KR102299269B1

KR102299269B1 - Method and apparatus for building voice database by aligning voice and script

Info

Publication number: KR102299269B1
Application number: KR1020200054279A
Authority: KR
Inventors: 장원; 윤재삼; 김봉완; 이윤한
Original assignee: 주식회사 카카오엔터프라이즈; 주식회사 카카오
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2021-09-08

Abstract

The present invention relates to a voice database building method by aligning voice and script and an apparatus thereof. According to an embodiment, a voice signal and the script are aligned in a sentence unit based on first similarity between a first voice recognition sentence corresponding to part of the voice signal and a first input sentence corresponding to part of the script, second similarity between a second voice recognition sentence connecting a next sentence of the first voice recognition sentence to the first voice recognition sentence and the first input sentence, and third similarity between a second input sentence connecting a next sentence of the first input sentence to the first input sentence and the first voice recognition sentence.

Description

Method and apparatus for building voice database by aligning voice and script

아래 실시예들은 음성 및 스크립트를 정렬하여 음성 데이터베이스를 구축하는 방법 및 장치에 관한 것이다.The embodiments below relate to a method and apparatus for building a voice database by sorting voices and scripts.

음성 처리 기술은 사람의 자연어 발화를 컴퓨터가 자동으로 이해하고 처리하거나 생성하는 기술로, 음성 인식 기술 및 음성 합성 기술 등을 포함한다. 음성 인식(speech recognition) 기술은 컴퓨터를 통해 사람이 말하는 음성 등 음향 신호를 텍스트 데이터로 전환하는 기술을 의미하는 것으로, 음성 텍스트 변환 기술(Speech-to-Text; STT)라고도 한다. 아날로그 신호인 음향 신호를 컴퓨터에서 처리 가능한 디지털 신호로 변환하는 음성 인식 기술을 다양한 분야에 활용될 수 있다. 예를 들어, 사람의 음성 명령을 인식하여 기기를 제어하는 음성 인식 비서 기능은 핸드폰, 스피커, 가전제품 등 다양한 기기에 탑재되어 활용되고 있다. 일반적으로 음성 인식 모델은 음향 모델(acoustic model; AM) 및 언어 모델(language model; LM)을 포함할 수 있다. 음향 모델은 특정 단어 텍스트가 해당되는 특정 음성에 대응될 확률을 예측하는 모델이고, 언어 모델은 음성과는 상관없이 특정 단어 텍스트가 문장의 현재 위치에 어울리는지에 대한 확률을 예측하는 모델이다. 두 모델에 따른 확률들을 이용하여 특정 음성 데이터를 입력 받았을 때 대응되는 단어 텍스트가 무엇인지를 인식하는 음성 인식을 수행할 수 있다. 이 경우, 대용량의 텍스트 데이터베이스를 이용하여 음성 인식 모델이 구축될 수 있다. Speech processing technology is a technology in which a computer automatically understands and processes or generates a human natural language utterance, and includes a speech recognition technology and a speech synthesis technology. Speech recognition technology refers to a technology that converts an acoustic signal, such as a voice spoken by a human, into text data through a computer, and is also referred to as speech-to-text (STT). A speech recognition technology that converts an audio signal, which is an analog signal, into a digital signal that can be processed by a computer can be used in various fields. For example, a voice recognition assistant function that controls a device by recognizing a person's voice command is installed and utilized in various devices such as a mobile phone, a speaker, and a home appliance. In general, the speech recognition model may include an acoustic model (AM) and a language model (LM). The acoustic model is a model for predicting the probability that a specific word text corresponds to a specific voice, and the language model is a model for predicting the probability that a specific word text matches the current position of a sentence regardless of the voice. Using probabilities according to the two models, when specific voice data is input, voice recognition for recognizing a corresponding word text may be performed. In this case, a speech recognition model may be built using a large-capacity text database.

음성 합성 기술은 텍스트 입력에 대응하는 말 소리의 음파를 기계가 자동으로 만들어 내는 기술로, 텍스트 음성 변환 기술(Text-to-Speech; TTS)이라고도 한다. 음성 합성 기술은 인공지능 비서, 챗봇, 대중교통 안내 음성, 문자를 읽기 어려운 사람을 위한 텍스트를 읽어주는 소프트웨어(스크린 리더), 어학 용 등 다양한 분야에 활용될 수 있다.Speech synthesis technology is a technology in which a machine automatically generates sound waves of speech sounds corresponding to text input, and is also called text-to-speech (TTS). Speech synthesis technology can be used in various fields such as artificial intelligence assistants, chatbots, public transportation guidance voices, software that reads text for people who have difficulty reading text (screen readers), and language learning.

음성전처리, 음성인식, 음성합성 등 다양한 음성처리 모델에서 좋은 성능을 얻기 위해서는 다양한 화자 및 환경에서 발화된 대용량의 음성 및 문장을 포함하는 데이터베이스 구축이 요구되며, 웹 크롤링, 온·오프라인 멀티미디어 컨텐츠 등에서 많은 양의 가공되지 않은 음성 및 문장 데이터를 음성 처리 모델링을 위한 데이터베이스로 확보하기 위한 기술이 요구된다.In order to obtain good performance in various voice processing models such as voice preprocessing, voice recognition, and voice synthesis, it is necessary to build a database containing a large volume of voices and sentences uttered by various speakers and environments, A technique for securing a large amount of raw speech and sentence data as a database for speech processing modeling is required.

실시예들은 가공되지 않은 스크립트 및 스크립트를 발화한 음성 신호를 문장 단위로 분절하고, 정렬하여 음성 처리 모델에 이용되는 음성 데이터베이스를 구축하는 기술을 제공할 수 있다.Embodiments may provide a technique for constructing a voice database used for a voice processing model by segmenting and arranging a raw script and a voice signal uttering the script in sentence units.

또한, 실시예들은 스크립트 및 스크립트를 발화한 음성 신호를 정렬하는 과정에서 스크립트 및 음성 신호가 일치하지 않는 오류를 검출하고, 오류로 검출된 데이터를 수정하여 데이터베이스를 구축하는 기술을 제공할 수 있다.In addition, the embodiments may provide a technique for detecting an error in which the script and the voice signal do not match in the process of arranging the script and the voice signal uttering the script, and building a database by correcting the data detected as an error.

일 측에 따른 음성 데이터베이스 구축 방법은 음성 신호의 일부에 해당하는 제1 음성 인식 문장 및 상기 음성 신호의 스크립트의 일부에 해당하는 제1 입력 문장 사이의 제1 유사도를 획득하는 단계; 상기 제1 음성 인식 문장의 다음 문장을 상기 제1 음성 인식 문장에 연결한 제2 음성 인식 문장 및 상기 제1 입력 문장 사이의 제2 유사도를 획득하는 단계; 상기 제1 입력 문장의 다음 문장을 상기 제1 입력 문장에 연결한 제2 입력 문장 및 상기 제1 음성 인식 문장 사이의 제3 유사도를 획득하는 단계; 및 상기 제1 유사도, 상기 제2 유사도 및 상기 제3 유사도에 기초하여 상기 음성 신호와 상기 스크립트를 문장 단위로 정렬하는 단계를 포함한다.A method of constructing a voice database according to one aspect includes: obtaining a first similarity between a first voice recognition sentence corresponding to a part of a voice signal and a first input sentence corresponding to a part of a script of the voice signal; obtaining a second similarity between a second speech recognition sentence in which a next sentence of the first speech recognition sentence is connected to the first speech recognition sentence and the first input sentence; obtaining a third similarity between a second input sentence in which a next sentence of the first input sentence is connected to the first input sentence and the first speech recognition sentence; and arranging the voice signal and the script in units of sentences based on the first degree of similarity, the second degree of similarity, and the third degree of similarity.

상기 정렬하는 단계는 상기 제1 유사도가 가장 높은 경우, 상기 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 상기 제1 입력 문장을 음성 데이터베이스에 저장하는 단계; 상기 제2 유사도가 가장 높은 경우, 상기 제1 음성 인식 문장을 상기 제2 음성 인식 문장으로 갱신하여 다음 이터레이션으로 전달하는 단계; 및 상기 제3 유사도가 가장 높은 경우, 상기 제1 입력 문장을 상기 제2 입력 문장으로 갱신하여 다음 이터레이션으로 전달하는 단계를 포함할 수 있다.The arranging may include, when the first similarity is the highest, storing a first voice signal corresponding to the first voice recognition sentence and the first input sentence in a voice database; when the second similarity is the highest, updating the first speech recognition sentence to the second speech recognition sentence and transferring it to a next iteration; and when the third similarity is the highest, updating the first input sentence to the second input sentence and transferring it to a next iteration.

상기 정렬하는 단계는 상기 제1 유사도, 상기 제2 유사도 및 상기 제3 유사도에 기초하여, 상기 제1 음성 인식 문장 또는 상기 제1 입력 문장의 갱신 여부를 결정하는 단계; 상기 제1 음성 인식 문장 또는 상기 제1 입력 문장의 갱신을 결정함에 따라, 갱신된 제1 음성 인식 문장 또는 갱신된 제1 입력 문장에 기초한 제1 유사도, 제2 유사도 및 제3 유사도에 기초하여 상기 갱신 여부를 결정하는 단계를 반복적으로 수행하는 단계; 및 상기 제1 음성 인식 문장 및 상기 제1 입력 문장의 비갱신을 결정함에 따라, 상기 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 상기 제1 입력 문장을 상기 데이터베이스에 저장하는 단계를 포함할 수 있다.The aligning may include: determining whether to update the first speech recognition sentence or the first input sentence based on the first degree of similarity, the second degree of similarity, and the third degree of similarity; When it is determined to update the first speech recognition sentence or the first input sentence, the first speech recognition sentence or the updated first input sentence is based on the first similarity, the second similarity, and the third similarity. repeatedly performing the step of determining whether to update; and storing, in the database, a first voice signal corresponding to the first voice recognition sentence and the first input sentence in response to determining not to update the first voice recognition sentence and the first input sentence. can

상기 갱신 여부를 결정하는 단계를 반복적으로 수행하는 단계는 상기 제1 음성 인식 문장을 갱신하는 것으로 결정되는 경우, 갱신된 제1 음성 인식 문장 및 상기 제1 입력 문장에 기초하여, 제1 유사도, 제2 유사도 및 제3 유사도를 획득하는 단계; 및 상기 제1 입력 문장을 갱신하는 것으로 결정되는 경우, 갱신된 제1 입력 문장 및 상기 제1 음성 인식 문장에 기초하여, 제1 유사도, 제2 유사도 및 제3 유사도를 획득하는 단계를 포함할 수 있다.Repeatedly performing the step of determining whether to update may include, when it is determined to update the first speech recognition sentence, based on the updated first speech recognition sentence and the first input sentence, the first similarity, the second obtaining a second degree of similarity and a third degree of similarity; and when it is determined to update the first input sentence, acquiring a first degree of similarity, a second degree of similarity, and a third degree of similarity based on the updated first input sentence and the first speech recognition sentence. have.

일 측에 따른 음성 데이터베이스 구축 방법은 상기 음성 신호를 문장 단위로 분절하는 단계; 음성 인식에 기초하여, 상기 음성 신호를 텍스트로 변환하는 단계; 및 상기 스크립트를 문장 단위로 분절하는 단계를 더 포함할 수 있다.A voice database construction method according to one side comprises the steps of segmenting the voice signal into sentence units; converting the speech signal into text based on speech recognition; and segmenting the script into sentence units.

일 측에 따른 음성 데이터베이스 구축 방법은 상기 음성 신호 및 상기 스크립트를 수신하는 단계; 상기 수신된 음성 신호를 전처리하는 단계; 및 상기 수신된 스크립트를 전처리하는 단계를 더 포함할 수 있다.A voice database construction method according to one side comprises the steps of: receiving the voice signal and the script; pre-processing the received voice signal; and pre-processing the received script.

상기 제1 유사도, 상기 제2 유사도 및 상기 제3 유사도는 편집 거리에 기초하여 획득될 수 있다.The first degree of similarity, the second degree of similarity, and the third degree of similarity may be obtained based on an editing distance.

상기 정렬하는 단계는 미리 정해진 기준에 따른 상기 제1 유사도에 기초하여, 상기 제1 음성 신호, 상기 제1 음성 인식 문장 및 상기 제1 입력 문장 중 적어도 하나를 수정하는 단계를 포함할 수 있다.The aligning may include modifying at least one of the first voice signal, the first voice recognition sentence, and the first input sentence based on the first similarity according to a predetermined criterion.

상기 정렬하는 단계는 상기 이터레이션의 반복 횟수에 해당하는 연결 횟수에 기초하여, 상기 제1 음성 신호, 상기 제1 음성 인식 문장 및 상기 제1 입력 문장 중 적어도 하나를 수정하는 단계를 포함할 수 있다.The arranging may include modifying at least one of the first voice signal, the first voice recognition sentence, and the first input sentence based on the number of connections corresponding to the number of repetitions of the iteration. .

일 측에 따른 음성 데이터베이스 구축 방법은 상기 제1 유사도, 상기 제2 유사도 및 상기 제3 유사도 중 적어도 하나의 획득 결과를 인터페이스를 통해 표시하는 단계를 더 포함할 수 있다.The method of constructing a voice database according to an aspect may further include displaying a result of obtaining at least one of the first similarity, the second similarity, and the third similarity through an interface.

상기 데이터베이스에 저장하는 단계는 상기 제1 음성 신호 및 상기 제1 입력 문장을 강제 음성 정렬하는 단계; 및 상기 강제 음성 정렬 결과에 기초하여, 상기 제1 음성 신호 및 상기 제1 입력 문장을 분절하여 저장하는 단계를 포함할 수 있다.The step of storing in the database may include: forced speech sorting of the first speech signal and the first input sentence; and segmenting and storing the first voice signal and the first input sentence based on the forced voice alignment result.

상기 음성 데이터베이스 구축 방법은 상기 제1 유사도를 획득하는 단계 이후, 상기 제1 유사도에 기초하여 제2 유사도의 획득 여부 및 제3 유사도의 획득 여부를 결정하는 단계를 더 포함할 수 있다. 이 경우, 상기 제2 유사도를 획득하는 단계는 상기 결정에 기초하여 상기 제2 유사도를 획득하는 단계를 포함하고, 상기 제3 유사도를 획득하는 단계는 상기 결정에 기초하여 상기 제3 유사도를 획득하는 단계를 포함할 수 있다.The method for constructing a voice database may further include, after acquiring the first similarity, determining whether to acquire a second degree of similarity and whether to acquire a third degree of similarity based on the first similarity. In this case, the step of obtaining the second degree of similarity includes obtaining the second degree of similarity based on the determination, and the step of obtaining the third degree of similarity includes obtaining the third degree of similarity based on the determination. may include steps.

상기 제2 유사도의 획득 여부 및 상기 제3 유사도의 획득 여부를 결정하는 단계는 상기 제1 유사도가 미리 정해진 기준에 해당하는 경우, 상기 제2 유사도 및 상기 제3 유사도를 획득하지 않는 것으로 결정하는 단계; 및 상기 제1 유사도가 미리 정해진 기준에 해당하지 않는 경우, 상기 제2 유사도 및 상기 제3 유사도를 획득하는 것으로 결정하는 단계를 포함할 수 있다.The determining of whether to obtain the second degree of similarity and whether to obtain the third degree of similarity includes determining that the second degree of similarity and the third degree of similarity are not obtained when the first degree of similarity corresponds to a predetermined criterion; ; and when the first degree of similarity does not correspond to a predetermined criterion, determining to obtain the second degree of similarity and the third degree of similarity.

상기 제1 유사도를 획득하는 단계는 상기 제1 음성 인식 문장을 상기 제1 입력 문장으로 변경하기 위하여, 삭제해야 할 문자의 수, 삽입해야 할 문자의 수 및 치환해야 할 문자의 수를 획득하는 단계; 및 상기 삭제해야 할 문자의 수, 삽입해야 할 문자의 수 및 치환해야 할 문자의 수에 기초하여 상기 제1 음성 인식 문장 및 상기 제1 입력 문장 사이의 편집 거리를 획득하는 단계를 포함할 수 있다.The acquiring of the first similarity includes acquiring the number of characters to be deleted, the number of characters to be inserted, and the number of characters to be replaced in order to change the first speech recognition sentence into the first input sentence. ; and obtaining an edit distance between the first speech recognition sentence and the first input sentence based on the number of characters to be deleted, the number of characters to be inserted, and the number of characters to be replaced. .

상기 제2 유사도의 획득 여부 및 상기 제3 유사도의 획득 여부를 결정하는 단계는 상기 삭제해야 할 문자의 수 및 상기 삽입해야 할 문자의 수에 기초하여, 상기 제2 유사도 및 상기 제3 유사도 중 어느 하나를 획득하는 것으로 결정하는 단계를 포함할 수 있다.The determining of whether to obtain the second degree of similarity and whether to obtain the third degree of similarity may include any one of the second degree of similarity and the third degree of similarity based on the number of characters to be deleted and the number of characters to be inserted. determining to obtain one.

상기 제1 유사도는 음소 열로 변환된 상기 제1 음성 인식 문장 및 음소 열로 변환된 상기 제1 입력 문장 사이의 유사도를 포함하고, 상기 제2 유사도는 음소 열로 변환된 상기 제2 음성 인식 문장 및 음소 열로 변환된 상기 제1 입력 문장 사이의 유사도를 포함하며, 상기 제3 유사도는 음소 열로 변환된 상기 제1 음성 인식 문장 및 음소 열로 변환된 상기 제2 입력 문장 사이의 유사도를 포함할 수 있다.The first similarity includes a degree of similarity between the first speech recognition sentence converted into a phoneme sequence and the first input sentence converted into a phoneme sequence, and the second similarity includes the second speech recognition sentence converted into a phoneme sequence and a phoneme sequence. may include a degree of similarity between the converted first input sentences, and the third similarity may include a degree of similarity between the first speech recognition sentence converted into a phoneme sequence and the second input sentence converted into a phoneme sequence.

일 측에 따른 음성 데이터베이스 구축 장치는 음성 신호의 일부에 해당하는 제1 음성 인식 문장 및 상기 음성 신호의 스크립트의 일부에 해당하는 제1 입력 문장 사이의 제1 유사도를 획득하고, 상기 제1 음성 인식 문장의 다음 문장을 상기 제1 음성 인식 문장에 연결한 제2 음성 인식 문장 및 상기 제1 입력 문장 사이의 제2 유사도를 획득하고, 상기 제1 입력 문장의 다음 문장을 상기 제1 입력 문장에 연결한 제2 입력 문장 및 상기 제1 음성 인식 문장 사이의 제3 유사도를 획득하며, 상기 제1 유사도, 상기 제2 유사도 및 상기 제3 유사도에 기초하여 상기 음성 신호와 상기 스크립트를 문장 단위로 정렬하는 적어도 하나의 프로세서; 및 상기 문장 단위로 정렬된 상기 음성 신호 및 상기 스크립트를 저장하는 음성 데이터베이스를 포함한다.The apparatus for constructing a speech database according to one side obtains a first similarity between a first speech recognition sentence corresponding to a part of a voice signal and a first input sentence corresponding to a part of a script of the voice signal, and the first voice recognition A second speech recognition sentence in which a next sentence of a sentence is connected to the first speech recognition sentence and a second degree of similarity between the first input sentence are obtained, and the next sentence of the first input sentence is connected to the first input sentence obtaining a third degree of similarity between a second input sentence and the first speech recognition sentence, and arranging the voice signal and the script in sentence units based on the first similarity, the second similarity, and the third similarity at least one processor; and a voice database for storing the voice signal and the script arranged in units of the sentences.

상기 프로세서는 상기 음성 신호와 상기 스크립트를 문장 단위로 정렬함에 있어서, 상기 제1 유사도가 가장 높은 경우, 상기 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 상기 제1 입력 문장을 상기 음성 데이터베이스에 저장하고, 상기 제2 유사도가 가장 높은 경우, 상기 제1 음성 인식 문장을 상기 제2 음성 인식 문장으로 갱신하여 다음 이터레이션으로 전달하며, 상기 제3 유사도가 가장 높은 경우, 상기 제1 입력 문장을 상기 제2 입력 문장으로 갱신하여 다음 이터레이션으로 전달할 수 있다.When the processor aligns the voice signal and the script in sentence units, when the first similarity is the highest, the processor stores the first voice signal corresponding to the first voice recognition sentence and the first input sentence in the voice database. stored, and when the second similarity is the highest, the first speech recognition sentence is updated to the second speech recognition sentence and transmitted to the next iteration, and when the third similarity is the highest, the first input sentence is The second input sentence may be updated and transmitted to the next iteration.

상기 프로세서는 상기 음성 신호와 상기 스크립트를 문장 단위로 정렬함에 있어서, 상기 제1 유사도, 상기 제2 유사도 및 상기 제3 유사도에 기초하여, 상기 제1 음성 인식 문장 또는 상기 제1 입력 문장의 갱신 여부를 결정하고, 상기 제1 음성 인식 문장 또는 상기 제1 입력 문장의 갱신을 결정함에 따라, 갱신된 제1 음성 인식 문장 또는 갱신된 제1 입력 문장에 기초한 제1 유사도, 제2 유사도 및 제3 유사도에 기초하여 상기 갱신 여부를 결정하는 단계를 반복적으로 수행하며, 상기 제1 음성 인식 문장 및 상기 제1 입력 문장의 비갱신을 결정함에 따라, 상기 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 상기 제1 입력 문장을 상기 데이터베이스에 저장할 수 있다. When the processor aligns the voice signal and the script in sentence units, based on the first similarity, the second similarity, and the third similarity, whether the first voice recognition sentence or the first input sentence is updated and determining to update the first speech recognition sentence or the first input sentence, the first similarity, the second similarity, and the third similarity based on the updated first speech recognition sentence or the updated first input sentence repeatedly performing the step of determining whether to update the first speech recognition sentence based on The first input sentence may be stored in the database.

상기 프로세서는 상기 갱신 여부를 결정하는 단계를 반복적으로 수행함에 있어서, 상기 제1 음성 인식 문장을 갱신하는 것으로 결정되는 경우, 갱신된 제1 음성 인식 문장 및 상기 제1 입력 문장에 기초하여, 제1 유사도, 제2 유사도 및 제3 유사도를 획득하고, 상기 제1 입력 문장을 갱신하는 것으로 결정되는 경우, 갱신된 제1 입력 문장 및 상기 제1 음성 인식 문장에 기초하여, 제1 유사도, 제2 유사도 및 제3 유사도를 획득할 수 있다.When the processor repeatedly performs the step of determining whether to update the first voice recognition sentence, when it is determined to update the first voice recognition sentence, based on the updated first voice recognition sentence and the first input sentence, the first Acquire similarity, second similarity, and third similarity, and when it is determined to update the first input sentence, based on the updated first input sentence and the first speech recognition sentence, the first similarity and second similarity and a third degree of similarity may be obtained.

상기 프로세서는 상기 음성 신호를 문장 단위로 분절하고, 음성 인식에 기초하여, 상기 음성 신호를 텍스트로 변환하며, 상기 스크립트를 문장 단위로 분절할 수 있다.The processor may segment the voice signal into sentence units, convert the voice signal into text based on voice recognition, and segment the script into sentence units.

상기 프로세서는 상기 음성 신호 및 상기 스크립트를 수신하고, 상기 수신된 음성 신호를 전처리하며, 상기 수신된 스크립트를 전처리할 수 있다.The processor may receive the voice signal and the script, pre-process the received voice signal, and pre-process the received script.

상기 프로세서는 상기 음성 신호와 상기 스크립트를 문장 단위로 정렬함에 있어서, 미리 정해진 기준에 따른 상기 제1 유사도에 기초하여, 상기 제1 음성 신호, 상기 제1 음성 인식 문장 및 상기 제1 입력 문장 중 적어도 하나를 수정할 수 있다.When the processor aligns the voice signal and the script in sentence units, based on the first similarity according to a predetermined criterion, at least one of the first voice signal, the first voice recognition sentence, and the first input sentence You can edit one.

상기 프로세서는 상기 제1 유사도, 상기 제2 유사도 및 상기 제3 유사도 중 적어도 하나의 획득 결과를 인터페이스를 통해 표시할 수 있다.The processor may display a result of obtaining at least one of the first degree of similarity, the second degree of similarity, and the third degree of similarity through an interface.

상기 프로세서는 상기 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 상기 제1 입력 문장을 상기 데이터베이스에 저장함에 있어서, 상기 제1 음성 신호 및 상기 제1 입력 문장을 강제 음성 정렬하고, 상기 강제 음성 정렬 결과에 기초하여, 상기 제1 음성 신호 및 상기 제1 입력 문장을 분절하여 저장할 수 있다.When the processor stores the first voice signal and the first input sentence corresponding to the first voice recognition sentence in the database, the first voice signal and the first input sentence are forcibly voice-aligned, and the forced voice Based on the alignment result, the first voice signal and the first input sentence may be segmented and stored.

상기 프로세서는 상기 제1 유사도를 획득한 이후, 상기 제1 유사도에 기초하여 제2 유사도의 획득 여부 및 제3 유사도의 획득 여부를 결정하고, 상기 결정에 기초하여 상기 제2 유사도 및 상기 제3 유사도를 획득할 수 있다.After obtaining the first degree of similarity, the processor determines whether to obtain a second degree of similarity and whether to obtain a third degree of similarity based on the first degree of similarity, and based on the determination, the second degree of similarity and the third degree of similarity can be obtained.

상기 프로세서는 상기 제2 유사도의 획득 여부 및 상기 제3 유사도의 획득 여부를 결정함에 있어서, 상기 제1 유사도가 미리 정해진 기준에 해당하는 경우, 상기 제2 유사도 및 상기 제3 유사도를 획득하지 않는 것으로 결정하고, 상기 제1 유사도가 미리 정해진 기준에 해당하지 않는 경우, 상기 제2 유사도 및 상기 제3 유사도를 획득하는 것으로 결정할 수 있다.In determining whether to obtain the second degree of similarity and whether to obtain the third degree of similarity, the processor determines that the second degree of similarity and the third degree of similarity are not obtained when the first degree of similarity corresponds to a predetermined criterion. is determined, and when the first degree of similarity does not correspond to a predetermined criterion, it may be determined that the second degree of similarity and the third degree of similarity are acquired.

가공되지 않은 문단 이상 단위의 음성 신호 및 스크립트 데이터를 문장 단위로 분절하고 정렬하여 음성 데이터베이스를 구축함으로써, 구축된 데이터베이스의 수율을 높일 수 있다.By segmenting and arranging unprocessed speech signals and script data in units of paragraphs or more in units of sentences to construct a voice database, it is possible to increase the yield of the constructed database.

음성 신호의 음성 인식 문장과 스크립트에 포함된 문장의 유사도에 기초하여, 음성 신호와 스크립트가 불일치한 부분을 검출함으로써, 화자가 스크립트대로 정확히 읽지 않았거나 음성 또는 입력 문장에 문제가 있는 경우 등의 오류가 있는 데이터를 분리할 수 있다.Based on the similarity between the speech recognition sentence of the voice signal and the sentence included in the script, an inconsistency between the voice signal and the script is detected. You can separate the data with

대용량의 데이터베이스를 미리 구축하지 않고, 입력된 스크립트 데이터로 음성 데이터 베이스를 구축함으로써, 입력된 스크립트 데이터의 인식 오류를 줄일 수 있다.By building a voice database with the input script data without building a large-capacity database in advance, it is possible to reduce the recognition error of the input script data.

도 1은 일실시예에 따른 음성 데이터베이스 구축 시스템을 도시한 도면.
도 2는 일실시예에 따른 음성 데이터베이스 구축을 위하여 입력되는 음성 신호 및 스크립트를 획득하는 다양한 방법을 설명하기 위한 도면.
도 3a 및 도 3b는 일실시예에 따른 정렬 알고리즘의 동작 흐름도를 도시한 도면들.
도 4 내지 도 6은 일실시예에 따른 정렬 알고리즘을 구체적으로 설명하기 위한 사례들을 도시한 도면들.
도 7은 일실시예에 따른 음성 데이터베이스 구축 시스템에서 오류를 검출하고 수정하는 동작을 설명하기 위한 도면.1 is a diagram illustrating a voice database building system according to an embodiment.
FIG. 2 is a view for explaining various methods of acquiring input voice signals and scripts for constructing a voice database according to an embodiment; FIG.
3A and 3B are diagrams illustrating an operation flowchart of an alignment algorithm according to an embodiment;
4 to 6 are diagrams illustrating examples for specifically explaining an alignment algorithm according to an embodiment.
7 is a view for explaining an operation of detecting and correcting an error in the voice database building system according to an embodiment;

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for description purposes only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In the description of the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

또한, 실시 예의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다. In addition, in describing the components of the embodiment, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the component from other components, and the essence, order, or order of the component is not limited by the term. When it is described that a component is “connected”, “coupled” or “connected” to another component, the component may be directly connected or connected to the other component, but between each component another component It will be understood that may also be "connected", "coupled" or "connected".

어느 하나의 실시 예에 포함된 구성요소와, 공통적인 기능을 포함하는 구성요소는, 다른 실시 예에서 동일한 명칭을 사용하여 설명하기로 한다. 반대되는 기재가 없는 이상, 어느 하나의 실시 예에 기재한 설명은 다른 실시 예에도 적용될 수 있으며, 중복되는 범위에서 구체적인 설명은 생략하기로 한다.Components included in one embodiment and components having a common function will be described using the same names in other embodiments. Unless otherwise stated, a description described in one embodiment may be applied to another embodiment, and a detailed description in the overlapping range will be omitted.

도 1은 일실시예에 따른 음성 데이터베이스 구축 시스템을 도시한 도면이다.1 is a diagram illustrating a voice database building system according to an embodiment.

도 1을 참조하면, 일실시예에 따른 음성 데이터베이스 구축 시스템은 수신된 음성 신호(101)의 분절된 음성 리스트(103), 분절된 음성 신호를 음성 인식하여 획득한 분절된 인식 문장 리스트(104), 및 수신된 스크립트(102)의 분절된 입력 문장 리스트(105)에 기초하여, 음성 신호 및 스크립트를 문장 단위로 정렬하는 모듈(150)을 포함할 수 있다. 또한, 일실시예에 따른 음성 신호 및 스크립트는 문장 단위로 정렬하는 모듈(150)에 의해 정렬되어 음성 데이터베이스(106)에 저장될 수 있다. 일실시예에 따른 시스템에서 구축된 음성 데이터베이스는 음성 처리 모델에 이용될 수 있다. 이하에서, 음성 처리 기술에 이용되는 음성 데이터베이스는 간략하게 DB로 지칭될 수 있다.Referring to FIG. 1 , the voice database building system according to an embodiment includes a segmented voice list 103 of a received voice signal 101, and a segmented recognition sentence list 104 obtained by recognizing the segmented voice signal. , and based on the segmented input sentence list 105 of the received script 102 , a module 150 for arranging the voice signal and the script in sentence units. In addition, the voice signal and the script according to an embodiment may be sorted by the sentence unit 150 and stored in the voice database 106 . A voice database constructed in the system according to an embodiment may be used for a voice processing model. Hereinafter, a voice database used for voice processing technology may be briefly referred to as a DB.

음성 신호(101)는 다양한 화자가 자연어를 발화하는 소리 신호에 해당하고, 스크립트(102)는 음성 신호에 대응하는 텍스트 데이터에 해당한다. 음성 신호(101) 및 스크립트(102)를 음성 처리 모델에서 데이터베이스로 사용하기 위해서는 문장 단위로 분절하고 정렬하는 등의 가공 과정을 거칠 필요가 있다.The voice signal 101 corresponds to a sound signal through which various speakers utter natural language, and the script 102 corresponds to text data corresponding to the voice signal. In order to use the voice signal 101 and the script 102 as a database in the voice processing model, it is necessary to go through a processing process such as segmentation and sorting in sentence units.

도 2를 참조하면, 음성 데이터베이스 구축을 위하여 입력되는 음성 신호(101) 및 스크립트(102)는 다양한 경로 및 다양한 방법을 통해 온라인 및/또는 오프라인에서 획득될 수 있다. 예를 들어, 웹 크롤링(crawling), 멀티미디어 컨텐츠 캡처 등을 통해 음성 신호 및 스크립트가 획득될 수 있다.Referring to FIG. 2 , a voice signal 101 and a script 102 input for constructing a voice database may be acquired online and/or offline through various routes and various methods. For example, voice signals and scripts may be obtained through web crawling, multimedia content capture, and the like.

이와 같은 방법으로 획득된 음성 신호 및 스크립트 데이터는 문단 이상의 단위로 획득될 수 있으므로, 음성 처리 모델에서 DB로 사용하기 위해서는 음성 신호 및 스크립트를 특정 단위로 분절할 필요가 있다. 예를 들어, 음성 신호 및 스크립트를 문장 단위로 분절하여, 음성 처리 모델을 위한 음성 데이터베이스로 이용할 수 있다. 이하에서, 음성 신호 및 스크립트를 분절한 단위는 문장 단위인 것으로 예를 들어 설명하겠으나, 여기서 문장은 음성 신호 또는 스크립트의 분절된 단위를 지칭하는 것으로 반드시 언어학적 정의에 따른 문장을 의미하는 것은 아니다.Since the speech signal and script data obtained in this way can be obtained in units of paragraphs or more, it is necessary to segment the speech signal and script into specific units in order to be used as a DB in the speech processing model. For example, a voice signal and a script may be segmented into sentence units and used as a voice database for a voice processing model. Hereinafter, a unit obtained by segmenting a voice signal and a script will be described as a sentence unit, but a sentence here refers to a segmented unit of a voice signal or a script and does not necessarily mean a sentence according to a linguistic definition.

일실시예에 따르면, 음성 신호와 스크립트를 문장 단위로 정렬하기 위하여, 음성 신호의 경우 음성 신호의 분절 처리와 함께 음성 인식을 통하여 텍스트로 변환하는 처리가 필요하다. 이하에서, 음성 인식 문장 또는 인식 문장은 문장 단위로 분절된 음성 신호에 대응되는 텍스트를 의미한다. According to an embodiment, in order to align the voice signal and the script in sentence units, in the case of the voice signal, a process of converting the voice signal into text through voice recognition is required along with segmentation processing of the voice signal. Hereinafter, the speech recognition sentence or the recognition sentence means text corresponding to the speech signal segmented in sentence units.

다시 말해, 일실시예에 따른 음성 데이터베이스 구축 시스템에 입력된 음성 신호(101)는 음성 분절 및 음성 인식 모듈(130)에 의해 특정 단위로 분절될 수 있으며, 음성 인식을 통해 텍스트로 변환될 수 있다. 일실시예에 따른 시스템은 음성 인식 및 음성 분절 모듈(130)에 입력되는 음성 신호를 전처리하는 모듈(110)을 포함할 수 있다. 음성 신호의 전처리 모듈(110)은 예를 들어, 음악, 배경음, 잡음 등 DB에 필요치 않은 요소들 제거, 음원 규격 정규화 등의 동작을 수행하는 동작을 포함할 수 있다. 음성 신호의 전처리, 분절 및 음성 인식은 임의의 기술에 따라 수행될 수 있다. 일실시예에 따른 음성 신호는 분절 및 음성 인식되어, 음성 신호의 일부들을 포함하는 분절된 음성 리스트(103) 및 음성 신호의 일부들의 음성 인식 결과인 텍스트들을 포함하는 분절된 인식 문장 리스트(104)가 형성될 수 있다. 이하, 음성 분절 이후 분절된 음성의 인식이 수행되는 경우를 설명하나, 실시예들은 음성 분절과 동시에 음성 인식이 수행되거나, 혹은 음성 인식 이후 인식된 문장이 분절되는 경우에도 실질적으로 동일하게 적용될 수 있다.In other words, the voice signal 101 input to the voice database building system according to an embodiment may be segmented into a specific unit by the voice segmentation and voice recognition module 130, and may be converted into text through voice recognition. . The system according to an embodiment may include a module 110 for pre-processing a voice signal input to the voice recognition and voice segmentation module 130 . The pre-processing module 110 of the voice signal may include, for example, an operation of removing elements not necessary for the DB, such as music, background sound, and noise, and performing an operation such as normalization of a sound source standard. Pre-processing, segmentation, and speech recognition of a speech signal may be performed according to any technique. A speech signal according to an embodiment is segmented and speech recognized, and a segmented speech list 103 including parts of the speech signal and a segmented recognition sentence list 104 including texts as a result of speech recognition of parts of the speech signal can be formed. Hereinafter, a case in which segmented speech recognition is performed after speech segmentation will be described. However, embodiments may be substantially the same even when speech recognition is performed simultaneously with speech segmentation or a sentence recognized after speech recognition is segmented. .

일실시예에 따른 음성 데이터베이스 구축 시스템에 입력된 스크립트(102)는 스크립트 분절 모듈(140)에 의해 특정 단위로 분절될 수 있으며, 일실시예에 따른 시스템은 스크립트 분절 모듈(140)에 입력되는 스크립트를 전처리하는 모듈(120)을 포함할 수 있다. 스크립트의 전처리 모듈(120)은 예를 들어, 스크립트에 포함된 다른 종류인 심볼들의 통일, 불필요한 심볼의 제거 등 분절의 성능을 높이기 위한 동작, 문장 인코딩 정규화 등의 동작을 수행할 수 있다. 스크립트의 전처리, 분절은 임의의 기술에 따라 수행될 수 있다. 일실시예에 따른 스크립트는 분절되어, 스크립트의 일부들을 포함하는 분절된 입력 문장 리스트(105)가 형성될 수 있다.The script 102 input to the voice database building system according to an embodiment may be segmented into a specific unit by the script segmentation module 140 , and the system according to an embodiment is a script inputted to the script segmentation module 140 . may include a module 120 for pre-processing. The script pre-processing module 120 may perform, for example, unification of symbols, which are different types included in the script, an operation for improving the performance of a segment, such as removal of unnecessary symbols, and an operation such as sentence encoding normalization. Pre-processing and segmentation of the script may be performed according to any technique. A script according to an embodiment may be segmented to form a segmented input sentence list 105 including parts of the script.

음성 신호 및 스크립트는 데이터의 성격이 달라 각자 다른 방법론으로 분절될 수 있다. 예를 들어, 소리 신호인 음성 신호는 소리 신호의 세기에 따라 특정 단위로 분절될 수 있고, 텍스트 데이터인 스크립트는 스크립트에 포함된 마침표 등 심볼의 종류에 따라 특정 단위로 분절될 수 있다. 즉, 음성 신호 및 스크립트를 각각을 자동 분절 알고리즘에 따라 분절하는 경우, 사용되는 알고리즘의 종류 및 알고리즘의 성능의 차이 등으로 인해 분절 결과가 서로 다를 수 있다. 다시 말해, 음성 신호 및 이에 대응되는 스크립트를 각각의 자동 분절 알고리즘에 따라 문장 단위로 분절한 경우, 음성 신호의 분절 결과와 스크립트의 분절 결과가 동일하지 않을 수 있다. 예를 들어, 문장 A, B 및 C를 발화한 음성 신호 및 문장 A, B 및 C로 구성된 스크립트를 문장 단위로 분절한 경우, 음성 신호의 분절 결과 문장 A 및 B가 한 문장으로 인식되어, 문장 A+B 및 문장 C로 분절될 수 있으며, 한편 스크립트의 분절 결과 문장 A, 문장 B, 문장 C로 분절될 수 있다. 또는 음성 신호의 분절 결과, 문장 A를 발화한 음성 신호가 문장 A'로 인식되어, 문장 A', 문장 B, 문장 C로 분절될 수도 있다. Voice signals and scripts may be segmented by different methodologies due to different data characteristics. For example, a voice signal that is a sound signal may be segmented into a specific unit according to the strength of the sound signal, and a script that is text data may be segmented into a specific unit according to a type of symbol such as a period included in the script. That is, when a voice signal and a script are each segmented according to an automatic segmentation algorithm, segmentation results may be different from each other due to differences in the types of algorithms used and performance of the algorithms. In other words, when a voice signal and a script corresponding thereto are segmented in sentence units according to each automatic segmentation algorithm, the segmentation result of the voice signal and the segmentation result of the script may not be the same. For example, when the speech signal uttering the sentences A, B, and C and the script composed of the sentences A, B, and C are segmented into sentence units, sentences A and B are recognized as one sentence as a result of the segmentation of the voice signal, It may be segmented into A+B and sentence C, and on the other hand, it may be segmented into sentence A, sentence B, and sentence C as a result of segmentation of the script. Alternatively, as a result of segmentation of the voice signal, the voice signal uttering the sentence A may be recognized as the sentence A', and may be segmented into the sentence A', the sentence B, and the sentence C.

음성 신호 및 스크립트의 분절 결과를 정렬하는 모듈(150)은 분절된 음성 신호 및 분절된 스크립트가 대응되도록 분절 결과를 보정할 수 있다. 다시 말해, 분절된 음성 신호와 분절된 스크립트의 결과는 서로 다를 수 있는데, 분절된 음성 신호 및 분절된 스크립트를 정렬하는 알고리즘을 수행하여 이를 교정할 수 있다. 일실시예에 따른 정렬 모듈(150)의 보다 구체적인 동작은 이하의 도 3a 내지 도 6에서 상술한다.The module 150 for arranging the segmentation result of the voice signal and the script may correct the segmentation result so that the segmented voice signal and the segmented script correspond. In other words, the results of the segmented speech signal and the segmented script may be different from each other, and this may be corrected by performing an algorithm for aligning the segmented speech signal and the segmented script. A more specific operation of the alignment module 150 according to an embodiment will be described in detail with reference to FIGS. 3A to 6 below.

일실시예에 따를 때, 분절된 음성 신호와 분절된 스크립트는 정렬된 후 데이터베이스(106)에 저장될 수 있다. 또한, 분절된 음성 신호와 분절된 스크립트를 대응시킨 리스트의 화자, 출처, 일자 등 관련 메타데이터가 데이터베이스에 저장될 수 있다.According to one embodiment, the segmented speech signal and the segmented script may be sorted and then stored in the database 106 . In addition, related metadata such as a speaker, source, and date of a list corresponding to the segmented voice signal and the segmented script may be stored in the database.

이 때, 화자가 스크립트대로 정확히 읽지 않았거나, 음성 신호 또는 스크립트에 오류가 있는 경우 정렬 모듈(150)에서 오류로 검출되어 별도의 리스트로 저장될 수 있다. 일실시예에 따른 오류 검출 동작에 관하여는 이하의 도 7에서 상술한다.At this time, if the speaker does not read the script exactly as the script or there is an error in the voice signal or the script, the alignment module 150 may detect an error and store the error as a separate list. An error detection operation according to an embodiment will be described in detail with reference to FIG. 7 below.

도 3a 및 도 3b는 일실시예에 따른 정렬 알고리즘의 동작 흐름도를 도시한 도면들이다.3A and 3B are diagrams illustrating an operation flowchart of an alignment algorithm according to an exemplary embodiment.

이하에서, 제1 음성 인식 문장은 분절된 음성 신호 중 어느 하나에 대응하는 데이터로, 분절된 음성 신호 중 어느 하나의 음성 인식 결과 변환된 텍스트 데이터를 의미한다. 즉, 분절된 인식 문장 리스트(104)에 포함된 문장들 중 어느 하나를 의미한다. 또한, 이하에서 제1 입력 문장은 분절된 스크립트 중 어느 하나를 의미하는 것으로, 분절된 입력 문장 리스트(105)에 포함된 입력 문장들 중 어느 하나를 의미한다.Hereinafter, the first speech recognition sentence is data corresponding to any one of the segmented speech signals, and means text data converted as a result of recognizing any one of the segmented speech signals. That is, it means any one of the sentences included in the segmented recognition sentence list 104 . Also, hereinafter, the first input sentence refers to any one of the segmented scripts, and refers to any one of the input sentences included in the segmented input sentence list 105 .

도 3a를 참조하면, 일실시예에 따른 정렬 알고리즘은 제1 유사도, 제2 유사도 및 제3 유사도에 기초하여, 음성 신호와 스크립트를 문장 단위로 정렬하는 단계들을 포함한다. 제1 유사도는 제1 음성 인식 문장 및 제1 입력 문장 사이의 유사도이고, 제2 유사도는 제1 음성 인식 문장의 다음 문장을 제1 음성 인식 문장에 연결한 제2 음성 인식 문장 및 제1 입력 문장 사이의 유사도이며, 제3 유사도는 제1 입력 문장의 다음 문장을 제1 입력 문장에 연결한 제2 입력 문장 및 제1 음성 인식 문장 사이의 유사도이다.Referring to FIG. 3A , an alignment algorithm according to an exemplary embodiment includes aligning a voice signal and a script in units of sentences based on a first degree of similarity, a second degree of similarity, and a third degree of similarity. The first similarity is a degree of similarity between the first speech recognition sentence and the first input sentence, and the second similarity is the second speech recognition sentence and the first input sentence in which the next sentence of the first speech recognition sentence is connected to the first speech recognition sentence. and the third similarity is a degree of similarity between the second input sentence and the first speech recognition sentence in which the next sentence of the first input sentence is connected to the first input sentence.

여기서, 일실시예에 따른 제1 음성 인식 문장의 다음 문장은 음성 신호가 분절된 결과에서 제1 음성 인식 문장의 다음에 위치한 문장을 의미하는 것으로, 일실시예에 따른 분절된 인식 문장 리스트(104) 상에서 제1 음성 인식 문장의 다음 음성 인식 문장을 의미할 수 있다. 이러한 다음 문장의 의미에 따라, 제2 음성 인식 문장은 제1 음성 인식 문장과 그 다음 문장을 연결한 문장을 의미할 수 있다.Here, the next sentence of the first voice recognition sentence according to an embodiment means a sentence located next to the first voice recognition sentence in the result of segmenting the voice signal, and the segmented recognition sentence list 104 according to the embodiment. ) may mean the next voice recognition sentence of the first voice recognition sentence. According to the meaning of the next sentence, the second voice recognition sentence may mean a sentence in which the first voice recognition sentence and the next sentence are connected.

제1 입력 문장의 다음 문장은 스크립트가 분절된 결과에서 제1 입력 문장의 다음에 위치한 문장을 의미하는 것으로, 일실시예에 따른 분절된 입력 문장 리스트(105) 상에서 제1 입력 문장의 다음 입력 문장을 의미할 수 있다. 이러한 다음 문장의 의미에 따라, 제2 입력 문장은 제1 입력 문장과 그 다음 문장을 연결한 문장을 의미할 수 있다.The next sentence of the first input sentence means a sentence located next to the first input sentence in the result of the script segmentation, and the next input sentence of the first input sentence on the segmented input sentence list 105 according to an embodiment. can mean According to the meaning of the next sentence, the second input sentence may mean a sentence in which the first input sentence and the next sentence are connected.

일실시예에 따를 때, 분절된 음성 신호에 대응하는 음성 인식 문장과 분절된 스크립트인 입력 문장을 정렬하기 위해서 편집 거리(Levenshtein distance)를 계산하여 사용할 수 있다. 다시 말해, 분절된 음성 신호에 대응하는 음성 인식 문장과 분절된 스크립트인 입력 문장 사이의 유사도는 분절된 음성 신호에 대응하는 음성 인식 문장과 분절된 스크립트인 입력 문장 사이의 편집 거리에 기초하여, 획득될 수 있다.According to an embodiment, a Levenshtein distance may be calculated and used to align a speech recognition sentence corresponding to a segmented voice signal and an input sentence that is a segmented script. In other words, the similarity between the speech recognition sentence corresponding to the segmented speech signal and the input sentence as the segmented script is obtained based on the editing distance between the speech recognition sentence corresponding to the segmented speech signal and the input sentence as the segmented script. can be

편집 거리란 문자열 A의 문자를 몇 번이나 삽입, 삭제, 치환해야 문자열 B로 바꿀 수 있는지를 계산한 최소값을 나타내며, 두 문자열의 유사도를 판단하는 지표로 사용될 수 있다. 예를 들어, "끝이야" 라는 문자열 A와 "그치 야"라는 문자열 B의 편집 거리를 계산하는 경우, 문자열 A에서 "끝"을 "그"로 치환하고, "이"를 "치"로 치환하여야 문자열 B로 바꿀 수 있으므로, 문자열 A 및 문자열 B의 편집 거리는 2가 된다. 다른 예를 들어, "그치" 라는 문자열 A와 "그치 야"라는 문자열 B의 편집 거리를 계산하는 경우, 문자열 A에서 "야"를 삽입하여야 문자열 B로 바꿀 수 있으므로, 문자열 A 및 문자열 B의 편집 거리는 1이 된다.The edit distance represents the minimum value calculated by how many times a character in the character string A must be inserted, deleted, or replaced to be changed to the character string B, and can be used as an index to judge the similarity between two strings. For example, if you calculate the edit distance between the string A "it's over" and the string B "that's it", replace "end" with "that" in string A, and replace "this" with "chi" Since it can be changed to character string B, the editing distance between character string A and character string B is 2. For another example, when calculating the editing distance between the string A "Geachi" and the string B "Geechiya", you must insert "Ya" in the string A to change it to the string B, so the editing of the string A and the string B The distance will be 1.

두 문장의 유사도는 두 문장의 편집 거리를 계산하여 편집 거리가 작은 경우 두 문장의 유사도는 높은 것으로 해석되고, 편집 거리가 큰 경우 두 문장의 유사도는 낮은 것으로 해석될 수 있다. 이하에서 설명하는 실시예들에는 편집 거리 이외에도 두 문장의 유사도 혹은 두 문장의 거리를 나타내는 다양한 척도가 적용될 수 있다. The similarity between two sentences is calculated by calculating the editing distance between the two sentences, and when the editing distance is small, the similarity between the two sentences is interpreted as high, and when the editing distance is large, the similarity between the two sentences can be interpreted as low. In addition to the editing distance, various scales indicating the similarity between two sentences or the distance between the two sentences may be applied to the embodiments described below.

일실시예에 따를 때, 음성 인식 문장과 입력 문장 사이의 유사도는 음소 열로 변환된 음성 인식 문장 및 음소 열로 변환된 입력 문장 사이의 유사도에 해당할 수 있다. 음소 열 변환은 자소 시퀀스를 그에 대응하는 음소 시퀀스로 변환하는 것으로, 텍스트를 음운 규칙 등에 따라 텍스트에 상응하는 발음을 지시하는 부호로 변환하는 것을 의미할 수 있다. 예를 들어, 음소 열 변환을 통해 “이천 육백 칠십 오 채”의 문장은 "이처뉵빽칠씨보채"로 변환될 수 있다. 문장에 숫자가 포함되어 있는 경우, 문장 내 숫자 역시 숫자의 발음에 대응되는 음소 열로 변환될 수 있다. 예를 들어, 음소 열 변환을 통해 “2675채"의 문장은 "이처뉵빽칠씨보채"로 변환될 수 있다.According to an embodiment, the similarity between the speech recognition sentence and the input sentence may correspond to the similarity between the voice recognition sentence converted into the phoneme sequence and the input sentence converted into the phoneme sequence. The phoneme sequence conversion is converting a phoneme sequence into a phoneme sequence corresponding to the phoneme sequence, and may mean converting a text into a code indicating pronunciation corresponding to the text according to a phonological rule or the like. For example, the sentences of “two thousand six hundred seventy five” can be converted into “the two thousand six hundred seventy five words” through phoneme column conversion. When a number is included in the sentence, the number in the sentence may also be converted into a phoneme sequence corresponding to the pronunciation of the number. For example, the sentences of “2675 chae” can be converted into “this choeotbek chil ssi bochae” through phoneme column conversion.

일실시예에 따른 음성 인식 문장과 입력 문장 사이의 유사도를 음성 인식 문장과 입력 문장 사이의 편집 거리를 계산하여 획득하는 경우, 음성 인식 문장 및 입력 문장 각각을 음소 열로 변환한 후, 변환된 음성 인식 문장 및 변환된 입력 문장 사이의 편집 거리를 계산함으로써, 음성 인식 문장과 입력 문장 사이의 유사도를 획득할 수 있다.When obtaining the similarity between the voice recognition sentence and the input sentence by calculating the editing distance between the voice recognition sentence and the input sentence according to an embodiment, after converting each of the voice recognition sentence and the input sentence into a phoneme string, the converted voice recognition By calculating the editing distance between the sentence and the converted input sentence, the similarity between the speech recognition sentence and the input sentence may be obtained.

예를 들어, 입력 문장이 “2675채”이고, 음성 인식 문장이 “이천 육백 칠십 오 채” 인 경우, 입력 문장과 음성 인식 문장은 동일하게 발음됨에도, 편집 거리는 7로 계산된다. 반면, 입력 문장 및 음성 인식 문장 각각을 음소 열로 변환한 후, 편집 거리를 계산하는 경우, 음소 열로 변환된 입력 문장 및 음소 열로 변환된 음성 인식 문장은 "이처뉵빽칠씨보채" 로 동일하므로, 편집 거리는 0으로 계산된다.For example, when the number of input sentences is “2675” and the number of speech recognition sentences is “two thousand six hundred and seventy-five”, the editing distance is calculated as 7 even though the input sentence and the voice recognition sentence are pronounced the same. On the other hand, when the editing distance is calculated after each of the input sentences and the speech recognition sentences are converted into phoneme columns, the input sentences converted into phoneme columns and the speech recognition sentences converted into phoneme columns are the same as, The distance is counted as 0.

도 3a를 참조하면, 일실시예에 따른 정렬 알고리즘은 제1 음성 인식 문장이 제2 음성 인식 문장 또는 제1 입력 문장이 제2 입력 문장으로 갱신된 경우, 다시 알고리즘의 첫 단계인 유사도들을 획득하는 단계(300)로 돌아가서 제2 음성 인식 문장을 새로운 제1 음성 인식 문장 또는 제2 입력 문장을 새로운 제1 입력 문장으로 인식하여 이후 단계들을 반복적으로 수행할 수 있다. 즉, 제1 음성 인식 문장 및 제1 입력 문장에 기초하여 유사도들을 획득하는 단계 및 유사도들에 기초하여 제1 음성 인식 문장 또는 제1 입력 문장의 갱신 여부를 결정하고 갱신하는 일련의 과정은 제1 음성 인식 문장의 갱신 또는 제1 입력 문장의 갱신에 따라 이터레이션(iteration)될 수 있다.Referring to FIG. 3A , in the alignment algorithm according to an embodiment, when the first speech recognition sentence is updated to the second speech recognition sentence or the first input sentence is updated to the second input sentence, the first step of the algorithm is to obtain similarities again. Returning to step 300 , subsequent steps may be repeatedly performed by recognizing the second speech recognition sentence as a new first speech recognition sentence or the second input sentence as a new first input sentence. That is, the steps of obtaining similarities based on the first voice recognition sentence and the first input sentence and determining whether to update the first voice recognition sentence or the first input sentence based on the similarities and updating the first voice recognition sentence or the first input sentence include the first The iteration may be performed according to the update of the speech recognition sentence or the update of the first input sentence.

보다 구체적으로, 단계(340) 및 단계(350)에서 제2 유사도가 가장 높다고 판단되는 경우, 제1 음성 인식 문장을 상기 제2 음성 인식 문장으로 갱신(360)하여 다음 이터레이션으로 전달하고, 단계(340) 및 단계(350)에서 제3 유사도가 가장 높다고 판단되는 경우, 제1 입력 문장을 제2 입력 문장으로 갱신(370)하여 다음 이터레이션으로 전달할 수 있다. 일실시예에 따를 때, 제1 유사도가 가장 높은 경우, 정렬 알고리즘의 이터레이션은 종료되고, 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 제1 입력 문장을 음성 데이터베이스에 저장하는 단계를 포함할 수 있다. More specifically, when it is determined that the second similarity is the highest in steps 340 and 350, the first speech recognition sentence is updated (360) with the second speech recognition sentence and transferred to the next iteration, step When it is determined in steps 340 and 350 that the third similarity is the highest, the first input sentence may be updated as the second input sentence (370) and transmitted to the next iteration. According to an embodiment, when the first similarity is the highest, the iteration of the sorting algorithm is terminated, and the method includes storing a first voice signal and a first input sentence corresponding to the first voice recognition sentence in a voice database. can do.

도 3a의 단계(340)에서 제1 유사도와 제2 유사도 및 제3 유사도를 비교함에 있어서 등호를 포함하지 않고 있으나, 실시예에 따라 등호가 포함될 수도 있다. 즉, 일실시예에 따를 때, 제1 유사도가 제2 유사도와 같고 제3 유사도보다 큰 경우, 제1 유사도가 제3 유사도와 같고 제2 유사도보다 큰 경우, 및 제1 유사도, 제2 유사도 및 제3 유사도가 모두 같은 경우에도 정렬 알고리즘의 이터레이션은 종료될 수 있다. An equal sign is not included in comparing the first degree of similarity, the second degree of similarity, and the third degree of similarity in step 340 of FIG. 3A , but an equal sign may be included according to an embodiment. That is, according to an embodiment, when the first degree of similarity is equal to the second degree of similarity and greater than the third degree of similarity, when the first degree of similarity is equal to the third degree of similarity and greater than the second degree of similarity, and when the first degree of similarity, the second degree of similarity and Even when all the third similarities are the same, the iteration of the sorting algorithm may be terminated.

도 3a의 단계(350)에서 제2 유사도 및 제3 유사도를 비교함에 있어서 등호를 포함하지 않고 있으나, 실시예에 따라 등호가 포함될 수도 있다. 즉, 제2 유사도 및 제3 유사도가 동일한 경우, 실시예에 따라 제1 음성 인식 문장이 제2 음성 인식 문장으로 갱신(360)될 수도 있고, 제1 입력 문장이 제2 입력 문장으로 갱신(370)될 수도 있다. 다만, 제2 유사도 및 제3 유사도가 동일한 경우, 단계(360) 및 단계(370) 중 어느 하나가 수행되는 것으로, 단계(360) 및 단계(370)가 모두 수행되는 것은 아니다. 즉, 제2 유사도와 제3 유사도가 동일한 경우, 제1 음성 인식 문장이 제2 음성 인식 문장으로 갱신되는 동작 및 제1 입력 문장이 제2 입력 문장으로 갱신되는 동작 중 어느 하나가 수행된다. 일실시예에 따를 때, 제2 유사도와 제3 유사도가 동일한 경우, 어느 동작이 수행될지 결정함에 있어서, 다른 판단 기준이 도입될 수 있다. 예를 들어, 편집 거리로 획득된 제2 유사도와 제3 유사도가 동일한 경우, 문장 사이의 유사도를 판단하는 다른 기준을 이용하여 제2 유사도 및 제3 유사도를 다시 획득하는 단계가 수행될 수 있다. 또는, 편집 거리로 획득된 제2 유사도와 제3 유사도가 동일한 경우, 편집 거리 획득 시 두 문자열을 동일하게 만들기 위하여, 삽입 또는 삭제해야 할 문자의 수 및 변경해야 할 문자의 수에 기초하여 단계(360) 및 단계(370) 중 어느 하나가 선택될 수 있다.Although the equal sign is not included in comparing the second degree of similarity and the third degree of similarity in step 350 of FIG. 3A , the equal sign may be included in some embodiments. That is, when the second degree of similarity and the third degree of similarity are the same, the first speech recognition sentence may be updated to the second speech recognition sentence (360), and the first input sentence may be updated to the second input sentence (370). ) may be However, when the second degree of similarity and the third degree of similarity are the same, either one of steps 360 and 370 is performed, but not both steps 360 and 370 are performed. That is, when the second degree of similarity and the third degree of similarity are the same, any one of an operation of updating the first speech recognition sentence to the second speech recognition sentence and an operation of updating the first input sentence to the second input sentence is performed. According to an embodiment, when the second degree of similarity and the third degree of similarity are the same, a different criterion may be introduced in determining which operation is to be performed. For example, when the second degree of similarity and the third degree of similarity obtained by the editing distance are the same, the step of re-acquiring the second degree of similarity and the third degree of similarity by using another criterion for determining the degree of similarity between sentences may be performed. Alternatively, if the second degree of similarity obtained by the editing distance and the third degree of similarity are the same, in order to make the two strings the same when the editing distance is obtained, a step ( Any one of 360) and step 370 may be selected.

도 3b를 참조하면, 일실시예에 따른 정렬 알고리즘은 먼저 제1 유사도를 획득하는 단계(310) 및 제1 유사도에 기초하여, 제2 유사도의 획득 여부 및 제3 유사도의 획득 여부를 결정하는 단계를 포함할 수 있다. 예를 들어, 제1 유사도가 미리 정해진 기준에 해당하는 경우, 제2 유사도 및 제3 유사도를 획득하지 않는 것으로 결정하고, 제1 유사도가 미리 정해진 기준에 해당하지 않는 경우, 제2 유사도 및 상기 제3 유사도를 획득하는 것으로 결정하는 단계를 포함할 수 있다.Referring to FIG. 3B , the sorting algorithm according to an embodiment includes first obtaining a first degree of similarity ( 310 ) and determining whether to obtain a second degree of similarity and whether to obtain a third degree of similarity based on the first degree of similarity ( 310 ). may include. For example, if the first similarity corresponds to a predetermined criterion, it is determined not to acquire the second similarity and the third similarity, and if the first similarity does not correspond to the predetermined criterion, the second similarity and the second similarity are not obtained. and determining to obtain 3 similarities.

도 3b를 참조하면, 일실시예에 따른 정렬 알고리즘은 제1 유사도를 미리 정해진 제1 임계치와 비교(320)하여, 제2 유사도 및 제3 유사도의 획득 여부를 결정하는 단계를 포함할 수 있다. 비교 결과, 제1 유사도가 미리 정해진 제1 임계치보다 작거나 같은 경우 제2 유사도 및 제3 유사도를 획득(330)하는 단계를 포함할 수 있다. 반면, 제1 유사도가 미리 정해진 제1 임계치보다 큰 경우, 제2 유사도 및 제3 유사도를 획득할 필요없이 정렬 알고리즘이 종료되고, 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 제1 입력 문장이 음성 데이터베이스에 저장될 수 있다. 만약 제1 입력 문장 및 제1 음성 인식 문장이 실질적으로 동일한 경우, 제2 유사도 및 제3 유사도를 획득하지 않고, 정렬 알고리즘이 종료될 수 있다. 여기서, 제1 입력 문장 및 제1 음성 인식 문장 사이의 편집 거리가 미리 정해진 임계 거리보다 작거나, 제1 입력 문장 및 제1 음성 인식 문장 사이의 제1 유사도가 미리 정해진 임계 유사도보다 큰 경우 제1 입력 문장 및 제1 음성 인식 문장이 실질적으로 동일하다고 판단할 수 있다. 제1 입력 문장과 제1 음성 인식 문장이 실질적으로 동일한 경우, 제1 음성 신호 및 제1 입력 문장이 정렬된 것으로 판단하여, 제2 유사도 및 제3 유사도를 획득할 필요없이 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 제1 입력 문장을 음성 데이터베이스에 저장함으로써, 정렬 동작의 효율을 높일 수 있다. Referring to FIG. 3B , the sorting algorithm according to an embodiment may include determining whether to obtain a second degree of similarity and a third degree of similarity by comparing (320) the first degree of similarity with a first predetermined threshold. As a result of the comparison, if the first degree of similarity is less than or equal to a predetermined first threshold, acquiring ( 330 ) a second degree of similarity and a third degree of similarity may be included. On the other hand, when the first similarity is greater than a predetermined first threshold, the sorting algorithm is terminated without acquiring the second and third similarities, and the first voice signal corresponding to the first voice recognition sentence and the first input sentence may be stored in this voice database. If the first input sentence and the first speech recognition sentence are substantially the same, the sorting algorithm may be terminated without acquiring the second similarity and the third similarity. Here, when the editing distance between the first input sentence and the first speech recognition sentence is less than a predetermined threshold distance, or when the first similarity between the first input sentence and the first voice recognition sentence is greater than the predetermined threshold similarity, the first It may be determined that the input sentence and the first voice recognition sentence are substantially the same. When the first input sentence and the first voice recognition sentence are substantially the same, it is determined that the first voice signal and the first input sentence are aligned, and the first voice recognition sentence is added to the first voice recognition sentence without the need to acquire the second similarity and the third similarity. By storing the corresponding first voice signal and the first input sentence in the voice database, the efficiency of the sorting operation may be increased.

도 3b에서는 제1 유사도가 제1 임계치를 초과하는지 여부를 판단(320)하는 경우를 도시하고 있으나, 일실시예에 따를 때, 제1 유사도가 제1 임계치 이상인지 여부를 판단할 수도 있다. 일실시예에 따른 단계(320)에서 판단 기준이 되는 제1 임계치는 미리 정해진 임의의 값에 해당할 수 있다.Although FIG. 3B illustrates a case in which it is determined 320 whether the first similarity exceeds the first threshold, it may be determined whether the first similarity is greater than or equal to the first threshold according to an exemplary embodiment. In step 320 according to an embodiment, the first threshold serving as a criterion for determination may correspond to a predetermined arbitrary value.

도 3b를 참조하면, 일실시예에 따른 정렬 알고리즘은 제1 유사도가 가장 큰 경우, 제1 유사도 또는 연결 횟수에 기초하여 제1 음성 인식 문장에 대응하는 제1 음성 신호(또는, 제1 음성 신호 및 제1 음성 인식 문장) 및 제1 입력 문장을 불일치 의심 리스트에 저장하는 단계(390)를 포함할 수 있다. 일실시예에 따른 연결 횟수는 제1 음성 인식 문장이 제2 음성 인식 문장으로 갱신된 횟수 및 제1 입력 문장이 제2 입력 문장으로 갱신된 횟수를 합한 것으로, 정렬 알고리즘의 이터레이션이 반복될 때마다 연결 횟수를 증가(380)시켜 획득될 수 있다. 즉, 일실시예에 따른 연결 횟수는 정렬 알고리즘의 이터레이션의 반복 횟수에 해당할 수 있다.Referring to FIG. 3B , in the alignment algorithm according to an embodiment, when the first similarity is the largest, the first voice signal (or the first voice signal) corresponding to the first voice recognition sentence based on the first similarity or the number of connections. and storing ( 390 ) the first speech recognition sentence) and the first input sentence in a mismatch suspicious list. The number of connections according to an embodiment is the sum of the number of times the first speech recognition sentence is updated to the second speech recognition sentence and the number of times the first input sentence is updated to the second input sentence, and when the iteration of the sorting algorithm is repeated It can be obtained by increasing (380) the number of connections every time. That is, the number of connections according to an embodiment may correspond to the number of iterations of the sorting algorithm.

일실시예에 따를 때, 제1 유사도가 가장 높으면서, 연결 횟수가 미리 정해진 제2 임계치를 초과하는 경우 또는 제1 유사도가 가장 높으면서, 제1 유사도가 미리 정해진 제3 임계치 미만인 경우, 제1 음성 신호(또는, 제1 음성 신호 및 제1 음성 인식 문장) 및 제1 입력 문장은 불일치 의심 리스트에 저장될 수 있다. 예를 들어, 제1 유사도가 제1 임계치보다 크거나 제2 및 제3 유사도들보다 큰 경우 추가적인 연결 동작(음성 인식 문장의 갱신이나 입력 문장의 갱신)을 수행하지 않을 수 있다. 이 때, 연결 횟수가 제2 임계치보다 크면 두 문장 이상이 함께 정렬된 가능성이 존재하는 바, 불일치 의심 리스트에 저장할 수 있다. 또는, 연결 횟수와 무관하게 제1 유사도가 제3 임계치보다 작은 경우에도 불일치 의심 리스트에 저장할 수 있다. 제3 임계치는 제1 임계치와 동일하게 설정될 수도 있고, 상이하게 설정될 수도 있다.According to an embodiment, when the first similarity is highest and the number of connections exceeds a second predetermined threshold, or when the first similarity is the highest and the first similarity is less than a third predetermined threshold, the first voice signal (Or, the first voice signal and the first voice recognition sentence) and the first input sentence may be stored in the mismatch suspicious list. For example, when the first similarity is greater than the first threshold or greater than the second and third similarities, an additional linking operation (updating the speech recognition sentence or updating the input sentence) may not be performed. In this case, if the number of connections is greater than the second threshold, there is a possibility that two or more sentences are aligned together, and thus, the list may be stored in the suspicious inconsistency list. Alternatively, irrespective of the number of connections, even when the first similarity is less than the third threshold, it may be stored in the mismatch suspicious list. The third threshold may be set to be the same as or set to be different from the first threshold.

반면, 제1 유사도가 가장 높으면서, 연결 횟수가 제2 임계치 이하이고, 제1 유사도가 제3 임계치 이상인 경우, 불일치 의심 리스트에 저장되지 않고 정렬 알고리즘이 종료될 수 있다. 일실시예에 따른 불일치 의심 리스트의 저장 여부 판단의 기준이 되는 제2 임계치 및 제3 임계치는 미리 정해진 임의의 값에 해당할 수 있다.On the other hand, when the first similarity is the highest, the number of connections is less than or equal to the second threshold, and the first similarity is greater than or equal to the third threshold, the sorting algorithm may be terminated without being stored in the mismatch suspicious list. The second threshold and the third threshold, which are the criteria for determining whether to store the inconsistency suspicious list, according to an embodiment, may correspond to predetermined arbitrary values.

도 3b에는 연결 횟수가 제2 임계치를 초과하는지 여부 및 제1 유사도가 제3 임계치 미만인지 여부를 판단하는 경우를 도시하고 있으나, 실시예에 따라 연결 횟수가 제2 임계치 이상인지 여부를 판단할 수도 있고, 제1 유사도가 제3 임계치 이하인지 여부를 판단할 수도 있다. 또한, 불일치 의심 리스트에 저장 여부를 결정함에 있어서, 도 3b에 도시된 판단 기준은 하나의 예시에 불과하며, 제1 유사도 또는 연결 횟수에 기초하여 불일치 의심 리스트에 저장 여부를 결정하는 기준은 다양할 수 있다.Although FIG. 3B illustrates a case in which it is determined whether the number of connections exceeds the second threshold and whether the first similarity is less than the third threshold, it may be determined whether the number of connections is equal to or greater than the second threshold according to an embodiment. Also, it may be determined whether the first similarity is equal to or less than a third threshold. In addition, in determining whether to store in the suspicious inconsistency list, the determination criterion shown in FIG. 3B is only an example, and the criteria for determining whether to store in the inconsistent suspicious list based on the first similarity or the number of connections may vary. can

일실시예에 따르면, 음성 신호와 스크립트를 문장 단위로 정렬하는 단계는 제1 유사도, 제2 유사도 및 제3 유사도에 기초하여, 제1 음성 인식 문장 또는 제1 입력 문장의 갱신 여부를 결정하는 단계, 제1 음성 인식 문장 또는 제1 입력 문장의 갱신을 결정함에 따라, 갱신된 제1 음성 인식 문장 또는 갱신된 제1 입력 문장에 기초한 제1 유사도, 제2 유사도 및 제3 유사도에 기초하여 갱신 여부를 결정하는 단계를 반복적으로 수행하는 단계, 및 제1 음성 인식 문장 및 제1 입력 문장의 비갱신을 결정함에 따라, 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 제1 입력 문장을 상기 데이터베이스에 저장하는 단계를 포함할 수 있다.According to an embodiment, the step of arranging the voice signal and the script in units of sentences may include determining whether to update the first voice recognition sentence or the first input sentence based on the first similarity, the second similarity, and the third similarity. , whether the first speech recognition sentence or the first input sentence is updated based on the first similarity, the second similarity, and the third similarity based on the updated first speech recognition sentence or the updated first input sentence By repeatedly performing the step of determining, and determining not to update the first voice recognition sentence and the first input sentence, the first voice signal corresponding to the first voice recognition sentence and the first input sentence are stored in the database It may include the step of storing it in .

일실시예에 따른 정렬 알고리즘은 제1 음성 인식 문장 또는 제1 입력 문장이 갱신된 경우, 갱신된 제1 음성 인식 문장 또는 갱신된 제1 입력 문장에 기초하여, 제1 유사도, 제2 유사도 및 제3 유사도를 획득하고, 획득된 유사도들에 기초하여 갱신 여부를 결정하는 단계들을 반복적으로 수행할 수 있으며, 제1 음성 인식 문장 및 제1 입력 문장이 모두 갱신되지 않은 경우 정렬 알고리즘이 종료될 수 있다. When the first speech recognition sentence or the first input sentence is updated, the sorting algorithm according to an embodiment may include a first similarity degree, a second similarity degree, and a second similarity degree based on the updated first speech recognition sentence or the updated first input sentence. The steps of obtaining 3 similarities and determining whether to update based on the obtained similarities may be repeatedly performed, and when neither the first speech recognition sentence nor the first input sentence is updated, the sorting algorithm may be terminated. .

도면에 도시하지 않았으나, 일실시예에 따르면, 정렬 알고리즘에 따라 정렬된 음성 신호 및 스크립트가 더 작은 단위로 분절될 수 있다. 이 경우, 음성 인식 문장에 대응하는 제1 음성 신호 및 제1 입력 문장을 데이터베이스에 저장하는 단계는 제1 음성 신호 및 제1 입력 문장을 강제 음성 정렬하는 단계 및 강제 음성 정렬 결과에 기초하여, 제1 음성 신호 및 제1 입력 문장을 분절하여 저장하는 단계를 포함할 수 있다. 예를 들어, 일실시예에 따른 제1 음성 신호가 문장 A+B에 대응하는 음성 신호에 해당하고, 제1 입력 문장 이 문장 A+B에 해당하는 경우, 제1 음성 신호 및 제1 입력 문장을 강제 음성 정렬 알고리즘에 따라 강제 음성 정렬하고, 그 결과에 기초하여 제1 음성 신호를 문장 A에 대응하는 음성 신호 및 문장 B에 대응하는 음성 신호로 분절하고, 제1 입력 문장을 문장 A 및 문장 B로 분절하여 저장할 수 있다. 제1 음성 신호 및 제1 입력 문장을 강제 음성 정렬에 따라 분절하여 데이터베이스에 저장할지 여부는 미리 정해진 기준에 따라 결정될 수 있다.Although not shown in the drawings, according to an embodiment, the voice signal and the script aligned according to the alignment algorithm may be segmented into smaller units. In this case, the step of storing the first voice signal and the first input sentence corresponding to the voice recognition sentence in the database is based on the forced voice alignment of the first voice signal and the first input sentence and the result of the forced voice sorting, It may include segmenting and storing the first voice signal and the first input sentence. For example, when the first voice signal according to an embodiment corresponds to the voice signal corresponding to the sentence A+B and the first input sentence corresponds to the sentence A+B, the first voice signal and the first input sentence is forced speech alignment according to the forced speech alignment algorithm, and based on the result, the first speech signal is segmented into a speech signal corresponding to the sentence A and the speech signal corresponding to the sentence B, and the first input sentence is divided into the sentence A and the sentence It can be segmented into B and stored. Whether the first voice signal and the first input sentence are segmented according to the forced voice sorting and stored in the database may be determined according to a predetermined criterion.

도 4 내지 도 6은 일실시예에 따른 정렬 알고리즘을 구체적으로 설명하기 위한 사례들을 도시한 도면들이다.4 to 6 are diagrams illustrating examples for specifically explaining an alignment algorithm according to an embodiment.

도 4를 참조하면, 음성 신호가 "지금 도대체" 및 "뭐 하고 있는 거야"의 문장 단위로 분절된 음성 신호들(410, 420) 및 스크립트가 "지금 도대체 뭐 하고 있는 거야?" 및 "청취자들과 소통하란 말야"의 문장 단위로 분절된 입력 문장들(430, 440)이 도시된다. 도 4에 도시된 제1 유사도(450), 제2 유사도(460), 및 제3 유사도(470)는 도 4에 도시된 음성 신호, 음성 인식 문장 및 입력 문장에 기초하여 계산된 결과이다. Referring to FIG. 4 , the voice signals 410 and 420 segmented into sentences of “what the hell are you doing now” and “what the hell are you doing” and the script are “what the hell are you doing now?” and input sentences 430 and 440 segmented in sentence units of “I have to communicate with listeners” are shown. The first similarity 450 , the second similarity 460 , and the third similarity 470 shown in FIG. 4 are calculated results based on the voice signal, the voice recognition sentence, and the input sentence shown in FIG. 4 .

도 4를 참조하면, 제1 유사도(450)는 제1 음성 인식 문장(411) 및 제1 입력 문장(430) 사이의 편집 거리에 기초하여 획득된다. 제1 음성 인식 문장(411)인 "지금 도대체"와 제1 입력 문장(430)인 "지금 도대체 뭐 하고 있는 거야"는 "뭐 하고 있는 거야"의 7글자의 차이이므로 편집 거리는 7로 계산된다. 제2 유사도(460)는 제1 음성 인식 문장(411)의 다음 문장(421)을 제1 음성 인식 문장(411)에 연결한 제2 음성 인식 문장 및 제1 입력 문장(430) 사이의 편집 거리에 기초하여 획득된다. 제2 음성 인식 문장인 "지금 도대체 뭐 하고 있는 거야"와 제1 입력 문장(430)인 "지금 도대체 뭐 하고 있는 거야"는 동일하므로 편집 거리는 0이다. 제3 유사도(470)는 제1 입력 문장(430)의 다음 문장(440)을 제1 입력 문장(430)에 연결한 제2 입력 문장 및 제1 음성 인식 문장(411) 사이의 편집 거리에 기초하여 획득된다. 제1 음성 인식 문장인 "지금 도대체"와 제2 입력 문장인 "지금 도대체 뭐 하고 있는 거야 청취자들과 소통하란 말야"는 "뭐 하고 있는 거야 청취자들과 소통하란 말야"의 18글자 차이이므로 편집 거리는 18이다. 상술한 바와 같이 두 문장 사이의 편집 거리가 작을수록 두 문장이 유사함을 의미하므로, 편집 거리의 대소 관계와 반대로 제1 유사도, 제2 유사도 및 제3 유사도의 대소 관계는 제2 유사도>제1 유사도>제3 유사도가 된다.Referring to FIG. 4 , the first similarity 450 is obtained based on the editing distance between the first speech recognition sentence 411 and the first input sentence 430 . Since the first speech recognition sentence 411, “what the hell are you doing” and “what the hell are you doing now,” the first input sentence 430 is a difference of 7 characters from “what are you doing”, the editing distance is calculated as 7. The second similarity 460 is the editing distance between the second speech recognition sentence linking the next sentence 421 of the first voice recognition sentence 411 to the first voice recognition sentence 411 and the first input sentence 430 . is obtained based on Since the second speech recognition sentence "what the hell are you doing now" and the first input sentence 430 "what the hell are you doing now" are the same, the editing distance is 0. The third similarity 470 is based on the editing distance between the second input sentence connecting the next sentence 440 of the first input sentence 430 to the first input sentence 430 and the first voice recognition sentence 411 . is obtained by The editing distance is the 18-character difference between the first speech recognition sentence "what the hell are you doing" and the second input sentence, "what the hell are you doing right now?" 18. As described above, since the smaller the editing distance between two sentences means that the two sentences are similar, in contrast to the magnitude relationship of the editing distance, the magnitude relationship of the first degree of similarity, the second degree of similarity, and the third degree of similarity is the second similarity > first degree of similarity. The degree of similarity > the third degree of similarity.

일실시예에 따른 정렬 알고리즘은 제2 유사도가 가장 큰 경우, 제1 음성 인식 문장을 제2 음성 인식 문장으로 갱신하는 단계(360)를 포함한다. 즉, 정렬 알고리즘에 따라 제1 음성 인식 문장(411)이 제2 음성 인식 문장으로 갱신되므로, 제1 음성 인식 문장은 "지금 도대체 뭐 하고 있는 거야"로 갱신된다. 갱신된 제1 음성 인식 문장 및 제1 입력 문장에 기초하여, 제1 유사도, 제2 유사도 및 제3 유사도를 획득하고, 갱신하는 단계가 반복될 수 있다.The sorting algorithm according to an embodiment includes updating (360) the first speech recognition sentence to the second speech recognition sentence when the second similarity is the greatest. That is, since the first speech recognition sentence 411 is updated to the second speech recognition sentence according to the sorting algorithm, the first speech recognition sentence is updated to "what the hell are you doing now". Acquiring and updating the first degree of similarity, the second degree of similarity, and the third degree of similarity based on the updated first speech recognition sentence and the first input sentence may be repeated.

도 5를 참조하면, 음성 신호가 문장 단위로 분절된 음성 신호들 중 일부인 "그치 야 너무 떨려" 및 "내 커리어는"의 음성 신호들(510, 520) 및 스크립트가 문장 단위로 분절된 입력 문장들 중 일부인 "끝이야." 및 "너무 떨려."의 입력 문장들(530, 540)이 도시된다. 도 5에 도시된 제1 유사도(550), 제2 유사도(560), 및 제3 유사도(570)는 도 5에 도시된 음성 신호, 음성 인식 문장 및 입력 문장에 기초하여 계산된 결과이다.Referring to FIG. 5 , the voice signals 510 and 520 of “I’m so nervous” and “My career is”, which are some of the voice signals segmented in sentence units, and input sentences in which the script is segmented sentence-by-sentence units. Some of them say, "It's over." and input sentences 530 and 540 of "I'm so nervous." The first similarity 550 , the second similarity 560 , and the third similarity 570 shown in FIG. 5 are calculated results based on the voice signal, the voice recognition sentence, and the input sentence shown in FIG. 5 .

도 5를 참조하면, 제1 유사도(550)는 제1 음성 인식 문장(511) 및 제1 입력 문장(530) 사이의 편집 거리에 기초하여 획득된다. 제1 음성 인식 문장(511)인 "그치 야 너무 떨려"와 제1 입력 문장(530)인 "끝이야."는 "끝이" 및 "그치"의 2글자 및 "너무 떨려"의 4글자 차이이므로 편집 거리는 6으로 계산된다. 제2 유사도(560)는 제1 음성 인식 문장(511)의 다음 문장(521)을 제1 음성 인식 문장(511)에 연결한 제2 음성 인식 문장 및 제1 입력 문장(530) 사이의 편집 거리에 기초하여 획득된다. 제2 음성 인식 문장인 "그치 야 너무 떨려 내 커리어는"과 제1 입력 문장(530)인 "끝이야."는 "끝이" 및 "그치"의 2글자 및 "너무 떨려 내 커리어는"의 9글자 차이이므로 편집 거리는 11으로 계산된다. 제3 유사도(570)는 제1 입력 문장(530)의 다음 문장(540)을 제1 입력 문장(530)에 연결한 제2 입력 문장 및 제1 음성 인식 문장(511) 사이의 편집 거리에 기초하여 획득된다. 제1 음성 인식 문장인 "그치야 너무 떨려"와 제2 입력 문장인 "끝이야 너무 떨려"는 "끝이" 및 "그치"의 2글자 차이이므로 편집 거리는 2이다. 상술한 바와 같이 두 문장 사이의 편집 거리가 작을수록 두 문장이 유사함을 의미하므로, 편집 거리의 대소 관계와 반대로 제1 유사도, 제2 유사도 및 제3 유사도의 대소 관계는 제3 유사도>제1 유사도>제2 유사도가 된다.Referring to FIG. 5 , the first similarity 550 is obtained based on the editing distance between the first speech recognition sentence 511 and the first input sentence 530 . The difference between the first speech recognition sentence 511, “I’m so nervous,” and the first input sentence 530, “It’s over,” is two letters of “It’s over” and “It’s over” and four letters of “I’m so nervous” Therefore, the edit distance is calculated as 6. The second similarity 560 is the editing distance between the second speech recognition sentence linking the next sentence 521 of the first voice recognition sentence 511 to the first voice recognition sentence 511 and the first input sentence 530 . is obtained based on The second speech recognition sentence "Hey, I'm so nervous, my career" and the first input sentence 530, "It's over." Since there is a difference of 9 characters, the edit distance is calculated as 11. The third similarity 570 is based on the editing distance between the second input sentence connecting the next sentence 540 of the first input sentence 530 to the first input sentence 530 and the first voice recognition sentence 511 . is obtained by Since the first speech recognition sentence "I'm so nervous" and the second input sentence, "It's over, I'm so nervous" is a two-letter difference between "It's the end" and "I'm so nervous," the editing distance is 2. As described above, since the smaller the editing distance between two sentences means that the two sentences are similar, in contrast to the magnitude relationship of the editing distance, the magnitude relationship of the first similarity, the second similarity, and the third similarity is the third similarity > first The degree of similarity > second degree of similarity.

일실시예에 따른 정렬 알고리즘은 제3 유사도가 가장 큰 경우, 제1 입력 문장을 제2 입력 문장으로 갱신하는 단계(370)를 포함한다. 즉, 정렬 알고리즘에 따라 제1 입력 문장(511)이 제2 입력 문장으로 갱신되므로, 제1 입력 문장은 "끝이야. 너무 떨려."로 갱신된다.The sorting algorithm according to an embodiment includes updating the first input sentence to the second input sentence when the third similarity is the largest ( 370 ). That is, since the first input sentence 511 is updated to the second input sentence according to the sorting algorithm, the first input sentence is updated to "It's over. I'm too nervous."

도 6을 참조하면, 도 5의 제1 입력 문장(530)이 갱신된 "끝이야. 너무 떨려."의 제1 입력 문장(630) 및 "내 커리어는 박살났군"의 제1 입력 문장(630)의 다음 문장(640)이 도시된다. 도 6에 도시된 제1 유사도(650), 제2 유사도(660), 및 제3 유사도(670)는 도 6에 도시된 음성 신호, 음성 인식 문장 및 입력 문장에 기초하여 계산된 결과이다.Referring to FIG. 6 , the first input sentence 530 of FIG. 5 is updated, the first input sentence 630 of “It’s over. I’m so nervous.” and the first input sentence 630 of “my career is ruined.” The next sentence 640 of ) is shown. The first similarity 650 , the second similarity 660 , and the third similarity 670 shown in FIG. 6 are calculated results based on the voice signal, the voice recognition sentence, and the input sentence shown in FIG. 6 .

도 6을 참조하면, 제1 유사도(650)는 제1 음성 인식 문장(511) 및 제1 입력 문장(630) 사이의 편집 거리에 기초하여 획득된다. 제1 음성 인식 문장(511)인 "그치 야 너무 떨려"와 제1 입력 문장(630)인 "끝이야. 너무 떨려."는 "끝이" 및 "그치"의 2글자 차이이므로 편집 거리는 2로 계산된다. 제2 유사도(660)는 제1 음성 인식 문장(511)의 다음 문장(521)을 제1 음성 인식 문장(511)에 연결한 제2 음성 인식 문장 및 제1 입력 문장(630) 사이의 편집 거리에 기초하여 획득된다. 제2 음성 인식 문장인 "그치 야 너무 떨려 내 커리어는"과 제1 입력 문장(630)인 "끝이야. 너무 떨려."는 "끝이" 및 "그치"의 2글자 및 "내 커리어는"의 5글자 차이이므로 편집 거리는 7로 계산된다. 제3 유사도(770)는 제1 입력 문장(630)의 다음 문장(640)을 제1 입력 문장(630)에 연결한 제2 입력 문장 및 제1 음성 인식 문장(511) 사이의 편집 거리에 기초하여 획득된다. 제1 음성 인식 문장인 "그치야 너무 떨려"와 제2 입력 문장인 "끝이야. 너무 떨려. 내 커리어는 박살났군"은 "끝이" 및 "그치"의 2글자 및 "내 커리어는 박살났군" 9 글자 차이이므로 편집 거리는 11이다. 상술한 바와 같이 두 문장 사이의 편집 거리가 작을수록 두 문장이 유사함을 의미하므로, 편집 거리의 대소 관계와 반대로 제1 유사도, 제2 유사도 및 제3 유사도의 대소 관계는 제1 유사도>제2 유사도>제3 유사도가 된다.Referring to FIG. 6 , the first similarity 650 is obtained based on the editing distance between the first speech recognition sentence 511 and the first input sentence 630 . Since the first speech recognition sentence 511, "It's so terrifying" and the first input sentence 630, "It's over. I'm so nervous," is a two-letter difference between "the end" and "it's so", so the editing distance is 2 Calculated. The second similarity 660 is the editing distance between the second speech recognition sentence linking the next sentence 521 of the first voice recognition sentence 511 to the first voice recognition sentence 511 and the first input sentence 630 . is obtained based on The second speech recognition sentence “Hey, I’m so nervous, my career” and the first input sentence 630, “It’s over. I’m so nervous.” are two letters of “the end” and “I’m so nervous,” and “my career is” Since it is a difference of 5 characters, the editing distance is calculated as 7. The third similarity 770 is based on the editing distance between the second input sentence linking the next sentence 640 of the first input sentence 630 to the first input sentence 630 and the first voice recognition sentence 511 is obtained by The first speech recognition sentence "I'm so nervous" and the second input sentence, "It's over. I'm so nervous. My career is shattered" is the two letters of "It's over" and "It's over" and "My career is shattered." " Since there is a difference of 9 characters, the edit distance is 11. As described above, since the smaller the editing distance between two sentences means that the two sentences are similar, in contrast to the magnitude relationship of the editing distance, the magnitude relationship of the first similarity, the second degree of similarity, and the third degree of similarity is 1st similarity > 2nd degree of similarity. The degree of similarity > the third degree of similarity.

일실시예에 따른 정렬 알고리즘은 제1 유사도가 가장 큰 경우 종료되고, 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 제1 입력 문장은 음성 데이터베이스에 저장될 수 있다. 다시 말해, 일실시예에 따른 시스템은 제1 유사도가 가장 큰 경우, 제1 음성 인식 문장 및 제1 입력 문장을 갱신하지 않는 것으로 결정하고, 비갱신 결정에 따라 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 제1 입력 문장을 데이터베이스에 저장하는 단계를 포함할 수 있다. 즉, 제1 음성 인식 문장(511)인 "그치 야 너무 떨려"에 대응되는 음성 신호(510) 및 제1 입력 문장(630)은 데이터베이스에 저장될 수 있다.The sorting algorithm according to an embodiment may be terminated when the first similarity is the greatest, and the first voice signal corresponding to the first voice recognition sentence and the first input sentence may be stored in a voice database. In other words, when the first similarity is the greatest, the system according to an embodiment determines not to update the first speech recognition sentence and the first input sentence, and determines not to update the first speech recognition sentence corresponding to the first speech recognition sentence according to the decision not to update. and storing the first voice signal and the first input sentence in a database. That is, the voice signal 510 and the first input sentence 630 corresponding to the first voice recognition sentence 511, “I’m so nervous,” may be stored in the database.

일실시예에 따를 때, 제1 음성 인식 문장 및 제1 입력 문장 사이의 편집 거리에 기초하여 제1 유사도가 획득된 경우, 편집 거리 계산 과정에서 제1 음성 인식 문장을 제1 입력 문장으로 변경하기 위하여, 삽입해야 하는 문자의 수, 삭제해야 하는 문자의 수, 및/또는 치환해야 하는 문자의 수에 기초하여, 제2 유사도를 획득하는 단계 및 제3 유사도를 획득하는 단계 중 어느 하나가 선택될 수 있다. 다시 말해, 제1 유사도를 획득하는 단계는 제1 음성 인식 문장을 상기 제1 입력 문장으로 변경하기 위하여, 삭제해야 할 문자의 수, 삽입해야 할 문자의 수 및 치환해야 할 문자의 수를 획득하는 단계 및 삭제해야 할 문자의 수, 삽입해야 할 문자의 수 및 치환해야 할 문자의 수에 기초하여 제1 음성 인식 문장 및 제1 입력 문장 사이의 편집 거리를 획득하는 단계를 포함할 수 있다. 이 경우, 획득된 삭제해야 할 문자의 수 및 삽입해야 할 문자의 수에 기초하여, 제2 유사도 및 제3 유사도 중 어느 하나를 획득하는 것으로 결정될 수 있다.According to an embodiment, when the first similarity is obtained based on the editing distance between the first speech recognition sentence and the first input sentence, changing the first voice recognition sentence to the first input sentence in the editing distance calculation process To this end, based on the number of characters to be inserted, the number of characters to be deleted, and/or the number of characters to be replaced, either one of the step of obtaining the second degree of similarity and the step of obtaining the third degree of similarity may be selected. can In other words, the acquiring of the first similarity includes acquiring the number of characters to be deleted, the number of characters to be inserted, and the number of characters to be replaced in order to change the first speech recognition sentence into the first input sentence. and obtaining an edit distance between the first speech recognition sentence and the first input sentence based on the number of characters to be deleted, the number of characters to be inserted, and the number of characters to be replaced. In this case, it may be determined to obtain any one of the second degree of similarity and the third degree of similarity based on the obtained number of characters to be deleted and the number of characters to be inserted.

예를 들어, 제1 음성 인식 문장을 상기 제1 입력 문장으로 변경하기 위하여, 삽입해야 할 문자의 수가 편집 거리에서 일정 비율 이상을 차지하는 경우, 제1 음성 인식 문장의 다음 문장을 상기 제1 음성 인식 문장에 연결한 제2 음성 인식 문장 및 제1 입력 문장 사이의 제2 유사도를 획득하는 것으로 결정될 수 있다. 반면, 제1 음성 인식 문장을 상기 제1 입력 문장으로 변경하기 위하여, 삭제해야 할 문자의 수가 편집 거리에서 일정 비율 이상을 차지하는 경우, 1 입력 문장의 다음 문장을 상기 제1 입력 문장에 연결한 제2 입력 문장 및 제1 음성 인식 문장 사이의 제3 유사도를 획득하는 것으로 결정될 수 있다.For example, when the number of characters to be inserted in order to change the first speech recognition sentence into the first input sentence occupies a certain ratio or more in the editing distance, the next sentence of the first voice recognition sentence is recognized by the first voice recognition It may be determined to obtain a second degree of similarity between the second speech recognition sentence connected to the sentence and the first input sentence. On the other hand, in order to change the first speech recognition sentence into the first input sentence, when the number of characters to be deleted occupies a certain ratio or more in the editing distance, the first sentence connecting the next sentence of the first input sentence to the first input sentence It may be determined to obtain a third degree of similarity between the second input sentence and the first speech recognition sentence.

이와 같이 제2 유사도 및 제3 유사도 중 어느 하나를 획득하는 것으로 결정된 경우, 결정에 따라 제2 유사도 및 제3 유사도 중 어느 하나를 획득하여 제1 유사도와 비교함으로써, 제1 입력 문장 또는 제1 음성 인식 문장을 갱신하여 정렬 알고리즘의 이터레이션이 반복되거나, 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 제1 입력 문장을 데이터베이스에 저장하고, 정렬 알고리즘이 종료될 수 있다.In this way, when it is determined to obtain any one of the second and third similarities, the first input sentence or the first voice is obtained by obtaining any one of the second and third similarities according to the determination and comparing it with the first similarity. The iteration of the sorting algorithm may be repeated by updating the recognition sentence, or the first voice signal and the first input sentence corresponding to the first voice recognition sentence may be stored in the database, and the sorting algorithm may be terminated.

도 7은 일실시예에 따른 음성 데이터베이스 구축 시스템에서 오류를 검출하고 수정하는 동작을 설명하기 위한 도면이다.7 is a diagram for explaining an operation of detecting and correcting an error in the voice database building system according to an embodiment.

도 7을 참조하면, 일실시예에 따른 음성 데이터베이스 구축 시스템은 음성 신호와 스크립트를 문장 단위로 정렬하는 단계에서 획득된 제1 유사도에 기초하여, 제1 음성 신호, 제1 음성 인식 문장 및 제1 입력 문장 중 적어도 하나를 수정하는 모듈(160)을 더 포함할 수 있다.Referring to FIG. 7 , the system for constructing a voice database according to an embodiment provides a first voice signal, a first voice recognition sentence, and a first The module may further include a module 160 for correcting at least one of the input sentences.

보다 구체적으로, 제1 음성 인식 문장 및 제1 입력 문장의 비갱신 결정에 따라 제1 음성 인식 문장에 대응하는 제1 음성 신호 및 제1 입력 문장을 데이터베이스에 저장하는 과정에서, 제1 음성 신호 및 제1 입력 문장 사이의 제1 유사도에 기초하여, 미리 정해진 기준에 따라 제1 음성 신호 및 제1 입력 문장을 수정이 필요한 데이터로 별도로 분류하여 저장할 수 있다. 예를 들어, 비갱신이 결정된 제1 음성 인식 문장 및 제1 입력 문장 사이의 제1 유사도가 미리 정해진 값보다 작은 경우, 제1 음성 신호(또는 제1 음성 신호 및 제1 음성 인식 문장) 및 제1 입력 문장은 수정한 필요한 데이터로 분류되어 불일치 의심 리스트(108)에 저장될 수 있고, 제1 유사도가 미리 정해진 값보다 큰 경우, 제1 음성 신호(또는 제1 음성 신호 및 제1 음성 인식 문장) 및 제1 입력 문장은 일치 리스트(107)에 저장될 수 있다. 상술한 바와 같이, 제1 유사도가 미리 정해진 값보다 큰 경우에도, 연결 횟수에 기초하여 제1 음성 신호(또는 제1 음성 신호 및 제1 음성 인식 문장) 및 제1 입력 문장이 불일치 의심 리스트(108)에 저장될 수 있다.More specifically, in the process of storing the first voice signal and the first input sentence corresponding to the first voice recognition sentence in the database according to the decision not to update the first voice recognition sentence and the first input sentence, the first voice signal and Based on the first similarity between the first input sentences, the first voice signal and the first input sentence may be separately classified and stored as data requiring correction according to a predetermined criterion. For example, when the first similarity between the first speech recognition sentence and the first input sentence for which it is determined not to be updated is less than a predetermined value, the first voice signal (or the first voice signal and the first voice recognition sentence) and the second One input sentence may be classified as the corrected necessary data and stored in the inconsistency suspicious list 108. When the first similarity is greater than a predetermined value, the first voice signal (or the first voice signal and the first voice recognition sentence) ) and the first input sentence may be stored in the match list 107 . As described above, even when the first similarity is greater than a predetermined value, the first voice signal (or the first voice signal and the first voice recognition sentence) and the first input sentence are inconsistent suspicious list 108 based on the number of connections. ) can be stored in

다시 말해, 일실시예에 따른 시스템은 미리 정해진 기준에 따른 제1 유사도에 기초하여, 스크립트와 이에 대응되는 음성 신호에 오류(예를 들어, 화자가 스크립트를 잘못 읽었거나, 음성 인식에 오류가 있거나, 스크립트에 오타가 있는 등) 여부를 판단할 수 있다. 일실시예에 따른 시스템에서 오류가 있는 것으로 판단된 제1 음성 신호(또는 제1 음성 신호 및 제1 음성 인식 문장) 및 제1 입력 문장은 수정이 필요한 데이터로 분류되어 별도로 저장될 수 있다.In other words, the system according to an embodiment of the present invention provides an error (eg, the speaker incorrectly reads the script, or there is an error in voice recognition) in the script and the corresponding voice signal based on the first similarity according to a predetermined criterion. , there is a typo in the script, etc.). The first voice signal (or the first voice signal and the first voice recognition sentence) and the first input sentence determined to have errors in the system according to an embodiment may be classified as data requiring correction and stored separately.

예를 들어, 일실시예에 따른 시스템이 두 문장의 편집 거리가 0이 아닌 경우 오류가 있는 것으로 판단한다고 할 때, 도 6의 제1 음성 인식 문장(511) 및 제1 입력 문장(630) 사이의 편집 거리가 2이므로, 제1 음성 신호(510), 제1 음성 인식 문장(511) 및 제1 입력 문장(630)은 오류가 있는 것으로 판단되어 불일치 의심 리스트로 저장될 수 있다.For example, when the system according to an embodiment determines that there is an error when the editing distance of two sentences is not 0, between the first speech recognition sentence 511 and the first input sentence 630 of FIG. 6 . Since the editing distance of is 2, the first voice signal 510, the first voice recognition sentence 511, and the first input sentence 630 are determined to have errors and may be stored as a mismatch suspicious list.

일실시예에 따른 수정 모듈(160)에서 불일치 의심 리스트에 저장된 데이터는 다양한 방법에 기초하여 수정될 수 있다. 예를 들어, 입력 문장을 기준으로 음성 인식 문장을 수정하는 경우, 도 6의 제1 음성 신호는 "끝이야 너무 떨려"로 인식되는 것으로 데이터베이스에 저장될 수 있다. 또는, 불일치 의심 리스트를 이용하여 해당 데이터가 용이하게 수정될 수 있도록 하는 사용자 인터페이스가 제공될 수도 있다.In the correction module 160 according to an embodiment, the data stored in the suspicious discrepancy list may be corrected based on various methods. For example, when a voice recognition sentence is corrected based on an input sentence, the first voice signal of FIG. 6 may be stored in the database as being recognized as "I'm too nervous at the end". Alternatively, a user interface may be provided so that the corresponding data can be easily corrected using the inconsistency suspicious list.

일실시예에 따를 때, 제1 유사도, 제2 유사도 및 제3 유사도 중 적어도 하나의 획득 결과는 인터페이스를 통해 표시될 수 있다. 예를 들어, 도 4, 도 5 및 도 6에 도시된 인터페이스를 통해 제1 유사도, 제2 유사도 및 제3 유사도에서 비교 대상이 되는 문장들 및 계산된 편집 거리의 수치가 표시될 수 있다. 또한, 도 6와 같은 인터페이스를 통해 제1 음성 인식 문장(511)에서 제1 입력 문장(630)과 차이가 있는 부분이 밑줄로 표시될 수 있다.According to an embodiment, the result of obtaining at least one of the first degree of similarity, the second degree of similarity, and the third degree of similarity may be displayed through the interface. For example, sentences to be compared in the first degree of similarity, the second degree of similarity, and the third degree of similarity may be displayed through the interfaces shown in FIGS. 4, 5, and 6 , and numerical values of the calculated editing distances. In addition, a portion different from the first input sentence 630 in the first voice recognition sentence 511 through the interface as shown in FIG. 6 may be displayed with an underline.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

obtaining a first similarity between a first speech recognition sentence corresponding to a part of the voice signal and a first input sentence corresponding to a part of the script of the voice signal;
obtaining a second similarity between a second speech recognition sentence in which a next sentence of the first speech recognition sentence is connected to the first speech recognition sentence and the first input sentence;
obtaining a third similarity between a second input sentence in which a next sentence of the first input sentence is connected to the first input sentence and the first speech recognition sentence; and
arranging the voice signal and the script in units of sentences based on the first degree of similarity, the second degree of similarity, and the third degree of similarity;
containing
How to build a voice database.

According to claim 1,
The sorting step is
storing a first voice signal corresponding to the first voice recognition sentence and the first input sentence in a voice database when the first similarity is the highest;
when the second similarity is the highest, updating the first speech recognition sentence to the second speech recognition sentence and transferring it to a next iteration; and
When the third similarity is the highest, updating the first input sentence to the second input sentence and transferring the second input sentence to the next iteration;
containing
How to build a voice database.

According to claim 1,
The sorting step is
determining whether to update the first speech recognition sentence or the first input sentence based on the first degree of similarity, the second degree of similarity, and the third degree of similarity;
When it is determined to update the first speech recognition sentence or the first input sentence, the first speech recognition sentence or the updated first input sentence is based on the first similarity, the second similarity, and the third similarity. repeatedly performing the step of determining whether to update; and
Storing a first voice signal corresponding to the first voice recognition sentence and the first input sentence in the database in response to determining not to update the first voice recognition sentence and the first input sentence
containing
How to build a voice database.

4. The method of claim 3,
Repeatedly performing the step of determining whether to update is
obtaining a first degree of similarity, a second degree of similarity, and a third degree of similarity based on the updated first speech recognition sentence and the first input sentence when it is determined to update the first speech recognition sentence; and
obtaining a first degree of similarity, a second degree of similarity, and a third degree of similarity based on the updated first input sentence and the first speech recognition sentence when it is determined to update the first input sentence;
containing
How to build a voice database.

According to claim 1,
segmenting the speech signal into sentence units;
converting the speech signal into text based on speech recognition; and
Segmenting the script into sentence units
further comprising
How to build a voice database.

According to claim 1,
receiving the voice signal and the script;
pre-processing the received voice signal; and
pre-processing the received script
further comprising
How to build a voice database.

According to claim 1,
wherein the first degree of similarity, the second degree of similarity, and the third degree of similarity are obtained based on an editing distance.
How to build a voice database.

3. The method of claim 2,
The sorting step is
modifying at least one of a first voice signal, the first voice recognition sentence, and the first input sentence based on the first similarity according to a predetermined criterion;
containing
How to build a voice database.

3. The method of claim 2,
The sorting step is
modifying at least one of the first voice signal, the first voice recognition sentence, and the first input sentence based on the number of connections corresponding to the number of repetitions of the iteration
containing
How to build a voice database.

According to claim 1,
displaying a result of obtaining at least one of the first degree of similarity, the second degree of similarity, and the third degree of similarity through an interface;
further comprising
How to build a voice database.

4. The method of claim 3,
The step of storing in the database is
forcibly aligning the first speech signal and the first input sentence; and
Segmenting and storing the first voice signal and the first input sentence based on the forced voice alignment result
containing
How to build a voice database.

According to claim 1,
After obtaining the first degree of similarity,
determining whether to obtain the second degree of similarity and whether to obtain the third degree of similarity based on the first degree of similarity;
further comprising,
The step of obtaining the second degree of similarity includes:
based on the determination, obtaining the second degree of similarity;
The step of obtaining the third degree of similarity includes:
based on the determination, obtaining the third degree of similarity;
How to build a voice database.

13. The method of claim 12,
The step of determining whether to obtain the second degree of similarity and whether to obtain the third degree of similarity includes:
determining not to acquire the second similarity and the third similarity when the first similarity corresponds to a predetermined criterion; and
determining that the second degree of similarity and the third degree of similarity are obtained when the first degree of similarity does not correspond to a predetermined criterion;
containing,
How to build a voice database.

13. The method of claim 12,
The step of obtaining the first similarity is
obtaining the number of characters to be deleted, the number of characters to be inserted, and the number of characters to be replaced in order to change the first speech recognition sentence into the first input sentence; and
obtaining an edit distance between the first speech recognition sentence and the first input sentence based on the number of characters to be deleted, the number of characters to be inserted, and the number of characters to be replaced
including,
The step of determining whether to obtain the second degree of similarity and whether to obtain the third degree of similarity includes:
determining, based on the number of characters to be deleted and the number of characters to be inserted, to obtain any one of the second degree of similarity and the third degree of similarity;
containing,
How to build a voice database.

According to claim 1,
The first similarity includes a degree of similarity between the first speech recognition sentence converted into a phoneme sequence and the first input sentence converted into a phoneme sequence,
The second degree of similarity includes a degree of similarity between the second speech recognition sentence converted into a phoneme sequence and the first input sentence converted into a phoneme sequence,
The third degree of similarity includes a degree of similarity between the first speech recognition sentence converted into a phoneme sequence and the second input sentence converted into a phoneme sequence,
How to build a voice database.

16. A computer program stored on a medium in combination with hardware to execute the method of any one of claims 1 to 15.

Obtaining a first similarity between a first speech recognition sentence corresponding to a part of the voice signal and a first input sentence corresponding to a part of the script of the voice signal, and converting the next sentence of the first voice recognition sentence to the first voice A second input sentence and the first sentence obtained by obtaining a second similarity between a second speech recognition sentence connected to a recognition sentence and the first input sentence, and a next sentence of the first input sentence connected to the first input sentence at least one processor for obtaining a third degree of similarity between speech recognition sentences, and arranging the speech signal and the script in sentence units based on the first degree of similarity, the second degree of similarity, and the third degree of similarity; and
A voice database for storing the voice signal and the script sorted by the sentence unit
containing
Voice database building device.

18. The method of claim 17,
the processor
In aligning the voice signal and the script in sentence units,
When the first similarity is the highest, the first voice signal corresponding to the first voice recognition sentence and the first input sentence are stored in the voice database, and when the second similarity is the highest, the first voice recognition The sentence is updated to the second speech recognition sentence and transferred to the next iteration, and when the third similarity is the highest, the first input sentence is updated to the second input sentence and transferred to the next iteration.
Voice database building device.

18. The method of claim 17,
the processor
In aligning the voice signal and the script in sentence units,
It is determined whether to update the first speech recognition sentence or the first input sentence based on the first similarity, the second similarity, and the third similarity, and the first speech recognition sentence or the first input sentence is Upon determining the update, repeatedly performing the step of determining whether to update based on the first similarity, the second similarity, and the third similarity based on the updated first speech recognition sentence or the updated first input sentence, Storing the first voice signal and the first input sentence corresponding to the first voice recognition sentence in the database in response to determining not to update the first voice recognition sentence and the first input sentence
Voice database building device.

20. The method of claim 19,
the processor
In repeatedly performing the step of determining whether to update,
When it is determined to update the first speech recognition sentence, a first degree of similarity, a second degree of similarity, and a third degree of similarity are obtained based on the updated first speech recognition sentence and the first input sentence, and the first input when it is determined to update the sentence, obtaining a first degree of similarity, a second degree of similarity, and a third degree of similarity based on the updated first input sentence and the first speech recognition sentence;
Voice database building device.

18. The method of claim 17,
the processor
segmenting the voice signal into sentence units, converting the voice signal into text based on voice recognition, and segmenting the script into sentence units
Voice database building device.

19. The method of claim 18,
the processor
In aligning the voice signal and the script in sentence units,
modifying at least one of the first voice signal, the first voice recognition sentence, and the first input sentence based on the first similarity according to a predetermined criterion
Voice database building device.

18. The method of claim 17,
the processor
In storing the first voice signal corresponding to the first voice recognition sentence and the first input sentence in the database,
performing forced voice alignment of the first voice signal and the first input sentence, and segmenting and storing the first voice signal and the first input sentence based on a result of the forced voice alignment
Voice database building device.

18. The method of claim 17,
the processor
After obtaining the first similarity,
determining whether to obtain the second degree of similarity and whether to obtain the third degree of similarity based on the first degree of similarity, and obtaining the second degree of similarity and the third degree of similarity based on the determination;
Voice database building device.

25. The method of claim 24,
the processor
In determining whether to obtain the second degree of similarity and whether to obtain the third degree of similarity,
If the first degree of similarity meets a predetermined criterion, it is determined not to acquire the second degree of similarity and the third degree of similarity; determining to obtain a third degree of similarity;
Voice database building device.