KR102185387B1

KR102185387B1 - Sound recognition subtitle production system and control method thereof

Info

Publication number: KR102185387B1
Application number: KR1020190082859A
Authority: KR
Inventors: 이성철
Original assignee: 한국방송통신대학교 산학협력단
Priority date: 2019-07-09
Filing date: 2019-07-09
Publication date: 2020-12-01

Abstract

The present invention relates to a sound recognition subtitle production system. The sound recognition subtitle production system includes: a basic subtitle data generator determines a basic text for the input audio data in an input audio file having input audio data, determines at least one candidate text for the determined basic text, determines one of the at least one candidate text as a final text to generate a basic subtitle text for the input audio data, and generates basic subtitle data including the basic subtitle for generating basic subtitle by performing morpheme analysis on the basic subtitle text and writing the basic subtitle text in word units; and a subtitle generator for generating preliminary captions by dividing the basic subtitle by sentence units and writing a space in each divided sentence. According to the present invention, the accuracy of subtitle generation is improved through speech recognition.

Description

Voice recognition subtitle generation system and method {SOUND RECOGNITION SUBTITLE PRODUCTION SYSTEM AND CONTROL METHOD THEREOF}

본 발명은 음성인식을 통해 자동 음성인식 자막 생성 시스템 및 방법에 관한 것이다. The present invention relates to a system and method for generating captions for automatic voice recognition through voice recognition.

장애를 가진 사람들이 TV 방송 콘텐츠를 이용할 수 있도록 오픈 캡션(open caption)이나 클로즈드 캡션(closed caption)과 같은 자막 방송인 캡션 기능, 수화, 또는 비디오 설명 등과 같은 다양한 서비스가 이루어지고 있다. Various services such as caption function, sign language, or video explanation, which are closed captions such as open caption or closed caption, are provided so that people with disabilities can use TV broadcasting contents.

이중 자막 방송은 청각 장애인의 방송 시청을 도와주거나 음소거 기능과 같이 시청자의 선택에 따른 무음 방송을 제공하기 위한 방법이다.Among them, closed-captioned broadcasting is a method for helping a hearing impaired watch a broadcasting or providing a silent broadcasting according to a viewer's selection, such as a mute function.

현재 많은 방송 프로그램에서는 자막 방송을 제공하고 있다. Currently, many broadcast programs provide closed captioning.

이러한 자막 방송은 주로 청각 장애인들을 위한 문자 방송 서비스로서, 일반적으로 가수가 노래할 때 가사를 자막으로 보여주는 것과 같은 오픈 캡션(open captioning)과 청각 장애인을 위하여 본 방송과는 별도로 오디오를 자막으로 처리하여 부가 방송 형태로 제공되는 클로즈드 캡션이 있으며, 보통 자막 방송은 클로즈드 캡션을 말한다. Such closed captioning is mainly a text broadcasting service for the hearing impaired. In general, open captioning, such as showing lyrics as captions when a singer sings, and for the hearing impaired, audio is processed as closed captions separately from the main broadcasting. There is a closed caption provided in the form of an additional broadcast, and the closed caption is usually a closed caption.

자막 방송을 수신하기 위해서는 자막 방송 수신 기능이 있는 수상기나 셋톱 박스(set top box)로 캡션 기능을 선택할 때 볼 수 있다.In order to receive a closed caption broadcast, it can be viewed when the caption function is selected with a set top box or a receiver with a caption broadcast receiving function.

자막 처리 방법은 사전 처리 방법과 실시간으로 속기사나 음성 인식 시스템으로 처리하는 방법이 있다.Subtitle processing methods include a pre-processing method and a method of processing in real time with a shorthand or speech recognition system.

또한, 인터넷(Internet) 기술의 발달로 인해, 학원이나 학교 등에 직접 가지 않고도 인터넷을 이용한 온라인 방식(on-line) 방식의 인터넷 강의가 유행하고 있다.In addition, due to the development of Internet technology, online lectures using the Internet are becoming popular without going directly to academies or schools.

온라인 방식의 인터넷 강의를 위해서는 동영상 콘텐츠를 별도로 제작한다.For online lectures, video content is separately produced.

이러한 인터넷을 이용한 온라인 방송의 경우에도 청각 장애인들을 위한 자막 서비스가 이루어지고 있다.Even in the case of online broadcasting using the Internet, a closed caption service is provided for the hearing impaired.

이러한 온라인 방송에서는 카메라 앞에서 행해지는 강의자의 강의 내용을 녹화하는 등의 동작을 통해 동영상 콘텐츠를 별도로 제작하고, 자막 서비스를 위해 동영상 콘텐츠의 강의 내용을 속기사 등을 통해 문서화한 후, 문서화한 내용을 그대로 자막으로 편집해 제공하는 경우가 대부분이다.In such online broadcasting, video contents are separately produced by recording the lecture contents of lecturers conducted in front of the camera, and after documenting the lecture contents of the video contents through a stenograph for subtitle service, the documented contents are kept as they are. In most cases, it is edited and provided with subtitles.

이런 경우, 실제 강의 내용과 전혀 무관한 문장이 그대로 자막 처리되어 출력되고, 강의자의 언어 습관 상 무의미하게 반복되는 '음', '어' 등과 같은 불필요한 의성어 역시 자막 처리되어 제공된다.In this case, sentences that are completely irrelevant to the actual lecture contents are subtitled and output, and unnecessary onomatopoeias such as'sounds' and'words' which are meaninglessly repeated due to the lecturer's language habits are also subtitled and provided.

이처럼, 드라마나 예능 등의 방송이 아니라, 강의를 수강할 목적으로 제공되는 교육용 인터넷 강의에서 강의 내용에 전혀 무관한 언어나 문장이 자막 처리되어 제공하면, 강의 내용을 이해하는 데 많은 방해 요소로 작용한다.In this way, in an educational Internet lecture provided for the purpose of taking lectures, rather than broadcasting of dramas or entertainment, if a language or sentence that is completely irrelevant to the lecture content is provided with subtitles, it acts as a number of obstacles to understanding the contents of the lecture. do.

또한, 일반적인 음성인식은 발화자의 구분 없이 한 개의 음향 모델과 언어 모델을 이용하여 음성 인식을 수행한다.In addition, in general speech recognition, speech recognition is performed using one acoustic model and a language model regardless of the speaker.

이럴 경우, 입력 음성의 클래스가 유사할 경우에는 높은 음성 인식 성능을 발휘하게 되지만, 입력 음성의 클래스가 상이할 경우에는 높은 음성 인식 성능을 보장할 수 없다.In this case, when the classes of the input speech are similar, high speech recognition performance is exhibited, but when the classes of the input speech are different, high speech recognition performance cannot be guaranteed.

대한민국 등록특허 제10-1233124호(공고일자: 2013년 02월 21일, 발명의 명칭: 자막 제작 시스템)Republic of Korea Patent Registration No. 10-1233124 (announcement date: February 21, 2013, title of invention: subtitle production system)

본 발명이 해결하려는 과제는 음성 인식을 통해 자막 생성의 정확도를 향상시키기 위한 것이다.The problem to be solved by the present invention is to improve the accuracy of caption generation through speech recognition.

상기 과제를 해결하기 위한 본 발명의 음성인식 자막 생성 시스템은 입력 오디오 데이터를 구비하는 입력 오디오 파일에서 상기 입력 오디오 데이터에 대한 기본 텍스트를 판정하고, 판정된 기본 텍스트에 대한 적어도 하나의 후보 텍스트를 판정하며 적어도 하나의 후보 텍스트 중에서 하나를 최종 텍스트로 판정하여 상기 입력 오디오 데이터에 대한 기본 자막용 텍스트를 생성하고, 상기 기본 자막용 텍스트를 형태소 분석을 실시하여 상기 기본 자막용 텍스트를 단어 단위로 띄어 쓰기를 실시하여 기본 자막을 생성하는 상기 기본 자막을 구비한 기본 자막 데이터를 생성하는 기본 자막 데이터 생성부, 그리고 상기 기본 자막 데이터 생성부와 연결되어 있고, 상기 기본 자막을 문장 단위로 구분하고 구분된 각 문장에서 띄어 쓰기를 실시하여 예비 자막을 생성하는 자막 생성부를 포함한다.In order to solve the above problem, the speech recognition caption generation system of the present invention determines a basic text for the input audio data in an input audio file including input audio data, and determines at least one candidate text for the determined basic text. And determining one of the at least one candidate text as the final text to generate basic subtitle text for the input audio data, and performing morpheme analysis on the basic subtitle text to write the basic subtitle text in word units. And a basic caption data generating unit that generates basic caption data with the basic caption to generate basic captions, and a basic caption data generating unit connected to the basic caption data generating unit, and classifying the basic captions by sentence units and It includes a caption generator for generating preliminary captions by writing a space in the sentence.

상기 특징에 따른 음성인식 자막 생성 시스템은 상기 기본 자막 데이터 생성부에 각각 연결되어 있는 강의자별 음성 모델과 강의자별 언어 모델을 더 포함할 수 있고, 상기 입력 오디오 파일은 강의자 정보를 더 포함할 수 있으며, 상기 기본 자막 데이터 생성부는 상기 강의자 정보를 이용하여 상기 강의자별 음성 모델과 강의자별 언어 모델에서 각각 해당 강의자에 대한 음성 모델과 언어 모델을 판정하고, 강의자에 대한 음성 모델을 이용하여 입력 오디오 데이터의 음성 정보에 해당하는 기본 텍스트에 대한 적어도 하나의 후보 텍스트를 판정하고, 강의자에 대한 언어 모델을 이용하여 적어도 하나의 후보 텍스트 중에서 하나를 선택하여 기본 텍스트에 대한 최종 텍스트를 판정해, 상기 입력 오디오 데이터에 대한 기본 자막용 텍스트를 생성할 수 있다. The voice recognition caption generation system according to the above characteristics may further include a voice model for each lecturer and a language model for each lecturer each connected to the basic caption data generation unit, and the input audio file may further include lecturer information. , The basic caption data generation unit determines a speech model and a language model for each lecturer in the speech model for each lecturer and the language model for each lecturer using the lecturer information, and uses the speech model for the lecturer to generate input audio data. At least one candidate text for the basic text corresponding to the voice information is determined, one of the at least one candidate text is selected using a language model for the lecturer to determine the final text for the basic text, and the input audio data You can create text for default subtitles for.

상기 특징에 따른 음성인식 자막 생성 시스템은 상기 기본 자막 데이터 생성부에 연결되어 있고 상기 기본 자막용 텍스트에 대한 형태소 분석을 실시하여 단어 단위로 분리하는 형태소 분석부를 더 포함할 수 있다. The voice recognition caption generation system according to the above characteristics may further include a morpheme analysis unit connected to the basic caption data generation unit and performing morpheme analysis on the basic caption text and separating the text into words.

상기 형태소 분석부는 분리된 각 단어의 품사 정보를 상기 기본 자막 데이터 생성부로 입력할 수 있고, 상기 기본 자막 데이터 생성부는 상기 각 단어의 품사를 이용하여 상기 띄어 쓰기를 실기하고 상기 품사 정보를 상기 기본 자막 데이터에 포함시킬 수 있다. The morpheme analysis unit may input part-of-speech information of each separated word to the basic caption data generation unit, and the basic caption data generation unit performs the spacing by using the part-of-speech of each word, and converts the parts of speech information into the basic caption. Can be included in the data.

상기 기본 자막 데이터는 상기 후보 텍스트에 대한 시각 정보를 이용하여 상기 기본 자막의 각 단어마다 부여된 단어 시작 시각과 단어 종료 시각을 포함할 수 있다. The basic caption data may include a word start time and a word end time assigned for each word of the basic caption using time information on the candidate text.

상기 자막 생성부는 상기 기본 자막의 인접한 두 단어의 단어 시각 시각과 단어 종료 시각을 이용하여 판단된 인접한 두 단어 사이의 시간이 무음 설정 시간 이상이면 하나의 문장으로 분리하고, 상기 무음 설정 시간을 기준으로 하여 분리된 각 문장에서 단어의 품사를 이용하여 다시 문장으로 분리하고, 상기 품사를 기준으로 하여 분리된 각 문장에서 글자수를 이용하여 다시 문장으로 분리해 상기 기본 자막에 대한 최종 문장을 분리할 수 있다.If the time between two adjacent words determined by using the word time time and the word end time of the two adjacent words of the basic subtitle is equal to or greater than the silence setting time, the subtitle generation unit separates them into one sentence, and based on the silence setting time Then, in each separated sentence, the final sentence for the basic subtitle can be separated by separating it into sentences again using the part-of-speech of the word, and separating it into sentences again using the number of characters in each sentence separated based on the part of speech. have.

상기 자막 생성부는 상기 최종 문장 내에서 품사와 띄어쓰기 맞춤법을 이용하여 띄어 쓰기 보정을 실시하여 상기 예비 자막을 생성할 수 있다.The subtitle generation unit may generate the preliminary subtitle by performing spacing correction using the part of speech and spacing spelling in the final sentence.

상기 특징에 따른 음성인식 자막 생성 장치는 상기 자막 생성부에 연결되어 있는 강의자별 오류 사전 데이터베이스를 더 포함할 수 있고, 상기 자막 생성부는 상기 강의자별 오류 사전 데이터베이스에서 상기 기본 자막 데이터에 포함된 강의자 정보를 이용하여 강의자에 대응하는 해당 오류 사전 데이터베이스를 선택하고, 상기 최종 문장에 존재하는 단어와 선택된 오류 사전 데이터베이스에 존재하는 단어를 비교하고, 최종 문장에 비표준어 단어를 표준어 단어로 변경하여 상기 예비 자막을 생성할 수 있다. The voice recognition caption generating apparatus according to the above characteristics may further include an error dictionary database for each lecturer connected to the caption generating unit, and the caption generating unit lecturer information included in the basic caption data in the error dictionary database for each lecturer Selecting a corresponding error dictionary database corresponding to the lecturer, comparing a word existing in the final sentence with a word existing in the selected error dictionary database, and changing the non-standard word into a standard word in the final sentence to create the preliminary subtitle. Can be generated.

상기 특징에 따른 음성인식 자막 생성 장치는 상기 자막 생성부에 연결되어 있는 출력부와 사용자 입력부를 더 포함할 수 있고, 상기 자막 생성부는 상기 출력부로 상기 예비 자막을 출력하며, 상기 사용자 입력부를 통해 상기 예비 자막의 편집 동작이 행해지면, 상기 사용자 입력부를 통해 입력되는 편집 자막을 상기 출력부로 출력하고, 상기 사용자 입력부를 통해 상기 편집 자막에 대한 적용 동작이 행해지면, 상기 편집 자막을 최종 자막을 선택할 수 있다. The voice recognition caption generating apparatus according to the feature may further include an output unit connected to the caption generating unit and a user input unit, and the caption generating unit outputs the preliminary caption to the output unit, and the user input unit When the editing operation of the preliminary subtitle is performed, the edited subtitle input through the user input unit is output to the output unit, and when an operation is applied to the edited subtitle through the user input unit, the edited subtitle can be selected as the final subtitle. have.

본 발명의 다른 특징에 따른 음성인식 자막 생성 방법은 기본 자막 데이터 생성부는 입력 오디오 파일에 구비된 입력 오디오 데이터에 대한 기본 텍스트를 판정하고, 판정된 기본 텍스트에 대한 적어도 하나의 후보 텍스트를 판정하는 단계, 상기 기본 자막 데이터 생성부는 적어도 하나의 후보 텍스트를 최종 텍스트로 판정하여 상기 입력 오디오 데이터에 대한 기본 자막용 텍스트를 생성하는 단계, 상기 기본 자막 데이터 생성부는 상기 기본 자막용 텍스트에 대한 형태소 분석의 결과에 따라 상기 기본 자막용 텍스트를 단어 단위로 띄어 쓰기를 실시하여 기본 자막을 생성하는 단계, 상기 자막 생성부는 상기 기본 자막을 문장 단위로 구분하는 단계, 그리고 상기 자막 생성부는 구분된 각 문장에서 띄어 쓰기를 실시하여 예비 자막을 생성하는 단계를 포함할 수 있다. In another aspect of the present invention, a method for generating a speech recognition caption includes the steps of the basic caption data generation unit determining a basic text for input audio data included in an input audio file, and determining at least one candidate text for the determined basic text. The basic caption data generating unit determines at least one candidate text as the final text to generate basic caption text for the input audio data, and the basic caption data generating unit results of a morpheme analysis of the basic caption text In accordance with the step of generating a basic subtitle by writing the basic subtitle text in units of words, the subtitle generation unit classifies the basic subtitles by sentence units, and the subtitle generation unit writes a space in each divided sentence It may include the step of generating a preliminary caption by performing.

상기 적어도 하나의 후보 텍스트의 판정 단계는 상기 기본 자막 데이터 생성부는 상기 입력 오디오 파일에 구비된 강의자 정보를 이용하여 강의자별 음성 모델과 강의자별 언어 모델에서 각각 해당 강의자에 대한 음성 모델과 언어 모델을 판정하는 단계, 그리고 상기 기본 자막 데이터 생성부는 강의자에 대한 음성 모델을 이용하여 입력 오디오 데이터의 음성 정보에 해당하는 기본 텍스트에 대한 상기 적어도 하나의 후보 텍스트를 판정하는 단계를 포함할 수 있다.In the determining step of the at least one candidate text, the basic subtitle data generation unit determines a speech model and a language model for the corresponding lecturer from the speech model for each lecturer and the language model for each lecturer using lecturer information included in the input audio file. In addition, the basic caption data generation unit may include determining the at least one candidate text for a basic text corresponding to speech information of the input audio data using a speech model for a lecturer.

상기 기본 자막용 텍스트의 생성 단계는 상기 기본 자막 데이터 생성부는 강의자에 대한 언어 모델을 이용하여 상기 적어도 하나의 후보 텍스트 중에서 하나를 선택하여 기본 텍스트에 대한 최종 텍스트를 판정하는 단계를 포함할 수 있다. The generating of the basic subtitle text may include the basic subtitle data generation unit selecting one of the at least one candidate text using a language model for a lecturer and determining a final text for the basic text.

상기 기본 자막의 생성 단계는 상기 형태로 분석에 따른 상기 각 단어의 품사를 이용하여 상기 단어 단위로 띄어 쓰기를 실기하는 단계를 포함할 수 있다. The generating of the basic subtitles may include performing a space-based writing for each word by using the part of speech of each word according to the analysis in the form.

상기 기본 자막의 생성 단계는 상기 후보 텍스트에 대한 시각 정보를 이용하여 상기 기본 자막의 각 단어마다 단어 시작 시각과 단어 종료 시각을 부여하는 단어 타임 코드를 생성하는 단계를 포함할 수 있다. The generating of the basic subtitle may include generating a word time code for giving a word start time and a word end time for each word of the basic subtitle using time information on the candidate text.

상기 문장 단위로 구분하는 단계는 상기 기본 자막의 인접한 두 단어의 단어 시각 시각과 단어 종료 시각을 이용하여 판단된 인접한 두 단어 사이의 시간이 무음 설정 시간 이상이면 하나의 문장으로 분리하는 단계, 상기 무음 설정 시간을 기준으로 하여 분리된 각 문장에서 단어의 품사를 이용하여 다시 문장으로 분리하는 단계, 그리고 상기 품사를 기준으로 하여 분리된 각 문장에서 글자수를 이용하여 다시 문장으로 분리해 상기 기본 자막에 대한 최종 문장을 분리하는 단계를 더 포함할 수 있다.The step of dividing by sentence units may include separating into one sentence if the time between two adjacent words determined using the word time time and the word end time of two adjacent words of the basic subtitle is equal to or greater than the silence setting time, the silence Separating each sentence separated based on the set time into sentences again using the part of speech of the word, and separating it into sentences again using the number of characters in each sentence separated based on the part of speech It may further include the step of separating the final sentence for.

상기 문장 단위로 구분하는 단계는 상기 최종 문장 내에서 품사와 띄어쓰기 맞춤법을 이용하여 띄어 쓰기 보정을 실시하여 상기 예비 자막을 생성하는 단계를 더 포함할 수 있다.The step of dividing by sentence units may further include generating the preliminary caption by performing spacing correction using part of speech and spacing spelling in the final sentence.

상기 예비 자막의 생성 단계는 강의자별 오류 사전 데이터베이스에서 상기 기본 자막 데이터에 포함된 강의자 정보를 이용하여 강의자에 대응하는 해당 오류 사전 데이터베이스를 선택하는 단계, 그리고 상기 최종 문장에 존재하는 단어와 선택된 오류 사전 데이터베이스에 존재하는 단어를 비교하고, 최종 문장에 비표준어 단어를 표준어 단어로 변경하여 상기 예비 자막 데이터를 생성하는 단계를 포함할 수 있다.In the generating of the preliminary subtitle, selecting a corresponding error dictionary database corresponding to the lecturer using lecturer information included in the basic subtitle data from an error dictionary database for each lecturer, and a word existing in the final sentence and a selected error dictionary And generating the preliminary caption data by comparing words existing in the database and converting non-standard words into standard words in a final sentence.

상기 특징에 따른 음성인식 자막 생성 방법은 상기 자막 생성부는 출력부로 상기 예비 자막을 출력하는 단계, 상기 자막 생성부는 사용자 입력부를 통해 상기 예비 자막의 편집 동작이 행해졌는 지 판단하는 단계, 상기 자막 생성부는 상기 예비 자막의 편집 동작이 행해진 상태로 판단되면, 상기 사용자 입력부를 통해 입력되는 편집 자막을 상기 출력부로 출력하는 단계, 그리고 상기 자막 생성부는 상기 사용자 입력부를 통해 상기 편집 자막에 대한 적용 동작이 행해지면, 상기 편집 자막을 최종 자막으로 선택하는 단계를 더 포함할 수 있다. The method for generating a voice recognition caption according to the above features includes the steps of outputting the preliminary caption to an output unit of the caption generating unit, determining whether an editing operation of the preliminary caption has been performed through the user input unit, and the caption generating unit When it is determined that the editing operation of the preliminary subtitle has been performed, outputting the edited subtitle input through the user input unit to the output unit, and when the subtitle generation unit applies the edited subtitle operation through the user input unit, It may further include the step of selecting the edited caption as the final caption.

이러한 본 발명의 특징에 따르면, 입력 오디오 데이터에 대응하는 기본 텍스트를 기초하여 복수 개의 후보 텍스트를 1차적으로 선택한 후 다시 하나의 후보 텍스트를 최종 텍스트로 선택하여 기본 자막의 정확도가 향상된다.According to this aspect of the present invention, the accuracy of the basic subtitle is improved by first selecting a plurality of candidate texts based on the basic text corresponding to the input audio data and then selecting one candidate text as the final text.

또한, 기본 자막을 세번의 단계를 거쳐 문장 분리를 실시하고, 최종적으로 문장이 분리된 기본 자막을 이용하여 띄어 쓰기가 행해지므로, 입력 오디오 데이터에 대응하는 예비 자막의 자동 생성이 정확하게 이루어진다.In addition, since sentence separation is performed on the basic subtitles through three steps, and spaces are finally written using the basic subtitles in which the sentences are separated, automatic generation of preliminary subtitles corresponding to the input audio data is accurately performed.

특히, 강의자별 음성 모델과 언어 모델을 이용하여, 강의자의 특징에 맞게 후보 텍스트와 최종 텍스트가 선택되므로, 자막 생성의 정확도는 더욱더 향상된다.In particular, since the candidate text and the final text are selected according to the characteristics of the lecturer by using the speech model and language model for each lecturer, the accuracy of subtitle generation is further improved.

도 1은 본 발명의 한 실시예에 따른 음성인식 자막 생성 시스템의 개략적인 블록도이다.
도 2는 도 1에 도시한 음성인식 자막 생성 장치의 개략적인 블록도이다.
도 3은 본 발명의 한 실시예에 따른 음성인식 자막 생성 시스템의 동작 순서도이다.
도 4는 도 3에 도시한 기본 자막 데이터 생성 동작의 상세 순서도이다.
도 5a 및 도 5b는 도 3에 도시한 예비 자막 데이터 생성 동작의 상세 순서도이다.
도 6은 도 3에 도시한 최종 자막 데이터 생성 동작의 상세 순서도이다.
도 7은 본 발명의 한 실시예에 따른 음성인식 자막 생성 시스템에서 자막 편집 초기 화면의 한 예를 도시한 도면이다.
도 8은 본 발명의 한 실시예에 따른 음성인식 자막 생성 시스템에서 예비 자막 화면의 한 예를 도시한 도면이다
도 9는 본 발명의 한 실시예에 따른 음성인식 자막 생성 시스템에서 예비 자막 화면의 자막 편집창이 활성화될 때의 예를 도시한 도면이다.1 is a schematic block diagram of a voice recognition caption generation system according to an embodiment of the present invention.
2 is a schematic block diagram of the apparatus for generating a voice recognition caption shown in FIG. 1.
3 is a flowchart illustrating an operation of a system for generating a voice recognition caption according to an embodiment of the present invention.
4 is a detailed flowchart of the basic caption data generation operation shown in FIG. 3.
5A and 5B are detailed flowcharts of the operation of generating preliminary caption data shown in FIG. 3.
6 is a detailed flowchart of the final caption data generation operation shown in FIG. 3.
7 is a diagram illustrating an example of a caption editing initial screen in a voice recognition caption generation system according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating an example of a preliminary caption screen in the voice recognition caption generation system according to an embodiment of the present invention.
9 is a diagram illustrating an example when a caption editing window of a preliminary caption screen is activated in the voice recognition caption generation system according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 실시예들을 상세히 설명한다. 본 발명을 설명하는데 있어서, 해당 분야에 이미 공지된 기술 또는 구성에 대한 구체적인 설명을 부가하는 것이 본 발명의 요지를 불분명하게 할 수 있다고 판단되는 경우에는 상세한 설명에서 이를 일부 생략하도록 한다. 또한, 본 명세서에서 사용되는 용어들은 본 발명의 실시예들을 적절히 표현하기 위해 사용된 용어들로서, 이는 해당 분야의 관련된 사람 또는 관례 등에 따라 달라질 수 있다. 따라서, 본 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing the present invention, if it is determined that adding a detailed description of a technology or configuration already known in the relevant field may make the subject matter of the present invention unclear, some of these will be omitted from the detailed description. In addition, terms used in the present specification are terms used to appropriately express embodiments of the present invention, and these may vary according to related people or customs in the field. Accordingly, definitions of these terms should be made based on the contents throughout the present specification.

여기서 사용되는 전문용어는 단지 특정 실시예를 언급하기 위한 것이며, 본 발명을 한정하는 것을 의도하지 않는다. 여기서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다. 명세서에서 사용되는 '포함하는'의 의미는 특정 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분을 구체화하며, 다른 특정 특성, 영역, 정수, 단계, 동작, 요소, 성분 및/또는 군의 존재나 부가를 제외시키는 것은 아니다.The terminology used herein is for reference only to specific embodiments and is not intended to limit the invention. Singular forms as used herein also include plural forms unless the phrases clearly indicate the opposite. As used in the specification, the meaning of'comprising' specifies a specific characteristic, region, integer, step, action, element and/or component, and other specific characteristic, region, integer, step, action, element, component and/or group It does not exclude the existence or addition of

이하, 첨부된 도면을 참조하여 본 발명의 한 실시예에 따른 음성인식 자막 생성 시스템 및 그 제어 방법에 대하여 설명하도록 한다.Hereinafter, a voice recognition caption generation system and a control method thereof according to an embodiment of the present invention will be described with reference to the accompanying drawings.

먼저, 도 1을 참고로 하여, 본 발명의 한 실시예에 따른 음성인식 자막 생성 시스템(1)을 설명한다.First, with reference to FIG. 1, a system for generating a voice recognition caption 1 according to an embodiment of the present invention will be described.

도 1에 도시한 것처럼, 본 예의 음성인식 자막 생성 시스템(1)은 사용자 입력부(10), 사용자 입력부(10)에 연결되어 있는 자막 생성 장치(20), 자막 생성 장치(20)에 연결되어 있는 저장부(30), 자막 생성 장치(20)에 연결되어 있는 형태소 분석부(40), 자막 생성 장치(20)와 형태소 분석부(40)에 연결되어 있는 데이터베이스부(50), 자막 생성 장치(20)에 연결되어 있는 출력부(60), 딥런닝 학습부(70), 그리고 딥런닝 학습부(70)에 연결되어 있는 모델부(80)를 구비한다.As shown in FIG. 1, the voice recognition caption generation system 1 of the present example includes a user input unit 10, a caption generating device 20 connected to the user input unit 10, and a caption generating device 20. The storage unit 30, the morpheme analysis unit 40 connected to the caption generation device 20, the database unit 50 connected to the caption generation device 20 and the morpheme analysis unit 40, the caption generation device ( An output unit 60 connected to 20), a deep running learning unit 70, and a model unit 80 connected to the deep running learning unit 70 are provided.

사용자 입력부(10)는 사용자가 음성인식 자막 생성 시스템(1)의 동작 제어를 위한 명령어 입력이나 데이터 입력 등과 같은 입력 동작에 관련된 신호를 발생시켜 자막 생성 장치(20)로 출력한다.The user input unit 10 generates a signal related to an input operation, such as a command input or data input for controlling the operation of the voice recognition caption generation system 1 by the user, and outputs it to the caption generating device 20.

이러한 사용자 입력부(10)는 키 패드(key pad), 돔 스위치 (dome switch), 터치 패드(touch pad), 조그(jog) 스위치 또는 마우스(mouse) 등으로 구성될 수 있다.The user input unit 10 may include a key pad, a dome switch, a touch pad, a jog switch, a mouse, or the like.

자막 생성 장치(20)는 음성인식 자막 생성 시스템(1)에 대한 전반적인 제어 동작을 실시하는 제어 장치로서, 자막으로 변환되기 위해 입력되는 오디오 파일(이하, 이 오디오 파일을 '입력 오디오 파일'이라 함)를 입력받아, 입력된 입력 오디오 파일에 포함되어 있는 오디오 데이터(이하, 이 오디오 데이터를 '입력 오디오 데이터'라 함)를 해당하는 텍스트로 변환하여 기본 자막을 생성하여 기본 자막이 함유된 기본 자막 데이터를 생성한다.The caption generating device 20 is a control device that performs an overall control operation for the voice recognition caption generating system 1, and an audio file input to be converted into a caption (hereinafter, this audio file is referred to as an'input audio file'). ) And converts the audio data (hereinafter referred to as'input audio data') included in the input audio file to the corresponding text to generate basic subtitles, and basic subtitles with basic subtitles Generate data.

본 예에서, 입력 오디오 파일은 자막으로 변환되길 원하는 오디오 데이터와 오디오 데이터의 강의자 정보를 저장한다. 이때, 강의자 정보는 강의자 이름 및 소속을 구비할 수 있다. In this example, the input audio file stores audio data desired to be converted into subtitles and lecturer information of the audio data. In this case, the lecturer information may include the lecturer name and affiliation.

또한, 자막 생성 장치(20)는 기본 자막 데이터에 함유된 기본 자막을 문장 단위로 구분하고 띄어쓰기를 보정하여 문장 구분과 띄어쓰기가 이루어진 자막(이하, 이 자막을 '예비 자막'으로 함)을 생성하여 예비 자막이 함유된 예비 자막 데이터를 생성하고, 생성된 예비 자막 데이터에 해당하는 예비 자막을 출력부(60)로 출력한다.In addition, the subtitle generating device 20 divides the basic subtitles contained in the basic subtitle data into sentences and corrects the spaces to generate subtitles with sentence division and spaces (hereinafter, this subtitle is referred to as'preliminary subtitles'). Preliminary caption data including the preliminary caption is generated, and the preliminary caption corresponding to the generated preliminary caption data is output to the output unit 60.

또한, 자막 생성 장치(20)는 출력부(60)로 출력된 예비 자막에 대한 보정 동작이 사용자 입력부(10)를 이용하여 사용자에 의해 행해지면, 사용자에 의한 보정 동작이 행해진 보정된 예비 자막을 최종 자막으로 생성하고, 최종 자막이 함유된 최종 자막 데이터를 저장부(30)에 저장한다. In addition, when the correction operation for the preliminary subtitle outputted to the output unit 60 is performed by the user using the user input unit 10, the subtitle generating device 20 stores the corrected preliminary subtitle with the correction operation performed by the user. It is generated as a final caption, and the final caption data containing the final caption is stored in the storage unit 30.

이러한 자막 생성 장치(20)에 대한 구조와 동작은 다음에 자세히 설명한다.The structure and operation of the caption generating apparatus 20 will be described in detail below.

저장부(30)는 자막 생성 장치(20)의 동작에 필요한 데이터나 동작 중에 발생하는 데이터를 저장하는 저장 매체로서, 메모리(memory) 등으로 이루어질 수 있다.The storage unit 30 is a storage medium that stores data required for the operation of the caption generating apparatus 20 or data generated during operation, and may be formed of a memory or the like.

따라서, 본 예의 저장부(30)에는 단어 타임 코드, 문장 타임 코드, 무음 설정시간, 설정 글자수 및 자막 생성 장치(20)의 동작 중에 생성되는 기본 자막 데이터, 예비 자막 데이터 및 최종 자막 데이터 등이 저장된다.Accordingly, in the storage unit 30 of the present example, the word time code, sentence time code, silence setting time, the number of characters set, and basic subtitle data generated during the operation of the subtitle generating device 20, preliminary subtitle data, and final subtitle data are stored. Is saved.

여기서, 단어 타임 코드는 각 글자의 시작 시점과 종료 시점에 대한 시각 정보로서, 각 단어에 대한 단어 시작 시각과 단어 종료 시각을 구비하며,Here, the word time code is time information on the start and end times of each letter, and includes a word start time and a word end time for each word,

문장 타임 코드는 각 문장의 시작 시점과 종료 시점에 대한 시각 정보로서, 각 문장에 대한 문장 시작 시각과 문장 종료 시각을 구비한다. The sentence time code is time information on the start and end points of each sentence, and includes a sentence start time and a sentence end time for each sentence.

무음 설정 시간과 설정 글자수는 문장 구분없이 연속적으로 이어지는 하나의 문장으로 이루어진 기본 자막을 문장 단위로 분리하기 위해 사용되는 설정값으로서, 관리자에 의해 정해진 무음 설정 시간과 설정 글자수의 크기는 필요에 따라 가변된다.Silent setting time and number of characters are set values used to separate basic subtitles consisting of one sentence continuously without sentence division into sentence units.The silence setting time and the number of characters set by the administrator are required. Depends on

이때, 무음 설정 시간은 강의자가 말을 한 후 다음 말을 할 때까지의 시간, 즉, 말을 행하지 않고 있는 시간이다. 따라서, 자막 생성 장치(20)는 기본 자막에서 서로 인접해 있는 글자 종료 시각과 글자 시작 시각 사이의 시간이 무음 설정 시간을 초과하면 새로운 문장의 시작으로 판단한다.At this time, the silent setting time is the time from the lecturer to the next speech, that is, the time during which the lecturer is not speaking. Accordingly, the caption generating apparatus 20 determines that a new sentence is started when the time between the end time and the start time of the letters adjacent to each other in the basic caption exceeds the silence setting time.

설정 글자수는 하나의 문장으로 이루어진 기본 자막에서 하나의 문장으로 판정하기 위해 정해진 최대 글자 개수이다. 이때, 글자수는 실질적인 글자뿐만 아니라 글자 사이의 빈 공간(space)도 포함되며, 글자 하나의 크기는 2바이트(byte)이고 하나의 빈 공간은 1바이트로 계수한다. 한 예로서, 설정 글자수의 값은 80바이트로 정해질 수 있다. The set number of characters is the maximum number of characters determined to be determined as one sentence in a basic subtitle composed of one sentence. In this case, the number of characters includes not only actual characters but also empty spaces between characters, and the size of one character is 2 bytes and one empty space is counted as 1 byte. As an example, the value of the set number of characters may be set to 80 bytes.

따라서, 한 문장의 글자수가 설정 글자수가 되면, 자막 생성 장치(20)는 설정 글자수를 초과하여 위치하는 글자를 새로운 문장의 시작으로 판정해 하나의 문장을 설정 글자수 단위로 분리한다.Accordingly, when the number of characters in one sentence is the set number of characters, the caption generating device 20 determines a character that exceeds the set number of characters as the start of a new sentence and separates one sentence in units of the set number of characters.

형태소 분석기(40)는 기본 자막과 예비 자막을 생성할 때 형태소 분석을 위한 것으로서, 이러한 형태소 분석기(40)는 자막 생성 장치(20)의 제어에 따라 동작이 이루어진다. The morpheme analyzer 40 is for morpheme analysis when generating basic and preliminary captions, and the morpheme analyzer 40 operates under the control of the caption generating device 20.

데이터베이스부(50)는 음성 인식 장치(20)의 자막 생성 동작에 필요한 데이터와 형태소 분석기(40)의 동작에 필요한 데이터가 데이터베이스화되어 저장되어 있다.The database unit 50 stores data required for a caption generation operation of the speech recognition device 20 and data necessary for an operation of the morpheme analyzer 40 in a database.

본 예의 데이터베이스부(50)는 강의자별 오류 사전 데이터베이스(DB)(51)와 형태소 사전 데이터베이스(52)를 구비한다.The database unit 50 of this example includes an error dictionary database (DB) 51 and a morpheme dictionary database 52 for each lecturer.

강의자별 오류 사전 데이터베이스(51)는 각 강의자별로 오류 사전이 저장되어 있다.The error dictionary database 51 for each lecturer stores an error dictionary for each lecturer.

오류 사전은 각 강의자가 반복적이거나 습관적으로 비표준어로 사용하는 단어(이하, 이 비표준어로 사용하는 단어를 '비표준어 단어'라 함)와 비표준어 단어에 대응하는 표준어 단어를 구비한다. The error dictionary includes words that each lecturer uses repeatedly or habitually as non-standard words (hereinafter, words used as non-standard words are referred to as'non-standard words') and standard words corresponding to non-standard words.

따라서, 생성된 예비 자막에 오류 사전에 구비된 비표준어 단어가 존재하면, 해당 비표준어 단어에 대응하는 표준어 단어로 교체되는 강의자 특화 오류 보정이 이루어진다.Accordingly, if there is a non-standard word provided in the error dictionary in the generated preliminary subtitle, a lecturer-specific error correction is performed that is replaced with a standard word corresponding to the non-standard word.

형태소 사전 데이터베이스(52)는 형태소 분석을 위해 형태소 분석부(40)가 사용하는 형태소 사전으로, 형태소 사전은 각 형태소에 대한 품사 및 활용 정보 등을 저장하고 있다.The morpheme dictionary database 52 is a morpheme dictionary used by the morpheme analysis unit 40 for morpheme analysis, and the morpheme dictionary stores parts of speech and utilization information for each morpheme.

출력부(60)는 음성 인식 장치(20)의 제어에 따라 음성 인식 장치(20)에서 생성되어 출력되는 예비 자막과 같은 자막을 시각적으로 출력하는 부분으로서, 액정 디스플레이(liquid crystal display), 유기 발광 표시 장치(organic light emitting diode display), 플렉시블 디스플레이(flexible display) 및 3차원 디스플레이(3D display) 중에서 적어도 하나의 표시 장치를 포함할 수 있다.The output unit 60 visually outputs subtitles such as preliminary subtitles generated and output by the voice recognition apparatus 20 under the control of the voice recognition apparatus 20, and includes a liquid crystal display and organic light emitting diode. A display device may include at least one of an organic light emitting diode display, a flexible display, and a 3D display.

딥런닝 학습부(70)는 크게 2가지 역할을 수행할 수 있다The deep running learning unit 70 can largely perform two roles.

첫번째, 학습의 경우, 각 강의자에 대한 오디오 파일인 강의자 오디오 파일과 이에 대응하는 강의자 텍스트 파일을 입력받아, 뉴럴 네트워크 알고리즘(neural network algorithm)을 적용하여 각 강의자에 대한 음향 모델(즉, 강의자별 음향 모델)(81)과 각 강의자에 대한 언어 모델(즉, 강의자별 언어 모델)(82)을 생성한다.First, in the case of learning, a lecturer audio file, which is an audio file for each lecturer, and a corresponding lecturer text file are input, and an acoustic model for each lecturer (i.e., sound for each lecturer) is applied by applying a neural network algorithm. A model) 81 and a language model for each lecturer (that is, a language model for each lecturer) 82 are generated.

두 번째는, 실시간 인식의 경우, 학습에 의해 생성된 강의자별 음향 모델(81)과 강의자별 언어 모델(82)을 이용하여 주어진 입력 오디오 데이터로부터 문자 정보를 출력하는 기능을 수행할 수 있다. Second, in the case of real-time recognition, it is possible to perform a function of outputting text information from the given input audio data using the acoustic model 81 for each lecturer and the language model 82 for each lecturer generated by learning.

따라서, 딥런닝 학습부(70)는, 구체적인 두 번째 동작으로, 강의자에 대한 음성 모델을 이용하여 입력 오디오 데이터의 음성 정보에 해당하는 기본 텍스트에 대한 적어도 하나의 후보 텍스트를 판정하고, 강의자에 대한 언어 모델을 이용하여 적어도 하나의 후보 텍스트 중에서 하나를 선택하여 기본 텍스트에 대한 최종 텍스트를 판정해, 상기 입력 오디오 데이터에 대한 기본 자막용 텍스트를 생성할 수 있다.Therefore, as a second specific operation, the deep running learning unit 70 determines at least one candidate text for the basic text corresponding to the speech information of the input audio data using the speech model for the lecturer, and A final text for the basic text may be determined by selecting one of at least one candidate text using the language model, and the basic subtitle text for the input audio data may be generated.

이러한 두 번째 동작은 자막 생성 장치(20)와 모델부(80)에 연결되어 있는 별도의 구성요소인 딥런닝 음성 인식부를 구축하여 딥런닝 음성 인식부에 의해 행해질 수 있다.This second operation may be performed by the deep running speech recognition unit by constructing a deep running speech recognition unit, which is a separate component connected to the caption generation apparatus 20 and the model unit 80.

또한, 딥런닝 학습부(70)의 두 번째 동작은 자막 생성 장치(20)에 의해 행해질 수 있고, 본 명세서에는 자막 생성 장치(20)에서 행해지는 것으로 설명한다.In addition, the second operation of the deep running learning unit 70 may be performed by the caption generating device 20, and in this specification, it will be described as being performed by the caption generating device 20.

이러한 딥런닝 학습부(70)는 컴퓨터로 구현 가능한 심층 신경망 기반의 자막 생성 시스템에서 행해지는 모델 파라미터 학습 방식을 이용하여 이들 모델(81, 82)을 구축할 수 있다. 본 예의 심층 신경망을 이용한 모델 구축 방식은 이미 알려진 방식을 이용하므로, 그에 대한 자세한 설명은 생략한다. The deep running learning unit 70 may build these models 81 and 82 using a model parameter learning method performed in a caption generation system based on a deep neural network that can be implemented by a computer. Since the model construction method using the deep neural network in this example uses a known method, detailed descriptions thereof will be omitted.

따라서, 자막 생성 장치(20), 딥러닝 학습부(70) 또는 딥런닝 음성 인식부는 자막으로 변환되길 원하는 입력 오디오 데이터를 구비한 입력 오디오 파일이 입력되면, 해당 강의자에 대한 음향 모델(81)과 언어 모델(82)을 이용하여 입력 오디오 데이터에 대응하는 기본 자막 데이터를 생성하게 된다.Accordingly, when an input audio file including input audio data that is desired to be converted into a subtitle is input, the subtitle generating device 20, the deep learning learning unit 70, or the deep running speech recognition unit, the acoustic model 81 for the corresponding lecturer and Basic caption data corresponding to the input audio data is generated by using the language model 82.

강의자별 음향 모델(81)과 강의자별 언어 모델(82)은 딥런닝 학습부(70)의 동작에 의해 구축되는 것으로 데이터베이스 형태로 이루어질 수 있다.The acoustic model 81 for each lecturer and the language model 82 for each lecturer are constructed by the operation of the deep running learning unit 70 and may be formed in the form of a database.

강의자별 음향 모델(81)은 각 강의자의 오디오 데이터에 대한 주파수 분석을 통한 음운 환경 별 발음의 특성을 모델링하는 과정을 통해 음소 별 발음에 따른 음향적 특성을 통계적으로 또는 패턴 분류화해서 수천 개 ~ 수만 여개의 모델로 대표화하여 모아 놓은 곳이다. The acoustic model for each lecturer 81 is a process of modeling the characteristics of pronunciation for each phonological environment through frequency analysis of each lecturer's audio data. It is a representative collection of tens of thousands of models.

예를 들어 우리말의 'ㄱ', 'ㄴ', 'ㄷ',..., 'ㅏ', 'ㅑ', 'ㅓ',... 등의 소리 단위를 기호화하여 학습하고 이를 디코딩(decoding)하여 사용한다.For example, Korean language's'ㄱ','ㄴ','ㄷ',...,'ㅏ','ㅑ','ㅓ',... And use it.

좀더 구체적으로, 강의자별 음향 모델(81)은 각 강의자의 오디오 데이터에 대한 주파수(이하, 이 음성 데이터의 주파수를 '음성 주파수'라 함)와 진폭(이하, 음성 데이터의 진폭을 '음성 진폭'이라 함) 및 이들 음성 주파수와 음성 진폭에 대응하는 텍스트(즉, 기본 텍스트)와 이 기본 텍스트에 대해 해당 강의자가 정확하게 지칭하는 적어도 하나의 텍스트(즉, 후보 텍스트)를 강의자 별로 저장되어 있다.More specifically, the acoustic model 81 for each lecturer includes the frequency of each lecturer's audio data (hereinafter, the frequency of this voice data is referred to as'speech frequency') and amplitude (hereinafter, the amplitude of the voice data is referred to as'speech amplitude'). ), text corresponding to these voice frequencies and voice amplitudes (i.e., basic text), and at least one text (i.e., candidate text) that the lecturer accurately refers to for this basic text, are stored for each lecturer.

강의자별 언어 모델(82)은 각 강의자에 대한 언어 정보(텍스트 정보)에 대하여 문장단위 구문구조를 통계적으로 모델링하는 과정을 통해 수천 만 ~ 수억 단어의 텍스트 데이터를 1-그램(gram), 2-그램, 3-그램등으로 통계적으로 모델링하여 저장하였다. 여기서, 그램은 인접한 단어의 개수를 의미하는 것으로 1-그램은 하나의 단어를 이용하는 것으로, 2-그램은 연속해서 인접한 단어의 개수가 2개인 것을 의미하며 3-그램은 연속해서 인접한 단어의 개수가 3개인 것을 의미한다. The language model for each lecturer (82) is a process of statistically modeling a sentence-by-sentence syntax structure with respect to language information (text information) for each lecturer, and converts text data of tens of thousands to hundreds of millions of words into 1-gram, 2- It was statistically modeled and stored in grams, 3-grams, etc. Here, gram means the number of adjacent words, 1-gram means one word, 2-gram means two consecutive words, and 3-gram means the number of consecutive words adjacent to each other. It means three.

예를 들어, 바로 이전에 위치하는 단어가 '나'이고, 다음에 위치하는 후보 단어가 '은'과 '는'일 때, '나' 다음에 '은'이 오는 확률(즉, '나은')과 '는'이 오는 조합 확률(즉, '나는')에 대한 조합 확률이 강의자 별로 저장되어 있다. 여기서 조합 확률은 각 강의자별로 서로 인접한 텍스트에 대한 조합 확률로서 텍스트 조합 확률이라 한다.For example, if the immediately preceding word is'I' and the next candidate words are'Silver' and'A', the probability of'Silver' following'I' (that is,'Better' The combination probability for the combination probability that comes with) and'silver' (that is,'I') is stored for each lecturer. Here, the combination probability is a combination probability of texts adjacent to each other for each lecturer, and is called a text combination probability.

다음, 도 2를 참고로 하여, 음성 인식 장치(20)에 대하여 설명한다Next, with reference to Fig. 2, the speech recognition device 20 will be described.

본 예의 음성 인식 장치(20)는, 도 2에 도시한 것처럼, 기본 자막 데이터 생성부(21), 기본 자막 데이터 생성부(21)에 연결되어 있는 자막 생성부(22) 및 기본 자막 데이터 생성부(21)에 연결되어 있는 타이머(23)를 구비한다.The speech recognition apparatus 20 of this example includes a basic caption data generation unit 21, a caption generation unit 22 connected to the basic caption data generation unit 21, and a basic caption data generation unit, as shown in FIG. 2. It has a timer 23 connected to (21).

기본 자막 데이터 생성부(21)는 입력 오디오 파일을 입력 받아, 입력 오디오 파일을 이용하여 기본 자막을 생성한 후 기본 자막을 구비한 기본 자막 데이터를 생성한다. The basic caption data generating unit 21 receives an input audio file, generates basic captions using the input audio file, and then generates basic caption data including basic captions.

기본 자막 데이터는 기본 자막 이외에도 각 단어에 대한 단어 시작 시각과 단어 종결 시각 및 강의자 정보를 구비한다. In addition to the basic subtitles, the basic caption data includes a word start time, a word end time, and lecturer information for each word.

자막 생성부(22)는 기본 자막 데이터 생성부(21)로부터 기본 자막 데이터를 입력받아 기본 자막을 문장 단위로 분리하고 분리된 문장에 띄어 쓰기를 실시하여 예비 자막을 생성한다. 그런 다음, 자막 생성부(22)는 생성된 예비 자막을 구비한 예비 자막 데이터를 생성하여 저장부(30)에 저장한다.The caption generating unit 22 receives basic caption data from the basic caption data generating unit 21, divides the basic caption into sentences, and writes a space in the separated sentence to generate a preliminary caption. Then, the caption generating unit 22 generates preliminary caption data including the generated preliminary caption and stores it in the storage unit 30.

예비 자막 데이터는 예비 자막 이외에도 각 문장에 대한 문장 시작 시각과 문장 종결 시각 및 강의자 정보를 구비한다. 이때, 문장 시작 시작과 문장 종결 시각은 단어 시작 시각과 단어 종료 시각을 이용하여 산출될 수 있다.In addition to the preliminary caption, the preliminary caption data includes a sentence start time and a sentence end time and lecturer information for each sentence. At this time, the sentence start start time and the sentence end time may be calculated using the word start time and the word end time.

또한 자막 생성부(22)는 예비 자막 데이터를 출력부(60)로 출력하고, 사용자 입력부(10)를 통해 입력되는 신호를 이용하여 예비 자막에 행해진 보정 동작을 판정해 최종 자막을 생성한다.In addition, the caption generation unit 22 outputs the preliminary caption data to the output unit 60, and determines a correction operation performed on the preliminary caption using a signal input through the user input unit 10 to generate a final caption.

타이머(23)는 기본 자막에 부여되는 단어 시작 시각과 단어 종료 시각을 부여하기 위한 것으로서, 기본 자막 데이터 생성부(21)는 입력 오디오 데이터와 타어머(23)에 의해 계수된 시간을 이용하여 각 단어에 대한 단어 시작 시각과 단어 종료 시각을 판정하게 된다.The timer 23 is for giving a word start time and a word end time assigned to the basic subtitle, and the basic subtitle data generation unit 21 uses input audio data and a time counted by the timer 23, respectively. The word start time and word end time are determined for the word.

다음, 도 3 내지 도 6을 참고로 하여, 이러한 구조를 갖는 음성인식 자막 생성 시스템(1)의 동작을 상세히 설명한다.Next, the operation of the voice recognition caption generation system 1 having such a structure will be described in detail with reference to FIGS. 3 to 6.

음성인식 자막 생성 시스템(1)의 동작에 필요한 전원이 공급되면, 음성인식 자막 생성 시스템(1)의 동작이 시작된다(S1).When the power required for the operation of the voice recognition caption generation system 1 is supplied, the operation of the voice recognition caption generation system 1 starts (S1).

따라서, 도 3에 도시한 것처럼, 음성인식 자막 생성 시스템(1)은 기본 자막 데이터 생성부(21)에 의해 기본 자막 데이터 생성 동작(S10)과 자막 생성부(22)에 의해 예비 자막 데이터 생성 동작(S20) 및 최종 자막 데이터 생성 동작(S30)을 실시하여, 입력 오디오 데이터에 대한 자막 생성을 실시한다.Accordingly, as shown in FIG. 3, the voice recognition caption generation system 1 generates basic caption data by the basic caption data generation unit 21 (S10) and the preliminary caption data generation operation by the caption generation unit 22. (S20) and the final caption data generation operation (S30) are performed to generate captions on the input audio data.

먼저, 도 4를 참고로 하여 기본 자막 데이터 생성 동작(S10)을 설명한다.First, the basic caption data generation operation (S10) will be described with reference to FIG. 4.

음성인식 자막 생성 시스템(1)의 동작에 의해 자막 생성 장치(20)가 동작되면, 먼저, 기본 자막 데이터 생성부(21)는 자막으로의 변환을 원하는 입력 오디오 파일이 입력되는지 판단한다(S11).When the caption generating device 20 is operated by the operation of the voice recognition caption generating system 1, first, the basic caption data generating unit 21 determines whether an input audio file for conversion into caption is input (S11). .

입력 오디오 파일이 입력된 상태로 판단되면(S11), 기본 자막 데이터 생성부(21)는 입력되는 입력 오디오 파일을 저장부(30)에 저장한다(S12).When it is determined that the input audio file is input (S11), the basic caption data generation unit 21 stores the input input audio file in the storage unit 30 (S12).

그런 다음, 기본 자막 데이터 생성부(21)는 입력 오디오 파일에 저장되어 있는 강의자 정보를 이용하여 해당 입력 오디오 파일에 대응하는 강의자를 판정한다(S13).Then, the basic caption data generation unit 21 determines a lecturer corresponding to the input audio file by using the lecturer information stored in the input audio file (S13).

그런 다음, 기본 자막 데이터 생성부(21)는 연결되어 있는 강의자별 음향 모델(81)에서 판정된 강의자에 해당하는 음향 모델(81)을 이용하여 시간순으로 입력 오디오 데이터의 음성 주파수와 음성 진폭(이하, 음성 주파수와 음성 진폭을 '오디오 정보'라 함)에 대응하는 기본 텍스트를 검색하고, 검색된 기본 텍스트에 대응하는 적어도 하나의 후보 텍스트를 판정하여 저장부(30)에 저장한다(S15). Then, the basic caption data generation unit 21 uses the acoustic model 81 corresponding to the lecturer determined in the acoustic model 81 for each lecturer connected to the voice frequency and the voice amplitude of the input audio data in chronological order. , The voice frequency and the voice amplitude are referred to as'audio information'), and at least one candidate text corresponding to the searched basic text is determined and stored in the storage unit 30 (S15).

예를 들어, 음성 정보에 대응하는 기본 텍스트가 '갱성'일 경우, 기본 텍스트인 '갱성'에 후보 텍스트는 '경성'와 '강성'일 수 있고, 음성 정보에 대응하는 기본 텍스트가 '거기'일 때, '거기'에 대한 후보 텍스트는 '고기'와 '거기'일 수 있다.For example, if the basic text corresponding to voice information is'Gangseong', candidate texts for'Gengseong', which are the basic texts, may be'Gyeongseong' and'Stiffness', and the basic text corresponding to the voice information is'There' When is, candidate texts for'there' may be'meat' and'there'.

이때, 기본 자막 데이터 생성부(21)는 타이머(23)를 동작시켜 시간 순으로 검색된 오디오 정보에 대한 기본 텍스트에 대한 시각 정보(즉, 기본 텍스트 시작 시각과 기본 텍스트 종료 시각)를 판정하여 판정된 후보 텍스트에 대응되게 후보 텍스트에 대한 시각 정보(즉, 후보 텍스트 시작 시각과 후보 텍스트 종료 시각)로서 저장부(30)에 저장한다. At this time, the basic caption data generation unit 21 operates the timer 23 to determine the time information for the basic text (that is, the basic text start time and the basic text end time) for the audio information retrieved in chronological order. The time information on the candidate text (ie, the candidate text start time and the candidate text end time) is stored in the storage unit 30 in correspondence with the candidate text.

이러한 기본 자막 데이터 생성부(21)의 동작에 의해 입력 오디오 파일에 대한 텍스트 변환 동작은 1차적으로 이루어진다. The text conversion operation for the input audio file is primarily performed by the operation of the basic caption data generation unit 21.

다음, 기본 자막 데이터 생성부(21)는 기본 텍스트에 대한 최종 텍스트를 판정하기 위해, 강의자별 언어 모델(82)에서 판정된 강의자에 해당하는 언어 모델(82)의 텍스트 조합 확률을 이용한다.Next, the basic caption data generation unit 21 uses the text combination probability of the language model 82 corresponding to the lecturer determined in the language model 82 for each lecturer to determine the final text for the basic text.

따라서, 기본 자막 데이터 생성부(21)는 기본 텍스트에 대한 적어도 하나의 후보 텍스트 중에서 최종 텍스트를 판정해, 입력 오디오 데이터에 대한 기본 자막용 텍스트를 생성한다(S15). 이때, 기본 자막 데이터 생성부(21)는 기본 텍스트에 대한 후보 텍스트가 하나인 경우, 하나의 후보 텍스트를 기본 텍스트에 대한 최종 텍스트로 판정하여 입력 오디오 데이터에 대한 기본 자막용 텍스트의 생성 동작을 실시한다.Accordingly, the basic caption data generation unit 21 determines the final text from among at least one candidate text for the basic text, and generates basic caption text for the input audio data (S15). In this case, when there is one candidate text for the basic text, the basic caption data generation unit 21 determines one candidate text as the final text for the basic text and performs an operation of generating basic subtitle text for the input audio data. do.

예를 들어, 바로 이전에 위치하는 텍스트가 '나'이고, 다음에 위치하는 후보 텍스트가 '은'과 '는'일 때, 기본 자막 데이터 생성부(21)는 해당 언어 모델(82)의 텍스트 조합 확률을 이용하여 '나' 다음에 '은'이 오는 확률(즉, '나은')과 '는'이 오는 조합 확률(즉, '나는')을 비교하여 높은 확률을 갖는 후보 텍스트를 최종 텍스트로 선택한다.For example, when the text positioned immediately before is'I' and the candidate text positioned next is'silver' and'silver', the basic subtitle data generator 21 is the text of the corresponding language model 82 Using the combination probability, the probability of'I' followed by'Silver' (i.e.,'Better') and the combination probability of'A' followed by'I' (i.e.,'I') are compared to obtain a candidate text with a high probability. Select with

따라서, '나' 다음에 '는'이 조합되는 조합 확률('나는')이 나' 다음에 '은'이 조합되는 조합 확률('나은')보다 클 경우, 기본 자막 데이터 생성부(21)는 '은'과 '는'의 후보 텍스트 중에서 '는'를 최종적으로 텍스트로 선택한다. Therefore, when the combination probability of combining'I' and'A' ('I') is greater than the combination probability of combining'I' and'Silver' ('Better'), the basic caption data generation unit 21 Is finally selected as the text of'silver' and'silver' among candidate texts.

동일한 방식으로, '에서' 이전에 위치하는 후보 텍스트가 '경성'과 '강성'일 때, '경성에서'의 조합 확률이 '강성에서'의 조합 확률보다 크면, 기본 자막 데이터 생성부(21)는 '경성'과 '강성' 중의 후보 텍스트 중에서 '경성'을 최종 텍스트로 선택한다.In the same way, when the candidate texts located before'in' are'hard' and'rigid', and if the combination probability of'in rigid' is greater than the combination probability of'in rigid', the basic subtitle data generator 21 Selects'Gyeongseong' as the final text from among candidate texts among'Gyeongseong' and'Stiffness'.

또한, '를' 이전에 위치하는 후보 텍스트가 '고기'과 '거기'일 때, 2가지 '고기'와 '를'의 조합 확률이 '거기'와 '를'의 조합 확률보다 크긴 하지만, 2가지 조합이 모두 가능하다. 이럴 경우 3-그램 정보를 활용하여 가장 최적의 문자열 조합을 선택한다. In addition, when the candidate text placed before'to' is'meat' and'there', the combination probability of the two'meat' and'e' is greater than that of'there' and'le', but 2 All combinations are possible. In this case, the most optimal string combination is selected using 3-gram information.

이러한 과정을 통해, 기본 자막 데이터 생성부(21)에 의해 변환된 기본 자막용 텍스트의 한 예는 '나는경성에서고기를먹었다'이다.An example of the basic subtitle text converted by the basic subtitle data generation unit 21 through this process is “I ate meat in Gyeongseong”.

다음, 기본 자막 데이터 생성부(21)는 기본 자막용 텍스트를 입력받아, 형태소 분석부(40)를 이용하여 기본 자막용 텍스트를 단어 단위로 분리하고, 저장부(30)에 저장되어 있는 후보 텍스트에 대한 시각 정보를 이용하여 분리된 각 단어마다 각 시각 정보인 단어 시작 시각과 단어 종료 시각을 부여하여 저장부(30)에 저장한다(S16). 이때, 형태소 분석부(40)는 분리된 단어에 대한 품사 정보도 함께 기본 자막 데이터 생성부(21)로 입력하고, 이에 대한 기본 자막 데이터 생성부(21)는 각 단어의 품사 정보 역시 해당 단어에 대응되게 저장부(30)에 저장한다.Next, the basic subtitle data generation unit 21 receives the basic subtitle text, divides the basic subtitle text into words by using the morpheme analysis unit 40, and stores the candidate text in the storage unit 30. A word start time and a word end time, which are time information, are assigned to each separated word by using the time information for and are stored in the storage unit 30 (S16). At this time, the morpheme analysis unit 40 also inputs part-of-speech information on the separated words to the basic caption data generation unit 21, and the basic caption data generation unit 21 also includes parts of speech information of each word in the corresponding word. Correspondingly, it is stored in the storage unit 30.

본 예에서, 단어에는 조사도 포함되므로, 기본 자막용 텍스트에서 조사도 하나의 단어로서 분리된다. In this example, since the word also includes a survey, the survey is also separated as one word in the basic subtitle text.

이때, 대안적인 예로서, 분리된 각 단어에 대한 시각 정보의 부여 동작은 기본 자막 데이터 생성부(21)에 의해 행해질 수 있다.In this case, as an alternative example, the operation of providing visual information for each separated word may be performed by the basic caption data generation unit 21.

이런 경우, 형태로 분석부(40)는 기본 자막 데이터 생성부(21)로부터 인가되는 기본 자막용 텍스트에서 단어를 검색하여 기본 자막용 텍스트를 단어 단위로 분리하여 기본 자막 데이터 생성부(21)로 입력한다.In this case, in the form, the analysis unit 40 searches for words from the basic subtitle text applied from the basic subtitle data generation unit 21, separates the basic subtitle text into words, and sends the basic subtitle data generation unit 21 Enter.

이에 따라, 기본 자막 데이터 생성부(21)는 저장부(30)에 저장되어 있는 후보 텍스트에 대한 시각 정보를 이용하여 분리된 각 단어의 단어 시각 시각과 단어 종료 시각을 부여하여 저장부(30)에 저장한다.Accordingly, the basic caption data generation unit 21 assigns the word time time and the word end time of each separated word by using time information on the candidate text stored in the storage unit 30, and the storage unit 30 Save it to.

그럼 다음, 기본 자막 데이터 생성부(21)는 단어 단위로 분리된 기본 자막용 텍스트에서 띄어 쓰기 동작을 실시하여, 띄어 쓰기 동작이 행해진 기본 자막용 텍스트를 기본 자막을 생성한다(S17). 추가적으로 단계(S17)에서, 기본 자막 데이터 생성부(21)는 강의자별 오류 사전 데이터베이스(51)를 이용하여 띄어 쓰기 보정이 행해진 기본 자막용 텍스트에 대해 강의자 특화 오류 보정을 실시할 수 있다.Next, the basic caption data generation unit 21 performs a spacing operation on the basic caption text separated by word units, and generates a basic caption on the basic caption text on which the spacing operation is performed (S17). Additionally, in step S17, the basic caption data generation unit 21 may perform lecture-specific error correction on the basic caption text for which spacing correction has been performed using the error dictionary database 51 for each lecturer.

이러한 강의자별 오류 사전 데이터베이스(51)를 이용한 강의자 특화 오류 보정 동작은 자막 생성 장치와 연결된 별도의 오류 보정부를 구축하여 구축된 오류 보정부에서 행해질 수 있다.The lecturer-specific error correction operation using the error dictionary database 51 for each lecturer may be performed by an error correction unit constructed by constructing a separate error correction unit connected to the caption generating device.

본 예에서, 기본 자막 데이터 생성부(21)는 형태소 분석부(40)에 의해 분석된 각 단어의 품사 정보를 이용하여 띄어 쓰기를 실시하고, 예를 들어, 서로 인접한 두 단어의 품사가 각각 명사와 조사이거나 어간과 어미이며 해당하는 두 단어는 붙여 쓰고, 이 두 가지 경우를 제외한 인접한 두 단어는 띄어 쓰기를 실시한다.In this example, the basic caption data generation unit 21 uses the part-of-speech information of each word analyzed by the morpheme analysis unit 40 to write spaces, and for example, the parts of speech of two adjacent words The two words that are and are the stem and the ending are pasted together, except for these two cases, and two adjacent words are written with a space.

이러한 과정을 통해 생성된 기본 자막은 문장 구분없이 품사를 이용해 띄어 쓰기 동작이 행해진 하나의 문장으로 생성된다. The basic subtitles generated through this process are generated as a single sentence in which a spacing operation is performed using part-of-speech regardless of sentence division.

이와 같이, 입력 오디오 파일의 입력 오디오 데이터에 관련된 기본 자막이 형성되면, 기본 자막 데이터 생성부(21)는 생성된 기본 자막, 각 단어에 대한 단어 타임 코드(즉, 각 단어의 단어 시작 시각 및 단어 종료 시각) 및 품사 정보, 그리고 강의자 정보를 구비하는 기본 자막 데이터를 생성하여 저장부(30)에 저장하고, 자막 생성부(22)로 출력한다(S18).In this way, when the basic subtitle related to the input audio data of the input audio file is formed, the basic subtitle data generation unit 21 generates the generated basic subtitle and the word time code for each word (that is, the word start time and the word of each word). End time), part of speech information, and lecturer information, basic caption data is generated, stored in the storage unit 30, and output to the caption generating unit 22 (S18).

다음, 도 5a 및 도 5b를 참고로 하여, 기본 자막 데이터를 이용하여 예비 자막 데이터를 생성하는 자막 생성부(22)의 동작을 설명한다.Next, an operation of the caption generator 22 that generates preliminary caption data using basic caption data will be described with reference to FIGS. 5A and 5B.

기본 자막 데이터 생성부(21)로부터 기본 자막 데이터가 입력되면, 자막 생성부(22)는 무음 시간, 글자수 및 형태소 분석부(40)에 의해 분석된 각 단어의 품사를 이용하여 문장 단위로 기본 자막을 분리한다.When the basic subtitle data is input from the basic subtitle data generation unit 21, the subtitle generation unit 22 uses the silent time, the number of characters, and the part of speech of each word analyzed by the morpheme analysis unit 40 to make a basic sentence unit. Separate subtitles.

이를 위해, 먼저 자막 생성부(22)는 인접한 두 단어 사이의 단어 타임 코드, 즉 제1 단어와 제1 단어 다음에 바로 위치하는 제2 단어에서 제1 단어의 단어 종료 시각과 제2 단어의 단어 시작 시각을 이용하여 인접한 두 단어 사이의 시간을 산출하고(S21), 산출된 시간을 무음 설정 시간과 비교하여 산출된 시간이 무음 설정 시간 이상인지 판단한다(S22). To this end, first, the subtitle generation unit 22 is a word time code between two adjacent words, that is, the word end time of the first word and the word of the second word in the second word immediately after the first word and the first word. The time between two adjacent words is calculated using the start time (S21), and the calculated time is compared with the silence setting time to determine whether the calculated time is greater than or equal to the silence setting time (S22).

산출된 시간이 무음 설정 시간 이상인 부분이 존재하면, 자막 생성부(22)는 각 해당 부분을 하나의 문장으로 판단한다. If there is a portion whose calculated time is equal to or greater than the silent setting time, the caption generator 22 determines each corresponding portion as one sentence.

따라서, 자막 생성부(22)는 산출된 시간이 무음 설정 시간 이상인 모든 부분에 대해, 인접한 두 단어에서 앞쪽에 위치하는 단어 바로 다음에 줄바꿈 기능을 실시하여 하나의 문장을 분리한다(S23).Accordingly, the subtitle generation unit 22 separates one sentence by performing a line break function immediately after the preceding word from the two adjacent words for all portions whose calculated time is equal to or greater than the silent setting time (S23).

다음, 자막 생성부(22)는 기본 자막에 대한 단어의 품사(예, 종결 어미)를 이용하여 문장 분리 동작을 실시한다.Next, the caption generation unit 22 performs a sentence separation operation by using the part of speech (eg, ending ending) of the word for the basic caption.

이를 좀더 구체적으로 설명하면, 자막 생성부(22)는 무음 설정 시간을 기준으로 하여 분리된 각 문장에서 단어의 품사가 종결 어미인 단어가 존재하는지 판단한다(S24). 이미 기술한 것처럼, 각 단어의 품사는 형태소 분석부(40)의 동작에 의해 판정된 각 단어의 품사를 이용한다.To explain this in more detail, the caption generator 22 determines whether or not a word whose part of speech is a ending ending in each sentence separated based on the silence setting time (S24). As already described, the part-of-speech of each word uses the part-of-speech of each word determined by the operation of the morpheme analyzer 40.

품사가 종결 어미인 단어가 존재하면, 자막 생성부(22)는 종결 어미를 갖는 단어 바로 다음에 줄바꿈 기능을 실시하여, 종결 어미인 품사의 단어 바로 뒤에서 다시 문장을 분리한다(S25).If there is a word whose part of speech is a ending ending, the caption generating unit 22 performs a line break function immediately after the word having the ending ending, and separates the sentence again immediately after the word of the ending part of speech (S25).

다시, 자막 생성부(22)는 종결 어미를 이용하여 분리된 각 문장에서 글자수를 계수하여(S26), 계수된 글자수와 설정 글자수를 비교하여 계수된 글자수가 설정 글자수와 같거나 큰지를 판단한다(S27).Again, the subtitle generation unit 22 counts the number of characters in each separated sentence using the ending ending (S26), and compares the number of counted characters with the number of characters set to determine whether the number of characters counted is equal to or greater than the number of characters set. It is determined (S27).

계수된 글자수가 설정 글자수와 같거나 큰 경우이면, 자막 생성부(22)는 게수된 글자수가 설정 글자수와 동일해지는 글자를 포함하는 단어 바로 다음에 줄바꿈 기능을 실시해 또 다시 문장을 분리하여 기본 자막에서의 문장 분리 동작을 최종적으로 실시한다(S28). When the counted number of characters is equal to or greater than the number of characters, the subtitle generation unit 22 performs a line break function immediately after a word containing a character whose number of characters is equal to the number of characters set to separate sentences again. The sentence separation operation in the basic subtitle is finally performed (S28).

본 예의 경우, 하나의 빈 칸(즉, 하나의 띄어쓰기)도 하나의 글자수로 계수하므로, 문장 분리가 이루어지는 단어에는 빈 칸도 포함될 수 있다.In the case of this example, since one blank space (ie, one space) is counted as one number of characters, a blank space may also be included in a word in which sentence separation is performed.

이와 같이, 자막 생성부(22)는 기본 자막 데이터 생성부(21)로부터 입력되는 기본 자막을 무음 설정 시간을 기준으로 하여 1차적으로 문장으로 분리한 후, 각 분리된 문장을 다시 품사(즉, 종결 어미)를 기준으로 하여 2차로 분리하고, 2차로 분리된 각 문장은 또 다시 글자수를 이용하여 문장으로 분리하여 최종적으로 문장 분리가 이루어지도록 한다. 하지만, 이러한 문장 분리 순서는 변경될 수 있어, 1차적으로 품사를 이용하여 문장을 분리하고, 다음 무음 설정 시간을 이용하여 문장을 분리한 다음, 마지막으로 글자수를 이용하여 다시 문장의 분리 동작이 이루어질 수 있다. In this way, the subtitle generation unit 22 first divides the basic subtitle input from the basic subtitle data generation unit 21 into sentences based on the silence setting time, and then separates each separated sentence again into parts of speech (i.e., The sentence is divided into sentences by using the number of letters again, and finally separating the sentences by separating them into sentences based on the ending ending). However, since the order of separating sentences may be changed, the sentence is first separated using part of speech, the sentence is separated using the next silent setting time, and finally, the sentence is separated again using the number of characters. Can be done.

이처럼, 3번에 걸쳐 문장 분리 동작이 이루어지므로, 기본 자막에 대한 문장 분리 동작의 정확도가 향상된다.As described above, since the sentence separation operation is performed three times, the accuracy of the sentence separation operation for the basic subtitle is improved.

이와 같이, 기본 자막에 대해 문장 분리가 최종적으로 이루어져 기본 자막에 대한 최종 문장들이 분리되면, 자막 생성부(22)는 최종적으로 분리된 문자(예, 최종 문장) 각각의 처음에 존재하는 단어의 단어 시작 시각과 마지막에 존재하는 단어의 단어 종료 시각을 이용하여 각 문장에 대한 문장 시각 시각과 문장 종료 시각을 판정하여(S29), 판정된 문장 시작 시각과 문장 종료 시각을 분리된 각 최종 문장에 대한 문장 타임 코드로서 저장부(30)에 저장한다(S210).In this way, when sentence separation for the basic subtitle is finally performed and the final sentences for the basic subtitle are separated, the subtitle generation unit 22 finally determines the word of the word existing at the beginning of each of the separated characters (eg, the final sentence). The sentence time and sentence end time for each sentence are determined using the start time and the word end time of the last word (S29), and the determined sentence start time and sentence end time are separated for each final sentence. It is stored in the storage unit 30 as a sentence time code (S210).

다음, 자막 생성부(22)는 각 최종 문장 마다 띄어 쓰기 동작을 수행한다(S211).Next, the caption generating unit 22 performs a writing operation with a space for each final sentence (S211).

본 예의 경우, 자막 생성부(22)는 각 최종 문장에서 서로 인접해 있는 두 단어의 품사와 이미 정해져 있는 띄어쓰기 맞춤법에 기초하여 띄어 쓰기와 붙여 쓰기를 실시하는 띄어 쓰기 보정을 실시한다.In this example, the caption generator 22 performs spacing correction in which spacing and pasting are performed based on the part of speech of two words adjacent to each other in each final sentence and a previously determined spacing spelling.

예를 들어, 종결어미의 단어(예, 왔다) 다음에 바로 연결 어미의 단어(예, 가라)가 존재하면, 이들 두 단어는 붙여 쓰고(예, 왔다가라), 숫자와 단위성 의존명사는 서로 붙여 쓰며, 명사 단위에 바로 존재하는 '이다'는 서로 붙인다. 또한, 명사형 전송어미 다음에 명사가 존재하면 띄어쓴다.For example, if there is a word of the ending ending (eg, come) followed by a word of the connecting ending (eg, go), these two words are pasted together (eg, come and go), and numbers and unity dependent nouns It is pasted together, and'ida', which exists immediately in the noun unit, is attached to each other. In addition, if there is a noun after the noun-type transmission ending, a space is used.

본 예의 자막 생성부(22)에 의한 띄어 쓰기와 붙여 쓰기의 예는 아래와 같다. An example of spacing and pasting by the caption generator 22 of this example is as follows.

1) 2017 년 3 월 16일 → 2017년 3월 16일1) March 16, 2017 → March 16, 2017

2) 먼저 식물체 화기 구조 입니다 → 먼저 식물체 화기 구조입니다2) First is the plant fire structure → First is the plant fire structure

3) 반복 됨 으로써 → 반복 됨으로써 3) By repeating → By repeating

4) 학습이 끝 나고 나면 → 학습이 끝나고 나면4) After learning is finished → After learning is finished

5) 학생들이 가장 힘 들어 하는 → 학생들이 가장 힘들어 하는5) Students are the hardest → Students are the hardest

6) 형질이 고정 돼있지 않습니다 → 형질이 고정돼 있지 않습니다6) The trait is not fixed → The trait is not fixed

7) 품종이 만들어 지면 → 품종이 만들어지면7) When a variety is made → When a variety is made

8) 다음 시간 에는 → 다음 시간에는8) Next time → next time

이와 같이 각 최종 문장에 대한 띄어 쓰기 보정을 실시한 후, 자막 생성부(22)는 강의자별 오류 사전 데이터베이스(53)를 이용하여 띄어 쓰기 보정이 행해진 각 최종 문장 단위로 강의자 특화 오류 보정을 실시한다(S212).After performing spacing correction for each final sentence in this way, the subtitle generation unit 22 performs lecture-specific error correction for each final sentence for which the spacing correction was performed using the error dictionary database 53 for each lecturer ( S212).

따라서, 자막 생성부(22)는 기본 자막 데이터에 포함된 강의자 정보를 이용하여 연결되어 있는 강의자별 오류 사전 데이터베이스(51)에서 해당 강의자의 오류 사전 데이터베이스(51)를 선택하고, 각 최종 문장에 존재하는 단어와 선택된 오류 사전 데이터베이스(51)에 존재하는 단어를 비교한다.Therefore, the subtitle generation unit 22 selects the error dictionary database 51 of the lecturer from the error dictionary database 51 for each lecturer connected using the lecturer information included in the basic subtitle data, and exists in each final sentence. The word to be used is compared with the word existing in the selected error dictionary database 51.

비교 결과, 오류 사전 데이터베이스(51)에 존재하는 비표준어 단어가 최종 문장에 존재하면, 해당 비표준어 단어를 오류 사전 데이터베이스(51)에 저장되어 있는 대응하는 표준어 단어로 변경하여, 강의자 특화 오류 보정을 실시한다.As a result of the comparison, if a non-standard word existing in the error dictionary database 51 exists in the final sentence, the corresponding non-standard word is changed to a corresponding standard word stored in the error dictionary database 51, and lecture-specific error correction is performed. .

따라서, 이러한 강의자 특화 오류 보정 동작이 이루어진 기본 자막은 예비 자막이 되고, 자막 생성부(22)는 예비 자막, 각 문장에 대한 타임 코드 및 발명자 정보를 포함하는 예비 자막 데이터를 생성하여 저장부(30)에 저장한다(S213).Therefore, the basic subtitles in which the lecturer-specific error correction operation is performed become preliminary subtitles, and the subtitle generation unit 22 generates preliminary subtitle data including preliminary subtitles, time code for each sentence, and inventor information, and the storage unit 30 ) To (S213).

이러한 자막 생성부(22)의 동작에 의해 예비 자막 데이터가 생성되면, 자막 생성부(22)는 출력부(60)를 통해 자막 편집 초기 화면을 출력하여, 관리자나 강의자와 같은 편집자에 의해 예비 자막의 수정 동작이 이루어진 최종 자막 데이터 생성 동작을 실시한다(S30).When the preliminary caption data is generated by the operation of the caption generating unit 22, the caption generating unit 22 outputs the initial subtitle editing screen through the output unit 60, and the preliminary caption is generated by an editor such as an administrator or lecturer. The final caption data generation operation in which the correction operation is performed is performed (S30).

이러한 자막 생성부(22)의 최종 자막 데이터 생성 동작에 대하여 도 6을 참고로 설명한다.An operation of generating final caption data by the caption generator 22 will be described with reference to FIG. 6.

자막 생성부(22)는 예비 자막이 생성된 후 최종 자막 데이터 생성 동작이 시작되면(S30), 자막 생성부(22)는 저장부(30)에 저장되어 있는 출력부(60)의 화면에 관련된 영상 데이터를 이용하여 도 7에 도시한 것과 자막 편집 초기 화면을 출력부(60로 출력하여 편집자에 의한 자막 편집 동작이 이루어지도록 한다(S31).When the final caption data generation operation starts after the preliminary caption is generated (S30), the caption generation unit 22 relates to the screen of the output unit 60 stored in the storage unit 30. The image data as shown in FIG. 7 is output to the output unit 60 to perform a caption editing operation by the editor (S31).

자막 편집 초기 화면을 출력한 후, 자막 생성부(22)는 사용자 입력부(10)로부터 인가되는 신호를 이용하여 자막 재생/편집 버튼(B31)이 동작하여 자막 재생 및 편집 시작 신호가 발생됐는지 판단한다(S32).After outputting the initial screen for subtitle editing, the subtitle generation unit 22 determines whether a subtitle playback and editing start signal is generated by operating the subtitle playback/edit button B31 using a signal applied from the user input unit 10. (S32).

편집자의 동작에 의해, 사용자 입력부(10)로부터 자막 재생 및 편집 시작 신호가 입력된 상태로 판단되면(S32), 자막 생성부(22)는 저장부(30)의 저장 데이터를 이용하여 해당 입력 오디오 파일에 관련되어 생성된 예비 자막을 문장 단위로 출력부(60)를 통해 출력한 예비 자막 화면을 출력한다(S33).When it is determined that the subtitle playback and editing start signals are input from the user input unit 10 by the editor's action (S32), the subtitle generation unit 22 uses the stored data of the storage unit 30 to generate the corresponding input audio. A preliminary subtitle screen outputted through the output unit 60 of the preliminary subtitle generated in relation to the file is outputted in units of sentences (S33).

예비 자막 화면의 한 예는 도 8과 같다.An example of the preliminary caption screen is shown in FIG. 8.

도 8에 도시한 것처럼, 출력부(60)를 통해 출력되는 예비 자막 화면은 동영상 출력창(W31), 자막 리스트 출력창(W32), 자막 정보 출력창(W33) 및 자막 편집창(W34)을 구비하고 있다.As shown in FIG. 8, the preliminary caption screen output through the output unit 60 includes a video output window W31, a subtitle list output window W32, a subtitle information output window W33, and a subtitle editing window W34. We have.

동영상 출력창(W31)은 예비 자막에 관련된 관련된 동영상 즉, 입력 오디오 파일에 관련된 동영상을 출력하기 위한 것으로서, 이러한 동영상은 저장부(30)에 저장되어 있을 수 있다.The video output window W31 is for outputting a video related to the preliminary subtitle, that is, a video related to an input audio file, and such a video may be stored in the storage unit 30.

자막 리스트 출력창(W32)은 형성된 예비 자막이 시간 순으로서 일련 번호가 '번호'란 부여되어 문장 단위로 출력되고, 이때, 각 문장의 문장 시작 시각과 문장 종료 시각이 각각 '시작 시간'란과 '종료 시간' 란에 표시되며, 해당 문장의 길이가 시간으로서 '길이'란에 표시된다. 또한, 도 8에서, 형성된 각 문장의 예비 자막은 '자막 내용'란에 표시된다.In the subtitle list output window (W32), the formed preliminary subtitles are sequentially assigned a serial number as'number' and output in sentence units. At this time, the sentence start time and the sentence end time of each sentence are displayed in the'start time' field, respectively. It is displayed in the'End Time' column, and the length of the sentence is displayed in the'Length' column. In addition, in FIG. 8, the preliminary caption of each sentence formed is displayed in the'Subtitle Content' column.

따라서, 도 8에서 '번호'란의 '1'은 첫번째 문장에 부여된 번호이고, '1'번이 부여된 예비 자막의 해당 문장에 대한 정보가 '시작 시간'란, '종료 시간'란, '길이'란 및 자막 내용'란에 각각 표시된다.Therefore, in FIG. 8, '1' in the'number' field is a number assigned to the first sentence, and information on the corresponding sentence of the preliminary subtitle assigned with '1' is'start time' and'end time', It is displayed in the'Length' and Subtitle Content' fields, respectively.

자막 정보 출력창(W33)은 예비 자막에 관련된 정보를 출력하는 창이다.The caption information output window W33 is a window for outputting information related to preliminary captions.

또한, 자막 편집창(W34)은 자막 리스트 출력창에 출력되는 예비 자막의 리스트 중에서 선택된 리스트에 관련된 예비 자막을 출력하여 편집자에 의한 편집 동작이 이루어질 수 있도록 하는 창이다.In addition, the subtitle editing window W34 is a window for outputting preliminary subtitles related to a list selected from a list of preliminary subtitles output to the subtitle list output window so that an editing operation by an editor can be performed.

이런 형태로 출력부(60)를 통해 예비 자막이 문장 단위로 출력되면, 편집자는 사용자 입력부(10)를 이용하여 수정을 원하는 문장을 선택하게 된다.When the preliminary caption is output in sentence units through the output unit 60 in this form, the editor selects a sentence to be corrected using the user input unit 10.

따라서, 자막 생성부(22)는 출력부(60)로 예비 자막 화면을 출력하여 예비 자막을 출력부(60)로 출력한 후, 사용자 입력부(10)에서 출력되는 신호를 판독하여 문장 편집을 위한 편집자의 선택 동작이 이루어져 문장 편집을 위한 자막 선택 동작이 이루어졌는지 판단한다(S34).Accordingly, the caption generating unit 22 outputs a preliminary caption screen to the output unit 60 and outputs the preliminary caption to the output unit 60, and then reads the signal output from the user input unit 10 for text editing. It is determined whether an editor's selection operation has been performed and a subtitle selection operation for text editing has been performed (S34).

사용자 입력부(10)를 통해 입력되는 신호에 의해 편집자의 선택 동작이 이루어진 상태로 판단되면(S34), 자막 생성부(22)는 신호의 값을 이용하여 편집자에 의해 선택된 문장의 번호를 판정하여 해당 번호에 정해진 표시기(예, √)를 표시하고 해당 번호에 대한 필드를 정해진 색상의 테두리로 편집자에 의해 선택된 예비 자막의 해당 문장을 표시한다(S35).When it is determined that the editor's selection operation has been made by the signal input through the user input unit 10 (S34), the subtitle generation unit 22 determines the number of the sentence selected by the editor using the value of the signal and An indicator (eg, √) specified for the number is displayed, and the corresponding sentence of the preliminary subtitle selected by the editor is displayed with a border of a predetermined color in the field for the number (S35).

또한, 자막 생성부(22)는, 도 9에 도시한 것처럼, 선택된 예비 자막의 해당 문장을 자막 편집창(W34)에 표시하여, 편집자에 의해 선택된 예비 자막의 해당 문장을 위한 자막 편집 모드로 동작 상태를 전환한다.In addition, as shown in FIG. 9, the subtitle generation unit 22 displays the corresponding sentence of the selected preliminary subtitle on the subtitle editing window W34, and operates in a subtitle editing mode for the corresponding sentence of the preliminary subtitle selected by the editor. Switch state.

자막 편집창(W34)에는 도 9에 도시한 것처럼, 선택된 자막뿐만 아니라 자막 시작 시각과 자막 종료 시각이 표시되며, 또한 'MARK IN'과 'MARK OUT'이 표시된다. 영상 처리에서 'MARK IN'은 특정 영상의 시작을 의미하고, 'MARK OUT'은 특정 영상의 끝을 의미하는데, 자막의 시작, 종류 정보와 영상의 시작, 종료 정보의 동기화(sync)를 수행한다.In the caption editing window W34, as shown in FIG. 9, not only the selected caption but also the caption start time and caption end time are displayed, and'MARK IN' and'MARK OUT' are also displayed. In image processing,'MARK IN' means the start of a specific video, and'MARK OUT' means the end of a specific video. It synchronizes the start and type information of the subtitle and the start and end information of the video. .

따라서, 편집자는 사용자 입력자(10)를 이용하여 자막 편집창(W34)에 출력되는 예비 자막을 올바르고 정확하게 편집하고, 편집 동작이 완료되면 편집 동작을 통해 편집된 자막의 적용 동작을 위해 '적용 버튼'(B32)을 선택한다. 이때, 자막 생성부(22)는 편집자의 편집 동작에 따라 사용자 입력부(10)로부터 입력되는 데이터, 즉 편집 내용을 자막 편집창(W34)에 표시한다. Therefore, the editor correctly and accurately edits the preliminary subtitles output to the subtitle editing window W34 using the user input unit 10, and when the editing operation is completed,'apply' for the operation of applying the edited subtitle through the editing operation. Select the button'(B32). In this case, the caption generating unit 22 displays data input from the user input unit 10, that is, edited contents, on the caption editing window W34 according to the editing operation of the editor.

이러한 사용자 입력부(10)의 동작에 따라 해당 자막에 대한 편집 동작이 완료되면 편집자는 사용자 입력부(10)를 이용하여 적용 버튼(B33)을 선택하게 된다. 반면, 사용자 입력부(10)에 의해 초기화 버튼(B32)이 선택되면 사용자 입력부(10)를 통해 입력된 편집 내용은 모두 삭제되고 처음 상태, 즉 예비 자막의 해당 문장으로 초기화된다. When the editing operation for the subtitle is completed according to the operation of the user input unit 10, the editor selects the apply button B33 using the user input unit 10. On the other hand, when the initialization button B32 is selected by the user input unit 10, all edited contents inputted through the user input unit 10 are deleted and initialized to the initial state, that is, the corresponding sentence of the preliminary caption.

따라서, 자막 생성부(22)는 사용자 입력부(10)에 의해 적용 버튼(B33)이 선택된 상태로 판단되면(S37), 자막 생성부(22)는 현재 자막 편집창(W34)에 기재된 내용을 해당 문장에 대한 최종 자막으로 판정하여 저장부(30)로 출력한다(S38).Therefore, if the subtitle generation unit 22 determines that the apply button B33 is selected by the user input unit 10 (S37), the caption generation unit 22 corresponds to the content currently described in the subtitle editing window W34. It is determined as the final caption for the sentence and output to the storage unit 30 (S38).

하지만, 사용자 입력부(10)를 통해 '초기화 버튼(B32)이 동작된 상태로 판단되면(S39), 자막 생성부(22)는 저장부(30)에 저장되어 있는 최초의 예비 자막을 읽어와 자막 편집창(W34)에 표시한다(S310).However, if it is determined that the'initialization button B32' is operated through the user input unit 10 (S39), the caption generation unit 22 reads the first preliminary caption stored in the storage unit 30 and It is displayed on the editing window W34 (S310).

이때, 초기화 버튼(B32)은 적용 버튼(B33)이 동작된 후에서 사용자 입력부(10)에 의해 선택될 수 있으므로, 편집자는 예비 자막의 해당 문장에 대한 편집 동작을 부담없이 진행할 수 있게 된다. At this time, since the initialization button B32 can be selected by the user input unit 10 after the apply button B33 is operated, the editor can perform the editing operation for the corresponding sentence of the preliminary subtitle without burden.

따라서, 적용 버튼(B33)이 동작된 후 초기화 버튼(B32)이 선택되면, 자막 생성부(22)는 적용 버튼(B33)의 동작으로 저장부(30)에 저장되어 있던 편집된 자막은 저장부(30)와 자막 편집창(W34)에서 삭제되고, 대신 해당 예비 자막의 편집되지 않은 해당 문장을 저장부(30)에서 읽어와 자막 편집창에 다시 출력한다. Therefore, when the initialization button B32 is selected after the apply button B33 is operated, the subtitle generation unit 22 stores the edited subtitles stored in the storage unit 30 by the operation of the apply button B33. It is deleted in the subtitle editing window (30) and the subtitle editing window (W34), and instead, the unedited text of the preliminary subtitle is read from the storage unit 30 and output to the subtitle editing window again.

이로 인해, 편집자는 적용 버튼(B32)과 초기화 버튼(B31)을 이용하여 원하는 문장에 대한 예비 자막에 수정 동작을 실시하게 된다.Accordingly, the editor performs a correction operation on the preliminary caption for the desired sentence by using the apply button B32 and the initialization button B31.

이러한 동작을 통해, 해당 예비 자막의 구비된 각 문장에 대한 편집 동작이 편집자에 의해 행해져 각 문장에 대한 최종 자막이 생성되므로, 각 문장의 최종 자막은 편집자의 의도에 맞는 정확한 내용으로 생성되게 되므로, 자막의 정확도가 향상된다.Through this operation, the editing operation for each sentence provided with the preliminary subtitle is performed by the editor to generate the final subtitle for each sentence, so that the final subtitle of each sentence is generated with accurate content suitable for the editor's intention. The accuracy of subtitles is improved.

또한, 대학교(예, 한국방송통신대학교)와 같이 과목 특성 상 각 강의자(예, 교수)가 전공하는 과목이나 강의하는 과목에 따라 서로 다른 어휘를 사용하는 경우가 많은 경우, 이에 착안해 강의자(예, 교수) 특화된 음향 모델과 언어 모델을 구축하고 강의자에 특화된 모델을 사용하여 음성 인식 과정을 수행하므로, 성능이 향상된다.Also, in many cases, such as universities (e.g., Korea National Open University), lecturers (e.g., professors) often use different vocabulary depending on the subject they are majoring or lectured due to the nature of the subject. , Professor) By constructing a specialized acoustic model and a language model, and performing the speech recognition process using a model specialized for the lecturer, performance is improved.

이상, 본 발명의 음성인식 자막 생성 시스템 및 방법의 실시예들에 대해 설명하였다. 본 발명은 상술한 실시예 및 첨부한 도면에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상의 지식을 가진 자의 관점에서 다양한 수정 및 변형이 가능할 것이다. 따라서 본 발명의 범위는 본 명세서의 청구범위뿐만 아니라 이 청구범위와 균등한 것들에 의해 정해져야 한다.In the above, embodiments of the voice recognition caption generation system and method of the present invention have been described. The present invention is not limited to the above-described embodiments and the accompanying drawings, and various modifications and variations are possible from the perspective of those of ordinary skill in the field to which the present invention belongs. Therefore, the scope of the present invention should be defined by the claims of the present specification as well as those equivalent to the claims.

1: 음성인식 자막 생성 시스템 10: 사용자 입력부
20: 자막 생성 장치 30: 저장부
40: 형태소 분석부 60: 출력부
50: 데이터베이스부 51: 강의자별 오류 사전 데이터베이스
52: 형태소 사전 데이터베이스 60: 출력부
70: 딥런닝 학습부 80: 모델부
81: 강의자별 음향 모델 82: 강의자별 언어 모델1: voice recognition caption generation system 10: user input unit
20: caption generating device 30: storage unit
40: morpheme analysis unit 60: output unit
50: database unit 51: error dictionary database for each lecturer
52: morpheme dictionary database 60: output
70: deep running learning unit 80: model unit
81: acoustic model for each lecturer 82: language model for each lecturer

Claims

In an input audio file including input audio data, a basic text for the input audio data is determined, at least one candidate text for the determined basic text is determined, and one of at least one candidate text is determined as a final text, and the Basic with the basic subtitles generating basic subtitles by generating basic subtitle text for input audio data, performing morpheme analysis of the basic subtitle text, and writing the basic subtitle text by word units A basic subtitle data generator that generates subtitle data, and
It is connected to the basic subtitle data generation unit, separates the final sentence for the basic subtitle by separating the basic subtitle by sentence units, and performs a space correction correction using the part of speech and space spelling in the final sentence. Subtitle generator that generates subtitles
Voice recognition caption generation system comprising a.

In claim 1,
A speech model for each lecturer and a language model for each lecturer connected to the basic subtitle data generation unit, respectively
Including more,
The input audio file further includes lecturer information,
The basic caption data generation unit,
Using the information of the lecturer, a speech model and a language model for the corresponding lecturer are determined from the speech model for each lecturer and the language model for each lecturer, respectively,
Determine at least one candidate text for the basic text corresponding to the speech information of the input audio data using the speech model for the lecturer,
Selecting one of at least one candidate text using the language model for the lecturer to determine the final text for the basic text, and generating the basic subtitle text for the input audio data
Voice recognition subtitle generation system.

In paragraph 2,
A morpheme analysis unit connected to the basic caption data generation unit and performing morpheme analysis on the basic caption text and separating the text into words
Voice recognition caption generation system further comprising a.

In paragraph 3,
The morpheme analysis unit inputs part-of-speech information of each separated word to the basic caption data generation unit,
The basic caption data generation unit performs the spacing by using the part of speech of each word and includes the part-of-speech information in the basic caption data.
Voice recognition subtitle generation system.

In claim 4,
The basic caption data includes a word start time and a word end time assigned for each word of the basic caption using time information on the candidate text.

In clause 5,
The subtitle generation unit,
In the basic subtitle, if the time between two adjacent words determined by using the word start time and the word end time for two adjacent words is equal to or greater than the silence setting time, it is separated into one sentence,
In each sentence separated based on the silent setting time, the parts of speech of the word are used to separate them into sentences again,
Separating the final sentence for the basic subtitle by separating it into sentences again using the number of characters in each sentence separated based on the part of speech.
Voice recognition subtitle generation system.

delete

In claim 1,
Further comprising an error dictionary database for each lecturer connected to the subtitle generator,
The subtitle generation unit,
In the error dictionary database for each lecturer, selecting a corresponding error dictionary database corresponding to the lecturer by using lecturer information included in the basic subtitle data,
Comparing a word existing in the final sentence with a word existing in a selected error dictionary database, and changing a non-standard word existing in the final sentence into a standard word to generate the preliminary caption
Voice recognition subtitle generation system.

In clause 8,
An output unit and a user input unit connected to the subtitle generation unit
Including more,
The subtitle generation unit,
Outputs the preliminary subtitle to the output unit,
When the editing operation of the preliminary caption is performed through the user input unit, the edited caption input through the user input unit is output to the output unit,
When an operation for applying the edited caption is performed through the user input unit, the edited caption is selected as a final caption.
Voice recognition subtitle generation system.

Determining a basic text for the input audio data included in the input audio file, and determining at least one candidate text for the determined basic text,
Generating, by the basic caption data generation unit, at least one candidate text as final text, and generating basic caption text for the input audio data,
Generating, by the basic subtitle data generating unit, the basic subtitle text with spaces in word units according to the result of morpheme analysis of the basic subtitle text,
The subtitle generation unit divides the basic subtitle input from the basic subtitle data generation unit into sentences, and separates a final sentence for the basic subtitle, and
Generating a preliminary subtitle by performing a spacing correction using the part of speech and spacing spelling in the final sentence by the subtitle generator
Voice recognition caption generation method comprising a.

In claim 10,
The determining step of the at least one candidate text,
The basic subtitle data generation unit determines a speech model and a language model for the corresponding lecturer from the speech model for each lecturer and the language model for each lecturer using lecturer information included in the input audio file, and
Determining, by the basic caption data generation unit, the at least one candidate text for the basic text corresponding to the speech information of the input audio data by using the speech model for the lecturer
Voice recognition caption generation method comprising a.

In claim 10,
In the generating of the basic subtitle text, the basic subtitle data generation unit selects one of the at least one candidate text using a language model for the lecturer and determines a final text for the basic text. Way.

In claim 10,
The generating of the basic caption includes the step of performing, by the basic caption data generation unit, writing a space in units of words using the part of speech of each word according to the analysis in the form.

In claim 10,
The generating of the basic caption includes generating, by the basic caption data generation unit, a word time code for giving a word start time and a word end time for each word of the basic caption using time information on the candidate text. How to create voice recognition subtitles.

In clause 14,
The step of dividing by sentence units,
The subtitle generation unit,
Separating into one sentence if the time between the two adjacent words determined using the word start time and the word end time for two adjacent words in the basic subtitle is equal to or greater than the silence setting time,
Separating again into sentences using the part of speech of the word in each sentence separated based on the silence setting time, and
Separating the final sentence for the basic subtitle by separating it into sentences again using the number of characters in each sentence separated based on the part of speech
Voice recognition subtitle generation method further comprising a.

delete

In claim 10,
The generating step of the preliminary subtitle,
The subtitle generation unit,
Selecting a corresponding error dictionary database corresponding to the lecturer using lecturer information included in the basic subtitle data from the error dictionary database for each lecturer, and
Generating the preliminary caption data by comparing a word existing in the final sentence with a word existing in a selected error dictionary database, and changing a non-standard word existing in the final sentence into a standard word
Voice recognition caption generation method comprising a.

In claim 10,
Outputting the preliminary caption to an output unit by the caption generator,
Determining whether the editing operation of the preliminary subtitle has been performed by the subtitle generation unit through a user input unit,
When it is determined that the editing operation of the preliminary subtitle has been performed, the subtitle generation unit outputs the edited subtitle input through the user input unit to the output unit, and
When it is determined that the application operation to the edited caption has been performed through the user input unit, the subtitle generating unit selects the edited caption as a final caption
Voice recognition subtitle generation method further comprising a.