KR102385779B1

KR102385779B1 - Electronic apparatus and methoth for caption synchronization of contents

Info

Publication number: KR102385779B1
Application number: KR1020200103005A
Authority: KR
Inventors: 오주현
Original assignee: 한국방송공사
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2022-04-13
Also published as: KR20220022159A

Abstract

전자 장치가 개시된다. 본 전자 장치는 영상 콘텐츠 및 영상 콘텐츠에 대한 자막 데이터를 수신하는 통신 장치, 메모리, 및 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출하고, 추출된 음성 인식 데이터와 수신된 자막 데이터를 비교하여 자막 데이터 내의 시간 정보를 추출된 시간 정보로 수정하고, 시간 정보가 수정된 자막 데이터를 메모리에 저장하는 프로세서를 포함한다. An electronic device is disclosed. The electronic device extracts voice recognition data and time information by performing voice recognition on video content and a communication device that receives video content and caption data for the video content, a memory, and video content, and extracts the extracted speech recognition data and the received caption and a processor that compares the data, corrects temporal information in the subtitle data to the extracted temporal information, and stores the corrected subtitle data in a memory.

Description

ELECTRONIC APPARATUS AND METHOTH FOR CAPTION SYNCHRONIZATION OF CONTENTS

본 개시는 콘텐츠에 대한 자막 동기화를 수행하는 전자 장치 및 방법에 관한 것으로, 보다 구체적으로는 콘텐츠에 대한 자막 동기화를 자동으로 수행하는 전자 장치 및 방법에 관한 것이다. The present disclosure relates to an electronic device and method for performing subtitle synchronization on content, and more particularly, to an electronic device and method for automatically performing subtitle synchronization on content.

국내 지상파 방송에서는 관련 법규에 따라 전체 방송시간에 걸쳐 청각장애인을 위한 자막방송 서비스를 제공하고 있다. 현재의 자막 방송 제작은 속기사가 실시간으로 자막을 입력하고 이를 방송사로 다시 송신하여 지상파 방송에 삽입하는 형태로 이루어 진다. In accordance with relevant laws and regulations, domestic terrestrial broadcasting provides closed captioning services for the hearing impaired throughout the entire broadcast time. The current production of closed captioning is done in the form of a stenographer inputting captions in real time, transmitting them back to the broadcaster, and inserting them into terrestrial broadcasts.

최근 방송사는 지상파 방송 이후에 해당 방송 콘텐츠를 가공하여 온라인 서비스 등에 활용하고 있다. 이러한 과정 중에 사전제작, 광고 삽입 등에 의하여 시간 오프셋(time offset)이 발생하며, 그에 따라 지상파 방송시의 시작 시점과 가공된 방송 콘텐츠의 시작 시점을 달라질 수 있다. 그러나 지상파 방송을 위한 자막(이하, 폐쇄 자막)은 별도의 편집 없이 시스템에 그대로 저장된다는 점에서, 온라인 서비스에서 해당 자막을 이용하는 경우 싱크가 맞지 않게 된다는 문제가 있었다. Recently, broadcasters are processing the corresponding broadcast content after terrestrial broadcasting and using it for online services. During this process, a time offset occurs due to pre-production, advertisement insertion, etc., and accordingly, the start time of terrestrial broadcasting and the start time of processed broadcast content may be different. However, since subtitles for terrestrial broadcasting (hereinafter, closed subtitles) are stored as they are in the system without additional editing, there is a problem in that when the corresponding subtitles are used in an online service, the synchronization becomes out of sync.

따라서, 본 개시의 목적은 콘텐츠에 대한 자막 동기화를 자동으로 수행하는 전자 장치 및 방법을 제공하는데 있다. Accordingly, an object of the present disclosure is to provide an electronic device and method for automatically performing subtitle synchronization for content.

이상과 같은 목적을 달성하기 위한 본 개시에 따른 전자 장치는 영상 콘텐츠 및 상기 영상 콘텐츠에 대한 자막 데이터를 수신하는 통신 장치, 메모리, 및 상기 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출하고, 상기 추출된 음성 인식 데이터와 상기 수신된 자막 데이터를 비교하여 상기 자막 데이터 내의 시간 정보를 상기 추출된 시간 정보로 수정하고, 시간 정보가 수정된 자막 데이터를 상기 메모리에 저장하는 프로세서를 포함한다. In order to achieve the above object, an electronic device according to the present disclosure provides a communication device for receiving image content and caption data for the image content, a memory, and voice recognition for the image content to obtain voice recognition data and time information. a processor configured to extract, compare the extracted speech recognition data with the received caption data, correct time information in the caption data with the extracted time information, and store the caption data with the time information corrected in the memory; include

이 경우, 상기 프로세서는 상기 추출된 음성 인식 데이터에서 문자열만을 추출하고, 상기 수신된 자막 데이터에서 문자열만을 추출하고, 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬하고, 상기 정렬 결과에 기초하여 상기 자막 데이터 내의 시간 정보를 상기 추출된 시간 정보로 수정할 수 있다. In this case, the processor extracts only the character string from the extracted voice recognition data, extracts only the character string from the received subtitle data, aligns the character string of the voice recognition data and the character string of the subtitle data, and based on the alignment result Thus, time information in the caption data can be modified with the extracted time information.

이 경우, 상기 프로세서는 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 음절 단위로 유사도 산출하고, 산출된 유사도 중 기설정된 유사도를 갖는 음절을 기준으로 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬할 수 있다. In this case, the processor calculates a similarity of the character string of the voice recognition data and the character string of the subtitle data in units of syllables, and based on a syllable having a predetermined similarity among the calculated similarities, the character string of the voice recognition data and the subtitle data Strings can be sorted.

이 경우, 상기 프로세서는 음절 단위 내에서 초성, 중성, 종성의 동일성 여부로 유사도를 산출할 수 있다. In this case, the processor may calculate the degree of similarity based on whether the initial consonant, the middle consonant, and the final consonant are identical within a syllable unit.

이 경우, 상기 프로세서는 음절 단위 내에서 초성, 중성, 종성 모두 동일하면 제1 값을 산출하고, 초성 중성, 종성 중 하나가 다르면 제1 값보다 작은 제2 값을 산출하고, 초성, 중성, 종성 중 하나만 동일하면 제2 값보다 작은 제3 값을 산출하며, 제1 값 및 제2 값을 갖는 음절을 기준으로 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬할 수 있다. In this case, the processor calculates a first value if all of the leading, neutral, and final consonants are the same within a syllable unit, and calculates a second value that is smaller than the first value if one of the leading, neutral, and final consonants is different, If only one of them is the same, a third value smaller than the second value may be calculated, and the character string of the voice recognition data and the character string of the subtitle data may be arranged based on the syllable having the first value and the second value.

한편, 상기 프로세서는 동적 계획법(dynamic programming) 및 상기 산출된 유사도를 이용하여 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬할 수 있다. Meanwhile, the processor may align the character string of the voice recognition data and the character string of the subtitle data using dynamic programming and the calculated similarity.

한편, 상기 프로세서는 상기 정렬 결과에 기초하여 상기 자막 데이터의 라인의 첫번째 음절에 대한 시간 정보를 상기 문자열에 대응되는 음성 인식 데이터의 문자열에 대한 시간 정보로 수정할 수 있다. Meanwhile, the processor may modify the time information on the first syllable of the line of the subtitle data to the time information on the character string of the voice recognition data corresponding to the character string based on the alignment result.

한편, 상기 프로세서는 상기 수신된 자막 데이터에서 영상의 상황 설명에 대응되는 텍스트 보조 정보는 제외한 문자열을 추출할 수 있다. Meanwhile, the processor may extract a character string excluding the text auxiliary information corresponding to the contextual description of the image from the received caption data.

한편, 상기 영상 콘텐츠에 대한 자막 데이터는, 상기 영상 콘텐츠에 대응하여 전자코드 형태로 전송된 해설 자막일 수 있다. Meanwhile, the caption data for the video content may be an explanation caption transmitted in the form of an electronic code corresponding to the video content.

한편, 상기 프로세서는 상기 수정된 자막 데이터가 외부 장치에 전송되도록 상기 통신 장치를 제어할 수 있다. Meanwhile, the processor may control the communication device to transmit the modified caption data to an external device.

한편, 본 개시의 일 실시 예에 따른 전자 장치의 자막 동기화 방법은 영상 콘텐츠 및 상기 영상 콘텐츠에 대한 자막 데이터를 수신하는 단계, 상기 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출하는 단계, 상기 추출된 음성 인식 데이터와 상기 수신된 자막 데이터를 비교하여 상기 자막 데이터 내의 시간 정보를 상기 추출된 시간 정보로 수정하는 단계, 및 시간 정보가 수정된 자막 데이터를 저장하는 단계를 포함한다. Meanwhile, the method for synchronizing captions of an electronic device according to an embodiment of the present disclosure includes receiving video content and caption data for the video content, and performing voice recognition on the video content to extract speech recognition data and time information. comparing the extracted speech recognition data with the received caption data to correct time information in the caption data with the extracted time information, and storing caption data in which the time information is corrected. .

이 경우, 상기 수정하는 단계는 상기 추출된 음성 인식 데이터에서 문자열만을 추출하는 단계, 상기 수신된 자막 데이터에서 문자열만을 추출하는 단계, 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬하는 단계, 및 상기 정렬 결과에 기초하여 상기 자막 데이터 내의 시간 정보를 상기 추출된 시간 정보로 수정하는 단계를 포함할 수 있다. In this case, the modifying includes extracting only a character string from the extracted voice recognition data, extracting only a character string from the received subtitle data, aligning the character string of the voice recognition data with the character string of the subtitle data; and correcting the time information in the caption data with the extracted time information based on the alignment result.

이 경우, 상기 정렬하는 단계는 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 음절 단위로 유사도 산출하는 단계, 및 상기 산출된 유사도 중 기설정된 유사도를 갖는 음절을 기준으로 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬하는 단계를 포함할 수 있다. In this case, the arranging includes calculating a similarity between the character string of the voice recognition data and the character string of the subtitle data in units of syllables, and the character string of the voice recognition data based on a syllable having a preset similarity among the calculated similarities. and aligning the character strings of the subtitle data.

이 경우, 상기 유사도를 산출하는 단계는 음절 단위 내에서 초성, 중성, 종성의 동일성 여부로 유사도를 산출할 수 있다. In this case, in the calculating of the similarity, the similarity may be calculated based on whether the initial consonant, the middle consonant, and the final consonant are identical within a syllable unit.

이 경우, 상기 유사도를 산출하는 단계는 음절 단위 내에서 초성, 중성, 종성 모두 동일하면 제1 값을 산출하고, 초성 중성, 종성 중 하나가 다르면 제1 값보다 작은 제2 값을 산출하고, 초성, 중성, 종성 중 하나만 동일하면 제2 값보다 작은 제3 값을 산출하며, 상기 기설정된 유사도를 갖는 음절을 기준으로 정렬하는 단계는, 상기 제1 값 및 상기 제2 값을 갖는 음절을 기준으로 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬할 수 있다. In this case, the calculating of the similarity includes calculating a first value if all of the initial consonant, neutral, and final consonant within a syllable unit are the same; , a third value smaller than the second value is calculated when only one of the syllables is the same, and the step of arranging based on the syllables having the preset similarity is based on the syllables having the first value and the second value. The character string of the voice recognition data may be aligned with the character string of the subtitle data.

한편, 상기 기설정된 유사도를 갖는 음절을 기준으로 정렬하는 단계는 동적 계획법(dynamic programming) 및 상기 산출된 유사도를 이용하여 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬할 수 있다. Meanwhile, in the arranging based on the syllables having the preset similarity, the character string of the voice recognition data and the character string of the subtitle data may be aligned using dynamic programming and the calculated similarity.

한편, 상기 수정하는 단계는 상기 정렬 결과에 기초하여 상기 자막 데이터의 라인의 첫번째 음절에 대한 시간 정보를 상기 문자열에 대응되는 음성 인식 데이터의 문자열에 대한 시간 정보로 수정할 수 있다. Meanwhile, the modifying may include correcting time information on the first syllable of the line of the subtitle data to time information on a character string of voice recognition data corresponding to the character string based on the alignment result.

한편, 상기 수신된 자막 데이터에서 문자열만을 추출하는 단계는 상기 수신된 자막 데이터에서 영상의 상황 설명에 대응되는 텍스트 보조 정보는 제외한 문자열을 추출할 수 있다. Meanwhile, in the step of extracting only the character string from the received subtitle data, the character string excluding the text auxiliary information corresponding to the context description of the image may be extracted from the received subtitle data.

한편, 본 자막 동기화 방법은 상기 수정된 자막 데이터가 외부 장치에 전송하는 단계를 더 포함할 수 있다. Meanwhile, the caption synchronization method may further include transmitting the modified caption data to an external device.

한편, 본 개시의 일 실시 예에 따른 자막 동기화 방법을 실행하기 위한 프로그램을 포함하는 컴퓨터 판독가능 기록매체에 있어서, 상기 자막 동기화 방법은, 영상 콘텐츠 및 상기 영상 콘텐츠에 대한 자막 데이터를 수신하는 단계, 상기 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출하는 단계, 상기 추출된 음성 인식 데이터와 상기 수신된 자막 데이터를 비교하여 상기 자막 데이터 내의 시간 정보를 상기 추출된 시간 정보로 수정하는 단계, 및 시간 정보가 수정된 자막 데이터를 저장하는 단계를 포함한다. Meanwhile, in a computer-readable recording medium including a program for executing the subtitle synchronization method according to an embodiment of the present disclosure, the subtitle synchronization method includes: receiving image content and subtitle data for the image content; extracting voice recognition data and time information by performing voice recognition on the video content; comparing the extracted voice recognition data with the received subtitle data to correct time information in the subtitle data with the extracted time information and storing the subtitle data in which the time information is corrected.

상술한 바와 같이 본 개시의 다양한 실시 예에 따르면, 영상 콘텐츠에 대한 음성 인식 결과를 이용하여 자동으로 자막 데이터의 싱크를 맞출 수 있게 된다. 특히, 동적 계획법을 이용하여 음성 인식 결과와 자막 데이터를 정렬하는바 동기화를 높은 정확도로 수행할 수 있다. As described above, according to various embodiments of the present disclosure, it is possible to automatically synchronize subtitle data using the voice recognition result for video content. In particular, the synchronization can be performed with high accuracy by aligning the speech recognition result and the subtitle data using the dynamic programming method.

도 1은 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템의 구성을 나타낸 도면,
도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 도시한 도면,
도 3은 음성 인식 방식을 이용하여 생성한 자막의 예를 도시한 도면,
도 4는 폐쇄 자막의 예를 도시한 도면,
도 5는 본 개시의 일 실시 예에 따른 동적 계획법을 설명하기 위한 도면,
도 6은 본 개시의 일 실시 예에 따른 유사도 산출 동작을 설명하기 위한 도면,
도 7은 음절의 동일성 여부로 유사도를 산출한 경우와 음절을 3개의 음소로 나눠서 유사도를 산출한 경우의 정렬 결과를 비교한 도면,
도 8은 본 개시에 따라 수정된 자막 데이터의 예를 도시한 도면, 그리고,
도 9는 본 개시의 일 실시 예에 따른 자막 동기화 방법을 설명하기 위한 흐름도이다. 1 is a view showing the configuration of a content creation system according to an embodiment of the present disclosure;
2 is a diagram illustrating a detailed configuration of an electronic device according to an embodiment of the present disclosure;
3 is a diagram illustrating an example of a subtitle generated using a voice recognition method;
4 is a diagram showing an example of a closed caption;
5 is a view for explaining a dynamic programming method according to an embodiment of the present disclosure;
6 is a view for explaining a similarity calculation operation according to an embodiment of the present disclosure;
7 is a diagram comparing the alignment results when the similarity is calculated based on the identity of syllables and when the similarity is calculated by dividing the syllable into three phonemes;
8 is a diagram illustrating an example of caption data modified according to the present disclosure, and;
9 is a flowchart illustrating a caption synchronization method according to an embodiment of the present disclosure.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다.Terms used in this specification will be briefly described, and the present disclosure will be described in detail.

본 개시의 실시 예에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 개시의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in the embodiments of the present disclosure are selected as currently widely used general terms as possible while considering the functions in the present disclosure, which may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, etc. . In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding disclosure. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

본 개시의 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 개시된 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 실시 예들을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Embodiments of the present disclosure may be subjected to various transformations and may have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope of the specific embodiments, and it should be understood to include all transformations, equivalents and substitutions included in the spirit and scope of the disclosure. In describing the embodiments, if it is determined that a detailed description of a related known technology may obscure the subject matter, the detailed description thereof will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다." 또는 "구성되다." 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In this application, "includes." Or "consistent." The term such as is intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features or number, step, operation, component, part or It should be understood that it does not preclude the possibility of the existence or addition of combinations thereof.

본 개시의 실시 예에서 '모듈' 혹은 '부'는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 혹은 복수의 '부'는 특정한 하드웨어로 구현될 필요가 있는 '모듈' 혹은 '부'를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In an embodiment of the present disclosure, a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software, or a combination of hardware and software. In addition, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module and implemented with at least one processor, except for 'modules' or 'units' that need to be implemented with specific hardware.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. However, the present disclosure may be implemented in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present disclosure in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

이하에서는 도면을 참조하여 본 개시에 대해 더욱 상세히 설명하기로 한다.Hereinafter, the present disclosure will be described in more detail with reference to the drawings.

도 1은 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템의 구성을 나타낸 도면이다.1 is a diagram showing the configuration of a content creation system according to an embodiment of the present disclosure.

도 1을 참조하면, 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템(1000)은 방송 송출 장치(10), 자막 생성 장치(20), 서버(30) 및 전자 장치(100)로 구성될 수 있다. Referring to FIG. 1 , a content generation system 1000 according to an embodiment of the present disclosure may include a broadcast transmission device 10 , a caption generation device 20 , a server 30 , and an electronic device 100 . .

방송 송출 장치(10)는 영상 콘텐츠를 이용하여 지상파 방송을 송출할 수 있다. 이때, 방송 송출 장치(10)는 자막 생성 장치(20)에서 생성된 영상 콘텐츠에 대한 자막 데이터를 이용하여 자막 방송 서비스를 제공할 수 있다. 여기서, 자막 방송 서비스란 음성 신호에 대응되는 텍스트를 TV 화면에 자막으로 표시하는 서비스로, 청각장애인들을 위해 TV 프로그램의 청각 메시지를 전자코드 형태로 변환 전송하여, TV 화면에 해설자막으로 나타나게 하는 기술이다. The broadcast transmitting device 10 may transmit a terrestrial broadcast using image content. In this case, the broadcast transmitting device 10 may provide a caption broadcasting service by using caption data for the image content generated by the caption generating device 20 . Here, the closed caption broadcasting service is a service that displays text corresponding to a voice signal as subtitles on the TV screen. It is a technology that converts and transmits an auditory message of a TV program into an electronic code form for the hearing impaired and displays it as commentary subtitles on the TV screen. am.

그리고 방송 송출 장치(10)는 지상파 방송 송출이 완료된 영상 콘텐츠를 서버(30)에 전송할 수 있다. 이때, 방송 송출 장치(40)는 영상 콘텐츠를 그대로 서버(30)에 전송할 수 있으며, 광고 등을 추가로 삽입하거나, 복수개로 분할하는 작업 등을 수행하여, 즉 편집 콘텐츠를 서버(30)에 전송할 수도 있다. In addition, the broadcast transmission device 10 may transmit the image content on which the terrestrial broadcast transmission has been completed to the server 30 . In this case, the broadcast transmission device 40 may transmit the video content to the server 30 as it is, and perform an operation of additionally inserting advertisements, etc. or dividing the video content into a plurality of pieces, that is, transmitting the edited content to the server 30 . may be

자막 생성 장치(20)는 방송 송출 장치(40)에서 송출하는 영상 콘텐츠에 대한 자막 데이터를 생성하고, 생성한 자막 데이터를 서버(30) 및/또는 방송 송출 장치(40)에 전송할 수 있다. 여기서 자막 데이터는 속기사의 타자 입력을 통하여 생성된 데이터로, 영상 콘텐츠에 대응하여 전자코드 형태로 전송된 해설 자막일 수 있다. 이와 같은 자막 데이터는 전문 속기사의 입력에 의하여 생성된 것인 바, 높은 정확도를 가지나, 속기 입력에 시간이 소요되는바 영상 콘텐츠와는 시간 오프셋이 있다. 또한, 자막 데이터는 일반적으로 사람의 발화 음성 뿐만 아니라, 차량의 경적, 핸드폰 소리 등을 설명하기 위한 영상의 상황을 설명하기 위한 텍스트 보조 정보도 포함될 수 있다. 이와 같은 속기사에 의하여 생성된 자막 데이터는 폐쇄 자막으로 지칭될 수 있다. The caption generating device 20 may generate caption data for the image content transmitted from the broadcast transmitting device 40 , and transmit the generated caption data to the server 30 and/or the broadcast transmitting device 40 . Here, the caption data is data generated through a typing input of a stenographer, and may be a commentary caption transmitted in the form of an electronic code corresponding to the video content. Such subtitle data is generated by input of a professional stenographer, and has high accuracy, but takes time to input the stenography, and there is a time offset from the video content. In addition, the caption data may include text auxiliary information for describing a situation of an image for explaining a vehicle horn, a cell phone sound, etc. as well as a human voice in general. The caption data generated by such a stenographer may be referred to as a closed caption.

서버(30)는 방송 송출 장치(10)에서 송출된 영상 콘텐츠를 저장할 수 있으며, 자막 생성 장치(20)로부터 해당 영상 콘텐츠에 대한 자막 데이터를 저장할 수 있다. 이와 같은 서버(30)는 아카이브(archive) 서버일 수 있다. The server 30 may store the video content transmitted from the broadcast transmitting device 10 , and may store caption data for the corresponding video content from the caption generating device 20 . Such a server 30 may be an archive server.

전자 장치(100)는 서버(30)로부터 자막 데이터를 수신하고, 수신된 자막 데이터의 싱크를 수정할 수 있다. 그리고 전자 장치(100)는 수정된 자막 데이터를 서버(30)에 전송할 수 있다. The electronic device 100 may receive caption data from the server 30 and correct a sync of the received caption data. In addition, the electronic device 100 may transmit the corrected caption data to the server 30 .

이때, 전자 장치(100)는 서버(30)로부터 자막 데이터에 대응되는 영상 콘텐츠를 수신하고, 수신된 영상 콘텐츠에 대한 음성 인식(speech to text)을 수행하여, 영상 콘텐츠에 포함된 음성에 대한 텍스트 정보와 텍스트 각각의 시간 정보(또는 시간 동기 정보)를 갖는 음성 인식 데이터를 생성할 수 있다. 이러한 음성 인식 데이터는 음성인식 자막, 음성인식 결과 등이라고 지칭될 수 있다. In this case, the electronic device 100 receives the image content corresponding to the caption data from the server 30 , performs speech to text on the received image content, and performs text on the voice included in the image content. It is possible to generate speech recognition data having time information (or time synchronization information) of information and text, respectively. Such voice recognition data may be referred to as a voice recognition caption, a voice recognition result, and the like.

음성 인식 데이터가 생성되면, 전자 장치(100)는 자막 데이터와 추출된 음성 인식 데이터를 비교하여 자막 데이터 내의 시간 정보를 음성 인식 데이터 내의 시간 정보로 수정할 수 있다. 전자 장치(100)의 구체적인 구성 및 동작에 대해서는 도 2을 참조하여 후술한다. When the voice recognition data is generated, the electronic device 100 may compare the subtitle data with the extracted voice recognition data to correct time information in the subtitle data to the time information in the voice recognition data. A detailed configuration and operation of the electronic device 100 will be described later with reference to FIG. 2 .

이와 같이 본 실시 예에 따른 콘텐츠 생성 시스템(1000)은 영상 콘텐츠에 대한 음성 인식을 통하여 생성한 음성 인식 데이터를 이용하여 자막 데이터 내의 시간 정보를 수정하는바, 사람의 개입 없이도 자동으로 자막 데이터의 동기를 영상에 맞출 수 있게 된다. As described above, the content generating system 1000 according to the present embodiment corrects time information in the subtitle data using voice recognition data generated through voice recognition for video content, and automatically synchronizes the subtitle data without human intervention. can be fitted to the video.

한편, 도 1을 도시하고 설명함에 있어서, 각각의 장치가 상호 직접 연결된 형태로 도시하였지만, 구현시에는 각 구성들은 별도의 외부 구성을 경유하는 형태로 연결될 수 있다. 또한, 전자 장치(100)가 하나의 서버(30)에만 연결되는 것으로 설명하였지만, 전자 장치(100)는 복수의 서버에 연결될 수 있으며, 각 서버로부터 영상 콘텐츠, 자막 데이터를 개별적으로 수신할 수 있다. Meanwhile, in the illustration and description of FIG. 1 , each device is illustrated in a form directly connected to each other, but in implementation, each element may be connected via a separate external element. In addition, although it has been described that the electronic device 100 is connected to only one server 30 , the electronic device 100 may be connected to a plurality of servers, and may individually receive image content and subtitle data from each server. .

또한, 도 1을 도시하고 설명함에 있어서, 서버(30)와 전자 장치(100) 장치가 별개인 것으로 도시하고 설명하였지만, 구현시에는 두 장치는 하나로 구현될 수 있다. 즉, 영상 콘텐츠를 저장하는 서버(30)가 전자 장치(100)에 영상 콘텐츠를 전송하여 자막 동기화를 수행하지 않고, 자체적으로 자막 데이터에 대한 동기화 작업을 수행할 수도 있다. Also, in the illustration and description of FIG. 1 , the server 30 and the electronic device 100 are illustrated and described as separate devices, but in implementation, the two devices may be implemented as one. That is, the server 30 that stores the image content may perform a synchronization operation on the subtitle data by itself without performing subtitle synchronization by transmitting the image content to the electronic device 100 .

도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 도시한 도면이다.2 is a diagram illustrating a detailed configuration of an electronic device according to an embodiment of the present disclosure.

도 2를 참조하면, 전자 장치(100)는 통신 장치(110), 메모리(120), 디스플레이(130), 조작 입력 장치(140) 및 프로세서(150)로 구성될 수 있다. 여기서 전자 장치(100)는 이미지 프로세싱이 가능한 PC, 노트북 PC, 스마트폰, 서버 등일 수 있다.Referring to FIG. 2 , the electronic device 100 may include a communication device 110 , a memory 120 , a display 130 , a manipulation input device 140 , and a processor 150 . Here, the electronic device 100 may be a PC capable of image processing, a notebook PC, a smart phone, a server, or the like.

통신 장치(110)는 자막 생성 장치(20) 및/또는 서버(30)와 연결되며, 영상 콘텐츠, 자막 데이터를 송수신할 수 있다. 구체적으로, 통신 장치(110)는 전자 장치(100)를 외부 장치와 연결하기 위해 형성되고, 근거리 통신망(LAN: Local Area Network) 및 인터넷망을 통해 모바일 장치에 접속되는 형태뿐만 아니라, USB(Universal Serial Bus) 포트를 통하여 접속되는 형태도 가능하다.The communication device 110 is connected to the caption generating device 20 and/or the server 30 , and may transmit/receive video content and caption data. Specifically, the communication device 110 is formed to connect the electronic device 100 to an external device, and is connected to a mobile device through a local area network (LAN) and the Internet network, as well as a USB (Universal) device. It is also possible to connect through a serial bus) port.

또한, 통신 장치(110)는 유선 방식뿐만 아니라, 공용 인터넷망에 연결되는 라우터 또는 공유기를 경유하여 다른 전자 장치에 연결될 수 있으며, 라우터 또는 공유기와는 유선 방식뿐만 아니라 와이파이, 블루투스, 셀룰러 통신 등의 무선 방식으로도 연결될 수 있다. In addition, the communication device 110 may be connected to other electronic devices via a router or router connected to a public Internet network as well as a wired method, and may use a wired method with the router or router as well as Wi-Fi, Bluetooth, cellular communication, etc. It can also be connected wirelessly.

그리고 통신 장치(110)는 후술하는 과정에 의하여 시간 정보가 수정된 자막 데이터(즉, 동기화된 자막 데이터)를 외부 장치(예를 들어, 서버(30))에 전송할 수 있다. In addition, the communication device 110 may transmit subtitle data with time information corrected (ie, synchronized subtitle data) to an external device (eg, the server 30 ) by a process described later.

메모리(120)는 전자 장치(100)를 구동하기 위한 O/S나 음성 인식/자막 동기화를 수행하기 위한 소프트웨어, 데이터 등을 저장하기 위한 구성요소이다. 메모리(120)는 RAM이나 ROM, 플래시 메모리, HDD, 외장 메모리, 메모리 카드 등과 같은 다양한 형태로 구현될 수 있으며, 어느 하나로 한정되는 것은 아니다.The memory 120 is a component for storing O/S for driving the electronic device 100 or software and data for performing voice recognition/subtitle synchronization. The memory 120 may be implemented in various forms such as RAM, ROM, flash memory, HDD, external memory, memory card, etc., but is not limited thereto.

메모리(120)는 영상 콘텐츠를 저장한다. 여기서, 영상 콘텐츠는 영상뿐만 아니라 음성도 포함되며, MP4, AVI, MOV 등과 같은 동영상 파일일 수 있다. 한편, 상술한 파일 포맷은 일 예에 불가하며, 상술한 예들에 한정되지 않는다. The memory 120 stores image content. Here, the video content includes not only video but also audio, and may be a video file such as MP4, AVI, or MOV. On the other hand, the above-described file format is not an example, and is not limited to the above-described examples.

그리고 메모리(120)는 상술한 영상 콘텐츠에 대응되는 자막 데이터를 저장한다. 여기서 자막 데이터는 자막 서비스 상에서 표시될 문자열 정보와 해당 문자열 정보가 표시될 시간 정보를 포함할 수 있다. 여기서 시간 정보는 ms 단위의 시간 정보일 수 있다. In addition, the memory 120 stores subtitle data corresponding to the above-described image content. Here, the caption data may include string information to be displayed on the caption service and time information for displaying the text string information. Here, the time information may be time information in ms units.

그리고 메모리(120)는 후술하는 과정 중에 생성된 문자열 정보, 시간 정보, 유사도 정보 등을 저장할 수 있으며, 시간 정보가 수정된 자막 데이터를 저장할 수 있다. In addition, the memory 120 may store character string information, time information, similarity information, etc. generated during a process to be described later, and may store subtitle data in which the time information is corrected.

디스플레이(130)는 전자 장치(100)가 지원하는 기능을 선택받기 위한 사용자 인터페이스 창을 표시한다. 구체적으로, 디스플레이(130)는 전자 장치(100)가 제공하는 각종 기능을 선택받기 위한 사용자 인터페이스 창을 표시할 수 있다. 이러한 디스플레이(130)는 LCD, CRT, OLED 등과 같은 모니터일 수 있으며, 후술할 조작 입력 장치(140)의 기능을 동시에 수행할 수 있는 터치 스크린으로 구현될 수도 있다.The display 130 displays a user interface window for receiving a selection of a function supported by the electronic device 100 . Specifically, the display 130 may display a user interface window for receiving selections of various functions provided by the electronic device 100 . The display 130 may be a monitor such as an LCD, CRT, or OLED, and may be implemented as a touch screen capable of simultaneously performing the functions of the manipulation input device 140 to be described later.

조작 입력 장치(140)는 사용자로부터 전자 장치(100)의 기능 선택 및 해당 기능에 대한 제어 명령을 입력받을 수 있다. 구체적으로, 조작 입력 장치(140)는 수신된 자막 데이터에 대한 싱크 수정 여부에 대한 명령을 입력받을 수 있다. 또한, 조작 입력 장치(140)는 자막 동기화를 수행할 자막 데이터(또는 영상 콘텐츠)를 선택받을 수 있다. The manipulation input device 140 may receive, from a user, a function selection of the electronic device 100 and a control command for the corresponding function. Specifically, the manipulation input device 140 may receive a command for whether to correct the sync of the received caption data. In addition, the manipulation input device 140 may receive selection of subtitle data (or image content) for which subtitle synchronization is to be performed.

프로세서(150)는 전자 장치(100) 내의 각 구성에 대한 제어를 수행한다. 구체적으로, 프로세서(150)는 사용자로부터 부팅 명령이 입력되면, 메모리(120)에 저장된 운영체제를 이용하여 부팅을 수행할 수 있다.The processor 150 controls each component in the electronic device 100 . Specifically, when a booting command is input by the user, the processor 150 may perform booting using the operating system stored in the memory 120 .

프로세서(150)는 자막 동기화를 수행하기 위한 영상 콘텐츠 또는 자막 데이터가 선택되면, 선택된 영상 콘텐츠에 대응되는 자막 데이터 또는 선택된 자막 데이터에 대응되는 영상 콘텐츠를 수신하도록 통신 장치(110)를 제어할 수 있다. When video content or caption data for performing caption synchronization is selected, the processor 150 may control the communication device 110 to receive caption data corresponding to the selected video content or video content corresponding to the selected caption data. .

그리고 프로세서(150)는 영상 콘텐츠 및 영상 콘텐츠에 대응되는 자막 데이터가 수신되면, 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출할 수 있다. 예를 들어, 프로세서(150)는 음성 인식과 관련하여 널리 알려진 라이브러리(예를 들어, Kaldi, CMU Sphinx, HTK, AWS Transcribe) 중 어느 하나를 이용하여 음성 인식을 수행할 수 있다. In addition, when video content and caption data corresponding to the video content are received, the processor 150 may perform voice recognition on the video content to extract voice recognition data and time information. For example, the processor 150 may perform speech recognition using any one of widely known libraries related to speech recognition (eg, Kaldi, CMU Sphinx, HTK, AWS Transcribe).

그리고 프로세서(150)는 추출된 음성 인식 데이터와 자막 데이터를 비교하여 자막 데이터 내의 시간 정보를 추출된 시간 정보로 수정할 수 있다. 구체적으로, 프로세서(150)는 먼저 추출된 음성 인식 데이터에서 문자열만을 추출할 수 있다. 예를 들어, 음성 인식 데이터는 상술한 바와 같이 문자열 및 해당 문자열의 시간 정보를 포함할 수 있으며, 후술하는 과정을 위하여, 프로세서(150)는 음성 인식 데이터에서 문자열만을 추출할 수 있다. In addition, the processor 150 may compare the extracted voice recognition data with the caption data to correct time information in the caption data to the extracted time information. Specifically, the processor 150 may extract only the character string from the previously extracted voice recognition data. For example, the voice recognition data may include a character string and time information of the corresponding character string as described above, and for a process to be described later, the processor 150 may extract only the character string from the voice recognition data.

그리고 프로세서(150)는 수신된 자막 데이터에서 문자열만을 추출할 수 있다. 예를 들어, 자막 데이터에는 '?', '!'와 같은 기호가 포함될 수 있으나, 음성 인식을 통한 문자열에는 이와 같은 기호들이 포함되지 않는다. 따라서, 프로세서(150)는 수신된 자막 데이터에서 상술한 바와 같은 기호들은 제외한 문자열만을 추출할 수 있다. 또한, 자막 데이터는 상황을 설명하기 위한 정보(텍스트 보조 정보)(예를 들어, 전화 벨이 울린다, 자동차 경적이 울린다)와 같은 텍스트가 포함되어 있을 수 있다. 하지만, 이와 같은 텍스트 보조 정보는 음성 인식 과정에서는 추출되지 않는 정보인 바, 프로세서(150)는 수신된 자막 데이터에서 텍스트 보조 정보를 제외한 문자열을 추출할 수 있다. 예를 들어, 상술한 텍스트 보조 정보는 "(", ")" 괄호 기호 내에 문자열이 기재된 경우가 있다. 따라서, 프로세서(150)는 수신된 자막 데이터에서 괄호 열 내의 문자열을 제외한 문자열을 추출할 수 있다In addition, the processor 150 may extract only the character string from the received caption data. For example, the subtitle data may include symbols such as '?' and '!', but the character string through voice recognition does not include such symbols. Accordingly, the processor 150 may extract only the character string excluding the above-described symbols from the received caption data. In addition, the caption data may include text such as information for describing a situation (text auxiliary information) (eg, a phone ringing, a car horn ringing). However, since such text auxiliary information is information that is not extracted during the speech recognition process, the processor 150 may extract a character string excluding the text auxiliary information from the received caption data. For example, in the text auxiliary information described above, there is a case where a character string is written in parentheses "(", ")". Accordingly, the processor 150 may extract a character string excluding the character string in the parenthesis column from the received subtitle data.

그리고 프로세서(150)는 음성 인식 데이터의 문자열과 자막 데이터의 문자열을 정렬할 수 있다. 구체적으로, 프로세서(150)는 동적 계획법(dynamic programming) 방식을 이용하여 음성 인식 데이터의 문자열과 자막 데이터의 문자열을 정렬할 수 있다. 여기서 동적 계획법은 두 개의 긴 문자열을 정렬하는 작업으로 생물 정보학 분야에서 유전체의 염기서열을 분석하는데 많이 사용되고 있다. 동적 계획법에는 많은 알고리즘이 있으나, 본 개시에서는 Needleman-Wunsch 알고리즘을 이용하였으나, 이에 한정되는 것은 아니다. Needleman-Wunsch 알고리즘을 이용한 방식에 대해서는 도 5에서 자세히 설명한다. In addition, the processor 150 may align the character string of the voice recognition data and the character string of the subtitle data. Specifically, the processor 150 may align the character string of the voice recognition data and the character string of the subtitle data using a dynamic programming method. Here, dynamic programming is a task that aligns two long strings and is widely used to analyze genome sequences in the field of bioinformatics. Although there are many algorithms in the dynamic programming method, the Needleman-Wunsch algorithm is used in the present disclosure, but is not limited thereto. A method using the Needleman-Wunsch algorithm will be described in detail with reference to FIG. 5 .

그리고 프로세서(150)는 정렬 결과에 기초하여 자막 데이터 내의 시간 정보를 추출된 시간 정보로 수정할 수 있다. 구체적으로, 프로세서(150)는 정렬 결과에 기초하여 자막 데이터의 라인의 시작점(즉, 첫번째 음절)에 대한 시간 정보를 해당 문자열에 대응되는 음성 인식 데이터의 문자열에 대한 시간 정보로 수정할 수 있다. 예를 들어, 도 4에 도시된 바와 같이 '본부장님~"이라는 문자열에 대한 28025 시간 정보(즉, 28.025초)란 시간 정보를 음성 인식 결과 내의 '본부장님'에 대응되는 시간 정보(예를 들어, 40.003초)로 수정할 수 있다. 이와 같은 수정 동작은 자막 데이터 내의 라인 단위로 수행될 수 있다. In addition, the processor 150 may correct the time information in the caption data to the extracted time information based on the alignment result. Specifically, the processor 150 may correct the time information about the starting point (ie, the first syllable) of the line of the subtitle data to the time information about the character string of the voice recognition data corresponding to the corresponding character string based on the alignment result. For example, as shown in FIG. 4 , 28025 time information (ie, 28.025 seconds) for the string 'Chief General~' is time information corresponding to 'Chief Manager' in the voice recognition result (for example, .

그리고 프로세서(150)는 시간 정보가 수정된 자막 데이터를 메모리(120)에 저장할 수 있다. 한편, 자막 데이터가 외부 장치(예를 들어, 서버(30))에 저장되어 다른 장치들에 제공되는 경우, 프로세서(150)는 수정된 자막 데이터를 외부 장치에 전송하도록 통신 장치(110)를 제어할 수 있다. In addition, the processor 150 may store the subtitle data in which the time information is corrected in the memory 120 . Meanwhile, when the caption data is stored in an external device (eg, the server 30) and provided to other devices, the processor 150 controls the communication device 110 to transmit the corrected caption data to the external device. can do.

한편, 이상에서는 전자 장치(100)가 자막 데이터의 시간 정보를 수정(또는 업데이트)하는 것으로 표현하였지만, 상술한 동작은 자막 데이터의 문자열 정보와 음성 인식 데이터의 시간 정보를 결합하여 신규 자막 데이터를 생성하는 것으로 표현될 수도 있다. Meanwhile, in the above description, the electronic device 100 modifies (or updates) the time information of the caption data, but the above-described operation generates new caption data by combining the string information of the caption data and the time information of the voice recognition data. It can also be expressed as

이와 같이 본 실시 예에 따른 전자 장치(100)는 영상 콘텐츠에 대한 음성 인식을 통하여 생성한 음성 인식 데이터를 이용하여 자막 데이터 내의 시간 정보를 수정하는바, 보다 손쉽고 빠르게 자막 데이터를 영상 데이터에 동기화하는 것이 가능하다. As described above, the electronic device 100 according to the present embodiment corrects time information in the caption data by using the speech recognition data generated through speech recognition for video content, so that it is easier and faster to synchronize the caption data to the video data. it is possible

한편, 도 1 및 도 2를 도시하고 설명함에 있어서, 방송국에서 생성된 콘텐츠에 대해서만 상술한 동작을 수행하는 것으로 도시하고 설명하였지만, 방송국 이외에 대해서 생성한 영상 콘텐츠에 대해서도 속기사가 입력한 별도의 자막 데이터가 있는 경우라면 적용될 수 있다. On the other hand, in the illustration and description of FIGS. 1 and 2, it has been illustrated and described that the above-described operation is performed only for content generated by a broadcasting station, but separate subtitle data input by the stenographer for video content generated for other than the broadcasting station If there is, it can be applied.

한편, 도 2를 도시하고 설명함에 있어서, 전자 장치(100)가 디스플레이 및 조작 입력 장치를 포함하는 것으로 도시하고 설명하였지만, 전자 장치(100)가 서버와 같은 장치로 구현되는 경우, 상술한 디스플레이 및 조작 입력 장치는 생략될 수 있다. Meanwhile, in the illustration and description of FIG. 2 , the electronic device 100 is illustrated and described as including a display and a manipulation input device. However, when the electronic device 100 is implemented as a device such as a server, the display and The manipulation input device may be omitted.

도 3은 영상 데이터에 대한 음성 인식 방식을 이용하여 생성한 자막의 예이고, 도 4는 도 3의 영상 데이터에 대한 폐쇄 자막의 예를 도시한 도면이다. FIG. 3 is an example of a caption generated using a voice recognition method for video data, and FIG. 4 is a diagram illustrating an example of a closed caption for the video data of FIG. 3 .

도 3을 참조하면, 음성 인식 기술의 발전에 의하여 영상 콘텐츠에 대한 음성 인식 기술이 상당히 진화하였다. 영상 콘텐츠에서 음성 인식을 수행하기 때문에, 산출되는 텍스트에 대한 시간 동기 정보는 매우 정확하다. 하지만, 도시된 바와 같이 인식 결과인 텍스트는 실제 사용자의 발화 내용과 다른 부분이 존재한다. 특히, 이에 대한 속기사가 직접 입력한 도 4의 폐쇄 자막과 비교해 보면, 정확도 부분에서 속기사가 직접 입력한 자막이 우수함을 확인할 수 있다. Referring to FIG. 3 , speech recognition technology for video content has significantly evolved due to the development of speech recognition technology. Since voice recognition is performed on video content, the time synchronization information for the calculated text is very accurate. However, as shown, the text, which is the recognition result, has a different part from the actual user's utterance. In particular, comparing with the closed caption of FIG. 4 directly input by the stenographer for this, it can be confirmed that the caption directly input by the stenographer is excellent in terms of accuracy.

도 4를 참조하면, 자막 데이터는 자막에 대한 텍스트 열(즉, 텍스트 정보)과 해당 텍스트 열이 영상에서의 표시될 시간 정보(또는 시간 동기 정보)를 포함한다. 시간 정보는 도시된 바와 같이 <SYNC Start=...> 형태로 표시된다. 도시된 예에서 시작 부분의 '본부장님' 이라는 단어는 28.025 초에 시작한다. 그러나 도 3을 참조하면, '동상이'로 오인식된 해당 부분은 실질적으로 영상 내에서 18.72 초에 시작함을 알 수 있다. 즉, 폐쇄 자막은 실제 영상과 대략 10초의 오프셋이 있음을 알 수 있다. Referring to FIG. 4 , caption data includes a text string (ie, text information) for a caption and time information (or time synchronization information) at which the text string is to be displayed in an image. The time information is displayed in the form of <SYNC Start=...> as shown. In the example shown, the word 'General Manager' at the beginning starts at 28.025 seconds. However, referring to FIG. 3 , it can be seen that the portion misrecognized as 'same-same as' actually starts at 18.72 seconds in the image. That is, it can be seen that the closed caption has an offset of approximately 10 seconds from the actual image.

이와 같이 속기사가 입력하는 폐쇄 자막은 내용상으로 높은 정확성이 있지만, 폐쇄 자막에 대한 저장 등의 과정, 방송 제작 및 아카이빙 프로세스 등에 의하여 오프셋이 발생하여 폐쇄 자막 내의 시간 동기 정보는 실제 영상과 차이가 있다. 반대로 음성 인식을 수행하여 얻은 음성인식 자막은 상대적으로 정확한 시간 동기 정보를 가지고 있지만, 텍스트의 내용은 많은 오류를 내포하고 있다. 이와 같은 음성 인식 자막과 폐쇄 자막의 장단점은 아래의 표 1과 같이 정리될 수 있다. Although the closed caption input by the stenographer is highly accurate in terms of content, an offset occurs due to the process of storing the closed caption, broadcast production and archiving process, etc., so the time synchronization information in the closed caption is different from the actual video. Conversely, the voice recognition subtitles obtained by performing voice recognition have relatively accurate time synchronization information, but the content of the text contains many errors. Advantages and disadvantages of such speech recognition captions and closed captions can be summarized in Table 1 below.

텍스트 정보text information 시간 동기 정보time synchronization information 폐쇄 자막closed captions 정확exact 부정확inaccuracy 음성인식 자막 Voice Recognition Subtitles 부정확 inaccuracy 정확exact

따라서, 본 개시에서는 폐쇄 자막의 텍스트 정보를 이용하되, 시간 동기 정보는 음성 인식 자막의 시간 정보로 동기화하여, 텍스트 정보 및 시간 동기 정보 모두 높은 정확도를 갖는 자막 데이터를 생성할 수 있다. Accordingly, in the present disclosure, caption data having high accuracy in both text information and time synchronization information can be generated by using text information of closed captions, but synchronizing time synchronization information with time information of voice recognition captions.

한편, 폐쇄 자막의 텍스트와 음성 인식의 텍스트를 비교하기 위하여, 본 개시는 동적 계획법(dynamic programming)을 이용한다. 폐쇄 자막 내에는 수많은 텍스트가 존재하며, 동일한 문구가 반복적으로 포함되어 있을 수 있다. 따라서, 단순히 특정 단어의 위치만을 비교하여 정렬하는 경우, 두 텍스트의 정렬이 맞지 않을 수 있다. 이러한 점에서, 본 개시는 염기서열 분석에 널리 이용되는 동적 계획법을 사용하여 두 텍스트를 정렬한다. 구체적인 동작은 도 5를 참조하여 이하에서 설명한다. Meanwhile, in order to compare the text of the closed caption with the text of the speech recognition, the present disclosure uses dynamic programming. Numerous texts exist within closed captions, and the same phrase may be included repeatedly. Therefore, when arranging by simply comparing the positions of specific words, the two texts may not be aligned. In this regard, the present disclosure aligns the two texts using dynamic programming, which is widely used for sequencing. A specific operation will be described below with reference to FIG. 5 .

도 5는 본 개시의 일 실시 예에 따른 동적 계획법을 설명하기 위한 도면이다. 5 is a diagram for explaining a dynamic programming method according to an embodiment of the present disclosure.

동적 계획법이란 어떤 문제가 여러 단계의 반복되는 부분 문제로 이루어 질 때, 각 단계에 있는 부분 문제의 답을 기반으로 전체 문제의 답을 구하는 방법을 말한다. 이러한, 동적 계획법은 제일 작은 부분 문제부터 상위에 있는 문제로 풀어 올라가는 방법이다. Dynamic programming is a method of finding the answer to the whole problem based on the answers to the subproblems in each step when a problem consists of repeated subproblems in multiple steps. This dynamic programming method is a method of solving problems from the smallest subproblem to the higher level problem.

도 5를 참조하면, 표의 상단은 제1 염기 서열을 구성하는 염기 정보이고, 표의 좌측은 제2 염기 서열을 구성하는 염기 정보들이다. 동적 계획법을 이용한 비교 동작을 수행하는 경우, 표의 우측 하단에서부터 두 염기 정보의 유사도 여부를 비교하여 최종적으로 표의 좌측 상단까지를 비교하게 된다. 이와 같은 과정을 통하여 두 염기 서열에 대한 전체 정렬을 수행할 수 있다. Referring to FIG. 5 , the upper part of the table is base information constituting the first base sequence, and the left side of the table is base information constituting the second base sequence. When the comparison operation using the dynamic programming method is performed, the similarity of the two base information is compared from the bottom right of the table, and finally, the comparison is made up to the top left of the table. Through this process, the entire alignment of the two nucleotide sequences can be performed.

한편, 이러한 동적 계획법을 문자열에 적용하기 위하여, 제일 작은 부분 문제를 음절로 하여, 즉, 음절 단위로 유사도를 산출할 수 있다. 구체적인 유사도 산출 동작에 대해서는 도 6을 참조하여 후술한다. Meanwhile, in order to apply this dynamic programming method to a character string, the similarity may be calculated by taking the smallest partial problem as a syllable, that is, in units of syllables. A detailed similarity calculation operation will be described later with reference to FIG. 6 .

그리고 산출된 유사도를 이용하여, 도 5에서와 누적 유사도 값이 최대 값을 갖는 경로를 찾게 되면, 최종적으로 두 문자열을 정렬할 수 있다. In addition, when a path having the maximum cumulative similarity value as in FIG. 5 is found using the calculated similarity, the two strings may be finally sorted.

도 6은 본 개시의 일 실시 예에 따른 유사도 산출 동작을 설명하기 위한 도면이다. 6 is a diagram for explaining a similarity calculation operation according to an embodiment of the present disclosure.

Needleman-Wunsch 알고리즘에서 similarity matrix 설정을 위한 각각의 점수(scores)는 다음과 같이 설정할 수 있다. Each score for setting the similarity matrix in the Needleman-Wunsch algorithm can be set as follows.

match_score = similarity(음절1, 음절2)match_score = similarity(syllable 1, syllable 2)

gap_score = -1gap_score = -1

구체적으로, 두 음절을 비교하여, 동일하면 유사도를 제1 값(+1)으로 산출하고, 동일하지 않거나 정보 누락 등이 있으면 제2 값(-1)으로 산출할 수 있다.Specifically, by comparing two syllables, if they are the same, the degree of similarity may be calculated as a first value (+1), and if they are not identical or there is information omission, it may be calculated as a second value (-1).

그러나 상술한 방식으로 수행하는 경우, '공'-'봉'과 같이 유사한 발음이라도 음절의 형태가 다른 경우, 위에 방법은 미스매치로 판단하기 때문에 성능 저하가 발생할 수 있다. However, in the case of performing in the above-described manner, if the syllable shape is different even if the pronunciation is similar, such as 'gong' - 'bong', the above method is judged as a mismatch, and thus performance degradation may occur.

따라서 음절(syllable)을 초/중/종성 단위의 음소(phoneme)로 분해한 후, 초/중/종성 단위로 각각의 음소를 비교하여 유사도를 산출할 수도 있다. 예를 들어, 각 음절의 초성/중성/종성을 비교하여 음소 모두 같으면 제1 값(예를 들어, 3), 초성/중성/종성 중 하나가 다르면 제2 값(예를 들어, 2), 초성/중성/종성 중 하나만 같으면 제3 값(예를 들어, 1)로 산출할 수 있다. Therefore, after decomposing a syllable into phonemes of the unit of beginning/middle/final consonants, the similarity may be calculated by comparing each phoneme in units of beginning/middle/final consonants. For example, comparing the leading/middle/final voice of each syllable, if all phonemes are the same, the first value (eg 3), if one of the leading/neutral/final voices is different, the second value (eg 2), If only one of /neutral/finality is the same, it can be calculated as a third value (eg, 1).

이와 같은 과정을 통하여 두 음절을 비교한 예들은 도 6과 같다. 예를 들어, '가','가'와 같이 두 음절의 초/중/종성 음소 모두 동일한 경우, 유사도로 3점이 산출될 수 있으며, '펭', '팽'과 같이 중성만 다른 경우 유사도 2점이 산출될 수 있으며, '요', ;원'과 같이 초성만 같은 경우 유사도 1점이 산출될 수 있다. Examples of comparing two syllables through this process are shown in FIG. 6 . For example, if the initial, middle, and final phonemes of two syllables are the same, such as 'ga' and 'ga', 3 points can be calculated as similarities. A point may be calculated, and if only the initial consonant is the same, such as 'yo' or ;won, one point of similarity may be calculated.

한편, 이상에서는 음소 단위로 +3/+2/+1 점을 부여하는 방식만을 설명하였지만, 구현시에는 상술한 값들은 예시에 불가하고, 위에 수치 값과 다른 값이 이용될 수도 있으며, 초/중/종성에 대해서 다른 가중치 값을 부여하여 유사도 값을 산출할 수도 있다. 예를 들어, 세 음소가 동일한 경우 +5 점, 일부 음소만 동일한 경우에 초성이 일치하는 경우 +2점, 중성/종성은 +1점을 갖도록 할 수도 있다. Meanwhile, in the above, only the method of assigning +3/+2/+1 points in units of phonemes has been described. However, in implementation, the above-described values cannot be illustrated, and a value different from the numerical value above may be used, and seconds/ A similarity value may be calculated by assigning different weight values to the middle/last grade. For example, if three phonemes are the same, +5 points, if only some phonemes are the same, if the leading consonants match, +2 points, and neutral/final consonants +1 point.

도 7은 음절의 동일성 여부로 유사도를 산출한 경우와 음절을 3개의 음소로 나눠서 유사도를 산출한 경우의 정렬 결과를 비교한 도면이다. 7 is a diagram comparing the alignment results between the case where the degree of similarity is calculated based on whether syllables are identical and the case where the degree of similarity is calculated by dividing the syllable into three phonemes.

구체적으로, 표의 좌측이 음절의 동일성 여부만으로 유사도를 산출한 결과이고, 표의 우측이 음절을 3개의 음소로 나눠 유사도를 산출한 경우의 정렬 결과이다. Specifically, the left side of the table is the result of calculating the similarity based only on whether syllables are identical, and the right side of the table is the alignment result when the similarity is calculated by dividing the syllable into three phonemes.

도 7을 참조하면, 음절을 3개의 음소로 나눠서 유사도를 산출하는 경우, 보다 정확한 정렬이 가능함을 확인할 수 있다. Referring to FIG. 7 , it can be confirmed that more accurate alignment is possible when the similarity is calculated by dividing a syllable into three phonemes.

도 8은 본 개시에 따라 수정된 자막 데이터의 예를 도시한 도면이다. 8 is a diagram illustrating an example of caption data modified according to the present disclosure.

도 8의 결과는, 대략 28분 24초 분량의 영상 콘텐츠에 대한 본 개시에 따른 자막 동기화를 수행하여 산출된 것이다. 텍스트의 정확도의 경우, 앞서 설명한 바와 같이 폐쇄 자막을 그대로 이용하였는바, 높은 정확도를 갖는다. 그리고 자막 음절 당 평균 동기화 오차는 약 0.19음절로 나타났는데, 이 결과로부터 자막의 대부분이 음절 단위로 정확하게 정렬되었음을 알 수 있다.The result of FIG. 8 is calculated by performing subtitle synchronization according to the present disclosure on video content of approximately 28 minutes and 24 seconds. In the case of text accuracy, as described above, closed captions are used as they are, so the accuracy is high. And the average synchronization error per syllable of the subtitles was about 0.19 syllables. From this result, it can be seen that most of the subtitles are accurately aligned in units of syllables.

한편, 상술한 자막 동기화 동작은 전체적으로 4분 7초의 시간이 소요되었으며, 전체 과정 중에서 음성 인식 자막과 폐쇄 자막을 정렬하고 시간 정보를 수정하는 과정은 78초가 소요되었다. 즉, 전체 시간 동작 중에 음성 인식 과정에 않은 시간이 소요한다. 이러한 이유는 상술한 테스트 과정에서의 음성 인식을 클라우드 시스템을 이용하였기 때문인데, 음성 인식 동작을 자체적으로 수행한다면 전체 동기화에 필요한 시간을 더욱 줄일 수 있다. Meanwhile, the above-described subtitle synchronization operation took a total of 4 minutes and 7 seconds, and the process of aligning the voice recognition subtitles and closed subtitles and correcting the time information took 78 seconds among the entire process. That is, during the entire time operation, the speech recognition process takes time. This is because the cloud system was used for voice recognition in the above-described test process. If the voice recognition operation is performed by itself, the time required for overall synchronization can be further reduced.

한편, 이상에서는 한글 자막에 대한 동기화 동작만을 설명하였지만, 구현시에는 한글 이외에 다른 언어(예를 들어, 영어, 중국어, 일어 등)에도 적용될 수 있다. 예를 들어, 영어 자막에 적용하는 경우, 상술한 음소를 알파벳으로 하고, 몇 개의 알파벳 조합(즉, 모음 + 자음)을 상술한 음절로 하여 두 영문 텍스트를 비교할 수 있다. 그리고, 중국어 자막에 적용하는 경우, 한자 하나에 대응되는 병음을 상술한 음절로 하고, 하나의 병음을 구성하는 성모/운모를 상술한 음소로 하여 두 중국어 텍스트를 비교할 수도 있다. Meanwhile, in the above, only the synchronization operation for the Korean subtitles has been described, but when implemented, it can be applied to languages other than Korean (eg, English, Chinese, Japanese, etc.). For example, when applied to English subtitles, two English texts may be compared by using the above-mentioned phoneme as an alphabet and using several alphabet combinations (ie, vowel + consonant) as the above-mentioned syllable. In addition, when applied to Chinese subtitles, two Chinese texts may be compared by using Pinyin corresponding to one Chinese character as the above-mentioned syllable, and using the sungmo/mica constituting one Pinyin as the above-mentioned phoneme.

도 9는 본 개시의 일 실시 예에 따른 자막 동기화 방법을 설명하기 위한 흐름도이다. 9 is a flowchart illustrating a caption synchronization method according to an embodiment of the present disclosure.

도 9를 참조하면, 먼저, 영상 콘텐츠 및 영상 콘텐츠에 대한 자막 데이터를 수신한다(S910). 예를 들어, 자막 동기화에 대상인 영상 콘텐츠 또는 자막 데이터를 선택되면, 선택된 영상 콘텐츠 또는 자막 데이터를 수신하고, 수신된 영상 콘텐츠 또는 수신된 자막 데이터에 대응되는 자막 데이터 또는 영상 콘텐츠를 검색하여 수신할 수 있다. 한편, 구현시에 영상 콘텐츠 또는 자막 데이터 중 적어도 하나는 미리 저장되어 있을 수 있으며, 이와 같은 경우 상술한 수신 동작은 생략될 수 있다. Referring to FIG. 9 , first, image content and caption data for the image content are received ( S910 ). For example, when video content or caption data to be synchronized with caption is selected, the selected video content or caption data may be received, and caption data or video content corresponding to the received video content or received caption data may be searched and received. there is. Meanwhile, in implementation, at least one of image content or caption data may be pre-stored, and in this case, the above-described receiving operation may be omitted.

그리고 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출한다(S920). 구체적으로, 음성 인식 라이브러리를 이용하여 영상 콘텐츠의 음성을 인식하여 음성에 대한 텍스트 및 상기 텍스트의 시간 정보를 갖는 음성 인식 데이터를 생성할 수 있다. Then, voice recognition is performed on the video content to extract voice recognition data and time information (S920). Specifically, by using the voice recognition library to recognize the voice of the video content, it is possible to generate voice recognition data having text for the voice and time information of the text.

그리고 추출된 음성 인식 데이터와 상기 수신된 자막 데이터를 비교하여 상기 자막 데이터 내의 시간 정보를 상기 추출된 시간 정보로 수정한다(S930). 예를 들어, 먼저, 추출된 음성 인식 데이터에서 문자열만을 추출하고, 수신된 자막 데이터에서 문자열만을 추출할 수 있다. 그리고 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬하고, 정렬 결과에 기초하여 상기 자막 데이터 내의 시간 정보를 상기 추출된 시간 정보로 수정할 수 있다. 구체적인 각 동작은 앞서 설명하였는바 중복 설명은 생략한다. Then, by comparing the extracted voice recognition data with the received caption data, time information in the caption data is corrected to the extracted time information (S930). For example, first, only the character string may be extracted from the extracted voice recognition data, and only the character string may be extracted from the received subtitle data. In addition, the character string of the voice recognition data and the character string of the caption data may be aligned, and time information in the caption data may be corrected to the extracted time information based on the alignment result. Since each specific operation has been described above, a redundant description thereof will be omitted.

시간 정보가 수정된 자막 데이터를 저장한다(S940). 만약, 자막 데이터를 외부 장치에서 수신한 경우, 시간 정보가 수정된 자막 데이터를 외부 장치에 전송할 수 있다. The subtitle data in which the time information is corrected is stored (S940). If the subtitle data is received from the external device, the subtitle data in which the time information is corrected may be transmitted to the external device.

따라서, 본 실시 예에 따른 파일 전송 방법은 영상 콘텐츠에 대한 음성 인식을 통하여 생성한 음성 인식 데이터를 이용하여 자막 데이터 내의 시간 정보를 수정하는바, 손쉽게 자막 데이터의 동기를 영상에 맞출 수 있게 된다. 도 10과 같은 자막 동기화 방법은 도 2의 구성을 가지는 전자 장치상에서 실행될 수 있으며, 그 밖의 다른 구성을 가지는 전자 장치상에서도 실행될 수 있다.Accordingly, in the file transmission method according to the present embodiment, time information in the subtitle data is corrected using voice recognition data generated through voice recognition of video content, so that the synchronization of the subtitle data can be easily matched with the image. The caption synchronization method shown in FIG. 10 may be executed on the electronic device having the configuration of FIG. 2 and may also be executed on the electronic device having other configurations.

또한, 상술한 바와 같은 자막 동기화 방법은 컴퓨터에서 실행될 수 있는 실행 가능한 알고리즘을 포함하는 프로그램으로 구현될 수 있고, 상술한 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the subtitle synchronization method as described above may be implemented as a program including an executable algorithm that can be executed on a computer, and the above-described program is stored in a non-transitory computer readable medium to be provided. can

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 방법을 수행하기 위한 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently, rather than a medium that stores data for a short moment, such as a register, cache, memory, and the like, and can be read by a device. Specifically, the programs for performing the above-described various methods may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

또한, 이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시가 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In addition, although preferred embodiments of the present disclosure have been illustrated and described above, the present disclosure is not limited to the specific embodiments described above, and the technical field to which the disclosure belongs without departing from the gist of the present disclosure as claimed in the claims Various modifications are possible by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present disclosure.

1000: 콘텐츠 생성 시스템 10: 방송 송신 장치
20: 자막 생성 장치 30: 서버
100: 전자 장치 110: 통신 장치
120: 메모리 130: 디스플레이
140: 조작 입력 장치 150: 프로세서1000: content creation system 10: broadcast transmission device
20: subtitle generating device 30: server
100: electronic device 110: communication device
120: memory 130: display
140: operation input device 150: processor

Claims

In an electronic device,
a communication device for receiving video content and caption data for the video content;
Memory; and
performing voice recognition on the video content to extract voice recognition data and time information, comparing the extracted voice recognition data with the received caption data, and correcting time information in the caption data with the extracted time information; , a processor that stores the subtitle data of which time information is corrected in the memory;
The processor is
Extracting only a character string from the extracted voice recognition data, extracting only a character string from the received subtitle data, and extracting the character string of the voice recognition data and the character string of the subtitle data within a syllable unit, The similarity is calculated by giving a weight value, and the character string of the voice recognition data and the character string of the subtitle data are sorted based on a syllable having a preset similarity among the calculated similarities, and the character string of the subtitle data is selected based on the alignment result. An electronic device for correcting time information with the extracted time information.

delete

According to claim 1,
The processor is
The electronic device for correcting the time information of the first syllable of the line of the subtitle data to the time information of the character string of the voice recognition data corresponding to the character string based on the alignment result.

According to claim 1,
The processor is
An electronic device for extracting a character string from the received caption data excluding text auxiliary information corresponding to a context description of an image.

According to claim 1,
The caption data for the video content includes:
An electronic device that is a caption for commentary transmitted in the form of an electronic code in response to the video content.

delete

A method for synchronizing subtitles in an electronic device, the method comprising:
receiving video content and caption data for the video content;
performing voice recognition on the video content to extract voice recognition data and time information;
comparing the extracted speech recognition data with the received caption data and correcting time information in the caption data with the extracted time information; and
Storing subtitle data in which time information is corrected;
The modifying step is
extracting only a character string from the extracted voice recognition data;
extracting only a character string from the received caption data;
The character string of the speech recognition data and the character string of the subtitle data are given different weight values for the initial consonant, middle, and final consonant within a syllable unit to calculate the similarity, and based on the syllable having a preset similarity among the calculated similarities aligning the character string of the speech recognition data and the character string of the subtitle data; and
and modifying time information in the caption data to the extracted time information based on the alignment result.

delete

12. The method of claim 11,
The step of correcting the time information in the caption data to the extracted time information based on the alignment result includes:
A subtitle synchronization method for correcting time information of a first syllable of a line of the subtitle data to time information of a character string of speech recognition data corresponding to the character string based on the alignment result.

12. The method of claim 11,
The step of extracting only the character string from the received subtitle data includes:
A caption synchronization method for extracting a character string from the received caption data excluding text auxiliary information corresponding to a contextual description of an image.

delete

A computer-readable recording medium comprising a program for executing a subtitle synchronization method,
The subtitle synchronization method includes:
receiving video content and caption data for the video content;
performing voice recognition on the video content to extract voice recognition data and time information;
comparing the extracted speech recognition data with the received caption data and correcting time information in the caption data with the extracted time information; and
Storing the subtitle data in which the time information is corrected;
The modifying step is
extracting only a character string from the extracted voice recognition data;
extracting only a character string from the received caption data;
The character string of the speech recognition data and the character string of the subtitle data are given different weight values for the initial consonant, middle, and final consonant within a syllable unit to calculate the similarity, and based on the syllable having a preset similarity among the calculated similarities aligning the character string of the speech recognition data and the character string of the subtitle data; and
and modifying the time information in the subtitle data to the extracted time information based on the alignment result.