KR102541162B1

KR102541162B1 - Electronic apparatus and methoth for caption synchronization of contents

Info

Publication number: KR102541162B1
Application number: KR1020210076765A
Authority: KR
Inventors: 오주현
Original assignee: 한국방송공사
Priority date: 2021-06-14
Filing date: 2021-06-14
Publication date: 2023-06-12
Also published as: KR20220167602A

Abstract

전자 장치가 개시된다. 본 전자 장치는 영상 콘텐츠 및 영상 콘텐츠에 대한 자막 데이터를 수신하는 통신 장치, 메모리, 및 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출하고, 추출된 음성 인식 데이터의 문자열과 자막 데이터의 문자열을 정렬하고, 정렬 결과에 기초하여 자막 데이터 내의 시간 정보를 추출된 시간 정보로 수정하고, 시간 정보가 수정된 자막 데이터를 메모리에 저장하는 프로세서를 포함하고, 프로세서는 음성 인식 데이터의 문자열과 자막 데이터의 문자열 간의 유사도를 산출하고, 단어 및/또는 문장 단위의 간격 점수와 산출된 유사도를 기준으로 음성 인식 데이터의 문자열과 자막 데이터의 문자열을 정렬한다. An electronic device is disclosed. The present electronic device extracts voice recognition data and time information by performing voice recognition on image content and a communication device receiving caption data for the image content, memory, and image content, and character strings and captions of the extracted voice recognition data. a processor for sorting data strings, modifying time information in the caption data with the extracted time information based on the sorting result, and storing the caption data with the corrected time information in a memory; The similarity between the text string and the subtitle data is calculated, and the text string of the speech recognition data and the text string of the caption data are aligned based on the interval score in word and/or sentence units and the calculated similarity.

Description

Electronic device and method for performing subtitle synchronization for content

본 개시는 콘텐츠에 대한 자막 동기화를 수행하는 전자 장치 및 방법에 관한 것으로, 보다 구체적으로는 콘텐츠에 대한 자막 동기화를 높은 정확도로 수행하는 전자 장치 및 방법에 관한 것이다. The present disclosure relates to an electronic device and method for performing caption synchronization on content, and more particularly, to an electronic device and method for performing caption synchronization on content with high accuracy.

국내 지상파 방송에서는 관련 법규에 따라 전체 방송시간에 걸쳐 청각장애인을 위한 자막방송 서비스를 제공하고 있다. 현재의 자막 방송 제작은 속기사가 실시간으로 자막을 입력하고 이를 방송사로 다시 송신하여 지상파 방송에 삽입하는 형태로 이루어 진다. Domestic terrestrial broadcasting provides closed-captioning services for the hearing-impaired over the entire broadcasting time in accordance with relevant laws and regulations. Currently, closed captioning is produced in the form of a stenographer inputting captions in real time, sending them back to broadcasting companies, and inserting them into terrestrial broadcasting.

최근 방송사는 지상파 방송 이후에 해당 방송 콘텐츠를 가공하여 온라인 서비스 등에 활용하고 있다. 이러한 과정 중에 사전제작, 광고 삽입 등에 의하여 시간 오프셋(time offset)이 발생하며, 그에 따라 지상파 방송시의 시작 시점과 가공된 방송 콘텐츠의 시작 시점이 달라질 수 있다. Recently, broadcasting companies are processing corresponding broadcast contents after terrestrial broadcasting and using them for online services. During this process, a time offset is generated due to pre-production, advertisement insertion, and the like, and accordingly, the start time of terrestrial broadcasting and the start time of processed broadcasting content may be different.

그러나 지상파 방송을 위한 자막(이하, 폐쇄 자막)은 별도의 편집 없이 시스템에 그대로 저장된다는 점에서, 온라인 서비스에서 해당 자막을 이용하는 경우 싱크가 맞지 않게 된다는 문제가 있었다. However, since subtitles for terrestrial broadcasting (hereinafter referred to as closed captions) are stored in the system as they are without additional editing, there is a problem in that synchronization is out of sync when the corresponding subtitles are used in an online service.

이에 따라 최근에는 음성 인식 기술을 이용하여 음성 인식 텍스트를 생성하고, 생성한 음성 인식 텍스트와 폐쇄 자막을 비교하여 폐쇄 자막 내의 싱크를 수정하는 방식이 제안되었다. Accordingly, recently, a method of generating voice recognition text using a voice recognition technology and comparing the generated voice recognition text with the closed caption to correct a sync in the closed caption has been proposed.

그러나 기존의 방식은 음절 단위로만 텍스트를 비교함에 따라, 폐쇄 자막 내의 하나의 문장 내에서 또는 단어 내에서 싱크가 이격되는 등의 문제점이 있었다.However, as the conventional method compares text only in units of syllables, there is a problem in that syncs are separated within one sentence or within words within closed captions.

따라서, 본 개시의 목적은 콘텐츠에 대한 높은 정확도로 자막 동기화를 수행하는 전자 장치 및 방법을 제공하는 데 있다. Accordingly, an object of the present disclosure is to provide an electronic device and method for synchronizing subtitles to content with high accuracy.

이상과 같은 목적을 달성하기 위한 본 개시에 따른 전자 장치는 전자 장치에 있어서, 영상 콘텐츠 및 상기 영상 콘텐츠에 대한 자막 데이터를 수신하는 통신 장치, 메모리, 및 상기 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출하고, 상기 추출된 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬하고, 상기 정렬 결과에 기초하여 상기 자막 데이터 내의 시간 정보를 상기 추출된 시간 정보로 수정하고, 시간 정보가 수정된 자막 데이터를 상기 메모리에 저장하는 프로세서를 포함하고, 상기 프로세서는 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열 간의 유사도를 산출하고, 단어 및/또는 문장 단위의 간격 점수와 상기 산출된 유사도를 기준으로 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬한다. An electronic device according to the present disclosure for achieving the above object is, in the electronic device, a communication device for receiving video content and subtitle data for the video content, a memory, and voice recognition for the video content to provide audio Recognition data and time information are extracted, a character string of the extracted speech recognition data and a character string of the caption data are aligned, and based on a result of the alignment, time information in the caption data is modified to the extracted time information, and time information is modified. and a processor for storing caption data with corrected information in the memory, wherein the processor calculates a similarity between a character string of the voice recognition data and a character string of the caption data, and calculates a spacing score in units of words and/or sentences and the calculation The character string of the voice recognition data and the character string of the subtitle data are aligned based on the similarity.

이 경우, 상기 프로세서는 상기 자막 데이터 내의 문자열을 문장 단위로 구분하고, 상기 구분된 문장 내에서 문자열 이격에는 상기 유사도에 반대되는 부호의 제1 간격 점수를 부여하여 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬할 수 있다. In this case, the processor divides the character string in the subtitle data into sentence units, and assigns a first interval score of a code opposite to the similarity to a character string spacing in the divided sentence, so that the character string of the speech recognition data and the caption Strings of data can be sorted.

이 경우, 상기 프로세서는 상기 자막 데이터 내의 문장부호를 검출하고, 상기 검출된 문장부호에 기초하여 상기 자막 데이터 내의 문자열을 문장 단위로 구분할 수 있다. In this case, the processor may detect punctuation marks in the caption data, and divide the character string in the caption data into sentence units based on the detected punctuation marks.

한편, 상기 프로세서는 상기 구분된 문장 간의 이격에 대해서는 상기 유사도에 동일한 부호의 제2 간격 점수를 부여하여 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬할 수 있다. Meanwhile, the processor may align the character strings of the speech recognition data and the character strings of the subtitle data by assigning a second spacing score having the same code to the similarity for the spacing between the divided sentences.

한편, 상기 프로세서는 상기 자막 데이터 내의 문자열 및 상기 자막 데이터의 문자열 각각을 단어 단위로 구분하고, 상기 구분된 단어 내에서의 문자열 이격에는 상기 유사도에 반대되는 부호의 제3 간격 점수를 부여하여 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬할 수 있다. Meanwhile, the processor divides each of the character string in the caption data and the character string of the caption data in word units, and assigns a third interval score of a code opposite to the similarity to the character string spacing in the separated words, thereby generating the voice A character string of the recognition data and a character string of the caption data may be aligned.

한편, 상기 프로세서는 상기 자막 데이터의 글자 수에 대응하는 탐색 영역을 결정하고, 상기 결정된 탐색 영역 내에서 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열 간의 유사도를 산출할 수 있다. Meanwhile, the processor may determine a search area corresponding to the number of characters of the caption data, and calculate a similarity between a character string of the voice recognition data and a character string of the caption data within the determined search area.

한편, 상기 프로세서는 음절 단위 내에서 초성, 중성, 종성의 동일성 여부로 유사도를 산출할 수 있다. Meanwhile, the processor may calculate the degree of similarity based on whether initial consonants, neutral consonants, and final consonants are identical within a syllable unit.

이 경우, 상기 프로세서는 음절 단위 내에서 초성, 중성, 종성 모두 동일하면 제1 값을 산출하고, 초성 중성, 종성 중 하나가 다르면 상기 제1 값보다 작은 제2 값을 산출하고, 초성, 중성, 종성 중 하나만 동일하면 상기 제2 값보다 작은 제3 값을 산출하고, 초성, 중성, 종성 모두 일치하지 않으면 상기 제3 값보다 작은 제4 값을 산출하며, 상기 산출된 음절 단위의 산출된 값과 상기 간격 점수에 기초하여 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬할 수 있다. In this case, the processor calculates a first value if the initial consonant, neutral consonant, and final consonant are all the same within the syllable unit, and calculates a second value smaller than the first value if one of the initial consonant, neutral consonant, and final consonant is different, and If only one of the final consonants is the same, a third value smaller than the second value is calculated, and if all of the initial consonant, neutral consonant, and final consonant do not match, a fourth value smaller than the third value is calculated, and the calculated value of the calculated syllable unit A character string of the voice recognition data and a character string of the caption data may be aligned based on the interval score.

이 경우, 상기 제1 내지 제4 값은 비선형적인 값을 가질 수 있다. In this case, the first to fourth values may have nonlinear values.

한편, 상기 프로세서는 상기 정렬 결과에 기초하여 상기 자막 데이터의 라인의 첫번째 음절에 대한 시간 정보를 상기 문자열에 대응되는 음성 인식 데이터의 문자열에 대한 시간 정보로 수정할 수 있다. Meanwhile, the processor may modify time information on the first syllable of the line of the caption data to time information on the text string of voice recognition data corresponding to the text string based on the alignment result.

이 경우, 상기 프로세서는 상기 자막 데이터의 라인의 첫번째 문자열에 대응되는 자막 데이터 내의 문자열이 부재하면, 상기 첫번째 문자열에 인접한 문자열에 대응되는 음성 인식 데이터의 문자열에 대한 시간 정보와 기설정된 발화 시간을 이용하여 상기 첫번째 문자열에 대한 시간 정보를 추정하고, 상기 첫번째 문자열에 대한 시간 정보를 상기 추정된 시간 정보로 수정할 수 있다. In this case, if there is no character string in the caption data corresponding to the first character string of the line of the caption data, the processor uses time information on the character string of voice recognition data corresponding to the character string adjacent to the first character string and a preset speech time. Thus, time information on the first string may be estimated, and time information on the first string may be corrected to the estimated time information.

이 경우, 상기 프로세서는 상기 수신된 자막 데이터에서 영상의 상황 설명에 대응되는 텍스트 보조 정보는 제외한 문자열을 추출할 수 있다. In this case, the processor may extract a character string excluding text auxiliary information corresponding to the context description of the video from the received caption data.

한편, 상기 영상 콘텐츠에 대한 자막 데이터는 상기 영상 콘텐츠에 대응하여 전자코드 형태로 전송된 해설 자막일 수 있다. Meanwhile, the caption data for the video content may be a commentary caption transmitted in the form of an electronic code corresponding to the video content.

한편, 상기 프로세서는 상기 수정된 자막 데이터가 외부 장치에 전송되도록 상기 통신 장치를 제어할 수 있다. Meanwhile, the processor may control the communication device to transmit the modified caption data to an external device.

한편, 본 개시의 일 실시 예에 따른 전자 장치의 자막 동기화 방법은 영상 콘텐츠 및 상기 영상 콘텐츠에 대한 자막 데이터를 수신하는 단계, 상기 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출하는 단계, 상기 추출된 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬하는 단계, 상기 정렬 결과에 기초하여 상기 자막 데이터 내의 시간 정보를 수정하는 단계, 및 시간 정보가 수정된 자막 데이터를 저장하는 단계를 포함하고, 상기 정렬하는 단계는 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열 간의 유사도를 산출하고, 단어 및/또는 문장 단위의 간격 점수와 상기 산출된 유사도를 기준으로 상기 음성 인식 데이터의 문자열과 상기 자막 데이터의 문자열을 정렬한다. Meanwhile, a method for synchronizing captions of an electronic device according to an embodiment of the present disclosure includes receiving video content and caption data for the video content, performing voice recognition on the video content, and extracting voice recognition data and time information. aligning the extracted voice recognition data string with the caption data string, correcting time information in the caption data based on the sorting result, and storing the caption data with corrected time information. and the step of sorting calculates a similarity between a character string of the voice recognition data and a character string of the subtitle data, and calculates a similarity between a character string of the voice recognition data and a character string of the subtitle data, based on a spacing score in units of words and/or sentences and the calculated similarity. A character string and the character string of the subtitle data are aligned.

상술한 바와 같이 본 개시의 다양한 실시 예에 따르면, 영상 콘텐츠에 대한 음성 인식 결과를 이용하여 자동으로 자막 데이터의 싱크를 맞출 수 있게 된다. 특히, 문장 및/또는 단어 내에서는 간격 발생을 억제되도록 정렬을 수행함으로써 높은 정확도로 자막 데이터의 싱크를 맞출 수 있게 된다. As described above, according to various embodiments of the present disclosure, it is possible to automatically synchronize subtitle data using a result of voice recognition for video content. In particular, by performing alignment to suppress occurrence of gaps within sentences and/or words, it is possible to synchronize subtitle data with high accuracy.

도 1은 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템의 구성을 나타낸 도면,
도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 도시한 도면,
도 3은 본 개시의 일 실시 예에 따른 동적 계획법을 설명하기 위한 도면,
도 4는 본 개시의 일 실시 예에 따른 제한 영역을 설명하기 위한 도면,
도 5는 컨텐츠와 제한 영역 간의 관계를 설명하기 위한 도면,
도 6은 본 개시의 일 실시 예에 따라 추정된 타임 코드의 예를 도시한 도면,
도 7은 제1 실시 예에 따라 타임 코드의 추정 예를 도시한 도면,
도 8은 제2 실시 예에 따라 타임 코드의 추정 예를 도시한 도면,
도 9는 종래의 정렬 결과와 본 개시의 일 실시 예에 따른 정렬 결과를 비교한 도면,
도 10은 본 개시의 일 실시 예에 따른 자막 동기화 방법을 설명하기 위한 흐름도이다. 1 is a diagram showing the configuration of a content creation system according to an embodiment of the present disclosure;
2 is a diagram showing a specific configuration of an electronic device according to an embodiment of the present disclosure;
3 is a diagram for explaining a dynamic programming method according to an embodiment of the present disclosure;
4 is a diagram for explaining a restricted area according to an embodiment of the present disclosure;
5 is a diagram for explaining the relationship between content and a restricted area;
6 is a diagram showing an example of an estimated time code according to an embodiment of the present disclosure;
7 is a diagram showing an example of time code estimation according to the first embodiment;
8 is a diagram showing an example of time code estimation according to the second embodiment;
9 is a view comparing conventional alignment results and alignment results according to an embodiment of the present disclosure;
10 is a flowchart illustrating a caption synchronization method according to an embodiment of the present disclosure.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다.Terms used in this specification will be briefly described, and the present disclosure will be described in detail.

본 개시의 실시 예에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 개시의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the embodiments of the present disclosure have been selected from general terms that are currently widely used as much as possible while considering the functions in the present disclosure, but they may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technologies, and the like. . In addition, in a specific case, there is also a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the disclosure. Therefore, terms used in the present disclosure should be defined based on the meaning of the term and the general content of the present disclosure, not simply the name of the term.

본 개시의 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 개시된 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 실시 예들을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Embodiments of the present disclosure may apply various transformations and may have various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope to specific embodiments, and should be understood to include all transformations, equivalents, and substitutes included in the spirit and scope of technology disclosed. In describing the embodiments, if it is determined that a detailed description of a related known technology may obscure the subject matter, the detailed description will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. Terms are only used to distinguish one component from another.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다." 또는 "구성되다." 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, "comprising." or "made up." The terms such as are intended to specify that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features or numbers, steps, operations, components, parts, or It should be understood that it does not preclude the possibility of existence or addition of combinations thereof.

본 개시의 실시 예에서 '모듈' 혹은 '부'는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 혹은 복수의 '부'는 특정한 하드웨어로 구현될 필요가 있는 '모듈' 혹은 '부'를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In an embodiment of the present disclosure, a 'module' or 'unit' performs at least one function or operation, and may be implemented in hardware or software or a combination of hardware and software. In addition, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module and implemented by at least one processor, except for 'modules' or 'units' that need to be implemented with specific hardware.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, with reference to the accompanying drawings, embodiments of the present disclosure will be described in detail so that those skilled in the art can easily carry out the present disclosure. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein. And in order to clearly describe the present disclosure in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

이하에서는 도면을 참조하여 본 개시에 대해 더욱 상세히 설명하기로 한다.Hereinafter, the present disclosure will be described in more detail with reference to the drawings.

도 1은 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템의 구성을 나타낸 도면이다.1 is a diagram showing the configuration of a content creation system according to an embodiment of the present disclosure.

도 1을 참조하면, 본 개시의 일 실시 예에 따른 콘텐츠 생성 시스템(1000)은 방송 송출 장치(10), 자막 생성 장치(20), 서버(30) 및 전자 장치(100)로 구성될 수 있다. Referring to FIG. 1 , a content creation system 1000 according to an embodiment of the present disclosure may include a broadcast transmitting device 10, a caption generating device 20, a server 30, and an electronic device 100. .

방송 송출 장치(10)는 영상 콘텐츠를 이용하여 지상파 방송을 송출할 수 있다. 이때, 방송 송출 장치(10)는 자막 생성 장치(20)에서 생성된 영상 콘텐츠에 대한 자막 데이터를 이용하여 자막 방송 서비스를 제공할 수 있다. 여기서, 자막 방송 서비스란 음성 신호에 대응되는 텍스트를 TV 화면에 자막으로 표시하는 서비스로, 청각장애인들을 위해 TV 프로그램의 청각 메시지를 전자코드 형태로 변환 전송하여, TV 화면에 해설자막으로 나타나게 하는 기술이다. The broadcast transmitting device 10 may transmit terrestrial broadcasting using video content. In this case, the broadcast transmitting device 10 may provide a caption broadcasting service by using caption data for video content generated by the caption generating device 20 . Here, the closed captioning service is a service that displays text corresponding to a voice signal as subtitles on a TV screen. A technology that converts and transmits auditory messages of TV programs in the form of electronic codes for the hearing impaired to display them as commentary subtitles on the TV screen. am.

그리고 방송 송출 장치(10)는 지상파 방송 송출이 완료된 영상 콘텐츠를 서버(30)에 전송할 수 있다. 이때, 방송 송출 장치(40)는 영상 콘텐츠를 그대로 서버(30)에 전송할 수 있으며, 광고 등을 추가로 삽입하거나, 복수개로 분할하는 작업 등을 수행하여, 즉 편집 콘텐츠를 서버(30)에 전송할 수도 있다. In addition, the broadcast transmitting device 10 may transmit the video content for which terrestrial broadcasting has been transmitted to the server 30 . At this time, the broadcast transmission device 40 may transmit the video content as it is to the server 30, and may perform an operation such as additionally inserting an advertisement or dividing it into a plurality of pieces, that is, transmit the edited content to the server 30. may be

자막 생성 장치(20)는 방송 송출 장치(40)에서 송출하는 영상 콘텐츠에 대한 자막 데이터를 생성하고, 생성한 자막 데이터를 서버(30) 및/또는 방송 송출 장치(40)에 전송할 수 있다. 여기서 자막 데이터는 속기사의 타자 입력을 통하여 생성된 데이터로, 영상 콘텐츠에 대응하여 전자코드 형태로 전송된 해설 자막일 수 있다. 이와 같은 자막 데이터는 전문 속기사의 입력에 의하여 생성된 것인 바, 높은 정확도를 가지나, 속기 입력에 시간이 소요되는바 영상 콘텐츠와는 시간 오프셋이 있다. 또한, 자막 데이터는 일반적으로 사람의 발화 음성뿐만 아니라, 차량의 경적, 핸드폰 소리 등을 설명하기 위한 영상의 상황을 설명하기 위한 텍스트 보조 정보도 포함될 수 있다. 이와 같은 속기사에 의하여 생성된 자막 데이터는 폐쇄 자막으로 지칭될 수 있다. The caption generating device 20 may generate caption data for video content transmitted by the broadcast transmission device 40 and transmit the generated caption data to the server 30 and/or the broadcast transmission device 40 . Here, the caption data is data generated through a stenographer's typing, and may be a commentary caption transmitted in the form of an electronic code corresponding to video content. Since such subtitle data is generated by the input of a professional stenographer, it has high accuracy, but since it takes time to input the stenography, there is a time offset from video content. In addition, caption data may generally include not only human speech, but also text supplementary information for explaining the situation of a video for explaining vehicle horns, cell phone sounds, and the like. Caption data generated by such a stenographer may be referred to as a closed caption.

서버(30)는 방송 송출 장치(10)에서 송출된 영상 콘텐츠를 저장할 수 있으며, 자막 생성 장치(20)로부터 해당 영상 콘텐츠에 대한 자막 데이터를 저장할 수 있다. 이와 같은 서버(30)는 아카이브(archive) 서버일 수 있다. The server 30 may store video content transmitted from the broadcast transmitting device 10 and may store caption data for corresponding video content from the caption generating device 20 . Such a server 30 may be an archive server.

전자 장치(100)는 서버(30)로부터 자막 데이터를 수신하고, 수신된 자막 데이터의 싱크를 수정할 수 있다. 그리고 전자 장치(100)는 수정된 자막 데이터를 서버(30)에 전송할 수 있다. The electronic device 100 may receive caption data from the server 30 and correct a sync of the received caption data. In addition, the electronic device 100 may transmit the modified caption data to the server 30 .

이때, 전자 장치(100)는 서버(30)로부터 자막 데이터에 대응되는 영상 콘텐츠를 수신하고, 수신된 영상 콘텐츠에 대한 음성 인식(speech to text)을 수행하여, 영상 콘텐츠에 포함된 음성에 대한 텍스트 정보와 텍스트 각각의 시간 정보(또는 시간 동기 정보, 타임 코드)를 갖는 음성 인식 데이터를 생성할 수 있다. 이러한 음성 인식 데이터는 음성인식 자막, 음성인식 결과 등이라고 지칭될 수 있다. At this time, the electronic device 100 receives video content corresponding to the caption data from the server 30, performs speech to text on the received video content, and text to the voice included in the video content. Speech recognition data having time information (or time synchronization information, time code) of each information and text can be generated. Such voice recognition data may be referred to as voice recognition captions, voice recognition results, and the like.

음성 인식 데이터가 생성되면, 전자 장치(100)는 자막 데이터와 추출된 음성 인식 데이터를 비교하여 자막 데이터 내의 시간 정보를 음성 인식 데이터 내의 시간 정보로 수정할 수 있다. 전자 장치(100)의 구체적인 구성 및 동작에 대해서는 도 2을 참조하여 후술한다.When the voice recognition data is generated, the electronic device 100 compares the caption data with the extracted voice recognition data to correct time information in the caption data to time information in the voice recognition data. A detailed configuration and operation of the electronic device 100 will be described later with reference to FIG. 2 .

이와 같이 본 실시 예에 따른 콘텐츠 생성 시스템(1000)은 영상 콘텐츠에 대한 음성 인식을 통하여 생성한 음성 인식 데이터를 이용하여 자막 데이터 내의 시간 정보를 수정하는바, 사람의 개입 없이도 자동으로 자막 데이터의 동기를 영상에 맞출 수 있게 된다. As described above, the content generation system 1000 according to the present embodiment corrects time information in caption data using voice recognition data generated through voice recognition of video content, and automatically synchronizes caption data without human intervention. can be fitted to the video.

한편, 도 1을 도시하고 설명함에 있어서, 각각의 장치가 상호 직접 연결된 형태로 도시하였지만, 구현시에는 각 구성들은 별도의 외부 구성을 경유하는 형태로 연결될 수 있다. 또한, 전자 장치(100)가 하나의 서버(30)에만 연결되는 것으로 설명하였지만, 전자 장치(100)는 복수의 서버에 연결될 수 있으며, 각 서버로부터 영상 콘텐츠, 자막 데이터를 개별적으로 수신할 수 있다. On the other hand, in the illustration and description of Figure 1, each device is shown in the form of being directly connected to each other, but in the case of implementation, each component may be connected in a form via a separate external component. Also, although the electronic device 100 has been described as being connected to only one server 30, the electronic device 100 can be connected to a plurality of servers and can individually receive video content and subtitle data from each server. .

또한, 도 1을 도시하고 설명함에 있어서, 서버(30)와 전자 장치(100) 장치가 별개인 것으로 도시하고 설명하였지만, 구현시에는 두 장치는 하나로 구현될 수 있다. 즉, 영상 콘텐츠를 저장하는 서버(30)가 전자 장치(100)에 영상 콘텐츠를 전송하여 자막 동기화를 수행하지 않고, 자체적으로 자막 데이터에 대한 동기화 작업을 수행할 수도 있다. In addition, in the illustration and description of FIG. 1, the server 30 and the electronic device 100 are shown and described as being separate, but in implementation, the two devices may be implemented as one. That is, the server 30 that stores the video content may perform synchronizing the caption data by itself without transmitting the video content to the electronic device 100 and synchronizing captions.

도 2는 본 개시의 일 실시 예에 따른 전자 장치의 구체적인 구성을 도시한 도면이다.2 is a diagram illustrating a specific configuration of an electronic device according to an embodiment of the present disclosure.

도 2를 참조하면, 전자 장치(100)는 통신 장치(110), 메모리(120), 디스플레이(130), 조작 입력 장치(140) 및 프로세서(150)로 구성될 수 있다. 여기서 전자 장치(100)는 이미지 프로세싱이 가능한 PC, 노트북 PC, 스마트폰, 서버 등일 수 있다.Referring to FIG. 2 , the electronic device 100 may include a communication device 110, a memory 120, a display 130, a manipulation input device 140, and a processor 150. Here, the electronic device 100 may be a PC capable of image processing, a notebook PC, a smart phone, a server, and the like.

통신 장치(110)는 자막 생성 장치(20) 및/또는 서버(30)와 연결되며, 영상 콘텐츠, 자막 데이터를 송수신할 수 있다. 구체적으로, 통신 장치(110)는 전자 장치(100)를 외부 장치와 연결하기 위해 형성되고, 근거리 통신망(LAN: Local Area Network) 및 인터넷망을 통해 모바일 장치에 접속되는 형태뿐만 아니라, USB(Universal Serial Bus) 포트를 통하여 접속되는 형태도 가능하다.The communication device 110 is connected to the caption generating device 20 and/or the server 30 and can transmit and receive video content and caption data. Specifically, the communication device 110 is formed to connect the electronic device 100 with an external device, and is connected to a mobile device through a Local Area Network (LAN) and an Internet network, as well as a Universal Serial Bus (USB) device. It is also possible to connect through a serial bus port.

또한, 통신 장치(110)는 유선 방식뿐만 아니라, 공용 인터넷망에 연결되는 라우터 또는 공유기를 경유하여 다른 전자 장치에 연결될 수 있으며, 라우터 또는 공유기와는 유선 방식뿐만 아니라 와이파이, 블루투스, 셀룰러 통신 등의 무선 방식으로도 연결될 수 있다. In addition, the communication device 110 may be connected to other electronic devices via a router or a router connected to a public Internet network as well as a wired method, and the router or router may be connected to another electronic device through a wired method as well as Wi-Fi, Bluetooth, cellular communication, etc. It can also be connected wirelessly.

그리고 통신 장치(110)는 후술하는 과정에 의하여 시간 정보가 수정된 자막 데이터(즉, 동기화된 자막 데이터)를 외부 장치(예를 들어, 서버(30))에 전송할 수 있다. Also, the communication device 110 may transmit caption data (ie, synchronized caption data) having time information corrected by a process described below to an external device (eg, the server 30).

메모리(120)는 전자 장치(100)를 구동하기 위한 O/S나 음성 인식/자막 동기화를 수행하기 위한 소프트웨어, 데이터 등을 저장하기 위한 구성요소이다. 메모리(120)는 RAM이나 ROM, 플래시 메모리, HDD, 외장 메모리, 메모리 카드 등과 같은 다양한 형태로 구현될 수 있으며, 어느 하나로 한정되는 것은 아니다.The memory 120 is a component for storing O/S for driving the electronic device 100, software for performing voice recognition/caption synchronization, data, and the like. The memory 120 may be implemented in various forms such as RAM, ROM, flash memory, HDD, external memory, memory card, etc., but is not limited to any one.

메모리(120)는 영상 콘텐츠를 저장한다. 여기서, 영상 콘텐츠는 영상뿐만 아니라 음성도 포함되며, MP4, AVI, MOV 등과 같은 동영상 파일일 수 있다. 한편, 상술한 파일 포맷은 일 예에 불가하며, 상술한 예들에 한정되지 않는다. The memory 120 stores video content. Here, the video content includes audio as well as video, and may be a video file such as MP4, AVI, or MOV. On the other hand, the above-described file format is not limited to one example, and is not limited to the above-described examples.

그리고 메모리(120)는 상술한 영상 콘텐츠에 대응되는 자막 데이터를 저장한다. 여기서 자막 데이터는 자막 서비스 상에서 표시될 문자열 정보와 해당 문자열 정보가 표시될 시간 정보를 포함할 수 있다. 여기서 시간 정보는 ms 단위의 시간 정보일 수 있다. Also, the memory 120 stores caption data corresponding to the above-described video content. Here, the caption data may include character string information to be displayed on the caption service and time information to display the corresponding character string information. Here, the time information may be time information in ms units.

그리고 메모리(120)는 후술하는 과정 중에 생성된 문자열 정보, 시간 정보, 유사도 정보 등을 저장할 수 있으며, 시간 정보가 수정된 자막 데이터를 저장할 수 있다. Also, the memory 120 may store character string information, time information, similarity information, etc. generated during a process to be described later, and may store caption data having corrected time information.

디스플레이(130)는 전자 장치(100)가 지원하는 기능을 선택받기 위한 사용자 인터페이스 창을 표시한다. 구체적으로, 디스플레이(130)는 전자 장치(100)가 제공하는 각종 기능을 선택받기 위한 사용자 인터페이스 창을 표시할 수 있다. 이러한 디스플레이(130)는 LCD, CRT, OLED 등과 같은 모니터일 수 있으며, 후술할 조작 입력 장치(140)의 기능을 동시에 수행할 수 있는 터치 스크린으로 구현될 수도 있다.The display 130 displays a user interface window for selecting a function supported by the electronic device 100 . Specifically, the display 130 may display a user interface window for selecting various functions provided by the electronic device 100 . The display 130 may be a monitor such as an LCD, a CRT, or an OLED, and may be implemented as a touch screen capable of simultaneously performing the functions of the manipulation input device 140 to be described later.

조작 입력 장치(140)는 사용자로부터 전자 장치(100)의 기능 선택 및 해당 기능에 대한 제어 명령을 입력받을 수 있다. 구체적으로, 조작 입력 장치(140)는 수신된 자막 데이터에 대한 싱크 수정 여부에 대한 명령을 입력받을 수 있다. 또한, 조작 입력 장치(140)는 자막 동기화를 수행할 자막 데이터(또는 영상 콘텐츠)를 선택받을 수 있다. The manipulation input device 140 may receive a function selection of the electronic device 100 and a control command for the function from the user. Specifically, the manipulation input device 140 may receive a command on whether or not to correct the sync of the received caption data. Also, the manipulation input device 140 may receive selection of caption data (or video content) to perform caption synchronization.

프로세서(150)는 전자 장치(100) 내의 각 구성에 대한 제어를 수행한다. 구체적으로, 프로세서(150)는 사용자로부터 부팅 명령이 입력되면, 메모리(120)에 저장된 운영체제를 이용하여 부팅을 수행할 수 있다.The processor 150 controls each component in the electronic device 100 . Specifically, the processor 150 may perform booting using the operating system stored in the memory 120 when a booting command is input from the user.

프로세서(150)는 자막 동기화를 수행하기 위한 영상 콘텐츠 또는 자막 데이터가 선택되면, 선택된 영상 콘텐츠에 대응되는 자막 데이터 또는 선택된 자막 데이터에 대응되는 영상 콘텐츠를 수신하도록 통신 장치(110)를 제어할 수 있다. When video content or caption data for performing caption synchronization is selected, the processor 150 may control the communication device 110 to receive caption data corresponding to the selected video content or video content corresponding to the selected caption data. .

그리고 프로세서(150)는 영상 콘텐츠 및 영상 콘텐츠에 대응되는 자막 데이터가 수신되면, 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출할 수 있다. 예를 들어, 프로세서(150)는 음성 인식과 관련하여 널리 알려진 라이브러리(예를 들어, Kaldi, CMU Sphinx, HTK, AWS Transcribe) 중 어느 하나를 이용하여 음성 인식을 수행할 수 있다. When video content and caption data corresponding to the video content are received, the processor 150 may perform voice recognition on the video content to extract voice recognition data and time information. For example, the processor 150 may perform speech recognition using any one of well-known libraries related to speech recognition (eg, Kaldi, CMU Sphinx, HTK, AWS Transcribe).

그리고 프로세서(150)는 추출된 음성 인식 데이터와 자막 데이터를 비교하여 자막 데이터 내의 시간 정보를 추출된 시간 정보로 수정할 수 있다. 구체적으로, 프로세서(150)는 먼저 추출된 음성 인식 데이터에서 문자열만을 추출할 수 있다. 예를 들어, 음성 인식 데이터는 상술한 바와 같이 문자열 및 해당 문자열의 시간 정보를 포함할 수 있으며, 후술하는 과정을 위하여, 프로세서(150)는 음성 인식 데이터에서 문자열만을 추출할 수 있다. In addition, the processor 150 may compare the extracted voice recognition data and the caption data to correct time information in the caption data to the extracted time information. Specifically, the processor 150 may extract only a character string from previously extracted voice recognition data. For example, the voice recognition data may include a character string and time information of the corresponding character string as described above, and for a process to be described later, the processor 150 may extract only the character string from the voice recognition data.

예를 들어, 자막 데이터에는 '?', '!'와 같은 기호가 포함될 수 있으나, 음성 인식을 통한 텍스트에서는 이와 같은 기호들이 정확하게 포함되지 않는 경우가 많다. 따라서, 프로세서(150)는 수신된 자막 데이터에서 상술한 바와 같은 기호들은 제외한 문자열만을 추출할 수 있다. 또한, 자막 데이터는 상황을 설명하기 위한 정보(텍스트 보조 정보)(예를 들어, 전화 벨이 울린다, 자동차 경적이 울린다.)와 같은 텍스트가 포함되어 있을 수 있다. 하지만, 이와 같은 텍스트 보조 정보는 음성 인식 과정에서는 추출되지 않는 정보인 바, 프로세서(150)는 수신된 자막 데이터에서 텍스트 보조 정보를 제외한 문자열을 추출할 수 있다. 예를 들어, 상술한 텍스트 보조 정보는 "(", ")" 괄호 기호 내에 문자열이 기재된 경우가 있다. 따라서, 프로세서(150)는 수신된 자막 데이터에서 괄호 열 내의 문자열을 제외한 문자열을 추출할 수 있다For example, caption data may include symbols such as '?' and '!', but in many cases, such symbols are not accurately included in text through voice recognition. Accordingly, the processor 150 may extract only character strings excluding the above-described symbols from the received caption data. In addition, the caption data may include text such as information (text supplementary information) for explaining a situation (eg, a phone rings, a car horn sounds). However, since such text supplementary information is information that is not extracted in the voice recognition process, the processor 150 may extract a character string excluding the text supplementary information from the received subtitle data. For example, in the text auxiliary information described above, there is a case in which a character string is described in brackets of "(", ")". Accordingly, the processor 150 may extract a string excluding the string in parentheses from the received subtitle data.

그리고 프로세서(150)는 수신된 자막 데이터에서 문장부호에 기초하여 자막 데이터 내의 문자열을 문장 단위로 구분할 수 있다. 예를 들어, '.', '!', ?'와 같은 종결 문장부호는 문장을 완결하는 부호인바, 이러한 부호들 사이의 문자열을 하나의 문장으로 구분할 수 있다. 반면에, ','와 같은 문장 부호는 문장이 이어짐을 나타내는 것인바, 이와 같은 문장 부호에 대해서는 문장 단위를 구분하는데 이용하지 않을 수 있다. Further, the processor 150 may divide the character string in the caption data into sentence units based on punctuation marks in the received caption data. For example, since closing punctuation marks such as '.', '!', and ?' are marks that complete a sentence, a character string between these marks can be divided into one sentence. On the other hand, punctuation marks such as ',' indicate that sentences are connected, and such punctuation marks may not be used to distinguish sentence units.

그리고 프로세서(150)는 음성 인식 데이터의 문자열을 단어 단위로 구분할 수 있다. 이하에서는 설명을 용이하게 하기 위하여, 상술한 단어는 분리하여 자립적으로 쓸 수 있는 단어뿐만 아니라, 해당 단어와 해당 단어의 문법적 기능을 나타내는 조사가 결합된 것도 단어로 지칭한다. 일반적으로 국어 및 영어에서는 띄어쓰기를 이용하여 단어를 구분한다. 이러한 점에서 프로세서(150)는 음성 인식 데이터의 문자열에서 띄어쓰기를 영역을 확인하고, 띄어쓰기 사이의 문자열을 하나의 단어로 구분할 수 있다. Also, the processor 150 may classify the character string of the voice recognition data in units of words. Hereinafter, in order to facilitate explanation, the above-mentioned words refer to not only words that can be separated and used independently, but also words that are combined with a corresponding word and an article indicating a grammatical function of the corresponding word. Generally, in Korean and English, spaces are used to separate words. In this respect, the processor 150 may check areas for spacing in the text string of the voice recognition data and classify the text string between the spaces as a single word.

그리고 프로세서(150)는 음성 인식 데이터의 문자열과 자막 데이터의 문자열을 정렬할 수 있다. 구체적으로, 프로세서(150)는 동적 계획법(dynamic programming) 방식을 이용하여 음성 인식 데이터의 문자열과 자막 데이터의 문자열을 정렬할 수 있다. 여기서 동적 계획법은 두 개의 긴 문자열을 정렬하는 작업으로 생물 정보학 분야에서 유전체의 염기서열을 분석하는데 많이 사용되고 있다. 동적 계획법에는 많은 알고리즘이 있으나, 본 개시에서는 Needleman-Wunsch 알고리즘을 이용하였으나, 이에 한정되는 것은 아니다. Needleman-Wunsch 알고리즘을 이용한 방식에 대해서는 도 5에서 자세히 설명한다. Also, the processor 150 may align the character string of the voice recognition data with the character string of the caption data. Specifically, the processor 150 may align a character string of voice recognition data and a character string of caption data using a dynamic programming method. Here, the dynamic programming method is a task of aligning two long strings, and is widely used in analyzing genome sequences in the field of bioinformatics. There are many algorithms for dynamic programming, but the Needleman-Wunsch algorithm was used in the present disclosure, but is not limited thereto. The method using the Needleman-Wunsch algorithm will be described in detail with reference to FIG. 5 .

한편, 기존의 동적 계획법은 단순히 문자열만을 고려하여 유사도를 비교한다는 점에서, 하나의 문장 및/또는 단어 내의 문자열이 서로 다른 대상에 높은 유사도를 갖는 것으로 정렬될 수 있다. 이와 같은 정렬이 발생하는 경우, 하나의 문장 내의 텍스트가 서로 다른 싱크에서 표시될 수 있다는 점에서, 하나의 문장 및 또는 단어 내의 문자열이 분리되는 것을 방지할 필요가 있다. On the other hand, existing dynamic programming methods compare similarity by simply considering only character strings, so strings within one sentence and/or word can be sorted as having high similarities to different objects. When such alignment occurs, it is necessary to prevent separation of strings within one sentence and/or word in that text within one sentence may be displayed in different sinks.

이를 위하여, 본 개시에서는 문자열 간의 유사도뿐만 아니라, 단어 및/또는 문장 단위의 간격 점수를 이용하여 음성 인식 데이터의 문자열과 자막 데이터의 문자열을 정렬한다. 이에 따른 구체적인 방법은 도 3을 참조하여 후술한다. To this end, in the present disclosure, a character string of voice recognition data and a character string of subtitle data are aligned using not only similarities between character strings but also interval scores in units of words and/or sentences. A detailed method according to this will be described later with reference to FIG. 3 .

그리고 프로세서(150)는 정렬 결과에 기초하여 자막 데이터 내의 시간 정보를 추출된 시간 정보로 수정할 수 있다. 구체적으로, 프로세서(150)는 정렬 결과에 기초하여 자막 데이터의 라인의 시작점(즉, 첫번째 음절)에 대한 시간 정보를 해당 문자열에 대응되는 음성 인식 데이터의 문자열에 대한 시간 정보로 수정할 수 있다. 한편, 해당 라인의 시작점에 대응되는 음성 인식 데이터의 문자열이 없는 경우에, 프로세서(150)는 해당 라인 앞 또는 뒷 라인에 대응되는 음성 인식 데이터의 문자열에 대한 시간 정보를 이용하여, 해당 라인의 시간 정보를 추정하고, 추정된 시간 정보로 해당 라인의 시간 정보를 수정할 수 있다. 이와 같은 동작에 대해서는 도 6 내지 도 8과 관련하여 후술한다. Further, the processor 150 may modify time information in caption data to extracted time information based on the alignment result. Specifically, the processor 150 may modify time information on the start point (ie, the first syllable) of a line of caption data to time information on a text string of voice recognition data corresponding to the corresponding text string based on the sorting result. On the other hand, when there is no character string of voice recognition data corresponding to the starting point of the corresponding line, the processor 150 uses time information on the character string of voice recognition data corresponding to the line before or after the corresponding line to determine the time of the corresponding line. Information may be estimated, and time information of a corresponding line may be corrected with the estimated time information. Such an operation will be described later with reference to FIGS. 6 to 8 .

그리고 프로세서(150)는 시간 정보가 수정된 자막 데이터를 메모리(120)에 저장할 수 있다. 한편, 자막 데이터가 외부 장치(예를 들어, 서버(30))에 저장되어 다른 장치들에 제공되는 경우, 프로세서(150)는 수정된 자막 데이터를 외부 장치에 전송하도록 통신 장치(110)를 제어할 수 있다. Also, the processor 150 may store the closed caption data with time information corrected in the memory 120 . Meanwhile, when caption data is stored in an external device (eg, the server 30) and provided to other devices, the processor 150 controls the communication device 110 to transmit the modified caption data to the external device. can do.

한편, 이상에서는 전자 장치(100)가 자막 데이터의 시간 정보를 수정(또는 업데이트)하는 것으로 표현하였지만, 상술한 동작은 자막 데이터의 문자열 정보와 음성 인식 데이터의 시간 정보를 결합하여 신규 자막 데이터를 생성하는 것으로 표현될 수도 있다. Meanwhile, although the electronic device 100 has been described above as correcting (or updating) time information of caption data, the above-described operation generates new caption data by combining character string information of caption data and time information of voice recognition data. It can also be expressed as

이와 같이 본 실시 예에 따른 전자 장치(100)는 영상 콘텐츠에 대한 음성 인식을 통하여 생성한 음성 인식 데이터를 이용하여 자막 데이터 내의 시간 정보를 수정하는바, 보다 손쉽고 빠르게 자막 데이터를 영상 데이터에 동기화하는 것이 가능하다. 특히, 문장 및/또는 단어 내에서는 간격 발생을 억제되도록 정렬을 수행함으로써 높은 정확도로 자막 데이터의 싱크를 맞출 수 있게 된다. As described above, the electronic device 100 according to the present embodiment corrects time information in caption data using voice recognition data generated through voice recognition of video content, thereby synchronizing caption data with video data more easily and quickly. it is possible In particular, by performing alignment to suppress occurrence of gaps within sentences and/or words, it is possible to synchronize subtitle data with high accuracy.

한편, 도 1 및 도 2를 도시하고 설명함에 있어서, 방송국에서 생성된 콘텐츠에 대해서만 상술한 동작을 수행하는 것으로 도시하고 설명하였지만, 방송국 이외에 대해서 생성한 영상 콘텐츠에 대해서도 속기사가 입력한 별도의 자막 데이터가 있는 경우라면 적용될 수 있다. Meanwhile, in the drawings and descriptions of FIGS. 1 and 2, although the above-described operation is shown and described as being performed only for content generated by a broadcasting station, separate subtitle data input by a stenographer also applies to video content generated by a station other than a broadcasting station. If there is, it can be applied.

한편, 도 2를 도시하고 설명함에 있어서, 전자 장치(100)가 디스플레이 및 조작 입력 장치를 포함하는 것으로 도시하고 설명하였지만, 전자 장치(100)가 서버와 같은 장치로 구현되는 경우, 상술한 디스플레이 및 조작 입력 장치는 생략될 수 있다. On the other hand, in the illustration and description of FIG. 2, although the electronic device 100 has been illustrated and described as including a display and a manipulation input device, when the electronic device 100 is implemented as a device such as a server, the above-described display and A manipulation input device may be omitted.

이하에서는 본 개시에 따른 동적 계획법을 설명한다. Hereinafter, a dynamic programming method according to the present disclosure will be described.

음성 인식 기술의 발전에 의하여 영상 콘텐츠에 대한 음성 인식 기술이 상당히 진화하였다. 영상 콘텐츠에서 음성 인식을 수행하기 때문에, 산출되는 텍스트에 대한 시간 동기 정보는 매우 정확하다. 하지만, 아직은 음성 인식의 정확도는 속기사가 직접 입력한 자막 내용보다 떨어진다는 점에서, 음성 인식 결과만으로 자막을 제공하기 어려운 점이 있다. Due to the development of voice recognition technology, voice recognition technology for video content has evolved considerably. Since voice recognition is performed on video content, the time synchronization information for the calculated text is very accurate. However, since the accuracy of voice recognition is still lower than that of the caption directly input by the stenographer, it is difficult to provide the caption only with the voice recognition result.

그러나 속기사가 입력하는 폐쇄 자막은 내용상으로 높은 정확성이 있지만, 폐쇄 자막에 대한 저장 등의 과정, 방송 제작 및 아카이빙 프로세스 등에 의하여 오프셋이 발생하여 폐쇄 자막 내의 시간 동기 정보는 실제 영상과 차이가 있다. 이와 같은 음성 인식 자막과 폐쇄 자막의 장단점은 아래의 표 1과 같이 정리될 수 있다. However, closed captions input by a stenographer have high accuracy in terms of content, but offsets occur due to processes such as storage of closed captions, broadcasting production and archiving processes, etc., so time synchronization information in closed captions is different from actual video. Advantages and disadvantages of voice recognition captions and closed captions can be summarized as shown in Table 1 below.

텍스트 정보text information 시간 동기 정보time synchronization information 폐쇄 자막closed captions 정확exact 부정확inaccuracy 음성인식 자막 voice recognition subtitles 부정확 inaccuracy 정확exact

따라서, 본 개시에서는 폐쇄 자막의 텍스트 정보를 이용하되, 시간 동기 정보는 음성 인식 자막의 시간 정보로 동기화하여, 텍스트 정보 및 시간 동기 정보 모두 높은 정확도를 갖는 자막 데이터를 생성할 수 있다. Accordingly, in the present disclosure, caption data having high accuracy can be generated by using the text information of the closed caption and synchronizing the time synchronization information with the time information of the speech recognition caption.

한편, 폐쇄 자막의 텍스트와 음성 인식의 텍스트를 비교하기 위하여, 본 개시는 동적 계획법(dynamic programming)을 이용한다. Meanwhile, in order to compare text of closed caption and text of voice recognition, the present disclosure uses dynamic programming.

동적 계획법이란 어떤 문제가 여러 단계의 반복되는 부분 문제로 이루어 질 때, 각 단계에 있는 부분 문제의 답을 기반으로 전체 문제의 답을 구하는 방법을 말한다. 이러한, 동적 계획법은 제일 작은 부분 문제부터 상위에 있는 문제로 풀어 올라가는 방법이다. Dynamic programming refers to a method in which, when a problem is composed of repeated subproblems in several steps, the solution to the entire problem is obtained based on the answers to the subproblems in each step. This dynamic programming method is a method of solving from the smallest subproblem to the upper problem.

이와 같은 동적 계획법은 염기서열과 같은 일반적인 문자열을 정렬하는데 적합하지만, 자막 동기화 문제에 있어서는 추가적인 제약조건을 고려할 필요가 있다. This dynamic programming method is suitable for aligning general strings such as base sequences, but additional constraints need to be considered in the problem of synchronizing subtitles.

먼저, 염기서열 등 일반적인 시퀀스 정렬에 있어서 각 요소가 정합되지 않는 경우에 삽입(insertion), 삭제(delection)이 일어나는데, 이를 간격(gaps)이라 부른다. First, in general sequence alignment, such as base sequences, when each element is not matched, insertion and deletion occur, which are called gaps.

자막 동기화 문제에서 폐쇄 자막 문자열을 문장 단위(또는 단어 단위)로, 음성 인식 문자열을 단어 단위로 나눌 수 있는데, 그 내부에서는 큰 간격이 발생하기 어렵다는 추가 제약 조건이다. In the caption synchronization problem, a closed caption string can be divided into sentence units (or word units) and a speech recognition string can be divided into word units, and it is an additional constraint that it is difficult to generate large gaps therein.

일반적으로 TV 프로그램에서 큰 간격이 발생하는 경우는 두 가지 경우로 볼 수 있다. 첫 번째는 발화 내용이 화면 상의 자막으로 표시되는 경우이다. 화면 상의 자막이 표시되는 경우, 폐쇄 자막에는 해당 텍스트가 포함되지 않는다. 반면에 음성 인식 결과에서는 발화 내용에 따라 텍스트가 생성되기 때문에, 큰 간격이 발생할 수 있다. In general, there are two cases where large gaps occur in TV programs. The first case is when the contents of speech are displayed as subtitles on the screen. When a caption on the screen is displayed, the corresponding text is not included in the closed caption. On the other hand, since text is generated according to the contents of speech in the speech recognition result, a large gap may occur.

두 번째는 음성 인식이 되지 않는 노래의 경우이다. 노래의 경우 폐쇄 자막 상에 노래의 가사에 대응되는 텍스트가 존재하지만, 음성 인식 과정에서 노래에 대해서는 음성 인식이 수행되지 않아 텍스트가 생성되지 않기 때문에, 큰 간격이 발생할 수 있다. The second is for songs that do not have voice recognition. In the case of a song, text corresponding to the lyrics of the song exists on the closed caption, but since voice recognition is not performed for the song and text is not generated in the voice recognition process, a large gap may occur.

이와 같이 본 개시에서는 상술한 바와 같은 추가적인 제약조건을 고려하여 정렬을 수행한다. As such, in the present disclosure, alignment is performed in consideration of the above-described additional constraints.

도 3은 본 개시의 일 실시 예에 따른 동적 계획법을 설명하기 위한 도면이다. 구체적으로, 도 3은 상술한 추가 제약 조건을 반영하지 않은 경우의 유사도 행렬(310)과 상술한 추가 제약 조건을 반영한 유사도 행렬(320)이 도시된다. 3 is a diagram for explaining a dynamic programming method according to an embodiment of the present disclosure. Specifically, FIG. 3 shows a similarity matrix 310 when the above-described additional constraint conditions are not reflected and a similarity matrix 320 when the above-described additional constraint conditions are reflected.

먼저, 추가 제약 조건을 반영하지 않은 정렬 동작을 먼저 설명하고, 추가 제약 조건을 반영한 경우의 정렬 동작을 설명한다. First, the sorting operation without reflecting the additional constraints will be described first, and then the sorting operation in the case of reflecting the additional constraints will be described.

도 3을 참조하면, 제1 유사도 행렬(310)은 가로 방향으로 폐쇄 자막의 문자열이 배치되고, 세로 방향으로 음성 인식 결과의 문자열이 배치된다. Referring to FIG. 3 , in the first similarity matrix 310, character strings of closed captions are disposed in the horizontal direction, and character strings of speech recognition results are disposed in the vertical direction.

동적 계획법을 문자열에 적용하기 위하여, 문자열의 가장 작은 부분 문제를 음절로 하여, 즉 음절 단위로 유사도를 산출할 수 있다. In order to apply the dynamic programming method to a character string, the smallest part of the character string can be used as a syllable, that is, the similarity can be calculated in units of syllables.

이와 같음 음절 단위의 유사도는 먼저, 두 음절을 비교하여, 동일하면 유사도를 제1 값(+1)으로 산출하고, 동일하지 않거나 정보 누락 등이 있으면 제2 값(-1)으로 산출하는 방식이 이용될 수 있다. As such, the similarity in syllable units is first compared with two syllables, and if they are identical, the similarity is calculated as a first value (+1), and if they are not identical or there is information missing, a method of calculating the similarity as a second value (-1) can be used

그러나 단순히 동일 또는 불일치 하는 경우만으로 비교하면, '공'-'봉'과 같이 유사한 발음이라도 음절의 형태가 다른 경우, 위에 방법은 미스매치로 판단하기 때문에 성능 저하가 발생할 수 있다. However, when comparing only the same or mismatched cases, if the syllable form is different even with similar pronunciations, such as 'gong'-'bong', the above method may result in performance degradation because it is judged as a mismatch.

따라서 음절(syllable)을 초/중/종성 단위의 음소(phoneme)로 분해한 후, 초/중/종성 단위로 각각의 음소를 비교하여 유사도를 산출할 수도 있다. 예를 들어, 각 음절의 초성/중성/종성을 비교하여 음소 모두 같으면 제1 값(예를 들어, 38), 초성/중성/종성 중 하나가 다르면 제2 값(예를 들어, 19), 초성/중성/종성 중 하나만 같으면 제3 값(예를 들어, 3), 초성/중성/종성 모두 다르면 제4 값(예를 들어, 0)을 산출할 수 있다. 이와 같이 각 값은 비선형적으로 부여될 수 있다.Therefore, after decomposing a syllable into phonemes in initial/middle/final consonant units, the degree of similarity may be calculated by comparing each phoneme in initial/middle/final consonant units. For example, the initial/neutral/final of each syllable is compared, and if all phonemes are the same, a first value (eg, 38), and if one of the initial/neutral/final is different, a second value (eg, 19), an initial consonant If only one of the /neutral/final consonant is the same, a third value (eg, 3) may be calculated, and if all of the initial/neutral/final consonants are different, a fourth value (eg, 0) may be calculated. In this way, each value may be given non-linearly.

한편, 이상에서는 음소 단위로 +38/+19/+3/0 점을 부여하는 방식만을 설명하였지만, 구현시에는 상술한 값들은 예시에 불가하고, 위에 수치 값과 다른 값이 이용될 수도 있으며, 초/중/종성에 대해서 다른 가중치 값을 부여하여 유사도 값을 산출할 수도 있다. On the other hand, in the above, only the method of giving +38/+19/+3/0 points in phoneme units has been described, but in implementation, the above-described values are not illustrative, and values different from the numerical values above may be used, The similarity value may be calculated by assigning different weight values to the initial, middle, and final sexes.

각 음절 단위의 유사도가 산출되면, 산출된 유사도를 이용하여 누적 유사도 값이 최대 값을 갖는 경로를 찾음으로써 두 문자열을 정렬할 수 있다. When the similarity of each syllable unit is calculated, the two character strings can be sorted by finding a path having the maximum cumulative similarity value using the calculated similarity.

이와 같은 동작을 통하여 찾은 경로는 유사도 행렬(310) 내의 실선과 같다. A path found through such an operation is the same as a solid line in the similarity matrix 310 .

한편, 해당 유사도 행렬(310)의 중간 영역(311)을 보면, 일정한 경사로 떨어지는 구간을 확인할 수 있다. 이와 같은 구간은 하나의 폐쇄 자막 내의 문장 내에 문자열 간에 갭이 발생할 수 있는 영역이다. 예를 들어, 폐쇄 자막 내의 "뉴스가 전해 드립니다."라는 하나의 문장에 대해서 '뉴스가'라는 부분과 "전해 드립니다"가 시간상으로 이격된 다른 영역에 매핑될 수 있다. On the other hand, looking at the middle region 311 of the similarity matrix 310, it is possible to check a section falling at a certain slope. Such a section is an area where a gap may occur between character strings within a sentence in one closed caption. For example, with respect to one sentence of “News is delivered” in the closed caption, the part “News” and “We will deliver” may be mapped to different regions spaced apart in time.

이와 같이 폐쇄 자막 내의 하나의 문장이 음성 인식 결과 내의 이격된 문자열에 매핑된다면, 싱크 정확도가 안 맞게 된다. In this way, if one sentence in the closed caption is mapped to a spaced character string in the speech recognition result, sync accuracy is not matched.

따라서, 폐쇄 자막 내의 하나의 문장(또는 단어) 내에서는 간격 발생이 최소화되고, 음성 인식 결과 내의 단어 내에서의 간격 발생도 최소화되도록 하는 간격 점수를 추가로 부여하여 정렬을 수행할 수 있다. Accordingly, alignment may be performed by additionally assigning a spacing score that minimizes spacing in one sentence (or word) in the closed caption and minimizes spacing in words in the speech recognition result.

이와 같은 정렬을 위해서는 선행적으로 폐쇄 자막 내의 문장이 어떻게 구분되는지, 음성 인식 결과의 텍스트 내의 단어가 어떻게 구분되는지 미리 알고 있어야 한다. 문장의 구분은 앞서 설명한 바와 같이 폐쇄 자막 내의 문장 부호를 검출함으로써 구분이 가능하며, 단어 구분은 띄어쓰기를 이용하여 구분할 수 있다. 이와 같은 구분이 반영된 유사도 행렬은 도 3의 320과 같다. For this sorting, it is necessary to know in advance how sentences in the closed caption are classified and how words in the text of the speech recognition result are classified in advance. Sentences can be distinguished by detecting punctuation marks in closed captions as described above, and words can be distinguished by using spaces. A similarity matrix reflecting such division is shown as 320 in FIG. 3 .

제2 유사도 행렬(320)은 가로 방향으로 폐쇄 자막의 문자열이 배치되고, 세로 방향으로 음성 인식 결과의 문자열이 배치된다. 그리고 파란색의 세로줄은 폐쇄 자막의 문장 간격을 나타내고, 빨간색의 가로줄은 음성 인식 자막의 단어 간격을 나타낸다. In the second similarity matrix 320, character strings of closed captions are disposed in the horizontal direction, and character strings of voice recognition results are disposed in the vertical direction. Blue vertical lines indicate sentence spacing of closed captions, and red horizontal lines indicate word spacing of speech recognition subtitles.

이와 같이 자막 데이터 내의 문자열을 문장 단위로 구분함으로써, 구분된 문장 내의 문자열 이격에서는 유사도에 반대되는 부호의 제1 간격 점수를 부여할 수 있다. 반대로 문장 간의 이격에 대해서는 유사도에 동일한 부호를 갖는 간격 점수를 부여할 수 있다. 또한, 자막 데이터 내의 문자열을 단어 단위로 구분하거나, 음성 인식 결과 내의 문자열을 단어 단위로 구분하여 상술한 바와 유사한 형태로 간격 점수를 부여할 수 있다. In this way, by dividing the character string in the subtitle data into sentence units, a first spacing score of a code opposite to the degree of similarity may be given to the spacing of the character string in the divided sentence. Conversely, for spacing between sentences, a spacing score having the same sign in similarity can be assigned. In addition, it is possible to divide the character string in the caption data into word units or divide the character string in the voice recognition result into word units to give interval scores in a similar manner as described above.

이와 같이 음절 단위의 유사도와 간격 점수가 산출되면, 산출된 유사도와 간격 점수를 이용하여 누적 유사도 값이 최대 값을 갖는 경로를 찾음으로써 두 문자열을 정렬할 수 있다. In this way, when the syllable-unit similarity and interval scores are calculated, the two character strings can be sorted by finding a path having the maximum cumulative similarity value using the calculated similarity and interval scores.

상술한 바와 같은 간격 점수가 반영하여 산출된 제2 유사도 행렬(320)은 제1 유사도 행렬(310)과 비교하였을 때(구체적으로 중간 영역(311, 312), 폐쇄 자막 내의 문장 내에서는 간격 발생이 감소하고, 문장과 문장 사이에서만 간격 차이가 발생함을 확인할 수 있다. When the second similarity matrix 320 calculated by reflecting the spacing scores as described above is compared with the first similarity matrix 310 (specifically, in the middle regions 311 and 312, gaps occur in sentences within closed captions). decreases, and it can be confirmed that the gap difference occurs only between sentences.

상술한 음절 단위의 점수 체계와 추가 제약 사항을 반영한 점수 체계를 정리하면 다음 표 2와 같다. Table 2 below summarizes the above-mentioned syllable-based scoring system and the scoring system that reflects additional restrictions.

종류type 값value 설명explanation Match
scoresMatch
scores 00 음절 불일치syllable mismatch 33 음절 하나만 일치match one syllable 1919 음절 두개 일치match two syllables 3838 음절 모두 일치match all syllables gap
scoresgap
scores -0.9-0.9 폐쇄 자막 내의 단어 내의 갭Gaps within words within closed captions -15-15 폐쇄 자막의 문장 내의 갭Gaps within sentences of closed captions -0.5-0.5 음성 인식 결과 내의 단어 내의 갭Gaps within words within speech recognition results 00 음성 인식 결과 내의 단어 사이의 갭Gaps between words in speech recognition results 33 폐쇄 자막 내의 문장 종료 이후 갭Gap after end of sentence in closed caption

상술한 표에서는 갭 스코어에 4개의 항목을 이용하였지만, 구현시에는 상술한 항목 이외에 다른 항목을 이용할 수도 있고, 상술한 4개의 항목 중 일부만을 이용할 수도 있다. 또한, 상술한 수치는 예시에 불과하며, 컨텐츠의 종류 및 시스템 환경에 따라 다른 수치 값을 이용할 수도 있다. In the above table, four items were used for the gap score, but in implementation, other items other than the above items may be used, or only some of the above four items may be used. In addition, the above numerical values are merely examples, and other numerical values may be used depending on the type of content and system environment.

한편, 상술한 문자열 정렬을 위해서는 두 텍스트 내의 모든 문자열에 대한 유사도 계산이 수행되어야 한다. 특히, 계산 시간은 정렬해야 할 글자 수의 제곱에 비례하여 커진다는 점에서, 유사도 계산이 필요한 시간을 줄일 필요가 있다.Meanwhile, for the above-described string sorting, similarity calculations must be performed for all strings in two texts. In particular, since the calculation time increases in proportion to the square of the number of letters to be sorted, it is necessary to reduce the time required to calculate the similarity.

계산 시간을 줄일 수 있는 방법에 대해서는 도 4 및 도 5를 참조하여 이하에서 설명한다. A method of reducing the calculation time will be described below with reference to FIGS. 4 and 5 .

도 4는 본 개시의 일 실시 예에 따른 제한 영역을 설명하기 위한 도면이다. 그리고 도 5는 컨텐츠와 제한 영역 간의 관계를 설명하기 위한 도면이다. 4 is a diagram for explaining a restricted area according to an embodiment of the present disclosure. 5 is a diagram for explaining the relationship between content and a restricted area.

도 4를 참조하면, 유사도 계산을 수행하는 탐색 영역(또는 제한 영역)이 설정된 도면이다. Referring to FIG. 4 , a search area (or restriction area) for performing similarity calculation is set.

도 3에 도시된 바와 같이 누적 유사도 값이 최대 값을 갖는 경로는 유사도 행렬(310, 320) 내의 중앙 영역을 관통한다. 따라서, 중앙 영역을 제외한 우측 상부 및 좌측 하단 영역에 대해서는 유사도 계산이 수행하지 않더라도 경로 산출에 영향을 주지 않는다. As shown in FIG. 3 , the path having the maximum cumulative similarity value passes through the center area of the similarity matrices 310 and 320 . Therefore, even if the similarity calculation is not performed for the upper right and lower left regions except for the central region, the route calculation is not affected.

다만, 탐색을 수행할 중앙 영역이 좁아지면 계산 속도에 향상을 주게 되지만, 누적 유사도 값이 최대 값을 갖는 경로가 중앙 영역을 벗어나게 되면 정확도 저감이 발생한다는 점에서, 최적의 중앙 영역의 크기를 결정할 필요가 있다. However, if the central area to be searched is narrowed, the calculation speed is improved, but the optimal size of the central area is determined in that accuracy decreases when the path with the maximum cumulative similarity value deviates from the central area. There is a need.

이를 위하여 다양한 컨텐츠를 분석해본 결과 도 5와 같이 폐쇄 자막의 길이와 탐색 영역 크기의 상관 관계를 확인하였다. 구체적으로, 폐쇄 자막의 글자 수와 탐색 영역의 크기는 로그 함수로 모델링될 수 있다. 따라서, 전자 장치(100)는 폐쇄 자막이 입력되면, 입력된 폐쇄 자막의 글자 수를 확인하고, 확인된 글자 수에 대응되는 탐색 영역을 결정하고, 결정된 탐색 영역에 대해서만 상술한 유사도 행렬을 생성할 수 있다. To this end, as a result of analyzing various contents, the correlation between the length of the closed caption and the size of the search area was confirmed as shown in FIG. 5 . Specifically, the number of characters of the closed caption and the size of the search area may be modeled as a logarithmic function. Accordingly, when a closed caption is input, the electronic device 100 checks the number of characters of the input closed caption, determines a search area corresponding to the checked number of characters, and generates the above-described similarity matrix only for the determined search area. can

만약 제한 동적 계획법에 사용되는 메모리를 줄이고자 한 경우에는 도 4의 검정색 영역(즉, 제한 영역)을 제거하고, 탐색 영역(녹색 영역)만을 직사각형 형태로 재구성하여 이용할 수 있다. If it is desired to reduce the memory used in the constrained dynamic programming method, the black area (ie, the restricted area) of FIG. 4 may be removed, and only the search area (green area) may be reconstructed into a rectangular shape and used.

상술한 바와 같이 자막 데이터와 음성 인식 데이터의 텍스트를 정렬하고 나면, 정렬 결과를 이용하여 폐쇄 자막 내의 시간 정보를 수정할 수 있다.As described above, after the text of the caption data and the voice recognition data are aligned, time information in the closed caption may be corrected using the alignment result.

구체적으로, 자막 데이터의 라인(또는 문장)의 첫번째 음절에 대한 시간 정보를 문자열에 대응되는 음성 인식 데이터의 문자열에 대한 시간 정보로 수정함으로써, 시간 정보를 수정할 수 있다. Specifically, the time information may be corrected by modifying the time information of the first syllable of the line (or sentence) of the caption data to the time information of the character string of voice recognition data corresponding to the character string.

그러나 자막 데이터의 라인의 첫번째 음절에 대응되는 음성 인식 데이터의 문자열이 없는 경우에는 상술한 바와 같은 방식으로 시간 정보를 수정할 수 없게 된다. However, if there is no character string of voice recognition data corresponding to the first syllable of a line of caption data, time information cannot be modified in the above-described manner.

예를 들어, 앞서 예시한 바와 같이 노래와 같은 경우, 폐쇄 자막 내에서 문자열이 있으나, 음성 인식 데이터에 대응되는 문자열이 없는 경우에는 음성 인식 데이터의 시간 정보를 이용할 수 없게 된다. For example, as exemplified above, in the case of a song, if there is a character string in the closed caption, but there is no character string corresponding to the voice recognition data, time information of the voice recognition data cannot be used.

이하에서는 이러한 경우에서의 시간 정보를 수정하는 방식에 대해서 설명한다. Hereinafter, a method of correcting time information in this case will be described.

도 6은 본 개시의 일 실시 예에 따라 추정된 타임 코드의 예를 도시한 도면, 도 7은 제1 실시 예에 따라 타임 코드의 추정 예를 도시한 도면, 도 8은 제2 실시 예에 따라 타임 코드의 추정 예를 도시한 도면이다. 6 is a diagram showing an example of an estimated time code according to an embodiment of the present disclosure, FIG. 7 is a diagram showing an example of time code estimation according to the first embodiment, and FIG. 8 is a diagram showing an example of time code estimation according to the second embodiment. It is a diagram showing an example of time code estimation.

우선 정렬 과정에서 정합된 글자들의 타임 코드를 가지고 선형 회귀 분석을 수행하여, 정합되지 않은 글자의 시간 정보를 추정할 수 있다. 이를 위하여, 단어의 첫 그자 이외의 경우에는 전체 글자 수와 발화 시간을 이용하여 글자 단위의 타임 코드를 추정할 수 있다. 예를 들어, 도 6에 도시된 (SR/start)와 같은 값을 이용하여 타임 코드를 추정할 수 있다. First, linear regression analysis can be performed with time codes of matched letters in the sorting process to estimate time information of unmatched letters. To this end, in cases other than the first letter of a word, the time code for each letter can be estimated using the total number of letters and the utterance time. For example, the time code may be estimated using a value such as (SR/start) shown in FIG. 6 .

이때, 단순한 최소제곱법 기반의 선형 회귀분석만으로는 도 7에 도시된 바와 같이 소수의 아웃라이어로 인해 부정확한 추정이 이루어 질 수 있으므로, 도 8에 도시된 바와 같이 RANSAC(random sample consensus)와 같은 방식으로 아웃라이어의 영향을 최소화할 수 있다. At this time, since inaccurate estimation can be made due to a small number of outliers as shown in FIG. 7 with only simple least squares based linear regression analysis, as shown in FIG. Thus, the influence of outliers can be minimized.

한편, 일정 개수 이상의 정합이 발생하지 않은 경우에는 회귀분석에 의한 타임 코드 추정을 포기하고, 정확한 타임코드를 가지는 라인의 전후 방향으로 탐색하고, 참조하여 타임 코드를 복구할 수 있다. On the other hand, if more than a certain number of matches do not occur, time code estimation by regression analysis is abandoned, and the time code can be recovered by searching in the forward and backward directions of a line having an accurate time code and referring to it.

도 9는 종래의 정렬 결과와 본 개시의 일 실시 예에 따른 정렬 결과를 비교한 도면이다. 9 is a diagram comparing a conventional alignment result and an alignment result according to an embodiment of the present disclosure.

도 9를 참조하면, 추가 제약 조건이 반영되지 않은 경우의 정렬 결과(910)와 추가 제약 조건이 반영된 경우의 정렬 결과(920)가 도시된다. Referring to FIG. 9 , an alignment result 910 when the additional constraint conditions are not reflected and an alignment result 920 when the additional constraint conditions are reflected are shown.

두 정렬 결과를 비교하면, 기존의 추가 제약 조건이 반영되지 않은 경우에는 폐쇄 자막(CC) 내의 하나의 문장 내의 문자열이 음성 인식 결과 내의 이격된 문자열에 매핑되나. 추가 제약 조건이 반영된 경우에는 폐쇄 자막 내의 문자열이 이격이 제한되어 동기화의 정확성이 크게 향상되는 것을 확인할 수 있다. Comparing the two sorting results, if the existing additional constraints are not reflected, the character string in one sentence in the closed caption (CC) is mapped to the spaced character string in the speech recognition result. When the additional constraint conditions are reflected, it can be seen that the spacing of the character strings in the closed caption is limited, and the synchronization accuracy is greatly improved.

도 10은 본 개시의 일 실시 예에 따른 자막 동기화 방법을 설명하기 위한 흐름도이다. 10 is a flowchart illustrating a caption synchronization method according to an embodiment of the present disclosure.

도 10을 참조하면, 먼저, 영상 콘텐츠 및 영상 콘텐츠에 대한 자막 데이터를 수신한다(S1010). 예를 들어, 자막 동기화에 대상인 영상 콘텐츠 또는 자막 데이터를 선택되면, 선택된 영상 콘텐츠 또는 자막 데이터를 수신하고, 수신된 영상 콘텐츠 또는 수신된 자막 데이터에 대응되는 자막 데이터 또는 영상 콘텐츠를 검색하여 수신할 수 있다. 한편, 구현시에 영상 콘텐츠 또는 자막 데이터 중 적어도 하나는 미리 저장되어 있을 수 있으며, 이와 같은 경우 상술한 수신 동작은 생략될 수 있다. Referring to FIG. 10 , first, video content and caption data for the video content are received (S1010). For example, if video content or caption data that is a subject of caption synchronization is selected, the selected video content or caption data may be received, and caption data or video content corresponding to the received video content or received caption data may be searched for and received. there is. Meanwhile, at the time of implementation, at least one of video content or subtitle data may be stored in advance, and in this case, the above-described receiving operation may be omitted.

그리고 영상 콘텐츠에 대한 음성 인식을 수행하여 음성 인식 데이터 및 시간 정보를 추출한다(S1020). 구체적으로, 음성 인식 라이브러리를 이용하여 영상 콘텐츠의 음성을 인식하여 음성에 대한 텍스트 및 텍스트의 시간 정보를 갖는 음성 인식 데이터를 생성할 수 있다. Then, voice recognition is performed on the video content to extract voice recognition data and time information (S1020). Specifically, voice recognition data having text and time information of the text may be generated by recognizing the voice of video content using a voice recognition library.

추출된 음성 인식 데이터의 문자열과 자막 데이터의 문자열을 정렬한다(S1030). 구체적으로, 음성 인식 데이터의 문자열과 자막 데이터의 문자열 간의 유사도를 산출하고, 단어 및/또는 문장 단위의 간격 점수와 산출된 유사도를 기준으로 음성 인식 데이터의 문자열과 자막 데이터의 문자열을 정렬할 수 있다. The character string of the extracted voice recognition data and the character string of the caption data are aligned (S1030). Specifically, the similarity between the character string of the voice recognition data and the character string of the caption data may be calculated, and the character string of the voice recognition data and the character string of the caption data may be sorted based on the interval score of each word and/or sentence and the calculated similarity. .

그리고 정렬 결과에 기초하여 자막 데이터 내의 시간 정보를 수정한다(S1040). 구체적으로, 정렬 결과에 기초하여 자막 데이터 내의 시간 정보를 추출된 시간 정보로 수정할 수 있다. 예를 들어, 자막 데이터의 라인의 첫번째 음절에 대응되는 자막 데이터 내의 문자열이 있으면, 해당 음절에 대한 시간 정보를 해당 음절에 대응되는 음성 인식 데이터의 시간 정보로 수정할 수 있다. 만약, 자막 데이터의 라인의 첫번째 문자열에 대응되는 자막 데이터 내의 문자열이 부재하면, 첫번째 문자열에 인접한 문자열에 대응되는 음성 인식 데이터의 문자열에 대한 시간 정보와 기설정된 발화 시간을 이용하여 첫번째 문자열에 대한 시간 정보를 추정하고, 첫번째 문자열에 대한 시간 정보를 추정된 시간 정보로 수정할 수 있다. Then, based on the alignment result, time information in the caption data is corrected (S1040). Specifically, time information in caption data may be corrected to extracted time information based on a result of the alignment. For example, if there is a character string in caption data corresponding to the first syllable of a line of caption data, time information of the corresponding syllable may be modified to time information of voice recognition data corresponding to the corresponding syllable. If there is no character string in the caption data corresponding to the first character string of the caption data line, the time for the first character string is determined using the time information of the character string of voice recognition data corresponding to the character string adjacent to the first character string and the preset speech time. Information may be estimated, and the time information for the first character string may be modified to the estimated time information.

시간 정보가 수정된 자막 데이터를 저장한다(S1050). 만약, 자막 데이터를 외부 장치에서 수신한 경우, 시간 정보가 수정된 자막 데이터를 외부 장치에 전송할 수 있다. Caption data with corrected time information is stored (S1050). If caption data is received from an external device, caption data with corrected time information may be transmitted to the external device.

따라서, 본 실시 예에 따른 파일 전송 방법은 영상 콘텐츠에 대한 음성 인식을 통하여 생성한 음성 인식 데이터를 이용하여 자막 데이터 내의 시간 정보를 수정하는바, 손쉽게 자막 데이터의 동기를 영상에 맞출 수 있게 된다. 또한, 문장 및/또는 단어 내에서는 간격 발생을 억제되도록 정렬을 수행함으로써 높은 정확도로 자막 데이터의 싱크를 맞출 수 있게 된다. 도 10과 같은 자막 동기화 방법은 도 2의 구성을 가지는 전자 장치상에서 실행될 수 있으며, 그 밖의 다른 구성을 가지는 전자 장치상에서도 실행될 수 있다.Accordingly, in the file transmission method according to the present embodiment, time information in caption data is modified using voice recognition data generated through voice recognition of video content, so that caption data can be synchronized with video easily. Also, by performing alignment to suppress occurrence of gaps within sentences and/or words, it is possible to synchronize subtitle data with high accuracy. The caption synchronization method of FIG. 10 can be executed on an electronic device having the configuration shown in FIG. 2 and also on electronic devices having other configurations.

또한, 상술한 바와 같은 자막 동기화 방법은 컴퓨터에서 실행될 수 있는 실행 가능한 알고리즘을 포함하는 프로그램으로 구현될 수 있고, 상술한 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the above-described subtitle synchronization method may be implemented as a program including an executable algorithm that can be executed on a computer, and the above-described program may be stored and provided in a non-transitory computer readable medium. can

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 방법을 수행하기 위한 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.A non-transitory readable medium is not a medium that stores data for a short moment, such as a register, cache, or memory, but a medium that stores data semi-permanently and can be read by a device. Specifically, programs for performing the various methods described above may be stored and provided in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, or ROM.

또한, 이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시가 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In addition, although the preferred embodiments of the present disclosure have been shown and described above, the present disclosure is not limited to the specific embodiments described above, and the technical field to which the disclosure belongs without departing from the subject matter of the present disclosure claimed in the claims. Of course, various modifications are possible by those skilled in the art, and these modifications should not be individually understood from the technical spirit or perspective of the present disclosure.

1000: 콘텐츠 생성 시스템 10: 방송 송신 장치
20: 자막 생성 장치 30: 서버
100: 전자 장치 110: 통신 장치
120: 메모리 130: 디스플레이
140: 조작 입력 장치 150: 프로세서1000: content creation system 10: broadcast transmission device
20: subtitle generating device 30: server
100: electronic device 110: communication device
120: memory 130: display
140: operation input device 150: processor

Claims

In electronic devices,
a communication device that receives video content and caption data for the video content;
Memory; and
Voice recognition is performed on the video content to extract voice recognition data and time information, align a character string of the extracted voice recognition data and a character string of the caption data, and based on the alignment result, time information in the caption data and a processor for correcting the extracted time information and storing closed caption data having the corrected time information in the memory,
the processor,
The similarity between the character string of the voice recognition data and the character string of the subtitle data is calculated based on whether the initial consonants, middle consonants, and final consonants are identical within the syllable unit, and based on the interval score of at least one of the word unit and sentence unit and the calculated similarity An electronic device that aligns a string of the voice recognition data with a string of the caption data.

According to claim 1,
the processor,
Dividing the character string in the caption data into sentence units, and assigning a first interval score of a code opposite to the similarity to the character string spacing in the divided sentences to align the character string of the voice recognition data and the character string of the caption data electronic device.

According to claim 2,
the processor,
An electronic device that detects punctuation marks in the caption data and divides a character string in the caption data into sentence units based on the detected punctuation marks.

According to claim 2,
the processor,
The electronic device aligns the text strings of the speech recognition data and the text strings of the subtitle data by assigning a second spacing score having the same code to the similarity for the spacing between the divided sentences.

According to claim 1,
the processor,
A character string in the caption data and each character string of the caption data are divided into word units, and a third spacing score of a code opposite to the similarity is given to the distance between the character strings in the separated words, so as to match the character string of the voice recognition data. An electronic device for arranging text strings of the caption data.

According to claim 1,
the processor,
An electronic device that determines a search area corresponding to the number of characters of the caption data, and calculates a similarity between a character string of the voice recognition data and a character string of the caption data within the determined search area.

delete

According to claim 1,
the processor,
If all of the initial, neutral, and final consonants are the same within the syllable unit, a first value is calculated, and if one of the initial, neutral, and final consonants is different, a second value smaller than the first value is calculated, and if only one of the initial, neutral, and final consonants is the same, the first value is calculated. A third value smaller than the second value is calculated, and a fourth value smaller than the third value is calculated when the initial consonant, neutral consonant, and final consonant do not match,
An electronic device that aligns the text string of the voice recognition data with the text string of the caption data based on the calculated value of the calculated syllable unit and the spacing score.

According to claim 8,
The first to fourth values have non-linear values.

According to claim 1,
the processor,
An electronic device that modifies time information of a first syllable of a line of the caption data to time information of a character string of speech recognition data corresponding to the character string based on the alignment result.

According to claim 10,
the processor,
If there is no character string in the caption data corresponding to the first character string of the line of the caption data, the response to the first character string is determined by using time information on the character string of voice recognition data corresponding to the character string adjacent to the first character string and a preset speech time. An electronic device for estimating time information and correcting time information for the first character string to the estimated time information.

According to claim 1,
the processor,
An electronic device for extracting a character string excluding text auxiliary information corresponding to a situation description of an image from the received caption data.

According to claim 1,
Subtitle data for the video content,
An electronic device that is a commentary caption transmitted in the form of an electronic code corresponding to the video content.

According to claim 1,
the processor,
An electronic device that controls the communication device to transmit the modified caption data to an external device.

A method for synchronizing subtitles of an electronic device,
Receiving video content and caption data for the video content;
extracting voice recognition data and time information by performing voice recognition on the video content;
arranging a character string of the extracted voice recognition data and a character string of the caption data;
modifying time information in the caption data based on the sorting result; and
Storing caption data with corrected time information;
The sorting step is
The similarity between the character string of the voice recognition data and the character string of the subtitle data is calculated based on whether the initial consonants, middle consonants, and final consonants are identical within the syllable unit, and based on the interval score of at least one of the word unit and sentence unit and the calculated similarity A caption synchronization method of arranging a character string of the voice recognition data and a character string of the caption data.