KR20220142723A

KR20220142723A - Electronic apparatus that outputs subtitle on screen where video is played based on voice recognition and operating method thereof

Info

Publication number: KR20220142723A
Application number: KR1020210049114A
Authority: KR
Inventors: 원찬식; 최보람; 강희석; 손은채
Original assignee: 주식회사 한글과컴퓨터; (주)엠디에스인텔리전스; 주식회사 한컴위드
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2022-10-24
Also published as: KR102523814B1

Abstract

Disclosed are an electronic device for outputting a subtitle on a screen where a video is played based on voice recognition, and an operating method thereof. According to the electronic device for outputting a subtitle on a screen where a video is played based on voice recognition and the operating method thereof, when a user watches a video, the user can conveniently view the subtitle at the playback time he/she wants to watch.

Description

An electronic device that outputs subtitles on a screen on which an image is played based on voice recognition and an operation method thereof

본 발명은 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치 및 그 동작 방법에 대한 것이다.The present invention relates to an electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition and a method of operating the same.

최근, 코로나 시대가 장기화되면서, 실내에서 영상 콘텐츠를 시청하는 비율이 증가하고 있는 추세이다. 이에 따라, 실시간 TV 시청량은 증가하였으며, VOD 시청 건수도 증가하였다.Recently, as the corona era is prolonged, the rate of viewing video content indoors is increasing. Accordingly, the amount of real-time TV viewing increased, and the number of VOD viewings also increased.

이렇게, 실내에서 영상 콘텐츠를 시청하는 비율이 증가하면서, 영상 콘텐츠와 관련해서, 시청자들을 위해 보다 효율적인 기능을 지원해 줄 필요성 또한 증대되고 있다.As such, as the rate of viewing video content indoors increases, the need to support more efficient functions for viewers in relation to video content is also increasing.

이와 관련하여, 기존의 기술에서는 시청자가 영상 콘텐츠를 감상할 때, 음성 인식을 기반으로 영상 콘텐츠가 재생되는 화면에 자막을 출력해 주도록 하는 기능이 수행되지 않아서, 시청자가 자신이 시청하고 싶은 재생 시점에서의 자막을 보는 것이 쉽지 않았다.In this regard, in the conventional technology, when a viewer views video content, a function of outputting subtitles to a screen on which video content is played based on voice recognition is not performed, so that the viewer wants to view the video content at the desired playback time. It was not easy to see the subtitles in

만약, 시청자가 영상 콘텐츠를 감상할 때, 음성 인식을 기반으로 영상 콘텐츠가 재생되는 화면에 자막을 출력해 줄 수 있도록 함으로써, 시청자가 자신이 시청하고 싶은 재생 시점에서의 자막을 볼 수 있도록 지원하는 기술이 도입된다면, 시청자는 영상 콘텐츠를 감상할 시 더욱 편리함을 느낄 수 있을 것이다.When a viewer watches video content, by enabling the output of subtitles on the screen where the video content is played based on voice recognition based on voice recognition, it supports the viewer to see the subtitles at the playback point they want to watch. If technology is introduced, viewers will be able to feel more convenient when viewing video content.

따라서, 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력해 줄 수 있는 자막 서비스 기술에 대한 연구가 필요하다.Therefore, it is necessary to study a subtitle service technology capable of outputting subtitles on a screen on which an image is reproduced based on voice recognition.

본 발명은 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치 및 그 동작 방법을 제시함으로써, 사용자가 영상을 감상할 때 자신이 시청하고 싶은 재생 시점에서의 자막을 편리하게 볼 수 있도록 지원하고자 한다.The present invention provides an electronic device for outputting subtitles on a screen on which an image is played based on voice recognition and a method of operating the same, so that when viewing an image, a user can conveniently view the subtitle at the playback point at which he or she wants to view the image. We want to support you to

본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치는 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 수신되면, 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성하고, 상기 제1 영상 데이터의 전체 재생 시간에서 상기 복수의 분할 영상 데이터들 각각의 재생 시점에 대한 정보를 확인하는 분할부, 상기 복수의 분할 영상 데이터들을 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 분할 영상 데이터들 각각에 대응되는 복수의 자막들을 생성하는 자막 생성부, 상기 복수의 자막들과 상기 복수의 자막들 각각에 대응되는 분할 영상 데이터의 재생 시점에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 생성하는 자막 테이블 생성부 및 상기 제1 영상 데이터가 재생되는 도중, 제1 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 수신되면, 상기 제1 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막을, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅(Floating)하여 표시하는 자막 표시부를 포함한다.The electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention reproduces the first image data when a reproduction command for the first image data is received by a user, By dividing the first image data at a preset first reproduction time interval, a plurality of divided image data is generated, and information on a reproduction time of each of the plurality of divided image data in the entire reproduction time of the first image data a subtitle for generating a plurality of subtitles corresponding to each of the plurality of segmented image data by sequentially applying the plurality of segmented image data as inputs to a pre-built speech recognition model to perform voice recognition A generator, a caption table generator for generating a caption table in which the plurality of subtitles and information on a reproduction time point of segmented image data corresponding to each of the plurality of captions are recorded in correspondence with each other, and the first image data are reproduced In the meantime, when a command to display the subtitles for the first image data is received by the user at the first playback time, each playback time recorded in the subtitle table from the first playback time is displayed in the subtitle table. and a subtitle display unit for displaying the recorded subtitles according to each playback time point by floating at a preset first point in a screen area where an image is displayed.

또한, 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법은 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 수신되면, 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성하고, 상기 제1 영상 데이터의 전체 재생 시간에서 상기 복수의 분할 영상 데이터들 각각의 재생 시점에 대한 정보를 확인하는 단계, 상기 복수의 분할 영상 데이터들을 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 분할 영상 데이터들 각각에 대응되는 복수의 자막들을 생성하는 단계, 상기 복수의 자막들과 상기 복수의 자막들 각각에 대응되는 분할 영상 데이터의 재생 시점에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 생성하는 단계 및 상기 제1 영상 데이터가 재생되는 도중, 제1 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 수신되면, 상기 제1 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막을, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅하여 표시하는 단계를 포함한다.Also, in the method of operating an electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention, when a reproduction command for first image data is received by a user, the first image The data is reproduced, and a plurality of divided image data is generated by dividing the first image data at a preset first reproduction time interval, and each of the plurality of divided image data is generated during the entire reproduction time of the first image data. Checking information on the playback time, by sequentially applying the plurality of segmented image data to a pre-built voice recognition model as input to perform voice recognition, a plurality of subtitles corresponding to each of the plurality of segmented image data generating the subtitles, generating a subtitle table in which the plurality of subtitles and information on the reproduction time of the divided image data corresponding to each of the plurality of subtitles are recorded in correspondence with each other, and while the first image data is being reproduced , when a command to display a caption for the first image data is received by the user at the first playback time, the caption is recorded in the caption table at each playback time recorded in the subtitle table from the first playback time and displaying the subtitles according to the respective playback time points by floating them at a preset first point in a screen area where an image is displayed.

본 발명은 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치 및 그 동작 방법을 제시함으로써, 사용자가 영상을 감상할 때 자신이 시청하고 싶은 재생 시점에서의 자막을 편리하게 볼 수 있도록 지원할 수 있다.The present invention provides an electronic device for outputting subtitles on a screen on which an image is played based on voice recognition and a method of operating the same, so that when viewing an image, a user can conveniently view the subtitle at the playback point at which he or she wants to view the image. can support you to

도 1은 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 구조를 도시한 도면이다.
도 2 내지 도 3은 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작을 설명하기 위한 도면이다.
도 4는 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법을 도시한 순서도이다.1 is a diagram illustrating a structure of an electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention.
2 to 3 are diagrams for explaining an operation of an electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention.
4 is a flowchart illustrating a method of operating an electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention.

이하에서는 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명하기로 한다. 이러한 설명은 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였으며, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 본 명세서 상에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 사람에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. These descriptions are not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, like reference numerals are used for similar components, and unless otherwise defined, all terms used in this specification, including technical or scientific terms, refer to those of ordinary skill in the art to which the present invention belongs. It has the same meaning as is commonly understood by those who have it.

본 문서에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다. 또한, 본 발명의 다양한 실시예들에 있어서, 각 구성요소들, 기능 블록들 또는 수단들은 하나 또는 그 이상의 하부 구성요소로 구성될 수 있고, 각 구성요소들이 수행하는 전기, 전자, 기계적 기능들은 전자회로, 집적회로, ASIC(Application Specific Integrated Circuit) 등 공지된 다양한 소자들 또는 기계적 요소들로 구현될 수 있으며, 각각 별개로 구현되거나 2 이상이 하나로 통합되어 구현될 수도 있다. In this document, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. In addition, in various embodiments of the present invention, each of the components, functional blocks or means may be composed of one or more sub-components, and the electrical, electronic, and mechanical functions performed by each component are electronic. A circuit, an integrated circuit, an ASIC (Application Specific Integrated Circuit), etc. may be implemented with various well-known devices or mechanical elements, and may be implemented separately or two or more may be integrated into one.

한편, 첨부된 블록도의 블록들이나 흐름도의 단계들은 범용 컴퓨터, 특수용 컴퓨터, 휴대용 노트북 컴퓨터, 네트워크 컴퓨터 등 데이터 프로세싱이 가능한 장비의 프로세서나 메모리에 탑재되어 지정된 기능들을 수행하는 컴퓨터 프로그램 명령들(instructions)을 의미하는 것으로 해석될 수 있다. 이들 컴퓨터 프로그램 명령들은 컴퓨터 장치에 구비된 메모리 또는 컴퓨터에서 판독 가능한 메모리에 저장될 수 있기 때문에, 블록도의 블록들 또는 흐름도의 단계들에서 설명된 기능들은 이를 수행하는 명령 수단을 내포하는 제조물로 생산될 수도 있다. 아울러, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 명령들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 가능한 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 정해진 순서와 달리 실행되는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 실질적으로 동시에 수행되거나, 역순으로 수행될 수 있으며, 경우에 따라 일부 블록들 또는 단계들이 생략된 채로 수행될 수도 있다.On the other hand, the blocks in the accompanying block diagram or steps in the flowchart are computer program instructions that are loaded in a processor or memory of equipment capable of data processing, such as a general-purpose computer, a special-purpose computer, a portable notebook computer, and a network computer, and perform specified functions. can be interpreted as meaning Since these computer program instructions may be stored in a memory provided in a computer device or in a memory readable by a computer, the functions described in the blocks of the block diagram or the steps of the flowchart are produced in an article containing instruction means for performing the same. it might be In addition, each block or each step may represent a module, segment, or portion of code comprising one or more executable instructions for executing the specified logical function(s). It should also be noted that, in some alternative embodiments, it is possible for the functions recited in blocks or steps to be executed out of the prescribed order. For example, two blocks or steps shown one after another may be performed substantially simultaneously or in the reverse order, and in some cases, some blocks or steps may be omitted.

도 1은 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 구조를 도시한 도면이다.1 is a diagram illustrating a structure of an electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일실시예에 따른 전자 장치(110)는 분할부(111), 자막 생성부(112), 자막 테이블 생성부(113) 및 자막 표시부(114)를 포함한다.Referring to FIG. 1 , an electronic device 110 according to an embodiment of the present invention includes a division unit 111 , a caption generation unit 112 , a caption table generation unit 113 , and a caption display unit 114 .

분할부(111)는 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 수신되면, 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성하고, 상기 제1 영상 데이터의 전체 재생 시간에서 상기 복수의 분할 영상 데이터들 각각의 재생 시점에 대한 정보를 확인한다.When a reproduction command for the first image data is received by the user, the division unit 111 reproduces the first image data, and divides the first image data at a preset first reproduction time interval to divide the first image data into a plurality of divisions. Image data is generated, and information on the reproduction time of each of the plurality of divided image data is checked in the total reproduction time of the first image data.

자막 생성부(112)는 상기 복수의 분할 영상 데이터들을 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 분할 영상 데이터들 각각에 대응되는 복수의 자막들을 생성한다. The subtitle generator 112 sequentially applies the plurality of segmented image data to a pre-built voice recognition model as input to perform voice recognition, thereby generating a plurality of subtitles corresponding to each of the plurality of segmented image data. .

자막 테이블 생성부(113)는 상기 복수의 자막들과 상기 복수의 자막들 각각에 대응되는 분할 영상 데이터의 재생 시점에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 생성한다.The subtitle table generator 113 generates a subtitle table in which the plurality of subtitles and information on a reproduction time point of the divided image data corresponding to each of the plurality of subtitles are recorded in correspondence with each other.

자막 표시부(114)는 상기 제1 영상 데이터가 재생되는 도중, 제1 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 수신되면, 상기 제1 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막을, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅(Floating)하여 표시한다.When the caption display command for the first image data is received by the user at the first playback time while the first image data is being reproduced, the subtitle display unit 114 records it in the subtitle table from the first playback time At each playback time, the subtitle according to each playback time recorded in the subtitle table is displayed by floating at a preset first point in the screen area where the image is displayed.

이하에서는, 도 1과 도 2를 참조하여, 분할부(111), 자막 생성부(112), 자막 테이블 생성부(113) 및 자막 표시부(114)의 동작을 예를 들어, 상세히 설명하기로 한다.Hereinafter, operations of the division unit 111 , the subtitle generation unit 112 , the subtitle table generation unit 113 , and the subtitle display unit 114 will be described in detail with reference to FIGS. 1 and 2 . .

먼저, 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 전자 장치(110)에 수신되면, 분할부(111)는 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성할 수 있다.First, when a reproduction command for the first image data is received by the user from the electronic device 110 , the division unit 111 reproduces the first image data, and sets the first image data to a preset first reproduction time. By dividing at intervals, it is possible to generate a plurality of divided image data.

만약, 사전 설정된 제1 재생 시간 간격을 '1분'이라고 하고, 상기 제1 영상 데이터의 전체 재생 시간이 '5분'이라고 하는 경우, 분할부(111)는 상기 제1 영상 데이터를 '1분' 간격으로 분할함으로써, '1분'의 재생 시간을 갖는 복수의 분할 영상 데이터들을 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5'로 생성할 수 있다.If the preset first playback time interval is '1 minute' and the total playback time of the first image data is '5 minutes', the dividing unit 111 divides the first image data into '1 minute'. By dividing by ' intervals, a plurality of segmented image data having a playback time of '1 minute' can be generated as 'segmented image data 1, segmented image data 2, segmented image data 3, segmented image data 4, segmented image data 5'. can

그 이후, 분할부(111)는 상기 제1 영상 데이터의 전체 재생 시간인 '5분'에서 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각의 재생 시점인 '재생 시점 1(0초), 재생 시점 2(1분), 재생 시점 3(2분), 재생 시점 4(3분), 재생 시점 5(4분)'에 대한 정보를 확인할 수 있다.After that, the division unit 111 performs 'segmented image data 1, divided image data 2, divided image data 3, divided image data 4, and divided image data 5' at '5 minutes', which is the total playback time of the first image data. Information on each playback time, 'play time 1 (0 seconds), playback time 2 (1 minute), playback time 3 (2 minutes), playback time 4 (3 minutes), and playback time 5 (4 minutes)' can be checked

그러고 나서, 자막 생성부(112)는 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5'를 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각에 대응되는 복수의 자막들을 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5'로 생성할 수 있다.Then, the subtitle generating unit 112 sequentially applies 'segmented image data 1, divided image data 2, divided image data 3, divided image data 4, and divided image data 5' to the pre-built voice recognition model as an input. By performing voice recognition, a plurality of subtitles corresponding to each of 'segmented image data 1, segmented image data 2, segmented image data 3, segmented image data 4, and segmented image data 5' are converted to 'caption 1, subtitle 2, subtitle 3, Subtitle 4 and Subtitle 5' can be created.

그 이후, 자막 테이블 생성부(113)는 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5'와 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 각각에 대응되는 분할 영상 데이터인 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5'의 재생 시점인 '재생 시점 1(0초), 재생 시점 2(1분), 재생 시점 3(2분), 재생 시점 4(3분), 재생 시점 5(4분)'에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 하기의 표 1과 같이 생성할 수 있다.Thereafter, the subtitle table generating unit 113 generates divided images corresponding to 'Subtitle 1, Subtitle 2, Subtitle 3, Subtitle 4, Subtitle 5' and 'Subtitle 1, Subtitle 2, Subtitle 3, Subtitle 4, and Subtitle 5' respectively. 'Playback time 1 (0 sec), playback time 2 (1 minute), playback A subtitle table in which information on 'view time 3 (2 minutes), playback time 4 (3 minutes), and playback time 5 (4 minutes)' is recorded in correspondence with each other can be generated as shown in Table 1 below.

복수의 자막들multiple subtitles 분할 영상 데이터의 재생 시점Playback time of segmented image data 자막 1Subtitle 1 분할 영상 데이터 1의 재생 시점 1(0초)Playback time 1 of segmented image data 1 (0 sec) 자막 2subtitles 2 분할 영상 데이터 2의 재생 시점 2(1분)Playback time 2 of split image data 2 (1 minute) 자막 3Subtitle 3 분할 영상 데이터 3의 재생 시점 3(2분)Playback time 3 of split image data 3 (2 minutes) 자막 4Subtitle 4 분할 영상 데이터 4의 재생 시점 4(3분)Playback time 4 of split image data 4 (3 minutes) 자막 5Subtitle 5 분할 영상 데이터 5의 재생 시점 5(4분)Playback time 5 of segmented image data 5 (4 minutes)

이때, 전자 장치(110)에서 상기 제1 영상 데이터가 재생되는 도중, '1분 30초' 라는 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 전자 장치(110)에 수신되었다고 가정하자.In this case, while the first image data is being reproduced in the electronic device 110 , a command to display a caption for the first image data is received by the user at a playback time of '1 minute 30 seconds' to the electronic device 110 . Let's assume that

그러면, 자막 표시부(114)는 '1분 30초'부터 상기 표 1과 같은 자막 테이블에 기록되어 있는 각 재생 시점인 '재생 시점 3(2분), 재생 시점 4(3분), 재생 시점 5(4분)'가 될 때마다, 상기 표 1과 같은 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막인 '자막 3, 자막 4, 자막 5'를, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅하여 표시할 수 있다.Then, the subtitle display unit 114 displays 'playback time 3 (2 minutes), playback time 4 (3 minutes), playback time 5 (4 minutes)', 'Subtitle 3, Subtitle 4, Subtitle 5', which are subtitles according to each playback time recorded in the subtitle table as shown in Table 1 above, are set in the screen area where the image is displayed in advance. It can be displayed by plotting one point.

구체적으로, 자막 표시부(114)는 도 2에 도시된 그림과 같이, 제1 영상 데이터의 재생 시점이 '재생 시점 3(2분)'이 되었을 때, '자막 3'을 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)에 플로팅하여 표시할 수 있고, 제1 영상 데이터의 재생 시점이 '재생 시점 4(3분)'가 되었을 때, '자막 4'를 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)에 플로팅하여 표시할 수 있으며, 제1 영상 데이터의 재생 시점이 '재생 시점 5(4분)'가 되었을 때, '자막 5'를 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)에 플로팅하여 표시할 수 있다.Specifically, as shown in FIG. 2 , the subtitle display unit 114 displays 'caption 3' in the screen area where the image is displayed when the playback time of the first image data reaches 'playback time 3 (2 minutes)'. A screen on which an image is displayed with 'caption 4' may be displayed by floating at the first point 211 in 210, and when the playback time of the first image data reaches 'play time 4 (3 minutes)' It can be displayed by floating at the first point 211 in the area 210, and when the playback time of the first image data reaches 'play time 5 (4 minutes)', the image is displayed with 'caption 5'. It may be displayed by floating at the first point 211 in the screen area 210 .

이때, 본 발명의 일실시예에 따르면, 자막 생성부(112)는 상기 복수의 분할 영상 데이터들 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역 이외의 주파수 대역의 소리 성분을 모두 제거한 후, 상기 제1 주파수 대역 이외의 주파수 대역의 소리 성분이 모두 제거된 각 분할 영상 데이터를 상기 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 자막들을 생성할 수 있다.In this case, according to an embodiment of the present invention, the caption generator 112 may select a frequency band other than the first frequency band preset as a frequency band corresponding to a human voice for each of the plurality of divided image data. After all sound components are removed, each segmented image data from which all sound components of a frequency band other than the first frequency band have been removed are sequentially applied to the speech recognition model as input to perform speech recognition, thereby recognizing the plurality of subtitles. can create

예컨대, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역을 '50Hz ~ 4000Hz'라고 가정하자.For example, it is assumed that a first frequency band preset to be a frequency band corresponding to a human voice is '50Hz to 4000Hz'.

이때, 전술한 예에 따르면, 자막 생성부(112)는 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역인 '50Hz ~ 4000Hz' 이외의 주파수 대역의 소리 성분을 모두 제거할 수 있다.In this case, according to the above-described example, the subtitle generator 112 corresponds to a human voice for each of 'segmented image data 1, divided image data 2, divided image data 3, divided image data 4, and divided image data 5'. It is possible to remove all sound components of a frequency band other than '50Hz to 4000Hz', which is a preset first frequency band.

그 이후, 자막 생성부(112)는 상기 제1 주파수 대역인 '50Hz ~ 4000Hz' 이외의 주파수 대역의 소리 성분이 모두 제거된 각 분할 영상 데이터를 상기 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 자막들을 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5'로 생성할 수 있다.Thereafter, the subtitle generator 112 sequentially applies each segmented image data from which all sound components of a frequency band other than the first frequency band '50Hz to 4000Hz' are removed to the voice recognition model as an input to perform voice recognition. , the plurality of subtitles may be generated as 'Subtitle 1, Subtitle 2, Subtitle 3, Subtitle 4, and Subtitle 5'.

본 발명의 일실시예에 따르면, 전자 장치(110)는 번역 자막 생성부(115), 번역 자막 기록 처리부(116) 및 번역 자막 추가 표시부(117)를 더 포함할 수 있다.According to an embodiment of the present invention, the electronic device 110 may further include a translated subtitle generating unit 115 , a translated subtitle recording processing unit 116 , and a translated subtitle adding display unit 117 .

번역 자막 생성부(115)는 상기 제1 영상 데이터가 재생되는 도중, 제2 재생 시점에서 상기 사용자에 의해 제1 외국어로 구성된 자막을 추가로 표시할 것을 지시하는 추가 표시 명령이 수신되면, 상기 자막 테이블을 참조하여, 상기 복수의 자막들을 사전 구축된 번역 엔진을 통해 상기 제1 외국어로 번역함으로써, 복수의 번역 자막들을 생성한다.The translated subtitle generating unit 115 receives an additional display command instructing to additionally display subtitles in the first foreign language by the user at a second playback time point while the first image data is being reproduced, the subtitles A plurality of translated subtitles are generated by translating the plurality of subtitles into the first foreign language through a pre-built translation engine with reference to the table.

번역 자막 기록 처리부(116)는 상기 복수의 번역 자막들이 생성되면, 상기 자막 테이블 상에서 상기 복수의 자막들 각각에 대해, 상기 복수의 자막들 각각에 대응되는 번역 자막을 추가로 대응시켜 기록한다.When the plurality of translated subtitles are generated, the translated subtitle recording processing unit 116 additionally records the translated subtitle corresponding to each of the plurality of subtitles in correspondence with each of the plurality of subtitles on the subtitle table.

번역 자막 추가 표시부(117)는 상기 제2 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막과 번역 자막을, 영상이 표시되는 화면 영역에서 상기 제1 지점과 사전 설정된 제2 지점에 각각 플로팅하여 표시한다.The translated subtitle addition display unit 117 displays subtitles and translated subtitles according to each playback time recorded in the subtitle table from the second playback time to each playback time recorded in the subtitle table, and the video is displayed. The first point and the preset second point are respectively floated and displayed on the screen area.

이하에서는, 도 1과 도 3을 참조하여, 번역 자막 생성부(115), 번역 자막 기록 처리부(116) 및 번역 자막 추가 표시부(117)의 동작을 예를 들어, 상세히 설명하기로 한다.Hereinafter, operations of the translated caption generating unit 115 , the translated caption recording processing unit 116 , and the translated caption adding display unit 117 will be described in detail with reference to FIGS. 1 and 3 .

먼저, 상기 제1 영상 데이터가 '한국어'로 더빙된 영상 데이터인 관계로, 자막 생성부(112)를 통해 생성된 복수의 자막들이 '한국어'로 구성된 자막이라고 하고, 제1 외국어를 '영어'라고 하며, 상기 제1 영상 데이터가 재생되는 도중, '2분 30초'라는 재생 시점에서 상기 사용자에 의해 '영어'로 구성된 자막을 추가로 표시할 것을 지시하는 추가 표시 명령이 전자 장치(110)에 수신되었다고 가정하자.First, since the first image data is image data dubbed in 'Korean', a plurality of subtitles generated by the subtitle generator 112 are called subtitles composed of 'Korean', and the first foreign language is 'English'. and, while the first image data is being reproduced, an additional display command instructing the user to additionally display a subtitle composed of 'English' at the playback time of '2 minutes and 30 seconds' is issued to the electronic device 110 Assume that it is received in

그러면, 번역 자막 생성부(115)는 상기 표 1과 같은 자막 테이블을 참조하여, '자막 1, 자막 2, 자막 3, 자막 4, 자막 5'를 사전 구축된 번역 엔진을 통해 '영어'로 번역함으로써, 복수의 번역 자막들을 '번역 자막 1, 번역 자막 2, 번역 자막 3, 번역 자막 4, 번역 자막 5'로 생성할 수 있다.Then, the translated subtitle generator 115 translates 'Subtitle 1, Subtitle 2, Subtitle 3, Subtitle 4, Subtitle 5' into 'English' through a pre-built translation engine by referring to the subtitle table as shown in Table 1 above. Accordingly, a plurality of translated subtitles may be generated as 'translated subtitles 1, translated subtitles 2, translated subtitles 3, translated subtitles 4, and translated subtitles 5'.

이렇게, 번역 자막 생성부(115)에 의해 상기 복수의 번역 자막들이 생성되면, 번역 자막 기록 처리부(116)는 상기 표 1과 같은 자막 테이블 상에서 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 각각에 대해, '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 각각에 대응되는 번역 자막인 '번역 자막 1, 번역 자막 2, 번역 자막 3, 번역 자막 4, 번역 자막 5'를 추가로 대응시켜 하기의 표 2와 같이 기록할 수 있다.In this way, when the plurality of translated subtitles are generated by the translated subtitle generating unit 115, the translated subtitle recording processing unit 116 displays 'Subtitle 1, Subtitle 2, Subtitle 3, Subtitle 4, Subtitle on the subtitle table as shown in Table 1 above. 5' for each of 'Subtitle 1, Subtitle 2, Subtitle 3, Subtitle 4, Subtitle 5', 'Translated Subtitle 1, Translated Subtitle 2, Translated Subtitle 3, Translated Subtitle 4, and Translated Subtitle 5' It can be recorded as shown in Table 2 below by further corresponding.

복수의 자막들multiple subtitles 복수의 번역 자막들Multiple translation subtitles 분할 영상 데이터의 재생 시점Playback time of segmented image data 자막 1Subtitle 1 번역 자막 1translation subtitles 1 분할 영상 데이터 1의 재생 시점 1(0초)Playback time 1 of segmented image data 1 (0 sec) 자막 2subtitles 2 번역 자막 2translated subtitles 2 분할 영상 데이터 2의 재생 시점 2(1분)Playback time 2 of split image data 2 (1 minute) 자막 3Subtitle 3 번역 자막 3translation subtitles 3 분할 영상 데이터 3의 재생 시점 3(2분)Playback time 3 of split image data 3 (2 minutes) 자막 4Subtitle 4 번역 자막 4translation subtitles 4 분할 영상 데이터 4의 재생 시점 4(3분)Playback time 4 of split image data 4 (3 minutes) 자막 5Subtitle 5 번역 자막 5translation subtitles 5 분할 영상 데이터 5의 재생 시점 5(4분)Playback time 5 of segmented image data 5 (4 minutes)

그 이후, 번역 자막 추가 표시부(117)는 '2분 30초'라는 재생 시점부터 상기 표 2와 같은 자막 테이블에 기록되어 있는 각 재생 시점인 '재생 시점 4(3분), 재생 시점 5(4분)'가 될 때마다, 상기 표 2와 같은 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막인 '자막 4, 자막 5'와 번역 자막인 '번역 자막 4, 번역 자막 5'를, 영상이 표시되는 화면 영역에서 상기 제1 지점과 사전 설정된 제2 지점에 플로팅하여 표시할 수 있다.Thereafter, the translated subtitle addition display unit 117 displays 'playback time 4 (3 minutes) and playback time 5 (4 minutes)', 'Subtitle 4, Subtitle 5', which are subtitles according to each playback time recorded in the subtitle table as shown in Table 2, and 'Translated Subtitle 4, Translated Subtitle 5', which are translated subtitles, are displayed in the video. The display may be displayed by floating at the first point and the preset second point in the displayed screen area.

보다 자세히 설명하면, 번역 자막 추가 표시부(117)는 도 3에 도시된 그림과 같이, 제1 영상 데이터의 재생 시점이 '재생 시점 4(3분)'가 되었을 때, '자막 4'와 '번역 자막 4'를 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)과 상기 제2 지점(311)에 각각 플로팅하여 표시할 수 있고, 제1 영상 데이터의 재생 시점이 '재생 시점 5(4분)'가 되었을 때, '자막 5'와 '번역 자막 5'를 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)과 상기 제2 지점(311)에 각각 플로팅하여 표시할 수 있다.In more detail, as shown in FIG. 3 , the translated subtitle addition display unit 117 displays 'subtitle 4' and 'translation Subtitle 4' may be displayed by plotting at the first point 211 and the second point 311 in the screen area 210 where the image is displayed, respectively, and the playback time of the first image data is the 'playback time 5 (4 minutes)', 'Subtitle 5' and 'Translated Subtitle 5' are respectively plotted and displayed at the first point 211 and the second point 311 in the screen area 210 where the image is displayed. can do.

본 발명의 일실시예에 따르면, 전자 장치(110)는 음성 합성부(118), 믹싱부(119), 결합부(120) 및 더빙 재생부(121)를 더 포함할 수 있다.According to an embodiment of the present invention, the electronic device 110 may further include a voice synthesizing unit 118 , a mixing unit 119 , a combining unit 120 , and a dubbing reproducing unit 121 .

음성 합성부(118)는 상기 제1 영상 데이터가 재생되는 도중, 제3 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터를 상기 제1 외국어로 더빙하여 재생할 것을 요청하는 더빙 재생 명령이 수신되면, 상기 복수의 번역 자막들 각각을 사전 구축된 음성 합성 모델에 입력으로 인가하여 음성 합성을 수행함으로써, 상기 복수의 번역 자막들에 대응되는 복수의 합성 음성들을 생성한다.When the voice synthesizer 118 receives a dubbing playback command requesting that the first image data be dubbed into the first foreign language and reproduced by the user at a third playback time point while the first image data is being reproduced, A plurality of synthesized voices corresponding to the plurality of translated subtitles are generated by performing voice synthesis by applying each of the plurality of translated subtitles as an input to a pre-built voice synthesis model.

믹싱부(119)는 상기 복수의 분할 영상 데이터들 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역의 소리 성분을 제거한 후, 상기 복수의 합성 음성들을 상기 제1 주파수 대역의 소리 성분이 제거된 각 분할 영상 데이터와 믹싱(Mixing)함으로써, 복수의 믹싱 영상 데이터들을 생성한다.The mixing unit 119 removes a sound component of a first frequency band preset to be a frequency band corresponding to a human voice from each of the plurality of divided image data, and then converts the plurality of synthesized voices to the first frequency. A plurality of mixed image data is generated by mixing the divided image data from which the sound component of the band has been removed.

결합부(120)는 상기 복수의 믹싱 영상 데이터들을 재생 시간 순서에 따라 하나의 영상 데이터로 결합함으로써, 제1 더빙 영상 데이터를 생성한다.The combiner 120 generates first dubbed image data by combining the plurality of mixed image data into one image data according to a reproduction time sequence.

더빙 재생부(121)는 상기 제1 더빙 영상 데이터를 상기 제3 재생 시점부터 재생한다.The dubbing reproducing unit 121 reproduces the first dubbed image data from the third reproducing time point.

이하에서는, 음성 합성부(118), 믹싱부(119), 결합부(120) 및 더빙 재생부(121)의 동작을 예를 들어, 상세히 설명하기로 한다.Hereinafter, the operations of the voice synthesis unit 118 , the mixing unit 119 , the combiner 120 , and the dubbing reproducing unit 121 will be described in detail, for example.

먼저, 상기 표 2와 같이 자막들과 번역 자막들이 생성되었다고 하고, 전자 장치(110)에서 상기 제1 영상 데이터가 재생되는 도중, '3분 30초'라는 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터를 '영어'로 더빙하여 재생할 것을 요청하는 더빙 재생 명령이 전자 장치(110)에 수신되었다고 가정하자.First, it is assumed that subtitles and translated subtitles are generated as shown in Table 2, and while the first image data is being reproduced in the electronic device 110, the first It is assumed that the electronic device 110 receives a dubbing play command requesting to reproduce the image data by dubbing it in 'English'.

그러면, 음성 합성부(118)는 '번역 자막 1, 번역 자막 2, 번역 자막 3, 번역 자막 4, 번역 자막 5' 각각을 사전 구축된 음성 합성 모델에 입력으로 인가하여 음성 합성을 수행함으로써, '번역 자막 1, 번역 자막 2, 번역 자막 3, 번역 자막 4, 번역 자막 5'에 대응되는 복수의 합성 음성들을 '합성 음성 1, 합성 음성 2, 합성 음성 3, 합성 음성 4, 합성 음성 5'로 생성할 수 있다.Then, the speech synthesis unit 118 applies each of 'translated caption 1, translated caption 2, translated caption 3, translated caption 4, and translated caption 5' as inputs to the pre-built speech synthesis model to perform speech synthesis, A plurality of synthesized voices corresponding to translated subtitle 1, translated subtitle 2, translated subtitle 3, translated subtitle 4, and translated subtitle 5' are converted to 'synthetic voice 1, synthesized voice 2, synthesized voice 3, synthesized voice 4, synthesized voice 5'. can create

그러고 나서, 믹싱부(119)는 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역의 소리 성분을 제거할 수 있다.Then, the mixing unit 119 pre-determines that each of 'segmented image data 1, segmented image data 2, segmented image data 3, segmented image data 4, and segmented image data 5' is a frequency band corresponding to human voice. A sound component of the set first frequency band may be removed.

만약, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역을 '50Hz ~ 4000Hz'라고 하는 경우, 믹싱부(119)는 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역인 '50Hz ~ 4000Hz'의 소리 성분을 제거할 수 있다.If the first frequency band, which is preset as a frequency band corresponding to human voice, is '50Hz to 4000Hz', the mixing unit 119 controls the 'segmented image data 1, the divided image data 2, the divided image data 3, For each of the segmented image data 4 and the segmented image data 5', a sound component of '50Hz to 4000Hz', which is a preset first frequency band as a frequency band corresponding to a human voice, may be removed.

그 이후, 믹싱부(119)는 '합성 음성 1, 합성 음성 2, 합성 음성 3, 합성 음성 4, 합성 음성 5'를 상기 제1 주파수 대역인 '50Hz ~ 4000Hz'의 소리 성분이 제거된 각 분할 영상 데이터와 믹싱함으로써, 복수의 믹싱 영상 데이터들을 '믹싱 영상 데이터 1, 믹싱 영상 데이터 2, 믹싱 영상 데이터 3, 믹싱 영상 데이터 4, 믹싱 영상 데이터 5'로 생성할 수 있다.After that, the mixing unit 119 divides the 'synthesized voice 1, synthesized voice 2, synthesized voice 3, synthesized voice 4, and synthesized voice 5' into each division from which the sound component of '50Hz to 4000Hz', which is the first frequency band, is removed. By mixing the image data, a plurality of mixed image data may be generated as 'mixed image data 1, mixed image data 2, mixed image data 3, mixed image data 4, and mixed image data 5'.

구체적으로, 믹싱부(119)는 '합성 음성 1'과 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 1'을 서로 믹싱하여 '믹싱 영상 데이터 1'을 생성할 수 있고, '합성 음성 2'와 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 2'를 서로 믹싱하여 '믹싱 영상 데이터 2'를 생성할 수 있으며, '합성 음성 3'과 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 3'을 서로 믹싱하여 '믹싱 영상 데이터 3'을 생성할 수 있고, '합성 음성 4'와 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 4'를 서로 믹싱하여 '믹싱 영상 데이터 4'를 생성할 수 있고, '합성 음성 5'와 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 5'를 서로 믹싱하여 '믹싱 영상 데이터 5'를 생성할 수 있다.Specifically, the mixing unit 119 may generate 'mixed image data 1' by mixing 'synthetic voice 1' and 'segmented image data 1' from which a sound component of a human voice is removed, and 'synthetic voice' 'Mixed image data 2' can be created by mixing 2' and 'segmented image data 2' from which the sound component of human voice has been removed. 'Mixed image data 3' can be created by mixing the 'segmented image data 3' that has been The mixed image data 4' may be generated, and the 'mixed image data 5' may be generated by mixing the 'synthetic voice 5' and the 'split image data 5' from which a sound component of a human voice is removed.

이렇게, 믹싱부(119)에 의해 '믹싱 영상 데이터 1, 믹싱 영상 데이터 2, 믹싱 영상 데이터 3, 믹싱 영상 데이터 4, 믹싱 영상 데이터 5'가 생성되면, 결합부(120)는 '믹싱 영상 데이터 1, 믹싱 영상 데이터 2, 믹싱 영상 데이터 3, 믹싱 영상 데이터 4, 믹싱 영상 데이터 5'를 재생 시간 순서에 따라 하나의 영상 데이터로 결합함으로써, 제1 더빙 영상 데이터를 생성할 수 있다.In this way, when 'mixed image data 1, mixed image data 2, mixed image data 3, mixed image data 4, and mixed image data 5' is generated by the mixing unit 119, the combiner 120 is configured to generate 'mixed image data 1 , mixed image data 2 , mixed image data 3 , mixed image data 4 , and mixed image data 5 ′ may be combined into one image data according to the playback time order to generate first dubbed image data.

그러고 나서, 더빙 재생부(121)는 상기 제1 더빙 영상 데이터를 '3분 30초'라는 재생 시점부터 재생할 수 있다.Then, the dubbing reproducing unit 121 may reproduce the first dubbed image data from a reproduction time of '3 minutes and 30 seconds'.

이를 통해, 상기 사용자는 '영어'로 더빙된 상기 제1 더빙 영상 데이터를 '3분 30초'부터 감상할 수 있다.Through this, the user can enjoy the first dubbed image data dubbed in 'English' from '3 minutes and 30 seconds'.

본 발명의 일실시예에 따르면, 전자 장치(110)는 확인부(122), 추출부(123), 검색 결과 표시부(124) 및 재생부(125)를 더 포함할 수 있다.According to an embodiment of the present invention, the electronic device 110 may further include a confirmation unit 122 , an extraction unit 123 , a search result display unit 124 , and a playback unit 125 .

확인부(122)는 상기 제1 영상 데이터가 재생되는 도중, 상기 사용자에 의해 제1 키워드가 입력되면서, 상기 사용자에 의해 상기 제1 영상 데이터에서 상기 제1 키워드에 매칭되는 재생 시점을 검색할 것을 지시하는 검색 명령이 수신되면, 상기 자막 테이블을 참조하여, 상기 복수의 자막들 중 상기 제1 키워드를 포함하고 있는 자막이 존재하는지 여부를 확인한다.The confirmation unit 122 is configured to search for a playback time point matching the first keyword in the first image data by the user while the first keyword is input by the user while the first image data is being reproduced When an instructing search command is received, it is checked whether a subtitle including the first keyword exists among the plurality of subtitles by referring to the subtitle table.

추출부(123)는 상기 복수의 자막들 중 상기 제1 키워드를 포함하고 있는 자막으로, 적어도 하나의 제1 자막이 존재하는 것으로 확인되면, 상기 자막 테이블로부터 상기 적어도 하나의 제1 자막에 대응되는 적어도 하나의 제4 재생 시점에 대한 정보를 추출한다.The extracting unit 123 is a subtitle including the first keyword among the plurality of subtitles. When it is confirmed that at least one first subtitle exists, the extractor 123 selects a subtitle corresponding to the at least one first subtitle from the subtitle table. Information on at least one fourth playback time is extracted.

검색 결과 표시부(124)는 상기 적어도 하나의 제4 재생 시점에 대한 정보를 상기 제1 키워드에 매칭되는 검색 결과로 지정하여 영상이 표시되는 화면 영역에서 사전 설정된 제3 지점에 표시한다.The search result display unit 124 designates the information on the at least one fourth playback time as a search result matching the first keyword and displays it at a preset third point in the screen area where the image is displayed.

재생부(125)는 상기 적어도 하나의 제4 재생 시점에 대한 정보가 상기 제3 지점에 표시되는 상태에서, 상기 사용자에 의해 상기 적어도 하나의 제4 재생 시점 중 어느 하나인 제5 재생 시점에 대한 선택 재생 명령이 수신되면, 상기 제1 영상 데이터를 상기 제5 재생 시점부터 재생한다.In a state in which the information on the at least one fourth reproduction time is displayed at the third point, the reproducing unit 125 provides information about a fifth reproduction time, which is any one of the at least one fourth reproduction time, by the user. When the selective playback command is received, the first image data is played back from the fifth playback time point.

이하에서는, 확인부(122), 추출부(123), 검색 결과 표시부(124) 및 재생부(125)의 동작을 예를 들어, 상세히 설명하기로 한다.Hereinafter, the operation of the check unit 122 , the extraction unit 123 , the search result display unit 124 , and the reproduction unit 125 will be described in detail, for example.

먼저, 상기 제1 영상 데이터가 재생되는 도중, 상기 사용자에 의해 제1 키워드가 입력되면서, 상기 사용자에 의해 상기 제1 영상 데이터에서 상기 제1 키워드에 매칭되는 재생 시점을 검색할 것을 지시하는 검색 명령이 전자 장치(110)에 수신되었다고 가정하자.First, while the first image data is being reproduced, while a first keyword is input by the user, a search command instructing the user to search for a reproduction point matching the first keyword in the first image data Assume that it is received by the electronic device 110 .

그러면, 확인부(122)는 상기 표 1과 같은 자막 테이블을 참조하여, 상기 복수의 자막들인 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 중 상기 제1 키워드를 포함하고 있는 자막이 존재하는지 여부를 확인할 수 있다.Then, the checker 122 refers to the subtitle table as shown in Table 1, and refers to the subtitle including the first keyword among the plurality of subtitles 'Subtitle 1, Subtitle 2, Subtitle 3, Subtitle 4, and Subtitle 5'. You can check whether this exists.

만약, 확인부(122)를 통해 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 중 상기 제1 키워드를 포함하고 있는 자막으로, '자막 1, 자막 3, 자막 5'가 존재하는 것으로 확인되는 경우, 추출부(123)는 상기 표 1과 같은 자막 테이블로부터 '자막 1, 자막 3, 자막 5'에 대응되는 재생 시점인 '재생 시점 1(0초), 재생 시점 3(2분), 재생 시점 5(4분)'에 대한 정보를 추출할 수 있다.If, through the check unit 122, 'Subtitle 1, Subtitle 3, Subtitle 5' exists as a subtitle including the first keyword among 'Subtitle 1, Subtitle 2, Subtitle 3, Subtitle 4, and Subtitle 5', If it is confirmed that it is, the extractor 123 performs 'playback time 1 (0 seconds), playback time 3 (2 minutes), which are playback times corresponding to 'subtitle 1, subtitle 3, and subtitle 5' from the subtitle table as shown in Table 1 above. ), the playback time 5 (4 minutes)' can be extracted.

그 이후, 검색 결과 표시부(124)는 '재생 시점 1(0초), 재생 시점 3(2분), 재생 시점 5(4분)'에 대한 정보를 상기 제1 키워드에 매칭되는 검색 결과로 지정하여 영상이 표시되는 화면 영역에서 사전 설정된 제3 지점에 표시할 수 있다.After that, the search result display unit 124 designates information on 'play time 1 (0 seconds), playback time 3 (2 minutes), and playback time 5 (4 minutes)' as a search result matching the first keyword Thus, the image may be displayed at a preset third point in the screen area where the image is displayed.

이렇게, 검색 결과 표시부(124)에 의해 '재생 시점 1(0초), 재생 시점 3(2분), 재생 시점 5(4분)'에 대한 정보가 상기 제3 지점에 표시되는 상태에서, 상기 사용자에 의해 '재생 시점 1(0초), 재생 시점 3(2분), 재생 시점 5(4분)' 중 어느 하나의 재생 시점인 '재생 시점 3(2분)'에 대한 선택 재생 명령이 전자 장치(110)에 수신되면, 재생부(125)는 상기 제1 영상 데이터를 '재생 시점 3(2분)'부터 재생할 수 있다.In this way, in a state in which information on 'playback time 1 (0 seconds), playback time 3 (2 minutes), and playback time 5 (4 minutes)' is displayed at the third point by the search result display unit 124, the A selection playback command for 'play time 3 (2 minutes)', which is any one of playback time 1 (0 seconds), playback time 3 (2 minutes), and playback time 5 (4 minutes), is given by the user When received by the electronic device 110 , the playback unit 125 may reproduce the first image data from 'playback time 3 (2 minutes)'.

이를 통해, 상기 사용자는 상기 제1 영상 데이터를 '재생 시점 3(2분)'부터 감상할 수 있다.Through this, the user can view the first image data from 'playback time 3 (2 minutes)'.

도 4는 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법을 도시한 순서도이다.4 is a flowchart illustrating a method of operating an electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention.

단계(S410)에서는 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 수신되면, 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성하고, 상기 제1 영상 데이터의 전체 재생 시간에서 상기 복수의 분할 영상 데이터들 각각의 재생 시점에 대한 정보를 확인한다.In step S410, when a reproduction command for the first image data is received by the user, the first image data is reproduced, and the first image data is divided into a plurality of divided images by dividing the first image data at a preset first reproduction time interval. Data is generated, and information on a reproduction time of each of the plurality of divided image data is checked in the total reproduction time of the first image data.

단계(S420)에서는 상기 복수의 분할 영상 데이터들을 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 분할 영상 데이터들 각각에 대응되는 복수의 자막들을 생성한다.In step S420, a plurality of subtitles corresponding to each of the plurality of divided image data is generated by sequentially applying the plurality of segmented image data to a pre-built voice recognition model to perform voice recognition.

단계(S430)에서는 상기 복수의 자막들과 상기 복수의 자막들 각각에 대응되는 분할 영상 데이터의 재생 시점에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 생성한다.In step S430, a subtitle table is generated in which the plurality of subtitles and information on the reproduction time of the segmented image data corresponding to each of the plurality of subtitles are recorded in correspondence with each other.

단계(S440)에서는 상기 제1 영상 데이터가 재생되는 도중, 제1 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 수신되면, 상기 제1 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막을, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅하여 표시한다.In step S440, while the first image data is being reproduced, if a command to display a caption for the first image data is received by the user at the first reproduction time point, it is recorded in the caption table from the first reproduction time point. At each playback point in time, the subtitle according to each playback time recorded in the subtitle table is displayed by floating at a preset first point in the screen area where the image is displayed.

이때, 본 발명의 일실시예에 따르면, 단계(S420)에서는 상기 복수의 분할 영상 데이터들 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역 이외의 주파수 대역의 소리 성분을 모두 제거한 후, 상기 제1 주파수 대역 이외의 주파수 대역의 소리 성분이 모두 제거된 각 분할 영상 데이터를 상기 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 자막들을 생성할 수 있다.At this time, according to an embodiment of the present invention, in step S420, for each of the plurality of divided image data, a sound component of a frequency band other than the first frequency band preset to be a frequency band corresponding to a human voice After removing all , each segmented image data from which all sound components of a frequency band other than the first frequency band are removed are sequentially applied to the voice recognition model as input to perform voice recognition, thereby generating the plurality of subtitles. can

또한, 본 발명의 일실시예에 따르면, 상기 전자 장치의 동작 방법은 상기 제1 영상 데이터가 재생되는 도중, 제2 재생 시점에서 상기 사용자에 의해 제1 외국어로 구성된 자막을 추가로 표시할 것을 지시하는 추가 표시 명령이 수신되면, 상기 자막 테이블을 참조하여, 상기 복수의 자막들을 사전 구축된 번역 엔진을 통해 상기 제1 외국어로 번역함으로써, 복수의 번역 자막들을 생성하는 단계, 상기 복수의 번역 자막들이 생성되면, 상기 자막 테이블 상에서 상기 복수의 자막들 각각에 대해, 상기 복수의 자막들 각각에 대응되는 번역 자막을 추가로 대응시켜 기록하는 단계 및 상기 제2 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막과 번역 자막을, 영상이 표시되는 화면 영역에서 상기 제1 지점과 사전 설정된 제2 지점에 각각 플로팅하여 표시하는 단계를 더 포함할 수 있다.In addition, according to an embodiment of the present invention, the method of operating the electronic device instructs the user to additionally display a subtitle composed of a first foreign language at a second reproduction time point while the first image data is being reproduced generating a plurality of translated subtitles by referring to the subtitle table and translating the plurality of subtitles into the first foreign language through a pre-built translation engine when an additional display command is received; when generated, additionally correspondingly recording a translated subtitle corresponding to each of the plurality of subtitles for each of the plurality of subtitles on the subtitle table, and each recorded in the subtitle table from the second playback time Each time the playback time comes, the subtitles and the translated subtitles according to each playback timing recorded in the subtitle table are displayed by floating, respectively, at the first point and the preset second point in the screen area where the image is displayed. may include more.

이때, 본 발명의 일실시예에 따르면, 상기 전자 장치의 동작 방법은 상기 제1 영상 데이터가 재생되는 도중, 제3 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터를 상기 제1 외국어로 더빙하여 재생할 것을 요청하는 더빙 재생 명령이 수신되면, 상기 복수의 번역 자막들 각각을 사전 구축된 음성 합성 모델에 입력으로 인가하여 음성 합성을 수행함으로써, 상기 복수의 번역 자막들에 대응되는 복수의 합성 음성들을 생성하는 단계, 상기 복수의 분할 영상 데이터들 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역의 소리 성분을 제거한 후, 상기 복수의 합성 음성들을 상기 제1 주파수 대역의 소리 성분이 제거된 각 분할 영상 데이터와 믹싱함으로써, 복수의 믹싱 영상 데이터들을 생성하는 단계, 상기 복수의 믹싱 영상 데이터들을 재생 시간 순서에 따라 하나의 영상 데이터로 결합함으로써, 제1 더빙 영상 데이터를 생성하는 단계 및 상기 제1 더빙 영상 데이터를 상기 제3 재생 시점부터 재생하는 단계를 더 포함할 수 있다.At this time, according to an embodiment of the present invention, in the method of operating the electronic device, the first image data is dubbed by the user in the first foreign language at a third playback time point while the first image data is being reproduced. When a dubbing playback command requesting to be reproduced is received, each of the plurality of translated subtitles is applied as an input to a pre-built speech synthesis model to perform speech synthesis, thereby generating a plurality of synthesized voices corresponding to the plurality of translated captions. generating, for each of the plurality of divided image data, after removing a sound component of a first frequency band preset to be a frequency band corresponding to a human voice, the plurality of synthesized voices of the first frequency band generating a plurality of mixed image data by mixing with each divided image data from which a sound component has been removed and reproducing the first dubbed image data from the third playback time.

또한, 본 발명의 일실시예에 따르면, 상기 전자 장치의 동작 방법은 상기 제1 영상 데이터가 재생되는 도중, 상기 사용자에 의해 제1 키워드가 입력되면서, 상기 사용자에 의해 상기 제1 영상 데이터에서 상기 제1 키워드에 매칭되는 재생 시점을 검색할 것을 지시하는 검색 명령이 수신되면, 상기 자막 테이블을 참조하여, 상기 복수의 자막들 중 상기 제1 키워드를 포함하고 있는 자막이 존재하는지 여부를 확인하는 단계, 상기 복수의 자막들 중 상기 제1 키워드를 포함하고 있는 자막으로, 적어도 하나의 제1 자막이 존재하는 것으로 확인되면, 상기 자막 테이블로부터 상기 적어도 하나의 제1 자막에 대응되는 적어도 하나의 제4 재생 시점에 대한 정보를 추출하는 단계, 상기 적어도 하나의 제4 재생 시점에 대한 정보를 상기 제1 키워드에 매칭되는 검색 결과로 지정하여 영상이 표시되는 화면 영역에서 사전 설정된 제3 지점에 표시하는 단계 및 상기 적어도 하나의 제4 재생 시점에 대한 정보가 상기 제3 지점에 표시되는 상태에서, 상기 사용자에 의해 상기 적어도 하나의 제4 재생 시점 중 어느 하나인 제5 재생 시점에 대한 선택 재생 명령이 수신되면, 상기 제1 영상 데이터를 상기 제5 재생 시점부터 재생하는 단계를 더 포함할 수 있다.In addition, according to an embodiment of the present invention, in the method of operating the electronic device, a first keyword is input by the user while the first image data is being reproduced, and the user selects the checking whether a subtitle including the first keyword exists among the plurality of subtitles by referring to the subtitle table when a search command instructing to search for a playback time matching the first keyword is received; , from among the plurality of subtitles, when it is confirmed that at least one first subtitle exists as a subtitle including the first keyword, at least one fourth subtitle corresponding to the at least one first subtitle is obtained from the subtitle table. extracting information on a playback time; designating the at least one fourth playback time information as a search result matching the first keyword and displaying the information at a preset third point in a screen area where an image is displayed and in a state in which the information on the at least one fourth reproduction time is displayed at the third point, a selection reproduction command for a fifth reproduction time that is any one of the at least one fourth reproduction time is received by the user The method may further include reproducing the first image data from the fifth reproducing time point.

이상, 도 4를 참조하여 본 발명의 일실시예에 따른 전자 장치의 동작 방법에 대해 설명하였다. 여기서, 본 발명의 일실시예에 따른 전자 장치의 동작 방법은 도 1 내지 도 3을 이용하여 설명한 전자 장치(110)의 동작에 대한 구성과 대응될 수 있으므로, 이에 대한 보다 상세한 설명은 생략하기로 한다.The method of operating an electronic device according to an embodiment of the present invention has been described above with reference to FIG. 4 . Here, since the method of operating the electronic device according to an embodiment of the present invention may correspond to the configuration of the operation of the electronic device 110 described with reference to FIGS. 1 to 3 , a more detailed description thereof will be omitted. do.

본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 저장매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.The method of operating an electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention may be implemented as a computer program stored in a storage medium for execution through combination with a computer.

또한, 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 컴퓨터 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. In addition, the method of operating an electronic device for outputting subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention is implemented in the form of a computer program command for execution through combination with a computer and is computer readable may be recorded on the medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. As described above, the present invention has been described with specific matters such as specific components and limited embodiments and drawings, but these are only provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , various modifications and variations are possible from these descriptions by those of ordinary skill in the art to which the present invention pertains.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the described embodiments, and not only the claims described below, but also all those with equivalent or equivalent modifications to the claims will be said to belong to the scope of the spirit of the present invention. .

110: 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치
111: 분할부 112: 자막 생성부
113: 자막 테이블 생성부 114: 자막 표시부
115: 번역 자막 생성부 116: 번역 자막 기록 처리부
117: 번역 자막 추가 표시부 118: 음성 합성부
119: 믹싱부 120: 결합부
121: 더빙 재생부 122: 확인부
123: 추출부 124: 검색 결과 표시부
125: 재생부110: Electronic device for outputting subtitles on a screen on which an image is played based on voice recognition
111: division unit 112: subtitle generation unit
113: subtitle table generating unit 114: subtitle display unit
115: translated subtitle generating unit 116: translated subtitle recording processing unit
117: translation subtitle addition display unit 118: voice synthesis unit
119: mixing unit 120: coupling unit
121: dubbing playback unit 122: confirmation unit
123: extraction unit 124: search result display unit
125: replay unit

Claims

An electronic device for outputting subtitles on a screen on which an image is played based on voice recognition, the electronic device comprising:
When a reproduction command for the first image data is received by the user, the first image data is reproduced, and a plurality of divided image data is generated by dividing the first image data at a preset first reproduction time interval, a division unit for checking information on a reproduction time point of each of the plurality of divided image data in the entire reproduction time of the first image data;
a subtitle generator for sequentially applying the plurality of segmented image data as inputs to a pre-built voice recognition model to perform voice recognition, thereby generating a plurality of subtitles corresponding to each of the plurality of segmented image data;
a subtitle table generator for generating a subtitle table in which the plurality of subtitles and information on a reproduction time of the segmented image data corresponding to each of the plurality of subtitles are recorded in correspondence with each other; and
When the caption display command for the first video data is received by the user at the first playback time while the first image data is being reproduced, each reproduction time recorded in the subtitle table from the first reproduction time is displayed. A subtitle display unit that displays the subtitles according to each playback time recorded in the subtitle table by floating them at a preset first point in the screen area where the image is displayed each time it is displayed.
An electronic device comprising a.

According to claim 1,
The subtitle generator
After removing all sound components of a frequency band other than a first frequency band preset to be a frequency band corresponding to a human voice from each of the plurality of divided image data, a sound of a frequency band other than the first frequency band and generating the plurality of subtitles by sequentially applying each segmented image data from which all components have been removed as input to the voice recognition model to perform voice recognition.

According to claim 1,
While the first image data is being reproduced, when an additional display command instructing to additionally display the subtitles in the first foreign language is received by the user at the second playback time point, referring to the subtitle table, the plurality of a translated subtitle generator for generating a plurality of translated subtitles by translating the subtitles into the first foreign language through a pre-built translation engine;
a translated subtitle recording processing unit that, when the plurality of translated subtitles are generated, additionally records the translated subtitle corresponding to each of the plurality of subtitles in correspondence with each of the plurality of subtitles on the subtitle table; and
From the second playback time to each playback time recorded in the subtitle table, a subtitle and a translated subtitle according to each playback time recorded in the subtitle table are displayed at the first point in the screen area where an image is displayed. and a translation subtitle addition display unit that floats and displays at a preset second point, respectively
An electronic device further comprising a.

4. The method of claim 3,
While the first image data is being reproduced, when a dubbing reproduction command is received by the user at a third reproduction time point for dubbing and reproducing the first image data in the first foreign language, each of the plurality of translated subtitles is received. a voice synthesizer for generating a plurality of synthesized voices corresponding to the plurality of translated subtitles by applying as an input to a pre-built voice synthesis model and performing voice synthesis;
For each of the plurality of divided image data, after removing a sound component of a first frequency band preset to be a frequency band corresponding to a human voice, the sound component of the first frequency band is removed from the plurality of synthesized voices a mixing unit generating a plurality of mixed image data by mixing the divided image data;
a combining unit for generating first dubbed image data by combining the plurality of mixed image data into one image data according to a reproduction time sequence; and
a dubbing reproducing unit for reproducing the first dubbed image data from the third reproducing time
An electronic device further comprising a.

According to claim 1,
While the first image data is being reproduced, while a first keyword is input by the user, a search command instructing the user to search for a reproduction point matching the first keyword in the first image data is received a checking unit that checks whether a subtitle including the first keyword exists among the plurality of subtitles by referring to the subtitle table;
If it is confirmed that at least one first subtitle exists among the plurality of subtitles, which includes the first keyword, at least one fourth reproduction corresponding to the at least one first subtitle is reproduced from the subtitle table. an extracting unit for extracting information about a time point;
a search result display unit for designating the information on the at least one fourth playback time as a search result matching the first keyword and displaying the information at a preset third point in a screen area where an image is displayed; and
When the information on the at least one fourth playback time is displayed at the third point, when a selection playback command for a fifth playback time, which is any one of the at least one fourth playback time, is received by the user , a playback unit for playing the first image data from the fifth playback time
An electronic device further comprising a.

A method of operating an electronic device for outputting subtitles on a screen on which an image is played based on voice recognition, the method comprising:
When a reproduction command for the first image data is received by the user, the first image data is reproduced, and a plurality of divided image data is generated by dividing the first image data at a preset first reproduction time interval, checking information on a reproduction time of each of the plurality of divided image data in the total reproduction time of the first image data;
generating a plurality of subtitles corresponding to each of the plurality of segmented image data by sequentially applying the plurality of segmented image data to a pre-built speech recognition model to perform speech recognition;
generating a subtitle table in which the plurality of subtitles and information on a reproduction time point of segmented image data corresponding to each of the plurality of subtitles are recorded in correspondence with each other; and
When the caption display command for the first video data is received by the user at the first playback time while the first image data is being reproduced, each reproduction time recorded in the subtitle table from the first reproduction time is displayed. displaying the subtitles according to each playback time recorded in the subtitle table by floating them at a preset first point in the screen area where the image is displayed each time it occurs.
A method of operating an electronic device comprising a.

7. The method of claim 6,
The step of generating the plurality of subtitles
After removing all sound components of a frequency band other than a first frequency band preset to be a frequency band corresponding to a human voice from each of the plurality of divided image data, a sound of a frequency band other than the first frequency band and generating the plurality of subtitles by sequentially applying each segmented image data from which all components have been removed as input to the voice recognition model to perform voice recognition.

7. The method of claim 6,
While the first image data is being reproduced, when an additional display command instructing to additionally display the subtitles in the first foreign language is received by the user at the second playback time point, referring to the subtitle table, the plurality of generating a plurality of translated subtitles by translating subtitles into the first foreign language through a pre-built translation engine;
when the plurality of translated subtitles are generated, additionally correspondingly recording a translated subtitle corresponding to each of the plurality of subtitles for each of the plurality of subtitles on the subtitle table; and
From the second playback time to each playback time recorded in the subtitle table, a subtitle and a translated subtitle according to each playback time recorded in the subtitle table are displayed at the first point in the screen area where an image is displayed. and plotting and displaying at the second preset point, respectively.
The method of operating an electronic device further comprising a.

9. The method of claim 8,
While the first image data is being reproduced, when a dubbing reproduction command is received by the user at a third reproduction time point for dubbing and reproducing the first image data in the first foreign language, each of the plurality of translated subtitles is received. generating a plurality of synthesized voices corresponding to the plurality of translated subtitles by applying as an input to a pre-built voice synthesis model and performing voice synthesis;
For each of the plurality of divided image data, after removing a sound component of a first frequency band preset to be a frequency band corresponding to a human voice, the sound component of the first frequency band is removed from the plurality of synthesized voices generating a plurality of mixed image data by mixing the divided image data;
generating first dubbed image data by combining the plurality of mixed image data into one image data according to a reproduction time sequence; and
reproducing the first dubbed image data from the third reproducing time point
The method of operating an electronic device further comprising a.

7. The method of claim 6,
While the first image data is being reproduced, while a first keyword is input by the user, a search command instructing the user to search for a reproduction point matching the first keyword in the first image data is received checking whether a subtitle including the first keyword exists among the plurality of subtitles by referring to the subtitle table;
If it is confirmed that at least one first subtitle exists among the plurality of subtitles, which includes the first keyword, at least one fourth reproduction corresponding to the at least one first subtitle is reproduced from the subtitle table. extracting information about the time point;
designating the information on the at least one fourth playback time as a search result matching the first keyword and displaying the information at a preset third point in a screen area where an image is displayed; and
When the information on the at least one fourth playback time is displayed at the third point, when a selection playback command for a fifth playback time, which is any one of the at least one fourth playback time, is received by the user , reproducing the first image data from the fifth reproducing time point
The method of operating an electronic device further comprising a.

A computer-readable recording medium recording a computer program for executing the method of any one of claims 6 to 10 through combination with a computer.

A computer program stored in a storage medium for executing the method of any one of claims 6 to 10 through combination with a computer.