KR102523814B1

KR102523814B1 - Electronic apparatus that outputs subtitle on screen where video is played based on voice recognition and operating method thereof

Info

Publication number: KR102523814B1
Application number: KR1020210049114A
Authority: KR
Inventors: 원찬식; 최보람; 강희석; 손은채
Original assignee: 주식회사 한글과컴퓨터; 주식회사 한컴위드; (주)엠디에스인텔리전스
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2023-05-15
Also published as: KR20220142723A

Abstract

음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치 및 그 동작 방법이 개시된다. 본 발명은 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치 및 그 동작 방법을 제시함으로써, 사용자가 영상을 감상할 때 자신이 시청하고 싶은 재생 시점에서의 자막을 편리하게 볼 수 있도록 지원할 수 있다.Disclosed are an electronic device and an operating method for outputting a caption on a screen on which a video is reproduced based on voice recognition. The present invention provides an electronic device and an operating method for outputting subtitles on a screen on which a video is played based on voice recognition, so that a user can conveniently view subtitles at the playback time he/she wants to watch when watching a video. can support you

Description

Electronic device for outputting subtitles on the screen where video is played based on voice recognition and its operating method

본 발명은 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치 및 그 동작 방법에 대한 것이다.The present invention relates to an electronic device that outputs a caption on a screen on which an image is reproduced based on voice recognition and an operating method thereof.

최근, 코로나 시대가 장기화되면서, 실내에서 영상 콘텐츠를 시청하는 비율이 증가하고 있는 추세이다. 이에 따라, 실시간 TV 시청량은 증가하였으며, VOD 시청 건수도 증가하였다.Recently, as the corona era is prolonged, the ratio of watching video content indoors is increasing. Accordingly, the amount of real-time TV viewing has increased, and the number of VOD views has also increased.

이렇게, 실내에서 영상 콘텐츠를 시청하는 비율이 증가하면서, 영상 콘텐츠와 관련해서, 시청자들을 위해 보다 효율적인 기능을 지원해 줄 필요성 또한 증대되고 있다.In this way, as the ratio of viewing video content indoors increases, the need to support more efficient functions for viewers in relation to video content is also increasing.

이와 관련하여, 기존의 기술에서는 시청자가 영상 콘텐츠를 감상할 때, 음성 인식을 기반으로 영상 콘텐츠가 재생되는 화면에 자막을 출력해 주도록 하는 기능이 수행되지 않아서, 시청자가 자신이 시청하고 싶은 재생 시점에서의 자막을 보는 것이 쉽지 않았다.In this regard, in the existing technology, when a viewer watches video content, a function of outputting subtitles on the screen on which the video content is played based on voice recognition is not performed, so that the viewer can view the playback time he or she wants to watch. It wasn't easy to see the subtitles in .

만약, 시청자가 영상 콘텐츠를 감상할 때, 음성 인식을 기반으로 영상 콘텐츠가 재생되는 화면에 자막을 출력해 줄 수 있도록 함으로써, 시청자가 자신이 시청하고 싶은 재생 시점에서의 자막을 볼 수 있도록 지원하는 기술이 도입된다면, 시청자는 영상 콘텐츠를 감상할 시 더욱 편리함을 느낄 수 있을 것이다.If, when a viewer watches video content, it is possible to output subtitles on the screen where the video content is played based on voice recognition, so that the viewer can view the subtitles at the playback point they want to watch. If the technology is introduced, viewers will feel more convenient when viewing video content.

따라서, 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력해 줄 수 있는 자막 서비스 기술에 대한 연구가 필요하다.Therefore, it is necessary to research a caption service technology capable of outputting captions on a screen on which a video is played based on voice recognition.

본 발명은 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치 및 그 동작 방법을 제시함으로써, 사용자가 영상을 감상할 때 자신이 시청하고 싶은 재생 시점에서의 자막을 편리하게 볼 수 있도록 지원하고자 한다.The present invention provides an electronic device and an operating method for outputting subtitles on a screen on which a video is played based on voice recognition, so that a user can conveniently view subtitles at the playback time he/she wants to watch when watching a video. We want to support you so that

본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치는 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 수신되면, 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성하고, 상기 제1 영상 데이터의 전체 재생 시간에서 상기 복수의 분할 영상 데이터들 각각의 재생 시점에 대한 정보를 확인하는 분할부, 상기 복수의 분할 영상 데이터들을 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 분할 영상 데이터들 각각에 대응되는 복수의 자막들을 생성하는 자막 생성부, 상기 복수의 자막들과 상기 복수의 자막들 각각에 대응되는 분할 영상 데이터의 재생 시점에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 생성하는 자막 테이블 생성부 및 상기 제1 영상 데이터가 재생되는 도중, 제1 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 수신되면, 상기 제1 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막을, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅(Floating)하여 표시하는 자막 표시부를 포함한다.According to an embodiment of the present invention, an electronic device that outputs a caption on a screen on which an image is reproduced based on voice recognition reproduces the first image data when a play command for the first image data is received by a user, A plurality of divided image data is generated by dividing the first image data by a first playback time interval set in advance, and information about a playback time point of each of the plurality of divided image data in the total playback time of the first image data. A segmentation unit that checks subtitles for generating a plurality of subtitles corresponding to each of the plurality of segmented image data by sequentially applying the plurality of segmented image data as inputs to a pre-built voice recognition model to perform voice recognition. a generator, a caption table generator for generating a caption table in which the plurality of subtitles and information about reproduction time of the divided video data corresponding to each of the plurality of subtitles are recorded in correspondence with each other; and the first video data is reproduced. In the meantime, if a command to display a caption for the first image data is received by the user at the first playback time, the subtitle table is displayed at each reproduction time recorded in the subtitle table from the first playback time. and a caption display unit which floats and displays recorded captions according to each playback time at a first preset point in a screen area where an image is displayed.

또한, 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법은 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 수신되면, 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성하고, 상기 제1 영상 데이터의 전체 재생 시간에서 상기 복수의 분할 영상 데이터들 각각의 재생 시점에 대한 정보를 확인하는 단계, 상기 복수의 분할 영상 데이터들을 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 분할 영상 데이터들 각각에 대응되는 복수의 자막들을 생성하는 단계, 상기 복수의 자막들과 상기 복수의 자막들 각각에 대응되는 분할 영상 데이터의 재생 시점에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 생성하는 단계 및 상기 제1 영상 데이터가 재생되는 도중, 제1 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 수신되면, 상기 제1 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막을, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅하여 표시하는 단계를 포함한다.In addition, in an operating method of an electronic device that outputs a subtitle on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention, when a play command for the first image data is received by the user, the first image data is played. While reproducing data, a plurality of divided image data is generated by dividing the first image data by a first playback time interval set in advance, and each of the plurality of divided image data is generated during the entire reproduction time of the first image data. Checking information about playback time; Sequentially applying the plurality of divided image data as inputs to a pre-built voice recognition model to perform voice recognition, thereby providing a plurality of subtitles corresponding to each of the plurality of divided image data. generating subtitles, generating a subtitle table in which the plurality of subtitles and information about reproduction time of the divided video data corresponding to each of the plurality of subtitles are recorded in correspondence with each other, and while the first video data is being reproduced , When a command to display a caption for the first video data is received by the user at the first reproduction time, the caption table is recorded at each reproduction time recorded in the caption table from the first reproduction time. and displaying a subtitle according to each reproduction time point by floating it at a first preset point in a screen area where an image is displayed.

본 발명은 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치 및 그 동작 방법을 제시함으로써, 사용자가 영상을 감상할 때 자신이 시청하고 싶은 재생 시점에서의 자막을 편리하게 볼 수 있도록 지원할 수 있다.The present invention provides an electronic device and an operating method for outputting subtitles on a screen on which a video is played based on voice recognition, so that a user can conveniently view subtitles at the playback time he/she wants to watch when watching a video. can support you

도 1은 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 구조를 도시한 도면이다.
도 2 내지 도 3은 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작을 설명하기 위한 도면이다.
도 4는 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법을 도시한 순서도이다.1 is a diagram showing the structure of an electronic device that outputs a caption on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention.
2 and 3 are diagrams for explaining an operation of an electronic device that outputs a caption on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention.
4 is a flowchart illustrating an operating method of an electronic device that outputs a caption on a screen on which a video is reproduced based on voice recognition according to an embodiment of the present invention.

이하에서는 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명하기로 한다. 이러한 설명은 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였으며, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 본 명세서 상에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 사람에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. This description is not intended to limit the present invention to specific embodiments, but should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. While describing each drawing, similar reference numerals have been used for similar components, and unless otherwise defined, all terms used in this specification, including technical or scientific terms, are common knowledge in the art to which the present invention belongs. has the same meaning as commonly understood by the person who has it.

본 문서에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다. 또한, 본 발명의 다양한 실시예들에 있어서, 각 구성요소들, 기능 블록들 또는 수단들은 하나 또는 그 이상의 하부 구성요소로 구성될 수 있고, 각 구성요소들이 수행하는 전기, 전자, 기계적 기능들은 전자회로, 집적회로, ASIC(Application Specific Integrated Circuit) 등 공지된 다양한 소자들 또는 기계적 요소들로 구현될 수 있으며, 각각 별개로 구현되거나 2 이상이 하나로 통합되어 구현될 수도 있다. In this document, when a certain component is said to "include", it means that it may further include other components without excluding other components unless otherwise stated. In addition, in various embodiments of the present invention, each component, functional block, or means may be composed of one or more sub-components, and the electrical, electronic, and mechanical functions performed by each component are electronic It may be implemented with various known elements or mechanical elements such as circuits, integrated circuits, ASICs (Application Specific Integrated Circuits), and may be implemented separately or two or more may be integrated into one.

한편, 첨부된 블록도의 블록들이나 흐름도의 단계들은 범용 컴퓨터, 특수용 컴퓨터, 휴대용 노트북 컴퓨터, 네트워크 컴퓨터 등 데이터 프로세싱이 가능한 장비의 프로세서나 메모리에 탑재되어 지정된 기능들을 수행하는 컴퓨터 프로그램 명령들(instructions)을 의미하는 것으로 해석될 수 있다. 이들 컴퓨터 프로그램 명령들은 컴퓨터 장치에 구비된 메모리 또는 컴퓨터에서 판독 가능한 메모리에 저장될 수 있기 때문에, 블록도의 블록들 또는 흐름도의 단계들에서 설명된 기능들은 이를 수행하는 명령 수단을 내포하는 제조물로 생산될 수도 있다. 아울러, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 명령들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 가능한 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 정해진 순서와 달리 실행되는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 실질적으로 동시에 수행되거나, 역순으로 수행될 수 있으며, 경우에 따라 일부 블록들 또는 단계들이 생략된 채로 수행될 수도 있다.On the other hand, the blocks of the accompanying block diagram or the steps of the flowchart are computer program instructions that perform designated functions by being loaded into a processor or memory of a device capable of data processing, such as a general-purpose computer, a special purpose computer, a portable notebook computer, and a network computer. can be interpreted as meaning Since these computer program instructions may be stored in a memory included in a computer device or in a computer readable memory, the functions described in blocks of a block diagram or steps of a flowchart are produced as a product containing instruction means for performing them. It could be. Further, each block or each step may represent a module, segment or portion of code that includes one or more executable instructions for executing specified logical function(s). Also, it should be noted that in some alternative embodiments, functions mentioned in blocks or steps may be executed out of a predetermined order. For example, two blocks or steps shown in succession may be performed substantially simultaneously or in reverse order, and in some cases, some blocks or steps may be omitted.

도 1은 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 구조를 도시한 도면이다.1 is a diagram showing the structure of an electronic device that outputs a caption on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일실시예에 따른 전자 장치(110)는 분할부(111), 자막 생성부(112), 자막 테이블 생성부(113) 및 자막 표시부(114)를 포함한다.Referring to FIG. 1 , an electronic device 110 according to an embodiment of the present invention includes a division unit 111, a caption generator 112, a caption table generator 113, and a caption display unit 114.

분할부(111)는 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 수신되면, 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성하고, 상기 제1 영상 데이터의 전체 재생 시간에서 상기 복수의 분할 영상 데이터들 각각의 재생 시점에 대한 정보를 확인한다.The division unit 111 reproduces the first image data when a reproduction command for the first image data is received by the user, and divides the first image data by a first playback time interval set in advance to divide the first image data into a plurality of divisions. Image data is generated, and information about a reproduction time point of each of the plurality of divided image data is checked in the entire reproduction time of the first image data.

자막 생성부(112)는 상기 복수의 분할 영상 데이터들을 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 분할 영상 데이터들 각각에 대응되는 복수의 자막들을 생성한다. The caption generator 112 sequentially applies the plurality of divided image data as inputs to a pre-built voice recognition model to perform voice recognition, thereby generating a plurality of captions corresponding to each of the plurality of divided image data. .

자막 테이블 생성부(113)는 상기 복수의 자막들과 상기 복수의 자막들 각각에 대응되는 분할 영상 데이터의 재생 시점에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 생성한다.The caption table generating unit 113 generates a caption table in which the plurality of captions and information about playback time of the divided video data corresponding to each of the plurality of captions are recorded in correspondence with each other.

자막 표시부(114)는 상기 제1 영상 데이터가 재생되는 도중, 제1 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 수신되면, 상기 제1 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막을, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅(Floating)하여 표시한다.While the first image data is being reproduced, the caption display unit 114, when receiving a command to display a caption for the first image data from the user at the first reproduction time, records the caption in the caption table from the first reproduction time. At each playback time, the subtitle according to each playback time recorded in the caption table is floated and displayed at a preset first point in the screen area where the video is displayed.

이하에서는, 도 1과 도 2를 참조하여, 분할부(111), 자막 생성부(112), 자막 테이블 생성부(113) 및 자막 표시부(114)의 동작을 예를 들어, 상세히 설명하기로 한다.Hereinafter, with reference to FIGS. 1 and 2, operations of the division unit 111, the caption generation unit 112, the caption table generation unit 113, and the caption display unit 114 will be described in detail by way of example. .

먼저, 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 전자 장치(110)에 수신되면, 분할부(111)는 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성할 수 있다.First, when a user's command to reproduce the first image data is received by the electronic device 110, the division unit 111 reproduces the first image data for a preset first reproduction time. By dividing by intervals, a plurality of divided image data may be generated.

만약, 사전 설정된 제1 재생 시간 간격을 '1분'이라고 하고, 상기 제1 영상 데이터의 전체 재생 시간이 '5분'이라고 하는 경우, 분할부(111)는 상기 제1 영상 데이터를 '1분' 간격으로 분할함으로써, '1분'의 재생 시간을 갖는 복수의 분할 영상 데이터들을 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5'로 생성할 수 있다.If the preset first playback time interval is '1 minute' and the total playback time of the first image data is '5 minutes', the dividing unit 111 converts the first image data to '1 minute'. By splitting at intervals, a plurality of split image data having a playback time of '1 minute' can be created as 'segmented image data 1, split image data 2, split image data 3, split image data 4, and split image data 5'. can

그 이후, 분할부(111)는 상기 제1 영상 데이터의 전체 재생 시간인 '5분'에서 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각의 재생 시점인 '재생 시점 1(0초), 재생 시점 2(1분), 재생 시점 3(2분), 재생 시점 4(3분), 재생 시점 5(4분)'에 대한 정보를 확인할 수 있다.Thereafter, the segmentation unit 111 divides 'segmented image data 1, segmented image data 2, segmented image data 3, segmented image data 4, and segmented image data 5' in the entire reproduction time of '5 minutes' of the first image data. Information about each playback time, 'playback time 1 (0 seconds), playback time 2 (1 minute), playback time 3 (2 minutes), playback time 4 (3 minutes), playback time 5 (4 minutes)' You can check.

그러고 나서, 자막 생성부(112)는 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5'를 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각에 대응되는 복수의 자막들을 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5'로 생성할 수 있다.Then, the subtitle generator 112 sequentially applies 'segmented image data 1, segmented image data 2, segmented image data 3, segmented image data 4, and segmented image data 5' as inputs to the pre-built speech recognition model. By performing voice recognition, a plurality of subtitles corresponding to 'segmented image data 1, segmented image data 2, segmented image data 3, segmented image data 4, and segmented image data 5' are selected as 'caption 1, subtitle 2, subtitle 3, It can be created with subtitle 4 and subtitle 5'.

그 이후, 자막 테이블 생성부(113)는 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5'와 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 각각에 대응되는 분할 영상 데이터인 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5'의 재생 시점인 '재생 시점 1(0초), 재생 시점 2(1분), 재생 시점 3(2분), 재생 시점 4(3분), 재생 시점 5(4분)'에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 하기의 표 1과 같이 생성할 수 있다.After that, the subtitle table generating unit 113 generates divided images corresponding to 'caption 1, subtitle 2, subtitle 3, subtitle 4, and subtitle 5' and 'caption 1, subtitle 2, subtitle 3, subtitle 4, and subtitle 5', respectively. Data 'Split video data 1, Split video data 2, Split video data 3, Split video data 4, Split video data 5' playback time 'playback time 1 (0 sec), playback time 2 (1 minute), playback A subtitle table recorded by matching information on the time point 3 (2 minutes), playback time 4 (3 minutes), and playback time 5 (4 minutes)' can be generated as shown in Table 1 below.

복수의 자막들multiple subtitles 분할 영상 데이터의 재생 시점Playback point of split video data 자막 1subtitles 1 분할 영상 데이터 1의 재생 시점 1(0초)Playback time of split video data 1 1 (0 sec) 자막 2subtitles 2 분할 영상 데이터 2의 재생 시점 2(1분)Playback time of split video data 2 2 (1 minute) 자막 3subtitles 3 분할 영상 데이터 3의 재생 시점 3(2분)Playback time of split video data 3 3 (2 minutes) 자막 4subtitles 4 분할 영상 데이터 4의 재생 시점 4(3분)Playback time of split video data 4 4 (3 minutes) 자막 5subtitles 5 분할 영상 데이터 5의 재생 시점 5(4분)Playback time of split video data 5 5 (4 minutes)

이때, 전자 장치(110)에서 상기 제1 영상 데이터가 재생되는 도중, '1분 30초' 라는 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 전자 장치(110)에 수신되었다고 가정하자.At this time, while the first video data is being reproduced by the electronic device 110, the electronic device 110 receives a subtitle display command for the first video data from the user at the playback time of '1 minute and 30 seconds'. Let's assume that

그러면, 자막 표시부(114)는 '1분 30초'부터 상기 표 1과 같은 자막 테이블에 기록되어 있는 각 재생 시점인 '재생 시점 3(2분), 재생 시점 4(3분), 재생 시점 5(4분)'가 될 때마다, 상기 표 1과 같은 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막인 '자막 3, 자막 4, 자막 5'를, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅하여 표시할 수 있다.Then, the caption display unit 114 displays each reproduction time recorded in the caption table as shown in Table 1 from '1 minute and 30 seconds', 'playback time 3 (2 minutes), play time 4 (3 minutes), play time 5 (4 minutes)', 'Subtitle 3, Subtitle 4, and Subtitle 5', which are subtitles according to each playback time recorded in the subtitle table as shown in Table 1, are displayed in the screen area where the video is displayed. It can be displayed by plotting at 1 point.

구체적으로, 자막 표시부(114)는 도 2에 도시된 그림과 같이, 제1 영상 데이터의 재생 시점이 '재생 시점 3(2분)'이 되었을 때, '자막 3'을 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)에 플로팅하여 표시할 수 있고, 제1 영상 데이터의 재생 시점이 '재생 시점 4(3분)'가 되었을 때, '자막 4'를 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)에 플로팅하여 표시할 수 있으며, 제1 영상 데이터의 재생 시점이 '재생 시점 5(4분)'가 되었을 때, '자막 5'를 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)에 플로팅하여 표시할 수 있다.Specifically, as shown in FIG. 2 , when the playback time of the first video data is 'playback time 3 (2 minutes)', the caption display unit 114 displays 'caption 3' in the screen area where the video is displayed. It can be displayed by floating at the first point 211 in 210, and when the playback time of the first video data is 'playback time 4 (3 minutes)', 'caption 4' is displayed on the screen It can be displayed by floating at the first point 211 in the area 210, and when the playback time of the first video data becomes 'playback time 5 (4 minutes)', 'caption 5' is displayed when the video is displayed. It can be displayed by floating at the first point 211 in the screen area 210 .

이때, 본 발명의 일실시예에 따르면, 자막 생성부(112)는 상기 복수의 분할 영상 데이터들 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역 이외의 주파수 대역의 소리 성분을 모두 제거한 후, 상기 제1 주파수 대역 이외의 주파수 대역의 소리 성분이 모두 제거된 각 분할 영상 데이터를 상기 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 자막들을 생성할 수 있다.At this time, according to an embodiment of the present invention, the caption generator 112, for each of the plurality of divided image data, uses a frequency band other than the first frequency band preset as a frequency band corresponding to human voice. After all sound components have been removed, each segmented image data from which all sound components in frequency bands other than the first frequency band have been removed is sequentially applied as an input to the voice recognition model to perform voice recognition, thereby generating the plurality of subtitles. can create

예컨대, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역을 '50Hz ~ 4000Hz'라고 가정하자.For example, assume that the first frequency band preset as a frequency band corresponding to human voice is '50 Hz to 4000 Hz'.

이때, 전술한 예에 따르면, 자막 생성부(112)는 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역인 '50Hz ~ 4000Hz' 이외의 주파수 대역의 소리 성분을 모두 제거할 수 있다.At this time, according to the above-described example, the caption generator 112 responds to human voice for 'segmented video data 1, segmented video data 2, segmented video data 3, segmented video data 4, and segmented video data 5', respectively. It is possible to remove all sound components of frequency bands other than the preset first frequency band '50Hz ~ 4000Hz'.

그 이후, 자막 생성부(112)는 상기 제1 주파수 대역인 '50Hz ~ 4000Hz' 이외의 주파수 대역의 소리 성분이 모두 제거된 각 분할 영상 데이터를 상기 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 자막들을 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5'로 생성할 수 있다.Thereafter, the subtitle generator 112 sequentially applies each segmented image data from which all sound components of frequency bands other than the first frequency band '50Hz to 4000Hz' have been removed to the speech recognition model as inputs for speech recognition. , it is possible to generate the plurality of subtitles as 'caption 1, subtitle 2, subtitle 3, subtitle 4, and subtitle 5'.

본 발명의 일실시예에 따르면, 전자 장치(110)는 번역 자막 생성부(115), 번역 자막 기록 처리부(116) 및 번역 자막 추가 표시부(117)를 더 포함할 수 있다.According to one embodiment of the present invention, the electronic device 110 may further include a translated caption generator 115, a translated caption recording processor 116, and a translated caption addition display unit 117.

번역 자막 생성부(115)는 상기 제1 영상 데이터가 재생되는 도중, 제2 재생 시점에서 상기 사용자에 의해 제1 외국어로 구성된 자막을 추가로 표시할 것을 지시하는 추가 표시 명령이 수신되면, 상기 자막 테이블을 참조하여, 상기 복수의 자막들을 사전 구축된 번역 엔진을 통해 상기 제1 외국어로 번역함으로써, 복수의 번역 자막들을 생성한다.When the translated caption generator 115 receives an additional display command instructing to additionally display a caption composed of a first foreign language from the user at a second playback time point while the first image data is being reproduced, the translated caption generator 115 displays the caption Referring to the table, a plurality of translated subtitles are generated by translating the plurality of subtitles into the first foreign language through a pre-built translation engine.

번역 자막 기록 처리부(116)는 상기 복수의 번역 자막들이 생성되면, 상기 자막 테이블 상에서 상기 복수의 자막들 각각에 대해, 상기 복수의 자막들 각각에 대응되는 번역 자막을 추가로 대응시켜 기록한다.When the plurality of translated captions are generated, the translated caption recording processor 116 additionally records a translated caption corresponding to each of the plurality of captions on the caption table for each of the plurality of captions.

번역 자막 추가 표시부(117)는 상기 제2 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막과 번역 자막을, 영상이 표시되는 화면 영역에서 상기 제1 지점과 사전 설정된 제2 지점에 각각 플로팅하여 표시한다.The translated subtitle addition display unit 117 displays a video of subtitles and translated subtitles according to each reproduction time recorded in the subtitle table whenever each reproduction time recorded in the subtitle table is reached from the second reproduction time. It is displayed by floating at the first point and the preset second point in the screen area.

이하에서는, 도 1과 도 3을 참조하여, 번역 자막 생성부(115), 번역 자막 기록 처리부(116) 및 번역 자막 추가 표시부(117)의 동작을 예를 들어, 상세히 설명하기로 한다.Hereinafter, with reference to FIGS. 1 and 3, operations of the translated caption generator 115, the translated caption recording processor 116, and the translated caption addition display unit 117 will be described in detail, for example.

먼저, 상기 제1 영상 데이터가 '한국어'로 더빙된 영상 데이터인 관계로, 자막 생성부(112)를 통해 생성된 복수의 자막들이 '한국어'로 구성된 자막이라고 하고, 제1 외국어를 '영어'라고 하며, 상기 제1 영상 데이터가 재생되는 도중, '2분 30초'라는 재생 시점에서 상기 사용자에 의해 '영어'로 구성된 자막을 추가로 표시할 것을 지시하는 추가 표시 명령이 전자 장치(110)에 수신되었다고 가정하자.First, since the first video data is video data dubbed in 'Korean', a plurality of subtitles generated by the subtitle generator 112 are referred to as 'Korean' subtitles, and the first foreign language is 'English'. While the first video data is being reproduced, at the playback time of '2 minutes and 30 seconds', an additional display command instructing the user to additionally display subtitles composed of 'English' is sent to the electronic device 110. Suppose it is received by

그러면, 번역 자막 생성부(115)는 상기 표 1과 같은 자막 테이블을 참조하여, '자막 1, 자막 2, 자막 3, 자막 4, 자막 5'를 사전 구축된 번역 엔진을 통해 '영어'로 번역함으로써, 복수의 번역 자막들을 '번역 자막 1, 번역 자막 2, 번역 자막 3, 번역 자막 4, 번역 자막 5'로 생성할 수 있다.Then, the translation subtitle generation unit 115 refers to the subtitle table shown in Table 1 above, and translates 'subtitle 1, subtitle 2, subtitle 3, subtitle 4, and subtitle 5' into 'English' through a pre-built translation engine. By doing so, a plurality of translated subtitles can be generated as 'translated subtitle 1, translated subtitle 2, translated subtitle 3, translated subtitle 4, and translated subtitle 5'.

이렇게, 번역 자막 생성부(115)에 의해 상기 복수의 번역 자막들이 생성되면, 번역 자막 기록 처리부(116)는 상기 표 1과 같은 자막 테이블 상에서 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 각각에 대해, '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 각각에 대응되는 번역 자막인 '번역 자막 1, 번역 자막 2, 번역 자막 3, 번역 자막 4, 번역 자막 5'를 추가로 대응시켜 하기의 표 2와 같이 기록할 수 있다.In this way, when the plurality of translated subtitles are generated by the translated subtitle generator 115, the translated subtitle record processing unit 116 displays 'subtitle 1, subtitle 2, subtitle 3, subtitle 4, subtitle on the subtitle table as shown in Table 1 above. 5', respectively, 'translation subtitle 1, translation subtitle 2, translation subtitle 3, translation subtitle 4, translation subtitle 5' corresponding to 'subtitle 1, subtitle 2, subtitle 3, subtitle 4, and subtitle 5' respectively. It can be recorded as shown in Table 2 below by additional correspondence.

복수의 자막들multiple subtitles 복수의 번역 자막들multiple translated subtitles 분할 영상 데이터의 재생 시점Playback point of split video data 자막 1subtitles 1 번역 자막 1translation subtitles 1 분할 영상 데이터 1의 재생 시점 1(0초)Playback time of split video data 1 1 (0 sec) 자막 2subtitles 2 번역 자막 2translation subtitles 2 분할 영상 데이터 2의 재생 시점 2(1분)Playback time of split video data 2 2 (1 minute) 자막 3subtitles 3 번역 자막 3translation subtitles 3 분할 영상 데이터 3의 재생 시점 3(2분)Playback time of split video data 3 3 (2 minutes) 자막 4subtitles 4 번역 자막 4translation subtitles 4 분할 영상 데이터 4의 재생 시점 4(3분)Playback time of split video data 4 4 (3 minutes) 자막 5subtitles 5 번역 자막 5translation subtitles 5 분할 영상 데이터 5의 재생 시점 5(4분)Playback time of split video data 5 5 (4 minutes)

그 이후, 번역 자막 추가 표시부(117)는 '2분 30초'라는 재생 시점부터 상기 표 2와 같은 자막 테이블에 기록되어 있는 각 재생 시점인 '재생 시점 4(3분), 재생 시점 5(4분)'가 될 때마다, 상기 표 2와 같은 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막인 '자막 4, 자막 5'와 번역 자막인 '번역 자막 4, 번역 자막 5'를, 영상이 표시되는 화면 영역에서 상기 제1 지점과 사전 설정된 제2 지점에 플로팅하여 표시할 수 있다.After that, the translated caption addition display unit 117 displays each playback time recorded in the subtitle table as shown in Table 2 from the playback time of '2 minutes and 30 seconds', 'playback time 4 (3 minutes), playback time 5 (4 minutes) minutes)', the subtitles 'subtitle 4, subtitle 5' and translation subtitles 'translation subtitle 4, translation subtitle 5' according to each playback time recorded in the subtitle table as shown in Table 2 above are displayed. It may be displayed by floating at the first point and a preset second point in the displayed screen area.

보다 자세히 설명하면, 번역 자막 추가 표시부(117)는 도 3에 도시된 그림과 같이, 제1 영상 데이터의 재생 시점이 '재생 시점 4(3분)'가 되었을 때, '자막 4'와 '번역 자막 4'를 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)과 상기 제2 지점(311)에 각각 플로팅하여 표시할 수 있고, 제1 영상 데이터의 재생 시점이 '재생 시점 5(4분)'가 되었을 때, '자막 5'와 '번역 자막 5'를 영상이 표시되는 화면 영역(210)에서 상기 제1 지점(211)과 상기 제2 지점(311)에 각각 플로팅하여 표시할 수 있다.In more detail, as shown in FIG. Subtitle 4' can be displayed by floating at the first point 211 and the second point 311 in the screen area 210 where the video is displayed, and the playback time of the first video data is 'playback time 5'. (4 minutes)', 'subtitles 5' and 'translated subtitles 5' are plotted and displayed at the first point 211 and the second point 311 respectively in the screen area 210 where the video is displayed can do.

본 발명의 일실시예에 따르면, 전자 장치(110)는 음성 합성부(118), 믹싱부(119), 결합부(120) 및 더빙 재생부(121)를 더 포함할 수 있다.According to one embodiment of the present invention, the electronic device 110 may further include a voice synthesis unit 118, a mixing unit 119, a combination unit 120, and a dubbing playback unit 121.

음성 합성부(118)는 상기 제1 영상 데이터가 재생되는 도중, 제3 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터를 상기 제1 외국어로 더빙하여 재생할 것을 요청하는 더빙 재생 명령이 수신되면, 상기 복수의 번역 자막들 각각을 사전 구축된 음성 합성 모델에 입력으로 인가하여 음성 합성을 수행함으로써, 상기 복수의 번역 자막들에 대응되는 복수의 합성 음성들을 생성한다.While the first video data is being reproduced, the voice synthesis unit 118 receives a dubbing reproduction command requesting dubbing and reproduction of the first video data in the first foreign language from the user at a third playback time, A plurality of synthesized voices corresponding to the plurality of translated captions are generated by applying each of the plurality of translated captions as an input to a pre-constructed voice synthesis model and performing voice synthesis.

믹싱부(119)는 상기 복수의 분할 영상 데이터들 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역의 소리 성분을 제거한 후, 상기 복수의 합성 음성들을 상기 제1 주파수 대역의 소리 성분이 제거된 각 분할 영상 데이터와 믹싱(Mixing)함으로써, 복수의 믹싱 영상 데이터들을 생성한다.The mixing unit 119 removes a sound component of a first frequency band preset as a frequency band corresponding to human voice from each of the plurality of divided image data, and then converts the plurality of synthesized voices into the first frequency band. A plurality of mixed image data is generated by mixing each of the divided image data from which the sound component of the band has been removed.

결합부(120)는 상기 복수의 믹싱 영상 데이터들을 재생 시간 순서에 따라 하나의 영상 데이터로 결합함으로써, 제1 더빙 영상 데이터를 생성한다.The combining unit 120 generates first dubbing video data by combining the plurality of mixing video data into one video data according to the playback time order.

더빙 재생부(121)는 상기 제1 더빙 영상 데이터를 상기 제3 재생 시점부터 재생한다.The dubbing reproducing unit 121 reproduces the first dubbing image data from the third reproducing time point.

이하에서는, 음성 합성부(118), 믹싱부(119), 결합부(120) 및 더빙 재생부(121)의 동작을 예를 들어, 상세히 설명하기로 한다.Hereinafter, operations of the voice synthesis unit 118, the mixing unit 119, the combining unit 120, and the dubbing playback unit 121 will be described in detail by way of example.

먼저, 상기 표 2와 같이 자막들과 번역 자막들이 생성되었다고 하고, 전자 장치(110)에서 상기 제1 영상 데이터가 재생되는 도중, '3분 30초'라는 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터를 '영어'로 더빙하여 재생할 것을 요청하는 더빙 재생 명령이 전자 장치(110)에 수신되었다고 가정하자.First, it is assumed that subtitles and translated subtitles are generated as shown in Table 2, and while the first video data is being reproduced in the electronic device 110, the user selects the first video data at a playback time of '3 minutes and 30 seconds'. Assume that the electronic device 110 receives a dubbing play command requesting dubbing and playing video data in 'English'.

그러면, 음성 합성부(118)는 '번역 자막 1, 번역 자막 2, 번역 자막 3, 번역 자막 4, 번역 자막 5' 각각을 사전 구축된 음성 합성 모델에 입력으로 인가하여 음성 합성을 수행함으로써, '번역 자막 1, 번역 자막 2, 번역 자막 3, 번역 자막 4, 번역 자막 5'에 대응되는 복수의 합성 음성들을 '합성 음성 1, 합성 음성 2, 합성 음성 3, 합성 음성 4, 합성 음성 5'로 생성할 수 있다.Then, the speech synthesis unit 118 performs speech synthesis by applying 'translated subtitles 1, translated subtitles 2, translated subtitles 3, translated subtitles 4, and translated subtitles 5' as inputs to a pre-built speech synthesis model, ' Translated subtitles 1, translated subtitles 2, translated subtitles 3, translated subtitles 4, and translated subtitles 5' are converted into 'synthesized speech 1, synthetic speech 2, synthetic speech 3, synthetic speech 4, and synthetic speech 5'. can create

그러고 나서, 믹싱부(119)는 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역의 소리 성분을 제거할 수 있다.Then, the mixing unit 119 preliminarily determines that each of 'segmented image data 1, segmented image data 2, segmented image data 3, segmented image data 4, and segmented image data 5' is a frequency band corresponding to human voice. A sound component of the set first frequency band may be removed.

만약, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역을 '50Hz ~ 4000Hz'라고 하는 경우, 믹싱부(119)는 '분할 영상 데이터 1, 분할 영상 데이터 2, 분할 영상 데이터 3, 분할 영상 데이터 4, 분할 영상 데이터 5' 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역인 '50Hz ~ 4000Hz'의 소리 성분을 제거할 수 있다.If the first frequency band preset as a frequency band corresponding to a human voice is '50Hz to 4000Hz', the mixing unit 119 performs 'split image data 1, split image data 2, split image data 3, For each of the divided image data 4 and the divided image data 5', a sound component of '50 Hz to 4000 Hz', which is a first frequency band preset as a frequency band corresponding to human voice, may be removed.

그 이후, 믹싱부(119)는 '합성 음성 1, 합성 음성 2, 합성 음성 3, 합성 음성 4, 합성 음성 5'를 상기 제1 주파수 대역인 '50Hz ~ 4000Hz'의 소리 성분이 제거된 각 분할 영상 데이터와 믹싱함으로써, 복수의 믹싱 영상 데이터들을 '믹싱 영상 데이터 1, 믹싱 영상 데이터 2, 믹싱 영상 데이터 3, 믹싱 영상 데이터 4, 믹싱 영상 데이터 5'로 생성할 수 있다.Thereafter, the mixing unit 119 divides 'synthesized voice 1, synthesized voice 2, synthesized voice 3, synthesized voice 4, and synthesized voice 5' from which sound components of the first frequency band '50Hz to 4000Hz' have been removed. By mixing with video data, a plurality of mixing video data can be generated as 'mixing video data 1, mixing video data 2, mixing video data 3, mixing video data 4, and mixing video data 5'.

구체적으로, 믹싱부(119)는 '합성 음성 1'과 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 1'을 서로 믹싱하여 '믹싱 영상 데이터 1'을 생성할 수 있고, '합성 음성 2'와 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 2'를 서로 믹싱하여 '믹싱 영상 데이터 2'를 생성할 수 있으며, '합성 음성 3'과 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 3'을 서로 믹싱하여 '믹싱 영상 데이터 3'을 생성할 수 있고, '합성 음성 4'와 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 4'를 서로 믹싱하여 '믹싱 영상 데이터 4'를 생성할 수 있고, '합성 음성 5'와 사람의 음성에 대한 소리 성분이 제거된 '분할 영상 데이터 5'를 서로 믹싱하여 '믹싱 영상 데이터 5'를 생성할 수 있다.In detail, the mixing unit 119 may generate 'mixed video data 1' by mixing 'synthesized voice 1' and 'segmented image data 1' from which the sound component of the human voice has been removed. 2' and 'Split Video Data 2' from which the human voice component has been removed can be mixed together to create 'Mixed Video Data 2', and 'Synthetic Voice 3' and the human voice component removed 'Mixed video data 3' can be created by mixing 'split video data 3', and 'synthetic voice 4' and 'split video data 4' from which the sound components for human voices have been removed are mixed together to create 'mixed video data 3'. 'Mixed video data 4' may be generated, and 'mixed video data 5' may be generated by mixing the 'synthesized voice 5' and the 'divided image data 5' from which the sound component of the human voice is removed.

이렇게, 믹싱부(119)에 의해 '믹싱 영상 데이터 1, 믹싱 영상 데이터 2, 믹싱 영상 데이터 3, 믹싱 영상 데이터 4, 믹싱 영상 데이터 5'가 생성되면, 결합부(120)는 '믹싱 영상 데이터 1, 믹싱 영상 데이터 2, 믹싱 영상 데이터 3, 믹싱 영상 데이터 4, 믹싱 영상 데이터 5'를 재생 시간 순서에 따라 하나의 영상 데이터로 결합함으로써, 제1 더빙 영상 데이터를 생성할 수 있다.In this way, when 'mixing video data 1, mixing video data 2, mixing video data 3, mixing video data 4, and mixing video data 5' are generated by the mixing unit 119, the combiner 120 generates 'mixing video data 1'. , mixing video data 2, mixing video data 3, mixing video data 4, and mixing video data 5' are combined into one image data in order of playback time, thereby generating first dubbing video data.

그러고 나서, 더빙 재생부(121)는 상기 제1 더빙 영상 데이터를 '3분 30초'라는 재생 시점부터 재생할 수 있다.Then, the dubbing reproducing unit 121 may reproduce the first dubbing image data from a playback time of '3 minutes and 30 seconds'.

이를 통해, 상기 사용자는 '영어'로 더빙된 상기 제1 더빙 영상 데이터를 '3분 30초'부터 감상할 수 있다.Through this, the user can enjoy the first dubbing video data dubbed in 'English' from '3 minutes and 30 seconds'.

본 발명의 일실시예에 따르면, 전자 장치(110)는 확인부(122), 추출부(123), 검색 결과 표시부(124) 및 재생부(125)를 더 포함할 수 있다.According to an embodiment of the present invention, the electronic device 110 may further include a confirmation unit 122, an extraction unit 123, a search result display unit 124, and a reproduction unit 125.

확인부(122)는 상기 제1 영상 데이터가 재생되는 도중, 상기 사용자에 의해 제1 키워드가 입력되면서, 상기 사용자에 의해 상기 제1 영상 데이터에서 상기 제1 키워드에 매칭되는 재생 시점을 검색할 것을 지시하는 검색 명령이 수신되면, 상기 자막 테이블을 참조하여, 상기 복수의 자막들 중 상기 제1 키워드를 포함하고 있는 자막이 존재하는지 여부를 확인한다.While the first image data is being reproduced, the confirmation unit 122 is configured to search for a reproduction time matching the first keyword in the first image data by the user while the user inputs the first keyword. When an indicated search command is received, it is checked whether a caption including the first keyword exists among the plurality of captions with reference to the caption table.

추출부(123)는 상기 복수의 자막들 중 상기 제1 키워드를 포함하고 있는 자막으로, 적어도 하나의 제1 자막이 존재하는 것으로 확인되면, 상기 자막 테이블로부터 상기 적어도 하나의 제1 자막에 대응되는 적어도 하나의 제4 재생 시점에 대한 정보를 추출한다.The extraction unit 123 extracts a subtitle including the first keyword from among the plurality of subtitles, and if it is determined that at least one first subtitle exists, a subtitle corresponding to the at least one first subtitle is obtained from the subtitle table. Information on at least one fourth reproduction time point is extracted.

검색 결과 표시부(124)는 상기 적어도 하나의 제4 재생 시점에 대한 정보를 상기 제1 키워드에 매칭되는 검색 결과로 지정하여 영상이 표시되는 화면 영역에서 사전 설정된 제3 지점에 표시한다.The search result display unit 124 designates the information on the at least one fourth reproduction time point as a search result matching the first keyword and displays it at a preset third point in the screen area where the video is displayed.

재생부(125)는 상기 적어도 하나의 제4 재생 시점에 대한 정보가 상기 제3 지점에 표시되는 상태에서, 상기 사용자에 의해 상기 적어도 하나의 제4 재생 시점 중 어느 하나인 제5 재생 시점에 대한 선택 재생 명령이 수신되면, 상기 제1 영상 데이터를 상기 제5 재생 시점부터 재생한다.In a state in which information on the at least one fourth reproduction time point is displayed at the third point, the playback unit 125 controls a fifth playback time point, which is any one of the at least one fourth playback time point, by the user. When a selective reproduction command is received, the first image data is reproduced from the fifth reproduction time point.

이하에서는, 확인부(122), 추출부(123), 검색 결과 표시부(124) 및 재생부(125)의 동작을 예를 들어, 상세히 설명하기로 한다.Hereinafter, operations of the confirmation unit 122, the extraction unit 123, the search result display unit 124, and the reproduction unit 125 will be described in detail, for example.

먼저, 상기 제1 영상 데이터가 재생되는 도중, 상기 사용자에 의해 제1 키워드가 입력되면서, 상기 사용자에 의해 상기 제1 영상 데이터에서 상기 제1 키워드에 매칭되는 재생 시점을 검색할 것을 지시하는 검색 명령이 전자 장치(110)에 수신되었다고 가정하자.First, while the first image data is being reproduced, a first keyword is input by the user, and a search command instructs the user to search for a reproduction time point matching the first keyword in the first image data. Assume that this is received by the electronic device 110.

그러면, 확인부(122)는 상기 표 1과 같은 자막 테이블을 참조하여, 상기 복수의 자막들인 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 중 상기 제1 키워드를 포함하고 있는 자막이 존재하는지 여부를 확인할 수 있다.Then, the verification unit 122 refers to the caption table as shown in Table 1, and among the plurality of captions 'caption 1, caption 2, caption 3, caption 4, and caption 5' includes the first keyword. You can check whether it exists or not.

만약, 확인부(122)를 통해 '자막 1, 자막 2, 자막 3, 자막 4, 자막 5' 중 상기 제1 키워드를 포함하고 있는 자막으로, '자막 1, 자막 3, 자막 5'가 존재하는 것으로 확인되는 경우, 추출부(123)는 상기 표 1과 같은 자막 테이블로부터 '자막 1, 자막 3, 자막 5'에 대응되는 재생 시점인 '재생 시점 1(0초), 재생 시점 3(2분), 재생 시점 5(4분)'에 대한 정보를 추출할 수 있다.If, through the verification unit 122, 'caption 1, subtitle 3, and subtitle 5' exist as subtitles including the first keyword among 'caption 1, subtitle 2, subtitle 3, subtitle 4, and subtitle 5' If confirmed, the extractor 123 selects 'playback time 1 (0 sec), playback time 3 (2 minutes)', which are playback times corresponding to 'caption 1, subtitle 3, and subtitle 5' from the subtitle table as shown in Table 1 above. ), information on playback time 5 (4 minutes)' can be extracted.

그 이후, 검색 결과 표시부(124)는 '재생 시점 1(0초), 재생 시점 3(2분), 재생 시점 5(4분)'에 대한 정보를 상기 제1 키워드에 매칭되는 검색 결과로 지정하여 영상이 표시되는 화면 영역에서 사전 설정된 제3 지점에 표시할 수 있다.Thereafter, the search result display unit 124 designates information on 'playback time 1 (0 sec), playback time 3 (2 minutes), and playback time 5 (4 minutes)' as a search result matching the first keyword. Accordingly, the image may be displayed at a preset third point in the screen area where the image is displayed.

이렇게, 검색 결과 표시부(124)에 의해 '재생 시점 1(0초), 재생 시점 3(2분), 재생 시점 5(4분)'에 대한 정보가 상기 제3 지점에 표시되는 상태에서, 상기 사용자에 의해 '재생 시점 1(0초), 재생 시점 3(2분), 재생 시점 5(4분)' 중 어느 하나의 재생 시점인 '재생 시점 3(2분)'에 대한 선택 재생 명령이 전자 장치(110)에 수신되면, 재생부(125)는 상기 제1 영상 데이터를 '재생 시점 3(2분)'부터 재생할 수 있다.In this way, in a state in which information on 'playback time 1 (0 second), play time 3 (2 minutes), and play time 5 (4 minutes)' is displayed at the third point by the search result display unit 124, the A selection playback command for 'playback time 3 (2 minutes)', which is any one playback time among 'playback time 1 (0 seconds), playback time 3 (2 minutes), and playback time 5 (4 minutes)', is issued by the user. When received by the electronic device 110, the playback unit 125 can reproduce the first image data from 'playback time 3 (2 minutes)'.

이를 통해, 상기 사용자는 상기 제1 영상 데이터를 '재생 시점 3(2분)'부터 감상할 수 있다.Through this, the user can enjoy the first image data from 'playback point 3 (2 minutes)'.

도 4는 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법을 도시한 순서도이다.4 is a flowchart illustrating an operating method of an electronic device that outputs a caption on a screen on which a video is reproduced based on voice recognition according to an embodiment of the present invention.

단계(S410)에서는 사용자에 의해 제1 영상 데이터에 대한 재생 명령이 수신되면, 상기 제1 영상 데이터를 재생하되, 상기 제1 영상 데이터를 사전 설정된 제1 재생 시간 간격으로 분할함으로써, 복수의 분할 영상 데이터들을 생성하고, 상기 제1 영상 데이터의 전체 재생 시간에서 상기 복수의 분할 영상 데이터들 각각의 재생 시점에 대한 정보를 확인한다.In step S410, when a play command for the first image data is received by the user, the first image data is reproduced, and the first image data is divided by a first play time interval set in advance, thereby forming a plurality of divided images. Data is generated, and information about a reproduction time point of each of the plurality of divided image data is checked in the entire reproduction time of the first image data.

단계(S420)에서는 상기 복수의 분할 영상 데이터들을 사전 구축된 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 분할 영상 데이터들 각각에 대응되는 복수의 자막들을 생성한다.In step S420, a plurality of subtitles corresponding to each of the plurality of divided image data are generated by sequentially applying the plurality of divided image data as inputs to a pre-built voice recognition model to perform voice recognition.

단계(S430)에서는 상기 복수의 자막들과 상기 복수의 자막들 각각에 대응되는 분할 영상 데이터의 재생 시점에 대한 정보를 서로 대응시켜 기록한 자막 테이블을 생성한다.In step S430, a caption table is created in which the plurality of captions and information about reproduction time of the divided video data corresponding to each of the plurality of captions are recorded in correspondence with each other.

단계(S440)에서는 상기 제1 영상 데이터가 재생되는 도중, 제1 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터에 대한 자막 표시 명령이 수신되면, 상기 제1 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막을, 영상이 표시되는 화면 영역에서 사전 설정된 제1 지점에 플로팅하여 표시한다.In step S440, if a command to display a caption for the first image data is received from the user at the first reproduction time while the first image data is being reproduced, it is recorded in the caption table from the first reproduction time. Whenever there is a playback time, the subtitle according to each playback time recorded in the caption table is floated and displayed at a first preset point in the screen area where the video is displayed.

이때, 본 발명의 일실시예에 따르면, 단계(S420)에서는 상기 복수의 분할 영상 데이터들 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역 이외의 주파수 대역의 소리 성분을 모두 제거한 후, 상기 제1 주파수 대역 이외의 주파수 대역의 소리 성분이 모두 제거된 각 분할 영상 데이터를 상기 음성 인식 모델에 순차적으로 입력으로 인가하여 음성 인식을 수행함으로써, 상기 복수의 자막들을 생성할 수 있다.At this time, according to one embodiment of the present invention, in step S420, for each of the plurality of divided image data, sound components of a frequency band other than the first frequency band preset to be a frequency band corresponding to human voice After all of the subtitles are removed, the plurality of subtitles are generated by sequentially applying each segmented image data from which all of the sound components of the frequency band other than the first frequency band have been removed as inputs to the speech recognition model to perform speech recognition. can

또한, 본 발명의 일실시예에 따르면, 상기 전자 장치의 동작 방법은 상기 제1 영상 데이터가 재생되는 도중, 제2 재생 시점에서 상기 사용자에 의해 제1 외국어로 구성된 자막을 추가로 표시할 것을 지시하는 추가 표시 명령이 수신되면, 상기 자막 테이블을 참조하여, 상기 복수의 자막들을 사전 구축된 번역 엔진을 통해 상기 제1 외국어로 번역함으로써, 복수의 번역 자막들을 생성하는 단계, 상기 복수의 번역 자막들이 생성되면, 상기 자막 테이블 상에서 상기 복수의 자막들 각각에 대해, 상기 복수의 자막들 각각에 대응되는 번역 자막을 추가로 대응시켜 기록하는 단계 및 상기 제2 재생 시점부터 상기 자막 테이블에 기록되어 있는 각 재생 시점이 될 때마다, 상기 자막 테이블에 기록되어 있는 각 재생 시점에 따른 자막과 번역 자막을, 영상이 표시되는 화면 영역에서 상기 제1 지점과 사전 설정된 제2 지점에 각각 플로팅하여 표시하는 단계를 더 포함할 수 있다.Further, according to an embodiment of the present invention, the operating method of the electronic device instructs the user to additionally display subtitles in a first foreign language at a second playback time while the first image data is being reproduced. generating a plurality of translated subtitles by translating the plurality of subtitles into the first foreign language through a pre-built translation engine with reference to the subtitle table when an additional display command to display the subtitle is received; If generated, additionally correspondingly and recording translated subtitles corresponding to each of the plurality of subtitles on the subtitle table, and each of the subtitles recorded in the subtitle table from the second reproduction time point. Whenever a play point comes, plotting and displaying subtitles and translated subtitles according to each play point recorded in the subtitle table at the first point and a preset second point in the screen area where the video is displayed, respectively. can include more.

이때, 본 발명의 일실시예에 따르면, 상기 전자 장치의 동작 방법은 상기 제1 영상 데이터가 재생되는 도중, 제3 재생 시점에서 상기 사용자에 의해 상기 제1 영상 데이터를 상기 제1 외국어로 더빙하여 재생할 것을 요청하는 더빙 재생 명령이 수신되면, 상기 복수의 번역 자막들 각각을 사전 구축된 음성 합성 모델에 입력으로 인가하여 음성 합성을 수행함으로써, 상기 복수의 번역 자막들에 대응되는 복수의 합성 음성들을 생성하는 단계, 상기 복수의 분할 영상 데이터들 각각에 대해, 사람의 음성에 대응되는 주파수 대역인 것으로 사전 설정된 제1 주파수 대역의 소리 성분을 제거한 후, 상기 복수의 합성 음성들을 상기 제1 주파수 대역의 소리 성분이 제거된 각 분할 영상 데이터와 믹싱함으로써, 복수의 믹싱 영상 데이터들을 생성하는 단계, 상기 복수의 믹싱 영상 데이터들을 재생 시간 순서에 따라 하나의 영상 데이터로 결합함으로써, 제1 더빙 영상 데이터를 생성하는 단계 및 상기 제1 더빙 영상 데이터를 상기 제3 재생 시점부터 재생하는 단계를 더 포함할 수 있다.At this time, according to an embodiment of the present invention, the operating method of the electronic device may include dubbing the first image data into the first foreign language by the user at a third playback point while the first image data is being reproduced. When a dubbing playback command requesting reproduction is received, a plurality of synthesized voices corresponding to the plurality of translated subtitles are generated by applying each of the plurality of translated subtitles as an input to a pre-constructed speech synthesis model and performing speech synthesis. Generating, for each of the plurality of divided image data, after removing a sound component of a first frequency band preset to be a frequency band corresponding to human voice, the plurality of synthesized voices of the first frequency band Generating a plurality of mixing image data by mixing each divided image data from which sound components have been removed, generating first dubbing image data by combining the plurality of mixed image data into one image data in order of reproduction time and reproducing the first dubbing video data from the third reproduction time point.

또한, 본 발명의 일실시예에 따르면, 상기 전자 장치의 동작 방법은 상기 제1 영상 데이터가 재생되는 도중, 상기 사용자에 의해 제1 키워드가 입력되면서, 상기 사용자에 의해 상기 제1 영상 데이터에서 상기 제1 키워드에 매칭되는 재생 시점을 검색할 것을 지시하는 검색 명령이 수신되면, 상기 자막 테이블을 참조하여, 상기 복수의 자막들 중 상기 제1 키워드를 포함하고 있는 자막이 존재하는지 여부를 확인하는 단계, 상기 복수의 자막들 중 상기 제1 키워드를 포함하고 있는 자막으로, 적어도 하나의 제1 자막이 존재하는 것으로 확인되면, 상기 자막 테이블로부터 상기 적어도 하나의 제1 자막에 대응되는 적어도 하나의 제4 재생 시점에 대한 정보를 추출하는 단계, 상기 적어도 하나의 제4 재생 시점에 대한 정보를 상기 제1 키워드에 매칭되는 검색 결과로 지정하여 영상이 표시되는 화면 영역에서 사전 설정된 제3 지점에 표시하는 단계 및 상기 적어도 하나의 제4 재생 시점에 대한 정보가 상기 제3 지점에 표시되는 상태에서, 상기 사용자에 의해 상기 적어도 하나의 제4 재생 시점 중 어느 하나인 제5 재생 시점에 대한 선택 재생 명령이 수신되면, 상기 제1 영상 데이터를 상기 제5 재생 시점부터 재생하는 단계를 더 포함할 수 있다.In addition, according to an embodiment of the present invention, the operating method of the electronic device may include, while the first image data is being reproduced, the user inputs a first keyword, and the user inputs the first keyword in the first image data. When a search command instructing to search for a playback time matching a first keyword is received, checking whether a subtitle including the first keyword among the plurality of subtitles exists by referring to the subtitle table , If it is determined that at least one first caption among the plurality of captions including the first keyword exists, at least one fourth caption corresponding to the at least one first caption is obtained from the caption table. extracting information on playback time; specifying the at least one fourth playback time information as a search result matching the first keyword and displaying the information at a preset third point in a screen area where an image is displayed; and in a state in which the information on the at least one fourth reproduction time point is displayed at the third point, a selection playback command for a fifth playback time point, which is any one of the at least one fourth playback time point, is received by the user. , the method may further include reproducing the first image data from the fifth reproducing time point.

이상, 도 4를 참조하여 본 발명의 일실시예에 따른 전자 장치의 동작 방법에 대해 설명하였다. 여기서, 본 발명의 일실시예에 따른 전자 장치의 동작 방법은 도 1 내지 도 3을 이용하여 설명한 전자 장치(110)의 동작에 대한 구성과 대응될 수 있으므로, 이에 대한 보다 상세한 설명은 생략하기로 한다.In the above, the operating method of the electronic device according to an embodiment of the present invention has been described with reference to FIG. 4 . Here, since the operating method of the electronic device according to an embodiment of the present invention may correspond to the configuration of the operation of the electronic device 110 described with reference to FIGS. 1 to 3, a detailed description thereof will be omitted. do.

본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 저장매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.A method of operating an electronic device that outputs a caption on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention may be implemented as a computer program stored in a storage medium for execution through a combination with a computer.

또한, 본 발명의 일실시예에 따른 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치의 동작 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 컴퓨터 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. In addition, a method of operating an electronic device that outputs subtitles on a screen on which an image is reproduced based on voice recognition according to an embodiment of the present invention is implemented in the form of computer program commands to be executed through combination with a computer, so that the computer can read them. can be recorded on media. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the medium may be those specially designed and configured for the present invention or those known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. As described above, the present invention has been described by specific details such as specific components and limited embodiments and drawings, but these are provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , Those skilled in the art in the field to which the present invention belongs can make various modifications and variations from these descriptions.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the described embodiments, and it will be said that not only the claims to be described later, but also all modifications equivalent or equivalent to these claims belong to the scope of the present invention. .

110: 음성 인식을 기반으로 영상이 재생되는 화면에 자막을 출력하는 전자 장치
111: 분할부 112: 자막 생성부
113: 자막 테이블 생성부 114: 자막 표시부
115: 번역 자막 생성부 116: 번역 자막 기록 처리부
117: 번역 자막 추가 표시부 118: 음성 합성부
119: 믹싱부 120: 결합부
121: 더빙 재생부 122: 확인부
123: 추출부 124: 검색 결과 표시부
125: 재생부110: Electronic device that outputs subtitles on the screen on which video is played based on voice recognition
111: division unit 112: subtitle generation unit
113: subtitle table generation unit 114: subtitle display unit
115: translation subtitle generation unit 116: translation subtitle record processing unit
117: Translation subtitle addition display unit 118: Voice synthesis unit
119: mixing unit 120: coupling unit
121: dubbing playback unit 122: confirmation unit
123: extraction unit 124: search result display unit
125: regeneration unit

Claims

An electronic device that outputs subtitles on a screen on which a video is reproduced based on voice recognition, comprising:
When a playback command for the first image data is received by the user, the first image data is reproduced, and a plurality of divided image data is generated by dividing the first image data by a first playback time interval set in advance; a segmentation unit that checks information about a reproduction point of each of the plurality of divided image data within the total reproduction time of the first image data;
a caption generating unit generating a plurality of subtitles corresponding to each of the plurality of divided image data by sequentially applying the plurality of divided image data as inputs to a pre-built voice recognition model and performing voice recognition;
a caption table generating unit generating a caption table in which the plurality of captions and information about playback time of the divided video data corresponding to each of the plurality of captions are recorded in correspondence with each other;
While the first video data is being reproduced, if a command to display a caption for the first video data is received by the user at the first reproduction time, each reproduction time recorded in the caption table from the first reproduction time a caption display unit that floats and displays captions according to each playback time recorded in the caption table at a first preset point in a screen area where an image is displayed;
While the first video data is being reproduced, if an additional display command instructing to additionally display a subtitle composed of a first foreign language is received from the user at a second reproduction time point, the plurality of subtitles are displayed with reference to the subtitle table. a translation subtitle generation unit that generates a plurality of translated subtitles by translating subtitles into the first foreign language through a pre-built translation engine;
a translated caption recording processor configured to, when the plurality of translated captions are generated, additionally map and record translated captions corresponding to each of the plurality of captions to each of the plurality of captions on the caption table;
Every time from the second reproduction time to each reproduction time recorded in the subtitle table, subtitles and translated subtitles according to each reproduction time recorded in the subtitle table are displayed at the first point in the screen area where the video is displayed. and a translation subtitle addition display unit which floats and displays at a preset second point, respectively;
While the first video data is being played, if a dubbing playback command requesting dubbing and playing the first video data in the first foreign language is received from the user at a third playback time, each of the plurality of translated subtitles is displayed. a voice synthesis unit generating a plurality of synthesized voices corresponding to the plurality of translated subtitles by applying as an input to a pre-built voice synthesis model and performing voice synthesis;
For each of the plurality of divided image data, after removing a sound component of a first frequency band preset as a frequency band corresponding to human voice, the sound component of the first frequency band is removed from the plurality of synthesized voices. a mixing unit generating a plurality of mixed image data by mixing each divided image data;
a combiner configured to generate first dubbing video data by combining the plurality of mixing video data into one video data in order of reproduction time; and
A dubbing reproducing unit to reproduce the first dubbing video data from the third reproduction time point.
An electronic device comprising a.

According to claim 1,
The subtitle generator
For each of the plurality of split image data, after removing all sound components in frequency bands other than the first frequency band preset as a frequency band corresponding to human voice, sound in frequency bands other than the first frequency band The electronic device characterized in that the electronic device generates the plurality of subtitles by sequentially applying the divided image data from which all components have been removed as inputs to the speech recognition model to perform speech recognition.

delete

According to claim 1,
While the first image data is being reproduced, while a first keyword is input by the user, a search command for instructing to search for a reproduction point matching the first keyword in the first image data is received by the user. a confirmation unit for checking whether a caption including the first keyword exists among the plurality of captions by referring to the caption table;
If it is determined that at least one first subtitle among the plurality of subtitles including the first keyword exists, at least one fourth play corresponding to the at least one first subtitle from the subtitle table. an extraction unit for extracting information about a viewpoint;
a search result display unit that designates the information on the at least one fourth playback time point as a search result matching the first keyword and displays it at a preset third point in a screen area where an image is displayed; and
In a state in which the information on the at least one fourth reproduction time point is displayed at the third point, when a selection reproduction command for a fifth reproduction time point, which is any one of the at least one fourth reproduction time point, is received by the user , a playback unit that reproduces the first image data from the fifth playback time point
An electronic device further comprising a.

A method of operating an electronic device that outputs subtitles on a screen on which a video is reproduced based on voice recognition, the method comprising:
When a playback command for the first image data is received by the user, the first image data is reproduced, and a plurality of divided image data is generated by dividing the first image data by a first playback time interval set in advance; checking information about a playback time point of each of the plurality of divided image data within a total playback time of the first video data;
generating a plurality of subtitles corresponding to each of the plurality of divided image data by sequentially applying the plurality of divided image data as inputs to a pre-built voice recognition model and performing voice recognition;
generating a caption table in which the plurality of captions and information about playback time of divided video data corresponding to each of the plurality of captions are recorded in correspondence with each other;
While the first video data is being reproduced, if a command to display a caption for the first video data is received by the user at the first reproduction time, each reproduction time recorded in the caption table from the first reproduction time displaying a floating subtitle according to each playback time recorded in the subtitle table at a first preset point in a screen area where an image is displayed whenever a subtitle is displayed;
While the first video data is being reproduced, if an additional display command instructing to additionally display a subtitle composed of a first foreign language is received from the user at a second reproduction time point, the plurality of subtitles are displayed with reference to the subtitle table. generating a plurality of translated subtitles by translating subtitles into the first foreign language through a pre-built translation engine;
when the plurality of translated subtitles are generated, additionally correspondingly recording a translated subtitle corresponding to each of the plurality of subtitles to each of the plurality of subtitles on the subtitle table;
Every time from the second reproduction time to each reproduction time recorded in the subtitle table, subtitles and translated subtitles according to each reproduction time recorded in the subtitle table are displayed at the first point in the screen area where the video is displayed. Floating and displaying at the second and preset points, respectively;
While the first video data is being played, if a dubbing playback command requesting dubbing and playing the first video data in the first foreign language is received from the user at a third playback time, each of the plurality of translated subtitles is displayed. generating a plurality of synthesized speeches corresponding to the plurality of translated subtitles by applying as an input to a pre-built speech synthesis model to perform speech synthesis;
For each of the plurality of divided image data, after removing a sound component of a first frequency band preset as a frequency band corresponding to human voice, the sound component of the first frequency band is removed from the plurality of synthesized voices. generating a plurality of mixed image data by mixing each divided image data;
generating first dubbing video data by combining the plurality of mixing video data into one video data in order of reproduction time; and
Reproducing the first dubbing video data from the third reproduction time point
A method of operating an electronic device comprising a.

According to claim 6,
The step of generating the plurality of subtitles is
For each of the plurality of split image data, after removing all sound components in frequency bands other than the first frequency band preset as a frequency band corresponding to human voice, sound in frequency bands other than the first frequency band The operating method of the electronic device, characterized in that generating the plurality of subtitles by sequentially applying the divided image data from which all components have been removed as inputs to the speech recognition model to perform speech recognition.

delete

According to claim 6,
While the first image data is being reproduced, while a first keyword is input by the user, a search command for instructing to search for a reproduction point matching the first keyword in the first image data is received by the user. checking whether a caption including the first keyword exists among the plurality of captions by referring to the caption table;
If it is determined that at least one first subtitle among the plurality of subtitles including the first keyword exists, at least one fourth play corresponding to the at least one first subtitle from the subtitle table. extracting information about a viewpoint;
designating information on the at least one fourth playback point as a search result matching the first keyword and displaying the information at a preset third point in a screen area where an image is displayed; and
In a state in which the information on the at least one fourth reproduction time point is displayed at the third point, when a selection reproduction command for a fifth reproduction time point, which is any one of the at least one fourth reproduction time point, is received by the user , Reproducing the first image data from the fifth playback time point
Method of operating an electronic device further comprising a.

A computer-readable recording medium recording a computer program for executing the method of any one of claims 6, 7, or 10 through a combination with a computer.

A computer program stored in a storage medium for executing the method of any one of claims 6, 7 or 10 through a combination with a computer.