KR102343336B1

KR102343336B1 - Video Authoring System and Method

Info

Publication number: KR102343336B1
Application number: KR1020200067509A
Authority: KR
Inventors: 고일두
Original assignee: 고일두
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2021-12-23
Also published as: KR20210150724A

Abstract

본 발명은 음성인식과 음성합성을 이용하여 동영상 저작물을 제작하는 시스템에 관한 것으로서, 비디오 영상에서 오디오 데이터를 추출하는 오디오추출모듈(100); 상기 오디오추출모듈(100)에서 추출한 오디오 데이터를 인식하고 텍스트로 변환하여 설명 오디오 스크립트를 생성하는 음성인식모듈(200); 상기 음성인식모듈(200)을 통하여 생성된 설명 오디오 스크립트를 편집하는 스크립트편집모듈(300); 상기 스크립트편집모듈(300)을 통하여 최종 완성된 설명 오디오 스크립트를 인식하고 음성합성을 하여 저작물 오디오 자료를 생성하는 음성합성모듈(400); 비디오 영상에서 비디오 파일을 추출하여 저작물 비디오 자료를 생성하는 비디오추출모듈(500); 및, 저작물 오디오 자료와 저작물 비디오 자료를 믹싱하여 동영상 저작물을 생성하는 저작물생성모듈(600);을 포함하는 것을 특징으로 한다.The present invention relates to a system for producing a video work using speech recognition and speech synthesis, comprising: an audio extraction module 100 for extracting audio data from a video image; a speech recognition module 200 for recognizing the audio data extracted from the audio extraction module 100 and converting it into text to generate an explanatory audio script; a script editing module 300 for editing the explanatory audio script generated through the voice recognition module 200; a speech synthesis module 400 for recognizing the finally completed explanatory audio script through the script editing module 300 and synthesizing the speech to generate audio material of the work; a video extraction module 500 for extracting a video file from a video image to generate a work video material; and a work creation module 600 for generating a moving picture work by mixing the work audio material and the work video material.

Description

Video Authoring System and Method}

본 발명은 음성인식 기술, 음성합성 기술, 자동번역 기술, 오디오 및 비디오 편집 기술 등을 활용하여 동영상 저작물을 편리하고 신속하게 제작하거나 편집할 수 있는 새로운 개념의 동영상 저작 시스템 및 방법에 관한 것이다.The present invention relates to a new concept of a video authoring system and method that can conveniently and quickly produce or edit a video work by utilizing speech recognition technology, speech synthesis technology, automatic translation technology, audio and video editing technology, and the like.

일반적으로 화상 강의 자료와 같은 동영상 저작물은 카메라와 같은 촬영 장치를 활용하여 강의 내용을 직접 촬영하고, 이를 편집하는 방법으로 제작되는데, 최종 동영상 저작물(예를 들어 강의 자료나 교육 자료)을 완성하기 위하여 촬영에 익숙하지 못한 사용자가 원하는 내용이 잘 표현될 때까지 촬영 과정이 수회 내지 수십회 반복될 수 있고, 이와 같은 방법으로 촬영된 영상 파일을 잘된 부분만 잘라 모으고 붙이는 편집을 통하여 만족스러운 결과물을 도출하는 과정이 번거롭다는 문제점이 있다.In general, video works such as video lecture materials are produced by using a recording device such as a camera to directly record lecture contents and edit them. The shooting process can be repeated several to tens of times until the user who is not accustomed to shooting can express the desired content well. There is a problem that the process is cumbersome.

강의 자료를 생성하는 기존의 기술 가운데 공개특허 제10-2017-0029725호가 있는데, 이를 살펴보면, 영상 합성을 이용한 강의 영상, 프리젠테이션 영상, 또는 화상회의에서 상대방에게 전송하기 위한 영상(이하 통칭하여 '강의 영상'이라 한다)을 생성하는 장치로서, 촬영 카메라로부터 강사 또는 화상회의자(이하 '강사'라 통칭한다), 또는 청중의 촬영 영상(이하 '카메라 영상'이라 한다)을 수신하는 카메라 영상 수신모듈; 사용자에게 강의 영상의 화면을 2D 또는 3D로의 구성하기 위한 편집 화면을 제공하는 편집 인터페이스 제공모듈; 상기 사용자로부터, 상기 카메라 영상 및 프리젠테이션 자료 화면 영상을 포함하는 다수의 영상(이하 '구성 영상'이라 한다)에 대하여, 상기 구성 영상들을 강의 영상 내에서 어떠한 형태로 배치할 것인지에 대한 설정(이하 '화면 배치 설정'이라 한다)을 상기 편집화면을 통하여 마우스, 키보드 또는 무선 포인팅 장치 등을 이용하여 입력받는 설정입력 수신모듈; 상기 입력받은 화면 배치 설정에 따라 구성된 강의 영상을 생성하는 강의 영상 생성모듈; 특정 조건이 충족될 경우, 상기 강의 영상의 구성을 충족된 조건에 따라 변경하는 강의 영상 조정모듈; 및, 영상 합성을 이용한 강의 영상 생성 장치의 상기 각 모듈을 제어하여 영상 합성을 이용한 강의 영상 생성과 관련된 일련의 처리를 수행하는 제어부를 포함하는 것을 특징으로 한다.Among the existing technologies for generating lecture materials, there is Patent Publication No. 10-2017-0029725, which is a lecture video using video synthesis, a presentation video, or a video to be transmitted to the other party in a video conference (hereinafter collectively referred to as 'lecture'). A device for generating an image (referred to as 'image'), and a camera image receiving module that receives a captured image (hereinafter referred to as 'camera image') of a lecturer or video conference (hereinafter referred to as 'instructor') or audience from a shooting camera ; an editing interface providing module for providing an editing screen for configuring a screen of a lecture video in 2D or 3D to a user; From the user, with respect to a plurality of images (hereinafter referred to as 'composition images') including the camera image and presentation material screen image, a setting (hereinafter '' a setting input receiving module for receiving input through the editing screen using a mouse, a keyboard, or a wireless pointing device, etc.; a lecture image generating module for generating a lecture image configured according to the received screen layout setting; When a specific condition is satisfied, a lecture video adjustment module for changing the configuration of the lecture video according to the satisfied condition; and a controller for controlling each module of the apparatus for generating a lecture image using image synthesis to perform a series of processing related to generating a lecture image using image synthesis.

이러한 강의 영상 생성 장치를 이용하면, (a) 발표에 사용할 PPT, 영상 이미지 등의 자료와, 복수의 촬영카메라로부터 강사, 발표자 또는 화상회의자(이하 '강사'라 통칭한다) 또는 청중의 촬영 영상(이하 '카메라 영상'이라 한다) 및 하나 이상의 프리젠테이션 자료 화면 영상을 수신하는 단계; (b) 상기 강의 영상 생성 장치를 사용하는 상기 강사 또는 사용자(이하 '사용자'라 총칭한다)에게, 강의 상대방에게 보여줄 강의 영상 화면들을 2D 또는 3D 공간상에 합성하여 구성하기 위한 편집 화면과, 발표자료와 발표자 등 카메라들의 입력들의 합성장면을 제공하며, 상기 구성 영상들을 다양한 장면으로 아름답게 합성화면으로 만들어지도록 하기 위하여 복수의 2D 레이어나 3D 공간상에 설정(이하 '화면 배치 설정'이라 한다)하여, 실시간으로 강의를 하며 사용할 합성 장면 들을 미리 설정하는 단계; (c) 상기 사용자가, 실시간으로 강의 영상 화면을 선택하며 강의를 하는 단계로, 사용자 또는 강사가 혼자서 마우스, 스크린 터치, 혹은 무선 포인팅 장치 등의 포인터만으로 간단히 조작하여 강의 자료 전환, 화면 배치 장면의 전환, 장면 전환 및 발표 자료 변경을 동시에 할 수 있도록 하며, 마우스, 스크린 터치, 혹은 무선 포인팅 장치 등의 포인터만으로 간단히 조작하여 필요한 합성 장면을 보며 실시간으로 포인터를 가리키고 발표자료를 전환하며 실시간 강의를 하는 단계; (d) 상기 강사가 강의를 하며 합성 장면과 강의 자료 화면 등 실시간으로 만들어지는 영상을 녹화하고, 이 출력을 화상회의 프로그램에 입력으로 사용되어 지도록 하여 마치 출력 영상이 하나의 카메라처럼 화상회의 프로그램에 사용되어지도록 만들어지게 하는 원격 프리젠테이션 회의 방법을 사용하는 녹화 및 전송 단계; 및 (e) 특정 조건이 충족될 경우, 상기 강의 영상의 구성을 충족된 조건에 따라 변경하는 단계를 포함하는 영상 합성을 이용한 강의 영상 생성 방법을 구현할 수 있으나, 음성인식 기술, 음성합성 기술, 자동번역 기술, 오디오 및 비디오 편집 기술 등을 활용하여 동영상 저작물을 편리하고 신속하게 제작하거나 편집할 수 있는 개념은 제공하지 못하고 있다.If such a lecture video generating device is used, (a) materials such as PPT and video images to be used for the presentation, and the video captured by the lecturer, presenter, video conference (hereinafter referred to as 'instructor') or the audience from a plurality of cameras (hereinafter referred to as 'camera image') and receiving one or more presentation material screen images; (b) an editing screen for composing lecture video screens to be shown to the other lecturer in 2D or 3D space to the lecturer or user (hereinafter collectively referred to as 'user') who uses the lecture image generating device; and presentation; It provides a composite scene of inputs from cameras, such as data and presenters, and sets it on a plurality of 2D layers or 3D space (hereinafter referred to as 'screen layout setting') in order to make the composition images into a beautiful composite screen with various scenes. , setting in advance the synthetic scenes to be used while giving a lecture in real time; (c) the user selects a lecture video screen in real time and gives a lecture, in which the user or the lecturer alone manipulates the lecture material by simply using a mouse, a touch screen, or a pointer such as a wireless pointing device to switch lecture materials and change the screen layout It allows transitions, scene transitions, and presentation material changes at the same time, and provides a real-time lecture by pointing the pointer and switching presentation materials in real time while viewing the necessary composite scene by simply operating with a pointer such as a mouse, screen touch, or wireless pointing device. step; (d) The lecturer gives a lecture, and the video produced in real time, such as synthetic scenes and lecture material screens, is recorded, and the output is used as an input to the video conference program, so that the output video is transmitted to the video conference program like a single camera. recording and transmitting using the remote presentation conferencing method to be made available for use; and (e) when a specific condition is met, changing the configuration of the lecture video according to the satisfied condition can be implemented, but voice recognition technology, speech synthesis technology, automatic It does not provide a concept for conveniently and quickly creating or editing a moving image by utilizing translation technology, audio and video editing technology, etc.

[선행기술문헌][Prior art literature]

공개특허 제10-2017-0029725호Publication No. 10-2017-0029725

등록특허 제10-1292563호Registered Patent No. 10-1292563

본 발명은 음성인식 기술, 음성합성 기술, 자동번역 기술, 오디오 및 비디오 편집 기술 등을 활용하여 동영상 저작물을 편리하고 신속하게 제작하거나 편집할 수 있는 새로운 개념의 동영상 저작 시스템 및 방법을 제공함을 그 목적으로 한다.It is an object of the present invention to provide a new concept of a video authoring system and method capable of conveniently and quickly producing or editing a video work by utilizing speech recognition technology, speech synthesis technology, automatic translation technology, audio and video editing technology, etc. do it with

상기한 목적을 달성하기 위하여 창작된 본 발명의 기술적 구성은 다음과 같다.The technical configuration of the present invention created to achieve the above object is as follows.

본 발명은 음성인식과 음성합성을 이용하여 동영상 저작물을 제작하는 시스템에 관한 것으로서, 비디오 영상에서 오디오 데이터를 추출하는 오디오추출모듈(100); 상기 오디오추출모듈(100)에서 추출한 오디오 데이터를 인식하고 텍스트로 변환하여 설명 오디오 스크립트를 생성하는 음성인식모듈(200); 상기 음성인식모듈(200)을 통하여 생성된 설명 오디오 스크립트를 편집(추가/교정/삭제)하는 스크립트편집모듈(300); 상기 스크립트편집모듈(300)을 통하여 최종 완성된 설명 오디오 스크립트를 인식하고 음성합성을 하여 저작물 오디오 자료를 생성하는 음성합성모듈(400); 비디오 영상에서 비디오 파일을 추출하여 저작물 비디오 자료를 생성하는 비디오추출모듈(500); 및, 저작물 오디오 자료와 저작물 비디오 자료를 믹싱하여 동영상 저작물을 생성하는 저작물생성모듈(600);을 포함하는 것을 특징으로 한다.The present invention relates to a system for producing a video work using speech recognition and speech synthesis, comprising: an audio extraction module 100 for extracting audio data from a video image; a speech recognition module 200 for recognizing the audio data extracted from the audio extraction module 100 and converting it into text to generate an explanatory audio script; a script editing module 300 for editing (adding/correcting/deleting) an explanatory audio script generated through the voice recognition module 200; a speech synthesis module 400 for recognizing the finally completed explanatory audio script through the script editing module 300 and synthesizing the speech to generate audio material of the work; a video extraction module 500 for extracting a video file from a video image to generate a work video material; and a work creation module 600 for generating a moving picture work by mixing the work audio material and the work video material.

아울러, 본 발명은 동영상 저작 시스템을 이용한 동영상 저작물 제작 방법에 관한 것으로서, 오디오추출모듈(100)을 통하여 미리 제작된 비디오 영상에서 오디오 데이터를 추출하는 오디오 데이터 추출 과정; 음성인식모듈(200)이 오디오추출모듈(100)에서 추출한 오디오 데이터의 음성을 인식하고 텍스트(Text)로 변환하여 설명 오디오 스크립트를 생성하는 설명 오디오 스크립트 생성 과정; 음성합성모듈(400)이 스크립트편집모듈(300)을 통하여 최종 완성된 설명 오디오 스크립트를 인식하고 음성합성을 하여 저작물 오디오 자료를 생성하는 저작물 오디오 자료 생성 과정; 비디오추출모듈(500)이 미리 제작된 비디오 영상에서 비디오 파일을 추출하여 저작물 비디오 자료를 생성하는 저작물 비디오 자료 생성 과정; 및, 저작물생성모듈(600)이 저작물 오디오 자료와 저작물 비디오 자료를 믹싱하거나 편집하면서 동영상 저작물을 생성하는 동영상 저작물 생성 과정;을 포함하는 것을 특징으로 한다. In addition, the present invention relates to a method for producing a video work using a video authoring system, comprising: an audio data extraction process of extracting audio data from a video image previously produced through the audio extraction module 100; A descriptive audio script generation process in which the speech recognition module 200 recognizes the voice of the audio data extracted from the audio extraction module 100 and converts it into text to generate a descriptive audio script; a work audio material generation process in which the speech synthesis module 400 recognizes the finally completed explanatory audio script through the script editing module 300 and synthesizes the speech to generate the work audio material; a work video material creation process in which the video extraction module 500 extracts a video file from a pre-produced video image to generate a work video material; and a moving picture work creation process in which the work creation module 600 creates a moving picture work while mixing or editing the work audio material and the work video material.

본 발명의 구성에 따른 기술적 효과는 다음과 같다.Technical effects according to the configuration of the present invention are as follows.

첫째, 음성인식 기술을 활용하여 설명 오디오 스크립트를 생성하고, 생성된 설명 오디오 스크립트를 필요에 따라 간편하게 편집(추가/교정/삭제)한 후 이를 음성합성 기술을 활용하여 저작물 오디오 자료를 생성한 후 저작물 비디오 자료와 믹싱하여 요구 조건에 적합한 품질을 가진 영상 저작물을 제작할 수 있다.First, using voice recognition technology to create explanatory audio scripts, edit (add/correct/delete) the generated explanatory audio scripts as needed, and then use the speech synthesis technology to create audio material for the work. It can be mixed with video material to create a video work with a quality suitable for your requirements.

둘째, 설명 오디오 스크립트에 포함된 텍스트를 다국어로 번역하는 다국어번역모듈(800)이 구비됨으로써 다양한 언어로 번역된 영상 저작물 제작이 가능하다.Second, since the multilingual translation module 800 for translating the text included in the explanatory audio script into multiple languages is provided, it is possible to produce an image work translated into various languages.

셋째, 비디오 파일에 포함된 오디오 데이터를 추출하는 과정에서 설명 오디오 및 배경효과 오디오로 분리 구분하여 용도를 지정할 수 있고, 음성인식모듈(200)은 설명 오디오로 지정된 내용만 인식하고 텍스트로 변환하여 설명 오디오 스크립트를 생성하게 되고, 음성합성모듈(400)은 설명 오디오로 지정된 내용에 관한 설명 오디오 스크립트를 인식하고 음성합성을 하여 저작물 오디오 자료를 생성하고, 배경효과 오디오로 지정된 내용은 음성인식모듈(200)과 음성합성모듈(400)을 거치지 않고 그대로 저작물 오디오 자료로 활용될 수 있다.Third, in the process of extracting the audio data included in the video file, the purpose can be specified by separating it into explanatory audio and background effect audio, and the speech recognition module 200 recognizes only the content specified as explanatory audio and converts it into text for explanation The audio script is generated, and the voice synthesis module 400 recognizes the explanatory audio script for the content designated as the explanatory audio and synthesizes the voice to generate the work audio material, and the content designated as the background effect audio is the voice recognition module 200 ) and the speech synthesis module 400 can be used as it is as an audio material of the work.

넷째, 한번 제작된 설명 오디오 스크립트(저작 스크립트)는 이후 수정 보완하여 저작물을 수정하기 쉽고, 여러 설명 오디오 스크립트(저작 스크립트) 간의 스크립트 블록들을 잘라 모으고, 순서를 바꾸는 등의 작업을 통하여 새로운 저작물을 만들기 쉬우며, 한번 만들어진 설명 오디오 스크립트(저작 스크립트)에서 음성합성의 목소리나 성별을 바꾸거나 발음 속도를 조정하는 등으로 저작물를 다양하고 세밀하게 조정 제작할 수 있다. Fourth, it is easy to modify the work by modifying and supplementing the explanatory audio script (authoring script) once produced, and making a new work by cutting and collecting script blocks between several explanatory audio scripts (authoring script), changing the order, etc. It is easy, and it is possible to make various and detailed adjustments to the work by changing the voice or gender of the voice synthesis or adjusting the pronunciation speed in the explanatory audio script (authoring script) created once.

도1은 본 발명의 전체 구성을 도시하는 블록도이다.
도2는 본 발명에 따른 영상 저작물 제작 과정을 도시하는 흐름도이다.1 is a block diagram showing the overall configuration of the present invention.
2 is a flowchart illustrating a video work production process according to the present invention.

이하에서는 본 발명의 구체적 실시예를 첨부도면을 참조하여 보다 상세히 설명한다.Hereinafter, specific embodiments of the present invention will be described in more detail with reference to the accompanying drawings.

본 발명은 음성인식과 음성합성을 이용하여 동영상 저작물을 제작하는 시스템에 관한 것으로서, 도1에 도시된 것처럼 입력모듈(700), 오디오추출모듈(100), 음성인식모듈(200), 스크립트편집모듈(300), 음성합성모듈(400), 비디오추출모듈(500), 저작물생성모듈(600), 다국어번역모듈(800), 블록인식모듈(900) 등을 포함하여 구성된다.The present invention relates to a system for producing a video work using voice recognition and voice synthesis, and as shown in FIG. 1 , an input module 700, an audio extraction module 100, a voice recognition module 200, and a script editing module 300 , a voice synthesis module 400 , a video extraction module 500 , a work creation module 600 , a multilingual translation module 800 , a block recognition module 900 , and the like.

이러한 구성요소는 다양한 하드웨어 기기 및 하드웨어 기기에 내장되어 미리 설정된 연산작용을 하는 프로그램(웹 어플리케이션 포함) 등으로 구성될 수 있으며, 기능을 중심으로 구분한 것으로서, 외형적으로는 마이크, 스피커, 키보드, 마우스, 디지털 카메라, 모니터 및 메모리와 CPU 등이 내장된 본체로 이루어지는 구성이 될 수 있다.These components can be composed of various hardware devices and programs (including web applications) that are built-in and perform pre-set arithmetic operations, and are divided based on functions. A mouse, a digital camera, a monitor, and a main body having a built-in memory and CPU may be configured.

오디오추출모듈(100)은 비디오 영상에서 오디오 데이터를 추출하는 역할을 한다.The audio extraction module 100 serves to extract audio data from a video image.

비디오 영상에는 영상, 설명 오디오 및 배경효과 오디오가 포함되어 있는데, 오디오추출모듈(100)은 이러한 비디오 영상에서 오디오 데이터를 분리 추출할 수 있는데, 이러한 작업을 위하여 오디오추출모듈(100)에는 ffmpeg 또는 Moviepy와 같이 상용화된 동영상 편집프로그램이 내장될 수 있다.The video image includes an image, explanatory audio, and background effect audio. The audio extraction module 100 can separate and extract audio data from these video images. For this operation, the audio extraction module 100 includes ffmpeg or Moviepy Commercially available video editing programs such as .

오디오추출모듈(100)은 오디오 데이터를 추출하는 과정에서 설명 오디오 및 배경효과 오디오로 분리 구분하여 용도를 지정할 수 있다.In the process of extracting audio data, the audio extraction module 100 may designate a use by dividing the audio data into explanatory audio and background effect audio.

음성인식모듈(200)은 설명 오디오로 지정된 내용만 인식하고 텍스트로 변환하여 설명 오디오 스크립트를 생성하게 되고, 음성합성모듈(400)은 설명 오디오로 지정된 내용에 관한 설명 오디오 스크립트를 인식하고 음성합성을 하여 저작물 오디오 자료를 생성하게 된다.The voice recognition module 200 recognizes only the content designated as explanatory audio and converts it into text to generate an explanatory audio script, and the speech synthesis module 400 recognizes the explanatory audio script for the content designated as explanatory audio and performs speech synthesis. This will create a work audio material.

오디오 데이터 추출 과정에서 배경효과 오디오로 지정된 내용은 도2의 흐름도에서 확인할 수 있듯이 음성인식모듈(200)과 음성합성모듈(400)을 거치지 않고 그대로 저작물 오디오 자료로 활용된다. 즉, 음성합성모듈(400)을 거친 설명 오디오와 믹싱되어 사용된다.As can be seen in the flowchart of FIG. 2 , the content designated as background effect audio in the audio data extraction process is used as an audio material of the work as it is without going through the voice recognition module 200 and the voice synthesis module 400 . That is, it is used after being mixed with the explanatory audio that has passed through the speech synthesis module 400 .

블록인식모듈(900)은 오디오추출모듈(100)에서 추출되는 오디오 데이터 가운데 설명 오디오로 지정되는 설명 내용이 존재하는 시간대(slot)와 존재하지 않는 시간대(slot)를 구분하여 인식하고, 설명 오디오가 존재하다가 미리 설정된 시간 동안 설명 오디오가 존재하지 않을 경우 이를 다른 스크립트 블록으로 구분한다.The block recognition module 900 distinguishes and recognizes the time zone (slot) in which the description content designated as the description audio exists among the audio data extracted from the audio extraction module 100 and the time zone (slot) in which the description content does not exist, and the description audio is If there is no explanatory audio for a preset period of time, it is divided into another script block.

이러한 블록인식모듈(900)의 기능은 필요에 따라 선택할 수 있는데, 블록인식모듈(900)이 작동될 경우 오디오추출모듈(100)에서 추출되는 오디오 데이터는 블록인식모듈(900)에 의하여 다수의 스크립트 블록으로 자동적으로 구분되고, 음성인식모듈(200)은 구분된 스크립트 블록별로 설명 오디오 스크립트를 생성할 수 있다.The function of the block recognition module 900 can be selected as needed. When the block recognition module 900 is operated, the audio data extracted from the audio extraction module 100 is converted into a plurality of scripts by the block recognition module 900 . Blocks are automatically divided, and the voice recognition module 200 may generate a descriptive audio script for each divided script block.

스크립트 블록별로 설명 오디오 스크립트가 생성되면, 이에 대한 스크립트 편집 과정, 다국어 번역 과정, 저작물 오디오 자료 생성 과정 등은 구분된 스크립트 블록별로 수행될 수 있으며, 구분된 스크립트 블록별로 생성된 저작물 오디오 자료와 이에 매칭되는 저작물 비디오 자료를 저작물생성모듈(600)이 믹싱하거나 편집하면서 최종적으로 동영상 저작물을 생성할 수 있다.When an explanation audio script is created for each script block, the script editing process, multilingual translation process, and work audio material creation process can be performed for each divided script block, and it is matched with the audio material of the work created for each divided script block While the work creation module 600 mixes or edits the work video material to be used, it is possible to finally create a moving image work.

이 스크립트 블록은 최종 저작물을 생성하는 작업의 기초 단위이다. 즉, 각각의 스크립트 블록은 최종 저작물에서 순서대로 나열되어 하나의 동영상을 구성하지만, 스크립트 블록별로 번역하고 음성합성하여 비디오와 조합하는 작업 과정은 서로 연관성이 없는 독립적인 과정이다.This script block is the basic unit of work that creates the final work. That is, each script block is listed in order in the final work to compose one moving picture, but the work process of translating and synthesizing each script block and combining it with a video is an independent process that is not related to each other.

따라서, 다음의 두 가지 방법으로 최종 저작물 생성 작업 시간을 단축한다.Therefore, the work time for creating the final work is shortened in the following two ways.

1) 병렬 작업: 저작물을 처음부터 끝까지 다시 생성하는 일괄 작업에서는 각각의 스크립트 블록들은 병렬 작업으로 동시에 블록별 저작물을 생성하고, 모든 블록별 저작물이 만들어지면 이를 하나의 동영상으로 조합한다.1) Parallel work: In a batch job that regenerates works from start to finish, each script block creates works by block at the same time as parallel work, and when all works by block are created, they are combined into a single video.

2) 배경 작업: 통상 편집 작업은 스크립트 블록을 순차적으로 만들고 편집하는데, 다음 블록을 편집하는 동안 앞서 편집을 마친 블록을 배경(background process)에서 블록별 저작물 생성작업을 한다. 보통은 블록저작물 만들기 시간은 편집 작업 시간보다 짧아 다음 블록 편집으로 넘어가기 전에 마치지만 마치지 못 해도 여러 개의 블록 저작물 생성작업을 동시에 진행할 수 있다. 마지막 블록의 편집을 마치고 마지막 블록 저작물이 만들어지면 모든 블록 저작물을 최종 저작물로 조합하는 작업만 하면 된다.2) Background work: Usually, the editing work sequentially creates and edits script blocks. While editing the next block, the previously edited block is created for each block in the background process. Usually, the block work creation time is shorter than the editing work time, so it is finished before moving on to the next block edit, but even if it is not completed, it is possible to proceed with the creation of several block works at the same time. After editing the last block and creating the last block asset, all you need to do is assemble all the block assets into the final asset.

보통은 스크립트 블록별 저작물을 만드는 시간이 가장 길고 이 작업을 병렬 작업으로 처리하면 20분 동영상 50개 블록의 경우 1/5 정도로 작업 시간이 단축되며, 배경 작업으로 처리하면 편집 작업의 배경에서 스크립트 블록 저작물을 만들므로 생성 작업에 시간이 거의 들지 않는다.Usually, it takes the longest time to create a work for each script block, and if this operation is processed as a parallel operation, the operation time is reduced by about 1/5 in the case of 50 blocks of 20 minutes video, and if processed as a background operation, the script block in the background of the editing operation Because you create an asset, the creation process takes very little time.

스크립트 블록별 저작물을 조합하여 최종 저작물을 만드는 과정은 병렬 작업이나 배경 작업을 할 수 없어 별도로 순차 작업한다.In the process of creating a final work by combining works by script block, parallel work or background work is not possible, so work is done sequentially separately.

음성인식모듈(200)은 오디오추출모듈(100)에서 추출한 오디오 데이터의 음성을 인식하고 텍스트(Text)로 변환하여 설명 오디오 스크립트를 생성하는 역할을 하는데, 비디오 영상에서 설명 오디오(음성)을 추출하고 음성을 인식하여 설명 오디오 스크립트를 생성하는 과정은 도2의 흐름도에서도 확인할 수 있다.The voice recognition module 200 recognizes the voice of the audio data extracted from the audio extraction module 100 and converts it into text to generate an explanatory audio script, extracting the explanatory audio (voice) from the video image and The process of generating an audio script by recognizing a voice can also be confirmed in the flowchart of FIG. 2 .

이러한 음성인식모듈(200)은 google, amazon, 또는 naver가 제공하는 API(Application Program Interface)를 불러서 사용하는 방식으로 음성인식(STT, Speech To Text) 기능을 수행할 수 있는데, 구체적인 사용 방법은 아래의 URL에서 확인할 수 있다.Such a speech recognition module 200 may perform a speech recognition (STT, Speech To Text) function by calling and using an API (Application Program Interface) provided by google, amazon, or naver. It can be found at the URL of

google:google:

https://cloud.google.com/speech-to-text/docs/apis?hl=kohttps://cloud.google.com/speech-to-text/docs/apis?hl=en

amazon:amazon:

https://aws.amazon.com/ko/transcribe/https://aws.amazon.com/en/transcribe/

naver:naver:

https://www.ncloud.com/product/aiService/csrhttps://www.ncloud.com/product/aiService/csr

스크립트편집모듈(300)은 음성인식모듈(200)을 통하여 생성된 설명 오디오 스크립트를 편집(추가/교정/삭제)하는 기능을 수행하는데, 메모장을 비롯한 일반적인 Text Editor 기능을 사용할 수 있다.The script editing module 300 performs a function of editing (adding/correcting/deleting) an explanatory audio script generated through the voice recognition module 200, and a general text editor function including a memo pad can be used.

즉, 오소링(Authoring)에 필요한 정보들을 추가할 수 있는데, 쉼표나 마침표를 이용하여 저작물 설명의 쉬는 자리와 마치는 자리를 표시하거나, 쉼표나 마침표의 갯수를 가감하여 쉼, 마침 시간을 조정하며, 설명이 쉬는 시간에는 배경음이나 배경음악을 넣을 수도 있다.In other words, you can add information necessary for authoring. Use commas or periods to indicate the rest and end positions of the work description, or adjust the rest and end times by adding or subtracting the number of commas or periods. Background music or background music can be added during this break.

아울러, 잘못 인식된 텍스트를 자유롭게 수정할 수도 있고, 필요한 설명을 추가할 수도 있다.In addition, the misrecognized text can be freely corrected, and necessary explanations can be added.

즉, 도2에 도시된 흐름도에서 확인할 수 있듯이 설명 오디오 스크립트를 편집하거나, 교정하거나, 업데이트 할 수 있는데, 새롭게 타이핑할 수도 있고, 복사하여 재사용할 수도 있다.That is, as can be seen from the flowchart shown in FIG. 2 , the explanatory audio script can be edited, corrected, or updated, and can be newly typed or copied and reused.

음성합성모듈(400)은 스크립트편집모듈(300)을 통하여 최종 완성된 설명 오디오 스크립트를 인식하고 음성합성을 하여 도2의 흐름도에서 확인할 수 있듯이 저작물 오디오 자료를 생성하는 과정을 수행하는데, 이러한 음성합성모듈(400)은 google, amazon, 또는 naver가 제공하는 API(Application Program Interface)를 불러서 사용하는 방식으로 음성합성(TTS, Text To Speech) 기능을 수행할 수 있는데, 구체적인 사용 방법은 아래의 URL에서 확인할 수 있다.The voice synthesis module 400 recognizes the finally completed explanatory audio script through the script editing module 300 and synthesizes the voice to perform the process of generating audio material of the work, as can be seen in the flowchart of FIG. 2 , such voice synthesis The module 400 may perform a speech synthesis (TTS, Text To Speech) function by calling and using an API (Application Program Interface) provided by google, amazon, or naver. The specific usage method is at the following URL. can be checked

google:google:

https://cloud.google.com/text-to-speech/docs/reference/resthttps://cloud.google.com/text-to-speech/docs/reference/rest

amazon:amazon:

https://aws.amazon.com/ko/polly/https://aws.amazon.com/en/polly/

naver:naver:

https://www.ncloud.com/product/aiService/csshttps://www.ncloud.com/product/aiService/css

비디오추출모듈(500)은 도2의 흐름도에 제시된 것처럼 비디오 영상에서 비디오 파일을 추출하여 저작물 비디오 자료를 생성하는 과정을 수행하는데, 이러한 작업을 위하여 비디오추출모듈(500)에는 ffmpeg 또는 Moviepy와 같이 기존의 상용화된 동영상 편집프로그램의 편집 기능을 사용하고, 원본 비디오 영상에서 사용할 부분의 시작과 끝 위치를 지정하게 된다.The video extraction module 500 extracts a video file from a video image as shown in the flowchart of FIG. 2 to generate a work video material. For this task, the video extraction module 500 includes an existing Use the editing function of a commercial video editing program of , and designate the start and end positions of the part to be used in the original video image.

비디오추출모듈(500)은 기존 제작된 비디오 영상뿐만 아니라 도2의 흐름도에서 확인할 수 있듯이 이미지, 텍스트, 또는 PPT(프레젠테이션) 자료 가운데 일부 또는 전부를 추출하여 저작물 비디오 자료를 생성할 수 있다. The video extraction module 500 may extract some or all of an image, text, or PPT (presentation) material as can be seen from the flowchart of FIG. 2 as well as an existing video image to generate a work video material.

즉, 각종 응용 프로그램에서 제작한 사진이나 화면캡쳐, 카메라 촬영 사진 등을 소스로 활용할 수도 있고, 화면에서 보여줄 텍스트(글자)로 영상을 만들 수도 있고, PPT의 화면전환/애니메이션 기능 등을 포함하는 동영상을 비디오 소스로 사용할 수도 있다.In other words, you can use photos, screen captures, and camera shots made in various applications as a source, create a video with text (characters) to be displayed on the screen, and video containing PPT screen switching/animation functions, etc. can also be used as a video source.

저작물생성모듈(600)은 도2의 흐름도에서 확인할 수 있듯이 저작물 오디오 자료와 저작물 비디오 자료를 믹싱하거나 편집하면서 동영상 저작물을 생성하는 과정을 수행하는데, 이러한 작업을 위하여 ffmpeg 또는 Moviepy와 같이 기존의 상용화된 동영상 편집프로그램의 편집 기능이 사용될 수 있다.As can be seen from the flowchart of FIG. 2, the work creation module 600 performs the process of creating a moving picture work while mixing or editing the work audio material and the work video material. An editing function of a video editing program may be used.

입력모듈(700)은 음성을 입력하는 마이크 또는 텍스트를 입력하는 키보드나 마우스와 같은 기기가 될 수 있는데, 마이크를 통하여 음성이 입력될 경우 도2의 흐름도에 제시된 것처럼 음성인식모듈(200)이 음성을 인식하고 텍스트로 변환하여 설명 오디오 스크립트를 생성하고, 음성합성모듈(400)을 통하여 저작물 오디오 자료가 생성된다. The input module 700 may be a device such as a microphone for inputting voice or a keyboard or mouse for inputting text. is recognized and converted into text to generate an explanatory audio script, and audio material of the work is generated through the speech synthesis module 400 .

입력모듈(700)로 마우스나 키보드(또는 터치스크린)가 선택되고, 이를 통하여 텍스트가 입력된 경우에는 음성인식모듈(200)을 거치지 않고 설명 오디오 스크립트가 바로 생성되고, 음성합성모듈(400)을 통하여 저작물 오디오 자료가 생성된다.When a mouse or keyboard (or touch screen) is selected as the input module 700 , and text is inputted through this, an explanatory audio script is directly generated without going through the voice recognition module 200 , and the voice synthesis module 400 is used. Through this, the work audio material is created.

텍스트를 입력하는 방법은 그 내용을 직접 타이핑하거나 기존 문서의 텍스트 원고를 사용(기존 문서의 텍스트 부분을 자르거나 복사하여 붙이는 방법)할 수 있다. As a method of entering text, you can directly type the content or use a text manuscript of an existing document (a method of cutting or copying and pasting a text portion of an existing document).

다국어번역모듈(800)은 도2의 흐름도에서 확인할 수 있듯이 설명 오디오 스크립트에 포함된 텍스트를 다국어로 번역하는 과정을 수행한다.The multilingual translation module 800 performs a process of translating the text included in the explanatory audio script into multiple languages, as can be seen in the flowchart of FIG. 2 .

다국어번역모듈(800)은 구글(google)의 Translater API 등을 활용할 수 있으며, 음성합성모듈(400)은 다국어번역모듈(800)을 통하여 다국어로 번역된 설명 오디오 스크립트를 인식하고 다국어로 음성합성을 하여 저작물 오디오 자료를 생성하게 된다.The multilingual translation module 800 may utilize Google's Translater API, etc., and the speech synthesis module 400 recognizes the explanatory audio script translated into multiple languages through the multilingual translation module 800 and performs speech synthesis in multiple languages. This will create a work audio material.

즉, 국어로 작성된 설명 오디오 스크립트를 이용하여 간편하게 다국어로 구성된 저작물 오디오 자료를 생성할 수 있다.That is, by using the explanatory audio script written in the Korean language, it is possible to easily create an audio material of a work composed of multiple languages.

본 발명에 따른 저작물 오디오 자료 제작 방법은 다음과 같은 과정을 포함한다.The method for producing an audio material of a work according to the present invention includes the following process.

(1) 오디오 데이터 추출 과정(1) Audio data extraction process

오디오추출모듈(100)을 통하여 미리 제작된 비디오 영상에서 오디오 데이터를 추출하는 과정으로서, 오디오추출모듈(100)에는 ffmpeg 또는 Moviepy와 같이 상용화된 동영상 편집프로그램이 내장될 수 있다.As a process of extracting audio data from a video image produced in advance through the audio extraction module 100 , a commercial video editing program such as ffmpeg or Moviepy may be embedded in the audio extraction module 100 .

또는 오디오 스크립트 편집 과정에서 마이크에 말을 하면 STT기능으로 오디오 스크립트를 새롭게 생성하거나 편집할 수 있다.Alternatively, if you speak into the microphone during the audio script editing process, you can create or edit an audio script with the STT function.

비디오 영상에는 영상, 설명 오디오 및 배경효과 오디오가 포함되어 있는데, 오디오추출모듈(100)은 오디오 데이터를 추출하는 과정에서 설명 오디오 및 배경효과 오디오로 분리 구분하여 용도를 지정할 수 있다.The video image includes an image, explanatory audio, and background effect audio, and the audio extraction module 100 can designate a use by separating it into explanatory audio and background effect audio in the process of extracting audio data.

오디오 데이터 추출 과정은 필요에 따라 다수의 스크립트 블록을 구분하여 지정할 수 있는데, 블록인식모듈(900)은 스크립트오디오추출모듈(100)에서 추출되는 오디오 데이터 가운데 설명 오디오로 지정되는 내용이 존재하는 시간대(slot)와 존재하지 않는 시간대(slot)을 구분하여 인식하고, 설명 오디오로 지정된 내용이 존재하다가 미리 설정된 시간 동안 설명 오디오로 지정된 내용이 존재하지 않을 경우 이를 다른 스크립트 블록으로 구분한다. 이후 과정은 구분된 스크립트 블록별로 수행될 수 있다.In the audio data extraction process, a plurality of script blocks can be divided and specified as needed. The block recognition module 900 is a time period ( slot) and a non-existent time slot (slot) are recognized, and if the content designated as explanatory audio exists and the content designated as explanatory audio does not exist for a preset period of time, it is divided into another script block. The subsequent process may be performed for each divided script block.

(2) 설명 오디오 스크립트 생성 과정(2) Description audio script creation process

음성인식모듈(200)이 오디오추출모듈(100)에서 추출한 오디오 데이터의 음성을 인식하고 텍스트(Text)로 변환하여 설명 오디오 스크립트를 생성하는 과정인데, 음성인식모듈(200)은 google, amazon, 또는 naver가 제공하는 API(Application Program Interface)를 불러서 사용하는 방식으로 음성인식(STT, Speech To Text) 기능을 수행할 수 있다.The voice recognition module 200 is a process of recognizing the voice of the audio data extracted from the audio extraction module 100 and converting it into text to generate an explanatory audio script, the voice recognition module 200 is google, amazon, or Speech recognition (STT, Speech To Text) function can be performed by calling and using the API (Application Program Interface) provided by naver.

음성인식모듈(200)은 설명 오디오로 지정된 내용만 인식하고 텍스트로 변환하여 설명 오디오 스크립트를 생성하게 되고, 오디오 데이터 추출 과정에서 배경효과 오디오로 지정된 내용은 음성인식모듈(200)과 음성합성모듈(400)을 거치지 않고 그대로 저작물 오디오 자료로 활용된다.The voice recognition module 200 recognizes only the content designated as the explanatory audio and converts it into text to generate the explanatory audio script, and the content designated as the background effect audio in the audio data extraction process is the voice recognition module 200 and the voice synthesis module ( 400) and is used as an audio material for the work as it is.

설명 오디오 스크립트는 마이크와 같은 입력모듈(700)을 통하여 음성을 직접 입력하면 음성인식모듈(200)이 음성을 인식하고 텍스트로 변환하여 설명 오디오 스크립트를 생성하는 방법이나, 마우스나 키보드(또는 터치스크린)와 같은 입력모듈(700)을 통하여 텍스트가 입력되면 음성인식모듈(200)을 별도로 거치지 않고 설명 오디오 스크립트가 바로 생성되는 방법이 활용될 수도 있다.The description audio script is a method of directly inputting a voice through the input module 700 such as a microphone, and the voice recognition module 200 recognizes the voice and converts it into text to generate the explanation audio script, or a mouse or keyboard (or touch screen). ), when text is input through the input module 700 , a method of directly generating an explanatory audio script without going through the voice recognition module 200 may be utilized.

스크립트 블록별로 구분 지정된 경우 이러한 설명 오디오 스크립트 생성 과정은 구분된 스크립트 블록별로 음성인식모듈(200)이 설명 오디오 스크립트를 생성하게 된다.In the case where each script block is divided and designated, the voice recognition module 200 generates the explanatory audio script for each divided script block in this explanatory audio script generation process.

(2-1) 스크립트 편집 과정(2-1) Script editing process

스크립트편집모듈(300)이 설명 오디오 스크립트를 편집(추가/교정/삭제)하는 과정으로서, 메모장을 비롯한 일반적인 Text Editor 기능을 사용하거나 설명 오디오 스크립트의 목적에 맞춘 특화된 편집기를 사용할 수 있다.As a process in which the script editing module 300 edits (adds/corrects/deletes) the explanatory audio script, a general text editor function including a notepad may be used, or a specialized editor tailored to the purpose of the explanatory audio script may be used.

즉, 오소링(Authoring)에 필요한 정보들을 추가할 수 있는데, 쉼표나 마침표를 이용하여 저작물 설명의 쉬는 자리와 마치는 자리를 표시하거나, 쉼표나 마침표의 갯수를 가감하여 쉼, 마침 시간을 조정하며, 설명이 쉬는 시간에는 배경음이나 배경음악을 넣을 수도 있다. 아울러, 잘못 인식된 텍스트를 자유롭게 수정할 수도 있고, 필요한 설명을 추가할 수도 있다.In other words, you can add information necessary for authoring. Use commas or periods to indicate the rest and end positions of the work description, or adjust the rest and end times by adding or subtracting the number of commas or periods. Background music or background music can be added during this break. In addition, the misrecognized text can be freely corrected, and necessary explanations can be added.

이러한 스크립트 편집 과정은 구분된 스크립트 블록별로 수행한다.This script editing process is performed for each divided script block.

본 발명의 중요한 기능은 기존 동영상 저작물을 편집하는 것이다. 예를 들면, 기존 동영상 저작물의 일부 설명 오디오를 바꾸거나 일부 영상을 바꾸는 경우 본 발명을 사용하면 적은 노력으로 바꿀 수 있다.An important function of the present invention is to edit an existing moving picture work. For example, if you change some explanatory audio or some video of an existing moving picture work, you can change it with little effort by using the present invention.

특히 강의 동영상 저작물의 경우 설명 오디오를 중심으로 편집할 경우가 많다.In particular, in the case of lecture video works, there are many cases where editing is mainly focused on the audio of explanations.

기존의 방식은 재촬영이나 재녹음 후 짜집기 편집을 하지만 본 발명은 블록 인식 기능으로 동영상 저작물을 블록들로 자동으로 나눈 후 본 발명 편집 방법으로 편집할 블록의 설명 오디오만 편집하면 기존 방식에 비해 훨씬 적은 노력과 시간으로 편집 가능하다.In the existing method, editing is done after re-shooting or re-recording, but the present invention automatically divides the video work into blocks with the block recognition function and then edits only the description audio of the block to be edited with the editing method of the present invention. Editable with little effort and time.

물론 이 경우 편집하지 않을 블록을 자동 인식한 설명 오디오에 어느 정도 인식오류가 있을 수 있는데, 이 경우 두 가지 선택을 할 수 있다.Of course, in this case, there may be some recognition error in the description audio that automatically recognizes the block not to be edited. In this case, there are two choices.

(1) 원본 오디오를 배경사운드로 넣고 설명 오디오를 넣지 않아 원본 오디오만 사용하는 방법(1) How to use only the original audio by putting the original audio as a background sound and not including the explanatory audio

(2) 인식한 설명 오디오 스크립트의 인식 오류를 모두 수정하는 방법(2) How to fix all recognition errors of recognized explanatory audio scripts

(1)번 방법의 경우는 수정한 블록의 설명 오디오는 기계 합성음이고 원본 오디오는 원 저작자 목소리라 달라지지만. 내용 전달의 정확성이 더 중요한 강의 저작물의 경우는 선택할 만한 방법이다.In the case of method (1), the explanation audio of the modified block is a machine synthesized sound, and the original audio is different because the voice of the original author. For lecture writings, where the accuracy of content delivery is more important, this is an option.

(2)번 방법의 경우 초기 노력과 시간은 많이 들지만 강의 동영상처럼 장기간 사용할 저작물의 경우에는 한번 만들어놓은 저작물을 2차 수정 이후부터는 아주 적은 노력으로 수정 가능하다.In the case of method (2), it takes a lot of initial effort and time, but in the case of a work that will be used for a long time, such as a lecture video, it is possible to modify the work created once with very little effort after the second revision.

실제로 강의 저작물을 만들어 본 경험에 의하면, 수학이나 공학과 같이 강의 내용의 정확성이 중요할 경우는 처음 저작한 강의 동영상을 설명 오디오 스크립트로 만들기까지는 꽤 노력이 들지만 한번 설명 오디오 스크립트를 만든 후에는 강의 내용의 정확도와 듣는 학생의 이해 전달도를 높이기 위해 강의 동영상을 (속도를 높여) 듣고 수정하는 과정을 여러 차례 반복하지만 이 반복 과정에는 시간이 많이 소요되지 않고, 보통 5 차례 이상 반복해서 수정 보완한다.According to my experience in making lecture works, it takes quite a bit of effort to make the lecture video I wrote into an audio script for the first time when the accuracy of lecture content is important, such as in mathematics or engineering, but once the audio script is created, the lecture content can be changed. The process of listening and editing the lecture video (by increasing the speed) is repeated several times in order to increase the accuracy and the understanding of the listener.

(2-2) 다국어 번역 과정(2-2) Multilingual Translation Course

다국어번역모듈(800)이 설명 오디오 스크립트에 포함된 텍스트를 다국어로 번역하는 과정인데, 다국어번역모듈(800)은 구글(google)의 Translater API 등을 활용할 수 있다.The multilingual translation module 800 is a process of translating the text included in the description audio script into multiple languages, and the multilingual translation module 800 may utilize Google's Translater API and the like.

다국어 번역 과정이 추가될 경우 음성합성모듈(400)은 다국어번역모듈(800)을 통하여 다국어로 번역된 설명 오디오 스크립트를 인식하고 다국어로 음성합성을 하여 저작물 오디오 자료 또는 이에 대응하는 다국어 자막을 생성하게 된다.When a multilingual translation process is added, the speech synthesis module 400 recognizes the explanatory audio script translated into multiple languages through the multilingual translation module 800 and synthesizes the speech in multiple languages to generate audio material of the work or multilingual subtitles corresponding thereto. do.

이러한 다국어 번역 과정은 구분된 스크립트 블록별로 수행한다.This multilingual translation process is performed for each divided script block.

(3) 저작물 오디오 자료 생성 과정(3) The process of creating audio material of the work

음성합성모듈(400)이 스크립트편집모듈(300)을 통하여 최종 완성된 설명 오디오 스크립트를 인식하고 음성합성을 하여 저작물 오디오 자료를 생성하는 과정인데, 이러한 음성합성모듈(400)은 google, amazon, 또는 naver가 제공하는 API(Application Program Interface)를 불러서 사용하는 방식으로 음성합성(TTS, Text To Speech) 기능을 수행할 수 있다.The voice synthesis module 400 recognizes the finally completed explanatory audio script through the script editing module 300 and synthesizes the voice to generate audio material of the work. This voice synthesis module 400 is google, amazon, or The voice synthesis (TTS, Text To Speech) function can be performed by calling and using the API (Application Program Interface) provided by naver.

스크립트 블록별로 구분 지정된 경우 저작물 오디오 자료 생성 과정도 구분된 스크립트 블록별로 음성합성모듈(400)이 음성합성을 하여 저작물 오디오 자료를 생성하게 된다.In the case of being divided and designated for each script block, the speech synthesis module 400 performs speech synthesis for each divided script block to generate audio material of the work.

(4) 저작물 비디오 자료 생성 과정(4) The process of creating a work video material

비디오추출모듈(500)이 미리 제작된 비디오 영상에서 비디오 자료를 추출하여 저작물 비디오 자료를 생성하는 과정인데, 비디오추출모듈(500)에는 ffmpeg 또는 Moviepy와 같이 기존의 상용화된 동영상 편집프로그램의 편집 기능을 사용하고, 원본 비디오 영상에서 사용하고자 하는 영상의 시작과 끝 위치를 지정하게 된다.The video extraction module 500 is a process of extracting video data from a pre-produced video image to generate a work video material. and designates the start and end positions of the image to be used in the original video image.

비디오추출모듈(500)은 기존 제작된 비디오 영상뿐만 아니라 이미지, 텍스트, 또는 PPT(프레젠테이션) 자료 가운데 일부 또는 전부를 추출하여 저작물 비디오 자료를 생성할 수 있는데, 각종 응용 프로그램에서 제작한 사진이나 화면캡쳐, 카메라 촬영 사진 등을 소스로 활용할 수도 있고, 화면에서 보여줄 텍스트(글자)로 영상을 만들 수도 있고, PPT의 화면전환/애니메이션 기능 등을 포함하는 동영상을 비디오 소스로 사용할 수도 있다.The video extraction module 500 may extract some or all of images, texts, or PPT (presentation) materials as well as existing video images to generate copyrighted video materials, photos or screen captures produced by various applications. , camera photos, etc. can be used as a source, a video can be made with text (characters) to be displayed on the screen, and a video including a screen change/animation function of PPT can be used as a video source.

(5) 동영상 저작물 생성 과정(5) The process of creating a video work

저작물생성모듈(600)이 저작물 오디오 자료와 저작물 비디오 자료를 믹싱하거나 편집하면서 동영상 저작물을 생성하는 과정이다.The work creation module 600 is a process of creating a video work while mixing or editing the work audio material and the work video material.

이러한 작업을 위하여 ffmpeg 또는 Moviepy와 같이 기존의 상용화된 동영상 편집프로그램의 편집 기능이 사용될 수 있다.For this task, the editing function of an existing commercial video editing program such as ffmpeg or Moviepy can be used.

스크립트 블록별로 구분 지정된 경우 구분된 스크립트 블록별로 생성된 저작물 오디오 자료와 이에 매칭되는 저작물 비디오 자료를 저작물생성모듈(600)이 믹싱하거나 편집하면서 동영상 저작물을 생성하게 된다.When the script block is divided and designated, the work creation module 600 mixes or edits the work audio material generated for each divided script block and the work video material matching it to create a moving picture work.

(6) 동영상 저작물을 생성하기 위한 전체 과정은 데이터의 병렬처리를 통하여 처리속도를 획기적으로 향상시킬 수 있는데, 저작물의 내용이 다수의 스크립트 블록별로 구분 지정된 경우, 각각의 스크립트 블록은 서로 독립적으로 데이터 처리가 가능하므로 다수의 스크립트 블록을 병렬적으로 동시에 처리할 수 있다.(6) The entire process for creating a moving picture work can dramatically improve the processing speed through parallel processing of data. Because processing is possible, multiple script blocks can be processed in parallel and at the same time.

각각의 스크립트 블록 별 작업이 완료되면 스크립트 블록 별로 저작물 요소들을 모아 최종 동영상 저작물을 생성하는 작업만 수행하면 된다.When the work for each script block is completed, it is only necessary to collect the asset elements for each script block to create the final video work.

예를 들어, 하나의 저작물이 제1블록, 제2블록, 제3블록으로 이루어진 경우 제1블록에 대한 작업이 완료되기 이전에 제2블록과 제3블록의 작업이 병렬적으로 동시에 진행될 수 있으며, 각 블록 별 작업이 완료되면 제1블록, 제2블록, 제3블록의 저작물 요소들을 모아 최종 동영상 저작물을 생성하는 편집 작업을 수행함으로써 작업 효율을 극대화시킬 수 있으며, 이러한 병렬처리 과정은 이미 상용화되어 있는 기술적 요소들을 활용하면 된다.For example, if one work consists of a first block, a second block, and a third block, the work of the second block and the third block may be performed simultaneously in parallel before the work on the first block is completed, , when the work for each block is completed, the work efficiency can be maximized by collecting the work elements of the first block, the second block, and the third block and performing the editing work to create the final moving picture work, and this parallel processing process has already been commercialized You can use the technical elements that are already in place.

상기한 바와 같이 본 발명의 구체적 실시예를 첨부도면을 참조하여 설명하였으나 본 발명의 보호범위가 반드시 이러한 실시예에만 한정되는 것은 아니며, 본 발명의 기술적 요지를 변경하지 않는 범위 내에서 다양한 설계변경, 공지기술의 부가나 삭제, 단순한 수치한정 등의 경우에도 본 발명의 보호범위에 속함을 분명히 한다.As described above, specific embodiments of the present invention have been described with reference to the accompanying drawings, but the protection scope of the present invention is not necessarily limited only to these embodiments, and various design changes within the scope that do not change the technical gist of the present invention, It is made clear that addition or deletion of known technology, simple numerical limitation, etc. fall within the protection scope of the present invention.

100:오디오추출모듈
200:음성인식모듈
300:스크립트편집모듈
400:음성합성모듈
500:비디오추출모듈
600:저작물생성모듈
700:입력모듈
800:다국어번역모듈
900:블록인식모듈100: audio extraction module
200: voice recognition module
300: script editing module
400: speech synthesis module
500: video extraction module
600: work creation module
700: input module
800: multilingual translation module
900: block recognition module

Claims

It relates to a system for producing a video work using speech recognition and speech synthesis,
an audio extraction module 100 for extracting audio data from a video image;
a speech recognition module 200 for recognizing the audio data extracted from the audio extraction module 100 and converting it into text to generate an explanatory audio script;
a script editing module 300 for editing the explanatory audio script generated through the voice recognition module 200; and,
a speech synthesis module 400 for recognizing the finally completed explanatory audio script through the script editing module 300 and synthesizing the speech to generate audio material of the work;
includes,
In the process of extracting audio data, the audio extraction module 100 separates the audio data into explanatory audio and background effect audio to designate a use,
Among the audio data extracted from the audio extraction module 100, a time zone (slot) in which the content designated as explanatory audio exists and a time slot (slot) in which the content designated as explanatory audio does not exist is distinguished and recognized, and explanation audio is present for a preset time a block recognition module 900 for dividing audio into other script blocks when there is no audio;
This includes more,
The audio data extracted from the audio extraction module 100 is automatically designated and divided into a plurality of script blocks by the block recognition module 900,
The voice recognition module 200 is a video authoring system, characterized in that it can generate a description audio script for each divided script block.

In claim 1,
an input module 700 for inputting voice or text;
This includes more,
When a voice is input through the input module 700, the voice recognition module 200 recognizes the voice and converts it into text to generate an explanatory audio script,
A video authoring system, characterized in that when text is input through the input module (700), the audio script is directly generated without going through the voice recognition module (200).

In claim 1,
a video extraction module 500 capable of generating a work video material by extracting a video file from a video image, or extracting some or all of an image, text, or PPT material to generate a work video material; and,
a work creation module 600 for creating a moving picture work by mixing the work audio material and the work video material;
Video authoring system, characterized in that it is further included.

In claim 1,
The voice recognition module 200 recognizes only the content designated as explanatory audio and converts it into text to generate an explanatory audio script,
The voice synthesis module 400 recognizes the explanatory audio script for the content designated as explanatory audio and synthesizes the voice to generate audio material of the work,
A video authoring system, characterized in that the content designated as the background effect audio is mixed as an audio material of the work as it is without going through the voice recognition module 200 and the voice synthesis module 400 and used.

In claim 1,
a multilingual translation module 800 for translating text included in the description audio script into multiple languages;
This includes more
The speech synthesis module 400,
A video authoring system, characterized in that by recognizing an explanatory audio script translated into multiple languages through the multilingual translation module (800) and synthesizing speech in multiple languages, the work audio material and corresponding multilingual subtitles are generated.

It relates to a video authoring method using a video authoring system,
an audio data extraction process of extracting audio data from a video image produced in advance through the audio extraction module 100;
A descriptive audio script generation process in which the speech recognition module 200 recognizes the voice of the audio data extracted from the audio extraction module 100 and converts it into text to generate a descriptive audio script;
A work audio data generation process in which the speech synthesis module 400 recognizes the finally completed explanatory audio script through the script editing module 300 and synthesizes the speech to generate the work audio data;
A work video material creation process in which the video extraction module 500 extracts a video file from a pre-produced video image to generate a work video material, or extracts some or all of an image, text, or PPT material to create a work video material ; and,
a video work creation process in which the work creation module 600 creates a video work while mixing or editing the work audio material and the work video material;
including,
The audio data extraction process includes:
In the process of extracting audio data, it is divided into explanatory audio and background effect audio to specify the purpose,
Among the audio data extracted from the audio extraction module 100, the time zone (slot) in which the content designated as the description audio exists and the time zone (slot) in which the content designated as the description audio does not exist is distinguished and recognized, and the description audio exists for a preset time while the description audio exists If does not exist, separate it into another script block,
The process of generating the audio script described above is,
The voice recognition module 200 generates an explanatory audio script for each divided script block,
The process of creating the audio material of the work,
The speech synthesis module 400 performs speech synthesis for each divided script block to generate audio material of the work,
The work creation process is
A video authoring method, characterized in that the work creation module 600 creates a video work while mixing or editing the work audio material generated for each divided script block and the work video material matching it.

In claim 6,
The audio data extraction process includes:
Contents designated as background effect audio are mixed and used as audio material of the work without going through the speech recognition module 200 and the speech synthesis module 400 in the process of creating the audio material of the work,
The process of generating the audio script described above is,
A method of authoring a video, characterized in that only the content specified as the description audio is recognized and converted into text to generate the description audio script.

In claim 6,
a multilingual translation process of translating text included in the explanatory audio script generated in the explanatory audio script generating process into multiple languages;
is added,
When the multilingual translation process is added, the speech synthesis module 400 recognizes the explanatory audio script translated into multiple languages through the multilingual translation module 800 and synthesizes speech in multiple languages to generate audio material of the work or multilingual subtitles corresponding thereto A video authoring method, characterized in that

delete