KR102577734B1

KR102577734B1 - Ai learning method for subtitle synchronization of live performance

Info

Publication number: KR102577734B1
Application number: KR1020210166433A
Authority: KR
Inventors: 박민철; 주현수; 김대연; 최성원; 고현우
Original assignee: 한국과학기술연구원
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2023-09-14
Also published as: KR20230079574A

Abstract

본 명세서는 라이브 공연에서 공연자의 대사 또는 가사에 따라 공연 시간을 추론하는 모델을 학습시키는 방법을 제공하는 것을 목적으로 한다. 상술한 과제를 해결하기 위한 본 명세서에 따른 공연 시간 추론 모델 학습 방법은, 공연의 시각데이터를 입력받아 시간정보를 추론하는 제1 추론모델과 공연의 음성데이터를 입력받아 시간정보를 추론하는 제2 추론모델을 각각 학습시키고, 제1 추론시간값과 제2 추론시간값을 입력받아 시간정보를 추론하는 제3 추론모델을 학습시킬 수 있다.The purpose of this specification is to provide a method for learning a model that infers the performance time according to the performer's lines or lyrics in a live performance. The method of learning a performance time inference model according to the present specification to solve the above-described problem includes a first inference model that receives visual data of a performance and infers time information, and a second inference model that receives audio data of a performance and infers time information. Each inference model can be trained, and a third inference model that infers time information by receiving the first and second inference time values can be trained.

Description

AI learning method for subtitle synchronization of live performances {AI LEARNING METHOD FOR SUBTITLE SYNCHRONIZATION OF LIVE PERFORMANCE}

본 발명은 인공지능의 학습 방법에 관한 것이며, 보다 상세하게는 라이브 공연의 자막 동기화를 위한 인공지능의 학습 방법에 관한 것이다. The present invention relates to an artificial intelligence learning method, and more specifically, to an artificial intelligence learning method for subtitle synchronization of live performances.

이 부분에 기술된 내용은 단순히 본 명세서에 기재된 실시예에 대한 배경 정보를 제공할 뿐 반드시 종래 기술을 구성하는 것은 아니다.The content described in this section simply provides background information on the embodiments described in this specification and does not necessarily constitute prior art.

영화나 드라마와 같은 영상 매체에서 배우의 대사를 자막으로 표시하는 경우가 있다. 이때 표시되는 자막은 자막의 내용(배우의 대사)과 자막이 표시되는 타이밍 정보로 구성된다. 따라서, 프로그램은 영상의 시작점을 기준으로 정해진 타이밍에 정해진 자막의 내용을 표시하는 방식으로 화면에 자막을 표시한다.In video media such as movies and dramas, actors' lines are sometimes displayed as subtitles. The subtitles displayed at this time are composed of the content of the subtitles (actor's lines) and timing information at which the subtitles are displayed. Therefore, the program displays subtitles on the screen by displaying the contents of the subtitles at a fixed timing based on the starting point of the video.

최근 음악 콘서트, 연극, 뮤지컬과 같은 라이브 공연에서도 자막을 표시하는 경우가 있다. 무대의 한 쪽에 스크린 또는 디스플레이를 설치하고, 공연자의 대사 또는 노래 가사에 맞추어 자막을 표시하는 방식이다. 또한, 최근에는 AR 글래스를 관람객이 착용하고, AR 글래스의 디스플레이에 자막을 표시하는 기술이 개발되고 있다.Recently, subtitles are sometimes displayed in live performances such as music concerts, plays, and musicals. This is a method of installing a screen or display on one side of the stage and displaying subtitles according to the performer's lines or song lyrics. Additionally, recently, technology has been developed that allows visitors to wear AR glasses and display subtitles on the display of the AR glasses.

이러한 라이브 공연에서 자막의 표시는 영상 매체에서 자막을 표시하는 것과 다른 제어 기술이 필요하다. 영상 매체와 달리, 라이브 공연은 똑같은 배우의 대사 또는 노래의 가사라도 시작하는 타이밍이 변화할 수 있으며, 현장 상황에 따라 또는 공연자의 순간적인 판단에 따라 대사 또는 노래의 가사를 반복하거나 건너뛰는 상황이 발생할 수 도 있다.Displaying subtitles in such live performances requires different control techniques than displaying subtitles in video media. Unlike video media, in live performances, the starting timing of even the same actor's lines or song lyrics can change, and there are situations in which lines or song lyrics are repeated or skipped depending on the scene situation or the performer's momentary judgment. It may happen.

이러한 현장성 및 가변성으로 인해, 현재에는 공연 현장에서 자막을 담당하는 엔지니어가 직접 눈과 귀로 상황을 파악하고, 자막을 매번 직접 제어하고 있는 것이 현실이다.Due to this on-site nature and variability, the current reality is that the engineer in charge of subtitles at the performance site directly understands the situation with his or her eyes and ears and directly controls the subtitles each time.

공개특허공보 제10-2009-0129016Public Patent Publication No. 10-2009-0129016

본 명세서는 라이브 공연에서 공연자의 대사 또는 가사에 따라 공연 시간을 추론하는 모델을 학습시키는 방법을 제공하는 것을 목적으로 한다.The purpose of this specification is to provide a method for learning a model that infers the performance time according to the performer's lines or lyrics in a live performance.

본 명세서는 상기 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.This specification is not limited to the above-mentioned tasks, and other tasks not mentioned will be clearly understood by those skilled in the art from the description below.

상술한 과제를 해결하기 위한 본 명세서에 따른 공연 시간 추론 모델 학습 방법은, (a) 프로세서가 공연 촬영 정보를 소정의 시간마다 시각데이터 및 음성데이터를 생성하고, 각 시각데이터 및 음성데이터에 공연시간정보를 추가하는 단계; (b) 프로세서가 시각데이터를 입력받아 시간정보를 추론하는 제1 추론모델에 각 시각데이터에 해당하는 공연시간정보를 제1 정답값으로 설정하고, 각 시간데이터를 입력할 때 출력된 제1 추론시간값과 상기 제1 정답값 사이의 차이가 최소가 될 때까지 학습시키는 단계; (c) 프로세서가 음성데이터를 입력받아 시간정보를 추론하는 제2 추론모델에 각 음성데이터에 해당하는 공연시간정보를 제2 정답값으로 설정하고, 각 음성데이터를 입력할 때 출력된 제2 추론시간값과 상기 제2 정답값 사이의 차이가 최소가 될 때까지 학습시키는 단계; 및 (d) 프로세서가 상기 제1 추론시간값과 제2 추론시간값을 입력받아 시간정보를 추론하는 제3 추론모델에 상기 제1 추론시간값을 추론할 때 시각데이터와 상기 제2 추론시간값을 추론할 때 음성데이터에 동시에 해당하는 공연시간정보를 제3 정답값으로 설정하고, 상기 제1 추론시간값과 제2 추론시간값을 입력할 때 출력된 제3 추론시간값과 상기 제3 정답값 사이의 차이가 최소가 될 때까지 학습시키는 단계;를 포함할 수 있다.The method of learning a performance time inference model according to the present specification to solve the above-described problem is (a) a processor generates visual data and audio data from performance shooting information at predetermined times, and the performance time data is included in each visual data and audio data. adding information; (b) The processor sets the performance time information corresponding to each time data as the first correct value in the first inference model that receives time data and infers time information, and sets the first inference output when each time data is input. Learning until the difference between the time value and the first correct value is minimized; (c) The processor receives voice data and sets the performance time information corresponding to each voice data as the second correct value in the second inference model that infers time information, and the second inference output when each voice data is input. learning until the difference between the time value and the second correct value is minimized; and (d) when the processor receives the first inference time value and the second inference time value and infers the first inference time value to a third inference model that infers time information, time data and the second inference time value. When inferring, the performance time information corresponding to the voice data is set as the third correct answer value, and the third inference time value and the third correct answer output when the first inference time value and the second inference time value are input. It may include learning until the difference between values is minimal.

본 명세서의 일 실시예에 따르면, 상기 (a) 단계에서 음성데이터를 생성할 때, 프로세서가 공연 촬영 정보에서 음성신호를 스펙토그램 기반의 시각화된 데이터로 변환하여 음성데이터를 생성할 수 있다.According to an embodiment of the present specification, when generating voice data in step (a), the processor may generate voice data by converting voice signals from performance shooting information into spectogram-based visualized data.

본 명세서의 일 실시예에 따르면, 상기 (a) 단계는, 프로세서가 시각화된 데이터의 RGB값의 변화량이 미리 설정된 기준 변화량 이상 변화하는 구간마다 시각데이터 및 음성데이터를 생성할 수 있다.According to an embodiment of the present specification, in step (a), the processor may generate visual data and audio data for each section in which the change in RGB values of the visualized data changes more than a preset reference change amount.

본 명세서의 일 실시예에 따르면, 상기 (d) 단계는, 프로세서가 입력받은 제1 추론시간값과 제2 추론시간값의 차이가 미리 설정된 기준차이값 이하일 때 실행되는 단계일 수 있다.According to an embodiment of the present specification, step (d) may be executed when the difference between the first and second inference time values input to the processor is less than or equal to a preset reference difference value.

본 명세서에 따른 공연 시간 추론 모델 학습 방법은 컴퓨터에서 공연 시간 추론 모델 학습 방법의 각 단계들을 수행하도록 작성되어 컴퓨터로 독출 가능한 기록 매체에 기록된 컴퓨터프로그램의 형태로 구현될 수 있다.The performance time inference model learning method according to the present specification may be implemented in the form of a computer program written to perform each step of the performance time inference model learning method on a computer and recorded on a computer-readable recording medium.

상술한 과제를 해결하기 위한 본 명세서에 따른 공연 시간 추론 모델 학습 장치는, 공연 촬영 정보를 저장한 메모리; 상기 공연 촬영 정보를 소정의 시간마다 시각데이터 및 음성데이터를 생성하고, 각 시각데이터 및 음성데이터에 공연시간정보를 추가하는 전처리부; 시각데이터를 입력받아 시간정보를 추론하는 제1 추론모델에 각 시각데이터에 해당하는 공연시간정보를 제1 정답값으로 설정하고, 각 시간데이터를 입력할 때 출력된 제1 추론시간값과 상기 제1 정답값 사이의 차이가 최소가 될 때까지 학습시키는 제1 추론모델학습부; 음성데이터를 입력받아 시간정보를 추론하는 제2 추론모델에 각 음성데이터에 해당하는 공연시간정보를 제2 정답값으로 설정하고, 각 음성데이터를 입력할 때 출력된 제2 추론시간값과 상기 제2 정답값 사이의 차이가 최소가 될 때까지 학습시키는 제2 추론모델학습부; 및 상기 제1 추론시간값과 제2 추론시간값을 입력받아 시간정보를 추론하는 제3 추론모델에 상기 제1 추론시간값을 추론할 때 시각데이터와 상기 제2 추론시간값을 추론할 때 음성데이터에 동시에 해당하는 공연시간정보를 제3 정답값으로 설정하고, 상기 제1 추론시간값과 제2 추론시간값을 입력할 때 출력된 제3 추론시간값과 상기 제3 정답값 사이의 차이가 최소가 될 때까지 학습시키는 제3 추론모델학습부;를 포함할 수 있다.A performance time inference model learning device according to the present specification for solving the above-mentioned problems includes: a memory storing performance shooting information; A preprocessor that generates visual data and audio data from the performance shooting information at predetermined times and adds performance time information to each visual data and audio data; The performance time information corresponding to each time data is set as the first correct value in the first inference model that receives time data and infers time information, and the first inference time value output when each time data is input 1 A first inference model learning unit that trains until the difference between correct answers is minimized; In a second inference model that receives voice data and infers time information, the performance time information corresponding to each voice data is set as the second correct value, and the second inference time value output when each voice data is input and the second inference time value are set as the second correct value. 2 A second inference model learning unit that trains the user until the difference between the correct answers is minimized; And a third inference model that receives the first inference time value and the second inference time value and infers time information when inferring the first inference time value and a voice when inferring the second inference time value. When the performance time information corresponding to the data is set as the third correct answer value and the first and second inference time values are input, the difference between the output third inference time value and the third correct answer value is It may include a third inference model learning unit that trains the model until it reaches a minimum.

본 명세서의 일 실시예에 따르면, 상기 전처리부는, 음성데이터를 생성할 때, 공연 촬영 정보에서 음성신호를 스펙토그램 기반의 시각화된 데이터로 변환하여 음성데이터를 생성할 수 있다.According to an embodiment of the present specification, when generating voice data, the preprocessor may generate voice data by converting voice signals from performance shooting information into spectogram-based visualized data.

본 명세서의 일 실시예에 따르면, 상기 전처리부는, 시각화된 데이터의 RGB값의 변화량이 미리 설정된 기준 변화량 이상 변화하는 구간마다 시각데이터 및 음성데이터를 생성할 수 있다.According to an embodiment of the present specification, the preprocessor may generate visual data and audio data for each section in which the amount of change in RGB values of the visualized data changes more than a preset reference amount of change.

본 명세서의 일 실시예에 따르면, 상기 제3 추론모델학습부는, 입력받은 제1 추론시간값과 제2 추론시간값의 차이가 미리 설정된 기준차이값 이하일 때 제3 추론모델의 학습을 실행할 수 있다.According to an embodiment of the present specification, the third inference model learning unit may execute learning of the third inference model when the difference between the input first inference time value and the second inference time value is less than or equal to a preset reference difference value. .

상술한 과제를 해결하기 위한 본 명세서에 따른 공연 시간 추론 모델 학습 방법은, 본 명세서에 따른 공연 시간 추론 모델 학습 장치를 통해 학습된 공연 시간 추론 모델; 공연시간정보에 대응하는 자막정보를 포함하는 메모리; 및 상기 공연 시간 추론 모델에서 추론된 공연 시간 정보에 해당하는 자막정보를 출력하도록 제어하는 표시제어부;를 포함할 수 있다.The method of learning a performance time inference model according to the present specification for solving the above-described problem includes: a performance time inference model learned through the performance time inference model learning device according to the present specification; A memory containing subtitle information corresponding to performance time information; and a display control unit that controls to output subtitle information corresponding to performance time information inferred from the performance time inference model.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 명세서에 따라 학습시킨 공연 시간 추론 모델은 라이브 공연의 촬영 정보를 이용하여 공연자의 대사 또는 가사 정확한 타이밍에 자막을 표시할 수 있다. 이를 통해, 사람이 직접 자막의 표시 시점을 제어하는 종래 기술에 비해, 사람의 실수가 개입될 가능성이 배제되어 공연 사고를 줄일 수 있다.The performance time inference model trained according to the present specification can display subtitles at the exact timing of the performer's lines or lyrics using filming information of a live performance. Through this, compared to the conventional technology in which a person directly controls the timing of subtitle display, the possibility of human error is eliminated, thereby reducing performance accidents.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 명세서에 따른 공연 시간 추론 모델 학습 방법을 개략적으로 도시한 흐름도이다.
도 2는 본 명세서에 따른 공연 시간 추론 모델 학습 장치의 구성을 개략적으로 도시한 블럭도이다.
도 3은 전처리된 시각데이터 및 음성데이터의 참고도이다.1 is a flowchart schematically showing a method for learning a performance time inference model according to the present specification.
Figure 2 is a block diagram schematically showing the configuration of a performance time inference model learning device according to the present specification.
Figure 3 is a reference diagram of preprocessed visual data and audio data.

본 명세서에 개시된 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 명세서가 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 명세서의 개시가 완전하도록 하고, 본 명세서가 속하는 기술 분야의 통상의 기술자(이하 '당업자')에게 본 명세서의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 명세서의 권리 범위는 청구항의 범주에 의해 정의될 뿐이다. The advantages and features of the invention disclosed in this specification and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present specification is not limited to the embodiments disclosed below and may be implemented in various different forms, and the present embodiments are merely intended to ensure that the disclosure of the present specification is complete and to provide a general understanding of the technical field to which the present specification pertains. It is provided to fully inform those skilled in the art of the scope of this specification, and the scope of rights of this specification is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 명세서의 권리 범위를 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.The terms used in this specification are for describing embodiments and are not intended to limit the scope of this specification. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other elements in addition to the mentioned elements.

명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.Like reference numerals refer to like elements throughout the specification, and “and/or” includes each and every combination of one or more of the referenced elements. Although “first”, “second”, etc. are used to describe various components, these components are of course not limited by these terms. These terms are merely used to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may also be a second component within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 명세서가 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which this specification pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

이하, 첨부된 도면을 참조하여 본 명세서에 따른 공연 시간 추론 모델 학습 방법을 설명한다.Hereinafter, a method for learning a performance time inference model according to the present specification will be described with reference to the attached drawings.

도 1은 본 명세서에 따른 공연 시간 추론 모델 학습 방법을 개략적으로 도시한 흐름도이다.1 is a flowchart schematically showing a method for learning a performance time inference model according to the present specification.

도 2는 본 명세서에 따른 공연 시간 추론 모델 학습 장치의 구성을 개략적으로 도시한 블럭도이다.Figure 2 is a block diagram schematically showing the configuration of a performance time inference model learning device according to the present specification.

도 1 및 도 2를 함께 참조하면, 본 명세서에 따른 공연 시간 추론 모델 학습 방법은 크게 학습을 위한 공연 촬영 정보 전처리 단계(S100), 시각데이터를 이용하여 공연시간을 추론하는 제1 추론모델을 학습시키는 단계(S110), 음성데이터를 이용하여 공연시간을 추론하는 제2 추론모델을 학습시키는 단계(S120) 및 제1 추론모델이 시각데이터를 이용하여 추론한 공연시간과 제2 추론모델이 음성데이터를 이용하여 추론한 공연시간을 이용한 최종공연시간을 추론하는 제3 추론모델을 학습시키는 단계(S130)를 포함할 수 있다. 각 단계에 대해서 보다 자세히 설명하겠다.Referring to FIGS. 1 and 2 together, the performance time inference model learning method according to the present specification largely includes a pre-processing step of performance shooting information for learning (S100), and learning a first inference model for inferring performance time using visual data. A step (S110), a step of training a second inference model that infers the performance time using audio data (S120), and the performance time inferred by the first inference model using visual data and the second inference model by using audio data. It may include a step (S130) of training a third inference model to infer the final performance time using the performance time inferred using . I will explain each step in more detail.

먼저, 공연 촬영 정보 전처리 단계(S100)이다.First, the performance shooting information preprocessing step (S100).

본 명세서에서 공연 촬영 정보란 무대위에서 이루어지는 공연을 촬영한 정보로서 영상 및 음성 정보를 가진 데이터를 의미한다. 상기 공연 촬영 정보는 공연자의 리허설 무대를 촬영하여 확보할 수 있다. 또는 공연이 수차례 반복되는 경우, 실제 공연을 촬영하여 공연 촬영 정보를 확보할 수 있다. 확보된 공연 촬영 정보는 메모리에 저장되고, 학습을 위해 전처리 과정을 거치게 된다.In this specification, performance filming information refers to data containing video and audio information as information about a performance taking place on stage. The performance filming information can be secured by filming the performer's rehearsal stage. Alternatively, if the performance is repeated several times, performance filming information can be obtained by filming the actual performance. The secured performance shooting information is stored in memory and undergoes a pre-processing process for learning.

본 명세서에 따른 전처리부(100)는 시각데이터와 음성데이터로 나누어 각각 생성할 수 있다. 이때, 상기 전처리부(100)는 각 시각데이터 및 음성데이터에 공연시간정보를 추가할 수 있다. 본 명세서에서 공연시간정보란, 공연이 시작한 시간을 기준으로 얼마만큼의 시간이 지났는지 나타내는 정보이다. 예를 들어, 2시간짜리 공연에서 공연자가 A라는 노래를 부르면서 춤을 추는 시각이 01시간 05분 30초 지점일 때, 해당 노래와 춤에 대한 시각데이터 및 음성데이터에는 [010530]이라는 공연시간정보가 추가될 수 있다.The preprocessor 100 according to the present specification can generate visual data and audio data separately. At this time, the preprocessor 100 can add performance time information to each visual data and audio data. In this specification, performance time information is information indicating how much time has passed based on the time the performance started. For example, in a 2-hour performance, when the performer dances while singing song A at 01 hours 05 minutes 30 seconds, the time data and voice data for the song and dance include the performance time [010530]. Information may be added.

본 명세서의 일 실시예에 따르면, 상기 전처리부(100)는 음성데이터를 생성할 때, 공연 촬영 정보에서 음성신호를 스펙토그램 기반의 시각화된 데이터로 변환하여 음성데이터를 생성할 수 있다.According to an embodiment of the present specification, when generating voice data, the preprocessor 100 may generate voice data by converting voice signals from performance shooting information into spectogram-based visualized data.

도 3은 전처리된 시각데이터 및 음성데이터의 참고도이다.Figure 3 is a reference diagram of preprocessed visual data and audio data.

도 3을 참조하면, 공연시간정보출을 기준으로 위에는 시각데이터가 아래에는 음성데이터가 도시된 것을 확인할 수 있다.Referring to Figure 3, it can be seen that visual data is shown above and audio data is shown below based on the performance time information output.

본 명세서에 따르면, 상기 전처리부(100)는 공연 촬영 정보의 음성파일을 데이터를 -1에서 1사이의 값들로 정규화시킬 수 있다.According to the present specification, the preprocessor 100 can normalize the audio file data of the performance shooting information to values between -1 and 1.

그리고 전처리부(100)는 푸리에 변환을 통해 시간축으로 구성된 데이터를 주파수와 시간축으로 성분 분해해서 차원을 늘려줄 수 있다. 그리고 전처리부(100)는 MFCC기반으로 유리한 정보 추출할 수 있다. 이 때 전처리부(100)는 추출하는 정보의 수를 랜덤(random)하게 설정하여 10개의 정보에 해당하는 것을 모아서 사용할 수 있다.In addition, the preprocessor 100 can increase the dimension by decomposing the data consisting of the time axis into the frequency and time axes through Fourier transform. And the preprocessor 100 can extract advantageous information based on MFCC. At this time, the preprocessor 100 can randomly set the number of information to be extracted and collect and use the information corresponding to 10 pieces of information.

X_bags = {M_1, M_2, …, M10}X_bags = {M_1, M_2, … , M10}

한편, 본 명세서에 따른 제1 추론모델을 학습시키는 것과 제2 추론모델을 학습시키는 것은 서로 독립된 모델로서 시간 순서에 상관없이 실행이 가능하다. 바람직하게, 제1 추론모델을 학습시킬 때 사용하는 시각데이터와 제2 추론모델을 학습시킬 때 사용하는 음성데이터의 공연시각정보가 동일하다.Meanwhile, training the first inference model and learning the second inference model according to the present specification are independent models and can be executed regardless of time order. Preferably, the visual data used to train the first inference model and the performance visual information of the voice data used to train the second inference model are the same.

또한, 상기 전처리부(100)는 시각데이터와 음성데이터를 소정의 시간 길이를 가진 데이터로 생성할 수 있다. 일 예에 따르면, 상기 전처리부(100)는 1초, 2초, 3초 등과 같이 미리 설정된 시간 간격마다 시각데이터 및 음성데이터를 생성할 수 있다. 다른 예에 따르면, 상기 전처리부(100)는 시각화된 데이터의 RGB값의 변화량이 미리 설정된 기준 변화량 이상 변화하는 구간마다 시각데이터 및 음성데이터를 생성할 수 있다. 즉, 음성이 크게 변화하는 구간마다 식별이 용이하다는 점을 이용한 것이다. 보다 구체적으로, 상기 전처리부(100) 변환된 스펙토그램의 주어진 시간 t동안 RGB값의 평균과 t를 n(0부터 10까지)등분을 통해 나눈 스펙토그램의 평균치가 255의 1/5이상 커지는 n을 찾아서 n-1지점을 최적의 시간으로 설정하여 지점을 기준으로 시각데이터와 음성데이터를 생성할 수 있다. 또 다른 실시예로, 상기 전처리부(100)는 음성이 아닌, 영상이 크게 변화하는 지점마다, 시각데이터 및 음성데이터를 생성할 수 있다.Additionally, the preprocessor 100 can generate visual data and audio data as data with a predetermined time length. According to one example, the preprocessor 100 may generate visual data and audio data at preset time intervals, such as 1 second, 2 seconds, and 3 seconds. According to another example, the preprocessor 100 may generate visual data and audio data for each section in which the amount of change in the RGB values of the visualized data changes more than a preset reference amount. In other words, it takes advantage of the fact that it is easy to identify each section where the voice changes significantly. More specifically, the average of the RGB values for a given time t of the spectogram converted by the preprocessor 100 and the average value of the spectogram divided by t into n (0 to 10) equal parts are more than 1/5 of 255. By finding the growing n and setting point n-1 as the optimal time, visual data and voice data can be generated based on the point. In another embodiment, the preprocessor 100 may generate visual data and audio data at each point where the image changes significantly, rather than audio.

이렇게 생성된 시각데이터는 제1 추론모델을 학습시키는데 사용되고, 음성데이터는 제2 추론모델을 학습시키는데 사용될 수 있다.The visual data generated in this way can be used to train the first inference model, and the audio data can be used to train the second inference model.

다음, 제1 추론모델의 학습 단계(S110)이다.Next, the learning step (S110) of the first inference model.

본 명세서에서 제1 추론모델은 시각데이터를 입력받아 시간정보를 추론하는 인공지능모델이다. 제1 추론모델학습부(110)는 각 시각데이터에 해당하는 공연시간정보를 제1 정답값으로 설정활 수 있다. 제1 추론모델학습부(110)는 각 시간데이터를 제1 추론모델에 입력시켜서 추론된 시간정보(이하 '제1 추론시간값')를 출력시킬 수 있다. 이때, 제1 추론모델학습부(110)는 출력된 제1 추론시간값과 상기 제1 정답값 사이의 차이가 최소가 될 때까지 제1 추론모델을 학습시킬 수 있다.In this specification, the first inference model is an artificial intelligence model that receives visual data as input and infers time information. The first inference model learning unit 110 may set the performance time information corresponding to each visual data as the first correct answer value. The first inference model learning unit 110 may input each time data into the first inference model and output inferred time information (hereinafter referred to as 'first inference time value'). At this time, the first inference model learning unit 110 may train the first inference model until the difference between the output first inference time value and the first correct answer value is minimized.

다음, 제2 추론모델의 학습 단계(S120)이다.Next, the learning step (S120) of the second inference model.

본 명세서에서 제2 추론모델은 음성데이터를 입력받아 시간정보를 추론하는 인공지능모델이다. 제2 추론모델학습부(120)는 각 음성데이터에 해당하는 공연시간정보를 제2 정답값으로 설정활 수 있다. 제2 추론모델학습부(120)는 각 음성데이터를 제2 추론모델에 입력시켜서 추론된 시간정보(이하 '제2 추론시간값')를 출력시킬 수 있다. 이때, 제2 추론모델학습부(120)는 출력된 제2 추론시간값과 상기 제2 정답값 사이의 차이가 최소가 될 때까지 제2 추론모델을 학습시킬 수 있다.In this specification, the second inference model is an artificial intelligence model that receives voice data as input and infers time information. The second inference model learning unit 120 may set the performance time information corresponding to each voice data as the second correct value. The second inference model learning unit 120 may input each voice data into the second inference model and output inferred time information (hereinafter referred to as 'second inference time value'). At this time, the second inference model learning unit 120 may train the second inference model until the difference between the output second inference time value and the second correct answer value is minimized.

이렇게 제1 추론모델과 제2 추론모델은 각각 시각데이타와 음성데이터를 통해 공연시간을 추론하는 모델로 학습을 마칠 수 있다. 같은 시간정보를 가지는 시각데이터와 음성데이터에 대해서 제1 추론모델과 제2 추론모델이 서로 같은 공연시간을 추론하면 바람직하다. 그러나 같은 시간정보를 가지는 시각데이터와 음성데이터에 대해서 제1 추론모델과 제2 추론모델이 서로 다른 공연시간을 추론할 수도 있다. 예를 들어, 공연자가 의자에 앉아서 움직이지 않고 노래를 할 경우, 시각데이터가 추론한 공연시간의 정확도가 낮을 것으로 예상할 수 있다. 반면, 공연자가 노래의 1절 후렴구와 2절 후렴구를 부르는 경우, 음성데이터가 추론한 공연시간의 정확도가 낮을 것으로 예상할 수 있다. 이처럼 추론된 2개의 공연시간이 서로 다를 경우, 최종적으로 하나의 공연시간을 추론하는 인공지능이 필요하다. 이러한 역할을 하는 것이 제3 추론모델이다.In this way, the first inference model and the second inference model can be completed as models that infer the performance time through visual data and audio data, respectively. For visual data and audio data having the same time information, it is desirable if the first inference model and the second inference model infer the same performance time. However, for visual data and audio data having the same time information, the first inference model and the second inference model may infer different performance times. For example, if a performer sits on a chair and sings without moving, the accuracy of the performance time inferred from visual data can be expected to be low. On the other hand, if the performer sings the first and second refrains of the song, the accuracy of the performance time inferred from the voice data can be expected to be low. If the two inferred performance times are different, artificial intelligence is needed to ultimately deduce one performance time. The third inference model plays this role.

다음, 제3 추론모델의 학습 단계(S130)이다.Next, the learning step (S130) of the third inference model.

본 명세서에서 제3 추론모델은 제1 추론시간값과 제2 추론시간값을 입력받아 시간정보를 추론하는 인공지능모델이다. 제3 추론모델학습부(130)는 상기 제1 추론시간값을 추론할 때 시각데이터와 상기 제2 추론시간값을 추론할 때 음성데이터에 동시에 해당하는 공연시간정보를 제3 정답값으로 설정할 수 있다. 제3 추론모델학습부(130)는 상기 제1 추론시간값과 제2 추론시간값을 제3 추론모델에 입력시켜서 추론된 시간정보(이하 '제3 추론시간값')를 출력시킬 수 있다. 이때, 제3 추론모델학습부(130)는 출력된 제3 추론시간값과 상기 제3 정답값 사이의 차이가 최소가 될 때까지 제3 추론모델을 학습시킬 수 있다.In this specification, the third inference model is an artificial intelligence model that infers time information by receiving the first inference time value and the second inference time value. The third inference model learning unit 130 may set the performance time information corresponding to the visual data when inferring the first inference time value and the voice data when inferring the second inference time value as the third correct value. there is. The third inference model learning unit 130 may input the first inference time value and the second inference time value into a third inference model and output inferred time information (hereinafter referred to as 'third inference time value'). At this time, the third inference model learning unit 130 may train the third inference model until the difference between the output third inference time value and the third correct answer value is minimized.

한편, 제1 추론시간값과 제2 추론시간값의 차이가 지나칠 경우가 발생할 수 있다. 예를 들어, 공연자가 앉아서 3분짜리 노래를 부를 경우, 시각데이터에 의해 3분짜리 노래의 초반 시간을 제1 추론시간값이 출력하고, 음성데이터에 의해 3분짜리 노래의 마지막 시간을 제2 추론시간값이 출력될 수 있다. 이렇게 두 추론시간값이 일정 시간 이상 차이날 경우, 해당 데이터는 오히려 학습에 방해가 되는 데이터가 될 수 있다. 따라서, 본 명세서에 따른 제3 추론모델학습부(130)는 입력받은 제1 추론시간값과 제2 추론시간값의 차이가 미리 설정된 기준차이값 이하일 때 제3 추론모델을 학습시킬 수 있다. 상기 기준차이값은 1초, 2초, 3초, 5초 등 다양하게 설정될 수 있다. 또한, 제1 추론시간값과 제2 추론시간값의 차이에 대한 평균을 이용하여 설정될 수도 있다.Meanwhile, there may be cases where the difference between the first inference time value and the second inference time value is excessive. For example, when a performer sits down and sings a 3-minute song, the first inferred time value outputs the first time of the 3-minute song based on visual data, and the second inferred time value outputs the last time of the 3-minute song based on voice data. Inference time values may be output. If the two inference time values differ by more than a certain amount of time, the data may become an obstacle to learning. Therefore, the third inference model learning unit 130 according to the present specification can train the third inference model when the difference between the input first and second inference time values is less than or equal to a preset reference difference value. The reference difference value can be set in various ways, such as 1 second, 2 seconds, 3 seconds, and 5 seconds. Additionally, it may be set using the average of the difference between the first and second inference time values.

이렇게 학습된 제1 추론모델, 제2 추론모델 및 제3 추론모델은 공연 시간 추론 모델을 구성요소가 될 수 있다. 따라서, 실제 현장에서 라이브 공연에 대한 촬영 정보를 입력하면, 본 명세서에 따른 공연 시간 추론 모델은 지금 공연 장면이 전체 공연의 어느 시점인지 추론하여 공연시간정보를 출력할 수 있다.The first inference model, the second inference model, and the third inference model learned in this way can become components of the performance time inference model. Therefore, when shooting information about a live performance is input at an actual site, the performance time inference model according to the present specification can infer at what point in the entire performance the current performance scene is and output performance time information.

이때, 공연시간정보에 대응하는 자막정보는 미리 메모리에 저장될 수 있다. 자막정보란, 공연의 어느 시점에 어떠한 자막이 표시되어야 하는지 자막 내용과 시간 정보를 포함하는 데이터이다. 이때, 본 명세서에 따른 표시제어부(도면 미도시)는 상기 공연 시간 추론 모델에서 추론된 공연 시간 정보에 해당하는 자막정보를 출력하도록 제어할 수 있다. 상기 표시제어부, 메모리 및 공연 시간 추론 모델은 공연 자막 동기화 장치의 일 구성요소가 될 수 있다.At this time, subtitle information corresponding to performance time information may be stored in memory in advance. Subtitle information is data that includes subtitle content and time information on which subtitles should be displayed at what point in the performance. At this time, the display control unit (not shown) according to the present specification can control to output subtitle information corresponding to the performance time information inferred from the performance time inference model. The display control unit, memory, and performance time inference model may be components of a performance subtitle synchronization device.

상기 제1 내지 제3 추론모델학습부는, 산출 및 다양한 제어 로직을 실행하기 위해 본 발명이 속한 기술분야에 알려진 프로세서, ASIC(application-specific integrated circuit), 다른 칩셋, 논리 회로, 레지스터, 통신 모뎀, 데이터 처리 장치 등을 포함할 수 있다. 또한, 상술한 제어 로직이 소프트웨어로 구현될 때, 상기 제1 내지 제3 추론모델학습부는 프로그램 모듈의 집합으로 구현될 수 있다. 이 때, 프로그램 모듈은 상기 메모리에 저장되고, 프로세서에 의해 실행될 수 있다.The first to third inference model learning units include processors, ASICs (application-specific integrated circuits), other chipsets, logic circuits, registers, communication modems, etc. known in the technical field to which the present invention belongs to perform calculations and various control logics. It may include a data processing device, etc. Additionally, when the above-described control logic is implemented as software, the first to third inference model learning units may be implemented as a set of program modules. At this time, the program module may be stored in the memory and executed by the processor.

상기 컴퓨터프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C/C++, C#, JAVA, Python, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The computer program is C/C++, C#, JAVA, or Python that the processor (CPU) of the computer can read through the device interface of the computer in order for the computer to read the program and execute the methods implemented in the program. , may include code encoded in a computer language such as machine language. These codes may include functional codes related to functions that define the necessary functions for executing the methods, and include control codes related to execution procedures necessary for the computer's processor to execute the functions according to predetermined procedures. can do. In addition, these codes may further include memory reference-related codes that indicate at which location (address address) in the computer's internal or external memory additional information or media required for the computer's processor to execute the above functions should be referenced. there is. In addition, if the computer's processor needs to communicate with any other remote computer or server in order to execute the above functions, the code uses the computer's communication module to determine how to communicate with any other remote computer or server. It may further include communication-related codes regarding whether communication should be performed and what information or media should be transmitted and received during communication.

상기 저장되는 매체는, 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The storage medium refers to a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short period of time, such as a register, cache, or memory. Specifically, examples of the storage medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., but are not limited thereto. That is, the program may be stored in various recording media on various servers that the computer can access or on various recording media on the user's computer. Additionally, the medium may be distributed to computer systems connected to a network, and computer-readable code may be stored in a distributed manner.

이상, 첨부된 도면을 참조로 하여 본 명세서의 실시예를 설명하였지만, 본 명세서가 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다.Although the embodiments of the present specification have been described above with reference to the attached drawings, those skilled in the art will understand that the present invention can be implemented in other specific forms without changing the technical idea or essential features. You will be able to understand it. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive.

100 : 전처리부
110 : 제1 추론모델학습부
120 : 제2 추론모델학습부
130 : 제3 추론모델학습부100: preprocessing unit
110: First inference model learning unit
120: Second inference model learning unit
130: Third inference model learning unit

Claims

(a) a step of the processor generating performance shooting information into visual data and audio data at predetermined times, and adding performance time information to each visual data and audio data;
(b) The processor sets the performance time information corresponding to each time data as the first correct value in the first inference model that receives time data and infers time information, and sets the first inference output when each time data is input. Learning until the difference between the time value and the first correct value is minimized;
(c) The processor receives voice data and sets the performance time information corresponding to each voice data as the second correct value in the second inference model that infers time information, and the second inference output when each voice data is input. learning until the difference between the time value and the second correct value is minimized; and
(d) When the processor receives the first inference time value and the second inference time value and infers the first inference time value to a third inference model that infers time information, time data and the second inference time value are used. When inferring, the performance time information corresponding to the voice data is set as the third correct answer value, and the third inferred time value and the third correct answer value output when the first inferred time value and the second inferred time value are input. It includes a step of learning until the difference between them is minimal,
When generating voice data in step (a) above,
A performance time inference model learning method in which the processor generates voice data by converting voice signals from performance shooting information into spectogram-based visualized data.

delete

In claim 1,
In step (a),
A performance time inference model learning method in which the processor generates visual data and audio data for each section where the change in RGB values of visualized data changes more than a preset standard change amount.

In claim 1,
The step (d) is a performance time inference model learning method that is executed when the difference between the first inference time value and the second inference time value input to the processor is less than or equal to a preset reference difference value.

A computer program written to perform each step of the performance time inference model learning method according to any one of claims 1, 3, and 4 on a computer and recorded on a computer-readable recording medium.

Memory storing performance shooting information;
A preprocessor that generates visual data and audio data from the performance shooting information at predetermined times and adds performance time information to each visual data and audio data;
The performance time information corresponding to each time data is set as the first correct value in the first inference model that receives time data and infers time information, and the first inference time value output when each time data is input 1 A first inference model learning unit that trains until the difference between correct answers is minimized;
In a second inference model that receives voice data and infers time information, the performance time information corresponding to each voice data is set as the second correct value, and the second inference time value output when each voice data is input and the second inference time value are set as the second correct value. 2 A second inference model learning unit that trains the user until the difference between the correct answers is minimized; and
Time data when inferring the first inference time value and audio data when inferring the second inference time value to a third inference model that receives the first inference time value and the second inference time value and infers time information. At the same time, the corresponding performance time information is set as the third correct answer value, and when the first and second inference time values are input, the difference between the output third inference time value and the third correct answer value is minimum. It includes a third inference model learning unit that trains until ,
The preprocessor is a performance time inference model learning device that generates voice data by converting voice signals from performance shooting information into spectogram-based visualized data when generating voice data.

delete

In claim 6,
The preprocessor is a performance time inference model learning device that generates visual data and audio data for each section in which the change amount of RGB values of the visualized data changes more than a preset reference change amount.

In claim 6,
The third inference model learning unit is a performance time inference model learning device that executes learning of the third inference model when the difference between the input first inference time value and the second inference time value is less than or equal to a preset reference difference value.

A running time inference model learned through a running time inference model learning device according to any one of claims 6, 8, and 9;
A memory containing subtitle information corresponding to performance time information; and
A display control unit that controls to output subtitle information corresponding to performance time information inferred from the performance time inference model. Performance caption synchronization device comprising a.