KR20140115536A

KR20140115536A - Apparatus for editing of multimedia contents and method thereof

Info

Publication number: KR20140115536A
Application number: KR1020130030117A
Authority: KR
Inventors: 정찬의
Original assignee: 디노플러스 (주)
Priority date: 2013-03-21
Filing date: 2013-03-21
Publication date: 2014-10-01
Also published as: WO2014148665A3; WO2014148665A2; KR101493006B1

Abstract

The present invention relates to a device and a method for editing multimedia content which can synchronize voice data and text data during production of multimedia content. The device for editing multimedia content can automatically synchronize voice data and text data through a text object generation unit to generate a text object of a word unit after sequentially separating input text data into paragraph/sentence/and word units; a voice recognition unit to recognize a voice after designating the sentence end position of input voice data and detecting a phoneme section; a voice object generation unit to generate a voice text object from the voice data recognized by the voice recognition unit; and an automatic synchronization unit to synchronize a voice and a text by comparing the text object and the voice text object in a template matching manner, thereby reducing a synchronization operation time compared to an existing manual operation and improving efficiency and accuracy of a synchronization operation.

Description

[0001] Apparatus for editing multimedia contents and method [0002]

본 발명은 멀티미디어 콘텐츠(Multimedia contents) 편집에 관한 것으로, 특히 멀티미디어 콘텐츠 제작시 음성 데이터와 텍스트 데이터를 동기화하는 멀티미디어 콘텐츠 편집장치 및 그 방법에 관한 것이다.
The present invention relates to multimedia content editing, and more particularly, to a multimedia content editing apparatus and method for synchronizing voice data and text data in the production of multimedia contents.

일반적으로, 멀티미디어 콘텐츠 제작시 음성 객체와 텍스트 객체를 동기화할 필요가 있다. 특히, 교육용 멀티미디어 콘텐츠나 노래방의 가사 서비스시 음성과 텍스트를 동기화함으로써, 교육 효율을 높이거나 노래를 부르는 사람이 박자를 맞추는 데 도움이 될 수 있다. 여기서 음성 객체(또는, 음성 텍스트 객체)는 음성 데이터를 의미하고, 텍스트 객체는 텍스트 데이터를 의미한다. 아울러 상기 동기화란 음성과 텍스트를 매칭시키는 것을 의미한다.Generally, it is necessary to synchronize a voice object and a text object when producing multimedia contents. In particular, synchronizing voice and text in education multimedia content or karaoke household services can enhance educational efficiency or help singers to beat. Here, the voice object (or voice text object) means voice data, and the text object means text data. In addition, the synchronization means matching voice and text.

멀티미디어 콘텐츠 제작시 음성 데이터와 텍스트 데이터를 동기화하기 위한 일반적인 방법은 다음과 같다.A general method for synchronizing voice data with text data in the production of multimedia contents is as follows.

텍스트 데이터와 음성 데이터를 시계열 상에서 시각적으로 표시하는 단계, 음성 데이터의 구간을 선택하여 청음을 하는 단계, 청음한 해당 음성과 동일한 텍스트를 선택하는 단계, 선택한 텍스트의 속성값으로 청음한 음성 데이터 구간의 시작시간과 끝 시간 정보를 저장하는 단계를 통해 텍스트 데이터와 음성 데이터를 동기화하게 된다. 즉, 동기화 작업자(Operator)가 음성 데이터의 무음 구간을 기준으로 청음할 구간을 선택하여 청음 후 해당 텍스트 데이터와 매핑하는 과정을 반복적으로 수행한다.A step of visually displaying the text data and the voice data in a time series, a step of selecting a section of the voice data to hear the voice, a step of selecting the same text as the voice to be heard, The text data and the voice data are synchronized through the step of storing the start time and end time information. That is, the synchronization operator repeatedly performs a process of selecting a section to be listened to based on the silent section of the voice data, and mapping the selected section to the corresponding text data after listening.

한편, 텍스트 데이터와 음성 데이터를 동기화하는 종래의 기술이 공개특허공보 공개번호 특1995-0030128호(1995.11.24. 공개)에 개시된다.On the other hand, a conventional technique of synchronizing text data and voice data is disclosed in Published Unexamined Patent Publication No. 1995-0030128 (published on November 24, 1995).

개시된 종래기술은 노래방시스템의 모니터를 통해 자막의 표현변화로 각종 음악정보를 시각적으로 전달할 수 있도록 한 것으로서, 노래 가사의 형태적 변화를 통해 시각적으로 음악 정보를 제공해주게 된다.
The prior art disclosed in the present invention allows various musical information to be visually transmitted through a monitor of a karaoke system in accordance with a change in the presentation of subtitles and provides musical information visually through morphological changes of song lyrics.

대한민국 공개번호 특1995-0030128호(1995.11.24. 공개)Korean Public Release No. 1995-0030128 (published Nov. 24, 1995)

그러나 상기와 같은 음성 객체와 텍스트 객체를 동기화하는 일반적인 방법은 작업자가 음성 데이터를 모두 청음하는데 요구되는 시간만큼 동기화 시간이 소요되는 단점이 있다.However, the conventional method of synchronizing the voice object and the text object has a disadvantage in that the synchronization time is required for the time required for the operator to listen to all the voice data.

또한, 상기와 같은 일반적인 방법은 작업자의 숙련도에 따라 동기화의 정확도가 달라지는 문제가 있으며, 작업상황이나 작업자의 기분상태 등 외적 요인에 의해 동기화 오류 발생 빈도가 높아지는 문제점도 있다.In addition, the general method as described above has a problem that the accuracy of synchronization varies according to skill of a worker, and the frequency of occurrence of synchronization errors increases due to external factors such as a work situation or mood state of a worker.

또한, 상기와 같은 종래기술은 음성 데이터와 텍스트 데이터를 동기화하는 것이 불가능한 문제점이 있었다.
In addition, there is a problem that it is impossible to synchronize voice data and text data.

본 발명의 목적은 상기한 바와 같은 문제점을 해결하기 위한 것으로, 멀티미디어 콘텐츠 제작시 음성 데이터와 텍스트 데이터를 동기화하는 멀티미디어 콘텐츠 편집장치 및 그 방법을 제공하는 것이다.SUMMARY OF THE INVENTION An object of the present invention is to provide a multimedia content editing apparatus and method for synchronizing voice data and text data in the production of multimedia content.

본 발명의 다른 목적은 멀티미디어 콘텐츠 제작시 음성 데이터와 텍스트 데이터를 자동으로 동기화하여, 동기화에 소요되는 시간을 절감하고 작업 효율성을 높일 수 있는 멀티미디어 콘텐츠 편집장치 및 그 방법을 제공하는 것이다.It is another object of the present invention to provide a multimedia content editing apparatus and method which can automatically synchronize voice data and text data in the production of multimedia contents, thereby reducing time required for synchronization and increasing work efficiency.

본 발명의 또 다른 목적은 텍스트 데이터를 대상으로 음성인식을 수행하고 인식된 결과를 텍스트로 변환하여 텍스트 데이터와 인식된 결과 텍스트를 비교하여 동일한 텍스트끼리 매핑하는 방식으로 동기화를 수행하는 멀티미디어 콘텐츠 편집장치 및 그 방법을 제공하는 것이다.
Yet another object of the present invention is to provide a multimedia content editing apparatus and method for performing synchronization by performing speech recognition on text data and converting recognized results into text to compare the text data and the recognized result text, And a method thereof.

상기한 바와 같은 목적을 달성하기 위하여, 본 발명에 따른 멀티미디어 콘텐츠 편집장치는 입력된 텍스트 데이터를 문단/문장/단어 단위 순으로 순차 분리한 후 단어 단위의 텍스트 객체를 생성하는 텍스트 객체 생성부; 입력된 음성 데이터의 문장 끝 위치를 지정하고 음소 구간을 검출한 후 음성 인식을 하는 음성 인식부; 상기 음성 인식부에서 인식된 음성 데이터로부터 음성 텍스트 객체를 생성하는 음성 객체 생성부; 상기 텍스트 객체와 음성 텍스트 객체를 템플릿 매칭 방식으로 대비시켜 음성과 텍스트를 동기화하는 자동 동기화부를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a multimedia content editing apparatus including: a text object generation unit for sequentially generating input text data in a unit of a paragraph / sentence / word unit and then generating a text object in word units; A speech recognition unit for designating a sentence end position of the input speech data and detecting speech after a phoneme segment is detected; A voice object generation unit for generating a voice text object from the voice data recognized by the voice recognition unit; And an automatic synchronization unit for synchronizing the voice and the text by comparing the text object and the voice text object in a template matching manner.

또한, 본 발명에 따른 멀티미디어 콘텐츠 편집장치는 상기 자동 동기화부와 연결되어 동기화가 이루어지지 않은 텍스트 객체를 비 동기화 정보로 생성하는 비 동기화 정보 생성부; 상기 비 동기화 정보 생성부에서 생성된 비 동기화 정보를 사용자가 수작업으로 동기화할 수 있도록 시각적으로 표시해주는 비동기화 정보 표시부를 포함하는 것을 특징으로 한다.
The multimedia content editing apparatus may further include an asynchronous information generating unit connected to the automatic synchronizing unit to generate a non-synchronized text object as asynchronous information; And an asynchronous information display unit for visually displaying the asynchronous information generated by the asynchronous information generating unit so that the user can manually synchronize the asynchronous information.

또한, 상기한 바와 같은 목적을 달성하기 위하여, 본 발명에 따른 멀티미디어 콘텐츠 편집방법은 (a) 입력된 텍스트 데이터로부터 단어 단위의 텍스트 객체를 생성하는 단계; (b) 입력된 음성 데이터로부터 음성인식을 통해 음성 텍스트 객체를 생성하는 단계; 및 (c) 상기 텍스트 객체와 음성 텍스트 객체의 템플릿을 생성하고, 템플릿 매칭으로 자동 동기화를 실행하는 단계를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of editing multimedia content, the method comprising: (a) generating a text object in units of words from input text data; (b) generating a speech text object through speech recognition from the input speech data; And (c) generating a template of the text object and the speech text object, and performing automatic synchronization by template matching.

또한, 본 발명에 따른 멀티미디어 콘텐츠 편집방법은 (d) 상기 (c)단계에서 동기화가 이루어지지 않은 객체들을 대상으로 텍스트 객체 템플릿을 생성하고, 텍스트 객체 템플릿을 기초로 비동기화 정보를 생성하여 표시해주는 단계; (e) 동기화된 객체의 속성을 저장하는 단계를 더 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of editing multimedia content, comprising: (d) creating a text object template for objects not synchronized in step (c), generating and displaying asynchronization information based on the text object template step; (e) storing attributes of the synchronized object.

상기에서 (a)단계는 (a-1) 입력된 텍스트 데이터를 문단 단위로 분리하는 단계; (a-2) 분리한 각 문단을 문장 단위로 분리하는 단계; (a-3) 분리한 각 문장을 단어 단위로 분리하고, 분리한 단어 단위의 텍스트 데이터를 텍스트 객체로 생성하는 단계를 포함한다.The step (a) includes the steps of: (a-1) separating the input text data into paragraphs; (a-2) separating each separated paragraph into sentences; (a-3) separating each separated sentence into words, and generating text data of the separated word units as a text object.

상기에서 (b)단계는 (b-1) 입력된 음성 데이터에서 문장의 끝 위치를 지정하는 단계; (b-2) 상기 문장에서 묵음 구간을 기준으로 음소 구간을 자동으로 검출하는 단계; (b-3) 텍스트 객체 정보를 참조하여 음성인식을 수행하여 음성인식된 텍스트를 획득하는 단계; (b-4) 획득된 텍스트를 음성 텍스트 객체로 생성하는 단계를 포함한다.In the step (b), (b-1) designating the end position of the sentence in the inputted voice data; (b-2) automatically detecting a phoneme section based on a silence section in the sentence; (b-3) obtaining speech-recognized text by performing speech recognition with reference to text object information; (b-4) generating the obtained text as a voice text object.

상기에서 (c)단계는 (c-1) 상기 단어 단위의 텍스트 객체로 구성된 텍스트 템플릿 집합을 생성하는 단계; (c-2) 음성인식의 결과로 이루어진 음성 텍스트 객체에서 단어 집합으로 구성된 음성 텍스트 템플릿 집합을 생성하는 단계; (c-3) 상기 텍스트 템플릿 집합과 상기 음성 텍스트 템플릿 집합을 매칭하는 단계; (c-4) 상기 템플릿 매칭 결과로부터 동일한 단어를 검출하는 단계; (c-5) 검출된 동일한 단어들을 동기화정보로 생성하는 단계를 포함하는 것을 특징으로 한다.The step (c) includes the steps of: (c-1) generating a set of text templates composed of the text objects in units of words; (c-2) generating a set of speech text templates including a word set in a speech text object formed as a result of speech recognition; (c-3) matching the set of text templates with the set of speech text templates; (c-4) detecting the same word from the template matching result; (c-5) generating the same words detected as synchronization information.

상기에서 (d)단계는 (d-1) 상기 (c)단계에서 동기화되지 않은 텍스트 객체들로 텍스트 객체 템플릿을 구성하는 단계; (d-2) 상기 텍스트 객체 템플릿에 포함된 각 객체에 대한 속성 정보를 생성하는 단계; (d-3) 동기화되지 않은 구간을 음성 신호 표시 화면상에 컬러로 표시하는 단계; (d-4) 상기 구간 내에서 음성 단어 객체를 복수로 나눌 후보 점을 생성하여 비 동기화 정보로 표시해주는 단계를 포함하는 것을 특징으로 한다.The step (d) comprises: (d-1) constructing a text object template from the unsynchronized text objects in the step (c); (d-2) generating attribute information for each object included in the text object template; (d-3) displaying the non-synchronized period in color on the voice signal display screen; (d-4) generating candidates for dividing a plurality of speech word objects in the section and displaying the candidate points as asynchronous information.

상기에서 (d-2)단계는 텍스트 객체 템플릿 집합에서 자동 동기화된 객체들을 제외하고 자동 동기화되지 않은 텍스트 객체들로 구성된 템플릿 집합을 생성하고, 생성한 템플릿 집합의 객체들에게 고유의 순차적인 번호를 부여하고, 해당 객체의 우측에 바로 인접한 객체에 관한 속성값을 지정하여 속성 정보를 생성하는 것을 특징으로 한다.
In step (d-2), a template set composed of text objects that are not synchronized automatically is excluded from the text object template set, and a unique sequential number is assigned to the objects of the template set And attribute information about the object immediately adjacent to the object is specified by generating the attribute information.

본 발명에 따르면 음성 데이터와 텍스트 데이터의 자동 동기화가 가능하므로, 기존 수작업 대비 동기화 작업 시간을 단축할 수 있는 효과가 있다.According to the present invention, it is possible to automatically synchronize voice data and text data, thereby shortening the synchronizing operation time compared to the conventional manual operation.

또한, 본 발명에 따르면 음성 데이터와 텍스트 데이터의 자동 동기화에 의해 작업자의 숙련도, 작업 상황, 작업자의 기분상태 등의 외적 요인에 의한 동기화 오류 발생 빈도를 최소화할 수 있는 효과가 있다.In addition, according to the present invention, there is an effect of minimizing the occurrence frequency of synchronization errors due to external factors such as the skill level of an operator, a work situation, and a mood state of an operator by automatic synchronization of voice data and text data.

또한, 본 발명에 따르면 자동 동기화에 의해 동기화 작업의 효율성 및 정확성을 향상시킬 수 있는 효과가 있다.
In addition, according to the present invention, efficiency and accuracy of a synchronization operation can be improved by automatic synchronization.

도 1은 본 발명의 바람직한 실시 예에 따른 멀티미디어 콘텐츠 편집장치의 블록구성도.
도 2는 본 발명에서 음성 데이터를 분리하기 위한 무음 구간 설명도.
도 3은 본 발명의 바람직한 실시 예에 따른 멀티미디어 콘텐츠 편집방법을 보인 흐름도.
도 4는 도 3의 텍스트 객체 생성 단계의 실시 예 흐름도.
도 5는 도 3의 음성 텍스트 객체 생성단계의 실시 예 흐름도.
도 6은 본 발명에서 6개의 객체로 분리된 음성 단어의 제1예시도.
도 7은 본 발명에서 6개의 객체로 분리된 음성단어의 제2예시도.
도 8은 본 발명에서 6개의 객체로 분리된 음성단어의 제3예시도.
도 9는 본 발명에서 6개의 객체로 분리된 음성단어의 제4예시도.
도 10은 본 발명에서 GUI방식에 의한 후보 구간 자동 분리 설명도.
도 11은 도 3의 자동 동기화 단계의 실시 예 흐름도,
도 12는 도 3의 비 동기화 정보 생성 및 표시 단계의 실시 예 흐름도.1 is a block diagram of a multimedia content editing apparatus according to a preferred embodiment of the present invention;
2 is a diagram for explaining a silent section for separating voice data in the present invention;
3 is a flowchart illustrating a method of editing multimedia content according to a preferred embodiment of the present invention.
FIG. 4 is a flowchart of an embodiment of the text object generating step of FIG. 3; FIG.
FIG. 5 is a flowchart of an embodiment of a speech text object generation step of FIG. 3;
FIG. 6 is a first example of a speech word divided into six objects in the present invention; FIG.
FIG. 7 is a second example of speech words separated into six objects in the present invention; FIG.
FIG. 8 is a third example of speech words separated by six objects in the present invention. FIG.
FIG. 9 is a fourth example of a speech word divided into six objects in the present invention; FIG.
FIG. 10 is an explanatory diagram for automatically separating a candidate section by a GUI method in the present invention. FIG.
11 is a flowchart showing an embodiment of the automatic synchronization step of FIG. 3,
FIG. 12 is a flowchart of an embodiment of the step of generating and displaying the asynchronous information in FIG. 3; FIG.

이하 본 발명의 바람직한 실시 예에 따른 멀티미디어 콘텐츠 편집장치 및 방법을 첨부된 도면을 참조하여 상세하게 설명한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A multimedia content editing apparatus and method according to a preferred embodiment of the present invention will now be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 바람직한 실시 예에 따른 멀티미디어 콘텐츠 편집장치의 블록구성도이다.1 is a block diagram of a multimedia content editing apparatus according to a preferred embodiment of the present invention.

본 발명의 바람직한 실시 예에 따른 멀티미디어 콘텐츠 편집장치는 텍스트 입력부(10), 텍스트 객체 생성부(20), 음성 입력부(30), 음성 인식부(40), 음성 객체 생성부(50), 자동 동기화부(60), 비동기화 정보 생성부(70), 비동기화 정보 표시부(80)를 포함한다.The multimedia content editing apparatus according to the preferred embodiment of the present invention includes a text input unit 10, a text object generating unit 20, a voice input unit 30, a voice recognition unit 40, a voice object generating unit 50, (60), an asynchronous information generator (70), and an asynchronous information display (80).

텍스트 입력부(10)는 텍스트 데이터를 입력받는 역할을 하며, 텍스트 객체 생성부(20)는 상기 텍스트 객체 생성부(20)를 통해 입력된 텍스트 데이터를 문단/문장/단어 단위 순으로 순차 분리한 후 단어 단위의 텍스트 객체를 생성하는 역할을 한다.The text input unit 10 receives text data and the text object generation unit 20 sequentially separates the text data input through the text object generation unit 20 in the order of paragraph / sentence / word It is responsible for creating a word-based text object.

음성 입력부(30)는 음성 데이터를 입력받는 역할을 하며, 음성 인식부(40)는 상기 음성 입력부(30)를 통해 입력된 음성 데이터의 문장 끝 위치를 지정하고 음소 구간을 검출한 후 음성 인식을 하는 역할을 한다.The voice input unit 30 receives the voice data. The voice recognition unit 40 designates the end position of the voice data input through the voice input unit 30, detects the phoneme duration, .

음성 객체 생성부(50)는 상기 음성 인식부(40)에서 인식된 음성 데이터로부터 음성 텍스트 객체를 생성하는 역할을 하며, 자동 동기화부(60)는 상기 텍스트 객체와 음성 텍스트 객체를 템플릿 매칭 방식으로 대비시켜 음성과 텍스트를 동기화하는 역할을 한다.The voice object generation unit 50 generates a voice text object from the voice data recognized by the voice recognition unit 40. The automatic synchronization unit 60 uses the template matching method for the text object and the voice text object It synchronizes voice and text in contrast.

비동기화 정보 생성부(70)는 상기 자동 동기화부(60)와 연결되어 동기화가 이루어지지 않은 텍스트 객체를 비동기화 정보로 생성하는 역할을 하며, 비동기화 정보 표시부(80)는 상기 비동기화 정보 생성부(70)에서 생성된 비동기화 정보를 사용자가 수작업으로 동기화할 수 있도록 시각적으로 표시해주는 역할을 한다.The asynchronous information generating unit 70 generates a non-synchronized text object that is not synchronized with the automatic synchronizing unit 60, and the asynchronous information displaying unit 80 generates the asynchronous information And displays the non-synchronization information generated by the unit 70 visually so that the user can manually synchronize the non-synchronization information.

도 3은 본 발명의 바람직한 실시 예에 따른 멀티미디어 콘텐츠 편집방법을 보인 흐름도로서, S는 단계(Step)를 나타낸다.FIG. 3 is a flowchart illustrating a method for editing multimedia contents according to a preferred embodiment of the present invention, wherein S represents a step.

본 발명의 바람직한 실시 예에 따른 멀티미디어 콘텐츠 편집방법은 (a) 입력된 텍스트 데이터로부터 단어 단위의 텍스트 객체를 생성하는 단계(S10); (b) 입력된 음성 데이터로부터 음성인식을 통해 음성 텍스트 객체를 생성하는 단계(S20); 및 (c) 상기 텍스트 객체와 음성 텍스트 객체의 템플릿을 생성하고, 템플릿 매칭으로 자동 동기화를 실행하는 단계(S30); (d) 상기 (c)단계에서 동기화가 이루어지지 않은 객체들을 대상으로 텍스트 객체 템플릿을 생성하고, 텍스트 객체 템플릿을 기초로 비동기화 정보를 생성하여 표시해주는 단계(S40); (e) 동기화된 객체의 속성을 저장하는 저장 단계(S50)를 포함한다. A method of editing multimedia content according to a preferred embodiment of the present invention includes the steps of: (a) creating a text object on a word-by-word basis from input text data; (b) generating (S20) a speech text object through speech recognition from the input speech data; And (c) generating a template of the text object and the speech text object, and performing automatic synchronization by template matching (S30); (d) generating a text object template for objects not synchronized in step (c), and generating and displaying asynchronization information based on the text object template (S40); (e) storing the attributes of the synchronized object (S50).

여기서 상기 (c)단계에서 음성 데이터와 텍스트 데이터 간의 동기화가 이루어진 객체에 대해서는 (d)단계를 경유하지 않고 바로 (e)단계로 이동하게 된다.In step (c), the object synchronized with the voice data and the text data is moved to step (e) without passing through step (d).

상기에서 (a)단계는 도 4에 도시된 바와 같이, (a-1) 입력된 텍스트 데이터를 문단 단위로 분리하는 단계(S11 ~ S12); (a-2) 분리한 각 문단을 문장 단위로 분리하는 단계(S13); (a-3) 분리한 각 문장을 단어 단위로 분리하고, 분리한 단어 단위의 텍스트 데이터를 텍스트 객체로 생성하는 단계(S14)를 포함한다.In the step (a), as shown in FIG. 4, (a-1) separating the input text data into paragraphs (S11 to S12); (a-2) separating each separated paragraph into sentences (S13); (a-3) separating each separated sentence into words, and generating separated text data in units of words as a text object (S14).

상기에서 (b)단계는 도 5에 도시된 바와 같이, (b-1) 입력된 음성 데이터에서 문장의 끝 위치를 지정하는 단계(S21 ~ S22); (b-2) 상기 문장에서 묵음 구간을 기준으로 음소 구간을 자동으로 검출하는 단계(S23); (b-3) 텍스트 객체 정보를 참조하여 음성인식을 수행하여 음성인식된 텍스트를 획득하는 단계(S24); (b-4) 획득된 텍스트를 음성 텍스트 객체로 생성하는 단계(S25)를 포함한다.5, step (b) includes steps (S21 to S22) of designating an end position of a sentence in the input voice data (b-1); (b-2) automatically detecting a phoneme section based on the silence section in the sentence (S23); (b-3) obtaining speech-recognized text by performing speech recognition with reference to text object information (S24); (b-4) generating the obtained text as a voice text object (S25).

상기에서 (c)단계는 도 11에 도시된 바와 같이, (c-1) 상기 단어 단위의 텍스트 객체로 구성된 텍스트 템플릿 집합을 생성하는 단계(S31); (c-2) 음성인식의 결과로 이루어진 음성 텍스트 객체에서 단어 집합으로 구성된 음성 텍스트 템플릿 집합을 생성하는 단계(S32); (c-3) 상기 텍스트 템플릿 집합과 상기 음성 텍스트 템플릿 집합을 매칭하는 단계(S33); (c-4) 상기 템플릿 매칭 결과로부터 동일한 단어를 검출하는 단계(S34); (c-5) 검출된 동일한 단어들을 동기화정보로 생성하는 단계(S35)를 포함한다.In the step (c), as shown in FIG. 11, (c-1) creating (S31) a text template set composed of the text objects of the word unit; (c-2) creating (S32) a set of speech text templates comprising a word set in the speech text object resulting from speech recognition; (c-3) matching the set of text templates with the set of speech text templates (S33); (c-4) detecting the same word from the template matching result (S34); (c-5) generating (S35) the detected identical words as synchronization information.

상기에서 (d)단계는 도 12에 도시된 바와 같이, (d-1) 상기 (c)단계에서 동기화되지 않은 텍스트 객체들로 텍스트 객체 템플릿을 구성하는 단계(S41); (d-2) 상기 텍스트 객체 템플릿에 포함된 각 객체에 대한 속성 정보를 생성하는 단계(S42); (d-3) 동기화되지 않은 구간을 음성 신호 표시 화면상에 컬러로 표시하는 단계(S43); (d-4) 상기 구간 내에서 음성 단어 객체를 복수로 나눌 후보 점을 생성하여 비 동기화 정보로 표시해주는 단계(S44)를 포함한다.The step (d) may include: (d-1) constructing a text object template with the unsynchronized text objects in step (c) as shown in FIG. 12; (d-2) generating attribute information for each object included in the text object template (S42); (d-3) displaying a non-synchronized section on the voice signal display screen in color (S43); (d-4) generating a candidate point for dividing a plurality of speech word objects in the interval and displaying the candidate points as non-synchronization information (S44).

이하 본 발명의 바람직한 실시 예에 따른 멀티미디어 콘텐츠 편집장치 및 그 방법을 첨부한 도면 도 1 내지 도 12를 참조하여 상세하게 설명한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A multimedia content editing apparatus and method according to a preferred embodiment of the present invention will now be described in detail with reference to FIGS. 1 to 12.

먼저, 텍스트와 음성을 동기화하여 멀티미디어 콘텐츠를 제작하기 위해서, 텍스트 입력부(10)는 텍스트 데이터를 입력받게 되고, 음성 입력부(30)는 음성을 입력받게 된다. First, in order to produce multimedia contents by synchronizing text and voice, the text input unit 10 receives the text data, and the voice input unit 30 receives the voice.

여기서 텍스트 입력부(10)는 키보드이거나 특정 텍스트 파일에서 추출한 텍스트 데이터가 입력되는 부분을 의미할 수 있다. 아울러 음성 입력부(30)는 음성 신호 입력을 위한 마이크이거나 특정 음성 파일에서 추출한 음성 데이터일 수 있다.Here, the text input unit 10 may be a keyboard or a portion where text data extracted from a specific text file is input. In addition, the voice input unit 30 may be a microphone for voice signal input or voice data extracted from a specific voice file.

텍스트 입력부(10)를 통해 입력되는 텍스트 데이터는 텍스트 객체 생성부(20)에 전달되며, 텍스트 객체 생성부(20)는 입력된 텍스트 데이터로부터 동기화를 위한 텍스트 객체를 생성하게 된다(S10 ~ S20).The text data input through the text input unit 10 is transmitted to the text object generation unit 20 and the text object generation unit 20 generates a text object for synchronization from the inputted text data (S10 to S20) .

예컨대, 텍스트 객체는 도 4에 도시된 바와 같이, 단계 S11 내지 S12에서 입력된 텍스트 데이터를 문단 단위로 분리하게 된다. 여기서 입력된 텍스트 데이터로부터 문단 단위의 분리는 시작 텍스트부터 마침표를 탐색하고, 마침표 다음에 문단을 식별하기 위해 미리 설정된 데이터 간격 동안 다음 데이터가 검색되지 않으면 문단이라고 판단을 한다.For example, as shown in FIG. 4, the text object separates the text data input in steps S11 to S12 on a paragraph-by-paragraph basis. Separation of a paragraph unit from input text data is performed by searching for a period from the start text, and if the next data is not searched for a preset data interval to identify the paragraph after the period, it is judged as a paragraph.

다음으로, 단계 S13에서 분리한 각 문단을 문장 단위로 분리하게 된다. 문단에서 문장의 분리는 마침표를 기준으로 한다. 예컨대, 문단에서 처음 텍스트가 검출되고 이후 처음으로 마침표가 검출되면 이것을 문장 단위로 분리한다. 그리고 마침표 이후 처음 텍스트가 검출되고 이후 다시 처음으로 마침표가 검출되면 이것도 문장 단위로 분리하게 된다.Next, each paragraph separated in step S13 is separated by sentence unit. Separation of sentences in paragraphs is based on a period. For example, if the first text in a paragraph is detected and the first time a period is detected, then it is separated by sentence. Then, if the first text is detected after the period, and then the period is detected again after the beginning, this is also separated by sentence.

마지막으로 단계 S14에서는 상기 분리한 각 문장을 단어 단위로 분리하고, 분리한 단어 단위의 텍스트 데이터를 텍스트 객체로 생성하게 된다. 여기서 문장으로부터 단어 단위의 텍스트 데이터 분리는 텍스트가 지속하다가 텍스트가 검출되지 않으면 그 검출되지 않는 시점의 바로 앞의 텍스트까지를 단어로 분리하게 된다. 이후 분리한 단어 단위의 텍스트 데이터를 텍스트 객체로 생성하게 된다. 다시 말해 텍스트 객체는 분리된 단어 단위를 의미한다. 여기서 단어 단위라고 표현한 것은 분리된 단위가 단어일 수도 있고 아니면 조사 등을 포함하는 단어일 수 있기 때문이다.Finally, in step S14, the separated sentences are separated by word units, and text data of the separated word units is generated as a text object. Here, the text data is separated from the sentence by the word, and if the text is continued but the text is not detected, the text immediately before the point of time when the text is not detected is separated into words. Then, the text data of the separated word unit is generated as a text object. In other words, the text object means a separate word unit. Here, word unit is expressed because the separated unit may be a word or it may be a word including an investigation or the like.

이러한 과정을 통해 텍스트 데이터로부터 생성된 텍스트 객체는 자동 동기화부(60)에 전달된다.Through this process, the text object generated from the text data is transmitted to the automatic synchronization unit 60.

한편, 음성 인식부(40)는 상기 음성 입력부(30)를 통해 입력된 음성 데이터로부터 음성 텍스트 객체를 생성하게 된다(S20).Meanwhile, the voice recognition unit 40 generates a voice text object from the voice data input through the voice input unit 30 (S20).

상기 음성 텍스트 객체 생성은 도 5에 도시된 바와 같이, 단계 S21 및 S22에서 입력된 음성 데이터에서 문장의 끝 위치를 지정하고, 단계 S23에서 상기 문장에서 도 2의 묵음 구간을 기준으로 음소 구간을 자동으로 검출하게 된다. 여기서 통상의 음성 데이터는 도 2에 도시된 바와 같이, 음성 신호가 존재하는 구간과 음성과 음성 중간에 음성 신호가 없는 구간인 묵음구간(무음 구간)이 존재한다. 따라서 이러한 묵음 구간을 기본으로 음소 구간을 자동으로 검출할 수 있다.As shown in FIG. 5, the speech text object generation is performed by designating the end position of the sentence in the speech data input in steps S21 and S22, and in step S23, the speech section is automatically . Here, as shown in FIG. 2, there is a silent section (silent section) which is a section in which a voice signal exists and a section in which there is no voice signal in the middle of voice and voice. Therefore, the phoneme interval can be automatically detected based on the silence interval.

이후 단계 S24에서 텍스트 객체 정보를 참조하여 음성인식을 수행하고, 음성인식된 텍스트를 획득하게 된다. 여기서 음성 인식은 통상의 동적 시간 신축(DTW; dynamic time warping), 은닉 markov 모델(HMM; hidden markov model), 분산 신경망을 이용한 연속 음성 인식 방법, 제안단어 음성인식 방법을 이용할 수 있다.Thereafter, in step S24, speech recognition is performed by referring to the text object information, and the speech-recognized text is obtained. Here, the speech recognition can use a conventional dynamic time warping (DTW), a hidden markov model (HMM), a continuous speech recognition method using a distributed neural network, and a proposal word speech recognition method.

다음으로, 단계 S25에서 음성 객체 생성부(50)는 상기 음성인식부(40)에서 음성 인식으로 획득된 텍스트를 음성 텍스트 객체로 생성하여 자동 동기화부(60)에 전달한다.Next, in step S25, the voice object generation unit 50 generates a voice text object, which is obtained by voice recognition in the voice recognition unit 40, as a voice text object, and transmits the voice text object to the automatic synchronization unit 60. [

상기 자동 동기화부(60)는 단계 S30에서 상기 텍스트 객체와 음성 텍스트 객체를 동기화하게 된다.The automatic synchronization unit 60 synchronizes the text object with the voice text object in step S30.

여기서 텍스트 객체와 음성 텍스트 객체의 동기화는 템플릿 매칭 방식에 의한 자동 동기화 과정이 수행되며, 이후 자동 동기화 과정에서 동기화가 이루어지지 못한 텍스트 데이터와 음성 데이터를 GUI를 이용하여 동기화를 진행하는 GUI기반 동기화 과정이 수행된다.Here, the synchronization between the text object and the voice text object is performed by a template matching method. After that, GUI-based synchronization processing for synchronizing text data and voice data, which have not been synchronized in the automatic synchronization process, Is performed.

그리고 텍스트 객체(단어)와 음성 객체(단어)의 동기화는 두 객체(단어)가 동일한 텍스트이면, 즉 같은 단어이면 서로 매핑되는 동기화 대상으로 지정한다. 같은 단어인가의 판단은 두 객체를 표현하는 문자가 동일한가를 비교하여 판단할 수 있다. 그런데 텍스트 객체는 그 자체가 문자로 표현 가능하지만, 음성 객체는 다른 과정을 거쳐 문자로 다시 표현해야 한다. 그 과정은 음성인식 기법을 이용하여 가능해 진다. 따라서, 두 객체 간의 동기화 여부 결정은 텍스트 객체의 문자(열)와 음성객체를 인식한 결과인 음성 객체의 문자(열)를 비교하여 동일한 문자(열)인가를 판단한다.The synchronization between the text object (word) and the voice object (word) is specified as a synchronization object in which two objects (words) are the same text, that is, they are mapped to each other if they are the same word. Whether or not the same word is judged can be judged by comparing whether the characters expressing the two objects are the same or not. However, a text object can express itself as a character, but a voice object must be re-rendered as a character through another process. The process is enabled using speech recognition techniques. Therefore, whether or not the two objects are synchronized is determined by comparing the character (column) of the text object with the character (column) of the voice object which is a result of recognizing the voice object, and determining whether the character (column) is the same character.

이를 좀 더 구체적으로 설명하면 도 11에 도시된 바와 같이, 단계 S31에서 입력된 단어 단위의 텍스트 객체를 기반으로 텍스트 템플릿 집합을 생성하게 된다. 여기서 템플릿이란 텍스트 객체와 음성 텍스트 객체를 상호 비교하기 용이하게 인위적으로 만들어 놓은 틀이라고 할 수 있다.More specifically, as shown in FIG. 11, a text template set is generated based on the text object in units of words input in step S31. Here, the template is a template artificially created for easily comparing the text object and the voice text object with each other.

텍스트 데이터가 기준이 되어야 하기 때문에, 텍스트 문장을 단어 단위의 텍스트 객체의 문자열 집합으로 하는 템플릿을 생성한다. 즉, 텍스트 템플릿 집합(A)은 A = {aaa, bbb, ccc, ...}와 같이 생성한다.Since text data must be a reference, a template is created in which a text sentence is a string set of a text object on a word-by-word basis. That is, a set of text templates (A) is generated as A = {aaa, bbb, ccc, ...}.

다음으로, 단계 S32에서 음성인식의 결과로 이루어진 음성단어에 대한 음성 텍스트 템플릿 집합(B)을 생성하게 된다. 여기서 음성 텍스트는 음성 인식된 결과를 단어 단위의 문자열로 변환한 것으로서, 음성 텍스트 템플릿 집합(B)은 B = {a'a'a', b'b'b', c'c'c', ...}와 같이 생성한다.Next, in step S32, a speech text template set B for a speech word formed as a result of speech recognition is generated. Here, the speech text is a result of converting the speech recognition result into a character string in units of words. The speech text template set B includes B = {a'a'a ', b'b'b', c'c'c ' ...}.

이후 단계 S33에서 템플릿 집합 A를 기준으로 음성단어 집합 B를 매칭하는 템플릿 매칭 과정을 수행한다.Thereafter, in step S33, a template matching process is performed to match the speech word set B with respect to the template set A as a reference.

상기 두 집합에 속한 문자열들에 대한 비교는 집합 B의 각 요소(문자열)를 템플릿 집합 A의 각 요소(문자열)에 순서대로 1:1로 비교하는 것으로 구현된다. 즉 템플릿 매칭 요소 집합(T)은 T = {(aaa, a'a'a'), (aaa, b'b'b'), ..., (bbb, b'b'b'), (bbb, c'c'c'), ...}와 같이 결정되고, 각 요소를 순차적으로 비교하여 두 객체 문자열이 일치하는 즉, (aaa)=(a'a'a')인 경우에 두 객체가 완전하게 매칭된 것으로 판단한다. 예를 들어, "신데렐라는 호박 마차를 타고 궁전으로 갑니다."라는 텍스트 문장을 텍스트 템플릿 집합 A로 표시하면, A = {신데렐라는, 호박, 마차를, 타고, 궁전으로, 갑니다}로 표현되며, 이 집합은 6개의 단어 단위 객체로 구성되어 있음을 알 수 있다. 만약, 이 텍스트 문장에 대응한 음성 문장을 인식한 결과가 도 6과 같은 경우, 이를 단어 단위의 문자열 집합 B로 표현하면, B = {신데렐라는, 호박, 마차를, 타고, 궁전으로, 갑니다}와 같이 표현되었다면, 템플릿 비교를 통해 T = {(신데렐라, 신데렐라), (호박, 호박), (마차를, 마차를), (타고, 타고), (궁전으로, 궁전으로), (갑니다, 갑니다)}와 같이 6개의 매칭된 결과를 얻게 되고, 모든 객체가 일치하므로 텍스트 문장과 음성 문장은 동기화가 완료된다.The comparison of the strings belonging to the two sets is implemented by sequentially comparing each element (string) of the set B to each element (string) of the template set A 1: 1. That is, the set of template matching elements T is T = {(aaa, a'a'a '), (aaa, b'b'b'), ..., (bbb, b'b'b ' bbb, c'c'c '), ...}, and sequentially compares the elements. If the two object strings match (aaa) = (a'a'a'), It is determined that the object is perfectly matched. For example, if you mark the text template A with the text sentence "Cinderella goes to the palace on a pumpkin carriage," A = {Cinderella is a pumpkin, carriage, ride, going to the palace} It can be seen that this set consists of six word unit objects. If the result of recognizing the speech sentence corresponding to this text sentence is as shown in FIG. 6, then B = {Cinderella, the amber, carriage, ride, goes to the palace) If you are represented as T = {(Cinderella, Cinderella), (pumpkin, pumpkin), (carriage, carriage), (ride, ride), (to palace, to palace) )}, And all the objects are matched, so that the text sentence and the voice sentence are completely synchronized.

또한, 음성인식의 오류로 인해 도 7과 같은 음성인식 결과의 음성 텍스트 객체를 얻은 경우, 두 집합은 A = {신데렐라는, 호박, 마차를, 타고, 궁전으로, 갑니다}, B = {신데렐라는, 호박 마차를, 타고, 궁전으로, 갑니다}로 표현되고, T = {(신데렐라, 신데렐라), (타고, 타고), (궁전으로, 궁전으로), (갑니다, 갑니다)}Also, if the speech text object of the speech recognition result as shown in Fig. 7 is obtained due to the error of the speech recognition, the two sets A = {Cinderella, pumpkin, carriage, ride, go to the palace}, B = {Cinderella ), T = {(Cinderella, Cinderella), (ride, ride), (to the palace, to the palace), (to go)

와 같은 5개의 매칭 객체를 얻게 된다.We get five matching objects.

이때, 매칭되지 않은 객체에 대해서는 비동기화 정보 생성부(70)에서 텍스트 객체 집합으로 템플릿 A'를 생성하고 또한 매칭되지 않은 음성 텍스트 객체 집합 B'를 생성하게 된다. 여기서 매칭되지 않은 객체에 대한 집합은 다음과 같이 표현된다.At this time, for the unmatched object, the asynchronous information generating unit 70 generates the template A 'as the set of text objects and also generates the unmatched set of the textual objects B'. Here, the set of unmatched objects is expressed as follows.

A' = {호박, 마차를}, B' = {호박 마차를}A '= {Pumpkin, carriage}, B' = {Pumpkin carriage}

또한, 음성인식의 오류로 인해 도 8과 같은 음성인식 결과의 음성 텍스트 객체를 얻은 경우에, 매칭되지 않은 객체에 대한 집합은 다음과 같이 표현된다.In addition, when a speech text object of speech recognition result as shown in FIG. 8 is obtained due to a speech recognition error, a set of non-matched objects is expressed as follows.

A' = {마차를, 타고}, B' = {마차를 타고}A '= {ride a carriage}, B' = {ride a carriage}

이러한 과정으로 단계 S33에서 템플릿 매칭을 수행하고, 단계 S34에서 동일한 단어를 검출하며, 단계 S35에서 동기화 속성 정보를 생성하여 내부 메모리에 저장하게 된다.In this process, template matching is performed in step S33, the same word is detected in step S34, synchronization attribute information is generated and stored in the internal memory in step S35.

미디어 동기화 대상인 텍스트 데이터와 음성 데이터는 본래 동일한 문자(열)를 갖는다. 즉, 단어 단위로 분리된 텍스트 객체와 음성 텍스트 객체의 문자(열)는 원칙적으로 완전하게 동일해야 하며, 이러한 비교는 텍스트 객체를 기준으로 판단해야 한다. 만약, 완벽한 음성인식 엔진이 있다면, 음성인식된 객체들과 텍스트 객체들은 완전하게 일치하게 될 것이다.The text data and the audio data that are media synchronization targets have the same characters (columns) originally. That is, the characters (columns) of a text object separated by a word and a voice text object should be completely the same in principle, and the comparison should be judged based on the text object. If there is a complete speech recognition engine, the speech-recognized objects and the text objects will be perfectly matched.

본 발명에서는 실시 예로 제한단어 음성인식 방식을 적용하였다. 이 경우 텍스트 데이터에 포함된 단어들만을 대상으로 제한적 인식을 수행함으로써 인식률이 매우 높지만, 음성인식 기술은 여전히 한계가 있어서, 특정 객체에 대한 인식률이 극히 낮거나 혹은 인식에 실패한 경우도 존재한다. 더구나, 단어 단위 음성인식을 위해 음성 데이터를 사전에 묵음 구간을 기준으로 음소단위 분리하여 인식을 수행할 경우, 음성 단어 분리 오류에 따른 오인식도 존재하게 된다.In the present invention, the limited word speech recognition method is applied as an embodiment. In this case, the recognition rate is very high by performing limited recognition only on the words included in the text data. However, there are still cases where the recognition rate of the specific object is extremely low or the recognition is failed because the speech recognition technology is still limited. In addition, when speech data is divided into phonemes according to the silence interval in advance for word-based speech recognition, there is also a mistake due to the separation of the speech words.

이와 같은 이유 때문에, 텍스트 객체들과 음성인식 객체들이 일치하지 않을 경우, 이를 보완할 방법이 필요하며, 본 발명에서는 템플릿 매칭에 의한 보완 방법을 제시한다.For this reason, when the text objects and the speech recognition objects do not coincide with each other, a method of compensating the text objects is necessary. In the present invention, a complementary method by template matching is proposed.

예컨대, 단계 S40에서는 상기와 같이 매칭되지 않은 집합에 대해서만 GUI 방식으로 동기화가 이루어지도록, 비동기화 정보를 생성하여 표시해주게 된다.For example, in step S40, asynchronous information is generated and displayed so that only a set that is not matched as described above is synchronized in a GUI manner.

이를 위해 도 12에 도시한 바와 같이, 단계 S41에서 자동 동기화되지 않은 텍스트 객체 템플릿을 생성한다. 그리고 단계 S42에서 템플릿 객체 속성정보를 생성한다. To this end, as shown in FIG. 12, a text object template not automatically synchronized is generated in step S41. In step S42, template object attribute information is generated.

예컨대, 자동 동기화되지 않은 템플릿 집합(A')을 생성하는 과정에서, 자동 동기화되지 않은 객체들은 고유의 순차적인 번호가 부여되며, 또한 우측에 바로 인접한 객체가 자동 동기화된 객체인지를 구별하는 속성값을 갖게 된다. 이때, 속성값이 0이면 우측 객체는 자동 동기화된 객체를 의미하며, 0이 아닌 숫자의 경우 우측 객체의 고유한 순차 번호를 가리키며 이는 우측 객체 또한 자동 동기화되지 않은 객체라는 의미이다. 즉, 자동 동기화되지 않은 템플릿 A'는 동기화되지 않은 객체의 수 N(N=1, 2, ..., n)과 그 구간 정보 C(N)를 포함한 속성값을 가진다. 여기서, 구간 정보(C(N))는 C(N) = 0이면 동기화되지 않은 객체 N의 우측에 있는 객체는 동기화된 객체라는 의미이고, C(N) = k(k>N)이면 우측에 동기화되지 않는 객체가 있으며 그 고유 번호가 k라는 의미이다. For example, in the process of creating the template set A 'that has not been automatically synchronized, the objects that are not automatically synchronized are assigned unique sequential numbers, and the attribute values that distinguish whether the object immediately adjacent to the right is an automatically synchronized object . At this time, if the attribute value is 0, the right object indicates an auto-synchronized object, and in the case of a non-zero number, indicates a unique sequential number of the right object, which means that the right object is also an object that is not automatically synchronized. That is, the template A 'that has not been automatically synchronized has an attribute value including the number N (N = 1, 2, ..., n) of objects that are not synchronized and the interval information C (N). If C (N) = 0, the object on the right side of the non-synchronized object N means the synchronized object. If C (N) = k (k> N) There is an object that is not synchronized, and its unique number is k.

따라서, 도 7의 경우, A' = {호박, 마차를}이므로, 자동 동기화되지 않은 객체의 수 N = 2이고, A'(1) = {호박}, A'(2) = {마차를}로 표현되므로, C(1)=2, C(2)=0의 속성값을 갖는다. 즉, 이러한 속성값을 고려한 A'(1) = {호박}, A'(2)={마차를}, B' = {마차를 타고}의 관계로부터 B' = {마차를 타고}의 음성 단어 객체는 두 개로 분리되어야 한다는 것을 알 수 있다. 이와 같은 A'의 속성 정보 C(N)를 기준으로 GUI 방식에 의해 동기화 과정을 진행하면 된다.Therefore, in the case of FIG. 7, the number of objects N = 2, A '(1) = {amber}, A' (2) = {carriage} , C (1) = 2, and C (2) = 0. That is, from the relationship of A '(1) = {amber}, A' (2) = {carriage} and B '= {carriage ride} It can be seen that the object must be split into two. The synchronization process may be performed by the GUI method based on the attribute information C (N) of A '.

GUI 방식에 의한 동기화 과정을 위해 단계 S43에서와 같이, 템플릿 매칭으로 자동 동기화되지 않은 구간을 음성 신호 표시 화면상에 컬러로 표시해주게 된다. 자동화되지 않은 영역을 컬러로 표시함으로써 작업자가 한눈에 파악할 수 있도록 시선 집중도를 높여주게 된다.As in step S43 for the synchronization process by the GUI method, a section that is not automatically synchronized with template matching is displayed in color on the voice signal display screen. By displaying the un-automated area in color, it increases the concentration of attention so that the operator can grasp at a glance.

단계 S44에서 해당 구간 내에서 음성 단어 객체를 n개로 나눌 후보점 p를 생성해주게 된다. 예컨대, 도 7의 음성 단어 객체 {(호박 마차를)}의 경우, 도 6과 같이 두 개의 후보 객체 영역으로 나누어야 하므로 도 10과 같이 하나의 분리 후보점 p를 생성하고 화면상에서 후보 음성 객체 영역을 표시해주게 된다.In step S44, a candidate point p for dividing the speech word object into n words within the corresponding section is generated. For example, in the case of the speech word object {(pumpkin carriage)} of FIG. 7, since it is necessary to divide into two candidate object regions as shown in FIG. 6, one separation candidate point p is generated as shown in FIG. 10, Will be displayed.

이러한 자동 동기화되지 않은 구간 표시와 후보점 생성으로, 작업자는 후보 음성 객체 영역을 클릭하여 청음하고, 청음 결과 동일한 텍스트일 경우 텍스트 객체 영역을 마우스로 클릭한다. 텍스트 객체 영역이 마우스로 선택되면 음성 후보 객체와 클릭한 텍스트 객체 간의 동기화가 이루어진다.With this automatic display of the unsynchronized section and generation of candidate points, the operator clicks the candidate voice object area to hear it, and if the text is the same as the result of listening, the text object area is clicked with the mouse. When the text object area is selected with the mouse, synchronization between the speech candidate object and the clicked text object is made.

이렇게 동기화가 이루어진 동기화 정보는 자동 동기화 정보와 함께 내부 메모리에 저장된다. 상기 동기화 정보에 의해 제작된 멀티미디어 콘텐츠를 재생하는 경우, 음성 데이터와 텍스트 데이터 간의 동기화가 이루어지게 되는 것이다.The synchronization information thus synchronized is stored in the internal memory together with the automatic synchronization information. When the multimedia content produced by the synchronization information is reproduced, synchronization between voice data and text data is performed.

이상 본 발명자에 의해서 이루어진 발명을 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.
Although the present invention has been described in detail with reference to the above embodiments, it is needless to say that the present invention is not limited to the above-described embodiments, and various modifications may be made without departing from the spirit of the present invention.

본 발명은 텍스트 데이터와 음성 데이터를 자동 동기화하여 멀티미디어 콘텐츠를 제작하는 기술에 적용된다. 특히, 교육용 멀티미디어 콘텐츠 제작에 효과적으로 적용할 수 있다.
The present invention is applied to a technique of automatically creating multimedia contents by automatically synchronizing text data and voice data. In particular, it can be effectively applied to the production of educational multimedia contents.

10… 텍스트 입력부
20… 텍스트 객체 생성부
30… 음성 입력부
40… 음성 인식부
50… 음성 객체 생성부
60… 자동 동기화부
70… 비동기화 정보 생성부
80… 비동기화 정보 표시부10 ... Text input unit
20 ... The text object creation unit
30 ... Voice input unit
40 ... The voice recognition unit
50 ... The voice-
60 ... Automatic synchronization unit
70 ... The asynchronous information generating unit
80 ... Asynchronous information display

Claims

A text object generation unit that sequentially separates input text data in order of paragraph / sentence / word and then generates a text object in word units;
A speech recognition unit for designating a sentence end position of the input speech data and detecting speech after a phoneme segment is detected;
A voice object generation unit for generating a voice text object from the voice data recognized by the voice recognition unit;
And an automatic synchronization unit for synchronizing the voice and the text by comparing the text object and the voice text object by a template matching method.

The apparatus of claim 1, further comprising: an asynchronization information generation unit connected to the automatic synchronization unit to generate a non-synchronized text object as non-synchronization information; Further comprising an asynchronous information display unit for visually displaying the asynchronous information generated by the asynchronous information generating unit so that the user can manually synchronize the asynchronous information.

(a) generating a text object on a word-by-word basis from the text data input by the text object generating unit;
(b) generating a speech text object through speech recognition from the speech data input by the speech object generating unit; And
(c) generating a template of the text object and the voice text object in the automatic synchronization unit, and performing automatic synchronization by template matching.

The method of claim 3, further comprising: (d) generating a text object template for objects not synchronized in step (c), generating and displaying asynchronization information based on the text object template; and (e) storing attributes of the synchronized object.

The method of claim 3 or 4, wherein the step (a) comprises the steps of: (a-1) separating input text data into paragraphs; (a-2) separating each separated paragraph into sentences; (a-3) separating each separated sentence into words, and generating text data of the separated word units as a text object.

The method according to claim 3 or 4, wherein the step (b) comprises the steps of: (b-1) designating an end position of a sentence in the inputted voice data; (b-2) automatically detecting a phoneme section based on a silence section in the sentence; (b-3) obtaining speech-recognized text by performing speech recognition with reference to text object information; (b-4) generating the obtained text as a spoken text object.

The method of claim 3 or 4, wherein the step (c) includes the steps of: (c-1) generating a set of text templates composed of the text objects in units of words; (c-2) generating a set of speech text templates including a word set in a speech text object formed as a result of speech recognition; (c-3) matching the set of text templates with the set of speech text templates; (c-4) detecting the same word from the template matching result; (c-5) generating the same words as the detected synchronization information.

[6] The method of claim 4, wherein the step (d) includes: (d-1) constructing a text object template from the unsynchronized text objects in the step (c) (d-2) generating attribute information for each object included in the text object template; (d-3) displaying the non-synchronized period in color on the voice signal display screen; (d-4) generating a candidate point for dividing a plurality of speech word objects in the section and displaying the candidate points as non-synchronization information.

[8] The method of claim 8, wherein the step (d-2) comprises the steps of: generating a template set composed of text objects that are not synchronized with each other except for automatically synchronized objects in the set of text object templates; Wherein the attribute information is generated by assigning a sequential number and designating an attribute value of an object immediately adjacent to the object.