KR102140438B1

KR102140438B1 - Method of mapping text data onto audia data for synchronization of audio contents and text contents and system thereof

Info

Publication number: KR102140438B1
Application number: KR1020130108730A
Authority: KR
Inventors: 김무중; 김준수
Original assignee: 주식회사 청담러닝
Priority date: 2013-09-10
Filing date: 2013-09-10
Publication date: 2020-08-04
Also published as: KR20150029846A

Abstract

텍스트 데이터를 오디오 데이터에 매핑하는 방법은 오디오 데이터의 발화 시간 정보를 추출하는 단계; 텍스트 데이터를 문장 단위로 구분하는 단계; 상기 구분된 텍스트 데이터에 기초하여 상기 오디오 데이터로부터 상기 오디오 데이터의 포즈(Pause) 시간 정보를 추출하는 단계; 상기 구분된 텍스트 데이터에 포함되는 음소를 추출하여 상기 텍스트 데이터의 발화 구간 비율을 계산하는 단계; 및 상기 추출된 오디오 데이터의 상기 발화 시간 정보, 상기 포즈 시간 정보 및 상기 텍스트 데이터의 상기 발화 구간 비율에 기초하여 상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는 단계를 포함한다.A method of mapping text data to audio data includes extracting speech time information of audio data; Classifying text data into sentence units; Extracting pause time information of the audio data from the audio data based on the divided text data; Extracting a phoneme included in the distinguished text data and calculating a speech section ratio of the text data; And mapping the text data to the audio data based on the talk time information of the extracted audio data, the pose time information, and the talk section ratio of the text data.

Description

METHOD OF MAPPING TEXT DATA ONTO AUDIA DATA FOR SYNCHRONIZATION OF AUDIO CONTENTS AND TEXT CONTENTS AND SYSTEM THEREOF}

텍스트 데이터를 오디오 데이터에 매핑하는 시스템 및 방법에 관한 기술로서, 보다 구체적으로, 오디오 컨텐츠 및 텍스트 컨텐츠의 동기화를 위해 오디오 데이터의 시간 정보 및 텍스트 데이터의 비율 정보에 기초하여 텍스트 데이터를 오디오 데이터에 매핑하는 기술에 관한 것이다.
A technology for a system and method for mapping text data to audio data, and more specifically, mapping text data to audio data based on time information of audio data and ratio information of text data for synchronization of audio content and text content It is about technology.

텍스트 데이터를 오디오 데이터에 매핑하는 기술은 언어 학습의 분야에서 학습 효율을 증가시키기 위해, 오디오 컨텐츠 및 텍스트 컨텐츠를 동기화하는데 이용되는 기술로서, 문장 단위로 구분된 텍스트 데이터를 공백 구간을 포함하는 오디오 데이터에 매핑하는 기술이다. 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 기술은 오디오 데이터의 공백 구간 및 문장 단위로 구분된 텍스트 데이터만을 고려하여 매핑하기 때문에, 발화 구간에서 오디오 데이터와 텍스트 데이터의 불일치가 발생하는 문제점이 발생할 수 있다.The technology of mapping text data to audio data is a technology used to synchronize audio content and text content in order to increase learning efficiency in the field of language learning, and audio data including text space divided by sentence units is included. It is a mapping technique. At this time, since the technique of mapping text data to audio data maps only the text data divided into the blank section and the sentence unit of the audio data, the discrepancy between the audio data and the text data may occur in the speech section. have.

예를 들어, 공개 특허 2012-0129015 "어학컨텐츠 생성 방법 및 이를 위한 단말기"를 살펴보면, 문장 단위로 미리 구분된 텍스트 데이터를, 파형을 분석하여 획득된 공백 구간을 포함하는 오디오 데이터에 일대일 매칭함으로써, 텍스트 데이터를 오디오 데이터에 매핑한다. 여기서, 공개 특허 2012-0129015는 매핑 과정에서 오디오 데이터 및 텍스트 데이터의 발화 구간을 고려하지 않기 때문에, 발화 구간에서 오디오 데이터와 텍스트 데이터의 불일치가 발생할 수 있다.For example, referring to the published patent 2012-0129015 "Language Content Generation Method and Terminal for It", by one-to-one matching text data pre-divided in sentence units and audio data including blank sections obtained by analyzing a waveform, Map text data to audio data. Here, since the disclosed patent 2012-0129015 does not consider the spoken section of the audio data and the text data in the mapping process, a mismatch between the audio data and the text data may occur in the spoken section.

이에, 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하는 매핑 기술이 요구된다.
Accordingly, there is a need for a mapping technique that takes into account speech section information of audio data and text data.

본 발명의 실시예들은 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하여 텍스트 데이터를 오디오 데이터에 매핑하는 방법, 장치 및 시스템을 제공한다.Embodiments of the present invention provide a method, apparatus and system for mapping text data to audio data in consideration of speech data and speech section information of text data.

또한, 본 발명의 실시예들은 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하는 과정에서, 오디오 데이터 및 텍스트 데이터에 포함되는 문장, 문장에 포함되는 단어 또는 단어에 포함되는 음소의 발화 시간을 이용하는 방법, 장치 및 시스템을 제공한다.In addition, embodiments of the present invention, in the process of considering speech section information of audio data and text data, a method of using a speech time of a sentence, a word included in a sentence, or a phoneme included in a word included in the audio data and text data , Devices and systems.

또한, 본 발명의 실시예들은 오디오 데이터의 재생 속도에 대한 변화에 적응적으로, 텍스트 데이터를 오디오 데이터에 매핑하는 방법, 장치 및 시스템을 제공한다.
In addition, embodiments of the present invention provide a method, apparatus, and system for mapping text data to audio data, adaptively to changes in the reproduction speed of audio data.

본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 방법은 오디오 데이터의 발화 시간 정보를 추출하는 단계; 텍스트 데이터를 문장 단위로 구분하는 단계; 상기 구분된 텍스트 데이터에 기초하여 상기 오디오 데이터로부터 상기 오디오 데이터의 포즈(Pause) 시간 정보를 추출하는 단계; 상기 구분된 텍스트 데이터에 포함되는 음소를 추출하여 상기 텍스트 데이터의 발화 구간 비율을 계산하는 단계; 및 상기 추출된 오디오 데이터의 상기 발화 시간 정보, 상기 포즈 시간 정보 및 상기 텍스트 데이터의 상기 발화 구간 비율에 기초하여 상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는 단계를 포함한다.A method of mapping text data to audio data according to an embodiment of the present invention includes extracting speech time information of audio data; Classifying text data into sentence units; Extracting pause time information of the audio data from the audio data based on the divided text data; Extracting a phoneme included in the distinguished text data and calculating a speech section ratio of the text data; And mapping the text data to the audio data based on the talk time information of the extracted audio data, the pose time information, and the talk section ratio of the text data.

상기 오디오 데이터의 상기 발화 시간 정보를 추출하는 단계는, 미리 설정된 프레임 단위로 상기 오디오 데이터의 ZCR(Zero Crossing Ratio) 또는 에너지 중 적어도 하나를 계산하는 단계; 상기 계산 결과에 기초하여 상기 오디오 데이터에 포함되는 상기 프레임을 유성음 프레임(Voiced Frame), 무성음 프레임(Unvoiced Frame) 또는 묵음 프레임(Silence Frame) 중 적어도 하나로 구분하는 단계; 및 상기 구분 결과에 따라 상기 문장 단위로 상기 오디오 데이터의 상기 발화 시간 정보를 획득하는 단계를 포함할 수 있다.Extracting the utterance time information of the audio data comprises: calculating at least one of a ZCR (Zero Crossing Ratio) or energy of the audio data in a preset frame unit; Classifying the frame included in the audio data into at least one of a voiced frame, an unvoiced frame, or a silence frame based on the calculation result; And acquiring the utterance time information of the audio data in the sentence unit according to the classification result.

상기 오디오 데이터의 상기 포즈 시간 정보를 추출하는 단계는, 상기 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 상기 오디오 데이터의 포즈 구간을 설정하는 단계; 및 상기 설정된 오디오 데이터의 상기 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득하는 단계를 포함할 수 있다.Extracting the pose time information of the audio data may include setting a pose section of the audio data based on the pose section information of the distinguished text data; And obtaining start time information and end time information of the pose section of the set audio data.

상기 오디오 데이터의 상기 포즈 구간을 설정하는 단계는, 상기 오디오 데이터의 후보 포즈 구간을 검출하는 단계; 상기 검출된 후보 포즈 구간을 미리 설정된 기준 포즈 구간과 비교하는 단계; 및 상기 후보 포즈 구간으로부터 상기 비교 결과에 기초하여 오디오 데이터의 상기 포즈 구간을 선택하는 단계를 포함할 수 있다.The setting of the pose section of the audio data may include detecting a candidate pose section of the audio data; Comparing the detected candidate pose section with a preset reference pose section; And selecting the pose section of the audio data based on the comparison result from the candidate pose section.

상기 텍스트 데이터의 상기 발화 구간 비율을 계산하는 단계는, 상기 추출된 음소에 미리 설정된 고정 비율을 적용하는 단계; 및 상기 미리 설정된 고정 비율이 적용된 상기 음소에 기초하여 상기 문장 단위로 상기 텍스트 데이터의 상기 발화 구간 비율을 생성하는 단계를 포함할 수 있다.The calculating of the utterance section ratio of the text data may include applying a preset fixed ratio to the extracted phonemes; And generating the utterance section ratio of the text data in units of the sentence based on the phoneme to which the preset fixed ratio is applied.

상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는 단계는, 상기 오디오 데이터의 상기 발화 시간 정보에 상기 텍스트 데이터의 상기 발화 구간 비율을 적용하여 상기 오디오 데이터의 상기 발화 시간 정보에 대응하는 상기 텍스트 데이터의 발화 시간 정보를 계산하는 단계; 및 상기 오디오 데이터의 상기 포즈 시간 정보에 대응하는 상기 텍스트 데이터의 포즈 시간 정보를 생성하는 단계를 포함할 수 있다.In the mapping of the text data to the audio data, the talk time of the text data corresponding to the talk time information of the audio data by applying the talk section ratio of the text data to the talk time information of the audio data Calculating information; And generating pause time information of the text data corresponding to the pause time information of the audio data.

상기 텍스트 데이터를 상기 문장 단위로 구분하는 단계는, 상기 텍스트 데이터에 포함되는 문장 각각이 구별되도록 미리 삽입된 기호 정보에 기초하여, 상기 텍스트 데이터를 상기 문장 단위로 구분하는 단계일 수 있다.The step of classifying the text data into the sentence unit may be a step of classifying the text data into the sentence unit based on pre-inserted symbol information so that each sentence included in the text data is distinguished.

상기 텍스트 데이터를 오디오 데이터에 매핑하는 방법은 상기 오디오 데이터에 프리엠퍼시스(Pre-emphasis), DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용하는 단계를 더 포함할 수 있다.The method of mapping the text data to audio data may further include applying at least one of pre-emphasis, DC component removal, or noise removal to the audio data.

상기 텍스트 데이터를 오디오 데이터에 매핑하는 방법은 상기 텍스트 데이터를 상기 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성하는 단계를 더 포함할 수 있다.The method of mapping the text data to audio data may further include generating audio/text data as a result of mapping the text data to the audio data.

본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보를 추출하는, 발화 시간 정보 추출부; 텍스트 데이터를 문장 단위로 구분하는, 문장 구분부; 상기 구분된 텍스트 데이터에 기초하여 상기 오디오 데이터로부터 상기 오디오 데이터의 포즈(Pause) 시간 정보를 추출하는, 포즈 시간 정보 추출부; 상기 구분된 텍스트 데이터에 포함되는 음소를 추출하여 상기 텍스트 데이터의 발화 구간 비율을 계산하는, 발화 구간 비율 계산부; 및 상기 추출된 오디오 데이터의 상기 발화 시간 정보, 상기 포즈 시간 정보 및 상기 텍스트 데이터의 상기 발화 구간 비율에 기초하여 상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는, 매핑부를 포함한다.A system for mapping text data to audio data according to an embodiment of the present invention includes: a speech time information extraction unit for extracting speech time information of audio data; A sentence division unit for classifying text data into sentence units; A pause time information extraction unit for extracting pause time information of the audio data from the audio data based on the divided text data; A utterance section ratio calculator configured to extract a phoneme included in the divided text data and calculate a utterance section ratio of the text data; And a mapping unit for mapping the text data to the audio data based on the talk time information of the extracted audio data, the pose time information, and the talk section ratio of the text data.

상기 발화 시간 정보 추출부는, 미리 설정된 프레임 단위로 상기 오디오 데이터의 ZCR(Zero Crossing Ratio) 또는 에너지 중 적어도 하나를 계산하는, 계산부; 상기 계산 결과에 기초하여 상기 오디오 데이터에 포함되는 상기 프레임을 유성음 프레임(Voiced Frame), 무성음 프레임(Unvoiced Frame) 또는 묵음 프레임(Silence Frame) 중 적어도 하나로 구분하는, 구분부; 및 상기 구분 결과에 따라 상기 문장 단위로 상기 오디오 데이터의 상기 발화 시간 정보를 획득하는, 획득부를 포함할 수 있다.The utterance time information extraction unit includes: a calculation unit that calculates at least one of a ZCR (Zero Crossing Ratio) or energy of the audio data in a preset frame unit; A division unit for dividing the frame included in the audio data into at least one of a voiced frame, an unvoiced frame, or a silence frame based on the calculation result; And an acquiring unit acquiring the utterance time information of the audio data in units of the sentence according to the classification result.

상기 포즈 시간 정보 추출부는, 상기 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 상기 오디오 데이터의 포즈 구간을 설정하는, 설정부; 및 상기 설정된 오디오 데이터의 상기 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득하는, 획득부를 포함할 수 있다.The pose time information extracting unit includes: a setting unit configured to set a pose section of the audio data based on the pose section information of the divided text data; And an acquiring unit acquiring start time information and end time information of the pose section of the set audio data.

상기 발화 구간 비율 계산부는, 상기 추출된 음소에 미리 설정된 고정 비율을 적용하는, 적용부; 및 상기 미리 설정된 고정 비율이 적용된 상기 음소에 기초하여 상기 문장 단위로 상기 텍스트 데이터의 상기 발화 구간 비율을 생성하는, 생성부를 포함할 수 있다.The utterance section ratio calculation unit, an application unit that applies a preset fixed ratio to the extracted phoneme; And a generating unit generating the speech section ratio of the text data in units of the sentence based on the phoneme to which the preset fixed ratio is applied.

상기 매핑부는, 상기 오디오 데이터의 상기 발화 시간 정보에 상기 텍스트 데이터의 상기 발화 구간 비율을 적용하여 상기 오디오 데이터의 상기 발화 시간 정보에 대응하는 상기 텍스트 데이터의 발화 시간 정보를 계산하는 계산부; 및 상기 오디오 데이터의 상기 포즈 시간 정보에 대응하는 상기 텍스트 데이터의 포즈 시간 정보를 생성하는, 생성부를 포함할 수 있다.The mapping unit may include: a calculator configured to calculate speech time information of the text data corresponding to the speech time information of the audio data by applying the speech section ratio of the text data to the speech time information of the audio data; And a generating unit generating pose time information of the text data corresponding to the pose time information of the audio data.

상기 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 상기 오디오 데이터에 프리엠퍼시스(Pre-emphasis), DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용하는, 전처리 적용부를 더 포함할 수 있다.The system for mapping the text data to audio data may further include a pre-processing application unit that applies at least one of pre-emphasis, DC component removal, or noise removal to the audio data.

상기 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 상기 텍스트 데이터를 상기 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성하는, 오디오/텍스트 데이터 생성부를 더 포함할 수 있다.
The system for mapping the text data to audio data may further include an audio/text data generation unit for generating audio/text data as a result of mapping the text data to the audio data.

본 발명의 실시예들은 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하여 텍스트 데이터를 오디오 데이터에 매핑하는 방법, 장치 및 시스템을 제공할 수 있다.Embodiments of the present invention can provide a method, apparatus and system for mapping text data to audio data in consideration of speech data and speech section information of text data.

또한, 본 발명의 실시예들은 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하는 과정에서, 오디오 데이터 및 텍스트 데이터에 포함되는 문장, 문장에 포함되는 단어 또는 단어에 포함되는 음소의 발화 시간을 이용하는 방법, 장치 및 시스템을 제공할 수 있다.In addition, embodiments of the present invention, in the process of considering speech section information of audio data and text data, a method of using a speech time of a sentence, a word included in a sentence, or a phoneme included in a word included in the audio data and text data , Devices and systems.

또한, 본 발명의 실시예들은 오디오 데이터의 재생 속도에 대한 변화에 적응적으로, 텍스트 데이터를 오디오 데이터에 매핑하는 방법, 장치 및 시스템을 제공할 수 있다.
In addition, embodiments of the present invention can provide a method, apparatus, and system for mapping text data to audio data adaptively to changes in the reproduction speed of audio data.

도 1은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 방법을 나타낸 플로우 차트이다.
도 2는 도 1에 도시된 오디오 데이터의 발화 시간 정보를 추출하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 3은 도 1에 도시된 오디오 데이터의 포즈 시간 정보를 추출하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 4는 도 3에 도시된 오디오 데이터의 포즈 구간을 설정하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 5는 도 1에 도시된 텍스트 데이터의 발화 구간 비율을 계산하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 6은 도 1에 도시된 텍스트 데이터를 오디오 데이터에 매핑하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 7은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 과정을 나타낸 도면이다.
도 8은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템을 나타낸 블록도이다.1 is a flowchart illustrating a method of mapping text data to audio data according to an embodiment of the present invention.
FIG. 2 is a flowchart specifically illustrating a step of extracting utterance time information of the audio data shown in FIG. 1.
3 is a flowchart specifically illustrating a step of extracting pause time information of the audio data shown in FIG. 1.
FIG. 4 is a flowchart specifically showing a step of setting a pose section of the audio data shown in FIG. 3.
FIG. 5 is a flowchart specifically showing a step of calculating the utterance section ratio of the text data shown in FIG. 1.
FIG. 6 is a flowchart specifically illustrating a step of mapping the text data shown in FIG. 1 to audio data.
7 is a diagram illustrating a process of mapping text data to audio data according to an embodiment of the present invention.
8 is a block diagram showing a system for mapping text data to audio data according to an embodiment of the present invention.

이하, 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 그러나 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 또한, 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.
Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited or limited by the embodiments. In addition, the same reference numerals shown in each drawing denote the same members.

도 1은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 방법을 나타낸 플로우 차트이다.1 is a flowchart illustrating a method of mapping text data to audio data according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터에 프리엠퍼시스(pre-emphasis), DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용하는 전처리(pre-processing)를 수행한다(110).Referring to FIG. 1, a system for mapping text data to audio data according to an embodiment of the present invention is pre-processed by applying at least one of pre-emphasis, DC component removal, or noise removal to audio data ( pre-processing) is performed (110).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보를 추출한다(120).In addition, the system for mapping text data to audio data extracts speech time information of the audio data (120).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 텍스트 데이터를 문장 단위로 구분한다(130). In addition, the system for mapping text data to audio data classifies the text data in units of sentences (130).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 구분된 텍스트 데이터에 기초하여 오디오 데이터로부터 오디오 데이터의 포즈 시간 정보를 추출한다(140). 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 문장 단위로 구분된 텍스트 데이터 개수에 기초하여 오디오 데이터로부터 오디오 데이터의 포즈 시간 정보를 추출할 수 있다.In addition, the system for mapping text data to audio data extracts pause time information of the audio data from the audio data based on the divided text data (140). At this time, the system for mapping text data to audio data may extract pause time information of the audio data from the audio data based on the number of text data divided by sentence units.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 구분된 텍스트 데이터에 포함되는 음소를 추출하여 텍스트 데이터의 발화 구간 비율을 계산한다(150). 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 문장 단위로 구분된 텍스트 데이터에 포함되는 단어 및 단어에 포함되는 음소를 추출하여 텍스트 데이터의 발화 구간 비율을 계산할 수 있다.In addition, the system for mapping text data to audio data extracts phonemes included in the separated text data and calculates a speech section ratio of the text data (150). At this time, the system for mapping text data to audio data may extract words included in text data separated by sentence units and phonemes included in words to calculate a speech section ratio of text data.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보, 오디오 데이터의 포즈 시간 정보 및 텍스트 데이터의 발화 구간 비율에 기초하여, 텍스트 데이터를 오디오 데이터에 매핑한다(160).In addition, the system for mapping text data to audio data maps text data to audio data based on speech time information of the audio data, pause time information of the audio data, and a speech section ratio of the text data (160).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 텍스트 데이터를 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성한다(170).Further, the system for mapping text data to audio data generates audio/text data as a result of mapping the text data to audio data (170).

여기서, 각각의 단계에 대해서는 아래에서 상세히 설명한다.
Here, each step will be described in detail below.

도 2는 도 1에 도시된 오디오 데이터의 발화 시간 정보를 추출하는 단계를 구체적으로 나타낸 플로우 차트이다.FIG. 2 is a flowchart specifically illustrating a step of extracting utterance time information of the audio data shown in FIG. 1.

도 2를 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 미리 설정된 프레임 단위로 오디오 데이터의 제로 크로싱 비율(Zero Crossing Rate; ZCR) 또는 에너지 중 적어도 하나를 계산할 수 있다(210). 여기서, ZCR을 계산하는 과정은 오디오 데이터에 포함되는 각각의 프레임에 대한 신호가 '0' 값을 기준으로 어느 정도 변화하는지를 측정하고, 측정된, 각각의 프레임에 대한 ZCR을 상대적으로 비교하는 과정을 통하여 수행될 수 있다. 또한, 에너지를 계산하는 과정은 오디오 데이터에 포함되는 각각의 프레임에 대한 에너지를 추출하고, 추출된, 각각의 프레임에 대한 에너지를 상대적으로 비교하는 과정을 통하여 수행될 수 있다.Referring to FIG. 2, a system for mapping text data to audio data according to an embodiment of the present invention can calculate at least one of a zero crossing rate (ZCR) or energy of audio data in preset frame units. Yes (210). Here, the process of calculating the ZCR measures the degree to which the signal for each frame included in the audio data changes based on the value of '0', and compares the measured ZCR for each frame relatively. Can be performed through. In addition, the process of calculating energy may be performed through a process of extracting energy for each frame included in the audio data, and comparing the energy for each frame extracted.

텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 위에서 상술한 계산 결과에 기초하여, 오디오 데이터에 포함되는 프레임을 유성음 프레임, 무성음 프레임 또는 묵음 프레임 중 적어도 하나로 구분할 수 있다(220). 예를 들어, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 '_strick_'라는 오디오 데이터에 포함되는 /s/ 음소, /t/ 음소, /r/ 음소, /i/ 음소, /c/ 음소 및 /k/ 음소 각각에 해당하는 음성 프레임의 ZCR 및 에너지를 계산하고 이를 비교하여, 각각의 음성 프레임을 유성음 프레임, 무성음 프레임 및 묵음 프레임으로 구분할 수 있다. 더 구체적인 예를 들어, 계산된 ZCR 값을 비교한 결과, /s/ 음소 및 /t/ 음소에 해당하는 프레임의 ZCR 값이 /i/ 음소에 해당하는 프레임의 ZCR 값보다 높다면, /s/ 음소 및 /t/ 음소에 해당하는 프레임은 무성음 프레임으로 구분될 수 있다. 또한, 계산된 에너지 값을 비교한 결과, 에너지 값이 /i/ 음소, /r/ 음소, /s/ 음소, /t/ 음소 및 /k/ 음소에 해당하는 프레임의 순서대로 높다면, 높은 에너지 값을 갖는 /i/ 음소 및 /r/ 음소에 해당하는 프레임이 유성음 프레임으로, 그 다음 높은 에너지 값을 갖는 /s/ 음소, /t/ 음소 및 /k/ 음소에 해당하는 프레임이 무성음 프레임으로 구분될 수 있고, '_strick_'의 시작 및 끝에 해당하는 프레임이 묵음 프레임으로 구분될 수 있다.The system for mapping text data to audio data may classify a frame included in the audio data into at least one of voiced frames, unvoiced frames, or silent frames based on the calculation result described above (220 ). For example, a system that maps text data to audio data includes /s/ phonemes, /t/ phonemes, /r/ phonemes, /i/ phonemes, /c/ phonemes, and /k included in audio data called'_strick_'. / Calculate the ZCR and energy of the voice frame corresponding to each phoneme and compare them, so that each voice frame can be divided into voiced, unvoiced, and silent frames. For a more specific example, as a result of comparing the calculated ZCR value, if the ZCR value of the frame corresponding to /s/ phoneme and /t/ phoneme is higher than the ZCR value of the frame corresponding to /i/ phoneme, /s/ Frames corresponding to phonemes and /t/ phonemes may be divided into unvoiced frames. Also, as a result of comparing the calculated energy value, if the energy value is high in the order of frames corresponding to /i/ phoneme, /r/ phoneme, /s/ phoneme, /t/ phoneme, and /k/ phoneme, high energy Frames corresponding to /i/ phonemes and /r/ phonemes with values are voiced frames, and frames corresponding to /s/ phonemes, /t/ phonemes, and /k/ phones with high energy values are unvoiced frames. The frames corresponding to the start and end of'_strick_' may be divided into silent frames.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터에 포함되는 프레임 각각을 구분한 결과에 기초하여, 오디오 데이터의 발화 시간 정보를 획득할 수 있다(230). 여기서, 오디오 데이터의 발화 시간 정보는 오디오 데이터의 문장 단위로 획득될 수 있다. 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 프레임의 개수에 기초하여 오디오 데이터의 발화 시간 정보를 획득할 수 있다. 예를 들어, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 미리 설정된 프레임 단위에 기초하여 묵음 프레임으로 구분된 음소가 미리 설정된 개수 이상으로 지속되는 경우, 해당 구간을 포즈 구간으로 인식할 수 있고, 인식된 포즈 구간을 기초로 오디오 데이터를 문장 단위로 구분함으로써, 문장 단위의 발화 시간 정보를 획득할 수 있다. 더 구체적인 예를 들면, 오디오 데이터에 'Have you answered yes to any of the questions above?'의 제1 문장 및 'Video game addictions are becoming recognized as a real problem."의 제2 문장이 포함된다면, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 제1 문장과 제2 문장 사이의 프레임이 묵음 프레임으로 구분되어 미리 설정된 개수 이상으로 지속되는지를 판단한다. 판단 결과, 미리 설정된 개수 이상으로 묵음 프레임이 지속되는 경우, 해당 구간이 문장을 구분하는 포즈 구간임을 인식하여, 오디오 데이터를 제1 문장의 오디오 데이터 및 제2 문장의 오디오 데이터로 구분함으로써, 제1 문장의 오디오 데이터의 발화 시간 정보 및 제2 문장의 오디오 데이터 발화 시간 정보를 각각 획득할 수 있다.
In addition, the system for mapping text data to audio data may acquire speech time information of the audio data based on a result of classifying each frame included in the audio data (230 ). Here, the utterance time information of the audio data may be obtained in units of sentences of the audio data. At this time, the system for mapping text data to audio data may acquire speech time information of the audio data based on the number of frames. For example, the system for mapping text data to audio data may recognize a corresponding section as a pose section when the phoneme divided into a silence frame based on a preset frame unit lasts a preset number or more. By classifying the audio data in sentence units based on the pose section, it is possible to obtain speech time information in sentence units. For a more specific example, if the audio data includes the first sentence of'Have you answered yes to any of the questions above?' and the second sentence of'Video game addictions are becoming recognized as a real problem.' The system mapping the audio data to the audio data determines whether the frame between the first sentence and the second sentence is divided into silence frames and persists at a preset number or more. Recognizing that the section is a pose section for classifying sentences, and classifying the audio data into audio data of the first sentence and audio data of the second sentence, the utterance time information of the audio data of the first sentence and the audio data of the second sentence Each ignition time information can be obtained.

도 3은 도 1에 도시된 오디오 데이터의 포즈 시간 정보를 추출하는 단계를 구체적으로 나타낸 플로우 차트이다.3 is a flowchart specifically illustrating a step of extracting pause time information of the audio data shown in FIG. 1.

도 3을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 오디오 데이터의 포즈 구간을 설정할 수 있다(310). 여기서, 오디오 데이터의 포즈 구간은 텍스트 데이터의 포즈 구간 정보에 기초하여, 오디오 데이터에 포함되는 프레임이 미리 설정된 개수 이상으로 묵음 프레임으로 지속될 때 설정될 수 있다. 이에 대해서는 도 4를 참조하며 상세히 설명한다.Referring to FIG. 3, a system for mapping text data to audio data according to an embodiment of the present invention may set a pose section of audio data based on the pose section information of the separated text data (310). Here, the pause section of the audio data may be set when the frames included in the audio data continue to be silence frames over a preset number based on the pose section information of the text data. This will be described in detail with reference to FIG. 4.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 설정된 오디오 데이터의 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득할 수 있다(320). 이 때, 포즈 구간의 시작 시간 정보 및 종료 시간 정보는 오디오 데이터의 전체 파일의 재생 시간을 기준으로 획득될 수 있다. 예를 들어, 포즈 구간이 시작되는 시작 시간 정보 및 종료되는 종료 시간 정보는 오디오 데이터의 전체 파일의 재생 시작 시간을 '0:00' 의 기준으로 설정하여, 이에 대응되는 시간 정보를 측정함으로써, 획득될 수 있다.
In addition, the system for mapping text data to audio data may acquire start time information and end time information of a pause section of set audio data (320). At this time, the start time information and the end time information of the pose section may be obtained based on the reproduction time of the entire file of audio data. For example, the start time information at which the pause section starts and the end time information at which the pause section starts are obtained by setting the playback start time of the entire file of audio data as a reference to '0:00' and measuring the corresponding time information. Can be.

도 4는 도 3에 도시된 오디오 데이터의 포즈 구간을 설정하는 단계를 구체적으로 나타낸 플로우 차트이다.FIG. 4 is a flowchart specifically showing a step of setting a pose section of the audio data shown in FIG. 3.

도 4를 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터에 포함되는 프레임 각각을 구분하고, 미리 설정된 개수 이상으로 묵음 프레임이 지속되는 구간을 후보(candidate) 포즈 구간으로 검출할 수 있다(410). 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 쉼표 포즈 구간이 문장 사이의 포즈 구간 보다도 긴 경우가 발생할 수 있으므로, 텍스트 데이터의 포즈 구간보다 미리 설정된 개수만큼 많이 검출할 수 있다.Referring to FIG. 4, a system for mapping text data to audio data according to an embodiment of the present invention distinguishes each frame included in the audio data, and candidates for a section in which the silent frame continues for a preset number or more. ) It can be detected as a pause section (410). At this time, the system for mapping the text data to the audio data may occur when the comma pose section is longer than the pose section between sentences, and thus, as many as a preset number of text data pose sections may be detected.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여, 검출된 후보 포즈 구간을 미리 설정된 기준 포즈 구간과 비교할 수 있다(420). 예를 들어, 텍스트 데이터가 텍스트 데이터에 포함되는 문장 각각이 구별되도록 미리 삽입된 기호 정보에 기초하여 11개의 문장으로 구분되어 10개의 포즈 구간을 갖는 경우, 오디오 데이터의 후보 포즈 구간이 15개로 검출된 경우, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 단어와 단어 사이에서 연속되는 묵음 프레임과 문장과 문장 사이의 포즈를 구별하기 위하여, 오디오 데이터의 후보 포즈 구간을 미리 설정된 기준 포즈 구간과 비교한다(420).Also, the system for mapping text data to audio data may compare the detected candidate pose section with a preset reference pose section based on the pose section information of the separated text data (420 ). For example, if the text data is divided into 11 sentences based on pre-inserted symbol information so that each sentence included in the text data is distinguished and has 10 pose sections, 15 candidate pose sections of audio data are detected. In case, the system for mapping text data to audio data compares a candidate pause section of the audio data with a preset reference pose section to distinguish a pause between a silent frame and a sentence and a sentence between words and words of the audio data. (420).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 후보 포즈 구간으로부터 비교 결과에 기초하여 오디오 데이터의 포즈 구간을 선택할 수 있다(430). 예를 들어, 기준 포즈 구간을 묵음 프레임이 6개 이상 지속되는 경우로 설정한 경우, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 15개의 후보 포즈 구간 중 묵음 프레임이 6개 이상 지속되는 구간을 오디오 데이터의 문장과 문장 사이의 포즈 구간으로 선택하여, 오디오 데이터의 포즈 구간이 텍스트 데이터의 10개의 포즈 구간과 일치하도록 할 수 있다.
In addition, the system for mapping text data to audio data may select a pose section of the audio data based on the comparison result from the candidate pose section (430). For example, if the reference pose section is set to a case where 6 or more silence frames are sustained, the system for mapping text data to audio data may include a section in which 15 silence frames last 6 or more durations among 15 candidate pose sections. By selecting as a pose section between the sentence and the sentence of, the pose section of the audio data can be matched with 10 pose sections of the text data.

도 5는 도 1에 도시된 텍스트 데이터의 발화 구간 비율을 계산하는 단계를 구체적으로 나타낸 플로우 차트이다.FIG. 5 is a flowchart specifically showing a step of calculating the utterance section ratio of the text data shown in FIG. 1.

도 5를 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 문장 단위로 구분된 텍스트 데이터에 단어 별로 포함되는 음소를 추출하여, 추출된 음소에 미리 설정된 고정 비율을 적용할 수 있다(510). 예를 들어, 추출된 음소가 유음인 /m/ 음소, /l/ 음소 및 /n/ 음소인 경우, 0.7의 미리 설정된 고정 비율을 적용하고, 추출된 음소가 나머지 자음인 경우, 0.5의 미리 설정된 고정 비율을 적용할 수 있다.Referring to FIG. 5, a system for mapping text data to audio data according to an embodiment of the present invention extracts a phoneme included in each word in text data divided into sentence units, and sets a preset fixed ratio to the extracted phoneme. It can be applied (510). For example, if the extracted phonemes are /m/ phonemes, /l/ phonemes, and /n/ phonemes that are phonetic, apply a preset fixed ratio of 0.7, and if the extracted phonemes are the remaining consonants, a preset of 0.5 A fixed ratio can be applied.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 미리 설정된 고정 비율이 적용된 음소에 기초하여 문장 단위로 텍스트 데이터의 발화 구간 비율을 생성할 수 있다(520). 예를 들어, 문장 단위로 미리 설정된 고정 비율이 적용된 음소의 총 합 비율을 계산하여, 문장 단위의 텍스트 데이터의 발화 구간 비율을 생성할 수 있다.
In addition, the system for mapping text data to audio data may generate a speech section ratio of text data in units of sentences based on a phoneme to which a preset fixed ratio is applied (520 ). For example, a total sum ratio of phonemes to which a preset fixed ratio is applied in units of sentences may be calculated to generate a speech section ratio of text data in units of sentences.

도 6은 도 1에 도시된 텍스트 데이터를 오디오 데이터에 매핑하는 단계를 구체적으로 나타낸 플로우 차트이다.FIG. 6 is a flowchart specifically illustrating a step of mapping the text data shown in FIG. 1 to audio data.

도 6을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보에 텍스트 데이터의 발화 구간 비율을 적용하여 오디오 데이터의 발화 시간 정보에 대응하는 텍스트 데이터의 발화 시간 정보를 계산할 수 있다(610).Referring to FIG. 6, in a system for mapping text data to audio data according to an embodiment of the present invention, text corresponding to utterance time information of audio data is applied by applying an utterance section ratio of text data to utterance time information of audio data The ignition time information of the data may be calculated (610).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 포즈 시간 정보에 대응하는 텍스트 데이터의 포즈 시간 정보를 생성할 수 있다(620).In addition, the system for mapping text data to audio data may generate pause time information of text data corresponding to pause time information of audio data (620 ).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보에 대응하는 텍스트 데이터의 발화 시간 정보를 계산하는 과정(610) 및 오디오 데이터의 포즈 시간 정보에 대응하는 텍스트 데이터의 포즈 시간 정보를 생성하는 과정(620)을 통하여, 오디오 데이터에 매핑될 텍스트 데이터를 생성함으로써, 텍스트 데이터를 오디오 데이터에 매핑할 수 있다(도시되지 아니함).
In addition, the system for mapping text data to audio data includes a process 610 of calculating utterance time information of text data corresponding to utterance time information of audio data, and pose time information of text data corresponding to pause time information of audio data. By generating text data to be mapped to audio data through the process of generating 620, text data may be mapped to audio data (not shown).

도 7은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 과정을 나타낸 도면이다.7 is a diagram illustrating a process of mapping text data to audio data according to an embodiment of the present invention.

도 7을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 텍스트 데이터의 발화 구간 비율(710)을 오디오 데이터(730)의 발화 시간 정보에 적용하여, 매핑될 텍스트 데이터(720)의 발화 시간 정보를 계산함으로써, 매핑될 텍스트 데이터(720)의 발화 구간을 생성할 수 있다. 예를 들어, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 텍스트 데이터에 포함되는 제1 문장(711)의 발화 구간 비율을 오디오 데이터(730)에 포함되는 제1 문장(731)의 발화 시간 정보에 적용하여, 매핑될 텍스트 데이터(720)에 포함되는 제1 문장(721)의 시간 정보를 계산함으로써, 매핑될 텍스트 데이터(720)에 포함되는 제1 문장(721)의 발화 구간을 생성할 수 있다(740).Referring to FIG. 7, a system for mapping text data to audio data according to an embodiment of the present invention applies text to the speech time information of the audio data 730 by applying the speech section ratio 710 of the text data By calculating the utterance time information of the data 720, an utterance section of the text data 720 to be mapped may be generated. For example, a system for mapping text data to audio data applies the utterance section ratio of the first sentence 711 included in the text data to the utterance time information of the first sentence 731 included in the audio data 730. Thus, by calculating time information of the first sentence 721 included in the text data 720 to be mapped, a speech section of the first sentence 721 included in the text data 720 to be mapped may be generated ( 740).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터(730)의 포즈 시간 정보에 대응하는, 매핑될 텍스트 데이터(720)의 포즈 시간 정보를 생성함으로써, 매핑될 텍스트 데이터(720)의 포즈 구간을 설정할 수 있다. 예를 들어, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터(730)에 포함되는 제1 문장(731)과 제2 문장(732) 사이의 포즈 구간(733)에 대한 시간 정보에 대응하는, 매핑될 텍스트 데이터(720)에 포함되는 제1 문장(721)과 제2 문장(722) 사이의 포즈 구간(723)에 대한 시간 정보를 생성함으로써, 매핑될 텍스트 데이터(720)에 포함되는 제1 문장(721)과 제2 문장(722) 사이의 포즈 구간(723)을 설정할 수 있다.In addition, the system for mapping text data to audio data generates pose time information of text data 720 to be mapped, which corresponds to pose time information of audio data 730, so that a pose section of text data 720 to be mapped is generated. You can set For example, a system for mapping text data to audio data corresponds to time information for a pose section 733 between the first sentence 731 and the second sentence 732 included in the audio data 730, By generating time information for the pose section 723 between the first sentence 721 and the second sentence 722 included in the text data 720 to be mapped, the first included in the text data 720 to be mapped The pose section 723 between the sentence 721 and the second sentence 722 may be set.

텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 매핑될 텍스트 데이터(720)에 포함되는 문장 및 포즈 구간을 모두 생성한 이후에, 매핑될 텍스트 데이터(720)를 오디오 데이터에 매핑할 수 있다(750).
The system that maps text data to audio data may map text data 720 to be mapped to audio data after generating all sentence and pose sections included in the text data 720 to be mapped (750 ).

도 8은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템을 나타낸 블록도이다.8 is a block diagram illustrating a system for mapping text data to audio data according to an embodiment of the present invention.

도 8을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 전처리 적용부(810), 발화 시간 정보 추출부(820), 문장 구분부(830), 포즈 시간 정보 추출부(840), 발화 구간 비율 계산부(850), 매핑부(860) 및 오디오/텍스트 데이터 생성부(870)를 포함한다.Referring to FIG. 8, a system for mapping text data to audio data according to an embodiment of the present invention includes a pre-processing application unit 810, a speech time information extraction unit 820, a sentence division unit 830, and pose time information. It includes an extraction unit 840, a speech section ratio calculation unit 850, a mapping unit 860, and audio/text data generation unit 870.

전처리 적용부(810)는 오디오 데이터에 프리엠퍼시스, DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용할 수 있다.The pre-processing application unit 810 may apply at least one of pre-emphasis, DC component removal, or noise removal to audio data.

발화 시간 정보 추출부(820)는 오디오 데이터의 발화 시간 정보를 추출한다.The speech time information extraction unit 820 extracts speech time information of audio data.

또한, 발화 시간 정보 추출부(820)는 미리 설정된 프레임 단위로 오디오 데이터의 ZCR 또는 에너지 중 적어도 하나를 계산하는 계산부(821), 계산 결과에 기초하여 오디오 데이터에 포함되는 상기 프레임을 유성음 프레임, 무성음 프레임 또는 묵음 프레임 중 적어도 하나로 구분하는 구분부(822) 및 구분 결과에 따라 문장 단위로 오디오 데이터의 발화 시간 정보를 획득하는 획득부(823)를 포함할 수 있다.In addition, the utterance time information extraction unit 820, the calculation unit 821 for calculating at least one of the ZCR or energy of the audio data in a predetermined frame unit, based on the calculation results, the frame included in the audio data voiced frame, It may include a division unit 822 for dividing at least one of the unvoiced frame or the silent frame, and an acquiring unit 823 for acquiring utterance time information of audio data in units of sentences according to the classification result.

문장 구분부(830)는 텍스트 데이터를 문장 단위로 구분한다.The sentence division unit 830 divides text data in units of sentences.

이 때, 문장 구분부(830)는 텍스트 데이터에 포함되는 문장 각각이 구별되도록 미리 삽입된 기호 정보에 기초하여, 텍스트 데이터를 문장 단위로 구분할 수 있다.At this time, the sentence division unit 830 may classify the text data in units of sentences based on pre-inserted symbol information so that each sentence included in the text data is distinguished.

포즈 시간 정보 추출부(840)는 구분된 텍스트 데이터에 기초하여 오디오 데이터로부터 오디오 데이터의 포즈 시간 정보를 추출한다.The pause time information extraction unit 840 extracts pause time information of the audio data from the audio data based on the separated text data.

또한, 포즈 시간 정보 추출부(840)는 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 오디오 데이터의 포즈 구간을 설정하는 설정부(841) 및 설정된 오디오 데이터의 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득하는 획득부(842)를 포함할 수 있다.In addition, the pose time information extraction unit 840 sets the setting section 841 for setting the pose section of the audio data based on the pose section information of the separated text data and the start time information and the end time information of the set pose section of the audio data It may include an acquisition unit 842 to acquire.

발화 구간 비율 계산부(850)는 구분된 텍스트 데이터에 포함되는 음소를 추출하여 텍스트 데이터의 발화 구간 비율을 계산한다.The speech section ratio calculator 850 extracts phonemes included in the separated text data and calculates the speech section ratio of the text data.

또한, 발화 구간 비율 계산부(850)는 추출된 음소에 미리 설정된 고정 비율을 적용하는 적용부(851) 및 미리 설정된 고정 비율이 적용된 음소에 기초하여 문장 단위로 텍스트 데이터의 발화 구간 비율을 생성하는 생성부(852)를 포함할 수 있다.In addition, the speech section ratio calculator 850 generates speech section ratios of text data in sentence units based on the application unit 851 applying a preset fixed ratio to the extracted phonemes and phonemes to which the preset fixed ratio is applied. It may include a generator 852.

매핑부(860)는 추출된 오디오 데이터의 발화 시간 정보, 포즈 시간 정보 및 텍스트 데이터의 발화 구간 비율에 기초하여 텍스트 데이터를 오디오 데이터에 매핑한다.The mapping unit 860 maps the text data to the audio data based on the utterance time information of the extracted audio data, the pose time information, and the utterance section ratio of the text data.

또한, 매핑부(860)는 오디오 데이터의 발화 시간 정보에 텍스트 데이터의 발화 구간 비율을 적용하여 오디오 데이터의 발화 시간 정보에 대응하는 텍스트 데이터의 발화 시간 정보를 계산하는 계산부(861) 및 텍스트 데이터의 발화 시간 정보로부터 오디오 데이터의 포즈 시간 정보에 대응하는 텍스트 데이터의 포즈 시간 정보를 생성하는 생성부(862)를 포함할 수 있다.Also, the mapping unit 860 calculates the utterance time information of the text data corresponding to the utterance time information of the audio data by applying the utterance section ratio of the text data to the utterance time information of the audio data, and the text data It may include a generating unit 862 for generating the pause time information of the text data corresponding to the pause time information of the audio data from the speech time information of the.

오디오/텍스트 데이터 생성부(870)는 텍스트 데이터를 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성할 수 있다.
The audio/text data generation unit 870 may generate audio/text data as a result of mapping the text data to the audio data.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and/or combinations of hardware components and software components. For example, the devices and components described in the embodiments include, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors (micro signal processors), microcomputers, field programmable arrays (FPAs), It may be implemented using one or more general purpose computers or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may run an operating system (OS) and one or more software applications running on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of understanding, a processing device may be described as one being used, but a person having ordinary skill in the art, the processing device may include a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include. For example, the processing device may include a plurality of processors or a processor and a controller. In addition, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instruction, or a combination of one or more of these, and configure the processing device to operate as desired, or process independently or collectively You can command the device. Software and/or data may be interpreted by a processing device or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. , Or may be permanently or temporarily embodied in the transmitted signal wave. The software may be distributed on networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the medium may be specially designed and constructed for the embodiments or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by a limited embodiment and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques are performed in a different order than the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.
Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

In the method of mapping text data to audio data,
Extracting speech time information of audio data;
Classifying text data into sentence units;
Extracting pause time information of the audio data from the audio data based on the divided text data;
Extracting a phoneme included in the distinguished text data and calculating a speech section ratio of the text data; And
Mapping the text data to the audio data based on the talk time information of the extracted audio data, the pose time information, and the talk section ratio of the text data.
Including,
The step of calculating the ratio of the utterance section,
Calculating a speech section ratio of the divided text data using a ratio applied to each phoneme extracted from the divided text data
A method of mapping text data to audio data, including.

According to claim 1,
Extracting the utterance time information of the audio data,
Calculating at least one of a ZCR (Zero Crossing Ratio) or energy of the audio data in a preset frame unit;
Classifying the frame included in the audio data into at least one of a voiced frame, an unvoiced frame, or a silence frame based on the result of the calculation of at least one of the ZCR or the energy; And
Acquiring the utterance time information of the audio data in the sentence unit according to the classification result
A method of mapping text data to audio data, including.

According to claim 1,
Extracting the pause time information of the audio data,
Setting a pose section of the audio data based on the pose section information of the divided text data; And
Obtaining start time information and end time information of the pose section of the set audio data
A method of mapping text data to audio data, including.

According to claim 3,
The step of setting the pose section of the audio data,
Detecting a candidate pose section of the audio data;
Comparing the detected candidate pose section with a preset reference pose section; And
Selecting the pose section of audio data from the candidate pose section based on the comparison result
A method of mapping text data to audio data, including.

delete

According to claim 1,
Mapping the text data to the audio data,
Calculating utterance time information of the text data corresponding to the utterance time information of the audio data by applying the utterance section ratio of the text data to the utterance time information of the audio data; And
Generating pause time information of the text data corresponding to the pause time information of the audio data
A method of mapping text data to audio data, including.

According to claim 1,
The step of dividing the text data into the sentence unit may include:
A method of mapping text data to audio data, the step of classifying the text data into units of sentences based on pre-inserted symbol information so that each sentence included in the text data is distinguished.

According to claim 1,
Applying at least one of pre-emphasis, DC component removal, or noise removal to the audio data
Further comprising, a method of mapping text data to audio data.

According to claim 1,
Generating audio/text data as a result of mapping the text data to the audio data
Further comprising, a method of mapping text data to audio data.

A computer-readable recording medium in which a program for performing the method of any one of claims 1 to 4 and 6 to 9 is recorded.

A system for mapping text data to audio data,
A utterance time information extraction unit for extracting utterance time information of audio data;
A sentence division unit for classifying text data into sentence units;
A pause time information extraction unit for extracting pause time information of the audio data from the audio data based on the divided text data;
A utterance section ratio calculator configured to extract a phoneme included in the divided text data and calculate a utterance section ratio of the text data; And
A mapping unit mapping the text data to the audio data based on the talk time information, the pose time information, and the talk section ratio of the text data of the extracted audio data
Including,
The utterance section ratio calculation unit,
A system for mapping text data to audio data to calculate a speech section ratio of the divided text data using a ratio applied to each of the phonemes extracted from the divided text data.

The method of claim 11,
The ignition time information extraction unit,
A calculator configured to calculate at least one of a ZCR (Zero Crossing Ratio) or energy of the audio data in a preset frame unit;
Based on the calculation result of at least one of the ZCR or the energy, the frame included in the audio data is classified into at least one of a voiced frame, a voiced frame, or a silence frame. part; And
Acquisition unit for acquiring the speech time information of the audio data in the sentence unit according to the classification result
A system for mapping text data to audio data, comprising:

The method of claim 11,
The pose time information extraction unit,
A setting unit configured to set a pose section of the audio data based on the pose section information of the divided text data; And
Acquisition unit for acquiring start time information and end time information of the pose section of the set audio data
A system for mapping text data to audio data, comprising:

delete

The method of claim 11,
The mapping unit,
A calculator configured to calculate utterance time information of the text data corresponding to the utterance time information of the audio data by applying the utterance section ratio of the text data to the utterance time information of the audio data; And
Generation unit for generating pose time information of the text data corresponding to the pose time information of the audio data
A system for mapping text data to audio data, comprising:

The method of claim 11,
A pre-processing application unit that applies at least one of pre-emphasis, DC component removal, or noise removal to the audio data
Further comprising, a system for mapping text data to audio data.

The method of claim 11,
An audio/text data generation unit that generates audio/text data as a result of mapping the text data to the audio data
Further comprising, a system for mapping text data to audio data.