KR100359988B1

KR100359988B1 - real-time speaking rate conversion system

Info

Publication number: KR100359988B1
Application number: KR1020000065861A
Authority: KR
Inventors: 오영환; 조훈영
Original assignee: 오영환
Priority date: 2000-11-07
Filing date: 2000-11-07
Publication date: 2002-11-07
Also published as: KR20020036014A

Abstract

본 발명은 화속 변환 장치에 관한 것으로서, 보다 상세하게는 컴퓨터, VTR 및 어학 학습기 등의 일반적인 음성 재생 장치에서 음성을 재생할 경우에 음성의 재생 속도를 임의 속도로 제어하는 장치에 관한 것이며, 실시간 처리가 가능하고 잡음의 유무에 무관하게 잘 작동하도록 하는 데에 그 특징이 있다.The present invention relates to a speech converting apparatus, and more particularly, to an apparatus for controlling the reproduction speed of speech at an arbitrary speed when a speech is played back in a general speech playback apparatus such as a computer, a VTR, and a language learner. Its features are that it is possible and works well with or without noise.

본 발명의 특징에 의한 화속 변환 장치는 음성 신호 입력부, 데이터 버스, 데이터 기억 장치부, 신호 분석부, 화속 제어부, 신호 합성부, 음성 신호 출력부, 프로그램 버스, 주 제어부로 구성된다.The apparatus for converting speech speed according to a feature of the present invention comprises a speech signal input section, a data bus, a data storage section, a signal analyzer, a speech section control section, a signal synthesizing section, a speech signal output section, a program bus, and a main control section.

본 발명은 입력신호의 유/무성음을 별도로 처리하며, 유성음에 대해서는 파형 유사도 중첩 합성 방식을 사용하고, 무성음에 대해서는 창 함수를 적용하여, 종래 기술의 샘플링 주파수 변경방식과는 달리 입력 음성의 음색을 유지하면서도 데이터의 손실이 없으며, 종래 기술의 피치 동기화 중첩 합성(PSOLA) 방법과는 달리 처리 알고리즘이 단순하면서도 미세한 화속변환이 가능하며, 다양한 잡음이 섞인 경우에도 좋은 음질을 얻을 수 있는 효과가 있다.The present invention separately processes voiced / unvoiced sound of an input signal, employs a waveform similarity superposition synthesis method for voiced sound, and applies a window function to unvoiced sound, so that the sound of the input voice is different from the conventional sampling frequency changing method. It maintains no data loss, and unlike the PPSA method of the related art, the processing algorithm is simple and fine-speed conversion is possible, and the sound quality is good even when various noises are mixed.

Description

Real-time speaking rate conversion system

본 발명은 화속 변환 장치에 관한 것으로서, 보다 상세하게는 컴퓨터, VTR 및 어학 학습기 등의 일반적인 음성 재생 장치에서 음성을 재생할 경우에 음성의 재생 속도를 임의 속도로 제어하는 장치에 관한 것이며, 실시간 처리가 가능하고 잡음의 유무에 무관하게 잘 작동하도록 하는 데에 그 특징이 있다.The present invention relates to a speech conversion apparatus, and more particularly, to an apparatus for controlling the reproduction speed of an audio at an arbitrary speed when a speech is reproduced in a general speech reproduction apparatus such as a computer, a VTR, and a language learner. Its features are that it is possible and works well with or without noise.

테이프 또는 비디오 플레이어의 재생 속도를 변경하는 경우, 속도를 빠르거나 느리게 하면 원래 음성의 음색을 변경시켜 말소리를 알아들을 수가 없게 된다.If you change the playback speed of a tape or video player, speeding it up or slowing it down will change the timbre of the original voice, making it impossible to hear the speech.

이러한 현상을 막기 위한 종래의 방법으로는 신호의 시간축 상에서 음성의 샘플링 주파수를 변경하는 단순한 방식이 있으나 음성의 재생 속도를 빠르게 할 경우에 정보량이 감소함에 따라 음성의 의미와 음향특성이 제대로 전달될 수 없다.Conventional methods to prevent this phenomenon are a simple method of changing the sampling frequency of speech on the time axis of the signal. However, when the speech reproduction speed is increased, the meaning and sound characteristics of the speech can be properly delivered as the information volume decreases. none.

또한, 그보다 진보된 방식으로서, 원 신호를 분석 처리하여 음성이나 음향신호의 피치(pitch)를 검출하고 이를 음원성분으로 하며, 피치의 주기를 기준으로 상기 음성이나 음향신호의 음향특성을 분리해내어 음성이나 음향신호의 절편(segment)을 합성하고, 음원성분을 삭제 또는 첨가하는 방법으로 변조하여 원하는 재생 속도의 음원신호로 합성한 다음, 위의 음향신호의 절편과 합성하여 음성신호로 출력하는 방법이 있는데, 이를 피치 동기화 중첩 합성(pitch synchronous overlap add; PSOLA)이라 한다.In addition, a more advanced method is to analyze the original signal to detect the pitch of the voice or sound signal and to make it a sound source component, and to separate the acoustic characteristics of the sound or sound signal based on the pitch period A method of synthesizing a segment of a voice or sound signal, modulating it by deleting or adding a sound source component, synthesizing it into a sound source signal having a desired reproduction speed, and then synthesizing it with the segment of the above sound signal and outputting it as a voice signal. This is called pitch synchronous overlap add (PSOLA).

이 방법은 위의 샘플링 주파수 변경에 의한 방법에 비하여 원 신호의 음색 및 음향특성을 그대로 유지시키면서 재생속도만을 빠르게 또는 느리게 변화시켜 주는 장점을 가지므로, VTR 등에서 검색이나 모니터링 등을 위해 화면을 빠르게 또는 느리게 재생하는 경우에 음색이 변하여 듣기 거북하거나 샘플링 주파수의 변화로 음성신호의 정보량이 감소하여 알아들을 수 없는 등의 상기 종래 기술의 적용에 따른 문제가 발생하지 않는다.This method has the advantage of changing only the playback speed fast or slow while maintaining the tone and sound characteristics of the original signal as compared to the method by changing the sampling frequency above. In the case of slow playback, there is no problem caused by the application of the prior art, such as a change in timbre, an awkward hearing, or a change in the sampling frequency, which reduces the amount of information of the speech signal.

그러나, 상기 방식은 다음과 같은 한계를 가지고 있다.However, this method has the following limitations.

첫째, 입력신호로부터 피치의 위치를 정확히 검출하여 음원성분으로 할 필요가 있는데, 실제의 음성신호의 경우 다양한 형태의 잡음성분이 섞여 피치의 위치를정확히 추정하기가 어려워지며, 그에 따라 출력된 합성음의 음질이 저하될 우려가 있다.First, it is necessary to accurately detect the position of the pitch from the input signal and make it as a sound source component.In the case of an actual speech signal, various types of noise components are mixed to make it difficult to accurately estimate the position of the pitch. There is a fear that the sound quality is degraded.

둘째, 상기 방식은 피치의 주기를 기준으로 음성이나 음향신호의 절편을 분리하는 방식을 취하므로 합성될 음성신호의 기본 요소가 되는 각 절편이 시간축에서 일정한 크기를 가지게 되어 0.1 배속 단위 이하의 해상도를 가지는 미세한 재생속도의 조절이 어렵게 된다.Secondly, since the fragments of speech or sound signals are separated on the basis of the pitch period, each fragment, which is a basic element of the speech signal to be synthesized, has a constant size in the time axis, so that the resolution of 0.1x speed unit or less is achieved. Branches are difficult to control fine reproduction speed.

본 발명은 이와 같은 문제점을 해결하기 위한 것으로, 입력신호의 유/무성음을 별도로 처리하며, 유성음에 대해서는 파형 유사도 중첩 합성 방식을 사용하고, 무성음에 대해서는 창 함수를 적용하여, 종래 기술의 샘플링 주파수 변경방식과는 달리 입력 음성의 음색을 유지하면서도 데이터의 손실이 없으며, 종래 기술의 피치 동기화 중첩 합성(PSOLA) 방법과는 달리 처리 알고리즘이 단순하면서도 피치 단위로 중첩 합성하여야 하는 방식이 아니기 때문에 미세한 화속변환이 가능하며, 다양한 잡음이 섞인 경우에도 좋은 음질을 얻을 수 있는 화속 변환 장치를 제공하기 위한 것이다.The present invention is to solve such a problem, and process the audio / unvoiced sound of the input signal separately, using the waveform similarity superposition synthesis method for the voiced sound, and applying the window function to the unvoiced sound, changing the sampling frequency of the prior art Unlike the method, there is no loss of data while maintaining the tone of the input voice, and unlike the PPSA method of the prior art, since the processing algorithm is not a simple but overlapped synthesis by pitch unit, the fine speech conversion is performed. This is possible, and to provide a speech conversion device that can obtain a good sound quality even when mixed with various noise.

도1은 본 발명의 실시예에 따른 실시간 화속 변환 장치의 처리 흐름도를 나타내는 도면이다.1 is a flowchart illustrating a process of a real-time speech conversion apparatus according to an embodiment of the present invention.

도2는 본 발명의 실시예에 따른 실시간 화속 변환 장치의 블록 구성도를 나타내는 도면이다.2 is a block diagram of a real-time video conversion apparatus according to an embodiment of the present invention.

도3에서는 상기 파형 유사도 중첩 합성 방식에 의한 화속의 변환 원리를 나타내었다.3 shows the conversion principle of the speech rate by the waveform similarity superposition synthesis method.

도4는 본 발명의 국부 유사도 거리에 의한 최적 세그먼트 검색방법을 나타내는 도면이다.4 is a diagram illustrating an optimal segment search method based on a local similarity distance according to the present invention.

본 발명은 이와 같은 목적을 달성하기 위한 것으로, 본 발명의 특징에 의한 화속 변환 장치는 음성 신호 입력부, 데이터 버스, 데이터 기억 장치부, 신호 분석부, 화속 제어부, 신호 합성부, 음성 신호 출력부, 프로그램 버스, 주 제어부로 구성된다.The present invention is to achieve the above object, the speech conversion device according to the characteristics of the present invention is a voice signal input unit, a data bus, a data storage unit, a signal analysis unit, a fire rate controller, a signal synthesizer, a voice signal output unit, It consists of a program bus and a main control unit.

상기 음성 신호 입력부는 상기 주 제어부의 제어신호에 따라 입력된 아날로그 신호를 디지털 신호로 변환하여 이를 이후의 처리를 위해 신호 분석부와 데이터 기억장치에 전달하기 위한 것으로, 아날로그 신호를 입력받아서 디지털 신호로 변환하기 위한 A/D 컨버터와, 변환된 디지털 데이터를 임시 저장하기 위한 입력 버퍼로 구성된다.The voice signal input unit converts an analog signal input according to a control signal of the main controller into a digital signal and transfers the analog signal to a signal analyzer and a data storage device for subsequent processing. An A / D converter for conversion and an input buffer for temporarily storing the converted digital data.

상기 데이터 버스는 상기 화속 변환 장치의 구성부분 간의 데이터의 전송을 위한 것으로서, 상기 음성 신호 입력부로부터 디지털 데이터를 받아 상기 데이터 기억 장치부에 저장할 수 있도록 데이터를 전달하며, 이후의 처리 단계에 따라 상기 데이터 기억 장치부에 저장된 데이터를 신호 분석부, 화속 제어부 및 신호 합성부에 전달하는 역할을 한다.The data bus is for transferring data between components of the speech conversion device, and transmits data to receive the digital data from the voice signal input unit and to store the data in the data storage unit. It transfers the data stored in the storage unit to the signal analysis unit, the fire speed control unit and the signal synthesis unit.

상기 데이터 기억 장치부는 상기 주 제어부로부터 제어신호를 받아 상기 데이터 버스를 통해 전송된 데이터를 저장하며 처리 단계에 따라 상기 데이터 버스를 통하여 저장된 데이터를 내보내는 역할을 한다.The data storage device receives a control signal from the main controller, stores data transmitted through the data bus, and exports data stored through the data bus according to a processing step.

상기 신호 분석부는 상기 주 제어부의 제어신호에 의하여, 상기 음성 신호 입력부로부터 입력된 데이터를 처리하여, 본 발명에 있어서의 처리에 불필요한 묵음구간이 제거되고 유/무성음이 식별된 데이터를 화속제어부에 전달하여 처리할 수 있도록 하기 위한 것으로서, 묵음 구간을 제거하기 위한 묵음 구간 제거부와, 입력된 데이터에서 유/무성음의 구간을 식별하기 위한 유/무성음 구분부로 구성된다.The signal analyzing unit processes data input from the voice signal input unit according to a control signal of the main control unit, and transmits data to which the silence section unnecessary for processing in the present invention is removed and the voice / voice sound is identified to the fire control section. In order to be able to process by, it is composed of a silent section removing unit for removing the silent section, and a voiced / unvoiced sound separating unit for identifying the section of the voiced / unvoiced sound from the input data.

상기 화속 제어부는 상기 주 제어부의 제어신호에 의하여, 묵음구간이 제거되고 유/무성음이 구분된 신호를 입력받아 유성음 구간에 대해서는 파형 유사도 중첩 합성(waveform similarity overlap add; WSOLA) 방식을 적용하여 변속된 신호로 변환시키고, 무성음 구간에 대해서는 창함수를 적용하여 변환하며 각각의 변환된 신호를 상기 신호 합성부로 출력하는 작용을 한다.The fire speed controller receives a signal in which silence sections are removed and voice / voice sounds are separated by a control signal of the main controller, and is shifted by applying a waveform similarity overlap add (WSOLA) method to the voiced sound sections. It converts into a signal, converts by applying a window function to the unvoiced sound interval, and outputs each converted signal to the signal synthesizer.

상기 신호 합성부는 상기 주 제어부의 제어신호에 의하여, 상기 화속 제어부로부터 입력된 신호들을 합성하여 화속이 변경된 최종의 음성 신호로 변환시키며, 이를 상기 음성 신호 출력부로 전달한다.The signal synthesizing unit synthesizes the signals inputted from the speech rate controlling unit by the control signal of the main control unit, converts the final speech signal of which the speech rate is changed, and transfers them to the speech signal output unit.

상기 음성 신호 출력부는 상기 주 제어부의 제어신호에 의하여, 최종 음성 신호를 아날로그 신호로 변환시켜 출력하기 위한 것으로서, 최종 처리된 디지털 음성 신호를 임시 저장하기 위한 출력 버퍼와 D/A 컨버터로 구성된다.The voice signal output unit converts and outputs the final voice signal into an analog signal according to a control signal of the main controller. The voice signal output unit includes an output buffer and a D / A converter for temporarily storing the final processed digital voice signal.

상기 프로그램 버스는 상기 주 제어기로부터 내려진 명령을 제어하고자 하는 부분으로 전달해 주는 작용을 한다.The program bus serves to transfer a command issued from the main controller to a portion to be controlled.

상기 주 제어부는 장치 전체에서의 처리의 전 과정을 제어하며, 각 부분을 제어하기 위하여 필요한 명령어들이 순서대로 저장된 명령어 기억장치와 제어신호를 발생시키는 주 제어기로 구성된다.The main control unit controls the whole process of the processing in the entire apparatus, and is composed of a command storage device for generating commands and a control signal, in which commands necessary for controlling each part are sequentially stored.

본 발명은 상기와 같이 음성신호의 유/무성음 구간을 구분하여 별도로 처리함을 주요한 특징의 하나로 하는데, 이는 음성의 길이에 영향을 주는 부분은 유성음 구간이며 무성음 구간은 발성 속도와 거의 무관하게 일정함이 보통이고, 또한 무성음 구간은 길이가 극히 짧은 경우가 많아서, 유성음 구간과 동일한 방법을 적용하여 속도를 변환할 경우에는 지나치게 무성음 구간의 길이가 짧아져 음질이 저하될 우려가 있기 때문이다.One of the main features of the present invention is to separately process the voiced / unvoiced sound sections of the voice signal as described above, which is a part that affects the length of the voiced voiced sound sections and the unvoiced sound sections are almost constant regardless of the speech speed. This is because the normal and unvoiced sections are often extremely short in length, and when the speed is changed by applying the same method as the voiced sections, the length of the unvoiced sections may be too short and the sound quality may be degraded.

이하에서는, 본 발명의 실시예를 첨부된 도면을 참조하여 설명한다Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

도1에 도시한 바와 같이, 본 발명의 실시예에 따른 실시간 화속 변환 장치는, 음성 신호 입력부에서 입력된 음성 신호를 디지털 데이터로 변환하여 신호 분석부로 보내고, 신호 분석부에서 묵음 구간을 제거하며 유/무성음을 식별하고, 화속 제어부에서 유성음 구간에 대하여는 상기 파형 유사도 중첩 합성 방식을 적용하고 무성음 구간에 대하여는 신호의 자연스런 연결을 위해 창함수를 적용하여 각각의 구간에 대하여 별도로 처리한 후, 신호 합성부에서 각각 구간의 데이터를 화속 변환된 최종의 음성 신호로 합성하여 음성신호 출력부에서 아날로그 데이터로 바꾸어 출력하는 처리 순서를 따른다.As shown in FIG. 1, the apparatus for converting a real-time video speed according to an embodiment of the present invention converts a voice signal input from a voice signal input unit into digital data and sends the digital signal to a signal analyzer, and removes a silent section from the signal analyzer. After identifying the unvoiced sound and applying the waveform similarity superimposition synthesis method to the voiced sound interval in the speech rate control section, and applying the window function for the natural connection of the signal to the unvoiced sound interval, the signal synthesizer is processed separately. In the following, the data of each section is synthesized into the final speech signal converted by the speech rate, and the audio signal output unit converts the analog data into output data.

도2에 도시한 바와 같이, 본 발명의 실시예에 따른 실시간 화속 변환 장치는, 음성 신호 입력부(10), 데이터 버스(20), 데이터 기억 장치부(30), 신호 분석부(40), 화속 제어부(50), 신호 합성부(60), 음성 신호 출력부(70), 프로그램 버스(90), 주 제어부(80)로 구성된다.As shown in FIG. 2, the apparatus for converting a real-time video speed according to an embodiment of the present invention includes a voice signal input unit 10, a data bus 20, a data storage unit 30, a signal analyzer 40, and a fire speed. The control unit 50, the signal synthesizing unit 60, the audio signal output unit 70, the program bus 90, the main control unit 80.

상기 음성 신호 입력부(10)는 아날로그 신호를 입력받아서 디지털 신호로 변환하기 위한 A/D 컨버터(11)와, 변환된 디지털 데이터를 임시 저장하기 위한 입력 버퍼(12)로 구성된다.The voice signal input unit 10 includes an A / D converter 11 for receiving an analog signal and converting it into a digital signal, and an input buffer 12 for temporarily storing the converted digital data.

상기 신호 분석부(40)는 묵음 구간 제거부(41)와, 입력된 데이터에서 유/무성음의 구간을 식별하기 위한 유/무성음 구분부(42)로 구성된다.The signal analyzer 40 includes a silent section remover 41 and a voiced / unvoiced voice divider 42 for identifying a section of voiced / unvoiced sound from the input data.

상기 음성 신호 출력부(70)는 최종 음성 신호를 임시 저장하기 위한 출력 버퍼(71)와 D/A 컨버터(72)로 구성된다.The voice signal output unit 70 includes an output buffer 71 and a D / A converter 72 for temporarily storing the final voice signal.

이하에서는, 도2, 도3, 도4 및 표1을 참조하여 본 발명의 실시예에 따른 실시간 화속 변환 장치의 동작을 설명한다.Hereinafter, the operation of the apparatus for converting a real-time fire speed according to an embodiment of the present invention will be described with reference to FIGS. 2, 3, 4, and 1.

도2에 나타낸 바와 같이 신호 분석부(40)에서는 음성 신호 입력부(10)로부터 전달된 입력 음성신호의 디지털 데이터를 입력받아 묵음 구간 제거부(41)를 통하여 묵음 구간을 제거함으로써 한정된 입력 버퍼를 효과적으로 사용함과 동시에 불필요한 연산량을 줄인다. 묵음 구간의 검출을 위하여서는 프레임 단위의 음성 세그먼트에 대한 에너지 값을 이용하며, 일련의 음성 세그먼트의 에너지 값이 소정의 임계값보다 낮을 경우에는 묵음 구간으로 판단한다.As shown in FIG. 2, the signal analyzer 40 receives digital data of an input voice signal transmitted from the voice signal input unit 10 and removes the silence section through the silence section remover 41 to effectively remove the limited input buffer. At the same time, it reduces unnecessary computation. In order to detect the silent section, the energy value of the speech segment in a frame unit is used. When the energy value of the series of speech segments is lower than a predetermined threshold, it is determined as the silent section.

유/무성음 구분부(42)에서는 묵음 구간이 제거된 데이터에서 유성음과 무성음 구간을 구분한다. 이를 위해서는 무성음 구간의 신호 특성이 유성음 구간과는 달리 피치 성분을 가지지 않아, 주기적 신호특성이 나타나지 않는 성질을 이용하여, 음성 세그먼트의 자기 상관 계수(autocorrelation coefficient)가 소정의 임계치보다 클 경우 유성음 구간으로 판단하게 된다.The voiced / unvoiced sound separating unit 42 distinguishes the voiced sound and the unvoiced sound interval from the data from which the silent section is removed. To this end, since the signal characteristics of the unvoiced section have no pitch component unlike the voiced section, the periodic signal characteristic does not appear, and when the autocorrelation coefficient of the voice segment is larger than a predetermined threshold, You will be judged.

상기 화속 제어부(50)에서는, 유성음 구간에 대해서 상기 파형 유사도 중첩 합성 방식을 적용하여 원하는 스케일 인자(scale factor)만큼 미세하게 화속을 변환하게 되는데, 이 과정을 상세히 살펴보면 다음과 같다.The speech rate control unit 50 converts the speech rate as fine as a desired scale factor by applying the waveform similarity superposition synthesis method to the voiced sound interval, which will be described in detail as follows.

상기 파형 유사도 중첩 합성 방식에서는, 원래의 디지털화 된 음성 신호의 파형을 x(m)이라 하고, 화속 변환 처리 후 합성된 음성 신호의 파형을 y(n)이라 할 때, y(n)의 샘플 인덱스 n에 해당되는 신호값이 x(m)의 샘플 인덱스 m에 해당되는 신호값과 최대의 국부 유사도를 유지하도록 m 과 n의 함수관계를 결정하는데, 이를 정합함수(time warping function)라 하며,으로 나타낸다. 이 관계로부터 결정되는 x와 y의 관계를 수식으로 나타내면 다음의 수학식 1과 같다.In the above waveform similarity superposition synthesis method, when the waveform of the original digitized speech signal is x (m) and the waveform of the speech signal synthesized after the speech rate conversion process is y (n), the sample index of y (n) The functional relationship between m and n is determined so that the signal value corresponding to n maintains the maximum local similarity with the signal value corresponding to the sample index m of x (m), which is called a time warping function. Represented by The relationship between x and y determined from this relationship is expressed by the following equation.

여기에서 w(n)은 창함수(window function)을 의미한다.Here w (n) means window function.

도3과 같이 음성 신호에는 부분적으로 비슷한 구조가 반복적으로 나타나게 되는데, 반복적으로 나타나는 신호의 일부분을 제거하거나 중첩함으로써 원래의 음성 신호보다 빠르게 하거나 느리게 할 수가 있다.As shown in FIG. 3, a partially similar structure appears repeatedly in the voice signal, and by removing or overlapping a part of the repeatedly displayed signal, it may be faster or slower than the original voice signal.

다음의 수학식 2에서는 중첩 합성 방식에서의 일반적인 합성 방정식을 나타내었다.Equation 2 below shows a general synthesis equation in the overlap synthesis method.

위의 수학식에서 x(n)은 원래의 음성 신호, y(n)은 화속 변환 후의 합성 신호를 의미하며, v(n)은 창함수로서 각각의 음성 세그먼트가 부드럽게 연결될 수 있도록 하기 위한 것이며, 다양한 변형이 가능하나 본 실시예에서는 해닝 창(hanning window)을 사용하였으며, 다음의 수학식에서 해닝 창의 일반적인 관계식을 기술하였다.In the above equation, x (n) is the original speech signal, y (n) is the synthesized signal after the conversion rate, and v (n) is a window function so that each speech segment can be connected smoothly. Although modifications are possible, in this embodiment, a hanning window is used, and the general equation of the hanning window is described in the following equation.

본 실시예에서는 상기의 수학식 2의 관계를 장치에 구현하기 위하여 합성창의 위치값 L_k를 동일한 주기로 하여와 같이 결정하였고,이 되도록 하여 다음과 같이 단순화하였다.In the present embodiment, in order to implement the relationship of Equation 2 in the device, the position value L _k of the synthesis window is set to have the same period. Determined as This was simplified as follows.

상기 종래 기술의 피치 동기화 중첩 합성(PSOLA) 방법에서는 상기 수학식 4에서의가 피치 동기화를 유지하도록 결정됨에 반하여, 본 발명의 방법에서는 프레임 단위의 신호 결합 부위에서 속도 변환 후의 신호가 원 신호와 최대한의 유사성을 갖도록 결정된다.In the pitch synchronization superposition synthesis (PSOLA) method of the prior art, Is determined to maintain pitch synchronization, whereas in the method of the present invention, the signal after the speed conversion is determined to have the maximum similarity with the original signal at the signal combining portion in units of frames.

도3을 참조하여 본 발명의 화속 변환 알고리즘을 기술하면, 도면에서 원 신호 x(n)의 좌측으로부터 시작하여 우측 방향으로 본 알고리즘이 적용되는데, 현재 합성 세그먼트(1)까지의 처리가 완료되어 합성신호 y(n)의 세그먼트(a)로 추가되었다고 하면, 이후 y(n)에 중첩 합성될 세그먼트(b)를 x(n)에서 찾게 된다.Referring to Fig. 3, the fire speed converting algorithm of the present invention is applied in the drawing starting from the left side of the original signal x (n) to the right direction, and processing up to the current synthesis segment 1 is completed. If it is added as a segment a of the signal y (n), then a segment b to be superimposed and synthesized with y (n) is found in x (n).

이 때, 원 신호에서 가장 최근에 처리된 (1)부분에 가장 근접한 (1')부분이 (1)부분과 유사성이 가장 높기 때문에 음성 신호의 부분적 유사성을 최대화하기 위하여 (1')과 가장 유사도가 높은 세그먼트 (2)를 x(n)에서의 위치를 기준으로 찾아 y(n)의 다음신호 (b)로 부가하게 된다.At this time, since the (1 ') portion closest to the most recently processed (1) portion of the original signal has the highest similarity with the (1) portion, the maximum similarity with (1') is used to maximize the partial similarity of the speech signal. Is the highest segment (2) at x (n) It is based on the position of and is added as the next signal (b) of y (n).

도4를 참조하여, 최적 세그먼트를 찾는 방법에 대하여 보다 자세히 기술하면,의 위치를 기준으로 하여 전후의 일정 구간내에서 유사도 척도를 최대화하여 바로 전 단계에서 합성 신호에 연결한 세그먼트 m-1에 자연스럽게 연결시킬 수 있는의 값을 찾는다. 이 때 유사도 척도로는 교차 상관 계수, 정규화 교차 상관 계수 및 cross-AMDF 계수 등을 적용할 수 있는데, 이들을 표1에 나타내었다.Referring to Fig. 4, the method for finding the optimal segment will be described in more detail. Period before and after based on the position of Similarity scale within Is maximized so that it naturally connects to the segment m-1 connected to the composite signal in the previous step. Find the value of. In this case, as the similarity measure, a cross correlation coefficient, a normalized cross correlation coefficient, and a cross-AMDF coefficient may be applied, and these are shown in Table 1.

유사도 척도Similarity Scale 수 식Equation 교차 상관 계수Cross correlation coefficient 정규화 교차 상관 계수Normalized Cross Correlation Coefficient cross-AMDF 계수cross-AMDF coefficient

무성음 구간에 대해서는 창함수를 적용하여 유성음부와 무성음부가 부드럽게 연결될 수 있도록 한다. 본 실시예에서는 상기 수학식 2의 해닝 창을 사용하였다.Window functions are applied to unvoiced sections so that voiced and unvoiced sections can be connected smoothly. In this embodiment, the hanning window of Equation 2 is used.

상기한 바와 같이 본 발명에서는 유성음부와 무성음부를 별도로 처리함을 주요한 특징의 하나로 하며, 상기 신호 합성부(60)에서는 처리된 유성음부와 무성음부를 최종 합성하여 화속 변경된 음성신호를 얻게 된다.As described above, one of the main features of the present invention is to separately process the voiced sound unit and the unvoiced sound unit, and the signal synthesizer 60 synthesizes the processed voiced sound unit and the unvoiced sound unit to obtain a speech rate changed voice rate.

본 실시예에서는 분석 창으로서 상기 hanning 창을 사용하는데, 20ms에서 30ms 정도의 시간 간격을 갖도록 하였으며, 도4에 나타낸 것처럼 50% 중첩가산 하였고,일 때, 스케일 인자범위에서 임의로 선택하여 화속의 미세 조정이 가능하며, 이 범위를 벗어날 경우 청각적으로 별 의미 없는 소리가 됨이 알려져 있다.In the present embodiment, the hanning window is used as the analysis window, and has a time interval of about 20 ms to 30 ms, and 50% overlap is added as shown in FIG. Scale factor It is known that it is possible to finely adjust the speed of fire by selecting arbitrarily from the range, and if it is out of this range, it is audibly meaningless sound.

이상에서 설명한 본 발명의 실시예는 하나의 실시예일 뿐 본 발명이 야기한 실시예에 한정되는 것은 아니며, 기술적 사상의 동일성내에서, 상기 실시예 외에 많은 변경이나 변형이 가능한 것은 자명하다.The embodiments of the present invention described above are not limited to the embodiments caused by the present invention but only one embodiment, it is obvious that many modifications and variations can be made in addition to the above embodiments within the same technical spirit.

예를 들어, 본 발명의 상기 화속 변환 알고리즘에서 사용하는 상기 유사도척도로는 상기 표1에 나타낸 것처럼 교차 상관 계수, 정규화 교차 상관 계수 및 cross-AMDF 계수 등을 치환하여 적용할 수 있다.For example, the similarity measure used in the speech conversion algorithm of the present invention may be applied by substituting a cross correlation coefficient, a normalized cross correlation coefficient, a cross-AMDF coefficient, and the like as shown in Table 1 above.

또한 본 발명의 기술적 사상은 VTR, 라디오, 카세트 플레이어 등 음성 재생과 관련된 모든 멀티미디어 제품에서 응용될 수 있으며, 상기 각 장치의 오디오 출력부에서 신호를 넘겨받아, 본 발명의 화속 변환 장치의 입력으로 받아들여 화속을 변환함으로써 외국어 청취 훈련을 통한 어학 학습기, 청각 장애인 또는 아동을 위한 음성 속도 조절 등 다양한 분야에 변형 응용될 수 있다.In addition, the technical idea of the present invention can be applied to all multimedia products related to speech reproduction such as VTR, radio, cassette player, etc., and receives the signal from the audio output unit of each device, and receives it as the input of the fire speed converting apparatus of the present invention. By changing the speed of speech, it can be applied to various fields such as language learner through foreign language listening training, voice speed control for the deaf or child.

본 발명은 VTR, 라디오, 카세트 플레이어 등 음성 재생과 관련된 모든 멀티미디어 제품에서 응용될 수 있으며, 상기 각 장치의 오디오 출력부에서 신호를 넘겨받아, 본 발명의 화속 변환 장치의 입력으로 받아들여 화속을 변환함으로써 외국어 청취 훈련을 통한 어학 학습기, 청각 장애인 또는 아동을 위한 음성 속도 조절 등 다양한 분야에 응용될 수 있다.The present invention can be applied to all multimedia products related to voice playback, such as VTR, radio, cassette player, etc., and receives the signal from the audio output unit of each device, accepts the input of the speech conversion device of the present invention, and converts the speech rate. Therefore, it can be applied to various fields such as language learner through foreign language listening training, voice speed control for the hearing impaired or the child.

Claims

A voice signal input unit converting the input analog signal into a digital signal and outputting the converted digital signal;

A data bus for transferring data between components of the device;

A data storage device for storing data transmitted through the data bus and outputting the stored data through the data bus according to a processing step;

A signal analyzer for processing data input from the voice signal input unit, removing a silent section, identifying voiced / unvoiced sound, and outputting data to a speech rate control unit;

A speech rate control unit which receives data from the voice signal input unit and shifts the voiced sound section and the unvoiced section separately, and outputs each converted signal to the signal synthesizer;

A signal synthesizing unit for synthesizing the signals inputted from the speech rate control unit, converting the final speech signal with the speech rate changed and outputting the final speech signal;

An audio signal output unit for converting and outputting the final audio signal into an analog signal;

A main controller for controlling the whole process of the processing in the entire apparatus;

It is composed of a program bus for transmitting a command issued from the main controller to the part to be controlled

Real time fire conversion device.

The method of claim 1,

The voice signal input unit

An A / D converter for receiving the analog voice signal and converting it into a digital signal; And

An input buffer for temporarily storing the converted digital data;

Converting the input analog signal into a digital signal according to the control signal of the main control unit and transferring the analog signal to the signal analyzer and the data storage device

Real time fire conversion device.

The method of claim 1,

The data bus

Transfers data so that the digital data is received from the voice signal input unit and stored in the data storage unit;

According to a subsequent processing step to transfer the data stored in the data storage unit to the signal analysis unit, the fire rate control unit and the signal synthesis unit

Real time fire conversion device.

The method of claim 1,

The signal analysis unit

A silent section removing unit for removing the silent section; And

A voice / unvoice division unit for identifying a section of voice / voice in the input data;

By the control signal of the main controller,

By processing data input from the voice signal input unit,

Silence section that is unnecessary for processing is eliminated, and data that identifies voiced and unvoiced sound is transmitted to the fire control section for processing.

Real time fire conversion device.

The method of claim 1,

The fire speed control unit

By the control signal of the main controller,

For the voiced sound interval, waveform similarity overlap add (WSOLA) is applied to the shifted signal,

For unvoiced sections, apply the window function to convert

Outputting each converted signal to the signal synthesis unit

Real time fire conversion device.

The method of claim 1,

The voice signal output unit

An output buffer for temporarily storing processed digital data; And

A D / A converter for converting the digital data into an analog signal;

The control signal of the main control unit converts the final audio signal into an analog signal and outputs the analog signal.

Real time fire conversion device.

The method of claim 1,

The main control unit

A command memory for storing commands necessary to control each part in order; And

A main controller for generating a control signal;

To control the whole process of processing throughout the device

Real time fire conversion device.

The method of claim 1,

The signal analysis unit

In the detection of the silent section

When the energy value for the speech segment in the frame unit is lower than the predetermined value, it is determined as the silent section,

In the judgment of the voiced / unvoiced section

When the autocorrelation coefficient of the speech segment is larger than a predetermined value, it is determined as a voiced sound section.

Real time fire conversion device.

The method of claim 5,

Each section of the voiced sound of the voice signal

Waveform similarity overlap add (WSOLA)

A composite signal y (n) is obtained for an input signal x (n) that satisfies the following equation.

Real time fire conversion device.

here Is a time warping function, and is a function indicating a relationship between sample indexes m and n of x (m) and y (n),

v (n) represents the window function,

Also, Is determined so that the signal after the speed conversion at the signal combining portion in the frame unit has the maximum similarity with the original signal.

The method of claim 9,

In determining the similarity

Cross-correlation coefficient of the following formula To maximize To obtain a

Real time fire conversion device.

The method of claim 9,

In determining the similarity

Normalized cross correlation coefficient of To maximize To obtain a

Real time fire conversion device.

The method of claim 9,

In determining the similarity

Cross-AMDF coefficient in the following formula To maximize To obtain a

Real time fire conversion device.