KR100363876B1

KR100363876B1 - A text to speech system using the characteristic vector of voice and the method thereof

Info

Publication number: KR100363876B1
Application number: KR1020000083144A
Authority: KR
Inventors: 백승표
Original assignee: (주)네오싸이피아
Priority date: 2000-12-27
Filing date: 2000-12-27
Publication date: 2002-12-11
Also published as: KR20020053496A

Abstract

개시된 본원 발명은 음성의 특징 벡터를 이용한 문자 음성 변환 장치 및 그 방법에 관한 것이다.The present invention relates to a text-to-speech device and method using the feature vector of the voice.

본원 발명에 따르는 음성의 특징 벡터를 이용한 문자 음성 변환 장치는, 문자열에 대한 읽기 규칙 데이터, 음소 파라미터 데이터, 문자 코드 데이터를 소정 기준에 의하여 저장하는 문자 데이터 베이스부; 소정 입력수단에 의하여 문자열을 입력받는 문자 입력부; 문자 입력부에서 입력된 문자에 대한 음소 데이터를 상기 문자 데이터 베이스부로부터 추출해내는 음소파라미터 추출부; 추출된 음소파라미터가 무성음인 경우 무성음에 대한 기본 주파수로서의 가우시안 잡음을 생성하는 가우시안 잡음 생성부; 추출된 음소파라미터가 유성음인 경우 유성음에 대한 기본 주파수로서의 임펄스 트래인을 생성해내는 임펄스 트래인 생성부; 임펄스 트래인 생성부에서 생성된 임펄스 트래인을 사람의 구강구조에서 나오는 음성 신호의 기본 주파수 파형으로 변환하는 글로탈 펄스 생성부; 가우시안 잡음 생성부와 글로탈 펄스 생성부에서 생성된 무성음 또는 유성음에 대한 기본 주파수 값에 상기 음소 파라미터 추출부에서 추출된 음소 파라미터 값을 적용하여 음소에 대한 음성 신호를 생성하는 보컬 트랙부; 및 보컬 트랙부에서 생성된 음소에 대한 음성 신호를 서로 연결하여 연속적인 음성신호로 합성하는 조음부;를 포함하여 구성된다.Text-to-speech apparatus using the feature vector of the voice according to the present invention, the character database unit for storing the reading rule data, phoneme parameter data, character code data for the character string by a predetermined criterion; A character input unit which receives a character string by a predetermined input means; A phoneme parameter extracting unit for extracting phoneme data of characters input from a text input unit from the text database unit; A Gaussian noise generator for generating Gaussian noise as a fundamental frequency for the unvoiced sound when the extracted phoneme parameter is an unvoiced sound; An impulse train generator for generating an impulse train as a fundamental frequency for the voiced sound when the extracted phoneme parameter is a voiced sound; A global pulse generator for converting an impulse train generated by the impulse train generator into a fundamental frequency waveform of a speech signal from a human oral structure; A vocal track unit generating a speech signal for a phoneme by applying a phoneme parameter value extracted from the phoneme parameter extractor to a basic frequency value for an unvoiced or voiced sound generated by a Gaussian noise generator and a global pulse generator; And an articulator for connecting the voice signals of the phonemes generated by the vocal track unit to each other and synthesizing them into continuous voice signals.

본원 발명에 따르는 음성 특징 벡터를 이용한 문자 음성 변환 장치는 음성합성에 필요한 데이터 베이스를 최소화하고 음소 파라미터만을 이용하여 문자를 음성신호로 합성함으로써 음성 합성의 처리시간을 실시간으로 단축시킬 수 있으며, 음소 파라미터만을 이용하여 자연음에 가깝도록 문자를 음성으로 재생할 수 있도록 하는 효과가 있다.The text-to-speech apparatus using the speech feature vector according to the present invention minimizes the database required for speech synthesis and synthesizes text into speech signals using only phoneme parameters, thereby shortening the processing time of speech synthesis in real time. By using only the effect is that the sound can be reproduced by the voice to be close to the natural sound.

Description

Text-to-speech system using the characteristic vector of voice and the method

본 발명은 음성의 특징 벡터를 이용한 문자 음성 변환 장치 및 그 방법에 관한 것이다. 더욱 상세하게는 문자에서 음소에 대한 문자 코드를 추출하고 추출된 문자 코드에 대한 음성 특징 벡터로서의 음소 파라미터를 이용하여 문자를 음성신호로 합성 재생할 수 있도록 하는 음성의 특징 벡터를 이용한 문자 음성 변환 장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus for text-to-speech using a feature vector of speech and a method thereof. More specifically, a text-to-speech device using a feature vector of speech to extract a text code for a phoneme from a text and to synthesize and reproduce the text into a speech signal using a phoneme parameter as a speech feature vector to the extracted text code; It's about how.

문자 음성 변환 장치 및 그 방법은 인간과 컴퓨터 사이의 가장 자연스러운 의사 전달 형태인 음성 언어를 이용하여 문장을 음성으로 바꾸어 주는 문자 음성 변환 기술로, 작게는 장난감, 가전 제품으로부터 크게는 자동차, 엘리베이터, 빌딩에 이르기까지 인간이 사용하는 모든 분야에 걸쳐 사용될 수 있는 중요한 기술이다Text-to-speech devices and methods are text-to-speech technology that converts sentences into speech using speech language, the most natural form of communication between humans and computers. Is an important technology that can be used across all areas of human use.

도 1을 참조하여 종래에 있어서의 음성 문자 변환 장치 및 그 방법을 설명하면 다음과 같다.Referring to FIG. 1, a conventional apparatus for converting a speech character and a method thereof are as follows.

도 1은 한국어에서의 음성-문자 변환 시스템의 개략적인 시스템 구성 및 그 동작의 처리과정을 나타내는 블록도이다.1 is a block diagram showing a schematic system configuration of a voice-to-text conversion system in Korean and a process of its operation.

도시된 바와 같이 종래의 문자-음성 변환 시스템은 문장이 입력되고 이를 처리하기 위한 전처리부(101), 전처리과정에 필요한 데이터를 추출하기 위한 사전형의 숫자/기호/약어 DB(101), 전처리된 문자에 대한 형태소를 분리하는 형태소 분석부(103), 형태소 분석부(103)에서 형태소를 분석하기 위한 데이터 사전으로서의 형태소 DB(104), 구문 분석을 위한 파서부(105), 분석된 구문에 대한 글자를 음운으로 변환하는 글자/음운 변환부(106), 글자/음운 변환 중에 기호나 특수 문자에 대한 발음 규칙에 대한 데이터 사전으로서의 예외발음 DB(107), 글자/음운 변환부(106)에서 변환된 음운에 대한 운율. 합성단위 및 글자, 단어, 문장의 경계 정보를 생성해내는 음성합성 데이터 생성부(108), 음성 합성 데이터 생성부(108)에서 생성된 음성 합성 데이터에 대한 각각의 음성 데이터의 지속시간을 설정하는 지속시간 제어부(109), 합성될 음성의 기본 주파수를 설정하는 기본 주파수를 생성하고 제어하는 기본 주파수 제어부(110), 상술한 구성요소에서 생성된 음성 합성 데이터, 지속시간 정보, 기본 주파수를 이용하여 음성신호를 합성하는 합성음 생성부(111), 및 합성음 생성부에서 상술한 음성합성에 필요한 합성단위를 추출하고 이를 음성 합성에 이용하도록 다수의 합성음을 데이터로 저장한 합성단위 DB(112)를 포함하여 구성된다.As shown, the conventional text-to-speech conversion system includes a pre-processing unit 101 for inputting and processing a sentence, a dictionary type number / symbol / abbreviation DB 101 for extracting data required for the pre-processing process, and preprocessing. A morpheme analysis unit 103 for separating morphemes for characters, a morpheme DB 104 as a data dictionary for analyzing morphemes in the morpheme analysis unit 103, a parser unit 105 for syntax analysis, and a parsed syntax Character / phonic conversion unit 106 for converting letters into phonemes, exception phoneme DB 107 as a data dictionary for pronunciation rules for symbols or special characters during character / phony conversion, and conversion in character / phonic conversion unit 106 Rhymes for phonological rhymes. To set the duration of each speech data for the speech synthesis data generated by the speech synthesis data generation unit 108, the speech synthesis data generation unit 108 for generating the synthesis unit and the boundary information of letters, words, sentences Duration control unit 109, the fundamental frequency control unit 110 for generating and controlling a fundamental frequency for setting the fundamental frequency of the speech to be synthesized, by using the speech synthesis data, duration information, the fundamental frequency generated by the above-described components Synthesized sound generating unit 111 for synthesizing a voice signal, and synthesized unit DB 112 which stores a plurality of synthesized sounds as data so as to extract the synthesized units necessary for speech synthesis described above by the synthesized sound generating unit and use them for speech synthesis; It is configured by.

이러한 기술들의 일부를 이용하여 구현한 종래 기술로서의 일 실시예들로는 특허 등록번호 10-0194814의 "다단계 입력 정보를 이용한 텍스트/음성변환기 및 그 방법", 특허 공개번호 특2000-0024096의 "디지털 음성 재생장치", 특허 공개번호 특2000-0024318의 "인터넷을 이용한 TTS 시스템 및 TTS 서비스 방법", 그리고 공개실용신안 실 2000-0011449의 "키보드 타이핑 시 음성 정보를 출력하는 기능을 갖는 컴퓨터시스템" 등이 있다.Exemplary embodiments of the related art implemented using some of these techniques are "Text / Speech Converter Using Multi-Step Input Information and Its Method," Patent Publication No. 10-0194814, "Digital Voice Reproduction of Patent Publication No. 2000-0024096. Device "," TTS system and TTS service method using the Internet "of Patent Publication No. 2000-0024318, and" Computer system having a function of outputting voice information when typing a keyboard "of JP 2000-0011449. .

그러나, 상술한 종래 기술들은 개인의 정보를 다단계에 걸쳐 처리하므로 음질의 향상을 도모할 수 있으나, 음성의 데이터 양이 많아지고 이로 인하여 인터넷 등의 통신에서 음성의 특징을 파라미터화하여 처리할 때 전송의 문제가 생길 수 있다. 또한 개인의 운율에 대한 파라미터를 추출함에 있어서, 개인의 특성을 최대한 살릴 수 있는 정보를 추출하고 이를 처리하는 시간이 실시간으로 이루어지기 어렵다는 문제점이 있다.However, the above-described conventional techniques can improve the sound quality because the information of the individual is processed in multiple stages, but the amount of data of the voice is increased, which causes transmission of the voice when parameterizing the feature of the voice in communication such as the Internet. May cause problems. In addition, in extracting a parameter for an individual's rhyme, there is a problem that it is difficult to extract and process the information that can make the most of the characteristics of the individual in real time.

또한 PCM을 사용하여 하드웨어적으로 구성하는 경우, 압축율이 떨어지며, 사용자가 하드웨어를 구입 설치 해야하는 등의 불편한 점이 있고, 문서를 스캔하고 이를 문자화하여 재생함으로써, 스캔 한 문서를 문자화하는 데에 따른 오류발생의 가능성이 있다는 문제점이 있다. 이와 함께 음성의 특징을 PCM으로 코딩함으로써 데이터 용량이 많아지며, TTS 시스템이 키보드 입력 시 입력되는 키보드의 값과 이에 매칭되는 음성으로 작성되나, 이에 대한 음성 재생 방법이 당업자가 실시하기에는 불충분하게 설명되어 있는 문제점이 있다.In addition, when the hardware is configured using PCM, there is an inconvenience in that the compression rate is low, and the user has to purchase and install the hardware, and an error occurs in texting the scanned document by scanning and texting and playing the document. There is a problem that there is a possibility. In addition, the data capacity is increased by coding the characteristics of the voice in PCM, and the TTS system is created with a voice value matched with the keyboard value input at the time of keyboard input, but the voice reproducing method is insufficiently described by those skilled in the art. There is a problem.

그리고 음성을 재생할 때 WAV 나 MP3 등의 형태로 데이터를 저장하는 경우에는 재생하기 위해 저장해야 하는 데이터의 용량이 커져서 인터넷 등의 통신에서 데이터를 전송하고 재생할 때 스트리밍이나 버퍼링 등의 처리시간이 필요하여 실시간 처리가 어렵다. 그리고 문자의 파라미터만을 전송하는 경우에도 하기에 설명될 본원 발명에 비해 음질이 떨어진다는 문제점이 있다.In case of storing data in the form of WAV or MP3 when playing voice, the amount of data to be stored for playback is increased, so processing time such as streaming or buffering is required when transmitting and playing data in communication such as Internet. Real time processing is difficult. And even in the case of transmitting only the parameter of the character there is a problem that the sound quality is poor compared to the present invention to be described below.

이에 본 발명은 상기와 같은 문제점을 해결하기 위한 것으로서, 자연어에 대하여 음소단위로 그 주파수 특성을 데이터 베이스화하여 텍스트문서를 음성으로 변환 출력하기 위하여 기 구성된 음성의 특징 벡터를 데이터로 하는 음소 파라미터 데이터 테이블에서 텍스트 문서를 구성하는 문자에 대응되는 음소에 대한 특징 벡터(음소 파라미터)로서의 음성 특징 벡터로서의 음소 파라미터 만을 추출하여 주파수 영역에서 비선형적으로 시간영역에서 선형적으로 연결, 음성으로 합성, 재생하는 음성의 특징 벡터를 이용한 문자 음성 변환 장치 및 그 방법을 제공하는데 그 목적이 있다.Accordingly, the present invention is to solve the above problems, the phoneme parameter data table using the feature vector of the preconfigured voice to convert the text document to speech by converting the frequency characteristics of the natural language into phoneme units A voice that extracts only a phoneme parameter as a feature vector (phoneme parameter) for a phoneme corresponding to a character constituting a text document, is linearly connected in the time domain and synthesized and reproduced in the time domain nonlinearly in the frequency domain. An object of the present invention is to provide a text-to-speech device using the feature vector and a method thereof.

도 1은 종래 기술에 있어서의 음성합성 방법의 전체 처리 과정을 나타내는 블록도이며,1 is a block diagram showing the overall processing of the speech synthesis method in the prior art;

도 2는 본원 발명에 따르는 음성 특징 벡터를 이용한 문자 음성 변환 장치의 바람직한 일 실시 예를 나타내는 블록도이고,2 is a block diagram showing a preferred embodiment of a text-to-speech device using a speech feature vector according to the present invention,

도 3은 본원 발명에 따르는 음성 특징 벡터를 이용한 문자 음성 변환에 대한 처리 과정을 나타내는 바람직한 일 실시 예를 나타내는 순서도이며.3 is a flow chart showing a preferred embodiment showing a process for text-to-speech using a speech feature vector according to the present invention.

도 4는 상술한 도 3의 처리과정 중 음성 합성 및 출력 과정을 나타내는 서브루틴도이고.4 is a subroutine diagram illustrating a speech synthesis and output process in the above-described process of FIG. 3.

도 5는 본원 발명에 따르는 음성의 특징 벡터를 이용한 문자 음성 변환 장치를 네트워크 환경에서 적용한 일 실시 예를 나타내는 도면이다.5 is a diagram illustrating an embodiment in which a text-to-speech device using a feature vector of speech according to the present invention is applied in a network environment.

* 도면의 주요 부분에 대한 부호의 설명** Explanation of symbols for the main parts of the drawings *

200 : 문자 음성 변환 장치 201 : 문자 입력부200: text-to-speech device 201: text input unit

202 : 음소 파라미터 추출부 203 : 문자 데이터베이스부202: phoneme parameter extraction unit 203: character database unit

204 : 가우시안 잡음 생성부 205 : 임펄스 트래인 생성부204: Gaussian noise generator 205: Impulse train generator

206 : 글로탈 펄스 생성부 207 : 증폭부206: global pulse generating unit 207: amplifying unit

208 : 보컬 트랙부 209 : 조음부208: vocal track section 209: articulation section

본원 발명은 상기의 처리과정 중 음성의 분석 및 음성 합성 단계에서 문자, 단어 등에 대한 방대한 데이터 베이스의 구축을 간소화할 수 있도록 음성의 특징 벡터로서 음소에 대한 주파수, 지속시간, 시작시각, 피치 주기 등의 음소 파라미터를 사용한다.In the present invention, the frequency, duration, start time, pitch period, etc., of the phoneme as a feature vector of the speech can be simplified to simplify the construction of a large database of characters, words, etc. in the analysis and speech synthesis of the speech. Use the phoneme parameter of.

음성의 특징 벡터로서 음소 파라미터를 사용하기 위해서는 음소에 대한 음소 파라미터를 추출하여 이를 데이터 베이스화 하여야 하는데, 이는 본원 출원인과 동일인이 출원한 특허 2000-0056532 호의 "비선형 방법에 의한 음성신호 특징 추출 장치 및 그 방법"에서와 같이 음성신호를 유성음, 무성음 및 무음의 음소 단위로 추출하여 음소 각각에 대한 음성 특징 벡터로서의 음소 파라미터를 추출한 후 추출된 음소파라미터를 데이터로 하는 데이터 베이스를 구축함으로써 달성될 수 있다.In order to use a phoneme parameter as a feature vector of a voice, a phoneme parameter for a phoneme must be extracted and databased. This is an apparatus for extracting a voice signal feature by a nonlinear method and a patent application of Patent 2000-0056532 filed by the same applicant. The method can be achieved by extracting a voice signal into voiced, unvoiced, and silent phoneme units, extracting a phoneme parameter as a voice feature vector for each phoneme, and constructing a database using the extracted phoneme parameters as data.

상술한 기 출원된 특허 2000-0056532 호의 방법에 의해 추출된 음소 파라미터 데이터를 포함하는 기 구성된 문자 데이터 베이스를 이용하여 기 입력된 문자의코드를 음소 단위로 해석하고 이에 해당하는 음소의 파라미터를 추출한다. 파라미터는 음소별로 구성되며, 이러한 파라미터를 이용하여 온라인 상에서, 또는 로컬 시스템 상에서 음성합성에 의한 음성 재생을 구현한다.Using the preconfigured character database including phoneme parameter data extracted by the aforementioned method of Patent Application 2000-0056532, the code of the input character is interpreted in phoneme units and the phoneme parameter is extracted. . The parameters are organized by phonemes, and these parameters are used to implement speech reproduction by speech synthesis online or on a local system.

이하, 본원 발명에 따르는 음성의 특징 벡터(음소 파라미터)를 이용한 문자 음성 변환 장치 및 그 방법의 바람직한 일 실시 예를 첨부된 도면을 참조하여 보다 상세히 설명한다.Hereinafter, a preferred embodiment of a text-to-speech device and a method using a feature vector (phoneme parameter) of a voice according to the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본원 발명에 따르는 음성 특징 벡터를 이용한 문자 음성 변환 장치의 바람직한 일 실시 예를 나타내는 도면이다.2 is a view showing an embodiment of a text-to-speech device using a speech feature vector according to the present invention.

도시된 바와 같이 본원 발명에 따르는 문자 음성 변환 장치는 문자 입력부(201), 음소파라미터 추출부(202), 문자 데이터베이스부(203), 스위칭부(S), 가우시안 잡음 생성부(204), 임펄스 트래인 생성부(205), 글로탈 펄스 생성부(206), 제 1 및 제 2 증폭부(207,207'), 보컬 트랙부(208), 및 조음부(209)를 포함하여 구성되며 보컬트랙부에 음소 파라미터 데이터를 전송하는 제 1 데이터라인(210)과 임펄스 트래인 생성부(205)로 피치주기 데이터를 전송하는 제 2 데이터 라인(211)을 포함하여 구성된다.As shown, the text-to-speech apparatus according to the present invention includes a text input unit 201, a phoneme parameter extraction unit 202, a text database unit 203, a switching unit S, a Gaussian noise generator 204, and an impulse trap. A phosphor generator 205, a global pulse generator 206, first and second amplifiers 207 and 207 ', a vocal track unit 208, and an articulator 209, and a vocal track unit. The first data line 210 transmits the phoneme parameter data and the second data line 211 transmits the pitch period data to the impulse train generator 205.

다음으로 상술한 음성 특징 벡터를 이용한 문자 음성 변환 장치의 구성요소들을 상세히 설명한다.Next, the components of the text-to-speech apparatus using the aforementioned speech feature vector will be described in detail.

상술한 문자 데이터베이스부(203)는 음소 파라미터 추출부(202)에서 문자에 대응하는 음소 파라미터를 추출하기 위해 필요한 데이터 베이스이다. 음소 파라미터의 추출을 위하여 상술한 문자 데이터베이스부(203)는 그 내부에 읽기 규칙 테이블, 문자 코드 테이블, 음소 파라미터 테이블을 갖는다.The character database unit 203 described above is a database required for the phoneme parameter extraction unit 202 to extract phoneme parameters corresponding to characters. The character database unit 203 described above for extracting phoneme parameters has a read rule table, a character code table, and a phoneme parameter table therein.

상술한 문자 데이터베이스부(203) 중 읽기 규칙 테이블은 상술한 음소 파라미터 추출부(202)가 기호, 숫자, 특수 문자를 소리나는 대로의 문자 코드로 변환하는 과정에서 사용된다. 따라서 상술한 읽기 규칙 테이블은 기호, 숫자, 특수 문자 등에 대한 읽기 규칙에 대한 데이터를 갖는데 이러한 읽기 규칙은 한글인 경우 문교부 고시 한글 읽기 규칙을 따르도록 구성됨이 바람직하다. 상기 읽기 규칙의 예로는 "%"라는 특수 문자를 "퍼센트"라는 문자열에 대응시키는 것과 같다. 숫자의 경우에는 "1,234"의 경우 " 일천이백삼십사"의 문자열과 대응시키고, "1, 2, 3, 4"는 " 일, 이, 삼, 사"의 문자열에 대응시키는 것이며, 쉼표, 마침표, 띄어쓰기 등에 대한 무음 연결 부위의 처리 규칙에 대한 소정 시간 값을 데이터로 갖는다. 숫자에 있어서의 서로 다은 음성의 발성은 스페이스 코드에 의하여 그 구성이 변환된다. 이렇게 함으로써 음소 파라미터 추출부(202)에 입력된 문자들 중 기호, 숫자, 특수 문자들이 음소파라미터에 대응되는 문자 코드 값을 부여받게 된다. 또한 상술한 읽기 규칙 테이블은 조음하는 과정에서 합성단위를 생성하는 역할을 수행한다.The read rule table in the above-described character database unit 203 is used by the phoneme parameter extraction unit 202 to convert symbols, numbers, and special characters into phonetic character codes. Therefore, the above-described reading rule table has data on reading rules for symbols, numbers, special characters, and the like. When the reading rules are Korean, it is preferable that the reading rules be configured to follow the Korean reading rule. An example of such a read rule would be to map the special character "%" to the string "percent". In the case of a number, "1,234" corresponds to a string of "one thousand two hundred and thirty four", and "1, 2, 3, 4" corresponds to a string of "one, two, three, four", and a comma, a period, The data has a predetermined time value for the processing rule of the silent connection portion for spacing and the like. The voices of different voices in numbers are converted by the space code. In this way, symbols, numbers, and special characters among the characters input to the phoneme parameter extraction unit 202 are given a character code value corresponding to the phoneme parameter. In addition, the above-described read rule table serves to generate a synthesis unit in the course of articulation.

상술한 문자 데이터베이스부(203) 중 문자 코드 테이블은 상술한 음소 파라미터 추출부(202)가 기호, 숫자, 특수 문자가 상술한 읽기 규칙 테이블에 의해 변환된 문자와 기호, 숫자, 특수 문자 이외의 문자에 대한 문자 코드를 추출해 내는 경우에 참조하게 되는 테이블이다. 이를 위하여 상술한 문자 코드 테이블은 표준 한글 코드 값을 그 데이터로 한다.The character code table of the above-described character database unit 203 is composed of characters, symbols, numbers, and characters other than symbols, numbers, and special characters converted by the phoneme parameter extraction unit 202 described above by the reading rule table described above. Table to reference when extracting the character code for. For this purpose, the above-described character code table uses standard Hangul code values as its data.

상술한 음소 파라미터 테이블은 상술한 읽기 규칙 테이블 및 상술한 문자 코드 테이블을 참조하여 추출된 입력 문자에 대한 문자코드별로 대응되는 음소의 특징 벡터로서의 음소 파라미터를 그 데이터로 저장하는 테이블이며, 음소 파라미터 추출부(202)가 각 문자 코드에 대한 음소 파라미터를 추출하기 위해 참조하는 데이터 테이블이다.The phoneme parameter table described above is a table that stores phoneme parameters as feature vectors of phonemes corresponding to character codes for input characters extracted with reference to the above-described reading rule table and the above-described character code table as its data. It is a data table referenced by section 202 to extract phoneme parameters for each character code.

상술한 음소 파라미터 테이블은 기 출원된 특허2000-0056532의 "비선형 방법에 의한 음성신호 특징 추출 장치 및 그 방법"에 의하여 화자의 음성을 수차례 입력받은 후 화자의 음성에서 시간과 주파수 영역에서 비 선형적으로 추출해 내어 음소의 주파수, 주파수의 지속 시간, 시작 시각, 및 피치 주기를 그 데이터로 한다.The phoneme parameter table described above is non-linear in time and frequency domain in the speaker's voice after receiving the speaker's voice several times by the patent application 2000-0056532, "Speech signal feature extraction apparatus and method thereof," The data is extracted using the frequency of the phoneme, the duration of the frequency, the start time, and the pitch period.

상술한 문자 데이터베이스부(203)에 저장된 데이터는 음소 파라미터 추출부(203)에 의해서 각 문자 코드의 음소에 대응되어 추출된 후에 가우시안 잡음 생성부(204), 임펄스 트래인 생성부(205), 글로탈 펄스 생성부(206), 보컬 트랙부(208), 조음부(209)에서 음성을 합성하기 위하여 사용된다.After the data stored in the character database unit 203 is extracted corresponding to the phoneme of each character code by the phoneme parameter extractor 203, the Gaussian noise generator 204, the impulse train generator 205, and the glow are extracted. The de-pulse generator 206, the vocal track 208, and the articulator 209 are used to synthesize speech.

상술한 음소 파라미터 추출부(S202)는 상술한 문자 입력부(S202)로부터 처리된 문자 데이터를 입력받는다. 입력된 문자 데이터를 전송 받은 음소 파라미터 추출부(202)는 문자 데이터베이스부를 이용하여 전송된 문자 데이터에서 읽기 규칙에 의해 기호, 숫자, 특수 문자 등을 소리나는 대로의 문자 코드로 변환한다. 다음으로 변환된 기호, 숫자, 특수 문자의 문자 코드와 입력된 문자들에 대한 문자 코드를 추출해 내어 이를 바탕으로 각 코드에 대응되는 음성의 특징 벡터로서의 음소 파라미터를 추출해 낸다.The phoneme parameter extraction unit S202 described above receives the processed character data from the above-described character input unit S202. The phoneme parameter extracting unit 202 receiving the input character data converts a symbol, a number, a special character, etc. into a character code in a text form according to a reading rule in the character data transmitted using the character database unit. Next, we extract the character codes of the converted symbols, numbers, and special characters, and the character codes for the input characters, and extract phoneme parameters as feature vectors of speech corresponding to each code.

상술한 가우시안 잡음 생성부(204)는 무성음의 음성신호를 생성하기 위한 잡음 신호로서의 소정 신호 파형을 생성한다. 생성된 신호에 음소 파라미터 추출부에서 추출된 음소 파라미터(음성 특징 벡터) 값을 적용시키면 해당 무성음에 대한 음소의 음성신호가 생성된다.The Gaussian noise generator 204 described above generates a predetermined signal waveform as a noise signal for generating an unvoiced voice signal. When a phoneme parameter (voice feature vector) value extracted by the phoneme parameter extractor is applied to the generated signal, a voice signal of a phoneme for the unvoiced sound is generated.

상술한 임펄스 트래인 생성부(205)는 유성음에 대한 피치 주기를 이용하여 기본 주파수를 생성한다. 이는 사람의 성대의 진동수와 일치하며, 다수의 화자로부터 입력받은 피치 주기를 통계적으로 산출하여 기본 피치 주기 값으로 한다.The impulse train generator 205 described above generates a fundamental frequency by using a pitch period for voiced sound. This coincides with the frequency of the human vocal cords and statistically calculates the pitch period input from a plurality of speakers to be the basic pitch period value.

상술한 글로탈 펄스 생성부(206)는 임펄스 트래인 생성부(205)에서 생성된 임펄스 트래인 열을 사람의 구강 구조에서 생성되는 음성에 대한 기본 주파수 파형으로 왜곡시켜 사람의 음성의 기본 주파수에 의해 생성되는 파형과 유사하도록 임펄스 트래인 열을 변화시킨다.The above-described global pulse generator 206 distorts the impulse train train generated by the impulse train generator 205 into a fundamental frequency waveform for the voice generated in the oral structure of the human being to the fundamental frequency of the human voice. Change the impulse train train to be similar to the waveform produced by it.

상술한 제 1 및 제 2 증폭부(207,207')는 생성된 가우시안 잡음과 글러탈 펄스 각각에 작용하여 본원 발명의 음성 특징 벡터를 이용한 문자 음성 변환 장치에서 처리하기에 적합한 신호 레벨로 증폭시키는 역할을 수행한다.The first and second amplifiers 207 and 207 'described above act on the generated Gaussian noise and the glacial pulse, respectively, and amplify them to a signal level suitable for processing in a text-to-speech apparatus using the speech feature vector of the present invention. Perform.

상술한 보컬 트랙부(208)는 무성음에 대한 가우시안 잡음 및 유성음에 대한 글러탈 펄스가 음소 파라미터 추출부(202)에 의해 추출된 음소 파라미터에 대응되는 음소의 음성신호로 변환되도록 한다. 즉 잡음으로서의 가우시안 잡음에 무성음에 대한 음소 파라미터 값의 주파수 특성, 주파수 지속시간, 주파수 시작시각을 적용하여 해당 음소에 대한 음성신호 파형으로 변환하고, 유성음에 대하여는 주기적인 피치주기 값을 가지는 글로탈 펄스에 상술한 유성음의 음소 파라미터 값을 적용함으로써 유성음 음소에 대한 음성 신호를 생성한다.The vocal track unit 208 described above converts the Gaussian noise for the unvoiced sound and the glacial pulse for the voiced sound into a voice signal of a phoneme corresponding to the phoneme parameter extracted by the phoneme parameter extractor 202. That is to say, by applying the frequency characteristic, frequency duration, and frequency start time of the phoneme parameter value to unvoiced sound to the Gaussian noise as a noise, it converts to the voice signal waveform for the phoneme. By applying the phoneme parameter value of the voiced sound described above to generate a voice signal for the voiced phoneme.

상술한 조음부는(209)는 보컬 트랙부(208)에서 생성된 각 음소의 음성신호를 서로 연결하여 실제의 연속적인 음성 파형을 생성한다. 이 때 음소 파라미터 추출부(202)에서 추출된 무음에 대한 음소 파라미터 값이 음소의 연결, 단어의 연결, 쉼표, 띄어쓰기 등의 문장연결을 위하여 사용된다. 또한 상술한 문자 데이터베이스부의 읽기 규칙 테이블 및 음성 합성 단위 데이터가 음성합성을 위하여 조음부에서 사용되어지며, 이러한 데이터는 상기 음소 파라미터 추출과정에서 추출되어 데이터 라인을 통하여 전송된 후 소정 저장수단으로서의 버퍼에 저장된 후에 필요 시 사용된다.The above-described articulator 209 connects the audio signals of the phonemes generated by the vocal track unit 208 to each other to generate an actual continuous voice waveform. At this time, the phoneme parameter values for the silence extracted by the phoneme parameter extractor 202 are used for sentence concatenation such as phoneme concatenation, word concatenation, comma, and spacing. In addition, the above-described read rule table of the character database unit and the speech synthesis unit data are used in the articulation unit for speech synthesis, and these data are extracted in the phoneme parameter extraction process, transmitted through the data line, and stored in a buffer as a predetermined storage means. It is stored and used when needed.

다음으로 상술한 음성 특징 벡터를 이용한 문자 음성 변환 장치의 동작과정을 상술한 각 구성요소들을 참조하여 상세히 설명한다.Next, an operation process of the apparatus for text-to-speech using the aforementioned speech feature vector will be described in detail with reference to the above-described elements.

먼저 소정 문자열이 입력되면 상술한 문자 입력부(201)는 문자의 정보를 음소 단위로 입력받는다. 다음으로 상술한 음소 파라미터 추출부(202)가 상술한 문자 데이터베이스부(203)를 참조하여 입력된 문자열에서 기호, 숫자, 특수 문자를 읽기 규칙에 따른 문자열로 변환한다. 변환된 읽기 규칙에 따르는 문자열은 각각의 문자를 구성하는 문자 구성요소로서 초성, 중성, 종성으로 분류된다. 분류된 각각의 음소는 문자 코드 테이블을 이용하여 음소에 대응되는 문자 코드로 추출 변환된다. 다음으로 다음의 처리과정인 보컬트랙부 및 조음부에서 필요한 문자 코드에 대응되는 음소 파라미터를 음소 파라미터 테이블을 참조하여 대응되는 각 문자열에 대한 음소 파라미터가 추출된다.First, when a predetermined string is input, the above-described text input unit 201 receives text information in phoneme units. Next, the phoneme parameter extraction unit 202 described above converts the symbols, numbers, and special characters from the input character string into the character string according to the reading rule by referring to the character database unit 203 described above. Strings that follow the converted reading rules are classified into primary, neutral, and final characters as the character components that make up each character. Each classified phoneme is extracted and converted into a character code corresponding to the phoneme using a character code table. Next, a phoneme parameter for each character string is extracted by referring to a phoneme parameter table as a phoneme parameter corresponding to a character code required in a vocal track unit and an articulator unit, which are the following processes.

추출된 음소 파라미터는 음소의 주파수 특성을 나타내는 파라미터, 음소의발성시각을 나타내는 파라미터, 그리고 음소의 지속시간을 나타내는 파라미터로 구성된다. 상술한 음소의 주파수 특성은 무성음, 유성음, 무음에 따라 각각 세분된 세 가지의 기본 파라미터를 가진다. 따라서 추출된 음소 파라미터는 주파수 특성으로서의 무성음 , 유성음, 무음에 대한 파라미터와 음소의 발성 지속시간, 음소의 발성 시작 시각에 대한 파라미터를 포함하여 5 가지의 파라미터 값을 가진다. 여기서 주파수 특성으로서의 파라미터는 문자가 무성음인지, 유성음인지 혹은 음소와 음소, 단어와 단어 , 쉼표, 마침표 등의 연결부위로서의 무음인지를 구분하는 기준 값이 된다.The extracted phoneme parameter is composed of a parameter representing the frequency characteristic of the phoneme, a parameter representing the speech time of the phoneme, and a parameter representing the duration of the phoneme. The frequency characteristics of the phoneme described above have three basic parameters, each subdivided according to unvoiced, voiced and silent. Therefore, the extracted phoneme parameters have five parameter values, including parameters for unvoiced voice, voiced sound, and silence as frequency characteristics, parameters for duration of phonetic speech and voice start time. Here, the parameter as the frequency characteristic is a reference value for distinguishing whether the character is an unvoiced sound or a voiced sound, or a silent sound as a connection part such as a phoneme and a phoneme, a word and a word, a comma and a period.

문자열의 처리순서는 문자열의 입력 순서에 따라 추출된 음소 파라미터의 주파수 특성 데이터를 이용하여 무성음인지 유성음인지 무음인지를 입력되는 음소의 순서대로 판단하게 된다.The processing sequence of the character string determines whether unvoiced, voiced or unvoiced in the order of input phonemes using frequency characteristic data of the phoneme parameter extracted according to the input sequence of the character string.

입력된 음소 파라미터 값이 무성음의 특성 주파수를 가지면, 가우시안 잡음 발생부(204)가 구동되어 무성음 처리를 위한 기본 주파수로서의 가우시안 잡음을 생성한다. 다음으로 추출된 음소 파라미터가 유성음으로 판별된 경우에는 임펄스 트래인 생성부(205)가 구동되고 음소 파라미터 추출부(202)에서 전송된 유성음의 피치주기 값과 일치하는 소정 임펄스 트래인을 생성한다. 생성된 임펄스 트래인들은 통계적으로 추출된 음성 신호에서의 기본 주파수에 대한 피치 주기 값을 그 주기로 가지는 주기 반복적인 펄스의 파형이기 때문에 사람의 구강 구조에 대한 주파수의 파형을 가지도록 변환해야 한다. 이러한 변환 과정이 글로탈 펄스생성부(206)에서 수행된다. 글로탈 펄스 생성부(206)에서 임펄스 트래인이 변환되어 생성된 글로탈 펄스는 사람의 구강구조에서 나오는 유성음에 대한 기본 주파수의 파형을 가지게된다. 입력되는 문자의 음소가 무음인 경우에는 그 무음이 음소 사이의 무음인지, 문자 사이의 무음인지, 단어 사이의 무음인지, 그리고 쉼표, 마침표에 의한 무음인지를 판단한다. 판단 결과 한 문자를 이루는 음소 사이의 무음인 경우에는 해당 무음의 전후 음소를 소정 시간 동안 주파수 영역에서 비선형 적으로 시간영역에서 선형적으로 연결하도록 하는 음소 파라미터의 데이터값이 음소 파라미터 추출부(202)에서 조음부(209)로 전송되어 차후 음소들을 연결하여 연속적인 음성 신호를 생성하는 경우에 음소의 연결처리에 이용된다.When the input phoneme parameter value has a characteristic frequency of unvoiced sound, the Gaussian noise generator 204 is driven to generate Gaussian noise as a fundamental frequency for unvoiced sound processing. Next, when the extracted phoneme parameter is determined to be a voiced sound, the impulse train generator 205 is driven to generate a predetermined impulse train that matches the pitch period value of the voiced sound transmitted from the phoneme parameter extractor 202. Since the generated impulse trains are waveforms of periodic repetitive pulses having the pitch period value for the fundamental frequency in the statistically extracted speech signal as the period, they should be converted to have the waveform of the frequency for the human oral structure. This conversion process is performed in the global pulse generator 206. The global pulse generated by converting the impulse train in the global pulse generator 206 has a waveform of a fundamental frequency with respect to the voiced sound coming from the human oral structure. If the phoneme of the input character is silent, it is determined whether the silence is silence between phonemes, silence between letters, silence between words, and silence by commas or periods. If it is determined that there is a silence between phonemes constituting a character, the phoneme parameter extracting unit 202 includes data values of phoneme parameters for linearly connecting the phonemes before and after the silence in a frequency domain nonlinearly in a frequency domain for a predetermined time. Is transmitted to the articulation unit 209 to be used for concatenating the phonemes in the case where subsequent phonemes are connected to generate a continuous voice signal.

다음으로 가우시안 잡음 생성부(204), 글로탈 펄스 생성부(206)에서 생성된 무성음과 유성음에 대한 기본 음성 신호들이 제 1 및 제 2 증폭부(207,207')에 의해 소정 신호 레벨로 증폭된 후에 보컬 트랙부(208)로 문자가 입력된 순서대로 순차적으로 입력되면, 음소파라미터 추출부(202)에서 전송된 각각의 무성음 및 유성음에 대응되는 음소 파라미터 값에 의하여 해당 음소에 대한 음성 신호로 변환된다. 각각의 음소에 대한 음성 신호는 무음에 의한 각각의 연결을 위한 연결 데이터와 함께 조음부(209)로 전송되며, 조음부에서 문자 데이터베이스부에 구성된 읽기 규칙 및 각각의 음소의 연결부위로서의 무음에 의하여 연결 합성되어 연속적인 음성신호로 변환되어 소정 출력 수단에 의해 음성으로 출력된다.Next, after the unvoiced and voiced voice signals generated by the Gaussian noise generator 204 and the global pulse generator 206 are amplified to a predetermined signal level by the first and second amplifiers 207 and 207 '. When characters are sequentially input to the vocal track unit 208, the phoneme parameter extracting unit 202 converts the voice signals for the corresponding phonemes according to phoneme parameter values corresponding to the unvoiced and voiced sounds. . The voice signal for each phoneme is transmitted to the articulator 209 together with the connection data for each connection by silence, and by the read rule configured in the character database part in the articulator and the silence as the connection part of each phoneme The connection is synthesized, converted into a continuous voice signal, and output as voice by a predetermined output means.

조음부(209)에서의 처리 과정을 좀더 상세히 설명하면 다음과 같다.The processing in the articulator 209 will be described in more detail as follows.

초성, 중성, 종성으로 구분되어 입력되는 문자의 파라미터는 음성 재생시 음운현상, 즉, 연음, 자음접변, 두음 법칙, 역행 및 순행동화 등이 일어나게 된다.이러한 음운 현상은 문자와 발성의 차이즉 음소의 연결 시간의 차이에서 기인하는 것으로서 문자에 대한 음소 파라미터에 대한 음성 신호를 각각의 지속시간과 음소와 음소사이의 연결 시간을 조절하여 입력되는 대로 음성을 재생함으로써 해결할 수 있다. 예를 들면 "ㄷ"과 "ㄸ"등과 같은 평음, 경음, 격음으로 이루어진 경우 각각의 음소의 지속 시간을 조절하여 이루어지며, "ㅀ", "ㄺ" 등의 2자로 구성된 종성의 경우 앞쪽에 붙은 음소의 길이를 뒤의 음소보다 길게하고, 뒤의 음소를 조음시 다음문자의 초성에 가깝게 합성되도록 함으로써 자연어에 가까운 합성음이 출력되도록 한다.The parameters of the characters inputted into primary, neutral, and final characters are phonological phenomena such as soft, consonant, consonant, retrograde, and forward motion during speech reproduction. As a result of the difference in the connection time of, the speech signal for the phoneme parameter for the text can be solved by reproducing the voice as input by adjusting the duration of each time and the connection time between the phoneme and the phoneme. For example, if the sound is composed of a flat sound, a hard sound, and a rhythm such as “ㄷ” and “ㄸ”, it is made by adjusting the duration of each phoneme. The length of the phoneme is longer than the phoneme in the back, and the phoneme of the phoneme is synthesized close to the first character of the next letter when articulation is made so that the synthesized sound close to natural language is output.

연음의 경우를 예로 들면 "날아"의 경우 "ㄴ ㅏ ㄹ ㅏ"와 같이 입력 순서에 의한 조음으로 음성을 재생하여 자연스럽게 이루어짐으로써 이에 대한 별도의 고려는 필요치 않다.In the case of soft sound, for example, "flying", "b ㅏ ㄹ ㅏ", such as the sound is made naturally by reproducing the articulation according to the input sequence, so no separate consideration is required.

즉 우리가 말을 할 때 나타나는 음운 현상은 음소의 지속시간 및 시작 시각에 의해 생성되는 것이므로, 입력 받는 파라미터를 순서대로 조음하여 음성을 재생하는 것이 읽기 규칙에 부합되는 음성 재생 방법이다. 그러므로 문자를 음성으로 재생할 때 한 문자의 음소사이, 단어사이, 문장 사이 등에서 음소사이의 변화 구간의 시간 간격을 각각 다르게 주어서 음이 자연어에 가깝도록 음성을 재생할 수 있다.In other words, the phonological phenomenon that occurs when we speak is generated by the duration and the start time of the phoneme. Therefore, it is a method of reproducing a voice that conforms to the reading rule by reproducing a voice by tuning input parameters in order. Therefore, when a character is reproduced as a voice, the interval between the phonemes, the words, the sentences, and the like between the phonemes can be changed differently, so that the sound can be reproduced to be close to the natural language.

이러한 처리과정은 입력되는 문자의 순서대로 처리되며, 무성음 및 유성음의 처리 또한 순서대로 스위칭부(S)에 의하여 스위칭되어 음소들의 입력 순서대로 처리되어야 한다.This process is processed in the order of the characters to be input, and processing of unvoiced and voiced sound must also be switched by the switching unit S in order to be processed in the order of input of the phonemes.

상술한 문자 음성 변환 장치에서 문자를 음성으로 재생할 때 음소 파라미터 만을 사용하는 이유는, 음성을 재생하거나 합성할 때 유성음은 임펄스트래인(impulse train)을, 무성음은 가우시안 노이즈(잡음)를 사용하여 이를 보컬트랙 생성부(208)에 입력함으로써 음성을 재생 또는 합성하기 위함이다. 이 때 음성을 자연어에 가깝도록 하기 위하여 유성음, 무성음의 원 신호인 가우시안 노이즈나 임펄스 트래인을 스위칭하여 선택하는데, 그 스위칭 시각이나 지속시간을 기 설정된 음성특성 데이터 중 무음의 지속시간 데이터에 의해 설정할 수 있다는 것이 또 다른 이유이다. 따라서 음소 파라미터를 사용하여 문자 음성 변환 장치를 이용하면, 기 구축된 데이터 베이스에서 재생할 음소를 찾거나, 인터넷 등의 통신망을 통하여 전송하는 경우 빠른 전송과 낮은 대역 점유율로도 음성의 유실을 방지할 수 있는 장점이 있으며, 실시간으로 음성재생 서비스를 제공할 수 있다.The above-mentioned text-to-speech apparatus uses only phoneme parameters when reproducing text by voice. When playing or synthesizing a voice, voiced sound uses an impulse train and voiced sound uses Gaussian noise (noise). This is for reproducing or synthesizing the voice by inputting to the vocal track generator 208. At this time, in order to bring the voice closer to the natural language, Gaussian noise or impulse train, which is the original signal of voiced sound and unvoiced sound, is selected by switching, and the switching time or duration is set by the silent duration data among preset voice characteristic data. That is another reason. Therefore, using the text-to-speech device using the phoneme parameter, it is possible to find the phoneme to be reproduced in the established database, or to prevent the loss of voice with fast transmission and low bandwidth occupancy when transmitting through a communication network such as the Internet. There is an advantage, and can provide a voice playback service in real time.

도 3은 본원 발명에 따르는 음성 특징 벡터(음소 파라미터)를 이용한 문자 음성 변환 방법에 대한 처리 과정을 나타내는 바람직한 일 실시 예를 나타내는 순서도이다.Figure 3 is a flow chart showing a preferred embodiment showing a process for the text-to-speech method using a speech feature vector (phoneme parameter) according to the present invention.

본원 발명에 따르는 문자 음성 변환 방법의 처리과정은 먼저 출력될 음성의 특징 데이터를 등록 받는다. 음성의 특징 데이터를 등록 받는 과정은 상술한 바와 같이 기 출원된 특허 10- 2000- 0056532호에 따른다(S301). 다음으로 키보드 또는 소정 문자 편집용 소프트웨어로부터 음성으로 변환될 문자를 입력 받는다. 입력 받은 문자는 데이터 처리를 위하여 소정 길이의 문자열로 분할된다. 분할된 문자열은 그 문자열에 대한 기호/숫자/특수 문자를 읽기 규칙에 따라 일반적인 문자열, 즉한글인 경우에 한글 읽기 규칙에 따르는 한글 문자열로 변환된다(S303). 기호/숫자/특수 문자가 한글 문자열로 변환되면, 입력된 문자열을 한글 표준 문자 코드로 구성된 문자 코드 테이블을 참조하여 문자 코드를 추출한다(S304). 문자에 대한 문자 코드가 추출되면 기 구성된 문자 코드와 음소파라미터가 연관성을 가지고 저장되어 있는 음소파라미터 테이블을 참조하여 문자코드에 대한 음소파라미터 값을 추출해 낸다(S305). 추출된 음소파라미터 값에서 음소파라미터 중 주파수의 영교차 비율에 따라 무성음, 유성음, 무음을 주파수영역에서 비선형적으로 분리한 후, 각각의 무성음, 유성음, 무음에 대한 음성신호를 생성해 내고 출력한다(S306). 다음으로 음성으로 변환될 문자열이 계속 입력되는 지를 판단하고(S307), 판단결과 문자열이 계속 입력되면, 상술한 S303 단계부터 반복 수행하고, 입력되는 문자열이 없으면 처리과정을 종결한다.The process of the text-to-speech method according to the present invention first registers the feature data of the voice to be output. The process of registering the feature data of the voice is in accordance with the previously filed Patent No. 10-2000- 0056532 as described above (S301). Next, a character to be converted into a voice is input from a keyboard or predetermined text editing software. The received character is divided into a string of a predetermined length for data processing. The divided string is converted into a general string, that is, a Hangul string according to the Hangul reading rule in the case of Hangul, according to a reading rule. When a symbol / number / special character is converted into a Hangul character string, the input character string is extracted with reference to a character code table composed of Korean standard character codes (S304). When the character code for the character is extracted, the phoneme parameter value for the character code is extracted by referring to the phoneme parameter table in which the preconfigured character code and the phoneme parameter are stored in association (S305). From the extracted phoneme parameter value, unvoiced sound, voiced sound, and unvoiced sound are nonlinearly separated in the frequency domain according to the zero crossing ratio of frequency among the phoneme parameters, and then voice signals for each unvoiced sound, voiced sound, and unvoiced sound are generated and output. S306). Next, it is determined whether the character string to be converted to voice is continuously input (S307). If the character string is continuously input as a result of the determination, the operation is repeated from the above-described step S303, and the process is terminated if there is no input character string.

도 4는 상술한 도 3의 처리과정 중 S306단계의 음성 합성 및 출력 과정을 나타내는 서브루틴도이다.4 is a subroutine diagram illustrating a speech synthesis and output process of step S306 in the above-described process of FIG. 3.

도시된 바와 같이 음성 합성 및 출력 과정은 상술한 S305단계에서 추출된 음소파라미터를 수신한(S461) 후 각각의 음소파라미터 중 주파수 특성에서 영교차 비율에 의해 무성음, 유성음 그리고 무음으로 분류하고(S462) 각각의 파라미터가 입력되는 순서에 따라 스위칭되어 다음 처리과정을 수행한다. 먼저 상술한 S462입력된 음소가 유성음인 경우 음소파라미터 값 중 피치주기를 이용하여 피치 주기에 대응되는 임펄스 트래인이 생성된다(S463). 생성된 임펄스 트래인은 통계적으로 추출된 사람의 성대의 진동수에 대응되는 소정 주기를 가지는 연속적인 펄스 열이므로이를 사람의 성대에서 출력되는 주파수의 특성을 가지는 글러탈 펄스열로 변환된다(S464). 다음으로 상술한 S462 단계에서 입력된 음소 파라미터 값이 무성음으로 분류되면, 무성음에 대한 기본 주파수로서의 가우시안 잡음이 생성된다(S465). 생성된 글로탈 펄스와 유성음의 음소 파라미터 값, 그리고 가우시안 잡음과 무성음의 음소 파라미터 값이 변환 생성된 글로탈 펄스열은 음소파라미터에 의하여 음소에 해당되는 음성신호의 주파수 파형으로 변환되어 무성음의 음소에 대한 음성신호로 변환된다(S466). 다음으로 생성된 가우시안 잡음과 입력된 무성음에 대한 음소 파라미터 값과 생성된 글로탈 펄스와 입력된 유성음의 음소 파라미터 값이 보컬트랙부로 전송된다, 보컬 트랙부로 전송된 가우시안 잡음은 함께 전송된 해당 음소의 무성음의 음소 파라미터 즉 주파수 특성, 주파수 지속시간, 주파수 시작시각에 의하여 무성음에 대응하는 음소에 대한 음성 신호로 변환되고 글로탈 펄스는 함께 전송된 유성음의 음소 파라미터 값이 적용되어 입력된 유성음에 대응되는 음소의 음성신호로 변환된다(S466). 상술한 단계 S402에서 무음으로 분류된 신호는 음소와 음소, 단어와 단어, 문장의 연결, 쉼표, 마침표에 따라 서로 다른 연결 신호로 변환된다. 음소와 음소사이의 연결은 무음이 삽입될 위치의 앞에 오는 음소의 주파수의 끝부분과 뒤에 위치하는 음소의 주파수의 앞부분을 서로 소정시간 간격동안 비선형적으로 연결하여 음소사이의 단절을 없애도록 하며, 단어와 단어 사이, 문장과 문장사이, 쉼표, 마침표 등을 서로 다른 시간 간격의 휴지시간을 설정하여 연결하도록 무음신호가 연결 설정 데이터로 변환 생성된다(S467). 유성음과 무성음에 대응되는 음소에 대한 음성신호는 무음 신호의 연결 설정 데이터와 함께조음부(209)로 전송되어 음성신호로 합성된다. 합성되는 과정은 상술한 S467 단계의 무음 데이터 추출과정에서의 연결 방법과 동일하다(S468). 다음으로 소정 음성 출력수단에 의하여 합성된 음성신호를 음성으로 재생 출력한 후(S469) 도 3의 처리과정으로 되돌아가 다음 처리과정을 계속 수행하게 된다.As shown, the speech synthesis and output process receives the phoneme parameters extracted in step S305 (S461) and then classifies them as unvoiced, voiced and silent (S462) by the zero crossing ratio in the frequency characteristics of the respective phoneme parameters. Each parameter is switched in the order in which it is entered to perform the next process. First, when the above-mentioned S462 input voice is a voiced sound, an impulse train corresponding to the pitch period is generated using the pitch period among the phoneme parameter values (S463). Since the generated impulse train is a continuous pulse train having a predetermined period corresponding to the frequency of the human vocal cords, which is statistically extracted, it is converted into a glutal pulse train having a characteristic of a frequency output from the human vocal cords (S464). Next, when the phoneme parameter value input in step S462 described above is classified as an unvoiced sound, Gaussian noise as a fundamental frequency for the unvoiced sound is generated (S465). The generated glotal pulse and voiced phoneme parameter values and Gaussian noise and unvoiced phoneme parameter values are converted to the frequency waveform of the voice signal corresponding to the phoneme by the phoneme parameter. The audio signal is converted (S466). Next, the generated Gaussian noise, the phoneme parameter value for the input unvoiced voice, the generated glotal pulse and the phoneme parameter value of the input voiced sound are transmitted to the vocal track part. The Gaussian noise transmitted to the vocal track part is transmitted to the vocal track part. The voice parameters of the unvoiced sound, that is, the frequency characteristics, the frequency duration, and the start time of the unvoiced sound, are converted into voice signals for the phonemes corresponding to the unvoiced sound. The audio signal is converted into a phoneme (S466). In operation S402, the signals classified as silent are converted into different connection signals according to phonemes and phonemes, words and words, sentences connected, commas, and periods. The connection between the phoneme and the phoneme removes the disconnection between the phonemes by non-linearly connecting the end of the frequency of the phoneme before the position where the silence is to be inserted and the frequency of the phoneme located behind it for a predetermined time interval. A silent signal is generated and converted into connection setting data so as to connect words and words, between sentences and sentences, commas, periods, and the like at different time intervals by setting pauses (S467). The voice signal for the phoneme corresponding to the voiced sound and the unvoiced sound is transmitted to the tuning unit 209 together with the connection setting data of the silent signal and synthesized into the voice signal. The synthesis process is the same as the connection method in the silent data extraction process of step S467 (S468). Next, after reproducing and outputting the voice signal synthesized by the predetermined voice output means as voice (S469), the process returns to the process of FIG. 3 to continue the next process.

도 5는 본원 발명에 따르는 음성 특징 벡터를 이용한 문자 음성 변환 장치를 이용한 응용예를 나타내는 도면이다.5 is a diagram showing an application example using a text-to-speech device using a speech feature vector according to the present invention.

도시된 실시예는 인터넷 등의 통신망을 통하여 음성출력 서비스를 제공하는 네트워크 환경에 대한 블록도이며 그 처리 과정은 다음과 같다.The illustrated embodiment is a block diagram of a network environment for providing a voice output service through a communication network such as the Internet, and the processing thereof is as follows.

사용자는 사용자 환경에 구성된 단말기(500)를 이용하여 로그인이 필요한 경우 음성, 문자 등의 인증 정보를 통해 접속한다(S501,S502). 접속한 사용자는 서버(510)에 구성된 검색 엔진을 통해 음성출력 서비스 즉 서적, 잡지, 신문 등의 파일을 선택한다(S503). 사용자의 서비스 요청을 수신 받은 서버(510)는 해당 파일에 대한 텍스트 정보 중 각 텍스트를 구성하는 음소의 주파수 정보, 시작시각 정보, 지속 시간 정보로 구성된 음소 파라미터를 사용자 환경에 구성된 단말기(500)로 전송한다. 서버(510)로부터 전송된 음소 파라미터는 사용자 환경에 구성된 단말기(500)에서 본원 발명에 따르는 음성 특징 벡터를 이용한 문자 음성 변환 장치에 의하여 음성으로 변환 출력됨으로써 온라인 상에서 각종 문서에 대한 음성서비스를 제공할 수 있게 된다.When the user needs to log in using the terminal 500 configured in the user environment, the user accesses through authentication information such as voice and text (S501 and S502). The connected user selects a file such as a voice output service, that is, a book, a magazine, or a newspaper, through a search engine configured in the server 510 (S503). The server 510 receiving the user's service request transmits phoneme parameters including frequency information, start time information, and duration information of phonemes constituting each text among the text information of the corresponding file to the terminal 500 configured in the user environment. send. The phoneme parameter transmitted from the server 510 is converted into voice by the text-to-speech device using the voice feature vector according to the present invention in the terminal 500 configured in the user environment, thereby providing voice services for various documents online. It becomes possible.

다음으로 본원 발명에 사용되는 기 출원된 출원번호 특2000-0056532호의 특허 "비선형 방법을 이용한 음성 신호의 특징 추출 장치 및 그 방법"을 개략적으로부가 설명하고자 한다.Next, a patent " apparatus for extracting features of a speech signal using a nonlinear method and a method thereof " of the previously applied application No. 2000-0056532 used in the present invention will be schematically explained.

기 출원된 출원번호 특2000-0056532의 특허에 기재된 동적 웨이브렛 변환을 사용함에 있어서 이용된 웨이브렛은 유성음에 대한 비선형 처리 시에는 하 웨이브렛(Haar Wavelet)을, 무성음에 대한 비선형 처리 시에는 스플라인 웨이브렛(Spline Wavelet)을 사용하여 사람의 음성언어를 유성음 과 무성음 그리고 무음으로 구분한 후 각 음소의 주파수 특성, 시작 시각, 지속 시간을 음성특성 데이터로 한다.The wavelets used in the use of the dynamic wavelet transform described in the patent application No. 2000-0056532 are applied to a haar wavelet for nonlinear processing of voiced sound and a spline for nonlinear processing of unvoiced sound. Spline Wavelet is used to classify human voice into voiced, unvoiced and silent, and then the frequency characteristics, start time, and duration of each phoneme are voice characteristic data.

본원 발명은 전술된 본원 발명의 일 실시 예들에 한정되지 아니하고, 본원 발명의 기술적 사상을 벗어나지 않는 범위에서 다양하게 변경 실시될 수 있다.The present invention is not limited to the above-described embodiments of the present invention, and various changes can be made without departing from the spirit of the present invention.

본원 발명에 따르는 음성 특징 벡터를 이용한 문자 음성 변환 장치는 음성합성에 필요한 데이터 베이스를 최소화하고 음소 파라미터만을 이용하여 문자를 음성 신호로 합성함으로써 음성 합성의 처리시간을 실시간으로 단축시킬 수 있는 효과가 있다.The apparatus for text-to-speech using the speech feature vector according to the present invention has the effect of minimizing the database required for speech synthesis and shortening the processing time of speech synthesis in real time by synthesizing text into speech signals using only phoneme parameters. .

또한, 음소 파라미터만을 이용하여 문자를 음성으로 합성 재생함으로써 재생된 음성이 자연음에 가깝도록 할 수 있도록 하는 효과가 있다.In addition, by synthesizing and reproducing text into voice using only phoneme parameters, there is an effect of making the reproduced voice close to natural sound.

또한, 본원 발명에 따르는 문자 음성 변환 장치 및 그 방법을 네트워크 상에서 적용하는 경우 음성합성에 필요한 음성 파라미터 만을 전송함으로써 소요되는 네트워크 트래픽을 현저히 감소 시킬 수 있으며, 또한 음생으로 재생하는 경우에도 단말기 측에서 스트리밍이나 버퍼링이 없이 실시간으로 음생 재생이 수행될 수 있도록 하는 효과가 있다.In addition, when the text-to-speech device and the method according to the present invention are applied on a network, the network traffic required can be significantly reduced by transmitting only the voice parameters necessary for voice synthesis, and streaming is performed on the terminal side even when playing with sound reproduction. In addition, there is an effect of allowing sound reproduction to be performed in real time without buffering.

Claims

A character database unit for storing read rule data, phoneme parameter data, and character code data for a character string based on a predetermined criterion;

A character input unit which receives a character string by a predetermined input means;

A phoneme parameter extraction unit for extracting phoneme data of a character input from the character input unit from the character database unit;

A Gaussian noise generating unit generating Gaussian noise as a fundamental frequency for the unvoiced sound when the extracted phoneme parameter is an unvoiced sound;

An impulse train generator for generating an impulse string as a fundamental frequency for the voiced sound when the extracted phoneme parameter is a voiced sound;

A global pulse generator for converting an impulse train generated by the impulse train generator into a fundamental frequency waveform of a voice signal from a human oral structure;

A vocal track unit for generating a voice signal for a phoneme by applying a phoneme parameter value extracted from the phoneme parameter extractor to a basic frequency value for the unvoiced or voiced sound generated by the Gaussian noise generator and the global pulse generator; And

And an articulator for connecting the voice signals of the phonemes generated by the vocal track unit to each other and synthesizing the voice signals into continuous voice signals.

The method of claim 1, wherein the character database unit,

A read rule table having read rule data for symbols, numbers and special characters;

A character code table having data of initial, neutral, and final character codes for the input characters; And

And a phoneme parameter table having, as data, a phoneme parameter value for a phoneme corresponding to each character code value configured in the text code table.

The phoneme parameter table of claim 2,

Text-to-speech device using a speech feature vector comprising a basic frequency value for each phoneme, a duration value of the phoneme and a start time value of the phoneme.

The method of claim 3, wherein the fundamental frequency value,

An apparatus for text-to-speech using a speech feature vector, wherein each phoneme has a frequency value for any one of unvoiced, voiced or unvoiced.

The method of claim 4, wherein the phoneme parameter value for the silence,

When the phoneme and the phoneme are connected in the speech synthesis process, the frequency of the unvoiced sound located in front of the phoneme where the silence is located and the frequency of the phoneme located behind the silence are linearly connected in the time domain in a nonlinear manner in the frequency domain. A text-to-speech device using a feature vector of speech, characterized in that the speech is synthesized.

The method of claim 4, wherein the phoneme parameter value for the silence,

In the case of a silence corresponding to any one of letters, letters, words, words, sentences, sentences, commas, or periods, the voice feature is characterized in that the voice data is processed as an idle time for each corresponding predetermined time interval. Character-to-speech apparatus using a feature vector of a speech using a vector.

The method of claim 1, wherein the articulation portion,

Phonetic character conversion device using a speech feature vector, characterized in that to implement the phonological phenomenon in the articulation process using only a phoneme parameter value.

A first step of receiving a phoneme parameter data, read rule data, and character code data of a voice to construct a character database;

A second step of receiving a string to be synthesized as a voice;

A third step of converting symbols, numbers, and special characters from the input string into strings according to reading rules, classifying the converted strings into initial, neutral, and final strings according to reading rules, and extracting a character code for the corresponding characters;

Extracting a phoneme parameter corresponding to the extracted character code; And

And a fifth step of synthesizing and reproducing the speech using the extracted phoneme parameters.

The method of claim 8, wherein the fifth step,

A sixth step of receiving the extracted phoneme parameter;

A seventh step of determining whether the input phoneme parameter is for unvoiced sound, voiced sound, or silent sound;

If it is determined that the sound is unvoiced in the seventh step, Gaussian noise is generated, and if it is determined that the sound is sound, the impulse train is generated and converted into a global pulse train. An eighth step of extracting and transmitting a phoneme parameter to a connection portion of the corresponding silence when it is determined that the sound is silent;

A ninth step of generating a voice signal corresponding to the phoneme in the order of the input phonemes by receiving a result signal for each of the voiced sound, the unvoiced sound, or the silent sound processed by the eighth step and the corresponding phoneme parameters; And

And a tenth step of synthesizing each of the voice signals generated in the ninth step sequentially using the silent data, synthesizing the voice signals into continuous voice signals, and outputting the voices by a predetermined voice output means. Text-to-speech method using feature vector of speech.