KR20050041749A

KR20050041749A - Voice synthesis apparatus depending on domain and speaker by using broadcasting voice data, method for forming voice synthesis database and voice synthesis service system

Info

Publication number: KR20050041749A
Application number: KR1020030077017A
Authority: KR
Inventors: 최문옥; 김상훈
Original assignee: 한국전자통신연구원
Priority date: 2003-10-31
Filing date: 2003-10-31
Publication date: 2005-05-04

Abstract

본 발명은 대용량 코퍼스 기반 음성 합성기를 구현할 경우, 비교적 고비용이 소요되는 음성 녹음작업을 대체할 수 있는 방송 음성 데이터를 이용한 영역 및 화자 의존 음성 합성 장치, 이러한 음성 합성 장치에 사용되는 음성 합성용 데이터베이스를 구축하는 방법 및 음성 합성 서비스 시스템에 관한 것이다. 본 발명에서는 음성 합성용 데이터베이스를 위한 별도의 합성용 텍스트 설계, 화자 선정, 음성 녹음 작업 대신에 일반 방송 음성을 특정 영역 및 화자 별로 녹취한 음성 데이터를 사용하여 각각의 음성 합성용 데이터베이스를 자동화된 방법으로 구축한 후 이를 이용한 음성 합성 장치 및 음성 합성 서비스 시스템을 구현한다. 본 발명에 따르면, 서비스 영역에 의존한 음성 합성용 데이터베이스의 구축 및 확장이 용이해지며, 합성음의 자연성 및 친화도를 향상시킬 수 있다. The present invention provides a region and speaker-dependent speech synthesizer using broadcast speech data that can replace voice recording, which is relatively expensive, when a large-scale corpus-based speech synthesizer is implemented, and a database for speech synthesis used in such a speech synthesizer. The present invention relates to a method and a voice synthesis service system. In the present invention, instead of separate text design for speech synthesis database, speaker selection, and voice recording, a method for automating each database for speech synthesis using voice data recorded by a specific area and speaker is recorded. After that, we implement a speech synthesis device and speech synthesis service system using it. According to the present invention, it is easy to construct and expand a database for speech synthesis depending on the service area, and to improve the naturalness and affinity of the synthesized speech.

Description

VOICE SYNTHESIS APPARATUS DEPENDING ON DOMAIN AND SPEAKER BY USING BROADCASTING VOICE DATA, METHOD FOR FORMING VOICE SYNTHESIS DATABASE AND VOICE SYNTHESIS SERVICE SYSTEM }

본 발명은 방송 음성 데이터를 이용한 영역 및 화자 의존 음성 합성 장치, 이러한 음성 합성 장치에 사용되는 음성 합성용 데이터베이스를 구축하는 방법 및 음성 합성 서비스 시스템에 관한 것이다. 더욱 상세하게는, 특정 서비스 분야에 따라 서비스 영역 및 화자에 의존한 음성합성용 데이터베이스를 각각 구축하여 이루어진 코퍼스(corpus) 기반의 음성 합성 장치에 관한 것이다. 다시 말하면, 특정 서비스 분야에 따라 일반인에게 친화도 높은 특정 방송 음성을 선정 및 녹음한 후 이를 이용하여 영역 및 화자별 음성 합성용 데이터베이스를 구축하고 이렇게 구축된 데이터베이스를 이용하여 음성 합성 장치를 구현하는 기술에 관한 것이다. The present invention relates to a region and speaker dependent speech synthesis apparatus using broadcast speech data, a method of constructing a database for speech synthesis used in such speech synthesis apparatus, and a speech synthesis service system. More specifically, the present invention relates to a corpus-based speech synthesis apparatus constructed by constructing a database for speech synthesis depending on a service area and a speaker according to a specific service field. In other words, a technology for selecting and recording a specific broadcast voice that is friendly to the general public according to a specific service field, and then using it to construct a database for speech synthesis by region and speaker, and to implement a speech synthesis apparatus using the database thus constructed. It is about.

종래의 코퍼스 기반 음성 합성 장치는 무제한 텍스트에 대한 음성 합성을 목적으로 하거나, 특정 서비스 영역에 대한 합성음 품질 향상을 위하여 각 분야의 뉴스, 일기예보, 교통정보, 증권정보 등의 특화된 서비스 영역에 의존한 음성 합성을 목적으로 하였다. Conventional corpus-based speech synthesizers are designed for speech synthesis for unlimited texts or rely on specialized service areas such as news, weather forecasts, traffic information, and stock information in each field to improve synthesized speech quality for specific service areas. The aim was to synthesize speech.

무제한 텍스트에 대한 음성 합성 방법의 경우는 모든 영역의 텍스트에 대한 합성음을 생성할 수 있는 장점이 있긴 하지만, 음성 합성용 데이터베이스 크기의 물리적 제한으로 인해 모든 발생 가능한 음성을 포함하기 어렵다. 따라서, 합성음의 품질이 상용화하기에는 부족하다는 단점이 있다. 이러한 문제점과 함께 특정 서비스 영역에서 음성 합성기의 활용 수요가 증가됨에 따라 특정 서비스 영역에서 합성음의 품질을 높이고자 해당 영역에 최적화된 영역 의존 음성 합성용 데이터베이스를 이용한 영역의존 음성 합성 방법이 제안되기도 하였다. 이와 관련된 선행 특허로서, 대한민국 특허출원 제1998-51342호(출원일 : 1998년 11월 27일)에 "영역 의존 음성 합성용 데이터베이스를 이용한 음성 합성 방법"이 출원된 바 있다. The speech synthesis method for unlimited text has the advantage of generating synthesized sounds for all the texts, but due to the physical limitations of the database size for speech synthesis, it is difficult to include all possible speeches. Therefore, there is a disadvantage in that the quality of the synthesized sound is insufficient for commercialization. Along with these problems, as the demand for the use of a speech synthesizer increases in a specific service area, a region-dependent speech synthesis method using a database for region-dependent speech synthesis optimized for the corresponding area has been proposed to improve the quality of the synthesized voice in a specific service area. As a related patent, Korean Patent Application No. 1998-51342 (filed date: November 27, 1998) has been filed a "speech synthesis method using a database for region dependent speech synthesis".

이와 같은 종래의 코퍼스 기반 음성 합성 방법은 음성 합성용 데이터베이스를 구축하는 단계와, 구축된 데이터베이스를 이용하여 필요한 합성 단위를 최적으로 조합하여 합성음을 생성하는 단계를 포함하는 것이 일반적이다. 이 중에서 기존의 음성 합성용 데이터베이스를 구축하는 과정이 도 1에 도시되어 있다. Such a conventional corpus-based speech synthesis method generally includes constructing a speech synthesis database and generating a synthesized sound by optimally combining necessary synthesis units using the constructed database. Among them, a process of building an existing database for speech synthesis is shown in FIG. 1.

상기 도 1에 도시된 바와 같이, 기존의 음성 합성용 데이터베이스를 구축하는 과정은 성우를 섭외하는 등의 화자(speaker)를 선정하는 단계(S10), 합성용 텍스트를 설계하는 단계(S11), 설계된 합성용 텍스트로 발성 및 녹음을 수행하는 단계(S12), 녹음 내용과 텍스트 내용의 불일치 여부를 검증하여 텍스트를 수정하거나 재녹음을 수행하는 단계(S13), 상기 단계(S10 내지 S12)를 통해 수집된 음성 데이터와 텍스트를 종합하여 음소 및 피치(pitch) 레이블링 작업을 수행하는 단계(S14) 및, 음성 합성용 데이터 베이스를 구성하는 단계(S15)를 포함한다. 이와 같은 기존의 음성 합성용 데이터베이스를 구축하는 과정 중 합성용 텍스트를 설계하는 단계(S11)와 관련된 선행특허로서, 특허출원 제2002-1353호(출원일 : 2002년 1월 10일)에 "코퍼스 기반 음성 합성용 녹음 문장 선정을 위한 방법"이 출원된 바 있다. As shown in FIG. 1, the process of constructing an existing database for speech synthesis includes selecting a speaker such as selecting a voice actor (S10), designing a text for synthesis (S11), and Performing utterance and recording with the synthesized text (S12), verifying whether there is a mismatch between the recorded content and the text content, correcting the text or performing re-recording (S13), and collecting through the steps (S10 to S12) And performing a phoneme and pitch labeling operation by synthesizing the synthesized voice data and the text (S14), and constructing a voice synthesis database (S15). As a prior patent related to the step of designing the text for synthesis in the process of building such a conventional speech synthesis database (S11), the patent application 2002-1353 (application date: January 10, 2002) "based on the corpus Method for selecting a recorded sentence for speech synthesis has been filed.

그러나, 상기와 같이 구성된 기존의 음성 합성용 데이터베이스 구축 방법에서는 화자 선정 단계(S10), 합성용 텍스트의 설계 단계(S11), 설계된 합성용 텍스트로 발성 및 녹음을 수행하는 단계(S12), 녹음 내용과 텍스트 내용이 불일치시 텍스트를 수정하거나 재녹음을 수행하는 단계(S13)는 높은 비용과 오랜 시간이 요구되는 과정이다. 따라서, 데이터베이스의 구축 및 확장이 쉽지 않으며, 합성용 텍스트를 녹음하는 단계(S12)에서의 연출된 녹음 과정으로 인하여 실제 상황의 음성에 비하여 자연성이 감소되거나 과정되는 경향이 있다. 또한, 추가 녹음시 동일한 성우를 섭외해야 하는 점과 이전과 유사한 상태의 음성을 수집해야 하는 점 등에 있어서 어려움이 있다. However, in the existing speech synthesis database construction method configured as described above, the speaker selection step (S10), the design step of the synthesis text (S11), the step of performing speech and recording with the designed synthesis text (S12), the recording content When the text content and the discrepancy are inconsistent, the step of correcting or re-recording the text (S13) is a process requiring a high cost and a long time. Therefore, the construction and expansion of the database is not easy, and due to the directed recording process in the step of recording the text for synthesis (S12), the naturalness tends to be reduced or processed compared to the voice of the actual situation. In addition, there is a difficulty in that the same voice actors should be recruited for additional recording and voices collected in a similar state as before.

본 발명은 상기 설명한 종래의 기술적 과제를 해결하기 위한 것으로서, 음성 합성용 데이터베이스 구축 및 확장을 용이하게 하고 합성음의 친화도를 향상시킬 수 있는 영역 및 화자 의존 음성 합성 장치, 이러한 음성 합성 장치에 사용되는 음성 합성용 데이터베이스를 구축하는 방법 및 음성 합성 서비스 시스템을 제공하는데 그 목적이 있다. SUMMARY OF THE INVENTION The present invention has been made to solve the above-described technical problem, and is an area and speaker-dependent speech synthesis apparatus capable of facilitating the construction and expansion of a speech synthesis database and improving the affinity of synthesized speech, which is used in such speech synthesis apparatus. Its purpose is to provide a method for constructing a database for speech synthesis and a speech synthesis service system.

상기 목적을 달성하기 위한 본 발명에 따른 영역 및 화자 의존 음성 합성 장치는, 서비스 영역별로 고정된 화자의 방송 음성 데이터를 녹음하여 영역 및 화자별 방송 음성 데이터를 제공하는 방송 음성 녹음부; 상기 방송 음성 녹음부에서 제공되는 영역 및 화자별 방송 음성 데이터를 각 영역 및 화자별로 데이터베이스로 구축한 복수의 음성 합성용 데이터베이스; 상기 각 음성 합성용 데이터베이스에 저장된 방송 음성 데이터를 이용하여 소정의 음성 합성 알고리즘에 따라 음성 합성을 수행하는 합성기; 및, 상기 합성기로부터의 합성음을 재생시키는 합성음 재생부를 포함하는 것을 특징으로 한다. According to an aspect of the present invention, there is provided a region and speaker dependent speech synthesis apparatus, including: a broadcast speech recording unit configured to record broadcast speech data of a fixed speaker for each service region and to provide broadcast speech data for each region and a speaker; A plurality of database for speech synthesis in which the broadcasting voice data for each area and the speaker provided by the broadcasting voice recording unit are constructed as a database for each area and the speaker; A synthesizer configured to perform speech synthesis according to a predetermined speech synthesis algorithm using broadcast speech data stored in the respective speech synthesis databases; And a synthesized sound reproducing unit for reproducing the synthesized sound from the synthesizer.

상기 목적을 달성하기 위한 본 발명에 따른 영역 및 화자 의존 음성 합성용 데이터베이스 구축 방법은, 특정 방송을 지정하여 녹음하는 과정을 통해 서비스 영역 및 화자별로 방송 음성 데이터를 수집하는 제1단계; 상기 제1단계에서 수집된 방송 음성 데이터에 대응하는 텍스트를 얻기 위하여 연속 음성 인식기를 이용한 텍스트 자동 전사 및 검증 과정을 수행하는 제2단계; 상기 단계에서 수집된 방송 음성 데이터와 전사된 텍스트를 받아들여 음성 인식기를 이용한 트라이폰 단위의 음소 레이블링과, 피치 추출 툴을 이용한 피치 레이블링을 수행하는 제3단계; 및, 상기 제3단계의 음소 및 피치 레이블링 결과와 상기 제1단계의 방송 음성 데이터를 종합하여 영역 및 화자별로 의존하는 음성 합성용 데이터베이스를 구성하는 제4단계를 포함하는 것을 특징으로 한다. According to an aspect of the present invention, there is provided a method for constructing a database for a region and speaker-dependent speech synthesis, the method comprising: collecting broadcast voice data for each service region and a speaker through a process of designating and recording a specific broadcast; Performing a text automatic transcription and verification process using a continuous speech recognizer to obtain text corresponding to the broadcast speech data collected in the first step; A third step of receiving phonetic voice data and the transcribed text collected in the step and performing phoneme labeling in a triphone unit using a speech recognizer and pitch labeling using a pitch extraction tool; And a fourth step of synthesizing the phoneme and pitch labeling results of the third step and the broadcast voice data of the first step and constructing a voice synthesis database depending on the area and the speaker.

상기 목적을 달성하기 위한 본 발명에 따른 영역 및 화자 의존 음성 합성 서비스 시스템은, 뉴스 정보, 일기예보 정보 및 교통 정보를 서비스 영역으로서 각각 포함하는 복수의 웹 컨텐츠; 상기 각 서비스 영역의 정보를 실시간으로 갱신하기 위한 컨텐츠 갱신 처리부; 서비스 영역 및 화자별 방송 음성 데이터를 저장하고 있는 복수의 음성 합성용 데이터베이스; 인터넷 망을 통해 상기 복수의 웹 컨텐츠 및 컨텐츠 갱신 처리부와 연결됨과 동시에 상기 복수의 음성 합성용 데이터베이스와 연결되어, 상기 컨텐츠 갱신 처리부에 의해 제공되는 복수의 웹 컨텐츠를 이용하여 상기 영역 및 화자별 음성 합성용 데이터베이스를 구축하며, 유무선 정보 단말기를 통해 서비스 요청이 있을 경우에 상기 각 데이터베이스에 저장된 방송 음성 데이터를 이용하여 음성 합성 기능을 수행하고 그 결과를 상기 정보 단말기에 제공하는 음성 합성 서버를 포함하는 것을 특징으로 한다. According to an aspect of the present invention, there is provided a region and speaker-dependent speech synthesis service system comprising: a plurality of web contents each including news information, weather forecast information, and traffic information as a service region; A content update processor for updating information of each service area in real time; A plurality of database for speech synthesis storing service area and speaker-specific broadcast voice data; It is connected to the plurality of web contents and the content update processing unit through an internet network, and is connected to the plurality of voice synthesis databases, and uses the plurality of web contents provided by the content update processing unit to synthesize voices for the area and the speaker. And a voice synthesis server configured to perform a voice synthesis function using broadcast voice data stored in each database when a service request is made through a wired or wireless information terminal, and provide the result to the information terminal. It features.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세하게 설명한다. Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the present invention.

먼저, 도 2를 참조하여 본 발명의 실시예에 따른 방송 음성 데이터를 이용한 영역/화자 의존 음성 합성 장치에 대해 설명한다. 도 2에는 본 발명의 실시예에 따른 방송 음성 데이터를 이용한 영역/화자 의존 음성 합성 장치의 구성이 도시되어 있다. First, a region / speaker dependent speech synthesis apparatus using broadcast speech data according to an embodiment of the present invention will be described with reference to FIG. 2. 2 is a block diagram of a region / speaker dependent speech synthesis apparatus using broadcast speech data according to an embodiment of the present invention.

상기 도 2에 도시된 바와 같이, 본 발명의 실시예에 따른 음성 합성 장치는 방송 음성 녹음부(20), 복수의 음성 합성용 데이터베이스(21~24), 합성기(25) 및 합성음 재생부(26)를 포함한다. 본 발명의 실시예에 따른 음성 합성 장치에서는 종래의 기술에서 성우를 섭외하고 미리 설계한 합성용 텍스트를 녹음하는 과정을 대신하여 합성용 음성 데이터로서 고정된 화자의 방송 음성을 사용한다. 예를 들어, 뉴스 영역에서는 화자로서 "엄기영", 일기예보 영역에서는 화자로서 "이익선", 교통정보 영역에서는 화자로서 "황정민" 등과 같이 영역 별로 고정된 화자의 방송 음성이 상기 방송 음성 녹음부(20)에 의해 녹음된다. 이와 같이 일정 기간의 고정된 화자와 특정 서비스 영역으로 분류될 수 있는 방송 음성의 특성을 이용하여 상기 방송 음성 녹음부(20)에 의해 녹음된 방송 음성에 대해 복수의 영역/화자별 음성 합성용 데이터베이스(21~24)가 구축된다. 다음으로, 상기 합성기(25)에 의해 각 음성 합성용 데이터베이스(21~24)에 저장된 음성 데이터를 이용하여 소정의 음성 합성 알고리즘에 따라 음성 합성이 이루어지며, 합성음 재생부(26)에 의해 최종 합성음이 재생된다. As shown in FIG. 2, the speech synthesis apparatus according to the embodiment of the present invention includes a broadcast speech recording unit 20, a plurality of speech synthesis databases 21 to 24, a synthesizer 25, and a synthesized sound reproduction unit 26. ). The speech synthesis apparatus according to the embodiment of the present invention uses a fixed speaker's broadcast voice as synthesis speech data in place of a process of recording voice texts and pre-designed synthesis texts in the prior art. For example, a broadcast voice of a fixed speaker for each area, such as "Um Ki-young" as a speaker in a news area, "Ioung-seon" as a speaker in a weather forecast area, and "Hwang Jung-min" as a speaker in a traffic information area, is the broadcast voice recording unit 20. Recorded by). As described above, a plurality of zones / speaker-specific speech synthesis databases are recorded for the broadcast voices recorded by the broadcast voice recorder 20 using characteristics of a fixed speaker and a broadcast voice that can be classified into a specific service area. (21-24) are established. Next, speech synthesis is performed by the synthesizer 25 according to a predetermined speech synthesis algorithm by using the speech data stored in each speech synthesis database 21 to 24, and the final synthesized sound is synthesized by the synthesized sound reproducing unit 26. Is played.

이러한 방식은 기존의 방식에 비해 음성 합성용 데이터베이스의 구축 및 확장이 용이하며, 동시에 일반인에게 충분히 익숙하고 해당 영역에 대하여 대표되는 방송 음성을 데이터베이스화하여 상기 합성기(25)에서 이용함으로써 최종 합성음의 청자와의 친화도가 크게 향상될 수 있다. 또한, 기존의 방법에서와 같이 인위적으로 설계된 텍스트의 연출된 음성을 사용하는 것이 아니라, 활용하고자 하는 서비스 영역의 실제 방송 음성을 사용함으로써 음성 데이터와 합성음의 자연성을 향상시키는 효과가 있다. This method is easier to construct and expand a database for speech synthesis than the conventional method, and at the same time, the broadcast voice represented by the general public and used for the synthesizer 25 is used in the synthesizer 25 so that the listener of the final synthesized speech The affinity of can be greatly improved. In addition, it is effective to improve the naturalness of the voice data and the synthesized sound by using the actual broadcast voice of the service area to be utilized, rather than using artificially designed text produced voice as in the conventional method.

다음으로 도 3 및 도 4를 참조하여 상기 음성 합성 장치에 사용되는 음성 합성용 데이터베이스를 구축하는 방법에 대해 설명한다. 도 3에는 본 발명의 실시예에 따른 음성 합성용 데이터베이스 구축방법을 나타낸 순서도가 도시되어 있다. Next, a method of constructing a speech synthesis database used in the speech synthesis apparatus will be described with reference to FIGS. 3 and 4. 3 is a flowchart illustrating a method for constructing a database for speech synthesis according to an embodiment of the present invention.

상기 도 3에 도시되어 있듯이, 음성 합성용 데이터베이스 구축방법의 동작이 시작되면, 우선 특정 방송을 지정하여 녹음하는 과정을 통해 영역/화자별 방송 음성 데이터가 수집된다(S30). 그리고, 수집된 방송 음성 데이터에 대응하는 텍스트를 얻기 위하여 연속 음성 인식기를 이용한 텍스트 자동 전사 및 검증 단계가 수행된다(S31). 상기 텍스트 자동 전사 및 검증 단계(S31)에 대한 보다 구체적인 과정은 도 4에 도시되어 있으며, 이에 대해서는 추후에 다시 설명한다. As shown in FIG. 3, when the operation of the voice synthesis database construction method starts, broadcast voice data for each region / speaker is collected through a process of designating and recording a specific broadcast (S30). Then, in order to obtain a text corresponding to the collected broadcast voice data, a text automatic transcription and verification step using a continuous speech recognizer is performed (S31). A more detailed process for the automatic text transfer and verification step S31 is shown in FIG. 4, which will be described later.

다음으로, 상위 단계에서 수집된 방송 음성 데이터와 전사된 텍스트를 입력으로 하여 음성 인식기를 이용한 트라이폰(triphone) 단위의 음소 레이블링과, 피치 추출 툴을 이용한 피치 레이블링을 수행한다(S32). 마지막 단계에서는 음소 및 피치 레이블링 결과와 방송 음성 데이터를 종합하여 영역/화자 의존 음성 합성용 데이터베이스를 구성한다(S33). 이와 같은 음성 합성용 데이터베이스의 구축 과정을 통해 기존의 화자 선정 및 녹음, 합성용 텍스트 설계, 오류 수정 등의 과정을 단순화 및 자동화하는 효과가 있으며, 이를 통하여 음성 합성용 데이터베이스 구축 및 확장을 용이하게 한다. Next, the phoneme labeling of the triphone unit using a speech recognizer and the pitch labeling using a pitch extraction tool are performed using the inputted broadcast voice data and the transferred text as inputs (S32). In the final step, the phoneme and pitch labeling results and the broadcast voice data are synthesized to construct a database for region / speaker dependent speech synthesis (S33). Through the process of constructing the voice synthesis database, it is effective to simplify and automate the process of selecting and recording existing speakers, designing text for synthesis, and correcting errors, thereby facilitating the construction and expansion of the database for speech synthesis. .

도 4에는 상기 도 3에 도시된 텍스트 자동 전사 및 검증 단계를 보다 상세하게 나타낸 순서도가 도시되어 있다. FIG. 4 is a flow chart showing in more detail the text automatic transcription and verification step shown in FIG.

상기 도 4에 도시된 바와 같이, 텍스트 자동 전사 및 검증 단계가 시작되면, 그 상위 단계에서 수집된 영역/화자별 방송 음성 데이터가 입력되고(S40), 연속 음성 인식 단계가 수행된다(S41). 인식기는 형태소, 어절 등의 인식 단위별로 텍스트 형태의 인식 결과와 인식 스코어 값을 산출한다. 다음으로, 상기 산출된 인식 스코어 값과 임계치를 비교하여 인식 스코어 값이 임계치보다 크거나 같은지를 판단한다(S42). 상기 단계(S42)에서 인식 스코어의 값이 임계치이상 일 경우에는 올바른 인식이 수행된 것으로 판단하여 인식된 텍스트를 방송 음성에 대응하는 텍스트로서 적용하여 사용한다(S43). 상기 단계(S42)에서 인식 스코어가 임계치보다 작을 경우에는 텍스트가 검증 대상 목록에 저장되며(S47), 이 텍스트에 대한 검증 및 오류 수정 과정이 수행된다(S48). 오류 수정 과정이 완료되면, 이 텍스트는 방송 음성에 대응하는 텍스트로서 적용된다. 상기 단계(S42)에서 사용되는 임계치는 사용자가 임의로 정할 수 있다. 상기 단계(S47)의 검증 대상 목록은 인식 스코어와 임계치의 비교를 통해 자동적으로 생성된다. 상기 단계(S48)에서는 검증 결과 목록을 대상으로 실제 방송 음성 데이터와 인식 결과인 텍스트의 비교를 통해 두 내용이 다를 경우 텍스트를 수정하는 방법이 사용된다. As shown in FIG. 4, when the automatic text transcription and verification step is started, the broadcasting voice data for each region / speaker collected in the upper step is input (S40), and the continuous voice recognition step is performed (S41). The recognizer calculates a textual recognition result and a recognition score value for each recognition unit such as a morpheme or a word. Next, it is determined whether the recognition score value is greater than or equal to the threshold value by comparing the calculated recognition score value with a threshold value (S42). When the value of the recognition score is greater than or equal to the threshold in the step S42, it is determined that the correct recognition is performed and the recognized text is used as the text corresponding to the broadcast voice (S43). If the recognition score is less than the threshold in step S42, the text is stored in the verification target list (S47), and the verification and error correction process for the text is performed (S48). When the error correction process is completed, this text is applied as the text corresponding to the broadcast voice. The threshold value used in step S42 may be arbitrarily determined by the user. The verification target list in step S47 is automatically generated by comparing the recognition score and the threshold. In the step S48, when the two contents are different by comparing the actual broadcasting voice data with the recognition result text with respect to the verification result list, a method of correcting the text is used.

한편, 본 발명에서 사용되는 연속 음성 인식기의 인식률을 향상시키기 위하여 언어모델 갱신 방법과 화자 적응 방법이 사용된다. 즉, 단계(S44)에서는 이미 완료된 인식 결과와 웹(web)으로부터 해당 서비스 영역별 텍스트를 자동으로 수집되고, 그 다음 단계(S45)에서는 수집된 텍스트로부터 언어모델이 추출되어 상기 음성 인식기에 제공된다. 따라서, 상기 단계(S41)에서는 언어모델이 음성 인식기에 자동적으로 갱신될 수 있다. 또한, 화자 적응 단계(S46)에서는 상기 단계(S40)에서 입력된 방송 음성 데이터와 상기 단계(S43)의 전사 완료된 텍스트를 입력으로 하여 해당 화자에 적응된 음향 모델을 생성한 후 이를 인식기에 적용한다. 위의 두가지 방법은 도 2의 텍스트 자동 전사 및 검증 단계가 수행될 때마다 반복적으로 적용된다. 이를 통하여 해당 영역 및 화자에 최적화된 음성 인식이 수행될 수 있다. Meanwhile, in order to improve the recognition rate of the continuous speech recognizer used in the present invention, a language model updating method and a speaker adaptation method are used. That is, in step S44, the text of each service area is automatically collected from the already recognized recognition result and the web. In the next step S45, a language model is extracted from the collected text and provided to the speech recognizer. . Therefore, in step S41, the language model may be automatically updated in the speech recognizer. In addition, in the speaker adaptation step S46, the broadcast voice data input in the step S40 and the transcribed text of the step S43 are input to generate an acoustic model adapted to the speaker, and then apply the same to the recognizer. . The above two methods are applied repeatedly each time the text auto-transcription and verification step of FIG. 2 is performed. Through this, speech recognition optimized for the corresponding region and the speaker may be performed.

도 5에는 본 발명의 실시예에 따른 음성 합성용 데이터베이스를 이용한 음성 합성 서비스 시스템의 구성이 도시되어 있다. 5 shows a configuration of a speech synthesis service system using a speech synthesis database according to an embodiment of the present invention.

상기 설명된 본 발명의 실시예에 따른 음성 합성용 데이터베이스는 데이터베이스 구축의 용이성과 함께 합성음의 향상된 자연성 및 친화도에 있어서 유리하며, 상기 도 5에 도시된 바와 같이, 음성 합성 서비스 시스템에 이용될 수 있다. The database for speech synthesis according to the embodiment of the present invention described above is advantageous in the improved nature and affinity of the synthesized sound with ease of database construction, and can be used in the speech synthesis service system as shown in FIG. have.

상기 도 5에 도시된 음성 합성 서비스 시스템은 뉴스 정보, 일기예보 정보 및 교통 정보 등과 같은 복수의 웹 컨텐츠(52)와, 컨텐츠 갱신 처리부(53)와, 복수의 영역/화자 의존 음성 합성용 데이터베이스(50)와, 인터넷 망을 통해 상기 복수의 웹컨텐츠(52) 및 컨텐츠 갱신 처리부(53)와 연결됨과 동시에 상기 복수의 음성 합성용 데이터베이스(50)와 연결되어 음성 합성 기능을 수행하는 영역/화자 의존 음성 합성 서버(51)로 구성되어 있다. 또한, 상기 영역/화자 의존 음성 합성 서버(51)에는 인터넷 망을 통해 개인용 컴퓨터와 같은 정보 단말기(54)가 연결되거나 무선망을 통해 PDA, 무선 전화기 등의 모바일 정보 단말기(55)가 연결될 수 있다. The voice synthesis service system shown in FIG. 5 includes a plurality of web contents 52 such as news information, weather forecast information, traffic information, and the like, a content update processing unit 53, and a plurality of area / speaker dependent speech synthesis databases ( 50) and an area / speaker dependency that is connected to the plurality of web contents 52 and the content update processing unit 53 through the Internet network and is connected to the plurality of speech synthesis databases 50 to perform a speech synthesis function. The speech synthesis server 51 is configured. In addition, the area / speaker dependent speech synthesis server 51 may be connected to an information terminal 54 such as a personal computer through an internet network or to a mobile information terminal 55 such as a PDA or a wireless telephone through a wireless network. .

상기 음성 합성 서버(51)는 인터넷 망을 통해 컨텐츠 갱신 처리부에 의해 제공되는 복수의 웹 컨텐츠를 이용하여 영역/화자별 음성 합성용 데이터베이스(50)를 구축하며, 인터넷망 또는 무선망을 통한 정보 단말기(54, 55)에 의해 음성 합성음 서비스 요청이 있을 경우에 상기 복수의 영역/화자별 음성 합성용 데이터베이스(50)로부터 합성음을 생성하여 해당 정보 단말기에 제공한다. 이와 같이, 본 발명에서는 영역/화자 의존 음성 합성용 데이터베이스(50)를 서비스 분야발로 구축하고 있으므로, 친화도 높은 고품질의 합성음을 사용자에게 제공할 수 있다. 특히, PDA, 무선 전화기 등의 모바일 정보 단말기(55)의 경우는 장치의 특성상 음성 서비스가 보다 적극적으로 활용될 수 있다. 이 경우, 서비스 분야별로 구분되어 생성되는 친숙한 음성을 통해서 음성 서비스의 품질 및 사용자의 만족도를 향상시킬 수 있고, 실시간 정보 전달에 한계가 있는 방송의 취약점을 보완하는 서비스로 응용될 수도 있다. The speech synthesis server 51 constructs a speech synthesis database 50 for each region / speaker using a plurality of web contents provided by a content update processing unit through an internet network, and an information terminal through an internet network or a wireless network. When the voice synthesis sound service request is made by (54, 55), the synthesis sound is generated from the plurality of region / speaker-specific speech synthesis databases 50 and provided to the corresponding information terminal. As described above, in the present invention, since the area / speaker dependent speech synthesis database 50 is constructed from the service field, it is possible to provide a user with a high-quality synthesized sound with high affinity. In particular, in the case of a mobile information terminal 55 such as a PDA and a wireless telephone, voice service may be more actively utilized due to the characteristics of the device. In this case, it is possible to improve the quality of the voice service and the user's satisfaction through familiar voices generated by dividing by service field, and may be applied as a service to compensate for the weakness of broadcasting which has a limitation in real time information transmission.

본 발명에서 제시된 장치 및 방법을 이용할 경우, 특정 영역에 대하여 고품질의 합성음을 생성할 수 있는 영역 의존 합성기를 구축할 수 있는 동시에, 뉴스, 일기예보, 교통정보, 증권정보 등의 특정 서비스 분야별로 일반인에게 널리 알려진 화자의 음성을 사용함으로써 합성음의 친화도를 크게 향상시킬 수 있다. 또한, 해당 서비스 영역에서의 실제 서비스 상황의 음성을 사용함으로써 자연성이 향상된 음성 합성 장치를 구현할 수 있다. 또한, 음성 합성용 데이터베이스를 구축할 때, 별도의 화자 선정, 합성용 텍스트의 설계, 설계된 합성용 텍스트 발성 및 녹음, 녹음 내용과 텍스트 내용의 불일치시 텍스트의 수정 및 재녹음 등의 과정이 단순화 또는 자동화된 방법으로 대체되며, 정기적인 방송 시간과 특정 영역에 고정된 화자라는 방송의 특성상 특정 영역의 단일 화자에 대한 대용량의 음성 데이터베이스 구축 및 확장이 매우 용이해지게 된다. 본 발명의 위와 같은 특징을 바탕으로 기존의 음성 합성 방법에 비하여 합성음의 자연성 및 친화도가 증가된 영역/화자 의존 음성 합성 장치를 보다 쉽게 구현할 수 있다. When using the apparatus and method proposed in the present invention, it is possible to construct a region-dependent synthesizer capable of generating high-quality synthesized sound for a specific area, and at the same time, general public by specific service fields such as news, weather forecast, traffic information, and stock information. The affinity of the synthesized sound can be greatly improved by using the speaker's voice which is widely known. In addition, it is possible to implement a speech synthesis apparatus with improved naturalness by using the voice of the actual service situation in the corresponding service area. In addition, when building a database for speech synthesis, the process of selecting a separate speaker, designing the text for synthesis, uttering and recording the designed synthesis text, and correcting and re-recording the text in the case of inconsistency between the recording content and the text content are simplified or It is replaced by an automated method, and it is very easy to build and expand a large-scale voice database for a single speaker in a specific area due to the nature of the broadcast, such as a regular broadcasting time and a fixed speaker in a specific area. Based on the above characteristics of the present invention, it is possible to more easily implement a region / speaker dependent speech synthesis apparatus in which the naturalness and affinity of the synthesized speech is increased as compared with the conventional speech synthesis method.

이상으로 설명한 것은 본 발명에 따른 방송 음성 데이터를 이용한 영역/화자 의존 음성 합성 장치, 음성 합성용 데이터베이스 구축방법 및 음성 합성 서비스 시스템을 실시하기 위한 하나의 실시예에 불과한 것으로서, 본 발명은 상기한 실시예에 한정되지 않고, 이하의 특허청구의 범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변경 실시가 가능한 범위까지 본 발명의 기술적 정신이 미친다고 할 것이다. What has been described above is only one embodiment for implementing a region / speaker dependent speech synthesis apparatus using broadcast speech data, a method for constructing a speech synthesis database, and a speech synthesis service system according to the present invention. The spirit of the present invention is not limited to the examples, and any person having ordinary knowledge in the field to which the present invention pertains can make various changes without departing from the gist of the present invention as claimed in the following claims. I would say this is crazy.

도 1은 기존의 음성 합성용 데이터베이스의 구축 과정을 나타낸 순서도.1 is a flowchart illustrating a process of constructing an existing database for speech synthesis.

도 2는 본 발명의 실시예에 따른 방송 음성 데이터를 이용한 영역/화자 의존 음성 합성 장치의 구성을 나타낸 도면.2 is a diagram illustrating a configuration of a region / speaker dependent speech synthesis apparatus using broadcast speech data according to an embodiment of the present invention.

도 3은 본 발명의 실시예에 따른 음성 합성용 데이터베이스 구축방법을 나타낸 순서도.3 is a flowchart illustrating a method for constructing a database for speech synthesis according to an embodiment of the present invention.

도 4는 도 3에 도시된 텍스트 자동 전사 및 검증 단계를 보다 상세하게 나타낸 순서도.FIG. 4 is a flow chart illustrating in more detail the text auto-transcription and verification steps shown in FIG.

도 5는 본 발명의 실시예에 따른 음성 합성용 데이터베이스를 이용한 음성 합성 서비스 시스템의 구성을 나타낸 도면. 5 is a diagram illustrating a configuration of a speech synthesis service system using a speech synthesis database according to an embodiment of the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

20 : 방송 음성 녹음부 21~24 : 영역/화자별 합성용 데이터베이스20: broadcast voice recording unit 21 ~ 24: database for synthesis by area / speaker

25 : 합성기 26 : 합성음 재생부25: synthesizer 26: synthesized sound playback unit

Claims

A broadcast voice recorder configured to record broadcast voice data of a fixed speaker for each service area and provide broadcast voice data for each speaker;

A plurality of database for speech synthesis in which the broadcasting voice data for each area and the speaker provided by the broadcasting voice recording unit are constructed as a database for each area and the speaker;

A synthesizer configured to perform speech synthesis according to a predetermined speech synthesis algorithm using broadcast speech data stored in the respective speech synthesis databases; And

And a synthesized sound reproducing unit configured to reproduce the synthesized sound from the synthesizer.

Region and speaker dependent speech synthesis device.

The method of claim 1,

The speech synthesis database includes a news area, a weather forecast area, and a traffic information area, and broadcast voice data of a representative speaker in each service area is used.

Region and speaker dependent speech synthesis device.

A first step of collecting broadcast voice data for each service area and a speaker through a process of designating and recording a specific broadcast;

Performing a text automatic transcription and verification process using a continuous speech recognizer to obtain text corresponding to the broadcast speech data collected in the first step;

A third step of receiving phonetic voice data and the transcribed text collected in the step and performing phoneme labeling in a triphone unit using a speech recognizer and pitch labeling using a pitch extraction tool; And

And a fourth step of synthesizing the phoneme and pitch labeling results of the third step and the broadcast voice data of the first step to form a database for speech synthesis that is dependent on area and speaker.

How to build a database for domain and speaker dependent speech synthesis.

The method of claim 3,

The third step is

Receiving the broadcast voice data of each region and the speaker and performing a continuous speech recognition process, and calculating a recognition result and a recognition score in a text form for each recognition unit such as morpheme and word; And

If the calculated recognition score is equal to or greater than a predetermined threshold, it is determined that correct recognition has been performed, and the recognized text is applied as text corresponding to the broadcast voice. Storing the verification target list and performing a verification and error correction process on the text;

How to build a database for domain and speaker dependent speech synthesis.

The method of claim 4, wherein

The verification target list is automatically generated by comparing the recognition score with a threshold.

How to build a database for domain and speaker dependent speech synthesis.

The method of claim 3,

Automatically collecting the recognition result and the text of each service area from the web, and extracting the language model from the collected text and applying to the continuous speech recognizer of the second step

How to build a database for domain and speaker dependent speech synthesis.

The method of claim 3,

And receiving the input broadcast voice data and the transcribed text and generating an acoustic model adapted to the speaker, and applying the same to the continuous speech recognizer of the second step.

How to build a database for domain and speaker dependent speech synthesis.

A plurality of web contents each including news information, weather forecast information, and traffic information as a service area;

A content update processor for updating information of each service area in real time;

A plurality of database for speech synthesis storing service area and speaker-specific broadcast voice data;

It is connected to the plurality of web contents and the content update processing unit through an internet network, and is connected to the plurality of voice synthesis databases, and uses the plurality of web contents provided by the content update processing unit to synthesize voices for the area and the speaker. And a voice synthesis server configured to perform a voice synthesis function using broadcast voice data stored in each database when a service request is made through a wired or wireless information terminal, and provide the result to the information terminal.

Zone- and Speaker-dependent Speech Synthesis Service System.