KR101203188B1

KR101203188B1 - Method and system of synthesizing emotional speech based on personal prosody model and recording medium

Info

Publication number: KR101203188B1
Application number: KR1020110034531A
Authority: KR
Inventors: 박종철; 이호준
Original assignee: 한국과학기술원
Priority date: 2011-04-14
Filing date: 2011-04-14
Publication date: 2012-11-22
Also published as: KR20120117041A

Abstract

개인 운율 모델에 기반하여 감정 음성을 합성하기 위한 방법 및 시스템이 제공된다. 본 발명에 의한 감정 음성 합성 방법은, 개인별 음성을 분석하여 개인별 감정 운율 구조의 특성을 추출하는 감정 운율 구조 특성 추출 단계, 추출된 개인별 감정 운율 구조를 감정 음성 데이터베이스에 저장하는 단계, 입력 텍스트 및 목표 감정을 수신하는 수신 단계, 감정 음성 데이터베이스로부터 음성을 합성할 발화자(speaker)에 상응하는 개인별 감정 운율 구조를 검색하는 단계, 및 입력 텍스트를 무감정 음성(emotionless speech)으로 변환하고, 변환된 무감정 음성 목표 감정에 상응하는 개인별 감정 운율 구조에 기반하여 수정함으로써 발화자에 상응하는 감정 음성을 생성하여 출력하는 감정 음성 합성 단계를 포함한다. 본 발명에서는 감정 음성을 합성할 때 개인별 감정 운율 구조를 이용하기 때문에, 일련의 반복된 인식 테스트로부터, 평균 인식 비율은 최고 95.5%에 도달한다. A method and system are provided for synthesizing emotional speech based on a personal rhyme model. Emotional speech synthesis method according to the present invention, the step of extracting the emotional rhyme structure characteristics to extract the characteristics of the individual emotional rhyme structure by analyzing the individual speech, storing the extracted individual emotional rhyme structure in the emotional speech database, input text and target Receiving an emotion, retrieving an individual emotional rhyme structure corresponding to a speaker to synthesize speech from the emotional speech database, and converting the input text into an emotionless speech, and converting the emotionless speech And an emotional speech synthesis step of generating and outputting an emotional speech corresponding to the talker by modifying based on an individual emotional rhyme structure corresponding to the speech target emotion. Since the present invention uses an individual emotional rhyme structure when synthesizing emotional speech, from a series of repeated recognition tests, the average recognition rate reaches up to 95.5%.

Description

Method and system of synthesizing emotional speech based on personal prosody model and recording medium}

본 발명은 음성 합성 기술에 관련되며, 특히, 발화자들의 개성 및 상대적인 차이점를 고려하여 기본 감정들의 감정 운율 구조(emotional prosody structure)를 모델링한 결과를 이용하여, 발화자의 감정이 가미된 감정 언어를 합성하기 위한 방법 및 음성 합성 시스템에 관련된다. The present invention relates to speech synthesis techniques, and in particular, to synthesize an emotional language with the emotions of a talker, using the results of modeling the emotional prosody structure of basic emotions in consideration of the individuality and relative differences of the talkers. And a speech synthesis system.

대화(speech)는 인간-인간의 상호작용 동안 생각을 표현하기 위해 가장 기초적이고 널리 사용되고 있는 통신 방식이다. 또한, 인간과 기계 사이의 사용자 친화적 인터페이스로서도 연구된다. 예를 들면, 보조 로봇은 정보를 제공하기 위해 날씨, TV 프로그램과 약물 투여 스케줄과 같은 일일 활동에 대하여 음성을 이용한다. 음성 합성에 있어서의 최신 기술은 매우 높은 인식률로 인공 음성을 합성한다. 특히, 단위 선택 알고리즘(unit-selection algorithm)을 사용하는 문자 음성 변환(TTS) 시스템의 출력은 실제 인간의 음성과 거의 동일하게 간주되기도 한다. 운율 구조에는 발화 기간(duration), 음량(loudness), 및 기본적인 주파수 변화 등이 포함되며, 음질이란 발화원의 세부 사항을 나타낸다. 비록 합성 결과가 각각의 단위 선택 음성 데이터베이스에 따라 상이한 운율 구조와 음성 품질을 나타내긴 하지만, 일반용 TTS 시스템은 주어진 문장을 위해 거의 같은 품질의 음성 결과를 제공한다. 만일 동일한 음성 데이터베이스와 문장이 선택된다면, 발화자의 감정 상태와 같은 문맥상의 정보에 관계없이 동일한 합성 결과를 가져올 것이다. 하지만, 합성된 음성의 품질과 억양의 자연스러움(naturalness)은 여전히 중요한 극복 과제로 남아 있다. 음질과 자연스러움의 개선에 대한 요구와 더불어, 자연스러우면서 효과적인 방식으로 필요한 정보를 제공하기 위한 음성 합성 기술이 필요하다. 이러한 목적을 달성하기 위해, 우선 여러 가지 타입의 감정적인 표현이 일반적으로 상응하는 데이터셋으로 변환되고, 그 결과가 각 타입의 감정 음성의 모델링을 위해 사용된다. 이런 종류의 방대한 데이터셋 분석 기술은 서비스를 제공하는 정보의 성능을 양적 및 질적으로 향상시켰다. Speech is the most basic and widely used form of communication for expressing ideas during human-human interaction. It is also studied as a user-friendly interface between humans and machines. For example, assistive robots use voice for daily activities such as weather, TV programs and medication administration schedules to provide information. The latest technology in speech synthesis synthesizes artificial speech with a very high recognition rate. In particular, the output of a text-to-speech (TTS) system using a unit-selection algorithm may be considered to be almost the same as a real human voice. The rhyme structure includes duration, loudness, and fundamental frequency changes, with sound quality representing the details of the source. Although the synthesis results show different rhyme structure and voice quality for each unit selection speech database, the general purpose TTS system provides speech results of about the same quality for a given sentence. If the same voice database and sentences are selected, they will result in the same synthesis results regardless of contextual information such as the emotional state of the talker. However, the quality of synthesized speech and the naturalness of intonation remain important overcoming challenges. With the demand for improved sound quality and naturalness, there is a need for speech synthesis techniques to provide the necessary information in a natural and effective manner. To achieve this goal, first, various types of emotional representations are generally converted into corresponding datasets, and the results are used for modeling each type of emotional speech. This type of massive dataset analysis technology has quantitatively and qualitatively improved the performance of the information it provides.

즉, 일반용 음성 합성 시스템의 음질을 개선할 필요성과 함께, 감정 구어 표현을 자연스러우면서도 효과적인 방식으로 이루어내기 위한 여러 가지 연구가 수행되는데, 이들 중 몇 가지는 분 명세서에 참조되어 통합되는 참조 문헌들(Huang et al., 2001; Jurafsky and Martin, 2000; Tatham and Morton, 2005, Cowie et al., 2001; Gobl and Chasaide, 2003; Johnstone and Scherer, 1999; Tatham and Morton, 2004)에 소개된 바와 같다. 그런데, 이러한 기술들의 평균 인식 비율은 감정 효과를 이용하여 합성된 중립(neutral) 텍스트에 대해서 단지 27.1%에 불과하다. 이러한 연구에서는 목표 감정에 상응하는 한 개의 데이터베이스를 사용하고, 감정에 특유한 선택 기준으로서 음성 품질과 운율에 관한 파라미터를 사용했다. 또한, 최근에는 이상의 연구 내용에 추가하여 감정 변화에 따른 억양 정보까지 표현하여 좀 더 자연스러운 음성 합성 결과를 만드는 방법에 대한 연구가 활발히 진행되고 있다. 억양 정보 중에서도 특히 음의 높낮이 변화가 감정 상태를 표현하는데 가장 큰 영향을 미치는 것으로 알려져 있다. 그러나 이러한 음의 높낮이 변화는 발화자의 발화 특성이나 습관 등에 의해 많은 영향을 받고 음의 높낮이 변화가 일반적인 발화문에서는 잘 나타나지 않는 등의 문제점 때문에 세부적인 문장 요소의 음의 높낮이 변화에 대한 연구보다는 문장 단위의 처리에 대한 연구가 주로 진행되고 있다.In other words, along with the need to improve the sound quality of general-purpose speech synthesis systems, various studies have been conducted to produce emotional colloquial expressions in a natural and effective manner, some of which are incorporated by reference in the specification. et al., 2001; Jurafsky and Martin, 2000; Tatham and Morton, 2005, Cowie et al., 2001; Gobl and Chasaide, 2003; Johnstone and Scherer, 1999; Tatham and Morton, 2004). However, the average recognition rate of these techniques is only 27.1% for neutral texts synthesized using emotional effects. In this study, we used a database corresponding to the target emotion, and used parameters related to voice quality and rhyme as the selection criteria specific to emotion. In addition, in recent years, research has been actively conducted on how to create more natural speech synthesis results by expressing intonation information according to emotion changes in addition to the above research. Among the accent information, the change in pitch is known to have the greatest influence on the expression of emotional state. However, this change of pitch is influenced by the speaker's speech characteristics and habits, and because of the problem that the pitch of pitch is not shown in the general speech, it is a sentence unit rather than a study on the change of pitch of detailed sentence elements. The research on the treatment of is mainly conducted.

그러나, 감정 음성 합성과 같이 개인 경험에 의거하는 상호 작용에서는 이러한 해결책이 적합하지 않다. 방대한 연구의 결과, 개개의 발화자가 감정을 표현하는 그들 자신의 방법은 그들의 개인 경험에 기초한다는 것과 방대한 데이터셋은 이러한 개개인의 경험과는 상대적으로 많은 차이점을 가진다는 것이 알려졌다. 즉, 이러한 방식은 감정 음성 합성과 같이 개인적 경험에 기반한 상호작용과는 양호하게 동작하지 않는다. 실험적으로, 개별 화자들은 개인적 경험에 기반한 다양한 방식의 감정 표현 방식을 가지고 있다. 또한, 방대한 데이터셋 관리를 통해서도 이러한 개인화되고 상대적인 차이점은 간과하기 쉽다. However, this solution is not suitable for interaction based on personal experience, such as emotional speech synthesis. As a result of extensive research, it has been found that their own way of expressing emotions by individual talkers is based on their personal experiences and that large datasets have many differences from these individual experiences. In other words, this approach does not work well with interactions based on personal experiences, such as emotional speech synthesis. Experimentally, individual speakers have various ways of expressing emotions based on personal experiences. In addition, through extensive dataset management, these personalized and relative differences are easy to overlook.

즉, 종래 기술에 의한 감정 음성 합성 기술은 다음과 같은 한계를 가진다. That is, the conventional emotion speech synthesis technique has the following limitations.

첫째로, 감정은 매우 섬세하고 개인적인 정신 상태이기 때문에, 이러한 감정을 획일적으로 범주화하는 것은 가능하지 않다. 그러나, 종래 기술에 의하면 "기본" 감정이라는 개념을 채택하고, 다른 감정을 나타내기 위하여 이러한 기본 감정을 변경시키거나 이들을 혼합한다. 그러나, 기본 감정을 구성하는 것이 무엇인지에 대해서는 의견 일치가 이루어지지 않고 있으며, 각 연구마다 서로 다른 개수의 기본 감정을 도입하고 있는 등, 이러한 접근법은 적절하지 못하다. First, since emotions are very delicate and personal mental states, it is not possible to categorize them uniformly. However, the prior art adopts the concept of "basic" emotions and alters or mixes these basic emotions to represent different emotions. However, there is no consensus on what constitutes a basic emotion, and such an approach is not appropriate, as different studies introduce different numbers of basic emotions.

종래 기술의 두 번째 문제는, 운율, 음질과 원본 텍스트의 정보 수정을 위한 다양한 감정 음성 합성 기법이 존재한다는 점이다(이에 대해서는 참조 문헌들(Baenziger and Scherer, 2005; Cowie et al., 2001; Mozziconacci, 2002; Oudeyer, 2003; Scherer, 1986; Schroeder, 2001; Tatham and Morton, 2004; Murray and Arnott, 2008)을 참조한다. 그러나, 종래 기술에 따르면 각각의 감정을 구별하기 위한 음질 및 텍스트 정보의 명확한 가이드라인을 설정하는 것은 매우 어렵다. 또한, 강도(intensity)와 휴지 길이(pause length)보다 피치(pitch)와 발화 속도(speech rate)의 감정 운율 구조가 개개의 발화자(즉 개인 정보)에게 더 많이 의존하는 속성을 가진다. 그리고, 이러한 개인 정보는 각 감정 운율 구조(즉 개인 운율 모델)의 상대적 차이점을 모델링할 수 있도록 허용하는데, 이것은 종래의 방대한 데이터셋 분석 기술에서는 불가능한 것이었다. The second problem of the prior art is that there are various emotional speech synthesis techniques for modifying the rhyme, sound quality and information of the original text (see for example Baenziger and Scherer, 2005; Cowie et al., 2001; Mozziconacci). , 2002; Oudeyer, 2003; Scherer, 1986; Schroeder, 2001; Tatham and Morton, 2004; Murray and Arnott, 2008) .However, according to the prior art, there is a clear definition of sound quality and textual information to distinguish each emotion. It is very difficult to set guidelines, and the emotional rhythm structure of pitch and speech rate is more for individual talkers (ie personal information) than intensity and pause length. This personal information allows us to model the relative differences in each emotional rhyme structure (ie the personal rhyme model), which is a massive analysis of conventional datasets. It was impossible in technology.

그러므로, 발화자들의 개성 및 상대적인 차이점를 고려하여 기본 감정, 예컨대 분노, 두려움, 행복과 슬픔과 같은 감정의 감정 운율 구조(emotional prosody structure)를 구현하고, 구현된 감정 운율 구조를 이용하여 감정 음성을 합성하기 위한 기술이 절실히 요구된다. Therefore, embodying the emotional prosody structure of basic emotions, such as anger, fear, happiness and sadness, taking into account the individual and relative differences of the talkers, and synthesizing the emotional voice using the implemented emotional rhyme structure. Technology is urgently needed.

또한, 개인 운율 모델에 기반하여, 감정 정보를 발화 표현에 추가할 수 있는 한국의 감정 음성 합성 시스템이 절실히 요구된다. In addition, based on the personal rhyme model, there is an urgent need for a Korean emotional speech synthesis system that can add emotional information to speech expressions.

E. Abadjieva, I.R. Murray, and J.L. Arnott, "Applying analysis of human emotional speech to enhance synthetic speech," Third European Conference on Speech Communication and Technology, 1993.E. Abadjieva, I.R. Murray, and J. L. Arnott, "Applying analysis of human emotional speech to enhance synthetic speech," Third European Conference on Speech Communication and Technology, 1993. J. Allen, M.S. Hunnicutt, D.H. Klatt, R.C. Armstrong, and D.B. Pisoni, "From text to speech: the MITalk system," Cambridge Studies In Speech Science And Communication, 1987, p. 216.J. Allen, M.S. Hunnicutt, D.H. Klatt, R. C. Armstrong, and D.B. Pisoni, "From text to speech: the MITalk system," Cambridge Studies In Speech Science And Communication, 1987, p. 216. T. Bㅴnziger and K.R. Scherer, "The role of intonation in emotional expressions," Speech Communication, vol. 46, 2005, pp. 252-267.T. B ㅴ nziger and K.R. Scherer, "The role of intonation in emotional expressions," Speech Communication, vol. 46, 2005, pp. 252-267. M.E. Beckman and J. Hirschberg, The ToBI annotation conventions, Ohio State University, 1994.M.E. Beckman and J. Hirschberg, The ToBI annotation conventions, Ohio State University, 1994. A. Black, P. Taylor, and R. Caley, The Festival speech synthesis system, University of Edinburgh, 1999.A. Black, P. Taylor, and R. Caley, The Festival speech synthesis system, University of Edinburgh, 1999. A.F. Bobick, S.S. Intille, J.W. Davis, F. Baird, C.S. Pinhanez, L.W. Campbell, Y.A. Ivanov, A. Schㆌtte, and A. Wilson, "The KidsRoom: A perceptually-based interactive and immersive story environment," Presence, vol. 8, 1999, pp. 369-393.A.F. Bobick, S.S. Intille, J.W. Davis, F. Baird, C.S. Pinhanez, L.W. Campbell, Y.A. Ivanov, A. Sch ㆌ tte, and A. Wilson, "The KidsRoom: A perceptually-based interactive and immersive story environment," Presence, vol. 8, 1999, pp. 369-393. P. Boersma and D. Weenink, "Praat, a system for doing phonetics by computer," Glot international, vol. 5, 2001, pp. 341-345.P. Boersma and D. Weenink, "Praat, a system for doing phonetics by computer," Glot international, vol. 5, 2001, pp. 341-345. P. Boersma and D. Weenink, Praat: doing phonetics by computer, 2005, Computer program (http://www.praat.org).P. Boersma and D. Weenink, Praat: doing phonetics by computer, 2005, Computer program (http://www.praat.org). C. Breazeal, "Emotion and sociable humanoid robots," International Journal of Human-Computer Studies, vol. 59, 2003, pp. 119-155.C. Breazeal, "Emotion and sociable humanoid robots," International Journal of Human-Computer Studies, vol. 59, 2003, pp. 119-155. M. Bulut, C. Busso, S. Yildirim, A. Kazemzadeh, C.M. Lee, S. Lee, and S. Narayanan, "Investigating the role of phoneme-level modifications in emotional speech resynthesis," Ninth European Conference on Speech Communication and Technology, 2005.M. Bulut, C. Busso, S. Yildirim, A. Kazemzadeh, C.M. Lee, S. Lee, and S. Narayanan, "Investigating the role of phoneme-level modifications in emotional speech resynthesis," Ninth European Conference on Speech Communication and Technology, 2005. F. Burkhardt, "Simulation emotionaler Sprechweise mit Sprachsyntheseverfahren [Simulation of emotional speech by means of speech synthesis]," Doctoral dissertation, Technische Universitㅴt Berlin, Berlin, Germany, 2001.F. Burkhardt, "Simulation emotionaler Sprechweise mit Sprachsyntheseverfahren [Simulation of emotional speech by means of speech synthesis]," Doctoral dissertation, Technische Universit et Berlin, Berlin, Germany, 2001. F. Burkhardt and W.F. Sendlmeier, "Verification of acoustical correlates of emotional speech using formant-synthesis," ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000.F. Burkhardt and W.F. Sendlmeier, "Verification of acoustical correlates of emotional speech using formant-synthesis," ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000. J.E. Cahn, "Generating expression in synthesized speech," Master's thesis, Massachusetts Institute of Technology, 1989.J.E. Cahn, "Generating expression in synthesized speech," Master's thesis, Massachusetts Institute of Technology, 1989. J.E. Cahn, "The generation of affect in synthesized speech," Journal of the American Voice I/O Society, vol. 8, 1990, pp. 1-1.J.E. Cahn, "The generation of affect in synthesized speech," Journal of the American Voice I / O Society, vol. 8, 1990, pp. 1-1. N. Campbell and J. Venditti, "J-ToBI: An intonation labelling system for Japanese," Proceedings of the Autumn meeting of the Acoustical Society of Japan, 1995, pp. 317-318.N. Campbell and J. Venditti, "J-ToBI: An intonation labeling system for Japanese," Proceedings of the Autumn meeting of the Acoustical Society of Japan, 1995, pp. 317-318. A. Chen, R.R. Muntz, S. Yuen, I. Locher, S.I. Sung, and M.B. Srivastava, "A support infrastructure for the smart kindergarten," IEEE Pervasive Computing, vol. 1, 2002, pp. 49-57.A. Chen, R.R. Muntz, S. Yuen, I. Locher, S.I. Sung, and M.B. Srivastava, "A support infrastructure for the smart kindergarten," IEEE Pervasive Computing, vol. 1, 2002, pp. 49-57. J.W. Chung and J.C Park, "Automated Classification of Sentential Types in Korean with Morphological Analysis," Language and Information, vol. 13, 2009, pp. 59-97.J.W. Chung and J. C Park, "Automated Classification of Sentential Types in Korean with Morphological Analysis," Language and Information, vol. 13, 2009, pp. 59-97. R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor, "Emotion recognition in human-computer interaction," IEEE Signal processing magazine, vol. 18, 2001, pp. 32-80.R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor, "Emotion recognition in human-computer interaction," IEEE Signal processing magazine, vol. 18, 2001, pp. 32-80. R. Donovan, "Trainable speech synthesis," Doctoral dissertation, Cambridge University, 1996.R. Donovan, "Trainable speech synthesis," Doctoral dissertation, Cambridge University, 1996. E. Douglas-Cowie, N. Campbell, R. Cowie, and P. Roach, "Emotional speech: Towards a new generation of databases," Speech Communication, vol. 40, 2003, pp. 33-60.E. Douglas-Cowie, N. Campbell, R. Cowie, and P. Roach, "Emotional speech: Towards a new generation of databases," Speech Communication, vol. 40, 2003, pp. 33-60. M. Edgington, "Investigating the limitations of concatenative synthesis," Fifth European Conference on Speech Communication and Technology, 1997.M. Edgington, "Investigating the limitations of concatenative synthesis," Fifth European Conference on Speech Communication and Technology, 1997. C. Gobl and A. N┱- Chasaide, "The role of voice quality in communicating emotion, mood and attitude," Speech Communication, vol. 40, 2003, pp. 189-212.C. Gobl and A. N┱- Chasaide, "The role of voice quality in communicating emotion, mood and attitude," Speech Communication, vol. 40, 2003, pp. 189-212. L. Gong, "Is happy better than sad even if they are both non-adaptive? Effects of emotional expressions of talking-head interface agents," International journal of human-computer studies, vol. 65, 2007, pp. 183-191.L. Gong, "Is happy better than sad even if they are both non-adaptive? Effects of emotional expressions of talking-head interface agents," International journal of human-computer studies, vol. 65, 2007, pp. 183-191. A. Gravano, S. Benus, H. Chavez, J. Hirschberg, and L. Wilcox, "On the role of context and prosody in the interpretation of 'okay'," Proceedings of ACL, 2007, pp. 800-807.A. Gravano, S. Benus, H. Chavez, J. Hirschberg, and L. Wilcox, "On the role of context and prosody in the interpretation of 'okay'," Proceedings of ACL, 2007, pp. 800-807. S.J. Haberman, "The analysis of residuals in cross-classified tables," Biometrics, vol. 29, 1973, pp. 205-220.S.J. Haberman, "The analysis of residuals in cross-classified tables," Biometrics, vol. 29, 1973, pp. 205-220. M. Heerink, B. Krㆆse, B. Wielinga, and V. Evers, "Enjoyment intention to use and actual use of a conversational robot by elderly people," Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction, 2008, pp. 113-120.M. Heerink, B. Kröse, B. Wielinga, and V. Evers, "Enjoyment intention to use and actual use of a conversational robot by elderly people," Proceedings of the 3rd ACM / IEEE international conference on Human robot interaction, 2008, pp. 113-120. B. Heuft, T. Portele, and M. Rauth, "Emotions in time domain synthesis," Proceedings of the 4th International Conference on Spoken Language Processing, 1996.B. Heuft, T. Portele, and M. Rauth, "Emotions in time domain synthesis," Proceedings of the 4th International Conference on Spoken Language Processing, 1996. X. Huang, A. Acero, and H.W. Hon, Spoken language processing: A guide to theory, algorithm, and system development, Prentice Hall PTR Upper Saddle River, NJ, USA, 2001.X. Huang, A. Acero, and H.W. Hon, Spoken language processing: A guide to theory, algorithm, and system development, Prentice Hall PTR Upper Saddle River, NJ, USA, 2001. E. Hudlicka, "To feel or not to feel: The role of affect in human-computer interaction," International Journal of Human-Computer Studies, vol. 59, 2003, pp. 1-32.E. Hudlicka, "To feel or not to feel: The role of affect in human-computer interaction," International Journal of Human-Computer Studies, vol. 59, 2003, pp. 1-32. A. Iida, N. Campbell, S. Iga, F. Higuchi, and M. Yasumura, "A speech synthesis system with emotion for assisting communication," ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000, pp. 167-172.A. Iida, N. Campbell, S. Iga, F. Higuchi, and M. Yasumura, "A speech synthesis system with emotion for assisting communication," ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000, pp. 167-172. T. Johnstone and K.R. Scherer, "The effects of emotions on voice quality," Proceedings of the XIV Int. Congress of Phonetic Sciences, 1999, pp. 2029-2032.T. Johnstone and K.R. Scherer, "The effects of emotions on voice quality," Proceedings of the XIV Int. Congress of Phonetic Sciences, 1999, pp. 2029-2032. S.A. Jun, "K-ToBI (Korean ToBI) Labelling Conventions," Speech Science, vol. 7, 2000, pp. 143-169.S.A. Jun, "K-ToBI (Korean ToBI) Labeling Conventions," Speech Science, vol. 7, 2000, pp. 143-169. D. Jurafsky, J.H. Martin, and A. Kehler, Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, MIT Press, 2000.D. Jurafsky, J.H. Martin, and A. Kehler, Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, MIT Press, 2000. J.W. Kang, H.S. Hong, B.S. Kim, and M.J. Chung, "Assistive Mobile Robot Systems helping the Disabled Workers in a Factory Environment," International Journal of Assistive Robotics and Mechatronics, vol. 9, 2008, pp. 42-52.J.W. Kang, H.S. Hong, B.S. Kim, and M.J. Chung, "Assistive Mobile Robot Systems helping the Disabled Workers in a Factory Environment," International Journal of Assistive Robotics and Mechatronics, vol. 9, 2008, pp. 42-52. Y. Kitahara and Y. Tohkura, "Prosodic control to express emotions for man-machine speech interaction," IEICE Transactions on Fundamentals of Electronics, communications and computer sciences, vol. 75, 1992, pp. 155-163.Y. Kitahara and Y. Tohkura, "Prosodic control to express emotions for man-machine speech interaction," IEICE Transactions on Fundamentals of Electronics, communications and computer sciences, vol. 75, 1992, pp. 155-163. H.-J. Lee and J.C. Park, "Lexical Disambiguation for Intonation Synthesis: A CCG Approach," Proceedings of the Korean Society for Language and Information (KSLI), 2005a, pp. 103-118.H.-J. Lee and J.C. Park, "Lexical Disambiguation for Intonation Synthesis: A CCG Approach," Proceedings of the Korean Society for Language and Information (KSLI), 2005a, pp. 103-118. H.-J. Lee and J.C. Park, "Vowel Sound Disambiguation for Proper Intonation Synthesis," Proceedings of the 19th Pacific Asia Conference on Language, Information and Computation, 2005b, pp. 131-142.H.-J. Lee and J.C. Park, "Vowel Sound Disambiguation for Proper Intonation Synthesis," Proceedings of the 19th Pacific Asia Conference on Language, Information and Computation, 2005b, pp. 131-142. H.-J. Lee and J.C. Park, "Intonation Synthesis using Emotional Information from Spoken Fairy Tale," Proceedings of the 17th Korean Association of Speech Science (KASS), 2005c, pp. 88-97.H.-J. Lee and J.C. Park, "Intonation Synthesis using Emotional Information from Spoken Fairy Tale," Proceedings of the 17th Korean Association of Speech Science (KASS), 2005c, pp. 88-97. H.-J. Lee and J.C. Park, "Customized Message Generation and Speech Synthesis in Response to Characteristic Behavioral Patterns of Children," Human-Computer Interaction. HCI Intelligent Multimodal Interaction Environments, 2007a, pp. 114-123.H.-J. Lee and J.C. Park, "Customized Message Generation and Speech Synthesis in Response to Characteristic Behavioral Patterns of Children," Human-Computer Interaction. HCI Intelligent Multimodal Interaction Environments, 2007a, pp. 114-123. H.-J. Lee and J.C. Park, "Characteristics of Spoken Discourse Markers and their Application to Speech Synthesis Systems," Proceedings of the 19th Annual Conference on Human and Cognitive Language Technology, 2007b, pp. 254-260.H.-J. Lee and J.C. Park, "Characteristics of Spoken Discourse Markers and their Application to Speech Synthesis Systems," Proceedings of the 19th Annual Conference on Human and Cognitive Language Technology, 2007b, pp. 254-260. H.-J. Lee and J.C. Park, "Analysis and Use of Intonation Features for Emotional States," Proceedings of the 20th Annual Conference on Human and Cognitive Language Technology, 2008, pp. 144-149.H.-J. Lee and J.C. Park, "Analysis and Use of Intonation Features for Emotional States," Proceedings of the 20th Annual Conference on Human and Cognitive Language Technology, 2008, pp. 144-149. H.-J. Lee and J.C. Park, "Interpretation of User Evaluation for Emotional Speech Synthesis System," Proceedings of the 13th International Conference on Human-Computer Interaction. Part I: New Trends, 2009, pp. 295-303.H.-J. Lee and J.C. Park, "Interpretation of User Evaluation for Emotional Speech Synthesis System," Proceedings of the 13th International Conference on Human-Computer Interaction. Part I: New Trends, 2009, pp. 295-303. J. Ma, L.T. Yang, B.O. Apduhan, R. Huang, L. Barolli, and M. Takizawa, "Towards a smart world and ubiquitous intelligence: a walkthrough from smart things to smart hyperspaces and UbicKids," International Journal of Pervasive Computing and Communications, vol. 1, 2005, p. 53.J. Ma, L. T. Yang, B.O. Apduhan, R. Huang, L. Barolli, and M. Takizawa, "Towards a smart world and ubiquitous intelligence: a walkthrough from smart things to smart hyperspaces and UbicKids," International Journal of Pervasive Computing and Communications, vol. 1, 2005, p. 53. T. Marumoto and N. Campbell, "Control of speaking types for emotion in a speech re-sequencing system," Proc. of the Acoustic Society of Japan, Spring meeting 2000, pp. 213-214.T. Marumoto and N. Campbell, "Control of speaking types for emotion in a speech re-sequencing system," Proc. of the Acoustic Society of Japan, Spring meeting 2000, pp. 213-214. H.-J. Min, D. Park, E. Chang, H.-J. Lee, and J.C. Park, "u-SPACE: Ubiquitous Smart Parenting and Customized Education," Proceedings of the 15th Human Computer Interaction, 2006, pp. 94-102.H.-J. Min, D. Park, E. Chang, H.-J. Lee, and J.C. Park, "u-SPACE: Ubiquitous Smart Parenting and Customized Education," Proceedings of the 15th Human Computer Interaction, 2006, pp. 94-102. J.M. Montero, J.M. Gutierrez-Arriola, S. Palazuelos, E. Enriquez, S. Aguilera, and J.M. Pardo, "Emotional speech synthesis: From speech database to TTS," Fifth International Conference on Spoken Language Processing, 1998.J.M. Montero, J.M. Gutierrez-Arriola, S. Palazuelos, E. Enriquez, S. Aguilera, and J.M. Pardo, "Emotional speech synthesis: From speech database to TTS," Fifth International Conference on Spoken Language Processing, 1998. J.M. Montero, J. Gutiㅹrrez-Arriola, J. Colㅱs, E. Enr'┱quez, and J.M. Pardo, "Analysis and modelling of emotional speech in Spanish," Proc. of ICPhS, 1999, pp. 957-960.J.M. Montero, J. Guti ㅹ rrez-Arriola, J. Col ㅱ s, E. Enr'quequez, and J.M. Pardo, "Analysis and modeling of emotional speech in Spanish," Proc. of ICPhS, 1999, pp. 957-960. K.E. Moyer and B. von Haller Gilmer, "The Concept of Attention Spans in Children," The Elementary School Journal, 1954, pp. 464-466.K.E. Moyer and B. von Haller Gilmer, "The Concept of Attention Spans in Children," The Elementary School Journal, 1954, pp. 464-466. S. Mozziconacci, "Prosody and emotions," Speech Prosody 2002, International Conference, 2002.S. Mozziconacci, "Prosody and emotions," Speech Prosody 2002, International Conference, 2002. I.R. Murray, Simulating emotion in synthetic speech, University of Dundee, 1989.I.R. Murray, Simulating emotion in synthetic speech, University of Dundee, 1989. I.R. Murray and J.L. Arnott, "Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion," JOURNAL-ACOUSTICAL SOCIETY OF AMERICA, vol. 93, 1993, pp. 1097-1097.I.R. Murray and J.L. Arnott, "Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion," JOURNAL-ACOUSTICAL SOCIETY OF AMERICA, vol. 93, 1993, pp. 1097-1097. I.R. Murray and J.L. Arnott, "Implementation and testing of a system for producing emotion-by-rule in synthetic speech," Speech Communication, vol. 16, 1995, pp. 369-390.I.R. Murray and J.L. Arnott, "Implementation and testing of a system for producing emotion-by-rule in synthetic speech," Speech Communication, vol. 16, 1995, pp. 369-390. I.R. Murray and J.L. Arnott, "Applying an analysis of acted vocal emotions to improve the simulation of synthetic speech," Computer Speech & Language, vol. 22, 2008, pp. 107-129.I.R. Murray and J.L. Arnott, "Applying an analysis of acted vocal emotions to improve the simulation of synthetic speech," Computer Speech & Language, vol. 22, 2008, pp. 107-129. I.R. Murray, M.D. Edgington, D. Campion, and J. Lynn, "Rule-based emotion synthesis using concatenated speech," ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000.I.R. Murray, M.D. Edgington, D. Campion, and J. Lynn, "Rule-based emotion synthesis using concatenated speech," ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000. A. Ortony and T.J. Turner, "What's basic about basic emotions?," Psychological review, vol. 97, 1990, pp. 315-331.A. Ortony and T.J. Turner, "What's basic about basic emotions ?," Psychological review, vol. 97, 1990, pp. 315-331. P.Y. Oudeyer, "The production and recognition of emotions in speech: features and algorithms," International Journal of Human-Computer Studies, vol. 59, 2003, pp. 157-183.P.Y. Oudeyer, "The production and recognition of emotions in speech: features and algorithms," International Journal of Human-Computer Studies, vol. 59, 2003, pp. 157-183. S.W. Park, Y.J. Heo, S.W. Lee, and J.H. Park, "Non-Fatal Injuries among Preschool Children in Daegu and Kyungpook.," Korean Journal of Preventive Medicine and Public Health, vol. 37, 2004, pp. 274-281.S.W. Park, Y.J. Heo, S.W. Lee, and J. H. Park, "Non-Fatal Injuries among Preschool Children in Daegu and Kyungpook.," Korean Journal of Preventive Medicine and Public Health, vol. 37, 2004, pp. 274-281. J. Pierrehumbert, "The perception of fundamental frequency declination," Journal of the Acoustical Society of America, vol. 66, 1979, pp. 363-369.J. Pierrehumbert, "The perception of fundamental frequency declination," Journal of the Acoustical Society of America, vol. 66, 1979, pp. 363-369. J. Pineau, M. Montemerlo, M. Pollack, N. Roy, and S. Thrun, "Towards robotic assistants in nursing homes: Challenges and results," Robotics and Autonomous Systems, vol. 42, 2003, pp. 271-281.J. Pineau, M. Montemerlo, M. Pollack, N. Roy, and S. Thrun, "Towards robotic assistants in nursing homes: Challenges and results," Robotics and Autonomous Systems, vol. 42, 2003, pp. 271-281. J.F. Pitrelli, M.E. Beckman, and J. Hirschberg, "Evaluation of prosodic transcription labeling reliability in the ToBI framework," Third International Conference on Spoken Language Processing, 1994.J.F. Pitrelli, M.E. Beckman, and J. Hirschberg, "Evaluation of prosodic transcription labeling reliability in the ToBI framework," Third International Conference on Spoken Language Processing, 1994. E. Rank and H. Pirker, "Generating emotional speech with a concatenative synthesizer," Fifth International Conference on Spoken Language Processing, 1998.E. Rank and H. Pirker, "Generating emotional speech with a concatenative synthesizer," Fifth International Conference on Spoken Language Processing, 1998. P. Roach, Introducing phonetics, Penguin Books, 1992.P. Roach, Introducing phonetics, Penguin Books, 1992. N. Roy, G. Baltus, D. Fox, F. Gemperle, J. Goetz, T. Hirsch, D. Margaritis, M. Montemerlo, J. Pineau, J. Schulte, and others, "Towards personal service robots for the elderly," Workshop on Interactive Robots and Entertainment (WIRE 2000), 2000, p. 184.N. Roy, G. Baltus, D. Fox, F. Gemperle, J. Goetz, T. Hirsch, D. Margaritis, M. Montemerlo, J. Pineau, J. Schulte, and others, "Towards personal service robots for the elderly, "Workshop on Interactive Robots and Entertainment (WIRE 2000), 2000, p. 184. C. Sas and A. Dix, "Designing for reflection on experience," Proceedings of the 27th international conference extended abstracts on Human factors in computing systems, 2009, pp. 4741-4744.C. Sas and A. Dix, "Designing for reflection on experience," Proceedings of the 27th international conference extended abstracts on Human factors in computing systems, 2009, pp. 4741-4744. K.R. Scherer, "Vocal affect expression: A review and a model for future research.," Psychological Bulletin, vol. 99, 1986, pp. 143-165.K.R. Scherer, "Vocal affect expression: A review and a model for future research.," Psychological Bulletin, vol. 99, 1986, pp. 143-165. M. Schrㆆder, "Emotional Speech Synthesis: A Review," Seventh European Conference on Speech Communication and Technology, 2001.M. Schröder, "Emotional Speech Synthesis: A Review," Seventh European Conference on Speech Communication and Technology, 2001. K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg, "ToBI: A standard for labeling English prosody," Second International Conference on Spoken Language Processing, 1992, pp. 867-870.K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg, "ToBI: A standard for labeling English prosody," Second International Conference on Spoken Language Processing, 1992, pp. 867-870. J. Stallo, "Simulating emotional speech for a talking head," Honours Thesis, Curtin University of Technology, Perth, Australia, 2000.J. Stallo, "Simulating emotional speech for a talking head," Honours Thesis, Curtin University of Technology, Perth, Australia, 2000. R. Stibbard, "Vocal expressions of emotions in non-laboratory speech An investigation of the Reading/Leeds Emotion in Speech Project annotation data," 2001.R. Stibbard, "Vocal expressions of emotions in non-laboratory speech An investigation of the Reading / Leeds Emotion in Speech Project annotation data," 2001. O. Tㆌrk and M. Schrㆆder, "A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis," Proceedings of Interspeech, 2008, pp. 2282-2285.O. Trk and M. Schröder, "A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis," Proceedings of Interspeech, 2008, pp. 2282-2285. M. Tatham and K. Morton, Expression in speech: analysis and synthesis, Oxford University Press, 2004.M. Tatham and K. Morton, Expression in speech: analysis and synthesis, Oxford University Press, 2004. M. Tatham and K. Morton, Developments in speech synthesis, Wiley & Sons Ltd., 2005.M. Tatham and K. Morton, Developments in speech synthesis, Wiley & Sons Ltd., 2005. J. Vaissiere, "Language-independent prosodic features," Prosody: Models and measurements, 1983, pp. 53-66.J. Vaissiere, "Language-independent prosodic features," Prosody: Models and measurements, 1983, pp. 53-66. J.J. Venduti, "Japanese ToBI labelling guidelines," Manuscript, Ohio State University, 1995.J.J. Venduti, "Japanese ToBI labeling guidelines," Manuscript, Ohio State University, 1995. J. Vroomen, R. Collier, and S. Mozziconacci, "Duration and intonation in emotional speech," Third European Conference on Speech Communication and Technology, 1993J. Vroomen, R. Collier, and S. Mozziconacci, "Duration and intonation in emotional speech," Third European Conference on Speech Communication and Technology, 1993

본 발명의 목적은 화남, 두려움, 기쁨 및 슬픔과 같은 네 개의 기본적 감정의 감정 운율 구조를 분석함으로써, 개인화된 감정 음성 합성 기술을 제공하는 것이다. It is an object of the present invention to provide a personalized emotional speech synthesis technique by analyzing the emotional rhyme structure of four basic emotions such as anger, fear, joy and sadness.

또한, 발화자들의 개성 및 상대적인 차이점를 고려하여 기초적인 감정들의 감정 운율 구조를 연구한 결과 강도와 휴지 길이보다 피치와 발화 속도가 개개의 발화자(즉 개인 정보)에게 더 많이 의존하기 때문에, 이러한 개인 정보를 이용하여 각 개인의 감정 운율 구조를 모델링하기 위한 음성 합성 기술을 제공하는 것이 본 발명의 다른 목적이다. In addition, the study of the emotional rhythm structure of basic emotions, taking into account the individuality and relative differences of the talkers, shows that the pitch and the speaking speed depend more on the individual talker (ie, personal information) than the intensity and pause length. Another object of the present invention is to provide a speech synthesis technique for modeling each person's emotional rhyme structure.

또한, 본 발명의 또 다른 목적은 개인 운율 모델에 기반하여 감정 정보를 발화 표현에 추가할 수 있는 한국의 감정 음성 합성 시스템을 제공하는 것이다. In addition, another object of the present invention is to provide an emotional speech synthesis system in Korea that can add emotion information to a speech expression based on a personal rhyme model.

상기와 같은 목적들을 달성하기 위한 본 발명의 일면은, 개인 운율 모델에 기반하여 감정 음성을 합성하기 위한 방법에 관한 것이다. 본 발명에 의한 감정 음성 합성 방법은, 개인별 음성을 분석하여 개인별 감정 운율 구조의 특성을 추출하는 감정 운율 구조 특성 추출 단계, 추출된 개인별 감정 운율 구조를 감정 음성 데이터베이스에 저장하는 단계, 입력 텍스트 및 목표 감정을 수신하는 수신 단계, 감정 음성 데이터베이스로부터 음성을 합성할 발화자(speaker)에 상응하는 개인별 감정 운율 구조를 검색하는 단계, 및 입력 텍스트를 무감정 음성(emotionless speech)으로 변환하고, 변환된 무감정 음성 목표 감정에 상응하는 개인별 감정 운율 구조에 기반하여 수정함으로써 발화자에 상응하는 감정 음성을 생성하여 출력하는 감정 음성 합성 단계를 포함한다. 특히, 감정 운율 구조 특성 추출 단계는, 발화자의 기본 감정(basic emotion)에 따른 음성 정보를 포함하는 데이터셋으로부터 일반 감정 운율 구조를 추출하는 일반 감정 운율 구조 추출 단계, 및 각 발화자의 개인별 감정 운율 구조를 일반 감정 운율 구조와 비교하여 개인별 감정 운율 구조의 상대적 차분치를 파라미터화하는 개인별 감정 운율 구조 추출 단계를 포함한다. 더 나아가, 일반 감정 운율 구조 추출 단계는 발화자별로 화남, 두려움, 행복함, 슬픔, 및 중립의 감정 상태에 상응하는 음성의 특성을 추출하는 단계를 포함하고, 개인별 감정 운율 구조 추출 단계는 발화자의 각각의 감정에 따른 음성의 전체 피치(overall pitch), 강도(intensity), 및 발화 속도(speech rate)를 파라미터로서 문장 레벨에서 분석하는 문장 레벨 감정 운율 구조 분석 단계, 발화자의 각각의 감정에 따른 음성에 포함되는 억양구(intonation phrase, IP)들 간의 휴지 길이(pause length)를 파라미터로서 억양구 레벨에서 분석하는 억양구 레벨 감정 운율 구조 분석 단계, 및 발화자의 각각의 감정에 따른 음성에 포함되는 억양구들 각각의 억양구 경계 패턴(IP boundary pattern)을 파라미터로서 음절 레벨에서 분석하는 음절 레벨 감정 운율 구조 분석 단계를 포함한다. 더 나아가, 문장 레벨 감정 운율 구조 분석 단계는 감정별 음성의 피치값의 사분위간 평균(interquartile mean, IQM)을 연산하는 단계, 감정별 음성의 강도 중 소정값 이상의 강도를 선택하는 단계, 감정별 음성의 전체 발화 길이로부터 발화 속도를 연산하는 단계, 피치값, 강도 및 발화 속도를 정규화하여 문장별 불일치(disparity) 및 발화자별 불일치를 제거하는 단계, 및 정규화된 결과를 이용하여, 중립의 감정 상태에 상응하는 개인별 감정 운율 구조를 표준으로 하여 감정별 개인별 감정 운율의 파라미터들의 차이를 연산하고, 연산 결과를 이용하여 개인별 감정 운율 구조를 구성하는 단계를 포함한다. 특히, 억양구 레벨 감정 운율 구조 분석 단계는 감정별 음성의 억양구 간 휴지 영역(pause region)을 검출하는 단계, 및 휴지 영역들의 전체 길이를 합산하여 전체 휴지 길이를 연산하는 단계를 포함한다. 또한, 음절 레벨 감정 운율 구조 분석 단계는 음성의 억양구 경계 패턴을 L%, H%, LH%, HL%, LHL%, HLH%, HLHL%, LHLH% 및 LHLHL% 중 하나에 상응하는 피치 컨투어(pitch contour)로서 분석하는 단계를 포함한다. 한편, 감정 음성 합성 단계는 TTS(Text-to-Speech) 시스템을 이용하여 입력 텍스트를 무감정 음성으로 변환하는 단계, 감정 음성 데이터베이스로부터 발화자의 목표 감정에 상응하는 감정 운율 구조를 검색하는 단계, 및 검색된 감정 운율 구조의 파라미터들을 이용하여 무감정 음성을 수정함으로써 감정 음성을 생성하는 음성 수정 단계를 포함한다. 뿐만 아니라, 음성 수정 단계는 개인별 감정 운율 구조로부터 목표 감정에 상응하는 피치 컨투어를 파라미터로서 추출하는 단계, 및 추출된 피치 컨투어를 이용하여 무감정 음성의 피치 컨투어를 수정하는 음절 레벨 수정 단계를 포함한다. 또한, 음성 수정 단계는 개인별 감정 운율 구조로부터 목표 감정에 상응하는 휴지 길이를 파라미터로서 추출하는 단계, 및 추출된 휴지 길이를 이용하여 무감정 음성의 휴지 길이를 수정하는 억양구 레벨 수정 단계를 포함한다. 한편, 음성 수정 단계는 개인별 감정 운율 구조로부터 목표 감정에 상응하는 전체 피치, 전체 강도, 및 발화 속도를 파라미터로서 추출하는 단계, 및 추출된 전체 피치, 전체 강도, 및 발화 속도를 이용하여 무감정 음성의 전체 피치, 전체 강도, 및 발화 속도를 수정하는 문장 레벨 수정 단계를 포함한다. One aspect of the present invention for achieving the above object relates to a method for synthesizing emotional speech based on a personal rhyme model. Emotional speech synthesis method according to the present invention, the step of extracting the emotional rhyme structure characteristics to extract the characteristics of the individual emotional rhyme structure by analyzing the individual speech, storing the extracted individual emotional rhyme structure in the emotional speech database, input text and target Receiving an emotion, retrieving an individual emotional rhyme structure corresponding to a speaker to synthesize speech from the emotional speech database, and converting the input text into an emotionless speech, and converting the emotionless speech And an emotional speech synthesis step of generating and outputting an emotional speech corresponding to the talker by modifying based on an individual emotional rhyme structure corresponding to the speech target emotion. In particular, the emotion rhyme structure characteristic extraction step includes a general emotion rhyme structure extraction step of extracting a general emotion rhyme structure from a dataset including voice information according to the basic emotion of the talker, and an individual emotion rhyme structure of each talker. Comparing the general emotion rhyme structure with the individual emotion rhyme structure extraction step of parameterizing the relative difference of the individual emotion rhyme structure. Furthermore, the general emotion rhyme structure extraction step includes extracting voice characteristics corresponding to anger, fear, happiness, sadness, and neutral emotion state for each talker, and the individual emotion rhyme structure extraction step includes extracting individual emotion rhyme structures. Sentence level emotion rhythm structure analysis step of analyzing the overall pitch, intensity, and speech rate of the voice according to the emotion of the sentence at the sentence level as a parameter; A step of analyzing intonation level emotion rhyme structure, which analyzes at the intonation level as a parameter the pause length between included innation phrases (IP), and intonations included in the speech according to each emotion of the talker A syllable level emotional rhythm structure analysis step of analyzing at each syllable level an IP boundary pattern as a parameter. Further, the sentence level emotional rhyme structure analysis step of calculating the interquartile mean (IQM) of the pitch value of speech by emotion, selecting the intensity of a predetermined value or more of the intensity of speech by emotion, by emotion Calculating the speech rate from the total speech length of the speech, normalizing the pitch value, intensity, and speech rate to remove disparity and sentence-inconsistency, and using the normalized result, the neutral emotional state Computing the difference between the parameters of the individual emotional rhyme for each emotion based on the individual emotional rhyme structure corresponding to the standard, and constructing the individual emotional rhyme structure using the calculation result. In particular, analyzing the intonation level emotion rhyme structure analysis includes detecting a pause region between the intonations of emotion-based speech, and calculating the total idle length by summing the total lengths of the pause regions. In addition, the syllable level emotional rhyme structure analysis step includes a pitch contour corresponding to one of the speech intonation boundary patterns: L%, H%, LH%, HL%, LHL%, HLH%, HLHL%, LHLH%, and LHLHL%. analyzing as a pitch contour. Meanwhile, the emotional speech synthesis step may include converting an input text into an emotional voice using a text-to-speech system, retrieving an emotional rhyme structure corresponding to the target emotion of the talker from the emotional speech database, and And a voice modification step of generating an emotional voice by modifying the emotionless voice using the retrieved emotional rhyme structure parameters. In addition, the speech correction step includes extracting, as parameters, a pitch contour corresponding to the target emotion from the individual emotional rhyme structure, and a syllable level correction step of correcting the pitch contour of the emotionless speech using the extracted pitch contour. . In addition, the voice correction step includes extracting a pause length corresponding to the target emotion from the individual emotion rhyme structure as a parameter, and using the extracted pause length, to modify the intonation level of the unaffected voice. . On the other hand, the voice correction step is to extract the overall pitch, the overall intensity, and the speech rate corresponding to the target emotion from the individual emotion rhyme structure as a parameter, and using the extracted total pitch, the overall intensity, and the speech rate, the emotionless speech A sentence level correction step of modifying the overall pitch, the overall intensity, and the rate of speech.

상기와 같은 목적들을 달성하기 위한 본 발명의 다른 면은, 개인 운율 모델에 기반하여 감정 음성을 합성하기 위한 시스템에 관한 것이다. 본 발명에 의한 감정 음성 합성 시스템은 개인별 음성을 분석하여 개인별 감정 운율 구조의 특성을 추출하는 감정 운율 구조 특성 추출부, 추출된 개인별 감정 운율 구조를 저장하는 감정 음성 데이터베이스, 입력 텍스트 및 목표 감정이 수신되면 입력 텍스트를 무감정 음성으로 변환하고, 감정 음성 데이터베이스로부터 음성을 합성할 발화자에 상응하는 개인별 감정 운율 구조를 검색하며, 변환된 무감정 음성을 목표 감정에 상응하는 개인별 감정 운율 구조에 기반하여 수정함으로써 발화자에 상응하는 감정 음성을 생성하여 출력하는 감정 음성 합성부를 포함한다. 특히, 감정 운율 구조 특성 추출부는, 발화자의 기본 감정(basic emotion)에 따른 음성 정보를 포함하는 데이터셋으로부터 일반 감정 운율 구조를 추출하는 동작, 및 각 발화자의 개인별 감정 운율 구조를 일반 감정 운율 구조와 비교하여 개인별 감정 운율 구조의 상대적 차분치를 파라미터화하는 동작을 수행하도록 적응된다. 더 나아가, 감정 운율 구조 특성 추출부는, 개인별 감정 운율 구조를 추출하기 위하여, 발화자의 각각의 감정에 따른 음성의 전체 피치, 강도, 및 발화 속도를 파라미터로서 문장 레벨에서 분석하는 문장 레벨 감정 운율 구조 분석 동작, 발화자의 각각의 감정에 따른 음성에 포함되는 억양구(IP)들 간의 휴지 길이를 파라미터로서 억양구 레벨에서 분석하는 억양구 레벨 감정 운율 구조 분석 동작, 및 발화자의 각각의 감정에 따른 음성에 포함되는 억양구들 각각의 억양구 경계 패턴을 파라미터로서 음절 레벨에서 분석하는 음절 레벨 감정 운율 구조 분석 동작을 수행하도록 적응된다. 특히, 감정 운율 구조 특성 추출부는, 문장 레벨 감정 운율 구조를 분석하기 위하여, 감정별 음성의 피치값의 사분위간 평균(IQM)을 연산하는 동작, 감정별 음성의 강도 중 소정값 이상의 강도를 선택하는 동작, 감정별 음성의 전체 발화 길이로부터 발화 속도를 연산하는 동작, 피치값, 강도 및 발화 속도를 정규화하여 문장별 불일치 및 발화자별 불일치를 제거하는 동작, 및 정규화된 결과를 이용하여, 중립의 감정 상태에 상응하는 개인별 감정 운율 구조를 표준으로 하여 감정별 개인별 감정 운율의 파라미터들의 차이를 연산하고, 연산 결과를 이용하여 개인별 감정 운율 구조를 구성하는 동작을 수행하도록 적응된다. 한편, 감정 운율 구조 특성 추출부는, 억양구 레벨 감정 운율 구조를 분석하기 위하여, 감정별 음성의 억양구 간 휴지 영역을 검출하는 동작, 및 휴지 영역들의 전체 길이를 합산하여 전체 휴지 길이를 연산하는 동작을 수행하도록 적응된다. 더 나아가, 감정 운율 구조 특성 추출부는, 음절 레벨 감정 운율 구조를 분석하기 위하여, 음성의 억양구 경계 패턴을 L%, H%, LH%, HL%, LHL%, HLH%, HLHL%, LHLH% 및 LHLHL% 중 하나에 상응하는 피치 컨투어로서 분석하는 동작을 수행하도록 적응된다. 또한, 감정 음성 합성부는 TTS 시스템을 이용하여 입력 텍스트를 무감정 음성으로 변환하는 동작, 감정 음성 데이터베이스로부터 발화자의 목표 감정에 상응하는 감정 운율 구조를 검색하는 동작, 및 검색된 감정 운율 구조의 파라미터들을 이용하여 무감정 음성을 수정함으로써 감정 음성을 생성하는 음성 수정 동작을 수행하도록 적응된다. 특히, 감정 음성 합성부는, 음성 수정 동작을 수행하기 위하여, 개인별 감정 운율 구조로부터 목표 감정에 상응하는 피치 컨투어를 파라미터로서 추출하는 동작, 및 추출된 피치 컨투어를 이용하여 무감정 음성의 피치 컨투어를 수정하는 음절 레벨 수정 동작을 수행하도록 적응된다. 또한, 감정 음성 합성부는, 음성 수정 동작을 수행하기 위하여, 개인별 감정 운율 구조로부터 목표 감정에 상응하는 휴지 길이를 파라미터로서 추출하는 동작, 및 추출된 휴지 길이를 이용하여 무감정 음성의 휴지 길이를 수정하는 억양구 레벨 수정 동작을 수행하도록 적응된다. 더 나아가, 감정 음성 합성부는, 음성 수정 동작을 수행하기 위하여, 개인별 감정 운율 구조로부터 목표 감정에 상응하는 전체 피치, 전체 강도, 및 발화 속도를 파라미터로서 추출하는 동작, 및 추출된 전체 피치, 전체 강도, 및 발화 속도를 이용하여 무감정 음성의 전체 피치, 전체 강도, 및 발화 속도를 수정하는 문장 레벨 수정 동작을 수행하도록 적응된다. Another aspect of the present invention for achieving the above objects relates to a system for synthesizing emotional speech based on a personal rhyme model. The emotional speech synthesis system according to the present invention analyzes individual speech and extracts an emotion rhyme structure characteristic extractor to extract characteristics of an individual emotion rhyme structure, an emotion speech database storing the extracted individual emotion rhyme structure, an input text and a target emotion are received. Converts the input text into an emotionless voice, retrieves an individual emotional rhyme structure corresponding to the speaker who synthesizes the voice from the emotional voice database, and corrects the converted emotional emotion based on the individual emotional rhyme structure corresponding to the target emotion. Thereby including an emotional speech synthesizer for generating and outputting an emotional speech corresponding to the talker. In particular, the emotion rhythm structure feature extracting unit may extract the general emotion rhyme structure from the dataset including voice information according to the basic emotion of the talker, and the individual emotion rhyme structure of each talker as the general emotion rhyme structure. By comparison, it is adapted to perform the operation of parameterizing the relative difference of the individual emotional rhyme structure. Furthermore, in order to extract an individual rhyme structure, the emotion rhyme structure characteristic extractor analyzes the sentence-level emotional rhyme structure that analyzes the overall pitch, intensity, and utterance speed of the voice according to each emotion of the talker at the sentence level as parameters. Analysis of intonation level level rhyme structure analysis, which analyzes the rest length between the intonations (IPs) included in the speech according to each emotion of the narrator, at the intonation level, and the speech according to each emotion of the narrator. It is adapted to perform a syllable level emotional rhythm structure analysis operation of analyzing the intonation boundary pattern of each of the included intonations as a parameter at the syllable level. In particular, the emotional rhythm structure characteristic extracting unit selects an intensity greater than or equal to a predetermined value from an operation of calculating an interquartile mean (IQM) of pitch values of speech for emotions and an emotion-based speech intensity to analyze a sentence-level emotional rhyme structure. Operation of calculating speech rate from the total speech length of emotion-by-emotion speech, normalizing pitch value, intensity, and speech rate to remove sentence inconsistency and speaker-inconsistency, and using normalized results. Based on the individual emotional rhyme structure corresponding to the emotional state as a standard, it is adapted to calculate the difference between the parameters of the individual emotional rhyme for each emotion, and perform an operation of constructing the individual emotional rhyme structure using the calculation result. On the other hand, the emotion rhyme structure characteristic extracting unit, to analyze the intonation level level emotion rhyme structure, the operation of detecting the rest area between the intonation of the speech by emotion, and the operation of calculating the total idle length by summing the total length of the rest areas; Is adapted to perform. Furthermore, the emotional rhythm structure characteristic extracting unit analyzes the syllable-level emotional rhyme structure by using the speech accent boundary patterns of L%, H%, LH%, HL%, LHL%, HLH%, HLHL% and LHLH%. And LHLHL% are adapted to perform the analyzing as a pitch contour corresponding to one of. Also, the emotional speech synthesis unit converts the input text into an emotional voice using a TTS system, retrieves an emotional rhyme structure corresponding to the target emotion of the talker from the emotional speech database, and uses the parameters of the found emotional rhyme structure. Thereby modifying the unaffected voice so as to perform a voice correction operation to generate the emotional voice. In particular, the emotion speech synthesis unit extracts, as a parameter, a pitch contour corresponding to the target emotion from the individual emotion rhyme structure as a parameter, and corrects the pitch contour of the emotion-free speech using the extracted pitch contour to perform the voice correction operation. Is adapted to perform a syllable level correction operation. In addition, the emotion speech synthesis unit extracts, as a parameter, a pause length corresponding to the target emotion from the individual emotion rhyme structure as a parameter, and corrects the pause length of the unemotional voice by using the extracted pause length. Is adapted to perform an accent level correcting operation. Furthermore, the emotion speech synthesis unit extracts, as parameters, the overall pitch, the total intensity, and the speech rate corresponding to the target emotion from the individual emotion rhyme structure as parameters, and the extracted overall pitch, the overall intensity, in order to perform the voice correction operation. , And is adapted to perform a sentence level correction operation to modify the overall pitch, overall intensity, and speech rate of the unaffected speech using the speech rate.

본 발명에 의하여, 기본적 감정의 감정 운율 구조를 분석함에 의하여 각 개인의 감정 운율 구조를 모델링할 수 있으며, 모델링된 결과를 이용하여 감정을 반영한 발화문을 합성할 수 있다. According to the present invention, the emotional rhyme structure of each individual may be modeled by analyzing the emotional rhyme structure of the basic emotion, and a speech sentence reflecting the emotion may be synthesized using the modeled result.

또한, 본 발명에 의하여 개인 운율 모델에 기반하여 감정 정보를 발화 표현에 추가할 수 있는 한국의 감정 음성 합성 시스템이 제공된다. The present invention also provides a Korean emotional speech synthesis system that can add emotional information to a speech expression based on a personal rhyme model.

도 1은 본 발명의 일면에 의한 감정 음성 합성 방법을 개념적으로 나타내는 흐름도이다.
도 2 및 도 3은 각 발화자에 대한 평균 피치값 및 평균 강도값을 각각 나타내는 그래프들이다.
도 4는 개인 정보를 고려한 각 감정 상태의 평균 피치값을 나타내는 그래프이다.
도 5는 개인 정보를 고려한 각 감정의 강도값을 나타내는 그래프이다.
도 6은 각 감정에 대한 발화 속도를 나타내는 그래프이다.
도 7은 각 감정에 대한 정규화된 휴지 길이를 도시하는 그래프이다.
도 8은 본 발명의 다른 면에 의한 감정 음성 합성 시스템을 개념적으로 나타내는 블록도이다.
도 9는 원본 운율 구조 및 감정 합성 결과를 나타내는 그래프이다. 1 is a flowchart conceptually illustrating an emotional speech synthesis method according to an aspect of the present invention.
2 and 3 are graphs showing an average pitch value and an average intensity value for each talker, respectively.
4 is a graph showing an average pitch value of each emotional state in consideration of personal information.
5 is a graph showing strength values of each emotion in consideration of personal information.
6 is a graph showing the speech rate for each emotion.
7 is a graph showing the normalized rest length for each emotion.
8 is a block diagram conceptually illustrating an emotional speech synthesis system according to another aspect of the present invention.
9 is a graph showing the original rhyme structure and the result of emotion synthesis.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, operational advantages of the present invention, and objects achieved by the practice of the present invention, reference should be made to the accompanying drawings and the accompanying drawings which illustrate preferred embodiments of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로서, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail with reference to the preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention can be implemented in various different forms, and is not limited to the embodiments described. In order to clearly describe the present invention, parts that are not related to the description are omitted, and the same reference numerals in the drawings denote the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part is said to "include" a certain component, it means that it may further include other components, without excluding the other components unless otherwise stated. In addition, the terms "... unit", "... unit", "module", "block", etc. described in the specification mean a unit for processing at least one function or operation, which means hardware, software, or hardware. And software.

도 1은 본 발명의 일면에 의한 감정 음성 합성 방법을 개념적으로 나타내는 흐름도이다. 1 is a flowchart conceptually illustrating an emotional speech synthesis method according to an aspect of the present invention.

감정은 개인의 경험에 기반한 심리상태로, 최근 다양한 형태의 사람과 기계 사이의 상호작용이 급속히 증가하면서 상호작용에 기반한 여러 분야에 직간접적인 영향을 미치고 있다. 종래 기술에서는 감정의 종류를 범주화하여 해당 범주의 감정을 인식 및 표현하는 과정에서 보편성을 찾으려고 노력해 왔다. 물론, 종래 기술에 의한 접근법을 통해서도 정보기술 분야의 정보습득, 정보가공, 정보표현 등의 과정에서 어느 정도 성공적인 결과를 얻을 수 있고, 감정에 기반한 상호작용에서도 어느 정도 긍정적인 결과를 보여주고 있다. Emotion is a psychological state based on personal experience. Recently, the interaction between various types of people and machines is rapidly increasing, and it has direct and indirect effects on various fields based on interaction. Prior art has attempted to find universality in the process of categorizing emotions and recognizing and expressing emotions of the categories. Of course, the prior art approach has shown some successful results in the process of information acquisition, information processing, information expression, etc. in the information technology field, and has shown some positive results in interactions based on emotions.

그런데, 기쁨, 슬픔, 화남의 감정과 높은 연관성을 보이는 운율 구조를 분석하고, 그러나 이러한 분석결과를 기반으로 개발한 감정 음성 합성 시스템에서 일부 감정의 인식결과가 매우 좋지 않다는 것이 여러 차례의 실험을 통해서 확인되었다. 처음에는 이러한 문제를 운율구조에 기반한 감정 합성의 한계로 생각하였으나, 합성된 감정 음성과 실제 사람에 의해 발화된 감정 음성을 조합하여 실행한 인식 테스트에서도 실제 발화된 감정 음성이 좋지 않은 인식결과를 보이는 이유를 설명하기에는 어려움이 있었다. However, through several experiments, we analyze the rhyme structure that is highly related to the emotions of joy, sadness, and anger, but that the recognition result of some emotions is not very good in the emotional speech synthesis system developed based on these analysis results. Confirmed. At first, we thought of this problem as the limitation of emotion synthesis based on the rhyme structure, but even in the recognition test conducted by combining the synthesized emotional voice and the emotional voice uttered by a real person, the actual uttered emotional voice shows poor recognition results. There was a difficulty explaining why.

감정은 개인의 경험에 기반한 심리상태이므로 감정의 인식 및 표현 역시 개인의 경험에 기반하여 이루어지게 된다. 따라서 개인의 특성이 충분히 반영되지 못한 채 보편성을 강조하게 되면 개별적이고 상대적인 형태의 정보는 분석 및 표현 과정에서 사라지게 된다. 그러므로, 본 발명에 의한 감정 음성 합성 방법은 감정을 개별적이고 상대적인 형태의 정보로 보고, 이러한 정보를 감정 음성 합성 시스템에서 표현하는 방법에 대해 논의한다. Since emotions are psychological states based on individual experiences, perceptions and expressions of emotions are also based on individual experiences. Therefore, if individuality is emphasized without sufficiently reflecting individual characteristics, individual and relative forms of information disappear in the process of analysis and expression. Therefore, the emotion speech synthesis method according to the present invention sees emotions as individual and relative types of information and discusses how to express such information in the emotion speech synthesis system.

이를 위해 기쁨, 슬픔, 화남, 분노의 감정에 따른 운율정보를 사용자 모델의 형태로 분석하여, 각 발화자가 가지는 특징적 감정 표현방식을 음의 높낮이 곡선, 음의 평균적 높낮이, 음의 평균적 세기, 음의 평균적 발화길이, 휴지의 평균적 발화길이의 변화로 살펴보고, 이러한 사용자 모델이 감정 음성합성 시스템의 합성결과에 미치는 영향을 살펴본다. 또한 감정음성 표현의 상대적인 특징을 분석하기 위해서 사용자 모델에 따른 감정 음성합성 결과를 평가할 때, 피실험자에게 합성된 감정음성 표현의 상대적인 특징을 인지할 수 있는 충분한 적응 기간을 주고 인식 결과의 변화를 살펴본다. 그 결과 사용자 모델을 적용한 감정 음성합성 결과의 인식 테스트에서는 이전 결과에 비해 상당한 인식률의 향상을 확인할 수 있었고, 감정음성 표현의 상대적인 특징을 고려한 인지 테스트에서는 거의 정확한 인식률을 확인할 수 있었다. To this end, we analyze the rhyme information according to emotions of joy, sadness, anger and anger in the form of a user model, and analyze the characteristic emotion expressions of each talker on the pitch curve, the average pitch, the average loudness, and the negative This study examines the average length of speech and the average length of speech, and examines the effect of the user model on the synthesis results of the emotional speech synthesis system. In addition, when evaluating the emotional speech synthesis results according to the user model to analyze the relative characteristics of the emotional speech expressions, the subject is given a sufficient period of adaptation to recognize the relative characteristics of the synthesized emotional speech expressions and the change of the recognition results is examined. . As a result, the recognition test of the emotional speech synthesis result using the user model showed a significant improvement in the recognition rate compared with the previous results, and the recognition rate considering the relative characteristics of the emotional speech expression was found to be almost accurate.

이하, 도 1을 참조하여 본 발명을 상세히 설명한다. Hereinafter, the present invention will be described in detail with reference to FIG. 1.

도 1을 참조하면, 우선 여러 개의 기본 감정에 따른 개인별 음성을 분석하여 개인별 감정 운율 구조(personal emotional prosody structure)의 특성을 추출한다(S110). 이 때, 감정 운율 구조 특성을 추출하기 위하여, 우선 발화자의 기본 감정(basic emotion)에 따른 음성 정보를 포함하는 데이터셋으로부터 일반 감정 운율 구조를 추출한다. 그러면, 음성을 음절, 억양구, 및 문장 별로 각각 분석하여 개인별 감정 운율 구조를 생성한다. 이 경우, 발화자의 각각의 감정에 따른 음성의 전체 피치(overall pitch), 강도(intensity), 및 발화 속도(speech rate)가 문장 레벨에서 분석되고, 발화자의 감정에 따른 음성에 포함되는 억양구(intonation phrase, IP)들 간의 휴지 길이(pause length)를 억양구 레벨에서 분석되며, 발화자의 감정에 따른 음성에 포함되는 억양구들 각각의 억양구 경계 패턴(IP boundary pattern)을 파라미터가 음절 레벨에서 분석된다. 문장, 억양구, 및 음절 레벨 파라미터들에 대해서는 명세서의 해당 부분에서 상세히 후술된다. 또한, 각 발화자의 개인별 감정 운율 구조를 일반 감정 운율 구조와 비교하고, 일반 감정 운율 구조에 대한 개인별 감정 운율 구조의 상대적 차분치를 해당 개인별 감정 운율 구조의 파라미터로서 결정한다. Referring to FIG. 1, first, an individual voice according to a plurality of basic emotions is analyzed to extract characteristics of a personal emotional prosody structure (S110). At this time, in order to extract the characteristics of the emotion rhyme structure, first, the general emotion rhyme structure is extracted from the data set including voice information according to the basic emotion of the talker. Then, the voice is analyzed for each syllable, intonation, and sentence to generate an individual emotional rhyme structure. In this case, the overall pitch, intensity, and speech rate of the voice according to each emotion of the talker are analyzed at the sentence level, and the accents included in the voice according to the emotion of the talker ( The pause length between the intonation phrases (IP) is analyzed at the accent level, and the IP boundary pattern of each of the accents included in the speech according to the narrator's emotion is determined at the syllable level. Is analyzed. Sentences, intonations, and syllable level parameters are described in detail later in the relevant sections of the specification. In addition, the individual emotional rhyme structure of each talker is compared with the general emotional rhyme structure, and the relative difference of the individual emotional rhyme structure with respect to the general emotional rhyme structure is determined as a parameter of the individual emotional rhyme structure.

그러면, 결정된 개인별 감정 운율 구조를 감정 음성 데이터베이스에 저장한다(S120). Then, the determined individual emotional rhyme structure is stored in the emotional voice database (S120).

이렇게 개인별 감정 운율 구조가 모두 분석되고 저장되면, 외부로부터 입력 텍스트 및 목표 감정을 수신한다. 만일, 입력 텍스트 및 목표 감정이 수신되지 않으면, 이들이 수신될 때까지 대기한다(S130). 본 명세서에서 목표 감정이란 입력 텍스트를 음성으로 변환한 신호에 추가될 감정 정보를 의미한다. When the individual emotional rhyme structure is analyzed and stored, the input text and the target emotion are received from the outside. If the input text and the target emotion are not received, they wait until they are received (S130). In the present specification, the target emotion refers to emotion information to be added to a signal obtained by converting an input text into a voice.

입력 텍스트 및 목표 감정이 수신되면, TTS 시스템을 이용하여 입력 텍스트를 무감정 음성으로 변환한다(S140). 본 명세서에서 무감정 음성(emotionless speech)란, 입력 텍스트를 동일한 변환 알고리즘을 이용하여 일괄적으로 변환한 결과를 의미한다. 이러한 무감정 음성은 감정 정보가 추가적으로 가미된 음성인 감정 음성과 구별된다. When the input text and the target emotion are received, the input text is converted into an unaffected voice using the TTS system (S140). In the present specification, an emotionless speech refers to a result of collectively converting input text using the same conversion algorithm. Such an emotional voice is distinguished from an emotional voice which is a voice additionally added with emotional information.

그러면, 감정 음성 데이터베이스로부터 음성을 합성할 발화자의 목표 감정에 상응하는 개인별 감정 운율 구조를 검색한다(S150). 그러면, 검색된 개인별 감정 운율 구조의 파라미터들을 이용하여 무감정 음성을 음절, 억양구, 및 문장 레벨에서 각각 수정함으로써 감정 음성을 생성한다(S160). 특히, 무감정 음성으로부터 감정 음성을 생성하기 위하여, 개인별 감정 운율 구조로부터 목표 감정에 상응하는 피치 컨투어가 추출된다. 그러면, 추출된 피치 컨투어를 이용하여 무감정 음성의 피치 컨투어가 수정된다. Then, the individual emotion rhyme structure corresponding to the target emotion of the talker to synthesize the voice is searched from the emotion speech database (S150). Then, the emotional voice is generated by modifying the unaffected voice at the syllable, intonation, and sentence level using the retrieved individual emotional rhyme structure parameters (S160). In particular, in order to generate an emotional voice from an emotional voice, a pitch contour corresponding to the target emotion is extracted from the individual emotional rhyme structure. Then, the pitch contour of the unaffected voice is modified using the extracted pitch contour.

또한, 억양구 레벨 수정을 위해서는, 개인별 감정 운율 구조로부터 목표 감정에 상응하는 휴지 길이를 파라미터로서 추출한 뒤, 추출된 휴지 길이를 이용하여 무감정 음성의 휴지 길이가 수정된다. 뿐만 아니라, 문장 레벨 수정을 위하여, 개인별 감정 운율 구조로부터 목표 감정에 상응하는 전체 피치, 전체 강도, 및 발화 속도를 파라미터로서 추출한다. 그러면, 추출된 전체 피치, 전체 강도, 및 발화 속도를 이용하여 무감정 음성의 전체 피치, 전체 강도, 및 발화 속도를 수정할 수도 있다. In addition, in order to correct the accent level, the pause length corresponding to the target emotion is extracted from the individual emotion rhyme structure as a parameter, and then the pause length of the unaffected voice is modified using the extracted pause length. In addition, for the sentence level correction, the total pitch, the total intensity, and the speech rate corresponding to the target emotion are extracted from the individual emotion rhyme structure as parameters. Then, the extracted overall pitch, overall intensity, and speech rate may be used to modify the overall pitch, overall intensity, and speech rate of the unaffected voice.

감정 음성 합성 단계에서 수행되는 음절, 억양구, 및 문장 레벨 음성 수정 동작에 대해서는 명세서의 해당 부분에서 상세히 후술한다. The syllable, intonation, and sentence-level speech correction operations performed in the emotional speech synthesis step will be described later in detail in the corresponding part of the specification.

그러면, 최종적으로 얻어진 감정 음성은 외부로 재생된다(S170). Then, the finally obtained emotional voice is reproduced to the outside (S170).

본 발명에 의한 감정 음성 합성 방법에서 음성을 합성하기 위하여 개인별 감정 운율 구조를 이용하는 이유는, 종래의 TTS 시스템을 통한 합성 결과의 인식률이 높지 않기 때문이다. 즉, 감정을 방대한 데이터셋 분석으로부터 얻어질 수 있는 보편적인 정보로 간주하면 결과가 양호하지 않다. 특히, 실제 인간의 음성 녹음을 이용해도 인식률은 단지 17.1%에 불과하다. 따라서, 이러한 낮은 인식 비율은 운율 수정의 한계에 기인한다는 결론을 얻었다. 따라서, 한국어 감정 음성 합성 시스템을 개발할 때 감정 정보를 분석하기 위한 가이드라인을 찾는데 있어서 보편적인 감정이 아니라 개인화된 정보를 고려할 필요가 있다. The reason for using the individual emotional rhyme structure to synthesize the speech in the emotional speech synthesis method according to the present invention is that the recognition rate of the synthesis result through the conventional TTS system is not high. In other words, considering emotion as universal information that can be obtained from extensive dataset analysis, the results are not good. In particular, even using real human voice recording, the recognition rate is only 17.1%. Thus, it was concluded that this low recognition rate is due to the limitation of rhyme correction. Therefore, when developing a Korean emotion speech synthesis system, it is necessary to consider personalized information, not general emotion, in finding guidelines for analyzing emotion information.

이하, 본 발명을 더욱 상세히 설명한다. Hereinafter, the present invention will be described in more detail.

개인 운율 모델을 이용한 한국 감정 음성 합성 기법Korean Emotional Speech Synthesis Using Personal Rhymes Model

일상적인 대화 동안, 우리는 그 주된 의미에 따라서 특정의 단어 또는 구를 선택한다. 그러나 문맥과 그 발음에 따라, 주된 의미는 다른 방식으로 바뀔 수 있다. 예를 들면, 만일 단어가 분노의 감정 상태로 표현되면, 그 단어는 부정적인 감각을 전달할 것이다. 반대로, 만일 그 단어가 행복의 감정 상태로 표현되면, 그것은 반어적인 경우를 제외한다면 긍정적인 것으로 해석될 것이다. During everyday conversation, we choose specific words or phrases according to their main meaning. But depending on the context and its pronunciation, the main meaning can be changed in different ways. For example, if a word is expressed in an emotional state of anger, it will convey a negative sense. On the contrary, if the word is expressed in the emotional state of happiness, it will be interpreted as positive except in the ironic case.

본 발명에 의한 감정 음성 합성 방법은, 감정 운율 구조(emotional prosody structure)를 상이한 감정들 사이의 상대적 운율 차이라고 간주한다. 감정 운율 구조를 분석하기 위하여, 본 발명은 대한민국의 음성 정보 기술 및 산업 증진 센터(Speech Information Technology and Industry Promotion Center)에 의하여 배포되는 한국어 감정 음성 말뭉치를 사용한다. 하지만, 이는 본 발명을 한정하는 것으로 이해되어서는 안된다. 본 발명에서는 예시적으로 10개의 감정 중립 문장을 사용하며, 분노, 두려움, 행복, 슬픔과 중립과 같은 5 개의 기본 감정들을 선택한다. The emotional speech synthesis method according to the present invention regards the emotional prosody structure as a relative rhythm difference between different emotions. In order to analyze the emotional rhyme structure, the present invention uses the Korean emotional speech corpus distributed by the Speech Information Technology and Industry Promotion Center of the Republic of Korea. However, this should not be understood as limiting the invention. In the present invention, ten emotional neutral sentences are used for example, and five basic emotions such as anger, fear, happiness, sadness and neutrality are selected.

문장 레벨 운율 구조 분석(Sentence Level Prosody Structure Analysis)Sentence Level Prosody Structure Analysis

문장 레벨 운율 구조를 알아내기 위하여, 본 발명에서는 각 감정 상태에 대한 전체 피치, 강도, 및 발화 속도값을 분석한다. 우선, 추출된 피치값의 사분위간 평균(interquartile mean, IQM)이 계산된다. 그리고, 50 dB 이상의 강도가 선택되는데, 이것은 가청 범위의 최소 레벨이라고 알려진 값이다. In order to find the sentence level rhythm structure, the present invention analyzes the overall pitch, intensity, and speech rate values for each emotional state. First, the interquartile mean (IQM) of the extracted pitch values is calculated. And an intensity of 50 dB or more is chosen, which is the value known as the minimum level of the audible range.

문장 레벨 운율 구조의 분석에 대한 각 발화자의 음성 스타일(개인전용의 정보)의 효과를 발견하기 위해, 중립의 감정 상태로 발화된 음성의 피치와 강도값에 대한 분산 분석(analysis of variance, ANOVA) 테스트가 수행된다. Analysis of variance (ANOVA) on pitch and intensity values of speech uttered in a neutral emotional state to discover the effect of each speaker's speech style (personal information) on the analysis of sentence-level rhyme structure. The test is performed.

도 2는 각 발화자에 대한 평균 피치값을 나타내는데, 여기서 F = 364.924, p = 0.000 < 0.05의 관계가 만족된다. Figure 2 shows the average pitch value for each talker, where the relationship of F = 364.924, p = 0.000 <0.05 is satisfied.

도 3은 각 발화자에 대한 평균 강도값을 나타내는데, 여기서 F = 16.338, p = 0.000의 < 0.05의 관계가 만족된다. 3 shows the average intensity value for each talker, where a relationship of F = 16.338, p = 0.000, <0.05.

도 2 및 도 3에 도시된 것과 같은 ANOVA 테스트 결과로부터, 모든 귀무가설(null hypothesis, 歸無假說) 들은 버려지고, 성 정보 및 피치값 사이의 강한 상관성이 확인된다. 또한, 감정 피치와 강도 구조에 대한 각 값을 추출한 이후에, 내배하는 문장 의존적 불일치(sentence dependent disparities) 및 발화자 의존적 불일치들을 정정하기 위하여 정규화 프로세스가 수행된다. 본 발명에서는, 중립을 표준 상태로 간주하고, 이와 같은 표준 및 동일한 발화자에 의하여 발음된 동일한 문장으로부터 추출된 각 감정 운율 구조 간의 차이점을 연산한다. From the ANOVA test results as shown in FIGS. 2 and 3, all null hypotheses are discarded and a strong correlation between sex information and pitch values is confirmed. Also, after extracting each value for the emotional pitch and intensity structure, a normalization process is performed to correct the sentence dependent disparities and the speaker dependent inconsistencies. In the present invention, neutral is regarded as a standard state, and the difference between each standard and the emotional rhyme structure extracted from the same sentence pronounced by the same talker is calculated.

피치 분석Pitch analysis

개인 정보의 속성을 조사하기 위하여, 본 발명에서는 6명의 발화자에 따라서 240개의 문장을 분류하고, 상이한 감정에 대한 피치값의 차이에 집중하면서 일련의 ANOVA 테스트를 수행했다. 테스트 결과가 도 4에 도시된다. In order to examine the attributes of personal information, the present invention performed 240 series of ANOVA tests, classifying 240 sentences according to six talkers and focusing on the difference in pitch values for different emotions. The test results are shown in FIG.

도 4는 개인 정보를 고려한 각 감정 상태의 평균 피치값을 나타내는 그래프이다. 4 is a graph showing an average pitch value of each emotional state in consideration of personal information.

도 4를 참조하면, 우리는 각 감정 상태의 평균 피치값은 각 발화자에 대하여 매우 민감하다는 것을 알 수 있는데, 따라서 각 감정에 대한 대표값을 결정하는 데에는 일반화된 피치값이 적합하지 않다는 것을 알 수 있다. Referring to FIG. 4, we can see that the average pitch value of each emotion state is very sensitive for each talker, so that the generalized pitch value is not suitable for determining the representative value for each emotion. have.

강도 분석Strength analysis

일방향 ANOVA 테스트가 240 개의 문장에 대하여 수행됨으로써, 상이한 감정들 간의 일반화된 강도값들을 발견할 수 있다. 도 4에 도시된 ANOVA 테스트의 결과로부터, 귀무 가설은 명백하게 폐기되었다. One-way ANOVA tests are performed on 240 sentences, thereby discovering generalized intensity values between different emotions. From the results of the ANOVA test shown in FIG. 4, the null hypothesis was explicitly discarded.

감정 강도 구조에서 개인 정보의 역할을 파악할 수 있도록 하기 위해, 240 개의 문장을 6명의 발화자를 고려하여 각 40 문장으로 세분했다. 일련의 ANOVA 테스트를 수행하였으며, 그 결과가 도 5에 도시된다. To understand the role of personal information in the emotional strength structure, 240 sentences were subdivided into 40 sentences each, taking into account six speakers. A series of ANOVA tests were performed and the results are shown in FIG. 5.

도 5는 개인 정보를 고려한 각 감정의 강도값을 나타내는 그래프이다. 5 is a graph showing strength values of each emotion in consideration of personal information.

도 5를 참조하면, 도 5는 KKS와 PYH와 같은 두 가지 특별한 경우를 제외하고 강도값의 순서가 슬픔, 두려움, 행복과 분노의 순서로 양호하게 보존되었음을 알 수 있다. 이를 통하여, 발화자는 각 감정에 대한 자신만의 동일한 강도값을 가진다는 것을 알 수 있다. 이는 정규화 프로세스를 사전에 수행한 경우에도 동일하다. Referring to FIG. 5, it can be seen that FIG. 5 is well preserved in order of sadness, fear, happiness and anger, except for two special cases such as KKS and PYH. Through this, it can be seen that the talker has his / her own same intensity value for each emotion. This is the same even if the normalization process is performed in advance.

발화 속도 분석Fire rate analysis

각 감정 상태에 대한 상대적인 발화 속도의 분석을 위하여, 300개의 발화 문장에 대하여 음성 구역(speech region)을 주석화(annotation)한다. 각 문장을 주석화한 이후에, 동일한 발화자에 의하여 발화된 음성의 각 감정 상태에 대한 음성 구역의 전체 길이를 측정한다. 그러면, 음성 구역의 길이를 정규화하여 특정 감정-대-중립의 비율을 연산한다. 표 1은 각 감정에 대한 정규화된 발화 속도를 나타내며, 도 6은 각 감정에 대한 발화 속도를 나타내는 그래프이다. To analyze the relative speech rate for each emotional state, speech regions are annotated for 300 speech sentences. After annotating each sentence, measure the total length of the speech zone for each emotional state of speech spoken by the same talker. The length of the voice zone is then normalized to calculate a specific emotion-to-neutral ratio. Table 1 shows the normalized speech rate for each emotion, and FIG. 6 is a graph showing the speech rate for each emotion.

그 결과, 발화된 표현에 대화에 관련된 정보를 부가시키기 위해서는 개인적 발화 속도를 고려해야 한다는 것을 알 수 있다. As a result, it can be seen that in order to add information related to the conversation to the spoken expression, the personal speech rate should be considered.

억양구 레벨 운율 구조 분석Accent Structure Rhythm Structure Analysis

문장 레벨보다 작은 단위의 레벨을 통한 감정 운율 구조의 분석을 위하여, K-ToBI 레이블링 시스템을 이용하여 한국어 감정 음성 말뭉치를 주석화한다. 한국어에는 예컨대 L%, H%, LH%, HL%, LHL%, HLH%, HLHL%, LHLH%와 LHLHL%와 같은 9개의 억양구(IP) 경계 성조(boundary tone)가 식별된다는 것이 알려진 바 있다. 또한, 상이한 감정 상태에 따라 피치 컨투어의 변화를 모델링하기 위해 이러한 IP 경계 패턴이 중요한 역할을 한다는 것이 널리 알려진다. 반면에, 강세구(accentual phrase, AP)의 표면 성조 패턴(surface tonal pattern)은 감정 상태와 무관하게 일반적으로 "L+H L+Ha" 또는 "L Ha" 패턴을 가진다. In order to analyze the emotional rhyme structure through the level of the unit smaller than the sentence level, the Korean emotional speech corpus is annotated using the K-ToBI labeling system. In Korean, it is known that nine accent boundary (IP) boundary tones are identified, such as L%, H%, LH%, HL%, LHL%, HLH%, HLHL%, LHLH% and LHLHL%. have. It is also well known that this IP boundary pattern plays an important role in modeling the change in pitch contour according to different emotional states. On the other hand, the surface tonal pattern of an accentual phrase (AP) generally has a "L + H L + Ha" or "L Ha" pattern regardless of the emotional state.

본 발명에서는 운율 연구를 위하여 언어학적 지식에 기반한 K-ToBI(Korean Tone and Break Indices) 레이블링 시스템을 이용한 방법을 사용하였다. K-ToBI 기반한 모델링을 위해서는 크게 두 가지의 작업이 필요하다. 먼저 입력문장으로부터 억양을 몇 가지 형태의 레이블로 분류된 레이블 중에서 해당 문서에 적합한 레이블을 추정해야 하고, 추정한 레이블을 통해 F0 궤적을 생성해야 한다. 이러한 방식의 특징은 대용량의 데이터에서 통계적 방식으로 F0궤적의 추출이 가능하지만, 체계적인 분류 체계 및 언어학적 지식을 이용한 대용량의 코퍼스 구축이 선행되어야 한다는 단점이 있다.In the present invention, a method using a K-ToBI (Korean Tone and Break Indices) labeling system based on linguistic knowledge was used for rhyme studies. For K-ToBI-based modeling, two tasks are needed. First, the label suitable for the document should be estimated from the labels classified into several types of labels with the accent from the input sentence, and the F0 trajectory should be generated from the estimated label. The feature of this method is that it is possible to extract F0 trajectories from a large amount of data in a statistical manner, but there is a drawback that a large-scale corpus must be constructed using a systematic classification system and linguistic knowledge.

K-ToBI 시스템K-ToBI System

ToBI(Tone and Break Indices) 레이블링 시스템은 영어에 기반한 운율 레이블링 시스템으로 1992년 소개되었고, 이후 많은 언어권에서 ToBI 레이블링 시스템이 고안되었다. K-ToBI 레이블링 시스템은 영어권의 ToBI와 일본어의 J-ToBI에 기반해서 만들어졌고, 최근의 모델은 단어 층(word tier), 음성학 성조 층(phonological tone tier), 브레이크-인덱스 층(break-index tier), 및 부수 층(miscellaneous tier)의 5개의 층(tier)으로 구성되어 있다. 각 층은 이벤트가 발생한 시간과 기호로 이루어져 있고, F0궤적의 표현은 초기 톤(initial tone)과 강세 톤(accentual tone), 경계 톤(boundary tone)으로 구성되어 있다. The ToBI and Tone Indices (ToBI) labeling system was introduced in 1992 as an English-based rhyming labeling system. Since then, ToBI labeling systems have been devised in many languages. The K-ToBI labeling system is based on English-language ToBI and Japanese-language J-ToBI, with recent models being a word tier, phonological tone tier, and break-index tier. ), And five tiers of miscellaneous tiers. Each layer consists of the time and symbol of the event, and the representation of the F0 trajectory is composed of an initial tone, an accentual tone, and a boundary tone.

한국어의 운율 구조는 억양구와 강세구 두 개의 운율단위로 이루어져 있고, 억양구는 하나 이상의 강세구로 구성되며, 마지막의 톤 변화를 의미하는 'H%', 'L%', 'HL%'등 기호로 표현된다. 강세구의 시작부분에 'H-'가 올 수 있으며, 강세구의 마지막 부분은 'Lha'로 구성된다. 또한 끊어 읽는 정도에 따라 '0'부터 '3'까지의 정지 인덱스가 있다. '0'은 연음을 의미하고, '3'은 끊어 읽기가 가장 뚜렷한 곳을 나타내는 기호이다. 이 네 가지 인덱스 중에서 '2'는 강세구의 경계가 되고, '3'은 억양구의 경계를 의미한다. The rhyme structure of Korean language consists of two rhyme units, accent sphere and accent sphere, and accent sphere consists of one or more accent spheres, and symbols such as 'H%', 'L%' and 'HL%' signify the last tone change. It is expressed as 'H-' may be at the beginning of the bullish ball, and the last part of the bullish ball is composed of 'Lha'. There is also a stop index from '0' to '3' depending on the degree of reading. '0' means choon, and '3' is the symbol that indicates where the reading is most pronounced. Of these four indices, '2' is the boundary of accent sphere, and '3' is the boundary of accent sphere.

나머지 두 개의 층은 어절의 경계를 표시하는 단어층(word tier)과 숨소리나 웃음 등 기타 다른 정보를 표시하는 기타층(miscellaneous tier)이다. The other two layers are the word tier, which marks the boundaries of words, and the miscellaneous tier, which displays other information, such as breaths and laughter.

피치 컨투어 분석Pitch Contour Analysis

주된 IP 경계 성조를 분석하기 위하여, K-ToBI 레이블링 시스템을 이용하여 음성의 300 조각을 주석화한다. K-ToBI 레이블링된 데이터의 통계적 분석을 위하여, Pearson의 Chi-square 테스트가 수행된다. 표 2는 K-ToBI 레이블링에 Pearson의 Chi-square 테스트를 수행한 결과를 나타낸다. To analyze the predominant IP boundary tones, annotate 300 pieces of voice using the K-ToBI labeling system. For statistical analysis of K-ToBI labeled data, Pearson's Chi-square test is performed. Table 2 shows the results of Pearson's Chi-square test on K-ToBI labeling.

표 2에서 알 수 있는 바와 같이, 결과를 통하여 귀무가설이 배제된다(p = 0.0의 < 0.05). 이 결과는 각 감정이 어떤 감정을 나머지 감정과 구별시킬 수 있는 뚜렷한 IP 경계 성조를 가진다는 사실을 통계적으로 지원한다. 그러면, 각 감정의 뚜렷한 피치 컨투어를 식별하기 위하여 조절 잔여량(adjusted residuals)이 연산된다. 표 3은 중립을 포함하는 각 감정에 따르는 뚜렷한 IP 경계 성조를 식별하기 위하여 연산된 조절 잔여량을 나타낸다. As can be seen from Table 2, the null hypothesis is excluded from the results (<0.05 of p = 0.0). This result provides statistical support for the fact that each emotion has a distinct IP boundary to distinguish one emotion from the others. Then, adjusted residuals are calculated to identify the distinct pitch contour of each emotion. Table 3 shows the adjustment residuals calculated to identify the distinct IP boundary tones along with each emotion including neutral.

피치 컨투어 패턴의 통계적 분석을 통하여, 우리가 분노와 HL%, 두려움과 H%, 행복과 LH%, 슬픔과 H%, 및 중립 및 L% 사이에 매우 강한 상호 관계가 있음을 알 수 있다. 만일 우리가 피치 컨투어의 분석으로부터 중립을 제외하면, L% IP 경계 패턴은 표 4에 나타난 슬픔에 할당될 것이다. 표 4는 중립을 제외하는 각 감정에 따르는 뚜렷한 IP 경계 성조를 식별하기 위하여 연산된 조절 잔여량을 나타낸다. Statistical analysis of the pitch contour pattern shows that we have a very strong correlation between anger and HL%, fear and H%, happiness and LH%, sadness and H%, and neutral and L%. If we exclude neutral from the pitch contour analysis, the L% IP boundary pattern will be assigned to the sadness shown in Table 4. Table 4 shows the adjustment residuals calculated to identify the distinct IP boundary tones along with each emotion excluding neutrality.

개인화 정보의 역할을 평가하기 위하여, 6명의 발화자에 따른 300개의 문장을 분류하고 일련의 Chi-square 테스트를 수행하지만, 중요한 차이점은 발견되지 않는다. 이러한 결과는 IP 경계 패턴이 상이한 감정 사이의 상대적인 피치값이 아니라 피치 컨투어의 상징적 표현(symbolic representation)이라는 사실에 기인한다고 판단될 수 있다. To evaluate the role of personalization information, we classify 300 sentences according to six talkers and perform a series of Chi-square tests, but no significant differences are found. This result can be judged to be due to the fact that the IP boundary pattern is a symbolic representation of the pitch contour, not the relative pitch value between different emotions.

휴지 길이 분석Pause Length Analysis

비록 K-ToBI 레이블링 데이터의 분석으로부터 중단 색인(break indices)의 상징적 표현을 얻었다고 하더라도, 대화에 반영되는 정보의 수정을 위해서는 상대적 휴지 길이도 분석해야 한다. Although the symbolic representation of break indices is obtained from the analysis of K-ToBI labeling data, the relative pause length must also be analyzed to correct the information reflected in the dialogue.

300 개의 발화 문장에 대해서 유지 구역을 주석화하고, 동일한 발화자에 의하여 발화된 각 문장의 휴지 구역의 전체 길이를 연산한다. 일반화된 휴지 길이는 표 5에 나열된다. 표 5는 각 감정에 대한 일반화된 휴지 길이를 나타내며, 도 7은 그 결과를 도시하는 그래프이다. For 300 spoken sentences, annotating the holding zone, and calculating the total length of the resting zone of each sentence spoken by the same talker. Generalized rest lengths are listed in Table 5. Table 5 shows the generalized rest lengths for each emotion, and FIG. 7 is a graph showing the results.

CWJ 및 KKS와 같은 두 가지 특별한 경우를 제외하고 휴지 길이의 순서가 행복, 분노, 두려움, 및 슬픔의 순서로 양호하게 보존되었음을 알 수 있다. 정규화된 휴지 길이도 모든 발화자를 대표하는 값으로는 사용될 수 없다. 예를 들면, 만일 감정 운율 구조에 대한 대푯값으로 일반화된 휴지 길이를 이용하면, 각 발화자의 뚜렷한 특성이 없어질 것이기 때문이다. 그러므로, 개인 정보를 이용하여 음성을 합성해야 한다. Except for two special cases, such as CWJ and KKS, it can be seen that the order of rest length is well preserved in the order of happiness, anger, fear, and sadness. The normalized idle length cannot be used as a value representing all talkers. For example, if you use the generalized pause length as a representative value for the emotional rhyme structure, the distinct characteristics of each talker will be lost. Therefore, speech must be synthesized using personal information.

개인 전형적인 기반의 감정 운율 통합 시스템Personal typical based emotional rhyme integrated system

범용 TTS 시스템에 억양구 레벨 및 문장 레벨 감정 운율 구조를 통합시키기 위하여, 도 8에 도시된 것과 같은 감정 음성 합성 시스템이 제공된다. In order to integrate the accent level and sentence level emotional rhyme structure into a general purpose TTS system, an emotional speech synthesis system as shown in FIG. 8 is provided.

이해의 편의를 위하여, TTS 시스템에 대해서 간단히 설명하면 다음과 같다. For convenience of explanation, a brief description of the TTS system is as follows.

일반적인 문서-음성 변환(TTS; Text-to-Speech) 시스템의 구성은 크게 자연어처리부, 음소/운소 추출부, 신호처리부로 나뉠 수 있다. 자연어처리부에서는 형태소 분석, 구문 분석을 통해 음소/운소 추출부에서 사용할 기본적인 정보를 제공하는 부분이다. 음소/운소 추출부에서는 자연어처리부의 기초적 정보를 바탕으로 입력문서를 발음형태와 음소/운소 추출부에서 사용할 정보를 생성하는 작업을 한다. 즉, 형태소 분석 및 구문분석 등을 통해 알파벳이나 숫자, 기호 등을 정확한 한글형태로 변환하고, 다양한 음운변화를 고려한 발음선택을 한다. 또한 운율 경계를 추정하고, 신호처리부의 입력으로 사용되는 다양한 파라미터를 추정하게 된다. 마지막으로 신호처리부에서는 상위에서 추출된 파라미터를 이용하여 음성을 복원해내는 작업을 한다.The construction of a general text-to-speech (TTS) system can be roughly divided into a natural language processor, a phoneme / canal extractor, and a signal processor. The natural language processing section provides basic information to be used in the phoneme / canal extraction section through morphological analysis and syntax analysis. The phoneme / union extractor generates a phonetic form and information to be used by the phoneme / union extractor based on the basic information of the natural language processor. That is, through morphological analysis and syntax analysis, the alphabet, numbers, symbols, etc. are converted into correct Hangeul forms, and pronunciation selection considering various phonological changes is made. In addition, the rhyme boundary is estimated and various parameters used as inputs of the signal processor are estimated. Finally, the signal processing unit recovers the speech using the parameters extracted from the upper layer.

이 중에서 운율처리부는 신호처리부에서 사용될 억양의 물리학적 신호인 F0궤적(Fundamental Frequency Contour)이 생성되는 부분으로 이 정보는 합성음의 자연성과 이해도를 향상시키는 중요한 역할을 한다. 음성에 있어서 운율이란 피치, 음성의 크기, 음절의 길이 등의 음성학적 변화측면의 신호적인 특징을 의미한다. 여기에 발화 속도 또는 리듬(rhythm)등과 같은 시간적인 특징을 포함시키기도 한다. Of these, the rhythm processor generates F0 trajectory (Fundamental Frequency Contour), an accent physical signal used in the signal processor. Rhyme in speech means the characteristics of the signal in terms of phonological changes such as pitch, loudness, and syllable length. This may include temporal features such as speech rate or rhythm.

일반적으로 운율은 일련의 억양구의 연속으로 이루어져 있다고 생각하고, 운율처리의 츨발은 이 억양구의 추출에서 시작된다. 추출된 억양구는 다양한 모델링 방법을 통해 물리적 신호인 F0궤적으로 변환되게 된다. It is generally thought that rhymes consist of a series of accents, and the outbreak of rhyme processing begins with the extraction of these accents. The extracted intonation is converted into the F0 trajectory, which is a physical signal, through various modeling methods.

도 8은 본 발명의 다른 면에 의한 감정 음성 합성 시스템을 개념적으로 나타내는 블록도이다. 8 is a block diagram conceptually illustrating an emotional speech synthesis system according to another aspect of the present invention.

도 8에 도시된 감정 음성 합성 시스템(800)은 감정 운율 구조 특성 추출부(810), TTS 시스템(830), 및 감정 음성 합성부(850), 및 감정 음성 데이터베이스(890)를 포함한다. 또한, 감정 음성 합성부(850)는 음절 레벨 수정부(860), 억양구 레벨 수정부(870), 및 문장 레벨 수정부(880)를 포함한다. The emotional speech synthesis system 800 shown in FIG. 8 includes an emotional rhythm structure characteristic extractor 810, a TTS system 830, an emotional speech synthesizer 850, and an emotional speech database 890. In addition, the emotional speech synthesis unit 850 includes a syllable level corrector 860, an accent level corrector 870, and a sentence level corrector 880.

감정 운율 구조 특성 추출부(810)는 개인별 음성을 분석하여 개인별 감정 운율 구조의 특성을 추출하여, 추출된 결과를 감정 음성 데이터베이스(890)에 저장한다. TTS 시스템(830)은 입력 텍스트 및 목표 감정이 수신되면 입력 텍스트를 무감정 음성으로 변환하고, 감정 음성 합성부(850)는 무감정 음성을 감정 음성 데이터베이스(890)로부터 수신된 개인별 감정 운율 구조에 따라 수정함으로써 감정 정보가 반영된 감정 음성을 생성하여 출력한다. The emotional rhyme structure characteristic extractor 810 analyzes the individual speech and extracts the characteristic of the individual emotional rhyme structure, and stores the extracted result in the emotional speech database 890. When the input text and the target emotion are received, the TTS system 830 converts the input text into an unaffected voice, and the emotional voice synthesizer 850 converts the unaffected voice into an individual emotional rhyme structure received from the emotional voice database 890. By correcting accordingly, an emotional voice reflecting the emotion information is generated and output.

특히, 감정 운율 구조 특성 추출부(810)는 발화자의 기본 감정(basic emotion)에 따른 음성 정보를 포함하는 데이터셋으로부터 일반 감정 운율 구조를 추출하고, 각 발화자의 개인별 감정 운율 구조를 일반 감정 운율 구조와 비교하여 개인별 감정 운율 구조의 상대적 차분치를 파라미터화한다. 더 나아가, 감정 운율 구조 특성 추출부(810)는 문장 레벨, 억양구 레벨, 및 음절 레벨에서 각각 파라미터를 추출한다. 예를 들어, 감정 운율 구조 특성 추출부(810)는 발화자의 각각의 감정에 따른 음성의 전체 피치, 강도, 및 발화 속도를 파라미터로서 문장 레벨에서 분석할 수 있다. 또한, 감정 운율 구조 특성 추출부(810)는 발화자의 각각의 감정에 따른 음성에 포함되는 억양구(IP)들 간의 휴지 길이를 파라미터로서 억양구 레벨에서 분석하고, 발화자의 각각의 감정에 따른 음성에 포함되는 억양구들 각각의 억양구 경계 패턴을 파라미터로서 음절 레벨에서 분석할 수 있다. In particular, the emotion rhythm structure feature extractor 810 extracts a general emotion rhyme structure from a dataset including voice information according to the basic emotion of the talker, and extracts the individual emotion rhyme structure of each talker from the general emotion rhyme structure. Compared with, parameterize the relative difference of individual emotional rhyme structure. Furthermore, the emotion rhythm structure feature extractor 810 extracts parameters from the sentence level, the intonation level, and the syllable level, respectively. For example, the emotion rhyme structure characteristic extractor 810 may analyze the overall pitch, the intensity, and the speech rate of the voice according to each emotion of the talker at the sentence level. In addition, the emotion rhythm structure feature extractor 810 analyzes the pause length between the intonations IP included in the speech according to each emotion of the talker at the intonation level as a parameter, and the speech according to each emotion of the talker. An accent boundary pattern of each of the intonations included in the may be analyzed at the syllable level as a parameter.

이 경우, 문장, 억양구, 및 음절 레벨에서 각 파라미터를 추출하는 과정은 전술된 바와 같기 때문에 명세서의 간략화를 위하여 반복되는 설명이 생략된다. 표 6은 감정 운율 구조 특성 추출부(810)에 의하여 추출된 파라미터를 개인별로 나타낸다. In this case, since the process of extracting each parameter at the sentence, intonation, and syllable levels is as described above, repeated description is omitted for the sake of simplicity of the specification. Table 6 shows parameters extracted by the emotion rhythm structure characteristic extractor 810 for each individual.

음절 레벨 수정부(860)는 개인별 감정 운율 구조로부터 목표 감정에 상응하는 피치 컨투어를 파라미터로서 추출하고, 추출된 피치 컨투어를 이용하여 무감정 음성의 피치 컨투어를 수정한다. 또한, 억양구 레벨 수정부(870)는 개인별 감정 운율 구조로부터 목표 감정에 상응하는 휴지 길이를 파라미터로서 추출하고, 추출된 휴지 길이를 이용하여 무감정 음성의 휴지 길이를 수정한다. 더 나아가, 문장 레벨 수정부(880)는 개인별 감정 운율 구조로부터 목표 감정에 상응하는 전체 피치, 전체 강도, 및 발화 속도를 파라미터로서 추출하고, 추출된 전체 피치, 전체 강도, 및 발화 속도를 이용하여 무감정 음성의 전체 피치, 전체 강도, 및 발화 속도를 수정한다. The syllable level correction unit 860 extracts, as a parameter, a pitch contour corresponding to the target emotion from the individual emotional rhythm structure, and corrects the pitch contour of the emotionless voice using the extracted pitch contour. In addition, the accent level corrector 870 extracts a pause length corresponding to the target emotion from the individual emotion rhyme structure as a parameter, and uses the extracted pause length to correct the pause length of the unaffected voice. Further, the sentence level correction unit 880 extracts, as parameters, the total pitch, the total intensity, and the speech rate corresponding to the target emotion from the individual emotional rhythm structure, and uses the extracted total pitch, the total intensity, and the speech rate. Modify the overall pitch, overall intensity, and speech rate of the unaffected voice.

범용 TTS 시스템에 의하여 일반적으로 합성되는 무감정 음성에 대한 적절한 개인 모델을 선택하기 위하여, 감정 음성 합성 시스템(800)은 무감정 음성의 IQM을 도 2에 도시된 각 발화자의 피치값과 비교한다. 개인 모델이 선택되면, 상응하는 파라미터들이 억양구 레벨 운율 수정 및 문장 레벨 운율 수정을 위하여 연속적으로 이용된다. In order to select an appropriate personal model for the unaffected speech generally synthesized by the general purpose TTS system, the emotional speech synthesis system 800 compares the IQM of the unaffected speech with the pitch value of each talker shown in FIG. Once the personal model is selected, the corresponding parameters are used successively for accent level rhyme correction and sentence level rhyme correction.

이 동작을 위한 스크립트는 다음과 같다. The script for this operation is as follows:

# PRAAT script for the modification of pitch contours# PRAAT script for the modification of pitch contours

if emotion$ == "Anger"if emotion $ == "Anger"

Formula... 'IQMPitch' + 'MaxPitch'*sin(x/dur*pi)Formula ... 'IQMPitch' + 'MaxPitch' * sin (x / dur * pi)

elsif emotion$ == "Fear"elsif emotion $ == "Fear"

Formula... 'IQMPitch' + 'MaxPitch'*(x/dur)Formula ... 'IQMPitch' + 'MaxPitch' * (x / dur)

elsif emotion$ == "Happiness"elsif emotion $ == "Happiness"

Formula... 'IQMPitch' + 'MaxPitch'*exp((x/dur)-0.5)Formula ... 'IQMPitch' + 'MaxPitch' * exp ((x / dur) -0.5)

elsif emotion$ == "HappinessFinal"elsif emotion $ == "HappinessFinal"

elsif emotion$ == "Sadness"elsif emotion $ == "Sadness"

피치 컨투어 수정 함수는 표 4에 나타나는 각 K-ToBI 피치 컨투어의 모델링을 위한 선형 함수, 사인 함수, 및 지수 함수를 사용한다. 모든 함수는 PRAAT 스크립트로서 구현될 수 있다(Boersma and Weenink, 2001 참조). The pitch contour correction function uses linear functions, sine functions, and exponential functions for modeling each K-ToBI pitch contour shown in Table 4. All functions can be implemented as PRAAT scripts (see Boersma and Weenink, 2001).

평가evaluation

감정 운율 통합 시스템과 그 개인 모델을 평가하기 위하여 인식 테스트가 수행되었다. 이를 위하여 다음과 같은 5 개의 문장이 사용되며, 여성 음성을 가지는 상업용 한국어 TTS 시스템을 이용하여 무감정 음성을 생성한다(http://www.voiceware.co.kr의 VoiceText 참조). A cognitive test was conducted to evaluate the emotional rhyme integration system and its personal model. For this purpose, the following five sentences are used, and an unaffected voice is generated using a commercial Korean TTS system having a female voice (see VoiceText of http://www.voiceware.co.kr).

문장 1: 난 가지 말라고 하면서 문을 닫았어. Sentence 1: I closed the door saying not to go.

문장 2: 정말 그렇단 말이야. Sentence 2: I really mean it.

문장 3: 나도 몰라. Sentence 3: I don't know.

문장 4: 우리가 하는 일이 얼마나 중요한지 너는 모를 꺼야. Sentence 4: You don't know how important what we do.

문장 5: 바람과 해님이 서로 힘이 더 세다고 다투고 있을 때 한 나그네가 따뜻한 외투를 입고 걸어 왔습니다. Sentence 5: A stranger walked in a warm coat while the wind and the sun were arguing that they were stronger.

그러면, 20 개의 감정 음성(1 문장당 화남, 두려움, 행복 및 슬픔의 4개의 감정을 반영)이 생성된다. Then 20 emotional voices are generated (reflecting four emotions per anger, fear, happiness and sadness).

도 9는 원본 운율 구조 및 감정 합성 결과를 나타내는 그래프이다. 9 is a graph showing the original rhyme structure and the result of emotion synthesis.

도 9는 "난 가지 말라고 하면서 문을 닫았어."라는 문장에 대한 운율 구조를 나타낸다. 연속되는 선은 강도 컨투어를 나타내고, 끊어진 선은 피치 컨투어를 나타낸다. Fig. 9 shows the rhyme structure for the sentence "I closed the door saying not to go." Continuous lines represent intensity contours and broken lines represent pitch contours.

이 테스트는 평균 연령 28.8세인 40명의 여성 유치원 교사에 대해서 실시되었다. 그들은 무작위로 정렬된 감정 합성 음성의 20 조각들을 들은 이후에 그들이 화남, 두려움, 행복, 및 슬픔이라고 생각하는 음성을 고르도록 지시되었다. 표 7은 40명에 대해서 수행한 인식 테스트 결과를 나타낸다. The test was conducted on 40 female kindergarten teachers with an average age of 28.8 years. They were told to listen to 20 pieces of randomly ordered emotional synthesis voices and then to choose the voices they thought were angry, fearful, happy, and sad. Table 7 shows the recognition test results for 40 people.

표 7을 참조하면, 감정 중에서 화남이 가장 인식하기 쉬운 것을 알 수 있으며, 두려움은 가장 인식하기 어려운 것을 알 수 있다. 그러나, 전체적인 성공률은 행복 감정을 포함하여 우연히 나올 수 있는 값보다 더 높다. 종래 기술에서 나타난 평가 결과와 비교하면, 표 7에 나타난 성공률은 매우 바람직한 것이라고 할 수 있는데, 그것은 종래의 인식률은 우연히 나올 수 있는 값에 가깝기 때문이다(Lee and Park, 2009; Schroeder 2001). Referring to Table 7, it can be seen that anger is the most recognizable among the emotions, and fear is the hardest to recognize. However, the overall success rate is higher than what might come about by chance, including feelings of happiness. Compared with the evaluation results shown in the prior art, the success rate shown in Table 7 can be said to be very desirable, because the conventional recognition rate is close to a value that can come up by chance (Lee and Park, 2009; Schroeder 2001).

전술된 바와 같이, 본 발명에 의한 감정 음성 합성 기술은 각 감정을 운율 특징의 변화로 간주하고, 다른 감정 중에 각 별개의 운율이 들어맞는 특징을 발견한다. 특히, 본 발명은 감정을 방대한 데이터셋을 분석하는 것에서 얻을 수 있는 보편적인 정보로 간주하지 않기 때문에, 인식률을 향상시킬 수 있다. 즉, 본 발명에 의한 감정 음성 합성 기술은 감정의 근본적인 특성을 고려하면서 감정이 일종의 보편적인 정보가 아니라고 전재하고, 4 개의 기본 감정인 분노, 두려움, 행복과 슬픔의 감정 운율 구조를 분석한다. 분석한 결과로서 얻어지는 개인별 감정 운율 구조는 피치, 강도와 휴지 길이와 같은 파라미터를 포함한다. 그리고, 이러한 파라미터를 이용하여 TTS 시스템에 의하여 합성된 무감정 음성에 감정 정보를 가미한다. As described above, the emotional speech synthesis technique according to the present invention regards each emotion as a change in rhyme characteristics, and finds a feature in which each distinct rhyme fits among other emotions. In particular, since the present invention does not regard emotion as universal information that can be obtained from analyzing a large data set, the recognition rate can be improved. In other words, the emotional speech synthesis technique according to the present invention considers the fundamental characteristics of emotions, reprints that emotions are not a kind of universal information, and analyzes four basic emotions, an emotion rhyme structure of anger, fear, happiness and sadness. The individual emotional rhyme structure obtained as a result of the analysis includes parameters such as pitch, intensity and rest length. Using this parameter, emotion information is added to the unaffected voice synthesized by the TTS system.

본 발명에 의한 감정 음성 합성 기술은 충분한 사전 훈련 경험으로 지지된 일련의 반복된 인식 테스트로부터, 모든 감정에 대해 최고 95.5%의 평균 인식률을 달성한다. 본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 예를 들어, 본 발명에서 감정 운율 구조를 생성하기 위하여 한국어 문장을 발화한 사람들은 다음 표 8에 나타난 것처럼 6명이다. The emotional speech synthesis technique according to the present invention achieves an average recognition rate of up to 95.5% for all emotions, from a series of repeated recognition tests supported with sufficient prior training experience. While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. For example, in the present invention, there are six people who spoke Korean sentences in order to generate an emotional rhyme structure, as shown in Table 8 below.

하지만, 이러한 발화자들은 예시적으로 제공된 것일 뿐이며, 다른 발화자들을 이용해서도 각 감정에 따른 개인별 감정 운율 구조를 생성할 수 있음은 물론이다. However, these talkers are provided only as an example, and of course, other talkers may be used to generate an individual emotional rhyme structure according to each emotion.

적용: 보조 로봇의 음성 인터페이스를 위한 한국의 감정 음성 합성 시스템의 개발Application: Development of Korean Emotional Speech Synthesis System for Voice Interface of Assistive Robots

오늘날, 로봇 기술의 진보는 여러 가지 길에서 실생활에서 사람들을 돕는 것을 가능하게 한다. Today, advances in robot technology make it possible to help people in real life on many ways.

종래에는, 장애자가 공장 환경에서 근무하는 것을 돕는 기술된 보조 이동식 로봇이 있었다. 하지만, 이러한 보조 로봇 시스템은 인간-로봇 상호작용에 시각적이며 대화에 의한 인터페이스를 필요하지 않았다. 또한, 노인에게 일상 생활에 대한 정보를 제공하여 돕는 이동식 로봇도 있다. 하지만, 로봇-인간의 상호작용을 위하여 얼굴 표현, 신체 움직임과 음성을 이용하는 것이 중요하며, 여기에 감정 표현 기술이 적용될 수 있다. Traditionally, there have been described assisted mobile robots to assist the handicapped to work in a factory environment. However, these assistive robotic systems do not require a visual and interactive interface to human-robot interaction. There are also mobile robots that help seniors by providing information about daily life. However, it is important to use facial expressions, body movements and voice for robot-human interaction, and emotion expression techniques can be applied to them.

따라서, 본 발명에 의한 음성 합성 기술은 문장 타입과 감정 상태에 따라 운율 구조를 수정하는 한국어 감정 운율 통합 시스템에도 적용될 수 있다. 그러므로, 감정 운율 구조를 각 발화자에 대해 구성된 상대적인 운율의 차이로 간주하고, 개인별로 이 차이를 파라미터화한다. 문장 타입을 식별하기 위하여, 주어진 문장 내에서 나타나는 형태 및 구문 정보의 조합을 사용할 수 있다. 이러한 감정 운율 통합 시스템은 일반용 TTS 시스템을 위한 후처리 모듈로 사용될 수 있다. Therefore, the speech synthesis technique according to the present invention can be applied to a Korean emotional rhyme integration system that modifies a rhyme structure according to sentence type and emotional state. Therefore, the emotional rhyme structure is regarded as the difference in relative rhyme configured for each talker, and the individual parameterizes the difference. To identify the sentence type, a combination of form and syntax information that appear within a given sentence can be used. This emotional rhyme integration system can be used as a post-processing module for a general purpose TTS system.

문장 타입과 운율 구조Sentence Types and Rhythm Structure

한국어에서, 문장은 영어에서와는 달리 구두점 및 최종 문미(final endings)에 따라서 5개의 카테고리로 나뉘는데, 이것은 평서(declarative), 명령(imperative), 청유(propositive), 의문(interrogative), 및 감탄(exclamatory)문이다. In Korean, unlike sentences in English, sentences are divided into five categories according to punctuation and final endings, which are declarative, imperative, propositive, interrogative, and exclamatory. It is a door.

각 문장 타입의 형태론 및 구문적 암시에 대한 언어학적 분석에 기반하여, 본 발명에서는 형태론적 분석 결과를 가지는 문장 타입을 자동으로 식별한다. Based on the linguistic analysis of the morphology and syntactic implications of each sentence type, the present invention automatically identifies sentence types having morphological analysis results.

보조 로봇의 음성 인터페이스를 위한 한국의 감정 음성 합성 시스템Korean Emotional Speech Synthesis System for Voice Interface of Assistive Robots

감정 음성 합성 결과를 생성할 수 있는 보조 로봇 시스템을 위한 출력된 음성 인터페이스를 구현하는 것은 문장 타입과 개인 정보에 의거하여 이루어진다. 이 경우, 본 발명에 의한 감정 음성 합성 시스템이 이용될 수 있다. 이 시스템은 억양구 레벨, 단어 레벨 및 음절 레벨과 같은 작은 레벨의 운율 수정을 가능하게 한다. Implementing an outputted voice interface for an assistive robotic system capable of generating emotional speech synthesis results is based on sentence type and personal information. In this case, the emotional speech synthesis system according to the present invention can be used. This system allows for minor levels of rhythm correction, such as accent level, word level and syllable level.

이와 같이, 본 발명의 일 측면에 의하면 노인을 위해 디자인된 보조 로봇의 음성 인터페이스에 이용될 수 있는 한국어 감정 음성 합성 시스템이 제공된다. 이 시스템은 문장 타입과 감정 상태에 따라 주어진 문장의 운율 구조를 수정한다. Thus, according to one aspect of the present invention there is provided a Korean emotional speech synthesis system that can be used in the voice interface of the assistant robot designed for the elderly. The system modifies the rhyme structure of a given sentence according to sentence type and emotional state.

문장 타입의 식별을 위하여 우리는 주어진 문장 내에서 나타나는 형태 및 구문적 정보의 조합을 사용한다. 그리고, 감정 운율 구조를 분석하기 위하여, 본 발명의 일면에 의한 음성 합성 방법을 이용한다. To identify sentence types, we use a combination of form and syntactic information that appears within a given sentence. And, in order to analyze the emotional rhyme structure, the speech synthesis method according to one aspect of the present invention is used.

또한, 본 발명에 따르는 방법은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함할 수 있다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 분산 컴퓨터 시스템에 의하여 분산 방식으로 실행될 수 있는 컴퓨터가 읽을 수 있는 코드를 저장할 수 있다. In addition, the method according to the present invention can be embodied as computer readable codes on a computer readable recording medium. The computer-readable recording medium may include all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROMs, RAMs, CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, and the like, and may also be implemented in the form of carrier waves (for example, transmission over the Internet). Include. The computer readable recording medium can also store computer readable code that can be executed in a distributed fashion by a networked distributed computer system.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 등록청구범위의 기술적 사상에 의해 정해져야 할 것이다. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

본 발명은 음성 합성 기술에 적용될 수 있다. The present invention can be applied to speech synthesis techniques.

Claims

delete

A method for synthesizing emotional speech based on a personal rhyme model,
An emotional rhyme structure characteristic extraction step of extracting characteristics of a personal emotional prosody structure by analyzing individual speech;
Storing the extracted individual emotion rhyme structure in an emotion speech database;
A receiving step of receiving an input text and a target emotion;
Retrieving an individual emotional rhyme structure corresponding to a speaker to synthesize a voice from the emotional voice database; And
An emotion that generates and outputs an emotion voice corresponding to the talker by converting the input text into an emotionless speech and modifying the converted emotionless voice based on the individual emotion rhyme structure corresponding to the target emotion Speech synthesis steps,
The emotion rhyme structure characteristic extraction step,
A general emotion rhyme structure extraction step of extracting a general emotion rhyme structure from a data set including voice information according to the basic emotion of the talker; And
Personal rhyme model-based emotional voice synthesis method comprising the step of extracting the individual emotional rhyme structure of each utterance compared to the general emotional rhyme structure and parameterize the relative difference of the individual emotional rhyme structure.

The method of claim 2, wherein the extracting of the general emotional rhyme structure comprises:
And extracting a voice characteristic corresponding to anger, fear, happiness, sadness, and neutral emotional state for each talker.

The method of claim 2, wherein the extracting the individual emotional rhyme structure comprises:
A sentence level emotion rhyme structure analysis step of analyzing overall pitch, intensity, and speech rate of speech according to each emotion of the talker at a sentence level as parameters;
An accent level analysis rhyme structure analysis step of analyzing a pause length between intonation phrases (IPs) included in speech according to each emotion of the talker at the accent level as a parameter; And
Personal rhyme structure analysis step of analyzing the syllable level emotion rhyme structure to analyze at the syllable level as a parameter the IP boundary pattern of each of the accents included in the speech according to each emotion of the talker Model-based emotional speech synthesis method.

The method of claim 4, wherein the sentence level emotional rhyme structure analysis step comprises:
Calculating an interquartile mean (IQM) of pitch values of speech for emotions;
Selecting an intensity of a predetermined value or more among the intensity of the emotion-specific voice;
Calculating the speech rate from the total speech length of the emotion-specific speech;
Normalizing the pitch value, intensity, and speech rate to remove disparity per sentence and discordant per speaker; And
Using the normalized result, calculating the difference between the parameters of the individual emotional rhyme for each emotion based on the individual emotional rhyme structure corresponding to the neutral emotional state, and constructing the individual emotional rhyme structure using the calculation result Personal rhyme model-based emotional speech synthesis method comprising a.

The method of claim 4, wherein the intonation level emotion rhyme structure analysis step,
Detecting a pause region between emotion intonations; And
And calculating a total idle length by summing the total lengths of the idle regions.

The method of claim 4, wherein the analyzing of the syllable level emotional rhyme structure comprises:
Analyzing the negative intonation boundary pattern as a pitch contour corresponding to one of L%, H%, LH%, HL%, LHL%, HLH%, HLHL%, LHLH% and LHLHL% Personal rhyme model-based emotional speech synthesis method comprising a.

The method of claim 2, wherein the emotion speech synthesis step,
Converting the input text into an emotional voice using a text-to-speech system;
Retrieving an emotion rhyme structure corresponding to the target emotion of the talker from the emotion speech database; And
And a voice modification step of generating the emotional voice by modifying the emotionless voice using the retrieved emotional rhyme structure parameters.

The method of claim 8, wherein the voice correction step,
Extracting, as a parameter, a pitch contour corresponding to the target emotion from the individual emotion rhyme structure; And
And a syllable level correction step of correcting the pitch contour of the emotion-free speech using the extracted pitch contour.

The method of claim 8, wherein the voice correction step,
Extracting, as a parameter, a pause length corresponding to the target emotion from the individual emotion rhyme structure; And
Personal rhythm model-based emotional speech synthesis method comprising the step of modifying the intonation level using the extracted pause length to modify the pause length of the emotion-free voice.

The method of claim 8, wherein the voice correction step,
Extracting, as parameters, an overall pitch, an overall intensity, and a speech rate corresponding to the target emotion from the individual emotion rhythm structure; And
And a sentence level correction step of modifying the overall pitch, the overall intensity, and the speech rate of the emotionless speech using the extracted total pitch, the overall intensity, and the speech rate. .

Claim 12 is abandoned in setting registration fee.

12. A recording medium having a computer program recorded thereon, the computer program comprising instructions executable by a computer for implementing the method according to any one of claims 2 to 11.

delete

A system for synthesizing emotional speech based on a personal rhyme model,
An emotional rhyme structure characteristic extracting unit for analyzing the individual speech to extract characteristics of the individual emotional rhyme structure;
An emotion speech database that stores the extracted individual emotion rhyme structure;
When the input text and the target emotion are received, the input text is converted into an unaffected voice, a personal emotional rhyme structure corresponding to the talker to synthesize the voice is retrieved from the emotional voice database, and the converted unaffected voice is converted into the target emotion. And an emotional speech synthesis unit for generating and outputting an emotional voice corresponding to the talker by modifying based on the corresponding emotional rhyme structure for each individual,
The emotional rhyme structure characteristic extraction unit,
Extracting a general emotion rhyme structure from a dataset including voice information according to the basic emotion of the talker, and
And a personalized rhyme model-based emotional speech synthesis system characterized by comparing the individual emotional rhyme structure of each talker with the general emotional rhyme structure and parameterizing the relative difference of the individual emotional rhyme structure.

The method of claim 14, wherein the emotion rhyme structure characteristic extracting unit extracts the general emotion rhyme structure.
And a personal rhythm model-based emotional speech synthesis system configured to extract voice characteristics corresponding to anger, fear, happiness, sadness, and neutral emotional state for each talker.

The method of claim 14, wherein the emotional rhyme structure characteristic extracting unit extracts the individual emotional rhyme structure.
Sentence level emotion rhyme structure analysis operation of analyzing the overall pitch, intensity, and speech rate of speech according to each emotion of the talker at sentence level as parameters;
An intonation level emotion rhyme structure analysis operation of analyzing the rest length between the intonations IP included in the speech according to each emotion of the talker at the intonation level as a parameter, and
Personal rhyme model-based emotion, characterized in that it is adapted to perform a syllable level emotional rhyme structure analysis operation that analyzes the accent boundary patterns of each of the intonations included in the speech according to each emotion of the talker at the syllable level Speech synthesis system.

The method of claim 16, wherein the emotion rhythm structure characteristic extractor is configured to analyze a sentence-level emotion rhyme structure.
Calculating an interquartile mean (IQM) of pitch values of speech by emotion,
Selecting an intensity of a predetermined value or more among the intensity of the voice for each emotion,
Calculating the speech rate from the total speech length of the emotion-specific speech;
Normalizing the pitch value, intensity, and speech rate to eliminate inconsistency and sentence inconsistency;
Using the normalized result, the difference between the parameters of the individual emotional rhyme for each emotion is calculated based on the individual emotional rhythm structure corresponding to the neutral emotional state, and the operation of configuring the individual emotional rhyme structure using the calculation result is performed. A personal rhyme model based emotional speech synthesis system, adapted to perform.

The method of claim 16, wherein the emotional rhythm structure characteristic extracting unit is configured to analyze an accent rhyme structure emotional rhythm structure.
Detecting a resting region between the intonations of emotion-based speech, and
Personal rhyme model based emotional speech synthesis system, characterized in that it is adapted to perform an operation of calculating the total idle length by summing the total lengths of the idle regions.

The method of claim 16, wherein the emotion rhythm structure characteristic extractor is configured to analyze a syllable level emotion rhyme structure.
Adapted to perform the operation of analyzing the negative intonation boundary pattern as a pitch contour corresponding to one of L%, H%, LH%, HL%, LHL%, HLH%, HLHL%, LHLH% and LHLHL%. Emotional speech synthesis system based on a personal rhyme model.

The method of claim 14, wherein the emotional speech synthesis unit,
Converting the input text into an emotional voice using a TTS system,
Retrieving an emotional rhyme structure corresponding to the target emotion of the talker from an emotional speech database, and
Personal rhythm model-based emotional speech synthesis system, characterized in that it is adapted to perform a speech modification operation to generate the emotional speech by modifying the emotionless speech using the retrieved emotional rhyme structure parameters.

The method of claim 20, wherein the emotion speech synthesizer is further configured to perform the voice correction operation.
Extracting, as a parameter, a pitch contour corresponding to the target emotion from the individual emotion rhyme structure, and
Personal rhythm model-based emotional speech synthesis system, characterized in that it is adapted to perform a syllable level correction operation to correct the pitch contour of the emotion-free speech using the extracted pitch contour.

The method of claim 20, wherein the emotion speech synthesizer is further configured to perform the voice correction operation.
Extracting, as a parameter, a pause length corresponding to the target emotion from the individual emotion rhyme structure; and
A personal rhythm model-based emotional speech synthesis system, characterized in that it is adapted to perform an accent level correction operation that modifies the idle length of the unaffected speech using the extracted pause length.

The method of claim 20, wherein the emotion speech synthesizer is further configured to perform the voice correction operation.
Extracting, as parameters, an overall pitch, an overall intensity, and a speech rate corresponding to the target emotion from the individual emotion rhythm structure, and
Personal rhythm model-based emotional voices adapted to perform sentence level correction operations to modify the overall pitch, overall intensity, and speech rate of the emotionless speech using the extracted overall pitch, total intensity, and speech rate Synthesis system.

Claim 24 is abandoned in setting registration fee.

A recording medium which can be read by a computer on which a computer program containing instructions executable by a computer for driving the system according to any one of claims 14 to 23 is recorded.