KR102045761B1

KR102045761B1 - Device for changing voice synthesis model according to character speech context

Info

Publication number: KR102045761B1
Application number: KR1020190118791A
Authority: KR
Inventors: 송민규; 윤종성
Original assignee: 미디어젠(주)
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2019-11-18
Also published as: WO2021060591A1

Abstract

The present invention relates to an apparatus for changing a voice synthesis model according to character speech contexts, and more particularly, to an apparatus for changing a voice synthesis model according to character speech contexts, which can improve emotion qualities such as reading a fairy tale, a dialogue of a character and an interaction dialogue by outputting voices with various voice tones and emotions for each character according to speech contexts, thereby improving the uniformity of the prior art, which outputs a text with only one voice tone when selecting a model. The apparatus for changing a voice synthesis model according to character speech contexts comprises: an information input part (100); an information analysis part (200); a voice synthesis model selection part (300); and a voice synthesis output part (400).

Description

Device for changing voice synthesis model according to character speech context}

본 발명은 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치에 관한 것으로서, 더욱 상세하게는 하나의 모델을 선택하면 하나의 음성 톤으로만 텍스트를 음성출력하는 종래 기술의 획일성을 개선하고자 문맥에 따라 다양한 캐릭터별 다양한 음성 톤과 다양한 감정 톤으로 음성을 출력시킴으로써, 동화 읽기, 캐릭터의 대사, 인터랙션 대화 등의 감성 품질을 더욱 향상시킬 수 있는 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치에 관한 것이다.The present invention relates to an apparatus for changing a speech synthesis model according to a character speech context, and more particularly, to improve the uniformity of the prior art in which text is outputted with only one voice tone when one model is selected. The present invention relates to an apparatus for changing a voice synthesis model according to a character utterance context that can further improve emotion quality such as reading a fairy tale, dialogue of a character, and interaction dialogue by outputting a voice with various voice tones and various emotion tones for each character.

현재 음성인식이나 화자인증 기술을 바탕으로한 홀로그램 가상현실 기술은 세계적으로 꾸준히 발전되어 제한적인 환경에서 매우 만족스러운 성능을 나타내고 있다. Currently, holographic virtual reality technology based on voice recognition or speaker authentication technology has been steadily developed around the world and shows very satisfactory performance in a limited environment.

특히, 홀로그램 가상현실 기술에서 가상의 캐릭터와 음성으로 대화를 해야하는 상황이 발생하고 있으며, 이때 다수의 캐릭터와 대화하는 경우 음성을 캐릭터별로 변경하는 기술이 필요하게 되었고, 하나의 캐릭터와 대화하는 경우 문맥에 따라 감정을 표현하는 기술이 필요하게 되었다.In particular, there is a situation in which holographic virtual reality technology requires a conversation with a virtual character by voice, and when talking with a plurality of characters, a technology for changing the voice for each character is required. As a result, skills for expressing emotions are needed.

도 1에 종래의 음성합성 기술을 활용한 홀로그램 서비스 예시가 도시되어 있다. 도 1에 도시된 종래의 홀로그램 가상현실 기술에서는 특정 캐릭터를 선택하고 해당 캐릭터가 음성을 출력할 경우에 하나의 문장을 하나의 음성과 하나의 톤으로만 출력하는 획일적인 음성 출력이라는 문제점이 있어왔다.1 illustrates an example hologram service utilizing a conventional voice synthesis technique. In the conventional holographic virtual reality technology shown in FIG. 1, there is a problem of a uniform voice output for outputting one sentence only with one voice and one tone when a specific character is selected and the corresponding character outputs a voice. .

이로 인해, 종래 기술은 하나의 모델을 선택하면 하나의 음성 톤으로만 텍스트를 음성 출력하기 때문에 인터랙션 대화의 감성 품질을 크게 향상시킬 수는 없었다.For this reason, since the prior art only outputs text with only one voice tone when one model is selected, the emotional quality of the interactive conversation cannot be greatly improved.

본 발명은 종래 기술의 문제점 해결과 기술적 필요성에 의해 도출된 것으로 본 발명은 풍부한 감성적 음성 출력이 가능한 장점을 제공한다.The present invention is derived from the problem solving and technical necessity of the prior art, the present invention provides the advantage that the rich emotional voice output is possible.

따라서, 본 발명에서는 발화 맥락에 따라 다양한 톤으로 음성을 출력하여 동화 읽기, 캐릭터의 대사, 인터랙션 대화시의 감성 품질을 크게 향상시킬 수 있는 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치를 제안하게 된 것이다.Accordingly, the present invention proposes an apparatus for changing the speech synthesis model according to the character speech context, which can greatly improve the emotion quality during reading a fairy tale, character dialogue, and interaction conversation by outputting voices in various tones according to the speech context. .

(선행문헌1) 대한민국등록특허번호 제10-1089184호(Previous Document 1) Republic of Korea Patent No. 10-1089184 (선행문헌2) 대한민국등록특허번호 제10-1006491호(Previous Document 2) Korean Registered Patent No. 10-1006491 (선행문헌3) 대한민국공개특허번호 제2001-25161호(Preceding Document 3) Korean Patent Publication No. 2001-25161 (선행문헌4) 대한민국공개특허번호 제2002-42248호(Preceding Document 4) Korean Patent Publication No. 2002-42248 (선행문헌5) 대한민국공개특허번호 제2001-34987호(Previous Document 5) Korean Patent Publication No. 2001-34987

본 발명은 음성을 캐릭터별로 변경시키고 발화 맥락에 따라 감정을 표현하도록 음성 합성하여 인터랙션 대화의 감성 품질을 향상시키는 것을 목적으로 한다.An object of the present invention is to improve the emotional quality of an interactive conversation by changing the voice for each character and synthesizing the voice to express the emotion according to the speech context.

또한, 입력된 텍스트를 캐릭터별, 감정상태별로 다양한 음성 톤과 다양한 감정 톤으로 음성을 출력시키는 것을 목적으로 한다.In addition, an object of the present invention is to output a voice with various voice tones and various emotion tones for each character and emotion state.

본 발명은 종래 기술의 문제점 해결과 상기 목적을 달성하기 위하여,The present invention to solve the problems of the prior art and to achieve the above object,

음성합성 대상정보를 입력하기 위한 정보입력부(100)와,An information input unit 100 for inputting voice synthesis target information,

상기 정보입력부(100)를 통해 입력된 음성합성 대상정보를 분석하고, 분석결과 정보를 저장하는 정보분석부(200)와,An information analysis unit 200 for analyzing voice synthesis target information input through the information input unit 100 and storing analysis result information;

상기 정보분석부(200)에 의해 분석된 분석결과를 참조하여 음성합성 모델을 선택하기 위한 음성합성모델선택부(300)와,A speech synthesis model selecting unit 300 for selecting a speech synthesis model by referring to the analysis result analyzed by the information analyzing unit 200,

음성합성엔진을 구동시켜 상기 음성합성모델선택부(300)에 의해 선택된 음성합성 모델이 적용된 캐릭터 음성을 출력시키기 위한 음성합성출력부(400)를 포함하는 것을 특징으로 한다.And a voice synthesis output unit 400 for driving a voice synthesis engine to output a character voice to which the voice synthesis model selected by the voice synthesis model selection unit 300 is applied.

본 발명은 하나의 모델을 선택하면 하나의 음성 톤으로만 텍스트를 음성 출력하는 종래 기술의 획일성을 제거하여 발화 맥락에 따라 다양한 캐릭터별 다양한 음성 톤과 다양한 감정 톤으로 음성을 출력시킴으로써, 동화 읽기, 캐릭터의 대사, 인터랙션 대화 등의 감성 품질을 더욱 향상시킬 수 있는 효과를 제공하게 된다.The present invention removes the uniformity of the prior art, which outputs text only with one voice tone when one model is selected, and outputs the voice with various voice tones and various emotion tones according to the utterance context. In addition, it provides an effect to further improve the emotional quality of the dialogue, character dialogue, etc. of the character.

구체적으로는 캐릭터와 감정상태에 따라 음성합성 모델을 선정할 수 있는 효과와, 각 문장별 음성합성 모델을 변경하여 음성을 출력할 수 있는 효과와, 입력 문장이 소진될 때까지 반복적으로 캐릭터 및 감정 상태에 따른 음성합성 모델을 변경하여 음성을 출력할 수 있는 효과를 제공하기 때문에 문맥에 따라 다양한 캐릭터별 다양한 음성 톤과 다양한 감정 톤으로 음성을 출력시킴으로써, 동화 읽기, 캐릭터의 대사, 인터랙션 대화 등의 감성 품질을 더욱 향상시킬 수 있는 효과를 제공할 수가 있게 되는 것이다.Specifically, the effect of selecting the speech synthesis model according to the character and emotion state, the effect of outputting the voice by changing the speech synthesis model for each sentence, and the character and emotion repeatedly until the input sentence is exhausted It provides the effect of outputting the voice by changing the voice synthesis model according to the state, and outputs the voice with various voice tones and emotions for each character depending on the context, such as reading fairy tales, dialogue of characters, and interactive conversations. It is possible to provide an effect that can further improve the emotional quality.

도 1은 종래의 음성합성 기술을 활용한 홀로그램 서비스 예시도.
도 2는 본 발명에 따른 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치의 전체 구성도.
도 3은 본 발명에 따른 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치의 정보분석부(200) 구성 블록도.
도 4는 본 발명에 따른 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치의 음성합성모델선택부(300) 구성 블록도.
도 5는 본 발명에 따른 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치의 음성합성출력부(400) 구성 블록도.1 is a diagram illustrating a hologram service using a conventional voice synthesis technology.
2 is an overall configuration diagram of an apparatus for changing a voice synthesis model according to a character speech context according to the present invention;
Figure 3 is a block diagram of the information analysis unit 200 of the apparatus for changing the speech synthesis model according to the character speech context according to the present invention.
Figure 4 is a block diagram of the speech synthesis model selection unit 300 of the apparatus for changing the speech synthesis model according to the character speech context according to the present invention.
5 is a block diagram of a speech synthesis output unit 400 of the apparatus for changing a speech synthesis model according to a character speech context according to the present invention.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만, 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. The following merely illustrates the principles of the invention. Therefore, those skilled in the art, although not explicitly described or illustrated herein, can embody the principles of the present invention and invent various devices that fall within the spirit and scope of the present invention.

또한, 본 명세서에 열거된 모든 조건부 용어 및 실시 예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시 예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.In addition, all conditional terms and embodiments listed herein are in principle clearly intended to be understood only for the purpose of understanding the concept of the invention and are not to be limited to the specifically listed embodiments and states. do.

본 발명을 설명함에 있어서 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되지 않을 수 있다.In describing the present invention, terms such as first and second may be used to describe various components, but the components may not be limited by the terms.

예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component.

어떤 구성요소가 다른 구성요소에 연결되어 있다거나 접속되어 있다고 언급되는 경우는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해될 수 있다.When a component is referred to as being connected or connected to another component, it may be understood that the component may be directly connected to or connected to the other component, but there may be other components in between. .

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니며, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the invention, and the singular forms “a”, “an” and “the” may include the plural forms as well, unless the context clearly indicates otherwise.

본 명세서에서, 포함하다 또는 구비하다 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것으로서, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해될 수 있다.In this specification, the terms including or including are intended to designate that there exists a feature, a number, a step, an operation, a component, a part, or a combination thereof described in the specification, and one or more other features or numbers, It can be understood that it does not exclude in advance the possibility of the presence or addition of steps, actions, components, parts or combinations thereof.

이하에서는, 본 발명에 의한 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치의 실시예를 통해 상세히 설명하도록 한다.Hereinafter, the embodiment of the apparatus for changing the voice synthesis model according to the character speech context according to the present invention will be described in detail.

도 2는 본 발명에 따른 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치를 개략적으로 나타낸 전체 구성도이다.2 is an overall configuration diagram schematically showing an apparatus for changing a speech synthesis model according to a character speech context according to the present invention.

도 2에 도시한 바와 같이, 본 발명인 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치는 하나의 모델을 선택하면 하나의 음성 톤으로만 텍스트를 음성 출력하는 종래 기술의 획일성을 제거하여 발화 문맥에 따라 다양한 캐릭터별 다양한 음성 톤과 다양한 감정 톤으로 음성을 출력시킴으로써, 동화 읽기, 캐릭터의 대사, 인터랙션 대화 등의 감성 품질을 더욱 향상시킬 수 있는 효과를 제공하게 된다.As shown in FIG. 2, the apparatus for changing a speech synthesis model according to the present inventors' speech utterance context removes the uniformity of the prior art in which text is output only with one speech tone when one model is selected, according to the speech context. By outputting voices with various voice tones and various emotion tones for various characters, it is possible to provide an effect of further improving the emotional quality of reading fairy tales, dialogue of characters, and interactive conversations.

상기와 같은 효과를 발휘하기 위하여, 본 발명에 따른 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치는,In order to achieve the above effects, the apparatus for changing the speech synthesis model according to the character speech context according to the present invention,

음성합성엔진을 구동시켜 상기 음성합성모델선택부(300)에 의해 선택된 음성합성 모델이 적용된 캐릭터 음성을 출력시키기 위한 음성합성출력부(400)를 포함하여 구성되는 것을 특징으로 한다.And a voice synthesis output unit 400 for driving a voice synthesis engine to output a character voice to which the voice synthesis model selected by the voice synthesis model selection unit 300 is applied.

구체적으로 설명하면, 정보입력부(100)는 음성합성 대상 정보를 입력하기 위한 기능을 수행하게 된다.In detail, the information input unit 100 performs a function for inputting voice synthesis target information.

상기 음성합성 대상정보는 텍스트 정보, 캐릭터 정보, 감정상태 정보를 포함하고, 텍스트 정보에는 캐릭터 정보와 감정상태 정보 중 적어도 하나 이상이 매칭되어 있는 것을 특징으로 한다.The voice synthesis target information includes text information, character information, and emotional state information, and at least one of character information and emotional state information is matched with the text information.

사용자는 음성합성 대상정보인 텍스트 정보(텍스트)입력 시, 캐릭터 정보를 매칭시켜 입력하게 되는데, 예를 들어, '[홍길동] 안녕하세요? [신사임당] 반갑습니다.'라는 캐릭터 정보가 매칭된 텍스트 정보를 입력하게 된다.When the user inputs text information (text) that is voice synthesis target information, character information is matched and input. For example, '[Hong Gil-dong] Hello? [Cinema] Nice to meet you. 'Enter the text information matching the character information.

또한, 사용자는 텍스트 정보(텍스트)입력 시, 감정상태 정보도 매칭시켜 입력하게 되는데, 예를 들어, '[홍길동] [기쁨] 아 정말 좋은 소식이군요. '[신사임당] [슬픔] 그런 일이 있었다니 정말 유감이에요.'라는 감정상태 정보가 매칭된 텍스트 정보를 입력하게 된다.In addition, when the user enters the text information (text), the emotional state information is also matched and entered, for example, '[Hong Gil Dong] [joy] Oh, that's really good news. '[Sinsaimdang] [sorrow] I'm so sorry that it happened.' The emotional state information that matches the text information is entered.

상기 정보분석부(200)는 상기 정보입력부(100)를 통해 입력된 음성합성 대상정보를 분석하고, 분석결과 정보를 저장하는 기능을 수행하게 된다. 예를 들어, 입력된 음성합성 대상정보가 '[신사임당] [슬픔] 그런 일이 있었다니 정말 유감이에요.'인 경우, 이를 분석하여 신사임당이라는 캐릭터와 슬픔이라는 감정 상태와 그런 일이 있었다니 정말 유감이에요라는 텍스트 정보를 분석결과 정보로 생성하게 되는 것이다.The information analyzer 200 analyzes the voice synthesis target information input through the information input unit 100 and stores the analysis result information. For example, if the input information for voice synthesis is '[Sinsaimdang] [sorrows] I'm very sorry that something happened.' I'm very sorry, text information is generated as an analysis result information.

상기 음성합성모델선택부(300)는 상기 정보분석부(200)에 의해 분석된 분석결과를 참조하여 음성합성 모델을 선택하기 위한 기능을 수행하고, 음성합성출력부(400)는 음성합성엔진을 구동시켜 상기 음성합성모델선택부(300)에 의해 선택된 음성합성 모델이 적용된 캐릭터 음성을 출력시키기 위한 기능을 수행하게 된다.The speech synthesis model selection unit 300 performs a function for selecting a speech synthesis model by referring to the analysis result analyzed by the information analyzer 200, and the speech synthesis output unit 400 performs a speech synthesis engine. The driving unit performs a function for outputting a character voice to which the voice synthesis model selected by the voice synthesis model selection unit 300 is applied.

예를 들어, '신사임당'이라는 캐릭터에 가장 적합한 음성합성 모델인 '차분한 여성 음성'을 선택하고, 해당 차분한 여성 음성으로 '슬픔'이라는 감정 상태로 '그런 일이 있었다니 정말 유감이에요.'라는 음성을 출력할 수 있도록 하는 것이다.For example, I chose the calm female voice, the best voice synthesis model for the character Shinsaim-dang, and I'm sorry that it happened with the sad state of sadness. It is to make the audio output.

즉, '신사임당'이라는 캐릭터가 '차분한 여성 음성'으로 '슬픔'이라는 감정 상태로 음성합성엔진을 구동시켜 '그런 일이 있었다니 정말 유감이에요.'라는 음성을 출력시키는 것이다.In other words, the character 'Shin Saim-dang' drives the voice synthesis engine with the emotional state of 'sadness' as 'the calm female voice' and outputs the voice saying 'I'm sorry that it happened'.

도 3은 본 발명에 따른 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치의 정보분석부(200) 구성 블록도이다.3 is a block diagram of an information analyzer 200 of the apparatus for changing a voice synthesis model according to a character speech context according to the present invention.

도 3에 도시한 바와 같이, 정보분석부(200)는 문장단위분절모듈(210), 음성합성모델선정참조모듈(220), 캐릭터정보검출모듈(230), 캐릭터정보저장모듈(240), 감정상태정보검출모듈(250), 감정상태정보저장모듈(260)을 포함하여 구성되게 된다.As shown in FIG. 3, the information analyzing unit 200 includes a sentence unit segment module 210, a voice synthesis model selection reference module 220, a character information detection module 230, a character information storage module 240, and an emotion. State information detection module 250, emotional state information storage module 260 is configured to include.

구체적으로 설명하면, 상기 문장단위분절모듈(210)은 정보입력부(100)를 통해 입력된 음성합성 대상정보에 포함된 텍스트 정보를 미리 정의된 문장 분절 규칙을 참조하여 문장 단위로 분절하여 저장한다.Specifically, the sentence unit segmentation module 210 divides and stores text information included in the speech synthesis target information input through the information input unit 100 into sentence units with reference to a predefined sentence segmentation rule.

요약하자면, 텍스트가 입력되면 입력된 텍스트를 라인, 마침표, 문장 부호 등을 통해 구분하여 문장단위의 분절을 수행하는 것이다. 이때, 문장의 분절은 정보입력부(100)에서 미리 저장되어 정의된 문장 분절 규칙에 따라 수행된다.In summary, when text is input, the input text is divided into lines, periods, punctuation marks, and the like to execute sentence segments. In this case, the segment of the sentence is performed according to the sentence segmentation rules previously stored and defined in the information input unit 100.

또한, 상기 문장단위분절모듈(210)은 음성합성 대상정보속의 텍스트 정보가 미리 정의된 문장 분절 규칙에 위배될 경우에 텍스트 정보 재입력 신호를 정보입력부로 제공하는 것을 특징으로 한다.In addition, the sentence unit segment module 210 is characterized in that to provide a text information re-input signal to the information input unit when the text information in the speech synthesis target information violates the predefined sentence segmentation rules.

상기 음성합성모델선정참조모듈(220)은 캐릭터별로 적합한 음성합성 모델 선정을 위한 참조 정보와 감정상태별로 적합한 음성합성 모델 선정을 위한 참조 정보를 저장하고 있게 된다.The speech synthesis model selection reference module 220 stores reference information for selecting a suitable speech synthesis model for each character and reference information for selecting a suitable speech synthesis model for each emotional state.

예를 들어, '차분한 여성 음성', '발랄한 여성 음성', '의기소침한 여성 음성' 등과 같은 캐릭터별로 적합한 음성합성 모델 선정을 위한 참조 정보를 저장하고 있으며, '슬픔', '기쁨', '놀람' 등과 같은 감정상태별로 적합한 음성합성 모델 선정을 위한 참조 정보를 저장하고 있는 것이다.For example, it stores reference information for selecting a suitable speech synthesis model for each character, such as' a calm female voice ',' a sporty female voice ',' a depressing female voice ', and' sorrow ',' joy ',' Reference information for selecting a suitable speech synthesis model for each emotional state.

예를 들어, 신사임당이란 캐릭터의 경우 캐릭터별로 적합한 음성합성 모델 선정을 위한 참조 정보를 이용하여 추출할 수 있는 음성합성 모델은 '차분한 중년 여성 음성'이 될 수 있고, 홍길동이란 캐릭터의 경우 캐릭터별로 적합한 음성합성 모델 선정을 위한 참조 정보를 이용하여 추출할 수 있는 음성합성 모델은 '힘차고 용감한 젊은 남성 음성'일 수 있다.For example, in the case of a character called Shin-Sang-Dang, a voice synthesis model that can be extracted using reference information for selecting a suitable voice synthesis model for each character may be a calm middle-aged female voice. A speech synthesis model that can be extracted using reference information for selecting a suitable speech synthesis model may be a 'strong and brave young male voice'.

상기 캐릭터정보검출모듈(230)은 상기 정보입력부(100)를 통해 입력된 음성합성 대상정보에 포함된 캐릭터 정보를 검출하고, 음성합성모델선정참조모듈(220)에 저장된 캐릭터별로 적합한 음성합성 모델 선정을 위한 참조 정보를 참조하여 검출된 해당 캐릭터 정보에 적합한 음성합성 모델 정보를 결정하고, 검출된 캐릭터 정보와 결정된 해당 캐릭터 정보에 적합한 음성합성 모델 정보를 캐릭터정보저장모듈(240)에 저장 처리하는 기능을 수행하게 된다.The character information detection module 230 detects character information included in the voice synthesis target information input through the information input unit 100 and selects a suitable voice synthesis model for each character stored in the voice synthesis model selection reference module 220. A function of determining voice synthesis model information suitable for the detected character information with reference to the reference information for storing and processing the detected character information and voice synthesis model information suitable for the determined character information in the character information storage module 240. Will be performed.

예를 들어, '신사임당'이라는 캐릭터 정보를 검출하는 경우, 음성합성모델선정참조모듈(220)에 저장된 캐릭터별로 적합한 음성합성 모델 선정을 위한 참조 정보를 참조하여 '차분한 여성 음성'을 적합한 음성합성 모델 정보로 결정하는 것이다.For example, when detecting the character information of 'sinsaimdang', referring to reference information for selecting a suitable speech synthesis model for each character stored in the speech synthesis model selection reference module 220, suitable speech synthesis is applied to the 'silent female voice'. This is determined by model information.

또한, 상기 캐릭터정보검출모듈(230)은 입력된 음성합성 대상정보에 캐릭터 정보가 존재하지 않는 경우에 캐릭터 정보 재입력 신호를 정보입력부로 제공하는 것을 특징으로 한다.In addition, the character information detection module 230 provides a character information re-input signal to the information input unit when character information does not exist in the input voice synthesis target information.

상기 캐릭터정보저장모듈(240)은 캐릭터정보검출모듈(230)에서 검출된 캐릭터 정보와 결정된 해당 캐릭터 정보에 적합한 음성합성 모델 정보를 저장하게 된다.The character information storage module 240 stores voice synthesis model information suitable for the character information detected by the character information detection module 230 and the determined character information.

상기 감정상태정보검출모듈(250)은 상기 정보입력부(100)를 통해 입력된 음성합성 대상정보에 포함된 감정상태 정보를 검출하고, 음성합성모델선정참조모듈(220)에 저장된 감정상태별로 적합한 음성합성 모델 선정을 위한 참조 정보를 참조하여 검출된 해당 감정상태 정보에 적합한 음성합성 모델 정보를 결정하고, 검출된 감정상태 정보와 결정된 해당 감정상태 정보에 적합한 음성합성 모델 정보를 감정상태정보저장모듈(260)에 저장 처리하는 기능을 수행하게 된다.The emotion state information detection module 250 detects the emotion state information included in the voice synthesis target information input through the information input unit 100, and is suitable for each emotion state stored in the voice synthesis model selection reference module 220. The speech synthesis model information suitable for the detected emotional state information is determined by referring to the reference information for selecting a synthesis model, and the emotional state information storage module stores the speech synthesis model information suitable for the detected emotional state information and the determined emotional state information. In step 260, a storage process is performed.

예를 들어, '슬픔'이라는 감정상태 정보를 검출하는 경우, 음성합성모델선정참조모듈(220)에 저장된 감정상태별로 적합한 음성합성 모델 선정을 위한 참조 정보를 참조하여 '슬픔'을 적합한 음성합성 모델 정보로 결정하는 것이다.For example, when detecting emotional state information of 'sadness', a speech synthesis model suitable for 'sorrow' by referring to reference information for selecting a suitable speech synthesis model for each emotional state stored in the speech synthesis model selection reference module 220. It is determined by information.

또한, 상기 감정상태정보검출모듈(250)은 입력된 음성합성 대상정보에 감정상태 정보가 존재하지 않는 경우에 감정상태 정보 재입력 신호를 정보입력부로 제공하는 것을 특징으로 한다.In addition, the emotion state information detection module 250 provides the emotion state information re-input signal to the information input unit when the emotion state information does not exist in the input voice synthesis target information.

상기 감정상태정보저장모듈(260)은 감정상태정보검출모듈(250)에서 검출된 감정상태 정보와 결정된 해당 감정상태 정보에 적합한 음성합성 모델 정보를 저장하게 된다.The emotion state information storage module 260 stores the speech synthesis model information suitable for the emotion state information detected by the emotion state information detection module 250 and the determined emotion state information.

도 4는 본 발명에 따른 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치의 음성합성모델선택부(300) 구성 블록도이다.4 is a block diagram of the speech synthesis model selection unit 300 of the apparatus for changing the speech synthesis model according to the character speech context according to the present invention.

도 4에 도시한 바와 같이, 음성합성모델선택부(300)는 출력대상문장확정모듈(310), 캐릭터및감정상태확정모듈(320), 음성합성모델정보저장모듈(330), 음성합성모델선택모듈(340)을 포함하여 구성되게 된다.As shown in FIG. 4, the voice synthesis model selection unit 300 outputs a sentence determination sentence 310, a character and emotion state determination module 320, a voice synthesis model information storage module 330, and a voice synthesis model selection. Module 340 will be configured to include.

구체적으로 설명하면, 상기 출력대상문장확정모듈(310)은 정보분석부(200)의 문장단위분절모듈(210)에서 문장 단위로 분절된 출력 대상 텍스트의 단위 문장들을 순차적으로 확정하는 기능을 수행하게 된다.In detail, the output sentence determination module 310 performs a function of sequentially determining the unit sentences of the output target text segmented in sentence units in the sentence unit segmentation module 210 of the information analysis unit 200. do.

예를 들어, 입력된 출력대상 텍스트가 '반갑습니다.' '그런 일이 있었다니 정말 유감이에요.' 인 경우, '반갑습니다.'를 1순위, '그런 일이 있었다니 정말 유감이에요.'를 2순위로 확정하는 것이다.For example, the input destination text is 'Nice to meet'. 'I'm so sorry that it happened.' In the case of 'I'm glad you're ranked first,' I'm so sorry that it happened. '

상기 캐릭터및감정상태확정모듈(320)은 정보분석부(200)의 캐릭터정보검출모듈(230)과 감정상태정보검출모듈(250)에서 검출된 캐릭터 정보와 감정상태정보를 이용하여 순차적으로 확정된 단위 문장마다에 합성할 캐릭터와 감정상태를 확정하는 기능을 수행하게 된다.The character and emotion state determination module 320 is sequentially determined using character information and emotion state information detected by the character information detection module 230 and the emotion state information detection module 250 of the information analyzer 200. The function of determining the character and the emotional state to be synthesized for each unit sentence is performed.

예를 들어, 반갑습니다란 단위문장에 합성할 캐릭터와 감정상태는 각각 신사임당과 반가움이고, 그런 일이 있었다니 정말 유감이에요란 단위문장에 합성할 캐릭터와 감정상태는 각각 신사임당과 슬품이 되는 것이다.For example, the characters and emotions to be synthesized in the unit sentence are good to see, respectively. will be.

상기 음성합성모델정보저장모듈(330)은 캐릭터와 감정상태별로 다양한 음성합성 모델 정보를 저장하고 있게 된다.The voice synthesis model information storage module 330 stores various voice synthesis model information for each character and emotion state.

예를 들어, '신사임당'이라는 캐릭터에 적합한 '차분한 여성 음성'이라는 음성합성 모델, '홍길동'이라는 캐릭터에 적합한 '용감한 젊은 남성 음성'이라는 음성합성 모델, '이순신'이라는 캐릭터에 적합한 '준엄한 장년 남성 음성'이라는 음성합성 모델 등을 저장하고 있게 되는 것이다.For example, a voice synthesis model called `` a calm female voice '' suitable for a character called `` Shin Saim Im Dang '', a voice synthesis model called `` a brave young male voice '' suitable for a character called `` Hong Gil-dong '', and a `` strict '' suited for a character called `` Yi Sun Shin '' It is storing a voice synthesis model called 'mature male voice'.

상기 음성합성모델선택모듈(340)은 확정된 캐릭터 및 감정상태에 적합한 음성합성 모델을 음성합성모델정보저장모듈(330)에서 추출하는 기능을 수행하게 된다.The speech synthesis model selection module 340 performs a function of extracting from the speech synthesis model information storage module 330 a speech synthesis model suitable for the determined character and emotion state.

예를 들어, 입력된 음성합성 대상정보가 '[한석봉어머니] 아들아 너는 글을 쓰로 나는 떡을 썰겠다[비장함] ' 이고, 캐릭터및감정상태확정모듈(320)에 의해 확정된 캐릭터가 한석봉어머니, 확정된 감정상태가 비장함인 경우, 음성합성모델선택모듈(340)은 차분한 여성 음성이면서 비장한 감정을 입력된 텍스트인 '아들아 너는 글을 쓰로 나는 떡을 썰겠다'에 합성할 수 있도록 하는 음성합성 모델을 음성합성모델정보저장모듈(330)에서 추출하는 것이다.For example, the input voice synthesis target information is '[Hanseokbong mother] son you write a piece I'll slice the rice cake [specified]', the character and the character determined by the emotional state determination module 320 is Hanseokbong Mother, if the determined emotional state is hoarding, the speech synthesis model selection module 340 can synthesize the calm female voice and the spleen emotions into the input text 'son, I'm going to slice rice cake with writing'. It is to extract the speech synthesis model so that the speech synthesis model information storage module 330.

도 5는 본 발명에 따른 캐릭터 발화 맥락에 따른 음성합성 모델 변경장치의 음성합성출력부(400) 구성 블록도이다.5 is a block diagram of the speech synthesis output unit 400 of the apparatus for changing the speech synthesis model according to the character speech context according to the present invention.

도 5에 도시한 바와 같이, 상기 음성합성출력부(400)는 음성합성엔진구동모듈(410), 음성출력모듈(420), 음성출력저장모듈(430)을 기본적으로 포함하고, 추가적으로 종료확인모듈(440)을 더 포함하여 구성될 수 있다.As shown in FIG. 5, the voice synthesis output unit 400 basically includes a voice synthesis engine driving module 410, a voice output module 420, and a voice output storage module 430. 440 may be further included.

구체적으로 설명하면, 상기 음성합성엔진구동모듈(410)은 순차적으로 확정된 단위문장마다 음성합성모델선택부(300)의 음성합성모델선택모듈(340)에 의해 추출된 음성합성 모델이 적용되도록 음성합성엔진을 구동시켜 캐릭터 발화 맥락이 반영된 문장별 음성 합성 결과값을 생성하는 기능을 수행하게 된다.Specifically, the voice synthesis engine driving module 410 is configured to apply the speech synthesis model extracted by the speech synthesis model selection module 340 of the speech synthesis model selection unit 300 to every unit sentence that is sequentially determined. By driving the synthesis engine, a function of generating a speech synthesis result value for each sentence reflecting a character speech context is performed.

예를 들어, 확정된 단위문장이 '그런 일이 있었다니 정말 유감이에요.'이고 추출된 음성합성 모델이 차분한 여성 음성이면서 슬픈 감정인 경우, 음성합성엔진을 구동시켜 차분하고 슬픈 어조의 '그런 일이 있었다니 정말 유감이에요.'라는 음성 합성 결과값을 출력시키는 것이다.For example, if the confirmed unit sentence is 'I'm very sorry that it happened' and the extracted speech synthesis model is a calm female voice and a sad emotion, the voice synthesis engine was driven to 'slow and sad tone'. I'm so sorry. '

상기 음성출력모듈(420)은 음성합성엔진구동모듈(410)에 의해 생성된 문장별 음성합성 결과값을 문장단위로 음성 출력시킨다.The voice output module 420 outputs the speech synthesis result value for each sentence generated by the speech synthesis engine driving module 410 in sentence units.

상기 음성출력저장모듈(430)은 음성출력모듈(420)에서 출력되는 출력 음성을 녹음하여 저장하게 된다.The voice output storage module 430 records and stores the output voice output from the voice output module 420.

즉, '그런 일이 있었다니 정말 유감이에요.'라는 출력 음성을 녹음하여 저장함으로써, 추후 해당 출력된 음성 정보를 검증할 수 있도록 하는 것이다.In other words, by recording and storing the output voice 'I'm very sorry that it happened', so that the output voice information can be verified later.

한편, 부가적인 양태에 따라, 음성합성출력부(400)는 종료확인모듈(440)을 더 포함할 수 있으며, 종료확인모듈(440)은 정보입력부(100)를 통해 입력된 출력 대상 텍스트가 모두 음성으로 출력되었는지를 판단하고, 판단결과 모두 출력되었으면 음성 합성을 종료하고, 정보입력부(100)를 통해 입력되었으나 음성 출력되지 않은 잔여 텍스트가 존재하면 잔여 텍스트에 대한 음성합성 모델을 선택하도록 하는 요청 신호를 음성합성모델선택부(300)로 송출하는 기능을 수행하게 된다.On the other hand, according to an additional aspect, the speech synthesis output unit 400 may further include a termination confirmation module 440, the termination confirmation module 440 is all the output target text input through the information input unit 100 It is determined whether the voice is output, and when all of the determination results are output, the voice synthesis is terminated. If there is residual text input through the information input unit 100 but the voice is not output, a request signal for selecting the voice synthesis model for the residual text is present. To perform the function of transmitting to the speech synthesis model selection unit 300.

예를 들어, '반갑습니다.'란 1순위 문장과 '그런 일이 있었다니 정말 유감이에요.'란 2순위의 문장이 음성 출력되었는지를 판단하고, 판단 결과 모두 출력되었으면 음성 합성을 종료시키게 된다.For example, it's a pleasure to say that the 1st priority sentence and 'I'm so sorry that it happened' is a 2nd rank sentence.

만약, 정보입력부(100)를 통해 입력되었으나 음성 출력되지 않은 잔여 텍스트가 존재하면, 예를 들어, '주변 사람들과 간단한 통화를 하던지, 드라이브를 하시는 것은 어떠한지요.'라는 잔여 텍스트가 존재한다면 '주변 사람들과 간단한 통화를 하던지, 드라이브를 하시는 것은 어떠한지요.'에 대한 음성합성 모델을 선택하도록 하는 요청 신호를 음성합성모델선택부(300)로 송출하는 것이다.If there is residual text that is input through the information input unit 100 but the voice is not output, for example, if there is residual text such as 'speaking with people around you or driving.' How about making a simple call or driving with people? A request signal for selecting a voice synthesis model for 'is sent to the voice synthesis model selection unit 300.

본 발명에 의하면, 하나의 모델을 선택하면 하나의 음성 톤으로만 텍스트를 음성 출력하는 종래의 문제점인 획일성을 제거하여 문맥에 따라 다양한 캐릭터별 다양한 음성 톤과 다양한 감정 톤으로 음성을 출력시킴으로써, 동화 읽기, 캐릭터의 대사, 인터랙션 대화 등의 감성 품질을 더욱 향상시킬 수 있는 효과를 제공하게 된다.According to the present invention, by selecting one model, by eliminating the uniformity which is a conventional problem of outputting text to only one voice tone, the voice is output by various voice tones and various emotion tones according to the context. It provides effects that can further improve the emotional quality of reading fairy tales, dialogue of characters, and interactive conversations.

이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.Although the above has been illustrated and described with respect to the preferred embodiments of the present invention, the present invention is not limited to the specific embodiments described above, it is usually in the technical field to which the invention belongs without departing from the spirit of the invention claimed in the claims. Various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

100 : 정보입력부
200 : 정보분석부
300 : 음성합성모델선택부
400 : 음성합성출력부100: information input unit
200: Information Analysis Department
300: speech synthesis model selection unit
400: voice synthesis output unit

Claims

In the apparatus for changing the speech synthesis model according to the character speech context,
An information input unit 100 for inputting voice synthesis target information,
An information analysis unit 200 for analyzing voice synthesis target information input through the information input unit 100 and storing analysis result information;
A speech synthesis model selecting unit 300 for selecting a speech synthesis model by referring to the analysis result analyzed by the information analyzing unit 200,
And a voice synthesis output unit 400 for driving a voice synthesis engine to output a character voice to which the voice synthesis model selected by the voice synthesis model selection unit 300 is applied.

The speech synthesis model selection unit 300,
An output target sentence determination module 310 for sequentially determining unit sentences of the output target text segmented in sentence units in the sentence unit segmentation module 210 of the information analyzing unit 200;
The character information and the emotion state to be synthesized for each unit sentence sequentially determined by using the character information and the emotion state information detected by the character information detection module 230 and the emotion state information detection module 250 of the information analysis unit 200 are determined. Character and emotion state determination module 320 for determining;
A voice synthesis model information storage module 330 for storing various voice synthesis model information for each character and emotion state;
And a voice synthesis model selection module 340 for extracting a voice synthesis model suitable for the determined character and emotion state from the voice synthesis model information storage module 330.

The speech synthesis target information includes text information, character information, and emotional state information, and the text information model changing apparatus according to a character speech context, characterized in that at least one or more of character information and emotional state information are matched. .

The method of claim 1,
The information analysis unit 200,
A sentence unit segment module (210) for segmenting and storing the text information included in the speech synthesis target information input through the information input unit (100) in sentence units with reference to a predefined sentence segmentation rule;
A speech synthesis model selection reference module 220 which stores reference information for selecting a suitable speech synthesis model for each character and reference information for selecting a suitable speech synthesis model for each emotional state;
Character information included in the voice synthesis target information input through the information input unit 100 is detected, and is detected by referring reference information for selecting a suitable voice synthesis model for each character stored in the voice synthesis model selection reference module 220. A character information detection module 230 for determining voice synthesis model information suitable for the character information and storing and processing the detected character information and voice synthesis model information suitable for the determined character information in the character information storage module 240;
A character information storage module 240 for storing voice synthesis model information suitable for the character information detected by the character information detection module 230 and the determined corresponding character information;
Detecting emotion state information included in the voice synthesis target information input through the information input unit 100 and referring to reference information for selecting a suitable speech synthesis model for each emotion state stored in the voice synthesis model selection reference module 220. Emotional state information for determining speech synthesis model information suitable for the detected emotional state information and storing and processing the emotion state information and the speech synthesis model information suitable for the determined emotional state information in the emotional state information storage module 260. Detection module 250;
Character state ignition comprising a; emotional state information storage module 260 for storing the emotional state information detected by the emotional state information detection module 250 and the speech synthesis model information suitable for the determined emotional state information; Device for changing speech synthesis model according to context.

delete

The method of claim 1,
The speech synthesis output unit 400,
A speech synthesis result for each sentence reflecting a character speech context by driving the speech synthesis engine so that the speech synthesis model extracted by the speech synthesis model selection module 340 of the speech synthesis model selection unit 300 is applied to each unit sentence sequentially determined. A voice synthesis engine driving module 410 for generating a value;
A voice output module 420 for outputting the speech synthesis result value of each sentence generated by the speech synthesis engine driving module 410 in sentence units;
And a voice output storage module (430) for recording and storing the output voice output from the voice output module (420).

The method of claim 4, wherein
The speech synthesis output unit 400,
It is determined whether all of the output target texts input through the information input unit 100 are output as voice. When all of the determination results are output, the voice synthesis is terminated. And a termination confirmation module 440 for transmitting a request signal for selecting the speech synthesis model for the remaining text to the speech synthesis model selection unit 300. The speech synthesis according to the character speech context further comprises a. Model changer.

The method of claim 1,
The apparatus for changing a speech synthesis model according to the character speech context,
An apparatus for changing a speech synthesis model according to a context of a character utterance, by automatically outputting a voice with various character voices and various emotion tones according to a sentence, thereby improving emotional quality during an interaction conversation.

The method of claim 2,
The sentence unit segmentation module 210 provides a text information re-input signal to the information input unit when the text information in the speech synthesis target information violates a predefined sentence segmentation rule.
The character information detection module 230 provides a character information re-input signal to the information input unit when character information does not exist in the input voice synthesis target information.
The emotional state information detection module 250 provides a voice synthesis model according to a character speech context, wherein the emotional state information re-input signal is provided to the information input unit when the emotional state information does not exist in the input voice synthesis target information. Change device.