KR20200114606A

KR20200114606A - Methode and aparatus of providing voice

Info

Publication number: KR20200114606A
Application number: KR1020190036635A
Authority: KR
Inventors: 박지웅; 김인호
Original assignee: 주식회사 엘지유플러스
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2020-10-07
Also published as: KR102221236B1

Abstract

According to one embodiment, provided is a method and device for providing a voice, wherein the method comprises in receiving a first voice uttered by a speaker, obtaining speaker information corresponding to the speaker recognized based on the first voice, generating a second voice corresponding to the first voice based on an analysis result of the first voice, and providing a third voice for the speaker by correcting a frequency of the second voice based on the speaker information.

Description

TECHNICAL FIELD [Method and Aparatus of Providing Voice]

실시예들은 음성을 제공하는 방법 및 장치에 관한 것이다.Embodiments relate to a method and apparatus for providing voice.

인공 지능 기술이 발달함에 따라 AI(Artificial Intelligence) 비서 또는 AI 스피커 등과 같은 스마트 디바이스를 통한 다양한 서비스(예를 들어, 음원 제공, 음성 인식을 통한 제어, 자율 주행, 증강 현실 등)가 제공되고 있다. 이러한 서비스는 주로 음성 또는 영상을 통해 제공되는 것이 일반적이다. 하지만, 선천적 장애 또는 노화로 인해 특정 주파수 대역의 신호를 듣는 데에 어려움을 겪는 사용자들이 존재한다. 이러한 사용자들의 가청 주파수 대역을 벗어나는 음원 또는 음성이 제공되는 경우, 해당 사용자들이 제공되는 음원 또는 음성을 정확하게 인식하기 어렵거나, 또는 전혀 인식할 수 없다는 문제점이 있다.With the development of artificial intelligence technology, various services (for example, sound source provision, control through voice recognition, autonomous driving, augmented reality, etc.) are provided through smart devices such as an AI (Artificial Intelligence) secretary or AI speaker. These services are generally provided through audio or video. However, there are users who have difficulty listening to a signal in a specific frequency band due to a congenital disorder or aging. When a sound source or voice out of the audible frequency band of these users is provided, there is a problem in that it is difficult for the users to accurately recognize the provided sound source or voice, or they cannot recognize it at all.

일 실시예에 따르면, 화자 인식을 통한 개인 별 맞춤 음성 합성(Text-To-Speech; TTS) 서비스를 제공할 수 있다. According to an embodiment, it is possible to provide a personalized speech synthesis (Text-To-Speech; TTS) service through speaker recognition.

일 실시예에 따르면, 스마트 디바이스를 통한 서비스 시에 화자의 가청 주파수 대역을 고려한 음성을 생성하여 제공함으로써 선천적 장애 또는 노화로 인해 가청 주파수가 줄어든 사용자에게 최적의 주파수 대역의 음성 또는 음원을 제공할 수 있다.According to an embodiment, by generating and providing voice in consideration of the speaker's audible frequency band when servicing through a smart device, it is possible to provide a voice or sound source in the optimal frequency band to a user whose audible frequency is reduced due to congenital disorder or aging. have.

일 실시예에 따르면, 음성을 제공하는 방법은 화자의 발화된 제1 음성을 수신하는 단계; 상기 제1 음성을 기초로, 상기 화자를 인식하는 단계; 상기 인식한 화자에 대응하는 화자 정보를 획득 하는 단계; 상기 제1 음성의 분석 결과를 기초로, 상기 제1 음성에 응답하는 제2 음성을 생성하는 단계; 및 상기 화자 정보를 기초로, 상기 제2 음성의 주파수를 보정하여 상기 화자를 위한 제3 음성을 제공하는 단계를 포함한다. According to an embodiment, a method of providing a voice includes receiving a first voice spoken by a speaker; Recognizing the speaker based on the first voice; Obtaining speaker information corresponding to the recognized speaker; Generating a second voice responsive to the first voice based on the analysis result of the first voice; And providing a third voice for the speaker by correcting the frequency of the second voice based on the speaker information.

상기 제3 음성을 제공하는 단계는 상기 화자 정보를 기초로, 상기 화자의 가청 주파수 대역을 파악하는 단계; 및 상기 제2 음성의 주파수 대역과 상기 화자의 가청 주파수 대역의 비교 결과를 기초로 상기 제2 음성의 주파수를 보정함으로써 상기 제3 음성을 생성하는 단계를 포함할 수 있다. The providing of the third voice may include determining an audible frequency band of the speaker based on the speaker information; And generating the third voice by correcting the frequency of the second voice based on a comparison result of the frequency band of the second voice and the audible frequency band of the speaker.

상기 비교 결과를 기초로 상기 제3 음성을 생성하는 단계는 상기 제2 음성이 상기 화자의 가청 주파수 대역을 벗어나는 경우, 상기 제2 음성을 상기 화자의 가청 주파수 대역으로 이동시켜 샘플링(sampling) 함으로써 상기 제3 음성을 생성하는 단계를 포함할 수 있다. The step of generating the third voice based on the comparison result includes: when the second voice is out of the speaker's audible frequency band, the second voice is shifted to the speaker's audible frequency band and sampled. It may include generating a third voice.

상기 비교 결과를 기초로 상기 제3 음성을 생성하는 단계는 상기 제2 음성이 상기 화자의 가청 주파수 대역을 벗어나는 고주파 대역의 신호인지 여부를 판단하는 단계; 및 상기 제2 음성이 상기 고주파 대역의 신호라는 판단에 따라, 상기 제2 음성의 고주파 대역의 에너지를 보강함으로써 상기 제3 음성을 생성하는 단계를 포함할 수 있다. The generating of the third voice based on the comparison result may include determining whether the second voice is a signal of a high frequency band outside the audible frequency band of the speaker; And generating the third voice by reinforcing energy of the high frequency band of the second voice according to the determination that the second voice is a signal of the high frequency band.

상기 음성을 제공하는 방법은 상기 제3 음성의 사용 여부에 대한 상기 화자의 선택을 입력받는 단계를 더 포함할 수 있다. The method of providing the voice may further include receiving a selection of the speaker as to whether or not to use the third voice.

상기 화자를 위한 제3 음성을 제공하는 단계는 상기 화자 정보에 상기 화자의 가청 주파수 대역이 포함되지 않은 경우, 상기 화자의 연령대 별 평균 가청 주파수를 기초로, 상기 제2 음성의 주파수를 보정하여 상기 화자를 위한 제3 음성을 생성하는 단계를 포함할 수 있다. In the providing of the third voice for the speaker, if the speaker information does not include the speaker's audible frequency band, the frequency of the second voice is corrected based on the average audible frequency for each age group of the speaker. It may include generating a third voice for the speaker.

상기 화자 정보는 상기 화자의 식별 정보, 상기 화자의 연령, 상기 화자의 성별, 상기 화자의 가청 주파수 대역, 상기 화자의 연령대 별 평균 가청 주파수, 상기 화자가 선호하는 음성 주파수 대역 중 적어도 하나를 포함할 수 있다. The speaker information may include at least one of the speaker's identification information, the speaker's age, the speaker's gender, the speaker's audible frequency band, the speaker's average audible frequency by age group, and the speaker's preferred voice frequency band. I can.

상기 제2 음성을 생성하는 단계는 상기 제1 음성의 분석 결과를 기초로, 상기 제1 음성에 응답하는 답변 문장을 생성하는 단계; 및 상기 답변 문장을 기초로, 상기 제1 음성에 응답하는 제2 음성을 생성하는 단계를 포함할 수 있다. The generating of the second voice may include generating an answer sentence in response to the first voice based on an analysis result of the first voice; And generating a second voice in response to the first voice based on the answer sentence.

상기 답변 문장을 생성하는 단계는 STT(Speech To Text)에 의해 상기 제1 음성을 제1 텍스트로 변환하는 단계; 및 상기 제1 텍스트에 대한 자연어 처리(Natural Language Processing; NLP)를 통해 상기 제1 텍스트에 응답하는 답변 문장을 생성하는 단계를 포함할 수 있다. The generating of the answer sentence may include converting the first voice into a first text by STT (Speech To Text); And generating an answer sentence in response to the first text through natural language processing (NLP) on the first text.

상기 음성을 제공하는 방법은 상기 인식한 화자를 인증하는 단계를 더 포함할 수 있다. The method of providing the voice may further include authenticating the recognized speaker.

상기 화자를 인식하는 단계는 상기 제1 음성이 미리 정해진 호출어를 포함하는지 여부를 판단하는 단계; 및 상기 제1 음성이 상기 호출어를 포함한다는 판단에 따라 상기 제1 음성을 화자 인식 엔진에 인가함으로써 상기 화자를 인식하는 단계를 포함할 수 있다. Recognizing the speaker may include determining whether the first voice includes a predetermined pager; And recognizing the speaker by applying the first voice to a speaker recognition engine according to a determination that the first voice includes the call word.

상기 화자를 인식하는 단계는 상기 제1 음성으로부터 특징을 추출하는 단계; 및 상기 특징을 미리 학습된 화자 별 모델에 인가함으로써 상기 화자를 인식하는 단계를 포함할 수 있다. Recognizing the speaker may include extracting features from the first voice; And recognizing the speaker by applying the feature to the previously learned speaker-specific model.

일 실시예에 따르면, 음성을 제공하는 장치는 화자의 발화된 제1 음성을 수신하는 통신 인터페이스; 상기 제1 음성을 기초로, 상기 화자를 인식하고, 상기 인식한 화자에 대응하는 화자 정보를 획득 하고, 상기 제1 음성의 분석 결과를 기초로, 상기 제1 음성에 응답하는 제2 음성을 생성하고, 상기 화자 정보를 기초로, 상기 제2 음성의 주파수를 보정하여 상기 화자를 위한 제3 음성을 생성하는 프로세서; 및 상기 제3 음성을 제공하는 스피커를 포함한다. According to an embodiment, an apparatus for providing a voice includes a communication interface for receiving a first voice spoken by a speaker; Based on the first voice, the speaker is recognized, speaker information corresponding to the recognized speaker is obtained, and a second voice responsive to the first voice is generated based on the analysis result of the first voice And a processor for generating a third voice for the speaker by correcting a frequency of the second voice based on the speaker information; And a speaker providing the third voice.

상기 프로세서는 상기 화자 정보를 기초로, 상기 화자의 가청 주파수 대역을 파악하고, 상기 제2 음성의 주파수 대역과 상기 화자의 가청 주파수 대역의 비교 결과를 기초로 상기 제2 음성의 주파수를 보정함으로써 상기 제3 음성을 생성할 수 있다. The processor determines the audible frequency band of the speaker based on the speaker information, and corrects the frequency of the second voice based on a comparison result of the frequency band of the second voice and the audible frequency band of the speaker. A third voice can be generated.

상기 프로세서는 상기 제2 음성이 상기 화자의 가청 주파수 대역을 벗어나는 경우, 상기 제2 음성을 상기 화자의 가청 주파수 대역으로 이동시켜 샘플링 함으로써 상기 제3 음성을 생성할 수 있다. If the second voice is out of the speaker's audible frequency band, the processor may generate the third voice by moving the second voice to the speaker's audible frequency band and sampling the second voice.

상기 프로세서는 상기 제2 음성이 상기 화자의 가청 주파수 대역을 벗어나는 고주파 대역의 신호인지 여부를 판단하고, 상기 제2 음성이 상기 고주파 대역의 신호라는 판단에 따라, 상기 제2 음성의 고주파 대역의 에너지를 보강함으로써 상기 제3 음성을 생성할 수 있다. The processor determines whether the second voice is a signal of a high frequency band outside the speaker's audible frequency band, and according to the determination that the second voice is a signal of the high frequency band, the energy of the high frequency band of the second voice It is possible to generate the third voice by reinforcing it.

상기 음성을 제공하는 장치는 상기 제3 음성의 사용 여부에 대한 상기 화자의 선택을 입력받는 사용자 인터페이스를 더 포함할 수 있다. The apparatus for providing the voice may further include a user interface for receiving a selection of the speaker as to whether or not to use the third voice.

상기 프로세서는 상기 화자 정보에 상기 화자의 가청 주파수 대역이 포함되지 않은 경우, 상기 화자의 연령대 별 평균 가청 주파수를 기초로, 상기 제2 음성의 주파수를 보정하여 상기 화자를 위한 제3 음성을 제공할 수 있다. When the speaker information does not include the speaker's audible frequency band, the processor corrects the frequency of the second voice based on the average audible frequency for each age group of the speaker to provide a third voice for the speaker. I can.

상기 프로세서는 상기 제1 음성이 미리 정해진 호출어를 포함하는지 여부를 판단하고, 상기 제1 음성이 상기 호출어를 포함한다는 판단에 따라 상기 제1 음성을 화자 인식 엔진에 인가함으로써 상기 화자를 인식할 수 있다.The processor determines whether the first voice includes a predetermined call word, and applies the first voice to the speaker recognition engine according to the determination that the first voice includes the call word to recognize the speaker. I can.

일 측에 따르면, 화자 인식을 통한 개인 별 맞춤 음성 합성(Text-To-Speech; TTS) 서비스를 제공할 수 있다. According to one side, it is possible to provide a personalized speech synthesis (Text-To-Speech; TTS) service through speaker recognition.

일 측에 따르면, 스마트 디바이스를 통한 서비스 시에 화자의 가청 주파수 대역을 고려한 음성을 생성하여 제공함으로써 선천적 장애 또는 노화로 인해 가청 주파수가 줄어든 사용자에게 최적의 주파수 대역의 음성 또는 음원을 제공할 수 있다.According to one side, by generating and providing voice in consideration of the speaker's audible frequency band when servicing through a smart device, it is possible to provide a voice or sound source in the optimal frequency band to a user whose audible frequency is reduced due to congenital disorder or aging. .

도 1은 일 실시예에 따른 음성을 제공하는 방법을 나타낸 흐름도.
도 2는 일 실시예에 따른 음성을 제공하는 장치의 구성을 설명하기 위한 도면.
도 3은 일 실시예에 따른 스마트 디바이스와 화자 인식 서버 간의 동작을 설명하기 위한 도면.
도 4는 일 실시예에 따른 음성 합성(TTS) 장치의 동작을 설명하기 위한 도면.
도 5는 다른 실시예에 따른 음성을 제공하는 방법을 나타낸 흐름도.
도 6은 일 실시예에 따른 음성을 제공하는 장치의 블록도.1 is a flowchart illustrating a method of providing voice according to an embodiment.
2 is a diagram for describing a configuration of an apparatus for providing voice according to an embodiment.
3 is a diagram for describing an operation between a smart device and a speaker recognition server according to an exemplary embodiment.
4 is a diagram for explaining an operation of a speech synthesis (TTS) apparatus according to an embodiment.
5 is a flowchart illustrating a method for providing voice according to another embodiment.
Fig. 6 is a block diagram of a device for providing voice according to an embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings. The same reference numerals in each drawing indicate the same members.

아래 설명하는 실시예들에는 다양한 변경이 가해질 수 있다. 아래 설명하는 실시예들은 실시 형태에 대해 한정하려는 것이 아니며, 이들에 대한 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Various changes may be made to the embodiments described below. The embodiments described below are not intended to be limited to the embodiments, and should be understood to include all changes, equivalents, and substitutes thereto.

실시예에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 실시예를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used only to describe specific embodiments, and are not intended to limit the embodiments. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof, does not preclude in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in this application. Does not.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same reference numerals are assigned to the same components regardless of the reference numerals, and redundant descriptions thereof will be omitted. In describing the embodiments, when it is determined that a detailed description of related known technologies may unnecessarily obscure the subject matter of the embodiments, the detailed description thereof will be omitted.

도 1은 일 실시예에 따른 음성을 제공하는 방법을 나타낸 흐름도이다. 도 1을 참조하면, 일 실시예에 따른 음성을 제공하는 장치(이하, '제공 장치')는 화자의 발화된 제1 음성을 수신한다(110). '제1 음성'은 예를 들어, 화자가 직접 발화한 음성 또는 화자로부터 녹음된 음성에 해당할 수 있다. 제1 음성은 예를 들어, 스마트 디바이스에 대한 질문, 요청 등을 포함할 수 있다. 1 is a flowchart illustrating a method of providing voice according to an exemplary embodiment. Referring to FIG. 1, a device for providing a voice according to an exemplary embodiment (hereinafter, a “providing device”) receives a first voice spoken by a speaker (110). The'first voice' may correspond to, for example, a voice uttered by a speaker directly or a voice recorded from a speaker. The first voice may include, for example, a question or a request for a smart device.

제공 장치는 제1 음성을 기초로, 화자를 인식한다(120). 제공 장치는 예를 들어, 제1 음성이 미리 정해진 호출어를 포함하는지 여부를 판단할 수 있다. 미리 정해진 호출어는 예를 들어, "하이 OO"과 같은 통상적인 대화체 언어일 수도 있고, "알렉O"와 같은 호출명일 수도 있다. 제공 장치는 제1 음성이 미리 정해진 호출어를 포함한다는 판단에 따라 제1 음성을 화자 인식 엔진에 인가함으로써 화자를 인식할 수 있다. 제공 장치는 예를 들어, 제1 음성으로부터 특징을 추출하고, 특징을 미리 학습된 화자 별 모델에 인가함으로써 화자를 인식할 수 있다. 제공 장치는 해당 화자가 미리 등록된 화자인지에 대한 화자 인증 또한 수행할 수 있다. 제공 장치는 예를 들어, 아래의 도 2에서 설명하는 화자 인식 엔진(231)에 의해 화자 인식 및/또는 화자 인증을 수행할 수 있다. 화자 인식 엔진에 대하여는 아래의 도 2 및 도 3을 참조하여 구체적으로 설명한다. The providing device recognizes a speaker based on the first voice (120). The providing device may determine, for example, whether the first voice includes a predetermined call word. The predetermined caller may be, for example, a common conversational language such as "Hi OO", or may be a call name such as "Alec O". The providing device may recognize a speaker by applying the first voice to the speaker recognition engine according to a determination that the first voice includes a predetermined call word. The providing device may recognize a speaker by extracting a feature from, for example, the first voice and applying the feature to a pre-learned speaker-specific model. The providing device may also perform speaker authentication as to whether the corresponding speaker is a previously registered speaker. The providing device may perform speaker recognition and/or speaker authentication by, for example, the speaker recognition engine 231 described in FIG. 2 below. The speaker recognition engine will be described in detail with reference to FIGS. 2 and 3 below.

제공 장치는 단계(120)에서 인식한 화자에 대응하는 화자 정보를 획득한다(130). 화자 정보는 예를 들어, 화자의 식별 정보, 화자의 연령, 화자의 성별, 화자의 가청 주파수 대역, 화자의 연령대 별 평균 가청 주파수, 화자가 선호하는 음성 주파수 대역 등을 포함할 수 있다. 제공 장치는 인식한 화자에 대한 화자 정보를, 예를 들어, 아래의 도 2에서 설명하는 사용자 데이터베이스(239)로부터 획득할 수 있다. The providing device acquires speaker information corresponding to the speaker recognized in step 120 (130). The speaker information may include, for example, identification information of the speaker, the age of the speaker, the sex of the speaker, an audible frequency band of the speaker, an average audible frequency by age group of the speaker, and a voice frequency band preferred by the speaker. The providing device may obtain speaker information on the recognized speaker, for example, from the user database 239 described in FIG. 2 below.

제공 장치는 제1 음성의 분석 결과를 기초로, 제1 음성에 응답하는 제2 음성을 생성한다(140). 제공 장치는 제1 음성의 분석 결과를 기초로, 제1 음성에 응답하는 답변 문장을 생성할 수 있다. 제공 장치는 예를 들어, STT(Speech To Text) 기술에 의해 제1 음성을 제1 텍스트로 변환할 수 있다. 제공 장치는 제1 텍스트에 대한 자연어 처리(Natural Language Processing; NLP)를 통해 제1 텍스트에 응답하는 답변 문장을 생성할 수 있다. 제공 장치는 답변 문장을 기초로, 제1 음성에 응답하는 제2 음성을 생성할 수 있다. 이하에서, '제2 음성'은 제공 장치, 보다 구체적으로 제공 장치의 음성 합성(TTS) 모듈(도 2의 부재 번호 250 참조)을 통해 생성된 음성으로 이해될 수 있다. The providing device generates a second voice in response to the first voice based on the analysis result of the first voice (140). The providing device may generate an answer sentence in response to the first voice based on the analysis result of the first voice. The providing device may convert the first voice into the first text by, for example, Speech To Text (STT) technology. The providing device may generate an answer sentence in response to the first text through natural language processing (NLP) on the first text. The providing device may generate a second voice in response to the first voice based on the answer sentence. Hereinafter, the'second voice' may be understood as a voice generated through a providing device, more specifically, a voice synthesis (TTS) module (refer to reference numeral 250 in FIG. 2) of the providing device.

제공 장치는 화자 정보를 기초로, 제2 음성의 주파수를 보정하여 화자를 위한 제3 음성을 제공한다(150). 이하에서, '제3 음성'은 제2 음성을 주파수 보정한 음성일 수 있다. 제공 장치는 예를 들어, 화자 정보를 기초로, 화자의 가청 주파수 대역을 파악할 수 있다. 제공 장치는 제2 음성의 주파수 대역과 화자의 가청 주파수 대역의 비교 결과를 기초로 제2 음성의 주파수를 보정함으로써 제3 음성을 생성할 수 있다. 제공 장치는 예를 들어, 제2 음성이 화자의 가청 주파수 대역을 벗어나는 경우, 제2 음성을 화자의 가청 주파수 대역으로 이동시켜 샘플링(sampling) 함으로써 제3 음성을 생성할 수 있다. 또는 제공 장치는 제2 음성이 화자의 가청 주파수 대역을 벗어나는 고주파 대역의 신호인지 여부를 판단할 수 있다. 제공 장치는 제2 음성이 고주파 대역의 신호라는 판단에 따라, 제2 음성의 고주파 대역의 에너지를 보강함으로써 제3 음성을 생성할 수 있다. 제공 장치는 예를 들어, 고주파수의 에너지 값을 증가시켜서 다른 주파수 대역처럼 들리게 할 수 있다. 예를 들어, 6Khz 주파수 대역을 잘 듣지 못하는 사용자가 있다고 하자. 해당 사용자는 일반 주파수 대역과 같은 음압 레벨(SPL, sound pressure level)로 제공하면 전달되는 내용을 잘 듣지 못할 수 있다. 따라서, 제공 장치는 해당 사용자가 6Khz 주파수 대역도 기존 음압 레벨 수준으로 청취할 수 있게끔 해당 주파수의 에너지 값을 키워 보강할 수 있다. The providing device corrects the frequency of the second voice based on the speaker information and provides a third voice for the speaker (150). Hereinafter, the'third voice' may be a voice obtained by frequency-correcting the second voice. The providing device may determine the audible frequency band of the speaker based on, for example, speaker information. The providing apparatus may generate the third voice by correcting the frequency of the second voice based on the comparison result of the frequency band of the second voice and the speaker's audible frequency band. For example, when the second voice is out of the speaker's audible frequency band, the providing apparatus may generate the third voice by moving the second voice to the speaker's audible frequency band and performing sampling. Alternatively, the providing device may determine whether the second voice is a signal of a high frequency band outside the speaker's audible frequency band. The providing device may generate the third voice by reinforcing energy in the high frequency band of the second voice according to the determination that the second voice is a signal in the high frequency band. The providing device may, for example, increase the energy value of a high frequency to make it sound like another frequency band. For example, let's say there are users who can't listen to the 6Khz frequency band well. The user may not be able to hear the content well if it is provided at the same sound pressure level (SPL) as in the general frequency band. Accordingly, the providing device may increase and reinforce the energy value of the corresponding frequency so that the user can listen to the 6Khz frequency band at the level of the existing sound pressure.

도 2는 일 실시예에 따른 음성을 제공하는 장치의 구성을 설명하기 위한 도면이다. 도 2를 참조하면, 일 실시예에 따른 스마트 디바이스(210), 및 음성 제공 장치(230)가 도시된다. 이하에서는 설명의 편의를 위해 스마트 디바이스(210) 및 음성 제공 장치(230)가 도 2에 도시된 것과 같이 분리하여 구성된 것으로 예를 들어 설명하지만 반드시 이에 한정되는 것은 아니다. 음성 제공 장치(230)는 스마트 디바이스(210)에 병합된 하나의 장치로 구성될 수도 있다. 음성 제공 장치(230)는 예를 들어, 화자 인식 엔진(231), STT/NLP 모듈(233), 음성 합성(TTS) 모듈(235), 주파수 보정 모듈(237), 및 사용자 데이터베이스(239) 등을 포함할 수 있다. 화자 인식 엔진(231)은 예를 들어, 별도의 화자 인식 서버로 구현될 수도 있다. 2 is a diagram illustrating a configuration of an apparatus for providing voice according to an exemplary embodiment. Referring to FIG. 2, a smart device 210 and a voice providing apparatus 230 according to an embodiment are shown. Hereinafter, for convenience of description, the smart device 210 and the voice providing apparatus 230 are described as being separately configured as shown in FIG. 2, but are not limited thereto. The voice providing device 230 may be configured as a single device merged with the smart device 210. The speech providing device 230 includes, for example, a speaker recognition engine 231, an STT/NLP module 233, a speech synthesis (TTS) module 235, a frequency correction module 237, and a user database 239. It may include. The speaker recognition engine 231 may be implemented as a separate speaker recognition server, for example.

스마트 디바이스(210)는 예를 들어, 스마트 폰, AI 스피커, IoT(Internet of Things) 장치, 및 셋탑 박스 등과 같이 AI 비서 기능이 포함된 장치일 수 있다. 스마트 디바이스(210)는 화자가 발화한 제1 음성을 수신할 수 있다(201). 스마트 디바이스(210)는 제1 음성이 포함된 파일을 화자 인식 엔진(231)으로 전달(202)하여 해당 음성의 화자 인식(화자 식별) 및/또는 화자 인증을 요청할 수 있다. 화자 인식 및/또는 화자 인증을 위한 스마트 디바이스(210)와 화자 인식 엔진(231) 간의 동작은 아래의 도 3을 참조하여 구체적으로 설명한다. The smart device 210 may be a device including an AI assistant function, such as a smart phone, an AI speaker, an Internet of Things (IoT) device, and a set-top box. The smart device 210 may receive the first voice spoken by the speaker (201). The smart device 210 may transmit 202 a file including the first voice to the speaker recognition engine 231 to request speaker recognition (speaker identification) and/or speaker authentication of the corresponding voice. The operation between the smart device 210 and the speaker recognition engine 231 for speaker recognition and/or speaker authentication will be described in detail with reference to FIG. 3 below.

예를 들어, 화자 인식 엔진(231)은 제1 음성의 화자를 인식하고, 인식된 화자가 미리 등록된 화자인지를 인증할 수 있다. 해당 화자에 대한 인증이 완료되면, 다시 말해 해당 화자가 인증된 화자라고 판단되면, 화자 인식 엔진(231)은 스마트 디바이스(210)에게 음성 인식을 위한 마이크의 활성화를 요청할 수 있다(203). 화자 인식 엔진(231)은 예를 들어, Wake-up 요청을 통해 스마트 디바이스(210)에게 마이크의 활성화를 요청할 수 있다. 이와 함께, 화자 인식 엔진(231)은 화자 인증을 통해 파악한 사용자 식별자를 바탕으로 해당 사용자의 사용자 정보 및 해당 사용자의 TTS 설정 정보를 음성 합성(TTS) 모듈(235)에게 전달할 수 있다(204). 사용자 정보는 예를 들어, 사용자 ID, 및 해당 사용자의 가청 주파수 정보 등을 포함할 수 있다. 해당 사용자의 사용자 정보 및 해당 사용자의 TTS 설정 정보는 '화자 정보'라고도 불릴 수 있다. TTS 설정 정보는 해당 사용자가 희망하는 주파수 대역에 대한 설정 정보를 포함할 수 있다. 화자 정보는 사용자 데이터베이스(239)에 저장될 수 있다. 이때, 사용자 데이터베이스(239)는 주파수 보정 모듈(237) 내에 포함될 수도 있고, 주파수 보정 모듈(237)과 별도로 구성될 수도 있다.For example, the speaker recognition engine 231 may recognize a speaker of the first voice and authenticate whether the recognized speaker is a pre-registered speaker. When authentication for the corresponding speaker is completed, that is, if it is determined that the corresponding speaker is an authenticated speaker, the speaker recognition engine 231 may request the smart device 210 to activate a microphone for speech recognition (203). The speaker recognition engine 231 may request the smart device 210 to activate the microphone through, for example, a wake-up request. In addition, the speaker recognition engine 231 may transmit the user information of the corresponding user and the TTS setting information of the corresponding user to the speech synthesis (TTS) module 235 based on the user identifier identified through speaker authentication (204). The user information may include, for example, a user ID and audible frequency information of the user. User information of a corresponding user and TTS setting information of a corresponding user may also be referred to as'speaker information'. The TTS configuration information may include configuration information for a frequency band desired by a corresponding user. Speaker information may be stored in the user database 239. In this case, the user database 239 may be included in the frequency correction module 237 or may be configured separately from the frequency correction module 237.

스마트 디바이스(210)는 마이크의 활성화 이후에 사용자로부터 발화된 발화 음성(예를 들어, "오늘 미세 먼지 어때?")을 STT/NLP 모듈(233)로 전달할 수 있다(205). After activation of the microphone, the smart device 210 may transmit a spoken voice (eg, “How about today's fine dust?”) from the user to the STT/NLP module 233 (205).

STT/NLP 모듈(233)은 예를 들어, 발화 음성의 의도 및/또는 의미를 분석하여 발화 음성에 응답하는 답변 문장(예를 들어, "오늘 미세 먼지는 매우 나쁨 수준입니다. 마스크를 착용하세요")을 텍스트로 생성할 수 있다. 보다 구체적으로, STT/NLP 모듈(233)은 발화 음성의 분석 결과를 기초로, 음성 합성(TTS) 모듈(235)에 의해 발화 음성을 제1 텍스트로 변환할 수 있다. STT/NLP 모듈(233)은 제1 텍스트에 대한 자연어 처리(Natural Language Processing; NLP)를 통해 제1 텍스트에 응답하는 텍스트 형태의 답변 문장을 생성할 수 있다. STT/NLP 모듈(233)은 텍스트 형태의 답변 문장을 음성 합성(TTS) 모듈(235)에게 전달할 수 있다(206). The STT/NLP module 233, for example, analyzes the intent and/or meaning of the spoken speech to respond to the spoken speech (for example, "fine dust today is at a very bad level. Please wear a mask") ) Can be created as text. More specifically, the STT/NLP module 233 may convert the spoken speech into the first text by the speech synthesis (TTS) module 235 based on the analysis result of the spoken speech. The STT/NLP module 233 may generate a text response sentence in response to the first text through Natural Language Processing (NLP) on the first text. The STT/NLP module 233 may transmit a text-type answer sentence to the speech synthesis (TTS) module 235 (206).

음성 합성(TTS) 모듈(235)은 텍스트 형태의 답변 문장("오늘 미세 먼지는 매우 나쁨 수준입니다. 마스크를 착용하세요")을 제2 음성으로 생성할 수 있다. 음성 합성(TTS) 모듈(235)이 텍스트를 음성으로 변환하는 과정은 다음과 같다. The speech synthesis (TTS) module 235 may generate a text-type answer sentence (“fine dust today is very bad. Please wear a mask”) as a second voice. The process of converting text into speech by the speech synthesis (TTS) module 235 is as follows.

예를 들어, "날씨는 어때?"와 같은 텍스트를 음성으로 변환한다고 하자. 이 경우, 음성 합성(TTS) 모듈(235)은 텍스트 전처리를 통해 텍스트에 포함된 용어들 중 사용자 사전에 포함된 고유 명사 등을 구분할 수 있다. 음성 합성(TTS) 모듈(235)은 예를 들어, 품사 사전 데이터베이스를 사용하여 해당 텍스트의 구문을 분석하고, 분석 결과에 따라 각 단어의 품사를 태깅(tagging)할 수 있다. 음성 합성(TTS) 모듈(235)은 예를 들어, 딜리미터(delimiter)를 이용하여 각 단어를 구분하는 경계를 설정하고, 발음 사전 등을 이용하여 문자열과 발음을 변환함으로써 전체 문장을 생성하여 음성으로 제공할 수 있다. For example, suppose you want to convert a text such as "How's the weather?" into speech. In this case, the speech synthesis (TTS) module 235 may classify a proper noun included in the user dictionary among terms included in the text through text preprocessing. The speech synthesis (TTS) module 235 may analyze a phrase of a corresponding text using, for example, a part-of-speech dictionary database, and tag the part-of-speech of each word according to the analysis result. The speech synthesis (TTS) module 235 sets a boundary for classifying each word using, for example, a delimiter, and converts a character string and pronunciation using a pronunciation dictionary to generate an entire sentence Can be provided.

음성 합성(TTS) 모듈(235)이 생성한 제2 음성은 예를 들어, 22.02Khz의 주파수 대역의 신호일 수 있다. 음성 합성(TTS) 모듈(235)은 제2 음성을 주파수 보정 모듈(237)로 전달할 수 있다. 음성 합성(TTS) 모듈(235)의 구성은 아래의 도 4를 참조할 수 있다. The second voice generated by the voice synthesis (TTS) module 235 may be, for example, a signal in a frequency band of 22.02 Khz. The voice synthesis (TTS) module 235 may transmit the second voice to the frequency correction module 237. The configuration of the speech synthesis (TTS) module 235 may be referred to in FIG. 4 below.

주파수 보정 모듈(237)은 예를 들어, 사용자 데이터베이스(239)에 저장된 화자 정보를 기초로 제2 음성의 주파수를 보정할 수 있다. 예를 들어, 제2 음성의 주파수가 4Khz 대역이고, 화자 정보에 따른 가청 주파수 대역이 3Khz 대역이라고 하자. 이 경우, 주파수 보정 모듈(237)은 4Khz 대역의 주파수를 갖는 제2 음성을 화자가 청취 가능한 3Khz 대역의 신호로 보정하여 제3 음성을 생성할 수 있다. 주파수 보정 모듈(237)은 음성 합성(TTS) 모듈(235)을 통해 스마트 디바이스(210)로 제 3 음성을 제공할 수 있다(207). 스마트 디바이스(210)는 제3 음성("오늘 미세 먼지는 매우 나쁨 수준입니다. 마스크를 착용하세요")을 화자에게 제공할 수 있다. The frequency correction module 237 may correct the frequency of the second voice based on speaker information stored in the user database 239, for example. For example, assume that the frequency of the second voice is a 4Khz band, and the audible frequency band according to the speaker information is a 3Khz band. In this case, the frequency correction module 237 may generate a third voice by correcting the second voice having a frequency in the 4Khz band into a signal in the 3Khz band that the speaker can hear. The frequency correction module 237 may provide the third voice to the smart device 210 through the voice synthesis (TTS) module 235 (207). The smart device 210 may provide a third voice ("fine dust today is very bad. Please wear a mask") to the speaker.

주파수 보정 모듈(237)은 사용자 데이터베이스(239)에 저장된 화자 정보를 통해 화자 개인 별 가청 주파수를 파악하고, 화자 개인 별로 잘 듣지 못하는 주파수 대역의 음성 신호는 화자가 잘 들을 수 있는 주파수 대역으로 이동시켜 샘플링할 수 있다. 또는 주파수 보정 모듈(237)은 화자가 잘 듣지 못하는 주파수 대역에 에너지를 보강하여 제3 음성을 생성하거나, 또는 화자가 잘 듣지 못하는 주파수 대역에 대한 주파수를 이동시켜 제3 음성을 생성할 수 있다. The frequency correction module 237 identifies the audible frequency for each speaker through speaker information stored in the user database 239, and moves the voice signal in a frequency band that each speaker cannot hear well to a frequency band where the speaker can hear well. Can be sampled. Alternatively, the frequency correction module 237 may generate a third voice by reinforcing energy in a frequency band that the speaker does not hear well, or may generate a third voice by moving a frequency for a frequency band that the speaker does not hear well.

제공 장치(230)는 에너지 보강 또는 주파수 이동을 통해 생성된 각각의 제3 음성을 화자에게 들려주고, 화자가 마음에 드는 어느 하나의 제3 음성을 선택하도록 할 수도 있다. 예를 들어, 화자가 가청 주파수 대역에 대한 정보를 입력하지 않거나 또는 가청 주파수와 관련된 별도의 설정을 하지 않은 경우, 제공 장치(230)는 별도의 주파수 보정없이, 음성 합성(TTS) 모듈(235)이 생성한 기본적인 주파수 대역의 제2 음성을 스마트 디바이스(210)에게 제공할 수도 있다. The providing device 230 may play each third voice generated through energy reinforcement or frequency shift to the speaker, and allow the speaker to select any one of the third voices they like. For example, when the speaker does not input information on the audible frequency band or does not make separate settings related to the audible frequency, the providing device 230 may perform a speech synthesis (TTS) module 235 without separate frequency correction. The generated second voice of the basic frequency band may be provided to the smart device 210.

도 3은 일 실시예에 따른 스마트 디바이스와 화자 인식 서버 간의 동작을 설명하기 위한 도면이다. 도 3을 참조하면, 스마트 디바이스(210)를 통해 입력된 호출명을 기반으로 화자 인식 서버(300)가 화자 별 모델을 학습하여 화자들을 식별하는 과정이 도시된다. 화자 인식 서버(300)는 전술한 화자 인식 엔진(231)에 해당할 수 있다. 3 is a diagram for explaining an operation between a smart device and a speaker recognition server according to an exemplary embodiment. Referring to FIG. 3, a process in which the speaker recognition server 300 learns a speaker-specific model to identify speakers based on a calling name input through the smart device 210 is illustrated. The speaker recognition server 300 may correspond to the speaker recognition engine 231 described above.

예를 들어, 화자가 발화한 호출명 A가 스마트 디바이스(210)에 수신되었다고 하자. 스마트 디바이스(210)는 호출명 A가 예를 들어, 화자 인식 서버(300)의 구동을 트리거(trigger)하는 미리 약속된 호출명인지를 인식할 수 있다. 스마트 디바이스(210)는 호출명 A가 미리 약속된 호출명이라는 판단에 따라 호출명 A를 화자 인식 서버에 전달함으로써 화자 인식 서버(300)를 연동(또는 구동)시킬 수 있다. For example, suppose that a call name A spoken by a speaker is received by the smart device 210. The smart device 210 may recognize whether the call name A is, for example, a pre-arranged call name that triggers the operation of the speaker recognition server 300. The smart device 210 may link (or drive) the speaker recognition server 300 by transmitting the calling name A to the speaker recognition server according to the determination that the calling name A is a predetermined calling name.

예를 들어, 호출명 A가 처음 입력된 경우, 화자 인식 서버(300)는 호출명 A에 해당하는 음성으로 화자 등록(310)을 수행할 수 있다. 화자 인식 서버(300)는 예를 들어, 호출명 A로부터 특징을 추출(311)하고, 추출한 특징을 이용하여 화자 모델을 학습(313)함으로써 최종적인 화자 모델(320)을 생성할 수 있다. 실시예에 따라서, 화자 모델(320)은 화자의 녹음된 음성으로부터 추출한 특징에 의해 화자 모델을 학습함으로써 생성될 수도 있다. 화자의 녹음된 음성은 예를 들어, 녹음 데이터베이스에 저장된 것일 수 있다. For example, when a call name A is first input, the speaker recognition server 300 may perform speaker registration 310 with a voice corresponding to the call name A. The speaker recognition server 300 may generate a final speaker model 320 by, for example, extracting 311 features from the calling name A and learning 313 a speaker model using the extracted features. According to an embodiment, the speaker model 320 may be generated by learning a speaker model based on features extracted from the recorded voice of the speaker. The recorded voice of the speaker may be stored in a recording database, for example.

또한, 예를 들어, 화자 등록(310) 이후에 호출명 A가 입력된 경우, 화자 인식 서버(300)는 호출명 A에 의해 화자 인식(330)을 수행할 수 있다. 화자 인식 서버(300)는 호출명 A로부터 특징을 추출(331)하고, 추출한 특징과 화자 모델(320)에 등록된 정보를 비교함으로써 화자 인식(333)을 수행할 수 있다. In addition, for example, when the call name A is input after the speaker registration 310, the speaker recognition server 300 may perform the speaker recognition 330 by the call name A. The speaker recognition server 300 may perform speaker recognition 333 by extracting 331 features from the calling name A and comparing the extracted features with information registered in the speaker model 320.

도 4는 일 실시예에 따른 음성 합성(TTS) 장치의 동작을 설명하기 위한 도면이다. 도 4를 참조하면, 일 실시예에 따른 음성 합성(TSS) 장치(400)는 엔드-투-엔드(end-to-end) 기반의 심층 신경망(Deep Neural Network; DNN)을 통해 텍스트를 음성으로 합성할 수 있다. 심층 신경망은 예를 들어, 인코더(encoder)(410), 어텐션 모델(attention model)(420), 디코더(decoder)(430) 및 보코더(vocoder)(440)를 포함할 수 있다. 4 is a diagram illustrating an operation of a speech synthesis (TTS) apparatus according to an embodiment. 4, a speech synthesis (TSS) apparatus 400 according to an embodiment transmits text to speech through an end-to-end-based deep neural network (DNN). It can be synthesized. The deep neural network may include, for example, an encoder 410, an attention model 420, a decoder 430, and a vocoder 440.

음성 합성(TSS) 장치(400)로 텍스트(text)가 입력되었다고 하자. 이 경우, 인코더(410)는 입력된 텍스트를 해당 텍스트의 특징을 나타내는 숫자 또는 벡터로 변환할 수 있다. 인코더(410)는 입력된 텍스트로부터 텍스트의 특징을 나타내는 특징 벡터를 결정할 수 있다. 인코더(410)는 예를 들어, 1차 컨볼루션 뱅크(convolution bank), 하이웨이(highway) 네트워크, 양방향 GRU(bidirectional gated recurrent unit)을 포함하는 CBHG 네트워크로 구성될 수 있다. 예를 들어, 음성 합성(TSS) 장치(400)를 학습하는 시점에는 텍스트 및 정답 음원이 함께 입력될 수 있다. It is assumed that text is input to the speech synthesis (TSS) device 400. In this case, the encoder 410 may convert the input text into a number or vector representing the characteristics of the text. The encoder 410 may determine a feature vector representing a characteristic of the text from the input text. The encoder 410 may be configured as a CBHG network including, for example, a primary convolution bank, a highway network, and a bidirectional gated recurrent unit (GRU). For example, when the speech synthesis (TSS) apparatus 400 is learned, text and a correct answer sound source may be input together.

어텐션 모델(420)은 인코더(410)에서 결정된 특징 벡터로부터 텍스트의 문맥 정보(context information)를 결정할 수 있다. 어텐션 모델(420)은 텍스트의 문맥 정보로부터 다음(next) 스펙트로그램을 예측할 수 있다. 스펙트로그램은 예를 들어, 소리 스펙트로그램으로써 말(word)의 자극 강도와 주파수의 분포를 나타낼 수 있다. The attention model 420 may determine context information of text from the feature vector determined by the encoder 410. The attention model 420 may predict a next spectrogram from context information of text. The spectrogram is, for example, a sound spectrogram, and may represent the distribution of the stimulus intensity and frequency of a word.

디코더(430)는 어텐션 모델(420)에서 예측한 다음 스펙트로그램을 기반으로 다음 스펙트로그램을 생성할 수 있다. 어텐션 모델(420) 및 디코더(430)는 예를 들어, 어텐션- RNN(Recurrent Neural Network)으로 구성될 수 있다. The decoder 430 may generate a next spectrogram based on the spectrogram after prediction by the attention model 420. The attention model 420 and the decoder 430 may be configured as, for example, an attention-recurrent neural network (RNN).

보코더(440)는 각 음소에 대한 파라미터 값들을 음원 또는 음성으로 바꾸어 줄 수 있다. 보코더(440)는 디코더(430)에서 생성된 다음 스펙트로그램을 음원 또는 음성으로 변환할 수 있다. 보코더(440)는 예를 들어, 그리핀 림(Griffin- Lim) 보코더 알고리즘을 포함할 수 있다. The vocoder 440 may convert parameter values for each phoneme into a sound source or voice. The vocoder 440 may convert the next spectrogram generated by the decoder 430 into a sound source or voice. The vocoder 440 may include, for example, a Griffin-Lim vocoder algorithm.

도 5는 다른 실시예에 따른 음성을 제공하는 방법을 나타낸 흐름도이다. 도 5를 참조하면, 일 실시예에 따른 제공 장치는 화자의 발화된 제1 음성을 수신한다(505).5 is a flowchart illustrating a method of providing voice according to another embodiment. Referring to FIG. 5, the providing device according to an embodiment receives a first voice spoken by a speaker (505 ).

제공 장치는 제1 음성을 기초로, 화자를 인식한다(510).The providing device recognizes a speaker based on the first voice (510).

제공 장치는 인식한 화자에 대응하는 화자 정보를 획득한다(515).The providing device acquires speaker information corresponding to the recognized speaker (515 ).

제공 장치는 제1 음성의 분석 결과를 기초로, 제1 음성에 응답하는 제2 음성을 생성한다(520).The providing device generates a second voice in response to the first voice based on the analysis result of the first voice (520).

제공 장치는 화자 정보에 화자의 가청 주파수 대역에 대한 정보가 포함되어 있는지 여부를 판단할 수 있다(525). 예를 들어, 단계(525)에서, 화자 정보에 화자의 가청 주파수 대역에 대한 정보가 포함된 경우, 제공 장치는 화자 정보를 기초로, 화자의 가청 주파수 대역을 파악할 수 있다(530).The providing device may determine whether the speaker information includes information on the speaker's audible frequency band (525 ). For example, in step 525, if the speaker information includes information on the speaker's audible frequency band, the providing device may determine the speaker's audible frequency band based on the speaker information (530 ).

제공 장치는 제2 음성의 주파수 대역과 상기 화자의 가청 주파수 대역의 비교 결과를 기초로 제2 음성의 주파수를 보정함으로써 제3 음성을 생성할 수 있다(535).The providing apparatus may generate the third voice by correcting the frequency of the second voice based on the comparison result of the frequency band of the second voice and the audible frequency band of the speaker (535).

제공 장치는 제3 음성의 사용 여부에 대한 화자의 선택을 입력받을 수 있다(540). 제공 장치는 단계(540)에서의 화자의 선택에 따라 제3 음성을 제공할 수 있다(545). The providing device may receive a speaker's selection as to whether or not to use the third voice (540 ). The providing device may provide the third voice according to the speaker's selection in step 540 (545 ).

또는 실시예에 따라서, 단계(525)에서, 화자 정보에 화자의 가청 주파수 대역에 대한 정보가 포함되지 않은 경우, 제공 장치는 화자의 연령대 별 평균 가청 주파수를 기초로, 제2 음성의 주파수를 보정하여 화자를 위한 제3 음성을 생성할 수 있다(550). 이후, 제공 장치는 제3 음성의 사용 여부에 대한 화자의 선택을 입력받고(540), 화자의 선택에 따라 제3 음성을 제공할 수 있다(545). Alternatively, according to an embodiment, in step 525, if the speaker information does not include information on the speaker's audible frequency band, the providing device corrects the frequency of the second voice based on the average audible frequency for each age group of the speaker. Thus, a third voice for the speaker may be generated (550). Thereafter, the providing device may receive a speaker's selection as to whether to use the third voice (operation 540), and provide the third voice according to the speaker's selection (operation 545).

도 6은 일 실시예에 따른 음성을 제공하는 장치의 블록도이다. 도 6을 참조하면, 일 실시예에 따른 음성을 제공하는 장치('제공 장치')(600)는 통신 인터페이스(610), 프로세서(630) 및 스피커(670)를 포함한다. 제공 장치(600)는 녹음기(620), 메모리(650) 및 사용자 인터페이스(미도시)를 더 포함할 수 있다. 통신 인터페이스(610), 녹음기(620), 프로세서(630), 메모리(650) 및 스피커(670)는 통신 버스(605)를 통해 서로 통신할 수 있다. 6 is a block diagram of an apparatus for providing voice according to an exemplary embodiment. Referring to FIG. 6, an apparatus for providing voice ('providing device') 600 according to an embodiment includes a communication interface 610, a processor 630, and a speaker 670. The providing device 600 may further include a recorder 620, a memory 650, and a user interface (not shown). The communication interface 610, the recorder 620, the processor 630, the memory 650, and the speaker 670 may communicate with each other through a communication bus 605.

통신 인터페이스(610)는 화자의 발화된 제1 음성을 수신한다. The communication interface 610 receives the first voice spoken by the speaker.

녹음기(620)는 화자의 발화된 제1 음성을 녹음할 수 있다. The recorder 620 may record the first voice spoken by the speaker.

프로세서(630)는 예를 들어, 통신 인터페이스(610)를 통해 수신하거나, 또는 녹음기(620)에 녹음된 제1 음성을 기초로, 화자를 인식한다. 프로세서(630)는 인식한 화자에 대응하는 화자 정보를 획득한다. 프로세서(630)는 제1 음성의 분석 결과를 기초로, 제1 음성에 응답하는 제2 음성을 생성한다. 프로세서(630)는 화자 정보를 기초로, 제2 음성의 주파수를 보정하여 화자를 위한 제3 음성을 생성한다. The processor 630 recognizes a speaker based on, for example, the first voice received through the communication interface 610 or recorded by the recorder 620. The processor 630 acquires speaker information corresponding to the recognized speaker. The processor 630 generates a second voice in response to the first voice based on the analysis result of the first voice. The processor 630 generates a third voice for the speaker by correcting the frequency of the second voice based on the speaker information.

프로세서(630)는 화자 정보를 기초로, 화자의 가청 주파수 대역을 파악한다. 프로세서(630)는 제2 음성의 주파수 대역과 화자의 가청 주파수 대역의 비교 결과를 기초로 제2 음성의 주파수를 보정함으로써 제3 음성을 생성한다. The processor 630 determines the speaker's audible frequency band based on the speaker information. The processor 630 generates a third voice by correcting the frequency of the second voice based on the comparison result of the frequency band of the second voice and the speaker's audible frequency band.

프로세서(630)는 예를 들어, 제2 음성이 화자의 가청 주파수 대역을 벗어나는 경우, 제2 음성을 화자의 가청 주파수 대역으로 이동시켜 샘플링 함으로써 제3 음성을 생성한다. 또는 프로세서(630)는 제2 음성이 화자의 가청 주파수 대역을 벗어나는 고주파 대역의 신호인지 여부를 판단한다. 프로세서(630)는 제2 음성이 고주파 대역의 신호라는 판단에 따라, 제2 음성의 고주파 대역의 에너지를 보강함으로써 제3 음성을 생성한다. 프로세서(630)는 화자 정보에 화자의 가청 주파수 대역이 포함되지 않는지 여부를 판단할 수 있다. 예를 들어, 화자 정보에 화자의 가청 주파수 대역이 포함되지 않은 경우, 프로세서(630)는 화자의 연령대 별 평균 가청 주파수를 기초로, 제2 음성의 주파수를 보정하여 화자를 위한 제3 음성을 제공한다. For example, when the second voice is out of the speaker's audible frequency band, the processor 630 generates a third voice by moving and sampling the second voice into the speaker's audible frequency band. Alternatively, the processor 630 determines whether the second voice is a signal in a high frequency band outside the speaker's audible frequency band. The processor 630 generates a third voice by reinforcing energy in the high frequency band of the second voice according to the determination that the second voice is a signal in the high frequency band. The processor 630 may determine whether the speaker information does not include the speaker's audible frequency band. For example, if the speaker information does not include the speaker's audible frequency band, the processor 630 provides a third voice for the speaker by correcting the frequency of the second voice based on the average audible frequency for each age group of the speaker. do.

프로세서(630)는 예를 들어, 제1 음성이 미리 정해진 호출어를 포함하는지 여부를 판단할 수 있다. 프로세서(630)는 제1 음성이 호출어를 포함한다는 판단에 따라 제1 음성을 화자 인식 엔진에 인가함으로써 화자를 인식할 수 있다. The processor 630 may determine, for example, whether the first voice includes a predetermined call word. The processor 630 may recognize the speaker by applying the first voice to the speaker recognition engine according to the determination that the first voice includes a pager.

메모리(650)는 프로세서(630)가 획득한 화자 정보를 저장할 수 있다. The memory 650 may store speaker information acquired by the processor 630.

스피커(670)는 프로세서(630)가 생성한 제3 음성을 제공한다. The speaker 670 provides the third voice generated by the processor 630.

제공 장치(600)는 예를 들어, 터치 디스플레이(미도시) 화면에 표시되는 사용자 인터페이스를 통해 제3 음성의 사용 여부에 대한 화자의 선택을 입력받을 수 있다. The providing device 600 may receive, for example, a speaker's selection of whether to use the third voice through a user interface displayed on a screen of a touch display (not shown).

또한, 프로세서(630)는 도 1 내지 도 5를 통해 전술한 적어도 하나의 방법 또는 적어도 하나의 방법에 대응되는 알고리즘을 수행할 수 있다. 프로세서(630)는 목적하는 동작들(desired operations)을 실행시키기 위한 물리적인 구조를 갖는 회로를 가지는 하드웨어로 구현된 데이터 처리 장치일 수 있다. 예를 들어, 목적하는 동작들은 프로그램에 포함된 코드(code) 또는 인스트럭션들(instructions)을 포함할 수 있다. 예를 들어, 하드웨어로 구현된 데이터 처리 장치는 마이크로프로세서(microprocessor), 중앙 처리 장치(central processing unit), 프로세서 코어(processor core), 멀티-코어 프로세서(multi-core processor), 멀티프로세서(multiprocessor), ASIC(Application-Specific Integrated Circuit), FPGA(Field Programmable Gate Array)를 포함할 수 있다.In addition, the processor 630 may perform at least one method or an algorithm corresponding to at least one method described above through FIGS. 1 to 5. The processor 630 may be a data processing device implemented in hardware having a circuit having a physical structure for executing desired operations. For example, desired operations may include code or instructions included in a program. For example, a data processing device implemented in hardware is a microprocessor, a central processing unit, a processor core, a multi-core processor, and a multiprocessor. , Application-Specific Integrated Circuit (ASIC), and Field Programmable Gate Array (FPGA).

프로세서(630)는 프로그램을 실행하고, 제공 장치(600)를 제어할 수 있다. 프로세서(630)에 의하여 실행되는 프로그램 코드는 메모리(650)에 저장될 수 있다.The processor 630 may execute a program and control the providing device 600. The program code executed by the processor 630 may be stored in the memory 650.

메모리(650)는 통신 인터페이스(610)를 통해 수신한 다양한 정보를 저장할 수 있다. 메모리(650)는 제1 음성, 제2 음성 및 제3 음성 중 적어도 하나를 저장할 수 있다. The memory 650 may store various types of information received through the communication interface 610. The memory 650 may store at least one of a first voice, a second voice, and a third voice.

이 밖에도, 메모리(650)는 상술한 프로세서(630)에서의 처리 과정에서 생성되는 다양한 정보들을 저장할 수 있다. 이 밖에도, 메모리(650)는 각종 데이터와 프로그램 등을 저장할 수 있다. 메모리(650)는 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다. 메모리(650)는 하드 디스크 등과 같은 대용량 저장 매체를 구비하여 각종 데이터를 저장할 수 있다. In addition, the memory 650 may store various types of information generated during processing in the processor 630 described above. In addition, the memory 650 may store various types of data and programs. The memory 650 may include a volatile memory or a nonvolatile memory. The memory 650 may include a mass storage medium such as a hard disk and may store various types of data.

일 실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of the program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The above-described hardware device may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, although the present invention has been described by the limited embodiments and drawings, the present invention is not limited to the above embodiments, and various modifications and variations from these descriptions are those of ordinary skill in the field to which the present invention belongs. This is possible.

그러므로, 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.Therefore, the scope of the present invention is limited to the described embodiments and should not be defined, and should be defined by the claims and equivalents as well as the claims to be described later.

600: 제공 장치
605: 통신 버스
610: 통신 인터페이스
620: 녹음기
630: 프로세서
650: 메모리
670: 스피커600: providing device
605: communication bus
610: communication interface
620: recorder
630: processor
650: memory
670: speaker

Claims

Receiving a first voice spoken by a speaker;
Recognizing the speaker based on the first voice;
Obtaining speaker information corresponding to the recognized speaker;
Generating a second voice responsive to the first voice based on the analysis result of the first voice; And
Providing a third voice for the speaker by correcting a frequency of the second voice based on the speaker information
Containing, a method of providing a voice.

The method of claim 1,
Providing the third voice
Determining an audible frequency band of the speaker based on the speaker information; And
Generating the third voice by correcting the frequency of the second voice based on a comparison result of the frequency band of the second voice and the audible frequency band of the speaker
Containing, a method of providing a voice.

The method of claim 2,
Generating the third voice based on the comparison result
When the second voice is out of the speaker's audible frequency band, generating the third voice by moving the second voice to the speaker's audible frequency band and performing sampling
Containing, a method of providing a voice.

The method of claim 2,
Generating the third voice based on the comparison result
Determining whether the second voice is a signal of a high frequency band outside the audible frequency band of the speaker; And
Generating the third voice by reinforcing energy in the high frequency band of the second voice according to the determination that the second voice is a signal in the high frequency band
Containing, a method of providing a voice.

The method of claim 2,
Receiving a selection of the speaker as to whether to use the third voice
The method of providing a voice further comprising.

The method of claim 2,
Providing the third voice for the speaker comprises:
If the speaker information does not include the speaker's audible frequency band,
Generating a third voice for the speaker by correcting the frequency of the second voice based on the average audible frequency for each age group of the speaker
Containing, a method of providing a voice.

The method of claim 1,
The speaker information above is
Providing a voice including at least one of the speaker's identification information, the speaker's age, the speaker's gender, the speaker's audible frequency band, the speaker's average audible frequency by age group, and the speaker's preferred voice frequency band How to.

The method of claim 1,
The step of generating the second voice
Generating an answer sentence in response to the first voice based on the analysis result of the first voice; And
Generating a second voice in response to the first voice based on the answer sentence
Containing, a method of providing a voice.

The method of claim 8,
The step of generating the answer sentence
Converting the first voice into first text by STT (Speech To Text); And
Generating an answer sentence in response to the first text through natural language processing (NLP) on the first text
Containing, a method of providing a voice.

The method of claim 1,
Authenticating the recognized speaker
The method of providing a voice further comprising.

The method of claim 1,
Recognizing the speaker is
Determining whether the first voice includes a predetermined call word; And
Recognizing the speaker by applying the first voice to a speaker recognition engine according to a determination that the first voice includes the call word
Containing, a method of providing a voice.

The method of claim 1,
Recognizing the speaker is
Extracting features from the first voice; And
Recognizing the speaker by applying the feature to a previously learned speaker-specific model
Containing, a method of providing a voice.

A computer program stored in a computer-readable recording medium in combination with hardware to execute the method of claim 1.

A communication interface for receiving the first spoken voice of the speaker;
Based on the first voice, the speaker is recognized, speaker information corresponding to the recognized speaker is obtained, and a second voice responsive to the first voice is generated based on the analysis result of the first voice And a processor for generating a third voice for the speaker by correcting a frequency of the second voice based on the speaker information; And
Speaker providing the third voice
Containing, a device for providing a voice.

The method of claim 14,
The processor is
Based on the speaker information, an audible frequency band of the speaker is identified, and the frequency of the second voice is corrected based on a comparison result of the frequency band of the second voice and the audible frequency band of the speaker. A device that provides a voice to generate.

The method of claim 15,
The processor is
When the second voice is out of the speaker's audible frequency band, the second voice is moved to the speaker's audible frequency band and sampled to generate the third voice.

The method of claim 15,
The processor is
By determining whether the second voice is a signal of a high frequency band outside the speaker's audible frequency band, and reinforcing energy of the high frequency band of the second voice according to the determination that the second voice is a signal of the high frequency band An apparatus for providing a voice for generating the third voice.

The method of claim 15,
User interface for receiving input of the speaker's selection regarding whether to use the third voice
The apparatus further comprising a voice providing.

The method of claim 15,
The processor is
If the speaker information does not include the speaker's audible frequency band,
An apparatus for providing a voice, for providing a third voice for the speaker by correcting a frequency of the second voice based on an average audible frequency for each age group of the speaker.

The method of claim 15,
The processor is
A voice that recognizes the speaker by determining whether the first voice includes a predetermined call word, and applying the first voice to a speaker recognition engine according to a determination that the first voice includes the call word. Device to provide.