KR20160049804A

KR20160049804A - Apparatus and method for controlling outputting target information to voice using characteristic of user voice

Info

Publication number: KR20160049804A
Application number: KR1020140147474A
Authority: KR
Inventors: 권오현
Original assignee: 현대모비스 주식회사
Priority date: 2014-10-28
Filing date: 2014-10-28
Publication date: 2016-05-10
Also published as: CN105575383A; KR102311922B1

Abstract

The present invention provides an apparatus and a method for controlling a voice output of target information using voice properties of a user which provides a text to speech (TTS) service based on specific information obtained from a voice of the user. According to the present invention, the apparatus for controlling a voice output of target information comprises: a specific information generation unit for generating specific information of a user based on voice information of the user; a target information generation unit for generating voice type second target information from text type first target information based on the specific information; and a target information output unit for outputting the second target information.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to an apparatus and method for controlling a target information voice output using a voice characteristic of a user,

본 발명은 대상 정보를 음성으로 출력하는 제어 장치 및 방법에 관한 것이다. 보다 상세하게는, 차량에서 대상 정보를 음성으로 출력하는 제어 장치 및 방법에 관한 것이다.The present invention relates to a control apparatus and method for outputting target information by voice. More particularly, the present invention relates to a control apparatus and method for outputting target information from a vehicle by voice.

일반적으로 TTS(Text To Speech)는 문자 또는 기호를 음성으로 변환하여 들려주는 기술이다. TTS는 음소에 대한 발음 데이터베이스를 구축하고 이를 연결하여 연속된 음성을 생성하는데, 이때 음성의 크기, 길이 높낮이 등을 조절하여 자연스러운 음성을 합성하는 것이 관건이다.In general, TTS (Text To Speech) is a technology that converts characters or symbols into speech. TTS constructs a phonetic database of phonemes and creates a continuous speech by connecting them. In this case, it is important to synthesize natural voice by adjusting the size and length of the voice.

즉, TTS는 문자열(문장)을 음성으로 변환하는 문자-음성 변환 장치로서, 크게 언어 처리, 운율 생성, 파형 합성 등의 3단계로 나누어 지는데, 텍스트가 입력되면 언어 처리 과정에서 입력된 문서의 문법적 구조를 분석하고, 분석된 문서 구조에 의해 사람이 읽는 것과 같은 운율을 생성하고, 생성된 운율에 따라 저장된 음성 DB의 기본 단위를 모아 합성음을 생성한다.That is, TTS is a character-to-speech conversion apparatus for converting a string (sentence) into speech, and is divided into three stages of language processing, rhyme generation, and waveform synthesis. When text is input, Analyzes the structure, generates the same rhyme as that read by the analyzed document structure, and synthesizes the basic unit of the stored speech DB according to the generated rhyme to generate a synthesized tone.

TTS는 대상 어휘에 제한이 없으며, 일반적인 문자 형태의 정보를 음성으로 변환하는 것이므로, 시스템의 구현시 음성학, 음성 분석, 음성 합성 및 음성인식 기술 등이 접목되어 보다 자연스럽고 다양한 음성이 출력된다.TTS has no limit on the target vocabulary. Since it converts the general character type information into voice, the system realizes more natural and various voice by combining phonetics, voice analysis, speech synthesis and voice recognition technology.

그러나 이러한 종래의 TTS를 제공하는 단말은 문자 메시지 등의 음성을 출력하는 경우 상대방이 누구인지 관계없이 기설정된 항상 동일한 음성으로 출력하기 때문에 다양한 사용자의 욕구를 만족시키지 못하는 문제점이 있었다.However, when the conventional TTS terminal outputs a voice message such as a text message, it always outputs the same voice regardless of the other party, thereby failing to satisfy the needs of various users.

한국공개특허 제2011-0032256호는 TTS 안내 방송 장치에 대하여 제안하고 있다. 그러나 이 장치는 지정된 텍스트를 음성으로 단순 변환하는 장치에 불과하기 때문에 상기한 문제점을 해결할 수 없다.Korean Patent Publication No. 2011-0032256 proposes a TTS announcement device. However, the above-mentioned problem can not be solved because the apparatus is merely a device for simply converting the designated text into speech.

본 발명은 상기한 문제점을 해결하기 위해 안출된 것으로서, 사용자의 음성으로부터 얻은 특성 정보를 기초로 TTS(Text To Speech) 서비스를 제공하는 사용자의 음성 특성을 이용한 대상 정보 음성 출력 제어 장치 및 방법을 제안하는 것을 목적으로 한다.SUMMARY OF THE INVENTION The present invention has been conceived to solve the problems described above, and it is an object of the present invention to provide a device and method for controlling a target information audio output using a voice characteristic of a user providing a TTS (Text To Speech) service based on characteristic information obtained from a user's voice .

그러나 본 발명의 목적은 상기에 언급된 사항으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the objects of the present invention are not limited to those mentioned above, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

본 발명은 상기한 목적을 달성하기 위해 안출된 것으로서, 사용자의 음성 정보를 기초로 상기 사용자의 특성 정보를 생성하는 특성 정보 생성부; 상기 특성 정보를 기초로 텍스트 형태의 제1 대상 정보로부터 음성 형태의 제2 대상 정보를 생성하는 대상 정보 생성부; 및 상기 제2 대상 정보를 출력하는 대상 정보 출력부를 포함하는 것을 특징으로 하는 사용자의 음성 특성을 이용한 대상 정보 음성 출력 제어 장치를 제안한다.According to an aspect of the present invention, there is provided an information processing apparatus comprising: a characteristic information generation unit configured to generate characteristic information of a user based on voice information of a user; An object information generating unit for generating second object information in a speech form from first object information in the form of a text based on the characteristic information; And a target information output unit for outputting the second target information.

바람직하게는, 상기 특성 정보 생성부는 상기 음성 정보로부터 포먼트(Formant) 정보, 주파수(Log f0) 정보, LPC(Linear Predictive Coefficient) 정보, 스펙트럼 포락선(Spectral Envelope) 정보, 에너지 정보, 발화 속도(Pitch Period) 정보 및 로그 스펙트럼(Log Spectrum) 정보 중 적어도 하나의 정보를 추출하며, 상기 적어도 하나의 정보를 기초로 상기 특성 정보를 실시간으로 생성한다.Preferably, the characteristic information generation unit may extract formant information, frequency f0 information, linear predictive coefficient (LPC) information, spectral envelope information, energy information, Period information, and Log Spectrum information, and generates the characteristic information in real time based on the at least one information.

바람직하게는, 상기 특성 정보 생성부는 상기 특성 정보로 상기 사용자의 성별 정보, 상기 사용자의 연령 정보, 및 상기 사용자의 감정 정보 중 적어도 하나의 정보를 실시간으로 생성한다.Preferably, the characteristic information generating unit generates at least one of the gender information of the user, the age information of the user, and the emotion information of the user in real time with the characteristic information.

바람직하게는, 상기 특성 정보 생성부는 상기 음성 정보로부터 잡음 정보를 제거한 뒤 상기 특성 정보를 생성한다.Preferably, the characteristic information generating unit removes noise information from the audio information and generates the characteristic information.

바람직하게는, 상기 특성 정보 생성부는 상기 음성 정보에 상응하는 입력 정보들과 각 입력 정보의 목표 정보를 학습(training)시켜 얻은 가중치 정보를 상기 음성 정보에 적용하여 상기 특성 정보를 생성한다.Preferably, the characteristic information generating unit applies the input information corresponding to the audio information and the weight information obtained by training the target information of each input information to the audio information to generate the characteristic information.

바람직하게는, 상기 특성 정보 생성부는 ANN(Artificial Neural Network) 알고리즘, EBP(Error Back Propagation) 알고리즘 및 경사하강법(Gradient Descent Method)을 이용하여 상기 가중치 정보를 획득한다.Preferably, the characteristic information generating unit obtains the weight information using an ANN (Artificial Neural Network) algorithm, an EBP (Error Back Propagation) algorithm, and a Gradient Descent Method.

바람직하게는, 상기 대상 정보 생성부는 데이터베이스에서 상기 특성 정보에 대응하는 기준 정보를 추출하며, 상기 기준 정보를 기초로 상기 제1 대상 정보를 음성으로 변환하여 얻은 정보를 튜닝하여 상기 제2 대상 정보를 생성한다.Preferably, the target information generating unit extracts reference information corresponding to the characteristic information from the database, and tunes the information obtained by converting the first target information into speech based on the reference information, .

바람직하게는, 상기 대상 정보 생성부는 상기 기준 정보로부터 얻은 발화 속도(Pitch Period) 정보 또는 주파수(Log f0) 정보를 기초로 상기 제1 대상 정보를 음성으로 변환하여 얻은 정보를 튜닝하여 상기 제2 대상 정보를 생성한다.Preferably, the target information generating unit tunes information obtained by converting the first target information into a speech based on the pitch period information or the frequency (Log f0) information obtained from the reference information, Information.

바람직하게는, 상기 대상 정보 생성부는 상기 기준 정보와 더불어 상기 특성 정보로부터 얻은 화자 식별 정보를 기초로 상기 제2 대상 정보를 생성한다.Preferably, the object information generating unit generates the second object information based on the speaker identification information obtained from the characteristic information together with the reference information.

바람직하게는, 상기 대상 정보 생성부는 가우시안 혼합 모델(GMM)을 기초로 상기 화자 식별 정보를 획득한다.Preferably, the object information generating unit obtains the speaker identification information based on a Gaussian mixture model (GMM).

또한 본 발명은 사용자의 음성 정보를 기초로 상기 사용자의 특성 정보를 생성하는 단계; 상기 특성 정보를 기초로 텍스트 형태의 제1 대상 정보로부터 음성 형태의 제2 대상 정보를 생성하는 단계; 및 상기 제2 대상 정보를 출력하는 단계를 포함하는 것을 특징으로 하는 사용자의 음성 특성을 이용한 대상 정보 음성 출력 제어 방법을 제안한다.According to another aspect of the present invention, Generating second target information in the form of speech from first subject information in the form of a text based on the characteristic information; And outputting the second target information. The present invention also provides a method of controlling a target information audio output using a user's voice characteristic.

바람직하게는, 상기 특성 정보를 생성하는 단계는 상기 음성 정보로부터 포먼트(Formant) 정보, 주파수(Log f0) 정보, LPC(Linear Predictive Coefficient) 정보, 스펙트럼 포락선(Spectral Envelope) 정보, 에너지 정보, 발화 속도(Pitch Period) 정보 및 로그 스펙트럼(Log Spectrum) 정보 중 적어도 하나의 정보를 추출하며, 상기 적어도 하나의 정보를 기초로 상기 특성 정보를 실시간으로 생성한다.Preferably, the step of generating the characteristic information may include extracting formant information, frequency (Log f0) information, LPC (Linear Predictive Coefficient) information, spectral envelope information, energy information, (Pitch Period) information, and Log Spectrum information, and generates the characteristic information in real time based on the at least one information.

바람직하게는, 상기 특성 정보를 생성하는 단계는 상기 특성 정보로 상기 사용자의 성별 정보, 상기 사용자의 연령 정보, 및 상기 사용자의 감정 정보 중 적어도 하나의 정보를 실시간으로 생성한다.The generating of the characteristic information may generate at least one of the gender information of the user, the user's age information, and the user's emotion information in real time with the characteristic information.

바람직하게는, 상기 특성 정보를 생성하는 단계는 상기 음성 정보로부터 잡음 정보를 제거한 뒤 상기 특성 정보를 생성한다.Preferably, the generating of the characteristic information generates the characteristic information after removing the noise information from the voice information.

바람직하게는, 상기 특성 정보를 생성하는 단계는 상기 음성 정보에 상응하는 입력 정보들과 각 입력 정보의 목표 정보를 학습(training)시켜 얻은 가중치 정보를 상기 음성 정보에 적용하여 상기 특성 정보를 생성한다.The generating of the characteristic information may include generating input information corresponding to the audio information and weight information obtained by training target information of each input information to the audio information to generate the characteristic information .

바람직하게는, 상기 특성 정보를 생성하는 단계는 ANN(Artificial Neural Network) 알고리즘, EBP(Error Back Propagation) 알고리즘 및 경사하강법(Gradient Descent Method)을 이용하여 상기 가중치 정보를 획득한다.Preferably, the step of generating the characteristic information acquires the weight information using an artificial neural network (ANN) algorithm, an error back propagation (EBP) algorithm, and a gradient descent method.

바람직하게는, 상기 제2 대상 정보를 생성하는 단계는 데이터베이스에서 상기 특성 정보에 대응하는 기준 정보를 추출하며, 상기 기준 정보를 기초로 상기 제1 대상 정보를 음성으로 변환하여 얻은 정보를 튜닝하여 상기 제2 대상 정보를 생성한다.Preferably, the generating of the second object information may include extracting reference information corresponding to the characteristic information from the database, and tuning information obtained by converting the first object information into speech based on the reference information, And generates second object information.

바람직하게는, 상기 제2 대상 정보를 생성하는 단계는 상기 기준 정보로부터 얻은 발화 속도(Pitch Period) 정보 또는 주파수(Log f0) 정보를 기초로 상기 제1 대상 정보를 음성으로 변환하여 얻은 정보를 튜닝하여 상기 제2 대상 정보를 생성한다.Preferably, the generating of the second object information may include tuning information obtained by converting the first object information into speech based on the pitch period information or the frequency (Log f0) information obtained from the reference information, And generates the second object information.

바람직하게는, 상기 제2 대상 정보를 생성하는 단계는 상기 기준 정보와 더불어 상기 특성 정보로부터 얻은 화자 식별 정보를 기초로 상기 제2 대상 정보를 생성한다.Preferably, the generating of the second object information generates the second object information based on the speaker identification information obtained from the characteristic information together with the reference information.

바람직하게는, 상기 제2 대상 정보를 생성하는 단계는 가우시안 혼합 모델(GMM)을 기초로 상기 화자 식별 정보를 획득한다.Preferably, the step of generating the second object information acquires the speaker identification information based on a Gaussian mixture model (GMM).

본 발명은 사용자의 음성으로부터 얻은 특성 정보를 기초로 TTS(Text To Speech) 서비스를 제공함으로써 다음 효과를 얻을 수 있다.The present invention can provide the following effects by providing a TTS (Text To Speech) service based on characteristic information obtained from a user's voice.

첫째, 일방적인 방식에서 벗어나 양방향으로 소통하여 자연스러운 음성인식 시스템을 구현할 수 있다.First, a natural speech recognition system can be realized by communicating in a bidirectional manner without departing from the one-sided method.

둘째, 시스템에서 운전자의 성별, 연령, 성향 등에 맞춰진 TTS 서비스를 제공함으로써 차량의 음성인식 시스템에 기계적이지 않고 친숙함과 알아듣기 쉬운 음성을 제공해줄 수 있다.Second, the system provides TTS service tailored to the driver 's sex, age, disposition, etc., Thereby providing the voice recognition system of the vehicle with a friendliness that is not mechanical and easy to understand.

도 1은 본 발명의 일실시예에 따른 차량용 음성 안내 제공 시스템의 내부 구성을 도시한 개념도이다.
도 2와 도 3은 도 1에 도시된 차량용 음성 안내 제공 시스템을 구성하는 화자 음성 분석기를 설명하기 위한 참고도이다.
도 4는 본 발명의 일실시예에 따른 차량용 음성 안내 제공 시스템의 작동 방법을 도시한 흐름도이다.1 is a conceptual diagram illustrating an internal configuration of a voice guidance system for a vehicle according to an embodiment of the present invention.
FIG. 2 and FIG. 3 are reference views for explaining a speaker voice analyzer constituting the voice guidance system for vehicles shown in FIG.
4 is a flowchart illustrating an operation method of a vehicle voice guidance system according to an embodiment of the present invention.

이하, 본 발명의 바람직한 실시예를 첨부된 도면들을 참조하여 상세히 설명한다. 우선 각 도면의 구성요소들에 참조 부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다. 또한, 이하에서 본 발명의 바람직한 실시예를 설명할 것이나, 본 발명의 기술적 사상은 이에 한정하거나 제한되지 않고 당업자에 의해 변형되어 다양하게 실시될 수 있음은 물론이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In addition, the preferred embodiments of the present invention will be described below, but it is needless to say that the technical idea of the present invention is not limited thereto and can be variously modified by those skilled in the art.

본 발명은 차량 내 운전자 개인의 음성 특징을 분석하여 보다 더 자연스럽고 친숙한 음성 안내 서비스를 제공하는 것을 목적으로 한다.It is an object of the present invention to provide a more natural and familiar voice guidance service by analyzing voice characteristics of a driver in a car.

도 1은 본 발명의 일실시예에 따른 차량용 음성 안내 제공 시스템의 내부 구성을 도시한 개념도이다.1 is a conceptual diagram illustrating an internal configuration of a voice guidance system for a vehicle according to an embodiment of the present invention.

차량용 음성 안내 제공 시스템(100)은 운전자의 음성을 이용하여 현재 운전자의 음성과 유사한 패턴으로 음성 안내를 제공해 주는 시스템으로서, 도 1에 도시한 바와 같이 잡음 제거기(110), 음성 특징 정보 추출기(120), 화자 음성 분석기(130), TTS DB 추출기(140), TTS DB(150), 화자 음성 튜닝기(160), GMM 모델 추출기(170) 및 화자 음성 변환기(180)를 포함한다.The voice guidance system 100 for a vehicle provides voice guidance in a pattern similar to the voice of the current driver using voice of the driver. As shown in FIG. 1, the system includes a noise eliminator 110, a voice feature information extractor 120 A speaker voice analyzer 130, a TTS DB extractor 140, a TTS DB 150, a speaker voice tuner 160, a GMM model extractor 170, and a speaker voice converter 180.

일반적으로 차량 내 내비게이션 안내 음성이나 음성인식 프롬프트 음성의 경우 이미 생산시 고정된 특정 TTS DB를 사용하고 있다. 그로 인해 나이별, 성별, 운전자의 성향별 음성 안내에 대한 소비자의 요구(Needs)를 적절하게 충족시키지 못하고 있는 상황이다. 예를 들자면, 나이든 노년층에게 발랄한 20대의 빠른 음성은 자칫 알아듣기 어려울 수 있으며, 젊은층에게 온화한 50대의 느린 음성은 지루하고 개성이 없어 보일 수 있다.Generally, in the case of navigation guidance voice or voice recognition prompt voice in the vehicle, a specific TTS DB that is fixed at the time of production is used. Therefore, it does not adequately satisfy the consumer's needs for voice guidance by age, gender, and driver's propensity. For example, the voices of the youthful 20s may be difficult to understand for older people, and those of the mild 50s may seem boring and uncharacteristic for the younger generation.

본 발명에 따른 차량용 음성 안내 제공 시스템(100)은 젊은층과 중년층과 노년층 및 남성과 여성, 그리고 성향이 활달하거나 온화한 운전자에게 기계적인 TTS 안내 음성이 아닌 친숙하며 알아듣기 쉬운 음성의 품질을 제공하는 것을 목적으로 한다.The vehicle voice guidance system 100 according to the present invention can provide a familiar and easy-to-understand voice quality to the young, middle-aged, elderly, male and female, The purpose.

또한 차량용 음성 안내 제공 시스템(100)은 양방향 소통 방식으로 기술이 변화함에 따라 음성인식이라는 화자 식별 기능을 이용해 운전자를 구별해서 운전자에게 적합한 기능을 먼저 제안해 인공지능 추세에 맞춰갈 수 있도록 하는 것을 목적으로 한다.In addition, the vehicle voice guidance system 100 can distinguish drivers by using a speaker recognition function of voice recognition as a technology changes in a bidirectional communication system, so that a function suitable for a driver can be proposed first and adapted to an artificial intelligence trend .

이하 도 1을 참조하여 보다 자세하게 설명한다.This will be described in more detail with reference to FIG.

잡음 제거기(110)는 화자의 음성 정보가 입력되면 이 음성 정보로부터 잡음 성분을 제거하는 기능을 수행한다. 잡음 제거기(110)는 차량 내의 노이즈를 제거하여 보다 더 명확한 운전자의 음성을 취득한다.The noise eliminator 110 performs a function of removing a noise component from the voice information when the voice information of the speaker is input. The noise eliminator 110 removes noise in the vehicle to acquire a clearer driver's voice.

음성 특징 정보 추출기(120)는 잡음 성분이 제거된 음성 정보로부터 화자의 음성 특징 정보를 추출하는 기능을 수행한다. 음성 특징 정보 추출기(120)는 화자의 나이, 성별, 성향 등을 분석하기 위해 각 개인별 음성의 특징 정보를 추출한다.The voice feature information extractor 120 extracts voice feature information of the speaker from the voice information from which the noise component is removed. The voice feature information extractor 120 extracts feature information of each individual voice to analyze the age, sex, inclinations, etc. of the speaker.

음성 특징 정보 추출기(120)는 음성 정보로부터 포먼트(Formant) 정보, 주파수(Log f0) 정보, LPC(Linear Predictive Coefficient) 정보, 스펙트럼 포락선(Spectral Envelope) 정보, 에너지(Energy) 정보, 발화 속도(Pitch Period) 정보, 로그 스펙트럼(Log Spectrum) 정보 등의 음성 특징 정보를 추출한다.The voice feature information extractor 120 extracts formant information, frequency f0 information, linear predictive coefficient (LPC) information, spectral envelope information, energy information, Pitch period information, log spectrum information, and the like.

화자 음성 분석기(130)는 음성 특징 정보 추출기(120)에 의해 추출된 음성 특징 정보를 이용하여 화자의 나이, 성별, 성향 등을 분류(Classification)하는 기능을 수행한다. 화자 음성 분석기(130)는 성별을 구분할 때 Log f0 정보를 사용할 수 있는데, Log f0 평균값이 120Hz ~ 240Hz이면 여성으로 판단할 수 있으며, Log f0 평균값이 0Hz ~ 120Hz이면 남성으로 판단할 수 있다.The speaker's voice analyzer 130 performs a function of classifying the age, sex, inclinations, etc. of the speaker using the voice feature information extracted by the voice feature information extractor 120. The speaker voice analyzer 130 can use the Log f0 information when the gender is classified. If the Log f0 average value is 120 Hz to 240 Hz, it can be judged as a female. If the Log f0 average value is 0 Hz to 120 Hz, it can be judged as a male.

화자 음성 분석기(130)는 음성 특징 정보 추출기(120)에 의해 개인별 음성 특징 정보가 추출되면 인공 신경망(ANN; Artificial Neural Network) 알고리즘을 이용하여 모델링(Modeing)을 수행함으로써 일반화된 나이별, 성별, 성향별 등으로 분석된 인공 신경망 알고리즘의 가중치(Weight) 정보를 추출할 수 있다. 화자 음성 분석기(130)는 이렇게 추출된 일반화된 가중치 정보(즉, 인공 신경망 알고리즘을 이용한 모델링 결과 데이터)를 토대로 실시간으로 입력되는 운전자의 음성 특징 정보를 추출하여 화자의 나이, 성별, 성향 등을 추정할 수 있다.The speaker's voice analyzer 130 performs modeling using an artificial neural network (ANN) algorithm when individual voice feature information is extracted by the voice feature information extractor 120, And weight information of the artificial neural network algorithm analyzed by the propensity of the user. The speaker's voice analyzer 130 extracts the voice feature information of the driver input in real time based on the extracted generalized weight information (i.e., modeling result data using an artificial neural network algorithm) to estimate the speaker's age, sex, can do.

화자 음성 분석기(130)는 화자의 나이, 성별, 성향 등을 추정하기 위해 인공 신경망 알고리즘으로 나이 분석용 신경망(Neural Network), 성별 분석용 신경망, 성향 분석용 신경망 등을 이용할 수 있다.The speaker voice analyzer 130 can use an age neural network, a gender analysis neural network, a neural network for tendency analysis, and the like as an artificial neural network algorithm in order to estimate the age, sex, and propensity of a speaker.

이하 도 2와 도 3을 참조하여 화자 음성 분석기(130)에 대하여 부연 설명한다.Hereinafter, the speaker-voice analyzer 130 will be described in detail with reference to FIG. 2 and FIG.

도 2와 도 3은 도 1에 도시된 차량용 음성 안내 제공 시스템을 구성하는 화자 음성 분석기를 설명하기 위한 참고도이다.FIG. 2 and FIG. 3 are reference views for explaining a speaker voice analyzer constituting the voice guidance system for vehicles shown in FIG.

인공 신경망(ANN; Artificial Neural Network) 알고리즘은 인간의 두뇌 작용을 신경 세포들 간의 연결 관계로 모델링하고 구분하는 알고리즘을 말한다. 본 실시예에서는 화자 음성 분석기(130)가 다음 두 단계를 순차적으로 수행하여 인공 신경망 알고리즘을 구현한다. 도 2는 본 발명에 적용될 인공 신경망 알고리즘에서 인공 신경 회로망의 뉴런(처리 요소)의 구조를 설명하기 위한 참고도이다.An artificial neural network (ANN) algorithm is an algorithm that models and distinguishes human brain activity as a connection between neurons. In the present embodiment, the speaker's voice analyzer 130 performs the following two steps in sequence to implement an artificial neural network algorithm. 2 is a reference diagram for explaining a structure of a neuron (processing element) of an artificial neural network in an artificial neural network algorithm to be applied to the present invention.

1. 학습 단계(Training, Modeling)1. Training (Modeling)

화자 음성 분석기(130)는 학습 단계에서 많은 양의 입력 벡터와 목표 벡터를 주어진 신경망 네트워크에 입력하여 패턴 분류를 시키고, 이에 따라 최적화된 연결 가중치(Weight; 220)를 획득한다.The speaker's speech analyzer 130 inputs a large amount of input vectors and target vectors to a given neural network to classify the patterns and acquires optimized connection weights (Weight 220).

2. 판별(Classification)2. Classification

화자 음성 분석기(130)는 판별 단계에서 학습된 가중치(220)와 입력 벡터(210) 간 연산식(230)을 통해 출력값(240)을 산출한다. 화자 음성 분석기(130)는 가중치(220)와 입력 벡터(210) 간 차이값을 계산하여 가장 근사한 출력(Output)을 최종적인 결과로 판별하여 산출할 수 있다. 연산식(230)에서 θ는 임계값을 의미한다.The speaker's voice analyzer 130 calculates the output value 240 through the equation 230 between the weight 220 and the input vector 210 learned in the discrimination step. The speaker voice analyzer 130 may calculate the difference between the weight 220 and the input vector 210 to determine the closest output as the final result. In the equation (230),? Represents a threshold value.

화자 음성 분석기(130)는 인공 신경망 알고리즘을 이용하여 화자의 음성 특징 정보로부터 화자의 나이, 성별, 성향 등을 분석할 때 다층 퍼셉트론(Multi-Layer Perceptron)을 적용할 수 있으며, 특히 오류 역전파(EBP; Error Back Propagation) 알고리즘을 적용할 수 있다. 이하 도 3을 참조하여 보다 자세하게 설명한다. 도 3은 본 발명에 적용될 EBP 알고리즘의 구조를 도시한 참고도이다.The speaker's voice analyzer 130 can apply a multi-layer perceptron to analyze the speaker's age, sex, and disposition from the speaker's voice feature information using the artificial neural network algorithm. In particular, EBP (Error Back Propagation) algorithm can be applied. This will be described in more detail with reference to FIG. 3 is a reference diagram showing a structure of an EBP algorithm to be applied to the present invention.

종래 음성과 관련된 퍼셉트론 이론은 음성을 인식하거나(음성을 입력받으면 어떤 내용의 음성인지를 판단), 사람의 감정을 판별하는 용도로 쓰여 왔다.Perceptron theory related to conventional speech has been used for recognizing a voice (judging what content is a voice when a voice is inputted) or discriminating a human emotion.

다층 퍼셉트론(multilayer perceptron)은 입력층과 출력층 사이에 하나 이상의 중간층이 존재하는 신경망이다. 네트워크는 입력층, 은닉층, 출력층 방향으로 연결되어 있으며, 각 층 내의 연결과 출력층에서 입력층으로의 직접적인 연결은 존재하지 않는 전방향(Feedforward) 네트워크이다.A multilayer perceptron is a neural network with one or more intermediate layers between the input and output layers. The network is connected in the direction of the input layer, the hidden layer and the output layer and is a feedforward network in which there is no direct connection from the output layer to the input layer in each layer.

이러한 다층 퍼셉트론을 화자 음성 분석기(130)에 적용하기 위해 본 발명에서는 EBP 알고리즘을 채택한다.In order to apply the multi-layer perceptron to the speech analyzer 130, the present invention adopts the EBP algorithm.

EBP 알고리즘은 입력층과 출력증 사이에 하나 이상의 은닉층을 가지며, 수학식 1에서 보는 바와 같이 일반화된 델타 규칙을 이용하여 원하는 목표값(D_pj)과 실제 출력값(O_pj) 사이의 오차제곱합으로 정의된 Cost function 값을 경사 하강 추적법(gradient-descent method)에 의해 최소화하는 방향으로 학습을 진행시켜서 원하는 가중치 값을 얻는 방법을 말한다.The EBP algorithm has one or more hidden layers between the input layer and output manifest and is defined as the sum of squares of the error between the desired target value (D _pj ) and the actual output value (O _pj ) using the generalized delta rule as shown in Equation (1) And then the learning is proceeded in the direction of minimizing the cost function value by the gradient-descent method to obtain the desired weight value.

상기에서, p는 p번째 학습 패턴을 의미하며, E_p는 p번째 패턴에 대한 오차를 의미한다. 그리고 D_pj는 p번째 패턴에 대한 j번째 요소를 의미하며, O_pj는 실제 출력의 j번째 요소를 의미한다.In the above, p denotes the p-th learning pattern, and E _p denotes the error with respect to the p-th pattern. D _pj denotes the jth element of the pth pattern, and O _pj denotes the j th element of the actual output.

화자 음성 분석기(130)는 이상 설명한 EBP 알고리즘을 이용함으로써 은닉층의 학습을 위해 출력층에서 발생한 오류를 이용하여 은닉층 오차를 계산하고, 다시 이 값을 입력층으로 역으로 전파시켜서 출력층의 오차가 원하는 수준이 될 때까지 반복시켜서 최적화된 가중치 값을 얻을 수 있다.The speaker's voice analyzer 130 calculates the hidden layer error using the error generated in the output layer for learning the hidden layer by using the EBP algorithm described above and propagates it back to the input layer so that the error of the output layer is the desired level The weighted value can be optimized.

화자 음성 분석기(130)는 EBP 알고리즘을 이용하여 학습(Training) 단계를 다음 절차에 따라 수행할 수 있다.The speaker voice analyzer 130 can perform a training step using the EBP algorithm according to the following procedure.

먼저 제1 단계에서 가중치(Weight)와 임계치를 초기화시킨다.In the first step, the weight and the threshold value are initialized.

이후 제2 단계에서 입력 벡터(Input Vector)와 목표 벡터(Target Vector)를 제시한다.In the second step, the input vector and the target vector are presented.

이후 제3 단계에서 제시된 입력 벡터를 이용하여 은닉층(Hidden Layer) j번째 뉴런으로의 입력값을 계산한다. 이때 수학식 2가 이용될 수 있다.Then, the input value to the jth neuron of the hidden layer is calculated using the input vector shown in the third step. Equation (2) can be used at this time.

상기에서, net_pj는 은닉층 j번째 뉴런으로의 입력값을 의미한다. W_ji는 j번째 뉴런에서 i번째 뉴런으로의 연결 가중치를 의미하며, X_pi는 입력 벡터를 의미한다. 그리고 θ_j는 임계값을 의미한다.In the above, net _pj denotes an input value to the jth neuron of the hidden layer. W _ji denotes the connection weight from the jth neuron to the i th neuron, and X _pi denotes the input vector. And? _J denotes a threshold value.

이후 제4 단계에서 시그모이드(Sigmoid) 함수를 이용하여 은닉층의 출력(O_pj)를 계산한다.In the fourth step, the output (O _pj ) of the hidden layer is calculated using a sigmoid function.

이후 제5 단계에서 은닉층의 출력을 이용하여 출력층 뉴런 k로의 입력값을 계산한다. 이때 수학식 3이 이용될 수 있다.Then, in the fifth step, the input value to the output layer neuron k is calculated using the output of the hidden layer. Equation (3) can be used at this time.

상기에서, net_pk는 출력층 뉴런 k로의 입력값을 의미한다.In the above, net _pk denotes an input value to the output layer neuron k.

이후 제6 단계에서 시그모이드 함수(f'())를 사용하여 출력층의 출력(O_pk)을 계산한다.Then, in a sixth step, the output (O _pk ) of the output layer is calculated using the sigmoid function f '().

이후 제7 단계에서 입력 패턴의 목표 출력과 실제 출력 사이의 오차를 계산하고 출력층 오차합을 학습 패턴의 오차로 누적시킨다. 이때 수학식 4가 이용될 수 있다.In the seventh step, an error between the target output and the actual output of the input pattern is calculated, and the sum of the output layer errors is accumulated as the error of the learning pattern. At this time, equation (4) can be used.

상기에서 d_pk는 입력 패턴의 목표 출력을 의미하며, O_pk는 입력 패턴의 실제 출력을 의미한다. 그리고 δ_pk는 목표 출력과 실제 출력 사이의 오차를 의미한다. E는 출력층 오차합을 의미하며, E_p는 학습 패턴의 오차를 의미한다.Where d _pk denotes the target output of the input pattern, and O _pk denotes the actual output of the input pattern. And δ _pk is the error between the target output and the actual output. E denotes the sum of the output layer errors, and E _p denotes the error of the learning pattern.

이후 제8 단계에서 출력층 오차값(δ_pk), 은닉층과 출력층의 가중치값(W_kj) 등을 이용하여 은닉층의 오차(δ_pj)를 계산한다. 이때 수학식 5가 이용될 수 있다.Then, in the eighth step, the error δ _pj of the hidden layer is calculated using the output layer error value δ _pk , the weight value of the hidden layer and the output layer (W _kj ), and the like. At this time, equation (5) can be used.

이후 제9 단계에서 제4 단계와 제7 단계에서 구한 은닉층 뉴런 j의 출력값(O_pj)과 출력층의 오차값(δ_pk)을 이용하여 출력층의 가중치(W_kj)를 갱신한다. 이때 임계치도 조정한다. 이때 수학식 6이 이용될 수 있다.Then, in the ninth step, the weight W _kj of the output layer is updated using the output value O _pj of the hidden layer neuron j and the error value δ _pk of the output layer obtained in the fourth and seventh steps. At this time, the threshold value is also adjusted. At this time, equation (6) can be used.

상기에서 η과 β는 이득값을 의미하며, t는 시각을 의미한다.In the above, η and β denote gain values, and t denotes time.

이후 제10 단계에서 출력층에서와 마찬가지로 입력층과 은닉층의 가중치 값과 임계치 값을 갱신한다. 이때 수학식 7이 이용될 수 있다.In step 10, the weight value and the threshold value of the input layer and the hidden layer are updated as in the output layer. At this time, equation (7) can be used.

이후 제11 단계에서 모든 학습 패턴에 대하여 전부 학습할 때까지 2단계로 분기하여 반복 수행한다.Thereafter, in the eleventh step, all the learning patterns are repeatedly performed in two stages until all the learning patterns are learned.

이후 제12 단계에서 출력층의 오차합 E가 허용값 이하이거나 최대 반복 횟수보다 크면 종료하며, 그렇지 않으면 제2 단계로 가서 이후 절차를 반복한다.If the error sum E of the output layer is less than the allowable value or greater than the maximum number of repetitions in step 12, the process is terminated. Otherwise, the process goes to the second step and the procedure is repeated.

한편 화자 음성 분석기(130)는 화자가 복수일 때 다층 퍼셉트론(multilayer perceptron)을 이용할 때 각 화자의 음성 특징 정보로부터 각 화자의 나이, 성별, 성향 등을 분석하는 것도 가능하다. 이하 이에 대해 설명한다.On the other hand, the speaker's voice analyzer 130 can analyze the age, sex, orientation, etc. of each speaker from the voice characteristic information of each speaker when using a multilayer perceptron when there are a plurality of speakers. This will be described below.

일반적인 노이즈 필터링 방법에 따르면, 음성인식 마이크 오픈 후 일정 시간 후에 음성인식 발화를 함으로써 음성인식 전에 마이크로 들어오는 신호를 차량 내 노이즈라고 판단하고 신호에서 그 노이즈만 필터링시킨다.According to the general noise filtering method, speech recognition speech is generated after a certain time after the speech recognition microphone is opened, so that the signal coming in before the speech recognition is determined as noise in the vehicle, and only the noise is filtered in the signal.

그런데 차량 내에 운전자 방향으로 지향성 마이크가 달려 있지만 음성 발화 전의 잠깐의 시간동안 입력된 신호를 노이즈로 판단하기 때문에, 만일 음성인식 발화 시점에 운전자 외에 다른 좌석에서 발화시 음성이 섞이게 되어 음성인식률이 떨어지는 문제점이 있다.However, although the directional microphone is provided in the direction of the driver in the vehicle, since the input signal is judged as noise during a short time before the speech utterance, the speech recognition rate is lowered because the voice is mixed when the speech is recognized from other seats .

그래서 본 발명에서는 차량 내 4개의 좌석 영역에 지향성 마이크를 각각 설치하고, 운전자 영역의 마이크의 입력 신호를 기준으로 다른 영역들의 마이크 신호를 노이즈로 판별하고 필터링한다. 신호를 처리하는 과정에서 실시간으로 운전자 영역의 운전자의 특징을 판별하여 멀티미디어 기기에서 운전자에게 적합한 정보를 제공하도록 한다.Therefore, in the present invention, the directional microphones are respectively installed in the four seat areas in the vehicle, and the microphone signals of the different areas are discriminated as noise and filtered based on the input signal of the microphone in the driver area. In the process of processing the signal, the characteristics of the driver in the driver area are determined in real time, and information suitable for the driver is provided in the multimedia device.

이하에서 보다 자세하게 설명하며, 이하 설명에서는 운전석을 A 영역으로 정의하고, 조수석을 B 영역으로 정의하며, 운전석의 뒤쪽과 조수석의 뒤쪽을 각각 C 영역과 D 영역으로 정의한다.In the following description, the driver's seat is defined as area A, the passenger seat is defined as area B, and the back of the driver's seat and the back of the passenger seat are defined as C area and D area, respectively.

운전자가 음성인식 기능을 시작시, A, B, C, D 영역의 마이크들이 동시에 오픈되면서 마이크로 4 영역의 음성 신호를 받는다. 사람의 음성이 아닌 차량 노이즈는 4 영역의 마이크에 입력되는 값이 거의 동일하므로 차량 노이즈 값을 A에서 필터링한다. 그리고 4 영역의 음성 목소리를 분석한다. 우선 4 영역의 성별을 나타내는 음성 벡터값을 분석하고, A 영역을 기준으로 B, C, D 영역에서 A 영역과 다른 성별을 나타내는 벡터값이 추출되면 A 영역에서 그 벡터값에 해당하는 신호를 필터링한다. 성별 분석이 완료되면 동일한 방법으로 연령, 기분/컨디션 등에 대해 분석한다.When the driver starts the voice recognition function, the microphones of A, B, C, and D areas are simultaneously opened and receive voice signals of the micro 4 area. The vehicle noise value, which is not the human voice, is filtered by the vehicle noise value because the value input to the microphone in the four regions is almost the same. Then, the voice of four regions is analyzed. First, the speech vector value representing the gender of the four regions is analyzed. When a vector value indicating the sex and the other region is extracted from the region A, the region corresponding to the vector A is filtered in the region A, do. Once the gender analysis is complete, analyze the age, mood, and condition in the same way.

A 영역에서 운전자의 음성 신호가 가장 크겠지만, B, C, D 영역의 음성 신호가 있을 경우 A 영역에서 완벽하게 운전자의 목소리만 추출하기 어렵기 때문에 이 방법을 사용한다.Although the driver's voice signal is the largest in area A, this method is used because it is difficult to extract only the driver's voice perfectly in area A when there are voice signals in areas B, C, and D.

이때에는 상관관계(CORRELATION), ICA 기술, BEAM FORMING 기술 외의 다른 알고리즘을 사용하여 신호가 독립적인지 유사성을 띄고 있는지를 판별할 수 있다.At this time, other algorithms other than CORRELATION, ICA, and BEAM FORMING techniques can be used to determine whether the signals are independent or similar.

4개의 마이크를 통해 필터링을 하면서 화자의 개별 특성을 파악할 수 있고, 개별 특성을 파악한 정보를 이용한 노이즈 필터링으로 인식률을 높일 수 있다.It is possible to grasp the individual characteristics of the speaker while filtering through the four microphones, and the recognition rate can be increased by noise filtering using information obtained by grasping individual characteristics.

차량의 경우는 일반적으로 4개의 좌석이 지정되어 있고, 차량 내 음성인식 시스템은 보통 운전자가 사용하는데, 운전자의 음성인식 시스템 사용 중 나머지 좌석의 탑승자가 발화시 여러 명의 음성이 더해지므로 인식 시스템에서 운전자의 명령을 인식하기가 어렵다. 현재 일반적으로 쓰이는 음성인식 시스템에서는 음성인식 구간 앞에 음성이 없는 구간을 설정해 그 구간을 노이즈로 인식하고, 음성이 들어오는 구간에서 노이즈를 필터링하는 구조이다.In the case of a vehicle, four seats are generally designated, and the voice recognition system in the vehicle is usually used by the driver. When the occupant of the rest of the seat uses the voice recognition system of the driver, It is difficult to recognize the command of. In the currently used speech recognition system, a section having no speech is set in front of a speech recognition section, the section is recognized as noise, and the noise is filtered in a section in which speech is received.

본 발명은 퍼셉트론 이론을 이용해 음성의 특징을 추출해 발화자의 특성을 식별하고, 그 데이터로 발화자에게 적합한 정보를 실시간으로 제공하는 기술이다. 퍼셉트론을 이용하면 ①화자의 특성에 따라 맞춤형 정보를 제공하거나, ②발화자 위치를 인식하고 그 위치에 발화자가 원하는 기능을 제공할 수 있다. 이하 ①과 ②에 대해 보다 자세하게 설명한다.The present invention is a technology for extracting characteristics of speech using perceptron theory to identify characteristics of a speaker and providing information suitable for a speaker in real time using the data. Using Perceptron, it is possible to provide customized information according to the characteristics of the speaker, (2) to recognize the position of the speaker, and to provide a desired function of the speaker at that position. The following describes ① and ② in more detail.

1. 화자 특성에 따른 맞춤형 정보 제공1. Provide customized information according to speaker characteristics

다층 퍼셉트론을 이용해 시스템을 구성하면, 여러 명의 음성이 더해지더라도 운전자의 음성을 추출하는 것이 가능해진다. 이 방법은 운전자에 국한되지 않고 나머지 사람의 인식도 가능하다. 예시로 A 영역의 음성 특성만 추출하고, 나머지 B, C, D 영역의 음성 신호는 무시하는 경우이다.When a system is constructed using a multilayer perceptron, it is possible to extract the voice of the driver even if a plurality of voices are added. This method is not limited to the driver, but can be recognized by the rest. For example, only the voice characteristic of the A region is extracted, and the voice signals of the remaining B, C, and D regions are ignored.

퍼셉트론의 경우 미리 많은 DB를 바탕으로 BACK PROPAGATION 기법을 이용해 트레이닝된 알고리즘이 형성된 상태가 대전제이다.In the case of perceptron, it is anticipated that the algorithm that has been trained using BACK PROPAGATION technique based on many DBs in advance has been established.

퍼셉트론 모델링은 예로 20대 컨디션이 좋은 서울 여성의 수많은 음성을 분석해 특성(포만트, 기본 주파수, 에너지값, LPC 값 등)을 추출해 인풋에 넣고, OUTPUT 타켓을 20대 컨디션이 좋은 서울 여성으로 하면 퍼셉트론 구조 내부적으로 BACK PROPAGATION 과정을 거쳐 적절한 WEIGHT 값이 결정된다. 이렇게 다양한 특성의 사람들을 트레이닝시키면 어떤 음성이 들어가더라도 트레이닝된 구조 안에서 특징을 찾아갈 수 있다. LPC 값은 선형 예측 부호화 값으로 인간 발성 모델에 근거한 음성 부호화 방식 중 하나로 26차원의 벡터를 갖는다.For example, perceptron modeling extracts the characteristics (formants, fundamental frequency, energy value, LPC value, etc.) of a woman who is in good condition in the 20s, and inserts it into the input. If the OUTPUT target is 20 women who are in good condition, The structure is internally subjected to the BACK PROPAGATION process to determine the appropriate WEIGHT value. By training people with these different characteristics, any voice can be traced in a trained structure. The LPC value is a linear predictive coding value and has a 26-dimensional vector as one of the speech coding methods based on the human speech model.

특정 타켓의 수만은 음성의 formant, 기본 주파수, LPC 모델의 26차원 벡터값을 입력했을 때 역전개 과정을 거쳐 적절한 가중치 값들이 정하는 작업을 여러 타켓으로 반복한다(20대 컨디션 좋은 서울 여성, 30대 컨디션 안좋은 경상도 지역의 남성…).When only the number of specific target is inputted, 26th dimension vector value of speech formant, fundamental frequency, LPC model is input, and the work is reversed and appropriate weight values are determined by various targets (20s condition good Seoul woman, 30s Men in Gyeongsang area are in a bad condition ...).

이 트레이닝 과정을 거치면 어떤 음성이 입력되더라도 그 음성의 특징 벡터들을 모델링한 퍼셉트론 구조에 입력하면 발화자의 특성을 알 수 있다.Through this training process, no matter what speech is input, the characteristics of the speech can be known by inputting the feature vectors of the speech into a modeled perceptron structure.

좌석 선택의 기준은 PTT로 한다. PTT 버튼이 4개가 있다면 위치에 따라 해당 PTT 입력된 자리에 위치한 마이크에 입력된 음성이 분석해야 될 음성으로 판단하고 나머지는 노이즈로 판단해 필터링한다. 필터링된 음성으로 인식을 하여 발화자에게 최적의 정보를 제공하는데, 예를 들어 멀티 제품에 발화자가 명령할 경우, 주변 음식점을 찾는다면 발화자의 특성에 적합한 주변 음식점을 먼저 찾아주게 된다.The standard of seat selection is PTT. If there are four PTT buttons, the voice input to the microphone located at the corresponding PTT input position is determined as a voice to be analyzed, and the remaining voice is determined as noise and filtered. For example, when a multi-product is instructed by a speaker, if a neighboring restaurant is searched, a neighboring restaurant suitable for the characteristics of the speaker is searched first.

이상 설명한 내용을 정리하면 다음과 같은 특징 도출이 가능하다.The following features are summarized as follows.

먼저, PTT 위치를 판별하며 음성 신호별 특성에 따른 벡터를 추출한다.First, the PTT position is discriminated and a vector according to the characteristics of each voice signal is extracted.

이후, 다층 퍼셉트론 구조에 4가지 신호의 특성 벡터를 입력한다.Then, the characteristic vectors of four signals are input to the multi-layer perceptron structure.

이후, 각각 음성 신호마다의 특성을 추출한다.Then, the characteristics of each voice signal are extracted.

이후, 기준 음성(A)과 다른 특성을 가질 경우 A 마이크 신호에서 다른 특성값을 노이즈로 판단하고 필터링한다.Thereafter, when the A microphone signal has a characteristic different from that of the reference speech A, other characteristic values are determined as noise and filtered.

이후, A 영역의 음성만 추출된 데이터로 음성 인식을 수행하며, 음성이 어떤 의미인지 판별한다.Thereafter, voice recognition is performed using only extracted voice data of the area A, and it is determined which voice means.

이후, A 영역의 발화자의 명령어에 대해 최적화된 정보를 제공한다.Thereafter, it provides optimized information for the command word of the A region.

2. 발화자 위치를 인식하고 그 위치에 발화자가 원하는 기능을 제공2. Recognize the position of the speaker and provide the desired function

좌석 선택의 기준은 PTT로 한다. PTT 버튼이 4개가 있다면 위치에 따라 해당 PTT 입력된 자리에 위치한 마이크에 입력된 음성이 분석해야 될 음성으로 판단하고 나머지는 노이즈로 판단해 필터링한다. 예를 들어 공조의 경우 D 영역에 앉은 사람이 에어컨 온도 관련 명령을 할 경우 D 영역의 공조 장치에만 명령에 따라 공조 레벨이 변하게도 할 수 있다.The standard of seat selection is PTT. If there are four PTT buttons, the voice input to the microphone located at the corresponding PTT input position is determined as a voice to be analyzed, and the remaining voice is determined as noise and filtered. For example, in the case of air conditioning, when a person sitting in area D commands an air conditioner-related command, the air conditioning level can be changed according to the command only in the air conditioner of area D.

다시 도 1을 참조하여 설명한다.Referring back to FIG.

TTS DB(150)는 나이에 관련된 기준 특징 정보(10대, 20대, 30대, 40대, 50대, 60대, 70대 이상 등), 성별에 관련된 기준 특징 정보(남성, 여성 등), 성향에 관련된 기준 특징 정보(온화, 활달 등) 등을 저장하는 데이터베이스이다.The TTS DB 150 includes reference feature information (male, female, etc.) related to the gender, reference characteristic information (10, 20, 30, 40, 50, 60, And reference feature information related to the inclination (mild, active, etc.).

TTS DB 추출기(140)는 화자 음성 분석기(130)에 의해 발견된 화자의 나이, 성별, 성향 등에 대응하는 정보를 TTS DB(150)로부터 검출하는 기능을 수행한다.The TTS DB extractor 140 performs a function of detecting, from the TTS DB 150, information corresponding to the age, sex, and propensity of a speaker found by the speaker's voice analyzer 130.

화자 음성 튜닝기(160)는 TTS DB(150)로부터 검출된 정보를 기초로 TTS 서비스를 위해 출력될 음성을 튜닝(tuning)하는 기능을 수행한다. 화자 음성 튜닝기(160)는 운전자의 음성으로부터 얻은 발화 속도 정보(Pitch Period), 주파수의 고저(高低)에 대한 정보(Log f0) 등을 출력하려는 음성에 적용하여 튜닝할 수 있다.The speaker tuner 160 performs a function of tuning a voice to be output for the TTS service based on the information detected from the TTS DB 150. [ The speaker tuning unit 160 may apply tuning to the voice to be output, such as the pitch period information (Pitch Period) obtained from the driver's voice, the information (Log f0) about the high and low frequencies.

GMM(Gaussian Mixture Model) 모델 추출기(170)는 음성 특징 정보 추출기(120)에 의해 추출된 화자의 음성 특징 정보를 기초로 가우시안 혼합 모델을 생성하는 기능을 수행한다.The Gaussian Mixture Model (GMM) model extractor 170 performs a function of generating a Gaussian mixture model based on the speech feature information of the speaker extracted by the speech feature information extractor 120.

화자 음성 변환기(180)는 화자 음성 튜닝기(160)에 의해 튜닝된 음성에 가우시안 혼합 모델을 적용하여 음성을 추가적으로 변환하는 기능을 수행한다. 본 발명에서는 화자 음성 튜닝기(160)에 의해 튜닝된 음성을 TTS 서비스를 위한 음성으로 제공할 수 있다. 그러나 본 발명은 이에 한정하지 않고 실시간으로 화자의 음성 특성이 적절하게 변환될 수 있도록 GMM(Gaussian Mixture Model)을 통해 화자의 음성을 추가적으로 변환하는 것도 가능하다.The speaker-to-speech converter 180 performs a function of additionally converting voice by applying a Gaussian blending model to the voice tuned by the speaker's voice tuning unit 160. In the present invention, the voice tuned by the speaker tuner 160 can be provided as a voice for the TTS service. However, the present invention is not limited to this, but it is also possible to additionally convert the speaker's voice through GMM (Gaussian Mixture Model) so that the speaker's voice characteristics can be appropriately converted in real time.

이하 가우시안 혼합 모델을 이용한 화자 음성 변환기(180)에 대하여 부연 설명한다.Hereinafter, the speaker-to-speech converter 180 using the Gaussian mixture model will be described in detail.

x∈Rⁿ이라는 특정 랜덤 벡터의 가우시안 혼합 밀도(Gaussian Mixture Density)는 수학식 8과 같이 나타낼 수 있다.The Gaussian Mixture Density of a specific random vector x &^lt; RTI ID = 0.0 > Rn < / RTI >

상기에서 p()는 성분 파라미터로 평균과 분산을 가진 가우시안 함수를 의미한다. Q는 단일 가우시안 밀도(Gaussian Density)의 총 갯수를 의미하며, α_i는 단일 가우시안 밀도의 가중치를 의미한다.In the above, p () denotes a Gaussian function having mean and variance as component parameters. Q denotes the total number of single Gaussian densities, and? _I denotes a weight of a single Gaussian density.

여기서 b_i(x)를 단일 가우시안 밀도로 나타내면 수학식 9와 같이 정의된다.Here, if b _i (x) is expressed by a single Gaussian density, it is defined as in Equation (9).

그러므로 완성된 가우시안 혼합 밀도(Gaussian Mixture Density)는 다음 3가지 변수로 구성된다.Therefore, the finished Gaussian Mixture Density consists of the following three variables.

λ = {α_i, μ_i, C_i}, i = 1, …, Q _{_{λ = {α i, μ i}} , C i}, i = 1, ... , Q

x∈Rⁿ를 TTS DB 추출기(140)에 의해 선별된 음성으로 정의하고 y∈Rⁿ를 운전자의 음성으로 정의하면, z=(x, y)^T는 TTS DB 추출기(140)에 의해 선별된 음성과 운전자 음성 간의 결합 밀도(joint density) 음성으로 정의할 수 있다. 이를 수학식으로 나타내면 다음과 같다.x∈R define ⁿ as the negative selection by the TTS DB extractor 140 and by defining the ⁿ y∈R by the voice of the driver, z = (x, y) T is selected by the TTS DB extractor 140 And a joint density voice between the voice and the driver's voice. This can be expressed by the following equation.

따라서 화자 음성 변환기(180)는 수학식 11과 같이 평균 제곱 오차(Mean Square Error)를 최소화하는 맵핑(Mapping) 함수 F(x)를 발견하는 것이다.Therefore, the speaker-voice converter 180 finds a mapping function F (x) that minimizes the mean square error as shown in Equation (11).

E[…]는 기대값(Expectation)을 의미하며, F(x)는 추정된(estimated) 음성의 스펙트럴 벡터(Spectral Vector)를 의미한다.E [... ] Denotes the expectation (Expectation), and F (x) denotes the spectral vector of the estimated speech.

다음으로 도 1 내지 도 3을 참조하여 설명한 차량용 음성 안내 제공 시스템(100)의 작동 방법에 대하여 설명한다. 도 4는 본 발명의 일실시예에 따른 차량용 음성 안내 제공 시스템의 작동 방법을 도시한 흐름도이다.Next, a method of operating the vehicle audio guidance system 100 described with reference to Figs. 1 to 3 will be described. 4 is a flowchart illustrating an operation method of a vehicle voice guidance system according to an embodiment of the present invention.

운전자가 특정 명령어를 발화하면(S405), 음성 특징 정보 추출기(120)가 화자의 음성으로부터 특징 정보를 추출한다(S410).When the driver utters a specific command (S405), the speech feature information extractor 120 extracts the feature information from the speech of the speaker (S410).

이후 화자 음성 분석기(130)가 특징 정보로부터 실시간으로 성별, 연령, 성향 등을 분석한다(S415).Then, the speaker's voice analyzer 130 analyzes gender, age, inclinations, etc. in real time from the feature information (S415).

이후 TTS DB 추출기(140)가 TTS DB(150)에서 각 분석 결과에 대응하는 정보를 선택한다(S420).Thereafter, the TTS DB extractor 140 selects information corresponding to each analysis result in the TTS DB 150 (S420).

이후 화자 음성 튜닝기(160)가 TTS DB 추출기(140)에 의해 선택된 정보들을 기초로 음성 변환된 정보를 튜닝한다(S425).Then, the speaker tuner 160 tunes the voice-converted information based on the information selected by the TTS DB extractor 140 (S425).

이후 화자 음성 변환기(190)가 화자의 음성으로부터 얻은 GMM 모델을 기초로 튜닝된 음성을 운전자의 실제 음성에 가깝도록 변환한다(S430).Then, the speaker-to-speech converter 190 converts the tuned voice based on the GMM model obtained from the speaker's voice so as to be close to the actual voice of the driver (S430).

이후 TTS 출력부(미도시)가 화자 음성 변환기(190)에 의해 변환된 음성을 출력한다(S435).Thereafter, the TTS output unit (not shown) outputs the voice converted by the speaker-voice converter 190 (S435).

이상 도 1 내지 도 4를 참조하여 본 발명의 일실시 형태에 대하여 설명하였다. 이하에서는 이러한 일실시 형태로부터 추론 가능한 본 발명의 바람직한 형태에 대하여 설명한다.DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention has been described with reference to Figs. Best Mode for Carrying Out the Invention Hereinafter, preferred forms of the present invention that can be inferred from the above embodiment will be described.

본 발명의 바람직한 실시예에 따른 대상 정보 음성 출력 제어 장치는 특성 정보 생성부, 대상 정보 생성부, 대상 정보 출력부, 전원부 및 주제어부를 포함한다.A target information audio output control apparatus according to a preferred embodiment of the present invention includes a characteristic information generation unit, a target information generation unit, a target information output unit, a power source unit, and a main control unit.

전원부는 대상 정보 음성 출력 제어 장치를 구성하는 각 구성에 전원을 공급하는 기능을 수행한다. 주제어부는 대상 정보 음성 출력 제어 장치를 구성하는 각 구성의 전체 작동을 제어하는 기능을 수행한다. 대상 정보 음성 출력 제어 장치가 차량에 적용되는 것임을 참작할 때 전원부와 주제어부는 본 실시예에서 구비되지 않아도 무방하다.The power supply unit supplies power to each configuration of the target information audio output control apparatus. The main control unit controls the overall operation of each of the components constituting the target information audio output control apparatus. When considering that the target information audio output control apparatus is applied to a vehicle, the power source section and the main control section may not be provided in the present embodiment.

특성 정보 생성부는 사용자의 음성 정보를 기초로 사용자의 특성 정보를 생성하는 기능을 수행한다. 특성 정보 생성부는 도 1의 음성 특징 정보 추출기(120)에 대응하는 개념이다.The characteristic information generation unit performs a function of generating characteristic information of the user based on the user's voice information. The characteristic information generation unit is a concept corresponding to the voice characteristic information extractor 120 of FIG.

특성 정보 생성부는 음성 정보로부터 포먼트(Formant) 정보, 주파수(Log f0) 정보, LPC(Linear Predictive Coefficient) 정보, 스펙트럼 포락선(Spectral Envelope) 정보, 에너지 정보, 발화 속도(Pitch Period) 정보 및 로그 스펙트럼(Log Spectrum) 정보 중 적어도 하나의 정보를 추출하며, 적어도 하나의 정보를 기초로 특성 정보를 실시간으로 생성할 수 있다.The characteristic information generating unit may extract formant information, frequency information (Log f0), linear predictive coefficient (LPC) information, spectral envelope information, energy information, pitch period information, (Log Spectrum) information, and can generate characteristic information on the basis of at least one piece of information in real time.

특성 정보 생성부는 특성 정보로 사용자의 성별 정보, 사용자의 연령 정보, 및 사용자의 감정 정보 중 적어도 하나의 정보를 실시간으로 생성할 수 있다. 이러한 특성 정보 생성부는 도 1의 음성 특징 정보 추출기(120)와 화자 음성 분석기(130)의 결합 구성에 대응하는 개념이다.The characteristic information generation unit may generate at least one of the gender information of the user, the age information of the user, and the user's emotion information in real time as the characteristic information. The characteristic information generating unit corresponds to the combination of the speech feature information extractor 120 and the speech analyzer 130 of FIG.

특성 정보 생성부는 음성 정보로부터 잡음 정보를 제거한 뒤 특성 정보를 생성할 수 있다. 이러한 특성 정보 생성부는 도 1의 잡음 제거기(110)와 음성 특징 정보 추출기(120)의 결합 구성에 대응하는 개념이다.The characteristic information generating unit may generate characteristic information after removing the noise information from the voice information. The characteristic information generation unit corresponds to the combination of the noise eliminator 110 and the voice feature information extractor 120 of FIG.

특성 정보 생성부는 음성 정보에 상응하는 입력 정보들과 각 입력 정보의 목표 정보를 학습(training)시켜 얻은 가중치 정보를 음성 정보에 적용하여 특성 정보를 생성할 수 있다.The characteristic information generating unit may generate characteristic information by applying input information corresponding to the voice information and weight information obtained by training target information of each input information to the voice information.

특성 정보 생성부는 ANN(Artificial Neural Network) 알고리즘, EBP(Error Back Propagation) 알고리즘 및 경사하강법(Gradient Descent Method)을 이용하여 가중치 정보를 획득할 수 있다.The characteristic information generation unit may obtain weight information using an ANN (Artificial Neural Network) algorithm, an EBP (Error Back Propagation) algorithm, and a Gradient Descent Method.

대상 정보 생성부는 특성 정보를 기초로 텍스트 형태의 제1 대상 정보로부터 음성 형태의 제2 대상 정보를 생성하는 기능을 수행한다.The target information generating unit generates a second target information in the form of a voice from the first target information in the form of a text based on the characteristic information.

대상 정보 생성부는 데이터베이스에서 특성 정보에 대응하는 기준 정보를 추출하며, 이 기준 정보를 기초로 제1 대상 정보를 음성으로 변환하여 얻은 정보를 튜닝하여 제2 대상 정보를 생성할 수 있다. 이러한 대상 정보 생성부는 도 1의 TTS DB(150), TTS DB 추출기(140) 및 화자 음성 튜닝기(160)의 결합 구성에 대응하는 개념이다.The target information generating unit extracts reference information corresponding to the characteristic information from the database, and generates second target information by tuning information obtained by converting the first target information into speech based on the reference information. This object information generating unit is a concept corresponding to the combined configuration of the TTS DB 150, the TTS DB extractor 140, and the speaker tuner 160 of FIG.

대상 정보 생성부는 기준 정보로부터 얻은 발화 속도(Pitch Period) 정보 또는 주파수(Log f0) 정보를 기초로 제1 대상 정보를 음성으로 변환하여 얻은 정보를 튜닝하여 제2 대상 정보를 생성할 수 있다.The target information generating unit may generate second target information by tuning information obtained by converting the first target information into speech based on the pitch period information or the frequency f0 obtained from the reference information.

대상 정보 생성부는 기준 정보와 더불어 특성 정보로부터 얻은 화자 식별 정보를 기초로 제2 대상 정보를 생성할 수 있다. 이러한 대상 정보 생성부는 TTS DB(150), TTS DB 추출기(140), 화자 음성 튜닝기(160), GMM 모델 추출기(170) 및 화자 음성 변환기(180)의 결합 구성에 대응하는 개념이다.The target information generating unit may generate the second target information based on the speaker identification information obtained from the characteristic information together with the reference information. This object information generating unit is a concept corresponding to the combined structure of the TTS DB 150, the TTS DB extractor 140, the speaker tuner 160, the GMM model extractor 170 and the speaker voice converter 180.

대상 정보 생성부는 가우시안 혼합 모델(GMM)을 기초로 화자 식별 정보를 획득할 수 있다.The target information generation unit can acquire the speaker identification information based on the Gaussian mixture model (GMM).

다음으로 대상 정보 음성 출력 제어 장치의 작동 방법에 대하여 설명한다.Next, an operation method of the target information audio output control apparatus will be described.

먼저 특성 정보 생성부가 사용자의 음성 정보를 기초로 사용자의 특성 정보를 생성한다.First, the characteristic information generating unit generates characteristic information of the user based on the user's voice information.

이후 대상 정보 생성부가 특성 정보를 기초로 텍스트 형태의 제1 대상 정보로부터 음성 형태의 제2 대상 정보를 생성한다.Then, the target information generating unit generates second target information in the form of speech from the first target information in text form based on the characteristic information.

이후 대상 정보 출력부가 제2 대상 정보를 출력한다.Then, the target information output unit outputs the second target information.

이상에서 설명한 본 발명의 실시예를 구성하는 모든 구성요소들이 하나로 결합하거나 결합하여 동작하는 것으로 기재되어 있다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 또한, 이와 같은 컴퓨터 프로그램은 USB 메모리, CD 디스크, 플래쉬 메모리 등과 같은 컴퓨터가 읽을 수 있는 기록매체(Computer Readable Media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. 컴퓨터 프로그램의 기록매체로서는 자기 기록매체, 광 기록매체, 캐리어 웨이브 매체 등이 포함될 수 있다.It is to be understood that the present invention is not limited to these embodiments, and all elements constituting the embodiment of the present invention described above are described as being combined or operated in one operation. That is, within the scope of the present invention, all of the components may be selectively coupled to one or more of them. In addition, although all of the components may be implemented as one independent hardware, some or all of the components may be selectively combined to perform a part or all of the functions in one or a plurality of hardware. As shown in FIG. In addition, such a computer program may be stored in a computer readable medium such as a USB memory, a CD disk, a flash memory, etc., and read and executed by a computer to implement an embodiment of the present invention. As the recording medium of the computer program, a magnetic recording medium, an optical recording medium, a carrier wave medium, and the like can be included.

또한, 기술적이거나 과학적인 용어를 포함한 모든 용어들은, 상세한 설명에서 다르게 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥상의 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Furthermore, all terms including technical or scientific terms have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined in the Detailed Description. Commonly used terms, such as predefined terms, should be interpreted to be consistent with the contextual meanings of the related art, and are not to be construed as ideal or overly formal, unless expressly defined to the contrary.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. 따라서, 본 발명에 개시된 실시예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구 범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.It will be apparent to those skilled in the art that various modifications, substitutions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. will be. Therefore, the embodiments disclosed in the present invention and the accompanying drawings are intended to illustrate and not to limit the technical spirit of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments and the accompanying drawings . The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

Claims

A characteristic information generating unit for generating characteristic information of the user based on voice information of a user;
An object information generating unit for generating second object information in a speech form from first object information in the form of a text based on the characteristic information; And
A target information output unit for outputting the second target information,
And a voice information output unit for outputting voice information of the user.

The method according to claim 1,
The characteristic information generating unit may extract formant information, log f0 information, linear predictive coefficient (LPC) information, spectral envelope information, energy information, pitch period information, Extracting at least one piece of information from at least one piece of log spectrum information, and generating the characteristic information in real time based on the at least one piece of information.

The method according to claim 1,
Wherein the characteristic information generating unit generates at least one of the gender information of the user, the age information of the user, and the emotion information of the user in real time with the characteristic information. Output control device.

The method according to claim 1,
Wherein the characteristic information generating unit generates the characteristic information after removing the noise information from the voice information.

The method according to claim 1,
Wherein the characteristic information generating unit applies the weight information obtained by training the input information corresponding to the voice information and the target information of each input information to the voice information to generate the characteristic information. A device for controlling the target information audio output using characteristics.

6. The method of claim 5,
Wherein the characteristic information generating unit obtains the weight information using an artificial neural network algorithm, an error back propagation algorithm, and a gradient descent method. Voice output control device.

The method according to claim 1,
Wherein the target information generating unit extracts reference information corresponding to the characteristic information from the database and generates the second target information by tuning information obtained by converting the first target information into speech based on the reference information And the target information audio output control device uses the voice characteristics of the user.

8. The method of claim 7,
The target information generating unit generates the second target information by tuning the information obtained by converting the first target information into speech based on the pitch period information or the frequency f0 obtained from the reference information And outputting the voice information to the target information output device.

8. The method of claim 7,
Wherein the target information generating unit generates the second target information based on the speaker identification information obtained from the characteristic information together with the reference information.

10. The method of claim 9,
Wherein the target information generating unit obtains the speaker identification information based on a Gaussian mixture model (GMM).

Generating characteristic information of the user based on voice information of a user;
Generating second target information in the form of speech from first subject information in the form of a text based on the characteristic information; And
Outputting the second object information
And outputting the voice information to the target information output device.

12. The method of claim 11,
The generating of the characteristic information may comprise extracting formant information, log f0 information, linear predictive coefficient (LPC) information, spectral envelope information, energy information, ) Information and log spectrum information, and generates the characteristic information in real time on the basis of the at least one piece of information. Way.

12. The method of claim 11,
Wherein the generating of the characteristic information generates at least one of the gender information of the user, the age information of the user, and the emotion information of the user in real time using the characteristic information. Method for controlling target audio output.

12. The method of claim 11,
The generating of the second object information may include extracting reference information corresponding to the characteristic information from the database, tuning information obtained by converting the first object information into speech based on the reference information, And outputting the generated voice information to the user.

15. The method of claim 14,
Wherein the second target information generating step generates the second target information based on the speaker identification information obtained from the characteristic information together with the reference information. .