CN105575383A

CN105575383A - Apparatus and method for controlling target information voice output through using voice characteristics of user

Info

Publication number: CN105575383A
Application number: CN201510657714.4A
Authority: CN
Inventors: 权吾泫
Original assignee: Hyundai Mobis Co Ltd
Current assignee: Hyundai Mobis Co Ltd
Priority date: 2014-10-28
Filing date: 2015-10-13
Publication date: 2016-05-11
Also published as: KR102311922B1; KR20160049804A

Abstract

The invention provides an apparatus and method for controlling target information voice output through using voice characteristics of a user, wherein the apparatus and method are used for providing TTS services according to the characteristic information obtained from the voice of the user. The device comprises a characteristic information generation part which generates the characteristic information of the user according to the voice information; an object information generation part which is used for generating second object information at a format of voice through employing the first object information at a format of text according to the characteristic information; and an object information output part which is used for outputting the second object information. The device can build a natural voice recognition system, and can provide non-mechanical amiable and easy voice.

Description

Utilize object information voice output controller and the method for the phonetic feature of user

Technical field

The present invention relates to the control device by voice output object information and method, particularly relate to a kind of control device and method of vehicle voice object information output.

Background technology

Usually, text to speech (TextToSpeech; Hereinafter referred to as ' TTS ') be technology word or mark being converted to voice output.TTS builds the pronunciation data storehouse about phoneme and this is connected into continuous print voice, and now key is by regulating the natural voice of synthesis such as voice size, length, height.

Namely, TTS is Text-to-Speech device character string (article) being converted to voice, be roughly divided into Language Processing, generate three steps such as the rhythm, Waveform composition, the syntactic structure of the document received in the analysis of Language Processing step when specifically receiving text, according to analyzing the rhythm of syntactic structure generation as true man read aloud obtained, and collect the base unit generation synthesized voice of the speech database (hereinafter referred to as ' DB ') of storage according to the rhythm generated.

TTS, without the restriction of object vocabulary, converts the information of normal words form to voice, and therefore during constructing system, applied phonetics, speech analysis, phonetic synthesis and speech recognition technology etc. export multiple natural voice.

But provide the terminal of this TTS with in the situations such as voice output character message at present, no matter whom the other side is, all exports by the same voice preset, and therefore cannot meet the demand of all types of user.

No. 2011-0032256th, KR published patent discloses a kind of TTS and guides broadcaster.But due to this device only a kind of device of merely will text-converted be specified to become voice, therefore cannot solve the problem.

Summary of the invention

Technical matters

For solving the problem, a kind of basis is the object of the present invention is to provide to provide object information voice output controller and the method for the phonetic feature (characteristicofuservoice) of TTS (TextToSpeech) inequality in utilizing user from the characteristic information that the voice of user obtain.

But object of the present invention is not limited to the content of above record, those skilled in the art clearly understand other objects do not recorded by following contents.

Technical scheme

For reaching above-mentioned purpose, the invention provides a kind of object information voice output controller utilizing the phonetic feature of user, it is characterized in that, comprising: characteristic information generating unit, it generates the characteristic information of described user according to the voice messaging of user; Object information generating unit, it is according to described characteristic information, utilizes the first object information of textual form to generate the second object information of speech form; And object information efferent, it exports described second object information.

Preferably, described characteristic information generating unit extracts resonance peak (Formant) information, frequency (Logf0) information, linear predictor coefficient (LinearPredictiveCoefficient from described voice messaging; LPC) at least one information in information, spectral enveloping line (SpectralEnvelope) information, energy information, speech rate (PitchPeriod) information and logarithmic spectrum (LogSpectrum) information, and generate described characteristic information in real time according to described at least one information.

Preferably, described characteristic information generating unit generates at least one information in the emotion information of the gender information of described user, the age information of described user and described user in real time as described characteristic information.

Preferably, described characteristic information is generated after described characteristic information generating unit removes noise information from described voice messaging.

Preferably, described characteristic information generating unit is suitable for weight information to described voice messaging and generates described characteristic information, wherein, described weight information is the information obtained corresponding to the input information of described voice messaging and the target information of each input information by study (training).

Preferably, described characteristic information generating unit utilizes artificial neural network (ArtificialNeuralNetwork; ANN) algorithm, error back propagation (ErrorBackPropagation; EBP) algorithm and gradient descent method (GradientDescentMethod) obtain described weight information.

Preferably, described object information generating unit is extracted and is corresponded to the reference information of described characteristic information from database, and converts to described first object information the information that voice obtain according to described reference information to and carry out described second object information of adjustment generation.

Preferably, described object information generating unit, according to the speech rate obtained from described reference information (PitchPeriod) information or frequency (Logf0) information, converts to described first object information information that voice obtain to and carries out adjustment and generate described second object information.

Preferably, described object information generating unit generates described second object information according to described reference information and the speaker identification information obtained from described characteristic information.

Preferably, described object information generating unit obtains described speaker identification information according to gauss hybrid models (GMM).

Further, the invention provides a kind of object information voice output control method utilizing the phonetic feature of user, it is characterized in that, comprising: the step generating the characteristic information of described user according to the voice messaging of user; According to described characteristic information, the first object information of textual form is utilized to generate the step of the second object information of speech form; And, export the step of described second object information.

Preferably, the step generating described characteristic information specifically, extracts resonance peak (Formant) information, frequency (Logf0) information, linear predictor coefficient (LinearPredictiveCoefficient from described voice messaging; LPC) at least one information in information, spectral enveloping line (SpectralEnvelope) information, energy information, speech rate (PitchPeriod) information and logarithmic spectrum (LogSpectrum) information, and generate described characteristic information in real time according to described at least one information.

Preferably, generate the step of described characteristic information specifically, at least one information in the emotion information of the gender information of described user, the age information of described user and described user of generating in real time is as described characteristic information.

Preferably, the step generating described characteristic information specifically, generates described characteristic information after removing noise information from described voice messaging.

Preferably, generate the step of described characteristic information specifically, be suitable for weight information to described voice messaging and generate described characteristic information, wherein, described weight information is the information obtained corresponding to the input information of described voice messaging and the target information of each input information by study (training).

Preferably, the step generating described characteristic information specifically, utilizes artificial neural network (ArtificialNeuralNetwork; ANN) algorithm, error back propagation (ErrorBackPropagation; EBP) algorithm and gradient descent method (GradientDescentMethod) obtain described weight information.

Preferably, generate the step of described second object information specifically, extract from database and correspond to the reference information of described characteristic information, and according to described reference information to the information that voice obtain is converted to described first object information and carry out described second object information of adjustment generation.

Preferably, generate the step of described second object information specifically, speech rate (PitchPeriod) information obtained from described reference information or frequency (Logf0) information, convert to described first object information information that voice obtain to and carry out adjustment and generate described second object information.

Preferably, the step generating described second object information specifically generates described second object information according to described reference information and the speaker identification information obtained from described characteristic information.

Preferably, the step generating described second object information specifically obtains described speaker identification information according to gauss hybrid models (GMM).

Technique effect

The present invention provides text to speech (TextToSpeech, hereinafter referred to as ' TTS ') to serve according to the characteristic information that the voice from user obtain, thus has following effect:

The first, make bidirectional mode into from one way system and link up, thus natural speech recognition system can be built.

The second, system provides the TTS conformed to driver's sex, age, hobby etc. to serve, and therefore the speech recognition system of vehicle can provide affine, the understandable voice of non-mechanical.

Accompanying drawing explanation

Fig. 1 provides the concept map of the Inner Constitution of system for showing vehicle voice guide according to an embodiment of the invention;

Fig. 2 and Fig. 3 is for providing the reference diagram of the speaker's voice analyzer in system for illustration of the voice guide of vehicle shown in Fig. 1;

Fig. 4 provides the process flow diagram of system works method for showing vehicle voice guide according to an embodiment of the invention.

Embodiment

The preferred embodiments of the present invention are illustrated referring to accompanying drawing.First, it should be noted that in the inscape interpolation Reference numeral of each figure, even if identical inscape appears on different accompanying drawings also add identical Reference numeral as far as possible.And if judge to think illustrating to cause theme of the present invention and obscure related known structure or function when illustrating of the present invention, then omit relevant illustrating.In addition, preferred embodiments of the present invention will be described below, but technical scheme of the present invention does not limit or is limited to this, and person of ordinary skill in the field can do various deformation and implement.

The object of the invention is to the phonetic feature of driver in analysis vehicle and more naturally warm voice guide service is provided.

Fig. 1 provides the concept map of the Inner Constitution of system for showing vehicle voice guide according to an embodiment of the invention.

Vehicle voice guide provides system 100 to be the voice utilizing driver, the system of voice guide is provided by the pattern similar to the voice of current driver's, as shown in Figure 1, comprise noise removal device 110, voice characteristics information extraction apparatus 120, speaker's voice analyzer 130, text to speech database extraction apparatus (hereinafter referred to as ' TTSDB extraction apparatus ') 140, TTSDB (hereinafter referred to as ' text to speech database ') 150, speaker's voice adjuster 160, gauss hybrid models extraction apparatus (GaussianMixtureModel extraction apparatus, hereinafter referred to as ' GMM extraction apparatus ') 170 and speaker's speech convertor 180.

In vehicle, navigation guidance voice or speech recognition guide voice generally to use the specific T TSDB already fixed during production.Therefore, the consumer demand (Needs) that hope is age-based, sex, driver's hobby carry out voice guide cannot fully be met.Such as, the elderly of long life not too easily may understand vigorous two teens personnel word speeds voice faster, young man then think the voice at a slow speed of five teens personnel uninteresting, without individual character.

Vehicle voice guide of the present invention provides the object of system 100 to be as young man, a middle-aged person, the elderly and the male sex, women, driver that personality is active or gentle provide affine, understandable speech quality, instead of provides mechanical TTS to guide voice.

Further, vehicle voice guide provides the object of system 100 to be also first to be recommended by this speaker identification function distinguishing of speech recognition driver the function of the most applicable driver under the technical development of two-way communication mode, to adapt to artificial intelligence trend.

Be specifically described referring to Fig. 1.

The function of noise removal device 110 is the noise element removed when receiving the voice messaging of speaker in this voice messaging.Noise removal device 110 obtains driver's voice clearly by the noise removed in vehicle.

The function of voice characteristics information extraction apparatus 120 is from the voice messaging after removal noise element, extract the voice characteristics information of speaker.In order to analyze the age, sex, hobby etc. of speaker, voice characteristics information extraction apparatus 120 extracts the voice characteristics information of individual.

Voice characteristics information extraction apparatus 120 extracts resonance peak (Formant) information, frequency (Logf0) information, linear predictor coefficient (LinearPredictiveCoefficient from voice messaging; LPC) voice characteristics information such as information, spectral enveloping line (SpectralEnvelope) information, energy (Energy) information, speech rate (PitchPeriod) information, logarithmic spectrum (LogSpectrum) information.

The function of speaker's voice analyzer 130 is the voice characteristics information utilizing voice characteristics information extraction apparatus 120 to extract, to the age, sex, hobby etc. classify (Classification) of speaker.Speaker's voice analyzer 130 can adopt Logf0 information when distinguishing sex, can be judged as women, can be judged as the male sex when Logf0 mean value is 0Hz ~ 120Hz when Logf0 mean value is 120Hz ~ 240Hz.

After voice characteristics information extraction apparatus 120 extracts the voice characteristics information of individual, speaker's voice analyzer 130 utilizes artificial neural network (ArtificialNeuralNetwork; ANN) algorithm modeling (Modeling), extracts general age-based, weighted value (Weight) information that sex, hobby etc. carry out the artificial neural network algorithm analyzed.Speaker's voice analyzer 130 can according to the general weight information as above extracted (namely, the modeling result data utilizing artificial neural network algorithm to obtain) extract the characteristic information of the voice of the driver of in real time input, the age, sex, hobby etc. of speaker is estimated with this.

For estimating the age, sex, hobby etc. of speaker, speaker's voice analyzer 130 can utilize the artificial neural network algorithms such as Analysis of age neural network (NeuralNetwork), gender analysis neural network, taste analysis neural network.

Speaker's voice analyzer 130 is further illustrated referring to Fig. 2 and Fig. 3.

Fig. 2 and Fig. 3 is for providing the reference diagram of the speaker's voice analyzer in system for illustration of the voice guide of vehicle shown in Fig. 1.

Artificial neural network (ArtificialNeuralNetwork; ANN) algorithm be by neurocyte between annexation modeling and distinguish the algorithm of effect of human brain.In the present embodiment, speaker's voice analyzer 130 realizes artificial neural network algorithm by performing following two steps successively.Fig. 2 illustrates the reference diagram being applicable to neural unit (process key element) structure of the artificial neural network of artificial neural network algorithm of the present invention.

1. learning procedure (Training, Modeling)

At learning procedure, a large amount of input vector and object vector are input in the neural network of specifying and carry out pattern classification by speaker's voice analyzer 130, to obtain best weighted value (Weight) 220.

2. differentiate (Classification)

In discriminating step, speaker's voice analyzer 130 calculates output valve 240 by the arithmetic expression 230 learnt between the weighted value 220 that obtains and input vector 210.Speaker's voice analyzer 130 can calculate the difference between weighted value 220 and input vector 210, differentiates that immediate output (Output) is for finally to calculate result.In arithmetic expression 230, θ represents critical value.

Utilizing artificial neural network algorithm, when analyzing the age, sex, hobby etc. of speaker according to speaker's voice characteristics information, the applicable multi-layer perception（MLP） of speaker's voice analyzer 130 (Multi-LayerPerceptron), especially can be suitable for error back propagation (ErrorBackPropagation; EBP) algorithm.Be described further referring to Fig. 3.Fig. 3 is the reference diagram for showing the structure by being applicable to EBP algorithm of the present invention.

Mechanism of perception opinion relevant to voice is at present all the time for identifying the emotion of voice (judging the content of voice when receiving voice) or differentiation people.

Multi-layer perception（MLP） (multilayerperceptron) is the neural network between input layer and output layer with more than one middle layer.Network connects according to input layer, hidden layer, output layer direction, there is not in each layer feedforward (Feedforward) network of the direct connection connected and from output layer to input layer.

In order to this multi-layer perception（MLP） is adapted to speaker's voice analyzer 130, the present invention adopts EBP algorithm.

In the present invention, EBP algorithm has more than one hidden layer between input layer and output layer.And, in the present invention, EBP algorithm is as shown in mathematical expression 1, be worth required weighted value by gradient descent method (gradient-descentmethod) to minimized direction study cost function (Costfunction), wherein said cost function is the required desired value D utilizing general Delta (delta) law to define _pjwith real output value O _pjbetween error sum of squares:

[mathematical expression 1]

E = \underset{p}{Σ} E_{p}, (E_{p} = \frac{1}{2} \underset{j}{Σ} {(D_{p j} - O_{p j})}^{2})

Wherein, p represents that p learns pattern, E _prepresent the error about p pattern.Further, D _pjrepresent the jth key element about p pattern, O _pjrepresent the actual jth key element exported.

Speaker's voice analyzer 130 is by utilizing EBP algorithm described above, for the error calculation hidden layer error that hidden layer learns and utilize output layer to occur, and propagate into input layer by reverse for this value, by repeating this process until the error of output layer reaches target level, as above obtain best weighted value.

Speaker's voice analyzer 130 can utilize EBP algorithm to perform study (Training) step as follows.

First, first step initialization weighted value (Weight) and critical value.

Then, second step provides input vector (InputVector) X _pwith object vector (TargetVector) d _p.

Then, third step utilizes the input value of input vector calculating for being input to hidden layer (HiddenLayer) jth neural unit provided.Now can utilize mathematical expression 2:

[mathematical expression 2]

{net}_{p j} = Σ_{i = 0}^{N - 1} W_{j i} X_{p i} - θ_{j}

Wherein, net _pjrepresent the input value being input to hidden layer jth neural unit.W _jirepresent the connection weighted value from jth neural unit to the i-th neural unit, X _pirepresent input vector.Further, θ _jrepresent critical value.Further, N represents the number of input neural unit.

Then, the 4th step utilizes S type (Sigmoid) function to calculate the output O of hidden layer _pj.

Then, the 5th step utilizes the input value of the output of hidden layer calculating for being input to output layer neural unit k.Now can utilize mathematical expression 3:

[mathematical expression 3]

{net}_{p k} = Σ_{j = 0}^{L - 1} W_{k j} O_{p j} - θ_{k}

Wherein, net _pkrepresent the input value being input to output layer neural unit k.And L represents the number of concealment neural unit.

Then, the 6th step utilizes net _pkthe output O of output layer is calculated with S type (Sigmoid) function _pk.

Then, the error between the target output of the 7th step calculating input pattern and reality export, and using output layer error and the error accumulation as study pattern.Now can utilize mathematical expression 4:

[mathematical expression 4]

δ _pk＝(d _pk-O _pk)f _k′(net _pk)＝(d _pk-O _pk)O _pk(1-O _pk)

E = E + E_{p}, (E_{p} = Σ_{k = 1}^{M - 1} δ_{p k}^{2})

Wherein, d _pkrepresent that the target of input pattern exports, O _pkrepresent the actual output of input pattern.Further, δ _pkerror between the output of expression target and reality export.E represent output layer error and, E _prepresent the error of study pattern.M represents the number of output nerve unit.

Then, the 8th step utilizes output layer error amount d _pk, hidden layer and output layer weighted value W _kjdeng the error delta calculating hidden layer _pj.Now can utilize mathematical expression 5:

[mathematical expression 5]

δ_{p j} = f_{j}^{'} ({net}_{p j}) Σ_{k = 0}^{M - 1} δ_{p k} W_{k j} = Σ_{k = 0}^{M - 1} δ_{p k} W_{k j} O_{p j} (1 - O_{p j})

Then, the 9th step utilizes the output valve O of the hidden layer neural unit j tried to achieve in the 4th step and the 7th step _pjwith the error amount δ of output layer _pkupgrade the weighted value W of output layer _kj.Now also adjust critical value, be assumed to be the weighted value be associated with constant value input, therefore press approximate way and be suitable for.Now can utilize mathematical expression 6:

[mathematical expression 6]

W _kj(t+1)＝W _kj(t)+ηδ _pkO _pj

θ _k(t+1)＝θ _k(t)+βδ _pk

Wherein, η and β is yield value, and especially, η represents learning rate, and t represents the moment.W _kjfrom the weighted value of concealment neural unit j to output nerve unit k during (t) expression time t.

Then, the tenth step also upgrades the weighted value W of input layer and hidden layer as output layer _jiand critical value θ _j.Now can utilize mathematical expression 7:

[mathematical expression 7]

W _ji(t+1)＝W _ji(t)+ηδ _pjX _pi

θ _j(t+1)＝θj(t)+βδ _pj

Then, the 11 step is branched off into second step and repeats until all study patterns of global learning.

Then, the 12 step the error of output layer and E be below permissible value or be greater than maximum number of repetitions time terminate, otherwise forward second step to and step after performing.

In addition, speaker's voice analyzer 130 can also, when speaker is many people, utilize multi-layer perception（MLP） (multilayerperceptron) to analyze the age, sex, hobby etc. of each speaker according to the voice characteristics information of each speaker.Below be explained.

According to general noise filtering method, speech recognition microphone sends speech recognition voice after opening the schedule time, and the signal therefore speech recognition being advanced into microphone is judged as noise in vehicle, this noise then in a trap signal.

Have towards the directional microphone in driver direction in vehicle, but owing to being judged as noise by sending the signal inputted in the short time before voice, if the time point therefore sending speech recognition voice also has other seats personnel to speak in addition to a driver, so voice mix mutually and cause phonetic recognization rate to decline.

Therefore, the present invention's four seating areas in vehicle arrange directional microphone respectively, with the input signal of the microphone in driver region for benchmark, the microphone signal in other regions is determined as noise and filters.The feature of real time discriminating driver Regional drivers in signal processing, with the information making multimedia equipment provide applicable driver.

Below this is described further, below illustrates and driver's seat is defined as a-quadrant, front passenger's seat is defined as B region, the rear side of the rear side of driver's seat and front passenger's seat is defined as C region and D region respectively.

When driver starts speech identifying function, the microphone in A, B, C, D region is opened simultaneously, by the voice signal in microphones four regions.Due to four regions microphones to the vehicle noise value except human speech be almost identical, therefore in A filter vehicle level of noise.Then the voice in four regions are analyzed.First analyze other speech vector value of representative in four regions, if be the vector value of benchmark from B, C, D extracted region to expression with a-quadrant different sexes with a-quadrant, then from a-quadrant, filter the signal being equivalent to this vector value.Age, mood/state etc. is analyzed by same procedure after gender analysis terminates.

The voice signal of necessarily driver maximum in a-quadrant, but when also there is the voice signal in B, C, D region, the complete speech of driver only cannot be extracted in a-quadrant, therefore adopts the method.

Other algorithm judgment signal except mutual relationship (CORRELATION), ICA technology, Wave beam forming (BEAMFORMING) technology now can be utilized independently still to have approximation.

The Individual features of speaker can be analyzed while being undertaken filtering by four microphones, the information filtering noise obtaining and analyze Individual features and obtain can be utilized, improve discrimination with this.

Vehicle generally has four seats, in vehicle, speech recognition system user is generally driver, if driver uses other seat occupant in the process of speech recognition system to speak, then the voice of many people are superimposed, therefore the order of speech recognition system None-identified driver.At present the general speech recognition system used is the interval that arranges before speech recognition interval without voice and the input in this interval is identified as noise, in the structure of phonetic entry interval filtering noise.

The present invention is the feature utilizing the feature of multi-layer perception（MLP） theory extraction voice and identify speaker, according to these data in real time for speaker provides the technology of applicable information.By adopting multi-layer perception（MLP）, 1. Adapted information can be provided according to the feature of speaker, or, 2. can identify the position of speaker and the function needed for speaker of this position is provided.Below further illustrate 1. with 2..

1. provide Adapted information according to speaker's feature

When utilizing multi-layer perception（MLP） constructing system, even if the superimposed voice that also can extract driver of the voice of many people.The method not only goes for driver, can also identify other staff.Such as, only extract the phonetic feature of a-quadrant and ignore the voice signal in B, C, D region.

The major premise of multi-layer perception（MLP） is pre-formed to carry out according to a large amount of DB and backpropagation (BACKPROPAGATION) technology the algorithm that learns.

Multi-layer perception（MLP） modeling specifically, such as analyze 20 ~ 29 years old and a large amount of voice of the good Soul women of state extract feature (resonance peak, basic frequency, energy value, LPC value etc.) and are input to input end, using 20 ~ 29 years old and the good Soul women of state as output (OUTPUT) object when, perceptron inside configuration determines suitable weighting (WEIGHT) value through backpropagation (BACKPROPAGATION) process.When as above learning manifold people, any voice of input can both find feature in the structure through study.LPC value is linear predictive coding value, is based on the one in the voice coding modes of mankind's sonification model, has 20 six-vectors.

When 20 six-vector value of the resonance peak of a large amount of voice of input special object, basic frequency, LPC model, repeat the fixed operation of suitable weighted value rule (such as 20 ~ 29 years old and the good Soul women of state, 30 ~ 40 years old and area, the Qing Shang road the be not in good state male sex by reverse expansion process to multiple target ...).

When through this learning process, no matter any voice, as long as be input to the feature can knowing speaker to the perceptron structure of the proper vector modeling of these voice.

By putting call through immediately after connection (pushtotalk, hereinafter referred to as ' PTT ') as seat selection reference.If there are four PTT keys, then according to position by the microphones of corresponding PTT input position to phonetic decision be the voice of Water demand, all the other are judged as noise and filter.Carry out identifying according to the voice after filtration and provide best information, the situation of giving an order to media product for speaker for speaker, what search if want is dining room, then first search the dining room conformed to speaker's feature.

Arrange above description and can derive following feature.

First, differentiate PTT position and extract the vector corresponding to each phonic signal character.

Then, the proper vector of four kinds of signals is input to multi-layer perception（MLP） structure.

Then, the feature of each voice signal is extracted respectively.

Then, when having the feature different from reference speech A, other eigenwerts in A microphone signal are judged as noise and filter.

Then, utilize the data identification voice only extracting a-quadrant voice and obtain, and differentiate the meaning of voice.

Then, the order for the speaker of a-quadrant provides best information.

2. identify speaker location and the function needed for speaker of this position is provided

By putting call through immediately after connection (pushtotalk, hereinafter referred to as ' PTT ') as seat selection reference.If there are four PTT keys, then according to position by the microphones of corresponding PTT input position to phonetic decision be the voice of Water demand, all the other are judged as noise and filter.For air-conditioning, if the passenger in D region sends the order about air-conditioner temperature, the aircondition in only D region can be made to regulate air-conditioning gear by order.

Be described referring again to Fig. 1 below.

TTSDB150 stores the reference characteristic information (10 ~ 19 years old, 20 ~ 29 years old, 30 ~ 39 years old, 40 ~ 49 years old, 50 ~ 59 years old, 60 ~ 69 years old, 70 years old with first-class) about the age, the reference characteristic information (male sex, women etc.) about sex, database about the information such as reference characteristic information (gentle, active etc.) of hobby.

The function of TTSDB extraction apparatus 140 detects from TTSDB150 the information corresponding to speaker's age, sex, hobby etc. that speaker's voice analyzer 130 finds.

The function of speaker's voice adjuster 160 is the voice that will export in order to TTS service according to information adjustment (tuning) detected from TTSDB150.The information (Logf0) etc. of the speech rate information (PitchPeriod) that the voice from driver can obtain by speaker's voice adjuster 160, the height of frequency is adapted to the voice that will export and adjusts.

The function of GMM model extraction device 170 generates gauss hybrid models according to the voice characteristics information of the speaker of voice characteristics information extraction apparatus 120 extraction.

The function of speaker's speech convertor 180 is that the voice adjusted to speaker's voice adjuster 160 are suitable for gauss hybrid models with further converting speech.In the present invention, the voice adjusted through speaker's voice adjuster 160 can be provided as the voice of serving for TTS.But the present invention is not limited thereto, the present invention can also pass through the voice of the further conversion speaker of GMM (GaussianMixtureModel), to guarantee the phonetic feature of real-time reasonable conversion speaker.

Below further illustrate the speaker's speech convertor 180 utilizing gauss hybrid models.

X ∈ R ⁿthe available mathematical expression 8 of Gaussian Mixture Model Probability Density (GaussianMixtureDensity) of this specific random vector represents:

[mathematical expression 8]

p (x | λ) = Σ_{i = 0}^{Q} α_{i} b_{i} (x), Σ_{i = 0}^{Q} α_{i} = 1, α_{i} &GreaterEqual; 0

Wherein p (x| λ) is composition parameter, represents to have average and discrete Gaussian function.Q represents total number of single gaussian density (GaussianDensity), α _irepresent the weighted value of single gaussian density.

B _ix () represents multidimensional Gaussian Mixture Model Probability Density (Gaussianmixturedensity).This b _ix () represents as shown in mathematical expression 9 with single gaussian density:

[mathematical expression 9]

b_{i} (x) = \frac{1}{{(2 π)}^{n / 2} {| C_{i} |}^{1 / 2}} \exp [- \frac{1}{2} {(x - μ_{i})}^{T} C_{i}^{- 1} (x - μ_{i})]

μ _i：nx1meanvector，C _i：nxncovariancematrix

Therefore, the Gaussian Mixture Model Probability Density (GaussianMixtureDensity) completed is made up of following three variablees:

λ＝{αi，μi，Ci}，i＝1，…，Q

By x ∈ R ⁿbe defined as the voice that TTSDB extraction apparatus 140 filters out, by y ∈ R ⁿbe defined as the voice of driver, then z=(x, y) ^tjoint density (jointdensity) voice between voice and driver's voice that TTSDB extraction apparatus 140 filters out can be defined as.This can represent by following mathematical expression:

[mathematical expression 10]

p (z | λ) = Σ_{i = 1}^{Q} \frac{α_{i}}{{(2 π)}^{n} {| C_{i} |}^{1 / 2}} \exp [- \frac{1}{2} {(z - μ_{i})}^{T^{.}} C_{i}^{- 1} (z - μ_{i})]

Σ_{i = 1}^{Q} α_{i} = 1, α_{i} &GreaterEqual; 0

Therefore, speaker's speech convertor 180 finds mapping (Mapping) function F (x) minimizing square error (MeanSquareError) as shown in mathematical expression 11.

[mathematical expression 11]

ε _mse＝E[‖y-F(x)‖ ²]

E represents expectation value (Expectation), F (x) represent estimate the spectral vector (SpectralVector) of (estimated) voice.

When utilizing joint density presuming method (JointDensityEstimationMethod), F (x) may be defined to as shown in following mathematical expression 12.Now, can see ' A.KainandM.Macon, " Spectralvoiceconversionfortext-to-speechsynthesis " Proc.ICASSP, pp.285 ~ 288,1998. '.

[mathematical expression 12]

F (x) = E [y | x] = Σ_{j = 1}^{Q} h_{i} (x) [μ_{i}^{y} + C_{i}^{y x} C_{i}^{x x - 1} (x - μ_{i}^{x})]

h_{i} (x) = \frac{\frac{α_{i}}{{(2 π)}^{n / 2} {| C_{i}^{x x} |}^{1 / 2}} \exp [- \frac{1}{2} {(x - μ_{i}^{x})}^{T} C_{i}^{x x - 1} (x - μ_{i}^{x})]}{Σ_{j = 1}^{Q} \frac{α_{j}}{{(2 π)}^{n / 2} {| C_{j}^{x x} |}^{1 / 2}} \exp [- \frac{1}{2} {(x - μ_{j}^{x})}^{T} C_{j}^{x x - 1} (x - μ_{j}^{x})]}

C_{i} = [\begin{matrix} C_{i}^{x x} & C_{i}^{x y} \\ C_{i}^{y x} & C_{i}^{y y} \end{matrix}], μ_{i} = [\begin{matrix} μ_{i}^{x} \\ μ_{i}^{y} \end{matrix}]

Below illustrate the method for work that the vehicle voice guide illustrated referring to figs. 1 through Fig. 3 provides system 100.Fig. 4 provides the process flow diagram of the method for work of system for showing vehicle voice guide according to an embodiment of the invention.

In step S405, when driver says particular command, in step S410, voice characteristics information extraction apparatus 120 is from the voice characteristic information extraction of speaker.

Then, in step S415, speaker's voice analyzer 130 is according to characteristic information real-time analysis sex, age, hobby etc.

Then in the step s 420, TTSDB extraction apparatus 140 selects from TTSDB150 the information corresponding to each analysis result.

Then, in step S425, speaker's voice adjuster 160 adjusts the information through speech conversion according to the information that TTSDB extraction apparatus 140 is selected.

Then in step S430, the actual speech that speaker's speech convertor 180 will become according to the speech conversion after the GMM model adjustment obtained from speaker's voice close to driver.

Then, in step S435, TTS efferent (not shown) exports the voice after speaker's speech convertor 180 is changed.

An example of the present invention is described above referring to figs. 1 through Fig. 4.The preferred configuration of the present invention that can obtain from these examples is below described.

Object information voice output controller comprises characteristic information generating unit, object information generating unit, object information efferent, power supply unit and master control part according to the preferred embodiment of the invention.

The function of power supply unit is each formation supply power supply to forming object information voice output controller.The function of master control part is all working controlling each formation forming object information voice output controller.When object information voice output controller is applicable to vehicle, the present embodiment does not possess power supply unit and master control part is also harmless.

The function of characteristic information generating unit is the characteristic information generating user according to the voice messaging of user.Characteristic information generating unit is the concept corresponding to voice characteristics information extraction apparatus 120 in Fig. 1.

Characteristic information generating unit extracts resonance peak (Formant) information, frequency (Logf0) information, linear predictor coefficient (LinearPredictiveCoefficient from voice messaging; LPC) at least one information in information, spectral enveloping line (SpectralEnvelope) information, energy information, speech rate (PitchPeriod) information and logarithmic spectrum (LogSpectrum) information, and according to the real-time generating feature information of at least one information.

Characteristic information generating unit can generating feature information in real time, and described characteristic information comprises at least one information in the emotion information of the gender information of user, the age information of user and user.This characteristic information generating unit corresponds to the voice characteristics information extraction apparatus 120 of Fig. 1 and the concept be bonded of speaker's voice analyzer 130.

Characteristic information generating unit can remove noise information generating feature information from voice messaging.This characteristic information generating unit corresponds to the noise removal device 110 of Fig. 1 and the concept be bonded of voice characteristics information extraction apparatus 120.

Characteristic information generating unit can be suitable for input information corresponding to voice messaging and the weight information generating feature information to be obtained by the target information learning (training) each input information to voice messaging.

Characteristic information generating unit can utilize artificial neural network (ArtificialNeuralNetwork; ANN) algorithm, error back propagation (ErrorBackPropagation; EBP) algorithm and gradient descent method (GradientDescentMethod) obtain weight information.

The function of object information generating unit is according to characteristic information, utilizes the first object information of textual form to generate the second object information of speech form.

Object information generating unit is extracted and is corresponded to the reference information of characteristic information from database, and adjusts the first object information according to this reference information and convert the information that voice obtain to and generate the second object information.This object information generating unit is the concept be bonded corresponding to TTSDB150, TTSDB extraction apparatus 140 and speaker's voice adjuster 160 in Fig. 1.

First object information can be converted to information that voice obtain to generate the second object information according to the speech rate obtained from reference information (PitchPeriod) information or the adjustment of frequency (Logf0) information by object information generating unit.

Object information generating unit can generate the second object information according to reference information and the speaker identification information obtained from characteristic information.This object information generating unit is the concept be bonded corresponding to TTSDB150, TTSDB extraction apparatus 140, speaker's voice adjuster 160, GMM model extraction device 170 and speaker's speech convertor 180.

Object information generating unit can obtain speaker identification information according to gauss hybrid models (GMM).

The method of work of following description object information speech output-controlling device.

First, characteristic information generating unit generates the characteristic information of user according to the voice messaging of user.

Then, object information generating unit, according to characteristic information, utilizes the first object information of textual form to generate the second object information of speech form.

Then, object information efferent exports the second object information.

More than describe and form all inscapes of the embodiment of the present invention and to be combined into one or in conjunction with work, but the present invention is not limited to these embodiments.Namely, within the scope of object of the present invention, in its all inscape, more than one alternative is in conjunction with work.And, its all inscape can be respectively an independently hardware, but also optionally can combining part or all of each inscape, being realized by the computer program with the program module for performing the part or all of function that one or more hardware combinations realizes.And, this computer program can be stored in the computer readable recording medium storing program for performing (ComputerReadableMedia) such as USB storage, CD disk, flash disk (FlashMemory), read by computing machine and perform, realizing embodiments of the invention.Computer program recorded medium can comprise magnetic recording medium, optical recording media, carrier wave (CarrierWave) medium etc.

Further, all terms comprising technology or scientific words, when nothing in illustrating defines separately, represent the meaning identical with the usual understanding of general technical staff of the technical field of the invention.The term of normally used dictionary definition, should be interpreted as the meaning consistent with the meaning of the context of correlation technique, if undefined in the present invention, shall not be construed as the desirable or arteriopathy meaning.

Last it is noted that above each embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to the various embodiments described above to invention has been specific description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in the various embodiments described above, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.

Claims

1. utilize an object information voice output controller for the phonetic feature of user, it is characterized in that, comprising:

Characteristic information generating unit, it generates the characteristic information of described user according to the voice messaging of user;

Object information generating unit, it is according to described characteristic information, utilizes the first object information of textual form to generate the second object information of speech form; And

Object information efferent, it exports described second object information.

2. the object information voice output controller utilizing the phonetic feature of user according to claim 1, is characterized in that:

Described characteristic information generating unit extracts at least one information resonance peak information, frequency information, linear predictor coefficient information, spectral enveloping line information, energy information, speech rate information and logarithmic spectrum information from described voice messaging, and generates described characteristic information in real time according to described at least one information.

3. the object information voice output controller utilizing the phonetic feature of user according to claim 1, is characterized in that:

Described characteristic information generating unit generates at least one information in the emotion information of the gender information of described user, the age information of described user and described user in real time as described characteristic information.

4. the object information voice output controller utilizing the phonetic feature of user according to claim 1, is characterized in that:

Described characteristic information is generated after described characteristic information generating unit removes noise information from described voice messaging.

5. the object information voice output controller utilizing the phonetic feature of user according to claim 1, is characterized in that:

Described characteristic information generating unit is suitable for weight information to described voice messaging and generates described characteristic information, and wherein, described weight information is the information obtained corresponding to the input information of described voice messaging and the target information of each input information by study.

6. the object information voice output controller utilizing the phonetic feature of user according to claim 5, is characterized in that:

Described characteristic information generating unit utilizes artificial neural network algorithm, error backpropagation algorithm and gradient descent method to obtain described weight information.

7. the object information voice output controller utilizing the phonetic feature of user according to claim 1, is characterized in that:

Described object information generating unit is extracted and is corresponded to the reference information of described characteristic information from database, and converts to described first object information the information that voice obtain according to described reference information to and carry out described second object information of adjustment generation.

8. the object information voice output controller utilizing the phonetic feature of user according to claim 7, is characterized in that:

Described object information generating unit, according to the speech rate information obtained from described reference information or frequency information, converts to described first object information information that voice obtain to and carries out adjustment and generate described second object information.

9. the object information voice output controller utilizing the phonetic feature of user according to claim 7, is characterized in that:

Described object information generating unit generates described second object information according to described reference information and the speaker identification information obtained from described characteristic information.

10. the object information voice output controller utilizing the phonetic feature of user according to claim 9, is characterized in that:

Described object information generating unit obtains described speaker identification information according to gauss hybrid models.

11. 1 kinds of object information voice output control methods utilizing the phonetic feature of user, is characterized in that, comprising:

The step of the characteristic information of described user is generated according to the voice messaging of user;

According to described characteristic information, the first object information of textual form is utilized to generate the step of the second object information of speech form; And

Export the step of described second object information.

The 12. object information voice output control methods utilizing the phonetic feature of user according to claim 11, is characterized in that:

Generate the step of described characteristic information specifically, extract at least one information resonance peak information, frequency information, linear predictor coefficient information, spectral enveloping line information, energy information, speech rate information and logarithmic spectrum information from described voice messaging, and generate described characteristic information in real time according to described at least one information.

The 13. object information voice output control methods utilizing the phonetic feature of user according to claim 11, is characterized in that:

Generate the step of described characteristic information specifically, at least one information in the emotion information of the gender information of described user, the age information of described user and described user of generating in real time is as described characteristic information.

The 14. object information voice output control methods utilizing the phonetic feature of user according to claim 11, is characterized in that:

Generate the step of described second object information specifically, extract from database and correspond to the reference information of described characteristic information, and according to described reference information to the information that voice obtain is converted to described first object information and carry out described second object information of adjustment generation.

The 15. object information voice output control methods utilizing the phonetic feature of user according to claim 14, is characterized in that:

The step generating described second object information specifically, generates described second object information according to described reference information and the speaker identification information obtained from described characteristic information.