KR20160049804A - Apparatus and method for controlling outputting target information to voice using characteristic of user voice - Google Patents

Apparatus and method for controlling outputting target information to voice using characteristic of user voice Download PDF

Info

Publication number
KR20160049804A
KR20160049804A KR1020140147474A KR20140147474A KR20160049804A KR 20160049804 A KR20160049804 A KR 20160049804A KR 1020140147474 A KR1020140147474 A KR 1020140147474A KR 20140147474 A KR20140147474 A KR 20140147474A KR 20160049804 A KR20160049804 A KR 20160049804A
Authority
KR
South Korea
Prior art keywords
information
voice
target
characteristic
user
Prior art date
Application number
KR1020140147474A
Other languages
Korean (ko)
Inventor
권오현
Original Assignee
현대모비스 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 현대모비스 주식회사 filed Critical 현대모비스 주식회사
Priority to KR1020140147474A priority Critical patent/KR20160049804A/en
Publication of KR20160049804A publication Critical patent/KR20160049804A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/08Learning methods
    • G06N3/084Back-propagation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Abstract

The present invention provides an apparatus and a method for controlling a voice output of target information using voice properties of a user which provides a text to speech (TTS) service based on specific information obtained from a voice of the user. According to the present invention, the apparatus for controlling a voice output of target information comprises: a specific information generation unit for generating specific information of a user based on voice information of the user; a target information generation unit for generating voice type second target information from text type first target information based on the specific information; and a target information output unit for outputting the second target information.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to an apparatus and method for controlling a target information voice output using a voice characteristic of a user,

The present invention relates to a control apparatus and method for outputting target information by voice. More particularly, the present invention relates to a control apparatus and method for outputting target information from a vehicle by voice.

In general, TTS (Text To Speech) is a technology that converts characters or symbols into speech. TTS constructs a phonetic database of phonemes and creates a continuous speech by connecting them. In this case, it is important to synthesize natural voice by adjusting the size and length of the voice.

That is, TTS is a character-to-speech conversion apparatus for converting a string (sentence) into speech, and is divided into three stages of language processing, rhyme generation, and waveform synthesis. When text is input, Analyzes the structure, generates the same rhyme as that read by the analyzed document structure, and synthesizes the basic unit of the stored speech DB according to the generated rhyme to generate a synthesized tone.

TTS has no limit on the target vocabulary. Since it converts the general character type information into voice, the system realizes more natural and various voice by combining phonetics, voice analysis, speech synthesis and voice recognition technology.

However, when the conventional TTS terminal outputs a voice message such as a text message, it always outputs the same voice regardless of the other party, thereby failing to satisfy the needs of various users.

Korean Patent Publication No. 2011-0032256 proposes a TTS announcement device. However, the above-mentioned problem can not be solved because the apparatus is merely a device for simply converting the designated text into speech.

SUMMARY OF THE INVENTION The present invention has been conceived to solve the problems described above, and it is an object of the present invention to provide a device and method for controlling a target information audio output using a voice characteristic of a user providing a TTS (Text To Speech) service based on characteristic information obtained from a user's voice .

However, the objects of the present invention are not limited to those mentioned above, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

According to an aspect of the present invention, there is provided an information processing apparatus comprising: a characteristic information generation unit configured to generate characteristic information of a user based on voice information of a user; An object information generating unit for generating second object information in a speech form from first object information in the form of a text based on the characteristic information; And a target information output unit for outputting the second target information.

Preferably, the characteristic information generation unit may extract formant information, frequency f0 information, linear predictive coefficient (LPC) information, spectral envelope information, energy information, Period information, and Log Spectrum information, and generates the characteristic information in real time based on the at least one information.

Preferably, the characteristic information generating unit generates at least one of the gender information of the user, the age information of the user, and the emotion information of the user in real time with the characteristic information.

Preferably, the characteristic information generating unit removes noise information from the audio information and generates the characteristic information.

Preferably, the characteristic information generating unit applies the input information corresponding to the audio information and the weight information obtained by training the target information of each input information to the audio information to generate the characteristic information.

Preferably, the characteristic information generating unit obtains the weight information using an ANN (Artificial Neural Network) algorithm, an EBP (Error Back Propagation) algorithm, and a Gradient Descent Method.

Preferably, the target information generating unit extracts reference information corresponding to the characteristic information from the database, and tunes the information obtained by converting the first target information into speech based on the reference information, .

Preferably, the target information generating unit tunes information obtained by converting the first target information into a speech based on the pitch period information or the frequency (Log f0) information obtained from the reference information, Information.

Preferably, the object information generating unit generates the second object information based on the speaker identification information obtained from the characteristic information together with the reference information.

Preferably, the object information generating unit obtains the speaker identification information based on a Gaussian mixture model (GMM).

According to another aspect of the present invention, Generating second target information in the form of speech from first subject information in the form of a text based on the characteristic information; And outputting the second target information. The present invention also provides a method of controlling a target information audio output using a user's voice characteristic.

Preferably, the step of generating the characteristic information may include extracting formant information, frequency (Log f0) information, LPC (Linear Predictive Coefficient) information, spectral envelope information, energy information, (Pitch Period) information, and Log Spectrum information, and generates the characteristic information in real time based on the at least one information.

The generating of the characteristic information may generate at least one of the gender information of the user, the user's age information, and the user's emotion information in real time with the characteristic information.

Preferably, the generating of the characteristic information generates the characteristic information after removing the noise information from the voice information.

The generating of the characteristic information may include generating input information corresponding to the audio information and weight information obtained by training target information of each input information to the audio information to generate the characteristic information .

Preferably, the step of generating the characteristic information acquires the weight information using an artificial neural network (ANN) algorithm, an error back propagation (EBP) algorithm, and a gradient descent method.

Preferably, the generating of the second object information may include extracting reference information corresponding to the characteristic information from the database, and tuning information obtained by converting the first object information into speech based on the reference information, And generates second object information.

Preferably, the generating of the second object information may include tuning information obtained by converting the first object information into speech based on the pitch period information or the frequency (Log f0) information obtained from the reference information, And generates the second object information.

Preferably, the generating of the second object information generates the second object information based on the speaker identification information obtained from the characteristic information together with the reference information.

Preferably, the step of generating the second object information acquires the speaker identification information based on a Gaussian mixture model (GMM).

The present invention can provide the following effects by providing a TTS (Text To Speech) service based on characteristic information obtained from a user's voice.

First, a natural speech recognition system can be realized by communicating in a bidirectional manner without departing from the one-sided method.

Second, the system provides TTS service tailored to the driver 's sex, age, disposition, etc., Thereby providing the voice recognition system of the vehicle with a friendliness that is not mechanical and easy to understand.

1 is a conceptual diagram illustrating an internal configuration of a voice guidance system for a vehicle according to an embodiment of the present invention.
FIG. 2 and FIG. 3 are reference views for explaining a speaker voice analyzer constituting the voice guidance system for vehicles shown in FIG.
4 is a flowchart illustrating an operation method of a vehicle voice guidance system according to an embodiment of the present invention.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In addition, the preferred embodiments of the present invention will be described below, but it is needless to say that the technical idea of the present invention is not limited thereto and can be variously modified by those skilled in the art.

It is an object of the present invention to provide a more natural and familiar voice guidance service by analyzing voice characteristics of a driver in a car.

1 is a conceptual diagram illustrating an internal configuration of a voice guidance system for a vehicle according to an embodiment of the present invention.

The voice guidance system 100 for a vehicle provides voice guidance in a pattern similar to the voice of the current driver using voice of the driver. As shown in FIG. 1, the system includes a noise eliminator 110, a voice feature information extractor 120 A speaker voice analyzer 130, a TTS DB extractor 140, a TTS DB 150, a speaker voice tuner 160, a GMM model extractor 170, and a speaker voice converter 180.

Generally, in the case of navigation guidance voice or voice recognition prompt voice in the vehicle, a specific TTS DB that is fixed at the time of production is used. Therefore, it does not adequately satisfy the consumer's needs for voice guidance by age, gender, and driver's propensity. For example, the voices of the youthful 20s may be difficult to understand for older people, and those of the mild 50s may seem boring and uncharacteristic for the younger generation.

The vehicle voice guidance system 100 according to the present invention can provide a familiar and easy-to-understand voice quality to the young, middle-aged, elderly, male and female, The purpose.

In addition, the vehicle voice guidance system 100 can distinguish drivers by using a speaker recognition function of voice recognition as a technology changes in a bidirectional communication system, so that a function suitable for a driver can be proposed first and adapted to an artificial intelligence trend .

This will be described in more detail with reference to FIG.

The noise eliminator 110 performs a function of removing a noise component from the voice information when the voice information of the speaker is input. The noise eliminator 110 removes noise in the vehicle to acquire a clearer driver's voice.

The voice feature information extractor 120 extracts voice feature information of the speaker from the voice information from which the noise component is removed. The voice feature information extractor 120 extracts feature information of each individual voice to analyze the age, sex, inclinations, etc. of the speaker.

The voice feature information extractor 120 extracts formant information, frequency f0 information, linear predictive coefficient (LPC) information, spectral envelope information, energy information, Pitch period information, log spectrum information, and the like.

The speaker's voice analyzer 130 performs a function of classifying the age, sex, inclinations, etc. of the speaker using the voice feature information extracted by the voice feature information extractor 120. The speaker voice analyzer 130 can use the Log f0 information when the gender is classified. If the Log f0 average value is 120 Hz to 240 Hz, it can be judged as a female. If the Log f0 average value is 0 Hz to 120 Hz, it can be judged as a male.

The speaker's voice analyzer 130 performs modeling using an artificial neural network (ANN) algorithm when individual voice feature information is extracted by the voice feature information extractor 120, And weight information of the artificial neural network algorithm analyzed by the propensity of the user. The speaker's voice analyzer 130 extracts the voice feature information of the driver input in real time based on the extracted generalized weight information (i.e., modeling result data using an artificial neural network algorithm) to estimate the speaker's age, sex, can do.

The speaker voice analyzer 130 can use an age neural network, a gender analysis neural network, a neural network for tendency analysis, and the like as an artificial neural network algorithm in order to estimate the age, sex, and propensity of a speaker.

Hereinafter, the speaker-voice analyzer 130 will be described in detail with reference to FIG. 2 and FIG.

FIG. 2 and FIG. 3 are reference views for explaining a speaker voice analyzer constituting the voice guidance system for vehicles shown in FIG.

An artificial neural network (ANN) algorithm is an algorithm that models and distinguishes human brain activity as a connection between neurons. In the present embodiment, the speaker's voice analyzer 130 performs the following two steps in sequence to implement an artificial neural network algorithm. 2 is a reference diagram for explaining a structure of a neuron (processing element) of an artificial neural network in an artificial neural network algorithm to be applied to the present invention.

1. Training (Modeling)

The speaker's speech analyzer 130 inputs a large amount of input vectors and target vectors to a given neural network to classify the patterns and acquires optimized connection weights (Weight 220).

2. Classification

The speaker's voice analyzer 130 calculates the output value 240 through the equation 230 between the weight 220 and the input vector 210 learned in the discrimination step. The speaker voice analyzer 130 may calculate the difference between the weight 220 and the input vector 210 to determine the closest output as the final result. In the equation (230),? Represents a threshold value.

The speaker's voice analyzer 130 can apply a multi-layer perceptron to analyze the speaker's age, sex, and disposition from the speaker's voice feature information using the artificial neural network algorithm. In particular, EBP (Error Back Propagation) algorithm can be applied. This will be described in more detail with reference to FIG. 3 is a reference diagram showing a structure of an EBP algorithm to be applied to the present invention.

Perceptron theory related to conventional speech has been used for recognizing a voice (judging what content is a voice when a voice is inputted) or discriminating a human emotion.

A multilayer perceptron is a neural network with one or more intermediate layers between the input and output layers. The network is connected in the direction of the input layer, the hidden layer and the output layer and is a feedforward network in which there is no direct connection from the output layer to the input layer in each layer.

In order to apply the multi-layer perceptron to the speech analyzer 130, the present invention adopts the EBP algorithm.

The EBP algorithm has one or more hidden layers between the input layer and output manifest and is defined as the sum of squares of the error between the desired target value (D pj ) and the actual output value (O pj ) using the generalized delta rule as shown in Equation (1) And then the learning is proceeded in the direction of minimizing the cost function value by the gradient-descent method to obtain the desired weight value.

Figure pat00001

In the above, p denotes the p-th learning pattern, and E p denotes the error with respect to the p-th pattern. D pj denotes the jth element of the pth pattern, and O pj denotes the j th element of the actual output.

The speaker's voice analyzer 130 calculates the hidden layer error using the error generated in the output layer for learning the hidden layer by using the EBP algorithm described above and propagates it back to the input layer so that the error of the output layer is the desired level The weighted value can be optimized.

The speaker voice analyzer 130 can perform a training step using the EBP algorithm according to the following procedure.

In the first step, the weight and the threshold value are initialized.

In the second step, the input vector and the target vector are presented.

Then, the input value to the jth neuron of the hidden layer is calculated using the input vector shown in the third step. Equation (2) can be used at this time.

Figure pat00002

In the above, net pj denotes an input value to the jth neuron of the hidden layer. W ji denotes the connection weight from the jth neuron to the i th neuron, and X pi denotes the input vector. And? J denotes a threshold value.

In the fourth step, the output (O pj ) of the hidden layer is calculated using a sigmoid function.

Then, in the fifth step, the input value to the output layer neuron k is calculated using the output of the hidden layer. Equation (3) can be used at this time.

Figure pat00003

In the above, net pk denotes an input value to the output layer neuron k.

Then, in a sixth step, the output (O pk ) of the output layer is calculated using the sigmoid function f '().

In the seventh step, an error between the target output and the actual output of the input pattern is calculated, and the sum of the output layer errors is accumulated as the error of the learning pattern. At this time, equation (4) can be used.

Figure pat00004

Where d pk denotes the target output of the input pattern, and O pk denotes the actual output of the input pattern. And δ pk is the error between the target output and the actual output. E denotes the sum of the output layer errors, and E p denotes the error of the learning pattern.

Then, in the eighth step, the error δ pj of the hidden layer is calculated using the output layer error value δ pk , the weight value of the hidden layer and the output layer (W kj ), and the like. At this time, equation (5) can be used.

Figure pat00005

Then, in the ninth step, the weight W kj of the output layer is updated using the output value O pj of the hidden layer neuron j and the error value δ pk of the output layer obtained in the fourth and seventh steps. At this time, the threshold value is also adjusted. At this time, equation (6) can be used.

Figure pat00006

In the above, η and β denote gain values, and t denotes time.

In step 10, the weight value and the threshold value of the input layer and the hidden layer are updated as in the output layer. At this time, equation (7) can be used.

Figure pat00007

Thereafter, in the eleventh step, all the learning patterns are repeatedly performed in two stages until all the learning patterns are learned.

If the error sum E of the output layer is less than the allowable value or greater than the maximum number of repetitions in step 12, the process is terminated. Otherwise, the process goes to the second step and the procedure is repeated.

On the other hand, the speaker's voice analyzer 130 can analyze the age, sex, orientation, etc. of each speaker from the voice characteristic information of each speaker when using a multilayer perceptron when there are a plurality of speakers. This will be described below.

According to the general noise filtering method, speech recognition speech is generated after a certain time after the speech recognition microphone is opened, so that the signal coming in before the speech recognition is determined as noise in the vehicle, and only the noise is filtered in the signal.

However, although the directional microphone is provided in the direction of the driver in the vehicle, since the input signal is judged as noise during a short time before the speech utterance, the speech recognition rate is lowered because the voice is mixed when the speech is recognized from other seats .

Therefore, in the present invention, the directional microphones are respectively installed in the four seat areas in the vehicle, and the microphone signals of the different areas are discriminated as noise and filtered based on the input signal of the microphone in the driver area. In the process of processing the signal, the characteristics of the driver in the driver area are determined in real time, and information suitable for the driver is provided in the multimedia device.

In the following description, the driver's seat is defined as area A, the passenger seat is defined as area B, and the back of the driver's seat and the back of the passenger seat are defined as C area and D area, respectively.

When the driver starts the voice recognition function, the microphones of A, B, C, and D areas are simultaneously opened and receive voice signals of the micro 4 area. The vehicle noise value, which is not the human voice, is filtered by the vehicle noise value because the value input to the microphone in the four regions is almost the same. Then, the voice of four regions is analyzed. First, the speech vector value representing the gender of the four regions is analyzed. When a vector value indicating the sex and the other region is extracted from the region A, the region corresponding to the vector A is filtered in the region A, do. Once the gender analysis is complete, analyze the age, mood, and condition in the same way.

Although the driver's voice signal is the largest in area A, this method is used because it is difficult to extract only the driver's voice perfectly in area A when there are voice signals in areas B, C, and D.

At this time, other algorithms other than CORRELATION, ICA, and BEAM FORMING techniques can be used to determine whether the signals are independent or similar.

It is possible to grasp the individual characteristics of the speaker while filtering through the four microphones, and the recognition rate can be increased by noise filtering using information obtained by grasping individual characteristics.

In the case of a vehicle, four seats are generally designated, and the voice recognition system in the vehicle is usually used by the driver. When the occupant of the rest of the seat uses the voice recognition system of the driver, It is difficult to recognize the command of. In the currently used speech recognition system, a section having no speech is set in front of a speech recognition section, the section is recognized as noise, and the noise is filtered in a section in which speech is received.

The present invention is a technology for extracting characteristics of speech using perceptron theory to identify characteristics of a speaker and providing information suitable for a speaker in real time using the data. Using Perceptron, it is possible to provide customized information according to the characteristics of the speaker, (2) to recognize the position of the speaker, and to provide a desired function of the speaker at that position. The following describes ① and ② in more detail.

1. Provide customized information according to speaker characteristics

When a system is constructed using a multilayer perceptron, it is possible to extract the voice of the driver even if a plurality of voices are added. This method is not limited to the driver, but can be recognized by the rest. For example, only the voice characteristic of the A region is extracted, and the voice signals of the remaining B, C, and D regions are ignored.

In the case of perceptron, it is anticipated that the algorithm that has been trained using BACK PROPAGATION technique based on many DBs in advance has been established.

For example, perceptron modeling extracts the characteristics (formants, fundamental frequency, energy value, LPC value, etc.) of a woman who is in good condition in the 20s, and inserts it into the input. If the OUTPUT target is 20 women who are in good condition, The structure is internally subjected to the BACK PROPAGATION process to determine the appropriate WEIGHT value. By training people with these different characteristics, any voice can be traced in a trained structure. The LPC value is a linear predictive coding value and has a 26-dimensional vector as one of the speech coding methods based on the human speech model.

When only the number of specific target is inputted, 26th dimension vector value of speech formant, fundamental frequency, LPC model is input, and the work is reversed and appropriate weight values are determined by various targets (20s condition good Seoul woman, 30s Men in Gyeongsang area are in a bad condition ...).

Through this training process, no matter what speech is input, the characteristics of the speech can be known by inputting the feature vectors of the speech into a modeled perceptron structure.

The standard of seat selection is PTT. If there are four PTT buttons, the voice input to the microphone located at the corresponding PTT input position is determined as a voice to be analyzed, and the remaining voice is determined as noise and filtered. For example, when a multi-product is instructed by a speaker, if a neighboring restaurant is searched, a neighboring restaurant suitable for the characteristics of the speaker is searched first.

The following features are summarized as follows.

First, the PTT position is discriminated and a vector according to the characteristics of each voice signal is extracted.

Then, the characteristic vectors of four signals are input to the multi-layer perceptron structure.

Then, the characteristics of each voice signal are extracted.

Thereafter, when the A microphone signal has a characteristic different from that of the reference speech A, other characteristic values are determined as noise and filtered.

Thereafter, voice recognition is performed using only extracted voice data of the area A, and it is determined which voice means.

Thereafter, it provides optimized information for the command word of the A region.

2. Recognize the position of the speaker and provide the desired function

The standard of seat selection is PTT. If there are four PTT buttons, the voice input to the microphone located at the corresponding PTT input position is determined as a voice to be analyzed, and the remaining voice is determined as noise and filtered. For example, in the case of air conditioning, when a person sitting in area D commands an air conditioner-related command, the air conditioning level can be changed according to the command only in the air conditioner of area D.

Referring back to FIG.

The TTS DB 150 includes reference feature information (male, female, etc.) related to the gender, reference characteristic information (10, 20, 30, 40, 50, 60, And reference feature information related to the inclination (mild, active, etc.).

The TTS DB extractor 140 performs a function of detecting, from the TTS DB 150, information corresponding to the age, sex, and propensity of a speaker found by the speaker's voice analyzer 130.

The speaker tuner 160 performs a function of tuning a voice to be output for the TTS service based on the information detected from the TTS DB 150. [ The speaker tuning unit 160 may apply tuning to the voice to be output, such as the pitch period information (Pitch Period) obtained from the driver's voice, the information (Log f0) about the high and low frequencies.

The Gaussian Mixture Model (GMM) model extractor 170 performs a function of generating a Gaussian mixture model based on the speech feature information of the speaker extracted by the speech feature information extractor 120.

The speaker-to-speech converter 180 performs a function of additionally converting voice by applying a Gaussian blending model to the voice tuned by the speaker's voice tuning unit 160. In the present invention, the voice tuned by the speaker tuner 160 can be provided as a voice for the TTS service. However, the present invention is not limited to this, but it is also possible to additionally convert the speaker's voice through GMM (Gaussian Mixture Model) so that the speaker's voice characteristics can be appropriately converted in real time.

Hereinafter, the speaker-to-speech converter 180 using the Gaussian mixture model will be described in detail.

The Gaussian Mixture Density of a specific random vector x < RTI ID = 0.0 > Rn < / RTI >

Figure pat00008

In the above, p () denotes a Gaussian function having mean and variance as component parameters. Q denotes the total number of single Gaussian densities, and? I denotes a weight of a single Gaussian density.

Here, if b i (x) is expressed by a single Gaussian density, it is defined as in Equation (9).

Figure pat00009

Therefore, the finished Gaussian Mixture Density consists of the following three variables.

λ = {α i, μ i , C i}, i = 1, ... , Q

x∈R define n as the negative selection by the TTS DB extractor 140 and by defining the n y∈R by the voice of the driver, z = (x, y) T is selected by the TTS DB extractor 140 And a joint density voice between the voice and the driver's voice. This can be expressed by the following equation.

Figure pat00010

Therefore, the speaker-voice converter 180 finds a mapping function F (x) that minimizes the mean square error as shown in Equation (11).

Figure pat00011

E [... ] Denotes the expectation (Expectation), and F (x) denotes the spectral vector of the estimated speech.

Figure pat00012

Next, a method of operating the vehicle audio guidance system 100 described with reference to Figs. 1 to 3 will be described. 4 is a flowchart illustrating an operation method of a vehicle voice guidance system according to an embodiment of the present invention.

When the driver utters a specific command (S405), the speech feature information extractor 120 extracts the feature information from the speech of the speaker (S410).

Then, the speaker's voice analyzer 130 analyzes gender, age, inclinations, etc. in real time from the feature information (S415).

Thereafter, the TTS DB extractor 140 selects information corresponding to each analysis result in the TTS DB 150 (S420).

Then, the speaker tuner 160 tunes the voice-converted information based on the information selected by the TTS DB extractor 140 (S425).

Then, the speaker-to-speech converter 190 converts the tuned voice based on the GMM model obtained from the speaker's voice so as to be close to the actual voice of the driver (S430).

Thereafter, the TTS output unit (not shown) outputs the voice converted by the speaker-voice converter 190 (S435).

DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention has been described with reference to Figs. Best Mode for Carrying Out the Invention Hereinafter, preferred forms of the present invention that can be inferred from the above embodiment will be described.

A target information audio output control apparatus according to a preferred embodiment of the present invention includes a characteristic information generation unit, a target information generation unit, a target information output unit, a power source unit, and a main control unit.

The power supply unit supplies power to each configuration of the target information audio output control apparatus. The main control unit controls the overall operation of each of the components constituting the target information audio output control apparatus. When considering that the target information audio output control apparatus is applied to a vehicle, the power source section and the main control section may not be provided in the present embodiment.

The characteristic information generation unit performs a function of generating characteristic information of the user based on the user's voice information. The characteristic information generation unit is a concept corresponding to the voice characteristic information extractor 120 of FIG.

The characteristic information generating unit may extract formant information, frequency information (Log f0), linear predictive coefficient (LPC) information, spectral envelope information, energy information, pitch period information, (Log Spectrum) information, and can generate characteristic information on the basis of at least one piece of information in real time.

The characteristic information generation unit may generate at least one of the gender information of the user, the age information of the user, and the user's emotion information in real time as the characteristic information. The characteristic information generating unit corresponds to the combination of the speech feature information extractor 120 and the speech analyzer 130 of FIG.

The characteristic information generating unit may generate characteristic information after removing the noise information from the voice information. The characteristic information generation unit corresponds to the combination of the noise eliminator 110 and the voice feature information extractor 120 of FIG.

The characteristic information generating unit may generate characteristic information by applying input information corresponding to the voice information and weight information obtained by training target information of each input information to the voice information.

The characteristic information generation unit may obtain weight information using an ANN (Artificial Neural Network) algorithm, an EBP (Error Back Propagation) algorithm, and a Gradient Descent Method.

The target information generating unit generates a second target information in the form of a voice from the first target information in the form of a text based on the characteristic information.

The target information generating unit extracts reference information corresponding to the characteristic information from the database, and generates second target information by tuning information obtained by converting the first target information into speech based on the reference information. This object information generating unit is a concept corresponding to the combined configuration of the TTS DB 150, the TTS DB extractor 140, and the speaker tuner 160 of FIG.

The target information generating unit may generate second target information by tuning information obtained by converting the first target information into speech based on the pitch period information or the frequency f0 obtained from the reference information.

The target information generating unit may generate the second target information based on the speaker identification information obtained from the characteristic information together with the reference information. This object information generating unit is a concept corresponding to the combined structure of the TTS DB 150, the TTS DB extractor 140, the speaker tuner 160, the GMM model extractor 170 and the speaker voice converter 180.

The target information generation unit can acquire the speaker identification information based on the Gaussian mixture model (GMM).

Next, an operation method of the target information audio output control apparatus will be described.

First, the characteristic information generating unit generates characteristic information of the user based on the user's voice information.

Then, the target information generating unit generates second target information in the form of speech from the first target information in text form based on the characteristic information.

Then, the target information output unit outputs the second target information.

It is to be understood that the present invention is not limited to these embodiments, and all elements constituting the embodiment of the present invention described above are described as being combined or operated in one operation. That is, within the scope of the present invention, all of the components may be selectively coupled to one or more of them. In addition, although all of the components may be implemented as one independent hardware, some or all of the components may be selectively combined to perform a part or all of the functions in one or a plurality of hardware. As shown in FIG. In addition, such a computer program may be stored in a computer readable medium such as a USB memory, a CD disk, a flash memory, etc., and read and executed by a computer to implement an embodiment of the present invention. As the recording medium of the computer program, a magnetic recording medium, an optical recording medium, a carrier wave medium, and the like can be included.

Furthermore, all terms including technical or scientific terms have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined in the Detailed Description. Commonly used terms, such as predefined terms, should be interpreted to be consistent with the contextual meanings of the related art, and are not to be construed as ideal or overly formal, unless expressly defined to the contrary.

It will be apparent to those skilled in the art that various modifications, substitutions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. will be. Therefore, the embodiments disclosed in the present invention and the accompanying drawings are intended to illustrate and not to limit the technical spirit of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments and the accompanying drawings . The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

Claims (15)

  1. A characteristic information generating unit for generating characteristic information of the user based on voice information of a user;
    An object information generating unit for generating second object information in a speech form from first object information in the form of a text based on the characteristic information; And
    A target information output unit for outputting the second target information,
    And a voice information output unit for outputting voice information of the user.
  2. The method according to claim 1,
    The characteristic information generating unit may extract formant information, log f0 information, linear predictive coefficient (LPC) information, spectral envelope information, energy information, pitch period information, Extracting at least one piece of information from at least one piece of log spectrum information, and generating the characteristic information in real time based on the at least one piece of information.
  3. The method according to claim 1,
    Wherein the characteristic information generating unit generates at least one of the gender information of the user, the age information of the user, and the emotion information of the user in real time with the characteristic information. Output control device.
  4. The method according to claim 1,
    Wherein the characteristic information generating unit generates the characteristic information after removing the noise information from the voice information.
  5. The method according to claim 1,
    Wherein the characteristic information generating unit applies the weight information obtained by training the input information corresponding to the voice information and the target information of each input information to the voice information to generate the characteristic information. A device for controlling the target information audio output using characteristics.
  6. 6. The method of claim 5,
    Wherein the characteristic information generating unit obtains the weight information using an artificial neural network algorithm, an error back propagation algorithm, and a gradient descent method. Voice output control device.
  7. The method according to claim 1,
    Wherein the target information generating unit extracts reference information corresponding to the characteristic information from the database and generates the second target information by tuning information obtained by converting the first target information into speech based on the reference information And the target information audio output control device uses the voice characteristics of the user.
  8. 8. The method of claim 7,
    The target information generating unit generates the second target information by tuning the information obtained by converting the first target information into speech based on the pitch period information or the frequency f0 obtained from the reference information And outputting the voice information to the target information output device.
  9. 8. The method of claim 7,
    Wherein the target information generating unit generates the second target information based on the speaker identification information obtained from the characteristic information together with the reference information.
  10. 10. The method of claim 9,
    Wherein the target information generating unit obtains the speaker identification information based on a Gaussian mixture model (GMM).
  11. Generating characteristic information of the user based on voice information of a user;
    Generating second target information in the form of speech from first subject information in the form of a text based on the characteristic information; And
    Outputting the second object information
    And outputting the voice information to the target information output device.
  12. 12. The method of claim 11,
    The generating of the characteristic information may comprise extracting formant information, log f0 information, linear predictive coefficient (LPC) information, spectral envelope information, energy information, ) Information and log spectrum information, and generates the characteristic information in real time on the basis of the at least one piece of information. Way.
  13. 12. The method of claim 11,
    Wherein the generating of the characteristic information generates at least one of the gender information of the user, the age information of the user, and the emotion information of the user in real time using the characteristic information. Method for controlling target audio output.
  14. 12. The method of claim 11,
    The generating of the second object information may include extracting reference information corresponding to the characteristic information from the database, tuning information obtained by converting the first object information into speech based on the reference information, And outputting the generated voice information to the user.
  15. 15. The method of claim 14,
    Wherein the second target information generating step generates the second target information based on the speaker identification information obtained from the characteristic information together with the reference information. .
KR1020140147474A 2014-10-28 2014-10-28 Apparatus and method for controlling outputting target information to voice using characteristic of user voice KR20160049804A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020140147474A KR20160049804A (en) 2014-10-28 2014-10-28 Apparatus and method for controlling outputting target information to voice using characteristic of user voice

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020140147474A KR20160049804A (en) 2014-10-28 2014-10-28 Apparatus and method for controlling outputting target information to voice using characteristic of user voice
CN201510657714.4A CN105575383A (en) 2014-10-28 2015-10-13 Apparatus and method for controlling target information voice output through using voice characteristics of user

Publications (1)

Publication Number Publication Date
KR20160049804A true KR20160049804A (en) 2016-05-10

Family

ID=55885440

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020140147474A KR20160049804A (en) 2014-10-28 2014-10-28 Apparatus and method for controlling outputting target information to voice using characteristic of user voice

Country Status (2)

Country Link
KR (1) KR20160049804A (en)
CN (1) CN105575383A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180049689A (en) * 2016-11-03 2018-05-11 세종대학교산학협력단 Apparatus and method for reliability measurement of speaker
WO2020080615A1 (en) * 2018-10-16 2020-04-23 Lg Electronics Inc. Terminal

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106504743B (en) * 2016-11-14 2020-01-14 北京光年无限科技有限公司 Voice interaction output method for intelligent robot and robot
CN108519870A (en) * 2018-03-29 2018-09-11 联想(北京)有限公司 A kind of information processing method and electronic equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101375329A (en) * 2005-03-14 2009-02-25 沃克索尼克股份有限公司 An automatic donor ranking and selection system and method for voice conversion
US20070174396A1 (en) * 2006-01-24 2007-07-26 Cisco Technology, Inc. Email text-to-speech conversion in sender's voice
US9105053B2 (en) * 2010-03-23 2015-08-11 Nokia Technologies Oy Method and apparatus for determining a user age range
WO2013187610A1 (en) * 2012-06-15 2013-12-19 Samsung Electronics Co., Ltd. Terminal apparatus and control method thereof
KR101987966B1 (en) * 2012-09-03 2019-06-11 현대모비스 주식회사 System for improving voice recognition of the array microphone for vehicle and method thereof
CN103236259B (en) * 2013-03-22 2016-06-29 乐金电子研发中心(上海)有限公司 Voice recognition processing and feedback system, voice replying method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180049689A (en) * 2016-11-03 2018-05-11 세종대학교산학협력단 Apparatus and method for reliability measurement of speaker
WO2020080615A1 (en) * 2018-10-16 2020-04-23 Lg Electronics Inc. Terminal

Also Published As

Publication number Publication date
CN105575383A (en) 2016-05-11

Similar Documents

Publication Publication Date Title
Huang et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation
Ling et al. Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends
Hansen et al. Speaker recognition by machines and humans: A tutorial review
Hu et al. An unsupervised approach to cochannel speech separation
El Ayadi et al. Survey on speech emotion recognition: Features, classification schemes, and databases
Yang et al. Emotion recognition from speech signals using new harmony features
Zen et al. Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005
Du et al. A regression approach to single-channel speech separation via high-resolution deep neural networks
Morgan et al. Neural networks and speech processing
Lugger et al. The relevance of voice quality features in speaker independent emotion recognition
O’Shaughnessy Automatic speech recognition: History, methods and challenges
JP4241736B2 (en) Speech processing apparatus and method
JP4516527B2 (en) Voice recognition device
Ling et al. Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis
Nakashika et al. Voice conversion in high-order eigen space using deep belief nets.
Kim et al. Real-time emotion detection system using speech: Multi-modal fusion of different timescale features
CA2609247C (en) Automatic text-independent, language-independent speaker voice-print creation and speaker recognition
Nakashika et al. Non-parallel training in voice conversion using an adaptive restricted boltzmann machine
JP4458321B2 (en) Emotion recognition method and emotion recognition device
US9009048B2 (en) Method, medium, and system detecting speech using energy levels of speech frames
Nicholson et al. Emotion recognition in speech using neural networks
JP4274962B2 (en) Speech recognition system
Schuller et al. Emotion recognition in the noise applying large acoustic feature sets
Veaux et al. Intonation conversion from neutral to expressive speech
Schuller et al. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture

Legal Events

Date Code Title Description
A201 Request for examination