CN1681002A

CN1681002A - Speech synthesis system, speech synthesis method, and program product

Info

Publication number: CN1681002A
Application number: CNA2005100693792A
Authority: CN
Inventors: 真锅宏幸; 平岩明; 杉村利明
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2002-03-04
Filing date: 2003-03-03
Publication date: 2005-10-12
Anticipated expiration: 2023-03-03
Also published as: US7369991B2; EP1345210B1; US20070100630A1; DE60330400D1; US7680666B2; CN1681002B; EP1667108B1; DE60321256D1; EP1345210A3; EP1345210A2; US20030171921A1; EP1667108A1; JP2003255993A; CN1442845A

Abstract

The object of the present invention is to keep a high success rate in recognition with a low-volume of sound signal, without being affected by noise. The speech recognition system comprises a sound signal processor configured to acquire a sound signal, and to calculate a sound signal parameter based on the acquired sound signal; an electromyographic signal processor configured to acquire potential changes on a surface of the object as an electromyographic signal, and to calculate an electromyographic signal parameter based on the acquired electromyographic signal; an image information processor configured to acquire image information by taking an image of the object, and to calculate an image information parameter based on the acquired image information; a speech recognizer configured to recognize a speech signal vocalized by the object, based on the sound signal parameter, the electromyographic signal parameter and the image information parameter; and a recognition result provider configured to provide a result recognized by the speech recognizer.

Description

Speech synthesis system and method and program product

Technical field

The present invention relates to be used for the speech recognition system and the method for recognition of speech signals, the program product that carries out the speech synthesis system of synthetic speech signal and method and use therein according to speech recognition.

Background technology

The application is that application number is P2002-057818, and the date of application is to propose on the basis for priority of Japan's patented claim formerly of proposing on March 4th, 2002, and the full content of this application is incorporated herein by reference.

Traditional speech detection device adopts speech recognition technology by the frequency analysis in the sounding voice signal is come voice signal is discerned and handled.Speech recognition technology obtains by using band envelope or similar techniques.

Yet,, can in conventional speech detection device, survey voice signal under the condition of the voice signal of input sounding for traditional speech detection device.In addition, the speech detection result for by using speech recognition technology to obtain requires voice signal to sound with certain volume.

Therefore, traditional speech detection device can not use under the noiseless condition of needs, these situations for example, in office, in the library and in places such as public organizations, when the speaker may to around he when making troubles.The problem that traditional speech detection device has is exactly under the condition of high noisy, can bring the problem of intersect speaking and the performance of speech detection function to reduce.

On the other hand, occurred obtain the Study on Technology of voice signal from the information except that voice signal.The technology of obtaining voice signal from the information except that voice signal makes that obtaining voice signal under the condition of the voice signal that does not have sounding becomes possibility, therefore can solve the above problems.

Carrying out image process method according to the image information of video camera input is a kind of method of carrying out recognition of speech signals according to the visual information of lip.

In addition, also carried out by handle along with around the mouth (near) the electromyogram that produces of muscular movement (below be referred to as EMG) signal discern the technical research of the vowel type of sending.This research exists " the A speech Employing a Speech Syntghesizer Vowel Discriminationfrom Perioral Muscles Activities and Vowel Production of NoboruSugie etc. '; ' IEEE tansactions onBiomedical Engineering; volume 32; the 7th phase; 485-490 page or leaf " in open, the technology of distinguishing five vowels " a; i; u; e, o " by the number of times that the EMG signal is passed threshold value by pass-band filter and the EMG signal that passes through of statistics is wherein disclosed.

As everyone knows, exist by using nervous system network to handle the method that the EMG signal is surveyed speaker's vowel and consonant.In addition, use an input channel just but the multimodal interface of the information of a plurality of input channel input is suggested and obtains.

On the other hand, the storage of traditional speech synthesis system is used to characterize the data of speaker's voice signal, and uses the data when speaker's sounding to come synthetic speech signal.

Yet, a problem that exists is that traditional speech detection method is used from information rather than obtained the technology of voice signal from voice signal, therefore compare from the speech detection method that voice signal obtains voice signal with use, this technology has low success ratio in identification.Particularly, be difficult in the mouth motion of muscle and discern the consonant that is sent.

In addition, the problem that traditional speech synthesis system exists is that voice signal is synthetic according to the data of the voice signal that characterizes the speaker, therefore synthetic voice signal sounds very stiff, expresses not nature, and can not express speaker's emotion definitely.

Summary of the invention

Eventually the above an object of the present invention is to provide a kind of speech recognition system and method, and it is not having under the condition of noise effect, and identification has high discrimination during than the voice signal of amount of bass.Another object of the present invention provides a kind of speech synthesis system and method, and it uses the voice signal of identification to come synthetic speech signal, thereby makes that synthetic voice signal is more natural and clear, and can express speaker's emotion definitely.

First aspect of the present invention can reduce a kind of speech recognition system, and it comprises that audio signal processor, electromyogram (EMG) signal processor, visual information processor, speech recognition device and recognition result provide device.

Audio signal processor is configured to from an object acquisition voice signal, and calculates the voice signal parameter according to the voice signal that obtains.The potential change that the EMG signal processor is configured to obtain object surface is with as the EMG signal, and according to the EMG calculated signals EMG signal parameter that obtains.Visual information processor is configured to obtain image information by the image of obtaining object, and comes the computed image information parameter according to the image information of obtaining.Speech recognition device is configured to according to voice signal parameter, EMG signal parameter and image information parameter, the voice signal that identification is sent by object.Recognition result provides device to be configured to provide the result of speech recognition device identification.

Of the present invention aspect first, speech recognition device can come recognition of speech signals according in voice signal parameter, EMG signal parameter and the image information parameter each, each voice signal of contrast identification and according to the comparing result recognition of speech signals.

Aspect first, speech recognition device can use voice signal parameter, EMG signal parameter and image information parameter to come recognition of speech signals simultaneously of the present invention.

Aspect first, speech recognition device can comprise a hierarchical network of the present invention, and a plurality of non-linear components that contain input block and output unit in this network are by layering location from top to bottom.The output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.The connection that data outputed to that each non-linear component calculates from the data of output unit output and determines to calculate according to the weighted value that is input to the data of input block and is assigned to the combination that connects or connect.Voice signal parameter, EMG signal parameter and image information parameter are used as the input data and are input in the non-linear component of the superiors in the hierarchical network.The voice signal of identification is used as in the undermost non-linear component of output data from hierarchical network to be exported.Speech recognition device is according to the data identification voice signal of output.

Aspect first, speech recognition device can comprise learning functionality of the present invention, and it is configured to change the weighted value that is assigned to non-linear component according to the sample data that transmits to the upper strata from lower floor of input.

Aspect first, audio signal processor can comprise microphone of the present invention, and it is configured to obtain voice signal from sound source.Microphone is configured to communicate with communicator.The EMG signal processor can comprise electrode, and it is configured to obtain the potential change on the face around the sound source, with as the EMG signal.This electrode is installed in the surface of communicator.Visual information processor can comprise camera, and it is configured to obtain image information by the image that the shooting sound source moves.On the terminal that this camera is installed in communicator separates.Communicator uses this terminal to transmit and receive data.

Aspect first, terminal can comprise a main body that camera is housed, and the belt of stationary body of the present invention.Recognition result provides the device can be for being used for the display of display result, and this display is installed in the surface of main body.

Aspect first, system can comprise a positioning equipment and fastening of the present invention.Audio signal processor can comprise microphone, and it is configured to obtain voice signal from sound source.The EMG signal processor can comprise electrode, and it is configured to obtain around the sound source the potential change on the face with as the EMG signal.Visual information processor can comprise camera, and it is configured to obtain image information by the image that the shooting sound source moves.Positioning equipment can be fixed microphone and the electrode approaching with sound source.Fastening can support camera and positioning equipment.

Of the present invention aspect first, recognition result provide device can be in translucent display device display result.Recognition result provides device to be installed in the fastening.

Second aspect of the present invention can reduce a kind of speech synthesis system, and it comprises speech recognition device, voice signal getter, the first bands of a spectrum getter, the second bands of a spectrum generator, regulates bands of a spectrum generator and follower.

Speech recognition device is configured to recognition of speech signals.The voice signal getter is configured to obtain voice signal.The bands of a spectrum that the first bands of a spectrum getter is configured to obtain the voice signal that obtains are used as first bands of a spectrum.The second bands of a spectrum generator is configured to produce according to the voice signal of speech recognition device identification the secondary configuration bands of a spectrum of voice signal, and with it as second bands of a spectrum.Regulating the bands of a spectrum generator is configured to produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum.Follower is configured to export synthetic voice signal according to the bands of a spectrum after regulating.

Aspect second of the present invention, follower can comprise communicator, and it is configured to send the synthetic voice signal as data.

The 3rd aspect of the present invention can reduce a kind of audio recognition method, may further comprise the steps: (A) from the object acquisition voice signal, and according to the voice signal calculating voice signal parameter of obtaining; (B) the potential change on surface of obtaining object is as the EMG signal, and according to the EMG calculated signals EMG signal parameter that obtains; (C) image of obtaining object obtains image information, and comes the computed image information parameter according to the image information of obtaining; (D) according to voice signal parameter, EMG signal parameter and image information parameter, the voice signal that identifying object sends; And the result that speech recognition device identification (E) is provided.

In aspect the 3rd of the present invention, step (D) can may further comprise the steps: (D1) come recognition of speech signals according in voice signal parameter, EMG signal parameter and the image information parameter each; (D2) contrast the voice signal of each identification; And (D3) according to the comparing result recognition of speech signals.

In aspect the 3rd of the present invention, voice signal can be discerned by use voice signal parameter, EMG signal parameter and image information parameter simultaneously in step (D).

Aspect the 3rd of the present invention, a plurality of non-linear components that contain input block and output unit are in the network of layering from top to bottom by the position of layering.The output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.Each non-linear component calculates data and definite connection that data outputed to of calculating of exporting from output unit according to being input to the data of input block and being assigned to the weighted value that connects or make up.Step (D) may further comprise the steps: (D11) voice signal parameter, EMG signal parameter and image information parameter are used as the input data and are input in the non-linear component of the superiors in the hierarchical network; (D12) with identification voice signal as output data by exporting in the undermost non-linear component in the hierarchical network; And (D13) come recognition of speech signals according to the data of output.

In aspect the 3rd of the present invention, described method can comprise that the sample data that transmits to the upper strata from lower floor according to input changes the step of the weighted value that is assigned to non-linear component.

The 4th aspect of the present invention can reduce a kind of phoneme synthesizing method, may further comprise the steps: (A) recognition of speech signals; (B) obtain voice signal; (C) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum; (D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum; (E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And (F) export synthetic voice signal according to the bands of a spectrum after regulating.

In aspect the 4th of the present invention, step (F) can comprise the step of transmission as the synthetic voice signal of data.

The 5th aspect of the present invention can reduce the program product that is used for recognition of speech signals in computing machine.Computing machine is carried out following steps: (A) from the object acquisition voice signal, and according to the voice signal calculating voice signal parameter of obtaining; (B) the potential change on surface of obtaining object is as the EMG signal, and according to the EMG calculated signals EMG signal parameter that obtains; (C) image of obtaining object obtains image information, and comes the computed image information parameter according to the image information of obtaining; (D) according to voice signal parameter, EMG signal parameter and image information parameter, the voice signal that identifying object sends; And the result that speech recognition device identification (E) is provided.

In aspect the 5th of the present invention, step (D) can may further comprise the steps: (D1) according to each recognition of speech signals in voice signal parameter, EMG signal parameter and the image information parameter; (D2) contrast the voice signal of each identification; And (D3) according to the comparing result recognition of speech signals.

In the step (D) aspect the 5th of the present invention, voice signal can use voice signal parameter, EMG signal parameter and image information parameter to discern simultaneously.

Aspect the 5th of the present invention, a plurality of non-linear components that contain input block and output unit in the network of layering from top to bottom by the position of layering.The output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.Each non-linear component calculates data and definite connection that data outputed to of calculating of exporting from output unit according to being input to the data of input block and being assigned to the weighted value that connects or make up.Step (D) may further comprise the steps: (D11) voice signal parameter, EMG signal parameter and image information parameter are input in the non-linear component of the superiors in the hierarchical network as the input data; (D12) voice signal of the output unit of the undermost non-linear component from hierarchical network output identification is as output data; And (D13) come recognition of speech signals according to the data of output.

Aspect the 5th of the present invention, computing machine can carry out changing according to the sample data that transmits from bottom to top of input the step of the weighted value that is assigned to non-linear component.

The 6th aspect of the present invention can reduce the program product that is used at the computing machine synthetic speech signal.Computing machine is carried out following step: (A) recognition of speech signals; (B) obtain voice signal; (C) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum; (D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum; (E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And (F) export synthetic voice signal according to the bands of a spectrum after regulating.

In aspect the 6th of the present invention, step (F) can comprise the step of transmission as the synthetic voice signal of data.

Description of drawings

Fig. 1 is the function unit figure of speech recognition system according to an embodiment of the invention.

Fig. 2 A is the process example of winning voice signal and EMG signal according to embodiments of the invention in speech recognition system to 2D.

Fig. 3 A is the example of winning the process of image information according to an embodiment of the invention in speech recognition system to 3D.

Fig. 4 is the function unit figure of the speech recognition device in speech recognition system according to an embodiment of the invention.

Fig. 5 is the function unit figure of the speech recognition device in speech recognition system according to an embodiment of the invention.

Fig. 6 is to be the detail drawing of explaining speech recognition device according to an embodiment of the invention in speech recognition system.

Fig. 7 is the process flow diagram of the description speech recognition process in speech recognition system operation according to an embodiment of the invention.

Fig. 8 is the process flow diagram of the description learning process in speech recognition system operation according to an embodiment of the invention.

Fig. 9 is the function unit figure of speech synthesis system according to an embodiment of the invention.

Figure 10 A to 10D at the key drawing of removing noise operation in speech recognition system according to an embodiment of the invention.

Figure 11 is for describing the process flow diagram of phonetic synthesis process operation according to an embodiment of the invention in voice system.

Figure 12 is according to an embodiment of the invention to the entire arrangement of speech recognition system and speech synthesis system integral system.

Figure 13 is according to an embodiment of the invention to the complete configuration of speech recognition system and the incorporate system of speech synthesis system.

Figure 14 represents to have write down the computer-readable recording medium according to the embodiments of the invention program.

Embodiment

(according to the configuration of the speech recognition system of the first embodiment of the present invention)

Below will describe configuration in detail according to the speech recognition system of the first embodiment of the present invention.Fig. 1 has described the function unit figure according to the speech recognition system of present embodiment.

As shown in Figure 1, speech recognition system disposes audio signal processor 10, EMG signal processor 13, visual information processor 16, information score device/recognizer 19, speech recognition device 20 and recognition result device 21 is provided.

Audio signal processor 10 is arranged to the voice signal that processing is sent by the speaker.Audio signal processor 10 disposes voice signal acquiring unit 11 and voice signal processing unit 12.

Voice signal acquiring unit 11 is a kind of device that is used for obtaining from speaker's (target) mouth voice signal, for example microphone.Voice signal acquiring unit 11 is surveyed the voice signal that the speaker sends, and the voice signal that obtains is sent to sound signal processing unit 12.

Sound signal processing unit 12 is arranged in the voice signal that obtains from voice signal acquiring unit 11 and obtains the voice signal parameter by separating band envelope or microtexture.

Sound signal processing unit 12 is a kind of devices that are used to calculate the voice signal parameter, this voice signal parameter can be in speech recognition device 20 according to the voice signal that obtains by voice signal acquiring unit 11 and processed.Sound signal processing unit 12 cut off voice signal when the time one, window was provided with, and the analytical calculation voice signal parameter when being usually used in speech recognition, for example the voice signal that cuts off is carried out the short time spectral band analysis, the cepstrum analysis of spectrum, maximum likelihood spectrum method of estimation, covariance method, PARCOR are analyzed and LSP analyzes.

EMG signal processor 13 is arranged to detection and handles near the motion of the muscle of mouth of speaking when sounding signal.EMG signal processor 13 disposes EMG signal acquiring unit 14 and EMG signal processing unit 15.

EMG signal acquiring unit 14 is arranged to obtains near (winning) the speak motion of the muscle mouth when sounding signal.EMG signal acquiring unit 14 is surveyed near the possible variation of the skin surface of speaker's (target) mouth.That is to say, in order to discern near the motion that is accompanied by the polylith muscle of the signal of sounding the mouth, EMG signal acquiring unit 14 is surveyed a plurality of EMG signals by a plurality of electrodes that are positioned on the skin surface relevant with polylith muscle, and amplification EMG signal is transferred to EMG signal processing unit 15.

EMG signal processing unit 15 is arranged to by the frequency of power that calculates the EMG signal that is obtained by EMG acquiring unit 14 and analysis EMG signal and wins the EMG signal parameter.EMG signal processing unit 15 is a kind of devices that calculate the EMG signal parameter according to a plurality of EMG signals by 14 transmission of EMG signal acquiring unit.More specifically, EMG signal processing unit 15 is being provided with cut-out EMG signal every time one window, and by calculating the mean oscillatory feature, as RMS (root mean square), ARV (average correction value) or IEMG (integration EMG) calculate the EMG signal parameter.

, will be described in detail to 2D with reference to figure 2A voice signal acquiring unit 12 and EMG signal processing unit 15.

Voice signal that is detected by voice signal acquiring unit 11 or EMG signal acquiring unit 14 or EMG signal are cut off (S401 among Fig. 2 A) by audio signal processor 12 or EMG signal processor 15 during window in per time one.Then, extract bands of a spectrum (S402 among Fig. 2 B) by FFT by cutoff signal.Then, the bands of a spectrum of winning are carried out the third-octave analysis meter and calculate the power of each frequency (S403 among Fig. 2 C).That calculate be transferred to speech recognition device 20 as speech signal parameter or EMG signal parameter (S404 among Fig. 2 D) with power each frequency dependence.This speech signal parameter or EMG signal parameter are discerned by speech recognition device 20.

Sound signal processing unit 12 or EMG signal processing unit 15 can not be to win voice signal parameter or EMG signal parameter at Fig. 2 A to the method among the 2D by using yet.

Visual information processor 16 is arranged near speak the spatial variations mouth of detection when sounding signal.Visual information processor 16 disposes image information acquisition unit 17 and Image Information Processing unit 18.

The image that image information acquisition unit 17 is arranged to by obtain near the spatial variations the mouth of speaking when sounding signal obtains image information.Image information acquisition unit 17 disposes the camera that obtains near the spatial variations image of mouth of speaking when sounding signal, as video camera.Near the motion of mouth is surveyed as image information in image information acquisition unit 17, and transmits this image information to Image Information Processing unit 18.

Image Information Processing unit 18 is arranged to the image information of obtaining according to image information acquisition unit 17 and calculates mouth kinematic parameter (image information parameter) on every side in a minute.More specifically, mouth motion feature computed image information is on every side won in the 18 usefulness light streams of Image Information Processing unit.

, will be described in detail Image Information Processing unit 18 below to 3D with reference to figure 3A.

Near the image information of the feature locations mouth of speaking during according to time t0 won.(as the S501 among Fig. 3 A).Position that might be by obtaining mark is as feature locations, or searches feature locations and win feature locations around the mouth in captured image information.Picture information processing unit 18 can from image information, win feature locations and with it as the two-dimensional space position.Picture information processing unit 18 by use a plurality of cameras and obtain feature locations and with it as three-dimensional space position.

Similarly, through associate t0 to t1 during this period of time after, when time t1, win the feature locations (as the S502 among Fig. 3 B) around the mouth.Then, Image Information Processing unit 18 calculates the motion (as the S503 among Fig. 3 C) of each unique point by calculating unique point when time t0 and the difference between the unique point when time t1.Image Information Processing unit 18 produces image information parameter (as the S504 among Fig. 3 D) according to the difference that calculates.

For Image Information Processing unit 18, can use except that the additive method the method for Fig. 3 A in the 3D and obtain the image information parameter.

Image information integrator/recognizer 19 is configured to the various information of obtaining from audio signal processor 10, EMG signal processor 13 and visual information processor 16 are carried out integration and identification.Image information integrator/recognizer 19 is furnished with speech recognition device 20 and recognition result provides device 21.

Speech recognition device 20 compares and integration by the voice signal parameter that audio signal processor 10 is sent, the EMG signal parameter of EMG signal processor 13 transmissions and the image information parameter that visual information processor 16 sends, thus the processor that carries out speech recognition.

Speech recognition device 20 when around noise rank when less, when the volume of the voice signal that sends is big or in the time can carrying out speech recognition with enough ranks according to the voice signal parameter, speech recognition device 20 can only come recognizing voice according to the voice signal parameter.

On the other hand, when around noise rank when big, when the volume of the voice signal that sends hour or in the time can not carrying out speech recognition with enough ranks according to the voice signal parameter, speech recognition device 20 not only can also come recognizing voice according to EMG signal parameter and image information parameter according to the voice signal parameter.

In addition, speech recognition device 20 can only be discerned special phoneme etc. according to the voice signal parameter, and this special phoneme can not correctly be discerned by using EMG signal parameter and image information parameter, thereby can improve the success ratio of identification.

With reference to figure 4, will the example of speech recognition device 20 be specifically described below.In example shown in Figure 4, speech recognition device 20 comes recognition of speech signals according in voice signal parameter, EMG signal parameter and the image information parameter each, and the voice signal of each identification compared, and come recognition of speech signals according to the result of contrast.

As shown in Figure 4, more specifically, speech recognition device 20 only comes recognizing voice respectively according to voice signal parameter, EMG signal parameter or image information parameter respectively.Speech recognition device 20 carries out integration according to each parameter to the result who discerns then, thereby carries out speech recognition.

When (in all recognition results) a plurality of recognition results that obtain according to each parameter coincide mutually, speech recognition device 20 with this result as final recognition result.On the other hand, when (in all recognition results) of obtaining according to each parameter do not have recognition result to coincide mutually, recognizer 20 will have the recognition result of the highest discrimination as final result in identification.

For example, just known in front when special phoneme of identification or special tongue, the speech recognition of carrying out according to the EMG parameter has lower success ratio, yet, suppose that special phoneme or special tongue are issued, when then basis was carried out speech recognition by the parameter of non-EMG signal, speech recognition device 20 was ignored the recognition result that obtains according to the EMG signal parameter, thereby can improve recognition success rate.

Based on the speech recognition of voice signal parameter the time, when the noise rank around determining is big or the volume of the voice signal that sends hour, speech recognition device 20 reduces the influence to net result of the recognition result that obtains based on the voice signal parameter, and carries out speech recognition by focusing on the recognition result that obtains based on EMG signal parameter and image information parameter.Can adopt conventional audio recognition method according to the speech recognition that each parameter is carried out.

Can adopt the audio recognition method of traditional various voice signals of use based on the speech recognition of the voice signal in the speech recognition device 20.The speech recognition of carrying out based on the EMG signal can adopt technical literature " Noboru Sugie et a1.; ' and A speech Employing a Speech Synthesizer VowelDiscriminatgion from Perioral Muscles Activities and Vowel Production ' IEEEtransactions on Biomedial Enginnering; 32 volumes; the 7th phase, 485-490 page or leaf " in disclosed method or in JP-A-181888 etc. disclosed method.Can adopt disclosed method in JP-A-2001-51963 or JP-A-2000-206986 etc. based on the speech recognition that image information is carried out.

Speech recognition device 20 as shown in Figure 4, when any parameter in voice signal parameter, EMG signal parameter and the image information parameter is all nonsensical for speech recognition, for example when around noise rank when big, when the volume of the voice signal that sends hour maybe when not detecting the EMG signal, speech recognition device 20 can come recognizing voice according to significant parameter, thereby can improve the vulnerability to jamming to noise in whole speech recognition system fully.

With reference to figure 5, will the another one example of speech recognition device 20 be specifically described below.In example shown in Figure 5, speech recognition device 20 is simultaneously according to coming recognition of speech signals in voice signal parameter, EMG signal parameter and the image information parameter.

More concrete, speech recognition device 20 comprises a hierarchical network (for example, nervous system network 20a), wherein a plurality of non-linear components that comprise input block and output unit are hierarchically positioned from top to bottom.

In nervous system network 20a, the output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.Each non-linear component calculates data and definite connection that data outputed to of calculating of exporting from output unit according to being input to the data of input block and being assigned to the weighted value that connects or make up.

Voice signal parameter, EMG signal parameter and image information parameter are used as the input data and are input in the non-linear component of the superiors in the hierarchical network.The voice signal (vowel and consonant) of identification is used as output data and outputs in the undermost non-linear component in the layering speech recognition device.Speech recognition device 20 comes recognition of speech signals according to the data of being exported by the output unit of undermost non-linear component.

By reference " Nishikawa and Kitamura, ' Neural network and control of measure ', Asakura Syoten, 18-50 page or leaf " as can be known, nervous system network can adopt three layers of nervous system network of full connecting-type.

Speech recognition device 20 comprises learning functionality, and it is configured to change the weighting that is assigned to non-linear component according to the sample data that transmits from bottom to top of input.

That is to say, be necessary method, the weighting among the prior learning nervous system network 20a by for example backpropagation.

In order to learn weighting, speech recognition device 20 obtains according to the voice signal parameter that operation produced, EMG signal parameter and the image information parameter of sending special mode, and learns weighting by using as the special mode of learning signal.

When the speaker pronounces, the EMG signal is input in the speech recognition system earlier than voice signal and image information, speech recognition device 20 only postpones the input of EMG signal parameter by neuralward network 20a, and do not postpone the input of voice signal parameter and image information parameter, thereby make speech recognition device 20 have the function of synchronous voice signal, EMG signal and image information.

Receive the nervous system network 20a output phoneme relevant with input parameter as the various parameters of input data.

Nervous system network 20a adopts recurrence nervous system network (RNN), and it is handled the recognition result that obtains with the next one and returns as the input data.According to present embodiment, speech recognition algorithm also can adopt various speech recognition algorithms, for example Hidden Markov Model (HMM) except that adopting nervous system network.

As shown in Figure 6, a plurality of EMG signals 1,2 that detected by EMG signal acquiring unit 14 are exaggerated in EMG processing unit 15 (S601) and are cut off every time one window.By the EMG signal that cuts off is carried out the calculating that FFT carries out bands of a spectrum.Before input nervous system network 20, the bands of a spectrum (S602) that calculate are carried out the third-octave analysis, carry out the calculating of EMG signal parameter.

The voice signal that voice signal acquiring unit 11 detects is exaggerated and cuts off every time one window in sound signal processing unit 12 (S611).By the voice signal that cuts off is carried out the calculating that FFT carries out bands of a spectrum.Before input nervous system network 20, the bands of a spectrum (S612) that calculate are carried out the third-octave analysis, carry out the voice signal CALCULATION OF PARAMETERS.

The image information that Image Information Processing unit 18 obtains according to image information acquisition unit 17 (S621) is obtained the motion of the feature locations around the mouth of speaking as light stream.The image information parameter of winning as light stream is imported among the nervous system network 20a.

In a series of time, can win mouth feature locations separately on every side in the captured image information, thus the motion of winning feature locations.Also sign can be placed on mouth unique point on every side, and place reference point, according to the displacement of surveying with respect to the unique point of reference point, thus the motion of winning unique point.

Be transfused to the nervous system network 20a output phoneme relevant of various parameters with input parameter.

In addition, when voice by can not discern according to any parameter the time as the audio recognition method among Fig. 4, can be configured to use according to the speech recognition device 20 of present embodiment and carry out speech recognition as the audio recognition method among Fig. 5.Compare or they are carried out integration by the result that the result of the audio recognition method among Fig. 4 identification and the audio recognition method among Fig. 5 are discerned, speech recognition device 20 can be configured to voice are discerned.

It is the equipment of a kind of providing (output) speech recognition device 20 recognition results that recognition result provides device 21.Recognition result provides device 21 can adopt the voice generator that speech recognition device 20 recognition results are outputed to the speaker as voice signal or outputs to as text message in the display of display result.Recognition result provides device 21 can comprise a communication interface, and it is except providing the result to the speaker, also transmit the result in the application program as data, this application program runs in the terminal as PC.

(according to the operation of the speech recognition system of embodiment)

Operation according to the speech recognition system of embodiment will be described below with reference to figure 7 and Fig. 8.At first, with reference to figure 7, according to the operation of carrying out speech recognition in the speech recognition system of embodiment.

In step S101, the speaker begins sounding.In S104, voice signal acquiring unit 11, EMG signal acquiring unit 14 and image information acquisition unit 17 are surveyed voice signal, EMG signal and the image information that is produced respectively when speaker's sounding at step S102.

In S107, sound signal processing unit 12, EMG signal processing unit 15 and Image Information Processing unit 18 calculate voice signal parameter, EMG signal parameter and image information parameter respectively according to voice signal, EMG signal and image information at step S105.

In step S108, speech recognition device 20 comes recognizing voice according to parameters calculated.In step 109, recognition result provides device 21 that the result who is obtained by speech recognition device 20 identifications is provided.Recognition result provides device 21 result of identification can be exported or show recognition result as voice signal.

Secondly, with reference to figure 8, be operation according to the learning process in speech recognition system of present embodiment.

For improving recognition success rate, the pronunciation character of learning each speaker is very important.In an embodiment, will operation that use nervous system network 20a among Fig. 5 carry out learning process be described below.Under the situation of the audio recognition method that does not use nervous system network 20a, speech recognition system according to the present invention has adopted the learning functionality relevant with audio recognition method.

As shown in Figure 8, in step S301 and S302, the speaker begins sounding.In step S305, the speaker that is to say with said contents of input such as keyboards, input learning signal (sample data) when pronunciation.In step S303, voice signal acquiring unit 11, EMG signal acquiring unit 14 and image information acquisition unit 17 are surveyed voice signal, EMG signal and image information respectively.In step S304, sound signal processing unit 12, EMG signal processing unit 15 and Image Information Processing unit 18 are won voice signal parameter, EMG signal parameter and image information parameter respectively.

In step S306, nervous system network 20a wins the parameter that obtains according to the learning signal study of keyboard input.That is to say that nervous system network 20a changes the weighting that is assigned to non-linear component by the learning signal (sample data) that input transmits from top to bottom.

In step S307, when the error rate of identification was lower than threshold value, nervous system network 20a determined that learning process finishes.EO (S308) then.

On the other hand, in step S307, when nervous system network 20a determines that learning process is not finished, then with the operation of repeating step S302 to S306.

(according to the function and the effect thereof of the speech recognition system of embodiment)

The speech recognition system of present embodiment can be come recognizing voice according to a plurality of parameters that calculate from voice signal, EMG signal and image information, thereby can improve abilities such as noise resistance interference fully.

That is to say that the speech recognition system of present embodiment comprises three types input interface (audio signal processor 10, EMG signal processor 13 and visual information processor 16) and is used to improve the noise resistance interference capability.When all input interfaces were all unavailable, speech recognition system can use available input interface to come recognizing voice, thereby improved recognition success rate.

Therefore, the present invention can provide a kind of speech recognition system, and the volume of the voice signal that maybe ought send when its noise rank around is big hour can be come recognizing voice with enough ranks.

(according to the speech synthesis system of second embodiment of the present invention)

With reference to figure 9 to 11, will the speech synthesis system according to second embodiment of the present invention be described.Speech recognition system described above is used to according to speech synthesis system of the present invention.

As shown in Figure 9, the speech synthesis system of phase disposes audio signal processor 10, EMG signal processor 13, visual information processor 16, speech recognition device 20 and voice operation demonstrator 55 according to the present invention.Voice operation demonstrator 55 disposes the first bands of a spectrum getter, 51, the second bands of a spectrum generators 52, regulates bands of a spectrum generator 53 and follower 54.

Speech recognition system among audio signal processor 10, EMG signal processor 13, visual information processor 16, speech recognition device 20 and first embodiment has identical functions.

The first bands of a spectrum getter 51 be configured to obtain the bands of a spectrum of voice signal and with it as first bands of a spectrum, wherein voice signal is obtained by voice signal acquiring unit 11.Contain noise (with reference to figure 10C) in first bands of a spectrum that obtain.

The second bands of a spectrum generator 52 be configured to voice signal (result) according to speech recognition device 20 identification produce through the bands of a spectrum of the voice signal that reconfiguring and with it as second bands of a spectrum.Shown in Figure 10 A, more specifically, the second bands of a spectrum generator 52 is according to the pronunciation phonemes of winning from the result of speech recognition device 20 identifications, and formant frequency for example reconfigures the bands of a spectrum of pronunciation phonemes.

Regulating bands of a spectrum generator 53 is configured to produce the bands of a spectrum of adjusting according to first bands of a spectrum and second bands of a spectrum.Shown in Figure 10 D, more specifically, regulate bands of a spectrum generator 53 by multiplying each other, thereby generation there are not the adjusting bands of a spectrum of noise with second bands of a spectrum (with reference to figure 10A) and first bands of a spectrum (with reference to figure 10C).

Follower 54 is configured to according to regulating the synthetic voice signal of bands of a spectrum output.Follower 54 comprises communicator, and it is configured to send the synthetic voice signal as data.Shown in Figure 10 C, more specifically, follower 54 obtains not contain the voice signal of noise by the adjusting bands of a spectrum that do not contain noise being carried out Fourier reverse transformation (with reference to figure 10D), and the voice signal that obtains is exported as synthetic voice signal.

That is to say, obtain not contain the voice signal of noise by filtrator by the voice signal that will contain noise according to the speech synthesis system of present embodiment, its middle filtrator has the frequecy characteristic by the bands of a spectrum representative that reconfigures, and the voice signal of output acquisition.

Speech synthesis system according to present embodiment ins all sorts of ways recognizing voice by making, the voice signal that the speaker can be sent and noise are on every side separated from the signal that recognition result is reconfigured the signal that obtains and voice signal acquiring unit 11 and surveyed, thus when around the noise rank can export synthetic speech clearly greatly the time.

Therefore, according to the speech synthesis system of present embodiment can be big at noise or the voice signal that sends hour, the voice signal that output is synthetic, this signal sounds just looking like that the speaker does not send in having the environment of noise.

Adopted speech recognition system according to the speech synthesis system of present embodiment, yet the present invention is not limited to this embodiment according to first embodiment.Speech synthesis system according to present embodiment can come recognizing voice according to the parameter except that the voice signal parameter.

With reference to Figure 11, will the operation according to the speech synthesis system of present embodiment be described below.

As shown in figure 11, at step S201 in S208, carry out with first embodiment in the identical identifying of identifying.

In step S209, the first bands of a spectrum getter 51 obtain the bands of a spectrum of voice signal by voice signal acquiring unit 11 and with it as first bands of a spectrum.The second bands of a spectrum generator 52 according to the result of speech recognition device 20 identification produce through the bands of a spectrum of the voice signal that reconfiguring and with it as second bands of a spectrum.Regulate bands of a spectrum generator 53 and produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum, noise in these bands of a spectrum (not being the voice signal that the speaker sent) is eliminated from the voice signal that voice signal acquiring unit 11 is obtained.

In step S201, follower 54 is exported synthetic speech signal clearly according to regulating bands of a spectrum.

(according to the system of the 3rd embodiment of the present invention)

With reference to Figure 12, will the system of integrating speech sound recognition system and speech synthesis system be described below.

As shown in figure 12, the Wristwatch-type terminal 31 that communicator 30 is arranged and be separated with it according to the system configuration of present embodiment.

Communicating terminal 30 is configured to add audio signal processor 10, EMG signal processor 13, speech recognition device 20 and voice operation demonstrator 55 in the portable terminal of routine.

EMG signal acquiring unit 14 comprises the skin surface electrodes 114 that can contact with speaker 32 skin of a plurality of installations, and it is configured to obtain around speaker's (sound source) 32 the mouth the potential change on the face with as the EMG signal.Voice signal acquiring unit 11 comprises microphone 111, and it is configured to obtain voice signal from speaker's (sound source) 32.Microphone 111 can be configured to communicate with communicator 30.For example, microphone 111 is installed in the surface of communicator 30.Microphone 111 can be for being installed near the wireless microphone speaker's 32 mouths.Skin surface electrodes 114 can be installed in the surface of communicator 30.

Communication terminal 30 have transmission based on the result of speech recognition device 20 identification and synthetic voice signal as the function of the voice signal that sends by speaker 32.

The terminal 31 of Wristwatch-type disposes visual information processor 16 and recognition result processor 21.The video camera 117 that is used to take speaker's (sound source) 32 mouth moving image is installed on the terminal 31 of Wristwatch-type as image information collecting unit 17.The display device 121 that is used to show recognition result is installed on the terminal 31 of Wristwatch-type and provides device 21 as recognition result.The terminal 31 of Wristwatch-type comprises that one is used for its belt of fixing 33.

System to speech recognition system and speech synthesis system integration obtains EMG signal and voice signal by EMG signal acquiring unit 14 and the voice signal acquiring unit 11 that is installed on the communicator 30, and obtains image information by the image information acquisition unit 17 on the terminal 31 that is installed in Wristwatch-type.

Communicator 30 uses 31 pairs of data of terminal of Wristwatch-type to send and receive by wire communication or radio communication.The terminal 31 of communicator 30 and Wristwatch-type is collected and is transmitted a signal on the speech recognition device 20 that is structured in the communicator 30, speech recognition device 20 comes recognizing voice according to collected signal, and being installed in recognition result in the terminal 31 of Wristwatch-type provides 21 pairs of devices to show from the recognition result that speech recognition device 20 sends by wire communication or radio communication.Communicator 30 can send do not contain noise synthetic speech signal clearly in the terminal 31 of Wristwatch-type.

In the present embodiment, speech recognition device 20 is structured in the communicator 30, and the recognition result that is structured in the terminal 31 of Wristwatch-type provides device 21 to show recognition results.But speech recognition device 20 also can be installed in the terminal 31 of Wristwatch-type, or other can with terminal that communicator 30 is communicated by letter in, the terminal 31 of this Wristwatch-type can be discerned and synthetic speech.

Recognition result can be exported as voice signal from communicator, may be displayed on the monitor of terminal 31 (or communicator 30) of Wristwatch-type, perhaps can be from another terminal that can communicate by letter with the terminal 31 of communicator 30 and Wristwatch-type output.

(according to the system of the 4th embodiment of the present invention)

With reference to Figure 13, will be described the system that speech recognition system and speech synthesis system according to present embodiment carry out integration being used for below.

As shown in figure 13, according to the system configuration of present embodiment stationary installation 41 is arranged, this installs as the glasses form; As the video camera 117 of image information acquisition unit 17, it can be conditioned the motion with the mouth of taking speaker's (sound source) 32; Locating device 42; The head suspension display device (HMD) 121 of device is provided as recognition result; And in be built in speech recognition device 20 in the stationary installation 41.Stationary installation 41 can be suspended on speaker 52 the head.

Be configured to obtain the potential change on the face around speaker's 32 (sound source) the mouth as the skin surface electrodes 114 of EMG signal acquiring unit 14; And as voice signal acquiring unit 11 and the microphone 111 that is configured to from the mouth of speaker 32 (sound source), to obtain voice signal be fixed on by adjustable ground around speaker 32 the mouth.

Wear with speaker 32 and can discern and synthetic speech,, his/her both hands can be freed owing to use the mode of wearing according to the system of embodiment.

Speech recognition device 20 can in the exterior terminal that is built in the fixed equipment device 41 or communicates with fixed equipment device 41.Recognition result may be displayed among the HMD (translucent display device), or exports from output device such as loudspeaker apparatus as voice signal.Output device such as loudspeaker apparatus can be according to the synthetic voice signals of recognition result output.

(according to the system of the 5th embodiment of the present invention)

According to speech recognition system, audio recognition method, speech synthesis system or the phoneme synthesizing method of the above embodiments can by the computing machine (for example, personal computer) 215 of general applications be included in IC chip in the communicator 30 or similar equipment on the program carried out with the predetermined program language description obtain.

In addition, program can be recorded on the storage medium, and this media can be read by the computing machine 215 of general applications.That is, as shown in figure 14, program can be stored on the equipment such as floppy disk 216, CD-ROM 217, RAM218, magnetic tape cassette 219.By using the storage medium to contain program to be inserted into computing machine 215 or the medium method of internal memory that program is installed to communicator 30 can being realized system of the present invention or method.

Can keep high success ratio with the corresponding speech recognition system of the present invention, method and program when not discerned by the voice signal than amount of bass of noise effect.

Can use the voice signal of identification to come synthetic speech signal with the corresponding speech synthesis system of the present invention, method and program, thereby make synthetic voice signal nature and clear more, and the speaker's that can express properly emotion etc.

Claims

1. a speech synthesis system comprises:

Configuration is used for the speech recognition device of recognition of speech signals;

Configuration is used for obtaining the voice signal getter of voice signal;

Configuration is used for obtaining the first bands of a spectrum getter of the bands of a spectrum of the voice signal that obtains as first bands of a spectrum;

Configuration is used for according to the secondary configuration bands of a spectrum of the voice signal generation voice signal of speech recognition device identification, and with its second bands of a spectrum generator as second bands of a spectrum;

Configuration is used for according to the adjusting bands of a spectrum generator of the bands of a spectrum after first bands of a spectrum and second bands of a spectrum generation adjusting; And

Configuration is used for according to the follower of the synthetic voice signal of the bands of a spectrum output after regulating.

2. speech synthesis system according to claim 1, wherein, follower comprises communicator, it is configured to transmit synthetic voice signal as data.

3. phoneme synthesizing method may further comprise the steps:

(A) recognition of speech signals;

(B) obtain voice signal;

(C) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum;

(D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum;

(E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And

(F) export synthetic voice signal according to the bands of a spectrum after regulating.

4. program product that in computing machine, is used for synthetic speech signal, wherein, computing machine is carried out following steps:

(A) recognition of speech signals;

(B) obtain voice signal;

(B) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum;