CN1681002A - Speech synthesis system, speech synthesis method, and program product - Google Patents

Speech synthesis system, speech synthesis method, and program product Download PDF

Info

Publication number
CN1681002A
CN1681002A CNA2005100693792A CN200510069379A CN1681002A CN 1681002 A CN1681002 A CN 1681002A CN A2005100693792 A CNA2005100693792 A CN A2005100693792A CN 200510069379 A CN200510069379 A CN 200510069379A CN 1681002 A CN1681002 A CN 1681002A
Authority
CN
China
Prior art keywords
voice signal
bands
spectrum
speech
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2005100693792A
Other languages
Chinese (zh)
Other versions
CN1681002B (en
Inventor
真锅宏幸
平岩明
杉村利明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NTT Docomo Inc
Original Assignee
NTT Docomo Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NTT Docomo Inc filed Critical NTT Docomo Inc
Publication of CN1681002A publication Critical patent/CN1681002A/en
Application granted granted Critical
Publication of CN1681002B publication Critical patent/CN1681002B/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Abstract

The object of the present invention is to keep a high success rate in recognition with a low-volume of sound signal, without being affected by noise. The speech recognition system comprises a sound signal processor configured to acquire a sound signal, and to calculate a sound signal parameter based on the acquired sound signal; an electromyographic signal processor configured to acquire potential changes on a surface of the object as an electromyographic signal, and to calculate an electromyographic signal parameter based on the acquired electromyographic signal; an image information processor configured to acquire image information by taking an image of the object, and to calculate an image information parameter based on the acquired image information; a speech recognizer configured to recognize a speech signal vocalized by the object, based on the sound signal parameter, the electromyographic signal parameter and the image information parameter; and a recognition result provider configured to provide a result recognized by the speech recognizer.

Description

Speech synthesis system and method and program product
Technical field
The present invention relates to be used for the speech recognition system and the method for recognition of speech signals, the program product that carries out the speech synthesis system of synthetic speech signal and method and use therein according to speech recognition.
Background technology
The application is that application number is P2002-057818, and the date of application is to propose on the basis for priority of Japan's patented claim formerly of proposing on March 4th, 2002, and the full content of this application is incorporated herein by reference.
Traditional speech detection device adopts speech recognition technology by the frequency analysis in the sounding voice signal is come voice signal is discerned and handled.Speech recognition technology obtains by using band envelope or similar techniques.
Yet,, can in conventional speech detection device, survey voice signal under the condition of the voice signal of input sounding for traditional speech detection device.In addition, the speech detection result for by using speech recognition technology to obtain requires voice signal to sound with certain volume.
Therefore, traditional speech detection device can not use under the noiseless condition of needs, these situations for example, in office, in the library and in places such as public organizations, when the speaker may to around he when making troubles.The problem that traditional speech detection device has is exactly under the condition of high noisy, can bring the problem of intersect speaking and the performance of speech detection function to reduce.
On the other hand, occurred obtain the Study on Technology of voice signal from the information except that voice signal.The technology of obtaining voice signal from the information except that voice signal makes that obtaining voice signal under the condition of the voice signal that does not have sounding becomes possibility, therefore can solve the above problems.
Carrying out image process method according to the image information of video camera input is a kind of method of carrying out recognition of speech signals according to the visual information of lip.
In addition, also carried out by handle along with around the mouth (near) the electromyogram that produces of muscular movement (below be referred to as EMG) signal discern the technical research of the vowel type of sending.This research exists " the A speech Employing a Speech Syntghesizer Vowel Discriminationfrom Perioral Muscles Activities and Vowel Production of NoboruSugie etc. '; ' IEEE tansactions onBiomedical Engineering; volume 32; the 7th phase; 485-490 page or leaf " in open, the technology of distinguishing five vowels " a; i; u; e, o " by the number of times that the EMG signal is passed threshold value by pass-band filter and the EMG signal that passes through of statistics is wherein disclosed.
As everyone knows, exist by using nervous system network to handle the method that the EMG signal is surveyed speaker's vowel and consonant.In addition, use an input channel just but the multimodal interface of the information of a plurality of input channel input is suggested and obtains.
On the other hand, the storage of traditional speech synthesis system is used to characterize the data of speaker's voice signal, and uses the data when speaker's sounding to come synthetic speech signal.
Yet, a problem that exists is that traditional speech detection method is used from information rather than obtained the technology of voice signal from voice signal, therefore compare from the speech detection method that voice signal obtains voice signal with use, this technology has low success ratio in identification.Particularly, be difficult in the mouth motion of muscle and discern the consonant that is sent.
In addition, the problem that traditional speech synthesis system exists is that voice signal is synthetic according to the data of the voice signal that characterizes the speaker, therefore synthetic voice signal sounds very stiff, expresses not nature, and can not express speaker's emotion definitely.
Summary of the invention
Eventually the above an object of the present invention is to provide a kind of speech recognition system and method, and it is not having under the condition of noise effect, and identification has high discrimination during than the voice signal of amount of bass.Another object of the present invention provides a kind of speech synthesis system and method, and it uses the voice signal of identification to come synthetic speech signal, thereby makes that synthetic voice signal is more natural and clear, and can express speaker's emotion definitely.
First aspect of the present invention can reduce a kind of speech recognition system, and it comprises that audio signal processor, electromyogram (EMG) signal processor, visual information processor, speech recognition device and recognition result provide device.
Audio signal processor is configured to from an object acquisition voice signal, and calculates the voice signal parameter according to the voice signal that obtains.The potential change that the EMG signal processor is configured to obtain object surface is with as the EMG signal, and according to the EMG calculated signals EMG signal parameter that obtains.Visual information processor is configured to obtain image information by the image of obtaining object, and comes the computed image information parameter according to the image information of obtaining.Speech recognition device is configured to according to voice signal parameter, EMG signal parameter and image information parameter, the voice signal that identification is sent by object.Recognition result provides device to be configured to provide the result of speech recognition device identification.
Of the present invention aspect first, speech recognition device can come recognition of speech signals according in voice signal parameter, EMG signal parameter and the image information parameter each, each voice signal of contrast identification and according to the comparing result recognition of speech signals.
Aspect first, speech recognition device can use voice signal parameter, EMG signal parameter and image information parameter to come recognition of speech signals simultaneously of the present invention.
Aspect first, speech recognition device can comprise a hierarchical network of the present invention, and a plurality of non-linear components that contain input block and output unit in this network are by layering location from top to bottom.The output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.The connection that data outputed to that each non-linear component calculates from the data of output unit output and determines to calculate according to the weighted value that is input to the data of input block and is assigned to the combination that connects or connect.Voice signal parameter, EMG signal parameter and image information parameter are used as the input data and are input in the non-linear component of the superiors in the hierarchical network.The voice signal of identification is used as in the undermost non-linear component of output data from hierarchical network to be exported.Speech recognition device is according to the data identification voice signal of output.
Aspect first, speech recognition device can comprise learning functionality of the present invention, and it is configured to change the weighted value that is assigned to non-linear component according to the sample data that transmits to the upper strata from lower floor of input.
Aspect first, audio signal processor can comprise microphone of the present invention, and it is configured to obtain voice signal from sound source.Microphone is configured to communicate with communicator.The EMG signal processor can comprise electrode, and it is configured to obtain the potential change on the face around the sound source, with as the EMG signal.This electrode is installed in the surface of communicator.Visual information processor can comprise camera, and it is configured to obtain image information by the image that the shooting sound source moves.On the terminal that this camera is installed in communicator separates.Communicator uses this terminal to transmit and receive data.
Aspect first, terminal can comprise a main body that camera is housed, and the belt of stationary body of the present invention.Recognition result provides the device can be for being used for the display of display result, and this display is installed in the surface of main body.
Aspect first, system can comprise a positioning equipment and fastening of the present invention.Audio signal processor can comprise microphone, and it is configured to obtain voice signal from sound source.The EMG signal processor can comprise electrode, and it is configured to obtain around the sound source the potential change on the face with as the EMG signal.Visual information processor can comprise camera, and it is configured to obtain image information by the image that the shooting sound source moves.Positioning equipment can be fixed microphone and the electrode approaching with sound source.Fastening can support camera and positioning equipment.
Of the present invention aspect first, recognition result provide device can be in translucent display device display result.Recognition result provides device to be installed in the fastening.
Second aspect of the present invention can reduce a kind of speech synthesis system, and it comprises speech recognition device, voice signal getter, the first bands of a spectrum getter, the second bands of a spectrum generator, regulates bands of a spectrum generator and follower.
Speech recognition device is configured to recognition of speech signals.The voice signal getter is configured to obtain voice signal.The bands of a spectrum that the first bands of a spectrum getter is configured to obtain the voice signal that obtains are used as first bands of a spectrum.The second bands of a spectrum generator is configured to produce according to the voice signal of speech recognition device identification the secondary configuration bands of a spectrum of voice signal, and with it as second bands of a spectrum.Regulating the bands of a spectrum generator is configured to produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum.Follower is configured to export synthetic voice signal according to the bands of a spectrum after regulating.
Aspect second of the present invention, follower can comprise communicator, and it is configured to send the synthetic voice signal as data.
The 3rd aspect of the present invention can reduce a kind of audio recognition method, may further comprise the steps: (A) from the object acquisition voice signal, and according to the voice signal calculating voice signal parameter of obtaining; (B) the potential change on surface of obtaining object is as the EMG signal, and according to the EMG calculated signals EMG signal parameter that obtains; (C) image of obtaining object obtains image information, and comes the computed image information parameter according to the image information of obtaining; (D) according to voice signal parameter, EMG signal parameter and image information parameter, the voice signal that identifying object sends; And the result that speech recognition device identification (E) is provided.
In aspect the 3rd of the present invention, step (D) can may further comprise the steps: (D1) come recognition of speech signals according in voice signal parameter, EMG signal parameter and the image information parameter each; (D2) contrast the voice signal of each identification; And (D3) according to the comparing result recognition of speech signals.
In aspect the 3rd of the present invention, voice signal can be discerned by use voice signal parameter, EMG signal parameter and image information parameter simultaneously in step (D).
Aspect the 3rd of the present invention, a plurality of non-linear components that contain input block and output unit are in the network of layering from top to bottom by the position of layering.The output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.Each non-linear component calculates data and definite connection that data outputed to of calculating of exporting from output unit according to being input to the data of input block and being assigned to the weighted value that connects or make up.Step (D) may further comprise the steps: (D11) voice signal parameter, EMG signal parameter and image information parameter are used as the input data and are input in the non-linear component of the superiors in the hierarchical network; (D12) with identification voice signal as output data by exporting in the undermost non-linear component in the hierarchical network; And (D13) come recognition of speech signals according to the data of output.
In aspect the 3rd of the present invention, described method can comprise that the sample data that transmits to the upper strata from lower floor according to input changes the step of the weighted value that is assigned to non-linear component.
The 4th aspect of the present invention can reduce a kind of phoneme synthesizing method, may further comprise the steps: (A) recognition of speech signals; (B) obtain voice signal; (C) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum; (D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum; (E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And (F) export synthetic voice signal according to the bands of a spectrum after regulating.
In aspect the 4th of the present invention, step (F) can comprise the step of transmission as the synthetic voice signal of data.
The 5th aspect of the present invention can reduce the program product that is used for recognition of speech signals in computing machine.Computing machine is carried out following steps: (A) from the object acquisition voice signal, and according to the voice signal calculating voice signal parameter of obtaining; (B) the potential change on surface of obtaining object is as the EMG signal, and according to the EMG calculated signals EMG signal parameter that obtains; (C) image of obtaining object obtains image information, and comes the computed image information parameter according to the image information of obtaining; (D) according to voice signal parameter, EMG signal parameter and image information parameter, the voice signal that identifying object sends; And the result that speech recognition device identification (E) is provided.
In aspect the 5th of the present invention, step (D) can may further comprise the steps: (D1) according to each recognition of speech signals in voice signal parameter, EMG signal parameter and the image information parameter; (D2) contrast the voice signal of each identification; And (D3) according to the comparing result recognition of speech signals.
In the step (D) aspect the 5th of the present invention, voice signal can use voice signal parameter, EMG signal parameter and image information parameter to discern simultaneously.
Aspect the 5th of the present invention, a plurality of non-linear components that contain input block and output unit in the network of layering from top to bottom by the position of layering.The output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.Each non-linear component calculates data and definite connection that data outputed to of calculating of exporting from output unit according to being input to the data of input block and being assigned to the weighted value that connects or make up.Step (D) may further comprise the steps: (D11) voice signal parameter, EMG signal parameter and image information parameter are input in the non-linear component of the superiors in the hierarchical network as the input data; (D12) voice signal of the output unit of the undermost non-linear component from hierarchical network output identification is as output data; And (D13) come recognition of speech signals according to the data of output.
Aspect the 5th of the present invention, computing machine can carry out changing according to the sample data that transmits from bottom to top of input the step of the weighted value that is assigned to non-linear component.
The 6th aspect of the present invention can reduce the program product that is used at the computing machine synthetic speech signal.Computing machine is carried out following step: (A) recognition of speech signals; (B) obtain voice signal; (C) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum; (D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum; (E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And (F) export synthetic voice signal according to the bands of a spectrum after regulating.
In aspect the 6th of the present invention, step (F) can comprise the step of transmission as the synthetic voice signal of data.
Description of drawings
Fig. 1 is the function unit figure of speech recognition system according to an embodiment of the invention.
Fig. 2 A is the process example of winning voice signal and EMG signal according to embodiments of the invention in speech recognition system to 2D.
Fig. 3 A is the example of winning the process of image information according to an embodiment of the invention in speech recognition system to 3D.
Fig. 4 is the function unit figure of the speech recognition device in speech recognition system according to an embodiment of the invention.
Fig. 5 is the function unit figure of the speech recognition device in speech recognition system according to an embodiment of the invention.
Fig. 6 is to be the detail drawing of explaining speech recognition device according to an embodiment of the invention in speech recognition system.
Fig. 7 is the process flow diagram of the description speech recognition process in speech recognition system operation according to an embodiment of the invention.
Fig. 8 is the process flow diagram of the description learning process in speech recognition system operation according to an embodiment of the invention.
Fig. 9 is the function unit figure of speech synthesis system according to an embodiment of the invention.
Figure 10 A to 10D at the key drawing of removing noise operation in speech recognition system according to an embodiment of the invention.
Figure 11 is for describing the process flow diagram of phonetic synthesis process operation according to an embodiment of the invention in voice system.
Figure 12 is according to an embodiment of the invention to the entire arrangement of speech recognition system and speech synthesis system integral system.
Figure 13 is according to an embodiment of the invention to the complete configuration of speech recognition system and the incorporate system of speech synthesis system.
Figure 14 represents to have write down the computer-readable recording medium according to the embodiments of the invention program.
Embodiment
(according to the configuration of the speech recognition system of the first embodiment of the present invention)
Below will describe configuration in detail according to the speech recognition system of the first embodiment of the present invention.Fig. 1 has described the function unit figure according to the speech recognition system of present embodiment.
As shown in Figure 1, speech recognition system disposes audio signal processor 10, EMG signal processor 13, visual information processor 16, information score device/recognizer 19, speech recognition device 20 and recognition result device 21 is provided.
Audio signal processor 10 is arranged to the voice signal that processing is sent by the speaker.Audio signal processor 10 disposes voice signal acquiring unit 11 and voice signal processing unit 12.
Voice signal acquiring unit 11 is a kind of device that is used for obtaining from speaker's (target) mouth voice signal, for example microphone.Voice signal acquiring unit 11 is surveyed the voice signal that the speaker sends, and the voice signal that obtains is sent to sound signal processing unit 12.
Sound signal processing unit 12 is arranged in the voice signal that obtains from voice signal acquiring unit 11 and obtains the voice signal parameter by separating band envelope or microtexture.
Sound signal processing unit 12 is a kind of devices that are used to calculate the voice signal parameter, this voice signal parameter can be in speech recognition device 20 according to the voice signal that obtains by voice signal acquiring unit 11 and processed.Sound signal processing unit 12 cut off voice signal when the time one, window was provided with, and the analytical calculation voice signal parameter when being usually used in speech recognition, for example the voice signal that cuts off is carried out the short time spectral band analysis, the cepstrum analysis of spectrum, maximum likelihood spectrum method of estimation, covariance method, PARCOR are analyzed and LSP analyzes.
EMG signal processor 13 is arranged to detection and handles near the motion of the muscle of mouth of speaking when sounding signal.EMG signal processor 13 disposes EMG signal acquiring unit 14 and EMG signal processing unit 15.
EMG signal acquiring unit 14 is arranged to obtains near (winning) the speak motion of the muscle mouth when sounding signal.EMG signal acquiring unit 14 is surveyed near the possible variation of the skin surface of speaker's (target) mouth.That is to say, in order to discern near the motion that is accompanied by the polylith muscle of the signal of sounding the mouth, EMG signal acquiring unit 14 is surveyed a plurality of EMG signals by a plurality of electrodes that are positioned on the skin surface relevant with polylith muscle, and amplification EMG signal is transferred to EMG signal processing unit 15.
EMG signal processing unit 15 is arranged to by the frequency of power that calculates the EMG signal that is obtained by EMG acquiring unit 14 and analysis EMG signal and wins the EMG signal parameter.EMG signal processing unit 15 is a kind of devices that calculate the EMG signal parameter according to a plurality of EMG signals by 14 transmission of EMG signal acquiring unit.More specifically, EMG signal processing unit 15 is being provided with cut-out EMG signal every time one window, and by calculating the mean oscillatory feature, as RMS (root mean square), ARV (average correction value) or IEMG (integration EMG) calculate the EMG signal parameter.
, will be described in detail to 2D with reference to figure 2A voice signal acquiring unit 12 and EMG signal processing unit 15.
Voice signal that is detected by voice signal acquiring unit 11 or EMG signal acquiring unit 14 or EMG signal are cut off (S401 among Fig. 2 A) by audio signal processor 12 or EMG signal processor 15 during window in per time one.Then, extract bands of a spectrum (S402 among Fig. 2 B) by FFT by cutoff signal.Then, the bands of a spectrum of winning are carried out the third-octave analysis meter and calculate the power of each frequency (S403 among Fig. 2 C).That calculate be transferred to speech recognition device 20 as speech signal parameter or EMG signal parameter (S404 among Fig. 2 D) with power each frequency dependence.This speech signal parameter or EMG signal parameter are discerned by speech recognition device 20.
Sound signal processing unit 12 or EMG signal processing unit 15 can not be to win voice signal parameter or EMG signal parameter at Fig. 2 A to the method among the 2D by using yet.
Visual information processor 16 is arranged near speak the spatial variations mouth of detection when sounding signal.Visual information processor 16 disposes image information acquisition unit 17 and Image Information Processing unit 18.
The image that image information acquisition unit 17 is arranged to by obtain near the spatial variations the mouth of speaking when sounding signal obtains image information.Image information acquisition unit 17 disposes the camera that obtains near the spatial variations image of mouth of speaking when sounding signal, as video camera.Near the motion of mouth is surveyed as image information in image information acquisition unit 17, and transmits this image information to Image Information Processing unit 18.
Image Information Processing unit 18 is arranged to the image information of obtaining according to image information acquisition unit 17 and calculates mouth kinematic parameter (image information parameter) on every side in a minute.More specifically, mouth motion feature computed image information is on every side won in the 18 usefulness light streams of Image Information Processing unit.
, will be described in detail Image Information Processing unit 18 below to 3D with reference to figure 3A.
Near the image information of the feature locations mouth of speaking during according to time t0 won.(as the S501 among Fig. 3 A).Position that might be by obtaining mark is as feature locations, or searches feature locations and win feature locations around the mouth in captured image information.Picture information processing unit 18 can from image information, win feature locations and with it as the two-dimensional space position.Picture information processing unit 18 by use a plurality of cameras and obtain feature locations and with it as three-dimensional space position.
Similarly, through associate t0 to t1 during this period of time after, when time t1, win the feature locations (as the S502 among Fig. 3 B) around the mouth.Then, Image Information Processing unit 18 calculates the motion (as the S503 among Fig. 3 C) of each unique point by calculating unique point when time t0 and the difference between the unique point when time t1.Image Information Processing unit 18 produces image information parameter (as the S504 among Fig. 3 D) according to the difference that calculates.
For Image Information Processing unit 18, can use except that the additive method the method for Fig. 3 A in the 3D and obtain the image information parameter.
Image information integrator/recognizer 19 is configured to the various information of obtaining from audio signal processor 10, EMG signal processor 13 and visual information processor 16 are carried out integration and identification.Image information integrator/recognizer 19 is furnished with speech recognition device 20 and recognition result provides device 21.
Speech recognition device 20 compares and integration by the voice signal parameter that audio signal processor 10 is sent, the EMG signal parameter of EMG signal processor 13 transmissions and the image information parameter that visual information processor 16 sends, thus the processor that carries out speech recognition.
Speech recognition device 20 when around noise rank when less, when the volume of the voice signal that sends is big or in the time can carrying out speech recognition with enough ranks according to the voice signal parameter, speech recognition device 20 can only come recognizing voice according to the voice signal parameter.
On the other hand, when around noise rank when big, when the volume of the voice signal that sends hour or in the time can not carrying out speech recognition with enough ranks according to the voice signal parameter, speech recognition device 20 not only can also come recognizing voice according to EMG signal parameter and image information parameter according to the voice signal parameter.
In addition, speech recognition device 20 can only be discerned special phoneme etc. according to the voice signal parameter, and this special phoneme can not correctly be discerned by using EMG signal parameter and image information parameter, thereby can improve the success ratio of identification.
With reference to figure 4, will the example of speech recognition device 20 be specifically described below.In example shown in Figure 4, speech recognition device 20 comes recognition of speech signals according in voice signal parameter, EMG signal parameter and the image information parameter each, and the voice signal of each identification compared, and come recognition of speech signals according to the result of contrast.
As shown in Figure 4, more specifically, speech recognition device 20 only comes recognizing voice respectively according to voice signal parameter, EMG signal parameter or image information parameter respectively.Speech recognition device 20 carries out integration according to each parameter to the result who discerns then, thereby carries out speech recognition.
When (in all recognition results) a plurality of recognition results that obtain according to each parameter coincide mutually, speech recognition device 20 with this result as final recognition result.On the other hand, when (in all recognition results) of obtaining according to each parameter do not have recognition result to coincide mutually, recognizer 20 will have the recognition result of the highest discrimination as final result in identification.
For example, just known in front when special phoneme of identification or special tongue, the speech recognition of carrying out according to the EMG parameter has lower success ratio, yet, suppose that special phoneme or special tongue are issued, when then basis was carried out speech recognition by the parameter of non-EMG signal, speech recognition device 20 was ignored the recognition result that obtains according to the EMG signal parameter, thereby can improve recognition success rate.
Based on the speech recognition of voice signal parameter the time, when the noise rank around determining is big or the volume of the voice signal that sends hour, speech recognition device 20 reduces the influence to net result of the recognition result that obtains based on the voice signal parameter, and carries out speech recognition by focusing on the recognition result that obtains based on EMG signal parameter and image information parameter.Can adopt conventional audio recognition method according to the speech recognition that each parameter is carried out.
Can adopt the audio recognition method of traditional various voice signals of use based on the speech recognition of the voice signal in the speech recognition device 20.The speech recognition of carrying out based on the EMG signal can adopt technical literature " Noboru Sugie et a1.; ' and A speech Employing a Speech Synthesizer VowelDiscriminatgion from Perioral Muscles Activities and Vowel Production ' IEEEtransactions on Biomedial Enginnering; 32 volumes; the 7th phase, 485-490 page or leaf " in disclosed method or in JP-A-181888 etc. disclosed method.Can adopt disclosed method in JP-A-2001-51963 or JP-A-2000-206986 etc. based on the speech recognition that image information is carried out.
Speech recognition device 20 as shown in Figure 4, when any parameter in voice signal parameter, EMG signal parameter and the image information parameter is all nonsensical for speech recognition, for example when around noise rank when big, when the volume of the voice signal that sends hour maybe when not detecting the EMG signal, speech recognition device 20 can come recognizing voice according to significant parameter, thereby can improve the vulnerability to jamming to noise in whole speech recognition system fully.
With reference to figure 5, will the another one example of speech recognition device 20 be specifically described below.In example shown in Figure 5, speech recognition device 20 is simultaneously according to coming recognition of speech signals in voice signal parameter, EMG signal parameter and the image information parameter.
More concrete, speech recognition device 20 comprises a hierarchical network (for example, nervous system network 20a), wherein a plurality of non-linear components that comprise input block and output unit are hierarchically positioned from top to bottom.
In nervous system network 20a, the output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.Each non-linear component calculates data and definite connection that data outputed to of calculating of exporting from output unit according to being input to the data of input block and being assigned to the weighted value that connects or make up.
Voice signal parameter, EMG signal parameter and image information parameter are used as the input data and are input in the non-linear component of the superiors in the hierarchical network.The voice signal (vowel and consonant) of identification is used as output data and outputs in the undermost non-linear component in the layering speech recognition device.Speech recognition device 20 comes recognition of speech signals according to the data of being exported by the output unit of undermost non-linear component.
By reference " Nishikawa and Kitamura, ' Neural network and control of measure ', Asakura Syoten, 18-50 page or leaf " as can be known, nervous system network can adopt three layers of nervous system network of full connecting-type.
Speech recognition device 20 comprises learning functionality, and it is configured to change the weighting that is assigned to non-linear component according to the sample data that transmits from bottom to top of input.
That is to say, be necessary method, the weighting among the prior learning nervous system network 20a by for example backpropagation.
In order to learn weighting, speech recognition device 20 obtains according to the voice signal parameter that operation produced, EMG signal parameter and the image information parameter of sending special mode, and learns weighting by using as the special mode of learning signal.
When the speaker pronounces, the EMG signal is input in the speech recognition system earlier than voice signal and image information, speech recognition device 20 only postpones the input of EMG signal parameter by neuralward network 20a, and do not postpone the input of voice signal parameter and image information parameter, thereby make speech recognition device 20 have the function of synchronous voice signal, EMG signal and image information.
Receive the nervous system network 20a output phoneme relevant with input parameter as the various parameters of input data.
Nervous system network 20a adopts recurrence nervous system network (RNN), and it is handled the recognition result that obtains with the next one and returns as the input data.According to present embodiment, speech recognition algorithm also can adopt various speech recognition algorithms, for example Hidden Markov Model (HMM) except that adopting nervous system network.
As shown in Figure 6, a plurality of EMG signals 1,2 that detected by EMG signal acquiring unit 14 are exaggerated in EMG processing unit 15 (S601) and are cut off every time one window.By the EMG signal that cuts off is carried out the calculating that FFT carries out bands of a spectrum.Before input nervous system network 20, the bands of a spectrum (S602) that calculate are carried out the third-octave analysis, carry out the calculating of EMG signal parameter.
The voice signal that voice signal acquiring unit 11 detects is exaggerated and cuts off every time one window in sound signal processing unit 12 (S611).By the voice signal that cuts off is carried out the calculating that FFT carries out bands of a spectrum.Before input nervous system network 20, the bands of a spectrum (S612) that calculate are carried out the third-octave analysis, carry out the voice signal CALCULATION OF PARAMETERS.
The image information that Image Information Processing unit 18 obtains according to image information acquisition unit 17 (S621) is obtained the motion of the feature locations around the mouth of speaking as light stream.The image information parameter of winning as light stream is imported among the nervous system network 20a.
In a series of time, can win mouth feature locations separately on every side in the captured image information, thus the motion of winning feature locations.Also sign can be placed on mouth unique point on every side, and place reference point, according to the displacement of surveying with respect to the unique point of reference point, thus the motion of winning unique point.
Be transfused to the nervous system network 20a output phoneme relevant of various parameters with input parameter.
In addition, when voice by can not discern according to any parameter the time as the audio recognition method among Fig. 4, can be configured to use according to the speech recognition device 20 of present embodiment and carry out speech recognition as the audio recognition method among Fig. 5.Compare or they are carried out integration by the result that the result of the audio recognition method among Fig. 4 identification and the audio recognition method among Fig. 5 are discerned, speech recognition device 20 can be configured to voice are discerned.
It is the equipment of a kind of providing (output) speech recognition device 20 recognition results that recognition result provides device 21.Recognition result provides device 21 can adopt the voice generator that speech recognition device 20 recognition results are outputed to the speaker as voice signal or outputs to as text message in the display of display result.Recognition result provides device 21 can comprise a communication interface, and it is except providing the result to the speaker, also transmit the result in the application program as data, this application program runs in the terminal as PC.
(according to the operation of the speech recognition system of embodiment)
Operation according to the speech recognition system of embodiment will be described below with reference to figure 7 and Fig. 8.At first, with reference to figure 7, according to the operation of carrying out speech recognition in the speech recognition system of embodiment.
In step S101, the speaker begins sounding.In S104, voice signal acquiring unit 11, EMG signal acquiring unit 14 and image information acquisition unit 17 are surveyed voice signal, EMG signal and the image information that is produced respectively when speaker's sounding at step S102.
In S107, sound signal processing unit 12, EMG signal processing unit 15 and Image Information Processing unit 18 calculate voice signal parameter, EMG signal parameter and image information parameter respectively according to voice signal, EMG signal and image information at step S105.
In step S108, speech recognition device 20 comes recognizing voice according to parameters calculated.In step 109, recognition result provides device 21 that the result who is obtained by speech recognition device 20 identifications is provided.Recognition result provides device 21 result of identification can be exported or show recognition result as voice signal.
Secondly, with reference to figure 8, be operation according to the learning process in speech recognition system of present embodiment.
For improving recognition success rate, the pronunciation character of learning each speaker is very important.In an embodiment, will operation that use nervous system network 20a among Fig. 5 carry out learning process be described below.Under the situation of the audio recognition method that does not use nervous system network 20a, speech recognition system according to the present invention has adopted the learning functionality relevant with audio recognition method.
As shown in Figure 8, in step S301 and S302, the speaker begins sounding.In step S305, the speaker that is to say with said contents of input such as keyboards, input learning signal (sample data) when pronunciation.In step S303, voice signal acquiring unit 11, EMG signal acquiring unit 14 and image information acquisition unit 17 are surveyed voice signal, EMG signal and image information respectively.In step S304, sound signal processing unit 12, EMG signal processing unit 15 and Image Information Processing unit 18 are won voice signal parameter, EMG signal parameter and image information parameter respectively.
In step S306, nervous system network 20a wins the parameter that obtains according to the learning signal study of keyboard input.That is to say that nervous system network 20a changes the weighting that is assigned to non-linear component by the learning signal (sample data) that input transmits from top to bottom.
In step S307, when the error rate of identification was lower than threshold value, nervous system network 20a determined that learning process finishes.EO (S308) then.
On the other hand, in step S307, when nervous system network 20a determines that learning process is not finished, then with the operation of repeating step S302 to S306.
(according to the function and the effect thereof of the speech recognition system of embodiment)
The speech recognition system of present embodiment can be come recognizing voice according to a plurality of parameters that calculate from voice signal, EMG signal and image information, thereby can improve abilities such as noise resistance interference fully.
That is to say that the speech recognition system of present embodiment comprises three types input interface (audio signal processor 10, EMG signal processor 13 and visual information processor 16) and is used to improve the noise resistance interference capability.When all input interfaces were all unavailable, speech recognition system can use available input interface to come recognizing voice, thereby improved recognition success rate.
Therefore, the present invention can provide a kind of speech recognition system, and the volume of the voice signal that maybe ought send when its noise rank around is big hour can be come recognizing voice with enough ranks.
(according to the speech synthesis system of second embodiment of the present invention)
With reference to figure 9 to 11, will the speech synthesis system according to second embodiment of the present invention be described.Speech recognition system described above is used to according to speech synthesis system of the present invention.
As shown in Figure 9, the speech synthesis system of phase disposes audio signal processor 10, EMG signal processor 13, visual information processor 16, speech recognition device 20 and voice operation demonstrator 55 according to the present invention.Voice operation demonstrator 55 disposes the first bands of a spectrum getter, 51, the second bands of a spectrum generators 52, regulates bands of a spectrum generator 53 and follower 54.
Speech recognition system among audio signal processor 10, EMG signal processor 13, visual information processor 16, speech recognition device 20 and first embodiment has identical functions.
The first bands of a spectrum getter 51 be configured to obtain the bands of a spectrum of voice signal and with it as first bands of a spectrum, wherein voice signal is obtained by voice signal acquiring unit 11.Contain noise (with reference to figure 10C) in first bands of a spectrum that obtain.
The second bands of a spectrum generator 52 be configured to voice signal (result) according to speech recognition device 20 identification produce through the bands of a spectrum of the voice signal that reconfiguring and with it as second bands of a spectrum.Shown in Figure 10 A, more specifically, the second bands of a spectrum generator 52 is according to the pronunciation phonemes of winning from the result of speech recognition device 20 identifications, and formant frequency for example reconfigures the bands of a spectrum of pronunciation phonemes.
Regulating bands of a spectrum generator 53 is configured to produce the bands of a spectrum of adjusting according to first bands of a spectrum and second bands of a spectrum.Shown in Figure 10 D, more specifically, regulate bands of a spectrum generator 53 by multiplying each other, thereby generation there are not the adjusting bands of a spectrum of noise with second bands of a spectrum (with reference to figure 10A) and first bands of a spectrum (with reference to figure 10C).
Follower 54 is configured to according to regulating the synthetic voice signal of bands of a spectrum output.Follower 54 comprises communicator, and it is configured to send the synthetic voice signal as data.Shown in Figure 10 C, more specifically, follower 54 obtains not contain the voice signal of noise by the adjusting bands of a spectrum that do not contain noise being carried out Fourier reverse transformation (with reference to figure 10D), and the voice signal that obtains is exported as synthetic voice signal.
That is to say, obtain not contain the voice signal of noise by filtrator by the voice signal that will contain noise according to the speech synthesis system of present embodiment, its middle filtrator has the frequecy characteristic by the bands of a spectrum representative that reconfigures, and the voice signal of output acquisition.
Speech synthesis system according to present embodiment ins all sorts of ways recognizing voice by making, the voice signal that the speaker can be sent and noise are on every side separated from the signal that recognition result is reconfigured the signal that obtains and voice signal acquiring unit 11 and surveyed, thus when around the noise rank can export synthetic speech clearly greatly the time.
Therefore, according to the speech synthesis system of present embodiment can be big at noise or the voice signal that sends hour, the voice signal that output is synthetic, this signal sounds just looking like that the speaker does not send in having the environment of noise.
Adopted speech recognition system according to the speech synthesis system of present embodiment, yet the present invention is not limited to this embodiment according to first embodiment.Speech synthesis system according to present embodiment can come recognizing voice according to the parameter except that the voice signal parameter.
With reference to Figure 11, will the operation according to the speech synthesis system of present embodiment be described below.
As shown in figure 11, at step S201 in S208, carry out with first embodiment in the identical identifying of identifying.
In step S209, the first bands of a spectrum getter 51 obtain the bands of a spectrum of voice signal by voice signal acquiring unit 11 and with it as first bands of a spectrum.The second bands of a spectrum generator 52 according to the result of speech recognition device 20 identification produce through the bands of a spectrum of the voice signal that reconfiguring and with it as second bands of a spectrum.Regulate bands of a spectrum generator 53 and produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum, noise in these bands of a spectrum (not being the voice signal that the speaker sent) is eliminated from the voice signal that voice signal acquiring unit 11 is obtained.
In step S201, follower 54 is exported synthetic speech signal clearly according to regulating bands of a spectrum.
(according to the system of the 3rd embodiment of the present invention)
With reference to Figure 12, will the system of integrating speech sound recognition system and speech synthesis system be described below.
As shown in figure 12, the Wristwatch-type terminal 31 that communicator 30 is arranged and be separated with it according to the system configuration of present embodiment.
Communicating terminal 30 is configured to add audio signal processor 10, EMG signal processor 13, speech recognition device 20 and voice operation demonstrator 55 in the portable terminal of routine.
EMG signal acquiring unit 14 comprises the skin surface electrodes 114 that can contact with speaker 32 skin of a plurality of installations, and it is configured to obtain around speaker's (sound source) 32 the mouth the potential change on the face with as the EMG signal.Voice signal acquiring unit 11 comprises microphone 111, and it is configured to obtain voice signal from speaker's (sound source) 32.Microphone 111 can be configured to communicate with communicator 30.For example, microphone 111 is installed in the surface of communicator 30.Microphone 111 can be for being installed near the wireless microphone speaker's 32 mouths.Skin surface electrodes 114 can be installed in the surface of communicator 30.
Communication terminal 30 have transmission based on the result of speech recognition device 20 identification and synthetic voice signal as the function of the voice signal that sends by speaker 32.
The terminal 31 of Wristwatch-type disposes visual information processor 16 and recognition result processor 21.The video camera 117 that is used to take speaker's (sound source) 32 mouth moving image is installed on the terminal 31 of Wristwatch-type as image information collecting unit 17.The display device 121 that is used to show recognition result is installed on the terminal 31 of Wristwatch-type and provides device 21 as recognition result.The terminal 31 of Wristwatch-type comprises that one is used for its belt of fixing 33.
System to speech recognition system and speech synthesis system integration obtains EMG signal and voice signal by EMG signal acquiring unit 14 and the voice signal acquiring unit 11 that is installed on the communicator 30, and obtains image information by the image information acquisition unit 17 on the terminal 31 that is installed in Wristwatch-type.
Communicator 30 uses 31 pairs of data of terminal of Wristwatch-type to send and receive by wire communication or radio communication.The terminal 31 of communicator 30 and Wristwatch-type is collected and is transmitted a signal on the speech recognition device 20 that is structured in the communicator 30, speech recognition device 20 comes recognizing voice according to collected signal, and being installed in recognition result in the terminal 31 of Wristwatch-type provides 21 pairs of devices to show from the recognition result that speech recognition device 20 sends by wire communication or radio communication.Communicator 30 can send do not contain noise synthetic speech signal clearly in the terminal 31 of Wristwatch-type.
In the present embodiment, speech recognition device 20 is structured in the communicator 30, and the recognition result that is structured in the terminal 31 of Wristwatch-type provides device 21 to show recognition results.But speech recognition device 20 also can be installed in the terminal 31 of Wristwatch-type, or other can with terminal that communicator 30 is communicated by letter in, the terminal 31 of this Wristwatch-type can be discerned and synthetic speech.
Recognition result can be exported as voice signal from communicator, may be displayed on the monitor of terminal 31 (or communicator 30) of Wristwatch-type, perhaps can be from another terminal that can communicate by letter with the terminal 31 of communicator 30 and Wristwatch-type output.
(according to the system of the 4th embodiment of the present invention)
With reference to Figure 13, will be described the system that speech recognition system and speech synthesis system according to present embodiment carry out integration being used for below.
As shown in figure 13, according to the system configuration of present embodiment stationary installation 41 is arranged, this installs as the glasses form; As the video camera 117 of image information acquisition unit 17, it can be conditioned the motion with the mouth of taking speaker's (sound source) 32; Locating device 42; The head suspension display device (HMD) 121 of device is provided as recognition result; And in be built in speech recognition device 20 in the stationary installation 41.Stationary installation 41 can be suspended on speaker 52 the head.
Be configured to obtain the potential change on the face around speaker's 32 (sound source) the mouth as the skin surface electrodes 114 of EMG signal acquiring unit 14; And as voice signal acquiring unit 11 and the microphone 111 that is configured to from the mouth of speaker 32 (sound source), to obtain voice signal be fixed on by adjustable ground around speaker 32 the mouth.
Wear with speaker 32 and can discern and synthetic speech,, his/her both hands can be freed owing to use the mode of wearing according to the system of embodiment.
Speech recognition device 20 can in the exterior terminal that is built in the fixed equipment device 41 or communicates with fixed equipment device 41.Recognition result may be displayed among the HMD (translucent display device), or exports from output device such as loudspeaker apparatus as voice signal.Output device such as loudspeaker apparatus can be according to the synthetic voice signals of recognition result output.
(according to the system of the 5th embodiment of the present invention)
According to speech recognition system, audio recognition method, speech synthesis system or the phoneme synthesizing method of the above embodiments can by the computing machine (for example, personal computer) 215 of general applications be included in IC chip in the communicator 30 or similar equipment on the program carried out with the predetermined program language description obtain.
In addition, program can be recorded on the storage medium, and this media can be read by the computing machine 215 of general applications.That is, as shown in figure 14, program can be stored on the equipment such as floppy disk 216, CD-ROM 217, RAM218, magnetic tape cassette 219.By using the storage medium to contain program to be inserted into computing machine 215 or the medium method of internal memory that program is installed to communicator 30 can being realized system of the present invention or method.
Can keep high success ratio with the corresponding speech recognition system of the present invention, method and program when not discerned by the voice signal than amount of bass of noise effect.
Can use the voice signal of identification to come synthetic speech signal with the corresponding speech synthesis system of the present invention, method and program, thereby make synthetic voice signal nature and clear more, and the speaker's that can express properly emotion etc.

Claims (4)

1. a speech synthesis system comprises:
Configuration is used for the speech recognition device of recognition of speech signals;
Configuration is used for obtaining the voice signal getter of voice signal;
Configuration is used for obtaining the first bands of a spectrum getter of the bands of a spectrum of the voice signal that obtains as first bands of a spectrum;
Configuration is used for according to the secondary configuration bands of a spectrum of the voice signal generation voice signal of speech recognition device identification, and with its second bands of a spectrum generator as second bands of a spectrum;
Configuration is used for according to the adjusting bands of a spectrum generator of the bands of a spectrum after first bands of a spectrum and second bands of a spectrum generation adjusting; And
Configuration is used for according to the follower of the synthetic voice signal of the bands of a spectrum output after regulating.
2. speech synthesis system according to claim 1, wherein, follower comprises communicator, it is configured to transmit synthetic voice signal as data.
3. phoneme synthesizing method may further comprise the steps:
(A) recognition of speech signals;
(B) obtain voice signal;
(C) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum;
(D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum;
(E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And
(F) export synthetic voice signal according to the bands of a spectrum after regulating.
4. program product that in computing machine, is used for synthetic speech signal, wherein, computing machine is carried out following steps:
(A) recognition of speech signals;
(B) obtain voice signal;
(B) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum;
(D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum;
(E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And
(F) export synthetic voice signal according to the bands of a spectrum after regulating.
CN2005100693792A 2002-03-04 2003-03-03 Speech synthesis system, speech synthesis method Expired - Lifetime CN1681002B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
JP2002057818 2002-03-04
JP2002057818A JP2003255993A (en) 2002-03-04 2002-03-04 System, method, and program for speech recognition, and system, method, and program for speech synthesis
JP2002-057818 2002-03-04

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN03105163A Division CN1442845A (en) 2002-03-04 2003-03-03 Speech recognition system and method, speech synthesis system and method and program product

Publications (2)

Publication Number Publication Date
CN1681002A true CN1681002A (en) 2005-10-12
CN1681002B CN1681002B (en) 2010-04-28

Family

ID=27764437

Family Applications (2)

Application Number Title Priority Date Filing Date
CN2005100693792A Expired - Lifetime CN1681002B (en) 2002-03-04 2003-03-03 Speech synthesis system, speech synthesis method
CN03105163A Pending CN1442845A (en) 2002-03-04 2003-03-03 Speech recognition system and method, speech synthesis system and method and program product

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN03105163A Pending CN1442845A (en) 2002-03-04 2003-03-03 Speech recognition system and method, speech synthesis system and method and program product

Country Status (5)

Country Link
US (2) US7369991B2 (en)
EP (2) EP1345210B1 (en)
JP (1) JP2003255993A (en)
CN (2) CN1681002B (en)
DE (2) DE60321256D1 (en)

Families Citing this family (95)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004016658A (en) 2002-06-19 2004-01-22 Ntt Docomo Inc Mobile terminal capable of measuring biological signal, and measuring method
US6910911B2 (en) 2002-06-27 2005-06-28 Vocollect, Inc. Break-away electrical connector
US8200486B1 (en) * 2003-06-05 2012-06-12 The United States of America as represented by the Administrator of the National Aeronautics & Space Administration (NASA) Sub-audible speech recognition based upon electromyographic signals
JP4713111B2 (en) * 2003-09-19 2011-06-29 株式会社エヌ・ティ・ティ・ドコモ Speaking section detecting device, speech recognition processing device, transmission system, signal level control device, speaking section detecting method
US20050154593A1 (en) * 2004-01-14 2005-07-14 International Business Machines Corporation Method and apparatus employing electromyographic sensors to initiate oral communications with a voice-based device
US20060129394A1 (en) * 2004-12-09 2006-06-15 International Business Machines Corporation Method for communicating using synthesized speech
JP4847022B2 (en) 2005-01-28 2011-12-28 京セラ株式会社 Utterance content recognition device
JP4632831B2 (en) * 2005-03-24 2011-02-16 株式会社エヌ・ティ・ティ・ドコモ Speech recognition method and speech recognition apparatus
US7792314B2 (en) * 2005-04-20 2010-09-07 Mitsubishi Electric Research Laboratories, Inc. System and method for acquiring acoustic signals using doppler techniques
US8417185B2 (en) 2005-12-16 2013-04-09 Vocollect, Inc. Wireless headset and method for robust voice data communication
US7773767B2 (en) * 2006-02-06 2010-08-10 Vocollect, Inc. Headset terminal with rear stability strap
US7885419B2 (en) 2006-02-06 2011-02-08 Vocollect, Inc. Headset terminal with speech functionality
US7571101B2 (en) * 2006-05-25 2009-08-04 Charles Humble Quantifying psychological stress levels using voice patterns
US8082149B2 (en) * 2006-10-26 2011-12-20 Biosensic, Llc Methods and apparatuses for myoelectric-based speech processing
USD626949S1 (en) 2008-02-20 2010-11-09 Vocollect Healthcare Systems, Inc. Body-worn mobile device
USD605629S1 (en) 2008-09-29 2009-12-08 Vocollect, Inc. Headset
US8386261B2 (en) 2008-11-14 2013-02-26 Vocollect Healthcare Systems, Inc. Training/coaching system for a voice-enabled work environment
GB2466242B (en) * 2008-12-15 2013-01-02 Audio Analytic Ltd Sound identification systems
CN102257561A (en) * 2008-12-16 2011-11-23 皇家飞利浦电子股份有限公司 Speech signal processing
US8160287B2 (en) 2009-05-22 2012-04-17 Vocollect, Inc. Headset with adjustable headband
US8438659B2 (en) 2009-11-05 2013-05-07 Vocollect, Inc. Portable computing device and headset interface
US9634855B2 (en) 2010-05-13 2017-04-25 Alexander Poltorak Electronic personal interactive device that determines topics of interest using a conversational agent
US8659397B2 (en) 2010-07-22 2014-02-25 Vocollect, Inc. Method and system for correctly identifying specific RFID tags
USD643400S1 (en) 2010-08-19 2011-08-16 Vocollect Healthcare Systems, Inc. Body-worn mobile device
USD643013S1 (en) 2010-08-20 2011-08-09 Vocollect Healthcare Systems, Inc. Body-worn mobile device
US8700392B1 (en) * 2010-09-10 2014-04-15 Amazon Technologies, Inc. Speech-inclusive device interfaces
US9274744B2 (en) 2010-09-10 2016-03-01 Amazon Technologies, Inc. Relative position-inclusive device interfaces
US8775341B1 (en) 2010-10-26 2014-07-08 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
US9015093B1 (en) 2010-10-26 2015-04-21 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
US9223415B1 (en) 2012-01-17 2015-12-29 Amazon Technologies, Inc. Managing resource usage for task performance
US9263044B1 (en) * 2012-06-27 2016-02-16 Amazon Technologies, Inc. Noise reduction based on mouth area movement recognition
KR101240588B1 (en) 2012-12-14 2013-03-11 주식회사 좋은정보기술 Method and device for voice recognition using integrated audio-visual
CN103338330A (en) * 2013-06-18 2013-10-02 腾讯科技(深圳)有限公司 Picture processing method and device, and terminal
US11921471B2 (en) 2013-08-16 2024-03-05 Meta Platforms Technologies, Llc Systems, articles, and methods for wearable devices having secondary power sources in links of a band for providing secondary power in addition to a primary power source
US10042422B2 (en) 2013-11-12 2018-08-07 Thalmic Labs Inc. Systems, articles, and methods for capacitive electromyography sensors
US20150124566A1 (en) 2013-10-04 2015-05-07 Thalmic Labs Inc. Systems, articles and methods for wearable electronic devices employing contact sensors
US11199906B1 (en) 2013-09-04 2021-12-14 Amazon Technologies, Inc. Global user input management
US9367203B1 (en) 2013-10-04 2016-06-14 Amazon Technologies, Inc. User interface techniques for simulating three-dimensional depth
WO2015081113A1 (en) 2013-11-27 2015-06-04 Cezar Morun Systems, articles, and methods for electromyography sensors
US9564128B2 (en) 2013-12-09 2017-02-07 Qualcomm Incorporated Controlling a speech recognition process of a computing device
KR20150104345A (en) * 2014-03-05 2015-09-15 삼성전자주식회사 Voice synthesys apparatus and method for synthesizing voice
JP2015212732A (en) * 2014-05-01 2015-11-26 日本放送協会 Sound metaphor recognition device and program
US9880632B2 (en) 2014-06-19 2018-01-30 Thalmic Labs Inc. Systems, devices, and methods for gesture identification
TWI576826B (en) * 2014-07-28 2017-04-01 jing-feng Liu Discourse Recognition System and Unit
US9390725B2 (en) 2014-08-26 2016-07-12 ClearOne Inc. Systems and methods for noise reduction using speech recognition and speech synthesis
US20160253996A1 (en) * 2015-02-27 2016-09-01 Lenovo (Singapore) Pte. Ltd. Activating voice processing for associated speaker
US20160284363A1 (en) * 2015-03-24 2016-09-29 Intel Corporation Voice activity detection technologies, systems and methods employing the same
JP6518134B2 (en) * 2015-05-27 2019-05-22 株式会社ソニー・インタラクティブエンタテインメント Pre-worn display device
US10032463B1 (en) * 2015-12-29 2018-07-24 Amazon Technologies, Inc. Speech processing with learned representation of user interaction history
US11331045B1 (en) 2018-01-25 2022-05-17 Facebook Technologies, Llc Systems and methods for mitigating neuromuscular signal artifacts
US11000211B2 (en) 2016-07-25 2021-05-11 Facebook Technologies, Llc Adaptive system for deriving control signals from measurements of neuromuscular activity
EP3487595A4 (en) 2016-07-25 2019-12-25 CTRL-Labs Corporation System and method for measuring the movements of articulated rigid bodies
EP3487395A4 (en) 2016-07-25 2020-03-04 CTRL-Labs Corporation Methods and apparatus for predicting musculo-skeletal position information using wearable autonomous sensors
WO2020112986A1 (en) 2018-11-27 2020-06-04 Facebook Technologies, Inc. Methods and apparatus for autocalibration of a wearable electrode sensor system
US10409371B2 (en) 2016-07-25 2019-09-10 Ctrl-Labs Corporation Methods and apparatus for inferring user intent based on neuromuscular signals
US11216069B2 (en) 2018-05-08 2022-01-04 Facebook Technologies, Llc Systems and methods for improved speech recognition using neuromuscular information
US10489986B2 (en) 2018-01-25 2019-11-26 Ctrl-Labs Corporation User-controlled tuning of handstate representation model parameters
JP6686977B2 (en) * 2017-06-23 2020-04-22 カシオ計算機株式会社 Sound source separation information detection device, robot, sound source separation information detection method and program
US11200882B2 (en) * 2017-07-03 2021-12-14 Nec Corporation Signal processing device, signal processing method, and storage medium for storing program
CN107221324B (en) * 2017-08-02 2021-03-16 上海智蕙林医疗科技有限公司 Voice processing method and device
WO2019079757A1 (en) 2017-10-19 2019-04-25 Ctrl-Labs Corporation Systems and methods for identifying biological structures associated with neuromuscular source signals
US10937414B2 (en) 2018-05-08 2021-03-02 Facebook Technologies, Llc Systems and methods for text input using neuromuscular information
US10504286B2 (en) 2018-01-25 2019-12-10 Ctrl-Labs Corporation Techniques for anonymizing neuromuscular signal data
EP3743790A4 (en) 2018-01-25 2021-03-17 Facebook Technologies, Inc. Handstate reconstruction based on multiple inputs
US11150730B1 (en) 2019-04-30 2021-10-19 Facebook Technologies, Llc Devices, systems, and methods for controlling computing devices via neuromuscular signals of users
US11961494B1 (en) 2019-03-29 2024-04-16 Meta Platforms Technologies, Llc Electromagnetic interference reduction in extended reality environments
US11481030B2 (en) 2019-03-29 2022-10-25 Meta Platforms Technologies, Llc Methods and apparatus for gesture detection and classification
US10970936B2 (en) 2018-10-05 2021-04-06 Facebook Technologies, Llc Use of neuromuscular signals to provide enhanced interactions with physical objects in an augmented reality environment
US11567573B2 (en) 2018-09-20 2023-01-31 Meta Platforms Technologies, Llc Neuromuscular text entry, writing and drawing in augmented reality systems
US11069148B2 (en) 2018-01-25 2021-07-20 Facebook Technologies, Llc Visualization of reconstructed handstate information
EP3743901A4 (en) 2018-01-25 2021-03-31 Facebook Technologies, Inc. Real-time processing of handstate representation model estimates
WO2019147996A1 (en) 2018-01-25 2019-08-01 Ctrl-Labs Corporation Calibration techniques for handstate representation modeling using neuromuscular signals
US11907423B2 (en) 2019-11-25 2024-02-20 Meta Platforms Technologies, Llc Systems and methods for contextualized interactions with an environment
US11493993B2 (en) 2019-09-04 2022-11-08 Meta Platforms Technologies, Llc Systems, methods, and interfaces for performing inputs based on neuromuscular control
CN108364660B (en) * 2018-02-09 2020-10-09 腾讯音乐娱乐科技(深圳)有限公司 Stress recognition method and device and computer readable storage medium
CN108957392A (en) * 2018-04-16 2018-12-07 深圳市沃特沃德股份有限公司 Sounnd source direction estimation method and device
CN112424859A (en) * 2018-05-08 2021-02-26 脸谱科技有限责任公司 System and method for improving speech recognition using neuromuscular information
US10592001B2 (en) 2018-05-08 2020-03-17 Facebook Technologies, Llc Systems and methods for improved speech recognition using neuromuscular information
US11687770B2 (en) 2018-05-18 2023-06-27 Synaptics Incorporated Recurrent multimodal attention system based on expert gated networks
CN112469469A (en) 2018-05-25 2021-03-09 脸谱科技有限责任公司 Method and apparatus for providing sub-muscular control
CN112261907A (en) 2018-05-29 2021-01-22 脸谱科技有限责任公司 Noise reduction shielding technology in surface electromyogram signal measurement and related system and method
WO2019241701A1 (en) 2018-06-14 2019-12-19 Ctrl-Labs Corporation User identification and authentication with neuromuscular signatures
US11045137B2 (en) 2018-07-19 2021-06-29 Facebook Technologies, Llc Methods and apparatus for improved signal robustness for a wearable neuromuscular recording device
WO2020036958A1 (en) 2018-08-13 2020-02-20 Ctrl-Labs Corporation Real-time spike detection and identification
EP3843617B1 (en) 2018-08-31 2023-10-04 Facebook Technologies, LLC. Camera-guided interpretation of neuromuscular signals
CN109087651B (en) * 2018-09-05 2021-01-19 广州势必可赢网络科技有限公司 Voiceprint identification method, system and equipment based on video and spectrogram
CN112771478A (en) 2018-09-26 2021-05-07 脸谱科技有限责任公司 Neuromuscular control of physical objects in an environment
JP6920361B2 (en) * 2019-02-27 2021-08-18 エヌ・ティ・ティ・コミュニケーションズ株式会社 Judgment device, judgment method, and program
US10905383B2 (en) 2019-02-28 2021-02-02 Facebook Technologies, Llc Methods and apparatus for unsupervised one-shot machine learning for classification of human gestures and estimation of applied forces
CN110232907B (en) * 2019-07-24 2021-11-02 出门问问(苏州)信息科技有限公司 Voice synthesis method and device, readable storage medium and computing equipment
WO2021076662A1 (en) 2019-10-16 2021-04-22 Invicta Medical, Inc. Adjustable devices for treating sleep apnea, and associated systems and methods
JP2021081527A (en) * 2019-11-15 2021-05-27 エヌ・ティ・ティ・コミュニケーションズ株式会社 Voice recognition device, voice recognition method, and voice recognition program
US20220134102A1 (en) 2020-11-04 2022-05-05 Invicta Medical, Inc. Implantable electrodes with remote power delivery for treating sleep apnea, and associated systems and methods
US20210104244A1 (en) * 2020-12-14 2021-04-08 Intel Corporation Speech recognition with brain-computer interfaces
US11868531B1 (en) 2021-04-08 2024-01-09 Meta Platforms Technologies, Llc Wearable device providing for thumb-to-finger-based input gestures detected based on neuromuscular signals, and systems and methods of use thereof

Family Cites Families (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3383466A (en) * 1964-05-28 1968-05-14 Navy Usa Nonacoustic measures in automatic speech recognition
US4885790A (en) * 1985-03-18 1989-12-05 Massachusetts Institute Of Technology Processing of acoustic waveforms
JPS62239231A (en) * 1986-04-10 1987-10-20 Kiyarii Rabo:Kk Speech recognition method by inputting lip picture
US4862503A (en) * 1988-01-19 1989-08-29 Syracuse University Voice parameter extractor using oral airflow
FR2632725B1 (en) * 1988-06-14 1990-09-28 Centre Nat Rech Scient METHOD AND DEVICE FOR ANALYSIS, SYNTHESIS, SPEECH CODING
JPH04273298A (en) 1991-02-28 1992-09-29 Fujitsu Ltd Voice recognition device
US5522013A (en) * 1991-04-30 1996-05-28 Nokia Telecommunications Oy Method for speaker recognition using a lossless tube model of the speaker's
DE4212907A1 (en) * 1992-04-05 1993-10-07 Drescher Ruediger Integrated system with computer and multiple sensors for speech recognition - using range of sensors including camera, skin and muscle sensors and brain current detection, and microphones to produce word recognition
US5586215A (en) 1992-05-26 1996-12-17 Ricoh Corporation Neural network acoustic and visual speech recognition system
JPH0612483A (en) 1992-06-26 1994-01-21 Canon Inc Method and device for speech input
US5457394A (en) * 1993-04-12 1995-10-10 The Regents Of The University Of California Impulse radar studfinder
US5536902A (en) * 1993-04-14 1996-07-16 Yamaha Corporation Method of and apparatus for analyzing and synthesizing a sound by extracting and controlling a sound parameter
US5454375A (en) * 1993-10-21 1995-10-03 Glottal Enterprises Pneumotachograph mask or mouthpiece coupling element for airflow measurement during speech or singing
JP3455921B2 (en) 1993-12-24 2003-10-14 日本電信電話株式会社 Voice substitute device
FR2715755B1 (en) * 1994-01-28 1996-04-12 France Telecom Speech recognition method and device.
JPH08187368A (en) 1994-05-13 1996-07-23 Matsushita Electric Ind Co Ltd Game device, input device, voice selector, voice recognizing device and voice reacting device
US5573012A (en) * 1994-08-09 1996-11-12 The Regents Of The University Of California Body monitoring and imaging apparatus and method
JP3536363B2 (en) 1994-09-02 2004-06-07 松下電器産業株式会社 Voice recognition device
DE69509555T2 (en) * 1994-11-25 1999-09-02 Fink METHOD FOR CHANGING A VOICE SIGNAL BY MEANS OF BASIC FREQUENCY MANIPULATION
US5864812A (en) * 1994-12-06 1999-01-26 Matsushita Electric Industrial Co., Ltd. Speech synthesizing method and apparatus for combining natural speech segments and synthesized speech segments
US5774846A (en) * 1994-12-19 1998-06-30 Matsushita Electric Industrial Co., Ltd. Speech coding apparatus, linear prediction coefficient analyzing apparatus and noise reducing apparatus
US5701390A (en) * 1995-02-22 1997-12-23 Digital Voice Systems, Inc. Synthesis of MBE-based coded speech using regenerated phase information
US5717828A (en) * 1995-03-15 1998-02-10 Syracuse Language Systems Speech recognition apparatus and method for learning
JP3647499B2 (en) 1995-03-31 2005-05-11 フオスター電機株式会社 Voice pickup system
US5729694A (en) * 1996-02-06 1998-03-17 The Regents Of The University Of California Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
US6006175A (en) * 1996-02-06 1999-12-21 The Regents Of The University Of California Methods and apparatus for non-acoustic speech characterization and recognition
US6377919B1 (en) * 1996-02-06 2002-04-23 The Regents Of The University Of California System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech
US5960395A (en) * 1996-02-09 1999-09-28 Canon Kabushiki Kaisha Pattern matching method, apparatus and computer readable memory medium for speech recognition using dynamic programming
JPH09326856A (en) 1996-06-03 1997-12-16 Mitsubishi Electric Corp Speech recognition reply device
JP3266819B2 (en) * 1996-07-30 2002-03-18 株式会社エイ・ティ・アール人間情報通信研究所 Periodic signal conversion method, sound conversion method, and signal analysis method
JPH10123450A (en) 1996-10-15 1998-05-15 Sony Corp Head up display device with sound recognizing function
GB2319379A (en) * 1996-11-18 1998-05-20 Secr Defence Speech processing system
US6161089A (en) * 1997-03-14 2000-12-12 Digital Voice Systems, Inc. Multi-subframe quantization of spectral parameters
JPH10260692A (en) * 1997-03-18 1998-09-29 Toshiba Corp Method and system for recognition synthesis encoding and decoding of speech
GB9714001D0 (en) * 1997-07-02 1997-09-10 Simoco Europ Limited Method and apparatus for speech enhancement in a speech communication system
JPH11296192A (en) * 1998-04-10 1999-10-29 Pioneer Electron Corp Speech feature value compensating method for speech recognition, speech recognizing method, device therefor, and recording medium recorded with speech recognision program
JP3893763B2 (en) 1998-08-17 2007-03-14 富士ゼロックス株式会社 Voice detection device
US6347297B1 (en) * 1998-10-05 2002-02-12 Legerity, Inc. Matrix quantization with vector quantization error compensation and neural network postprocessing for robust speech recognition
US6263306B1 (en) * 1999-02-26 2001-07-17 Lucent Technologies Inc. Speech processing technique for use in speech recognition and speech coding
US6604070B1 (en) * 1999-09-22 2003-08-05 Conexant Systems, Inc. System of encoding and decoding speech signals
US6862558B2 (en) * 2001-02-14 2005-03-01 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Empirical mode decomposition for analyzing acoustical signals

Also Published As

Publication number Publication date
US7369991B2 (en) 2008-05-06
EP1345210B1 (en) 2008-05-28
US20070100630A1 (en) 2007-05-03
DE60330400D1 (en) 2010-01-14
US7680666B2 (en) 2010-03-16
CN1681002B (en) 2010-04-28
EP1667108B1 (en) 2009-12-02
DE60321256D1 (en) 2008-07-10
EP1345210A3 (en) 2005-08-17
EP1345210A2 (en) 2003-09-17
US20030171921A1 (en) 2003-09-11
EP1667108A1 (en) 2006-06-07
JP2003255993A (en) 2003-09-10
CN1442845A (en) 2003-09-17

Similar Documents

Publication Publication Date Title
CN1681002A (en) Speech synthesis system, speech synthesis method, and program product
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
CN1158642C (en) Method and system for detecting and generating transient conditions in auditory signals
CN1229773C (en) Speed identification conversation device
CN1187734C (en) Robot control apparatus
CN1160699C (en) Tone features for speech recognition
US20190259388A1 (en) Speech-to-text generation using video-speech matching from a primary speaker
Tran et al. Improvement to a NAM-captured whisper-to-speech system
CN1703734A (en) Method and apparatus for determining musical notes from sounds
CN1932807A (en) Apparatus and method for translating speech and performing speech synthesis of translation result
CN1101446A (en) Computerized system for teching speech
CN1894740A (en) Information processing system, information processing method, and information processing program
CN1622200A (en) Method and apparatus for multi-sensory speech enhancement
CN1662018A (en) Method and apparatus for multi-sensory speech enhancement on a mobile device
CN1461463A (en) Voice synthesis device
Hansen et al. On the issues of intra-speaker variability and realism in speech, speaker, and language recognition tasks
CN1787076A (en) Method for distinguishing speek person based on hybrid supporting vector machine
CN1534597A (en) Speech sound identification method using change inference inversion state space model
KR20150104345A (en) Voice synthesys apparatus and method for synthesizing voice
CN114121006A (en) Image output method, device, equipment and storage medium of virtual character
Scheme et al. Myoelectric signal classification for phoneme-based speech recognition
Rudzicz Production knowledge in the recognition of dysarthric speech
WO2017008075A1 (en) Systems and methods for human speech training
CN1253851C (en) Speaker's inspection and speaker's identification system and method based on prior knowledge
Meister et al. New speech corpora at IoC

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CX01 Expiry of patent term
CX01 Expiry of patent term

Granted publication date: 20100428