CN1681002A - Speech synthesis system, speech synthesis method, and program product - Google Patents
Speech synthesis system, speech synthesis method, and program product Download PDFInfo
- Publication number
- CN1681002A CN1681002A CNA2005100693792A CN200510069379A CN1681002A CN 1681002 A CN1681002 A CN 1681002A CN A2005100693792 A CNA2005100693792 A CN A2005100693792A CN 200510069379 A CN200510069379 A CN 200510069379A CN 1681002 A CN1681002 A CN 1681002A
- Authority
- CN
- China
- Prior art keywords
- voice signal
- bands
- spectrum
- speech
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
- G06F18/256—Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/809—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
- G06V10/811—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Abstract
The object of the present invention is to keep a high success rate in recognition with a low-volume of sound signal, without being affected by noise. The speech recognition system comprises a sound signal processor configured to acquire a sound signal, and to calculate a sound signal parameter based on the acquired sound signal; an electromyographic signal processor configured to acquire potential changes on a surface of the object as an electromyographic signal, and to calculate an electromyographic signal parameter based on the acquired electromyographic signal; an image information processor configured to acquire image information by taking an image of the object, and to calculate an image information parameter based on the acquired image information; a speech recognizer configured to recognize a speech signal vocalized by the object, based on the sound signal parameter, the electromyographic signal parameter and the image information parameter; and a recognition result provider configured to provide a result recognized by the speech recognizer.
Description
Technical field
The present invention relates to be used for the speech recognition system and the method for recognition of speech signals, the program product that carries out the speech synthesis system of synthetic speech signal and method and use therein according to speech recognition.
Background technology
The application is that application number is P2002-057818, and the date of application is to propose on the basis for priority of Japan's patented claim formerly of proposing on March 4th, 2002, and the full content of this application is incorporated herein by reference.
Traditional speech detection device adopts speech recognition technology by the frequency analysis in the sounding voice signal is come voice signal is discerned and handled.Speech recognition technology obtains by using band envelope or similar techniques.
Yet,, can in conventional speech detection device, survey voice signal under the condition of the voice signal of input sounding for traditional speech detection device.In addition, the speech detection result for by using speech recognition technology to obtain requires voice signal to sound with certain volume.
Therefore, traditional speech detection device can not use under the noiseless condition of needs, these situations for example, in office, in the library and in places such as public organizations, when the speaker may to around he when making troubles.The problem that traditional speech detection device has is exactly under the condition of high noisy, can bring the problem of intersect speaking and the performance of speech detection function to reduce.
On the other hand, occurred obtain the Study on Technology of voice signal from the information except that voice signal.The technology of obtaining voice signal from the information except that voice signal makes that obtaining voice signal under the condition of the voice signal that does not have sounding becomes possibility, therefore can solve the above problems.
Carrying out image process method according to the image information of video camera input is a kind of method of carrying out recognition of speech signals according to the visual information of lip.
In addition, also carried out by handle along with around the mouth (near) the electromyogram that produces of muscular movement (below be referred to as EMG) signal discern the technical research of the vowel type of sending.This research exists " the A speech Employing a Speech Syntghesizer Vowel Discriminationfrom Perioral Muscles Activities and Vowel Production of NoboruSugie etc. '; ' IEEE tansactions onBiomedical Engineering; volume 32; the 7th phase; 485-490 page or leaf " in open, the technology of distinguishing five vowels " a; i; u; e, o " by the number of times that the EMG signal is passed threshold value by pass-band filter and the EMG signal that passes through of statistics is wherein disclosed.
As everyone knows, exist by using nervous system network to handle the method that the EMG signal is surveyed speaker's vowel and consonant.In addition, use an input channel just but the multimodal interface of the information of a plurality of input channel input is suggested and obtains.
On the other hand, the storage of traditional speech synthesis system is used to characterize the data of speaker's voice signal, and uses the data when speaker's sounding to come synthetic speech signal.
Yet, a problem that exists is that traditional speech detection method is used from information rather than obtained the technology of voice signal from voice signal, therefore compare from the speech detection method that voice signal obtains voice signal with use, this technology has low success ratio in identification.Particularly, be difficult in the mouth motion of muscle and discern the consonant that is sent.
In addition, the problem that traditional speech synthesis system exists is that voice signal is synthetic according to the data of the voice signal that characterizes the speaker, therefore synthetic voice signal sounds very stiff, expresses not nature, and can not express speaker's emotion definitely.
Summary of the invention
Eventually the above an object of the present invention is to provide a kind of speech recognition system and method, and it is not having under the condition of noise effect, and identification has high discrimination during than the voice signal of amount of bass.Another object of the present invention provides a kind of speech synthesis system and method, and it uses the voice signal of identification to come synthetic speech signal, thereby makes that synthetic voice signal is more natural and clear, and can express speaker's emotion definitely.
First aspect of the present invention can reduce a kind of speech recognition system, and it comprises that audio signal processor, electromyogram (EMG) signal processor, visual information processor, speech recognition device and recognition result provide device.
Audio signal processor is configured to from an object acquisition voice signal, and calculates the voice signal parameter according to the voice signal that obtains.The potential change that the EMG signal processor is configured to obtain object surface is with as the EMG signal, and according to the EMG calculated signals EMG signal parameter that obtains.Visual information processor is configured to obtain image information by the image of obtaining object, and comes the computed image information parameter according to the image information of obtaining.Speech recognition device is configured to according to voice signal parameter, EMG signal parameter and image information parameter, the voice signal that identification is sent by object.Recognition result provides device to be configured to provide the result of speech recognition device identification.
Of the present invention aspect first, speech recognition device can come recognition of speech signals according in voice signal parameter, EMG signal parameter and the image information parameter each, each voice signal of contrast identification and according to the comparing result recognition of speech signals.
Aspect first, speech recognition device can use voice signal parameter, EMG signal parameter and image information parameter to come recognition of speech signals simultaneously of the present invention.
Aspect first, speech recognition device can comprise a hierarchical network of the present invention, and a plurality of non-linear components that contain input block and output unit in this network are by layering location from top to bottom.The output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.The connection that data outputed to that each non-linear component calculates from the data of output unit output and determines to calculate according to the weighted value that is input to the data of input block and is assigned to the combination that connects or connect.Voice signal parameter, EMG signal parameter and image information parameter are used as the input data and are input in the non-linear component of the superiors in the hierarchical network.The voice signal of identification is used as in the undermost non-linear component of output data from hierarchical network to be exported.Speech recognition device is according to the data identification voice signal of output.
Aspect first, speech recognition device can comprise learning functionality of the present invention, and it is configured to change the weighted value that is assigned to non-linear component according to the sample data that transmits to the upper strata from lower floor of input.
Aspect first, audio signal processor can comprise microphone of the present invention, and it is configured to obtain voice signal from sound source.Microphone is configured to communicate with communicator.The EMG signal processor can comprise electrode, and it is configured to obtain the potential change on the face around the sound source, with as the EMG signal.This electrode is installed in the surface of communicator.Visual information processor can comprise camera, and it is configured to obtain image information by the image that the shooting sound source moves.On the terminal that this camera is installed in communicator separates.Communicator uses this terminal to transmit and receive data.
Aspect first, terminal can comprise a main body that camera is housed, and the belt of stationary body of the present invention.Recognition result provides the device can be for being used for the display of display result, and this display is installed in the surface of main body.
Aspect first, system can comprise a positioning equipment and fastening of the present invention.Audio signal processor can comprise microphone, and it is configured to obtain voice signal from sound source.The EMG signal processor can comprise electrode, and it is configured to obtain around the sound source the potential change on the face with as the EMG signal.Visual information processor can comprise camera, and it is configured to obtain image information by the image that the shooting sound source moves.Positioning equipment can be fixed microphone and the electrode approaching with sound source.Fastening can support camera and positioning equipment.
Of the present invention aspect first, recognition result provide device can be in translucent display device display result.Recognition result provides device to be installed in the fastening.
Second aspect of the present invention can reduce a kind of speech synthesis system, and it comprises speech recognition device, voice signal getter, the first bands of a spectrum getter, the second bands of a spectrum generator, regulates bands of a spectrum generator and follower.
Speech recognition device is configured to recognition of speech signals.The voice signal getter is configured to obtain voice signal.The bands of a spectrum that the first bands of a spectrum getter is configured to obtain the voice signal that obtains are used as first bands of a spectrum.The second bands of a spectrum generator is configured to produce according to the voice signal of speech recognition device identification the secondary configuration bands of a spectrum of voice signal, and with it as second bands of a spectrum.Regulating the bands of a spectrum generator is configured to produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum.Follower is configured to export synthetic voice signal according to the bands of a spectrum after regulating.
Aspect second of the present invention, follower can comprise communicator, and it is configured to send the synthetic voice signal as data.
The 3rd aspect of the present invention can reduce a kind of audio recognition method, may further comprise the steps: (A) from the object acquisition voice signal, and according to the voice signal calculating voice signal parameter of obtaining; (B) the potential change on surface of obtaining object is as the EMG signal, and according to the EMG calculated signals EMG signal parameter that obtains; (C) image of obtaining object obtains image information, and comes the computed image information parameter according to the image information of obtaining; (D) according to voice signal parameter, EMG signal parameter and image information parameter, the voice signal that identifying object sends; And the result that speech recognition device identification (E) is provided.
In aspect the 3rd of the present invention, step (D) can may further comprise the steps: (D1) come recognition of speech signals according in voice signal parameter, EMG signal parameter and the image information parameter each; (D2) contrast the voice signal of each identification; And (D3) according to the comparing result recognition of speech signals.
In aspect the 3rd of the present invention, voice signal can be discerned by use voice signal parameter, EMG signal parameter and image information parameter simultaneously in step (D).
Aspect the 3rd of the present invention, a plurality of non-linear components that contain input block and output unit are in the network of layering from top to bottom by the position of layering.The output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.Each non-linear component calculates data and definite connection that data outputed to of calculating of exporting from output unit according to being input to the data of input block and being assigned to the weighted value that connects or make up.Step (D) may further comprise the steps: (D11) voice signal parameter, EMG signal parameter and image information parameter are used as the input data and are input in the non-linear component of the superiors in the hierarchical network; (D12) with identification voice signal as output data by exporting in the undermost non-linear component in the hierarchical network; And (D13) come recognition of speech signals according to the data of output.
In aspect the 3rd of the present invention, described method can comprise that the sample data that transmits to the upper strata from lower floor according to input changes the step of the weighted value that is assigned to non-linear component.
The 4th aspect of the present invention can reduce a kind of phoneme synthesizing method, may further comprise the steps: (A) recognition of speech signals; (B) obtain voice signal; (C) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum; (D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum; (E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And (F) export synthetic voice signal according to the bands of a spectrum after regulating.
In aspect the 4th of the present invention, step (F) can comprise the step of transmission as the synthetic voice signal of data.
The 5th aspect of the present invention can reduce the program product that is used for recognition of speech signals in computing machine.Computing machine is carried out following steps: (A) from the object acquisition voice signal, and according to the voice signal calculating voice signal parameter of obtaining; (B) the potential change on surface of obtaining object is as the EMG signal, and according to the EMG calculated signals EMG signal parameter that obtains; (C) image of obtaining object obtains image information, and comes the computed image information parameter according to the image information of obtaining; (D) according to voice signal parameter, EMG signal parameter and image information parameter, the voice signal that identifying object sends; And the result that speech recognition device identification (E) is provided.
In aspect the 5th of the present invention, step (D) can may further comprise the steps: (D1) according to each recognition of speech signals in voice signal parameter, EMG signal parameter and the image information parameter; (D2) contrast the voice signal of each identification; And (D3) according to the comparing result recognition of speech signals.
In the step (D) aspect the 5th of the present invention, voice signal can use voice signal parameter, EMG signal parameter and image information parameter to discern simultaneously.
Aspect the 5th of the present invention, a plurality of non-linear components that contain input block and output unit in the network of layering from top to bottom by the position of layering.The output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.Each non-linear component calculates data and definite connection that data outputed to of calculating of exporting from output unit according to being input to the data of input block and being assigned to the weighted value that connects or make up.Step (D) may further comprise the steps: (D11) voice signal parameter, EMG signal parameter and image information parameter are input in the non-linear component of the superiors in the hierarchical network as the input data; (D12) voice signal of the output unit of the undermost non-linear component from hierarchical network output identification is as output data; And (D13) come recognition of speech signals according to the data of output.
Aspect the 5th of the present invention, computing machine can carry out changing according to the sample data that transmits from bottom to top of input the step of the weighted value that is assigned to non-linear component.
The 6th aspect of the present invention can reduce the program product that is used at the computing machine synthetic speech signal.Computing machine is carried out following step: (A) recognition of speech signals; (B) obtain voice signal; (C) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum; (D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum; (E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And (F) export synthetic voice signal according to the bands of a spectrum after regulating.
In aspect the 6th of the present invention, step (F) can comprise the step of transmission as the synthetic voice signal of data.
Description of drawings
Fig. 1 is the function unit figure of speech recognition system according to an embodiment of the invention.
Fig. 2 A is the process example of winning voice signal and EMG signal according to embodiments of the invention in speech recognition system to 2D.
Fig. 3 A is the example of winning the process of image information according to an embodiment of the invention in speech recognition system to 3D.
Fig. 4 is the function unit figure of the speech recognition device in speech recognition system according to an embodiment of the invention.
Fig. 5 is the function unit figure of the speech recognition device in speech recognition system according to an embodiment of the invention.
Fig. 6 is to be the detail drawing of explaining speech recognition device according to an embodiment of the invention in speech recognition system.
Fig. 7 is the process flow diagram of the description speech recognition process in speech recognition system operation according to an embodiment of the invention.
Fig. 8 is the process flow diagram of the description learning process in speech recognition system operation according to an embodiment of the invention.
Fig. 9 is the function unit figure of speech synthesis system according to an embodiment of the invention.
Figure 10 A to 10D at the key drawing of removing noise operation in speech recognition system according to an embodiment of the invention.
Figure 11 is for describing the process flow diagram of phonetic synthesis process operation according to an embodiment of the invention in voice system.
Figure 12 is according to an embodiment of the invention to the entire arrangement of speech recognition system and speech synthesis system integral system.
Figure 13 is according to an embodiment of the invention to the complete configuration of speech recognition system and the incorporate system of speech synthesis system.
Figure 14 represents to have write down the computer-readable recording medium according to the embodiments of the invention program.
Embodiment
(according to the configuration of the speech recognition system of the first embodiment of the present invention)
Below will describe configuration in detail according to the speech recognition system of the first embodiment of the present invention.Fig. 1 has described the function unit figure according to the speech recognition system of present embodiment.
As shown in Figure 1, speech recognition system disposes audio signal processor 10, EMG signal processor 13, visual information processor 16, information score device/recognizer 19, speech recognition device 20 and recognition result device 21 is provided.
Voice signal acquiring unit 11 is a kind of device that is used for obtaining from speaker's (target) mouth voice signal, for example microphone.Voice signal acquiring unit 11 is surveyed the voice signal that the speaker sends, and the voice signal that obtains is sent to sound signal processing unit 12.
Sound signal processing unit 12 is arranged in the voice signal that obtains from voice signal acquiring unit 11 and obtains the voice signal parameter by separating band envelope or microtexture.
Sound signal processing unit 12 is a kind of devices that are used to calculate the voice signal parameter, this voice signal parameter can be in speech recognition device 20 according to the voice signal that obtains by voice signal acquiring unit 11 and processed.Sound signal processing unit 12 cut off voice signal when the time one, window was provided with, and the analytical calculation voice signal parameter when being usually used in speech recognition, for example the voice signal that cuts off is carried out the short time spectral band analysis, the cepstrum analysis of spectrum, maximum likelihood spectrum method of estimation, covariance method, PARCOR are analyzed and LSP analyzes.
EMG signal acquiring unit 14 is arranged to obtains near (winning) the speak motion of the muscle mouth when sounding signal.EMG signal acquiring unit 14 is surveyed near the possible variation of the skin surface of speaker's (target) mouth.That is to say, in order to discern near the motion that is accompanied by the polylith muscle of the signal of sounding the mouth, EMG signal acquiring unit 14 is surveyed a plurality of EMG signals by a plurality of electrodes that are positioned on the skin surface relevant with polylith muscle, and amplification EMG signal is transferred to EMG signal processing unit 15.
EMG signal processing unit 15 is arranged to by the frequency of power that calculates the EMG signal that is obtained by EMG acquiring unit 14 and analysis EMG signal and wins the EMG signal parameter.EMG signal processing unit 15 is a kind of devices that calculate the EMG signal parameter according to a plurality of EMG signals by 14 transmission of EMG signal acquiring unit.More specifically, EMG signal processing unit 15 is being provided with cut-out EMG signal every time one window, and by calculating the mean oscillatory feature, as RMS (root mean square), ARV (average correction value) or IEMG (integration EMG) calculate the EMG signal parameter.
, will be described in detail to 2D with reference to figure 2A voice signal acquiring unit 12 and EMG signal processing unit 15.
Voice signal that is detected by voice signal acquiring unit 11 or EMG signal acquiring unit 14 or EMG signal are cut off (S401 among Fig. 2 A) by audio signal processor 12 or EMG signal processor 15 during window in per time one.Then, extract bands of a spectrum (S402 among Fig. 2 B) by FFT by cutoff signal.Then, the bands of a spectrum of winning are carried out the third-octave analysis meter and calculate the power of each frequency (S403 among Fig. 2 C).That calculate be transferred to speech recognition device 20 as speech signal parameter or EMG signal parameter (S404 among Fig. 2 D) with power each frequency dependence.This speech signal parameter or EMG signal parameter are discerned by speech recognition device 20.
Sound signal processing unit 12 or EMG signal processing unit 15 can not be to win voice signal parameter or EMG signal parameter at Fig. 2 A to the method among the 2D by using yet.
The image that image information acquisition unit 17 is arranged to by obtain near the spatial variations the mouth of speaking when sounding signal obtains image information.Image information acquisition unit 17 disposes the camera that obtains near the spatial variations image of mouth of speaking when sounding signal, as video camera.Near the motion of mouth is surveyed as image information in image information acquisition unit 17, and transmits this image information to Image Information Processing unit 18.
Image Information Processing unit 18 is arranged to the image information of obtaining according to image information acquisition unit 17 and calculates mouth kinematic parameter (image information parameter) on every side in a minute.More specifically, mouth motion feature computed image information is on every side won in the 18 usefulness light streams of Image Information Processing unit.
, will be described in detail Image Information Processing unit 18 below to 3D with reference to figure 3A.
Near the image information of the feature locations mouth of speaking during according to time t0 won.(as the S501 among Fig. 3 A).Position that might be by obtaining mark is as feature locations, or searches feature locations and win feature locations around the mouth in captured image information.Picture information processing unit 18 can from image information, win feature locations and with it as the two-dimensional space position.Picture information processing unit 18 by use a plurality of cameras and obtain feature locations and with it as three-dimensional space position.
Similarly, through associate t0 to t1 during this period of time after, when time t1, win the feature locations (as the S502 among Fig. 3 B) around the mouth.Then, Image Information Processing unit 18 calculates the motion (as the S503 among Fig. 3 C) of each unique point by calculating unique point when time t0 and the difference between the unique point when time t1.Image Information Processing unit 18 produces image information parameter (as the S504 among Fig. 3 D) according to the difference that calculates.
For Image Information Processing unit 18, can use except that the additive method the method for Fig. 3 A in the 3D and obtain the image information parameter.
Image information integrator/recognizer 19 is configured to the various information of obtaining from audio signal processor 10, EMG signal processor 13 and visual information processor 16 are carried out integration and identification.Image information integrator/recognizer 19 is furnished with speech recognition device 20 and recognition result provides device 21.
On the other hand, when around noise rank when big, when the volume of the voice signal that sends hour or in the time can not carrying out speech recognition with enough ranks according to the voice signal parameter, speech recognition device 20 not only can also come recognizing voice according to EMG signal parameter and image information parameter according to the voice signal parameter.
In addition, speech recognition device 20 can only be discerned special phoneme etc. according to the voice signal parameter, and this special phoneme can not correctly be discerned by using EMG signal parameter and image information parameter, thereby can improve the success ratio of identification.
With reference to figure 4, will the example of speech recognition device 20 be specifically described below.In example shown in Figure 4, speech recognition device 20 comes recognition of speech signals according in voice signal parameter, EMG signal parameter and the image information parameter each, and the voice signal of each identification compared, and come recognition of speech signals according to the result of contrast.
As shown in Figure 4, more specifically, speech recognition device 20 only comes recognizing voice respectively according to voice signal parameter, EMG signal parameter or image information parameter respectively.Speech recognition device 20 carries out integration according to each parameter to the result who discerns then, thereby carries out speech recognition.
When (in all recognition results) a plurality of recognition results that obtain according to each parameter coincide mutually, speech recognition device 20 with this result as final recognition result.On the other hand, when (in all recognition results) of obtaining according to each parameter do not have recognition result to coincide mutually, recognizer 20 will have the recognition result of the highest discrimination as final result in identification.
For example, just known in front when special phoneme of identification or special tongue, the speech recognition of carrying out according to the EMG parameter has lower success ratio, yet, suppose that special phoneme or special tongue are issued, when then basis was carried out speech recognition by the parameter of non-EMG signal, speech recognition device 20 was ignored the recognition result that obtains according to the EMG signal parameter, thereby can improve recognition success rate.
Based on the speech recognition of voice signal parameter the time, when the noise rank around determining is big or the volume of the voice signal that sends hour, speech recognition device 20 reduces the influence to net result of the recognition result that obtains based on the voice signal parameter, and carries out speech recognition by focusing on the recognition result that obtains based on EMG signal parameter and image information parameter.Can adopt conventional audio recognition method according to the speech recognition that each parameter is carried out.
Can adopt the audio recognition method of traditional various voice signals of use based on the speech recognition of the voice signal in the speech recognition device 20.The speech recognition of carrying out based on the EMG signal can adopt technical literature " Noboru Sugie et a1.; ' and A speech Employing a Speech Synthesizer VowelDiscriminatgion from Perioral Muscles Activities and Vowel Production ' IEEEtransactions on Biomedial Enginnering; 32 volumes; the 7th phase, 485-490 page or leaf " in disclosed method or in JP-A-181888 etc. disclosed method.Can adopt disclosed method in JP-A-2001-51963 or JP-A-2000-206986 etc. based on the speech recognition that image information is carried out.
With reference to figure 5, will the another one example of speech recognition device 20 be specifically described below.In example shown in Figure 5, speech recognition device 20 is simultaneously according to coming recognition of speech signals in voice signal parameter, EMG signal parameter and the image information parameter.
More concrete, speech recognition device 20 comprises a hierarchical network (for example, nervous system network 20a), wherein a plurality of non-linear components that comprise input block and output unit are hierarchically positioned from top to bottom.
In nervous system network 20a, the output unit of the non-linear component on upper strata is connected to the input block of the non-linear component of the lower floor in the contiguous non-linear component.Weighted value is assigned to the combination of this connection or connection.Each non-linear component calculates data and definite connection that data outputed to of calculating of exporting from output unit according to being input to the data of input block and being assigned to the weighted value that connects or make up.
Voice signal parameter, EMG signal parameter and image information parameter are used as the input data and are input in the non-linear component of the superiors in the hierarchical network.The voice signal (vowel and consonant) of identification is used as output data and outputs in the undermost non-linear component in the layering speech recognition device.Speech recognition device 20 comes recognition of speech signals according to the data of being exported by the output unit of undermost non-linear component.
By reference " Nishikawa and Kitamura, ' Neural network and control of measure ', Asakura Syoten, 18-50 page or leaf " as can be known, nervous system network can adopt three layers of nervous system network of full connecting-type.
That is to say, be necessary method, the weighting among the prior learning nervous system network 20a by for example backpropagation.
In order to learn weighting, speech recognition device 20 obtains according to the voice signal parameter that operation produced, EMG signal parameter and the image information parameter of sending special mode, and learns weighting by using as the special mode of learning signal.
When the speaker pronounces, the EMG signal is input in the speech recognition system earlier than voice signal and image information, speech recognition device 20 only postpones the input of EMG signal parameter by neuralward network 20a, and do not postpone the input of voice signal parameter and image information parameter, thereby make speech recognition device 20 have the function of synchronous voice signal, EMG signal and image information.
Receive the nervous system network 20a output phoneme relevant with input parameter as the various parameters of input data.
Nervous system network 20a adopts recurrence nervous system network (RNN), and it is handled the recognition result that obtains with the next one and returns as the input data.According to present embodiment, speech recognition algorithm also can adopt various speech recognition algorithms, for example Hidden Markov Model (HMM) except that adopting nervous system network.
As shown in Figure 6, a plurality of EMG signals 1,2 that detected by EMG signal acquiring unit 14 are exaggerated in EMG processing unit 15 (S601) and are cut off every time one window.By the EMG signal that cuts off is carried out the calculating that FFT carries out bands of a spectrum.Before input nervous system network 20, the bands of a spectrum (S602) that calculate are carried out the third-octave analysis, carry out the calculating of EMG signal parameter.
The voice signal that voice signal acquiring unit 11 detects is exaggerated and cuts off every time one window in sound signal processing unit 12 (S611).By the voice signal that cuts off is carried out the calculating that FFT carries out bands of a spectrum.Before input nervous system network 20, the bands of a spectrum (S612) that calculate are carried out the third-octave analysis, carry out the voice signal CALCULATION OF PARAMETERS.
The image information that Image Information Processing unit 18 obtains according to image information acquisition unit 17 (S621) is obtained the motion of the feature locations around the mouth of speaking as light stream.The image information parameter of winning as light stream is imported among the nervous system network 20a.
In a series of time, can win mouth feature locations separately on every side in the captured image information, thus the motion of winning feature locations.Also sign can be placed on mouth unique point on every side, and place reference point, according to the displacement of surveying with respect to the unique point of reference point, thus the motion of winning unique point.
Be transfused to the nervous system network 20a output phoneme relevant of various parameters with input parameter.
In addition, when voice by can not discern according to any parameter the time as the audio recognition method among Fig. 4, can be configured to use according to the speech recognition device 20 of present embodiment and carry out speech recognition as the audio recognition method among Fig. 5.Compare or they are carried out integration by the result that the result of the audio recognition method among Fig. 4 identification and the audio recognition method among Fig. 5 are discerned, speech recognition device 20 can be configured to voice are discerned.
It is the equipment of a kind of providing (output) speech recognition device 20 recognition results that recognition result provides device 21.Recognition result provides device 21 can adopt the voice generator that speech recognition device 20 recognition results are outputed to the speaker as voice signal or outputs to as text message in the display of display result.Recognition result provides device 21 can comprise a communication interface, and it is except providing the result to the speaker, also transmit the result in the application program as data, this application program runs in the terminal as PC.
(according to the operation of the speech recognition system of embodiment)
Operation according to the speech recognition system of embodiment will be described below with reference to figure 7 and Fig. 8.At first, with reference to figure 7, according to the operation of carrying out speech recognition in the speech recognition system of embodiment.
In step S101, the speaker begins sounding.In S104, voice signal acquiring unit 11, EMG signal acquiring unit 14 and image information acquisition unit 17 are surveyed voice signal, EMG signal and the image information that is produced respectively when speaker's sounding at step S102.
In S107, sound signal processing unit 12, EMG signal processing unit 15 and Image Information Processing unit 18 calculate voice signal parameter, EMG signal parameter and image information parameter respectively according to voice signal, EMG signal and image information at step S105.
In step S108, speech recognition device 20 comes recognizing voice according to parameters calculated.In step 109, recognition result provides device 21 that the result who is obtained by speech recognition device 20 identifications is provided.Recognition result provides device 21 result of identification can be exported or show recognition result as voice signal.
Secondly, with reference to figure 8, be operation according to the learning process in speech recognition system of present embodiment.
For improving recognition success rate, the pronunciation character of learning each speaker is very important.In an embodiment, will operation that use nervous system network 20a among Fig. 5 carry out learning process be described below.Under the situation of the audio recognition method that does not use nervous system network 20a, speech recognition system according to the present invention has adopted the learning functionality relevant with audio recognition method.
As shown in Figure 8, in step S301 and S302, the speaker begins sounding.In step S305, the speaker that is to say with said contents of input such as keyboards, input learning signal (sample data) when pronunciation.In step S303, voice signal acquiring unit 11, EMG signal acquiring unit 14 and image information acquisition unit 17 are surveyed voice signal, EMG signal and image information respectively.In step S304, sound signal processing unit 12, EMG signal processing unit 15 and Image Information Processing unit 18 are won voice signal parameter, EMG signal parameter and image information parameter respectively.
In step S306, nervous system network 20a wins the parameter that obtains according to the learning signal study of keyboard input.That is to say that nervous system network 20a changes the weighting that is assigned to non-linear component by the learning signal (sample data) that input transmits from top to bottom.
In step S307, when the error rate of identification was lower than threshold value, nervous system network 20a determined that learning process finishes.EO (S308) then.
On the other hand, in step S307, when nervous system network 20a determines that learning process is not finished, then with the operation of repeating step S302 to S306.
(according to the function and the effect thereof of the speech recognition system of embodiment)
The speech recognition system of present embodiment can be come recognizing voice according to a plurality of parameters that calculate from voice signal, EMG signal and image information, thereby can improve abilities such as noise resistance interference fully.
That is to say that the speech recognition system of present embodiment comprises three types input interface (audio signal processor 10, EMG signal processor 13 and visual information processor 16) and is used to improve the noise resistance interference capability.When all input interfaces were all unavailable, speech recognition system can use available input interface to come recognizing voice, thereby improved recognition success rate.
Therefore, the present invention can provide a kind of speech recognition system, and the volume of the voice signal that maybe ought send when its noise rank around is big hour can be come recognizing voice with enough ranks.
(according to the speech synthesis system of second embodiment of the present invention)
With reference to figure 9 to 11, will the speech synthesis system according to second embodiment of the present invention be described.Speech recognition system described above is used to according to speech synthesis system of the present invention.
As shown in Figure 9, the speech synthesis system of phase disposes audio signal processor 10, EMG signal processor 13, visual information processor 16, speech recognition device 20 and voice operation demonstrator 55 according to the present invention.Voice operation demonstrator 55 disposes the first bands of a spectrum getter, 51, the second bands of a spectrum generators 52, regulates bands of a spectrum generator 53 and follower 54.
Speech recognition system among audio signal processor 10, EMG signal processor 13, visual information processor 16, speech recognition device 20 and first embodiment has identical functions.
The first bands of a spectrum getter 51 be configured to obtain the bands of a spectrum of voice signal and with it as first bands of a spectrum, wherein voice signal is obtained by voice signal acquiring unit 11.Contain noise (with reference to figure 10C) in first bands of a spectrum that obtain.
The second bands of a spectrum generator 52 be configured to voice signal (result) according to speech recognition device 20 identification produce through the bands of a spectrum of the voice signal that reconfiguring and with it as second bands of a spectrum.Shown in Figure 10 A, more specifically, the second bands of a spectrum generator 52 is according to the pronunciation phonemes of winning from the result of speech recognition device 20 identifications, and formant frequency for example reconfigures the bands of a spectrum of pronunciation phonemes.
Regulating bands of a spectrum generator 53 is configured to produce the bands of a spectrum of adjusting according to first bands of a spectrum and second bands of a spectrum.Shown in Figure 10 D, more specifically, regulate bands of a spectrum generator 53 by multiplying each other, thereby generation there are not the adjusting bands of a spectrum of noise with second bands of a spectrum (with reference to figure 10A) and first bands of a spectrum (with reference to figure 10C).
Follower 54 is configured to according to regulating the synthetic voice signal of bands of a spectrum output.Follower 54 comprises communicator, and it is configured to send the synthetic voice signal as data.Shown in Figure 10 C, more specifically, follower 54 obtains not contain the voice signal of noise by the adjusting bands of a spectrum that do not contain noise being carried out Fourier reverse transformation (with reference to figure 10D), and the voice signal that obtains is exported as synthetic voice signal.
That is to say, obtain not contain the voice signal of noise by filtrator by the voice signal that will contain noise according to the speech synthesis system of present embodiment, its middle filtrator has the frequecy characteristic by the bands of a spectrum representative that reconfigures, and the voice signal of output acquisition.
Speech synthesis system according to present embodiment ins all sorts of ways recognizing voice by making, the voice signal that the speaker can be sent and noise are on every side separated from the signal that recognition result is reconfigured the signal that obtains and voice signal acquiring unit 11 and surveyed, thus when around the noise rank can export synthetic speech clearly greatly the time.
Therefore, according to the speech synthesis system of present embodiment can be big at noise or the voice signal that sends hour, the voice signal that output is synthetic, this signal sounds just looking like that the speaker does not send in having the environment of noise.
Adopted speech recognition system according to the speech synthesis system of present embodiment, yet the present invention is not limited to this embodiment according to first embodiment.Speech synthesis system according to present embodiment can come recognizing voice according to the parameter except that the voice signal parameter.
With reference to Figure 11, will the operation according to the speech synthesis system of present embodiment be described below.
As shown in figure 11, at step S201 in S208, carry out with first embodiment in the identical identifying of identifying.
In step S209, the first bands of a spectrum getter 51 obtain the bands of a spectrum of voice signal by voice signal acquiring unit 11 and with it as first bands of a spectrum.The second bands of a spectrum generator 52 according to the result of speech recognition device 20 identification produce through the bands of a spectrum of the voice signal that reconfiguring and with it as second bands of a spectrum.Regulate bands of a spectrum generator 53 and produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum, noise in these bands of a spectrum (not being the voice signal that the speaker sent) is eliminated from the voice signal that voice signal acquiring unit 11 is obtained.
In step S201, follower 54 is exported synthetic speech signal clearly according to regulating bands of a spectrum.
(according to the system of the 3rd embodiment of the present invention)
With reference to Figure 12, will the system of integrating speech sound recognition system and speech synthesis system be described below.
As shown in figure 12, the Wristwatch-type terminal 31 that communicator 30 is arranged and be separated with it according to the system configuration of present embodiment.
Communicating terminal 30 is configured to add audio signal processor 10, EMG signal processor 13, speech recognition device 20 and voice operation demonstrator 55 in the portable terminal of routine.
EMG signal acquiring unit 14 comprises the skin surface electrodes 114 that can contact with speaker 32 skin of a plurality of installations, and it is configured to obtain around speaker's (sound source) 32 the mouth the potential change on the face with as the EMG signal.Voice signal acquiring unit 11 comprises microphone 111, and it is configured to obtain voice signal from speaker's (sound source) 32.Microphone 111 can be configured to communicate with communicator 30.For example, microphone 111 is installed in the surface of communicator 30.Microphone 111 can be for being installed near the wireless microphone speaker's 32 mouths.Skin surface electrodes 114 can be installed in the surface of communicator 30.
Communication terminal 30 have transmission based on the result of speech recognition device 20 identification and synthetic voice signal as the function of the voice signal that sends by speaker 32.
The terminal 31 of Wristwatch-type disposes visual information processor 16 and recognition result processor 21.The video camera 117 that is used to take speaker's (sound source) 32 mouth moving image is installed on the terminal 31 of Wristwatch-type as image information collecting unit 17.The display device 121 that is used to show recognition result is installed on the terminal 31 of Wristwatch-type and provides device 21 as recognition result.The terminal 31 of Wristwatch-type comprises that one is used for its belt of fixing 33.
System to speech recognition system and speech synthesis system integration obtains EMG signal and voice signal by EMG signal acquiring unit 14 and the voice signal acquiring unit 11 that is installed on the communicator 30, and obtains image information by the image information acquisition unit 17 on the terminal 31 that is installed in Wristwatch-type.
Communicator 30 uses 31 pairs of data of terminal of Wristwatch-type to send and receive by wire communication or radio communication.The terminal 31 of communicator 30 and Wristwatch-type is collected and is transmitted a signal on the speech recognition device 20 that is structured in the communicator 30, speech recognition device 20 comes recognizing voice according to collected signal, and being installed in recognition result in the terminal 31 of Wristwatch-type provides 21 pairs of devices to show from the recognition result that speech recognition device 20 sends by wire communication or radio communication.Communicator 30 can send do not contain noise synthetic speech signal clearly in the terminal 31 of Wristwatch-type.
In the present embodiment, speech recognition device 20 is structured in the communicator 30, and the recognition result that is structured in the terminal 31 of Wristwatch-type provides device 21 to show recognition results.But speech recognition device 20 also can be installed in the terminal 31 of Wristwatch-type, or other can with terminal that communicator 30 is communicated by letter in, the terminal 31 of this Wristwatch-type can be discerned and synthetic speech.
Recognition result can be exported as voice signal from communicator, may be displayed on the monitor of terminal 31 (or communicator 30) of Wristwatch-type, perhaps can be from another terminal that can communicate by letter with the terminal 31 of communicator 30 and Wristwatch-type output.
(according to the system of the 4th embodiment of the present invention)
With reference to Figure 13, will be described the system that speech recognition system and speech synthesis system according to present embodiment carry out integration being used for below.
As shown in figure 13, according to the system configuration of present embodiment stationary installation 41 is arranged, this installs as the glasses form; As the video camera 117 of image information acquisition unit 17, it can be conditioned the motion with the mouth of taking speaker's (sound source) 32; Locating device 42; The head suspension display device (HMD) 121 of device is provided as recognition result; And in be built in speech recognition device 20 in the stationary installation 41.Stationary installation 41 can be suspended on speaker 52 the head.
Be configured to obtain the potential change on the face around speaker's 32 (sound source) the mouth as the skin surface electrodes 114 of EMG signal acquiring unit 14; And as voice signal acquiring unit 11 and the microphone 111 that is configured to from the mouth of speaker 32 (sound source), to obtain voice signal be fixed on by adjustable ground around speaker 32 the mouth.
Wear with speaker 32 and can discern and synthetic speech,, his/her both hands can be freed owing to use the mode of wearing according to the system of embodiment.
(according to the system of the 5th embodiment of the present invention)
According to speech recognition system, audio recognition method, speech synthesis system or the phoneme synthesizing method of the above embodiments can by the computing machine (for example, personal computer) 215 of general applications be included in IC chip in the communicator 30 or similar equipment on the program carried out with the predetermined program language description obtain.
In addition, program can be recorded on the storage medium, and this media can be read by the computing machine 215 of general applications.That is, as shown in figure 14, program can be stored on the equipment such as floppy disk 216, CD-ROM 217, RAM218, magnetic tape cassette 219.By using the storage medium to contain program to be inserted into computing machine 215 or the medium method of internal memory that program is installed to communicator 30 can being realized system of the present invention or method.
Can keep high success ratio with the corresponding speech recognition system of the present invention, method and program when not discerned by the voice signal than amount of bass of noise effect.
Can use the voice signal of identification to come synthetic speech signal with the corresponding speech synthesis system of the present invention, method and program, thereby make synthetic voice signal nature and clear more, and the speaker's that can express properly emotion etc.
Claims (4)
1. a speech synthesis system comprises:
Configuration is used for the speech recognition device of recognition of speech signals;
Configuration is used for obtaining the voice signal getter of voice signal;
Configuration is used for obtaining the first bands of a spectrum getter of the bands of a spectrum of the voice signal that obtains as first bands of a spectrum;
Configuration is used for according to the secondary configuration bands of a spectrum of the voice signal generation voice signal of speech recognition device identification, and with its second bands of a spectrum generator as second bands of a spectrum;
Configuration is used for according to the adjusting bands of a spectrum generator of the bands of a spectrum after first bands of a spectrum and second bands of a spectrum generation adjusting; And
Configuration is used for according to the follower of the synthetic voice signal of the bands of a spectrum output after regulating.
2. speech synthesis system according to claim 1, wherein, follower comprises communicator, it is configured to transmit synthetic voice signal as data.
3. phoneme synthesizing method may further comprise the steps:
(A) recognition of speech signals;
(B) obtain voice signal;
(C) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum;
(D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum;
(E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And
(F) export synthetic voice signal according to the bands of a spectrum after regulating.
4. program product that in computing machine, is used for synthetic speech signal, wherein, computing machine is carried out following steps:
(A) recognition of speech signals;
(B) obtain voice signal;
(B) obtain the bands of a spectrum of the voice signal that obtains as first bands of a spectrum;
(D) produce the secondary configuration bands of a spectrum of voice signal according to the voice signal of speech recognition device identification, and with it as second bands of a spectrum;
(E) produce bands of a spectrum after the adjusting according to first bands of a spectrum and second bands of a spectrum; And
(F) export synthetic voice signal according to the bands of a spectrum after regulating.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2002057818 | 2002-03-04 | ||
JP2002057818A JP2003255993A (en) | 2002-03-04 | 2002-03-04 | System, method, and program for speech recognition, and system, method, and program for speech synthesis |
JP2002-057818 | 2002-03-04 |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN03105163A Division CN1442845A (en) | 2002-03-04 | 2003-03-03 | Speech recognition system and method, speech synthesis system and method and program product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1681002A true CN1681002A (en) | 2005-10-12 |
CN1681002B CN1681002B (en) | 2010-04-28 |
Family
ID=27764437
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2005100693792A Expired - Lifetime CN1681002B (en) | 2002-03-04 | 2003-03-03 | Speech synthesis system, speech synthesis method |
CN03105163A Pending CN1442845A (en) | 2002-03-04 | 2003-03-03 | Speech recognition system and method, speech synthesis system and method and program product |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN03105163A Pending CN1442845A (en) | 2002-03-04 | 2003-03-03 | Speech recognition system and method, speech synthesis system and method and program product |
Country Status (5)
Country | Link |
---|---|
US (2) | US7369991B2 (en) |
EP (2) | EP1345210B1 (en) |
JP (1) | JP2003255993A (en) |
CN (2) | CN1681002B (en) |
DE (2) | DE60321256D1 (en) |
Families Citing this family (95)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004016658A (en) | 2002-06-19 | 2004-01-22 | Ntt Docomo Inc | Mobile terminal capable of measuring biological signal, and measuring method |
US6910911B2 (en) | 2002-06-27 | 2005-06-28 | Vocollect, Inc. | Break-away electrical connector |
US8200486B1 (en) * | 2003-06-05 | 2012-06-12 | The United States of America as represented by the Administrator of the National Aeronautics & Space Administration (NASA) | Sub-audible speech recognition based upon electromyographic signals |
JP4713111B2 (en) * | 2003-09-19 | 2011-06-29 | 株式会社エヌ・ティ・ティ・ドコモ | Speaking section detecting device, speech recognition processing device, transmission system, signal level control device, speaking section detecting method |
US20050154593A1 (en) * | 2004-01-14 | 2005-07-14 | International Business Machines Corporation | Method and apparatus employing electromyographic sensors to initiate oral communications with a voice-based device |
US20060129394A1 (en) * | 2004-12-09 | 2006-06-15 | International Business Machines Corporation | Method for communicating using synthesized speech |
JP4847022B2 (en) | 2005-01-28 | 2011-12-28 | 京セラ株式会社 | Utterance content recognition device |
JP4632831B2 (en) * | 2005-03-24 | 2011-02-16 | 株式会社エヌ・ティ・ティ・ドコモ | Speech recognition method and speech recognition apparatus |
US7792314B2 (en) * | 2005-04-20 | 2010-09-07 | Mitsubishi Electric Research Laboratories, Inc. | System and method for acquiring acoustic signals using doppler techniques |
US8417185B2 (en) | 2005-12-16 | 2013-04-09 | Vocollect, Inc. | Wireless headset and method for robust voice data communication |
US7773767B2 (en) * | 2006-02-06 | 2010-08-10 | Vocollect, Inc. | Headset terminal with rear stability strap |
US7885419B2 (en) | 2006-02-06 | 2011-02-08 | Vocollect, Inc. | Headset terminal with speech functionality |
US7571101B2 (en) * | 2006-05-25 | 2009-08-04 | Charles Humble | Quantifying psychological stress levels using voice patterns |
US8082149B2 (en) * | 2006-10-26 | 2011-12-20 | Biosensic, Llc | Methods and apparatuses for myoelectric-based speech processing |
USD626949S1 (en) | 2008-02-20 | 2010-11-09 | Vocollect Healthcare Systems, Inc. | Body-worn mobile device |
USD605629S1 (en) | 2008-09-29 | 2009-12-08 | Vocollect, Inc. | Headset |
US8386261B2 (en) | 2008-11-14 | 2013-02-26 | Vocollect Healthcare Systems, Inc. | Training/coaching system for a voice-enabled work environment |
GB2466242B (en) * | 2008-12-15 | 2013-01-02 | Audio Analytic Ltd | Sound identification systems |
CN102257561A (en) * | 2008-12-16 | 2011-11-23 | 皇家飞利浦电子股份有限公司 | Speech signal processing |
US8160287B2 (en) | 2009-05-22 | 2012-04-17 | Vocollect, Inc. | Headset with adjustable headband |
US8438659B2 (en) | 2009-11-05 | 2013-05-07 | Vocollect, Inc. | Portable computing device and headset interface |
US9634855B2 (en) | 2010-05-13 | 2017-04-25 | Alexander Poltorak | Electronic personal interactive device that determines topics of interest using a conversational agent |
US8659397B2 (en) | 2010-07-22 | 2014-02-25 | Vocollect, Inc. | Method and system for correctly identifying specific RFID tags |
USD643400S1 (en) | 2010-08-19 | 2011-08-16 | Vocollect Healthcare Systems, Inc. | Body-worn mobile device |
USD643013S1 (en) | 2010-08-20 | 2011-08-09 | Vocollect Healthcare Systems, Inc. | Body-worn mobile device |
US8700392B1 (en) * | 2010-09-10 | 2014-04-15 | Amazon Technologies, Inc. | Speech-inclusive device interfaces |
US9274744B2 (en) | 2010-09-10 | 2016-03-01 | Amazon Technologies, Inc. | Relative position-inclusive device interfaces |
US8775341B1 (en) | 2010-10-26 | 2014-07-08 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
US9015093B1 (en) | 2010-10-26 | 2015-04-21 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
US9223415B1 (en) | 2012-01-17 | 2015-12-29 | Amazon Technologies, Inc. | Managing resource usage for task performance |
US9263044B1 (en) * | 2012-06-27 | 2016-02-16 | Amazon Technologies, Inc. | Noise reduction based on mouth area movement recognition |
KR101240588B1 (en) | 2012-12-14 | 2013-03-11 | 주식회사 좋은정보기술 | Method and device for voice recognition using integrated audio-visual |
CN103338330A (en) * | 2013-06-18 | 2013-10-02 | 腾讯科技(深圳)有限公司 | Picture processing method and device, and terminal |
US11921471B2 (en) | 2013-08-16 | 2024-03-05 | Meta Platforms Technologies, Llc | Systems, articles, and methods for wearable devices having secondary power sources in links of a band for providing secondary power in addition to a primary power source |
US10042422B2 (en) | 2013-11-12 | 2018-08-07 | Thalmic Labs Inc. | Systems, articles, and methods for capacitive electromyography sensors |
US20150124566A1 (en) | 2013-10-04 | 2015-05-07 | Thalmic Labs Inc. | Systems, articles and methods for wearable electronic devices employing contact sensors |
US11199906B1 (en) | 2013-09-04 | 2021-12-14 | Amazon Technologies, Inc. | Global user input management |
US9367203B1 (en) | 2013-10-04 | 2016-06-14 | Amazon Technologies, Inc. | User interface techniques for simulating three-dimensional depth |
WO2015081113A1 (en) | 2013-11-27 | 2015-06-04 | Cezar Morun | Systems, articles, and methods for electromyography sensors |
US9564128B2 (en) | 2013-12-09 | 2017-02-07 | Qualcomm Incorporated | Controlling a speech recognition process of a computing device |
KR20150104345A (en) * | 2014-03-05 | 2015-09-15 | 삼성전자주식회사 | Voice synthesys apparatus and method for synthesizing voice |
JP2015212732A (en) * | 2014-05-01 | 2015-11-26 | 日本放送協会 | Sound metaphor recognition device and program |
US9880632B2 (en) | 2014-06-19 | 2018-01-30 | Thalmic Labs Inc. | Systems, devices, and methods for gesture identification |
TWI576826B (en) * | 2014-07-28 | 2017-04-01 | jing-feng Liu | Discourse Recognition System and Unit |
US9390725B2 (en) | 2014-08-26 | 2016-07-12 | ClearOne Inc. | Systems and methods for noise reduction using speech recognition and speech synthesis |
US20160253996A1 (en) * | 2015-02-27 | 2016-09-01 | Lenovo (Singapore) Pte. Ltd. | Activating voice processing for associated speaker |
US20160284363A1 (en) * | 2015-03-24 | 2016-09-29 | Intel Corporation | Voice activity detection technologies, systems and methods employing the same |
JP6518134B2 (en) * | 2015-05-27 | 2019-05-22 | 株式会社ソニー・インタラクティブエンタテインメント | Pre-worn display device |
US10032463B1 (en) * | 2015-12-29 | 2018-07-24 | Amazon Technologies, Inc. | Speech processing with learned representation of user interaction history |
US11331045B1 (en) | 2018-01-25 | 2022-05-17 | Facebook Technologies, Llc | Systems and methods for mitigating neuromuscular signal artifacts |
US11000211B2 (en) | 2016-07-25 | 2021-05-11 | Facebook Technologies, Llc | Adaptive system for deriving control signals from measurements of neuromuscular activity |
EP3487595A4 (en) | 2016-07-25 | 2019-12-25 | CTRL-Labs Corporation | System and method for measuring the movements of articulated rigid bodies |
EP3487395A4 (en) | 2016-07-25 | 2020-03-04 | CTRL-Labs Corporation | Methods and apparatus for predicting musculo-skeletal position information using wearable autonomous sensors |
WO2020112986A1 (en) | 2018-11-27 | 2020-06-04 | Facebook Technologies, Inc. | Methods and apparatus for autocalibration of a wearable electrode sensor system |
US10409371B2 (en) | 2016-07-25 | 2019-09-10 | Ctrl-Labs Corporation | Methods and apparatus for inferring user intent based on neuromuscular signals |
US11216069B2 (en) | 2018-05-08 | 2022-01-04 | Facebook Technologies, Llc | Systems and methods for improved speech recognition using neuromuscular information |
US10489986B2 (en) | 2018-01-25 | 2019-11-26 | Ctrl-Labs Corporation | User-controlled tuning of handstate representation model parameters |
JP6686977B2 (en) * | 2017-06-23 | 2020-04-22 | カシオ計算機株式会社 | Sound source separation information detection device, robot, sound source separation information detection method and program |
US11200882B2 (en) * | 2017-07-03 | 2021-12-14 | Nec Corporation | Signal processing device, signal processing method, and storage medium for storing program |
CN107221324B (en) * | 2017-08-02 | 2021-03-16 | 上海智蕙林医疗科技有限公司 | Voice processing method and device |
WO2019079757A1 (en) | 2017-10-19 | 2019-04-25 | Ctrl-Labs Corporation | Systems and methods for identifying biological structures associated with neuromuscular source signals |
US10937414B2 (en) | 2018-05-08 | 2021-03-02 | Facebook Technologies, Llc | Systems and methods for text input using neuromuscular information |
US10504286B2 (en) | 2018-01-25 | 2019-12-10 | Ctrl-Labs Corporation | Techniques for anonymizing neuromuscular signal data |
EP3743790A4 (en) | 2018-01-25 | 2021-03-17 | Facebook Technologies, Inc. | Handstate reconstruction based on multiple inputs |
US11150730B1 (en) | 2019-04-30 | 2021-10-19 | Facebook Technologies, Llc | Devices, systems, and methods for controlling computing devices via neuromuscular signals of users |
US11961494B1 (en) | 2019-03-29 | 2024-04-16 | Meta Platforms Technologies, Llc | Electromagnetic interference reduction in extended reality environments |
US11481030B2 (en) | 2019-03-29 | 2022-10-25 | Meta Platforms Technologies, Llc | Methods and apparatus for gesture detection and classification |
US10970936B2 (en) | 2018-10-05 | 2021-04-06 | Facebook Technologies, Llc | Use of neuromuscular signals to provide enhanced interactions with physical objects in an augmented reality environment |
US11567573B2 (en) | 2018-09-20 | 2023-01-31 | Meta Platforms Technologies, Llc | Neuromuscular text entry, writing and drawing in augmented reality systems |
US11069148B2 (en) | 2018-01-25 | 2021-07-20 | Facebook Technologies, Llc | Visualization of reconstructed handstate information |
EP3743901A4 (en) | 2018-01-25 | 2021-03-31 | Facebook Technologies, Inc. | Real-time processing of handstate representation model estimates |
WO2019147996A1 (en) | 2018-01-25 | 2019-08-01 | Ctrl-Labs Corporation | Calibration techniques for handstate representation modeling using neuromuscular signals |
US11907423B2 (en) | 2019-11-25 | 2024-02-20 | Meta Platforms Technologies, Llc | Systems and methods for contextualized interactions with an environment |
US11493993B2 (en) | 2019-09-04 | 2022-11-08 | Meta Platforms Technologies, Llc | Systems, methods, and interfaces for performing inputs based on neuromuscular control |
CN108364660B (en) * | 2018-02-09 | 2020-10-09 | 腾讯音乐娱乐科技(深圳)有限公司 | Stress recognition method and device and computer readable storage medium |
CN108957392A (en) * | 2018-04-16 | 2018-12-07 | 深圳市沃特沃德股份有限公司 | Sounnd source direction estimation method and device |
CN112424859A (en) * | 2018-05-08 | 2021-02-26 | 脸谱科技有限责任公司 | System and method for improving speech recognition using neuromuscular information |
US10592001B2 (en) | 2018-05-08 | 2020-03-17 | Facebook Technologies, Llc | Systems and methods for improved speech recognition using neuromuscular information |
US11687770B2 (en) | 2018-05-18 | 2023-06-27 | Synaptics Incorporated | Recurrent multimodal attention system based on expert gated networks |
CN112469469A (en) | 2018-05-25 | 2021-03-09 | 脸谱科技有限责任公司 | Method and apparatus for providing sub-muscular control |
CN112261907A (en) | 2018-05-29 | 2021-01-22 | 脸谱科技有限责任公司 | Noise reduction shielding technology in surface electromyogram signal measurement and related system and method |
WO2019241701A1 (en) | 2018-06-14 | 2019-12-19 | Ctrl-Labs Corporation | User identification and authentication with neuromuscular signatures |
US11045137B2 (en) | 2018-07-19 | 2021-06-29 | Facebook Technologies, Llc | Methods and apparatus for improved signal robustness for a wearable neuromuscular recording device |
WO2020036958A1 (en) | 2018-08-13 | 2020-02-20 | Ctrl-Labs Corporation | Real-time spike detection and identification |
EP3843617B1 (en) | 2018-08-31 | 2023-10-04 | Facebook Technologies, LLC. | Camera-guided interpretation of neuromuscular signals |
CN109087651B (en) * | 2018-09-05 | 2021-01-19 | 广州势必可赢网络科技有限公司 | Voiceprint identification method, system and equipment based on video and spectrogram |
CN112771478A (en) | 2018-09-26 | 2021-05-07 | 脸谱科技有限责任公司 | Neuromuscular control of physical objects in an environment |
JP6920361B2 (en) * | 2019-02-27 | 2021-08-18 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | Judgment device, judgment method, and program |
US10905383B2 (en) | 2019-02-28 | 2021-02-02 | Facebook Technologies, Llc | Methods and apparatus for unsupervised one-shot machine learning for classification of human gestures and estimation of applied forces |
CN110232907B (en) * | 2019-07-24 | 2021-11-02 | 出门问问(苏州)信息科技有限公司 | Voice synthesis method and device, readable storage medium and computing equipment |
WO2021076662A1 (en) | 2019-10-16 | 2021-04-22 | Invicta Medical, Inc. | Adjustable devices for treating sleep apnea, and associated systems and methods |
JP2021081527A (en) * | 2019-11-15 | 2021-05-27 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | Voice recognition device, voice recognition method, and voice recognition program |
US20220134102A1 (en) | 2020-11-04 | 2022-05-05 | Invicta Medical, Inc. | Implantable electrodes with remote power delivery for treating sleep apnea, and associated systems and methods |
US20210104244A1 (en) * | 2020-12-14 | 2021-04-08 | Intel Corporation | Speech recognition with brain-computer interfaces |
US11868531B1 (en) | 2021-04-08 | 2024-01-09 | Meta Platforms Technologies, Llc | Wearable device providing for thumb-to-finger-based input gestures detected based on neuromuscular signals, and systems and methods of use thereof |
Family Cites Families (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3383466A (en) * | 1964-05-28 | 1968-05-14 | Navy Usa | Nonacoustic measures in automatic speech recognition |
US4885790A (en) * | 1985-03-18 | 1989-12-05 | Massachusetts Institute Of Technology | Processing of acoustic waveforms |
JPS62239231A (en) * | 1986-04-10 | 1987-10-20 | Kiyarii Rabo:Kk | Speech recognition method by inputting lip picture |
US4862503A (en) * | 1988-01-19 | 1989-08-29 | Syracuse University | Voice parameter extractor using oral airflow |
FR2632725B1 (en) * | 1988-06-14 | 1990-09-28 | Centre Nat Rech Scient | METHOD AND DEVICE FOR ANALYSIS, SYNTHESIS, SPEECH CODING |
JPH04273298A (en) | 1991-02-28 | 1992-09-29 | Fujitsu Ltd | Voice recognition device |
US5522013A (en) * | 1991-04-30 | 1996-05-28 | Nokia Telecommunications Oy | Method for speaker recognition using a lossless tube model of the speaker's |
DE4212907A1 (en) * | 1992-04-05 | 1993-10-07 | Drescher Ruediger | Integrated system with computer and multiple sensors for speech recognition - using range of sensors including camera, skin and muscle sensors and brain current detection, and microphones to produce word recognition |
US5586215A (en) | 1992-05-26 | 1996-12-17 | Ricoh Corporation | Neural network acoustic and visual speech recognition system |
JPH0612483A (en) | 1992-06-26 | 1994-01-21 | Canon Inc | Method and device for speech input |
US5457394A (en) * | 1993-04-12 | 1995-10-10 | The Regents Of The University Of California | Impulse radar studfinder |
US5536902A (en) * | 1993-04-14 | 1996-07-16 | Yamaha Corporation | Method of and apparatus for analyzing and synthesizing a sound by extracting and controlling a sound parameter |
US5454375A (en) * | 1993-10-21 | 1995-10-03 | Glottal Enterprises | Pneumotachograph mask or mouthpiece coupling element for airflow measurement during speech or singing |
JP3455921B2 (en) | 1993-12-24 | 2003-10-14 | 日本電信電話株式会社 | Voice substitute device |
FR2715755B1 (en) * | 1994-01-28 | 1996-04-12 | France Telecom | Speech recognition method and device. |
JPH08187368A (en) | 1994-05-13 | 1996-07-23 | Matsushita Electric Ind Co Ltd | Game device, input device, voice selector, voice recognizing device and voice reacting device |
US5573012A (en) * | 1994-08-09 | 1996-11-12 | The Regents Of The University Of California | Body monitoring and imaging apparatus and method |
JP3536363B2 (en) | 1994-09-02 | 2004-06-07 | 松下電器産業株式会社 | Voice recognition device |
DE69509555T2 (en) * | 1994-11-25 | 1999-09-02 | Fink | METHOD FOR CHANGING A VOICE SIGNAL BY MEANS OF BASIC FREQUENCY MANIPULATION |
US5864812A (en) * | 1994-12-06 | 1999-01-26 | Matsushita Electric Industrial Co., Ltd. | Speech synthesizing method and apparatus for combining natural speech segments and synthesized speech segments |
US5774846A (en) * | 1994-12-19 | 1998-06-30 | Matsushita Electric Industrial Co., Ltd. | Speech coding apparatus, linear prediction coefficient analyzing apparatus and noise reducing apparatus |
US5701390A (en) * | 1995-02-22 | 1997-12-23 | Digital Voice Systems, Inc. | Synthesis of MBE-based coded speech using regenerated phase information |
US5717828A (en) * | 1995-03-15 | 1998-02-10 | Syracuse Language Systems | Speech recognition apparatus and method for learning |
JP3647499B2 (en) | 1995-03-31 | 2005-05-11 | フオスター電機株式会社 | Voice pickup system |
US5729694A (en) * | 1996-02-06 | 1998-03-17 | The Regents Of The University Of California | Speech coding, reconstruction and recognition using acoustics and electromagnetic waves |
US6006175A (en) * | 1996-02-06 | 1999-12-21 | The Regents Of The University Of California | Methods and apparatus for non-acoustic speech characterization and recognition |
US6377919B1 (en) * | 1996-02-06 | 2002-04-23 | The Regents Of The University Of California | System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech |
US5960395A (en) * | 1996-02-09 | 1999-09-28 | Canon Kabushiki Kaisha | Pattern matching method, apparatus and computer readable memory medium for speech recognition using dynamic programming |
JPH09326856A (en) | 1996-06-03 | 1997-12-16 | Mitsubishi Electric Corp | Speech recognition reply device |
JP3266819B2 (en) * | 1996-07-30 | 2002-03-18 | 株式会社エイ・ティ・アール人間情報通信研究所 | Periodic signal conversion method, sound conversion method, and signal analysis method |
JPH10123450A (en) | 1996-10-15 | 1998-05-15 | Sony Corp | Head up display device with sound recognizing function |
GB2319379A (en) * | 1996-11-18 | 1998-05-20 | Secr Defence | Speech processing system |
US6161089A (en) * | 1997-03-14 | 2000-12-12 | Digital Voice Systems, Inc. | Multi-subframe quantization of spectral parameters |
JPH10260692A (en) * | 1997-03-18 | 1998-09-29 | Toshiba Corp | Method and system for recognition synthesis encoding and decoding of speech |
GB9714001D0 (en) * | 1997-07-02 | 1997-09-10 | Simoco Europ Limited | Method and apparatus for speech enhancement in a speech communication system |
JPH11296192A (en) * | 1998-04-10 | 1999-10-29 | Pioneer Electron Corp | Speech feature value compensating method for speech recognition, speech recognizing method, device therefor, and recording medium recorded with speech recognision program |
JP3893763B2 (en) | 1998-08-17 | 2007-03-14 | 富士ゼロックス株式会社 | Voice detection device |
US6347297B1 (en) * | 1998-10-05 | 2002-02-12 | Legerity, Inc. | Matrix quantization with vector quantization error compensation and neural network postprocessing for robust speech recognition |
US6263306B1 (en) * | 1999-02-26 | 2001-07-17 | Lucent Technologies Inc. | Speech processing technique for use in speech recognition and speech coding |
US6604070B1 (en) * | 1999-09-22 | 2003-08-05 | Conexant Systems, Inc. | System of encoding and decoding speech signals |
US6862558B2 (en) * | 2001-02-14 | 2005-03-01 | The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration | Empirical mode decomposition for analyzing acoustical signals |
-
2002
- 2002-03-04 JP JP2002057818A patent/JP2003255993A/en active Pending
-
2003
- 2003-03-03 EP EP03004378A patent/EP1345210B1/en not_active Expired - Lifetime
- 2003-03-03 CN CN2005100693792A patent/CN1681002B/en not_active Expired - Lifetime
- 2003-03-03 DE DE60321256T patent/DE60321256D1/en not_active Expired - Lifetime
- 2003-03-03 DE DE60330400T patent/DE60330400D1/en not_active Expired - Lifetime
- 2003-03-03 EP EP06004029A patent/EP1667108B1/en not_active Expired - Lifetime
- 2003-03-03 CN CN03105163A patent/CN1442845A/en active Pending
- 2003-03-04 US US10/377,822 patent/US7369991B2/en active Active
-
2006
- 2006-12-01 US US11/565,992 patent/US7680666B2/en not_active Expired - Lifetime
Also Published As
Publication number | Publication date |
---|---|
US7369991B2 (en) | 2008-05-06 |
EP1345210B1 (en) | 2008-05-28 |
US20070100630A1 (en) | 2007-05-03 |
DE60330400D1 (en) | 2010-01-14 |
US7680666B2 (en) | 2010-03-16 |
CN1681002B (en) | 2010-04-28 |
EP1667108B1 (en) | 2009-12-02 |
DE60321256D1 (en) | 2008-07-10 |
EP1345210A3 (en) | 2005-08-17 |
EP1345210A2 (en) | 2003-09-17 |
US20030171921A1 (en) | 2003-09-11 |
EP1667108A1 (en) | 2006-06-07 |
JP2003255993A (en) | 2003-09-10 |
CN1442845A (en) | 2003-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1681002A (en) | Speech synthesis system, speech synthesis method, and program product | |
CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
CN1158642C (en) | Method and system for detecting and generating transient conditions in auditory signals | |
CN1229773C (en) | Speed identification conversation device | |
CN1187734C (en) | Robot control apparatus | |
CN1160699C (en) | Tone features for speech recognition | |
US20190259388A1 (en) | Speech-to-text generation using video-speech matching from a primary speaker | |
Tran et al. | Improvement to a NAM-captured whisper-to-speech system | |
CN1703734A (en) | Method and apparatus for determining musical notes from sounds | |
CN1932807A (en) | Apparatus and method for translating speech and performing speech synthesis of translation result | |
CN1101446A (en) | Computerized system for teching speech | |
CN1894740A (en) | Information processing system, information processing method, and information processing program | |
CN1622200A (en) | Method and apparatus for multi-sensory speech enhancement | |
CN1662018A (en) | Method and apparatus for multi-sensory speech enhancement on a mobile device | |
CN1461463A (en) | Voice synthesis device | |
Hansen et al. | On the issues of intra-speaker variability and realism in speech, speaker, and language recognition tasks | |
CN1787076A (en) | Method for distinguishing speek person based on hybrid supporting vector machine | |
CN1534597A (en) | Speech sound identification method using change inference inversion state space model | |
KR20150104345A (en) | Voice synthesys apparatus and method for synthesizing voice | |
CN114121006A (en) | Image output method, device, equipment and storage medium of virtual character | |
Scheme et al. | Myoelectric signal classification for phoneme-based speech recognition | |
Rudzicz | Production knowledge in the recognition of dysarthric speech | |
WO2017008075A1 (en) | Systems and methods for human speech training | |
CN1253851C (en) | Speaker's inspection and speaker's identification system and method based on prior knowledge | |
Meister et al. | New speech corpora at IoC |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CX01 | Expiry of patent term | ||
CX01 | Expiry of patent term |
Granted publication date: 20100428 |