WO2005013261A1 - Verfahren zur spracherkennung und kommunikationsgerät - Google Patents
Verfahren zur spracherkennung und kommunikationsgerät Download PDFInfo
- Publication number
- WO2005013261A1 WO2005013261A1 PCT/EP2004/050687 EP2004050687W WO2005013261A1 WO 2005013261 A1 WO2005013261 A1 WO 2005013261A1 EP 2004050687 W EP2004050687 W EP 2004050687W WO 2005013261 A1 WO2005013261 A1 WO 2005013261A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- language
- speaker
- dependent
- feature vector
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000004891 communication Methods 0.000 title claims description 17
- 239000013598 vector Substances 0.000 claims abstract description 137
- 230000001419 dependent effect Effects 0.000 claims abstract description 55
- 230000009467 reduction Effects 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000012880 independent component analysis Methods 0.000 claims 1
- 238000000513 principal component analysis Methods 0.000 claims 1
- 230000000875 corresponding effect Effects 0.000 abstract 1
- 238000012549 training Methods 0.000 description 12
- 238000000605 extraction Methods 0.000 description 9
- 238000007781 pre-processing Methods 0.000 description 9
- 230000008901 benefit Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 230000009466 transformation Effects 0.000 description 7
- 238000001914 filtration Methods 0.000 description 4
- 230000007704 transition Effects 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000000750 progressive effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003455 independent Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- the invention relates to a speech recognition method and to a communication device, in particular a mobile telephone or a portable computer with a speech recognition device.
- Communication devices such as mobile phones or portable computers, have been undergoing progressive miniaturization in recent years to facilitate on-the-go use.
- this progressive miniaturization in addition to better portability, entails considerable problems with regard to the convenience of operation. Due to the smaller compared to previous communication device housing surface, it is no longer possible to provide them with a number of keys corresponding to the functional range of the devices.
- Some communication devices therefore offer voice-independent voice control.
- the user inputs voice commands such as "dialing", “phonebook”, “emergency call”, “rejecting” or “accepting”, for example.
- the telephone application associated with these command words can be used by a user immediately in the appropriate manner, without having himself previously trained the system with this vocabulary.
- vocabulary vocabulary samples have been collected in a database by many different people, who form a representative cross section of the user circle of the speech recognition system. To ensure that a representative cross-section exists, care is taken to select people for different dialects, ages and gender. With the aid of a so-called "cluster method", for example an iterative algorithm, similar speech samples are combined in groups or in so-called clusters.
- the groups or clusters are each assigned a phoneme, a phoneme cluster, or possibly a whole word. Thus, within each group or each cluster are several typical representatives of the phoneme, the phoneme cluster or a whole
- Word in the form of corresponding model vectors. In this way, the speech of many different people is captured by a few representatives.
- a speaker-dependent speech recognition system is optimized for the respective user, as it is trained on the voice of the first user uss. This is called “Einsagen” or “Say-in” or training. It is used to create a feature vector sequence from at least one feature vector.
- Such a system in which speaker-dependent and speech-dependent speech recognition are used in combination, is in operation in FIG. in speech recognition.
- a speech signal SS is temporally subdivided into frames F (framing) and preprocessing PP.
- F fraing
- preprocessing PP it undergoes a Fourier transformation.
- LDA-L linear discriminant analysis
- a dimensionally reduced feature vector F_S is ensteht. Since the dimension reduction LDA-L is carried out in a language-specific manner, the resulting dimension-reduced feature vector is also language-specific.
- a speaker-independent speech recognition HMM-SI is performed on the basis of a monolinguistic speech resource HMM-L.
- a speaker-dependent speech recognition SD is performed on the basis of the feature vector F_IS.
- the distances D between the relevant dimension-reduced feature vector F_S and the model vectors present in the language resource HMM-L are calculated.
- an assignment to a model vector or the determination takes place in operation an assignment of a feature vector sequence to a model vector sequence.
- the in oration on the allowed model vector sequences is present in the speaker-independent vocabulary VOC-SI-L, which was created by the manufacturer or manufacturer.
- Algorithm is the appropriate assignment or sequence of model vectors based on the distance calculation using the vocabulary VOC-SI-L determined.
- the result R results in a command word assigned to the model vectors.
- the additionally provided speaker-dependent speech recognition SD can, for example, be based on "Dynamic Time Warping" (dynamic time bending) DTW or neural networks, ie correlation-based or pattern matching methods, or other measures known to the person skilled in the art.
- speaker-dependent and speaker-independent speech recognition can not be mixed, i. it must be known before the speech recognition, whether a speaker-dependent or speaker-independent speech recognition takes place.
- speaker-dependent vocabulary is also created based on the language resource HMM-L1, namely by distance calculation D to the model vectors present therein, speaker-dependent vocabulary VOC-SD-Ll and speaker-independent vocabulary VOC-SI-Ll. Can be used in the system shown in FIG in the language L1, thereby eliminating the problem of mixing that occurs in the system of FIG.
- a speech signal for example a word or a sentence, consists of at least one acoustic unit. Under acoustic unit, one or more syllables, phonemes, a group of phonemes, word segments, or in the case of a Sat
- This Spachsignal is first broken down into periods.
- the speech signal in each time segment can be described by a feature vector.
- a feature vector sequence is first formed. In this case, only one feature vector can occur in the sequence. In general, the number of feature vectors in the feature vector sequence may be determined by the length, e.g. of the control command or else the time segments or time frames.
- model vector sequence For speech recognition, for example, as described above, meaningful units of a language are modeled by means of model vectors.
- a set of model vectors is contained in a language resource, that is, for example, the representation of a particular language for purposes of speech recognition with the model vectors.
- a language resource can also represent the representation or the "operating mode" of a particular language in a defined environment, for example in a motor vehicle. For example, the environment sets the ambient volume.
- the assignment or assignment information of a feature vector sequence which is generated for example by a say-in or training, is stored to a model vector sequence.
- the storage is done for example in a so-called vocabulary.
- speech recognition is performed using a speech resource.
- a language resource
- ce also at least includes transition probabilities between two model vectors.
- the core of the invention is now also to store the feature vector sequence itself and not just the assignment of this feature vector sequence to a sequence of model vectors. This has the advantage that when switching the language resource, i. For example, when switching to another language, not the voice signal from which the feature vector sequence is formed, must be resumed. This is also possible if speaker-dependent and speaker-independent systems are used in parallel.
- the feature vector sequence can be reduced in terms of its dimensions when assigned to a model vector sequence. This reduction can be carried out for example by means of a linear discriminant analysis. Making a reduction has the advantage that the model vectors can be stored even in the reduced dimension and thus less memory space is required for a language resource. It is important that the dimension reduction of the feature vector or of the feature vector sequence takes place only during the assignment, but a non-reduced representation of the feature vector or of the feature vector sequence is retained.
- a model vector sequence in another language resource can be assigned directly.
- the underlying speech resource may be constructed using a so-called Hidden Markov Model (HMM), in which for each acoustic unit, at
- HMM has the advantage that it can be used well in speaker-independent speech recognition, so that in addition to speaker-dependent commands and a broad vocabulary, which does not have to be trained by the user, can be preset.
- a suitable communication device has at least one microphone, with which the speech signal is detected, a processor unit with which the speech signal is processed, that is, for. For example, the decomposition into time frames or the extraction of the feature vector for a time frame. Furthermore, a memory unit is provided for storing the processed voice signal and at least one voice resource. For voice recognition itself, the microphone, memory unit and voice recognition device work together.
- FIG. 1 shows the sequence of a combined speaker-dependent and speaker-independent speech recognition according to the prior art in which speaker-dependent and speaker-independent speech recognition can not be mixed;
- VOC-SD-L1 shows the course of a training or "say-in” in a system with a language resource HMM-L1 according to the prior art
- the vocabulary VOC-SD-L1 created is speaker and language-dependent and may be a combination of speaker-dependent and speaker-independent Speech recognition (not shown) can be used;
- Figure 3 shows the system shown in Figure 2 in operation, i. in speech recognition, where a combination of speaker-dependent and speaker-independent speech recognition is realized, in which both techniques can be mixed but which is language-dependent;
- FIG. 4 shows the course of a training or "say-in" in a system according to an embodiment of the invention
- FIG. 5 shows the sequence of a transcoding undertaken without user interaction when the language resource of user-specific vocabulary created according to FIG. 4 is changed from a first language resource L1 to a second language resource L2 according to an embodiment of the invention
- FIG. 6 shows the embodiment shown in FIGS. 4 and 5 in operation
- FIG. 7 shows the sequence of individual steps in the context of the temporal subdivision and preprocessing of the speech signal
- 8 shows a communication device for carrying out a speech recognition method.
- Each language can be divided into phonemes specific to each language.
- Phonemes are sound components or sounds that are still differentiating in meaning. For example, a vowel is such a phoneme.
- a phoneme can also be composed of several letters and correspond to a single sound, for example
- speech recognition filters information about the speaker's mood, his gender, his or her voice
- Speech rate, the variations in pitch and background noise, etc. out. This is mainly to reduce the amount of data generated during speech recognition. This is necessary insofar as the amount of data required in speech recognition is so large that it can not normally be processed in real time, in particular not by a compact arithmetic unit as found in communication devices.
- a Fourier transform is generated in which the speech signal is separated into frequencies.
- window functions which has values not equal to zero only in a limited time window, an increase in contrast and / or a reduction of the noise component of the speech signal is achieved.
- a sequence of feature vectors or transcriptors is obtained which represent the time course of the speech signal.
- the individual feature vectors can be assigned to different classes of feature vectors.
- the classes of feature vectors each include groups of similar feature vectors.
- the speech signal is identified, ie it is present in a phonetic transcription.
- the phonetic transcript may be assigned a meaning content if the classes of feature vectors are associated with information about which sound is represented by feature vectors of the respective class.
- the classes of feature vectors on their own do not yet give clear information about which sound was being spoken.
- voice recordings are necessary, from which the classes of feature vectors are assigned individual sounds or phonemes, phoneme clusters or whole words.
- a phoneme cluster which can also be called a phoneme segment, simultaneously combines several individual phonemes into a single unit. As a result, the total amount of data to be processed in speech recognition can also be reduced.
- a language resource ie a set of model vectors, by means of which a specific language can be represented, is created by the manufacturer or manufacturer. Furthermore, in the case of a language resource, transition probabilities between individual model vectors are defined so that, for example, words can be formed in a language.
- a speech signal SS is first subjected to a feature extraction FE.
- This feature extraction generally initially comprises a subdivision into time frame or frame F (fraying) with a downstream preprocessing PP of the speech signal SS subdivided into frames.
- This generally includes a Fourier transformation.
- Storgerauschunterdruckung or channel compensation instead. Under channel here is understood the way from the microphone to AD converter, noise is compensated.
- the channel may vary due to different microphones, for example in the car kit or in the mobile radio terminal itself. Even with different rooms, the channel will have different properties, since the impulse response to a sound effect is different.
- the steps in the feature extraction FE for determining a feature vector could proceed as shown in FIG. 7:
- the pre-processing PP takes place. This may include the following steps: filtering FI of the signal with a finite impulse response filter (FIR), formation AA of so-called “Hamming windows” around antialiasing, ie avoiding the acquisition of frequencies not actually determined achieve. Subsequently, a fast or “fast” Fourier transform FFT is performed. The result is a power spectrum or "power spectrum”, in which the power over the frequency manually selected.
- FIR finite impulse response filter
- the result of this process is a language-independent feature vector F_IS.
- Sequences from at least one language-independent feature vector F_IS are stored in a collection or database FV-SD (F_IS) of speaker-dependent, language-independent feature vectors F_IS.
- the language-independent feature vector F_IS is processed into speaker and language-dependent vocabulary.
- LDA-Ll which is specific for a language resource (Ll)
- LDA-Ll specific for a language resource (Ll)
- the language-dependent feature vector F_S contains less information content due to the non-lossless data reduction in the dimension reduction. It is therefore not possible to recreate the language-independent feature vector F_IS from the language-dependent feature vector F_S.
- LDA matrix By multiplying with an LDA matrix, a diagonalization is primarily carried out, in which the dimension of the feature vector can be reduced by choosing a suitable eigensystem of basis vectors.
- This LDA matrix is language-specific, as the eigenvectors are different due to the differences in different languages or speech modes or locales. It is already determined at the factory. Like this matrix, e.g. based on so-called sub-phonemes and other subgroups, e.g. "d-phones", determined by averaging and corresponding weighting is known in the art and will not be explained here.
- a language resource is a set of model vectors by means of which a language can be represented.
- the language resource may also represent a language in a particular environment. This is used, for example, in the case of the use of communication devices in a vehicle in which due to the hands-free, there is a different noise level than in normal calls.
- these feature vectors are first assigned to existing groups of model vectors. This assignment is done via a stand-by computation D to model vectors, e.g. can approximate a determination of the most similar model vector, wherein the model vectors are present in a monolingual HMM language resource HMM-L.
- the assignment information between feature vector and model vector or feature vector sequences and model vector sequences is stored in a so-called vocabulary.
- the speaker-dependent vocabulary VOC-SD-Ll for the language resource L1 is compiled via distance calculation D to model vectors from the language resource HMM-L1 and conversion D2I of the distance to assignment or index information.
- the language-independent feature vector or the sequence of feature vectors is thus also stored, by which a control command is described. This has the principal advantage that when the language resource is switched over, the say-in need not be repeated.
- the language-dependent dimension reduction LDA-L2 can then be performed on the basis of this language-independent vector F_IS.
- the user switches from the language Ll to a language L2 via a user interface, or when using the car kit for a communication device, is automatically switched from the silent environment Ll to a loud environment L2.
- Ll or L2 thus designates a language or a locale.
- transcoding TC which is the assignment of a language-dependent feature vector F_S
- the transcoding TC takes place offline by means of the already factory-created language resource HMM-L2. without interaction with the user, based on the language-independent feature vectors F_IS stored in the database FV-SD (F_IS).
- F_IS language-independent feature vectors
- the result of the transcoding is a speaker-dependent vocabulary V0C-SD-L2, which was created on the basis of the language resource HMM-L2 using the language-independent, stored-stored feature vectors.
- the speaker-dependent vocabulary contains associations between sequences of feature vectors and model vectors.
- FIG. 6 the speech recognition system shown in Fig. 4 during training and in Fig. 5 during transcoding is shown in operation. The same terms are again provided with the same reference numbers.
- the language or locale L2 in which the transcoding in FIG. 5 took place is selected.
- the distance calculation D is performed using the language resource HMM-L2 for the language or locale L2.
- the search S now takes place on the basis of the speaker-independent vocabulary V0C-SI-L2, which corresponds to the speaker-independent vocabulary VOCSI-LI from FIG. 3 for the language environment L2, and of the speaker-dependent vocabulary VOC-SD-L2.
- the vocabulary VOC-SI-L2 created at the same time can be written simultaneously, i. without choosing between speaker-dependent and speaker-independent speech recognition, with the speaker-dependent vocabulary VOC-SD-L2.
- this has the advantage that speaker-dependent and speaker-independent vocabularies co-exist in such a way that it is not necessary for a speech recognition to know whether a speaker-dependent or speaker-independent command occurs, which allows the flexibility of e.g. composite command significantly increased. For example, knowing if a speaker-dependent or speaker-independent command occurs would be required if the speaker-dependent speech recognition proceeds using feature vectors and the speaker-independent based on map information.
- a communication device which is suitable for carrying out the speech recognition described.
- the communication device CD has at least one microphone M, with which the speech signal is detected, a processor unit CPU with which the speech signal is processed, so z. For example, the decomposition into time frames or the extraction of the feature vector for a time frame.
- a memory unit SU is provided for storing the processed speech signal and at least one speech resource
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006521560A JP2007500367A (ja) | 2003-07-28 | 2004-05-04 | 音声認識方法およびコミュニケーション機器 |
EP04741506A EP1649450A1 (de) | 2003-07-28 | 2004-05-04 | Verfahren zur spracherkennung und kommunikationsger t |
US10/566,293 US7630878B2 (en) | 2003-07-28 | 2004-05-04 | Speech recognition with language-dependent model vectors |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE10334400.4 | 2003-07-28 | ||
DE10334400A DE10334400A1 (de) | 2003-07-28 | 2003-07-28 | Verfahren zur Spracherkennung und Kommunikationsgerät |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2005013261A1 true WO2005013261A1 (de) | 2005-02-10 |
Family
ID=34088898
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2004/050687 WO2005013261A1 (de) | 2003-07-28 | 2004-05-04 | Verfahren zur spracherkennung und kommunikationsgerät |
Country Status (6)
Country | Link |
---|---|
US (1) | US7630878B2 (de) |
EP (1) | EP1649450A1 (de) |
JP (1) | JP2007500367A (de) |
CN (1) | CN1856820A (de) |
DE (1) | DE10334400A1 (de) |
WO (1) | WO2005013261A1 (de) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006033044A2 (en) * | 2004-09-23 | 2006-03-30 | Koninklijke Philips Electronics N.V. | Method of training a robust speaker-dependent speech recognition system with speaker-dependent expressions and robust speaker-dependent speech recognition system |
Families Citing this family (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102004008225B4 (de) * | 2004-02-19 | 2006-02-16 | Infineon Technologies Ag | Verfahren und Einrichtung zum Ermitteln von Merkmalsvektoren aus einem Signal zur Mustererkennung, Verfahren und Einrichtung zur Mustererkennung sowie computerlesbare Speichermedien |
US20060123398A1 (en) * | 2004-12-08 | 2006-06-08 | Mcguire James B | Apparatus and method for optimization of virtual machine operation |
US7697827B2 (en) | 2005-10-17 | 2010-04-13 | Konicek Jeffrey C | User-friendlier interfaces for a camera |
CN101727903B (zh) * | 2008-10-29 | 2011-10-19 | 中国科学院自动化研究所 | 基于多特征和多系统融合的发音质量评估和错误检测方法 |
WO2010086928A1 (ja) * | 2009-01-28 | 2010-08-05 | 三菱電機株式会社 | 音声認識装置 |
US9177557B2 (en) * | 2009-07-07 | 2015-11-03 | General Motors Llc. | Singular value decomposition for improved voice recognition in presence of multi-talker background noise |
US9026444B2 (en) | 2009-09-16 | 2015-05-05 | At&T Intellectual Property I, L.P. | System and method for personalization of acoustic models for automatic speech recognition |
DE102010054217A1 (de) * | 2010-12-11 | 2012-06-14 | Volkswagen Aktiengesellschaft | Verfahren zum Bereitstellen eines Sprachbediensystems in einem Fahrzeug und Sprachbediensystem dazu |
US20130297299A1 (en) * | 2012-05-07 | 2013-11-07 | Board Of Trustees Of Michigan State University | Sparse Auditory Reproducing Kernel (SPARK) Features for Noise-Robust Speech and Speaker Recognition |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
US9230550B2 (en) * | 2013-01-10 | 2016-01-05 | Sensory, Incorporated | Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination |
CN104143330A (zh) * | 2013-05-07 | 2014-11-12 | 佳能株式会社 | 语音识别方法和语音识别系统 |
US9640186B2 (en) * | 2014-05-02 | 2017-05-02 | International Business Machines Corporation | Deep scattering spectrum in acoustic modeling for speech recognition |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US10366158B2 (en) * | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
JP6300328B2 (ja) * | 2016-02-04 | 2018-03-28 | 和彦 外山 | 環境音生成装置及びそれを用いた環境音生成システム、環境音生成プログラム、音環境形成方法及び記録媒体 |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
DK201770383A1 (en) | 2017-05-09 | 2018-12-14 | Apple Inc. | USER INTERFACE FOR CORRECTING RECOGNITION ERRORS |
DK201770439A1 (en) | 2017-05-11 | 2018-12-13 | Apple Inc. | Offline personal assistant |
DK201770428A1 (en) | 2017-05-12 | 2019-02-18 | Apple Inc. | LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK201770432A1 (en) | 2017-05-15 | 2018-12-21 | Apple Inc. | Hierarchical belief states for digital assistants |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
DK179560B1 (en) | 2017-05-16 | 2019-02-18 | Apple Inc. | FAR-FIELD EXTENSION FOR DIGITAL ASSISTANT SERVICES |
US10964315B1 (en) * | 2017-06-30 | 2021-03-30 | Amazon Technologies, Inc. | Monophone-based background modeling for wakeword detection |
CN111345016A (zh) * | 2017-09-13 | 2020-06-26 | 深圳传音通讯有限公司 | 一种智能终端的启动控制方法及启动控制系统 |
US11158322B2 (en) * | 2019-09-06 | 2021-10-26 | Verbit Software Ltd. | Human resolution of repeated phrases in a hybrid transcription system |
KR20230130608A (ko) | 2020-10-08 | 2023-09-12 | 모듈레이트, 인크 | 콘텐츠 완화를 위한 멀티-스테이지 적응 시스템 |
CN114515138A (zh) * | 2022-01-06 | 2022-05-20 | 福州市星康朗语教育科技有限公司 | 一种语言障碍评估与矫正系统 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0797185A2 (de) * | 1996-03-19 | 1997-09-24 | Siemens Aktiengesellschaft | Verfahren und Vorrichtung zur Spracherkennung |
WO2002021513A1 (en) * | 2000-09-08 | 2002-03-14 | Qualcomm Incorporated | Combining dtw and hmm in speaker dependent and independent modes for speech recognition |
EP1220196A1 (de) * | 2000-12-27 | 2002-07-03 | Siemens Aktiengesellschaft | Verfahren zur Spracherkennung in einem kleinen elektronischen Gerät |
USRE38101E1 (en) * | 1996-02-29 | 2003-04-29 | Telesector Resources Group, Inc. | Methods and apparatus for performing speaker independent recognition of commands in parallel with speaker dependent recognition of names, words or phrases |
Family Cites Families (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FI86694C (fi) | 1990-03-19 | 1992-10-12 | Outokumpu Oy | Gjutmaskin |
JP2989211B2 (ja) * | 1990-03-26 | 1999-12-13 | 株式会社リコー | 音声認識装置における辞書制御方式 |
US5412716A (en) * | 1993-05-03 | 1995-05-02 | At&T Bell Laboratories | System for efficiently powering repeaters in small diameter cables |
US5623609A (en) * | 1993-06-14 | 1997-04-22 | Hal Trust, L.L.C. | Computer system and computer-implemented process for phonology-based automatic speech recognition |
EP0856175A4 (de) * | 1995-08-16 | 2000-05-24 | Univ Syracuse | System und verfahren zum wiederauffinden mehrsprachiger dokumente unter verwendung eines semantischer vektorvergleichs |
DE19533541C1 (de) * | 1995-09-11 | 1997-03-27 | Daimler Benz Aerospace Ag | Verfahren zur automatischen Steuerung eines oder mehrerer Geräte durch Sprachkommandos oder per Sprachdialog im Echtzeitbetrieb und Vorrichtung zum Ausführen des Verfahrens |
DE19624988A1 (de) * | 1996-06-22 | 1998-01-02 | Peter Dr Toma | Verfahren zur automatischen Erkennung eines gesprochenen Textes |
DE19635754A1 (de) * | 1996-09-03 | 1998-03-05 | Siemens Ag | Sprachverarbeitungssystem und Verfahren zur Sprachverarbeitung |
EP0925579B1 (de) * | 1996-09-10 | 2001-11-28 | Siemens Aktiengesellschaft | Verfahren zur anpassung eines hidden-markov-lautmodelles in einem spracherkennungssystem |
DE19636739C1 (de) * | 1996-09-10 | 1997-07-03 | Siemens Ag | Verfahren zur Mehrsprachenverwendung eines hidden Markov Lautmodelles in einem Spracherkennungssystem |
US6389393B1 (en) * | 1998-04-28 | 2002-05-14 | Texas Instruments Incorporated | Method of adapting speech recognition models for speaker, microphone, and noisy environment |
US6085160A (en) * | 1998-07-10 | 2000-07-04 | Lernout & Hauspie Speech Products N.V. | Language independent speech recognition |
US6253181B1 (en) * | 1999-01-22 | 2001-06-26 | Matsushita Electric Industrial Co., Ltd. | Speech recognition and teaching apparatus able to rapidly adapt to difficult speech of children and foreign speakers |
US6658385B1 (en) * | 1999-03-12 | 2003-12-02 | Texas Instruments Incorporated | Method for transforming HMMs for speaker-independent recognition in a noisy environment |
US6912499B1 (en) * | 1999-08-31 | 2005-06-28 | Nortel Networks Limited | Method and apparatus for training a multilingual speech model set |
WO2002005263A1 (de) * | 2000-07-07 | 2002-01-17 | Siemens Aktiengesellschaft | Verfahren zur spracheingabe und -erkennung |
DE10040063A1 (de) * | 2000-08-16 | 2002-02-28 | Philips Corp Intellectual Pty | Verfahren zur Zuordnung von Phonemen |
DE10041456A1 (de) * | 2000-08-23 | 2002-03-07 | Philips Corp Intellectual Pty | Verfahren zum Steuern von Geräten mittels Sprachsignalen, insbesondere bei Kraftfahrzeugen |
JP4244514B2 (ja) * | 2000-10-23 | 2009-03-25 | セイコーエプソン株式会社 | 音声認識方法および音声認識装置 |
US7043431B2 (en) * | 2001-08-31 | 2006-05-09 | Nokia Corporation | Multilingual speech recognition system using text derived recognition models |
US7072834B2 (en) * | 2002-04-05 | 2006-07-04 | Intel Corporation | Adapting to adverse acoustic environment in speech processing using playback training data |
US20040078191A1 (en) * | 2002-10-22 | 2004-04-22 | Nokia Corporation | Scalable neural network-based language identification from written text |
US7149688B2 (en) * | 2002-11-04 | 2006-12-12 | Speechworks International, Inc. | Multi-lingual speech recognition with cross-language context modeling |
US7496498B2 (en) * | 2003-03-24 | 2009-02-24 | Microsoft Corporation | Front-end architecture for a multi-lingual text-to-speech system |
US7516071B2 (en) * | 2003-06-30 | 2009-04-07 | International Business Machines Corporation | Method of modeling single-enrollment classes in verification and identification tasks |
-
2003
- 2003-07-28 DE DE10334400A patent/DE10334400A1/de not_active Withdrawn
-
2004
- 2004-05-04 EP EP04741506A patent/EP1649450A1/de not_active Withdrawn
- 2004-05-04 JP JP2006521560A patent/JP2007500367A/ja active Pending
- 2004-05-04 CN CNA2004800279419A patent/CN1856820A/zh active Pending
- 2004-05-04 US US10/566,293 patent/US7630878B2/en not_active Expired - Lifetime
- 2004-05-04 WO PCT/EP2004/050687 patent/WO2005013261A1/de active Search and Examination
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
USRE38101E1 (en) * | 1996-02-29 | 2003-04-29 | Telesector Resources Group, Inc. | Methods and apparatus for performing speaker independent recognition of commands in parallel with speaker dependent recognition of names, words or phrases |
EP0797185A2 (de) * | 1996-03-19 | 1997-09-24 | Siemens Aktiengesellschaft | Verfahren und Vorrichtung zur Spracherkennung |
WO2002021513A1 (en) * | 2000-09-08 | 2002-03-14 | Qualcomm Incorporated | Combining dtw and hmm in speaker dependent and independent modes for speech recognition |
EP1220196A1 (de) * | 2000-12-27 | 2002-07-03 | Siemens Aktiengesellschaft | Verfahren zur Spracherkennung in einem kleinen elektronischen Gerät |
Non-Patent Citations (1)
Title |
---|
VOS DE L ET AL: "ALGORITHM AND DSP-IMPLEMENTATION FOR A SPEAKER-INDEPENDENT SINGLE-WORD SPEECH RECOGNIZER WITH ADDITIONAL SPEAKER-DEPENDENT SAY-IN FACILITY", PROCEEDINGS IEEE WORKSHOP ON INTERACTIVE VOICE TECHNOLOGY FOR TELECOMMUNICATIONS APPLICATIONS, XX, XX, 30 September 1996 (1996-09-30), pages 53 - 56, XP000919045 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006033044A2 (en) * | 2004-09-23 | 2006-03-30 | Koninklijke Philips Electronics N.V. | Method of training a robust speaker-dependent speech recognition system with speaker-dependent expressions and robust speaker-dependent speech recognition system |
WO2006033044A3 (en) * | 2004-09-23 | 2006-05-04 | Koninkl Philips Electronics Nv | Method of training a robust speaker-dependent speech recognition system with speaker-dependent expressions and robust speaker-dependent speech recognition system |
JP2008513825A (ja) * | 2004-09-23 | 2008-05-01 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | 話者に依存しない堅牢な音声認識システム |
JP4943335B2 (ja) * | 2004-09-23 | 2012-05-30 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | 話者に依存しない堅牢な音声認識システム |
Also Published As
Publication number | Publication date |
---|---|
CN1856820A (zh) | 2006-11-01 |
JP2007500367A (ja) | 2007-01-11 |
US7630878B2 (en) | 2009-12-08 |
EP1649450A1 (de) | 2006-04-26 |
DE10334400A1 (de) | 2005-02-24 |
US20070112568A1 (en) | 2007-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2005013261A1 (de) | Verfahren zur spracherkennung und kommunikationsgerät | |
DE69311303T2 (de) | Sprachtrainingshilfe für kinder. | |
DE60125542T2 (de) | System und verfahren zur spracherkennung mit einer vielzahl von spracherkennungsvorrichtungen | |
DE69427083T2 (de) | Spracherkennungssystem für mehrere sprachen | |
DE60124559T2 (de) | Einrichtung und verfahren zur spracherkennung | |
DE69433593T2 (de) | Aufgeteiltes spracherkennungssystem | |
DE69822296T2 (de) | Mustererkennungsregistrierung in einem verteilten system | |
DE69321656T2 (de) | Verfahren zur Spracherkennung | |
DE20004416U1 (de) | Spracherkennungsvorrichtung unter Verwendung mehrerer Merkmalsströme | |
DE19847419A1 (de) | Verfahren zur automatischen Erkennung einer buchstabierten sprachlichen Äußerung | |
EP1264301B1 (de) | Verfahren zur erkennung von sprachäusserungen nicht-muttersprachlicher sprecher in einem sprachverarbeitungssystem | |
DE102006006069A1 (de) | Verteiltes Sprachverarbeitungssystem und Verfahren zur Ausgabe eines Zwischensignals davon | |
EP0987682B1 (de) | Verfahren zur Adaption von linguistischen Sprachmodellen | |
DE69512961T2 (de) | Spracherkennung auf Grundlage von "HMMs" | |
DE60018696T2 (de) | Robuste sprachverarbeitung von verrauschten sprachmodellen | |
DE60107072T2 (de) | Robuste merkmale für die erkennung von verrauschten sprachsignalen | |
EP1282897B1 (de) | Verfahren zum erzeugen einer sprachdatenbank für einen zielwortschatz zum trainieren eines spracherkennungssystems | |
EP1182646A2 (de) | Verfahren zur Zuordnung von Phonemen | |
EP1097447A1 (de) | Verfahren und vorrichtung zur erkennung vorgegebener schlüsselwörter in gesprochener sprache | |
DE102004017486A1 (de) | Verfahren zur Geräuschreduktion bei einem Sprach-Eingangssignal | |
DE60219030T2 (de) | Verfahren zur mehrsprachigen Spracherkennung | |
WO2001067435A9 (de) | Verfahren zum sprachgesteuerten initiieren von in einem gerät ausführbaren aktionen durch einen begrenzten benutzerkreis | |
WO1999005681A1 (de) | Verfahren zum abspeichern von suchmerkmalen einer bildsequenz und zugriff auf eine bildfolge in der bildsequenz | |
WO2005069278A1 (de) | Verfahren und vorrichtung zur bearbeitung eines sprachsignals für die robuste spracherkennung | |
DE60225536T2 (de) | Verfahren und Vorrichtung zur Spracherkennung |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200480027941.9 Country of ref document: CN |
|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2004741506 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2007112568 Country of ref document: US Ref document number: 2006521560 Country of ref document: JP Ref document number: 10566293 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 2004741506 Country of ref document: EP |
|
DPEN | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101) | ||
WWP | Wipo information: published in national office |
Ref document number: 10566293 Country of ref document: US |