JP4468264B2

JP4468264B2 - Methods and systems for multilingual name speech recognition

Info

Publication number: JP4468264B2
Application number: JP2005228583A
Authority: JP
Inventors: シャオ−リンレン; シンホ; ファンスン; ヤシンチャン
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2004-08-06
Filing date: 2005-08-05
Publication date: 2010-05-26
Anticipated expiration: 2025-08-05
Also published as: KR20060050277A; SG119358A1; JP2006048058A; CN1731511A; CN100592385C; KR100769029B1

Description

本発明は一般に音声認識プロセスに関する。本発明は特にパーソナル電子装置を使用した多言語による名称の音声認識に有用であるが、必ずしもそれに限定されるものではない。 The present invention relates generally to speech recognition processes. The present invention is particularly useful for multi-language name speech recognition using a personal electronic device, but is not necessarily limited thereto.

移動電話、携帯情報端末（ＰＤＡ）、および小型無線呼出し装置などのパーソナル電子装置は、工業化された世界の至る所で普及してきた。何百万人というユーザーが現在、電子情報への素早く簡単なアクセスと通信のためにそのような装置に頼っている。それらの装置の軽量・小型化は一般に、それらを例えばポケットや財布に入れて簡単に持ち運べるようにすることによって、その便利さを増している。しかしながら、それらの装置の小型化の不都合な点は、キーパッドおよびボタンなどの装置上の触覚インターフェースがしばしば極めて小さく扱い難いことである。 Personal electronic devices such as mobile phones, personal digital assistants (PDAs), and small wireless paging devices have become popular throughout the industrialized world. Millions of users currently rely on such devices for quick and easy access and communication of electronic information. The weight and size of these devices generally increases their convenience by making them easy to carry, for example in a pocket or purse. However, the downside of these devices is that the haptic interfaces on devices such as keypads and buttons are often very small and unwieldy.

そこで、音声認識は多くのパーソナル電子装置にとって貴重な機能である。例えば、音声認識機能によって、車のドライバは、道路から目を離すことなくパーソナル電子装置に簡単な命令を出すことができる。また、音声による命令は、簡単に、しかも、小さなキーパッド上で指示を入力するのに必要な時間よりもしばしば速く実行することができるため、音声認識は、例えばＰＤＡ内のアドレス帳エントリにアクセスする際の利便性を向上することができる。 Thus, speech recognition is a valuable function for many personal electronic devices. For example, the voice recognition function allows a car driver to issue a simple command to the personal electronic device without taking his eyes off the road. Also, voice recognition can be performed easily and often faster than the time required to enter instructions on a small keypad, so voice recognition can access, for example, address book entries in a PDA. Convenience can be improved.

従って、音声認識システムは、コンピュータプログラムを実行し、データベースにアクセスするための好評な手段である。しかしながら、パーソナル電子装置の小型化は、組み込まれる音声認識システムの性能を制限することもある。効果的な音声認識はしばしば、比較的大きなデータベースとかなりの処理速度を必要とするが、小型電子装置のメモリ容量と処理能力は一般に制限がある。これらの制限を克服するために、パーソナル電子装置の音声認識システムは普通、限定された特定の状況だけのためにカスタマイズされる。例えば、そのようなシステムはしばしば、以下に更に詳細に説明するように、話者に依存しており、特定の話者のみの音声パターンを解釈するようになっている。また、そのようなシステムはしばしば、言語に依存しており、限定された語彙だけのために設計される。これらの設計上の妥協により、システムは、特定目的のためには、パーソナル電子装置の限定されたリソースを用いて、かなり良好に機能することができる。 Thus, the speech recognition system is a popular means for executing computer programs and accessing databases. However, the miniaturization of personal electronic devices may limit the performance of an integrated speech recognition system. Effective speech recognition often requires relatively large databases and significant processing speed, but the memory capacity and processing power of small electronic devices are generally limited. In order to overcome these limitations, personal electronic device speech recognition systems are usually customized only for limited specific situations. For example, such systems often rely on speakers and interpret speech patterns only for specific speakers, as described in more detail below. Also, such systems are often language dependent and are designed for only a limited vocabulary. These design compromises allow the system to function fairly well with the limited resources of personal electronic devices for specific purposes.

音声認識システムは一般に、入力言葉をデータベースに記憶されている音響モデルと照合することによって機能する。一致した音響モデルをその後、辞書データベース内のエントリと照合し、単語および文の認識を完了する。音響モデルはしばしば、隠れマルコフモデル（Hidden Markov Models: ＨＭＭ）からなっている。ＨＭＭは、平均ベクトルと分散ベクトルを含む統計記述であり、単語と音素などの音声ユニットを記述する。次いで、ＨＭＭパターン照合を使って、音声認識データベース内の音響モデルが発話により入力された言葉と一致するかどうかを判定する。ＨＭＭは一般に、ガウス混合と呼ばれるいくつかの複合ガウス確率分布関数（ＰＤＦ）からなる確率関数に基づいている。音声パターン照合はそれ故、ガウス混合と入力音声言葉を照合するプロセスである。従って、ＨＭＭパターン照合音響モデルの利用可能な高度化は、性能とメモリおよび処理リソースとの間の必要な妥協を行うときに、音声認識システムの設計者が考えなければならない重要な変動要因である。 Speech recognition systems generally work by matching input words with an acoustic model stored in a database. The matched acoustic model is then matched against an entry in the dictionary database to complete word and sentence recognition. Acoustic models often consist of Hidden Markov Models (HMM). The HMM is a statistical description including an average vector and a variance vector, and describes speech units such as words and phonemes. The HMM pattern matching is then used to determine whether the acoustic model in the speech recognition database matches the words entered by utterance. HMMs are generally based on a probability function consisting of several complex Gaussian probability distribution functions (PDF) called Gaussian mixture. Speech pattern matching is therefore a process of matching Gaussian mixing with input speech words. Thus, the available sophistication of HMM pattern matching acoustic models is an important variable factor that speech recognition system designers must consider when making the necessary compromise between performance and memory and processing resources. .

音声認識システムにおける他の妥協は、多数のユーザーの音声を認識するためのシステ
ムの能力に関係している。従って、音声認識システムは更に、話者に依存しないシステムか話者に依存するシステムのいずれかとして分類される。話者に依存しないシステムは、任意の言語のどの話者の音声をも認識するように設計されているが、話者に依存するシステムはただ一人の話者の音声を認識するように教育される。話者に依存しないシステムは普通、複数のトレーニング話者から得られたＨＭＭを含む音響データベースを含んでいる。トレーニング話者の音声から得られるＨＭＭは、より大きな話者グループに見られる音声パターンを代表することが意図されるガウス混合パラメータである。そのようなシステムは一般に、話者に依存するシステムよりも正確ではない。これは、いろいろな音声属性に対応するために音声モデルにおいて妥協しなければならないためと、話者に依存しないシステムが、そのシステムを使用するどの特定の話者の特有の音声属性にも適合しないためである。 Another compromise in speech recognition systems relates to the system's ability to recognize multiple users' speech. Thus, speech recognition systems are further classified as either speaker independent systems or speaker dependent systems. While speaker-independent systems are designed to recognize the speech of any speaker in any language, speaker-dependent systems are educated to recognize the speech of a single speaker. The Speaker-independent systems typically include an acoustic database containing HMMs obtained from multiple training speakers. The HMM obtained from the training speaker's speech is a Gaussian mixture parameter that is intended to be representative of speech patterns found in larger speaker groups. Such systems are generally less accurate than speaker dependent systems. This is because the speech model must be compromised to accommodate different speech attributes, and a speaker-independent system will not fit the specific speech attributes of any particular speaker using that system Because.

話者に依存するシステムは、個々の話者の特定の音声パターンを認識するように調整される。通常、話者は、トレーニングルーチンの間、話者に依存するシステムの中にいろいろな音声パターンを含むスクリプトを読み込む。次いで、トレーニング音声がそのスクリプトに合わせられ、そのため、システムを話者の特有の音声属性に調整することができ、従って、システムは、音声認識の間、話者の音声をより正確に認識するようになる。しかしながら、話者に依存するシステムはしばしば、多くの人々が音声認識システムを使用する必要がある状況においては望ましくない。例えば、移動電話に組み込まれた音声認識システムによってユーザーは、命令を話し、その命令が電話によって認識されることによって、装置を動作させることができる。しかしながら、移動電話の主ユーザーは、多くの友人、同僚、または家族も電話の音声認識機能を使用できることを望むことがある。電話のそのような第２ユーザーはほんの短期間だけ音声認識機能を必要とするかも知れないため、第２ユーザーは、音声認識機能を使う前に、まず自分の音声を電話に認識させる必要があることは不便である。 Speaker dependent systems are tuned to recognize specific speech patterns of individual speakers. Typically, during the training routine, the speaker loads scripts containing various speech patterns into the speaker-dependent system. The training speech is then tailored to the script so that the system can be adjusted to the speaker's unique speech attributes so that the system recognizes the speaker's speech more accurately during speech recognition. become. However, speaker dependent systems are often undesirable in situations where many people need to use speech recognition systems. For example, a voice recognition system built into a mobile phone allows a user to speak a command and operate the device by the command being recognized by the phone. However, the primary user of a mobile phone may want many friends, colleagues, or family members to be able to use the phone's voice recognition function. Because such a second user of the phone may need a voice recognition function for only a short period of time, the second user must first let his phone recognize his / her voice before using the voice recognition function. That is inconvenient.

最後に、音声認識音響モデルは通常、単一言語専用に設計されている。従って、多言語音声を認識できる音声認識システムは、多数の音響モデルを必要とし、このこともシステムのメモリ要求と高度化を増大させる。 Finally, speech recognition acoustic models are usually designed specifically for a single language. Therefore, a speech recognition system that can recognize multilingual speech requires a large number of acoustic models, which also increases the memory requirements and sophistication of the system.

最近、二言語による音声認識装置がパーソナル電子装置用に開発されている。従って、例えば、移動電話のバイリンガルユーザーは、英語と標準中国語といった二言語のうちのいずれかを使って、電話に記憶されたアドレス帳から名称を呼び出すことができる。これらの装置に使用される別々の言語に特定的な音響モデルおよび語彙データベースのために、ユーザーは一般に、音声認識機能を使う前に、まず電話の言語モードを一つの特定の言語に切り換えなければならない。しかしながら、特定の言語を予め選択しなければならないのは、例えば、アドレス帳が、多様に混在した二言語の名称または他の連絡情報を含む場合に不便である。また、特定の言語を予め選択しなければならないために、システムは、音声認識を用いて、多言語が混在した２つの部分からなる名称、例えば、名が英語で姓が標準中国語の名称を特定することができない。 Recently, bilingual speech recognition devices have been developed for personal electronic devices. Thus, for example, a bilingual user of a mobile phone can call a name from an address book stored on the phone using one of two languages, English and Mandarin. Due to the acoustic models and vocabulary databases specific to the different languages used on these devices, users generally have to switch the phone language mode to one specific language before using the speech recognition function. Don't be. However, it is inconvenient that a specific language must be selected in advance, for example, when the address book includes variously mixed names of two languages or other contact information. Also, because a specific language must be pre-selected, the system uses speech recognition to create a two-part name mixed with multiple languages, for example, a name whose first name is English and whose last name is Mandarin Chinese. It cannot be specified.

従って、言語モード間の手動切り換えを必要とせずに多言語の名称を認識することができ、パーソナル電子装置の制限されたリソースを効果的に使用する、話者に依存していない音声認識のための改善された方法とシステムに対する必要性が存在する。 Thus, multilingual names can be recognized without the need for manual switching between language modes, and speaker-independent speech recognition that effectively uses the limited resources of personal electronic devices. There is a need for improved methods and systems.

そこで、本発明は、一態様によれば、改善された多言語の名称の音声認識方法であって、文字からなる複数の名称を表すテキストを電子装置に記憶させるステップと、前記名称
のそれぞれに対して少なくとも一つの言語を特定するステップと、複数の、言語に特定的な文字／音変換器（以下「言語特定文字／音変換器」とする）を用いて、各名称を順序だった一連の発音ユニットに変換するステップと、前記電子装置に関連付けられたマイクロフォンで発話された言葉を受信するステップと、前記言葉を特徴ベクトルに変換するステップと、前記特徴ベクトルを少なくとも一つの名称の前記順序だった一連の発音ユニットと照合するステップとを備えた方法である。 Therefore, according to one aspect of the present invention, there is provided an improved speech recognition method for multilingual names, the step of storing text representing a plurality of names consisting of characters in an electronic device, and each of the names Using a step of identifying at least one language and a plurality of language-specific character / sound converters (hereinafter referred to as “language-specific character / sound converters”) Converting to a phonetic unit; receiving a word spoken by a microphone associated with the electronic device; converting the word to a feature vector; and converting the feature vector into the order of at least one name And a step of collating with a series of pronunciation units.

前記多言語は標準中国語を含み、前記名称のそれぞれに対して少なくとも一つの言語を特定するステップは、前記名称が中国語アルファベットの文字から構成されているかローマ字アルファベットの文字から構成されているかを判定するステップと、ローマ字アルファベットの名称が中国語ぴん音であるかを判定するステップからなることが好ましい。 The multi-language includes Mandarin Chinese, and the step of identifying at least one language for each of the names is whether the name is composed of Chinese alphabet characters or Roman alphabet characters. Preferably, the method includes a step of determining and a step of determining whether the name of the Roman alphabet is a Chinese sword.

前記多言語は西洋言語と中国語で構成されることが好ましい。
前記複数の言語特定文字／音変換器は、中国語文字／音変換器と西洋言語文字／音変換器で構成されることが好ましい。 The multi-language is preferably composed of a Western language and Chinese.
The plurality of language-specific character / sound converters are preferably composed of a Chinese character / sound converter and a Western language character / sound converter.

前記中国語文字／音変換器は前後関係に依存しており、前記西洋言語文字／音変換器は前後関係に依存していないことが好ましい。
前記特徴ベクトルを少なくとも一つの名称の順序だった一連の発音ユニットと照合するステップは、自動音声認識エンジンにおいて前記特徴ベクトルと前記順序だった一連の発音ユニットとガウス混合パラメータとを比較することによって前記特徴ベクトルをデコードするステップからなることが好ましい。 Preferably, the Chinese character / sound converter depends on the context, and the Western language character / sound converter does not depend on the context.
The step of matching the feature vector with a series of phonetic units ordered by at least one name comprises comparing the feature vector with the sequence of phonetic units ordered and a Gaussian mixture parameter in an automatic speech recognition engine. It preferably comprises the step of decoding the feature vector.

前記自動音声認識エンジンはビーム検索ビタービ（Viterbi ）アルゴリズムを使用することが好ましい。
前記名称は前記電子装置に記憶されている連絡リストの構成要素からなっていることが好ましい。 The automatic speech recognition engine preferably uses a beam search Viterbi algorithm.
Preferably, the name comprises a contact list component stored in the electronic device.

別の態様によれば、本発明は、多言語による名称の音声認識方法であって、電子装置に関連付けられたマイクロフォンで発話された言葉を受信するステップと、前記言葉を特徴ベクトルに変換するステップと、前記特徴ベクトルを、少なくとも一つの名称であって文字の表現として前記電子装置に記憶されている名称の順序だった一連の発音ユニットと照合するステップを備える。前記名称の少なくとも一つの言語は前記文字から特定されており、前記名称は次いで、複数の言語特定文字／音変換器を用いて前記順序だった一連の発音ユニットに変換されている。 According to another aspect, the present invention is a multilingual name speech recognition method for receiving a word spoken by a microphone associated with an electronic device, and converting the word into a feature vector. And comparing the feature vector with a series of pronunciation units that are at least one name and in the order of names stored in the electronic device as a representation of characters. At least one language of the name is identified from the characters, and the name is then converted into the ordered series of pronunciation units using a plurality of language-specific character / sound converters.

更に別の態様によれば、本発明は、多言語による名称の音声認識のためのシステムであって、マイクロプロセッサと、前記マイクロプロセッサに動作可能に接続された少なくとも一つのメモリと、前記マイクロプロセッサに動作可能に接続されたマイクロフォンを備える。前記マイクロプロセッサは、前記メモリに記憶されているコードを実行して、発話された言葉を前記マイクロフォンで受信し、前記言葉を特徴ベクトルに変換し、前記特徴ベクトルを、少なくとも一つの名称であって文字の表現として前記メモリに記憶されている名称の順序だった一連の発音ユニットと照合するように動作する。前記名称の少なくとも一つの言語は前記文字から特定されており、前記名称は次いで、前記マイクロプロセッサに動作可能に接続された複数の言語特定文字／音変換器を用いて前記順序だった一連の発音ユニットに変換されている。 According to yet another aspect, the present invention is a system for multilingual name recognition, comprising a microprocessor, at least one memory operably connected to the microprocessor, and the microprocessor. A microphone operatively connected to the The microprocessor executes a code stored in the memory, receives a spoken word with the microphone, converts the word into a feature vector, and the feature vector has at least one name. It operates to collate with a series of pronunciation units that are in the order of names stored in the memory as character representations. At least one language of the name is identified from the letters, and the name is then used to produce the ordered series of pronunciations using a plurality of language-specific letter / sound converters operably connected to the microprocessor. Has been converted to a unit.

前記名称は前記システムに記憶されている連絡リストの構成要素からなっていることが好ましい。
前記システムは移動電話か携帯情報端末のいずれかに動作可能に接続されることが好ま
しい。 Preferably, the name comprises a contact list component stored in the system.
The system is preferably operatively connected to either a mobile phone or a personal digital assistant.

特許請求の範囲を含む本仕様書においては、用語「備えた」、「含む」、「からなる」、または同様な用語は、非排他的包含を意味するものであるため、多くの要素からなる方法または装置は、それらの要素だけを含むものではなく、記載されていない他の要素を容易に含むことができる。 In this specification, including the claims, the terms “comprising”, “including”, “consisting of”, or similar terms mean non-exclusive inclusions and therefore consist of many elements. The method or apparatus does not include only those elements, but can easily include other elements not described.

本発明を容易に理解し、実施するために、好ましい実施態様について添付図面を参照して説明する。添付図面において同一参照番号は同一要素を示す。
図１は、本発明の一実施態様による、多言語による名称の音声認識のためのシステム１００の機能的構成要素を示す概略図である。このシステム１００は以下のように動作する。文字／音変換器１０５は、名称のテキストを順序だった一連の発音ユニットに変換する。この名称は、通常、移動電話や携帯情報端末（ＰＤＡ）などのパーソナル電子装置上に、個々の文字の表記として記憶されている多くの名称の一つである。例えば、これらの名称は電子装置のアドレス帳または連絡リストの一部として記憶されていてもよい。文字／音変換器１０５は最初に、システム１００に入力された名称に対し、少なくとも一つの言語を特定する。次いで、この名称を、公開語彙辞書１１０に記憶される順序だった一連の発音ユニットに変換する。システム１００はまた、混合言語隠れマルコフモデル（ＨＭＭ）セット１１５を含んでいる。ＨＭＭセット１１５は、少なくとも二つの言語の選択音声パターンを表すガウス混合パラメータを含んでいる。 In order that the present invention may be readily understood and practiced, preferred embodiments will be described with reference to the accompanying drawings. In the accompanying drawings, the same reference numerals denote the same elements.
FIG. 1 is a schematic diagram illustrating functional components of a system 100 for multilingual name speech recognition according to one embodiment of the present invention. The system 100 operates as follows. The character / sound converter 105 converts the text of the name into an ordered series of pronunciation units. This name is usually one of many names stored as a representation of individual characters on personal electronic devices such as mobile phones and personal digital assistants (PDAs). For example, these names may be stored as part of an electronic device address book or contact list. The character / sound converter 105 first identifies at least one language for the name entered into the system 100. This name is then converted into a series of pronunciation units in the order stored in the open vocabulary dictionary 110. The system 100 also includes a mixed language hidden Markov model (HMM) set 115. The HMM set 115 includes Gaussian mixing parameters representing selected speech patterns in at least two languages.

複数の名称とそれに関連する順序だった一連の発音ユニットが公開語彙辞書１１０に入力された後、システム１００は、それらの名称のいずれかがマイクロフォン１２０などの入力部に発話されると、その名称の発話された表現を認識することができる。マイクロフォン１２０は、音声作動装置（ＶＡＤ）に動作可能に接続することができる。次に、特徴抽出器１２５が、この技術で良く知られた従来の音声認識技術に従って、発話された名称の特徴ベクトルを抽出する。特徴ベクトルは次いで、特徴ベクトルとガウス混合パラメータを比較する自動音声認識（ＡＳＲ）エンジン１３０によってデコードされる。ＡＳＲエンジン１３０は更に、動的文法ネットワーク１３５によって支援される。このネットワーク１３５は、公開語彙辞書１１０で構築され、音声認識プロセスの間、発音モデルの検索を誘導する。最後に、公開語彙辞書からの一致名称がシステム１００から出力される。次いで、この一致した名称を電子装置が使用して、例えば連絡リストから個人の電話番号または他の連絡情報を検索することができる。 After a plurality of names and a series of pronunciation units in the order associated with them are input to the open vocabulary dictionary 110, the system 100, when any of those names is uttered to an input unit such as the microphone 120, the names Can recognize the spoken expression. Microphone 120 can be operatively connected to a voice activated device (VAD). The feature extractor 125 then extracts the feature vector of the spoken name according to conventional speech recognition techniques well known in the art. The feature vector is then decoded by an automatic speech recognition (ASR) engine 130 that compares the feature vector and Gaussian mixing parameters. The ASR engine 130 is further supported by a dynamic grammar network 135. This network 135 is built with the open vocabulary dictionary 110 and guides the search for pronunciation models during the speech recognition process. Finally, the matching name from the public vocabulary dictionary is output from the system 100. This matched name can then be used by the electronic device to retrieve a personal telephone number or other contact information from a contact list, for example.

したがって、本発明は多言語が混在した単語や名称の音声認識が必要な用途おいて有用である。例えば、中国においては、話者に依存しない中国語（例えば、標準中国語または広東語）および英語のＡＳＲ可能な携帯電話が現れている。しかしながら、これらの先行技術システムは一般に、一時に単一言語モデルでのみ動作することができる。例えば、ユーザーが、英語名称を用いてアドレス帳内の情報を検索するためにＡＳＲ機能を使おうとすると、ユーザーはまず、ＡＳＲ機能を英語に設定しなければならない。次いで、同じユーザーが、標準中国語名称を用いてアドレス帳内の情報を検索しようとすると、そのユーザーは、標準中国語名称を検索可能となる前に、まずＡＳＲ機能を標準中国語に設定しなければならない。しかしながら、中国における多くの移動電話ユーザーは、電話アドレス帳に、名称の第１部分が英語で、名称の第２部分が標準中国語であるバイリンガルの二つの部分からなる名称を有していることが見受けられる。従って、先行技術のＡＳＲシステムはそのようなバイリンガルの二部名称の発話された表現を自動的に認識することができない。一方、本発明は、そのようなバイリンガルの二部名称を認識することができ、ユーザーが手動でＡＳＲを一方の言語から他方の言語に切り換える必要がない。 Therefore, the present invention is useful in applications that require speech recognition of words and names mixed with multiple languages. For example, in China, ASR capable mobile phones are appearing that are speaker independent Chinese (eg, Mandarin or Cantonese) and English. However, these prior art systems can generally only work with a single language model at a time. For example, if a user wants to use the ASR function to retrieve information in the address book using an English name, the user must first set the ASR function to English. Next, when the same user tries to search for information in the address book using the standard Chinese name, the user first sets the ASR function to standard Chinese before the standard Chinese name can be searched. There must be. However, many mobile phone users in China have a bilingual name in the phone address book where the first part of the name is English and the second part of the name is Mandarin Chinese Can be seen. Thus, prior art ASR systems cannot automatically recognize spoken representations of such bilingual bipartite names. On the other hand, the present invention can recognize such bilingual two-part names and does not require the user to manually switch the ASR from one language to the other.

図２は、二つの異なる言語のいろいろな名称と、それに関する順序だった一連の発音ユニットからなる発音とを示す表である。例えば、第１の名称、すなわち、

FIG. 2 is a table showing various names of two different languages and pronunciations consisting of a series of pronunciation units in order. For example, the first name, ie

は標準中国語（漢字）のみからなりであり、その後に、個々の中国語音素２０５を含む順序だった一連の発音ユニットで構成されたその発音が続いている。次の名称「John Stone」は英語のみからなり、その後に、個々の英語音素２１０を含むその発音が続いている。第３の名称、すなわち、

Is composed only of Mandarin Chinese (kanji), followed by its pronunciation composed of a series of pronunciation units in order including individual Chinese phonemes 205. The next name “John Stone” consists only of English, followed by its pronunciation including individual English phonemes 210. The third name, namely

は、標準中国語（漢字）の姓、すなわち、

Is the last name in Mandarin Chinese (Kanji),

と英語の名「Jacky 」とを含んでいるのでバイリンガルの二部名称である。それにもかかわらず、本発明の方法とシステムは、英語音素２１０と中国語音素２０５の双方を含むその名称の発音をも定義することができる。ユーザーが手動で言語を切り換える必要なく、バイリンガルの二部名称のそのような発音構文解析を可能にする本発明の特徴を以下に説明する。 And the English name "Jacky", so it is a bilingual two-part name. Nevertheless, the method and system of the present invention can also define the pronunciation of that name, including both English phonemes 210 and Chinese phonemes 205. The features of the present invention that enable such pronunciation parsing of bilingual bipartite names without requiring the user to manually switch languages are described below.

図３は、図１において導入された混合文字／音変換器１０５の働きと構成要素を示す概略図である。一例として、図３に示す混合文字／音変換器１０５は、英語または標準中国語のいずれかで表記された文字を変換するように動作する。まず、混合文字／音変換器１０５は、装置に記憶されている表記された名称の少なくとも一部を定義するために使用されるアルファベットを識別するアルファベット識別器３０５を含んでいる。名称の記憶部分が漢字３１０で構成されている場合には、その漢字３１０は、言語限定標準中国語文字／音変換器３１５に直接入力される。しかしながら、名称の記憶部分が英文字３２０で構成されている場合には、その名称は中国語ぴん音か英語のいずれかで表記されている可能性がある。従って名称のその部分はぴん音識別器３２５によって更に分類される。ぴん音識別器３２５は、（声調を除く）ぴん音で表されたすべての中国語の名称を基本的に識別する４０８音節のぴん音辞書を使用している。英文字３２０が中国語ぴん音である場合、英文字３２０は標準中国語文字／音変換器３１５に入力される。しかしながら、英文字３２０が英単語である場合には、英文字３２０は言語限定英語文字／音変換器３３０に入力される。標準中国語文字／音変換器３１５と英語文字／音変換器３３０は共に、名称を固有の順序だった一連の言語限定発音ユニットに変換するように動作可能である。種々の他の言語の文字を変換する他の文字／音変換器１０５も本開示によって可能であることは当業者にとって明らかである。従って、本発明の文字／音変換器１０５は、バイリンガルの二部名称を単一の順序だった一連の発音ユニットに構文解析することができる。 FIG. 3 is a schematic diagram showing the operation and components of the mixed character / sound converter 105 introduced in FIG. As an example, the mixed character / sound converter 105 shown in FIG. 3 operates to convert characters written in either English or Mandarin Chinese. First, the mixed character / sound converter 105 includes an alphabet identifier 305 that identifies the alphabet used to define at least a portion of the written name stored in the device. When the name storage part is composed of Chinese characters 310, the Chinese characters 310 are directly input to the language-limited standard Chinese character / sound converter 315. However, if the storage portion of the name is composed of English characters 320, the name may be written in either Chinese ping or English. Therefore, that part of the name is further classified by the beep identifier 325. The pinyone discriminator 325 uses a 408 syllable pinyone dictionary that basically identifies all Chinese names represented by a pinyone (except the tone). If the English character 320 is a Chinese ping sound, the English character 320 is input to the standard Chinese character / sound converter 315. However, when the English character 320 is an English word, the English character 320 is input to the language-limited English character / sound converter 330. Both the standard Chinese character / sound converter 315 and the English character / sound converter 330 are operable to convert the names into a series of language-limited pronunciation units in a unique order. It will be apparent to those skilled in the art that other character / sound converters 105 that convert characters in various other languages are also possible with the present disclosure. Thus, the character / sound converter 105 of the present invention can parse bilingual bipartite names into a series of pronunciation units in a single order.

ユーザーが手動でシステム１００の言語モデルを切り換える必要なしに本発明が機能するようにするために、混合言語ＨＭＭセット１１５は、二つの言語のそれぞれに対して一つの、少なくとも二つの音響モデルセットを含んでいる。例えば、英語と標準中国語の双
方を認識する本発明の上記実施態様によれば、ＨＭＭセット１１５は、二つの単一言語音響モデルセット、即ち、前後関係に依存する標準中国語モデルと、前後関係に依存しない英語モデルとを組み合わせている。ここで、前後関係とは、任意の発音ユニットのすぐ右と左またはそのいずれかに隣接する発音ユニットを指す。中国語においては、これらのユニットは、以下により詳しく説明するように、「声母(initial) 」と「韻母(final) 」と呼ばれる。三音モデルは、左隣接発音ユニットと右隣接発音ユニットの双方を考慮した発音モデルである。二つの発音ユニットが、同じアイデンティティを有するが異なる左または右の前後関係を有する場合には、それらは異なる三音と考えられる。 To allow the present invention to function without the user having to manually switch the language model of the system 100, the mixed language HMM set 115 includes at least two acoustic model sets, one for each of the two languages. Contains. For example, according to the above embodiment of the present invention that recognizes both English and Mandarin Chinese, the HMM set 115 includes two monolingual acoustic model sets: a Mandarin Chinese model that depends on context, Combined with a relationship-independent English model. Here, the context refers to a sounding unit that is adjacent to the right and / or left of any sounding unit. In Chinese, these units are called “initial” and “final” as described in more detail below. The three-tone model is a pronunciation model that considers both the left adjacent pronunciation unit and the right adjacent pronunciation unit. If two pronunciation units have the same identity but different left or right contexts, they are considered different triphones.

中国語を英語などの西洋語と区別する一つの特徴は、漢字がすべて、子音／母音（Ｃ／Ｖ）構造プラス声調を有する単一音節であるということである。従って、音節認識はたいていの中国語音声認識システムの構成の基本である。中国語には全部で１２５４音節（４０８無調音節）があり、それらは２２個の「声母」（即ち、音節における母音の前の子音）と３８個の「韻母」（即ち、音節における母音の後の子音）のさまざまな組み合わせから得られる。声母の中には２１個の真声母と一つのいわゆる「ゼロ声母」がある。本発明の好ましい実施態様によれば、ゼロ声母は真声母として扱われる。限定されたトレーニングデータのみが入手可能であるという状況を考慮すれば、中国語音声に関して、音節内の同時調音効果は音節間の同時調音効果よりも著しく大きいということが一般に見られる。このことは中国語の単音節構造が原因である。また、音節内では、声母の音響特性は韻母に高度に依存しているが、韻母の特性は声母にほとんど依存しない。例えば、音節「ta」内の声母「t 」は、別の音節「tu」内の同じ声母とは非常に異なって発音されるが、音節「ta」内の韻母「a 」は、「cha 」内の「a 」とほとんど同様に発音される。それ故、中国語音声認識における合理的なアプローチは、音節間の同時調音効果と音節内の先行声母に対する韻母の依存性の双方は無視できると仮定して、声母をそれに続く韻母の開始音素と右前後関係依存性があるものとし、韻母を前後関係依存性がないものとすることである。従って、本発明の好ましい実施態様は、１１７個の声母と３８個の韻母を含む１５５個の副音節を使用する。各音節はその場合一対の副音節に分解される。本発明の好ましい実施態様の中国語音響モデルにおいて使用されるそのような音節分解の例を表１に示す。 One feature that distinguishes Chinese from Western languages such as English is that all Chinese characters are single syllables with consonant / vowel (C / V) structure plus tone. Therefore, syllable recognition is the basis of the configuration of most Chinese speech recognition systems. There are a total of 1254 syllables (408 atonal syllables) in Chinese, which are 22 “vowels” (ie consonants before vowels in syllables) and 38 “vowels” (ie vowels in syllables). Obtained from various combinations of later consonants). There are 21 true vocals and one so-called “zero”. According to a preferred embodiment of the present invention, the zero initial is treated as a true initial. Considering the situation where only limited training data is available, it is generally seen that for Chinese speech, the simultaneous articulation effect within a syllable is significantly greater than the simultaneous articulation effect between syllables. This is due to the Chinese single syllable structure. In the syllable, the acoustic characteristics of the initial are highly dependent on the final, but the final characteristics are almost independent of the final. For example, the initial “t” in the syllable “ta” is pronounced very differently from the same initial in another syllable “tu”, but the final “a” in the syllable “ta” is “cha”. It is pronounced almost the same as "a". Therefore, a reasonable approach in Chinese speech recognition assumes that both the simultaneous articulation effect between syllables and the dependence of the final on the preceding phoneme in the syllable can be ignored, and the initial It is assumed that there is a right-to-left context dependency, and that the final is not dependent on the context. Thus, the preferred embodiment of the present invention uses 155 subsyllables including 117 vocals and 38 finals. Each syllable is then broken down into a pair of subsyllables. An example of such syllable decomposition used in the Chinese acoustic model of the preferred embodiment of the present invention is shown in Table 1.

ＨＭＭセット１１５における英語音響モデルのサイズを縮小し、従って、システム１００全体の複雑さと計算上の要求を減らすために、本発明の好ましい中国語／英語の実施態様は前後関係に依存しない英語音響モデルを使用する。また、４０個の単音を基本英語モデル化ユニットとして使用する。そのような単音の一つの資料はカーネギーメロン大学（Carnegie Mellon University（CMU ））発音辞書である。ＣＭＵ発音辞書は、約１２７，０００の英単語をそれに対応する発音と共に含んでいる。ＣＭＵ発音辞書はまた、英語の３９個の個別音素を定義している。上記辞書の代わりに、他の辞書を使用してもよい。 In order to reduce the size of the English acoustic model in the HMM set 115 and thus reduce the overall complexity and computational requirements of the system 100, the preferred Chinese / English embodiment of the present invention is a context-independent English acoustic model. Is used. In addition, 40 single notes are used as a basic English modeling unit. One such single note material is the Carnegie Mellon University (CMU) pronunciation dictionary. The CMU pronunciation dictionary contains about 127,000 English words with corresponding pronunciations. The CMU pronunciation dictionary also defines 39 individual phonemes in English. Other dictionaries may be used instead of the above dictionary.

順序だった一連の発音ユニットと特徴ベクトルを照合するＡＳＲエンジン１３０の動作
方法をより詳しく説明する。エンジン１３０は、ビタービ（Viterbi ）型ビーム検索アルゴリズムを使って、システム１００によって受け取られた発話された言葉の一連の特徴ベクトルを解析する。エンジン１３０の目的は、文法ネットワーク１３５によって導かれて、状態シーケンスの対応ガウスパラメータ（ガウス混合）が入力発話された言葉と最も良く一致する順序だった一連の発音ユニットを見つけることである。ビタービ（Viterbi ）検索は、時刻ｔを時刻ｔ＋１に進む前に完全に処理する時刻同期検索アルゴリズムである。時刻ｔに対して、各状態は時刻ｔ−１におけるすべての状態から（すべての入力パスの合計を使ってというよりも）ベストスコアによって更新される。検索の最後に、最も可能性の高い状態シーケンスを、これらのバックトラッキングポインタをたどることよって、回復することができる。効果的な効率化技術のおかげで、検索空間全体または格子全体を探索する必要はない。代わりに、最も有望な検索状態空間だけを探索する必要がある。次いで、総合ＨＭＭセットがシステム１００のために作られる。このセットは、公開語彙辞書が更新される各時点の後にオンラインで生成される動的文法の最終要素の音響モデルに関連している。上記アルゴリズムに関する更なる詳細は、Jelinek, Frederickによる「音声認識のための統計的方法（Statistical Methods for Speech Recognition）」（MTT Press 1999 ISBN 0-262-10066-5 ）において見ることができる。 The operation method of the ASR engine 130 that collates the sequence of sounding units in order with the feature vector will be described in more detail. Engine 130 analyzes a series of feature vectors of spoken words received by system 100 using a Viterbi beam search algorithm. The purpose of the engine 130 is to find a series of phonetic units, guided by the grammar network 135 , in which the corresponding Gaussian parameters (Gaussian mixture) of the state sequence are in order that best matches the spoken input speech. Viterbi search is a time-synchronized search algorithm that completely processes time t before proceeding to time t + 1. For time t, each state is updated with the best score from all states at time t-1 (rather than using the sum of all input paths). At the end of the search, the most likely state sequence can be recovered by following these backtracking pointers. Thanks to effective efficiency techniques, it is not necessary to search the entire search space or the entire grid. Instead, only the most probable search state space needs to be searched. A comprehensive HMM set is then created for the system 100. This set is associated with an acoustic model of the final element of the dynamic grammar that is generated online after each time the public vocabulary dictionary is updated. Further details regarding the above algorithm can be found in "Statistical Methods for Speech Recognition" by Jelinek, Frederick (MTT Press 1999 ISBN 0-262-10066-5).

本発明の更なる説明のために、図４は、標準中国語／英語公開語彙辞書１１０を含む本発明の一実施態様による、記憶したテキストを発音ユニットに変換するための典型的な方法４００を要約した一般化フローチャートである。この方法４００はまずステップ４０５で、文字からなる複数の名称を表すテキストを電子装置に記憶させる。ステップ４１０において、個々の名称が中国語アルファベットの文字からなるのかローマ字アルファベットの文字からなるのかを判定する。その名称を構成する文字が漢字である場合には、ステップ４１５において、その名称の言語は標準中国語であると特定する。しかしながら、その文字がローマ字アルファベットである場合には、その名称の言語は、その文字が中国語ぴん音である可能性があるので、まだ判定されない。よって、ステップ４２０において、ぴん音（声調を除く）で表されたすべての中国語名称を基本的に特定する４０８音節のぴん音辞書を用いて、その文字が中国語ぴん音であるかどうかを判定する。その文字がぴん音であると判定された場合には、方法４００は再びステップ４１５に進んで、その名称の言語が標準中国語であると特定する。そうでない場合には、ステップ４２５において、その名称の言語は英語であると特定する。 For further explanation of the present invention, FIG. 4 illustrates an exemplary method 400 for converting stored text to a pronunciation unit, according to one embodiment of the present invention including a Mandarin / English public vocabulary dictionary 110. It is the summarized generalized flowchart. In step 405 , the method 400 first stores text representing a plurality of names of characters in an electronic device. In step 410, it is determined whether each name is composed of Chinese alphabet characters or Roman alphabet characters. If the characters constituting the name are kanji, it is specified in step 415 that the language of the name is standard Chinese. However, if the character is a Roman alphabet, the language of the name is not yet determined because the character may be a Chinese pingpong. Thus, in step 420, using a 408 syllable pinyone dictionary that basically identifies all Chinese names represented by pinyongs (excluding tone), it is determined whether the character is a Chinese pinyin. judge. If it is determined that the character is a pop sound, the method 400 proceeds again to step 415 to identify that the language of the name is Mandarin Chinese. Otherwise, at step 425, the language of the name is identified as English.

その言語が標準中国語であるとステップ４１５において特定された場合には、方法４００は引き続きステップ４３０において、標準中国語文字／音変換器３１５を用いて、その名称を順序だった一連の発音ユニットに変換する。しかしながら、ステップ４２５において言語が英語であると特定された場合には、方法４００は続いてステップ４３５において、英語文字／音変換器３３０を用いて、その名称を順序だった一連の発音ユニットに変換する。次いで、順序だった一連の発音ユニットを公開語彙辞書１１０に記憶させる。 If the language is determined to be Mandarin Chinese at step 415, the method 400 continues at Step 430 with the standard Chinese character / sound converter 315 using a sequence of phonetic units whose names are ordered. Convert to However, if it is determined in step 425 that the language is English, the method 400 then uses the English character / sound converter 330 to convert the name to an ordered series of pronunciation units in step 435. To do. Next, the sequence of pronunciation units in order is stored in the open vocabulary dictionary 110.

いま、図５は、本発明の好ましい実施態様による、発話された言葉を公開語彙辞書１１０に記憶された名称と照合する方法５００を示す一般化フローチャートである。方法５００はまずステップ５０５において、発話された言葉を電子装置のマイクロフォン１２０で受信する。この装置は多言語による名称の音声認識のためのシステム１００を含んでいる。ステップ５１０において、その言葉を特徴ベクトルに変換する。次いで、ステップ５１５において、上記方法に従って、その言葉の特徴ベクトルを、公開語彙辞書１１０に記憶されている少なくとも一つの名称の順序だった一連の発音ユニットと照合する。 FIG. 5 is a generalized flowchart illustrating a method 500 for matching spoken words with names stored in the open vocabulary dictionary 110 according to a preferred embodiment of the present invention. The method 500 first receives the spoken word at the microphone 120 of the electronic device at step 505. The apparatus includes a system 100 for multilingual name speech recognition. In step 510, the word is converted to a feature vector. Then, in step 515, according to the above method, the feature vector of the word is checked against a series of pronunciation units that are in the order of at least one name stored in the open vocabulary dictionary 110.

図６は、本発明の音声認識システム１００を実行することができるパーソナル電子装置の一例を示す概略図である。この例は、本発明の一実施態様による多言語による名称の音声認識のためのシステム１００を含む無線電話６００の形態の無線通信装置を含んでいる
。電話６００は、プロセッサ６０３と通信するように接続された無線周波数通信ユニット６０２を備えている。無線電話６００はまた、プロセッサ６０３と通信するように接続されたキーパッド６０６と表示スクリーン６０５を備えている。当業者にとって明らかなように、スクリーン６０５はタッチスクリーンとすることができるので、キーパッド６０６はオプションとすることができる。 FIG. 6 is a schematic diagram illustrating an example of a personal electronic device that can execute the speech recognition system 100 of the present invention. This example includes a wireless communication device in the form of a wireless telephone 600 that includes a system 100 for multilingual name speech recognition according to one embodiment of the present invention. The phone 600 includes a radio frequency communication unit 602 connected to communicate with the processor 603. The radiotelephone 600 also includes a keypad 606 and a display screen 605 that are connected to communicate with the processor 603. As will be apparent to those skilled in the art, keypad 606 can be optional since screen 605 can be a touch screen.

プロセッサ６０３は、無線電話６００によって送信または受信することができる音声または他の信号をエンコードおよびデコードするためのエンコーダ／デコーダ６１１とそれに関するデータ記憶用コード読み取り専用メモリ（ＲＯＭ）６１２を含んでいる。プロセッサ６０３はまた、共通データアドレスバス６１７によってエンコーダ／デコーダ６１１に接続されたマイクロプロセッサ６１３と、文字読み取り専用メモリ（ＲＯＭ）６１４と、ランダムアクセスメモリ（ＲＡＭ）６０４と、プログラム可能スタティックメモリ６１６と、ＳＩＭインターフェース６１８を含んでいる。プログラム可能スタティックメモリ６１６と、ＳＩＭインターフェース６１８に動作可能に接続されたＳＩＭ（しばしばＳＩＭカードと呼ばれる）とはそれぞれ、とりわけ、選択された入力テキストメッセージと、電話番号用番号フィールドおよび名称フィールド内の番号の一つと関連付けられた識別子のための名称フィールドからなる電話番号データベースＴＮＤ（またはアドレス／電話帳）とを記憶することができる。例えば、電話番号データベースＴＮＤ内の一つのエントリは、（番号フィールド内に入力された）91999111111 と、名称フィールド内のそれに関連する識別子「Steven C! at work 」とすることができる。ＳＩＭカードとスタティックメモリ６１６は、無線電話６００のパスワード保護機能へのアクセスを可能にするためのパスワードを記憶することもできる。本発明の構成要素、例えば、文字／音変換器１０５、公開語彙辞書１１０、混合言語ＨＭＭセット１１５、特徴抽出器１２５、ＡＳＲエンジン１３０、動的文法ネットワーク１３５などはすべて、コード読み取り専用メモリ（ＲＯＭ）６１２、文字読み取り専用メモリ（ＲＯＭ）６１４、ランダムアクセスメモリ（ＲＡＭ）６０４、スタティックメモリ６１６、およびＳＩＭカードの一つまたはそれ以上に、部分的または全体的に格納することができる。マイクロプロセッサ６１３は、キーパッド６０６と、スクリーン６０５と、警報スピーカ、バイブレータモータ、および関連ドライバを一般的に含む警報機６１５への接続のためのポートを有している。また、マイクロプロセッサ６１３は、マイクロフォン１２０と通信スピーカ６４０への接続のためのポートを有している。文字読み取り専用メモリ６１４は、通信ユニット６０２によって受信されるテキストメッセージをデコードまたはエンコードするためのコードを記憶している。この実施態様においては、文字読み取り専用メモリ６１４はまた、マイクロプロセッサ６１３のためのオペレーティングコード（ＯＣ）と無線電話６００に関する機能を実行するためのコードを記憶している。 The processor 603 includes an encoder / decoder 611 and associated data storage code read only memory (ROM) 612 for encoding and decoding voice or other signals that can be transmitted or received by the radiotelephone 600. The processor 603 also includes a microprocessor 613 connected to the encoder / decoder 611 by a common data address bus 617, a character read only memory (ROM) 614, a random access memory (RAM) 604, a programmable static memory 616, A SIM interface 618 is included. The programmable static memory 616 and the SIM operatively connected to the SIM interface 618 (often referred to as a SIM card), respectively, are selected input text messages, numbers in the phone number and name fields, respectively. And a telephone number database TND (or address / phone book) consisting of name fields for identifiers associated with one of the two. For example, one entry in the telephone number database TND can be 91999111111 (entered in the number field) and its associated identifier “Steven C! At work” in the name field. The SIM card and static memory 616 can also store a password to allow access to the password protection function of the wireless telephone 600. All of the components of the present invention, such as the character / sound converter 105, the open vocabulary dictionary 110, the mixed language HMM set 115, the feature extractor 125, the ASR engine 130, the dynamic grammar network 135, etc. are all code read only memory (ROM). ) 612, character read only memory (ROM) 614, random access memory (RAM) 604, static memory 616, and SIM card may be partially or fully stored. The microprocessor 613 has ports for connection to an alarm 615 that typically includes a keypad 606, a screen 605, an alarm speaker, a vibrator motor, and associated drivers. The microprocessor 613 has a port for connection to the microphone 120 and the communication speaker 640. The character read only memory 614 stores a code for decoding or encoding a text message received by the communication unit 602. In this embodiment, the character read only memory 614 also stores an operating code (OC) for the microprocessor 613 and code for performing functions related to the radiotelephone 600.

無線周波数通信ユニット６０２は、共通アンテナ６０７を有する組み合わせ送受信機である。通信ユニット６０２は、無線周波数増幅器６０９を介してアンテナ６０７に接続されたトランシーバ６０８を有している。トランシーバ６０８はまた、通信ユニット６０２をプロセッサ６０３に接続する組み合わせ変調器／復調器６１０に接続されている。 The radio frequency communication unit 602 is a combination transceiver having a common antenna 607. The communication unit 602 has a transceiver 608 connected to an antenna 607 via a radio frequency amplifier 609. The transceiver 608 is also connected to a combined modulator / demodulator 610 that connects the communication unit 602 to the processor 603.

英語と標準中国語のための本発明の一実施態様の性能例を以下に示す。テストデータベースは、「cancel」と「castle」などのまぎらわしい似通った発音の単語を含み、５０個の語彙からなる発話された言葉の特徴ベクトルで構成されている。データベースは、約２００人の話者からの９４９４個の標準中国語の言葉と、２５人の話者からの６８２７個の英語の言葉を含んでいる。これらの言葉は、実世界環境を確立しようとして、オフィス、車、ショッピングモール、街路などの６つの異なる移動環境において録音された。テストの結果は表２に要約されている。単一言語結果は、単一言語音声認識専用システムを用いた認識精度を示している。混合言語結果は、本発明の混合言語音声認識システム１００を用いた認識精度を含んでいる。 An example performance of one embodiment of the present invention for English and Mandarin Chinese is shown below. The test database includes words with similar pronunciations such as “cancel” and “castle”, and is composed of feature vectors of spoken words composed of 50 vocabularies. The database contains 9494 Mandarin Chinese words from approximately 200 speakers and 6827 English words from 25 speakers. These words were recorded in six different mobile environments, such as offices, cars, shopping malls, streets, etc., trying to establish a real world environment. The test results are summarized in Table 2. The monolingual result shows the recognition accuracy using the monolingual speech recognition dedicated system. The mixed language result includes the recognition accuracy using the mixed language speech recognition system 100 of the present invention.

従って、本発明は、ユーザーがシステム１００の言語モードを手動で切り換える必要のない、多言語の発話された名称を認識することができる改善された音声認識システム１００である。従って、それは、例えば、ユーザーが多言語による名称を含む電子アドレス帳を持つことができる多言語環境において有用である。ユーザーが言語モードを切り換える必要がないので、システム１００は第１言語のファーストネームと第２言語のセカンドネームとで構成された複合名称さえも認識することができる。また、システム１００のメモリと処理要求は、前後関係に依存する構成要素と前後関係に依存しない構成要素を含む組み合わせ音響モデルの使用によって節約することができる。従って、システム１００を、制限されたメモリと処理リソースを有する移動電話やＰＤＡなどのパーソナル電子装置上で動作させることができる。 Accordingly, the present invention is an improved speech recognition system 100 that can recognize multilingual spoken names without requiring the user to manually switch the language mode of the system 100. Thus, it is useful, for example, in a multilingual environment where a user can have an electronic address book that includes multilingual names. Since the user does not need to switch language modes, the system 100 can recognize even a composite name composed of a first name in the first language and a second name in the second language. Also, the memory and processing requirements of the system 100 can be saved by using a combined acoustic model that includes components that depend on context and components that do not depend on context. Thus, the system 100 can be operated on a personal electronic device such as a mobile phone or PDA having limited memory and processing resources.

上記詳細な説明は、好ましい典型的な実施態様だけを提供するものであって、本発明の範囲、適用可能性、または構成を制限するものではない。むしろ、この好ましい実施態様の詳細な説明は、当業者に、本発明の好ましい典型的な実施態様を実施するのを可能にする説明を提供する。特許請求の範囲に述べられたような本発明の精神と範囲から逸脱することなしに、要素とステップの機能と配置において種々の変更を行うことができることは明らかである。 The above detailed description provides only preferred exemplary embodiments and is not intended to limit the scope, applicability, or configuration of the invention. Rather, this detailed description of the preferred embodiments provides those skilled in the art with an enabling description for implementing preferred exemplary embodiments of the present invention. Obviously, various modifications may be made in the function and arrangement of elements and steps without departing from the spirit and scope of the invention as set forth in the claims.

本発明の一実施態様による、多言語による名称の音声認識のためのシステムの機能的構成要素を示す概略図。1 is a schematic diagram illustrating functional components of a system for speech recognition of names in multiple languages, according to one embodiment of the present invention. 本発明の実施態様による、二つの異なる言語のいろいろな名称と、それに関する順序だった一連の発音ユニットからなる発音とを示す表。FIG. 4 is a table showing various names of two different languages and pronunciations consisting of a series of pronunciation units in order with respect to them according to an embodiment of the present invention. 本発明の実施態様による文字／音変換器の働きと構成要素を示す概略図。The schematic which shows the operation | movement and component of a character / sound converter by the embodiment of this invention. 標準中国語／英語公開語彙辞書を含む本発明の実施態様による、記憶テキストを発音ユニットに変換するための方法を要約した一般化フローチャート。4 is a generalized flow chart summarizing a method for converting stored text into a pronunciation unit according to an embodiment of the invention including a Mandarin / English public vocabulary dictionary. 本発明の実施態様による、発話された言葉を公開語彙辞書に記憶された名称と照合する方法を示す一般化フローチャート。4 is a generalized flowchart illustrating a method for matching spoken words against names stored in a public vocabulary dictionary according to an embodiment of the present invention. 本発明の実施態様による、音声認識システムを実行することができる無線電話の形態のパーソナル電子装置を示す概略図。1 is a schematic diagram illustrating a personal electronic device in the form of a wireless telephone capable of executing a speech recognition system, according to an embodiment of the present invention.

Claims

A speech recognition method for recognizing names in Chinese and English using a speech recognition system (100), the speech recognition method comprising:
A pronunciation unit conversion step of receiving character text representing a plurality of names that are not speech and converting the character text into a pronunciation unit;
A feature vector conversion step of receiving a spoken language that is a spoken word and converting the spoken language into a feature vector;
A collating step for recognizing speech by collating the pronunciation unit with the feature vector;
Including
The pronunciation unit conversion step includes:
A character determination step (410) for determining whether each of the names is made up of Chinese alphabet characters or Roman alphabet characters by an alphabet identifier (305) ;
A Chinese sound determination step (420) for determining whether or not the character of the Roman alphabet is a Chinese sound by means of a sound identification device (325);
A Chinese character / sound converter (315) converts the characters of the Chinese alphabet and the characters of the Roman alphabet, which is the Chinese pinto, into an ordered series of pronunciation units. A conversion step (430);
An English character / sound conversion step (435) for converting the characters of the Roman alphabet determined by the English character / sound converter (330) to be not the Chinese pimp into the pronunciation unit;
Including
The feature vector conversion step includes:
A spoken word receiving step (505) for receiving the spoken word by means of a microphone (120) ;
A conversion step (510) for converting said spoken word into a feature vector;
A speech recognition method comprising:

The matching step further includes:
Comparing the feature vector with the pronunciation unit with reference to a Gaussian mixture parameter;
Analyzing the feature vector by following a backtracking pointer at the end of the search using a beam search Viterbi algorithm;
The speech recognition method according to claim 1, further comprising:

The voice recognition system is a personal electronic device, and the name is part of an electronic device address book or contact list stored in the voice recognition system (100) in association with a personal phone number or other contact information. The speech recognition method according to claim 1 , wherein the personal electronic device operates when the feature vector and the pronunciation unit match in the comparison step .

A speech recognition system (100) for recognizing names in Chinese and English, wherein the speech recognition system (100)
At least one of a keypad (606) and touch screen (605) to which non-speech character text is input, wherein the character text represents a plurality of names;
An alphabet classifier (305) for determining whether each of the names is made up of Chinese alphabet characters or Roman alphabet characters;
A pinyin discriminator (325) for determining whether or not the Roman alphabetic character is a Chinese pinyin;
A Chinese character / sound converter (315) for converting the characters of the Chinese alphabet and the characters of the Roman alphabet, which is the Chinese pimp, into an ordered series of pronunciation units;
An English character / sound converter (330) for converting the characters of the Roman alphabet determined not to be a Chinese pin sound into the pronunciation unit;
A microphone (120) into which the spoken language, which is the spoken word, is input;
A feature extractor (125) for converting the spoken word into a feature vector;
A speech recognition system (100), comprising: an automatic speech recognition engine (130) for collating the feature vector with the pronunciation unit.

The automatic speech recognition engine (130) decodes the feature vector by comparing the feature vector with the pronunciation unit with reference to a Gaussian mixture parameter, and uses a beam search Viterbi algorithm at the end of the search. The speech recognition system (100) of claim 4, wherein the feature vector is analyzed by following a backtracking pointer .

The automatic speech recognition engine (130) further includes
A public vocabulary dictionary (110) storing the pronunciation units;
A hidden Markov model set (115) including the Gaussian mixture parameters representing selected speech patterns of Chinese and English respectively;
Including
The selected Chinese voice pattern depends on the context of the pronunciation unit,
The speech recognition system (100) according to claim 4 or 5, wherein the English selected speech pattern does not depend on the context of the pronunciation units.

The voice recognition system (600) according to any one of claims 4 to 6 , wherein the voice recognition system is a personal electronic device such as a mobile phone or a portable information terminal.