JP2015087544A

JP2015087544A - Voice recognition device and voice recognition program

Info

Publication number: JP2015087544A
Application number: JP2013225805A
Authority: JP
Inventors: 桐田　洋; Hiroshi Kirita; 洋桐田; 隆中桐; Takashi Nakagiri
Original assignee: Koto Co Ltd
Current assignee: Koto Co Ltd
Priority date: 2013-10-30
Filing date: 2013-10-30
Publication date: 2015-05-07

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device and a program for identifying a phoneme string of a pronounced word even when pronunciation of a speaker is ambiguous.SOLUTION: A voice recognition device comprises: a storage section 5 having candidate phoneme storage means 13 for storing a candidate phoneme group including correct phonemes included in a correct phoneme string of a word, and/or similar phonemes whose pronunciation is similar to that of the correct phonemes, and which are associated with the correct phonemes, and priority of the respective elements of the candidate phoneme group; voice information input means 17 for generating voice information from a voice signal; assumption phoneme detection means 15 for detecting assumption phonemes to be assumed from the voice information in order of the pronunciation to obtain an assumption phoneme string; and pronunciation phoneme string identification means 19 for identifying a pronunciation phoneme string of the pronounced word from the assumption phoneme group, etc., in which the pronunciation phoneme string specification means 19 retrieves whether or not continuous elements in the assumption phoneme string are included in the candidate phoneme group, and when the continuous elements in the assumption phoneme group are included in the candidate phoneme group, an element of the assumption phoneme group with high priority is identified to be pronounced.

Description

本発明は、音声認識装置及び音声認識プログラムに関する。 The present invention relates to a speech recognition apparatus and a speech recognition program.

言葉の音声信号を標本化し、量子化して得た音声情報から、話者の発音した言葉を特定する音声認識技術が知られている。この音声認識技術の応用例として、特許文献１には、話者の発音誤りを指摘する発音学習装置が記載されている。この発音学習装置は、マイクロホンから出力される音声信号をアナログ−デジタル変換器により音声情報へと変換し、この音声情報から発音された単語の音素列を検出する。そして、この音素列を単語の正しい音素列及び予め想定された誤り音素列と照合して話者の発音誤りを検出している。 A speech recognition technique for identifying a word spoken by a speaker from speech information obtained by sampling and quantizing speech signals of words is known. As an application example of this speech recognition technology, Patent Literature 1 describes a pronunciation learning device that points out a speaker's pronunciation error. This pronunciation learning device converts a voice signal output from a microphone into voice information by an analog-to-digital converter, and detects a phoneme string of a word pronounced from the voice information. Then, this phoneme string is collated with a correct phoneme string of the word and a presumed error phoneme string to detect a pronunciation error of the speaker.

この発音学習装置は、音声情報から音響特徴パラメータを抽出し、この音響特徴パラメータに対応する音素を発音された単語の音素として特定する。しかし、この単語を母語としない話者の曖昧な発音から音素列を特定するのは困難である。詳しくは、曖昧な発音から取得した音声情報には、対応する音素が複数検出される音響特徴パラメータや、対応する音素が検出されない音響特徴パラメータが含まれることがある。この場合、発音された単語の音素列が特定できず、再度発音することを話者に求めたり、音声認識を中断しなければならなかった。 The pronunciation learning device extracts an acoustic feature parameter from the speech information, and specifies a phoneme corresponding to the acoustic feature parameter as a phoneme of a pronounced word. However, it is difficult to specify a phoneme string from an ambiguous pronunciation of a speaker whose native language is not this word. Specifically, the audio information acquired from the ambiguous pronunciation may include an acoustic feature parameter in which a plurality of corresponding phonemes are detected and an acoustic feature parameter in which the corresponding phonemes are not detected. In this case, the phoneme string of the pronounced word could not be specified, and the speaker had to be asked to pronounce it again, or speech recognition had to be interrupted.

特開平０６−１１０４９４号公報Japanese Patent Laid-Open No. 06-110494

本発明は、話者の発音が曖昧であっても発音された単語の音素列を特定し得る音声認識装置及び音声認識プログラムを提供することを目的とする。 An object of the present invention is to provide a speech recognition apparatus and a speech recognition program that can specify a phoneme string of a pronounced word even if the pronunciation of the speaker is ambiguous.

本発明の音声認識装置は、単語の正しい音素列に含まれる正音素、及び／又は該正音素に発音が類似し、該正音素に対応付けられた類似音素を含む候補音素群、並びに該候補音素群の各要素の優先度を記憶する候補音素群記憶手段を有する記憶部と、発音された前記単語の音声信号から音声情報を生成する音声情報入力手段を有する入力部と、前記音声情報から想定される想定音素を前記発音の順に検出して想定音素群を得る想定音素検出手段と、前記想定音素群の中から発音された前記単語の発音音素列を特定する発音音素列特定手段と、を有する処理部と、前記発音音素列を出力する出力部と、を備え、
前記発音音素列特定手段は、前記想定音素群中の連続する要素が前記候補音素群に含まれるかを検索し、前記想定音素群中の連続する要素が前記候補音素群に含まれる場合には、前記優先度の高い前記想定音素群の要素を発音されたものと特定することを特徴とする。 The speech recognition apparatus according to the present invention includes a candidate phoneme group including a phoneme included in a correct phoneme string of a word and / or a similar phoneme similar in pronunciation to the phoneme, and the candidate A storage unit having candidate phoneme group storage means for storing the priority of each element of the phoneme group, an input unit having voice information input means for generating voice information from the voice signal of the pronounced word, and the voice information Assumed phoneme detection means for detecting assumed phonemes in the order of pronunciation and obtaining an assumed phoneme group; pronunciation phoneme string specifying means for specifying a pronunciation phoneme string of the word pronounced from the assumed phoneme group; A processing unit, and an output unit that outputs the pronunciation phoneme string,
The phoneme phoneme string specifying means searches for whether or not a continuous element in the assumed phoneme group is included in the candidate phoneme group, and when a continuous element in the assumed phoneme group is included in the candidate phoneme group The element of the assumed phoneme group having a high priority is identified as being pronounced.

また、本発明の音声認識装置は、前記記憶部が、前記正音素及び前記類似音素に対応させて、該正音素及び該類似音素の特徴情報である音素特徴を記憶する音素特徴記憶手段を有し、前記音声情報入力手段は、前記音声信号を標本化及び量子化して前記音声情報を生成し、前記想定音素検出手段は、前記音声情報の一部の情報から部分音声特徴を算出し、該部分音声特徴と少なくとも部分的に共通する前記音声特徴に対応する前記正音素又は前記類似音素を想定音素として検出することを特徴とする。 In the speech recognition apparatus of the present invention, the storage unit includes a phoneme feature storage unit that stores phoneme features that are feature information of the regular phoneme and the similar phoneme in association with the regular phoneme and the similar phoneme. The speech information input means samples and quantizes the speech signal to generate the speech information, and the assumed phoneme detection means calculates a partial speech feature from a part of the speech information, The regular phoneme or the similar phoneme corresponding to the speech feature that is at least partially in common with the partial speech feature is detected as an assumed phoneme.

また、本発明の音声認識装置は、前記候補音素群が前記単語に対応づけて記憶されていることを特徴とする。 The speech recognition apparatus according to the present invention is characterized in that the candidate phoneme group is stored in association with the word.

さらに、本発明の音声認識装置は、前記優先度が、前記単語を母語としない話者が発音した該単語の前記音声情報、前記音声信号、及び／又は想定音素群の統計から定められた前記候補音素の発生頻度であることを特徴とする。 Furthermore, in the speech recognition apparatus of the present invention, the priority is determined from the speech information of the word, the speech signal, and / or the statistics of the assumed phonemes that are pronounced by a speaker who does not use the word as a native language. It is the occurrence frequency of candidate phonemes.

また、本発明の音声認識装置は、前記優先度が、前記正音素に対する前記類似音素の近似度であることを特徴とする。 In the speech recognition apparatus of the present invention, the priority is an approximation degree of the similar phonemes to the regular phonemes.

さらにまた、本発明の音声認識装置は、前記優先度が、前記候補音素の発生頻度と、前記正音素に対する前記類似音素の近似度であり、前記発音音素列特定手段は、前記想定音素群中の連続する要素が前記候補音素群に含まれる場合には、前記発生頻度の最も高い前記想定音素群の要素を発音されたものと特定し、さらに、該発生頻度の最も高い前記候補音素が複数ある場合には、前記近似度の高い前記想定音素群の要素を発音されたものとして特定することを特徴とする。 Furthermore, in the speech recognition apparatus according to the present invention, the priority is an occurrence frequency of the candidate phonemes and an approximation degree of the similar phonemes with respect to the regular phonemes, and the phoneme phoneme string specifying means Are included in the candidate phoneme group, the element of the assumed phoneme group having the highest occurrence frequency is identified as being pronounced, and a plurality of candidate phonemes having the highest occurrence frequency are further specified. In some cases, an element of the assumed phoneme group having a high degree of approximation is specified as being pronounced.

本発明の音声認識プログラムは、コンピュータを、単語の正しい音素列に含まれる正音素、及び／又は該正音素に発音が類似し、該正音素に対応付けられた類似音素を含む候補音素群、並びに該候補音素群の各要素の優先度を記憶する候補音素群記憶手段、発音された前記単語の音声信号から音声情報を生成する音声情報入力手段、前記音声情報から想定される想定音素を前記発音の順に検出して想定音素群を得る想定音素検出手段、前記想定音素群の中から発音された前記単語の発音音素列を特定する発音音素列特定手段、前記発音音素列を出力する出力手段、として機能させ、
前記発音音素列特定手段は、前記想定音素群中の連続する要素が前記候補音素群に含まれるかを検索し、前記想定音素群中の連続する要素が前記候補音素群に含まれる場合には、前記優先度の高い前記想定音素群の要素を発音されたものと特定することを特徴とする。 The speech recognition program of the present invention is a candidate phoneme group including a normal phoneme included in a correct phoneme sequence of a word and / or a similar phoneme similar in pronunciation to the normal phoneme, And candidate phoneme group storage means for storing the priority of each element of the candidate phoneme group, voice information input means for generating voice information from the voice signal of the pronounced word, and an assumed phoneme assumed from the voice information Assumed phoneme detection means for obtaining an assumed phoneme group by detecting the order of pronunciation, a pronunciation phoneme string specifying means for specifying a pronunciation phoneme string of the word pronounced from the assumed phoneme group, and an output means for outputting the pronunciation phoneme string Function as,
The phoneme phoneme string specifying means searches for whether or not a continuous element in the assumed phoneme group is included in the candidate phoneme group, and when a continuous element in the assumed phoneme group is included in the candidate phoneme group The element of the assumed phoneme group having a high priority is identified as being pronounced.

本発明の音声認識装置及び音声認識プログラムは、音声認識の対象となる単語を母語としない話者が曖昧に発音した場合であっても、優先度に基づいて発音された単語の音声認識が可能である。すなわち、曖昧な発音を受信した本発明の音声認識装置及び音声認識プログラムは、想定音素検出手段が音声情報から想定される想定音素群を検出する。そして、この想定音素群中に発音が互いに類似している要素が連続している場合には、優先度の高い想定音素群の要素を発音された音素として特定する。このため、話者に再度の発音を求めたり、音声認識を中止することが生じない。 The speech recognition apparatus and the speech recognition program of the present invention can recognize a word that is pronounced based on the priority even when a speaker who does not speak the word that is the target of speech recognition is vaguely pronounced. It is. That is, in the speech recognition apparatus and speech recognition program of the present invention that have received an ambiguous pronunciation, the assumed phoneme detection means detects an assumed phoneme group that is assumed from the speech information. When elements similar in pronunciation to each other continue in the assumed phoneme group, the elements of the assumed phoneme group having a high priority are specified as the phonemes that are pronounced. For this reason, it does not occur that the speaker is asked to pronounce again or the speech recognition is stopped.

本発明の音声認識装置のブロック図である。It is a block diagram of the speech recognition apparatus of this invention. 音声認識のフローチャートである。It is a flowchart of voice recognition. 発音音素特定ステップのフローチャートである。It is a flowchart of a pronunciation phoneme specific step. 本発明の音声認識装置の他の実施例を示すブロック図である。It is a block diagram which shows the other Example of the speech recognition apparatus of this invention. （ａ）候補音素群記憶手段の他の例を示す図であり、（ｂ）想定音素列集合を示す例であり、（ｃ）想定音素群を示す例である。(A) It is a figure which shows the other example of a candidate phoneme group memory | storage means, (b) It is an example which shows an assumed phoneme sequence set, (c) It is an example which shows an assumed phoneme group.

英単語"apple"の音声認識を例に、本発明の音声認識装置を説明する。なお、本明細書において、各図にわたって示される同じ符号は同一または同様のものを示す。 The speech recognition apparatus of the present invention will be described taking speech recognition of the English word “apple” as an example. In the present specification, the same reference numerals shown in the drawings indicate the same or similar elements.

本発明の音声認識装置1は、図１に示すように、入力部3、記憶部5、処理部7、及び出力部9を有する。この音声認識装置1は、例えば、タブレット端末、パソコン、携帯電話端末、音声認識専用機器等のコンピュータである。 As shown in FIG. 1, the speech recognition apparatus 1 of the present invention includes an input unit 3, a storage unit 5, a processing unit 7, and an output unit 9. The voice recognition device 1 is, for example, a computer such as a tablet terminal, a personal computer, a mobile phone terminal, or a voice recognition dedicated device.

記憶部5は、音声認識に必要なプログラムを記憶し、音声認識に関するデータを格納し、保持し、かつ取り出すことができるものであり、代表的には、コンピュータ内に設けられ、コンピュータを下記の手段として機能させるためのプログラムを記憶するハードディスク、フラッシュメモリ、ダイナミック・ランダム・アクセス・メモリ等の補助記憶装置である。この記憶部5は、正音素列記憶手段11と、候補音素記憶手段13とを備える。 The storage unit 5 stores a program necessary for speech recognition, and can store, hold, and retrieve data relating to speech recognition. Typically, the storage unit 5 is provided in a computer, and the computer is Auxiliary storage devices such as a hard disk, a flash memory, and a dynamic random access memory that store programs for functioning as means. The storage unit 5 includes a regular phoneme string storage unit 11 and a candidate phoneme storage unit 13.

正音素列記憶手段11は、単語の正しい音素列である正音素列を単語に対応させて記憶するものである。正音素列は、複数の正音素が発音順に並べられたものである。また、正音素は、正しい発音を構成している一の音素である。なお、本願において音素は、各表に示す音素の記号に限定されず、音素を示すものとして定義付けられた情報を含む。 The phoneme string storage means 11 stores a phoneme string that is a correct phoneme string of a word in association with the word. The phoneme sequence is a sequence of a plurality of phonemes arranged in the order of pronunciation. A regular phoneme is one phoneme that constitutes a correct pronunciation. In the present application, the phoneme is not limited to the phoneme symbol shown in each table, but includes information defined as indicating a phoneme.

表１に正音素列記憶手段11の一例を示す。

この正音素列記憶手段11はテーブルであり、一のレコードに一の単語情報"apple"、及びその単語の正音素列（正音素１、正音素２、正音素３、及び正音素４）が格納される。この正音素列は、一のフィールドに対して一の正音素が格納される。 Table 1 shows an example of the regular phoneme string storage means 11.

The phoneme sequence storage means 11 is a table, and one word information “apple” and one phoneme sequence (the phoneme 1, the phoneme 2, the phoneme 3, and the phoneme 4) of the word are stored in one record. Stored. In the regular phoneme string, one regular phoneme is stored for one field.

候補音素記憶手段13は、候補音素群と優先度を記憶する。候補音素群は、代表的には、単語を構成する一の正音素に対して後述の想定音素検出手段15が検出する可能性のある音素を要素とする集合であり、単語の正音素と類似音素から構成される。この類似音素は、正音素に発音が類似する音素である。発音が正音素に類似するか否かは、例えば国際音声記号の音素分布図における近さに基づいて定めることができる。その他、単語（英語）を非母語とする複数の話者（日本人）及び／又は認識対象の単語が母語であっても標準語を話さない話者（いわゆる方言を話す者）の発音傾向に基づいて定めても良い。なお、本願において候補音素記憶手段13に記憶される正音素及び類似音素を候補音素という。優先度は各候補音素に対応付けて記憶される。この優先度は、候補音素群から一の候補音素を選択するための優先度合を示すものである。また、学習効果を高めるため、正音素の優先度を他の候補音素よりも低く定めても良い。 The candidate phoneme storage means 13 stores the candidate phoneme group and the priority. The candidate phoneme group is typically a set whose elements are phonemes that can be detected by the assumed phoneme detection means 15 described later with respect to a single phoneme constituting a word, and is similar to a word phoneme group. Consists of phonemes. This similar phoneme is a phoneme whose pronunciation is similar to a regular phoneme. Whether or not the pronunciation is similar to a regular phoneme can be determined based on, for example, the proximity of the phonetic distribution map of the international phonetic symbols. In addition, there is a tendency toward pronunciation by multiple speakers (Japanese) whose words (English) are non-native and / or speakers who do not speak standard words (speakers of so-called dialects) even if the word to be recognized is a native language. You may decide based on. In the present application, the normal phoneme and the similar phoneme stored in the candidate phoneme storage means 13 are referred to as candidate phonemes. The priority is stored in association with each candidate phoneme. This priority indicates the priority for selecting one candidate phoneme from the candidate phoneme group. Further, in order to enhance the learning effect, the priority of the regular phonemes may be set lower than other candidate phonemes.

表２に候補音素記憶手段13の一例を示す。

この候補音素記憶手段13は、テーブルであり、一のレコードに対して、一の候補音素群（候補音素１、候補音素２、候補音素３、及び候補音素４）と、各候補音素の優先度（優先度１、優先度２、優先度３、及び優先度４）が格納される。また、各フィールドには、候補音素又はその優先度が格納される。なお、一の候補音素群を構成する候補音素の数は特に限定されない。また、優先度は候補音素の数に応じて増減するものである。 Table 2 shows an example of the candidate phoneme storage means 13.

This candidate phoneme storage means 13 is a table, and for one record, one candidate phoneme group (candidate phoneme 1, candidate phoneme 2, candidate phoneme 3, and candidate phoneme 4) and the priority of each candidate phoneme (Priority 1, Priority 2, Priority 3, and Priority 4) are stored. Each field stores candidate phonemes or their priorities. The number of candidate phonemes constituting one candidate phoneme group is not particularly limited. The priority is increased or decreased according to the number of candidate phonemes.

入力部3は音声情報入力手段17を有する。この音声情報入力手段17は、発音された単語の音声信号から音声情報を生成し、処理部7へと入力するものである。この音声情報は、アナログの音声信号を標本化し、量子化して得られた複数のデジタルデータである。音声情報入力手段17は、代表的には、音声を受信するマイクロホンと、このマイクロホンと電気的に接続されたアナログ−デジタル変換器である。なお、マイクロホンとアナログ−デジタル変換器との間に増幅器、及び／又は自動利得制御等を設けても良い。 The input unit 3 has voice information input means 17. This voice information input means 17 generates voice information from the voice signal of the pronounced word and inputs it to the processing unit 7. This audio information is a plurality of digital data obtained by sampling and quantizing an analog audio signal. The voice information input means 17 is typically a microphone that receives voice and an analog-digital converter that is electrically connected to the microphone. An amplifier and / or automatic gain control may be provided between the microphone and the analog-digital converter.

処理部7は、入力部3から取得した音声情報に基づいて、発音された単語の発音音素列を特定するものであり、代表的には、端末の中央処理装置である。この処理部7は、図１に示すように、入力部3、記憶部5、出力部9と通信可能に接続されており、想定音素検出手段15と発音音素列特定手段19を有する。 The processing unit 7 identifies the phoneme string sequence of the pronounced word based on the voice information acquired from the input unit 3, and is typically a central processing unit of the terminal. As shown in FIG. 1, the processing unit 7 is connected to an input unit 3, a storage unit 5, and an output unit 9 so as to be communicable, and includes an assumed phoneme detection unit 15 and a pronunciation phoneme string identification unit 19.

想定音素検出手段15は、入力部3から取得した音声情報を解析して想定音素群を検出する。想定音素群は、例えば、発音順に並べられた想定音素を要素とする集合であり、想定音素は、部分音声特徴と少なくとも部分的に共通する音素特徴に対応する音素である。
また、想定音素は、部分音声特徴と所定の共通項を有する音素特徴に対応する音素であっても良い。この部分音声特徴は、音声情報に含まれている一部のデータから算出された特徴であり、例えば当該一部のデータを周波数解析して得られた特徴や、周波数成分等である。部分音声特徴と比較される音素特徴は、代表的には、音素の周波数成分等を示す音響モデル等であり、音素に対応付けられて記憶部5に記憶される。検出された想定音素は個別に変数に格納される。想定音素検出手段15としては、例えばiOS（登録商標）、Android（登録商標）、Windows（登録商標）等のオペレーションシステムにインストールされたライブラリ等である。 The assumed phoneme detection means 15 analyzes the speech information acquired from the input unit 3 and detects an assumed phoneme group. The assumed phoneme group is, for example, a set having assumed phonemes arranged in the order of pronunciation as elements, and the assumed phonemes are phonemes corresponding to phoneme features that are at least partially in common with partial speech features.
Further, the assumed phoneme may be a phoneme corresponding to a phoneme feature having a predetermined common term with a partial speech feature. The partial voice feature is a feature calculated from a part of data included in the voice information, and is, for example, a characteristic obtained by frequency analysis of the part of the data, a frequency component, or the like. The phoneme feature to be compared with the partial speech feature is typically an acoustic model or the like indicating the frequency component of the phoneme, and is stored in the storage unit 5 in association with the phoneme. The detected assumed phonemes are individually stored in variables. The assumed phoneme detection means 15 is, for example, a library installed in an operation system such as iOS (registered trademark), Android (registered trademark), or Windows (registered trademark).

表３に出力された想定音素群の一例を示す。

この想定音素群は、話者が発音した単語"apple"から得られたものであり、配列変数Soutei[]に格納される。この例においては、一の部分音声特徴に対する２つの想定音素がSoutei[0]及びSoutei[1]に格納され、他の部分音声特徴に対する想定音素がSoutei[2]に格納され、さらに他の部分音声特徴に対する想定音素がSoutei[3]に格納されている。さらに他の部分音声特徴に対する想定音素がSoutei[4]に格納されている。 Table 3 shows an example of the assumed phoneme group output.

This assumed phoneme group is obtained from the word “apple” pronounced by the speaker, and is stored in the array variable Soutei []. In this example, two assumed phonemes for one partial speech feature are stored in Soutei [0] and Soutei [1], an assumed phoneme for another partial speech feature is stored in Soutei [2], and another part Assumed phonemes for speech features are stored in Soutei [3]. Furthermore, assumed phonemes for other partial speech features are stored in Soutei [4].

発音音素列特定手段19は、想定音素検出手段15から取得した想定音素群の中から、発音された単語の音素列である発音音素列を特定するものであり、この想定音素群中の連続する要素（想定音素）が同一の候補音素群に含まれる場合には、これらの連続する要素（想定音素）は、ある一の部分音声特徴に対する確からしい音素の集合であり、この中で最も優先度の高い要素（想定音素）を発音された音素と特定する。例えば、表３に示す想定音素群Soutei[]を取得した場合、連続するSoutei[0]とSoutei[1]の値は、同一の候補音素群に含まれているため、これらの値に対応する優先度を比較し、優先度の高いSoutei[1]が発音された一の音素と特定される。一方Soutei[2]〜Soutei[4]の値は、同一の候補音素群に含まれていないので、優先度を比較することなく発音された他の音素として特定される。よって、Soutei[1],Soutei[2],Soutei[3],Soutei[4]が発音音素列として特定される。この発音音素列は、例えば配列変数Hatsuone[]に格納される。 The phoneme string identification means 19 identifies a phoneme string that is a phoneme string of a pronounced word from the assumed phoneme group acquired from the assumed phoneme detection means 15, and is continuous in the assumed phoneme group. When elements (assumed phonemes) are included in the same candidate phoneme group, these consecutive elements (assumed phonemes) are a set of probable phonemes for a certain partial speech feature, and the highest priority among them. The element with high (speech phoneme) is identified as the phoneme that was pronounced. For example, when the assumed phoneme group Soutei [] shown in Table 3 is acquired, the values of consecutive Soutei [0] and Soutei [1] are included in the same candidate phoneme group, and therefore correspond to these values. The priorities are compared, and Soutei [1] having a high priority is identified as one phoneme that is pronounced. On the other hand, since the values of Soutei [2] to Soutei [4] are not included in the same candidate phoneme group, they are specified as other phonemes that are pronounced without comparing priorities. Therefore, Soutei [1], Soutei [2], Soutei [3], and Soutei [4] are specified as phoneme phoneme strings. This phoneme phoneme string is stored in, for example, the array variable Hatsuone [].

出力部9は、発音音素列特定手段19から取得した発音音素列を出力するものである。この出力部9としては、例えば、端末に設けられた表示手段21であり、発音音素列をテキスト情報に変換して表示される。出力部9は表示手段21に限定されず、発音音素列に基づいて話者の発音を評定する評定手段（不図示）に対して出力するものであっても良い。また、出力部9は、さらに音声を再生して出力する音声再生手段を備えても良い。この音声再生手段は、例えば、正音素列の音声をスピーカから出力する音声ガイドである。これにより、話者は認識対象の単語の正しい発音見本を知ることができる。さらに、音声再生手段は、取得した音声情報をスピーカから出力するものであっても良い。これにより、話者は、音声ガイドと自己の音声を聞き比べることができる。 The output unit 9 outputs the phoneme phoneme string acquired from the phoneme phoneme string specifying means 19. The output unit 9 is, for example, display means 21 provided in the terminal, and displays the phoneme phoneme string converted into text information. The output unit 9 is not limited to the display unit 21 and may output to a rating unit (not shown) that evaluates the pronunciation of the speaker based on the phoneme string. Further, the output unit 9 may further include an audio reproducing means for reproducing and outputting the audio. This voice reproducing means is, for example, a voice guide that outputs a voice of a normal phoneme string from a speaker. Thereby, the speaker can know the correct pronunciation sample of the word to be recognized. Further, the sound reproducing means may output the acquired sound information from a speaker. Thereby, the speaker can hear and compare the voice guide and his / her voice.

次に、音声認識方法を説明する。音声認識方法は、図２に示すように、音声情報入力ステップ（s10）と、想定音素検出ステップ（s20）と、発音音素列特定ステップ（s30）を含む。 Next, a voice recognition method will be described. As shown in FIG. 2, the speech recognition method includes a speech information input step (s10), an assumed phoneme detection step (s20), and a pronunciation phoneme string identification step (s30).

音声情報入力ステップ（s10）は、音声情報入力手段17が、発音された単語の音声信号から音声情報を生成し、この音声情報を処理部へと入力するステップである。例えば、マイクロホンが、発音された"apple"の音声を受信して電気信号に変換して、アナログ−デジタル変換器が電気信号を標本化し、量子化して音声情報を生成する。 The voice information input step (s10) is a step in which the voice information input means 17 generates voice information from the voice signal of the pronounced word and inputs this voice information to the processing unit. For example, a microphone receives a pronounced “apple” voice and converts it into an electrical signal, and an analog-digital converter samples and quantizes the electrical signal to generate voice information.

想定音素検出ステップ（s20）は、想定音素検出手段15が音声情報入力ステップ（s10）で得た音声情報を取得し、この音声情報と音素特徴とを比較して想定音素群を検出するステップである。例えば、先ず、音声情報に含まれる複数の部分音声特徴を検出する。各部分音声特徴の検出方法は、特に限定されないが、音声情報を所定フレームのデータ集合に分割して複数の分割音声情報を生成し、分割音声情報毎に周波数解析等を行い検出する。なお、連続する部分音声特徴が共通する場合には、これらを一つの部分音声特徴にまとめても良い。次に、部分音声特徴と共通項を有する音素特徴に対応する音素を記憶部から抽出し、この音素を想定音素として配列変数Soutei[]の各要素に格納する（表３）。 The assumed phoneme detection step (s20) is a step in which the assumed phoneme detection means 15 acquires the speech information obtained in the speech information input step (s10) and compares the speech information with phoneme features to detect an assumed phoneme group. is there. For example, first, a plurality of partial voice features included in the voice information are detected. The method for detecting each partial voice feature is not particularly limited, but the voice information is divided into a set of data of a predetermined frame to generate a plurality of pieces of divided voice information, and frequency analysis is performed for each piece of the divided voice information. In addition, when continuous partial voice features are common, these may be combined into one partial voice feature. Next, a phoneme corresponding to a phoneme feature having a common term with a partial speech feature is extracted from the storage unit, and this phoneme is stored as an assumed phoneme in each element of the array variable Soutei [] (Table 3).

発音音素列特定ステップ（s30）は、発音音素列特定手段19が想定音素群の中から発音された単語の音素列である発音音素列を特定するステップであり、想定音素群の中にある連続する要素（想定音素）が同一の候補音素群に含まれる場合には、優先度の高い要素（想定音素）を発音された音素として特定するステップである。この発音音素列特定ステップ（s30）の一例を図３に示す。発音音素列特定ステップ（s30）は、想定音素群取得ステップ（s301）と、初期化ステップ（s302）と、候補音素検索ステップ（s303）と、第一要素特定ステップ（s304）と、基準値再設定ステップ（s305）と、優先度比較ステップ（s306）と、繰返し判断ステップ（s307,s308）と、第二要素特定ステップ（s309）とを含むステップである。 The pronunciation phoneme string identification step (s30) is a step in which the pronunciation phoneme string identification means 19 identifies a pronunciation phoneme string that is a phoneme string of a word pronounced from the assumed phoneme group. This is a step of specifying a high-priority element (assumed phoneme) as a pronounced phoneme when the element (assumed phoneme) to be included is included in the same candidate phoneme group. An example of this phoneme phoneme string specifying step (s30) is shown in FIG. The phoneme sequence identification step (s30) includes an assumed phoneme group acquisition step (s301), an initialization step (s302), a candidate phoneme search step (s303), a first element identification step (s304), This step includes a setting step (s305), a priority comparison step (s306), an iterative determination step (s307, s308), and a second element specifying step (s309).

想定音素群取得ステップ（s301）は、想定音素検出ステップ（s20）で検出した想定音素群を取得するステップである。例えば、配列変数Soutei[]を取得する。 The assumed phoneme group acquisition step (s301) is a step of acquiring the assumed phoneme group detected in the assumed phoneme detection step (s20). For example, the array variable Soutei [] is acquired.

初期化ステップ（s302）は、後述する各ステップにおいて用いる変数に初期値を格納するステップである。変数としては、基準値、配列要素変数である。基準値は、連続する要素（想定音素）が同一の候補音素群に含まれるか否か、及び優先度の高低を比較する場合に、比較対象となる一方の要素（想定音素）を格納するための変数である。基準値の初期値としてSoutei[0]が格納される。配列要素変数は、想定音素群を格納している配列変数Soutei[]の要素変数iと、発音音素列を格納する配列変数Hatsuone[]の要素変数jである。要素変数iの初期値としては０が格納され、要素変数jの初期値としては１が格納される。 The initialization step (s302) is a step of storing an initial value in a variable used in each step described later. The variables are a reference value and an array element variable. The reference value stores one element (presumed phoneme) to be compared when comparing whether or not consecutive elements (assumed phonemes) are included in the same candidate phoneme group and the level of priority. Variable. Soutei [0] is stored as the initial value of the reference value. The array element variables are the element variable i of the array variable Soutei [] that stores the assumed phoneme group and the element variable j of the array variable Hatsuone [] that stores the phonemic phoneme string. 0 is stored as the initial value of the element variable i, and 1 is stored as the initial value of the element variable j.

候補音素検索ステップ（s303）は、想定音素群中の連続する要素（想定音素）が同一の候補音素群に含まれるか否かを検索するステップである。本例において、連続する要素（想定音素）とは、基準値及びSoutei[i]に格納された値である。この連続する要素（想定音素）が同一の候補音素群に含まれるか否かを検索するには、基準値の値を含む候補音素群（レコード）を候補音素記憶手段13から抽出し、抽出された候補音素群に比較対象となる他方の要素（想定音素）Soutei[i]の値が含まれるかを判断する。 The candidate phoneme search step (s303) is a step of searching whether or not consecutive elements (assumed phonemes) in the assumed phoneme group are included in the same candidate phoneme group. In this example, continuous elements (assumed phonemes) are a reference value and a value stored in Soutei [i]. In order to search whether or not these consecutive elements (assumed phonemes) are included in the same candidate phoneme group, a candidate phoneme group (record) including a reference value is extracted from the candidate phoneme storage unit 13 and extracted. It is determined whether or not the value of the other element (assumed phoneme) Soutei [i] to be compared is included in the candidate phoneme group.

上記の候補音素検索ステップ（s303）において、連続する要素（想定音素）が同一の候補音素群に含まれていない場合には、第一要素特定ステップ（s304）に移る。第一要素特定ステップ（s304）は、基準値の値を発音音素列の要素として特定するステップである。本例においては、基準値の値を発音音素列を格納する配列変数Hatsuone[j]の要素に格納し、要素変数jの値をインクリメントする。 In the above candidate phoneme search step (s303), when the continuous elements (assumed phonemes) are not included in the same candidate phoneme group, the process proceeds to the first element specifying step (s304). The first element specifying step (s304) is a step of specifying the value of the reference value as an element of the phoneme string sequence. In this example, the value of the reference value is stored in the element of the array variable Hatsuone [j] that stores the phonemic phoneme string, and the value of the element variable j is incremented.

基準値再設定ステップ（s305）は、他方の要素（想定音素）Soutei[i]を新たに基準値として設定するステップである。 The reference value resetting step (s305) is a step of newly setting the other element (assumed phoneme) Soutei [i] as a reference value.

一方、上記の候補音素検索ステップ（s303）において、連続する要素（想定音素）が同一の候補音素群に含まれている場合には、優先度比較ステップ（s306）に移る。優先度比較ステップ（s306）は、連続する要素（想定音素）の優先度を比較するステップである。優先度の比較は、基準値に対応する一の優先度及び他方の要素（想定音素）Soutei[i]に対応する他の優先度を候補音素記憶手段13から抽出し、一の優先度と他の優先度の大小比較演算を行う。 On the other hand, in the above candidate phoneme search step (s303), when successive elements (assumed phonemes) are included in the same candidate phoneme group, the process proceeds to a priority comparison step (s306). The priority comparison step (s306) is a step of comparing priorities of consecutive elements (assumed phonemes). For the priority comparison, one priority corresponding to the reference value and another priority corresponding to the other element (assumed phoneme) Soutei [i] are extracted from the candidate phoneme storage means 13, and one priority and another Compares the priority levels of.

上記優先度比較ステップ（s306）において、他の優先度が大きい場合には、上記の基準値再設定ステップ（s305）に移る。このステップにより優先度の高い他の要素（想定音素）が新たな基準値として設定される。 In the priority comparison step (s306), if another priority is high, the process proceeds to the reference value resetting step (s305). By this step, another element (assumed phoneme) having a higher priority is set as a new reference value.

一方、優先度比較ステップ（s306）において一の優先度が大きい場合、及び基準値再設定ステップ（s305）を終了した場合には、繰返し判断ステップ（s307,s308）に移る。繰返し判断ステップ（s307,s308）は、上記のステップ（s303）〜（s305）を全ての想定音素について繰返し行うための判断をするステップである。本例では、要素変数iをインクリメントし（s307）、配列変数Soutei[i]の値の有無を判断する（s308）。配列変数Soutei[i]の値がある場合には、候補音素検索ステップ（s303）に移る。 On the other hand, when one priority is high in the priority comparison step (s306), and when the reference value resetting step (s305) is completed, the process proceeds to a repetition determination step (s307, s308). The repetition determination step (s307, s308) is a step for determining to repeat the above steps (s303) to (s305) for all assumed phonemes. In this example, the element variable i is incremented (s307), and the presence / absence of the value of the array variable Soutei [i] is determined (s308). If there is a value of the array variable Soutei [i], the process proceeds to a candidate phoneme search step (s303).

一方、繰返し判断ステップ（s307,s308）において、配列変数Soutei[i]の値がない場合には、繰返し処理を終了し、第二要素特定ステップ（s309）へ移る。第二要素特定ステップ（s309）は、基準値として設定されている要素（想定音素）を発音音素列の最後の要素として特定するステップであり、配列変数Hatsuone[j]の要素に基準値を格納する。 On the other hand, if there is no value of the array variable Soutei [i] in the repetition determination step (s307, s308), the repetition process is terminated and the process proceeds to the second element identification step (s309). The second element identification step (s309) is a step of identifying the element (assumed phoneme) set as the reference value as the last element of the phoneme phoneme string, and storing the reference value in the element of the array variable Hatsuone [j] To do.

上記のステップにより、Hatsuone[]の各要素には発音音素列の要素が格納され、発音音素列が特定される。 Through the above steps, each phonetic element sequence element is stored in each element of Hatsuone [], and the pronunciation phoneme string is specified.

本発明の音声認識装置1は、音声認識対象の単語を母語としない話者の曖昧な発音であっても、優先度に基づいて発音された単語の音声認識が可能である。すなわち、曖昧な発音を受信した本発明の音声認識装置1は、想定音素検出手段15により確からしい想定音素群が検出される。そして、優先度に基づいて、検出された想定音素群のいずれを採用するかを決定する。これより、音響モデル等の音素特徴に基づく音素の判定が困難な場合であっても、優先度に基づいて音素を特定することができる。このため、話者に再度の発音を求めたり、音声認識を中止することが生じない。 The speech recognition apparatus 1 of the present invention is capable of speech recognition of a word that is pronounced based on priority even if it is an ambiguous pronunciation of a speaker who does not use a speech recognition target word as a mother tongue. That is, in the speech recognition apparatus 1 of the present invention that has received an ambiguous pronunciation, a probable phoneme group is detected by the assumed phoneme detection means 15. Then, based on the priority, it is determined which of the detected assumed phoneme groups is to be adopted. Thus, even when it is difficult to determine phonemes based on phoneme features such as an acoustic model, phonemes can be specified based on priority. For this reason, it does not occur that the speaker is asked to pronounce again or the speech recognition is stopped.

この優先度は、候補音素の発生頻度であっても良い。候補音素の発生頻度は、複数の話者の発音情報の統計に基づいて定められる。発音情報は音声信号、音声情報、及び／又は想定音素群である。複数の話者は、音声認識対象の単語を母語としない複数の話者である。これにより、音声認識対象の単語（英語）を母語としない話者（例えば日本人）の発音傾向を考慮した発音音素列の特定ができる。また、複数の話者は、認識対象の単語が母語であっても標準語を話さない複数の話者（いわゆる方言を話す者）であってもよい。さらに、候補音素の発生頻度は、地域毎に定められたものであっても良い。これにより、方言から生じる地域毎の発音傾向を反映した音声認識が可能となる。また、候補音素の発生頻度は、特定の一のユーザの蓄積された発音情報の統計に基づいて定められたものであっても良い。これにより、ユーザ特有の間違い傾向を反映した音声認識が可能となる。 This priority may be the frequency of occurrence of candidate phonemes. The frequency of occurrence of candidate phonemes is determined based on statistics of pronunciation information of a plurality of speakers. The pronunciation information is an audio signal, audio information, and / or an assumed phoneme group. The plurality of speakers are a plurality of speakers who do not use a speech recognition target word as their mother tongue. As a result, it is possible to specify a phoneme string sequence considering the pronunciation tendency of a speaker (for example, Japanese) who does not use the word (English) as a speech recognition target language. The plurality of speakers may be a plurality of speakers (so-called dialect speakers) who do not speak the standard language even if the recognition target word is a native language. Further, the generation frequency of candidate phonemes may be determined for each region. As a result, it is possible to perform speech recognition reflecting the pronunciation tendency of each region generated from a dialect. Further, the occurrence frequency of candidate phonemes may be determined based on statistics of pronunciation information accumulated by a specific user. As a result, it is possible to perform voice recognition that reflects a user-specific error tendency.

また、優先度は、正音素に対する類似音素の近似度であっても良い。この近似度は、国際音声記号分布図における各音素の近さによって定められる。また正音素と類似音素の音声信号の一致度に基づいて定めても良い。この一致度は、例えば各音素の音声信号波形をオシロスコープで観察して定められる。 Further, the priority may be an approximation degree of similar phonemes with respect to regular phonemes. This degree of approximation is determined by the proximity of each phoneme in the international phonetic symbol distribution map. Alternatively, it may be determined based on the degree of coincidence between the speech signals of the regular phoneme and the similar phoneme. The degree of coincidence is determined, for example, by observing the sound signal waveform of each phoneme with an oscilloscope.

さらにまた、優先度は、各候補音素の優先順位であっても良い。 Furthermore, the priority may be a priority order of each candidate phoneme.

また、本発明の優先度は上記の発生頻度及び近似度であり、発音音素列特定手段19は、想定音素群中の連続する要素（想定音素）が候補音素群に含まれる場合には、発生頻度の最も高い要素（想定音素）を発音されたものと特定し、さらに、発生頻度の最も高い要素（想定音素）が複数ある場合には、近似度の高い要素（想定音素）を発音されたものとして特定するものであっても良い。これにより、話者の発音傾向と音の近似度を考慮して発音音素列を特定することができる。 Further, the priority of the present invention is the above-mentioned occurrence frequency and approximation degree, and the phonemic phoneme string specifying means 19 generates the occurrence when a continuous element (assumed phoneme) in the assumed phoneme group is included in the candidate phoneme group. The element with the highest frequency (assumed phoneme) is identified as being pronounced, and if there are multiple elements with the highest frequency of occurrence (assumed phonemes), the element with the highest degree of approximation (the assumed phoneme) was pronounced. It may be specified as a thing. As a result, the phoneme string sequence can be specified in consideration of the speaker's pronunciation tendency and the approximation of the sound.

本発明の音声認識装置1は、上記の例に限定されない。 The speech recognition apparatus 1 of the present invention is not limited to the above example.

候補音素記憶手段13は、例えば表４に示すように、単語に対応付けて候補音素群及び優先度を記憶するものであっても良い。また、複数の候補音素群及び各候補音素群の要素の優先度を記憶するものであっても良い。

この候補音素記憶手段13は、複数の単語と、各単語に対応付けられた候補音素群及び候補音素群を構成する各要素（各候補音素）の優先度が、単語毎にレコードに記憶されたものである。 For example, as shown in Table 4, the candidate phoneme storage unit 13 may store a candidate phoneme group and a priority in association with a word. Moreover, you may memorize | store the priority of the element of a some candidate phoneme group and each candidate phoneme group.

The candidate phoneme storage means 13 stores a plurality of words, the candidate phoneme group associated with each word, and the priority of each element (each candidate phoneme) constituting the candidate phoneme group in a record for each word. Is.

また、本発明の入力部3が、図４に示すように、音声認識の対象となる単語を発音音素列特定手段19に指定する単語入力手段23を備えても良い。この単語入力手段23は、例えば単語を入力するキーボード等の文字入力装置や、表示手段21に表示された単語を選択するマウス、タッチパネル、ディジタイザ等の座標入力装置である。 In addition, the input unit 3 of the present invention may include a word input unit 23 for designating a word to be subjected to speech recognition to the pronunciation phoneme string specifying unit 19 as shown in FIG. The word input means 23 is, for example, a character input device such as a keyboard for inputting a word, or a coordinate input device such as a mouse, a touch panel, or a digitizer for selecting a word displayed on the display means 21.

この音声認識装置10は、単語入力手段23によって認識対象となる単語が入力される。そして、発音音素列特定手段19は、入力された単語をキーとして候補音素群（レコード）を特定する。これにより、複数の単語の音声認識が可能となる。また、単語毎に候補音素群を定めているため、特定の単語にのみ生じる発生頻度の高い類似音素を候補音素群に含めることができる。これにより、発音音素列の検出精度が向上する。 In the speech recognition apparatus 10, a word to be recognized is input by the word input means 23. Then, the phoneme phoneme string specifying means 19 specifies a candidate phoneme group (record) using the input word as a key. Thereby, voice recognition of a plurality of words becomes possible. Moreover, since the candidate phoneme group is defined for each word, similar phonemes that occur only in specific words and that occur frequently can be included in the candidate phoneme group. Thereby, the detection accuracy of the phoneme phoneme sequence is improved.

さらに、候補音素記憶手段13は、単語毎に設けられたテーブルであっても良い。すなわち一の単語"apple"についての候補音素群及び優先度を記憶する一のテーブル（appleテーブル（不図示））と、他の単語"very"についての候補音素群及び優先度を記憶する他のテーブル（veryテーブル（不図示））を設けても良い。この場合、単語入力手段23によって認識対象となる単語が入力されると、発音音素列特定手段19は、入力された単語のテーブルを検索して候補音素群を特定する。 Further, the candidate phoneme storage means 13 may be a table provided for each word. That is, one table (apple table (not shown)) that stores the candidate phoneme group and priority for one word “apple”, and another table that stores the candidate phoneme group and priority for another word “very”. A table (very table (not shown)) may be provided. In this case, when a word to be recognized is input by the word input unit 23, the phoneme phoneme string specifying unit 19 searches a table of the input words and specifies a candidate phoneme group.

また、図５に示すように、候補音素群33は候補音素列25a,25bを要素とする候補音素列集合27であり、想定音素検出手段15が検出する想定音素群は想定音素列29を要素とする想定音素列集合31であっても良い。 Further, as shown in FIG. 5, the candidate phoneme group 33 is a candidate phoneme string set 27 having the candidate phoneme strings 25a and 25b as elements, and the assumed phoneme group detected by the assumed phoneme detection means 15 includes the expected phoneme string 29 as an element. Assumed phoneme string set 31 may be used.

この候補音素列集合27は、図５（ａ）に示すように、単語に対応付けて記憶される。候補音素列25a,25bは、単語の正しい正音素列25a又は類似音素を含む誤音素列25bである。候補音素列25a,25bは、その優先度35と互いに対応付けられて候補音素記憶手段13に記憶される。 The candidate phoneme string set 27 is stored in association with a word as shown in FIG. The candidate phoneme sequences 25a and 25b are correct phoneme sequences 25a of words or erroneous phoneme sequences 25b including similar phonemes. The candidate phoneme strings 25a and 25b are stored in the candidate phoneme storage unit 13 in association with the priority 35.

また、想定音素列29は、図５（ｂ）に示すように、想定音素が発音順に並べられたものである。そして、発音音素列特定手段19は、想定音素列集合31の連続する要素（想定音素列（Soutei[0],Soutei[1]））が、同一の候補音素列集合27に含まれるかを判断し、その結果、連続する要素（想定音素列29）が同一の候補音素列集合27に含まれている場合には、連続する要素（想定音素列29）のうち優先度の高い要素（想定音素列29）を発音された発音音素列と特定するものであっても良い。 Further, the assumed phoneme string 29 is formed by arranging assumed phonemes in the order of pronunciation, as shown in FIG. Then, the phoneme sequence identification unit 19 determines whether or not the consecutive elements of the assumed phoneme sequence set 31 (the assumed phoneme sequence (Soutei [0], Soutei [1])) are included in the same candidate phoneme sequence set 27. As a result, when consecutive elements (assumed phoneme string 29) are included in the same candidate phoneme string set 27, elements having higher priority (assumed phonemes) among consecutive elements (assumed phoneme string 29). The column 29) may be identified as a pronunciation phoneme sequence.

また、想定音素群は、図５（ｃ）に示すように、２次元配列であっても良い。この想定音素群は、各行に一の想定音素列が定まるように、各列に想定音素が格納される。この場合、発音音素列特定手段19は、連続する要素をSoutei[0][0]とSoutei[1][0]として、これらの連続する要素が同一の候補音素群（表２、表４）に含まれるかを判断する。また、発音音素列特定手段19は、行毎に列の音素を結合することにより想定音素列集合31を生成してから上述する発音音素列の特定を行っても良い。 Further, the assumed phoneme group may be a two-dimensional array as shown in FIG. In this assumed phoneme group, assumed phonemes are stored in each column so that one assumed phoneme column is determined in each row. In this case, the phoneme string specifying means 19 sets the consecutive elements as Soutei [0] [0] and Soutei [1] [0], and these consecutive elements are the same candidate phoneme group (Tables 2 and 4). To determine whether it is included. The phonemic phoneme string specifying means 19 may generate the assumed phoneme string set 31 by combining the phonemes of the columns for each row and then specify the phoneme phoneme string described above.

また、英語の音声認識を例として本発明を説明したが、これに限定されず、他の言語の音声認識であってもよい。 Moreover, although the present invention has been described by taking English speech recognition as an example, the present invention is not limited to this, and speech recognition in other languages may be used.

また、上記の本発明の音声認識装置1は、本発明の好ましい実施例を示したものであり、記憶部5の正音素列記憶手段11は、本発明の必須の構成でない。 Further, the above speech recognition apparatus 1 of the present invention shows a preferred embodiment of the present invention, and the regular phoneme sequence storage means 11 of the storage unit 5 is not an essential configuration of the present invention.

また、候補音素記憶手段13は更新可能であってもよい。すなわち、本発明の音声認識装置はネットワークと接続可能な送受信部を備え、処理部は、送受信部を介して取得した候補音素及びその優先度を候補音素記憶手段13に登録する登録手段と、送受信部を介して発音音素列を通知する通知手段を備えても良い。 The candidate phoneme storage means 13 may be updatable. That is, the speech recognition apparatus of the present invention includes a transmission / reception unit that can be connected to a network, and the processing unit includes registration means for registering candidate phonemes and their priorities acquired via the transmission / reception unit in the candidate phoneme storage means 13, and transmission / reception You may provide the notification means which notifies a phonetic phoneme row | line | column via a part.

これにより、ネットワークを介して本発明の音声認識装置1と通信可能なサーバは、複数の音声認識装置1から発音音素列を収集し、サーバの記憶部に記憶することができる。また、サーバの処理部は収集した発音音素列の統計に基づいて候補音素を生成し、生成した候補音素を各音声認識装置に送信することができる。そして、音声認識装置1は取得した候補音素及び優先度を新たに登録することができる。この音声認識装置1は、地理的に離れた複数の話者の発音傾向に基づいて音声認識を行うことができる。 Thus, a server that can communicate with the speech recognition apparatus 1 of the present invention via a network can collect phoneme phoneme strings from the plurality of speech recognition apparatuses 1 and store them in the storage unit of the server. Further, the processing unit of the server can generate candidate phonemes based on the collected phonological phoneme sequence statistics, and can transmit the generated candidate phonemes to each speech recognition apparatus. Then, the speech recognition apparatus 1 can newly register the acquired candidate phonemes and priorities. The speech recognition apparatus 1 can perform speech recognition based on pronunciation tendency of a plurality of geographically distant speakers.

本発明の音声認識プログラムは、コンピュータを上記の正音素記憶手段11、候補音素記憶手段13、想定音素検出手段15、発音音素列特定手段19、音声情報入力手段17、単語入力手段23、及び表示手段21として機能させるためのものである。 The speech recognition program of the present invention includes a computer that includes the above-described normal phoneme storage means 11, candidate phoneme storage means 13, assumed phoneme detection means 15, pronunciation phoneme string identification means 19, speech information input means 17, word input means 23, and display. This is to make it function as the means 21.

以上、本発明の音声認識装置及び音声認識プログラムについて説明したが、本発明はその趣旨を逸脱しない範囲で、当業者の知識に基づき種々の改良、修正、変形を加えた態様で実施し得るものであり、これらの態様はいずれも本発明の範囲に属するものである。 Although the speech recognition apparatus and speech recognition program of the present invention have been described above, the present invention can be implemented in variously modified, modified, and modified forms based on the knowledge of those skilled in the art without departing from the spirit of the present invention. These embodiments are all within the scope of the present invention.

本発明の音声認識装置は、非母語の発音を練習するための発音練習機における単語の音声認識として用いられる。また、音声入力を備えた端末における音声認識として用いても良い。 The speech recognition apparatus of the present invention is used as speech recognition of words in a pronunciation training machine for practicing pronunciation of a non-native language. Moreover, you may use as voice recognition in the terminal provided with voice input.

1・10 ... 音声認識装置
3 ... 入力部
5 ... 記憶部
7 ... 処理部
9 ... 出力部
11 ... 正音素記憶手段
13 ... 候補音素記憶手段
15 ... 想定音素検出手段
17 ... 音声情報入力手段
19 ... 発音音素列特定手段
21 ... 表示手段
23 ... 単語入力手段 1 ・ 10 ... Voice recognition device
3 Input section
5 ... Memory
7 ... Processing section
9 ... Output section
11 ... Means of storing phonemes
13 ... Candidate phoneme storage means
15 ... Assumed phoneme detection means
17 ... Voice information input means
19 ... Phoneme sequence identification method
21 ... Display means
23 ... Word input means

Claims

Precise phonemes included in the correct phoneme string of a word and / or candidate phoneme group including similar phonemes that are similar in pronunciation to the phoneme and associated with the phoneme, and priority of each element of the candidate phoneme group A storage unit having candidate phoneme group storage means for storing
An input unit having voice information input means for generating voice information from the voice signal of the pronounced word;
Assumed phoneme detection means for detecting assumed phonemes from the speech information in the order of pronunciation and obtaining an assumed phoneme group, and a pronunciation phoneme string that identifies a phoneme string sequence of the word pronounced from the assumed phoneme group A processing unit having identification means;
An output unit for outputting the phonemic phoneme string;
With
The phoneme phoneme string specifying means searches for whether or not a continuous element in the assumed phoneme group is included in the candidate phoneme group, and when a continuous element in the assumed phoneme group is included in the candidate phoneme group A speech recognition device that identifies an element of the assumed phoneme group having a high priority as being pronounced.

The storage unit includes phoneme feature storage means for storing phoneme features that are feature information of the regular phonemes and the similar phonemes in association with the regular phonemes and the similar phonemes,
The voice information input means samples and quantizes the voice signal to generate the voice information,
The assumed phoneme detection means calculates a partial speech feature from a part of the speech information and assumes the phoneme or the similar phoneme corresponding to the phoneme feature at least partially in common with the partial speech feature. The speech recognition apparatus according to claim 1, wherein

The speech recognition apparatus according to claim 1, wherein the candidate phoneme group is stored in association with the word.

The priority is an occurrence frequency of the elements of the candidate phoneme group determined from the speech information of the word pronounced by a plurality of speakers, the speech signal, and / or statistics of the assumed phoneme group. The speech recognition apparatus according to claim 3.

The speech recognition apparatus according to claim 1, wherein the priority is an approximation degree of the similar phoneme to the regular phoneme.

The priority is the occurrence frequency of the elements of the candidate phoneme group determined from the speech information of the word pronounced by a speaker who does not speak the word, the speech signal, and / or the statistics of the assumed phoneme group, The degree of approximation of the similar phonemes to the regular phonemes,
The pronunciation phoneme string specifying means specifies that the element of the assumed phoneme group having the highest occurrence frequency is pronounced when a continuous element in the assumed phoneme group is included in the candidate phoneme group, Furthermore, when there are a plurality of the candidate phonemes having the highest occurrence frequency, the elements of the assumed phoneme group having the high degree of approximation are specified as being pronounced. Voice recognition device.

Computer
Precise phonemes included in the correct phoneme string of a word and / or candidate phoneme group including similar phonemes that are similar in pronunciation to the phoneme and associated with the phoneme, and priority of each element of the candidate phoneme group Candidate phoneme group storage means for storing
Voice information input means for generating voice information from the voice signal of the pronounced word;
Assumed phoneme detection means for obtaining an assumed phoneme group by detecting assumed phonemes assumed from the sound information in the order of the pronunciation;
Pronunciation phoneme string identification means for identifying a pronunciation phoneme string of the word pronounced from the assumed phoneme group;
An output means for outputting the phonemic phoneme string;
Function as
The phoneme phoneme string specifying means searches for whether or not a continuous element in the assumed phoneme group is included in the candidate phoneme group, and when a continuous element in the assumed phoneme group is included in the candidate phoneme group A speech recognition program that identifies an element of the assumed phoneme group having a high priority as being pronounced.