JP6221267B2

JP6221267B2 - Speech recognition apparatus and method, and semiconductor integrated circuit device

Info

Publication number: JP6221267B2
Application number: JP2013042664A
Authority: JP
Inventors: 勉野中
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2013-03-05
Filing date: 2013-03-05
Publication date: 2017-11-01
Anticipated expiration: 2033-03-05
Also published as: JP2014170163A

Description

本発明は、自動販売機、家電製品、住宅設備、車載装置（ナビゲーション装置等）、及び、携帯端末等におけるヒューマンインターフェース技術の一環として、音声を認識し、その認識結果に対応する応答や処理を行う音声認識装置及び音声認識方法に関する。さらに、本発明は、そのような音声認識装置において用いられる半導体集積回路装置等に関する。 The present invention recognizes speech as part of human interface technology in vending machines, home appliances, residential equipment, in-vehicle devices (navigation devices, etc.) and mobile terminals, and performs responses and processing corresponding to the recognition results. The present invention relates to a speech recognition apparatus and a speech recognition method. Furthermore, the present invention relates to a semiconductor integrated circuit device used in such a speech recognition device.

音声認識は、入力される音声信号を解析し、その結果として得られる特徴パターンを、予め収録された音声信号に基づいて音声認識データベースに用意されている標準パターン（「テンプレート」ともいう）と照合することによって、認識結果を得る技術である。しかしながら、照合される範囲に制限が設けられていない場合には、比較すべき特徴パターンと標準パターンとの組み合わせが膨大な数となって、認識結果を得るのに多くの時間を要すると共に、類似の標準パターンを有する単語又は文章の数も多くなることで認識率が低下してしまう傾向にある。 In speech recognition, input speech signals are analyzed, and the resulting feature patterns are collated with standard patterns (also called “templates”) prepared in a speech recognition database based on prerecorded speech signals. This is a technique for obtaining a recognition result. However, if there is no restriction on the range to be collated, the number of combinations of feature patterns and standard patterns to be compared becomes enormous, and it takes a lot of time to obtain recognition results and is similar. The recognition rate tends to decrease as the number of words or sentences having the standard pattern increases.

また、音声認識において、音声信号に基づいて単語又は文章を認識する際に要求される認識精度の厳密さ又は曖昧さは、類似の表示パターンを有する単語又は文章の数に関わらず一定に設定されている。 In speech recognition, the accuracy or ambiguity of recognition accuracy required when recognizing a word or sentence based on a speech signal is set to be constant regardless of the number of words or sentences having a similar display pattern. ing.

関連する従来技術として、特許文献１には、使用者の発話が曖昧な場合にも、使用者の発話を精度良く認識することを目的とする音声認識装置が開示されている。この音声認識装置は、入力された音声についての認識結果に基づいて制御対象の制御内容を決定する音声認識装置であって、制御内容を表すタスクの種類を所定の決定入力に基づいて決定するタスク種類決定手段と、タスク種類決定手段により決定された種類のタスクを判断対象として、入力された音声を認識する音声認識手段とを備えている。 As a related prior art, Patent Document 1 discloses a speech recognition device for accurately recognizing a user's utterance even when the user's utterance is ambiguous. This speech recognition device is a speech recognition device that determines the control content of a control target based on a recognition result for an input speech, and a task that determines a task type representing the control content based on a predetermined determination input Type determination means, and voice recognition means for recognizing an input voice with the type of task determined by the task type determination means as a determination target.

特許文献１の音声認識装置は、音声信号に基づいてユーザーの言葉が良好に認識されると、ユーザーの言葉において何を制御するかが特定されていなくても、どのように制御するかという指標に従って認識対象を限定して、制御対象の制御内容を決定することが可能である。しかしながら、音声信号に基づいてユーザーの言葉を認識する際に要求される認識精度の厳密さ又は曖昧さは一定であり、音声認識における認識率を向上させることはできない。 The speech recognition apparatus disclosed in Patent Document 1 indicates how to control a user's words even if the user's words are well recognized based on the audio signal, even if what is to be controlled in the user's words is not specified. Thus, it is possible to limit the recognition target and determine the control content of the control target. However, the accuracy or ambiguity of recognition accuracy required when recognizing a user's words based on a speech signal is constant, and the recognition rate in speech recognition cannot be improved.

特開２００８−６４８８５号公報（段落０００６−００１０）JP 2008-64885 A (paragraphs 0006-0010)

上述したように、音声認識において、音声信号に基づいて単語又は文章を認識する際に要求される認識精度の厳密さ又は曖昧さは、類似の表示パターンを有する単語又は文章の数に関わらず一定に設定されている。そのため、選択肢の数が多い場合と少ない場合とにおいて、又は、選択肢の中に類似する言葉が多い場合と少ない場合とにおいて、同一の認識条件で音声認識が行われるので、音声認識における認識率が向上しないという問題があった。 As described above, in speech recognition, the accuracy or ambiguity of recognition accuracy required when recognizing a word or sentence based on a speech signal is constant regardless of the number of words or sentences having a similar display pattern. Is set to Therefore, speech recognition is performed under the same recognition conditions when the number of options is large and small, or when there are many similar words in the options and when there are few similar words. There was a problem of not improving.

そこで、上記の点に鑑み、本発明の目的の１つは、音声認識における選択肢の数を適切に制限すると共に、音声認識に要求される認識精度の厳密さ又は曖昧さを選択肢に応じて変化させることにより、音声認識における認識率を向上させることである。本発明は、上述した課題若しくは問題の少なくとも１つを解決するためになされたものである。 Accordingly, in view of the above points, one of the objects of the present invention is to appropriately limit the number of options in speech recognition and change the accuracy or ambiguity of recognition accuracy required for speech recognition according to the options. By doing so, the recognition rate in speech recognition is improved. The present invention has been made to solve at least one of the above-described problems or problems.

本発明の第１の観点に係る半導体集積回路装置は、所定の言語において用いられる複数の音素の周波数成分の分布状態を表す標準パターンを含む音声認識データベースを格納する音声認識データベース格納部と、変換候補となる単語又は文章を表すテキストデータ、及び、変換候補となる単語又は文章を認識する際に適用される認識精度の厳密さを表す認識精度パラメーターを、コマンドと共に受信し、コマンドに従って、変換リストにテキストデータを設定する変換情報設定部と、変換リストを格納する変換リスト格納部と、変換リストに設定されたテキストデータによって表される各々の単語又は文章の少なくとも一部に対応する標準パターンを音声認識データベースから抽出する標準パターン抽出部と、認識精度パラメーターに従って、音声認識データベースから抽出された標準パターンの広がりの範囲を調整する認識精度調整部と、入力された音声信号にフーリエ変換を施すことにより音声信号の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成する信号処理部と、音声信号の少なくとも一部から生成された特徴パターンが標準パターンの広がりの範囲内に入っていれば両者の一致を検出し、変換候補となる単語又は文章の内で一致が検出された単語又は文章を特定する音声認識結果を出力する一致検出部とを具備する。 A semiconductor integrated circuit device according to a first aspect of the present invention includes a speech recognition database storage unit that stores a speech recognition database including a standard pattern that represents a distribution state of frequency components of a plurality of phonemes used in a predetermined language, and a conversion Receives text data representing a candidate word or sentence and a recognition accuracy parameter representing the accuracy of recognition accuracy applied when recognizing a word or sentence as a conversion candidate together with the command, and converts according to the command. A conversion information setting unit for setting text data in a conversion list, a conversion list storage unit for storing a conversion list, and a standard pattern corresponding to at least a part of each word or sentence represented by the text data set in the conversion list According to the standard pattern extraction unit that extracts from the speech recognition database and the recognition accuracy parameter A recognition accuracy adjustment unit that adjusts the range of spread of the standard pattern extracted from the speech recognition database, and the frequency component of the speech signal is extracted by performing Fourier transform on the input speech signal, and the frequency component distribution of the speech signal A signal processing unit that generates a feature pattern that represents a state and a word that is a candidate for conversion by detecting a match between the feature pattern generated from at least a part of the audio signal and within the range of the standard pattern Or a coincidence detection unit that outputs a speech recognition result specifying a word or sentence in which a coincidence is detected in the sentence.

また、本発明の第１の観点に係る音声認識装置は、本発明の第１の観点に係る半導体集積回路装置と、変換候補となる複数の単語又は文章を表すテキストデータ、及び、該複数の単語又は文章に応じて選択された認識精度パラメーターを、コマンドと共に半導体集積回路装置に送信する制御部とを具備する。 The speech recognition apparatus according to the first aspect of the present invention includes a semiconductor integrated circuit device according to the first aspect of the present invention, text data representing a plurality of words or sentences that are candidates for conversion, and the plurality of And a control unit that transmits the recognition accuracy parameter selected according to the word or sentence to the semiconductor integrated circuit device together with the command.

さらに、本発明の第１の観点に係る音声認識方法は、変換候補となる単語又は文章を表すテキストデータ、及び、変換候補となる単語又は文章を認識する際に適用される認識精度の厳密さを表す認識精度パラメーターを、コマンドと共に受信し、コマンドに従って、変換リストにテキストデータを設定するステップ（ａ）と、所定の言語において用いられる複数の音素の周波数成分の分布状態を表す標準パターンを含む音声認識データベースから、変換リストに設定されたテキストデータによって表される各々の単語又は文章の少なくとも一部に対応する標準パターンを抽出するステップ（ｂ）と、認識精度パラメーターに従って、音声認識データベースから抽出された標準パターンの広がりの範囲を調整するステップ（ｃ）と、入力された音声信号にフーリエ変換を施すことにより音声信号の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成するステップ（ｄ）と、音声信号の少なくとも一部から生成された特徴パターンが標準パターンの広がりの範囲内に入っていれば両者の一致を検出し、変換候補となる単語又は文章の内で一致が検出された単語又は文章を特定する音声認識結果を出力するステップ（ｅ）とを具備する。 Furthermore, the speech recognition method according to the first aspect of the present invention includes text data representing a word or sentence that is a conversion candidate, and the accuracy of recognition accuracy that is applied when recognizing the word or sentence that is a conversion candidate. A recognition accuracy parameter representing the number of phonemes, and a standard pattern representing a distribution state of frequency components of a plurality of phonemes used in a predetermined language. Extracting a standard pattern corresponding to at least a part of each word or sentence represented by the text data set in the conversion list from the speech recognition database, and extracting from the speech recognition database according to the recognition accuracy parameter Step (c) for adjusting the extent of the spread of the standard pattern made, and A step (d) of extracting a frequency component of the voice signal by performing Fourier transform on the voice signal and generating a feature pattern representing a distribution state of the frequency component of the voice signal; and a feature generated from at least a part of the voice signal If the pattern falls within the range of the standard pattern, a match between the two is detected, and a speech recognition result for specifying a word or sentence in which a match is detected among the words or sentences as conversion candidates is output ( e).

本発明の第１の観点によれば、深い階層メニューに従って音声認識を行う場合に、選択肢の数を適切に制限すると共に、各々の選択肢の組み合わせに適した認識精度パラメーターを設定して、認識精度パラメーターに従って標準パターンの広がりの範囲を調整することにより、音声認識における認識率を向上させることができる。 According to the first aspect of the present invention, when speech recognition is performed according to a deep hierarchical menu, the number of options is appropriately limited, and a recognition accuracy parameter suitable for each combination of options is set to recognize recognition accuracy. The recognition rate in speech recognition can be improved by adjusting the range of spread of the standard pattern according to the parameters.

本発明の第２の観点に係る半導体集積回路装置は、音声認識結果に対する応答内容を表す応答データを受信し、応答データに基づいて出力音声信号を合成する音声信号合成部をさらに具備する。これにより、応答データに基づいて発せられる質問又はメッセージに対するユーザーの回答が幾つかの単語又は文章の内の１つに予測される状況を作り出すことができる。 The semiconductor integrated circuit device according to the second aspect of the present invention further includes an audio signal synthesizer that receives response data representing response contents for the speech recognition result and synthesizes an output audio signal based on the response data. This can create a situation where a user's answer to a question or message that is issued based on response data is predicted to be one of several words or sentences.

本発明の第２の観点に係る音声認識装置は、本発明の第２の観点に係る半導体集積回路装置と、半導体集積回路装置から出力される音声認識結果に応じて複数の応答内容の中から応答内容を選択し、選択された応答内容を表す応答データ、応答内容に対する回答として変換候補となる単語又は文章を表すテキストデータ、及び、変換候補となる単語又は文章に応じて選択された認識精度パラメーターを、コマンドと共に半導体集積回路装置に送信する制御部とを具備する。これにより、応答データに基づいて発せられる質問又はメッセージに対応する複数の単語又は文章を表すテキストデータを変換リストに設定すると共に、それらの単語又は文章に応じて選択された認識精度パラメーターを認識精度調整部に設定することができる。 A speech recognition device according to a second aspect of the present invention includes a semiconductor integrated circuit device according to the second aspect of the present invention and a plurality of response contents according to a speech recognition result output from the semiconductor integrated circuit device. Selection of response content, response data representing the selected response content, text data representing a word or sentence as a conversion candidate as an answer to the response content, and recognition accuracy selected according to the word or sentence as a conversion candidate And a control unit that transmits the parameter together with the command to the semiconductor integrated circuit device. As a result, text data representing a plurality of words or sentences corresponding to the question or message issued based on the response data is set in the conversion list, and the recognition accuracy parameter selected according to the words or sentences is recognized as the recognition accuracy. It can be set in the adjustment unit.

本発明の第３の観点に係る半導体集積回路装置においては、信号処理部が、音声信号のレベルが所定の値を超えたときに音声検出信号を活性化する。これにより、ユーザーからの要求又は回答の有無を判定することができる。 In the semiconductor integrated circuit device according to the third aspect of the present invention, the signal processing unit activates the voice detection signal when the level of the voice signal exceeds a predetermined value. Thereby, the presence or absence of a request or answer from the user can be determined.

本発明の第３の観点に係る音声認識装置は、本発明の第３の観点に係る半導体集積回路装置と、音声検出信号が活性化されてから所定の期間内に特徴パターンと標準パターンとの一致を表す音声認識結果が得られない場合に、新たな認識精度パラメーターを新たなコマンドと共に半導体集積回路装置に送信し、一致検出を行うように半導体集積回路装置を制御する制御部とを具備する。これにより、所定の期間内に特徴パターンと標準パターンとの一致を表す音声認識結果が得られない場合に、認識精度パラメーターを変更して一致検出を再度行うことができる。 A speech recognition apparatus according to a third aspect of the present invention includes a semiconductor integrated circuit device according to the third aspect of the present invention, a feature pattern and a standard pattern within a predetermined period after the speech detection signal is activated. A control unit that controls the semiconductor integrated circuit device to transmit a new recognition accuracy parameter together with a new command to the semiconductor integrated circuit device when the voice recognition result indicating the match cannot be obtained, and to perform the match detection. . As a result, when a speech recognition result indicating a match between the feature pattern and the standard pattern cannot be obtained within a predetermined period, the match detection can be performed again by changing the recognition accuracy parameter.

本発明の一実施形態に係る音声認識装置の構成例を示す図。The figure which shows the structural example of the speech recognition apparatus which concerns on one Embodiment of this invention. 図１に示す音声認識装置によって実施される音声認識方法を示すフロー図。The flowchart which shows the speech recognition method implemented by the speech recognition apparatus shown in FIG. 食品メニューに表示されている複数の食品名を含む変換リストＡを示す図。The figure which shows the conversion list | wrist A containing the some food name currently displayed on the food menu. 質問に対する複数の回答を含む変換リストＢを示す図。The figure which shows the conversion list | wrist B containing the some answer with respect to a question.

以下、本発明の実施形態について、図面を参照しながら詳しく説明する。
図１は、本発明の一実施形態に係る音声認識装置の構成例を示す図である。この音声認識装置は、例えば、自動販売機、家電製品、住宅設備、車載装置（ナビゲーション装置等）、又は、携帯端末等に搭載され、ユーザーの音声を認識し、その認識結果に対応する応答や処理を行うものである。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram illustrating a configuration example of a speech recognition apparatus according to an embodiment of the present invention. This voice recognition device is mounted on, for example, a vending machine, a home appliance, a housing facility, an in-vehicle device (navigation device, etc.), a portable terminal, etc., recognizes the user's voice, The processing is performed.

図１に示すように、音声認識装置は、音声入力部１０と、Ａ／Ｄ変換器２０と、音声認識用の半導体集積回路装置３０と、Ｄ／Ａ変換器４０と、音声出力部５０と、制御部６０とを含んでいる。なお、音声入力部１０、Ａ／Ｄ変換器２０、Ｄ／Ａ変換器４０、及び、音声出力部５０の少なくとも一部を、半導体集積回路装置３０に内蔵しても良い。 As shown in FIG. 1, the speech recognition apparatus includes a speech input unit 10, an A / D converter 20, a semiconductor integrated circuit device 30 for speech recognition, a D / A converter 40, and a speech output unit 50. The control unit 60 is included. Note that at least a part of the voice input unit 10, the A / D converter 20, the D / A converter 40, and the voice output unit 50 may be built in the semiconductor integrated circuit device 30.

制御部６０は、ホストＣＰＵ（中央演算装置）６１と、格納部６２とを含んでいる。ホストＣＰＵ６１は、格納部６２の記録媒体に記録されているソフトウェア（音声認識制御プログラム）に基づいて動作する。記録媒体としては、ハードディスク、フレキシブルディスク、ＭＯ、ＭＴ、ＣＤ−ＲＯＭ、又は、ＤＶＤ−ＲＯＭ等を用いることができる。ホストＣＰＵ６１は、半導体集積回路装置３０に制御信号を供給することにより、半導体集積回路装置３０における音声認識動作を制御する。 The control unit 60 includes a host CPU (central processing unit) 61 and a storage unit 62. The host CPU 61 operates based on software (voice recognition control program) recorded on the recording medium of the storage unit 62. As the recording medium, a hard disk, flexible disk, MO, MT, CD-ROM, DVD-ROM, or the like can be used. The host CPU 61 controls the voice recognition operation in the semiconductor integrated circuit device 30 by supplying a control signal to the semiconductor integrated circuit device 30.

音声入力部１０は、音声を電気信号（音声信号）に変換するマイクロフォンと、マイクロフォンから出力される音声信号を増幅する増幅器と、増幅された音声信号の帯域を制限するローパスフィルターとを含んでいる。Ａ／Ｄ変換器２０は、音声入力部１０から出力されるアナログの音声信号をサンプリングすることにより、ディジタルの音声信号（音声データ）に変換する。例えば、音声データにおける音声周波数帯域は１２ｋＨｚであり、ビット数は１６ビットである。 The audio input unit 10 includes a microphone that converts audio into an electrical signal (audio signal), an amplifier that amplifies the audio signal output from the microphone, and a low-pass filter that limits the band of the amplified audio signal. . The A / D converter 20 samples the analog audio signal output from the audio input unit 10 and converts it into a digital audio signal (audio data). For example, the voice frequency band in the voice data is 12 kHz, and the number of bits is 16 bits.

半導体集積回路装置３０は、信号処理部３１と、音声認識ＤＢ（データベース）格納部３２と、変換情報設定部３３と、変換リスト格納部３４と、標準パターン抽出部３５と、認識精度調整部３６と、一致検出部３７とを含んでいる。さらに、半導体集積回路装置３０は、音声信号合成部３８、及び／又は、音声合成ＤＢ（データベース）格納部３９を含んでも良い。 The semiconductor integrated circuit device 30 includes a signal processing unit 31, a speech recognition DB (database) storage unit 32, a conversion information setting unit 33, a conversion list storage unit 34, a standard pattern extraction unit 35, and a recognition accuracy adjustment unit 36. And a coincidence detection unit 37. Further, the semiconductor integrated circuit device 30 may include a voice signal synthesis unit 38 and / or a voice synthesis DB (database) storage unit 39.

信号処理部３１は、入力された音声信号にフーリエ変換を施すことにより音声信号の複数の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成する。生成された特徴パターンは一致検出部３７に出力される。また、信号処理部３１は、入力された音声信号のレベルが所定の値を超えたときに、音声検出信号を活性化して一致検出部３７及びホストＣＰＵ６１に出力する。これにより、ユーザーからの要求又は回答の有無を判定することができる。 The signal processing unit 31 extracts a plurality of frequency components of the audio signal by performing Fourier transform on the input audio signal, and generates a feature pattern representing a distribution state of the frequency components of the audio signal. The generated feature pattern is output to the coincidence detection unit 37. Further, when the level of the input audio signal exceeds a predetermined value, the signal processing unit 31 activates the audio detection signal and outputs it to the coincidence detection unit 37 and the host CPU 61. Thereby, the presence or absence of a request or answer from the user can be determined.

ここで、音声信号から特徴パターンを求める手法の一例について説明する。信号処理部３１は、入力された音声信号にフィルタ処理を施して高域成分を強調する。次に、信号処理部３１は、音声信号によって表される音声波形にハミング窓をかけることにより、時系列の音声信号を所定の時間毎に区切って複数のフレームを作成する。さらに、信号処理部３１は、フレーム毎に音声信号をフーリエ変換することにより、複数の周波数成分を抽出する。各々の周波数成分は複素数であるので、信号処理部３１は、各々の周波数成分の絶対値を求める。 Here, an example of a method for obtaining a feature pattern from an audio signal will be described. The signal processing unit 31 performs a filtering process on the input audio signal to emphasize high frequency components. Next, the signal processing unit 31 creates a plurality of frames by dividing the time-series audio signal at predetermined time intervals by applying a Hamming window to the audio waveform represented by the audio signal. Further, the signal processing unit 31 extracts a plurality of frequency components by performing Fourier transform on the audio signal for each frame. Since each frequency component is a complex number, the signal processing unit 31 obtains an absolute value of each frequency component.

信号処理部３１は、それらの周波数成分に、メル尺度に基づいて定められた周波数領域の窓をかけて積分することにより、窓の数に対応する数の数値を求める。さらに、信号処理部３１は、それらの数値の対数をとって、対数値を離散コサイン変換する。これにより、周波数領域の窓が２０個であれば、２０個の数値が得られる。 The signal processing unit 31 obtains a numerical value corresponding to the number of windows by integrating the frequency components over the frequency domain windows determined based on the Mel scale. Further, the signal processing unit 31 takes the logarithm of these numerical values and performs a discrete cosine transform on the logarithmic values. Thereby, if there are 20 windows in the frequency domain, 20 numerical values are obtained.

このようにして得られた数値の内で低次のもの（例えば、１２個）が、ＭＦＣＣ（メル周波数ケプストラム係数）と呼ばれる。信号処理部３１は、フレーム毎にＭＦＣＣを算出し、ＨＭＭ（隠れマルコフモデル）に従ってＭＦＣＣを連結して、時系列で入力された音声信号に含まれている各々の音素に対応するＭＦＣＣとして特徴パターンを求める。 Of the numerical values obtained in this way, the lower ones (for example, 12) are called MFCC (Mel Frequency Cepstrum Coefficient). The signal processing unit 31 calculates the MFCC for each frame, concatenates the MFCCs according to the HMM (Hidden Markov Model), and features patterns as MFCCs corresponding to each phoneme included in the time-sequentially input audio signal. Ask for.

ここで、「音素」とは、ある言語において同じとみなされる音の要素を意味する。以下においては、言語として日本語が用いられる場合について説明する。日本語の音素としては、「ａ」、「ｉ」、「ｕ」、「ｅ」、「ｏ」の母音と、「ｋ」、「ｓ」、「ｔ」、「ｎ」等の子音と、「ｊ」、「ｗ」の半母音と、「Ｎ」、「Ｑ」、「Ｈ」の特殊モーラとが該当する。 Here, “phoneme” means an element of a sound that is regarded as the same in a certain language. Below, the case where Japanese is used as a language is demonstrated. Japanese phonemes include “a”, “i”, “u”, “e”, “o” vowels, “k”, “s”, “t”, “n” and other consonants, The semi-vowels of “j” and “w” and the special mora of “N”, “Q”, and “H” are applicable.

音声認識データベース格納部３２は、所定の言語において用いられる各種の音素について周波数成分の分布状態を表す標準パターンを含む音声認識データベースを格納する。音声認識データベースにおいては、各種の音素を表すテキストデータと、選択肢情報としての標準パターンとが、対応付けられている。 The speech recognition database storage unit 32 stores a speech recognition database including a standard pattern representing a distribution state of frequency components for various phonemes used in a predetermined language. In the speech recognition database, text data representing various phonemes is associated with standard patterns as option information.

標準パターンは、多数（例えば、２００人程度）の話者が発した音声を用いて予め作成される。標準パターンの作成においては、各々の音素を表す音声信号からＭＦＣＣが求められる。ただし、多数の話者が発した音声を用いて作成されたＭＦＣＣにおいては、それぞれの数値がばらつきを有している。 The standard pattern is created in advance using speech uttered by a large number of speakers (for example, about 200 speakers). In creating a standard pattern, an MFCC is obtained from an audio signal representing each phoneme. However, in the MFCC created using voices uttered by a large number of speakers, each numerical value varies.

従って、各々の音素についての標準パターンは、多次元空間（例えば、１２次元空間）において、ばらつきを含む広がりを有している。信号処理部３１に入力された音声信号から生成された特徴パターンが標準パターンの広がりの範囲内に入っていれば、両者の音素が一致していると判定される。 Therefore, the standard pattern for each phoneme has a spread including variation in a multidimensional space (for example, a 12-dimensional space). If the feature pattern generated from the audio signal input to the signal processing unit 31 is within the range of the standard pattern, it is determined that the phonemes of both match.

また、１つの音声認識データベースではなく、複数の音声認識データベースを用いるようにしても良い。例えば、音声認識データベース格納部３２は、年齢及び性別の異なる複数群の話者の音声を収録して得られた音声信号に基づいて生成された複数の音声認識データベースを格納しても良い。その場合に、一致検出部３７は、複数の音声認識データベースの内から、音素の一致を良好に検出できる音声認識データベースを選択して使用することができる。 A plurality of voice recognition databases may be used instead of one voice recognition database. For example, the voice recognition database storage unit 32 may store a plurality of voice recognition databases generated based on voice signals obtained by recording voices of a plurality of groups of speakers having different ages and genders. In that case, the coincidence detection unit 37 can select and use a speech recognition database that can satisfactorily detect phoneme coincidence from among a plurality of speech recognition databases.

あるいは、音声認識装置を使用するユーザーの年齢及び性別を特定できる場合には、音声認識データベース格納部３２は、年齢及び性別の異なる複数群の話者の音声を収録して得られた音声データに基づいて生成された複数の音声認識データベースを、年齢及び性別を特定する情報に対応付けて格納しても良い。その場合に、一致検出部３７は、音声認識データベース格納部３２に格納されている複数の音声認識データベースの内から、音声認識装置を使用するユーザーの年齢及び性別を特定する情報に従って１つの音声認識データベースを選択して使用することができる。 Alternatively, when the age and gender of the user who uses the speech recognition device can be specified, the speech recognition database storage unit 32 adds the speech data obtained by recording the speech of a plurality of groups of speakers with different ages and genders. A plurality of speech recognition databases generated based on the information may be stored in association with information specifying the age and sex. In that case, the coincidence detection unit 37 performs one speech recognition according to information for identifying the age and sex of the user who uses the speech recognition device from among the plurality of speech recognition databases stored in the speech recognition database storage unit 32. You can select and use a database.

変換情報設定部３３は、変換候補となる複数の単語又は文章を表すテキストデータ、及び、音声信号に基づいて単語又は文章を認識する際に適用される認識精度の厳密さを表す認識精度パラメーターを、コマンドと共にホストＣＰＵ６１から受信する。また、変換情報設定部３３は、受信されたコマンドに従って、変換リストにテキストデータを設定すると共に、認識精度調整部３６に認識精度パラメーターを設定する。変換リスト格納部３４は、変換リストを格納する。 The conversion information setting unit 33 sets a recognition accuracy parameter indicating the strictness of the recognition accuracy applied when recognizing a word or a sentence based on text data representing a plurality of words or sentences as conversion candidates and a speech signal. , Received from the host CPU 61 together with the command. The conversion information setting unit 33 sets text data in the conversion list and sets a recognition accuracy parameter in the recognition accuracy adjustment unit 36 according to the received command. The conversion list storage unit 34 stores a conversion list.

コマンドとしては、例えば、変換リストにおける全てのテキストデータ及び認識精度パラメーターを新規設定するための設定コマンドと、変換リストに一部のテキストデータを追加するための追加コマンドと、変換リストから一部のテキストデータを削除するための削除コマンドとが用いられる。従って、変換リスト全体を置き換えることなく、変換リストの一部を任意に変更することも可能である。また、認識精度パラメーターのみを変更するための変更コマンドが用いられても良い。尚、変換リストには、予め所定のテキストデータの設定がなされていても良い。 The commands include, for example, a setting command for newly setting all text data and recognition accuracy parameters in the conversion list, an addition command for adding a part of text data to the conversion list, and a part of the conversion list. A delete command for deleting text data is used. Therefore, it is possible to arbitrarily change a part of the conversion list without replacing the entire conversion list. Further, a change command for changing only the recognition accuracy parameter may be used. Note that predetermined text data may be set in advance in the conversion list.

変換リスト格納部３４において、変換リストに新たなテキストデータが設定されると、標準パターン抽出部３５は、変換リストに設定されたテキストデータによって表される単語又は文章の少なくとも一部に対応する標準パターンを、音声認識データベースから抽出する。 When new text data is set in the conversion list in the conversion list storage unit 34, the standard pattern extraction unit 35 corresponds to at least a part of a word or sentence represented by the text data set in the conversion list. The pattern is extracted from the speech recognition database.

認識精度調整部３６は、変換情報設定部３３によって設定された認識精度パラメーターに従って、音声認識データベース３２から抽出された標準パターンの広がりの範囲を調整する。標準パターン抽出部３５によって音声認識データベース３２から抽出された標準パターンは、多次元空間において、ばらつきを含む広がりを有しているが、認識精度調整部３６は、この標準パターンの広がりの範囲を調整する。 The recognition accuracy adjustment unit 36 adjusts the range of the spread of the standard pattern extracted from the speech recognition database 32 according to the recognition accuracy parameter set by the conversion information setting unit 33. The standard pattern extracted from the speech recognition database 32 by the standard pattern extraction unit 35 has a spread including variations in the multidimensional space, but the recognition accuracy adjustment unit 36 adjusts the range of the spread of the standard pattern. To do.

以下の例においては、認識精度パラメーターによって表される認識精度の厳密さ又は曖昧さが、最も曖昧なランク１から最も厳密なランクＭまでのＭ個のランクに分類される（Ｍは、２以上の自然数）。ある標準パターンＡのＮ次元空間（Ｎは自然数）における広がりをＡ１（ｉ）〜Ａ２（ｉ）で表すと（ｉ＝１、２、・・・、Ｎ）、認識精度調整部３６によって調整された標準パターンＡの広がりの範囲Ａ１ａ（ｉ）〜Ａ２ａ（ｉ）は、ランクＲ（１≦Ｒ≦Ｍ）を用いて、例えば、次式によって表される。
Ａ１ａ（ｉ）＝Ａ１（ｉ）−ｋ・（Ｍ−Ｒ）・（Ａ２（ｉ）−Ａ１（ｉ））
Ａ２ａ（ｉ）＝Ａ２（ｉ）＋ｋ・（Ｍ−Ｒ）・（Ａ２（ｉ）−Ａ１（ｉ））
ここで、ｋは定数である。 In the following example, the accuracy or ambiguity of the recognition accuracy represented by the recognition accuracy parameter is classified into M ranks from the most ambiguous rank 1 to the most exact rank M (M is 2 or more). Natural number). When the spread of a certain standard pattern A in an N-dimensional space (N is a natural number) is represented by A1 (i) to A2 (i) (i = 1, 2,. Further, the range A1a (i) to A2a (i) of the standard pattern A is expressed by, for example, the following expression using the rank R (1 ≦ R ≦ M).
A1a (i) = A1 (i) -k. (MR). (A2 (i) -A1 (i))
A2a (i) = A2 (i) + k. (MR). (A2 (i) -A1 (i))
Here, k is a constant.

例えば、自動車の制御に音声認識を適用するような場合には、誤った制御が行われることを防止するために、最も厳密なランクＭを表す認識精度パラメーター「Ｍ」が設定される。一方、変換リストに含まれている２つの単語の内の一方を選択するような場合には、音声認識において誤りが生じる可能性が低いので、最も曖昧なランク１を表す認識精度パラメーター「１」が設定される。 For example, when voice recognition is applied to the control of an automobile, a recognition accuracy parameter “M” representing the strictest rank M is set in order to prevent erroneous control. On the other hand, when one of the two words included in the conversion list is selected, there is a low possibility of an error in speech recognition, so the recognition accuracy parameter “1” representing the most ambiguous rank 1 is used. Is set.

あるいは、変換リストにおいて選択肢の数が所定の数よりも多いか少ないかに応じて異なる認識精度パラメーターを設定しても良い。また、変換リストにおいて選択肢に含まれている類似する言葉が所定の数よりも多いか少ないかに応じて異なる認識精度パラメーターを設定しても良い。 Alternatively, different recognition accuracy parameters may be set depending on whether the number of options in the conversion list is larger or smaller than a predetermined number. Further, different recognition accuracy parameters may be set depending on whether the number of similar words included in the options in the conversion list is larger or smaller than a predetermined number.

一致検出部３７は、音声検出信号が活性化されているときに動作し、信号処理部３１によって生成された特徴パターンと、認識精度調整部３６によって広がりの範囲が調整された標準パターンとを比較する。そして、一致検出部３７は、入力された音声信号の少なくとも一部から生成された特徴パターンが、認識精度調整部３６によって調整された標準パターンの広がりの範囲内に入っているか否かを判定する。 The coincidence detection unit 37 operates when the voice detection signal is activated, and compares the feature pattern generated by the signal processing unit 31 with the standard pattern whose range of spread is adjusted by the recognition accuracy adjustment unit 36. To do. Then, the coincidence detection unit 37 determines whether or not the feature pattern generated from at least a part of the input audio signal is within the range of the standard pattern adjusted by the recognition accuracy adjustment unit 36. .

比較はＮ次元空間における各成分について行われ、ｉ＝１、２、・・・、Ｎについて次式が満たされれば、特徴パターンＢが標準パターンＡの広がりの範囲内に入っていると判定される。
Ａ１ａ（ｉ）≦Ｂ（ｉ）≦Ａ２ａ（ｉ）
一致検出部３７は、入力された音声信号の少なくとも一部から生成された特徴パターンが標準パターンの広がりの範囲内に入っていれば、両者の一致を検出する。 The comparison is performed for each component in the N-dimensional space, and if the following expression is satisfied for i = 1, 2,..., N, it is determined that the feature pattern B is within the range of the standard pattern A. The
A1a (i) ≦ B (i) ≦ A2a (i)
If the feature pattern generated from at least a part of the input audio signal is within the spread range of the standard pattern, the match detection unit 37 detects the match between the two.

例えば、一致検出部３７は、入力された音声信号の先頭の音節から生成された特徴パターンを、変換リストに設定されたテキストデータによって表される各々の単語又は文章の先頭の音節に対応する標準パターンと比較する。変換リストにおいて、一致が検出された音節を先頭に有する変換候補が１つだけ存在する場合には、その変換候補が、変換後の単語又は文章となる。一方、変換リストにおいて、一致が検出された音節を先頭に有する複数の変換候補が存在する場合には、一致検出部３７は、変換候補が１つに絞られるまで、一致を検出すべき音節の範囲を拡大する。 For example, the coincidence detection unit 37 uses a standard pattern corresponding to the first syllable of each word or sentence represented by the text data set in the conversion list, based on the feature pattern generated from the first syllable of the input speech signal. Compare with pattern. In the conversion list, when there is only one conversion candidate having a syllable whose head is detected as a match, the conversion candidate is a converted word or sentence. On the other hand, in the conversion list, when there are a plurality of conversion candidates having a syllable whose head is detected as a match, the match detection unit 37 selects a syllable whose match should be detected until the conversion candidates are narrowed down to one. Expand the range.

ここで、「音節」とは、１個の母音を主音とし、その母音単独で、あるいは、その母音の前後に１つ又は複数の子音を伴って構成される音のまとまりを意味する。また、半母音や特殊モーラも、音節を構成することができる。即ち、１つの音節は、１つ又は複数の音素によって構成される。日本語の音節としては、「あ」、「い」、「う」、「え」、「お」、「か」、「き」、「く」、「け」、「こ」等が該当する。 Here, the “syllable” means a set of sounds that are composed of one vowel as a main sound and that vowels alone or with one or more consonants before and after the vowel. Semi-vowels and special mora can also constitute syllables. That is, one syllable is composed of one or more phonemes. Japanese syllables include “a”, “i”, “u”, “e”, “o”, “ka”, “ki”, “ku”, “ke”, “ko”, etc. .

例えば、音節「あ」に対応する標準パターンとは、音節「あ」を構成する音素「ａ」についての標準パターンのことである。また、音節「か」に対応する標準パターンとは、音節「か」を構成する第１番目の音素「ｋ」についての標準パターンと、音節「か」を構成する第２番目の音素「ａ」についての標準パターンとのことである。 For example, the standard pattern corresponding to the syllable “a” is a standard pattern for the phoneme “a” that constitutes the syllable “a”. The standard pattern corresponding to the syllable “ka” is the standard pattern for the first phoneme “k” constituting the syllable “ka” and the second phoneme “a” constituting the syllable “ka”. It is a standard pattern about.

入力された音声信号の１つの音節が１つの音素で構成されている場合には、その音素の一致が検出されれば、音節の一致が検出されたことになる。一方、入力された音声信号の１つの音節が複数の音素で構成されている場合には、それらの音素の一致が検出されれば、音節の一致が検出されたことになる。 When one syllable of the input speech signal is composed of one phoneme, if the phoneme match is detected, the syllable match is detected. On the other hand, when one syllable of the input speech signal is composed of a plurality of phonemes, if a coincidence of these phonemes is detected, a coincidence of syllables is detected.

特徴パターンと標準パターンとの一致が検出されると、一致検出部３７は、一致が検出された音節を有する単語又は文章を特定する情報、例えば、その単語又は文章を表すテキストデータを、音声認識結果として出力する。これにより、ホストＣＰＵ６１は、半導体集積回路装置３０に入力された音声信号の少なくとも一部に対応する単語又は文章を認識することができる。 When a match between the feature pattern and the standard pattern is detected, the match detection unit 37 recognizes voice identification information that identifies the word or sentence having the syllable from which the match is detected, for example, text data representing the word or sentence. Output as a result. As a result, the host CPU 61 can recognize a word or sentence corresponding to at least a part of the audio signal input to the semiconductor integrated circuit device 30.

ホストＣＰＵ６１は、半導体集積回路装置３０から出力される音声認識結果に応じて複数の応答内容（質問又はメッセージ）の中から１つの応答内容を選択し、選択された応答内容を表す応答データを半導体集積回路装置３０に送信する。 The host CPU 61 selects one response content from among a plurality of response contents (questions or messages) according to the voice recognition result output from the semiconductor integrated circuit device 30, and sends response data representing the selected response contents to the semiconductor. Transmit to the integrated circuit device 30.

半導体集積回路装置３０の音声信号合成部３８は、ホストＣＰＵ６１から音声認識結果に対する応答内容を表す応答データを受信し、受信された応答データに基づいて、出力すべき音声を表す音声信号を合成する。音声信号を合成するためには、音声合成データベース格納部３９に格納されている音声合成データベースを用いても良いが、音声認識データベース格納部３２に格納されている音声認識データベースを用いて音声信号を合成することも可能である。 The voice signal synthesizer 38 of the semiconductor integrated circuit device 30 receives response data representing the response contents for the voice recognition result from the host CPU 61, and synthesizes a voice signal representing the voice to be output based on the received response data. . In order to synthesize a speech signal, a speech synthesis database stored in the speech synthesis database storage unit 39 may be used, but the speech signal is stored using the speech recognition database stored in the speech recognition database storage unit 32. It is also possible to synthesize.

その場合には、例えば、音声信号合成部３８は、応答内容に含まれている各々の音素について、音声認識データベースに含まれている標準パターンから周波数スペクトルを求める。さらに、音声信号合成部３８は、周波数スペクトルを逆フーリエ変換して音声波形を求め、応答内容に含まれている複数の音素についての複数の音声波形を繋ぎ合わせることにより、応答内容に対応する音声信号を合成する。 In that case, for example, the speech signal synthesizer 38 obtains a frequency spectrum from the standard pattern included in the speech recognition database for each phoneme included in the response content. Furthermore, the voice signal synthesis unit 38 obtains a voice waveform by performing inverse Fourier transform on the frequency spectrum, and connects a plurality of voice waveforms for a plurality of phonemes included in the response contents to thereby obtain a voice corresponding to the response contents. Synthesize the signal.

Ｄ／Ａ変換器４０は、音声信号合成部３８から出力されるディジタルの音声信号を、アナログの音声信号に変換する。音声出力部５０は、Ｄ／Ａ変換器４０から出力されるアナログの音声信号を電力増幅する電力増幅器と、電力増幅された音声信号に応じて音声を発するスピーカーとを含んでいる。スピーカーは、ホストＣＰＵ６１から供給される応答データによって表される応答内容を、音声として出力する。これにより、応答データに基づいて発せられる質問又はメッセージに対するユーザーの回答が幾つかの単語又は文章の内の１つに予測される状況を作り出すことができる。 The D / A converter 40 converts the digital audio signal output from the audio signal synthesis unit 38 into an analog audio signal. The audio output unit 50 includes a power amplifier that power-amplifies the analog audio signal output from the D / A converter 40 and a speaker that emits audio in accordance with the power-amplified audio signal. The speaker outputs the response content represented by the response data supplied from the host CPU 61 as sound. This can create a situation where a user's answer to a question or message that is issued based on response data is predicted to be one of several words or sentences.

また、ホストＣＰＵ６１は、選択された質問又はメッセージに対する回答として変換候補となる複数の単語又は文章を表すテキストデータ、及び、それらの単語又は文章に応じて選択された認識精度パラメーターを、設定コマンドと共に半導体集積回路装置３０に送信する。 The host CPU 61 also sets text data representing a plurality of words or sentences that are candidates for conversion as an answer to the selected question or message, and a recognition accuracy parameter selected according to those words or sentences, together with a setting command. The data is transmitted to the semiconductor integrated circuit device 30.

半導体集積回路装置３０の変換情報設定部３３は、ホストＣＰＵ６１からテキストデータ及び認識精度パラメーターを設定コマンドと共に受信すると、受信された設定コマンドに従って、変換リストにテキストデータを設定すると共に、認識精度調整部３６に認識精度パラメーターを設定する。これにより、応答データに基づいて発せられる質問又はメッセージに対応する複数の単語又は文章を表すテキストデータを変換リストに設定すると共に、それらの単語又は文章に応じて選択された認識精度パラメーターを認識精度調整部３６に設定することができる。 When the conversion information setting unit 33 of the semiconductor integrated circuit device 30 receives the text data and the recognition accuracy parameter from the host CPU 61 together with the setting command, the conversion information setting unit 33 sets the text data in the conversion list according to the received setting command, and the recognition accuracy adjustment unit. A recognition accuracy parameter is set to 36. As a result, text data representing a plurality of words or sentences corresponding to the question or message issued based on the response data is set in the conversion list, and the recognition accuracy parameter selected according to the words or sentences is recognized as the recognition accuracy. The adjustment unit 36 can be set.

次に、本発明の一実施形態に係る音声認識方法について、図１及び図２を参照しながら説明する。図２は、図１に示す音声認識装置によって実施される音声認識方法を示すフローチャートである。 Next, a speech recognition method according to an embodiment of the present invention will be described with reference to FIGS. FIG. 2 is a flowchart showing a speech recognition method performed by the speech recognition apparatus shown in FIG.

図２のステップＳ１において、ホストＣＰＵ６１が、半導体集積回路装置３０の電源投入時又はリセット後に、１つの質問又はメッセージを表す交信データと、その質問又はメッセージに対する回答として変換候補となる複数の単語又は文章を表すテキストデータと、それらの単語又は文章に応じて選択された認識精度パラメーターとを、設定コマンドと共に半導体集積回路装置３０に送信する。 In step S1 of FIG. 2, when the host CPU 61 turns on or resets the semiconductor integrated circuit device 30, communication data representing one question or message and a plurality of words or conversion candidates as answers to the question or message Text data representing a sentence and a recognition accuracy parameter selected according to the word or sentence are transmitted to the semiconductor integrated circuit device 30 together with a setting command.

ステップＳ２において、半導体集積回路装置３０の変換情報設定部３３が、テキストデータ及び認識精度パラメーターを、設定コマンドと共にホストＣＰＵ６１から受信する。変換情報設定部３３は、受信された設定コマンドに従って、変換リストにテキストデータを設定すると共に、認識精度調整部３６に認識精度パラメーターを設定する。 In step S2, the conversion information setting unit 33 of the semiconductor integrated circuit device 30 receives the text data and the recognition accuracy parameter from the host CPU 61 together with the setting command. The conversion information setting unit 33 sets text data in the conversion list and sets a recognition accuracy parameter in the recognition accuracy adjustment unit 36 according to the received setting command.

変換リストに新たなテキストデータが設定されると、ステップＳ３において、標準パターン抽出部３５が、所定の言語において用いられる複数の音素の周波数成分の分布状態を表す標準パターンを含む音声認識データベースから、変換リストに設定されたテキストデータによって表される各々の単語又は文章の少なくとも一部に対応する標準パターンを抽出する。また、ステップＳ４において、認識精度調整部３６が、認識精度パラメーターに従って、音声認識データベースから抽出された標準パターンの広がりの範囲を調整する。 When new text data is set in the conversion list, in step S3, the standard pattern extraction unit 35, from a speech recognition database including standard patterns representing the distribution state of frequency components of a plurality of phonemes used in a predetermined language, A standard pattern corresponding to at least a part of each word or sentence represented by the text data set in the conversion list is extracted. In step S4, the recognition accuracy adjustment unit 36 adjusts the range of the spread of the standard pattern extracted from the speech recognition database according to the recognition accuracy parameter.

ステップＳ５において、音声信号合成部３８が、受信された交信データに基づいて音声信号を合成することにより、音声出力部５０から質問又はメッセージが発せられる。これに回答してユーザーが音声を発すると、ステップＳ６において、信号処理部３１が、入力された音声信号にフーリエ変換を施すことにより音声信号の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成する。また、信号処理部３１は、音声検出信号を活性化する。 In step S <b> 5, the voice signal synthesizer 38 synthesizes a voice signal based on the received communication data, whereby a question or a message is issued from the voice output unit 50. When the user utters a sound in response to this, in step S6, the signal processing unit 31 extracts the frequency component of the sound signal by performing Fourier transform on the input sound signal, and the distribution of the frequency component of the sound signal. A feature pattern representing a state is generated. Further, the signal processing unit 31 activates the voice detection signal.

音声検出信号が活性化されると、ステップＳ７において、一致検出部３７が、入力された音声信号の少なくとも一部から生成された特徴パターンが標準パターンの広がりの範囲内に入っていれば両者の一致を検出し、変換候補となる複数の単語又は文章の内で一致が検出された単語又は文章を特定する音声認識結果を出力する。 When the voice detection signal is activated, in step S7, the coincidence detection unit 37 determines that the feature pattern generated from at least a part of the input voice signal is within the range of the standard pattern. A match is detected, and a speech recognition result that identifies a word or sentence in which a match is detected among a plurality of words or sentences that are conversion candidates is output.

音声検出信号が活性化されてから所定の期間内に特徴パターンと標準パターンとの一致を表す音声認識結果が得られない場合に、ホストＣＰＵ６１は、ランクの低い新たな認識精度パラメーターを変更コマンドと共に半導体集積回路装置３０に送信し、一致検出を再度行うように半導体集積回路装置３０を制御しても良い。これにより、所定の期間内に特徴パターンと標準パターンとの一致を表す音声認識結果が得られない場合に、音声認識における認識精度の厳密さを緩めて一致検出を再度行うことができる。 When a voice recognition result indicating a match between the feature pattern and the standard pattern is not obtained within a predetermined period after the voice detection signal is activated, the host CPU 61 sets a new recognition accuracy parameter with a lower rank together with a change command. The semiconductor integrated circuit device 30 may be controlled to transmit to the semiconductor integrated circuit device 30 and perform coincidence detection again. As a result, when a speech recognition result indicating a match between the feature pattern and the standard pattern cannot be obtained within a predetermined period, it is possible to reduce the strictness of the recognition accuracy in the speech recognition and perform the match detection again.

あるいは、ホストＣＰＵ６１は、「もう一度お願いします」等のメッセージを表す応答データを半導体集積回路装置３０に送信しても良いし、分かり易いように言い直した質問を表す応答データを半導体集積回路装置３０に送信しても良い。音声信号合成部３８は、ホストＣＰＵ６１から供給された応答データに基づいて音声信号を合成し、音声出力部５０から新たなメッセージ又は質問が発せられる。 Alternatively, the host CPU 61 may transmit response data representing a message such as “Please ask again” to the semiconductor integrated circuit device 30, or the response data representing the question restated for easy understanding. 30 may be transmitted. The voice signal synthesis unit 38 synthesizes a voice signal based on the response data supplied from the host CPU 61, and a new message or question is issued from the voice output unit 50.

音声検出信号が活性化されてから所定の期間内に特徴パターンと標準パターンとの一致を表す音声認識結果が得られると、ステップＳ８において、ホストＣＰＵ６１が、一連の音声認識動作が完了したか否かを判定する。一連の音声認識動作が完了していれば、処理が終了する。一方、一連の音声認識動作が完了していなければ、処理がステップＳ９に移行する。 When a voice recognition result indicating a match between the feature pattern and the standard pattern is obtained within a predetermined period after the voice detection signal is activated, in step S8, the host CPU 61 determines whether or not a series of voice recognition operations are completed. Determine whether. If a series of voice recognition operations are completed, the process ends. On the other hand, if a series of voice recognition operations are not completed, the process proceeds to step S9.

ステップＳ９において、ホストＣＰＵ６１が、半導体集積回路装置３０から出力される音声認識結果に応じて複数の応答内容の中から１つの応答内容を選択し、選択された応答内容を表す応答データと、選択された応答内容に対する回答として変換候補となる複数の単語又は文章を表すテキストデータと、それらの単語又は文章に応じて選択された認識精度パラメーターとを、設定コマンドと共に半導体集積回路装置３０に送信する。これにより、ステップＳ２以降の処理が繰り返される。 In step S9, the host CPU 61 selects one response content from among a plurality of response contents according to the voice recognition result output from the semiconductor integrated circuit device 30, response data representing the selected response content, and selection Text data representing a plurality of words or sentences that are conversion candidates as an answer to the response content that has been sent, and a recognition accuracy parameter selected according to those words or sentences are transmitted to the semiconductor integrated circuit device 30 together with a setting command. . Thereby, the process after step S2 is repeated.

本発明の一実施形態によれば、音声認識シナリオに従った変換リストを用いることにより、入力された音声信号の特徴パターンと比較される標準パターンを、変換リストに設定されたテキストデータによって表される各々の単語又は文章の少なくとも一部に対応する標準パターンに絞り込むことができる。ここで、音声認識シナリオとは、ある質問又はメッセージに対するユーザーの回答が幾つかの単語又は文章の内の１つに予測される状況を作り出して音声認識を行うことをいう。 According to an embodiment of the present invention, by using a conversion list according to a voice recognition scenario, a standard pattern to be compared with a feature pattern of an input voice signal is represented by text data set in the conversion list. To a standard pattern corresponding to at least a portion of each word or sentence. Here, the speech recognition scenario refers to performing speech recognition by creating a situation where a user's answer to a certain question or message is predicted to be one of several words or sentences.

その際に、音声認識における認識精度の厳密さ又は曖昧さは、ホストＣＰＵ６１からコマンド及び認識精度パラメーターを半導体集積回路装置３０に送信することにより、音声認識シナリオに沿って自由に設定可能である。その結果、音声認識における認識精度を厳密にして誤認識を防止したり、あるいは、音声認識における認識精度を緩くして認識率を向上させたりすることができる。 At this time, the strictness or ambiguity of the recognition accuracy in speech recognition can be freely set according to the speech recognition scenario by transmitting a command and a recognition accuracy parameter from the host CPU 61 to the semiconductor integrated circuit device 30. As a result, the recognition accuracy in speech recognition can be made strict to prevent misrecognition, or the recognition accuracy in speech recognition can be relaxed to improve the recognition rate.

次に、本発明の一実施形態に係る音声認識装置における音声認識動作の具体例について説明する。ここでは、図１に示す音声認識装置が食堂における食券の自動販売機に適用される場合について説明する。 Next, a specific example of the speech recognition operation in the speech recognition apparatus according to an embodiment of the present invention will be described. Here, the case where the voice recognition apparatus shown in FIG. 1 is applied to a vending machine for a meal ticket in a cafeteria will be described.

自動販売機には、複数の食品名を含む食品メニューが表示されている。食品メニューには、「そば」、「うどん」、「カレー」、「カツ丼」等の文字が表されているものとする。その場合には、ユーザーが発する最初の言葉が、食品メニューに表示されている「そば」、「うどん」、「カレー」、「カツ丼」等の内のいずれかになることが予測される。 A food menu including a plurality of food names is displayed on the vending machine. It is assumed that characters such as “Soba”, “Udon”, “Curry”, “Katsudon” are displayed on the food menu. In that case, it is predicted that the first word spoken by the user will be any one of “Soba”, “Udon”, “Curry”, “Katsudon”, etc. displayed on the food menu.

そこで、ホストＣＰＵ６１は、自動販売機の電源投入時又はリセット後に、食品メニューに表示されている複数の食品名を表すテキストデータを、認識精度パラメーター及び設定コマンドと共に半導体集積回路装置３０に送信する。その際に、ホストＣＰＵ６１は、食品メニューに表示されている食品名の数が所定の数よりも多い場合に認識精度を厳密にし、食品メニューに表示されている食品名の数が所定の数よりも少ない場合に認識精度を緩くするように、認識精度パラメーターを設定しても良い。 Therefore, the host CPU 61 transmits text data representing a plurality of food names displayed on the food menu to the semiconductor integrated circuit device 30 together with the recognition accuracy parameter and the setting command when the vending machine is turned on or reset. At that time, the host CPU 61 makes the recognition accuracy strict when the number of food names displayed on the food menu is larger than a predetermined number, and the number of food names displayed on the food menu exceeds the predetermined number. The recognition accuracy parameter may be set so as to loosen the recognition accuracy when the number is small.

半導体集積回路装置３０の変換情報設定部３３は、受信された設定コマンドに従って、受信されたテキストデータを変換リストに設定すると共に、受信された認識精度パラメーターを認識精度調整部３６に設定する。 The conversion information setting unit 33 of the semiconductor integrated circuit device 30 sets the received text data in the conversion list and sets the received recognition accuracy parameter in the recognition accuracy adjustment unit 36 according to the received setting command.

このようにして、図３に示す変換リストＡが作成される。図３には、食品名に対応する番号と、食品名の日本語表記と、食品名に含まれている音素のローマ字表記とが示されているが、変換リストには、食品名に含まれている音素を特定できるローマ字表記又はカナ表記が少なくとも含まれていれば良い。 In this way, the conversion list A shown in FIG. 3 is created. FIG. 3 shows the number corresponding to the food name, the Japanese notation of the food name, and the Romanized notation of the phoneme included in the food name. It is sufficient that at least romaji notation or kana notation that can identify a phoneme is included.

変換リストＡが作成されると、標準パターン抽出部３５は、変換リストＡに含まれている食品名「そば」、「うどん」、「カレー」、「カツ丼」等の先頭の音節「そ」、「う」、「カ」、「カ」等に含まれている音素「ｓ・ｏ」、「ｕ」、「ｋ・ａ」、「ｋ・ａ」等のそれぞれについて、対応する標準パターンを音声認識データベースから抽出する。また、認識精度調整部３６は、認識精度パラメーターに従って、音声認識データベースから抽出された標準パターンの広がりの範囲を調整する。 When the conversion list A is created, the standard pattern extraction unit 35 starts with the first syllable “so” of the food names “soba”, “udon”, “curry”, “katsudon”, etc. included in the conversion list A. , “U”, “K”, “K”, etc., for each of the phonemes “s · o”, “u”, “k · a”, “k · a”, etc. Extract from speech recognition database. In addition, the recognition accuracy adjustment unit 36 adjusts the range of spread of the standard pattern extracted from the speech recognition database according to the recognition accuracy parameter.

また、ホストＣＰＵ６１は、「どの食品にしますか？食品名を言って下さい。」という質問又はメッセージを表す交信データを半導体集積回路装置３０に送信する。半導体集積回路装置３０の音声信号合成部３８は、この交信データに基づいて音声信号を合成してＤ／Ａ変換器４０に出力し、Ｄ／Ａ変換器４０は、ディジタルの音声信号をアナログの音声信号に変換して、アナログの音声信号を音声出力部５０に出力する。これにより、音声出力部５０から、「どの食品にしますか？食品名を言って下さい。」という質問又はメッセージが発せられる。 Further, the host CPU 61 transmits communication data representing a question or message “Which food do you want to say? Please say the food name” to the semiconductor integrated circuit device 30. The audio signal synthesizer 38 of the semiconductor integrated circuit device 30 synthesizes an audio signal based on the communication data and outputs it to the D / A converter 40. The D / A converter 40 converts the digital audio signal into an analog signal. The sound is converted into a sound signal, and the analog sound signal is output to the sound output unit 50. As a result, the voice output unit 50 issues a question or message “Which food do you want to say?

音声出力部５０から発せられた質問又はメッセージに対して、ユーザーが、表示された食品メニューを見て「カツ丼を下さい。」と言うと、信号処理部３１は、音素「ｋ・ａ・ｔ・ｕ・ｄ・ｏ・Ｎ・・・」のそれぞれについて、周波数成分の分布状態を表す特徴パターンを生成する。 In response to a question or message issued from the voice output unit 50, when the user looks at the displayed food menu and says "Please cut the cutlet", the signal processing unit 31 reads the phoneme "k · a · t". For each of “u · d · o · N...”, A feature pattern representing a distribution state of frequency components is generated.

一致検出部３７は、信号処理部３１によって生成された先頭の音節の第１番目の音素「ｋ」の特徴パターンと、音声認識データベースから抽出された先頭の音節の第１番目の音素「ｓ」、「ｕ」、「ｋ」、「ｋ」等の標準パターンとを比較することにより、音素「ｋ」の一致を検出する。 The coincidence detection unit 37 includes the feature pattern of the first syllable “k” of the first syllable generated by the signal processing unit 31 and the first phoneme “s” of the first syllable extracted from the speech recognition database. , “U”, “k”, “k”, etc., are compared with the standard pattern to detect the coincidence of phoneme “k”.

一致が検出された音素が子音を表している場合には、さらに、一致検出部３７が、先頭の音節の第２番目の音素を比較する。一致検出部３７は、信号処理部３１によって生成された先頭の音節の第２番目の音素「ａ」の特徴パターンと、音声認識データベースから抽出された先頭の音節の第２番目の音素「ｏ」、「ａ」、「ａ」等の標準パターンとを比較することにより、音素「ａ」の一致を検出する。 When the phoneme in which the match is detected represents a consonant, the match detection unit 37 further compares the second phoneme of the first syllable. The coincidence detection unit 37 includes the feature pattern of the second phoneme “a” of the first syllable generated by the signal processing unit 31 and the second phoneme “o” of the first syllable extracted from the speech recognition database. , “A”, “a”, etc. are compared to detect the coincidence of phoneme “a”.

これにより、音節「カ」の一致が検出される。一致が検出された食品名が１つであれば、ここで音声認識結果が得られる。しかしながら、変換リストには、食品名「カレー」と食品名「カツ丼」とが含まれているので、いずれが該当するかを認識することができない。そのような場合に、一致検出部３７は、一致を検出すべき音節の範囲を拡大する。 Thereby, the coincidence of the syllable “K” is detected. If there is one food name for which a match is detected, a speech recognition result is obtained here. However, since the food name “curry” and the food name “katsudon” are included in the conversion list, it is not possible to recognize which one is applicable. In such a case, the coincidence detection unit 37 expands the range of syllables for which coincidence is to be detected.

即ち、一致検出部３７は、変換リストに含まれている上記食品名の第２番目の音節に対応する標準パターンの抽出を依頼する信号を標準パターン抽出部３５に出力する。これにより、標準パターン抽出部３５は、変換リストに含まれている食品名「カレー」及び「カツ丼」の第２番目の音節「レ」及び「ツ」に含まれている音素「ｒ・ｅ」及び「ｔ・ｕ」のそれぞれについて、周波数成分の分布状態を表す標準パターンを音声認識データベースから抽出する。また、認識精度調整部３６は、認識精度パラメーターに従って、音声認識データベースから抽出された標準パターンの広がりの範囲を調整する。 That is, the coincidence detection unit 37 outputs a signal requesting extraction of the standard pattern corresponding to the second syllable of the food name included in the conversion list to the standard pattern extraction unit 35. As a result, the standard pattern extraction unit 35 includes the phonemes “r · e” included in the second syllables “Le” and “Tsu” of the food names “Curry” and “Katsudon” included in the conversion list. ”And“ t · u ”, standard patterns representing frequency component distribution states are extracted from the speech recognition database. In addition, the recognition accuracy adjustment unit 36 adjusts the range of spread of the standard pattern extracted from the speech recognition database according to the recognition accuracy parameter.

一致検出部３７は、信号処理部３１によって生成された第２番目の音節の第１番目の音素「ｔ」の特徴パターンと、音声認識データベースから抽出された第２番目の音節の第１番目の音素「ｒ」及び「ｔ」の標準パターンとを比較することにより、音素「ｔ」の一致を検出する。 The coincidence detection unit 37 includes the feature pattern of the first phoneme “t” of the second syllable generated by the signal processing unit 31 and the first pattern of the second syllable extracted from the speech recognition database. A phoneme “t” match is detected by comparing the phoneme “r” and “t” standard patterns.

さらに、一致検出部３７は、信号処理部３１によって生成された第２番目の音節の第２番目の音素「ｕ」の特徴パターンと、音声認識データベースから抽出された第２番目の音節の第２番目の音素「ｅ」及び「ｕ」の標準パターンとを比較することにより、音素「ｕ」の一致を検出する。 Further, the coincidence detection unit 37 includes the feature pattern of the second phoneme “u” of the second syllable generated by the signal processing unit 31 and the second pattern of the second syllable extracted from the speech recognition database. The phoneme “u” matches are detected by comparing the second phoneme “e” and the standard pattern of “u”.

これにより、音節「ツ」の一致が検出される。先頭の音節「カ」及び第２番目の音節「ツ」を有する食品名が他にも存在する場合には、一致検出部３７は、一致を検出すべき音節の範囲をさらに拡大すれば良い。一致検出部３７は、一致が検出された先頭の音節「カ」及び第２番目の音節「ツ」を有する食品名「カツ丼」を特定する音声認識結果をホストＣＰＵ６１に出力する。 Thereby, the coincidence of the syllable “tsu” is detected. If there are other food names having the first syllable “K” and the second syllable “T”, the match detection unit 37 may further expand the range of syllables for which a match should be detected. The coincidence detection unit 37 outputs to the host CPU 61 a speech recognition result that identifies the food name “cutlet bowl” having the first syllable “f” and the second syllable “tu” from which a coincidence has been detected.

食品名「カツ丼」を特定する情報としては、図３に示す番号、食品名の日本語表記「カツ丼」又はその一部「カツ」、食品名に含まれている音素のローマ字表記「ｋａｔｕｄｏＮ」又はその一部「ｋａｔｕ」等が該当する。これにより、ホストＣＰＵ６１は、入力された音声信号の少なくとも一部に対応する食品名「カツ丼」を認識することができる。 Information identifying the food name “Katsudon” includes the numbers shown in FIG. 3, the Japanese name of the food name “Katsudon” or part thereof “Katsu”, and the romaji of the phoneme included in the food name “katudoN” Or a part thereof “katu” or the like. Thereby, the host CPU 61 can recognize the food name “katsudon” corresponding to at least a part of the input audio signal.

このようにして第１回目の音声認識動作が終了すると、ホストＣＰＵ６１は、第２回目の音声認識動作を開始する。ホストＣＰＵ６１は、受け取った音声認識結果に応じて、格納部６２に格納されている応答データによって表される複数の応答内容の中から１つの適切な応答内容を選択し、選択された応答内容を表す応答データ、及び、選択された応答内容に対する回答として複数の変換候補を表すテキストデータを、認識精度パラメーター及び設定コマンドと共に半導体集積回路装置３０に送信する。 When the first speech recognition operation is thus completed, the host CPU 61 starts the second speech recognition operation. The host CPU 61 selects one appropriate response content from a plurality of response contents represented by the response data stored in the storage unit 62 according to the received voice recognition result, and selects the selected response content. Response data to be represented and text data representing a plurality of conversion candidates as responses to the selected response content are transmitted to the semiconductor integrated circuit device 30 together with the recognition accuracy parameter and the setting command.

半導体集積回路装置３０の変換情報設定部３３は、受信された設定コマンドに従って、現在のテキストデータの全てを変換リストから削除した後、受信されたテキストデータを変換リストに設定すると共に、受信された認識精度パラメーターを認識精度調整部３６に設定する。 The conversion information setting unit 33 of the semiconductor integrated circuit device 30 deletes all of the current text data from the conversion list according to the received setting command, sets the received text data in the conversion list, and receives the received text data. The recognition accuracy parameter is set in the recognition accuracy adjustment unit 36.

例えば、ホストＣＰＵ６１は、「いくつですか？」という質問を表す応答データを音声信号合成部３８に供給する。その場合には、この質問に対してユーザーが発する最初の言葉が、「１つ」、「２つ」、「３つ」等の複数の回答の内のいずれかになることが予測される。そこで、ホストＣＰＵ６１は、「１つ」、「２つ」、「３つ」等の複数の回答を表すテキストデータを、認識精度パラメーター及び設定コマンドと共に半導体集積回路装置３０に送信する。 For example, the host CPU 61 supplies response data representing the question “How many?” To the audio signal synthesis unit 38. In that case, it is predicted that the first word uttered by the user for this question will be one of a plurality of answers such as “one”, “two”, “three”, and the like. Therefore, the host CPU 61 transmits text data representing a plurality of answers such as “one”, “two”, “three”, etc. to the semiconductor integrated circuit device 30 together with the recognition accuracy parameter and the setting command.

このようにして、図４に示す変換リストＢが作成される。変換リストＢが作成されると、標準パターン抽出部３５は、変換リストＢに含まれているテキストデータによって表される単語「１つ」、「２つ」、「３つ」等の先頭の音節「ひ」、「ふ」、「み」等に含まれている音素「ｈ・ｉ」、「ｈ・ｕ」、「ｍ・ｉ」等のそれぞれについて、周波数成分の分布状態を表す標準パターンを音声認識データベースから抽出する。さらに、認識精度調整部３６は、認識精度パラメーターに従って、音声認識データベースから抽出された標準パターンの広がりの範囲を調整する。 In this way, the conversion list B shown in FIG. 4 is created. When the conversion list B is created, the standard pattern extraction unit 35 starts with the first syllables of the words “one”, “two”, “three” and the like represented by the text data included in the conversion list B. For each of the phonemes “h · i”, “h · u”, “m · i”, etc. included in “hi”, “fu”, “mi”, etc., a standard pattern indicating the distribution state of frequency components is set. Extract from speech recognition database. Furthermore, the recognition accuracy adjustment unit 36 adjusts the range of spread of the standard pattern extracted from the speech recognition database according to the recognition accuracy parameter.

音声信号合成部３８は、ホストＣＰＵ６１から供給された応答データに基づいて音声信号を合成してＤ／Ａ変換器４０に出力し、Ｄ／Ａ変換器４０は、ディジタルの音声信号をアナログの音声信号に変換して、アナログの音声信号を音声出力部５０に出力する。これにより、音声出力部５０からユーザーに対して、「いくつですか？」という質問が発せられる。 The audio signal synthesizer 38 synthesizes an audio signal based on the response data supplied from the host CPU 61 and outputs it to the D / A converter 40. The D / A converter 40 converts the digital audio signal into an analog audio signal. The signal is converted into a signal and an analog audio signal is output to the audio output unit 50. Accordingly, the voice output unit 50 issues a question “How many?” To the user.

音声出力部５０から発せられた質問に対して、ユーザーが、「１つです。」と言うと、信号処理部３１は、音素「ｈ・ｉ・ｔ・ｏ・ｔ・ｕ・・・」のそれぞれについて、周波数成分の分布状態を表す特徴パターンを生成する。 In response to a question issued from the voice output unit 50, when the user says “one”, the signal processing unit 31 reads the phoneme “h, i, t, o, t, u. For each, a feature pattern representing a distribution state of frequency components is generated.

一致検出部３７は、信号処理部３１によって生成された先頭の音節の第１番目の音素「ｈ」の特徴パターンと、音声認識データベースから抽出された先頭の音節の第１番目の音素「ｈ」、「ｈ」、「ｍ」等の標準パターンとを比較することにより、音素「ｈ」の一致を検出する。 The coincidence detection unit 37 includes the feature pattern of the first syllable “h” of the first syllable generated by the signal processing unit 31 and the first phoneme “h” of the first syllable extracted from the speech recognition database. , “H”, “m”, etc., are compared to detect the phoneme “h” match.

一致が検出された音素が子音を表している場合には、さらに、一致検出部３７が、信号処理部３１によって生成された先頭の音節の第２番目の音素「ｉ」の特徴パターンと、音声認識データベースから抽出された先頭の音節の第２番目の音素「ｉ」、「ｕ」、「ｉ」等の標準パターンとを比較することにより、音素「ｉ」の一致を検出する。 When the phoneme in which the match is detected represents a consonant, the match detection unit 37 further includes a feature pattern of the second phoneme “i” of the first syllable generated by the signal processing unit 31 and the voice By comparing the second phoneme “i”, “u”, “i”, and other standard patterns of the first syllable extracted from the recognition database, the phoneme “i” matches are detected.

これにより、音節「ひ」の一致が検出される。一致検出部３７は、一致が検出された音節「ひ」を先頭に有する単語「１つ」を特定する音声認識結果をホストＣＰＵ６１に出力する。これにより、ホストＣＰＵ６１は、入力された音声信号の少なくとも一部に対応する単語「１つ」を認識することができる。 Thereby, the coincidence of the syllable “hi” is detected. The coincidence detection unit 37 outputs to the host CPU 61 a speech recognition result that identifies the word “one” having the syllable “hi” at the beginning of which coincidence is detected. Thereby, the host CPU 61 can recognize the word “one” corresponding to at least a part of the input voice signal.

そこで、ホストＣＰＵ６１は、「○○○円を投入して下さい。」というメッセージを表す応答データを音声信号合成部３８に供給する。音声信号合成部３８は、ホストＣＰＵ６１から供給された応答データに基づいて音声信号を合成してＤ／Ａ変換器４０に出力し、Ｄ／Ａ変換器４０は、ディジタルの音声信号をアナログの音声信号に変換して、アナログの音声信号を音声出力部５０に出力する。これにより、音声出力部５０からユーザーに対して、「○○○円を投入して下さい。」というメッセージが発せられる。 Therefore, the host CPU 61 supplies response data representing a message “please insert XXX circle” to the audio signal synthesizer 38. The audio signal synthesizer 38 synthesizes an audio signal based on the response data supplied from the host CPU 61 and outputs it to the D / A converter 40. The D / A converter 40 converts the digital audio signal into an analog audio signal. The signal is converted into a signal and an analog audio signal is output to the audio output unit 50. As a result, the voice output unit 50 issues a message “Please insert XX yen” to the user.

以上の実施形態においては、本発明を自動販売機に適用した具体例について説明したが、本発明は、この実施形態に限定されるものではなく、一般的な電子機器に適用可能であると共に、当該技術分野において通常の知識を有する者によって、本発明の技術的思想内で多くの変形が可能である。 In the above embodiment, a specific example in which the present invention is applied to a vending machine has been described. However, the present invention is not limited to this embodiment, and can be applied to general electronic devices. Many modifications within the technical idea of the present invention are possible by those having ordinary knowledge in the art.

１０…音声入力部、２０…Ａ／Ｄ変換器、３０…半導体集積回路装置、３１…信号処理部、３２…音声認識データベース格納部、３３…変換情報設定部、３４…変換リスト格納部、３５…標準パターン抽出部、３６…認識精度調整部、３７…一致検出部、３８…音声信号合成部、３９…音声合成データベース格納部、４０…Ｄ／Ａ変換器、５０…音声出力部、６０…制御部、６１…ホストＣＰＵ、６２…格納部 DESCRIPTION OF SYMBOLS 10 ... Voice input part, 20 ... A / D converter, 30 ... Semiconductor integrated circuit device, 31 ... Signal processing part, 32 ... Voice recognition database storage part, 33 ... Conversion information setting part, 34 ... Conversion list storage part, 35 ... Standard pattern extraction unit, 36 ... Recognition accuracy adjustment unit, 37 ... Match detection unit, 38 ... Speech signal synthesis unit, 39 ... Speech synthesis database storage unit, 40 ... D / A converter, 50 ... Speech output unit, 60 ... Control unit 61... Host CPU 62.

Claims

A speech recognition database storage unit for storing a speech recognition database including a standard pattern representing a distribution state of frequency components of a plurality of phonemes used in a predetermined language;
Text data representing a word or sentence as a conversion candidate, and a recognition accuracy parameter representing the accuracy of recognition accuracy applied when recognizing the word or sentence as a conversion candidate are received together with the command, and according to the command A conversion information setting section for setting text data in the conversion list;
A conversion list storage unit for storing the conversion list;
A standard pattern extraction unit for extracting the standard pattern corresponding to at least a part of each word or sentence represented by the text data set in the conversion list from the speech recognition database;
A recognition accuracy adjustment unit that adjusts a range of spread of the standard pattern extracted from the speech recognition database according to the recognition accuracy parameter;
A signal processing unit that extracts a frequency component of the audio signal by performing a Fourier transform on the input audio signal, and generates a feature pattern representing a distribution state of the frequency component of the audio signal;
If the feature pattern generated from at least a part of the audio signal is within the range of the standard pattern, a match between the two is detected, and a match is detected in a word or sentence as a conversion candidate. A match detection unit that outputs a speech recognition result that identifies a word or sentence;
A semiconductor integrated circuit device comprising:

The semiconductor integrated circuit device according to claim 1, further comprising: a voice signal synthesis unit that receives response data representing response contents for the voice recognition result and synthesizes an output voice signal based on the response data.

The semiconductor integrated circuit device according to claim 1, wherein the signal processing unit activates an audio detection signal when a level of the audio signal exceeds a predetermined value.

A semiconductor integrated circuit device according to claim 1;
Text data representing the word or sentence as the conversion candidate, and the control unit for transmitting the recognition accuracy parameter to the semiconductor integrated circuit device together with the command;
A speech recognition apparatus comprising:

A semiconductor integrated circuit device according to claim 2;
The response content is selected from a plurality of response contents according to the voice recognition result output from the semiconductor integrated circuit device, the response data representing the selected response content, and the conversion as an answer to the response content A text data representing a word or sentence as a candidate, and a control unit for transmitting the recognition accuracy parameter selected according to the word or sentence as a candidate for conversion to the semiconductor integrated circuit device together with the command;
A speech recognition apparatus comprising:

A semiconductor integrated circuit device according to claim 3;
When the voice recognition result indicating the match between the feature pattern and the standard pattern is not obtained within a predetermined period after the voice detection signal is activated, a new recognition accuracy parameter is set together with the new command. A control unit for controlling the semiconductor integrated circuit device to transmit to the semiconductor integrated circuit device and perform coincidence detection;
A speech recognition apparatus comprising:

Text data representing a word or sentence as a conversion candidate, and a recognition accuracy parameter representing the accuracy of recognition accuracy applied when recognizing the word or sentence as a conversion candidate are received together with the command, and according to the command Setting the text data in the conversion list (a);
Corresponding to at least a part of each word or sentence represented by the text data set in the conversion list from a speech recognition database including a standard pattern representing a distribution state of frequency components of a plurality of phonemes used in a predetermined language Extracting the standard pattern to be (b);
Adjusting a range of the spread of the standard pattern extracted from the speech recognition database according to the recognition accuracy parameter;
(D) generating a feature pattern representing a distribution state of the frequency components of the audio signal by extracting a frequency component of the audio signal by performing Fourier transform on the input audio signal;
If the feature pattern generated from at least a part of the audio signal is within the range of the standard pattern, a match between the two is detected, and a match is detected in the word or sentence as the conversion candidate. Outputting a speech recognition result that identifies the word or sentence
A speech recognition method comprising: