JP2014206642A

JP2014206642A - Voice recognition device and voice recognition program

Info

Publication number: JP2014206642A
Application number: JP2013084104A
Authority: JP
Inventors: 荒金　康人; Yasuto Arakane; 康人荒金
Original assignee: RayTron Inc
Current assignee: RayTron Inc
Priority date: 2013-04-12
Filing date: 2013-04-12
Publication date: 2014-10-30
Anticipated expiration: 2033-04-12
Also published as: JP6276513B2

Abstract

PROBLEM TO BE SOLVED: To reduce erroneous recognition among keywords.SOLUTION: A voice recognition device (1) includes: a similarity word database (DB) (203) for pre-storing similarity word information related to keywords mutually similar acoustically; a determination part (106) for referring to the similarity word information to determine whether there is a second word similar to a first word estimated by a first recognition process part (100); an extraction part (108) for extracting a partial voice signal in a specific section predetermined from a voice signal if determined that there is the second word; and a second recognition part (110) that includes a determination part (114) for performing determination processing in which a keyword having a higher degree of likelihood between the first and second words is regarded as a recognition result, on the basis of: a feature amount of the partial voice signal; and a second model parameter (202) corresponding to the respective specific sections of the first and second words.

Description

本発明は、音声認識装置および音声認識プログラムに関し、特に、孤立単語認識方式により音声認識を行う音声認識装置および音声認識プログラムに関する。 The present invention relates to a speech recognition apparatus and a speech recognition program, and more particularly to a speech recognition apparatus and speech recognition program that perform speech recognition using an isolated word recognition method.

従来から、音響的に互いに類似する単語を認識する技術が存在する。 Conventionally, there is a technique for recognizing words that are acoustically similar to each other.

特開平１０−２７４９９４号公報（特許文献１）には、ＤＲＮＮ（Dynamic Recurrent Neural Networks）単語モデルを用いた音声認識技術において、「なんじ」と「なんど」といった類似語を認識するための処理が開示されている。具体的には、たとえば「なんじ」に対応するＤＲＮＮ出力のうち、類似語の特徴部分（「じ」、「ど」の音韻部分）に対応するＤＲＮＮ出力を含む区間ｔ１を設定し、区間ｔ１にどのような母音が存在するかを調べることが記載されている。 Japanese Patent Laid-Open No. 10-274994 (Patent Document 1) discloses a process for recognizing similar words such as “Nanji” and “Nando” in a speech recognition technique using a DRNN (Dynamic Recurrent Neural Networks) word model. It is disclosed. Specifically, for example, among the DRNN output corresponding to “Nanji”, a section t1 including a DRNN output corresponding to a characteristic part of a similar word (phonetic part of “ji” and “do”) is set, and a section t1 is set. Describes what kind of vowels exist.

特開平１０−２７４９９４号公報JP-A-10-274994

上記文献では、ＤＲＮＮ単語モデルという特殊な音響モデルを用いた場合に、認識対象単語（キーワード）と類似する、認識対象でない類似語が一定以上の確からしさを持つ場合があり、そのような問題に対処するために単語の特徴部分の母音が調べられる。 In the above document, when a special acoustic model called DRNN word model is used, a similar word that is similar to a recognition target word (keyword) and is not a recognition target may have a certain degree of certainty. To deal with, the vowels of the word features are examined.

一方で、ＨＭＭ（Hidden Markov Model）などの一般的な音響モデルを用いた音声認識では、あるキーワード（登録された単語）の認識率は、そのキーワードに音響的に類似する単語が登録されている場合と登録されていない場合とでは、前者の方が低い傾向にある。したがって、一般的な音響モデルを用いた音声認識においては、キーワード間の誤認識を低減させることが、全体の認識率の向上につながる。 On the other hand, in speech recognition using a general acoustic model such as HMM (Hidden Markov Model), the recognition rate of a certain keyword (registered word) is registered as a word that is acoustically similar to the keyword. There is a tendency for the former to be lower in cases where it is not registered and cases where it is not registered. Therefore, in speech recognition using a general acoustic model, reducing misrecognition between keywords leads to an improvement in the overall recognition rate.

本発明は、上記のような課題を解決するためになされたものであって、その目的は、キーワード間の誤認識を低減させることのできる音声認識装置および音声認識プログラムを提供することである。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a speech recognition apparatus and a speech recognition program capable of reducing erroneous recognition between keywords.

この発明のある局面に従う音声認識装置は、音声信号の特徴量と、複数のキーワードそれぞれについての第１のモデルパラメータとに基づいて認識処理を実行することで、複数のキーワードから第１の単語を推定するための第１の認識処理手段と、音響的に互いに類似するキーワードについての類似語情報を予め格納するための記憶部と、類似語情報を参照することで、第１の認識処理手段により推定された第１の単語と類似するキーワードである第２の単語が存在するか否かを判断するための判断手段と、判断手段により第２の単語が存在すると判断された場合に、音声信号から予め定められた特定の区間における部分音声信号を抽出するための抽出手段と、抽出手段により抽出された部分音声信号の特徴量を用いて認識処理を実行するための第２の認識処理手段とを備える。第２の認識処理手段は、部分音声信号の特徴量と、第１の単語および第２の単語それぞれについての特定の区間に対応する第２のモデルパラメータとに基づいて、第１の単語および第２の単語のうち尤度の高い方のキーワードを認識結果とする判定処理を実行する判定手段を含む。 A speech recognition apparatus according to an aspect of the present invention executes a recognition process based on a feature amount of a speech signal and a first model parameter for each of a plurality of keywords, thereby obtaining a first word from the plurality of keywords. By referring to the first recognition processing means for estimating, the storage unit for storing similar word information about acoustically similar keywords in advance, and the similar word information, the first recognition processing means When a determination means for determining whether or not a second word that is a keyword similar to the estimated first word exists, and when the determination means determines that the second word exists, an audio signal The recognition process is executed using the extraction means for extracting a partial speech signal in a specific section determined in advance and the feature amount of the partial speech signal extracted by the extraction means. And a second recognition processing means. The second recognition processing means, based on the feature amount of the partial speech signal and the second model parameter corresponding to the specific section for each of the first word and the second word, A determination unit configured to execute a determination process using a keyword having a higher likelihood of the two words as a recognition result;

好ましくは、類似語情報は、音響的に類似するキーワードと誤認識の可能性のある特定のキーワードごとに、類似するキーワードについての識別情報を含み、記憶手段は、特定のキーワードごとに、特定の区間を定めた区間情報をさらに記憶する。抽出手段は、所定のアルゴリズムで音声信号を複数の区間に分割し、分割された複数の区間と区間情報とに基づいて、部分音声信号を抽出する。 Preferably, the similar word information includes identification information about a similar keyword for each specific keyword that may be erroneously recognized as an acoustically similar keyword, and the storage means includes a specific keyword for each specific keyword. Section information defining the section is further stored. The extraction unit divides the audio signal into a plurality of sections using a predetermined algorithm, and extracts the partial sound signal based on the plurality of divided sections and the section information.

好ましくは、第１の認識処理手段は、音声信号を第１の時間長のフレーム単位で切出し、フレームごとに分析を行うことで音声信号の特徴量を算出する第１の分析手段を含み、第２の認識処理手段は、部分音声信号を第１の時間よりも短い第２の時間長のフレーム単位で切出し、フレームごとに分析を行うことで部分音声信号の特徴量を算出する第２の分析手段をさらに含む。 Preferably, the first recognition processing unit includes a first analysis unit that calculates a feature amount of the voice signal by cutting out the voice signal in units of frames of the first time length and performing analysis for each frame. The second recognition processing means extracts a partial speech signal in units of frames having a second time length shorter than the first time, and performs analysis for each frame to calculate a feature amount of the partial speech signal. Means are further included.

好ましくは、判定手段により認識結果として判定されたキーワードを出力するための出力手段をさらに備え、出力手段は、判断手段により第２の単語が存在しないと判断された場合には、第１の認識処理手段により推定された第１の単語を認識結果として出力する。 Preferably, an output means for outputting the keyword determined as the recognition result by the determination means is further provided, and the output means performs the first recognition when the determination means determines that the second word does not exist. The first word estimated by the processing means is output as a recognition result.

好ましくは、判定手段は、判断手段により第１の単語と類似するキーワードが複数あると判断された場合には、第１の認識処理手段における認識処理において、尤度が高かった方のキーワードを第２の単語として、判定処理を実行する。 Preferably, when the determination unit determines that there are a plurality of keywords similar to the first word, the determination unit selects the keyword with the higher likelihood in the recognition process in the first recognition processing unit. The determination process is executed as the second word.

好ましくは、判定手段は、判断手段により第１の単語と類似するキーワードが複数あると判断された場合には、第１の単語および複数の類似するキーワードそれぞれについての第２のモデルパラメータを用いて、第１の単語および複数の類似するキーワードのうち最も尤度が高いキーワードを認識結果として判定する。 Preferably, when the determination unit determines that there are a plurality of keywords similar to the first word, the determination unit uses the second model parameter for each of the first word and the plurality of similar keywords. The keyword having the highest likelihood among the first word and a plurality of similar keywords is determined as the recognition result.

好ましくは、判定手段は、判断手段により第１の単語と類似するキーワードが複数あると判断された場合には、複数の類似するキーワードそれぞれについての第２のモデルパラメータを用いて複数の類似するキーワードのうち尤度が高い方のキーワードを判定し、尤度が高い方のキーワードを第２の単語として判定処理を実行する。 Preferably, when the determination unit determines that there are a plurality of keywords similar to the first word, the determination unit uses a plurality of similar keywords using the second model parameter for each of the plurality of similar keywords. The keyword having the higher likelihood is determined, and the keyword having the higher likelihood is determined as the second word to execute the determination process.

好ましくは、特定の区間は、学習時に、計算シミュレーションにより、音響的に互いに類似するキーワード間での認識率が最も高い区間として定められている。 Preferably, the specific section is determined as a section having the highest recognition rate between keywords that are acoustically similar to each other by a simulation during learning.

この発明の他の局面に従う音声認識プログラムは、音声信号の特徴量と、複数のキーワードそれぞれについての第１のモデルパラメータとに基づいて認識処理を実行することで、複数のキーワードから第１の単語を推定するステップと、予め記憶された、音響的に互いに類似するキーワードについての類似語情報を参照することで、推定された第１の単語と類似するキーワードである第２の単語が存在するか否かを判断するステップと、第２の単語が存在すると判断された場合に、音声信号から予め定められた特定の区間における部分音声信号を抽出するステップと、抽出された部分音声信号の特徴量と、第１の単語および第２の単語それぞれについての特定の区間に対応する第２のモデルパラメータとに基づいて、第１の単語および第２の単語のうち尤度の高い方のキーワードを認識結果とする判定処理を実行するステップとをコンピュータに実行させる。 A speech recognition program according to another aspect of the present invention executes a recognition process based on a feature amount of a speech signal and a first model parameter for each of the plurality of keywords, so that the first word from the plurality of keywords is obtained. Whether or not there is a second word that is a keyword similar to the estimated first word by referring to the pre-stored similar word information for the keywords that are acoustically similar to each other. A step of determining whether or not, a step of extracting a partial speech signal in a predetermined specific section from the speech signal when it is determined that the second word is present, and a feature amount of the extracted partial speech signal And a second model parameter corresponding to a particular interval for each of the first word and the second word, the first word and the second And a step of executing the determination processing of the recognition result keyword higher likelihood among word to the computer.

本発明によれば、キーワード間の誤認識を低減させることができる。 According to the present invention, erroneous recognition between keywords can be reduced.

本発明の実施の形態に係る音声認識装置のハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of the speech recognition apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識装置の機能構成を示す機能ブロック図である。It is a functional block diagram which shows the function structure of the speech recognition apparatus which concerns on embodiment of this invention. 本発明の実施の形態における類似語データベースのデータ構造例を示す図である。It is a figure which shows the data structure example of the similar word database in embodiment of this invention. 本発明の実施の形態における音声認識処理を示すフローチャートである。It is a flowchart which shows the speech recognition process in embodiment of this invention. 差異区間の具体例を示す図である。It is a figure which shows the specific example of a difference area. 差異区間の具体例を示す図である。It is a figure which shows the specific example of a difference area.

本発明の実施の形態について図面を参照しながら詳細に説明する。なお、図中同一または相当部分には同一符号を付してその説明は繰返さない。 Embodiments of the present invention will be described in detail with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals and description thereof will not be repeated.

本実施の形態に係る音声認識装置は、孤立単語認識方式を採用し、音声信号を分析することで、複数のキーワードから、音声信号が表わす単語を推定して出力する。本実施の形態において「キーワード」とは、登録されている単語、すなわち認識対象の単語を表わすものとする。 The speech recognition apparatus according to the present embodiment employs an isolated word recognition method and analyzes a speech signal to estimate and output a word represented by the speech signal from a plurality of keywords. In this embodiment, the “keyword” represents a registered word, that is, a word to be recognized.

本実施の形態に係る音声認識装置は、音響モデルとしてＨＭＭを用い、かつ、認識の際にたとえばビタビアルゴリズムにより尤度を計算することとする。ビタビアルゴリズムによれば、尤度を計算する際に、ＨＭＭの状態（状態番号）と音声区間との対応付けがなされる。このビタビアルゴリズムによる対応付け（以下「ビタビアライメント」という）によって、音声信号が、少なくとも音素の区間と調音結合の区間とに分割され、各区間に状態番号が割り当てられる。なお、ビタビアライメントなどにより音声信号のアライメントが可能な音響モデルであれば、たとえばＤＴＷ（Dynamic time warping）などＨＭＭ以外の音響モデルを用いてもよい。 The speech recognition apparatus according to the present embodiment uses an HMM as an acoustic model and calculates the likelihood by, for example, a Viterbi algorithm at the time of recognition. According to the Viterbi algorithm, when the likelihood is calculated, the state of the HMM (state number) is associated with the speech section. By the association by the Viterbi algorithm (hereinafter referred to as “Viterbi alignment”), the speech signal is divided into at least a phoneme section and an articulation connection section, and a state number is assigned to each section. Note that an acoustic model other than the HMM such as DTW (Dynamic Time Warping) may be used as long as it is an acoustic model capable of aligning audio signals by Viterbi alignment or the like.

以下に、本実施の形態に係る音声認識装置の構成および動作について、詳細に説明する。 Hereinafter, the configuration and operation of the speech recognition apparatus according to the present embodiment will be described in detail.

＜構成について＞
（ハードウェア構成）
本実施の形態に係る音声認識装置は、たとえばＰＣ（Personal Computer）などの汎用コンピュータによって実現可能である。 <About configuration>
(Hardware configuration)
The speech recognition apparatus according to the present embodiment can be realized by a general-purpose computer such as a PC (Personal Computer).

図１は、本発明の実施の形態に係る音声認識装置１のハードウェア構成例を示すブロック図である。図１を参照して、音声認識装置１は、各種演算処理を行うためのＣＰＵ（Central Processing Unit）１１と、各種データおよびプログラムを格納するＲＯＭ（Read Only Memory）１２と、作業データ等を記憶するＲＡＭ（Random Access Memory）１３と、不揮発性の記憶装置であるハードディスク１４と、キーボードなどを含む操作部１５と、各種情報を表示するための表示部１６と、記録媒体１７ａからのデータやプログラムを読み出しおよび書き込み可能なドライブ装置１７と、インターネット通信するための通信Ｉ／Ｆ（インターフェイス）１８とを備える。記録媒体１７ａは、たとえば、ＣＤ−ＲＯＭ（Compact Disc-ROM）や、メモリカードなどであってよい。 FIG. 1 is a block diagram showing a hardware configuration example of a speech recognition apparatus 1 according to an embodiment of the present invention. Referring to FIG. 1, a speech recognition apparatus 1 stores a CPU (Central Processing Unit) 11 for performing various arithmetic processes, a ROM (Read Only Memory) 12 for storing various data and programs, work data, and the like. Random Access Memory (RAM) 13, a hard disk 14 which is a nonvolatile storage device, an operation unit 15 including a keyboard, a display unit 16 for displaying various information, and data and programs from the recording medium 17a Drive device 17 capable of reading and writing data, and a communication I / F (interface) 18 for Internet communication. The recording medium 17a may be, for example, a CD-ROM (Compact Disc-ROM) or a memory card.

音声認識装置１は、マイクロフォン２０から音声信号を入力するための入力部１９をさらに備えていてもよい。音声認識装置１が入力部１９を有さない場合には、たとえば通信Ｉ／Ｆ１８から得られた音声信号や、記録媒体１７ａから読み出された音声信号に対して、認識処理が実行される。 The voice recognition device 1 may further include an input unit 19 for inputting a voice signal from the microphone 20. When the speech recognition apparatus 1 does not have the input unit 19, recognition processing is performed on, for example, a speech signal obtained from the communication I / F 18 or a speech signal read from the recording medium 17a.

（機能構成）
図２は、本発明の実施の形態に係る音声認識装置１の機能構成を示す機能ブロック図である。図２を参照して、音声認識装置１は、一般的な音声認識装置と同様に、１次認識処理部１００、１次ＨＭＭデータ２０１および出力部１１６を含む。１次ＨＭＭデータ２０１は、全てのキーワードそれぞれに対応しており、１次認識処理部１００による１次認識の際に用いられる。各１次ＨＭＭは、キーワードの音声全体から生成されたモデルパラメータである。本実施の形態では、２次認識処理部１１０にて用いられるＨＭＭ（２次ＨＭＭデータ２０２）と区別するために「１次ＨＭＭ」と表わしている。１次ＨＭＭには、それぞれに識別番号が対応付けられている。 (Functional configuration)
FIG. 2 is a functional block diagram showing a functional configuration of the speech recognition apparatus 1 according to the embodiment of the present invention. With reference to FIG. 2, the speech recognition apparatus 1 includes a primary recognition processing unit 100, primary HMM data 201, and an output unit 116, similarly to a general speech recognition apparatus. The primary HMM data 201 corresponds to each of all keywords and is used for primary recognition by the primary recognition processing unit 100. Each primary HMM is a model parameter generated from the entire keyword speech. In this embodiment, “primary HMM” is used to distinguish it from the HMM (secondary HMM data 202) used in the secondary recognition processing unit 110. Each primary HMM is associated with an identification number.

１次認識処理部１００は、音声信号の特徴量と１次ＨＭＭデータ２０１とに基づいて認識処理を実行する。１次認識処理部１００は、その機能として分析部１０２および推定部１０４を含む。分析部１０２は、音声信号を第１の時間長のフレーム単位で切出し、フレーム単位で音声信号を分析することで特徴量を算出する。たとえば、切出された音声信号が、ＭＦＣＣ（Mel-frequency cepstral coefficient）特徴量に変換される。推定部１０４は、各１次ＨＭＭが、算出された特徴量の系列を生成する尤度を算出し、尤度が最も高い１次ＨＭＭが示すキーワードを認識結果として推定する。ここで推定されたキーワードを、以下「推定単語」ともいう。 The primary recognition processing unit 100 performs recognition processing based on the feature amount of the audio signal and the primary HMM data 201. The primary recognition processing unit 100 includes an analysis unit 102 and an estimation unit 104 as its functions. The analysis unit 102 extracts the audio signal in units of frames having a first time length, and calculates the feature amount by analyzing the audio signal in units of frames. For example, the extracted audio signal is converted into an MFCC (Mel-frequency cepstral coefficient) feature quantity. The estimation unit 104 calculates a likelihood that each primary HMM generates a sequence of calculated feature values, and estimates a keyword indicated by the primary HMM having the highest likelihood as a recognition result. The keyword estimated here is hereinafter also referred to as “estimated word”.

出力部１１６は、認識結果を出力する。出力部１１６は、たとえば表示部１６により実現される。 The output unit 116 outputs the recognition result. The output unit 116 is realized by the display unit 16, for example.

一般的な音声認識装置では、１次認識処理部１００での認識結果（１次認識結果）すなわち推定単語が、そのまま出力される。しかしながら、推定単語に音響的に類似するキーワードが登録されている場合、周囲の雑音が大きくなるにつれ、誤認識の可能性が高くなる。 In a general speech recognition apparatus, a recognition result (primary recognition result) in the primary recognition processing unit 100, that is, an estimated word is output as it is. However, when a keyword that is acoustically similar to the estimated word is registered, the possibility of misrecognition increases as ambient noise increases.

そこで、本実施の形態に係る音声認識装置１は、その機能として、判断部１０６、抽出部１０８および２次認識処理部１１０をさらに含む。また、１次ＨＭＭデータ２０１とともに、たとえばハードディスク１４には、２次ＨＭＭデータ２０２および類似語データベース（ＤＢ）２０３が格納される。 Therefore, the speech recognition apparatus 1 according to the present embodiment further includes a determination unit 106, an extraction unit 108, and a secondary recognition processing unit 110 as its functions. In addition to the primary HMM data 201, for example, the hard disk 14 stores secondary HMM data 202 and a similar word database (DB) 203.

２次ＨＭＭデータ２０２は、誤認識する可能性のある複数のキーワードそれぞれに対応しており、キーワードの音声のうち特定の区間における音声から生成されたモデルパラメータである。２次ＨＭＭには、それぞれに識別番号が対応付けられている。この「特定の区間」とは、音響的に互いに類似するキーワード間での類似度が小さくなる確率が高い区間、すなわち、特徴量に差異が出やすい区間である。その意味で、特定の区間は「差異区間」とも表現できる。学習時において、音声信号を抽出する区間を構成するＨＭＭの状態（状態番号）を様々に変えて、計算シミュレーションにより、類似するキーワード間での認識率が最も高い区間が差異区間として定められている。また、その差異区間における音声信号から２次ＨＭＭが生成されている。本実施の形態において、差異区間内の音声信号を「部分音声信号」という。 The secondary HMM data 202 corresponds to each of a plurality of keywords that may be erroneously recognized, and is a model parameter generated from the speech in a specific section of the speech of the keyword. Each secondary HMM is associated with an identification number. This “specific section” is a section in which the similarity between keywords that are acoustically similar to each other is high, that is, a section in which a feature amount is likely to be different. In that sense, a specific section can also be expressed as a “difference section”. During learning, the state (state number) of the HMM constituting the section from which the speech signal is extracted is changed in various ways, and the section with the highest recognition rate between similar keywords is determined as a difference section by calculation simulation. . A secondary HMM is generated from the audio signal in the difference section. In the present embodiment, the audio signal in the difference section is referred to as “partial audio signal”.

類似語データベース２０３は、音響的に互いに類似するキーワードについての情報（以下「類似語情報」という）を記憶する。類似語情報は、誤認識の可能性のあるキーワードについての情報とも言い換えられる。類似語情報は、少なくとも、誤認識の可能性のあるキーワードごとに、類似するキーワードについての識別情報としてたとえば１次ＨＭＭの識別番号を含む。また、類似語データベース２０３は、誤認識の可能性のあるキーワードごとに、２次認識の際に用いるべき差異区間についての区間情報および２次ＨＭＭの識別情報を記憶している。類似語データベース２０３のデータ構造例については後述する。 The similar word database 203 stores information on keywords that are acoustically similar to each other (hereinafter referred to as “similar word information”). The similar word information is also referred to as information about a keyword that may be erroneously recognized. The similar word information includes, for example, an identification number of the primary HMM as identification information for similar keywords at least for each keyword that may be erroneously recognized. Further, the similar word database 203 stores section information about a difference section to be used for secondary recognition and secondary HMM identification information for each keyword that may be erroneously recognized. An example of the data structure of the similar word database 203 will be described later.

判断部１０６は、類似語データベース２０３の類似語情報を参照することで、推定単語に類似するキーワード（以下「類似語」という）が存在するか否かを判断する。つまり、１次認識結果としての推定単語が、誤認識の可能性があるか否かを判断する。 The determination unit 106 refers to the similar word information in the similar word database 203 to determine whether or not there is a keyword similar to the estimated word (hereinafter referred to as “similar word”). That is, it is determined whether or not the estimated word as the primary recognition result has a possibility of erroneous recognition.

抽出部１０８は、判断部１０６により類似語が存在すると判断された場合に、類似語データベース２０３の区間情報を参照することで、音声信号全体から、差異区間における部分音声信号を抽出する。具体的には、ビタビアルゴリズムで音声信号を複数の区間に分割し、分割された複数の区間と区間情報とに基づいて、部分音声信号を抽出する。 When the determination unit 106 determines that a similar word exists, the extraction unit 108 refers to the section information in the similar word database 203 to extract a partial speech signal in the difference section from the entire speech signal. Specifically, the voice signal is divided into a plurality of sections by the Viterbi algorithm, and the partial voice signal is extracted based on the plurality of divided sections and section information.

２次認識処理部１１０は、抽出部１０８により抽出された部分音声信号の特徴量を用いて認識処理を実行する。２次認識処理部１１０は、その機能として、分析部１１２および判定部１１４を含む。分析部１１２は、部分音声信号を、第１の時間よりも短い第２の時間長のフレーム単位で切出し、フレーム単位で部分音声信号を分析することで特徴量を算出する。判定部１１４は、算出された部分音声信号の特徴量と、推定単語および類似語それぞれについての２次ＨＭＭとに基づいて、推定単語および類似語のうち尤度の高い方のキーワードを認識結果として判定する。具体的には、各２次ＨＭＭが、算出された特徴量の系列を生成する尤度を算出し、尤度が高い方の２次ＨＭＭの元となるキーワードを認識結果（２次認識結果）として判定する。 The secondary recognition processing unit 110 performs recognition processing using the feature amount of the partial speech signal extracted by the extraction unit 108. The secondary recognition processing unit 110 includes an analysis unit 112 and a determination unit 114 as its functions. The analysis unit 112 cuts out the partial audio signal in units of frames having a second time length shorter than the first time, and calculates the feature amount by analyzing the partial audio signals in units of frames. Based on the calculated feature amount of the partial speech signal and the secondary HMM for each of the estimated word and the similar word, the determination unit 114 uses the keyword having the higher likelihood of the estimated word and the similar word as a recognition result. judge. Specifically, each secondary HMM calculates the likelihood of generating a sequence of calculated feature values, and recognizes the keyword that is the source of the secondary HMM with the higher likelihood (secondary recognition result). Judge as.

ここで、１次認識に用いられる第１の時間長は、一般的な音声認識方法と同様に、たとえば、２０〜３０ｍｓの範囲で予め定められる。２次認識に用いられる第２の時間長は、たとえば、第１の時間長の半分程度であってよく、１０〜１５ｍｓの範囲で予め定められる。一般的に、第１の時間長は、母音の周期が数個含まれ得るように定められている。これに対し、互いに類似するキーワードは、母音以外の子音部分および調音結合部分の特徴量が異なることが多く、それらの部分の時間長は母音の時間長よりも短いため、第２の時間長を第１の時間長よりも短くすることで、互いに類似するキーワード間の認識率を改善することができる。なお、１次認識および２次認識のいずれの場合においても、フレームの位置は、隣り合うフレーム同士が重なるようにずらして切出される。このフレームをずらす時間長についても、１次認識時よりも２次認識時の方を短くすることが望ましい。 Here, the 1st time length used for primary recognition is predetermined in the range of 20-30 ms similarly to the general speech recognition method, for example. The second time length used for the secondary recognition may be, for example, about half of the first time length and is predetermined in the range of 10 to 15 ms. In general, the first time length is determined so that several vowel periods can be included. On the other hand, keywords similar to each other often have different features in the consonant part other than the vowel and the articulation combination part, and the time length of these parts is shorter than the time length of the vowel. By making it shorter than the first time length, the recognition rate between similar keywords can be improved. Note that, in both cases of primary recognition and secondary recognition, the positions of the frames are cut out by shifting so that adjacent frames overlap each other. As for the time length for shifting the frame, it is desirable to shorten the time for secondary recognition than for primary recognition.

出力部１１６は、１次認識結果および２次認識結果のうちいずれかを出力する。推定単語と類似するキーワードが存在しない場合には、１次認識結果すなわち推定単語が出力される。これに対し、推定単語と類似するキーワードが存在した場合には、２次認識結果、すなわち推定単語または類似語が出力される。 The output unit 116 outputs either the primary recognition result or the secondary recognition result. When there is no keyword similar to the estimated word, the primary recognition result, that is, the estimated word is output. On the other hand, when a keyword similar to the estimated word exists, the secondary recognition result, that is, the estimated word or similar word is output.

なお、図２に示した出力部１１６以外の機能ブロックは、図１に示したＣＰＵ１１が、たとえばＲＯＭ１２に格納されたソフトウェアを実行することで実現されてもよいし、これらのうち少なくとも１つは、ハードウェアにより実現されてもよい。 The functional blocks other than the output unit 116 shown in FIG. 2 may be realized by the CPU 11 shown in FIG. 1 executing software stored in the ROM 12, for example, and at least one of these may be implemented It may be realized by hardware.

ここで、類似語データベース２０３のデータ構造例について説明する。 Here, a data structure example of the similar word database 203 will be described.

（データ構造例）
図３は、本発明の実施の形態における類似語データベース２０３のデータ構造例を示す図である。図３を参照して、類似語データベース２０３は、複数の行および複数のカラム３１〜４５により構成されている。本実施の形態では、複数の行それぞれは全てのキーワードに対応しているものとする。 (Data structure example)
FIG. 3 is a diagram showing an example of the data structure of the similar word database 203 in the embodiment of the present invention. Referring to FIG. 3, the similar word database 203 includes a plurality of rows and a plurality of columns 31 to 45. In the present embodiment, it is assumed that each of a plurality of rows corresponds to all keywords.

カラム３１，３２には、各キーワード自体の情報として、キーワード（読み仮名）および１次ＨＭＭ番号が記録されている。カラム３３〜３６，４１には、類似語情報として、類似語の個数（カラム３３）、類似語Ａ，Ｂの読み仮名（カラム３４，３５）、および、類似語Ａ，Ｂそれぞれの１次ＨＭＭ番号（カラム３６，４１）が記憶されている。たとえば、キーワードが「きゅう」の場合、類似語は「じゅう」（類似語Ａ）と「ちゅう」（類似語Ｂ）の２つであることが記憶されている。なお、これらのうち、キーワードの読み仮名に関するカラム３１，３４，３５は、理解を容易にするために設けたものであり、これらは省いてもよい。 In columns 31 and 32, keywords (reading pseudonyms) and primary HMM numbers are recorded as information of each keyword itself. In columns 33 to 36 and 41, as similar word information, the number of similar words (column 33), reading pseudonyms of similar words A and B (columns 34 and 35), and primary HMMs of similar words A and B, respectively. Numbers (columns 36 and 41) are stored. For example, when the keyword is “kyu”, it is stored that there are two similar words “ju” (similar word A) and “chu” (similar word B). Of these, the columns 31, 34, and 35 relating to the reading of the keyword are provided for easy understanding, and these may be omitted.

また、類似語データベース２０３において、比較する類似語ごとに、区間情報として、始端状態番号（カラム３７，４２）および終端状態番号（カラム３８，４３）が記憶され、２次ＨＭＭの識別情報として、２つの２次ＨＭＭ番号（カラム３９，４０、および、カラム４４，４５）が記憶されている。カラム３７〜４０内のデータ３６０は、推定単語と類似語Ａとを判定する場合に用いられる。カラム４２〜４５内のデータ４１０は、推定単語と類似語Ｂとを判定する場合に用いられる。 Further, in the similar word database 203, for each similar word to be compared, the start state number (columns 37 and 42) and the end state number (columns 38 and 43) are stored as section information, and as identification information of the secondary HMM, Two secondary HMM numbers (columns 39 and 40 and columns 44 and 45) are stored. The data 360 in the columns 37 to 40 is used when determining the estimated word and the similar word A. The data 410 in the columns 42 to 45 is used when determining the estimated word and the similar word B.

なお、本実施の形態では、類似語データベース２０３に、類似語情報だけでなく、区間情報および２次ＨＭＭの識別情報も含めたが、これらを別のデータベースに記憶させてもよい。また、全てのキーワードについての類似語情報を類似語データベース２０３に含め、カラム３３（類似語の個数）によって類似語の有無についても記憶させることとしたが、誤認識の可能性のあるキーワードのみの類似語情報を記憶させてもよい。 In the present embodiment, the similar word database 203 includes not only the similar word information but also the section information and the identification information of the secondary HMM, but these may be stored in another database. Also, similar word information for all keywords is included in the similar word database 203, and the presence or absence of similar words is also stored in the column 33 (number of similar words). However, only keywords that may be erroneously recognized are stored. Similar word information may be stored.

＜動作について＞
（音声認識処理）
図４は、本発明の実施の形態における音声認識処理を示すフローチャートである。図４のフローチャートに示す処理手順は、予めプログラムとしてＲＯＭ１２に格納されており、ＣＰＵ１１が当該プログラムを読み出して実行することにより音声認識処理の機能が実現される。 <About operation>
(Voice recognition processing)
FIG. 4 is a flowchart showing voice recognition processing in the embodiment of the present invention. The processing procedure shown in the flowchart of FIG. 4 is stored in advance in the ROM 12 as a program, and the function of voice recognition processing is realized by the CPU 11 reading and executing the program.

図４を参照して、入力部１９より、音声信号が入力されると（ステップＳ（以下「Ｓ」と略す）２）、入力された音声信号がたとえばＲＡＭ１３に時系列に記憶される。Ｓ２で入力される音声信号には、人の声が含まれているものとする。１次認識処理部１００の分析部１０２は、記憶された音声信号からフレームを切出す（Ｓ４）。つまり、音声信号が、たとえば２５ｍｓのフレーム単位で切出される。フレームは、隣り合うフレーム同士が重なるように、たとえば１０ｍｓずつずらして切出される。 Referring to FIG. 4, when an audio signal is input from input unit 19 (step S (hereinafter abbreviated as “S”) 2), the input audio signal is stored in RAM 13 in time series, for example. It is assumed that the voice signal input in S2 includes a human voice. The analysis unit 102 of the primary recognition processing unit 100 cuts out a frame from the stored voice signal (S4). That is, the audio signal is cut out in units of 25 ms frames, for example. The frames are cut out by shifting, for example, by 10 ms so that adjacent frames overlap each other.

フレームが切出されると、分析部１０２は、フレームごとに音声信号の特徴量を算出する（Ｓ６）。 When the frame is cut out, the analysis unit 102 calculates the feature amount of the audio signal for each frame (S6).

次に、推定部１０４は、Ｓ６で算出された特徴量より、１次ＨＭＭデータ２０１に基づいて、音声信号が表わす単語（キーワード）を推定する（Ｓ８）。具体的には、まず、各１次ＨＭＭが、算出された特徴量の系列を生成する尤度を求める。その後、各１次ＨＭＭの尤度値を比較し、尤度が最大となる１次ＨＭＭに対応したキーワードを１次認識結果とする。 Next, the estimation unit 104 estimates a word (keyword) represented by the audio signal based on the primary HMM data 201 from the feature amount calculated in S6 (S8). Specifically, first, each primary HMM determines the likelihood of generating a sequence of calculated feature values. Thereafter, the likelihood values of the respective primary HMMs are compared, and a keyword corresponding to the primary HMM having the maximum likelihood is set as a primary recognition result.

１次認識処理が終わると、判断部１０６は、類似語データベース２０３を参照して、認識結果としての推定単語には類似語が存在するか否かを判断する（Ｓ１０）。具体的には、判断部１０６は、類似語データベース２０３において、Ｓ８で尤度が最大となったキーワードの１次ＨＭＭ番号（カラム３２）の行を参照し、「類似語の個数」のカラム３３に「１」または「２」が記録されているか否かを判断する。類似語が存在すると判断された場合（Ｓ１０にてＹＥＳ）、Ｓ１２に進む。これに対し、類似語が存在しないと判断された場合には（Ｓ１０にてＮＯ）、出力部１１６によって、Ｓ８で推定されたキーワードが正式な認識結果として出力される（Ｓ２０）。 When the primary recognition processing is completed, the determination unit 106 refers to the similar word database 203 and determines whether there is a similar word in the estimated word as the recognition result (S10). Specifically, the determination unit 106 refers to the row of the primary HMM number (column 32) of the keyword having the maximum likelihood in S8 in the similar word database 203, and the column 33 of “number of similar words”. It is determined whether or not “1” or “2” is recorded. If it is determined that a similar word exists (YES in S10), the process proceeds to S12. On the other hand, when it is determined that there is no similar word (NO in S10), the output unit 116 outputs the keyword estimated in S8 as a formal recognition result (S20).

Ｓ１２において、抽出部１０８は、類似語データベース２０３から区間情報を読出し、ＲＡＭ１３に記憶されている音声信号から、差異区間の音声信号すなわち部分音声信号を抽出する。抽出部１０８は、類似語が１つの場合、そのキーワードは類似語データベース２０３における「類似語Ａ」であるため、始端状態番号Ａ（カラム３７）および終端状態番号Ａ（カラム３８）を読出す。類似語が２つの場合、１次認識処理において尤度が高かった方の類似語の始端状態番号および終端状態番号を読出す。 In S <b> 12, the extraction unit 108 reads the section information from the similar word database 203, and extracts the voice signal in the difference section, that is, the partial voice signal from the voice signal stored in the RAM 13. When there is one similar word, the extraction unit 108 reads the starting state number A (column 37) and the ending state number A (column 38) because the keyword is “similar word A” in the similar word database 203. When there are two similar words, the start state number and the end state number of the similar word having the higher likelihood in the primary recognition process are read.

抽出部１０８は、音声信号全体を、推定単語の１次ＨＭＭでビタビアライメントする。そして、読出した始端状態番号および終端状態番号で区切られる差異区間の部分音声信号を抽出する。なお、本実施の形態では、音声信号を推定単語の１次ＨＭＭでビタビアライメントすることとしたが、学習の際に、類似語の１次ＨＭＭでビタビアライメントして２次ＨＭＭを生成しておけば、類似語の１次ＨＭＭでビタビアライメントしてもよい。 The extraction unit 108 performs Viterbi alignment of the entire speech signal with the primary HMM of the estimated word. And the partial audio | voice signal of the difference area divided by the read start end state number and termination | terminus state number is extracted. In this embodiment, the speech signal is Viterbi-aligned with the primary word HMM of the estimated word. However, during learning, a secondary HMM can be generated by Viterbi alignment with the primary HMM of the similar word. For example, Viterbi alignment may be performed using a similar primary HMM.

部分音声信号が抽出されると、分析部１１２は、部分音声信号からたとえば１０ｍｓのフレームを切出す（Ｓ１４）。この場合も、フレームは、隣り合うフレーム同士が重なるように、たとえば５ｍｓずつずらして切出される。 When the partial audio signal is extracted, the analysis unit 112 cuts out a 10 ms frame, for example, from the partial audio signal (S14). Also in this case, the frames are cut out with a shift of 5 ms, for example, so that adjacent frames overlap each other.

フレームが切出されると、分析部１１２は、フレームごとに音声信号の特徴量を算出する（Ｓ１６）。判定部１１４は、この特徴量より、推定単語および類似語それぞれの２次ＨＭＭデータ２０２に基づいて、認識結果を決定する（Ｓ１８）。つまり、入力された音声信号が表わすキーワードが、推定単語および類似語のいずれであるかを判定する。具体的には、判定部１１４は、２つの２次ＨＭＭが、Ｓ１６で算出した特徴量の系列を生成する尤度を求める。そして、尤度が大きい方の２次ＨＭＭの元となるキーワードを、正式な認識結果として決定する。正式な認識結果は、出力部１１６によって出力される（Ｓ２０）。これにより、音声認識処理は終了される。 When the frame is cut out, the analysis unit 112 calculates the feature amount of the audio signal for each frame (S16). The determination unit 114 determines a recognition result based on the secondary HMM data 202 of the estimated word and the similar word based on the feature amount (S18). That is, it is determined whether the keyword represented by the input voice signal is an estimated word or a similar word. Specifically, the determination unit 114 obtains a likelihood that the two secondary HMMs generate the feature amount series calculated in S16. Then, the keyword that is the basis of the secondary HMM with the higher likelihood is determined as a formal recognition result. The official recognition result is output by the output unit 116 (S20). Thereby, the voice recognition process is terminated.

上記した音声認識処理について、具体例を挙げてより詳細に説明する。 The above speech recognition processing will be described in more detail with a specific example.

（具体例）
たとえば、１次認識処理（Ｓ４〜Ｓ８）により得られた推定単語が「きゅう」であったと仮定する。また、「きゅう」の１次ＨＭＭは、１１個の状態を有していると仮定する。 (Concrete example)
For example, it is assumed that the estimated word obtained by the primary recognition process (S4 to S8) is “kyu”. Further, it is assumed that the primary HMM of “Kyu” has 11 states.

判断部１０６は、「きゅう」の１次ＨＭＭ番号は「９」であるので、図３に示した類似語データベース２０３のカラム３２に「９」が記録された行にアクセスする。その行のカラム３３を参照すると、類似語が２個あるため（Ｓ１０にてＹＥＳ）、２次認識処理を実行することになる。ここで、類似語が「じゅう」と「ちゅう」の２個あるが、本実施の形態では、１次認識処理での尤度が高い方の類似語と推定単語とについて、２次認識処理を実行する。当該具体例においては、「ちゅう」よりも「じゅう」の方が尤度が高かったと仮定する。 Since the primary HMM number of “kyu” is “9”, the determination unit 106 accesses the row in which “9” is recorded in the column 32 of the similar word database 203 shown in FIG. Referring to column 33 in that row, since there are two similar words (YES in S10), secondary recognition processing is executed. Here, although there are two similar words “10” and “chu”, in this embodiment, secondary recognition processing is performed for similar words and estimated words having higher likelihood in the primary recognition processing. Run. In this specific example, it is assumed that “10” has a higher likelihood than “chu”.

抽出部１０８は、「じゅう」のＨＭＭ番号と一致するカラムを検索すると、「類似語の１次ＨＭＭ番号Ａ」のカラム３６がそのカラムであると分かる。そのため、始端状態番号Ａ（カラム３７）および終端状態番号Ａ（カラム３８）として、それぞれ「２」および「６」が読み出される。抽出部１０８は、音声信号を、推定単語である「きゅう」の１次ＨＭＭでビタビアライメントし、状態１〜１１のうち、状態２〜６で区切られる差異区間の音声信号を抽出する（Ｓ１２）。 When the extraction unit 108 searches for a column that matches the HMM number of “10”, the column 36 of “the primary HMM number A of similar words” is known to be the column. Therefore, “2” and “6” are read as the start state number A (column 37) and the end state number A (column 38), respectively. The extraction unit 108 performs Viterbi alignment of the speech signal with the primary HMM of the estimated word “Kyu”, and extracts the speech signal of the difference section delimited by states 2 to 6 among states 1 to 11 (S12). .

ここで、図５および図６を参照して、差異区間について具体的に説明する。図５の上段には、「きゅう」の音声信号Ｖ１について、「きゅう」の１次ＨＭＭでビタビアライメントした例が示されている。図５の下段には、「きゅう」の音声信号全体から、差異区間の部分音声信号Ｖ２が抽出された例が示されている。図６の上段には、「じゅう」の音声信号ＶＡ１について、「きゅう」の１次ＨＭＭでビタビアライメントした例が示されている。図６の下段には、「じゅう」の音声信号全体から、差異区間の部分音声信号ＶＡ２が抽出された例が示されている。 Here, with reference to FIG. 5 and FIG. 6, a difference area is demonstrated concretely. In the upper part of FIG. 5, an example in which the “Kyu” audio signal V <b> 1 is Viterbi-aligned with the “Kyu” primary HMM is shown. The lower part of FIG. 5 shows an example in which the partial audio signal V2 in the difference section is extracted from the entire audio signal of “kyu”. In the upper part of FIG. 6, an example of Viterbi alignment of the voice signal VA <b> 1 of “10” with the primary HMM of “Kyu” is shown. The lower part of FIG. 6 shows an example in which the partial audio signal VA2 in the difference section is extracted from the entire “10” audio signal.

図５および図６のいずれの音声信号Ｖ１，ＶＡ１も、「きゅう」の１次ＨＭＭにより１１個の状態に区切られている。「きゅう」と「じゅう」とを判定する際には、状態番号２〜６の差異区間５０，６０内の部分音声信号Ｖ２，ＶＡ２のみが、２次認識に用いられる。 Each of the audio signals V1 and VA1 shown in FIGS. 5 and 6 is divided into 11 states by the “Kyu” primary HMM. When determining “10” and “10”, only the partial audio signals V2 and VA2 in the difference sections 50 and 60 of the state numbers 2 to 6 are used for the secondary recognition.

差異区間の部分音声信号が抽出されると、分析部１１２は、部分音声信号の分析を行って、差異区間における特徴量を求める（Ｓ１４，Ｓ１６）。判定部１１４は、２次ＨＭＭ番号Ａ−１，Ａ−２を記録しているカラム３９，４０を参照し、対応する２つの２次ＨＭＭ（９１０９，９１１０）のパラメータを取得する。２次ＨＭＭ番号が「９１０９」のパラメータは、学習の際に、「きゅう」の教師音声の差異区間（状態番号２〜６）における部分音声信号に対して作成されたＨＭＭである。２次ＨＭＭ番号が「９１１０」のパラメータは、学習の際に、「じゅう」の教師音声の差異区間（状態番号２〜６）における部分音声信号に対して作成されたＨＭＭである。なお、学習時においても、第２の時間長のフレームごとに算出された特徴量が用いられている。 When the partial speech signal in the difference section is extracted, the analysis unit 112 analyzes the partial speech signal and obtains the feature amount in the difference section (S14, S16). The determination unit 114 refers to the columns 39 and 40 in which the secondary HMM numbers A-1 and A-2 are recorded, and acquires the parameters of the corresponding two secondary HMMs (9109 and 9110). The parameter whose secondary HMM number is “9109” is an HMM created for the partial speech signal in the difference section (state numbers 2 to 6) of the teacher speech of “kyu” at the time of learning. The parameter whose secondary HMM number is “9110” is an HMM created for the partial speech signal in the difference section (state numbers 2 to 6) of the teacher speech of “10” at the time of learning. Note that the feature amount calculated for each frame having the second time length is also used during learning.

判定部１１４は、２つの２次ＨＭＭについて、差異区間における特徴量の系列を生成する尤度を算出する。尤度が算出されると、尤度が高い方の２次ＨＭＭの元となるキーワードの番号を、２次ＨＭＭを用いた認識結果として出力し、これを最終結果とする。たとえば２次ＨＭＭ番号「９１１０」の２次ＨＭＭの方が尤度が高い場合、認識結果を「じゅう」と決定する（Ｓ１８）。 The determination unit 114 calculates the likelihood of generating a feature amount series in the difference interval for the two secondary HMMs. When the likelihood is calculated, the keyword number that is the basis of the secondary HMM with the higher likelihood is output as a recognition result using the secondary HMM, and this is used as the final result. For example, when the secondary HMM with the secondary HMM number “9110” has a higher likelihood, the recognition result is determined as “10” (S18).

なお、「きゅう」と「ちゅう」とで判定する場合は、類似語データベース２０３において、「きゅう」の行のカラム４２，４３を見ると「３」および「６」が格納されている。これは、「きゅう」と「ちゅう」とを判定するには、差異区間は、状態番号３〜６で区切られる区間であることを示している。 In the case where the determination is based on “kyu” and “chu”, “3” and “6” are stored in the similar word database 203 when the columns 42 and 43 in the row of “kyu” are viewed. This indicates that, in order to determine “kyu” and “chu”, the difference section is a section divided by state numbers 3 to 6.

このように、本実施の形態によれば、１次ＨＭＭを用いた１次認識処理において誤認識の可能性があっても、音響的に類似するキーワード同士で再認識される。したがって、キーワード間の誤認識を低減させることができる。その結果、全体の認識率を向上させることができる。 As described above, according to the present embodiment, even if there is a possibility of erroneous recognition in the primary recognition processing using the primary HMM, keywords that are acoustically similar are re-recognized. Therefore, misrecognition between keywords can be reduced. As a result, the overall recognition rate can be improved.

なお、本実施の形態では、１次認識での推定単語の尤度に関わらず、推定単語に類似するキーワードが存在すれば２次認識を行うこととしたが、たとえば、１次認識での推定単語の尤度が所定値以下の場合にのみ、類似するキーワードが存在するかの判断（Ｓ１０）以降の処理を行ってもよい。あるいは、２次認識を行った後に、１次認識での推定単語の尤度を考慮してもよい。 In this embodiment, the secondary recognition is performed if there is a keyword similar to the estimated word regardless of the likelihood of the estimated word in the primary recognition. For example, the estimation in the primary recognition is performed. Only when the likelihood of a word is less than or equal to a predetermined value, the processing subsequent to the determination of whether a similar keyword exists (S10) may be performed. Or after performing secondary recognition, you may consider the likelihood of the presumed word in primary recognition.

＜変形例＞
上記実施の形態では、２つの類似語が存在した場合、１次認識で尤度が高い方の類似語について２次認識を行ったが、尤度が低い方の類似語も含めて２次認識を行ってもよい。 <Modification>
In the above embodiment, when there are two similar words, the secondary recognition is performed for the similar word having the higher likelihood in the primary recognition, but the secondary recognition is also performed including the similar word having the lower likelihood. May be performed.

たとえば、１次認識結果が「きゅう」であった場合、「きゅう」のＨＭＭに基づき差異区間の部分音声信号を切出した後、部分音声信号の特徴量より、「きゅう」、「じゅう」および「ちゅう」それぞれの２次ＨＭＭの尤度を比較し、最も尤度が高い２次ＨＭＭの元となる単語を認識結果として判定してもよい。この場合、類似語データベース２０３には、２次認識において用いるべき３つの２次ＨＭＭ番号が記録されていることとする。 For example, when the primary recognition result is “kyu”, after extracting the partial speech signal of the difference section based on the HMM of “kyu”, from the feature amount of the partial speech signal, “kyu”, “ju” and “ The likelihood of each secondary HMM may be compared, and the word that is the source of the secondary HMM with the highest likelihood may be determined as a recognition result. In this case, it is assumed that three secondary HMM numbers to be used in secondary recognition are recorded in the similar word database 203.

あるいは、１次認識結果が「きゅう」であった場合、先に、類似語同士の「じゅう」と「ちゅう」とのうちどちらが尤度が高いかを判定し、その後、尤度の高い方の類似語と「きゅう」とのうちどちらが尤度が高いかを判定するようにしてもよい。この場合、図３に示した類似語データベース２０３をそのまま利用することができる。 Alternatively, when the primary recognition result is “kyu”, first, it is determined which one of “ju” and “chu” between similar words has the highest likelihood, and then the one with the higher likelihood It may be determined which of the similar words and “kyu” has the highest likelihood. In this case, the similar word database 203 shown in FIG. 3 can be used as it is.

具体的には、まず、判断部１０６は、図３に示した類似語データベース２０３のカラム３２に「じゅう」のＨＭＭ番号である「１０」が記録された行にアクセスする。抽出部１０８は、「ちゅう」のＨＭＭ番号と一致するカラムを検索すると、「類似語の１次ＨＭＭ番号Ｂ」のカラム４１がそのカラムであると分かる。そのため、始端状態番号Ｂ（カラム４２）および終端状態番号Ｂ（カラム４３）として、それぞれ「３」および「５」が読み出される。抽出部１０８は、音声信号を、「じゅう」の１次ＨＭＭでビタビアライメントし、状態３〜５で区切られる差異区間の音声信号を抽出する（Ｓ１２）。 Specifically, first, the determination unit 106 accesses a line in which “10”, which is an HMM number of “10”, is recorded in the column 32 of the similar word database 203 illustrated in FIG. When the extraction unit 108 searches for a column that matches the HMM number of “Chu”, the column 41 of “primary HMM number B of similar words” is found to be that column. Therefore, “3” and “5” are read as the start state number B (column 42) and the end state number B (column 43), respectively. The extraction unit 108 performs Viterbi alignment of the audio signal with the primary HMM of “10”, and extracts the audio signal in the difference section divided by the states 3 to 5 (S12).

差異区間の部分音声信号が抽出されると、分析部１１２は、部分音声信号の分析を行って、差異区間における特徴量を求める（Ｓ１４，Ｓ１６）。判定部１１４は、２次ＨＭＭ番号Ｂ−１，Ｂ−２を記録しているカラム４４，４５を参照し、対応する２つの２次ＨＭＭ（１０２１０，１０２１２）のパラメータを取得する。２次ＨＭＭ番号が「１０２１０」のパラメータは、学習の際に、「じゅう」の教師音声の差異区間（状態番号３〜５）における部分音声信号に対して作成されたＨＭＭである。２次ＨＭＭ番号が「１０２１２」のパラメータは、学習の際に、「ちゅう」の教師音声の差異区間（状態番号３〜５）における部分音声信号に対して作成されたＨＭＭである。 When the partial speech signal in the difference section is extracted, the analysis unit 112 analyzes the partial speech signal and obtains the feature amount in the difference section (S14, S16). The determination unit 114 refers to the columns 44 and 45 in which the secondary HMM numbers B-1 and B-2 are recorded, and acquires the parameters of the corresponding two secondary HMMs (10210 and 10212). The parameter whose secondary HMM number is “10210” is an HMM created for the partial speech signal in the difference section (state numbers 3 to 5) of the teacher speech of “10” during learning. The parameter whose secondary HMM number is “10212” is an HMM created for the partial speech signal in the difference section (state numbers 3 to 5) of the teacher speech of “Chu” during learning.

判定部１１４は、２つの２次ＨＭＭについて、差異区間における特徴量の系列を生成する尤度を算出する。尤度が算出されると、尤度が高い方の２次ＨＭＭの元となる類似語（「じゅう」または「ちゅう」）と推定単語である「きゅう」とについて、再度、Ｓ１２以降の処理が行われる。尤度が高い方の類似語と推定単語との２次認識処理は、上記と同様であるため、詳細な説明は繰り返さない。 The determination unit 114 calculates the likelihood of generating a feature amount series in the difference interval for the two secondary HMMs. When the likelihood is calculated, the processing after S12 is performed again for the similar word (“ju” or “chu”) that is the basis of the higher-order secondary HMM and the estimated word “kyu”. Done. Since the secondary recognition process of the similar word with the higher likelihood and the estimated word is the same as described above, detailed description will not be repeated.

なお、本実施の形態では、推定単語に類似するキーワードは２つ以下であるとして説明したが、３つ以上ある場合でも適用可能である。 In the present embodiment, it has been described that the number of keywords similar to the estimated word is two or less.

本実施の形態に係る音声認識装置１により実行される音声認識方法を、プログラムとして提供することもできる。このようなプログラムは、ＣＤ−ＲＯＭ（Compact Disc-ROM）などの光学媒体や、メモリカードなどのコンピュータ読取り可能な一時的でない（non-transitory）記録媒体にて記録させて提供することができる。また、ネットワークを介したダウンロードによって、プログラムを提供することもできる。 The speech recognition method executed by the speech recognition apparatus 1 according to the present embodiment can also be provided as a program. Such a program can be provided by being recorded on an optical medium such as a CD-ROM (Compact Disc-ROM) or a computer-readable non-transitory recording medium such as a memory card. A program can also be provided by downloading via a network.

なお、本発明にかかるプログラムは、コンピュータのオペレーティングシステム（ＯＳ）の一部として提供されるプログラムモジュールのうち、必要なモジュールを所定の配列で所定のタイミングで呼出して処理を実行させるものであってもよい。その場合、プログラム自体には上記モジュールが含まれずＯＳと協働して処理が実行される。このようなモジュールを含まないプログラムも、本発明にかかるプログラムに含まれ得る。 The program according to the present invention is a program module that is provided as a part of a computer operating system (OS) and calls necessary modules in a predetermined arrangement at a predetermined timing to execute processing. Also good. In that case, the program itself does not include the module, and the process is executed in cooperation with the OS. A program that does not include such a module can also be included in the program according to the present invention.

また、本発明にかかるプログラムは他のプログラムの一部に組込まれて提供されるものであってもよい。その場合にも、プログラム自体には上記他のプログラムに含まれるモジュールが含まれず、他のプログラムと協働して処理が実行される。このような他のプログラムに組込まれたプログラムも、本発明にかかるプログラムに含まれ得る。 The program according to the present invention may be provided by being incorporated in a part of another program. Even in this case, the program itself does not include the module included in the other program, and the process is executed in cooperation with the other program. Such a program incorporated in another program can also be included in the program according to the present invention.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

１音声認識装置、１１ＣＰＵ、１２ＲＯＭ、１３ＲＡＭ、１４ハードディスク、１５操作部、１６表示部、１７ドライブ装置、１８通信Ｉ／Ｆ、１９入力部、２０マイクロフォン、５０，６０差異区間、１００１次認識処理部、１０２分析部、１０４推定部、１０６判断部、１０８抽出部、１１０２次認識処理部、１１２分析部、１１４判定部、１１６出力部、２０１１次ＨＭＭデータ、２０２２次ＨＭＭデータ、２０３類似語データベース。 DESCRIPTION OF SYMBOLS 1 Voice recognition apparatus, 11 CPU, 12 ROM, 13 RAM, 14 Hard disk, 15 Operation part, 16 Display part, 17 Drive apparatus, 18 Communication I / F, 19 Input part, 20 Microphone, 50, 60 Difference area, 100 1 Next recognition processing unit, 102 analysis unit, 104 estimation unit, 106 determination unit, 108 extraction unit, 110 secondary recognition processing unit, 112 analysis unit, 114 determination unit, 116 output unit, 201 primary HMM data, 202 secondary HMM Data, 203 similar words database.

Claims

First recognition processing means for estimating a first word from the plurality of keywords by executing recognition processing based on the feature amount of the audio signal and the first model parameter for each of the plurality of keywords. When,
A storage unit for preliminarily storing similar word information about keywords that are acoustically similar to each other;
Judgment means for judging whether or not there is a second word that is similar to the first word estimated by the first recognition processing means by referring to the similar word information; ,
An extracting means for extracting a partial speech signal in a predetermined specific section from the speech signal when the judging means judges that the second word is present;
Second recognition processing means for executing recognition processing using the feature amount of the partial speech signal extracted by the extraction means;
The second recognition processing means, based on the feature amount of the partial speech signal and the second model parameter corresponding to the specific section for each of the first word and the second word, A speech recognition apparatus including a determination unit that executes a determination process using a keyword having a higher likelihood of the first word and the second word as a recognition result.

The similar word information includes identification information about the similar keyword for each specific keyword that may be erroneously recognized as an acoustically similar keyword,
The storage means further stores section information defining the specific section for each specific keyword,
The said extraction means divides | segments the said audio | voice signal into a some area with a predetermined | prescribed algorithm, The said partial audio | voice signal is extracted based on these divided | segmented some area and the said area information. Voice recognition device.

The first recognition processing means includes first analysis means for cutting out the voice signal in units of a frame having a first time length and calculating a feature value of the voice signal by performing analysis for each frame,
The second recognition processing means calculates the feature amount of the partial speech signal by cutting out the partial speech signal in units of frames having a second time length shorter than the first time and performing analysis for each frame. The speech recognition apparatus according to claim 1, further comprising: a second analysis unit that performs the analysis.

An output unit for outputting the keyword determined as the recognition result by the determination unit;
The output means outputs the first word estimated by the first recognition processing means as the recognition result when the determination means determines that the second word does not exist. Item 4. The speech recognition device according to any one of Items 1 to 3.

When the determination unit determines that there are a plurality of keywords similar to the first word, the determination unit selects a keyword having a higher likelihood in the recognition process in the first recognition processing unit. The speech recognition apparatus according to claim 1, wherein the determination process is executed as the second word.

When the determination unit determines that there are a plurality of keywords similar to the first word, the determination unit determines the second model parameter for each of the first word and a plurality of similar keywords. The speech recognition apparatus according to claim 1, wherein a keyword having the highest likelihood among the first word and the plurality of similar keywords is determined as the recognition result.

When the determination unit determines that there are a plurality of similar keywords to the first word, the determination unit uses the second model parameter for each of the plurality of similar keywords to determine the plurality of similarities. The speech recognition device according to claim 1, wherein a keyword having a higher likelihood among keywords to be determined is determined, and the determination process is performed using the keyword having a higher likelihood as the second word. .

The speech recognition apparatus according to any one of claims 1 to 7, wherein the specific section is determined as a section having a highest recognition rate between keywords that are acoustically similar to each other by a simulation during learning.

Estimating a first word from the plurality of keywords by performing a recognition process based on a feature amount of the audio signal and a first model parameter for each of the plurality of keywords;
It is determined whether or not there is a second word that is a keyword similar to the estimated first word by referring to the similar word information about the keywords that are acoustically similar to each other stored in advance. And steps to
Extracting a partial speech signal in a predetermined specific section from the speech signal when it is determined that the second word is present;
Based on the extracted feature amount of the partial speech signal and the second model parameter corresponding to the specific section for each of the first word and the second word, the first word and the The speech recognition program which makes a computer perform the step which performs the determination process which uses as a recognition result the keyword with a higher likelihood among 2nd words.