JPH11288293A

JPH11288293A - Voice recognition device and storage medium

Info

Publication number: JPH11288293A
Application number: JP10105636A
Authority: JP
Inventors: Shigeaki Komatsu; 慈明小松
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 1998-03-31
Filing date: 1998-03-31
Publication date: 1999-10-19

Abstract

PROBLEM TO BE SOLVED: To realize a voice recognition device and a storage medium which improve the recognition precision. SOLUTION: A recognition start time setting part 14 sets a prescribed point of time within a range after the start time of an input signal and before the start time of a sound section as the recognition start time. Thus, the problem is resolved that erroneous recognition is caused by non-coincidence between the start time of the sound section and that of recognition due to an influence of the noise like breathing or a rip noise in a method which takes the start time of the sound section as that of recognition. A discrimination part 16 refers to a parameter 22 of a sound hidden Markov model and a parameter 24 of a silence hidden Markov model to perform discrimination by the hidden Markov model method. Thus, the recognition precision is not affected by breathing or a rip noise.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声を認識する音
声認識装置、およびその音声認識装置が音声認識を行う
ためのコンピュータプログラムが記憶された記憶媒体に
関し、音声認識の精度を高めることができるものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus for recognizing speech and a storage medium storing a computer program for the speech recognition apparatus to perform speech recognition, and can improve the accuracy of speech recognition. Things.

【０００２】[0002]

【従来の技術】従来、上記音声認識装置として、たとえ
ば図４に示す構成のものが知られている。この音声認識
装置４０は、有音区間開始時点検出部４２と、有音区間
終了時点検出部１２と、記憶部４６と、識別部４４とを
備える。有音区間開始時点検出部４２は、音声入力装置
３０によって入力され、デジタル化された信号を入力信
号として入力し、この入力した入力信号のパワーなどを
検出して有音区間の開始時点（以後、有音区間開始時点
と称する）を設定する。また、有音区間終了時点検出部
１２は、上記入力信号のパワーなどを検出して有音区間
の終了時点（以後、有音区間終了時点と称する）を設定
する。さらに、記憶部４６は、各識別対象の有音区間に
対応する有音隠れマルコフモデルの参照パラメータ２２
を記憶しており、識別部４４は、有音区間開始時点検出
部４２によって検出された有音区間開始時点から、有音
区間終了時点検出部１２によって検出された有音区間終
了時点までを認識区間とし、その認識区間内の入力信号
を、記憶部４６に記憶されている有音隠れマルコフモデ
ルの参照パラメータ２２を参照しながら識別する。2. Description of the Related Art Conventionally, as the above-mentioned speech recognition device, for example, one having a configuration shown in FIG. 4 is known. The voice recognition device 40 includes a sound section start point detection section 42, a sound section end point detection section 12, a storage section 46, and an identification section 44. The voiced segment start time detection unit 42 receives the digitized signal input by the voice input device 30 as an input signal, detects the power of the input signal, and detects the start time of the voiced segment (hereinafter referred to as the start time). , A sound segment start time). The sound section end point detection unit 12 detects the power of the input signal or the like and sets the end point of the sound section (hereinafter referred to as the end point of the sound section). Further, the storage unit 46 stores the reference parameter 22 of the voiced hidden Markov model corresponding to the voiced section to be identified.
The identification unit 44 recognizes the interval from the start of the voiced segment detected by the voiced segment start time detection unit 42 to the end of the voiced segment detected by the voiced segment end time detection unit 12. A section is defined, and an input signal in the recognition section is identified with reference to the reference parameter 22 of the voiced hidden Markov model stored in the storage unit 46.

【０００３】ここで、隠れマルコフモデルとは、音声を
統計的手法によって記述したモデルであり、遷移確率・
出力確率などのパラメータにより構成されている。ま
た、有音区間とは、図５に示すように、入力信号におい
て識別対象の音声が存在する区間を指し、無音区間と
は、有音区間以外の識別対象とは無関係な区間を指す。
また、有音隠れマルコフモデルとは、学習データの有音
区間の信号を用いて各識別対象ごとに予め学習された隠
れマルコフモデルを指し、これは各音節ごとに用意され
上記記憶部４６に記憶されている。Here, the hidden Markov model is a model in which speech is described by a statistical method, and the transition probability
It is composed of parameters such as output probability. As shown in FIG. 5, a sound section refers to a section in which a speech to be identified exists in an input signal, and a silent section refers to a section other than a speech section that is irrelevant to the identification target.
The voiced hidden Markov model refers to a hidden Markov model that has been learned in advance for each identification object using a signal in a voiced section of the learning data, and is prepared for each syllable and stored in the storage unit 46. Have been.

【０００４】[0004]

【発明が解決しようとする課題】ところで、有音区間終
了時点は、たとえば特開昭６２−２３７４９８号公報に
記載されている手法を用いて検出することが可能である
が、無声子音で始まる音声を認識する場合において、無
声子音などの中には、呼吸音やリップノイズなどの雑音
との区別がつきにくいものがあるため、実際には無声子
音であるのに雑音と判断してしまい、有音区間開始時点
の検出を誤るという問題があった。そしてそのような有
音区間開始時点の検出誤りが、上記識別部４４が識別処
理を行う際の誤認識の原因となることがあった。By the way, the end point of a sound section can be detected by using a method described in, for example, JP-A-62-237498. When recognizing voice, some unvoiced consonants are difficult to distinguish from noises such as breathing sounds and lip noises. There is a problem that the detection at the start of the sound section is erroneous. Such an erroneous detection at the start of the sound section may cause erroneous recognition when the identification unit 44 performs the identification processing.

【０００５】そこで、そのような問題を解決するため
に、図６に示す構成の音声認識装置５０が提案されてい
る。この音声認識装置５０は、記憶部５４と、識別部５
２とを備えている。記憶部５４には、各識別対象の有音
部分に対応する有音隠れマルコフモデルの参照パラメー
タ２２の他に、無音区間に対応する無音隠れマルコフモ
デルの参照パラメータ２４と、隠れマルコフモデル・ネ
ットワーク５６とが記憶されている。ここで、無音隠れ
マルコフモデルとは、学習データの無音区間の信号を用
いて予め学習された隠れマルコフモデルを指す。また、
隠れマルコフモデル・ネットワーク５６は、単音節を識
別する場合の一例であり、この例では、有音隠れマルコ
フモデルが各音節ごとに用意されており、無音隠れマル
コフモデル＋有音隠れマルコフモデル＋無音隠れマルコ
フモデルの順に結合されている。さらに、識別部５２
は、上記隠れマルコフモデル・ネットワーク５６に基づ
いて各有音隠れマルコフモデルの参照パラメータ２２お
よび無音隠れマルコフモデルの参照パラメータ２４を参
照しながら、公知のＶｉｔｅｒｂｉスコアを算出し、こ
の算出されたＶｉｔｅｒｂｉスコアを最大にするネット
ワーク・パスを検出し、その検出されたパス上の有音隠
れマルコフモデルに対応する発生内容を認識結果として
表示装置３２などに出力する。つまり、図６に示す音声
認識装置５０は、有音区間の検出を行わないで発生内容
を認識することにより、有音区間開始時点の検出誤りよ
って生じていた前述の問題を解決しようとするものであ
る。[0005] In order to solve such a problem, a speech recognition apparatus 50 having a configuration shown in FIG. 6 has been proposed. The voice recognition device 50 includes a storage unit 54 and an identification unit 5
2 is provided. In the storage unit 54, in addition to the reference parameters 22 of the voiced Hidden Markov Model corresponding to the voiced part of each identification target, the reference parameters 24 of the silent Hidden Markov Model corresponding to the silent section, and the Hidden Markov Model Network 56 Are stored. Here, the silent Hidden Markov Model refers to a Hidden Markov Model that has been learned in advance using a signal in a silent section of the learning data. Also,
The hidden Markov model network 56 is an example of a case where a single syllable is identified. In this example, a voiced hidden Markov model is prepared for each syllable, and a silent hidden Markov model + a voiced hidden Markov model + silent. Hidden Markov models are combined in order. Further, the identification unit 52
Calculates a known Viterbi score based on the hidden Markov model network 56 while referring to the reference parameter 22 of each sounded Hidden Markov model and the reference parameter 24 of a silent Hidden Markov Model, and calculates the calculated Viterbi score. Is detected, and the generated content corresponding to the voiced hidden Markov model on the detected path is output to the display device 32 or the like as a recognition result. In other words, the speech recognition apparatus 50 shown in FIG. 6 attempts to solve the above-described problem caused by the detection error at the start of the sound section by recognizing the occurrence without detecting the sound section. It is.

【０００６】ところで、図７に示すように、入力信号に
は、有音区間の後に呼吸音やリップノイズによる信号が
含まれている場合がある。そこで、そのような入力信号
の音声認識を上述の図６に示した音声認識装置５０を用
いて行うと、Ｖｉｔｅｒｂｉスコアを最大にするネット
ワーク・パスを検出する際に、有音隠れマルコフモデル
から無音隠れマルコフモデルへの遷移時点が、図７上の
Ｐに示す時点になることがあった。これは、リップノイ
ズや呼吸音を含めた区間を有音隠れマルコフモデルを用
いて識別処理をしていることと等しく、真の有音区間に
対する識別が行われていないことを意味する。つまり、
図６に示す音声認識装置５０では、真の有音区間に対す
る識別が行われないことにより、誤認識する場合がある
という問題がある。以上のように、特開昭６２−２３７
４９８号公報に記載の手法および図６に示す音声認識装
置５０では、いずれも認識精度が低いという問題があ
る。As shown in FIG. 7, the input signal may include a signal due to a breathing sound or a lip noise after a sound section. Therefore, when the speech recognition of such an input signal is performed by using the speech recognition apparatus 50 shown in FIG. 6 described above, when detecting a network path that maximizes the Viterbi score, a silent Markov model is The transition point to the Hidden Markov Model may be the point indicated by P in FIG. This is equivalent to performing the identification process on the section including the lip noise and the respiratory sound using the sound hidden Markov model, which means that the true sound section is not identified. That is,
The speech recognition device 50 shown in FIG. 6 has a problem that erroneous recognition may occur because no identification is performed for a true voiced section. As described above, Japanese Patent Application Laid-Open No. 62-237
The method described in Japanese Patent Publication No. 498 and the speech recognition apparatus 50 shown in FIG. 6 have a problem that the recognition accuracy is low.

【０００７】そこで、本発明は、認識精度を高めること
ができる音声認識装置、およびその音声認識装置によっ
て音声認識を行うためのコンピュータプログラムが記憶
された記憶媒体を実現することを目的とする。SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech recognition device capable of improving recognition accuracy and a storage medium storing a computer program for performing speech recognition by the speech recognition device.

【０００８】[0008]

【課題を解決するための手段】本発明は、上記目的を達
成するため、請求項１に記載の発明では、音声信号の無
音区間内の信号に対応する無音参照パラメータと、前記
音声信号の有音区間内の信号に対応する有音参照パラメ
ータとが記憶された記憶手段と、入力信号の有音区間の
終了時点を有音区間終了時点として検出する有音区間終
了時点検出手段と、前記入力信号の開始時点以後であ
り、かつ、有音区間の開始時点以前の範囲内における所
定の時点を音声認識を開始する認識開始時点に設定する
認識開始時点設定手段と、この認識開始時点設定手段に
よって設定された認識開始時点から、前記有音区間終了
時点検出手段によって検出された有音区間終了時点まで
の入力信号に対して、前記記憶手段に記憶された無音参
照パラメータおよび有音参照パラメータを参照しなが
ら、時間正規化識別手法によって前記入力信号の発声内
容を識別する識別手段と、が備えられたという技術的手
段を採用する。According to the present invention, in order to achieve the above object, according to the first aspect of the present invention, a silent reference parameter corresponding to a signal in a silent section of a voice signal, and an audio signal reference Storage means for storing a sound reference parameter corresponding to a signal in a sound section; a sound section end time detecting means for detecting an end time of a sound section of the input signal as a sound section end time; Recognition start time setting means for setting a predetermined time in a range after the start time of the signal and before the start time of the sound section as a recognition start time for starting voice recognition, and the recognition start time setting means For the input signal from the set recognition start point to the end of the sound interval detected by the sound interval end detection means, the silent reference parameter stored in the storage means and With reference to the sound reference parameter, and identifying means for identifying the utterance contents of the input signal by the time normalization recognition method employs a technical means that are provided.

【０００９】また、請求項２に記載の発明では、請求項
１に記載の音声認識装置において、前記時間正規化識別
手法は、隠れマルコフモデル法であるという技術的手段
を採用する。According to a second aspect of the present invention, in the speech recognition apparatus according to the first aspect, a technical means is adopted in which the time-normalized identification method is a hidden Markov model method.

【００１０】さらに、請求項３に記載の発明では、音声
信号の無音区間内の信号から学習して作成された無音参
照パラメータと、前記音声信号の有音区間内の信号を学
習して作成された有音参照パラメータとが格納されてお
り、さらに、入力信号の有音区間の終了時点を有音区間
終了時点として検出し、前記入力信号の開始時点以後で
あり、かつ、有音区間の開始時点以前の範囲内における
所定の時点を音声認識を開始する認識開始時点に設定
し、その設定された認識開始時点から、前記有音区間終
了時点検出手段によって検出された有音区間終了時点ま
での入力信号に対して、前記記憶手段に記憶された無音
参照パラメータおよび有音参照パラメータを参照しなが
ら、時間正規化識別手法によって前記入力信号の発声内
容を識別するためのコンピュータプログラムを含むコン
ピュータプログラムが格納されたという技術的手段を採
用する。Further, according to the third aspect of the present invention, a silent reference parameter created by learning from a signal in a silent section of an audio signal and a signal created by learning a signal in a voiced section of the audio signal are created. And the end point of the sound section of the input signal is detected as the end point of the sound section, and after the start point of the input signal, and the start of the sound section. A predetermined time within the range before the time is set as a recognition start time at which to start voice recognition, and from the set recognition start time to a sound section end time detected by the sound section end time detection means. A method for identifying an utterance content of the input signal by a time-normalized identification method while referring to a silent reference parameter and a sound reference parameter stored in the storage unit for the input signal. Adopt the technical means of the computer program is stored comprising a computer program.

【００１１】[0011]

【作用】請求項１および請求項２に記載の発明では、上
記記憶手段は、音声信号の無音区間内の信号に対応する
無音参照パラメータと、上記音声信号の有音区間内の信
号に対応する有音参照パラメータとを記憶し、上記有音
区間終了時点検出手段は、入力信号の有音区間の終了時
点を有音区間終了時点として検出し、上記認識開始時点
設定手段は、上記入力信号の開始時点以後であり、か
つ、有音区間の開始時点以前の範囲内における所定の時
点を音声認識を開始する認識開始時点に設定し、識別手
段は、上記認識開始時点設定手段によって設定された認
識開始時点から、上記有音区間終了時点検出手段によっ
て検出された有音区間終了時点までの入力信号に対し
て、上記記憶手段に記憶された無音参照パラメータおよ
び有音参照パラメータを参照しながら、時間正規化識別
手法によって上記入力信号の発声内容を識別する。つま
り、上記認識開始時点設定手段は、認識開始時点を上記
入力信号の開始時点以後であり、かつ、有音区間の開始
時点以前の範囲内の所定の時点に設定する構成であるた
め、つまり認識開始時点は上記範囲内の任意の時点に設
定できる構成であるため、認識開始時点と有音区間開始
時点とを一致させる必要がない。したがって、従来の、
有音区間開始時点を認識開始時点として認識を行う手法
（特開昭６２−２３７４９８号公報）のように、リップ
ノイズなどの雑音と無声子音とを誤って判断することな
どに原因して認識開始時点と有音区間開始時点とが一致
せず、誤認識するという問題が生じることがない。According to the first and second aspects of the present invention, the storage means stores a silent reference parameter corresponding to a signal in a silent section of the audio signal and a signal corresponding to a signal in a voiced section of the audio signal. A voiced reference parameter, the voiced segment end time point detection means detects an end time point of a voiced section of the input signal as a voiced segment end time point, and the recognition start time point setting means detects the input signal. A predetermined time after the start time and within a range before the start time of the sound section is set as a recognition start time at which the speech recognition is started, and the identification unit performs the recognition set by the recognition start time setting unit. For the input signal from the start time to the end of the sound period detected by the sound period end time detection means, the silent reference parameter and the sound reference parameter stored in the storage means are stored. With reference to identify the utterance contents of the input signal by a time normalized identification techniques. That is, the recognition start time setting means is configured to set the recognition start time to a predetermined time within the range after the start time of the input signal and before the start time of the sound section. Since the start time can be set to any time within the above range, it is not necessary to make the recognition start time coincide with the sound segment start time. Therefore, the conventional
Recognition is started due to erroneous determination of noise such as lip noise and unvoiced consonants, as in a method of performing recognition with the start of a voiced section as the start of recognition (Japanese Patent Laid-Open No. 62-237498). The time point does not coincide with the start point of the sound section, and there is no problem of incorrect recognition.

【００１２】また、上記識別手段は、認識開始時点設定
手段によって設定された認識開始時点から、上記有音区
間終了時点検出手段によって検出された有音区間終了時
点までの入力信号に対して、上記記憶手段に記憶された
無音参照パラメータおよび有音参照パラメータを参照し
ながら、時間正規化識別手法によって上記入力信号の発
声内容を識別することから、リップノイズなどの雑音と
無性子音とを識別することができる。つまり、上記認識
開始時点設定手段によれば、有音区間開始時点以前から
認識を開始することになるため、その認識範囲にリップ
ノイズなどの雑音が含まれる場合が考えられるが、上記
識別手段によってそれらの雑音を無音として識別するこ
とができる。したがって、有音区間開始時点および有音
区間終了時点を検出する手段を備えず、Ｖｉｔｅｒｂｉ
スコアを最大にするネットワーク・パスを検出して認識
を行う従来の音声認識装置５０（図６）のように、呼吸
音やリップノイズなどにより誤認識する場合が生じにく
いため、認識精度を高めることができる。[0012] Further, the discriminating means is provided for the input signal from the recognition start time set by the recognition start time setting means to the voiced section end time detected by the voiced section end time detecting means. By referring to the silent reference parameter and the sound reference parameter stored in the storage unit, the utterance content of the input signal is identified by the time normalization identification method, so that noise such as lip noise is distinguished from asexual consonants. be able to. In other words, according to the recognition start time setting means, since the recognition is started before the sound section start time, it is possible that the recognition range includes noise such as lip noise. Those noises can be identified as silence. Therefore, there is no means for detecting the start point of the sound section and the end point of the sound section.
As in the case of the conventional speech recognition apparatus 50 (FIG. 6) that detects and recognizes a network path that maximizes a score, it is unlikely to cause erroneous recognition due to respiratory sounds, lip noise, and the like. Can be.

【００１３】特に、請求項２に記載の発明では、上記時
間正規化識別手法は、隠れマルコフモデル法であるとい
う技術的手段を採用するため、つまり、統計処理手法を
用いるため、他の時間正規化識別手法を用いた場合より
も識別精度の高い音声認識装置を実現できる。[0013] In particular, in the invention according to the second aspect, the time normalization identification method employs a technical means that is a hidden Markov model method, that is, uses a statistical processing method. Thus, it is possible to realize a speech recognition device having higher identification accuracy than the case of using the generalized identification method.

【００１４】さらに、請求項３に記載の発明では、音声
信号の無音区間内の信号から学習して作成された無音参
照パラメータと、上記音声信号の有音区間内の信号を学
習して作成された有音参照パラメータとが格納されてお
り、さらに、入力信号の有音区間の終了時点を有音区間
終了時点として検出し、上記入力信号の開始時点以後で
あり、かつ、有音区間の開始時点以前の範囲内における
所定の時点を音声認識を開始する認識開始時点に設定
し、その設定された認識開始時点から、上記有音区間終
了時点検出手段によって検出された有音区間終了時点ま
での入力信号に対して、上記記憶手段に記憶された無音
参照パラメータおよび有音参照パラメータを参照しなが
ら、時間正規化識別手法によって上記入力信号の発声内
容を識別するためのコンピュータプログラムを含むコン
ピュータプログラムが格納された記憶媒体という構成で
あるため、その記憶媒体を用いることにより、上記請求
項１に記載の音声認識装置を実現できる。つまり、上記
音声認識装置は、たとえば、後述する発明の実施の形態
に記載するように、音声認識装置に内蔵されたＣＰＵ、
あるいは、音声認識装置に接続されたコンピュータによ
って制御されることから、上記記憶媒体としての記憶部
を音声認識装置に設け、もしくは、上記記憶媒体に格納
されているコンピュータプログラムをコンピュータにイ
ンストールすることによって、請求項１に記載の発明を
実施できるからである。Further, in the invention according to claim 3, a silent reference parameter created by learning from a signal in a silent section of the audio signal and a signal created by learning a signal in a voiced section of the audio signal are created. And the end point of the sound section of the input signal is detected as the end point of the sound section, and after the start point of the input signal, and the start of the sound section. A predetermined time point within the range before the time point is set as a recognition start time point for starting voice recognition, and from the set recognition start time point to a sound section end time point detected by the sound section end time point detecting means. A method for identifying the utterance content of the input signal by a time-normalized identification method while referring to the silent reference parameter and the sound reference parameter stored in the storage means for the input signal. Since the computer program comprising computer program is configured that the stored storage medium, by using the storage medium, it is possible to realize a speech recognition apparatus according to claim 1. That is, the speech recognition device includes, for example, a CPU built in the speech recognition device,
Alternatively, since the computer is controlled by a computer connected to the speech recognition device, a storage unit as the storage medium is provided in the speech recognition device, or a computer program stored in the storage medium is installed in the computer. This is because the invention described in claim 1 can be implemented.

【００１５】[0015]

【発明の実施の形態】以下、本発明の音声認識装置の一
実施形態について図１ないし図３を参照して説明する。
図１は、本実施形態の音声認識装置の概略構成をブロッ
クで示す説明図であり、図２は、認識開始時点および有
音区間終了時点を示す説明図であり、図３は、本実施形
態の音声認識装置の処理の流れを示すフローチャートで
ある。なお、従来と同一の構成には同一の符号を用いて
その説明を省略する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the speech recognition apparatus according to the present invention will be described below with reference to FIGS.
FIG. 1 is an explanatory diagram showing a schematic configuration of the speech recognition apparatus of the present embodiment by blocks, FIG. 2 is an explanatory diagram showing a recognition start time point and a sound section end time point, and FIG. 5 is a flowchart showing a flow of processing of the voice recognition device of FIG. The same components as those in the related art are denoted by the same reference numerals, and description thereof is omitted.

【００１６】最初に、本実施形態の音声装置の主な構成
について図１および図２を参照して説明する。本実施形
態の音声認識装置は、図１に示すように、本発明の有音
区間終了時点検出手段を構成する有音区間終了時点検出
部１２と、記憶手段を構成する記憶部２０と、認識開始
時点設定手段を構成する認識開始時点設定部１４と、識
別手段を構成する識別部１６とを備える。有音区間終了
時点検出部１２は、音声入力装置３０に入力され、デジ
タル化されて出力される信号を入力信号として入力し、
この入力した入力信号のパワーなどを用いて図２に示す
有音区間終了時点を検出する。ここで有音区間とは、従
来技術の欄で述べたように、入力信号において識別対象
の音声が存在する区間を指し、無音区間とは、有音区間
以外の識別対象とは無関係な区間を指す。First, the main configuration of the audio device according to the present embodiment will be described with reference to FIGS. As shown in FIG. 1, the speech recognition apparatus according to the present embodiment includes a speech section end point detection unit 12 constituting a speech section end point detection unit of the present invention, a storage unit 20 constituting a storage unit, and a recognition unit. The apparatus includes a recognition start time setting unit 14 constituting a start time setting unit and an identification unit 16 constituting an identification unit. The sound section end point detection unit 12 inputs a signal input to the voice input device 30 and digitized and output as an input signal,
The end point of the sound period shown in FIG. 2 is detected using the power of the input signal. Here, as described in the section of the related art, the voiced section refers to a section where the voice to be identified exists in the input signal, and the silent section refers to a section irrelevant to the identification target other than the voiced section. Point.

【００１７】記憶部２０は、各識別対象の有音部分に対
応する有音隠れマルコフモデルのパラメータ２２と、無
音区間に対応する無音隠れマルコフモデルのパラメータ
２４と、隠れマルコフモデル・ネットワーク２６とを記
憶している。有音隠れマルコフモデルのパラメータ２２
が、本発明の有音参照パラメータを構成し、無音隠れマ
ルコフモデルのパラメータ２４が無音参照パラメータを
構成する。ここで、隠れマルコフモデルとは、音声を統
計的手法により記述した公知のモデルであり、遷移確
率、出力確率などのパラメータにより構成されている。The storage unit 20 stores a parameter 22 of a voiced hidden Markov model corresponding to a voiced portion of each identification target, a parameter 24 of a voiceless hidden Markov model corresponding to a voiceless section, and a hidden Markov model network 26. I remember. Parameter 22 of the voiced hidden Markov model
Constitutes the voiced reference parameter of the present invention, and the parameter 24 of the silent hidden Markov model constitutes the silent reference parameter. Here, the hidden Markov model is a known model in which speech is described by a statistical technique, and is configured by parameters such as a transition probability and an output probability.

【００１８】また、有音隠れマルコフモデルとは、学習
データの有音区間の信号を用いて各識別対象ごとに予め
学習されたマルコフモデルを指し、本実施形態では、各
音節ごとの隠れマルコフモデルが記憶部２６に記憶され
ている。さらに、無音隠れマルコフモデルとは、学習デ
ータの無音区間の信号を用いて予め学習された隠れマル
コフモデルを指す。また、図１に示す隠れマルコフモデ
ル・ネットワーク２６は、単音節を認識する場合の一例
であり、本実施形態では、有音隠れマルコフモデルが各
音節ごとに用意されており、無音隠れマルコフモデル＋
有音隠れマルコフモデルの順に結合されたものである。
認識開始時点設定部１４は、入力信号の開始時点以後か
ら有音区間の開始時点以前までの区間内の適当な時点を
認識開始時点として設定する。本実施形態では、隠れマ
ルコフモデル・ネットワークを用いるため、認識開始時
点と有音区間開始時点とを一致させる必要はなく、認識
開始時点以降に有音区間開始時点があればよく、図２に
示すように、有音区間終了時点検出部１２により検出さ
れた有音区間終了時点ｔ３から単音節の有音区間より充
分長い時間ａ、たとえば、本実施形態では、約５００ｍ
秒だけ戻った時点ｔ１を認識開始時点として設定する。The voiced hidden Markov model refers to a Markov model that has been learned in advance for each identification object using a signal in a voiced section of the learning data. In the present embodiment, a hidden Markov model for each syllable is used. Are stored in the storage unit 26. Further, the silent Hidden Markov Model refers to a hidden Markov model that has been learned in advance using a signal in a silent section of the learning data. The hidden Markov model network 26 shown in FIG. 1 is an example of a case where a single syllable is recognized. In the present embodiment, a voiced hidden Markov model is prepared for each syllable, and a silent hidden Markov model +
These are combined in the order of the voiced hidden Markov model.
The recognition start time setting unit 14 sets an appropriate time in the section from the time after the start of the input signal to the time before the start of the sound section as the recognition start time. In the present embodiment, since the hidden Markov model network is used, it is not necessary to make the recognition start time coincide with the sound section start time, and it is sufficient if there is a sound section start time after the recognition start time, as shown in FIG. As described above, the time a that is sufficiently longer than the sound period of a single syllable from the sound period end time t3 detected by the sound period end detection unit 12, for example, about 500 m in the present embodiment.
A time point t1 returned by the second is set as a recognition start time point.

【００１９】識別部１６は、認識開始時点設定部１４が
設定した認識開始時点ｔ１から、有音区間終了時点検出
部１２が検出した有音区間終了時点ｔ３までの入力信号
に対して、記憶部２０に記憶された隠れマルコフモデル
・ネットワーク２６に基づいて有音隠れマルコフモデル
のパラメータ２２および無音隠れマルコフモデルのパラ
メータ２４を参照しながら、公知のＶｉｔｅｒｂｉ法に
よりＶｉｔｅｒｂｉスコアを算出し、この算出されたＶ
ｉｔｅｒｂｉスコアを最大にするネットワーク・パスを
検出し、その検出されたネットワーク・パス上の有音隠
れマルコフモデルに対応する識別対象を認識結果として
表示装置３２などに出力する。The identification unit 16 stores the input signal from the recognition start time t1 set by the recognition start time setting unit 14 to the voiced segment end time t3 detected by the voiced segment end time detection unit 12, and stores it in the storage unit. The Viterbi score is calculated by the known Viterbi method while referring to the parameter 22 of the voiced hidden Markov model and the parameter 24 of the silent hidden Markov model based on the hidden Markov model network 26 stored in 20. V
A network path that maximizes the iterbi score is detected, and an identification target corresponding to a sounded hidden Markov model on the detected network path is output to the display device 32 or the like as a recognition result.

【００２０】次に、本実施形態の音声認識装置１０の処
理内容について、それを示す図３のフローチャートを参
照して説明する。なお、本実施形態の音声認識装置１０
が図３に示す処理を実行するためのコンピュータプログ
ラムは、記憶部２０に記憶されている。まず、有音区間
終了時点検出部１２が、入力信号のパワーなどを用いて
入力信号の有音区間終了時点を検出する（ステップ
２）。その検出は、たとえば、前述の特開昭６２−２３
７４９８号公報に記載されている技術を用いて行う。続
いて、認識開始時点設定部１４が、上記ステップ２で検
出された有音区間終了時点ｔ３から５００ｍ秒戻った時
点を認識開始時点ｔ１として設定する（ステップ４）。Next, the processing contents of the speech recognition apparatus 10 of the present embodiment will be described with reference to the flowchart of FIG. Note that the speech recognition device 10 of the present embodiment
The computer program for executing the processing shown in FIG. 3 is stored in the storage unit 20. First, the sound section end point detection unit 12 detects the sound section end point of the input signal using the power of the input signal or the like (step 2). The detection is performed, for example, as described in
This is performed using the technique described in US Pat. Subsequently, the recognition start time setting unit 14 sets a time point 500 msec back from the sound section end time point t3 detected in step 2 as the recognition start time point t1 (step 4).

【００２１】続いて、識別部１６が、上記ステップ４で
設定された認識開始時点ｔ１から、上記ステップ２で検
出された有音区間終了時点ｔ３までの範囲における入力
信号に対して、公知の線形予測やフィルターバンクを用
いて特徴量を抽出し（ステップ６）、記憶部２０に記憶
されている隠れマルコフモデル・ネットワーク２６を読
込み（ステップ８）、この読込んだ隠れマルコフモデル
・ネットワーク２６に基づいて、記憶部２０に記憶され
ている有音隠れマルコフモデルのパラメータ２２および
無音隠れマルコフモデルのパラメータ２４を参照しなが
ら、公知のＶｉｔｅｒｂｉ法により、隠れマルコフモデ
ル・ネットワーク２６のＶｉｔｅｒｂｉスコアを最大に
するネットワーク・パスを抽出し（ステップ１０）、そ
の抽出されたネットワーク・パス上の有音隠れマルコフ
モデルに対応する識別対象を認識結果として出力する
（ステップ１２）。この出力された認識結果は、表示装
置３２などによって表示される。Subsequently, the identification unit 16 applies a known linear signal to the input signal in the range from the recognition start time t1 set in step 4 to the sound segment end time t3 detected in step 2 above. The feature amount is extracted using prediction or a filter bank (step 6), the hidden Markov model network 26 stored in the storage unit 20 is read (step 8), and based on the read hidden Markov model network 26. Then, the Viterbi score of the hidden Markov model network 26 is maximized by the known Viterbi method while referring to the parameters 22 of the sounded hidden Markov model and the parameters 24 of the silent hidden Markov model stored in the storage unit 20. A network path is extracted (step 10), and the extracted network is extracted. And outputs the verification area corresponding to a sound hidden Markov model on work path as the recognition result (step 12). The output recognition result is displayed on the display device 32 or the like.

【００２２】以上のように、本実施形態の音声認識装置
１０を使用すれば、認識開始時点ｔ１を入力信号の開始
時点以後であり、かつ、有音区間開始時点ｔ２以前の範
囲内における所定の時点に設定することができる。した
がって、従来の手法（特開昭６２−２３７４９８号公
報）のように、有音区間開始時点と認識開始時点とを一
致させることができないことに原因して誤認識をすると
いうことがない。また、隠れマルコフモデル・ネットワ
ーク２６に基づいて、有音隠れマルコフモデルのパラメ
ータ２２および無音隠れマルコフモデルのパラメータ２
４を参照しながら、公知のＶｉｔｅｒｂｉ法により、隠
れマルコフモデル・ネットワーク２６のＶｉｔｅｒｂｉ
スコアを最大にするネットワーク・パスを抽出し、その
抽出されたネットワーク・パス上の有音隠れマルコフモ
デルに対応する識別対象を認識結果として出力すること
から、従来の音声認識装置５０（図４）のように、呼吸
音やリップノイズなどを音声と誤認識する確率を低くで
きるため、認識精度を高めることができる。As described above, if the speech recognition apparatus 10 of the present embodiment is used, the recognition start time t1 is a predetermined time within the range after the start time of the input signal and before the sound segment start time t2. Can be set to a point in time. Therefore, unlike the conventional method (Japanese Patent Application Laid-Open No. 62-237498), erroneous recognition is not caused due to the inability to match the start point of the sound section with the start time of recognition. Also, based on the hidden Markov model network 26, the parameter 22 of the sounded hidden Markov model and the parameter 2 of the silent hidden Markov model
4, the Viterbi method of the hidden Markov model network 26 is performed by the known Viterbi method.
The conventional speech recognition apparatus 50 (FIG. 4) extracts a network path that maximizes a score and outputs an identification target corresponding to the sounded hidden Markov model on the extracted network path as a recognition result. As described above, the probability of erroneously recognizing respiratory sounds, lip noises, and the like as speech can be reduced, so that recognition accuracy can be improved.

【００２３】なお、上記実施形態では、隠れマルコフモ
デル・ネットワークを用いて無音隠れマルコフモデルと
有音隠れマルコフモデルを連結したが、無音隠れマルコ
フモデルと有音隠れマルコフモデルとを直接連結したも
のなど、他の処理方法を用いることもできる。また、有
音隠れマルコフモデルが、さらに細かい単位、たとえば
音韻などの隠れマルコフモデルを直接連結したもの、も
しくは、それらを隠れマルコフモデル・ネットワークに
より連結したものでもよい。さらに、上記実施形態で
は、時間正規化識別手法として隠れマルコフモデル法を
用いたが、動的計画法（ＤＰマッチング法）を用いるこ
ともできる。また、上記実施形態では、本発明の音声認
識装置として、単音節認識を行うものを代表に説明した
が、本発明は、単語認識や文章認識などを行う音声認識
装置にも適用することができる。さらに、上記実施形態
では、呼吸音を無音区間として処理したが、呼吸音は有
音区間と隣接しており、リップノイズと比べるとその影
響は小さいため、有音区間として処理してもよい。In the above embodiment, the hidden Markov model and the sounded hidden Markov model are connected using the hidden Markov model network, but the silent hidden Markov model and the sounded hidden Markov model are directly connected. , Other processing methods can be used. Further, the voiced hidden Markov model may be a model in which hidden Markov models such as phonemes are directly connected to each other in a smaller unit, or a model in which they are connected by a hidden Markov model network. Further, in the above-described embodiment, the hidden Markov model method is used as the time-normalized identification method, but a dynamic programming method (DP matching method) may be used. Further, in the above-described embodiment, a device that performs monosyllabic recognition has been described as a typical example of the voice recognition device of the present invention. However, the present invention can also be applied to a voice recognition device that performs word recognition, sentence recognition, and the like. . Further, in the above-described embodiment, the breathing sound is processed as a silent section. However, the breathing sound is adjacent to the sounding section, and its influence is smaller than that of the lip noise.

【００２４】また、上記実施形態では、コンピュータプ
ログラムが記憶部２０に記憶されている構成を用いた
が、上記コンピュータプログラムをＣＤ−ＲＯＭやフロ
ッピーディスクなどに記憶し、それらを装置に備えられ
た読取装置を用いてインストールすることによって装置
を動作させることもできる。この場合、上記ＣＤ−ＲＯ
ＭやＦＤなどが、請求項３に記載の記憶媒体として機能
する。さらに、外部情報処理装置から有線または無線の
通信手段を介してコンピュータプログラムを読み込んで
動作させることもできる。In the above embodiment, the computer program is stored in the storage unit 20. However, the computer program is stored in a CD-ROM, a floppy disk, or the like, and the computer program is read by the device. The device can also be operated by installing using the device. In this case, the CD-RO
M, FD, etc. function as the storage medium according to claim 3. Furthermore, a computer program can be read from an external information processing device via a wired or wireless communication unit and operated.

【００２５】[0025]

【発明の効果】以上のように、請求項１および請求項２
に記載の発明によれば、入力信号の開始時点以後であ
り、かつ、有音区間の開始時点以前の範囲内における所
定の時点を音声認識を開始する認識開始時点に設定する
認識開始時点設定手段を備えるため、有音区間開始時点
と認識開始時点とが一致しなかったことにより生じる誤
認識をなくすことができる。しかも、上記認識開始時点
設定手段によって設定された認識開始時点から、有音区
間終了時点検出手段によって検出された有音区間終了時
点までの入力信号に対して、無音参照パラメータおよび
有音参照パラメータを参照しながら、時間正規化識別手
法によって上力信号の発声内容を識別する識別手段を備
えるため、リップノイズや呼吸音などによって認識精度
に影響が与えられることがない。つまり、請求項１およ
び請求項２に記載の発明によれば、認識精度の高めるこ
とができる音声認識装置を実現できる。As described above, claims 1 and 2 are as described above.
According to the invention described in (1), a recognition start time setting means for setting a predetermined time within a range after a start time of an input signal and before a start time of a sound section as a recognition start time at which speech recognition is started. Is provided, it is possible to eliminate erroneous recognition caused by the fact that the start point of the sound segment does not match the recognition start point. Moreover, for the input signal from the recognition start time set by the recognition start time setting means to the end of the sound section detected by the sound section end time detection means, the silent reference parameter and the sound reference parameter are set. Since the identification means for identifying the utterance content of the upper force signal by the time-normalized identification method while referring is provided, the recognition accuracy is not affected by lip noise, breathing sound, and the like. That is, according to the first and second aspects of the present invention, it is possible to realize a speech recognition device capable of improving recognition accuracy.

【００２６】特に、請求項２に記載の発明によれば、上
記時間正規化識別手法が統計処理手法である隠れマルコ
フモデル法を用いるため、他の時間正規化識別手法を用
いた場合よりも識別精度の高い音声認識装置を実現でき
る。In particular, according to the second aspect of the present invention, since the time-normalized identification method uses the hidden Markov model method, which is a statistical processing method, the time-normalized identification method is more distinguishable than when other time-normalized identification methods are used. A highly accurate speech recognition device can be realized.

【００２７】さらに、請求項３に記載の発明によれば、
音声信号の無音区間内の信号から学習して作成された無
音参照パラメータと、上記音声信号の有音区間内の信号
を学習して作成された有音参照パラメータとが格納され
ており、さらに、入力信号の有音区間の終了時点を有音
区間終了時点として検出し、上記入力信号の開始時点以
後であり、かつ、有音区間の開始時点以前の範囲内にお
ける所定の時点を音声認識を開始する認識開始時点に設
定し、その設定された認識開始時点から、上記有音区間
終了時点検出手段によって検出された有音区間終了時点
までの入力信号に対して、上記記憶手段に記憶された無
音参照パラメータおよび有音参照パラメータを参照しな
がら、時間正規化識別手法によって上記入力信号の発声
内容を識別するためのコンピュータプログラムを含むコ
ンピュータプログラムが格納された記憶媒体という構成
であるため、その記憶媒体を音声認識装置内の記憶部と
して設け、もしくは、その記憶媒体に格納されているコ
ンピュータプログラムを音声認識装置あるいは音声認識
装置に接続されたコンピュータにインストールすること
によって請求項１に記載の音声認識装置を実現すること
ができる。Further, according to the third aspect of the present invention,
A silence reference parameter created by learning from a signal in a silence section of the audio signal and a speech reference parameter created by learning a signal in a speech section of the audio signal are stored, and The end point of the sound section of the input signal is detected as the end point of the sound section, and the voice recognition is started at a predetermined point in time after the start point of the input signal and before the start point of the sound section. The input signal from the set recognition start point to the end of the sound section detected by the sound section end point detection means is set to the silence stored in the storage means. A computer program including a computer program for identifying the utterance content of the input signal by a time-normalized identification method while referring to a reference parameter and a sound reference parameter. The storage medium is stored in the speech recognition device, or the computer program stored in the storage medium is connected to the speech recognition device or the speech recognition device. By installing the speech recognition device in a computer, the speech recognition device according to the first aspect can be realized.

[Brief description of the drawings]

【図１】本発明実施形態の音声認識装置の概略構成をブ
ロックで示す説明図である。FIG. 1 is an explanatory diagram showing a schematic configuration of a speech recognition device according to an embodiment of the present invention by blocks.

【図２】本発明実施形態の認識開始時点および有音区間
終了時点などを示す説明図である。FIG. 2 is an explanatory diagram showing a recognition start point and a sound section end point in the embodiment of the present invention;

【図３】本発明実施形態の音声認識装置の処理の流れを
示すフローチャートである。FIG. 3 is a flowchart showing a processing flow of the voice recognition device according to the embodiment of the present invention.

【図４】従来の音声認識装置の概略構成をブロックで示
す説明図である。FIG. 4 is an explanatory diagram showing a schematic configuration of a conventional voice recognition device in blocks.

【図５】有音区間および無音区間などを示す説明図であ
る。FIG. 5 is an explanatory diagram showing a sound section and a silent section;

【図６】従来の音声認識装置の概略構成をブロックで示
す説明図である。FIG. 6 is an explanatory diagram showing a schematic configuration of a conventional voice recognition device by blocks.

【図７】図６に示す音声認識装置の認識区間を示す説明
図である。FIG. 7 is an explanatory diagram showing a recognition section of the speech recognition device shown in FIG. 6;

[Explanation of symbols]

１０音声認識装置１２有音区間終了時点検出部（有音区間終了時点検
出手段）１４認識開始時点設定部（認識開始時点設定手段）１６識別部（識別手段）２０記憶部（記憶手段）２２有音隠れマルコフモデルのパラメータ（有音参
照パラメータ）２４無音隠れマルコフモデルのパラメータ（無音参
照パラメータ）２６隠れマルコフモデル・ネットワークDESCRIPTION OF SYMBOLS 10 Speech recognition device 12 Voice section end time detection part (voice section end time detection means) 14 Recognition start time setting section (recognition start time setting means) 16 Identification section (identification means) 20 Storage section (storage means) 22 Yes Parameters of sound hidden Markov model (sound reference parameter) 24 Parameters of silent hidden Markov model (silence reference parameter) 26 Hidden Markov model network

Claims

[Claims]

A storage means for storing a silent reference parameter corresponding to a signal in a silent section of the audio signal and a voiced reference parameter corresponding to a signal in a silent section of the audio signal; A sound section end time detecting means for detecting an end time of the sound section as a sound section end time; and a predetermined time within a range after the start time of the input signal and before the start time of the sound section. Start time setting means for setting the start time of voice recognition as a recognition start time point; and the end of the sound section detected by the sound section end time point detection means from the recognition start time set by the recognition start time setting means. For the input signal up to the point in time, while referring to the silent reference parameter and the voiced reference parameter stored in the storage means, the time normalized identification A speech recognition device, comprising: identification means for identifying utterance content;

2. The speech recognition apparatus according to claim 1, wherein the time-normalized identification method is a hidden Markov model method.

3. A non-speech reference parameter created by learning from a signal in a non-speech section of the audio signal, and a non-speech reference parameter created by learning a signal in a non-speech section of the audio signal are stored. Further, the end point of the sound section of the input signal is detected as the end point of the sound section, and the predetermined time within the range after the start point of the input signal and before the start point of the sound section. Is set as the recognition start time at which the voice recognition is started, and the storage means is used for the input signal from the set recognition start time to the sound section end time detected by the sound section end time detection means. A computer program for identifying the utterance content of the input signal by a time-normalized identification method while referring to the silent reference parameter and the sound reference parameter stored in the computer. A storage medium storing a computer program.