JPH07230293A

JPH07230293A - Voice recognition device

Info

Publication number: JPH07230293A
Application number: JP6020044A
Authority: JP
Inventors: Kazuo Ishii; 和夫石井; Masao Watari; 雅男渡; Yasuhiko Kato; 靖彦加藤; Hiroaki Ogawa; 浩明小川; Masanori Omote; 雅則表; Kazuo Watanabe; 一夫渡辺; Katsuki Minamino; 活樹南野
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1994-02-17
Filing date: 1994-02-17
Publication date: 1995-08-29

Abstract

PURPOSE:To improve the degree of freedom of ultering of a speaker and the recognition rate of voice. CONSTITUTION:In a control section 5, the silence interval of the voice detected by a voice interval detection section 5 and prescribed set times T1 and T2 are respectively compared (where T1<T2). In accordance with the results of the comparison, the voice is divided into prescribed segment units and is divided into sentence units constituted by segments. Then, in a voice recognition section 3, the sentence is voice recognized in the unit of the segments which constitute the sentence. In this case, corresponding to the recognition results of segments in the sentence proceeding in terms of time, the handling of succeeding segments is varied.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、例えば音声を認識する
場合に用いて好適な音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device suitable for use in recognizing voice, for example.

【０００２】[0002]

【従来の技術】従来の音声認識装置においては、音声が
入力されると、例えばその音声中における、１つの文の
区切りであると推測される所定の時間以上の無音部（無
音声区間）を検出し、入力された音声を、その無音部で
区切った単位で、即ち文（通常の文の他、複合語などの
係り受け関係のある単語列、何らかの意味で結びついて
いる単語列（例えば、住所など）などを含む）単位で連
続単語音声認識を行うようになされている。2. Description of the Related Art In a conventional voice recognition apparatus, when a voice is input, for example, a voiceless portion (voiceless section) for a predetermined time or more, which is presumed to be a sentence segment in the voice, is detected. Detected and input speech is a unit separated by its silent part, that is, a sentence (an ordinary sentence, a word string having a dependency relation such as a compound word, a word string connected in some sense (for example, It is designed to perform continuous word speech recognition in units of (including addresses).

【０００３】[0003]

【発明が解決しようとする課題】ところで、発話者は、
ある１つの文の中の単語間を区切って発声したり、区切
らずに発声したりする場合があるため、無音部で区切っ
た単語が、必ずしも文単位になるとは限らなかった。そ
こで、文単位に区切るように、発話者の発話を制限する
方法があるが、この方法では、発話の自由度が小さくな
り、従って装置の操作性が低下する課題があった。By the way, the speaker is
In some cases, words in one sentence may be uttered while being separated from each other, or may be uttered without being separated from each other. Therefore, words separated by silent parts are not always sentence units. Therefore, there is a method of restricting the utterance of the utterer such that the utterance is divided into sentence units, but this method has a problem that the degree of freedom of the utterance is reduced and thus the operability of the apparatus is deteriorated.

【０００４】そこで、上述した所定の時間を、比較的大
きな値に設定することにより、発話者が単語間に挿入す
る休止時間（ポーズ）の自由度を向上させる方法があ
る。Therefore, there is a method of increasing the degree of freedom of the pause time (pause) inserted between words by the speaker by setting the above-mentioned predetermined time to a relatively large value.

【０００５】しかしながら、この場合には、発話者に、
文と文との区切り部分に、その所定の時間以上の無発声
を強要することとなり、発話者に煩わしさを感じさせる
課題があった。However, in this case, the speaker
There is a problem in that the utterance feels annoying because the sentence-to-sentence segment is forced to be silent for a predetermined time or longer.

【０００６】また、発話者は、本来認識すべき音声の後
に、咳をしたり、また例えば「あれ？」などの認識対象
語ではない不要語を発声することが多い。従って、上述
したように、所定の設定時間を、大きな値に設定した場
合には、咳や不要語の部分も認識対象として音声認識が
行われることになり、このため認識率が低下する課題が
あった。[0006] Further, the speaker often coughs or utters an unnecessary word that is not a recognition target word, such as "that?", After the voice to be originally recognized. Therefore, as described above, when the predetermined set time is set to a large value, the cough and unnecessary words are also recognized as voice recognition targets, which reduces the recognition rate. there were.

【０００７】本発明は、このような状況に鑑みてなされ
たものであり、装置の操作性および認識率を向上させる
ことができるようにするものである。The present invention has been made in view of such a situation, and is to improve the operability and recognition rate of the apparatus.

【０００８】[0008]

【課題を解決するための手段】請求項１に記載の音声認
識装置は、入力された音声の音声区間を検出する検出手
段（例えば、図１に示す音声区間検出部４）と、検出手
段により検出された音声区間とその次に検出手段により
検出された音声区間との間の区間である無音声区間と、
第１および第２の設定時間とを比較し、その比較結果に
応じて、音声を、所定のセグメント単位に分割するとと
もに、セグメントで構成される文単位に分割する分割手
段（例えば、図１に示す制御部５）と、分割手段により
分割された文を、それを構成するセグメント単位で、音
声認識する認識手段（例えば、図１に示す音声認識部
３）とを備えることを特徴とする。A voice recognition apparatus according to claim 1 comprises a detection means for detecting a voice section of an input voice (for example, a voice section detection section 4 shown in FIG. 1) and a detection means. A non-voice section which is a section between the detected voice section and the voice section subsequently detected by the detection means,
A dividing unit that compares the first and second set times, divides the voice into predetermined segment units according to the comparison result, and divides the voice into sentence units composed of segments (for example, in FIG. The present invention is characterized by including a control unit 5) shown and a recognition unit (for example, the voice recognition unit 3 shown in FIG. 1) for recognizing a sentence divided by the dividing unit in units of segments constituting the sentence.

【０００９】請求項２に記載の音声認識装置は、第１の
設定時間が、第２の設定時間より小であり、分割手段
に、無音声区間が、第１の設定時間より小であるとき、
その無音声区間の前後の音声区間を合わせて１つのセグ
メントとさせ、無音声区間が、第１と第２の設定時間の
間の値であるとき、その無音声区間の前後の音声区間を
それぞれ１つのセグメントとさせるとともに、その２つ
のセグメントを合わせて１つの文とさせ、無音声区間
が、第２の設定時間より大であるとき、その無音声区間
の前後の音声区間をそれぞれ１つの文とさせることを特
徴とする。According to another aspect of the speech recognition device of the present invention, the first set time is shorter than the second set time, and the dividing means has a non-voice section shorter than the first set time. ,
The voice sections before and after the voiceless section are combined into one segment, and when the voiceless section is a value between the first and second set times, the voice sections before and after the voiceless section are respectively set. One segment, and the two segments are combined into one sentence, and when the non-voice section is longer than the second set time, the voice sections before and after the non-voice section are each one sentence. The feature is that

【００１０】請求項３に記載の音声認識装置は、認識手
段が、文が複数のセグメントで構成されている場合、先
行するセグメントから、少なくとも１つの単語を認識し
たときには、その文中の後続のセグメントを無視するこ
とを特徴とする。According to another aspect of the speech recognition apparatus of the present invention, if the recognition unit recognizes at least one word from the preceding segment when the sentence is composed of a plurality of segments, the succeeding segment in the sentence is recognized. Is characterized by ignoring.

【００１１】請求項４に記載の音声認識装置は、認識手
段が、文が複数のセグメントで構成されている場合、先
行するセグメントから、少なくとも１つの単語を認識し
たときには、その文中の後続のセグメントすべてを１つ
の新たな文として、さらに音声認識を行うことを特徴と
する。According to another aspect of the speech recognition apparatus of the present invention, when the recognizing means recognizes at least one word from the preceding segment when the sentence is composed of a plurality of segments, the following segment in the sentence is recognized. All of them are treated as one new sentence, and voice recognition is further performed.

【００１２】請求項５に記載の音声認識装置は、認識手
段が、文が複数のセグメントで構成されている場合、先
行するセグメントから、単語を認識したときには、その
単語に対応する制約の下で、先行するセグメントに続く
セグメントを音声認識することを特徴とする。According to another aspect of the speech recognition apparatus of the present invention, when the recognition means recognizes a word from a preceding segment when the sentence is composed of a plurality of segments, the recognition means is restricted under the constraint corresponding to the word. , A segment following a preceding segment is speech-recognized.

【００１３】請求項６に記載の音声認識装置は、認識手
段が、文が複数のセグメントで構成されている場合、そ
のすべてのセグメントを音声認識し、その結果得られる
認識結果のうち、その尤度の最も高いセグメントのもの
を最終的な音声認識結果とすることを特徴とする。In the voice recognition apparatus according to the sixth aspect, when the recognition means is composed of a plurality of segments, the recognition means performs voice recognition on all the segments, and among the recognition results obtained as a result, the likelihood The feature is that the segment with the highest frequency is used as the final speech recognition result.

【００１４】[0014]

【作用】請求項１に記載の音声認識装置においては、入
力された音声の無音声区間と、第１および第２の設定時
間とが比較され、その比較結果に応じて、音声が、所定
のセグメント単位に分割されるとともに、セグメントで
構成される文単位に分割される。そして、その文が、そ
れを構成するセグメント単位で、音声認識される。従っ
て、第１または第２の設定時間それぞれを適切な値とす
ることにより、発話者の発声を制限せずに済むようにな
り、装置の操作性を向上させることができる。In the voice recognition apparatus according to the present invention, the voiceless section of the input voice is compared with the first and second set times, and the voice is given a predetermined value according to the comparison result. It is divided into segment units, and also divided into sentence units composed of segments. Then, the sentence is voice-recognized in units of the constituent segments. Therefore, by setting each of the first and second set times to an appropriate value, it becomes unnecessary to limit the utterance of the speaker, and the operability of the device can be improved.

【００１５】請求項２に記載の音声認識装置において
は、第１の設定時間が、第２の設定時間より小であり、
分割手段に、無音声区間が、第１の設定時間より小であ
るとき、その無音声区間の前後の音声区間を合わせて１
つのセグメントとさせ、無音声区間が、第１と第２の設
定時間の間の値であるとき、その無音声区間の前後の音
声区間をそれぞれ１つのセグメントとさせるとともに、
その２つのセグメントを合わせて１つの文とさせ、無音
声区間が、第２の設定時間より大であるとき、その無音
声区間の前後の音声区間をそれぞれ１つの文とさせる。
従って、第１または第２の設定時間それぞれを適切な値
とすることにより、発話者の発声を制限せずに済むよう
になり、装置の操作性を向上させることができる。In the voice recognition apparatus according to the second aspect, the first set time is shorter than the second set time,
When the voiceless section is shorter than the first set time, the dividing means includes the voice sections before and after the voiceless section to be 1
When the non-voice section has a value between the first and second set times, the voice sections before and after the non-voice section are each set as one segment, and
The two segments are combined into one sentence, and when the non-voice section is longer than the second set time, the voice sections before and after the non-voice section are each set to one sentence.
Therefore, by setting each of the first and second set times to an appropriate value, it becomes unnecessary to limit the utterance of the speaker, and the operability of the device can be improved.

【００１６】請求項３に記載の音声認識装置において
は、文が複数のセグメントで構成されている場合、先行
するセグメントから、少なくとも１つの単語が認識され
たときには、その文中の後続のセグメントが無視され
る。従って、本来認識すべき音声の後に続く、咳や不要
語などの認識対象語ではない部分が無視されるようにな
り、音声の認識率を向上させることができる。In the speech recognition apparatus according to the third aspect, when a sentence is composed of a plurality of segments, when at least one word is recognized from the preceding segment, the subsequent segment in the sentence is ignored. To be done. Therefore, a part which is not a recognition target word, such as a cough and an unnecessary word, which follows the voice to be originally recognized is ignored, and the voice recognition rate can be improved.

【００１７】請求項４に記載の音声認識装置において
は、文が複数のセグメントで構成されている場合、先行
するセグメントから、少なくとも１つの単語が認識され
たときには、その文中の後続のセグメントすべてを１つ
の新たな文として、さらに音声認識が行われる。従っ
て、本来認識すべき音声の後に、さらに認識すべき音声
が続いている場合は、その音声が認識されるので、認識
率を向上させることができる。In the speech recognition apparatus according to the fourth aspect, when the sentence is composed of a plurality of segments, when at least one word is recognized from the preceding segment, all the subsequent segments in the sentence are recognized. Speech recognition is further performed as one new sentence. Therefore, when a voice to be further recognized is followed by a voice to be originally recognized, the voice is recognized, so that the recognition rate can be improved.

【００１８】請求項５に記載の音声認識装置において
は、文が複数のセグメントで構成されている場合、先行
するセグメントから、単語が認識されたときには、その
単語に対応する制約の下で、先行するセグメントに続く
セグメントが音声認識される。従って、音声の認識率
を、さらに向上させることができる。In the speech recognition apparatus according to the fifth aspect, when a sentence is composed of a plurality of segments, when a word is recognized from the preceding segment, it is preceded by a constraint corresponding to the word. The segment following the segment to be recognized is recognized by voice. Therefore, the voice recognition rate can be further improved.

【００１９】請求項６に記載の音声認識装置において
は、文が複数のセグメントで構成されている場合、その
すべてのセグメントが音声認識され、その結果得られる
認識結果のうち、その尤度の最も高いセグメントのもの
が最終的な認識結果とされる。従って、例えば多くの不
要語を含む音声の認識率を向上させることができる。In the speech recognition apparatus according to the sixth aspect, when a sentence is composed of a plurality of segments, all of the segments are speech-recognized, and among the recognition results obtained as a result, the likelihood is the highest. The one with the higher segment is the final recognition result. Therefore, for example, it is possible to improve the recognition rate of the voice including many unnecessary words.

【００２０】[0020]

【実施例】図１は、本発明の音声認識装置の一実施例の
構成を示すブロック図である。マイク１は、入力された
音声を、電気信号としての音声信号に変換し、Ａ／Ｄ変
換器２に出力するようになされている。Ａ／Ｄ変換器２
は、マイク１からの音声信号を、所定のサンプリング間
隔でサンプリングし（Ａ／Ｄ変換し）、その結果得られ
るディジタル音声信号を音声認識部３および音声区間検
出部４に出力するようになされている。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a block diagram showing the configuration of an embodiment of a voice recognition device of the present invention. The microphone 1 converts the input voice into a voice signal as an electric signal and outputs the voice signal to the A / D converter 2. A / D converter 2
Is configured to sample (A / D convert) the voice signal from the microphone 1 at a predetermined sampling interval and output the resulting digital voice signal to the voice recognition unit 3 and the voice section detection unit 4. There is.

【００２１】音声区間検出部４は、Ａ／Ｄ変換器２から
の音声信号の、例えば短時間パワーを算出し、そのパワ
ーが所定値以上の区間を音声区間として検出する。そし
て、その残りの区間を無音声区間として検出し、制御部
５に出力するようになされている。The voice section detector 4 calculates, for example, the short-time power of the voice signal from the A / D converter 2, and detects a section whose power is equal to or more than a predetermined value as a voice section. Then, the remaining section is detected as a non-voice section and is output to the control unit 5.

【００２２】制御部５は、音声区間検出部４からの無音
声区間と、あらかじめ設定された所定の時間Ｔ₁または
Ｔ₂それぞれとを比較し、その比較結果に基づいて、入
力音声の区間を、所定のセグメントの区間に分割すると
ともに、セグメントで構成される文の区間に分割する。
そして、その分割結果に対応する制御信号を、音声認識
部３に出力するようになされている。さらに、制御部５
は、音声認識部３から出力される認識結果（認識結果候
補）に対応して、音声認識部３を制御するようにもなさ
れている。また、制御部５は、音声認識部３からの音声
認識結果を、その後段の装置（図示せず）に出力するよ
うになされている。The control section 5 compares the non-voice section from the voice section detecting section 4 with a preset predetermined time T ₁ or T _2, respectively, and determines the section of the input voice based on the comparison result. , Is divided into sections of a predetermined segment and sections of a sentence composed of segments.
Then, a control signal corresponding to the division result is output to the voice recognition unit 3. Furthermore, the control unit 5
Is also configured to control the voice recognition unit 3 in accordance with the recognition result (recognition result candidate) output from the voice recognition unit 3. Further, the control unit 5 is configured to output the voice recognition result from the voice recognition unit 3 to a device (not shown) in the subsequent stage.

【００２３】音声認識部３は、Ａ／Ｄ変換器２からの音
声信号を、制御部５より供給される制御信号に基づい
て、例えば連続単語音声認識するようになされている。
即ち、音声認識部３は、音声信号のうちの文の区間に対
応する部分を、セグメントに対応する区間単位で、連続
単語音声認識するようになされている。なお、音声認識
部３における音声認識は、例えばＨＭＭ（Hidden Marko
v Models）法（この方法は、連続単語音声認識に有効な
ものである）などに基づいて行われ、確率（尤度）の最
も高い認識結果候補が、認識結果として出力されるよう
になされている。また、音声認識部３において、ＨＭＭ
から観測されるシンボル（コード）としては、入力音声
信号を音響分析し、音声の特徴量である、例えばＬＰＣ
ケプストラム係数などを算出して、これをベクトル量子
化したものが用いられるようになされている。The voice recognition unit 3 recognizes the voice signal from the A / D converter 2 based on a control signal supplied from the control unit 5, for example, continuous word voice.
That is, the voice recognition unit 3 is configured to recognize the continuous word voice of the portion of the voice signal corresponding to the section of the sentence in the section unit corresponding to the segment. The voice recognition in the voice recognition unit 3 is performed by, for example, HMM (Hidden Marko).
v Models) method (this method is effective for continuous word speech recognition) etc., and the recognition result candidate with the highest probability (likelihood) is output as the recognition result. There is. In the voice recognition unit 3, the HMM
As a symbol (code) observed from, the input voice signal is acoustically analyzed and is a voice feature amount, for example, LPC.
A cepstrum coefficient or the like is calculated, and a vector quantization of this is used.

【００２４】次に、その動作について説明する。まずマ
イク１に音声が入力されると、その音声は、電気信号で
ある音声信号に変換され、Ａ／Ｄ変換器２に出力され
る。Ａ／Ｄ変換器２では、マイク１からの音声信号がＡ
／Ｄ変換され、音声認識部３および音声区間検出部４に
出力される。Next, the operation will be described. First, when a voice is input to the microphone 1, the voice is converted into a voice signal which is an electric signal and is output to the A / D converter 2. In the A / D converter 2, the audio signal from the microphone 1 is A
It is D / D converted and output to the voice recognition unit 3 and the voice section detection unit 4.

【００２５】音声区間検出部４では、Ａ／Ｄ変換器２か
らの音声信号の短時間パワーが検出され、その値が所定
値Ｔｈ以上の区間が、音声区間として順次検出される。
即ち、例えば図２（ａ）に示すような短時間パワーを有
する音声信号に対しては、図２（ｂ）に示すような音声
区間が検出される。さらに、音声区間検出部４では、音
声区間以外の区間である無音声区間のうち、音声区間に
挟まれた区間（時間）ｔ₀，ｔ₁，ｔ₂，ｔ₃，ｔ₄が、制
御部５に順次出力される。制御部５では、音声区間検出
部４からの無音声区間ｔ₀乃至ｔ₄それぞれと、あらかじ
め設定された所定の時間Ｔ₁，Ｔ₂それぞれとが順次比較
される。The voice section detection unit 4 detects the short-time power of the voice signal from the A / D converter 2, and successively detects the sections whose values are equal to or more than a predetermined value Th as the voice section.
That is, for a voice signal having a short-time power as shown in FIG. 2A, a voice section as shown in FIG. 2B is detected. Further, in the voice section detection unit 4, among the non-voice sections other than the voice section, the sections (time) t ₀ , t ₁ , t ₂ , t ₃ , t ₄ sandwiched between the voice sections are controlled by the control section. 5 are sequentially output. The control unit 5, and a voiceless interval t ₀ to t _4, respectively from the voice section detection unit 4, and a preset predetermined time T _1, T ₂ respectively are sequentially compared.

【００２６】ここで、所定の時間Ｔ₁は、単語中の無音
声部の上限値、具体的には、例えば日本語では促音の長
さ（例えば、促音の長さの統計的な最大値など）に設定
されており、いわば入力された音声のうちの単語間の区
切りを決定するための基準値である。また、所定の時間
Ｔ₂は、連続して単語が発声された場合の単語間のポー
ズ（例えば、息継ぎによる休止時間など）として許容す
る時間より大きく、装置の使用者が文の区切るために挿
入する、いわば自発的な無音部の最小の長さ以下の値
で、いわば入力された音声のうちの文と文との間の区切
りを決定するための基準値である。従って、所定の時間
Ｔ₁またはＴ₂は、それぞれ比較的小さなまたは大きな値
で、両者の関係は、Ｔ₁＜Ｔ₂である。Here, the predetermined time T ₁ is the upper limit value of the non-speech portion in the word, specifically, in Japanese, for example, the length of the consonant (for example, the statistical maximum value of the consonant length, etc.). ) Is set to, so to speak, a reference value for determining a break between words in the input voice. Further, the predetermined time T ₂ is larger than the time allowed as a pause between words (for example, pause time due to breathing) when the words are uttered continuously, and is inserted by the user of the device to separate sentences. That is, it is, so to speak, a value equal to or less than the minimum length of the spontaneous silence, and is a reference value for determining the separation between sentences in the input voice, so to speak. Therefore, the predetermined time T ₁ or T ₂ is a relatively small value or a large value, respectively, and the relationship between them is T ₁ <T ₂ .

【００２７】制御部５において、無音声区間が、所定の
時間Ｔ₁より小であると判定された場合、その無音声区
間の前後の音声区間が、両者で１つのセグメントを構成
するものとされる。また、無音声区間が、所定の時間Ｔ
₁以上Ｔ₂以下の値であると判定された場合、その無音声
区間の前後の音声区間がそれぞれ１つのセグメントを構
成するものとされるとともに、その２つのセグメントが
１つの文を構成するものとされる。さらに、無音声区間
が、所定の時間Ｔ₂より大であると判定された場合、そ
の無音声区間の前後の音声区間がそれぞれ１つの文（独
立の文）を構成するものとされる。When the control section 5 determines that the non-voice section is shorter than the predetermined time T ₁ , the voice sections before and after the non-voice section are considered to constitute one segment. It In addition, the non-voice section is a predetermined time T
_When it is determined that the value is ₁ or more and T ₂ or less, the speech sections before and after the non-speech section are considered to form one segment, respectively, and the two segments form one sentence. It is said that Further, when it is determined that the non-voice section is longer than the predetermined time T ₂ , the voice sections before and after the non-voice section constitute one sentence (independent sentence).

【００２８】従って、図２（ｂ）に示した場合におい
て、Ｔ₁≦ｔ₀≦Ｔ₂，Ｔ₂＜ｔ₁，Ｔ₂＜ｔ₂，Ｔ₁≦ｔ₃≦
Ｔ₂，Ｔ₁≦ｔ₄≦Ｔ₂であるとすると、入力された音声の
区間は、図２（ｃ）に示すように、２つのセグメントで
構成（分割）される文Ａ、１つのセグメントでなる文
Ｂ、および３つのセグメントで構成（分割）される文Ｃ
の３つの文の区間に分割される。Therefore, in the case shown in FIG. 2B, T ₁ ≤t ₀ ≤T ₂ , T ₂ <t ₁ , T ₂ <t ₂ , T ₁ ≤t ₃ ≤
Assuming that T ₂ and T ₁ ≤t ₄ ≤T ₂ , the segment of the input speech is a sentence A composed of two segments (divided) and one segment, as shown in FIG. Sentence B consisting of and sentence C composed (divided) of three segments
Is divided into three sentence sections.

【００２９】即ち、図２（ｂ）において、Ｔ₁≦ｔ₀≦Ｔ
₂である無音声区間ｔ₀の前後の音声区間は、それぞれ独
立した１つのセグメントとされるとともに、文Ａを構成
するものとされる。また、Ｔ₂＜ｔ₁である無音声区間ｔ
₁の前後の音声区間は、それぞれ１つの文Ａ，Ｂとされ
る。さらに、Ｔ₂＜ｔ₂である無音声区間ｔ₂の前後の音
声区間は、それぞれ１つの文Ｂ，Ｃとされる。また、Ｔ
₁≦ｔ₃≦Ｔ₂である無音声区間ｔ₃の前後の音声区間は、
それぞれ１つのセグメントとされるとともに、文Ｃを構
成するものとされる。さらに、Ｔ₁≦ｔ₄≦Ｔ₂である無
音声区間ｔ₄の前後の音声区間も、それぞれ１つのセグ
メントとされるとともに、文Ｃを構成するものとされ
る。That is, in FIG. 2B, T ₁ ≤t ₀ ≤T
The speech sections before and after the non-speech section t ₀ of ₂ are set as independent segments and form the sentence A. In addition, a voiceless section t with T ₂ <t ₁
The speech sections before and after ₁ are one sentence A and B, respectively. Furthermore, T ₂ <before and after the speech section of the voiceless section t ₂ which is t ₂ are respectively one sentence B, and C. Also, T
The voice section before and after the non-voice section t ₃ where ₁ ≤ t ₃ ≤ T ₂ is
The sentence C is composed of one segment each. Further, the speech sections before and after the non-speech section t ₄ with T ₁ ≤t ₄ ≤T ₂ are also defined as one segment and form the sentence C.

【００３０】また、例えば図３（ａ）に示すような短時
間パワーの音声信号が入力された場合には、音声区間検
出部４において、図３（ｂ）に示すような音声区間が順
次検出され、それ以外の区間である無音声区間のうち、
音声区間に挟まれた区間（時間）ｔ₅，ｔ₆，ｔ₇，ｔ
₈が、制御部５に順次出力される。Further, for example, when a voice signal of short time power as shown in FIG. 3A is inputted, the voice section detecting section 4 sequentially detects the voice sections as shown in FIG. 3B. Of the non-voice section that is the other section,
Sections (time) t ₅ , t ₆ , t ₇ , t sandwiched between voice sections
₈ is sequentially output to the control unit 5.

【００３１】いま、ｔ₅＜Ｔ₁，Ｔ₁≦ｔ₆≦Ｔ₂，ｔ₇＜Ｔ
₁，Ｔ₂＜ｔ₈であるとすると、制御部５において、ｔ₅＜
Ｔ₁である無音声区間ｔ₅の前後の音声区間は、両者を合
わせて１つのセグメントとされる（図３（ｃ））。さら
に、Ｔ₁≦ｔ₆≦Ｔ₂である無音声区間ｔ₆の前後の音声区
間（但し、無音声区間ｔ₆の時間的に前の音声区間は、
上述したように、そのさらに前の音声区間と合わせて１
つのセグメントとされているので、この場合の無音声区
間ｔ₆の前の音声区間とは、そのセグメントと考えるこ
とができる）は、それぞれ独立の１つのセグメントとさ
れる（無音声区間ｔ₆の前後の音声区間が、それぞれ独
立したセグメントを構成するようにされる）とともに、
文Ｄを構成するものとされる。また、ｔ₇＜Ｔ₁である無
音声区間ｔ₇の前後の音声区間は、両者を合わせて１つ
のセグメントとされる。さらに、Ｔ₂＜ｔ₈である無音声
区間ｔ₈の前後の音声区間は、それぞれ１つの文Ｄ，Ｅ
を構成するものとされる。Now, t ₅ <T ₁ , T ₁ ≤t ₆ ≤T ₂ , t ₇ <T
_Assuming that ₁ and T ₂ <t ₈ , t ₅ <
The voice sections before and after the non-voice section t ₅ which is T ₁ are combined into one segment (FIG. 3 (c)). Furthermore, the voice section before and after the non-voice section t ₆ where T ₁ ≤t ₆ ≤T ₂ (however, the voice section preceding the non-voice section t ₆ in time is
As mentioned above, the total of 1
Since it is regarded as one segment, the voice section before the non-voice section t _{6 in} this case can be considered as that segment) is regarded as one independent segment (of the non-voice section t ₆ ). The preceding and following voice sections are configured to form independent segments),
It is assumed that sentence D is formed. Further, the voice sections before and after the non-voice section t ₇ where t ₇ <T ₁ are combined into one segment. Further, before and after the speech section of the T ₂ <is t ₈ voiceless interval t ₈ are each single statement D, E
Is configured.

【００３２】その結果、入力された音声の区間は、図３
（ｃ）に示すように、２つのセグメントで構成（分割）
される文Ｄ、および１つのセグメントでなる文Ｅの２つ
の文の区間に分割される。As a result, the section of the input voice is shown in FIG.
As shown in (c), consists of two segments (division)
The sentence D is divided into two sentences, that is, the sentence D and the sentence E which is one segment.

【００３３】制御部５は、この分割結果に対応して音声
認識部３に制御信号を出力し、これにより音声認識部３
を制御する。音声認識部３では、制御部５からの制御信
号に対応して、Ａ／Ｄ変換器２からの音声信号が文ごと
に順次区切られ、各文が、セグメント単位で、例えばＨ
ＭＭ法などに基づいて、連続単語音声認識される。The control unit 5 outputs a control signal to the voice recognition unit 3 in accordance with the result of the division, whereby the voice recognition unit 3 is operated.
To control. In the voice recognition unit 3, the voice signal from the A / D converter 2 is sequentially divided for each sentence in response to the control signal from the control unit 5, and each sentence is segmented in units of, for example, H.
Continuous word speech recognition is performed based on the MM method or the like.

【００３４】即ち、例えば東京２３区の区名と、その町
名を音声認識を行う対象語彙とし、また発声パターンと
して「区名」＋「町名」（「区名」に続けて「町名」を
発生するパターン）の他、「区名」または「町名」だけ
をも許容した場合、音声が発声されると、まず図２また
は図３で説明したような処理が行われる。そして、例え
ば図４のフローチャートに示すように、ステップＳ１に
おいて、制御部５により、音声区間検出部４の出力に基
づいて、所定の時間Ｔ₂より長い無音声区間が検出され
ると、その無音声区間の直前の音声区間（短時間パワー
が所定値Ｔｈ以上の区間）が、文に対応する区間として
検出される。That is, for example, the ward name of Tokyo's 23 wards and its town name are the target vocabulary for voice recognition, and the utterance pattern is "ward name" + "town name"("townname" is generated following "ward name". Pattern), and when only “ward name” or “town name” is allowed, when a voice is uttered, the process as described in FIG. 2 or FIG. 3 is first performed. Then, for example, as shown in the flow chart of FIG. 4, when the control section 5 detects a non-voice section longer than a predetermined time T ₂ based on the output of the voice section detection section 4 in step S1, the control section 5 detects the non-voice section. The voice section immediately before the voice section (the section where the short-time power is equal to or greater than the predetermined value Th) is detected as the section corresponding to the sentence.

【００３５】ここで、上述したように、所定の時間Ｔ₂
は、比較的（前述した従来の所定の時間と比較して）大
きな値であるから、発話者が、１つの文に含まれるべき
単語間に、多少長めのポーズを挿入して発話をしたとし
ても、その部分で異なる文に区切られることはなく、従
って、発話者の発話の自由度を向上させることができ
る。Here, as described above, the predetermined time T ₂
Is a relatively large value (compared to the above-mentioned conventional predetermined time), so if a speaker inserts a slightly longer pause between words that should be included in one sentence, However, the sentence is not divided into different sentences, so that the degree of freedom of the utterance of the speaker can be improved.

【００３６】ステップＳ１の処理後、ステップＳ２に進
み、ステップＳ１で検出された文に対応する音声区間
が、いくつのセグメントを含むかが判定される。After the processing of step S1, the process proceeds to step S2, and it is determined how many segments the voice section corresponding to the sentence detected in step S1 includes.

【００３７】ステップＳ２において、文に相当する音声
区間が、１つのセグメントでなると判定された場合、即
ち、「区名」または「町名」だけが発話されたか、ある
いは「区名」および「町名」の両方が、休止時間をほと
んどおくことなく連続して発話された場合、ステップＳ
３に進み、そのセグメント（この場合は、文に等しい）
に対応する音声信号に対し、連続単語音声認識をするよ
うに、制御部５から音声認識部３に制御信号が出力され
る。これにより、音声認識部３では、そのセグメントに
対応する音声信号が、連続単語音声認識され、ステップ
Ｓ４に進み、その認識結果が変数Ｒに保持され、ステッ
プＳ１２に進む。When it is determined in step S2 that the voice section corresponding to the sentence is composed of one segment, that is, only "ward name" or "town name" is uttered, or "ward name" and "town name" are spoken. If both of them are spoken continuously with little pause time, step S
Go to 3 and its segment (in this case equal to sentence)
A control signal is output from the control unit 5 to the voice recognition unit 3 so as to perform continuous word voice recognition on the voice signal corresponding to. As a result, in the voice recognition unit 3, the voice signal corresponding to the segment is subjected to continuous word voice recognition, the process proceeds to step S4, the recognition result is held in the variable R, and the process proceeds to step S12.

【００３８】一方、ステップＳ２において、文に相当す
る音声区間が、２以上のセグメントを含むと判定された
場合、ステップＳ５に進み、その先頭のセグメント（第
１セグメント（図２、図３））に対応する音声信号に対
し、連続単語音声認識をするように、制御部５から音声
認識部３に制御信号が出力される。これにより、音声認
識部３では、そのセグメントに対応する音声信号が、連
続単語音声認識され、ステップＳ６に進み、その認識結
果が変数Ｒ₁に保持される。On the other hand, when it is determined in step S2 that the voice section corresponding to the sentence includes two or more segments, the process proceeds to step S5, and the leading segment (first segment (FIGS. 2 and 3)) A control signal is output from the control unit 5 to the voice recognition unit 3 so as to perform continuous word voice recognition on the voice signal corresponding to. As a result, in the voice recognition unit 3, the voice signal corresponding to the segment is subjected to continuous word voice recognition, the process proceeds to step S6, and the recognition result is held in the variable R ₁ .

【００３９】そして、ステップＳ７に進み、ステップＳ
６で変数Ｒ₁に保持された認識結果が、「区名」を表す
ものであるか否かが判定される。ステップＳ７におい
て、変数Ｒ₁に保持された認識結果が、「区名」を表す
ものでないと判定された場合、即ち変数Ｒ₁に保持され
た認識結果が、例えば「町名」のみ、または「区名」＋
「町名」を表すものである場合、ステップＳ８に進み、
変数Ｒ₁に保持されている認識結果（「町名」のみ、ま
たは「区名」＋「町名」）が、変数Ｒに保持され、ステ
ップＳ１２に進む。Then, the process proceeds to step S7, and step S
At 6, it is determined whether the recognition result held in the variable R ₁ represents a “ward name”. In step S7, the recognition result is held in the variable R _1, when it is determined not to represent a "ward name", i.e., the variable R recognition result stored in _1, for example, "street" only, or "ward Name "+
If it represents a "town name", proceed to step S8,
The recognition result (only “town name” or “ward name” + “town name”) held in the variable R ₁ is held in the variable R, and the process proceeds to step S12.

【００４０】また、ステップＳ７において、変数Ｒ₁に
保持された認識結果が、「区名」を表すものであると判
定された場合、変数Ｒ₁が、音声認識部３から制御部５
に出力される。そして、制御部５において、変数Ｒ₁に
保持されている認識結果である「区名」が示す区内に実
在する町の「町名」に認識対象を制限して、次のセグメ
ント（この場合、第２セグメント（図２、図３））の音
声認識を行うように、音声認識部３に制御信号が出力さ
れる。If it is determined in step S7 that the recognition result held in the variable R ₁ indicates a "ward name", the variable R ₁ is changed from the voice recognition unit 3 to the control unit 5.
Is output to. Then, in the control unit 5, the recognition target is limited to the “town name” of the town actually existing in the ward indicated by the “ward name” which is the recognition result held in the variable R ₁ , and the next segment (in this case, A control signal is output to the voice recognition unit 3 so that the voice recognition of the second segment (FIGS. 2 and 3) is performed.

【００４１】これにより、音声認識部３では、ステップ
Ｓ９において、次のセグメントに対応する音声信号が、
認識対象語彙を制限して、連続単語音声認識（この場
合、単語認識でも良い）される。As a result, in the voice recognition section 3, in step S9, the voice signal corresponding to the next segment is
Continuous word speech recognition (in this case, word recognition may be used) by limiting the recognition target vocabulary.

【００４２】従って、この場合、音声認識対象語彙が絞
り込まれるので、音声の認識率を向上させることができ
る。Therefore, in this case, the speech recognition target vocabulary is narrowed down, so that the speech recognition rate can be improved.

【００４３】その後、ステップＳ１０に進み、次のセグ
メントに対応する音声信号の認識結果である「町名」
が、変数Ｒ₂に保持され、ステップＳ１１に進む。ステ
ップＳ１１では、変数Ｒ₁に保持されている認識結果
（この場合、「区名」（例えば、「品川区」など）に、
変数Ｒ₂に保持されている認識結果（この場合、「町
名」（例えば、「北品川」など）を続けた文字列「区
名」＋「町名」（例えば、「品川区北品川」など）が、
認識結果として、変数Ｒに保持され、ステップＳ１２に
進む。After that, the process proceeds to step S10, and the "town name" which is the recognition result of the voice signal corresponding to the next segment.
Is held in the variable R ₂ and the process proceeds to step S11. In step S11, the recognition result stored in the variable R ₁ (in this case, “ward name” (eg, “Shinagawa Ward”)
The recognition result held in the variable R ₂ (in this case, the character string “ward name” + “town name” (eg, “Shinagawa-ku Kitashinagawa”) that is followed by “town name” (eg, “Kita-Shinagawa”) But,
The recognition result is held in the variable R, and the process proceeds to step S12.

【００４４】ステップＳ１２においては、変数Ｒが、音
声認識部３から制御部５に供給され、制御部５では、変
数Ｒに保持されている認識結果が、後段の装置に出力さ
れ、処理を終了する。In step S12, the variable R is supplied from the voice recognition unit 3 to the control unit 5, and the control unit 5 outputs the recognition result held in the variable R to the device at the subsequent stage, thus ending the processing. To do.

【００４５】なお、次の文に対しては、再度ステップＳ
１からの処理が行われる。For the next sentence, step S is again performed.
The process from 1 is performed.

【００４６】また、文を構成するセグメントが３以上で
ある場合には、第２セグメントより後段のセグメント、
即ち第３セグメント以降は、咳や不要語に対応する部分
であるとして無視される。従って、この場合、第２セグ
メントの直後に、少なくとも所定の時間Ｔ₁の無発声区
間があれば、それに続く部分は無視されるので、発話者
には、発話開始時の語頭に気をつけてもらうだけで、音
声の認識率を向上させることができる。即ち、装置の操
作性および音声の認識率をともに向上させることができ
る。If there are three or more segments that make up the sentence, a segment after the second segment,
That is, the third and subsequent segments are ignored because they correspond to coughs and unnecessary words. Therefore, in this case, immediately after the second segment, if there is at least a non-voiced section for a predetermined time T ₁ , the subsequent portion is ignored, so that the speaker should be careful of the beginning of the utterance. Just by getting it, you can improve the voice recognition rate. That is, both the operability of the device and the recognition rate of voice can be improved.

【００４７】さらに、音声認識部３は、上述したように
音声認識処理の結果得られる認識結果候補のうち、確率
（尤度）（以下、スコアという）の最も高いものを、認
識結果として出力するようになされているが、例えば２
番目以降にスコアの高い認識結果候補も出力するように
することができる。Further, the voice recognition unit 3 outputs the recognition result candidate having the highest probability (likelihood) (hereinafter referred to as a score) among the recognition result candidates obtained as a result of the voice recognition processing as described above. It is done like this, but for example 2
It is possible to output the recognition result candidates with the highest scores after the th.

【００４８】即ち、例えばステップＳ１で検出された文
に対応する音声区間が、２以上のセグメントを含み、第
１セグメントから「区名」の認識結果候補を得た場合に
は、そのうちの最もスコアの高い認識結果候補である
「区名」の制限の下で、第２セグメントを音声認識した
認識結果候補のうち、そのスコアの上位の幾つかの認識
結果候補を、第１セグメントの最もスコアの高い認識結
果候補（認識結果）とともに出力するようにすることが
できる。That is, for example, when the speech section corresponding to the sentence detected in step S1 includes two or more segments and a recognition result candidate of "ku name" is obtained from the first segment, the highest score among them is obtained. Of the recognition result candidates obtained by speech recognition of the second segment, some recognition result candidates with higher scores are recognized as the highest score of the first segment under the restriction of “ward name” which is a recognition result candidate having a high score. It is possible to output together with a high recognition result candidate (recognition result).

【００４９】また、第１セグメントのスコアの上位の認
識結果候補である幾つかの「区名」それぞれの制限の下
に、第２セグメントの音声認識を行い、その結果得られ
るスコアの上位の認識結果候補を、第１セグメントのス
コアの上位の認識結果候補それぞれとともに出力するよ
うにすることができる。Further, the speech recognition of the second segment is performed under the restriction of each of the "ward names" which are the recognition result candidates having the higher score of the first segment, and the recognition of the higher score obtained as a result is performed. It is possible to output the result candidates together with each of the recognition result candidates having a higher score in the first segment.

【００５０】さらに、例えばステップＳ１で検出された
文に対応する音声区間が、１つのセグメントでなる場合
は、そのセグメントの認識結果候補のうちの、スコアの
上位のものを出力するようにすることができる。Further, for example, when the speech section corresponding to the sentence detected in step S1 is composed of one segment, the one with the highest score among the recognition result candidates of that segment is output. You can

【００５１】次に、図５は、図１の音声認識装置の他の
動作例を説明するフローチャートである。この場合、ス
テップＳ２１乃至Ｓ２５においては、図４で説明したス
テップＳ１乃至Ｓ５とそれぞれ同様の処理が行われる。Next, FIG. 5 is a flow chart for explaining another operation example of the voice recognition device of FIG. In this case, in steps S21 to S25, the same processes as steps S1 to S5 described in FIG. 4 are performed.

【００５２】そして、ステップＳ２６において、図４の
ステップＳ５と同様のステップＳ２５における第１セグ
メントに対応する音声信号に対する連続単語音声認識結
果が変数Ｒ₁に保持されるとともに、そのスコアが変数
Ｓ₁に保持される。Then, in step S26, the continuous word voice recognition result for the voice signal corresponding to the first segment in step S25 similar to step S5 of FIG. 4 is held in the variable R ₁ , and the score thereof is held in the variable S ₁ Held in.

【００５３】その後、ステップＳ２７に進み、ステップ
Ｓ２６で変数Ｒ₁に保持された認識結果が、「区名」を
表すものであるか否かが判定される。ステップＳ２７に
おいて、変数Ｒ₁に保持された認識結果が、「区名」を
表すものであると判定された場合、ステップＳ２９乃至
Ｓ３２に順次進み、図４で説明したステップＳ９乃至Ｓ
１２とそれぞれ同様の処理が行われ、処理を終了する。Then, the process proceeds to step S27, and it is determined whether or not the recognition result held in the variable R ₁ in step S26 represents a "ward name". When it is determined in step S27 that the recognition result held in the variable R ₁ represents the “ward name”, the process sequentially proceeds to steps S29 to S32, and steps S9 to S described in FIG.
The same processing as that of 12 is performed, and the processing ends.

【００５４】一方、ステップＳ２７において、変数Ｒ₁
に保持された認識結果が、「区名」を表すものでないと
判定された場合、即ち変数Ｒ₁に保持された認識結果
が、例えば「町名」のみ、または「区名」＋「町名」を
表すものである場合、ステップＳ３３に進み、次のセグ
メント、即ち第２セグメントに対応する音声信号が、連
続単語音声認識され、ステップＳ３４に進む。On the other hand, in step S27, the variable R ₁
When it is determined that the recognition result held in _No. does not represent the “ward name”, that is, the recognition result held in the variable R ₁ is, for example, only “town name” or “ward name” + “town name”. If so, the process proceeds to step S33, the speech signal corresponding to the next segment, that is, the second segment is recognized as a continuous word, and the process proceeds to step S34.

【００５５】ステップＳ３４においては、ステップＳ３
３における第２セグメントに対応する音声信号に対する
認識結果が、変数Ｒ₂に保持されるとともに、そのスコ
アが、変数Ｓ₂に保持され、ステップＳ３５に進み、変
数Ｓ₁に保持されているスコアが、変数Ｓ₂に保持されて
いるスコアより小さいか否かが判定される。ステップＳ
３５において、変数Ｓ₁に保持されているスコアが、変
数Ｓ₂に保持されているスコアより小さいと判定された
場合、ステップＳ３６に進み、変数Ｒ₂に保持されてい
る第２セグメントの認識結果が、変数Ｒに保持され、ス
テップＳ３２に進む。In step S34, step S3
The recognition result for the voice signal corresponding to the second segment in 3 is held in the variable R ₂ and the score thereof is held in the variable S ₂ and the process proceeds to step S35 where the score held in the variable S ₁ is , It is determined whether the score is smaller than the score held in the variable S ₂ . Step S
In 35, when it is determined that the score held in the variable S ₁ is smaller than the score held in the variable S ₂ , the process proceeds to step S 36, and the recognition result of the second segment held in the variable R ₂ Is held in the variable R, and the process proceeds to step S32.

【００５６】従って、第２セグメントの認識結果に対す
るスコアの方が、第１セグメントの認識結果に対するス
コアより高い場合には、第１セグメントおよび第３セグ
メント以降は無視され、第２セグメントの認識結果が、
最終的な認識結果として、音声認識部３から制御部５に
出力されることになる。Therefore, when the score for the recognition result of the second segment is higher than the score for the recognition result of the first segment, the first segment and the third and subsequent segments are ignored, and the recognition result of the second segment is ,
The final recognition result is output from the voice recognition unit 3 to the control unit 5.

【００５７】一方、ステップＳ３５において、変数Ｓ₁
に保持されているスコアが、変数Ｓ₂に保持されている
スコアより小さくないと判定された場合、ステップＳ３
７に進み、変数Ｒ₁に保持されている第１セグメントの
認識結果が、変数Ｒに保持され、ステップＳ３２に進
む。On the other hand, in step S35, the variable S ₁
If it is determined that the score held in the is not smaller than the score held in the variable S ₂ , step S3
7, the recognition result of the first segment held in the variable R ₁ is held in the variable R, and the process proceeds to step S32.

【００５８】従って、第１セグメントの認識結果に対す
るスコアが、第２セグメントの認識結果に対するスコア
以上の場合には、第２セグメント以降は無視され、第１
セグメントの認識結果が、最終的な認識結果として、音
声認識部３から制御部５に出力されることになる。Therefore, if the score for the recognition result of the first segment is equal to or higher than the score for the recognition result of the second segment, the second and subsequent segments are ignored and the first segment is ignored.
The recognition result of the segment is output from the voice recognition unit 3 to the control unit 5 as the final recognition result.

【００５９】通常、認識対象語ではないもの、即ち不要
語などは、音声認識処理の結果得られるスコアが小さい
ことが考えられるので、以上のように、第１セグメント
の認識結果が「区名」でなかった場合に、第１および第
２セグメントの認識結果のうち、そのスコアの大きいも
のを、最終的な認識結果とすることによって、不要語な
どが無視されるようになり、これにより認識率を向上さ
せることができる。Usually, it is considered that the score obtained as a result of the speech recognition processing is small for the words that are not the recognition target words, that is, the unnecessary words. Therefore, as described above, the recognition result of the first segment is "ku name". If it is not, by deciding the recognition result of the first and second segments, which has the highest score, as the final recognition result, unnecessary words and the like can be ignored, and thus the recognition rate can be improved. Can be improved.

【００６０】なお、図５に示す場合も、図４における場
合と同様に、スコアの最も高い認識結果候補（認識結
果）だけでなく、２番目以降にスコアの高い認識結果候
補も出力するようにすることができる。In the case shown in FIG. 5, as in the case of FIG. 4, not only the recognition result candidate with the highest score (recognition result) but also the recognition result candidates with the second and higher scores are output. can do.

【００６１】さらに、この場合、第２セグメントまでの
音声認識を行うようにしたが、１つの文に含まれるすべ
てのセグメントを対象に音声認識を行い、そのうちの最
もスコアの高いセグメントの認識結果を、最終的な認識
結果とするようにしても良い。Furthermore, in this case, the speech recognition up to the second segment is performed, but the speech recognition is performed for all the segments included in one sentence, and the recognition result of the segment with the highest score among them is obtained. The final recognition result may be used.

【００６２】また、次の文に対しては、再度ステップＳ
２１からの処理が行われる。For the next sentence, step S is again performed.
The processing from 21 is performed.

【００６３】次に、図１の音声認識装置には、図６に示
すフローチャートにしたがった動作をさせるようにする
こともできる。即ち、まずステップＳ４１乃至Ｓ４６に
おいては、図５のステップＳ２１乃至Ｓ２６における場
合と同様の処理が行われる。Next, the voice recognition apparatus shown in FIG. 1 can be operated according to the flow chart shown in FIG. That is, first, in steps S41 to S46, the same processing as in steps S21 to S26 of FIG. 5 is performed.

【００６４】そして、ステップＳ４７に進み、第１セグ
メントの認識結果に対するスコアが、所定の値Ｃ（例え
ば、音声認識対象語彙の認識結果のスコアとみなすこと
ができる最小値など）以下であるか否かが判定される。
ステップＳ４７において、第１セグメントの認識結果に
対するスコアが、所定の値Ｃ以下でないと判定された場
合、ステップＳ４８に進み、変数Ｒ₁に保持されている
第１セグメントの認識結果が、変数Ｒに保持され、ステ
ップＳ４９に進む。Then, in step S47, it is determined whether or not the score for the recognition result of the first segment is less than or equal to a predetermined value C (for example, the minimum value that can be regarded as the score of the recognition result of the speech recognition target vocabulary). Is determined.
When it is determined in step S47 that the score for the recognition result of the first segment is not equal to or smaller than the predetermined value C, the process proceeds to step S48, and the recognition result of the first segment held in the variable R ₁ is set in the variable R. It is held, and the process proceeds to step S49.

【００６５】また、ステップＳ４７において、第１セグ
メントの認識結果に対するスコアが、所定の値Ｃ以下で
あると判定された場合、ステップＳ５０に進み、次のセ
グメント、即ち第２セグメントに対応する音声信号が、
連続単語音声認識され、ステップＳ５１に進む。ステッ
プＳ５１においては、ステップＳ５０における第２セグ
メントに対応する音声信号に対する認識結果が変数Ｒに
保持され、ステップＳ４９に進む。If it is determined in step S47 that the score for the recognition result of the first segment is equal to or less than the predetermined value C, the process proceeds to step S50, and the voice signal corresponding to the next segment, that is, the second segment. But,
The continuous words are recognized, and the process proceeds to step S51. In step S51, the recognition result for the audio signal corresponding to the second segment in step S50 is held in the variable R, and the process proceeds to step S49.

【００６６】ステップＳ４９においては、変数Ｒが、音
声認識部３から制御部５に供給され、制御部５では、変
数Ｒに保持されている認識結果が、後段の装置に出力さ
れ、処理を終了する。In step S49, the variable R is supplied from the voice recognition unit 3 to the control unit 5, and the control unit 5 outputs the recognition result held in the variable R to the subsequent apparatus, and the process is terminated. To do.

【００６７】従って、図６においては、第１セグメント
の認識結果に対するスコアが、所定の値Ｃより大きい場
合には、第２セグメント以降は無視され、第１セグメン
トの認識結果が、最終的な認識結果とされ、また第１セ
グメントの認識結果に対するスコアが、所定の値Ｃ以下
の場合には、第１セグメントおよび第３セグメント以降
は無視され、第２セグメントの認識結果が、最終的な認
識結果とされることになる。Therefore, in FIG. 6, when the score for the recognition result of the first segment is larger than the predetermined value C, the second and subsequent segments are ignored and the recognition result of the first segment is the final recognition result. If the score for the recognition result of the first segment is less than or equal to a predetermined value C, the first segment and the third and subsequent segments are ignored, and the recognition result of the second segment is the final recognition result. Will be said.

【００６８】上述したように、不要語などは、音声認識
処理の結果得られるスコアが小さいことが考えられるの
で、以上のように、第１セグメントの認識結果のスコア
が大きい場合に、その認識結果を採用し、また第１セグ
メントの認識結果のスコアが小さい場合に、その認識結
果を無視して、第２セグメントの認識結果を採用するこ
とによって、不要語などが無視されるようになり、これ
により認識率を向上させることができる。As described above, it is conceivable that an unnecessary word or the like has a small score obtained as a result of the voice recognition process. Therefore, when the score of the recognition result of the first segment is large, the recognition result is large. , And when the recognition result of the first segment has a small score, by ignoring the recognition result and adopting the recognition result of the second segment, unnecessary words are ignored. This can improve the recognition rate.

【００６９】なお、この場合も、図４における場合と同
様に、スコアの最も高い認識結果候補（認識結果）だけ
でなく、２番目以降にスコアの高い認識結果候補も出力
するようにすることができる。In this case as well, as in the case of FIG. 4, not only the recognition result candidate with the highest score (recognition result) but also the recognition result candidates with the second and higher scores can be output. it can.

【００７０】また、次の文に対しては、再度ステップＳ
４１からの処理が行われる。For the next sentence, step S is again performed.
The processing from 41 is performed.

【００７１】次に、以上においては、発声パターンを
「区名」＋「町名」の他、「区名」または「町名」だけ
の３つに限定したため、文を構成するセグメントは、か
なり特殊な発話をしない限り、多くても２セグメントで
あるため、図４乃至図６では、第３セグメント以降は無
視するようにしたが、発話者が、例えば発話した「区
名」＋「町名」、「区名」、または「町名」を言い直し
た場合や、認識対象語彙間に、不要語などを発声した場
合などには、第３セグメント以降に、認識対象語彙が含
まれることがある。Next, in the above description, since the vocalization pattern is limited to only "ward name" + "town name", and only "ward name" or "town name", the segments forming the sentence are quite special. As long as no utterance is made, the number of segments is at most two. Therefore, in FIGS. 4 to 6, the third and subsequent segments are ignored, but the utterer may say, for example, “ward name” + “town name”, When the word “ward name” or “town name” is reworded, or when an unnecessary word is uttered between the recognition target words, the recognition target word may be included in the third segment and thereafter.

【００７２】そこで、図１の音声認識装置には、例えば
図７に示すフローチャートにしたがった動作をさせるよ
うにすることができる。なお、図７では、ステップＳ６
７とＳ６８との間でステップＳ７２の処理が行われる他
は、ステップＳ６１乃至７１において、図６のステップ
Ｓ４１乃至Ｓ５１とそれぞれ同様の処理が行われるよう
になされている。Therefore, the voice recognition apparatus shown in FIG. 1 can be operated according to the flowchart shown in FIG. 7, for example. In FIG. 7, step S6
7 is the same as that of steps S41 to S51 of FIG. 6 except that the processing of step S72 is performed between steps S7 and S68.

【００７３】即ち、図７においては、第１セグメントの
認識結果に対するスコアが、所定の値Ｃより大きい場
合、ステップＳ７２において、第２セグメント以降を次
の文とする処理が行われる。そして、ステップＳ６８，
Ｓ６９の処理を行った後、ステップＳ７２で新たに文と
した部分を対象として、ステップＳ６２以降の処理が行
われる。従って、この場合、第３セグメント以降に、認
識対象語彙が含まれていても、それを認識することがで
きる。但し、第３セグメント以降に認識対象語彙がある
ことは稀であると考えられるので、実際には、第３セグ
メント以降は、不要語として無視する方が適切である。That is, in FIG. 7, when the score for the recognition result of the first segment is larger than the predetermined value C, in step S72, the process of setting the second and subsequent segments as the next sentence is performed. Then, in step S68,
After performing the process of S69, the process of step S62 and the subsequent steps is performed on the portion newly made as a sentence in step S72. Therefore, in this case, even if the recognition target vocabulary is included in the third segment and thereafter, it can be recognized. However, since it is considered rare that there is a recognition target vocabulary in the third segment and thereafter, it is actually appropriate to ignore it as an unnecessary word in the third segment and thereafter.

【００７４】なお、この場合も、図４における場合と同
様に、スコアの最も高い認識結果候補（認識結果）だけ
でなく、２番目以降にスコアの高い認識結果候補も出力
するようにすることができる。Also in this case, similarly to the case in FIG. 4, not only the recognition result candidate with the highest score (recognition result) but also the recognition result candidates with the second and higher scores can be output. it can.

【００７５】また、本来の次の文に対しては、再度ステ
ップＳ６１からの処理が行われる。The processing from step S61 is performed again for the original next sentence.

【００７６】以上のように、音声の単語的な部分の切り
出しを行うための所定の時間Ｔ₁と、いわばユーザイン
ターフェイスに関連する文の切り出しを行うための所定
の時間Ｔ₂を用いて、所定の時間Ｔ₁より短い間隔を挟ん
だ２つの音声区間は、１つのセグメントとし、所定の時
間Ｔ₁乃至Ｔ₂の範囲の値の間隔を挟んだ２つの音声区間
（セグメント）は、１つの文に含め、所定の時間Ｔ₂よ
り長い間隔を挟んだ２つの音声区間（セグメント）は、
それぞれ別々の文に含めるようにし、さらに文が複数セ
グメントを含む場合には、第１セグメントの認識結果に
応じて、それに続くセグメントの取扱いを決定するよう
にしたので、装置の操作性および音声の認識率を向上さ
せることができる。As described above, the predetermined time T ₁ for cutting out the word-like portion of the voice and the predetermined time T ₂ for cutting out the sentence related to the user interface are used to set the predetermined time. Of two voice intervals with an interval shorter than the time T ₁ is one segment, and two voice intervals (segments) with an interval of a value within a predetermined time T _{1 to} T ₂ are one sentence. , Two voice intervals (segments) that are separated by an interval longer than a predetermined time T ₂ are included in
Since each sentence is included in a separate sentence, and when the sentence includes a plurality of segments, the handling of the following segments is determined according to the recognition result of the first segment. The recognition rate can be improved.

【００７７】なお、本実施例においては、音声認識部３
で、ＬＰＣケプストラム係数を音声の特徴量として用い
るようにしたが、これに限られるものではない。また、
音声認識部３における音声認識もＨＭＭ法以外の他のア
ルゴリズムにしたがって行うようにすることができる。In the present embodiment, the voice recognition unit 3
Then, the LPC cepstrum coefficient is used as the feature amount of the voice, but the present invention is not limited to this. Also,
The voice recognition in the voice recognition unit 3 can also be performed according to an algorithm other than the HMM method.

【００７８】さらに、本実施例では、音声信号の短時間
パワーに基づいて、音声区間を検出するようにしたが、
この他、例えば音声信号の単位時間当たりのゼロクロス
数に基づいて、音声区間を検出するようにすることがで
きる。あるいは、音声信号の短時間パワーと単位時間当
たりのゼロクロス数の両方に基づいて、音声区間を検出
するようにしても良い。Further, in the present embodiment, the voice section is detected based on the short time power of the voice signal.
Besides this, for example, the voice section can be detected based on the number of zero-crossings of the voice signal per unit time. Alternatively, the voice section may be detected based on both the short-time power of the voice signal and the number of zero crossings per unit time.

【００７９】また、本実施例においては、所定の時間Ｔ
₁，Ｔ₂の実際の値について言及しなかったが、実験か
ら、所定の時間Ｔ₁，Ｔ₂としては、それぞれほぼ０．３
ｍｓ，１．０ｓ程度が好ましいという結果が得られてい
る。Further, in this embodiment, the predetermined time T
_Although the actual values of ₁ and T ₂ were not mentioned, it was found from the experiment that the predetermined times T ₁ and T ₂ were approximately 0.3, respectively.
The result has been obtained that ms and 1.0 s are preferable.

【００８０】さらに、本明細書中においては、「以上」
または「以下」を、「より大きい」または「より小さ
い」とそれぞれ読み代えるとともに、「より大きい」ま
たは「より小さい」を、「以上」または「以下」とそれ
ぞれ読み代えるようにしても良い。Further, in the present specification, "or more"
Alternatively, "less than or equal to" may be read as "greater than" or "less than", respectively, and "greater than" or "less than" may be read as "greater than or equal to" or "less than or equal to".

【００８１】[0081]

【発明の効果】以上の如く、本発明によれば、装置の操
作性を向上させるとともに、音声の認識率を向上させる
ことができる。As described above, according to the present invention, the operability of the apparatus can be improved and the recognition rate of voice can be improved.

[Brief description of drawings]

【図１】本発明の音声認識装置の一実施例の構成を示す
ブロック図である。FIG. 1 is a block diagram showing the configuration of an embodiment of a voice recognition device of the present invention.

【図２】図１の実施例における制御部５の動作を説明す
る図である。FIG. 2 is a diagram illustrating an operation of a control unit 5 in the embodiment of FIG.

【図３】図１の実施例における制御部５の動作を説明す
る図である。FIG. 3 is a diagram illustrating an operation of a control unit 5 in the embodiment of FIG.

【図４】図１の実施例の第１の動作例を説明するフロー
チャートである。FIG. 4 is a flowchart illustrating a first operation example of the embodiment of FIG.

【図５】図１の実施例の第２の動作例を説明するフロー
チャートである。FIG. 5 is a flowchart illustrating a second operation example of the embodiment of FIG.

【図６】図１の実施例の第３の動作例を説明するフロー
チャートである。FIG. 6 is a flowchart illustrating a third operation example of the embodiment of FIG.

【図７】図１の実施例の第４の動作例を説明するフロー
チャートである。7 is a flowchart illustrating a fourth operation example of the embodiment of FIG.

[Explanation of symbols]

１マイク２Ａ／Ｄ変換器３音声認識部４音声区間検出部５制御部 1 Microphone 2 A / D converter 3 Voice recognition unit 4 Voice section detection unit 5 Control unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者小川浩明東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者表雅則東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者渡辺一夫東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者南野活樹東京都品川区北品川６丁目７番35号ソニー株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Hiroaki Ogawa 6-735 Kita-Shinagawa, Shinagawa-ku, Tokyo Sony Corporation (72) Inventor table Masanori 6-35 Kita-Shinagawa, Shinagawa-ku, Tokyo Sony Corporation (72) Inventor Kazuo Watanabe 6-35 Kita-Shinagawa, Shinagawa-ku, Tokyo Sony Corporation (72) Inventor Katsuki Minamino 6-7-35 Kita-Shinagawa, Shinagawa-ku, Tokyo Soni -Inside the corporation

Claims

[Claims]

1. A detection unit for detecting a voice section of an input voice, a section between the voice section detected by the detection unit and a voice section subsequently detected by the detection unit. A dividing unit that compares the voice section with the first and second set times, divides the voice into predetermined segment units according to the comparison result, and divides the voice unit into sentence units composed of the segments. And a recognition unit that recognizes the sentence divided by the dividing unit in units of segments that compose the sentence.

2. The first set time period is shorter than the second set time period, and the dividing means sets the voiceless voice when the voiceless period is shorter than the first set time period. The voice sections before and after the section are combined into one segment, and the non-voice section is
When the value is between the first and second set times, the voice sections before and after the non-voice section are set as one segment, respectively, and the two segments are combined to form one sentence. The speech recognition apparatus according to claim 1, wherein when the section is longer than the second set time, the speech sections before and after the non-speech section are each one sentence.

3. The recognition means, when the sentence is composed of a plurality of segments, from the preceding segment,
2. When recognizing at least one word, ignore subsequent segments in the sentence.
Alternatively, the voice recognition device according to item 2.

4. The recognizing means, when the sentence is composed of a plurality of segments, from the preceding segment,
The speech recognition apparatus according to claim 1 or 2, wherein when at least one word is recognized, all subsequent segments in the sentence are treated as one new sentence for further speech recognition.

5. The recognition means, when the sentence is composed of a plurality of segments, from the preceding segment,
The speech recognition apparatus according to claim 1 or 2, wherein when a word is recognized, a segment following the preceding segment is speech-recognized under a constraint corresponding to the word.

6. The recognizing means, when the sentence is composed of a plurality of segments, performs speech recognition on all the segments, and of the recognition results obtained as a result, the one having the highest likelihood. The voice recognition device according to claim 1 or 2, wherein is a final recognition result.