JP6972287B2

JP6972287B2 - Speech recognition device, speech recognition method and speech recognition program

Info

Publication number: JP6972287B2
Application number: JP2020200894A
Authority: JP
Inventors: 直樹関根
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2016-09-15
Filing date: 2020-12-03
Publication date: 2021-11-24
Anticipated expiration: 2036-09-15
Also published as: JP2021043465A

Description

本発明の実施形態は、音声認識方法及びこの方法で音声認識を行う音声認識装置並びにコンピュータを当該音声認識装置として機能させるための音声認識プログラムに関する。 An embodiment of the present invention relates to a voice recognition method, a voice recognition device that performs voice recognition by this method, and a voice recognition program for making a computer function as the voice recognition device.

近年、音声認識装置を搭載し、ユーザが外部から音声により所望の操作内容を与えると、その音声を認識して所望の操作内容に応じた動作を実行する電子機器がある。このような電子機器は、通常、発話ボタンを備え、音声認識装置は、この発話ボタンをユーザが操作したことを契機に音声の入力を受け付ける。しかし、音声入力を受け付ける前にユーザが発声したために音声信号の先頭部分を記録できず、音声認識装置が誤認識を引き起こすことがある。 In recent years, there is an electronic device equipped with a voice recognition device, which recognizes the voice and executes an operation according to the desired operation content when the user gives a desired operation content by voice from the outside. Such an electronic device usually includes an utterance button, and the voice recognition device accepts a voice input when the user operates the utterance button. However, since the user utters a voice before accepting the voice input, the head portion of the voice signal cannot be recorded, and the voice recognition device may cause erroneous recognition.

このような不具合を防止するために、発話ボタンが操作されてから音声入力の受付が可能になるまでの間、表示画面に所定の画像を表示させてユーザに発声開始のタイミングを知らせる技術が知られている。しかしこの技術を適用できるのは表示画面を有する電子機器に限られる上、画像を表示させるために電子機器を制御するプロセッサの処理負荷が大きくなるという問題がある。 In order to prevent such problems, we know the technology to display a predetermined image on the display screen and notify the user of the timing to start utterance from the time the utterance button is operated until the voice input can be accepted. Has been done. However, this technology can be applied only to an electronic device having a display screen, and there is a problem that the processing load of a processor that controls the electronic device for displaying an image becomes large.

特開２００３−１７７７８９号公報Japanese Unexamined Patent Publication No. 2003-177789

本発明の実施形態が解決しようとする課題は、ユーザに発声開始のタイミングを知らせることなく、音声信号の先頭部分を記録できなかったことによる誤認識を低減できる音声認識技術を提供しようとするものである。 An object to be solved by the embodiment of the present invention is to provide a voice recognition technique capable of reducing erroneous recognition due to failure to record the head portion of a voice signal without notifying the user of the timing of starting vocalization. Is.

一実施形態において、音声認識装置は、記録部と、受付手段と、認識手段と、修正手段とを備える。記録部は、音声入力手段を介して入力された音声信号を記録する。受付手段は、音声入力手段からの音声入力開始指示を受け付ける。認識手段は、入力開始指示を受け付けた後に記録部に記録された音声信号から音声発話を認識する。修正手段は、認識手段で認識した音声発話の先頭の語が母音である場合、その母音に子音を順次付加した単語と音声発話の２番目以降の単語との単語同士の繋がりパターンについて確率を計算し、最大確率の繋がりパターンの音声発話に修正する。 In one embodiment, the speech recognition apparatus includes a recording unit, a receiving unit, a recognition unit, and amendments means. The recording unit records a voice signal input via the voice input means. The receiving means receives a voice input start instruction from the voice input means. Recognition means, you recognizing voice uttered from the sound signal recorded in the recording unit after receiving the input start instruction. Amendments means, if the first word of the speech utterance which has been recognized by the recognition means is a vowel, the probability for the connection patterns of words between the second and subsequent words in the words and utterances are sequentially added consonant to the vowel Calculate and correct to the voice utterance of the maximum probability connection pattern.

一実施形態である音声認識装置のブロック構成図。The block block diagram of the voice recognition apparatus which is one Embodiment. 同音声認識装置が有する単語辞書ファイルの一例を示す図。The figure which shows an example of the word dictionary file which the voice recognition apparatus has. 同音声認識装置が有する言語辞書ファイルの一例を示す図。The figure which shows an example of the language dictionary file which the voice recognition apparatus has. 同音声認識装置のプロセッサが音声認識プログラムにしたがって実行する情報処理手順を示す流れ図。The flow chart which shows the information processing procedure which the processor of the voice recognition apparatus executes according to a voice recognition program. 音声信号波形の一例を示す図。The figure which shows an example of the audio signal waveform. 音声信号波形の他の例を示す図。The figure which shows the other example of the audio signal waveform. 音声認識に係る状態遷移図。State transition diagram related to voice recognition. 音声発話修正に係る状態遷移図。State transition diagram related to voice utterance correction.

以下、ユーザに発声開始のタイミングを知らせることなく、音声信号の先頭部分が記録できなかったことによる誤認識を低減できる音声認識装置の実施形態について、図面を用いて説明する。 Hereinafter, an embodiment of a voice recognition device capable of reducing erroneous recognition due to failure to record the head portion of a voice signal without notifying the user of the timing of starting voice will be described with reference to the drawings.

図１は、本実施形態における音声認識装置１０の要部構成を示すブロック図である。音声認識装置１０は、音声入力の開始指示を受け付けると、開始指示を受け付けた後に入力された音声信号から音声発話を認識する。そして音声認識装置１０は、開始指示を受け付けてから音声信号が入力されるまでの時間により音声発話の認識結果を修正するか否かを判定し、修正する場合、認識された音声発話を修正する。このような音声認識装置１０は、例えば飲食店等で利用される携帯型の注文端末、工業機器の保守作業等で利用される保守記録端末、等の電子機器に組み込まれ、ユーザの発話による入力を支援する機能を果たすものである。 FIG. 1 is a block diagram showing a configuration of a main part of the voice recognition device 10 in the present embodiment. When the voice recognition device 10 receives the start instruction of the voice input, the voice recognition device 10 recognizes the voice utterance from the voice signal input after receiving the start instruction. Then, the voice recognition device 10 determines whether or not to correct the recognition result of the voice utterance based on the time from receiving the start instruction until the voice signal is input, and if it corrects, corrects the recognized voice utterance. .. Such a voice recognition device 10 is incorporated in an electronic device such as a portable ordering terminal used in a restaurant or the like, a maintenance recording terminal used in maintenance work of industrial equipment, or the like, and is input by a user's utterance. It fulfills the function of supporting.

図１に示すように音声認識装置１０は、プロセッサ１１、メインメモリ１２、補助記憶デバイス１３、時計部１４、デジタイズ部１５、入力ポート１６、複数のデバイスインターフェース１７，１８及び出力部１９等を備える。また音声認識装置１０は、アドレスバス，データバス等を含むバスラインＢＬを備えており、このバスラインＢＬに、プロセッサ１１、メインメモリ１２、補助記憶デバイス１３、時計部１４、デジタイズ部１５、入力ポート１６、デバイスインターフェース１７，１８及び出力部１９が接続されている。 As shown in FIG. 1, the voice recognition device 10 includes a processor 11, a main memory 12, an auxiliary storage device 13, a clock unit 14, a digitizing unit 15, an input port 16, a plurality of device interfaces 17, 18 and an output unit 19. .. Further, the voice recognition device 10 includes a bus line BL including an address bus, a data bus, and the like, and the processor 11, the main memory 12, the auxiliary storage device 13, the clock unit 14, the digitizing unit 15, and the input are added to the bus line BL. The port 16, the device interfaces 17, 18 and the output unit 19 are connected.

デジタイズ部１５は、音声入力手段であるマイクロフォン２０を接続し、マイクロフォン２０を介して入力されたアナログの音声信号をデジタルの音声信号に変換する。マイクロフォン２０は、音声認識装置１０を搭載した電子機器に内蔵されていてもよいし、着脱自在に外部接続されるものであってもよい。なお、マイクロフォン２０がデジタルデータの音声信号を出力するタイプのものである場合には、デジタイズ部１５を省略できる。 The digitizing unit 15 connects a microphone 20 which is an audio input means, and converts an analog audio signal input via the microphone 20 into a digital audio signal. The microphone 20 may be built in an electronic device equipped with a voice recognition device 10, or may be detachably connected to the outside. If the microphone 20 is of a type that outputs an audio signal of digital data, the digitizing unit 15 can be omitted.

入力ポート１６は、音声入力の開始指示手段である発話ボタン３０を接続し、発話ボタン３０のオン信号を入力する。ユーザは、マイクロフォン２０に向かって発話する間、発話ボタン３０を押下する。発話ボタン３０は、押下されている間オン信号を出力する。発話ボタン３０は、１回目の押下でオン信号を出力し、２回目の押下でオン信号を停止するタイプのものであってもよい。 The input port 16 connects the utterance button 30 which is a voice input start instruction means, and inputs the on signal of the utterance button 30. The user presses the utterance button 30 while speaking to the microphone 20. The utterance button 30 outputs an on signal while it is pressed. The utterance button 30 may be of a type that outputs an on signal by the first pressing and stops the on signal by the second pressing.

デバイスインターフェース１７は、入力デバイス４０を接続し、所定のプロトコルに従い入力デバイス４０から入力データを取り込む。入力デバイス４０は、キーボード、タッチパネル、ポインティングデバイス等である。デバイスインターフェース１８は、表示デバイス５０を接続し、所定のプロトコルに従い表示デバイス５０に表示データを出力する。表示デバイス５０は、液晶ディスプレイ、プラズマディスプレイ、ＥＬ（Electro Luminescent）ディスプレイ等である。なお、デバイスインターフェース１７，１８に接続されるデバイスは、入力デバイス４０及び表示デバイス５０に限定されるものではない。例えば表示デバイス５０の代わりにプリンタが接続されてもよい。また、入力デバイス４０の代わりにバーコードリーダ、ＲＦＩＤリーダライタ、カードリーダライタ等が接続されてもよい。 The device interface 17 connects the input device 40 and takes in input data from the input device 40 according to a predetermined protocol. The input device 40 is a keyboard, a touch panel, a pointing device, or the like. The device interface 18 connects the display device 50 and outputs display data to the display device 50 according to a predetermined protocol. The display device 50 is a liquid crystal display, a plasma display, an EL (Electro Luminescent) display, or the like. The devices connected to the device interfaces 17 and 18 are not limited to the input device 40 and the display device 50. For example, a printer may be connected instead of the display device 50. Further, instead of the input device 40, a barcode reader, an RFID reader / writer, a card reader / writer, or the like may be connected.

因みに、音声入力手段であるマイクロフォン２０と、開始指示手段である発話ボタン３０と、入力デバイス４０と、表示デバイス５０とは、音声認識装置１０を搭載した電子機器に設けられる。その場合において、発話ボタン３０は、入力デバイス４０の一種であるキーボードまたはタッチパネルに設けられていてもよい。 Incidentally, the microphone 20 which is a voice input means, the utterance button 30 which is a start instruction means, the input device 40, and the display device 50 are provided in an electronic device equipped with a voice recognition device 10. In that case, the utterance button 30 may be provided on a keyboard or a touch panel which is a kind of input device 40.

音声認識装置１０は、プロセッサ１１、メインメモリ１２及び補助記憶デバイス１３と、これらを接続するバスラインＢＬとによってコンピュータを構成する。
プロセッサ１１は、上記コンピュータの中枢部分に相当する。プロセッサ１１は、オペレーティングシステムやアプリケーションプログラムに従って、音声認識装置１０としての機能を実現するべく各部を制御する。 The voice recognition device 10 constitutes a computer by a processor 11, a main memory 12, an auxiliary storage device 13, and a bus line BL connecting them.
The processor 11 corresponds to the central part of the computer. The processor 11 controls each part in order to realize the function as the voice recognition device 10 according to the operating system and the application program.

メインメモリ１２は、上記コンピュータの主記憶部分に相当する。メインメモリ１２は、不揮発性のメモリ領域と揮発性のメモリ領域とを含む。メインメモリ１２は、不揮発性のメモリ領域ではオペレーティングシステムやアプリケーションプログラムを記憶する。またメインメモリ１２は、プロセッサ１１が各部を制御するための処理を実行する上で必要なデータを不揮発性または揮発性のメモリ領域で記憶する。 The main memory 12 corresponds to the main storage portion of the computer. The main memory 12 includes a non-volatile memory area and a volatile memory area. The main memory 12 stores an operating system and an application program in a non-volatile memory area. Further, the main memory 12 stores data necessary for the processor 11 to execute a process for controlling each part in a non-volatile or volatile memory area.

メインメモリ１２は、揮発性のメモリ領域を、マイクロフォンを介して入力された音声信号の記録部として使用する。すなわちメインメモリ１２は、デジタイズ部１５でデジタルデータに変換された音声信号を所定のバッファリング単位で繰り返し上書き保存する領域を有する。なお、この記録部としての領域は、補助記憶デバイス１３に形成されていてもよい。 The main memory 12 uses a volatile memory area as a recording unit for an audio signal input via a microphone. That is, the main memory 12 has an area in which the audio signal converted into digital data by the digitizing unit 15 is repeatedly overwritten and stored in a predetermined buffering unit. The area as the recording unit may be formed in the auxiliary storage device 13.

補助記憶デバイス１３は、上記コンピュータの補助記憶部分に相当する。例えばＥＥＰＲＯＭ（Electric Erasable Programmable Read-Only Memory）、ＨＤＤ（Hard Disc Drive）、ＳＳＤ（Solid State Drive）等が補助記憶デバイス１３として使用される。補助記憶デバイス２１３は、プロセッサ１１が各種の処理を行う上で使用するデータや、プロセッサ１１での処理によって生成されたデータを保存する。補助記憶デバイス１３は、上記のアプリケーションプログラムを記憶する場合もある。 The auxiliary storage device 13 corresponds to the auxiliary storage portion of the computer. For example, an EEPROM (Electric Erasable Programmable Read-Only Memory), an HDD (Hard Disc Drive), an SSD (Solid State Drive), or the like is used as the auxiliary storage device 13. The auxiliary storage device 213 stores data used by the processor 11 for performing various processes and data generated by the processes of the processor 11. The auxiliary storage device 13 may store the above application program.

補助記憶デバイス１３は、音声認識に必要な単語辞書ファイル１３１及び言語辞書ファイル１３２を記憶する。単語辞書ファイル１３１は、図２にその一例を示すように、種々の単語とその読み仮名とを予め記録したデータファイルである。例えば単語辞書ファイル１３１Ａは、単語「焼き」、「秋」、「肉」、「行く」、「柿」、「咲き」、「滝」、「泣き」、「破棄」、「薪」、「脇」に対してそれぞれ読み仮名「yaki」、「aki」、「niku」、「iku」、「kaki」、「saki」、「taki」、「naki」、「haki」、「maki」、「waki」を記録する。 The auxiliary storage device 13 stores the word dictionary file 131 and the language dictionary file 132 necessary for voice recognition. The word dictionary file 131 is a data file in which various words and their phonetic spellings are recorded in advance, as shown in FIG. 2 as an example. For example, the word dictionary file 131A contains the words "yaki", "autumn", "meat", "go", "persimmon", "bloom", "waterfall", "crying", "discard", "firewood", and "armpit". "Yaki", "aki", "niku", "iku", "kaki", "saki", "taki", "naki", "haki", "maki", "waki" To record.

言語辞書ファイル１３２は、図３（ａ），（ｂ）にその一例を示すように、種々の単語同士の繋がりの確率を予め記録したデータファイルである。例えば、言語辞書ファイル１３２Ａは、単語「焼き」の後に、単語「焼き」が繋がる確率として“0.1”を、単語「秋」が繋がる確率として“0.1”を、単語「肉」が繋がる確率として“0.5”を、単語「行く」が繋がる確率として“0.1”を記録する。同様に言語辞書ファイル１３２Ａは、単語「秋」の後に、単語「焼き」が繋がる確率として“0.1”を、単語「秋」が繋がる確率として“0.1”を、単語「肉」が繋がる確率として“0.1”を、単語「行く」が繋がる確率として“0.2”を記録する。 The language dictionary file 132 is a data file in which the probabilities of connections between various words are recorded in advance, as shown in FIGS. 3 (a) and 3 (b). For example, in the language dictionary file 132A, "0.1" is used as the probability that the word "yaki" is connected after the word "yaki", "0.1" is used as the probability that the word "autumn" is connected, and "meat" is connected as the probability that the word "meat" is connected. Record "0.5" and "0.1" as the probability that the word "go" will be connected. Similarly, in the language dictionary file 132A, the word "autumn" is followed by "0.1" as the probability that the word "yaki" is connected, "0.1" as the probability that the word "autumn" is connected, and "meat" as the probability that the word "meat" is connected. Record "0.1" and "0.2" as the probability that the word "go" will be connected.

一方、言語辞書ファイル１３２Ｂは、単語「行く」の前に、単語「柿」が繋がる確率として“0.2”を、単語「咲き」が繋がる確率として“0.1”を、単語「滝」が繋がる確率として“0.1”を、単語「泣き」が繋がる確率として“0.1”を、単語「破棄」が繋がる確率として“0.1”を、単語「薪」が繋がる確率として“0.1” 、単語「脇」が繋がる確率として“0.1”を記録する。同様に言語辞書ファイル１３２Ｂは、単語「肉」の前に、単語「柿」が繋がる確率として“0.3”を、単語「咲き」が繋がる確率として“0.1”を、単語「滝」が繋がる確率として“0.1”を、単語「泣き」が繋がる確率として“0.1”を、単語「破棄」が繋がる確率として“0.1”を、単語「薪」が繋がる確率として“0.1” 、単語「脇」が繋がる確率として“0.2”を記録する。 On the other hand, in the language dictionary file 132B, "0.2" is used as the probability that the word "persimmon" is connected, "0.1" is used as the probability that the word "bloom" is connected, and "0.1" is used as the probability that the word "taki" is connected before the word "go". "0.1", "0.1" as the probability that the word "crying" will be connected, "0.1" as the probability that the word "destroy" will be connected, "0.1" as the probability that the word "firewood" will be connected, and the probability that the word "armpit" will be connected. Record "0.1" as. Similarly, in the language dictionary file 132B, "0.3" is used as the probability that the word "persimmon" is connected, "0.1" is used as the probability that the word "bloom" is connected, and "0.1" is used as the probability that the word "taki" is connected before the word "meat". "0.1", "0.1" as the probability that the word "crying" will be connected, "0.1" as the probability that the word "destroy" will be connected, "0.1" as the probability that the word "firewood" will be connected, and the probability that the word "armpit" will be connected. Record "0.2" as.

図１に説明を戻す。
時計部１４は、音声認識装置１０の時刻情報源として機能する。プロセッサ１１は、時計部１４によって計時される時刻情報を基に、現在の日付及び時刻を計時する。なお、時計部１４は、音声認識装置１０が搭載された電子機器に備えられているものを兼用してもよい。 The explanation is returned to FIG.
The clock unit 14 functions as a time information source for the voice recognition device 10. The processor 11 measures the current date and time based on the time information measured by the clock unit 14. The clock unit 14 may also be provided in an electronic device equipped with a voice recognition device 10.

出力部１９は、この音声認識装置１０で認識された結果である音声発話のデータを外部へ出力する。データの出力先は、例えばこの音声認識装置１０を搭載した電子機器の制御ユニットである。 The output unit 19 outputs the data of the voice utterance which is the result recognized by the voice recognition device 10 to the outside. The data output destination is, for example, a control unit of an electronic device equipped with the voice recognition device 10.

かかる構成の音声認識装置１０において、プロセッサ１１は、押下検知部１１１、閾値判定部１１２、音声認識部１１３、修正部１１４及び出力制御部１１５としての機能を有している。これらの機能は、音声認識プログラムに従ってプロセッサ１１が情報処理を行うことにより実現される。音声認識プログラムは、メインメモリ１２又は補助記憶デバイス１３に記憶されている。なお、音声認識プログラムがメインメモリ１２又は補助記憶デバイス１３に予め記憶されていなくてもよい。音声認識装置１０を搭載した電子機器が備える書き込み可能な記憶デバイスに、この電子機器とは個別に譲渡された音声認識プログラムがユーザなどの操作に応じて書き込まれてもよい。音声認識プログラムの譲渡は、リムーバブルな記録媒体に記録して、あるいはネットワークを介した通信により行うことができる。記録媒体は、ＣＤ−ＲＯＭ，メモリカード等のようにプログラムを記憶でき、かつ装置が読み取り可能であれば、その形態は問わない。 In the voice recognition device 10 having such a configuration, the processor 11 has functions as a press detection unit 111, a threshold value determination unit 112, a voice recognition unit 113, a correction unit 114, and an output control unit 115. These functions are realized by the processor 11 performing information processing according to the voice recognition program. The voice recognition program is stored in the main memory 12 or the auxiliary storage device 13. The voice recognition program may not be stored in the main memory 12 or the auxiliary storage device 13 in advance. A voice recognition program transferred separately from the electronic device may be written in a writable storage device included in the electronic device equipped with the voice recognition device 10 according to an operation by a user or the like. The voice recognition program can be transferred by recording it on a removable recording medium or by communicating via a network. The recording medium may be in any form as long as it can store a program such as a CD-ROM or a memory card and the device can read it.

図４は、プロセッサ１１が音声認識プログラムに従って実行する情報処理手順を示す流れ図である。なお、図４に示すとともに以下に説明する処理の内容は一例であって、同様な結果を得ることが可能であればその処理手順及び処理内容は特に限定されるものではない。 FIG. 4 is a flow chart showing an information processing procedure executed by the processor 11 according to the voice recognition program. The content of the process shown in FIG. 4 and described below is an example, and the process procedure and the process content are not particularly limited as long as similar results can be obtained.

音声認識プログラムが開始されると、プロセッサ１１は、Ａｃｔ１として発話ボタン３０が押下されるのを待ち受ける。入力ポート１６を介してオン信号が入力されると、プロセッサ１１は、発話ボタン３０が押下されたことを検知する（Ａｃｔ１にてＹＥＳ）。そしてプロセッサ１１は、Ａｃｔ２として時計部１４で計時されている時刻を検知時刻Ｐとしてメインメモリ１２の所定領域に記憶させる（第１時刻取得手段）。ここに、プロセッサ１１は、Ａｃｔ１及び２の処理を実行することにより、押下検知部（受付手段）１１１として機能する。 When the voice recognition program is started, the processor 11 waits for the utterance button 30 to be pressed as Act1. When the on signal is input via the input port 16, the processor 11 detects that the utterance button 30 is pressed (YES in Act1). Then, the processor 11 stores the time measured by the clock unit 14 as Act 2 as the detection time P in a predetermined area of the main memory 12 (first time acquisition means). Here, the processor 11 functions as a press detection unit (reception means) 111 by executing the processes of Acts 1 and 2.

検知時刻Ｐを記憶させた後、プロセッサ１１は、Ａｃｔ３として音声信号が入力されるのを待ち受ける。デジタイズ部１５を介してデジタル化された音声信号、いわゆる音データが入力されると（Ａｃｔ３にてＹＥＳ）、プロセッサ１１は、Ａｃｔ４として時計部１４で計時されている時刻を音声開始時刻Ｄとしてメインメモリ１２の所定領域に記憶させる（第２時刻取得手段）。またプロセッサ１１は、Ａｃｔ５として音データをメインメモリ１２の記録部に記録する。 After storing the detection time P, the processor 11 waits for an audio signal to be input as Act3. When a digitized voice signal, so-called sound data, is input via the digitizing unit 15 (YES in Act3), the processor 11 mainly uses the time measured by the clock unit 14 as Act4 as the voice start time D. It is stored in a predetermined area of the memory 12 (second time acquisition means). Further, the processor 11 records the sound data as Act 5 in the recording unit of the main memory 12.

プロセッサ１１は、Ａｃｔ６として音データの閾値判定を行う。閾値判定は、周囲に恒常的に生じている音データを認識対象から除外し、ユーザが発声した音声のデータのみを認識対象とする機能である。具体的には、記録部に記録された所定のバッファリング単位の音データが所定の音量ＴＨＰ以上であるかを判定し、所定の音量ＴＨＰ以上である場合にはその音データを認識対象とする。 The processor 11 determines the threshold value of the sound data as Act6. The threshold value determination is a function of excluding sound data constantly generated in the surroundings from the recognition target and targeting only the voice data uttered by the user as the recognition target. Specifically, it is determined whether the sound data of a predetermined buffering unit recorded in the recording unit is equal to or higher than a predetermined volume THP, and if the sound data is equal to or higher than a predetermined volume THP, the sound data is recognized. ..

プロセッサ１１は、Ａｃｔ７として閾値判定の結果を確認する。閾値判定の結果、音データを認識対象外とする場合（Ａｃｔ７にてＮＯ）、プロセッサ１１は、Ａｃｔ３の処理に戻る。そしてプロセッサ１１は、Ａｃｔ３以降の処理を再度繰り返す。これに対し、音データを認識対象とする場合には（Ａｃｔ７にてＹＥＳ）、プロセッサ１１は、Ａｃｔ８の処理に進む。ここに、プロセッサ１１は、Ａｃｔ６及びＡｃｔ７の処理を実行することにより、閾値判定部１１２として機能する。 The processor 11 confirms the result of the threshold value determination as Act7. As a result of the threshold value determination, when the sound data is excluded from the recognition target (NO in Act7), the processor 11 returns to the processing of Act3. Then, the processor 11 repeats the processing after Act3 again. On the other hand, when the sound data is to be recognized (YES in Act7), the processor 11 proceeds to the process of Act8. Here, the processor 11 functions as the threshold value determination unit 112 by executing the processes of Act 6 and Act 7.

Ａｃｔ８では、プロセッサ１１は、音声認識を行う。すなわちプロセッサ１１は、記録部に記録された音データの周波数特性を鑑み、その音データの音声特徴量を算出する。そしてプロセッサ１１は、単語辞書ファイル１３１及び言語辞書ファイル１３２のデータを用いて確率的なパターン認識処理を行うことにより、音データから音声発話として認識した文字列を作成する。作成された音声発話の文字列は、メインメモリ１２に一時的に記憶される。因みに、このような音声認識の手法は周知であるので、ここでの詳細な説明は省略する。また、音声認識の手法は特に限定されるものではなく、他の手法を用いて音データから音声発話としての文字列を認識してもよい。ここにプロセッサ１１は、Ａｃｔ８の処理を実行することにより、音声認識部（認識手段）１１３として機能する。 In Act 8, the processor 11 performs voice recognition. That is, the processor 11 calculates the voice feature amount of the sound data in consideration of the frequency characteristic of the sound data recorded in the recording unit. Then, the processor 11 creates a character string recognized as a voice utterance from the sound data by performing a probabilistic pattern recognition process using the data of the word dictionary file 131 and the language dictionary file 132. The created voice utterance character string is temporarily stored in the main memory 12. Incidentally, since such a method of voice recognition is well known, detailed description here will be omitted. Further, the method of voice recognition is not particularly limited, and a character string as a voice utterance may be recognized from the sound data by using another method. Here, the processor 11 functions as a voice recognition unit (recognition means) 113 by executing the processing of Act8.

音データの音声認識を終えると、プロセッサ１１は、Ａｃｔ９として音声発話の認識結果を修正するか否かを判定する（判定手段）。具体的にはプロセッサ１１は、Ａｃｔ２の処理で取得した検知時刻ＰからＡｃｔ４の処理で取得した音声開始時刻Ｄまでの経過時間（Ｄ−Ｐ）が、予め設定された閾値時間Ｔよりも短いか否かを調べる。そして短い場合には、プロセッサ１１は修正の必要有りと判定する。これに対して短くない場合には、プロセッサ１１は修正の必要無しと判定する。必要有りと判定した場合（Ａｃｔ９にてＮＯ）、プロセッサ１１は、Ａｃｔ１０の処理を実行した後、Ａｃｔ１１の処理へと進む。必要無しと判定した場合には（Ａｃｔ９にてＹＥＳ）、プロセッサ１１は、Ａｃｔ１０の処理を実行することなく、Ａｃｔ１１の処理へと進む。 After finishing the voice recognition of the sound data, the processor 11 determines whether or not to correct the recognition result of the voice utterance as Act9 (determination means). Specifically, in the processor 11, is the elapsed time (DP) from the detection time P acquired in the process of Act2 to the voice start time D acquired in the process of Act4 shorter than the preset threshold time T? Find out if it isn't. If it is short, the processor 11 determines that modification is necessary. On the other hand, if it is not short, the processor 11 determines that no modification is necessary. When it is determined that it is necessary (NO in Act9), the processor 11 executes the processing of Act10 and then proceeds to the processing of Act11. If it is determined that it is not necessary (YES in Act9), the processor 11 proceeds to the processing of Act11 without executing the processing of Act10.

Ａｃｔ１０では、プロセッサ１１は、Ａｃｔ８の処理で認識された音声発話を修正する（修正手段）。なお、音声発話の修正手法については後述する。ここにプロセッサ１１は、Ａｃｔ９及びＡｃｔ１０の処理を実行することにより、修正部１１４として機能する。 In Act 10, the processor 11 corrects the voice utterance recognized in the process of Act 8 (correction means). The method for correcting voice utterances will be described later. Here, the processor 11 functions as a correction unit 114 by executing the processes of Act9 and Act10.

Ａｃｔ１１では、プロセッサ１１は、Ａｃｔ８の処理で認識された音声発話又はＡｃｔ１０の処理で修正された音声発話のデータを、出力部１９を介して外部へと出力する。あるいはプロセッサ１１は、音声発話のデータを表示デバイス５０へと出力して、認識結果を表示デバイスの画面上に表示させてもよい。ここにプロセッサ１１は、Ａｃｔ１１の処理を実行することにより、出力制御部１１５として機能する。
以上で、音声認識プログラムに基づくプロセッサ１１の処理は終了する。 In Act 11, the processor 11 outputs the data of the voice utterance recognized by the process of Act 8 or the voice utterance corrected by the process of Act 10 to the outside via the output unit 19. Alternatively, the processor 11 may output the voice utterance data to the display device 50 and display the recognition result on the screen of the display device. Here, the processor 11 functions as an output control unit 115 by executing the processing of the Act 11.
This completes the processing of the processor 11 based on the voice recognition program.

図５及び図６は、ユーザが「や・き・に・く」と発声した際の音声信号（アナログデータ）の具体例である。図５の例において、発話ボタン３０の押下検知時刻Ｐは「Ｐ１」で示されており、音声開始時刻Ｄは「Ｄ１」で示されている。すなわち図５の例では、押下検知時刻Ｐから音声開始時刻Ｄまでの経過時間はＴ１で示されている。同様に、図６の例において、発話ボタン３０の押下検知時刻Ｐは「Ｐ２」で示されており、音声開始時刻Ｄは「Ｄ２」で示されている。すなわち図６の例では、押下検知時刻Ｐから音声開始時刻Ｄまでの経過時間はＴ２で示されている。 5 and 6 are specific examples of audio signals (analog data) when the user utters "yaki-ni-ku". In the example of FIG. 5, the press detection time P of the utterance button 30 is indicated by "P1", and the voice start time D is indicated by "D1". That is, in the example of FIG. 5, the elapsed time from the press detection time P to the voice start time D is indicated by T1. Similarly, in the example of FIG. 6, the press detection time P of the utterance button 30 is indicated by “P2”, and the voice start time D is indicated by “D2”. That is, in the example of FIG. 6, the elapsed time from the press detection time P to the voice start time D is indicated by T2.

図５の例の場合、経過時間Ｔ１は十分に長いため、記録部に記録された音データの先頭部分に欠落は生じていない。その結果、音データは「ya・ki・ni・ku」であり、認識された音声発話は「焼き肉」となる。これに対し、図６の例の場合は経過時間Ｔ２が短いため、記録部に記録された音データの先頭部分“ｙ”が欠落している。その結果、音データは「a・ki・ni・ku」であり、認識された音声発話は、図７の状態遷移図から「秋行く」となる。つまり、先頭の音データ「a」と次の音データ「ki」とから、単語「秋」が認識され、この単語「秋」に続く音データが「niku」の場合の確率は0.1、「iku」の場合の確率は0.2であることから、音声発話「秋行く」と認識される。 In the case of the example of FIG. 5, since the elapsed time T1 is sufficiently long, there is no omission in the head portion of the sound data recorded in the recording unit. As a result, the sound data is "ya, ki, ni, ku", and the recognized voice utterance is "roasted meat". On the other hand, in the case of the example of FIG. 6, since the elapsed time T2 is short, the head portion “y” of the sound data recorded in the recording unit is missing. As a result, the sound data is "a, ki, ni, ku", and the recognized voice utterance is "going autumn" from the state transition diagram of FIG. In other words, the word "autumn" is recognized from the first sound data "a" and the next sound data "ki", and the probability when the sound data following this word "autumn" is "niku" is 0.1, "iku". Since the probability in the case of "is 0.2", it is recognized as a voice utterance "going autumn".

ここで、本実施形態の音声認識装置１０は、押下検知時刻Ｐから音声開始時刻Ｄまでの経過時間が予め設定された閾値時間Ｔよりも短い場合、プロセッサ１１が音声発話の修正処理を行う。今、閾値時間Ｔが「Ｔ１＞Ｔ＞Ｔ２」の関係にあると仮定する。この場合、プロセッサ１１は、図５の例では修正を行わないが、図６の例では修正を実行する。 Here, in the voice recognition device 10 of the present embodiment, when the elapsed time from the press detection time P to the voice start time D is shorter than the preset threshold time T, the processor 11 corrects the voice utterance. Now, it is assumed that the threshold time T has a relationship of "T1> T> T2". In this case, the processor 11 does not make the correction in the example of FIG. 5, but makes the correction in the example of FIG.

具体的にはプロセッサ１１は、先ず、音データ「a・ki・ni・ku」の先頭「a」が母音である場合、この母音に子音を示す「k,s,t,n,h,m,y,r,w」を順次付加する。すなわちプロセッサ１１は、音データ「ka・ki・ni・ku」、「sa・ki・ni・ku」「ta・ki・ni・ku」、「na・ki・ni・ku」、「ha・ki・ni・ku」、「ma・ki・ni・ku」、「ya・ki・ni・ku」、「ra・ki・ni・ku」、「wa・ki・ni・ku」を作成する。そしてプロセッサ１１は、これらの音データのそれぞれについて、単語辞書ファイル１３１及び言語辞書ファイル１３２を用いたパターン認識処理を再度実行する。その結果、図８に示す状態遷移図が作成されたと仮定すると、プロセッサ１１は、この状態遷移図から繋がりの確率が最も高い音データ「ya・ki・ni・ku」を選出する。そしてプロセッサ１１は、音声発話「秋行く」を「焼き肉」に修正する。 Specifically, the processor 11 first, when the first "a" of the sound data "a, ki, ni, ku" is a vowel, "k, s, t, n, h, m" indicating a consonant to this vowel. , Y, r, w "are added in sequence. That is, the processor 11 has sound data "ka / ki / ni / ku", "sa / ki / ni / ku", "ta / ki / ni / ku", "na / ki / ni / ku", and "ha / ki".・ Create "ni ・ ku", "ma ・ ki ・ ni ・ ku", "ya ・ ki ・ ni ・ ku", "ra ・ ki ・ ni ・ ku", and "wa ・ ki ・ ni ・ ku". Then, the processor 11 re-executes the pattern recognition process using the word dictionary file 131 and the language dictionary file 132 for each of these sound data. As a result, assuming that the state transition diagram shown in FIG. 8 is created, the processor 11 selects the sound data “ya ・ ki ・ ni ・ ku” having the highest connection probability from this state transition diagram. Then, the processor 11 corrects the voice utterance "Autumn going" to "Yakiniku".

このように、本実施形態の音声認識装置１０によれば、ユーザが発話ボタン３０を押下してから発声を開始するまでの時間が短く、記録部に記録された音声信号の先頭に欠落が生じたために誤認識してしまった場合でも、高い確率をもって音声発話を修正することができる。したがって、ユーザに発声開始のタイミングを知らせることなく、音声信号の先頭部分を記録できなかったことによる誤認識を低減できるので、認識精度の高い音声認識装置を提供することができる。 As described above, according to the voice recognition device 10 of the present embodiment, the time from the user pressing the utterance button 30 to the start of utterance is short, and the beginning of the voice signal recorded in the recording unit is missing. Therefore, even if it is misrecognized, it is possible to correct the voice utterance with a high probability. Therefore, since it is possible to reduce erroneous recognition due to the failure to record the head portion of the voice signal without notifying the user of the timing of the start of speech, it is possible to provide a voice recognition device with high recognition accuracy.

また音声認識装置１０によれば、表示画面に所定の画像を表示させてユーザに発声開始のタイミングを知らせる必要もなくなる。したがって、表示画面を有していない電子機器にも搭載できる上、プロセッサ１１の処理負荷が大きくなる懸念もない。
また音声認識装置１０のプロセッサ１１は、ユーザが発話ボタン３０を押下してから発声を開始するまでの時間が所定の閾値時間Ｔよりも短いときに修正を行い、閾値時間Ｔ以上のときには修正を行わない。したがって、ユーザが発話ボタン３０を押下してから直ぐに発声したときだけ修正処理を行えばよいので、この点からもプロセッサ１１の処理負荷が大幅に増加するようなことはない。 Further, according to the voice recognition device 10, it is not necessary to display a predetermined image on the display screen to notify the user of the timing of starting utterance. Therefore, it can be mounted on an electronic device that does not have a display screen, and there is no concern that the processing load of the processor 11 will increase.
Further, the processor 11 of the voice recognition device 10 corrects when the time from the user pressing the utterance button 30 to the start of utterance is shorter than the predetermined threshold time T, and corrects when the threshold time T or more. Not performed. Therefore, since the correction process needs to be performed only when the user speaks immediately after pressing the utterance button 30, the processing load of the processor 11 does not increase significantly from this point as well.

またプロセッサ１１は、開始指示を受け付けた第１時刻を取得する第１時刻取得手段と、音声信号の入力が開始された第２時刻を取得する第２時刻取得手段とを備えている。したがって、ユーザが発話ボタン３０を押下してから発声を開始するまでの時間を正確に把握できるので、適切な閾値時間Ｔを設定することで、無駄な修正処理を実施するのを未然に防ぐことができる。 Further, the processor 11 includes a first time acquisition means for acquiring the first time when the start instruction is received, and a second time acquisition means for acquiring the second time when the input of the audio signal is started. Therefore, since the time from the user pressing the utterance button 30 to the start of utterance can be accurately grasped, by setting an appropriate threshold time T, it is possible to prevent unnecessary correction processing from being performed. Can be done.

また、認識した音声発話を修正する場合、プロセッサ１１は、その音声発話の先頭の単語を、音声発話の２番目以降の単語と繋がりのある他の単語に置き換えて修正する。したがって、修正処理も比較的容易であり短時間で実行できるので、プロセッサ１１の処理負荷が大幅に増加して認識速度が低下する懸念もない。 Further, when correcting the recognized voice utterance, the processor 11 replaces the first word of the voice utterance with another word connected to the second and subsequent words of the voice utterance and corrects the utterance. Therefore, since the correction process is relatively easy and can be executed in a short time, there is no concern that the processing load of the processor 11 will be significantly increased and the recognition speed will be lowered.

以下、他の実施形態について説明する。
前記実施形態では、プロセッサ１１が、図４のＡｃｔ２にて検知時刻Ｐを記憶し、Ａｃｔ４にて音声開始時刻Ｄを記憶した。他の実施形態では、Ａｃｔ１にて発話ボタン３０が押下されたことを検知したならば、プロセッサ１１がタイマをスタートさせ、Ａｃｔ３にて音データの入力を検知したならば、プロセッサ１１がタイマをストップさせる。そしてＡｃｔ９では、プロセッサ１１がタイマの計時時間と閾値時間Ｔとを比較して、修正処理を行うか否かを判定する。このような構成であっても、前記実施形態と同様な作用効果を奏することができる。 Hereinafter, other embodiments will be described.
In the above embodiment, the processor 11 stores the detection time P in Act 2 of FIG. 4 and stores the voice start time D in Act 4. In another embodiment, if the processor 11 detects that the utterance button 30 is pressed in Act 1, the processor 11 starts the timer, and if the processor 11 detects the input of sound data in Act 3, the processor 11 stops the timer. Let me. Then, in Act9, the processor 11 compares the time counting time of the timer with the threshold time T, and determines whether or not to perform the correction process. Even with such a configuration, the same effects as those of the above-described embodiment can be obtained.

この他、本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。
以下、本願出願の当初の特許請求の範囲に記載された発明を付記する。
［１］音声入力手段を介して入力された音声信号を記録する記録部と、音声入力の開始指示を受け付ける受付手段と、前記受付手段により前記開始指示を受け付けた後に前記記録部に記録された音声信号から音声発話を認識する認識手段と、前記受付手段により前記開始指示を受け付けてから前記音声入力手段を介して音声信号が入力されるまでの時間により前記音声発話の認識結果を修正するか否かを判定する判定手段と、前記判定手段により修正すると判定された音声発話を修正する修正手段と、を具備する音声認識装置。
［２］前記判定手段は、前記時間が所定の閾値時間よりも短いとき修正すると判定する、付記［１］記載の音声認識装置。
［３］前記受付手段により前記開始指示を受け付けた第１時刻を取得する第１時刻取得手段と、前記音声入力手段を介して音声信号の入力が開始された第２時刻を取得する第２時刻取得手段と、をさらに具備し、前記判定手段は、前記第１時刻から前記第２時刻までの経過時間が前記閾値時間よりも短いとき修正すると判定する、付記［２］記載の音声認識装置。
［４］前記修正手段は、前記認識手段で認識した音声発話の先頭の単語を、前記音声発話の２番目以降の単語と繋がりのある他の単語に置き換えて修正する、付記［１］乃至［３］のうちいずれか１項記載の音声認識装置。
［５］音声入力の開始指示を受け付け、前記開始指示を受け付けた後に音声入力手段を介して入力された音声信号から音声発話を認識し、前記開始指示を受け付けてから前記音声信号が入力されるまでの時間により前記音声発話の認識結果を修正するか否かを判定し、修正する場合、前記認識された音声発話を修正する音声認識方法。
［６］音声入力手段を接続するとともに、前記音声入力手段を介して入力された音声信号を記録する記録部を備えたコンピュータに、音声入力の開始指示を受け付ける機能と、前記開始指示を受け付けた後に前記記録部に記録された音声信号から音声発話を認識する機能と、前記開始指示を受け付けてから前記音声入力手段を介して音声信号が入力されるまでの時間により前記音声発話の認識結果を修正するか否かを判定する機能と、前記修正すると判定された音声発話を修正する機能と、を実現させるための音声認識プログラム。 In addition, although some embodiments of the present invention have been described, these embodiments are presented as examples and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other embodiments, and various omissions, replacements, and changes can be made without departing from the gist of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are also included in the scope of the invention described in the claims and the equivalent scope thereof.
Hereinafter, the inventions described in the scope of the original claims of the application of the present application will be added.
[1] A recording unit for recording a voice signal input via a voice input means, a reception means for receiving a voice input start instruction, and a recording unit after receiving the start instruction by the reception means. Whether to correct the recognition result of the voice utterance by the recognition means for recognizing the voice utterance from the voice signal and the time from receiving the start instruction by the reception means until the voice signal is input via the voice input means. A voice recognition device comprising: a determination means for determining whether or not, and a correction means for correcting a voice utterance determined to be corrected by the determination means.
[2] The voice recognition device according to the appendix [1], wherein the determination means determines to correct when the time is shorter than a predetermined threshold time.
[3] A first time acquisition means for acquiring the first time when the start instruction is received by the reception means, and a second time for acquiring the second time when the input of the voice signal is started via the voice input means. The voice recognition device according to the appendix [2], further comprising an acquisition means, wherein the determination means determines to correct when the elapsed time from the first time to the second time is shorter than the threshold time.
[4] The correction means replaces the first word of the voice utterance recognized by the recognition means with another word connected to the second and subsequent words of the voice utterance, and corrects the words [1] to [1] to [ 3] The voice recognition device according to any one of the following items.
[5] The voice input start instruction is received, the voice utterance is recognized from the voice signal input via the voice input means after receiving the start instruction, and the voice signal is input after receiving the start instruction. A voice recognition method for correcting the recognized voice utterance when it is determined whether or not to correct the recognition result of the voice utterance based on the time until.
[6] A function of receiving a voice input start instruction and a function of receiving the start instruction are received in a computer provided with a recording unit for recording a voice signal input via the voice input means while connecting the voice input means. The recognition result of the voice utterance is determined by the function of recognizing the voice utterance from the voice signal recorded in the recording unit later and the time from receiving the start instruction until the voice signal is input via the voice input means. A voice recognition program for realizing a function of determining whether or not to correct and a function of correcting the voice utterance determined to be corrected.

１０…音声認識装置、１１…プロセッサ、１２…メインメモリ、１３…補助記憶デバイス、１４…時計部、１９…出力部、２０…マイクロフォン、３０…発話ボタン、１１１…押下検知部、１１２…閾値判定部、１１３…音声認識部、１１４……修正部、１１５…出力制御部、１３１、１３１Ａ…単語辞書ファイル、１３２、１３２Ａ，１３２Ｂ…言語辞書ファイル。 10 ... voice recognition device, 11 ... processor, 12 ... main memory, 13 ... auxiliary storage device, 14 ... clock unit, 19 ... output unit, 20 ... microphone, 30 ... utterance button, 111 ... press detection unit, 112 ... threshold determination Unit, 113 ... Voice recognition unit, 114 ... Correction unit, 115 ... Output control unit, 131, 131A ... Word dictionary file, 132, 132A, 132B ... Language dictionary file.

Claims

A recording unit that records audio signals input via audio input means, and
A reception means that accepts voice input start instructions,
A recognition means for recognizing a voice utterance from a voice signal recorded in the recording unit after receiving the start instruction by the reception means.
When the first word of the voice utterance recognized by the recognition means is a vowel, the probability is calculated for the connection pattern between the words in which consonants are sequentially added to the vowel and the second and subsequent words of the voice utterance. A correction method to correct the voice utterance of the connection pattern with the maximum probability,
A voice recognition device equipped with.

A first time acquisition means for acquiring the first time when the start instruction is received by the reception means, and
A second time acquisition means for acquiring a second time when the input of a voice signal is started via the voice input means, and a second time acquisition means.
Further equipped,
It said correction means, the elapsed time from the first time to the second time you modified when less than a predetermined threshold time, the speech recognition apparatus請Motomeko 1 wherein.

Accepts voice input start instructions and accepts
After receiving the start instruction, the voice utterance is recognized from the voice signal input via the voice input means, and the voice utterance is recognized.
When the first word of the recognized voice utterance is a vowel, the probability is calculated for the connection pattern between the words in which consonants are sequentially added to the vowel and the second and subsequent words of the voice utterance, and the maximum probability is reached. A voice recognition method that corrects the connection pattern of voice utterances.

A computer provided with a recording unit for connecting a voice input means and recording a voice signal input via the voice input means.
A function that accepts voice input start instructions and
A function of recognizing a voice utterance from a voice signal recorded in the recording unit after receiving the start instruction, and
When the first word of the recognized voice utterance is a vowel, the probability is calculated for the connection pattern between the words in which consonants are sequentially added to the vowel and the second and subsequent words of the voice utterance, and the maximum probability is reached. The function to correct the voice utterance of the connection pattern and
A voice recognition program to realize.