JP2014092627A

JP2014092627A - Voice recognition device, voice recognition method and program for the same

Info

Publication number: JP2014092627A
Application number: JP2012242050A
Authority: JP
Inventors: Kazuyoshi Katsuta; 和義割田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2012-11-01
Filing date: 2012-11-01
Publication date: 2014-05-19

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device capable of suppressing false recognition of a voice of a person who does not utter a voice toward the voice recognition device with higher precision.SOLUTION: The voice recognition device 1 includes: a voice recognition unit 131 for converting voice information which collects a voice in front of the voice recognition device 1 into characters and transmitting them; a person detection unit 110 for detecting a person in front of the voice recognition device 1 and outputting person detection information; a face detection unit 121 for outputting face detection information if image information which images a front side of the voice recognition device 1 includes a face of a person directly watching the voice recognition device 1 in the image; a response possibility determination unit 100 for transmitting voice information according to receipt of the voice information from the voice recognition device 1, the person detection information and the face detection information; a response control unit 101 for determining a response message corresponding to the character information received from the response possibility determination unit 100 by selecting one or a plurality of phrases of a response database 141 which stores a response message for a person phrase by phrase; and a response outputting unit 140 for outputting a voice of the response message.

Description

本発明は、装置への発声を正確に検知し応答する音声認識装置、音声認識方法、及びそのためのプログラムに関する。 The present invention relates to a speech recognition apparatus, a speech recognition method, and a program therefor that accurately detect and respond to utterances to the apparatus.

ロボットの音声認識技術において、ロボットへの発声を正確に検知し応答する技術が知られている。この技術は、ロボットに向けられた発声であるか否かを区別する。 As a robot voice recognition technology, a technology for accurately detecting and responding to a utterance to a robot is known. This technique distinguishes whether or not the utterance is directed at the robot.

その技術の一例が特許文献１に記載されている。特許文献１に記載された技術は、カメラや接触センサ、人検知センサから出力される人物検知情報に基づいて、人が周囲に居ることを感知し、かつカメラの画像情報から顔情報を抽出し、顔が検知されている場合、ロボットに向けられた発声と認識し、応答する。 An example of this technique is described in Patent Document 1. The technique described in Patent Document 1 senses that a person is around based on person detection information output from a camera, a contact sensor, or a person detection sensor, and extracts face information from image information of the camera. When a face is detected, the utterance is recognized as a speech directed to the robot and responds.

特開２００７−１５５９８５号公報JP 2007-155985 A

しかしながら、上述した特許文献１に記載された技術は、ロボットの周囲３６０度全ての方向に対してセンサとカメラが備わっており、どの方向に人がいても反応してしまう。したがって、特許文献１に記載された技術は、ロボットへ発声していない人の発声を誤認識してしまうという問題点がある。 However, the technique described in Patent Document 1 described above includes a sensor and a camera in all directions around 360 degrees around the robot, and reacts to any direction in which people are present. Therefore, the technique described in Patent Document 1 has a problem that it misrecognizes the utterance of a person who has not uttered the robot.

本発明の目的の一例は、上述した問題点を解決できる音声認識装置、音声認識方法およびプログラムを提供することにある。 An object of the present invention is to provide a speech recognition apparatus, speech recognition method, and program that can solve the above-described problems.

本発明の一形態における第一の音声認識装置は、音声認識装置前方の音声を集音し音声情報として出力する音声入力部と、前記音声入力部から受信した前記音声情報を、文字情報に変換する音声認識部と、前記音声認識装置の前方の人物を検知し、人物検知情報を出力する人物検知部と、前記音声認識装置前方を撮影し、画像情報を出力する人物撮影部と、前記人物撮影部から受信した前記画像情報から、前記音声認識装置を直視している人物の顔を検知し、顔検知情報を出力する顔検出部と、前記音声認識部からの前記文字情報の受信と前記人物検知部からの前記人物検知情報及び前記顔検出部からの前記顔検知情報の受信とに応じて、前記文字情報を出力する応答可否判定部と、前記音声認識装置から人物への応答メッセージを、フレーズ単位でデータベース化し、記憶する応答データベースと、前記応答可否判定部から受信した前記文字情報に対応した応答メッセージを、前記応答データベースのフレーズを一つないし複数選択して決定する応答制御部と、前記応答制御部で決定した前記応答メッセージを音声として出力する応答出力部を備える。 A first speech recognition device according to an aspect of the present invention includes a speech input unit that collects speech in front of the speech recognition device and outputs the speech information, and converts the speech information received from the speech input unit into character information. A voice recognition unit that detects a person in front of the voice recognition device and outputs person detection information, a person photographing unit that images the front of the voice recognition device and outputs image information, and the person From the image information received from the imaging unit, a face detection unit that detects the face of a person who is directly looking at the voice recognition device and outputs face detection information; reception of the character information from the voice recognition unit; and In response to reception of the person detection information from the person detection unit and the face detection information from the face detection unit, a response availability determination unit that outputs the character information; and a response message from the voice recognition device to the person. The frame A response database that is databased in units and stored; a response control unit that determines a response message corresponding to the character information received from the response availability determination unit by selecting one or more phrases of the response database; and A response output unit configured to output the response message determined by the response control unit as a voice;

本発明の一形態における第一の音声認識方法は、音声入力部が、音声認識装置前方の音声を集音し音声情報とし、音声認識部が、前記音声入力部から受信した前記音声情報を文字情報に変換し、応答可否判定部が、前記音声認識部からの前記音声情報の受信と人物検知部からの人物検知情報および顔検出部からの顔検知情報の受信とに応じて、前記文字情報を出力し、応答データベースが、前記音声認識装置から人物への応答メッセージを、フレーズ単位でデータベース化して記憶し、応答制御部が、前記応答可否判定部から受信した前記音声情報に対応した応答メッセージを、前記応答データベースのフレーズを一つないし複数選択して決定し、応答出力部が、前記応答メッセージの音声を出力する。 In a first speech recognition method according to an aspect of the present invention, the speech input unit collects speech in front of the speech recognition device to obtain speech information, and the speech recognition unit uses the speech information received from the speech input unit as characters. In response to the reception of the voice information from the voice recognition unit, the person detection information from the person detection unit, and the reception of face detection information from the face detection unit, the character information is converted into information. The response database stores the response message from the voice recognition device to the person as a database for each phrase, and the response control unit responds to the voice information received from the response availability determination unit. Is determined by selecting one or more phrases in the response database, and the response output unit outputs the voice of the response message.

本発明の一形態における第一のプログラムは、音声入力部が、音声認識装置前方の音声を集音し音声情報とし、音声認識部が、前記音声入力部から受信した前記音声情報を文字情報に変換し、応答可否判定部が、前記音声認識部からの前記音声情報の受信と人物検知部からの人物検知情報および顔検出部からの顔検知情報の受信とに応じて、前記文字情報を出力し、応答データベースが、前記音声認識装置から人物への応答メッセージを、フレーズ単位でデータベース化して記憶し、応答制御部が、前記応答可否判定部から受信した前記音声情報に対応した応答メッセージを、前記応答データベースのフレーズを一つないし複数選択して決定し、応答出力部が、前記応答メッセージの音声を出力する処理をコンピュータに実行させる。 In a first program according to an aspect of the present invention, the voice input unit collects voice in front of the voice recognition device to obtain voice information, and the voice recognition unit converts the voice information received from the voice input unit into character information. And the response availability determination unit outputs the character information in response to reception of the voice information from the voice recognition unit and reception of person detection information from the person detection unit and face detection information from the face detection unit. Then, the response database stores the response message from the voice recognition device to the person as a database in a phrase unit, and the response control unit responds to the voice information received from the response availability determination unit, One or more phrases of the response database are selected and determined, and the response output unit causes the computer to execute a process of outputting the voice of the response message.

本発明によれば、音声認識装置へ発声していない人の発声の誤認識をより高い精度で防止できるという効果が得られる。 According to the present invention, it is possible to prevent erroneous recognition of a voice of a person who is not speaking to the voice recognition device with higher accuracy.

図１は、本発明の第一の実施形態の構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of the first embodiment of the present invention. 図２は、第一の実施の形態における音声認識装置１をコンピュータ装置とその周辺装置で実現したハードウェア構成を示す図である。FIG. 2 is a diagram illustrating a hardware configuration in which the speech recognition device 1 according to the first embodiment is realized by a computer device and its peripheral devices. 図３は、第一の実施の形態における音声認識装置１の外観の一例を示す図である。FIG. 3 is a diagram illustrating an example of the appearance of the speech recognition apparatus 1 according to the first embodiment. 図４は、第一の実施の形態における音声認識装置１の音声情報入力動作を示すフローチャートである。FIG. 4 is a flowchart showing the voice information input operation of the voice recognition apparatus 1 according to the first embodiment. 図５は、第一の実施の形態における音声認識装置１の人物検知情報入力動作を示すフローチャートである。FIG. 5 is a flowchart showing the person detection information input operation of the speech recognition apparatus 1 according to the first embodiment. 図６は、第一の実施の形態における音声認識装置１の顔検知情報入力動作を示すフローチャートである。FIG. 6 is a flowchart showing the face detection information input operation of the speech recognition apparatus 1 according to the first embodiment. 図７は、第一の実施の形態における音声認識装置１の応答可否判定動作を示すフローチャートである。FIG. 7 is a flowchart showing a response availability determination operation of the speech recognition apparatus 1 according to the first embodiment. 図８は、第一の実施の形態における音声認識装置１の応答メッセージ出力動作を示すフローチャートである。FIG. 8 is a flowchart showing a response message output operation of the speech recognition apparatus 1 according to the first embodiment.

次に、本発明の実施形態について図面を参照して詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

［第１の実施の形態］
図１は、本発明の第一の実施の形態における音声認識装置１の構成を示すブロック図である。図１を参照すると、音声認識装置１は、応答可否判定部１００と、応答制御部１０１と、人物検知部１１０と、人物撮影部１２０と、顔検出部１２１と、音声入力部１３０と、音声認識部１３１と、応答出力部１４０と、応答データベース１４１と、を備える。 [First Embodiment]
FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus 1 according to the first embodiment of the present invention. Referring to FIG. 1, the speech recognition apparatus 1 includes a response availability determination unit 100, a response control unit 101, a person detection unit 110, a person photographing unit 120, a face detection unit 121, a voice input unit 130, and a voice. A recognition unit 131, a response output unit 140, and a response database 141 are provided.

次に、第一の実施の形態における音声認識装置１の構成について説明する。 Next, the configuration of the speech recognition apparatus 1 in the first embodiment will be described.

音声入力部１３０は、音声認識装置１前方の音声を集音し、音声認識部１３１に送信する。 The voice input unit 130 collects the voice in front of the voice recognition device 1 and transmits it to the voice recognition unit 131.

音声認識部１３１は、音声認識エンジンを含む。、具体的に、音声認識部１３１は、入力音声から人物の発声とは関係ない生活騒音などの雑音情報を弁別し排除する前処理回路と、その前処理回路で雑音情報が排除された音声情報を受信し、その音声情報から発声音に対応する文字を認識し、文字情報に変換し、応答可否判定部１００にその文字情報を出力する音声認識変換回路とを含む。 The voice recognition unit 131 includes a voice recognition engine. Specifically, the speech recognition unit 131 discriminates and eliminates noise information such as daily noise that is not related to the utterance of the person from the input speech, and the speech information from which the noise information is excluded by the preprocessing circuit And a speech recognition conversion circuit that recognizes a character corresponding to the uttered sound from the speech information, converts the character to the character information, and outputs the character information to the response availability determination unit 100.

人物検知部１１０は、音声認識装置１の前方で人物を検知し、人物検知情報を応答可否判定部１００に出力する。人物を検知する方法は、赤外線センサまたは超音波センサによる検知方法、人物撮影部１２０の画像情報を処理して検知する方法などいずれの方法でも良い。また、人物検知情報とは、人物を検知した旨を示すあるいは通知する情報である。 The person detection unit 110 detects a person in front of the voice recognition device 1 and outputs the person detection information to the response availability determination unit 100. The method for detecting a person may be any method such as a detection method using an infrared sensor or an ultrasonic sensor, or a method for processing and detecting image information of the person photographing unit 120. The person detection information is information indicating or notifying that a person has been detected.

人物撮影部１２０は、音声認識装置１前方をカメラなどによって撮影し、画像情報を顔検出部１２１へ送信する。 The person photographing unit 120 photographs the front of the voice recognition device 1 with a camera or the like, and transmits image information to the face detecting unit 121.

顔検出部１２１は、受信した画像情報から人物の顔を検知する処理を実行する。その結果、受信画像内に音声認識装置１を直視している人物の顔を抽出できると顔検出部１２１が判断する場合、顔検知情報を応答可否判定部１００に出力する。また、顔検知情報とは、顔を検知した旨を示すあるいは通知する情報であり、顔の画像情報である必要がなく、あるいは顔の個人を識別する情報まで必要としない。 The face detection unit 121 executes processing for detecting a human face from the received image information. As a result, when the face detection unit 121 determines that the face of the person who is directly looking at the speech recognition apparatus 1 can be extracted from the received image, the face detection information is output to the response availability determination unit 100. The face detection information is information indicating or notifying that a face has been detected. The face detection information does not need to be face image information, or does not require information for identifying a face individual.

応答可否判定部１００は、音声認識部１３１から、音声情報に対応する文字情報を受信した場合、同時に人物検知情報と顔検知情報を受信しているか確認する。受信している場合、応答可否判定部１００は応答制御部１０１へ、音声情報に対応する文字情報を送信する。 When receiving the character information corresponding to the voice information from the voice recognition unit 131, the response availability determination unit 100 checks whether the person detection information and the face detection information are received at the same time. When receiving, the response availability determination unit 100 transmits the character information corresponding to the voice information to the response control unit 101.

応答データベース１４１は、音声認識装置１から人物検知情報と顔検知情報によって認知された人物が発声する音声に応答するための応答メッセージを、フレーズ単位でデータベース化し、記憶する。応答データベース１４１が記憶するフレーズのデータは、音声データでも文字情報でも構わない。 The response database 141 stores a response message for responding to the voice uttered by the person recognized by the person detection information and the face detection information from the voice recognition device 1 in units of phrases and stores the message. The phrase data stored in the response database 141 may be voice data or text information.

応答制御部１０１は、応答可否判定部１００から受信した文字情報に対応した応答メッセージを、応答データベース１４１のフレーズを一つないし複数選択し、応答出力部１４０へ、音声信号として出力する。応答データベース１４１が記憶するフレーズのデータが文字情報の場合、応答制御部１０１は、選択したフレーズのメッセージの文字情報を、音声合成して音声信号に変換し、応答出力部１４０へその音声信号を出力する。 The response control unit 101 selects one or a plurality of phrases in the response database 141 as response messages corresponding to the character information received from the response availability determination unit 100, and outputs the response message to the response output unit 140 as an audio signal. When the phrase data stored in the response database 141 is character information, the response control unit 101 synthesizes the character information of the message of the selected phrase into a voice signal and converts the voice information to the response output unit 140. Output.

応答出力部１４０は、受信した応答メッセージの音声を出力する。 The response output unit 140 outputs the voice of the received response message.

図３は、本発明の第一の実施の形態における音声認識装置１の外観の一例を示す図である。図３に示すように、音声認識装置１は、一例として、ロボットである。 FIG. 3 is a diagram showing an example of the appearance of the speech recognition apparatus 1 according to the first embodiment of the present invention. As shown in FIG. 3, the speech recognition apparatus 1 is a robot as an example.

図３に示すように、音声認識装置１の音声入力部１３０は、主にマイクロフォンで構成される。人物撮影部１２０は、主にカメラで構成される。応答出力部１４０は、主にスピーカで構成される。人物検知部１１０は、主にセンサで構成される。なお、人物の検知を人物撮影部１２０の画像情報を処理して検知する場合、人物検知部１１０を省略することができる。また、それぞれの構成要素はひとつないし複数あっても良く、また設置部位は問わないが、人物撮影部１２０と人物検知部１１０は音声認識装置１の前方を撮影および検知するものとする。 As shown in FIG. 3, the voice input unit 130 of the voice recognition device 1 is mainly composed of a microphone. The person photographing unit 120 is mainly composed of a camera. The response output unit 140 is mainly composed of a speaker. The person detection unit 110 is mainly composed of sensors. In addition, when detecting the person by processing the image information of the person photographing unit 120, the person detecting unit 110 can be omitted. There may be one or a plurality of components, and the installation site is not limited, but the person photographing unit 120 and the person detecting unit 110 photograph and detect the front of the voice recognition device 1.

図２は、本発明の第一の実施の形態における音声認識装置１をコンピュータおよびその周辺装置で実現したハードウェア構成を示す図である。図２に示されるように、音声認識装置１は、ＣＰＵ１１、出力装置１２、入力装置１３、主記憶装置１４、二次記憶装置１５を含む。 FIG. 2 is a diagram showing a hardware configuration in which the speech recognition apparatus 1 according to the first embodiment of the present invention is realized by a computer and its peripheral devices. As shown in FIG. 2, the speech recognition device 1 includes a CPU 11, an output device 12, an input device 13, a main storage device 14, and a secondary storage device 15.

ＣＰＵ１１は、オペレーティングシステムを動作させて本発明の第一の実施の形態に係る音声認識装置１の全体を制御する。また、ＣＰＵ１１は、例えば二次記憶装置１５から主記憶装置１４にプログラムやデータを読み出す。そして、ＣＰＵ１１は、第一の実施の形態における応答可否判断部１００、応答制御部１０１、顔検出部１２１および音声認識部１３１として機能し、プログラム制御に基づいて各種の処理を実行する。 The CPU 11 operates the operating system to control the entire speech recognition apparatus 1 according to the first embodiment of the present invention. Further, the CPU 11 reads out programs and data from the secondary storage device 15 to the main storage device 14, for example. The CPU 11 functions as the response availability determination unit 100, the response control unit 101, the face detection unit 121, and the voice recognition unit 131 in the first embodiment, and executes various processes based on program control.

出力装置１２は、例えばスピーカなどで実現され、第一の実施の形態における応答出力部１４０として機能する。 The output device 12 is realized by a speaker or the like, for example, and functions as the response output unit 140 in the first embodiment.

入力装置１３は、例えばセンサ、カメラ、マイクロフォンなどで実現され、第一の実施の形態における人物検出部１１０、人物撮影部１２０、音声入力部１３０として機能する。 The input device 13 is realized by, for example, a sensor, a camera, a microphone, and the like, and functions as the person detection unit 110, the person photographing unit 120, and the voice input unit 130 in the first embodiment.

二次記憶装置１５は、例えば光ディスク、フレキシブルディスク、磁気光ディスク、外付けハードディスク、または半導体メモリ等であって、コンピュータプログラムをコンピュータ読み取り可能に記録する。また、コンピュータプログラムは、通信網に接続されている図示しない外部コンピュータからダウンロードされてもよい。また、二次記憶装置１５は、第一の実施の形態における応答データベース１４１として機能する。 The secondary storage device 15 is, for example, an optical disk, a flexible disk, a magnetic optical disk, an external hard disk, a semiconductor memory, or the like, and records a computer program so that it can be read by a computer. The computer program may be downloaded from an external computer (not shown) connected to the communication network. The secondary storage device 15 functions as the response database 141 in the first embodiment.

なお、第一の実施の形態の説明において利用されるブロック図（図１）には、ハードウェア単位の構成ではなく、機能単位のブロックが示されている。これらの機能ブロックは図２に示されるハードウェア構成によって実現される。ただし、音声認識装置１が備える各部の実現手段は特に限定されない。すなわち、音声認識装置１は、物理的に結合した一つの装置により実現されてもよいし、物理的に分離した二つ以上の装置を有線または無線で接続し、これら複数の装置により実現されてもよい。 Note that the block diagram (FIG. 1) used in the description of the first embodiment shows functional unit blocks, not hardware unit configurations. These functional blocks are realized by the hardware configuration shown in FIG. However, the means for realizing each unit included in the speech recognition apparatus 1 is not particularly limited. In other words, the voice recognition device 1 may be realized by one physically coupled device, or by two or more physically separated devices connected by wire or wirelessly, and realized by these plural devices. Also good.

また、ＣＰＵ１１は、二次記憶装置１５に記録されているコンピュータプログラムを読み込み、そのプログラムにしたがって、応答可否判断部１００、応答制御部１０１、顔検出部１２１および音声認識部１３１として動作してもよい。 Further, the CPU 11 reads a computer program recorded in the secondary storage device 15 and operates as a response availability determination unit 100, a response control unit 101, a face detection unit 121, and a voice recognition unit 131 according to the program. Good.

また、前述のプログラムのコードを記録した記録媒体（または記憶媒体）が、音声認識装置１に供給され、音声認識装置１が記録媒体に格納されたプログラムのコードを読み出し実行してもよい。すなわち、本発明は、第一の実施の形態における音声認識装置１が実行するためのソフトウェア（音声認識プログラム）を一時的に記憶するまたは非一時的に記憶する二次記憶装置１５も含む。 Further, a recording medium (or storage medium) in which the above-described program code is recorded may be supplied to the speech recognition apparatus 1, and the speech recognition apparatus 1 may read and execute the program code stored in the recording medium. That is, the present invention also includes a secondary storage device 15 that temporarily or non-temporarily stores software (speech recognition program) to be executed by the speech recognition device 1 according to the first embodiment.

以上のように構成された音声認識装置１の動作について、図４から図８のフローチャートを参照して説明する。 The operation of the speech recognition apparatus 1 configured as described above will be described with reference to the flowcharts of FIGS.

図４は、第一の実施の形態における音声認識装置１での音声情報入力の動作の概要を示すフローチャートである。尚、このフローチャートによる処理は、前述したＣＰＵによるプログラム制御に基づいて、実行されても良い。 FIG. 4 is a flowchart showing an outline of voice information input operation in the voice recognition apparatus 1 according to the first embodiment. Note that the processing according to this flowchart may be executed based on the above-described program control by the CPU.

図４に示すように、まず、音声入力部１３０は、音声認識装置１前方の音声を集音し、音声認識部１３１に出力する（ステップＳ１０１）。 As shown in FIG. 4, first, the voice input unit 130 collects the voice in front of the voice recognition device 1 and outputs it to the voice recognition unit 131 (step S101).

次に、音声認識部１３１は、受信した音声を処理する。具体的には、前処理回路が人物の発声とは関係ない生活騒音などを弁別し排除する（ステップＳ１０２）。 Next, the voice recognition unit 131 processes the received voice. Specifically, the preprocessing circuit discriminates and eliminates daily noise that is not related to the voice of the person (step S102).

次に、音声認識部１３１は、前処理回路で処理した音声情報に人物の発声が入っているか確認する（ステップＳ１０３）。もし入っていれば（ステップＳ１０３でＹＥＳ）、音声認識部１３１の音声認識変換回路は、音声情報を文字情報、例えば、テキストデータに変換し、応答可否判断部１００に出力する（ステップＳ１０４）。 Next, the voice recognition unit 131 confirms whether or not a person's utterance is included in the voice information processed by the preprocessing circuit (step S103). If it is present (YES in step S103), the speech recognition conversion circuit of the speech recognition unit 131 converts the speech information into character information, for example, text data, and outputs it to the response availability determination unit 100 (step S104).

以上で、音声認識装置１は、音声情報入力の動作を終了する。 Thus, the voice recognition device 1 ends the voice information input operation.

図５は、第一の実施の形態における音声認識装置１での人物検知情報入力の動作の概要を示すフローチャートである。尚、このフローチャートによる処理も、前述したＣＰＵによるプログラム制御に基づいて、実行されても良い。 FIG. 5 is a flowchart showing an outline of the operation of inputting the person detection information in the speech recognition apparatus 1 according to the first embodiment. Note that the processing according to this flowchart may also be executed based on the above-described program control by the CPU.

図５に示すように、まず、人物検知部１１０は、音声認識装置１前方に人物を検知した場合（ステップＳ２０１でＹＥＳ）、人物検知情報を応答可否判定部１００に出力する（ステップＳ２０２）。 As shown in FIG. 5, first, when detecting a person in front of the voice recognition device 1 (YES in step S201), the person detection unit 110 outputs person detection information to the response availability determination unit 100 (step S202).

以上で、音声認識装置１は、人物検知情報入力の動作を終了する。 Thus, the voice recognition device 1 ends the operation of inputting the person detection information.

図６は、第一の実施の形態における音声認識装置１での顔検知情報入力の動作の概要を示すフローチャートである。尚、このフローチャートによる処理も、前述したＣＰＵによるプログラム制御に基づいて、実行されても良い。 FIG. 6 is a flowchart showing an outline of the operation of face detection information input in the speech recognition apparatus 1 according to the first embodiment. Note that the processing according to this flowchart may also be executed based on the above-described program control by the CPU.

図６に示すように、まず、人物撮影部１２０は、音声認識装置１前方を撮影し、画像情報を顔検出部１２１へ出力する（ステップＳ３０１）。 As shown in FIG. 6, first, the person photographing unit 120 photographs the front of the voice recognition device 1 and outputs image information to the face detecting unit 121 (step S301).

次に、顔検出部１２１は、受信した画像情報を画像処理する（ステップＳ３０２）。 Next, the face detection unit 121 performs image processing on the received image information (step S302).

次に、顔検出部１２１は、画像処理の結果、画像内に、音声認識装置１を直視している人物の顔を抽出できるか確認する（ステップＳ３０３）。抽出できた場合は（ステップＳ３０３でＹＥＳ）、顔検知情報を応答可否判定部１００に出力する（ステップＳ３０４）。 Next, the face detection unit 121 confirms whether the face of the person who is directly looking at the speech recognition apparatus 1 can be extracted from the image as a result of the image processing (step S303). If it can be extracted (YES in step S303), the face detection information is output to the response availability determination unit 100 (step S304).

以上で、音声認識装置１は、顔検知情報入力の動作を終了する。 Thus, the voice recognition device 1 ends the operation of inputting the face detection information.

図７は、第一の実施の形態における音声認識装置１での応答可否判定の動作の概要を示すフローチャートである。尚、このフローチャートによる処理も、前述したＣＰＵによるプログラム制御に基づいて、実行されても良い。 FIG. 7 is a flowchart illustrating an outline of an operation for determining whether or not response is possible in the speech recognition apparatus 1 according to the first embodiment. Note that the processing according to this flowchart may also be executed based on the above-described program control by the CPU.

図７に示すように、まず、応答可否判定部１００は、音声認識部１３１から、集音された音声情報に対応する文字情報を受信する（ステップＳ４０１）。 As illustrated in FIG. 7, first, the response availability determination unit 100 receives character information corresponding to collected voice information from the voice recognition unit 131 (step S401).

次に、応答可否判定部１００は、同時に人物検知部１１０からの人物検知情報と顔検出部１２１からの顔検知情報を受信しているか確認する（ステップＳ４０２）。もし受信していれば（ステップＳ４０２でＹＥＳ）、応答可と判定し、文字情報を応答制御部１０１へ送信する（ステップＳ４０３）。 Next, the response availability determination unit 100 confirms whether the person detection information from the person detection unit 110 and the face detection information from the face detection unit 121 are received at the same time (step S402). If it has been received (YES in step S402), it is determined that a response is possible, and character information is transmitted to the response control unit 101 (step S403).

以上で、音声認識装置１は、応答可否判定の動作を終了する。 Thus, the voice recognition device 1 ends the response availability determination operation.

図８は、第一の実施の形態における音声認識装置１での応答メッセージの出力動作の概要を示すフローチャートである。尚、このフローチャートによる処理も、前述したＣＰＵによるプログラム制御に基づいて、実行されても良い。 FIG. 8 is a flowchart showing an outline of a response message output operation in the speech recognition apparatus 1 according to the first embodiment. Note that the processing according to this flowchart may also be executed based on the above-described program control by the CPU.

図８に示すように、まず、応答制御部１０１は、受信した文字情報に対応した応答メッセージを、応答データベース１４１のフレーズを一つないし複数選択して決定する（ステップＳ５０１）。 As shown in FIG. 8, first, the response control unit 101 determines a response message corresponding to the received character information by selecting one or more phrases in the response database 141 (step S501).

次に、応答制御部１０１は、決定した応答メッセージを音声信号として応答出力部１４０へ送信する（ステップＳ５０２）。この場合、応答制御部１０１は、音声メッセージが文字情報の場合、その文字情報を音声合成して音声信号に変換して出力する。また、応答メッセージが音声データの場合、その音声データを音声信号に変換し、応答出力部１４０に出力する。 Next, the response control unit 101 transmits the determined response message as a voice signal to the response output unit 140 (step S502). In this case, when the voice message is character information, the response control unit 101 synthesizes the character information, converts it into a voice signal, and outputs the voice signal. If the response message is audio data, the audio data is converted into an audio signal and output to the response output unit 140.

次に、応答出力部１４０は、受信した応答メッセージの音声を出力する（ステップＳ５０３）。 Next, the response output unit 140 outputs the voice of the received response message (step S503).

以上で、音声認識装置１は、応答メッセージ出力の動作を終了する。 Thus, the voice recognition device 1 ends the response message output operation.

次に、本発明の第１の実施の形態の効果について説明する。 Next, effects of the first exemplary embodiment of the present invention will be described.

上述した本実施形態における音声認識装置１は、音声認識装置１へ発声していない人の発声の誤認識をより高い精度で防止できる。 The voice recognition device 1 in the present embodiment described above can prevent erroneous recognition of the voice of a person who has not spoken to the voice recognition device 1 with higher accuracy.

その理由は、以下のような構成を含むからである。即ち、第１に人物検知部１１０は、音声認識装置１の前方に人物を検知した場合、人物検知情報を応答可否判定部１００に出力する。第２に、人物撮影部１２０は、音声認識装置１の前方を撮影する。その画像内に音声認識装置１を直視している人物の顔があれば、顔検出部１２１は、顔検知情報を応答可否判定部１００に送信する。第３に、応答可否判定部１００は、同時に人物検知情報と顔検知情報を受信している場合、応答可と判定する。これにより、音声認識装置１は、人物が音声認識装置１の前方に居て、かつ音声認識装置１を直視している場合に応答可とするので、音声認識装置１へ発声していない人の発声の誤認識をより高い精度で防止できるという効果が得られる。 This is because the following configuration is included. That is, first, when the person detection unit 110 detects a person in front of the voice recognition device 1, the person detection unit 110 outputs the person detection information to the response availability determination unit 100. Secondly, the person photographing unit 120 photographs the front of the voice recognition device 1. If there is a face of a person looking directly at the speech recognition apparatus 1 in the image, the face detection unit 121 transmits the face detection information to the response availability determination unit 100. Third, the response availability determination unit 100 determines that a response is possible when the person detection information and the face detection information are simultaneously received. As a result, the voice recognition device 1 can respond when a person is in front of the voice recognition device 1 and is directly looking at the voice recognition device 1. An effect is obtained that erroneous recognition of utterance can be prevented with higher accuracy.

以上説明した、本発明の各実施形態における各構成要素は、その機能をハードウェア的に実現することはもちろん、プログラム制御に基づくコンピュータ装置、ファームウェアで実現することができる。プログラムは、磁気ディスクや半導体メモリなどのコンピュータ可読記録媒体に記録されて提供され、コンピュータの立ち上げ時などにコンピュータに読み取られる。この読み取られたプログラムは、そのコンピュータの動作を制御することにより、そのコンピュータを前述した各実施の形態における構成要素として機能させる。 Each component in each embodiment of the present invention described above can be realized by a computer apparatus and firmware based on program control as well as by realizing the function in hardware. The program is provided by being recorded on a computer-readable recording medium such as a magnetic disk or a semiconductor memory, and is read by the computer when the computer is started up. The read program causes the computer to function as a component in each of the embodiments described above by controlling the operation of the computer.

以上、各実施の形態を参照して本発明を説明したが、本発明は上記実施の形態に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解しえる様々な変更をすることができる。 Although the present invention has been described with reference to each embodiment, the present invention is not limited to the above embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

たとえば、以上の各実施形態で説明した各構成要素は、必ずしも個々に独立した存在である必要はない。例えば、各構成要素は、複数の構成要素が１個のモジュールとして実現されたり、一つの構成要素が複数のモジュールで実現されたりしてもよい。また、各構成要素は、ある構成要素が他の構成要素の一部であったり、ある構成要素の一部と他の構成要素の一部とが重複していたり、といったような構成であってもよい。 For example, each component described in each of the above embodiments does not necessarily have to be individually independent. For example, for each component, a plurality of components may be realized as one module, or one component may be realized as a plurality of modules. Each component is configured such that a component is a part of another component, or a part of a component overlaps a part of another component. Also good.

また、以上説明した各実施形態では、複数の動作をフローチャートの形式で順番に記載してあるが、その記載の順番は複数の動作を実行する順番を限定するものではない。このため、各実施形態を実施するときには、その複数の動作の順番は内容的に支障しない範囲で変更することができる。 Further, in each of the embodiments described above, a plurality of operations are described in order in the form of a flowchart, but the described order does not limit the order in which the plurality of operations are executed. For this reason, when each embodiment is implemented, the order of the plurality of operations can be changed within a range that does not hinder the contents.

さらに、以上説明した各実施形態では、複数の動作は個々に相違するタイミングで実行されることに限定されない。例えば、ある動作の実行中に他の動作が発生したり、ある動作と他の動作との実行タイミングが部分的に乃至全部において重複していたりしていてもよい。 Furthermore, in each embodiment described above, a plurality of operations are not limited to being executed at different timings. For example, another operation may occur during the execution of a certain operation, or the execution timing of a certain operation and another operation may partially or entirely overlap.

さらに、以上説明した各実施形態では、ある動作が他の動作の契機になるように記載しているが、その記載はある動作と他の動作の全ての関係を限定するものではない。このため、各実施形態を実施するときには、その複数の動作の関係は内容的に支障のない範囲で変更することができる。また各構成要素の各動作の具体的な記載は、各構成要素の各動作を限定するものではない。このため、各構成要素の具体的な各動作は、各実施形態を実施する上で機能的、性能的、その他の特性に対して支障をきたさない範囲内で変更されて良い。 Furthermore, in each of the embodiments described above, a certain operation is described as a trigger for another operation, but the description does not limit all relationships between the certain operation and the other operations. For this reason, when each embodiment is implemented, the relationship between the plurality of operations can be changed within a range that does not hinder the contents. The specific description of each operation of each component does not limit each operation of each component. For this reason, each specific operation | movement of each component may be changed in the range which does not cause trouble with respect to a functional, performance, and other characteristic in implementing each embodiment.

本発明は、音声認識装置へ発声していない人の発声の誤認識をより高い精度で抑止できる音声認識装置に適用できる。 INDUSTRIAL APPLICABILITY The present invention can be applied to a voice recognition device that can suppress erroneous recognition of a voice of a person who has not spoken to the voice recognition device with higher accuracy.

１音声認識装置
１１ＣＰＵ
１２出力装置
１３入力装置
１４主記憶装置
１５二次記憶装置
１００応答可否判定部
１０１応答制御部
１１０人物検知部
１２０人物撮影部
１２１顔検出部
１３０音声入力部
１３１音声認識部
１４０応答出力部
１４１応答データベース 1 Voice recognition device 11 CPU
DESCRIPTION OF SYMBOLS 12 Output device 13 Input device 14 Main storage device 15 Secondary storage device 100 Response availability determination part 101 Response control part 110 Person detection part 120 Person imaging | photography part 121 Face detection part 130 Voice input part 131 Voice recognition part 140 Response output part 141 Response The database

Claims

A voice input unit that collects the voice in front of the voice recognition device and outputs it as voice information;
A voice recognition unit that converts the voice information received from the voice input unit into character information;
A person detection unit that detects a person in front of the voice recognition device and outputs person detection information;
A person photographing unit that photographs the front of the voice recognition device and outputs image information;
A face detection unit that detects a face of a person who is directly looking at the voice recognition device from the image information received from the person photographing unit, and outputs face detection information;
Response availability determination that outputs the character information in response to reception of the character information from the voice recognition unit and reception of the person detection information from the person detection unit and the face detection information from the face detection unit And
A response database from the voice recognition device to a person, databased in units of phrases, and a response database for storing,
A response message corresponding to the character information received from the response availability determination unit, a response control unit that determines by selecting one or more phrases of the response database; and
And a response output unit that outputs the response message determined by the response control unit as a voice.

The speech recognition apparatus according to claim 1, wherein the speech recognition unit includes a preprocessing circuit that discriminates and eliminates daily noises that are not related to a person's utterance from the speech information.

The speech recognition apparatus according to claim 1, wherein the face detection unit performs image processing on the image information received from the person photographing unit for face detection.

The speech recognition apparatus according to claim 1, wherein the response message stored in the response database is speech data or character information.

The voice input unit collects the voice in front of the voice recognition device as voice information,
A voice recognition unit converts the voice information received from the voice input unit into character information;
A response availability determination unit that outputs the character information in response to reception of the voice information from the voice recognition unit and reception of person detection information from the person detection unit and face detection information from the face detection unit;
The response database stores the response message from the voice recognition device to the person as a database in phrase units,
The response control unit determines a response message corresponding to the voice information received from the response availability determination unit by selecting one or more phrases in the response database,
A voice recognition method in which a response output unit outputs voice of the response message.

The voice input unit collects the voice in front of the voice recognition device as voice information,
A voice recognition unit converts the voice information received from the voice input unit into character information;
A response availability determination unit that outputs the character information in response to reception of the voice information from the voice recognition unit and reception of person detection information from the person detection unit and face detection information from the face detection unit;
The response database stores the response message from the voice recognition device to the person as a database in phrase units,
The response control unit determines a response message corresponding to the voice information received from the response availability determination unit by selecting one or more phrases in the response database,
A program that causes a computer to execute a process in which a response output unit outputs the voice of the response message.