JP5849761B2

JP5849761B2 - Speech recognition system, speech recognition method, and speech recognition program

Info

Publication number: JP5849761B2
Application number: JP2012036555A
Authority: JP
Inventors: 昌史関野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2012-02-22
Filing date: 2012-02-22
Publication date: 2016-02-03
Anticipated expiration: 2032-02-22
Also published as: JP2013172411A

Description

本発明は、画像認識技術を利用した音声認識システムに関する。 The present invention relates to a voice recognition system using image recognition technology.

音声認識は、有望な技術であり、スマートデバイスなどでも活用され始めている。音声認識技術を用いることにより、例えば、メモの作成、検索またはシステムの操作など、今まで手動の操作で行われ手間がかかっていた作業をより簡単に行うことが可能である。 Speech recognition is a promising technology and is beginning to be used in smart devices. By using the voice recognition technology, it is possible to more easily perform operations that have been performed manually until now, such as creating notes, searching, or operating the system.

音声認識の認識精度向上のために、雑音処理を施したり、音声辞書を整備したりするなどの様々な対策が行われているが、認識精度が１００％に達することは難しい。しかし、音声認識を利用する顧客は、ほぼ１００％の認識精度を期待しており、現状ではそのニーズに応えることができていない。そのため、認識精度向上のための技術やソリューションを開発するために、今なお研究が行われている。 Various measures have been taken to improve the recognition accuracy of speech recognition, such as performing noise processing and preparing a speech dictionary, but it is difficult to achieve 100% recognition accuracy. However, customers who use speech recognition expect almost 100% recognition accuracy and are not able to meet the needs at present. Therefore, research is still being conducted to develop technologies and solutions for improving recognition accuracy.

音声認識を実施するための一般的な構成は、音声を収集するマイクロフォン（以下、マイクという）および得られたデータを解析しテキスト化を行う認識エンジン部分に大別される。そして、音声を認識する精度は、収集された音声データの質や、認識エンジン部分で利用する音声辞書に大きく依存する。ここで音声収集に注目すると、例えば、マイクにより認識対象者の周囲の雑音や認識対象者以外の話し声が収集されることにより、音声データの質が下がり、認識精度が下がってしまうという課題がある。 A general configuration for performing speech recognition is roughly divided into a microphone (hereinafter referred to as a microphone) that collects speech and a recognition engine portion that analyzes the obtained data and converts it into text. The accuracy of recognizing speech greatly depends on the quality of the collected speech data and the speech dictionary used in the recognition engine. When attention is focused on voice collection here, for example, there is a problem that noise around the recognition target person or speech other than the recognition target person is collected by a microphone, thereby lowering the quality of the voice data and lowering the recognition accuracy. .

そのような課題を解決するための技術として、例えば特許文献１には、認識対象者が発する音声を強調して入力し集音能力を高めるために、指向性マイクを活用し、撮像画像における被写体の占める範囲に基づいて、音声入力部の指向性を制御する技術が記載されている。また、特許文献１には、口が開いていると認識された顔の人物が声を発している可能性が高いと判断し、その人物の顔が占める範囲から入力される音声を強調する指向性で音声を入力する技術が記載されている。 As a technique for solving such a problem, for example, in Patent Document 1, a directional microphone is used to enhance the sound collecting ability by emphasizing and inputting the sound generated by the recognition target person, and the subject in the captured image Describes a technique for controlling the directivity of the voice input unit based on the range occupied by the voice input unit. Further, in Patent Document 1, it is determined that there is a high possibility that a person whose face is recognized as having an open mouth is speaking, and the input voice is emphasized from the range occupied by the person's face. A technique for inputting voice with gender is described.

特開２０１１−６１４６１号公報JP 2011-61461 A

しかし、話をする際の口の動かし方には、個人によって特徴があるので、特許文献１に記載された技術を用いたとしても、認識対象者である話者の確実な特定ができない場合がある。そのため、話者に対する音声収集がうまくいかず、周囲の雑音等を収集してしまい音声データの質が下がり、音声認識の精度が下がってしまう場合がある。 However, since the method of moving the mouth when speaking is characterized by individuals, even if the technique described in Patent Document 1 is used, there is a case where the speaker who is the recognition target cannot be reliably identified. is there. For this reason, voice collection for a speaker is not successful, ambient noise and the like are collected, the quality of voice data is lowered, and voice recognition accuracy may be lowered.

そこで、本発明は、音声認識の精度を向上させることができる音声認識システムを提供することを目的とする。 Accordingly, an object of the present invention is to provide a speech recognition system that can improve the accuracy of speech recognition.

本発明による音声認識システムは、音声認識の対象となる利用者を撮影した画像をカメラから取得し、前記画像を用いて前記利用者を特定する顔認識手段と、予め記憶された個人毎の口の動きの特徴量を記憶した口の動きデータベースを有し、前記画像から利用者の口の状態を検出し、前記口の動きデータベースに記憶された前記利用者に対応する口の動きの特徴量と前記画像から得られた前記利用者の口の動きの特徴量とを比較し、前記利用者が話しているかどうかを判定する口の動き判定手段と、前記利用者が話していると判定された場合、前記利用者の音声を取得するための音声入力手段に前記利用者の位置を通知する指向方向決定手段と、前記音声を取得し音声認識を行う音声認識手段とを備えたことを特徴とする。 The speech recognition system according to the present invention obtains an image of a user who is a target of speech recognition from a camera, uses the image to identify the user, and a pre-stored personal mouth. A mouth movement database storing the movement feature amount of the mouth, detecting the mouth state of the user from the image, and storing the mouth movement feature amount corresponding to the user stored in the mouth movement database And a mouth movement determination means for determining whether or not the user is speaking, and determining that the user is speaking A direction-of-direction determining means for notifying a voice input means for acquiring the voice of the user, and a voice recognition means for acquiring the voice and performing voice recognition. And

本発明による音声認識方法は、音声認識の対象となる利用者を撮影した画像をカメラから取得し、前記画像を用いて前記利用者を特定し、予め記憶された個人毎の口の動きの特徴量を記憶した口の動きデータベースを有し、前記画像から利用者の口の状態を検出し、前記口の動きデータベースに記憶された前記利用者に対応する口の動きの特徴量と前記画像から得られた前記利用者の口の動きの特徴量とを比較し、前記利用者が話しているかどうかを判定し、前記利用者が話していると判定された場合、前記利用者の音声を取得するための音声入力手段に前記利用者の位置を通知し、前記音声を取得し音声認識を行うことを特徴とする。 According to the speech recognition method of the present invention, an image obtained by capturing a user who is a target of speech recognition is acquired from a camera, the user is identified using the image, and mouth movement characteristics for each individual stored in advance are stored. A mouth movement database storing the amount, detecting a mouth state of the user from the image, and determining the mouth movement feature amount corresponding to the user stored in the mouth movement database and the image Compare the obtained feature value of the mouth movement of the user, determine whether the user is speaking, and if it is determined that the user is speaking, obtain the voice of the user The position of the user is notified to a voice input means for acquiring the voice, and voice recognition is performed.

本発明による音声認識プログラムは、コンピュータに、音声認識の対象となる利用者を撮影した画像をカメラから取得し、前記画像を用いて前記利用者を特定する顔認識処理と、予め記憶された個人毎の口の動きの特徴量を記憶した口の動きデータベースを有し、前記画像から利用者の口の状態を検出し、前記口の動きデータベースに記憶された前記利用者に対応する口の動きの特徴量と前記画像から得られた前記利用者の口の動きの特徴量とを比較し、前記利用者が話しているかどうかを判定する口の動き判定処理と、前記利用者が話していると判定された場合、前記利用者の音声を取得するための音声入力手段に前記利用者の位置を通知する指向方向決定処理と、前記音声を取得し音声認識を行う音声認識処理とを実行させることを特徴とする。 The speech recognition program according to the present invention obtains, from a camera, an image obtained by capturing a user who is a target of speech recognition from a camera, a face recognition process for identifying the user using the image, and a personally stored person. A mouth movement database that stores a feature value of each mouth movement, detects a mouth state of the user from the image, and moves the mouth corresponding to the user stored in the mouth movement database; The mouth movement determination process for comparing whether the user is speaking by comparing the feature amount of the user and the feature amount of the user's mouth movement obtained from the image, and the user speaking If it is determined, a direction determination process for notifying the user's position to voice input means for acquiring the user's voice and a voice recognition process for acquiring the voice and performing voice recognition are executed. That features To.

本発明によれば、音声認識の精度を向上させることができる。 According to the present invention, the accuracy of voice recognition can be improved.

本発明による音声認識システムの実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of embodiment of the speech recognition system by this invention. 本発明による音声認識システムの実施形態の動作を示すシーケンス図である。It is a sequence diagram which shows operation | movement of embodiment of the speech recognition system by this invention. 本発明による音声認識システムの実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of embodiment of the speech recognition system by this invention. 特定人物に対する口の動きの判定機能の動作を示す説明図である。It is explanatory drawing which shows operation | movement of the determination function of the movement of the mouth with respect to a specific person. 特定人物に対する音声認識機能の動作を示す説明図である。It is explanatory drawing which shows operation | movement of the speech recognition function with respect to a specific person. 文終了判定機能の動作を示す説明図である。It is explanatory drawing which shows operation | movement of a sentence end determination function. 本発明による音声認識システムの主要部を示すブロック図である。It is a block diagram which shows the principal part of the speech recognition system by this invention.

図１は、本発明による音声認識システムの実施形態の構成を示すブロック図である。図１に示すように、本実施形態の音声認識システムは、顔認識機能（顔認識部）３０、口の動き判定機能（口の動き判定部）４０、音声データ入力有無判定機能（音声データ入力有無判定部）５０、指向方向決定機能（指向方向決定部）６０、音声認識機能（音声認識部）７０、および文終了判定機能（文終了判定部）８０を備える。また、文終了判定機能８０以外の上記各機能は、ネットワーク２０を介して利用者が使用する機器１０に接続され、データの送受信を行う。また、機器１０は、カメラ１１および指向性マイクを含む。 FIG. 1 is a block diagram showing a configuration of an embodiment of a speech recognition system according to the present invention. As shown in FIG. 1, the speech recognition system of the present embodiment includes a face recognition function (face recognition unit) 30, a mouth movement determination function (mouth movement determination unit) 40, a voice data input presence / absence determination function (voice data input). Presence / absence determination unit) 50, directivity direction determination function (directivity direction determination unit) 60, speech recognition function (speech recognition unit) 70, and sentence end determination function (sentence end determination unit) 80 are provided. Each of the above functions other than the sentence end determination function 80 is connected to the device 10 used by the user via the network 20 and transmits and receives data. The device 10 includes a camera 11 and a directional microphone.

顔認識機能３０は、特徴量抽出手段３１、ＤＢ（ＤａｔａＢａｓｅ：データベース）照合手段３２および顔ＤＢ（ＤａｔａＢａｓｅ：データベース）３３を含む。顔認識機能３０は、カメラ１１から得られる画像を入力する。特徴量抽出手段３１は、入力した画像の特徴量を抽出する。ＤＢ照合手段３２は、特徴量と顔ＤＢ３３に予め記憶された顔の特徴量との照合を行い、利用者の人物特定を行う。 The face recognition function 30 includes a feature amount extraction unit 31, a DB (Data Base: database) matching unit 32, and a face DB (Data Base: database) 33. The face recognition function 30 inputs an image obtained from the camera 11. The feature amount extraction unit 31 extracts the feature amount of the input image. The DB collation unit 32 collates the feature quantity with the face feature quantity stored in advance in the face DB 33 to identify the user.

口の動き判定機能４０は、口の状態検出手段４１、特定ユーザのＤＢ（ＤａｔａＢａｓｅ：データベース）設定手段４２、個別の口の動きＤＢ（ＤａｔａＢａｓｅ：データベース）４３、および発話中の口の動き判定手段４４を含む。口の状態検出手段４１は、カメラ１１から得られた画像から、口の状態を検出する。発話時の口の動き判定手段４４は、特定ユーザのＤＢ設定手段４２により設定された個別の口の動きＤＢ４３のデータと、カメラ１１から得られた画像とを比較する。 Mouth movement determination function 40 includes mouth state detection means 41, specific user DB (Data Base) setting means 42, individual mouth movement DB (Data Base: database) 43, and mouth movement during speech. The determination means 44 is included. The mouth state detecting means 41 detects the mouth state from the image obtained from the camera 11. The mouth movement determination unit 44 at the time of utterance compares the data of the individual mouth movement DB 43 set by the DB setting unit 42 of the specific user with the image obtained from the camera 11.

音声データ入力有無判定機能５０は、音声データ取得手段５１および音声データ有無判定手段５２を含む。音声データ取得手段５１は、機器１０から音声データを取得する。音声データ有無判定手段５２は、音声データの有無を判定する。音声データ有無判定手段５２は、例えば、取得された音声データの音量が所定の値を超えていれば音声データが入っていると判定する。 The voice data input presence / absence determination function 50 includes a voice data acquisition unit 51 and a voice data presence / absence determination unit 52. The audio data acquisition unit 51 acquires audio data from the device 10. The sound data presence / absence determining means 52 determines the presence / absence of sound data. For example, the sound data presence / absence determination unit 52 determines that sound data is included if the volume of the acquired sound data exceeds a predetermined value.

指向方向決定機能６０は、検索結果取得手段６１、対象人物の画像位置判定手段６２および方向決定手段６３を含む。検索結果取得手段６１は、顔認識機能３０および口の動き判定機能４０が出力した結果を取得する。対象人物の画像位置判定手段６２は、取得された結果に基づいて、認識対象となる人物の位置を判定する。方向決定手段６３は、人物の位置から方向を決定する。 The orientation direction determination function 60 includes a search result acquisition unit 61, a target person image position determination unit 62, and a direction determination unit 63. The search result acquisition unit 61 acquires the results output by the face recognition function 30 and the mouth movement determination function 40. The target person image position determination means 62 determines the position of the person to be recognized based on the acquired result. The direction determining means 63 determines the direction from the position of the person.

音声認識機能７０は、特定ユーザ辞書設定手段７１、音声認識（テキスト化）手段７２、個別ユーザ辞書ＤＢ（ＤａｔａＢａｓｅ：データベース）７３および認識結果ＤＢ（ＤａｔａＢａｓｅ：データベース）７４を含む。特定ユーザ辞書設定手段７１は、個別ユーザ辞書ＤＢ７３のうち認識対象となる人物専用のＤＢを使用する設定を行う。音声認識（テキスト化）手段７２は、設定された辞書ＤＢを利用して、マイク１２から得られた音声データを音声認識（テキスト化）する。 The voice recognition function 70 includes a specific user dictionary setting unit 71, a voice recognition (texting) unit 72, an individual user dictionary DB (Data Base: database) 73, and a recognition result DB (Data Base: database) 74. The specific user dictionary setting means 71 performs a setting to use a DB dedicated to a person to be recognized in the individual user dictionary DB 73. The voice recognition (text conversion) means 72 performs voice recognition (text conversion) of the voice data obtained from the microphone 12 using the set dictionary DB.

文終了判定機能８０は、テキスト取得手段８１および文終了判定手段８２を含む。テキスト取得手段８１は、音声認識機能７０から送信されたテキスト情報を取得する。文終了判定手段８２は、得られたテキスト情報を解析し、文が完結しているかどうかを判定する。 The sentence end determination function 80 includes a text acquisition unit 81 and a sentence end determination unit 82. The text acquisition unit 81 acquires the text information transmitted from the voice recognition function 70. The sentence end determination means 82 analyzes the obtained text information and determines whether or not the sentence is complete.

なお、本実施形態の音声認識システムにおける顔認識機能３０、口の動き判定機能４０、音声データ入力有無判定機能５０、指向方向決定機能６０、音声認識機能７０、および文終了判定機能８０は、プログラムに基づいて処理を実行するＣＰＵで実現可能である。また、上記各機能に含まれるＤＢ（ＤａｔａＢａｓｅ：データベース）は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）等の記憶装置に記憶される。 Note that the face recognition function 30, mouth movement determination function 40, voice data input presence / absence determination function 50, directivity direction determination function 60, voice recognition function 70, and sentence end determination function 80 in the voice recognition system of this embodiment are a program. It is realizable with CPU which performs a process based on this. Further, a DB (Data Base) included in each function is stored in a storage device such as an HDD (Hard Disk Drive).

以下、本実施形態の音声認識システムの動作を説明する。図２は、本発明による音声認識システムの実施形態の動作を示すシーケンス図である。図３は、本発明による音声認識システムの実施形態の動作を示すフローチャートである。 Hereinafter, the operation of the voice recognition system of this embodiment will be described. FIG. 2 is a sequence diagram showing the operation of the embodiment of the speech recognition system according to the present invention. FIG. 3 is a flowchart showing the operation of the embodiment of the speech recognition system according to the present invention.

利用者により機器１０の電源がＯＮにされると（ステップＳ１００）、機器１０が有するカメラ１１は、利用者を撮影し、顔認識機能３０および口の動き判定機能４０に画像を送信する（ステップＳ１）。顔認識機能３０は、カメラ１１から得られた画像を入力して顔認識を行う（ステップＳ１０１、ステップＳ２）。具体的には、ＤＢ照合手段３２は、特徴量抽出手段３１が抽出した画像の特徴量と顔ＤＢ３３に予め記憶された顔の特徴量との照合を行い、利用者の人物特定を行う。顔認識において人物が特定できなかった場合（ステップＳ１０２のＮＯ）、再度、顔認識を行う（ステップＳ１０１）。 When the power of the device 10 is turned on by the user (step S100), the camera 11 included in the device 10 captures the user and transmits an image to the face recognition function 30 and the mouth movement determination function 40 (step S100). S1). The face recognition function 30 performs face recognition by inputting an image obtained from the camera 11 (steps S101 and S2). Specifically, the DB collation unit 32 collates the feature amount of the image extracted by the feature amount extraction unit 31 with the face feature amount stored in advance in the face DB 33 to identify the user. If a person cannot be identified in face recognition (NO in step S102), face recognition is performed again (step S101).

顔認識機能３０は、顔認識において人物が特定できた場合（ステップＳ１０２のＹＥＳ）、認識結果を口の動き判定機能４０および音声認識機能７０へ送信し（ステップＳ３）、顔認識を一時停止する。 If the person can be identified in the face recognition (YES in step S102), the face recognition function 30 transmits the recognition result to the mouth movement determination function 40 and the voice recognition function 70 (step S3), and temporarily stops the face recognition. .

口の動き判定機能４０には、カメラ１１から得られた画像が定期的に送信される（ステップＳ５）。口の状態検出手段４１は、カメラ１１から得られた画像から口の状態を検出する。特定ユーザＤＢ設定４２は、個別の口の動きデータベース４３を用いて、顔認識機能３０が認識した人物に対応する口の動き特徴量を設定する（ステップＳ１０３、ステップＳ４）。発話時の口の動き判定手段４４は、その画像の特徴量と、個別の口の動きデータベース４３に記憶された特徴量との比較を行い、利用者が話しているかどうかを判定する（ステップＳ１０４、ステップＳ６）。利用者が話していないと判定された場合、顔認識機能３０を稼動させ、再度顔認識を行う（ステップＳ１０１）。 An image obtained from the camera 11 is periodically transmitted to the mouth movement determination function 40 (step S5). The mouth state detection unit 41 detects the mouth state from the image obtained from the camera 11. The specific user DB setting 42 sets the mouth movement feature amount corresponding to the person recognized by the face recognition function 30 using the individual mouth movement database 43 (steps S103 and S4). The mouth movement determination means 44 at the time of utterance compares the feature quantity of the image with the feature quantity stored in the individual mouth movement database 43 to determine whether or not the user is speaking (step S104). Step S6). If it is determined that the user is not speaking, the face recognition function 30 is activated and face recognition is performed again (step S101).

図４は、特定人物に対する口の動き判定機能４０の動作を示す説明図である。図４に示すように、個別の口の動きデータベース４３は、個人毎の個別のＤＢを含む。図４に示す例では、個別の口の動きデータベース４３は、Ａさん専用のデータベース４３ａとＢさん専用のデータベース４３ｂとを含む。発話中の口の動き判定手段４４は、ステップＳ１０１においてカメラ１１から得られた画像がＡさんであると認識された場合、画像の特徴量とＡさん専用のデータベース４３ａに記憶された特徴量とを比較してＡさんが話しているかどうか判定する。 FIG. 4 is an explanatory diagram showing the operation of the mouth movement determination function 40 for a specific person. As shown in FIG. 4, the individual mouth movement database 43 includes an individual DB for each individual. In the example shown in FIG. 4, the individual mouth movement database 43 includes a database 43a dedicated to Mr. A and a database 43b dedicated to Mr. B. If the image obtained from the camera 11 is recognized as Mr. A in step S101, the mouth movement determination means 44 during utterance calculates the feature amount of the image and the feature amount stored in the database 43a dedicated to Mr. A. To determine if A is speaking.

口の動き判定機能４０により、個人毎の口の動きの特徴量データを予め記憶しておき、そのデータを用いて利用者が話しているかどうか判定するので、送られた画像のみを基に話しているかどうかを判定するよりも、より精度良く判定することができる。 Mouth movement determination function 40 stores in advance mouth movement feature value data for each individual, and uses that data to determine whether the user is speaking. It is possible to determine with higher accuracy than to determine whether or not

口の動き判定機能４０は、ステップＳ１０４において利用者が話していると判定した場合、顔認識機能３０から得た顔認識結果と、口の動きの判定結果とを指向方向決定機能６０へ送信する（ステップＳ７）。指向方向決定機能６０は、送信された結果に基づいて、利用者までの距離がどれくらいか、どの方向にマイク１２の指向性を向ければよいかを判定する（ステップＳ１０５）。具体的には、検索結果取得手段６１が、顔認識機能３０および口の動き判定機能４０の結果を取得する。取得された結果に基づいて、対象人物の画像位置判定手段６２は、認識対象となる人物の位置および距離を判定する。また、方向決定手段６３は、位置から方向を決定する。判定された指向方向は、機器１０へと通知される（ステップＳ８）。機器１０は、マイク１２がその方向へと指向性を高めるように自動で制御する（ステップＳ１０６）。 If it is determined in step S104 that the user is speaking, the mouth movement determination function 40 transmits the face recognition result obtained from the face recognition function 30 and the determination result of the mouth movement to the pointing direction determination function 60. (Step S7). The directivity direction determination function 60 determines the distance to the user and in what direction the directivity of the microphone 12 should be directed based on the transmitted result (step S105). Specifically, the search result acquisition unit 61 acquires the results of the face recognition function 30 and the mouth movement determination function 40. Based on the acquired result, the target person's image position determination means 62 determines the position and distance of the person to be recognized. Moreover, the direction determination means 63 determines a direction from a position. The determined pointing direction is notified to the device 10 (step S8). The device 10 automatically controls the microphone 12 to increase the directivity in that direction (step S106).

また、ステップＳ１０４で、利用者が話していると判定された場合、音声認識機能７０は、音声認識の対象となる人物の音声認識対象辞書を設定する（ステップＳ１０７）。具体的には、本実施例において顔認識機能３０が特定した利用者はＡさんだったので、特定ユーザ辞書設定手段７１は、個別ユーザ辞書ＤＢ７３のうちＡさん専用の辞書を使用する設定を行う。 If it is determined in step S104 that the user is speaking, the voice recognition function 70 sets a voice recognition target dictionary of a person who is a target of voice recognition (step S107). Specifically, since the user identified by the face recognition function 30 in this embodiment is Mr. A, the specific user dictionary setting unit 71 performs setting for using the dictionary dedicated to Mr. A in the individual user dictionary DB 73. .

機器１０は、マイク１２から収集された音声データを、音声データ入力有無判定機能５０へ送信する（ステップＳ９）。音声データ取得手段５１は、機器１０から音声データを取得する。音声データ有無判定手段５２は、音声データの有無を判定する（ステップＳ１０８、ステップＳ１０）。音声データ有無判定手段５２は、例えば、取得された音声データの音量が所定の値を超えていれば音声データが入っていると判定する。ステップＳ１０８において、音声データが入っていないと判定された場合は、顔認識機能３０を稼動させ、再びステップＳ１０１の顔認識を実施する。 The device 10 transmits the voice data collected from the microphone 12 to the voice data input presence / absence determination function 50 (step S9). The audio data acquisition unit 51 acquires audio data from the device 10. The voice data presence / absence determining means 52 determines the presence / absence of voice data (step S108, step S10). For example, the sound data presence / absence determination unit 52 determines that sound data is included if the volume of the acquired sound data exceeds a predetermined value. If it is determined in step S108 that no audio data is contained, the face recognition function 30 is activated and the face recognition in step S101 is performed again.

ステップＳ１０８において音声データが入っていると判定された場合、音声データ入力有無判定機能５０は、音声認識を行う音声認識機能７０へ音声データを送信する（ステップＳ１１）。音声認識機能７０は、送信された音声データに対して音声認識を実行する（ステップＳ１０９、ステップＳ１２）。 When it is determined in step S108 that voice data is included, the voice data input presence / absence determination function 50 transmits the voice data to the voice recognition function 70 that performs voice recognition (step S11). The voice recognition function 70 performs voice recognition on the transmitted voice data (steps S109 and S12).

音声データ入力有無判定機能５０により、音声データが入っていないと判定された場合は、音声認識を行わないため、音声認識機能７０が無駄な処理を行う可能性を低減できる。 When the voice data input presence / absence determination function 50 determines that no voice data is contained, voice recognition is not performed, and therefore the possibility that the voice recognition function 70 performs unnecessary processing can be reduced.

図５は、特定人物に対する音声認識機能７０の動作を示す説明図である。個別ユーザ辞書ＤＢ７３は、個人毎の声の質やよく言う言葉などの音声に関する特徴を記憶している。個別ユーザ辞書ＤＢ７３は、図５に示す例においては、Ａさん専用辞書ＤＢ７３ａとＢさん専用辞書ＤＢ７３ｂとを含む。ステップＳ１０７において、特定ユーザ辞書設定手段７１によりＡさん専用の辞書を使用する設定が行われている。そのため、音声認識（テキスト化）手段７２は、Ａさん専用辞書７３ａを利用して、マイク１２から得られた音声データの音声認識（テキスト化）を行う（ステップＳ１０９、ステップＳ１２）。 FIG. 5 is an explanatory diagram showing the operation of the voice recognition function 70 for a specific person. The individual user dictionary DB 73 stores voice-related features such as voice quality and words often used for each individual. In the example shown in FIG. 5, the individual user dictionary DB 73 includes a Mr. A dedicated dictionary DB 73a and a Mr. B dedicated dictionary DB 73b. In step S <b> 107, the specific user dictionary setting unit 71 is set to use a dictionary dedicated to Mr. A. Therefore, the voice recognition (text conversion) means 72 performs voice recognition (text conversion) of the voice data obtained from the microphone 12 using the Mr. A dedicated dictionary 73a (steps S109 and S12).

音声認識機能７０は、予め記憶した個人毎の声の質やよく言う言葉などの音声に関する特徴を用いて音声認識を行うので、個人を特定せずに音声認識する場合と比べて、より精度良く音声認識を行うことができる。 Since the voice recognition function 70 performs voice recognition using voice-related features such as voice quality for each individual stored in advance and words often used, the voice recognition function 70 is more accurate than the case of voice recognition without specifying an individual. Voice recognition can be performed.

音声認識機能７０は、音声認識（テキスト化）されたデータ（テキスト情報）を、認識結果ＤＢ７４へ蓄積し、文終了判定機能８０へ送信する（ステップＳ１３）。テキスト取得手段８１は、音声認識機能７０から送信されたテキスト情報を取得する。そして、文終了判定手段８２は、得られたテキスト情報を解析し、文が完結しているかどうかを判定する（ステップＳ１４）。 The voice recognition function 70 accumulates voice-recognized (text-formatted) data (text information) in the recognition result DB 74 and transmits it to the sentence end determination function 80 (step S13). The text acquisition unit 81 acquires the text information transmitted from the voice recognition function 70. Then, the sentence end determination means 82 analyzes the obtained text information and determines whether or not the sentence is complete (step S14).

図６は、文終了判定機能８０の動作を示す説明図である。図６に示すように、文終了判定機能８０は、音声認識機能７０から得たテキスト情報を解析し文の終わりを判別する。例えば、「です」や「しましょう」など文末に使用されることが多い語句を得たら、その語句が文末であると判断する。 FIG. 6 is an explanatory diagram showing the operation of the sentence end determination function 80. As shown in FIG. 6, the sentence end determination function 80 analyzes the text information obtained from the speech recognition function 70 and determines the end of the sentence. For example, when a word such as “Isao” or “Shimasho” is often used at the end of a sentence, it is determined that the word is the end of the sentence.

文終了判定機能８０の判定において、文が完結していると判定された場合（ステップＳ１１０のＹＥＳ）、Ａさんの話が一旦終了すると判断し、次に話す人を特定するために顔認識機能３０を再稼動させ顔認識を再度行う（ステップＳ１０１、ステップＳ１５）。文終了判定機能８０の判定において、文が完結していないと判定された場合（ステップＳ１１０１のＮＯ）、ステップＳ１０６の処理に戻る。また、音声を認識した後、一定時間音声入力が無い場合にもその人物の発話が終了したと判断を行い、次に話す人を特定するために顔認識機能３０を再稼動させ顔認識を再度行ってもよい（ステップＳ１０１、ステップＳ１５）。 In the determination of the sentence end determination function 80, if it is determined that the sentence is complete (YES in step S110), it is determined that the story of Mr. A is once ended, and the face recognition function is used to identify the next speaker. 30 is restarted and face recognition is performed again (step S101, step S15). In the determination by the sentence end determination function 80, if it is determined that the sentence is not complete (NO in step S1101), the process returns to step S106. Also, after recognizing the voice, even if there is no voice input for a certain period of time, it is determined that the person's utterance has ended, and the face recognition function 30 is restarted to identify the person who speaks next and the face recognition is performed again. (Step S101, Step S15).

文終了判定機能８０は、文の終わりを判断することができるので、話者が話し終わったかどうかの判断ができる。そして、話者が話終わったと判断した場合、再度、顔認識を行い別の人が話し始めたらその人に指向を向けるので、例えば会議等で複数人が話し出した場合でも、より精度良く話者に指向を向けることができる。 Since the sentence end determination function 80 can determine the end of the sentence, it can determine whether or not the speaker has finished speaking. And if the speaker decides that he has finished speaking, he recognizes the face again, and when another person begins to speak, he orients that person, so even if multiple people speak in a meeting, for example, the speaker is more accurate Can be directed to

本発明によれば、予め記憶した個人毎の口の動きの特徴量データを用いて話をしているかどうかを判定し、話をしている可能性が高い人物への指向性を高め、音声データを収集することができるので音声認識の精度を向上させることができる。また、予め記憶した個人毎の専用辞書を利用して音声認識を行うので、音声認識の精度を向上させることができる。また、話者の話の終わりを判断し、話者が話終わったと判断した場合、再度、顔認識を行い指向性の制御を行うので、音声認識の精度を向上させることができる。 According to the present invention, it is determined whether or not speaking is performed using the mouth movement feature amount data stored for each individual in advance, and the directivity to a person who is highly likely to be speaking is improved. Since data can be collected, the accuracy of speech recognition can be improved. Further, since voice recognition is performed using a dedicated dictionary for each individual stored in advance, the accuracy of voice recognition can be improved. Further, when the end of the speaker's story is determined and it is determined that the speaker has ended, the face recognition is performed again and the directivity is controlled, so that the accuracy of speech recognition can be improved.

図７は、本発明による音声認識システムの主要部を示すブロック図である。図７に示すように音声認識システムは、主要な構成要素として、音声認識の対象となる利用者を撮影した画像をカメラから取得し、画像を用いて利用者を特定する顔認識手段１と、予め記憶された個人毎の口の動きの特徴量を記憶した口の動きデータベースを有し、画像から利用者の口の状態を検出し、口の動きデータベースに記憶された利用者に対応する口の動きの特徴量と画像から得られた利用者の口の動きの特徴量とを比較し、利用者が話しているかどうかを判定する口の動き判定手段２と、利用者が話していると判定された場合、利用者の音声を取得するための音声入力手段に利用者の位置を通知する指向方向決定手段３と、音声を取得し音声認識を行う音声認識手段４とを備える。 FIG. 7 is a block diagram showing the main part of the speech recognition system according to the present invention. As shown in FIG. 7, the speech recognition system includes, as main components, a face recognition unit 1 that acquires an image of a user who is a target of speech recognition from a camera and identifies the user using the image; A mouth movement database that stores pre-stored individual mouth movement feature quantities, detects the mouth state of the user from the image, and corresponds to the user stored in the mouth movement database. The mouth movement determination means 2 for comparing whether the user is speaking by comparing the feature amount of the movement of the user and the feature amount of the mouth movement of the user obtained from the image, and When the determination is made, it is provided with a directivity direction determining means 3 for notifying the voice input means for acquiring the user's voice, and a voice recognition means 4 for acquiring the voice and performing voice recognition.

また、上記の各実施形態では、以下の（１）〜（４）に示すような音声認識システムも開示されている。 In each of the above embodiments, a speech recognition system as shown in the following (1) to (4) is also disclosed.

（１）音声認識の対象となる利用者（例えば、利用者Ｘ）を撮影した画像をカメラ（例えば、カメラ１１）から取得し、画像を用いて利用者を特定する顔認識手段（例えば、顔認識機能３０）と、予め記憶された個人毎の口の動きの特徴量を記憶した口の動きデータベース（例えば、個別の口の動きＤＢ４３）を有し、画像から利用者の口の状態を検出し、口の動きデータベースに記憶された利用者に対応する口の動きの特徴量と画像から得られた利用者の口の動きの特徴量とを比較し、利用者が話しているかどうかを判定する口の動き判定手段（例えば、口の動き判定機能４０）と、利用者が話していると判定された場合、利用者の音声を取得するための音声入力手段（例えば、マイク１２）に利用者の位置を通知し、利用者への指向性を高めさせる指向方向決定手段（例えば、指向方向決定機能６０）と、音声を取得し音声認識を行う音声認識手段（例えば、音声認識機能７０）とを備えた音声認識システム。 (1) Face recognition means (for example, face) that acquires an image obtained by capturing a user (for example, user X) as a voice recognition target from a camera (for example, camera 11) and identifies the user using the image. Recognition function 30) and a mouth movement database (for example, individual mouth movement DB 43) that stores pre-stored feature values of mouth movements for each individual, and detects a user's mouth state from an image. Then, the feature value of the mouth movement corresponding to the user stored in the mouth movement database is compared with the feature value of the user's mouth movement obtained from the image to determine whether or not the user is speaking. It is used for mouth movement determination means (for example, mouth movement determination function 40) and voice input means (for example, microphone 12) for acquiring the user's voice when it is determined that the user is speaking. The user ’s location and directing the user Speech recognition system comprising Mesa causing directivity direction setting unit (e.g., pointing direction determining function 60) and speech recognition means for performing speech recognition to get the sound (e.g., voice recognition function 70) and.

（２）音声認識システムは、音声認識手段が、個人毎の音声に関する特徴を予め記憶した個別ユーザ辞書データベースを有し、個別ユーザ辞書データベースに記憶された利用者に対応する音声に関する特徴に基づいて、音声入力手段から取得した利用者の音声認識を行うように構成されていてもよい。 (2) In the speech recognition system, the speech recognition means has an individual user dictionary database in which features relating to speech for each individual are stored in advance, and based on features relating to speech corresponding to users stored in the individual user dictionary database. The voice recognition of the user acquired from the voice input means may be performed.

（３）音声認識システムは、音声認識手段が認識して得たテキスト情報を取得し、テキスト情報を解析し文が完結しているかどうか判定し、文が完結していると判定した場合、顔認識手段に顔認識をさせる文終了判定手段（例えば、文終了判定機能８０）を備えるように構成されていてもよい。 (3) The speech recognition system acquires text information obtained by recognition by the speech recognition means, analyzes the text information, determines whether the sentence is complete, and determines that the sentence is complete, A sentence end determination unit (for example, a sentence end determination function 80) that causes the recognition unit to recognize a face may be provided.

（４）音声認識システムは、音声入力手段から音声データが取得されたかどうかを判定する音声入力有無判定手段（例えば、音声データ入力有無判定機能５０）を備え、音声データが取得されていない場合、音声認識手段は音声認識を行わず、顔認識手段は画像を用いて利用者を特定するように構成されていてもよい。 (4) The voice recognition system includes voice input presence / absence determination means (for example, voice data input presence / absence determination function 50) for determining whether voice data is acquired from the voice input means, and when voice data is not acquired, The voice recognition unit may be configured not to perform voice recognition, and the face recognition unit may be configured to identify a user using an image.

本発明は、スマートフォンを用いた音声認識、テレビ会議における音声認識、または打合せもしくは講演会での音声認識などに適用可能である。 The present invention is applicable to speech recognition using a smartphone, speech recognition in a video conference, speech recognition in a meeting or lecture, and the like.

１０機器
２０ネットワーク
３０顔認識機能
４０口の動き判定機能
５０音声データ入力有無判定機能
６０指向方向決定機能
７０音声認識機能
８０文終了判定機能 DESCRIPTION OF SYMBOLS 10 Apparatus 20 Network 30 Face recognition function 40 Mouth movement determination function 50 Speech data input presence / absence determination function 60 Directional direction determination function 70 Speech recognition function 80 Sentence end determination function

Claims

Face recognition means for acquiring an image of a user who is a target of speech recognition from a camera and identifying the user using the image;
It has a mouth movement database that stores pre-stored mouth movement feature values for each individual, detects a user's mouth state from the image, and stores the mouth movement database in the user stored in the mouth movement database. Mouth movement determination means for comparing the corresponding mouth movement feature quantity with the user mouth movement feature quantity obtained from the image and judging whether the user is speaking;
When it is determined that the user is speaking, a direction-of-direction determining unit that notifies the user's position to a voice input unit for acquiring the user's voice;
A speech recognition system comprising speech recognition means for acquiring the speech and performing speech recognition.

Voice recognition means
The user's voice acquired from the voice input unit based on the voice-related characteristics stored in the individual user dictionary database, having an individual user dictionary database storing the voice-related characteristics for each individual in advance The speech recognition system according to claim 1, wherein recognition is performed.

Acquires text information obtained by the speech recognition means, analyzes the text information to determine whether the sentence is complete, and if it is determined that the sentence is complete, causes the face recognition means to recognize the face The speech recognition system according to claim 1, further comprising sentence end determination means.

Comprising voice input presence / absence determining means for determining whether voice data is acquired from the voice input means;
The voice according to any one of claims 1 to 3, wherein when voice data is not acquired, the voice recognition means does not perform voice recognition, and the face recognition means specifies a user using an image. Recognition system.

Obtain an image of the user who is the target of speech recognition from the camera, identify the user using the image,
It has a mouth movement database that stores pre-stored mouth movement feature values for each individual, detects a user's mouth state from the image, and stores the mouth movement database in the user stored in the mouth movement database. Compare the corresponding mouth movement feature quantity with the user mouth movement feature quantity obtained from the image, determine whether the user is speaking,
When it is determined that the user is speaking, the position of the user is notified to the voice input means for acquiring the voice of the user,
A voice recognition method characterized by acquiring the voice and performing voice recognition.

On the computer,
A face recognition process for acquiring an image of a user who is a target of speech recognition from a camera and identifying the user using the image;
It has a mouth movement database that stores pre-stored mouth movement feature values for each individual, detects a user's mouth state from the image, and stores the mouth movement database in the user stored in the mouth movement database. Mouth movement determination processing for comparing the corresponding mouth movement feature amount with the user mouth movement feature amount obtained from the image, and determining whether the user is speaking;
When it is determined that the user is speaking, a direction determination process for notifying the user's position to voice input means for acquiring the user's voice;
A voice recognition program for executing voice recognition processing for acquiring voice and performing voice recognition.