JP2007079624A

JP2007079624A - Utterance detection device, method and program

Info

Publication number: JP2007079624A
Application number: JP2005262751A
Authority: JP
Inventors: Takashi Naito; 貴志内藤; Yoshihisa Matsumoto; 吉央松本; Tsukasa Ogasawara; 司小笠原
Original assignee: Nara Institute of Science and Technology NUC; Toyota Motor Corp; Toyota Central R&D Labs Inc
Current assignee: Nara Institute of Science and Technology NUC; Toyota Motor Corp; Toyota Central R&D Labs Inc
Priority date: 2005-09-09
Filing date: 2005-09-09
Publication date: 2007-03-29
Anticipated expiration: 2025-09-09
Also published as: JP4650888B2

Abstract

<P>PROBLEM TO BE SOLVED: To accurately detect a speaker's utterances, without being affected by sudden factors, such as noise. <P>SOLUTION: A lip pattern is segmented from a speaker's image (Step 108), and correlation values are calculated between lip pattern f(t) and a lip-inclusive pattern F(t-i) (i = 1, 2, ..., N) (Step 109). A maximum correlation value s_max(t, t-i) is calculated between f(t) and F(t-i), and such correlation values s_max with i from 1 to N are summed as a lip variation E(t) (Step 110). A lip variation E(t) that is not less than the threshold (Step 111: YES) identifies utterance duration (Step 112), and a lip variation E(t) less than the threshold (Step 111: NO) identifies non-utterance duration (Step 113). <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、発話検出装置、方法及びプログラムに係り、特に話者の口唇の画像から発話を検出する発話検出装置、方法及びプログラムに関する。 The present invention relates to an utterance detection device, method, and program, and more particularly, to an utterance detection device, method, and program for detecting an utterance from an image of a speaker's lips.

一般環境における音声認識システムは、周囲の騒音などに影響され、話者が話をしていなくてもその騒音をもとに音声認識を行い、結果的に誤認識をしてしまう。誤認識を低減する有効な方法の１つとして、話者の発話区間を検出し、その間のみで音声認識を行う手法が考えられる。 A speech recognition system in a general environment is affected by ambient noise and the like, and performs speech recognition based on the noise even when the speaker is not speaking, resulting in erroneous recognition. As an effective method for reducing misrecognition, a method of detecting a speaker's utterance section and performing speech recognition only in the meantime can be considered.

そこで、音声認識率を向上させるために、話者の唇の動きから発話区間を検出することが研究されている。特許文献１には、唇の輪郭の垂直方向の距離と基準値との差、あるいは唇の輪郭の曲率値から口の開閉を検出し、複数の対象者の中から話者を特定することが記載されている。また、非特許文献１には、現在の口唇パターンと、Ｎフレーム前の口唇パターンと、の差から発話状態を判定することが記載されている。
特開２０００−３３８９８７号公報村井、中村、「口周囲画像による雑音に剛健な会話検出」、音声言語情報処理３７−１０、２００１ Therefore, in order to improve the speech recognition rate, it has been studied to detect an utterance interval from the movement of the speaker's lips. According to Patent Document 1, opening / closing of a mouth is detected from a difference between a vertical distance of a lip contour and a reference value, or a curvature value of a lip contour, and a speaker is specified from a plurality of subjects. Are listed. Non-Patent Document 1 describes that the utterance state is determined from the difference between the current lip pattern and the lip pattern N frames before.
JP 2000-338987 A Murai, Nakamura, “Detecting Conversation Consistent with Noise from Mouth Images”, Spoken Language Information Processing 37-10, 2001

一般に話者が会話をする場合は、話者の頭部の位置はある範囲で変動し、その結果画像上での話者の見え方も変動してしまう。そのため、口唇画像から発話区間を頑健に検出することが困難である。 In general, when a speaker has a conversation, the position of the speaker's head varies within a certain range, and as a result, the appearance of the speaker on the image also varies. Therefore, it is difficult to robustly detect the utterance section from the lip image.

特許文献１の場合、発話中に口唇の輪郭が変わるため、基準値自体が変動してしまい、精度よく発話状態を検出することができない問題がある。特許文献２の場合、現在の口唇パターンとＮフレーム前の口唇パターンとの差のみに注目して発話状態を判定しているので、ノイズなどの突発的変動要因の影響を受け易い問題がある。 In the case of Patent Document 1, since the outline of the lips changes during utterance, the reference value itself fluctuates, and there is a problem that the utterance state cannot be detected with high accuracy. In the case of Patent Document 2, since the utterance state is determined by paying attention only to the difference between the current lip pattern and the lip pattern N frames before, there is a problem that it is susceptible to sudden fluctuation factors such as noise.

本発明は、上述した課題を解決するために提案されたものであり、ノイズなどの突発的な要因に影響されることなく、高精度に話者の発話を検出する発話検出装置、方法及びプログラムを提供することを目的とする。 The present invention has been proposed to solve the above-described problem, and an utterance detection apparatus, method, and program for detecting a speaker's utterance with high accuracy without being affected by sudden factors such as noise. The purpose is to provide.

本発明に係る発話検出装置は、少なくとも話者の口唇を撮像する撮像手段と、前記撮像手段により連続的に撮像された画像フレーム中において、前記口唇の形状を特定可能な口唇特徴領域を特定する口唇特徴領域特定手段と、前記撮像手段により連続的に撮像された画像フレーム中において、前記口唇特徴領域によって形状を特定された口唇を包含する口唇包含領域を特定する口唇包含領域特定手段と、前記撮像手段により連続的に撮像された画像フレーム中の、特定の画像フレームにおける前記口唇特徴領域によって特定される口唇形状と、前記特定の画像フレームの直前に撮像された１または連続複数の画像フレーム中の前記口唇包含領域に含まれる口唇形状とを比較し、これらの相関値を算出する相関値算出手段と、前記相関値算出手段の算出結果に基づいて前記口唇の変動量を算出する変動量算出手段と、前記変動量算出手段により算出された変動量に基づいて発話区間であるか否かを検出する発話区間検出手段と、を備えている。 An utterance detection apparatus according to the present invention specifies at least an imaging unit that images a lip of a speaker and a lip feature region that can specify a shape of the lip in an image frame continuously captured by the imaging unit. Lip feature area specifying means; and lip inclusion area specifying means for specifying a lip inclusion area including a lip whose shape is specified by the lip feature area in image frames continuously captured by the imaging means; Among the image frames continuously captured by the imaging means, the lip shape specified by the lip feature region in a specific image frame, and one or a plurality of continuous image frames captured immediately before the specific image frame The correlation value calculating means for comparing the lip shape included in the lip inclusion region of the lip and calculating the correlation value thereof, and the correlation value calculating means A fluctuation amount calculating means for calculating the amount of fluctuation of the lips based on the calculation result, and an utterance interval detecting means for detecting whether or not the utterance interval is based on the fluctuation amount calculated by the fluctuation amount calculating means; It has.

本発明に係る話者検出方法は、少なくとも話者の口唇を撮像し、前記連続的に撮像された画像フレーム中において、前記口唇の形状を特定可能な口唇特徴領域を特定し、前記連続的に撮像された画像フレーム中において、前記口唇特徴領域によって形状を特定された口唇を包含する口唇包含領域を特定し、前記連続的に撮像された画像フレーム中の、特定の画像フレームにおける前記口唇特徴領域によって特定される口唇形状と、前記特定の画像フレームの直前に撮像された１または連続複数の画像フレーム中の前記口唇包含領域に含まれる口唇形状とを比較し、これらの相関値を算出し、前記相関値算出結果に基づいて前記口唇の変動量を算出し、前記算出された変動量に基づいて発話区間であるか否かを検出する。 The speaker detection method according to the present invention captures at least a speaker's lips, specifies a lip feature region in which the shape of the lips can be specified in the continuously captured image frames, and continuously In the captured image frame, the lip inclusion region including the lip whose shape is specified by the lip feature region is specified, and the lip feature region in the specific image frame in the continuously captured image frames And the lip shape included in the lip inclusion region in one or a plurality of continuous image frames captured immediately before the specific image frame, and calculating a correlation value thereof, A variation amount of the lips is calculated based on the correlation value calculation result, and it is detected whether or not it is an utterance section based on the calculated variation amount.

本発明に係る話者検出プログラムは、コンピュータに、少なくとも話者の口唇を撮像させ、前記連続的に撮像された画像フレーム中において、前記口唇の形状を特定可能な口唇特徴領域を特定させ、前記連続的に撮像された画像フレーム中において、前記口唇特徴領域によって形状を特定された口唇を包含する口唇包含領域を特定させ、前記連続的に撮像された画像フレーム中の、特定の画像フレームにおける前記口唇特徴領域によって特定される口唇形状と、前記特定の画像フレームの直前に撮像された１または連続複数の画像フレーム中の前記口唇包含領域に含まれる口唇形状とを比較し、これらの相関値を算出させ、前記相関値算出結果に基づいて前記口唇の変動量を算出させ、前記算出された変動量に基づいて発話区間であるか否かを検出させる。 The speaker detection program according to the present invention causes a computer to image at least a speaker's lips, and to specify a lip feature region capable of specifying the shape of the lips in the continuously captured image frames, In a continuously captured image frame, a lip inclusion region including a lip whose shape is specified by the lip feature region is specified, and the specific image frame in the continuously captured image frame The lip shape specified by the lip feature region is compared with the lip shape included in the lip inclusion region in one or a plurality of consecutive image frames captured immediately before the specific image frame, and the correlation value thereof is calculated. Whether or not the lip variation amount is calculated based on the correlation value calculation result, and whether or not the utterance section is based on the calculated variation amount To detect.

撮像手段は、話者の口唇を撮像できるように設置されている。口唇特徴領域特定手段は、連続的に撮像された画像フレーム中において、前記口唇の形状を特定可能な口唇特徴領域を特定する。口唇包含領域特定手段は、連続的に撮像された画像フレーム中において、前記口唇特徴領域によって形状を特定された口唇を包含する口唇包含領域を特定する。 The imaging means is installed so that the lips of the speaker can be imaged. The lip feature region specifying unit specifies a lip feature region that can specify the shape of the lips in continuously captured image frames. The lip inclusion region specifying unit specifies a lip inclusion region including the lip whose shape is specified by the lip feature region in continuously captured image frames.

相関値算出手段は、連続的に撮像された画像フレーム中の、特定の画像フレームにおける前記口唇特徴領域によって特定される口唇形状と、前記特定の画像フレームの直前に撮像された１または連続複数の画像フレーム中の前記口唇包含領域に含まれる口唇形状とを比較し、これらの相関値を算出する。なお、「特定の画像フレームの直前に撮像された１または連続複数の画像フレーム」は、前記特定の画像フレームから連続するすべての画像フレームである必要はなく、任意の画像フレームであってもよい。 The correlation value calculation means includes a lip shape specified by the lip feature region in a specific image frame in continuously captured image frames, and one or a plurality of continuous images captured immediately before the specific image frame. The lip shape included in the lip inclusion region in the image frame is compared, and the correlation value thereof is calculated. Note that “one or a plurality of consecutive image frames captured immediately before a specific image frame” does not have to be all image frames continuous from the specific image frame, and may be an arbitrary image frame. .

変動量算出手段は、相関値算出手段の算出結果に基づいて前記口唇の変動量を算出する。話者が発話していないときは、特定の画像フレームにおける前記口唇特徴領域によって特定される口唇形状と、その特定の画像フレームの直前に撮像された１または連続複数の画像フレーム中の口唇包含領域に含まれる口唇形状と、の変動量はほとんどなく、話者が発話しているときは、その変動量は大きな値になる。よって、口唇の変動量は、それらの変動量から求められる。 The fluctuation amount calculation means calculates the fluctuation amount of the lips based on the calculation result of the correlation value calculation means. When the speaker is not speaking, the lip shape specified by the lip feature region in a specific image frame, and the lip inclusion region in one or a plurality of continuous image frames captured immediately before the specific image frame There is almost no amount of variation between the lip shape and the amount of variation when the speaker is speaking. Therefore, the amount of lip variation can be obtained from the amount of variation.

そこで、発話区間検出手段は、算出された変動量に基づいて発話区間であるか否かを検出できる。 Therefore, the speech segment detection means can detect whether or not the speech segment is based on the calculated fluctuation amount.

なお、前記発明は、前記撮像手段により連続的に撮像された画像フレームから、前記話者の口唇の特徴を表しかつ前記口唇の形状を追跡して特定するための濃度パターンを切り出す切出し手段を更に備えてもよい。このとき、前記口唇特徴領域特定手段は、前記口唇特徴領域として、前記撮像手段により連続的に撮像された画像フレームであって前記切出し手段により切り出された濃度パターンを包含する矩形領域を特定すればよい。このとき、濃度パターンは、少なくとも話者の口角、上唇、下唇のいずれかの濃淡値であるとよい。 Note that the invention further includes a cutting-out unit that cuts out a density pattern that represents the characteristics of the speaker's lips and tracks and identifies the shape of the lips from the image frames continuously captured by the imaging unit. You may prepare. At this time, if the lip feature region specifying unit specifies a rectangular region including the density pattern cut out by the cutting unit, which is an image frame continuously picked up by the imaging unit, as the lip feature region. Good. At this time, the density pattern may be at least a gray value of one of the mouth corner, upper lip, and lower lip of the speaker.

本発明に係る発話検出装置、方法及びプログラムは、連続的に撮像された画像フレーム中の、特定の画像フレームにおける前記口唇特徴領域によって特定される口唇形状と、前記特定の画像フレームの直前に撮像された１または連続複数の画像フレーム中の前記口唇包含領域に含まれる口唇形状とを比較し、これらの相関値を算出し、算出された変動量に基づいて発話区間であるか否かを検出することにより、ノイズ等の突発的な要因の影響を受けることなく、高精度に発話区間を検出することができる。 The utterance detection device, method, and program according to the present invention capture a lip shape specified by the lip feature region in a specific image frame in an image frame continuously captured, and immediately before the specific image frame. The lip shape included in the lip inclusion region in one or a plurality of consecutive image frames is compared, the correlation value thereof is calculated, and whether or not it is an utterance section is detected based on the calculated variation amount By doing so, it is possible to detect an utterance section with high accuracy without being affected by sudden factors such as noise.

以下、本発明の好ましい実施の形態について図面を参照しながら詳細に説明する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施の形態に係る音声認識システムの構成を示す図である。音声認識システムは、毎秒例えば３０フレームで話者を撮像するＣＣＤイメージセンサ１と、ＣＣＤイメージセンサ１で撮像された画像をアナログ／ディジタル変換するＡ／Ｄコンバータ２と、Ａ／Ｄコンバータ２からの画像データに基づいて画像処理を行って話者の発話を検出する画像処理装置１０と、話者の音声が入力されるマイク２１と、マイク２１からの音声をアナログ／ディジタル変換するＡ／Ｄコンバータ２２と、画像処理装置１０で検出された発話と、音声データとに基づいて音声認識を行う音声認識装置３０と、を備えている。 FIG. 1 is a diagram showing a configuration of a speech recognition system according to an embodiment of the present invention. The speech recognition system includes a CCD image sensor 1 that captures a speaker at 30 frames per second, an A / D converter 2 that performs analog / digital conversion on an image captured by the CCD image sensor 1, and an A / D converter 2. An image processing apparatus 10 that detects an utterance of a speaker by performing image processing based on image data, a microphone 21 to which the voice of the speaker is input, and an A / D converter that performs analog / digital conversion of the voice from the microphone 21 22 and a speech recognition device 30 that performs speech recognition based on speech detected by the image processing device 10 and speech data.

ＣＣＤイメージセンサ１は、話者の顔あるいは話者の口周辺の画像を撮像できるように設置されている。マイク４は、話者の声が入力される位置に設置されている。 The CCD image sensor 1 is installed so that an image of the speaker's face or the periphery of the speaker's mouth can be taken. The microphone 4 is installed at a position where a speaker's voice is input.

画像処理装置１０は、画像処理を行うＣＰＵ１１と、ワークエリアであるＲＡＭ１２と、ＣＰＵ１１の制御プログラムが記憶されているＲＯＭ１３と、を備えている。ＲＯＭ１３には、発話区間検出ルーチンのプログラムや、その他のプログラムが記憶されている。 The image processing apparatus 10 includes a CPU 11 that performs image processing, a RAM 12 that is a work area, and a ROM 13 that stores a control program for the CPU 11. The ROM 13 stores an utterance section detection routine program and other programs.

図２は、画像処理装置１０の発話区間検出ルーチンを示すフローチャートである。画像処理装置１０は、話者の発話区間を検出すべく、次のようなステップ１０１移行の処理を実行する。 FIG. 2 is a flowchart showing a speech segment detection routine of the image processing apparatus 10. The image processing apparatus 10 executes the following process of step 101 in order to detect a speaker's speech section.

ステップ１０１では、画像処理装置１０のＣＰＵ１１は、ＣＣＤイメージセンサ２から話者の画像が入力されると、ステップ１０２に移行する。 In step 101, when a speaker image is input from the CCD image sensor 2, the CPU 11 of the image processing apparatus 10 proceeds to step 102.

ステップ１０２では、ＣＰＵ１１は、話者の口唇特徴パターンが登録（ＲＡＭ１２に記憶）されているか否かを判定する。 In step 102, the CPU 11 determines whether or not the speaker's lip feature pattern is registered (stored in the RAM 12).

図３（ａ）及び（Ｂ）は、口唇特徴パターンの一例を示す図である。口唇特徴パターンとは、画像中での口唇の形状を追跡して特定するための濃度パターンである。濃度パターンとは、画像の画素毎の輝度レベルをパターン化したものでる。口唇特徴パターンは、図３（ａ）に示すように口の両端（口角）２箇所の濃度パターンや、同図（ｂ）に示すように上唇及び下唇の２箇所の濃度パターンでもよい。なお、口唇特徴パターンは、上記のように２箇所に限らず、図３（ａ）及び（ｂ）を組み合わせたものでもよいし、３箇所以上の濃度パターンであってもよい。 3A and 3B are diagrams illustrating an example of the lip feature pattern. The lip feature pattern is a density pattern for tracking and specifying the shape of the lips in the image. A density pattern is a pattern of luminance levels for each pixel of an image. The lip feature pattern may be a density pattern at two positions (mouth corners) of the mouth as shown in FIG. 3A, or a density pattern at two positions of the upper lip and lower lip as shown in FIG. 3B. The lip feature pattern is not limited to two places as described above, but may be a combination of FIGS. 3A and 3B, or may be a density pattern of three or more places.

口唇特徴パターンが既に登録されているときは、現在撮像されている画像について口唇特徴パターンの追跡を行うべく、ステップ１０６に移行する。一方、口唇特徴パターンが登録されていないときは、口唇特徴パターンを登録すべく、ステップ１０３に移行する。なお、口唇特徴パターンは、ＲＡＭ１２に限らず、図示しない不揮発性ＲＡＭ、磁気ディスクなどの記憶媒体に記憶されていてもよい。 If the lip feature pattern has already been registered, the process proceeds to step 106 in order to track the lip feature pattern for the currently captured image. On the other hand, when the lip feature pattern is not registered, the process proceeds to step 103 to register the lip feature pattern. The lip feature pattern is not limited to the RAM 12 and may be stored in a storage medium such as a non-volatile RAM or a magnetic disk (not shown).

ステップ１０３では、ＣＰＵ１１は、Ａ／Ｄコンバータ２から供給される画像に基づいて、口唇の特徴的なパターン（口唇特徴パターン）を検出して、ステップ１０４に移行する。口唇特徴パターンの検出では、種々の画像処理の手法を用いることが可能である。 In step 103, the CPU 11 detects a characteristic pattern of the lips (lip characteristic pattern) based on the image supplied from the A / D converter 2, and proceeds to step 104. In the detection of the lip feature pattern, various image processing methods can be used.

例えば図３（ａ）に示す口角の口唇特徴パターンに反応するニューラルネットワークを用意しておいき、入力された画像に対して上記ニューラルネットワークを適用することによって、図３（ａ）の口唇特徴パターンを容易に検出することができる。あるいは、画像のエッジヒストグラム分布から口角を特定し、口唇特徴パターンを登録してもよい。 For example, by preparing a neural network that responds to the lip feature pattern of the horn corner shown in FIG. 3A, and applying the neural network to the input image, the lip feature pattern of FIG. Can be easily detected. Alternatively, the mouth corner may be specified from the edge histogram distribution of the image, and the lip feature pattern may be registered.

ステップ１０４では、ＣＰＵ１１は、口唇特徴パターンを検出できたか否かを判定し、検出できたときはステップ１０５に移行し、検出できなかったときはステップ１０１に戻る。 In step 104, the CPU 11 determines whether or not the lip feature pattern has been detected. If it has been detected, the process proceeds to step 105. If not, the process returns to step 101.

ステップ１０５では、ＣＰＵ１１は、口唇特徴パターンをＲＡＭ１２に登録して、ステップ１０６に移行する。 In step 105, the CPU 11 registers the lip feature pattern in the RAM 12 and proceeds to step 106.

ステップ１０６では、ＣＰＵ１１は、前フレーム画像における口唇特徴パターンの位置に基づいて現フレーム画像から口唇特徴パターンを追跡して、ステップ１０７に移行する。 In step 106, the CPU 11 tracks the lip feature pattern from the current frame image based on the position of the lip feature pattern in the previous frame image, and proceeds to step 107.

図４は、口唇特徴パターンの探索範囲を示す図である。最初に、ＣＰＵ１１は、前フレーム画像の口唇特徴パターンの位置に基づいて、矩形点線で示す探索範囲を設定する。そして、この探索範囲内で予め登録された口唇特徴パターンに最も類似したパターンを有する領域を検出する。ここでは、例えば画像処理手法の１つである正規化相関処理手法などが利用可能である。 FIG. 4 is a diagram illustrating a search range of the lip feature pattern. First, the CPU 11 sets a search range indicated by a rectangular dotted line based on the position of the lip feature pattern of the previous frame image. Then, an area having a pattern most similar to the lip feature pattern registered in advance within the search range is detected. Here, for example, a normalized correlation processing method which is one of image processing methods can be used.

ステップ１０７では、ＣＰＵ１１は、口唇特徴パターンの追跡に成功したかを判定し、成功したときはステップ１０８に移行し、成功しなかったときは再びステップ１０１に戻る。なお、話者が頭部を激しく動かすなどして口唇特徴パターンの追跡に失敗する場合がある。失敗したか否かは、正規化相関処理によって得られる相関値を調べることによって判定可能である。 In step 107, the CPU 11 determines whether or not the tracking of the lip feature pattern is successful. If successful, the CPU 11 proceeds to step 108, and if not successful, returns to step 101 again. Note that tracking of the lip feature pattern may fail due to the speaker moving his head violently. Whether or not it has failed can be determined by examining the correlation value obtained by the normalized correlation processing.

ステップ１０８では、ＣＰＵ１１は、追跡に成功した画像から口唇パターンを切り出して、ステップ１０９に移行する。口唇パターンとは、濃淡値の画像パターンである口唇特徴パターンと異なり、口唇を囲む矩形状のパターンである。つまり、口唇パターンは、口唇の形状を特定できる口唇特徴の矩形領域である。 In step 108, the CPU 11 cuts out the lip pattern from the image that has been successfully tracked, and proceeds to step 109. The lip pattern is a rectangular pattern that surrounds the lips, unlike the lip feature pattern that is an image pattern with gray values. That is, the lip pattern is a rectangular region of lip features that can specify the shape of the lips.

図５は、口唇パターンの切り出しを説明する図である。ここでは、追跡している２つの口唇特徴パターン（本実施形態では２つの口角の濃度パターン）の中心座標を（ｘ１，ｙ１）、（ｘ２，ｙ２）（ただし、ｘ２＞ｘ１）する。このとき、ＣＰＵ１１は、幅（ｘ２−ｘ１）・ｒ１、高さ（ｘ２−ｘ１）・ｒ２、中心座標（（ｘ１＋ｘ２）／２，（ｙ１＋ｙ２）／２）の矩形領域を口唇パターンｆ（ｔ）として求めればよい。ｒ１及びｒ２は、口唇パターンｆ（ｔ）の切り出しのために予め設定された係数である。 FIG. 5 is a diagram for explaining lip pattern extraction. Here, the center coordinates of the two lip feature patterns being tracked (in this embodiment, the density patterns of the two mouth corners) are (x1, y1) and (x2, y2) (where x2> x1). At this time, the CPU 11 converts the rectangular area having the width (x2−x1) · r1, the height (x2−x1) · r2, and the center coordinates ((x1 + x2) / 2, (y1 + y2) / 2) into the lip pattern f (t). As long as you ask. r1 and r2 are coefficients set in advance for cutting out the lip pattern f (t).

なお、図３（ｂ）に示す口唇特徴パターンを追跡する場合、例えばその口唇特徴パターンの中心座標を（ｘ１，ｙ１）、（ｘ２，ｙ２）とすると、幅Ｗ、高さＨ、中心座標（（ｘ１＋ｘ２）／２，（ｙ１＋ｙ２）／２）の矩形領域を口唇パターンｆ（ｔ）とすればよい。すなわち、追跡すべき口唇特徴パターンに応じてｆ（ｔ）を設定すればよい。 When tracking the lip feature pattern shown in FIG. 3B, for example, if the center coordinates of the lip feature pattern are (x1, y1) and (x2, y2), the width W, the height H, and the center coordinates ( The rectangular area of (x1 + x2) / 2, (y1 + y2) / 2) may be used as the lip pattern f (t). That is, f (t) may be set according to the lip feature pattern to be tracked.

次に、ＣＰＵ１１は、画像から、口唇パターンｆ（ｔ）を含むように口唇包含パターンＦ（ｔ）を切り出す。つまり、口唇包含パターンＦ（ｔ）は、口唇パターンｆ（ｔ）によって形状を特定された口唇を包含する矩形領域である。 Next, the CPU 11 cuts out the lip inclusion pattern F (t) from the image so as to include the lip pattern f (t). That is, the lip inclusion pattern F (t) is a rectangular region that includes the lip whose shape is specified by the lip pattern f (t).

図６は、口唇包含パターンＦ（ｔ）の切り出しを説明する図である。例えば、口唇パターンｆ（ｔ）の中心位置を（ｘ０，ｙ０）、口唇パターンの幅、高さをそれぞれＷ０、Ｈ０とする。このとき、ＣＰＵ１１は、中心位置（ｘ０，ｙ０）、幅Ｗ０・ｒ０、高さＨ０・ｒ０の口唇包含パターンＦ（ｔ）を求めればよい。ｒ０は、口唇包含パターンＦ（ｔ）の切り出しのための係数（＞１）である。 FIG. 6 is a diagram for explaining the clipping of the lip inclusion pattern F (t). For example, the center position of the lip pattern f (t) is (x0, y0), and the width and height of the lip pattern are W0 and H0, respectively. At this time, the CPU 11 may obtain the lip inclusion pattern F (t) having the center position (x0, y0), the width W0 · r0, and the height H0 · r0. r0 is a coefficient (> 1) for cutting out the lip inclusion pattern F (t).

ステップ１０９では、ＣＰＵ１１は、切り出された口唇パターンｆ（ｔ）と、現在から直前Ｎフレーム分の口唇包含パターンＦ（ｔ−ｉ）（ｉ＝１，２，・・・，Ｎ）と、の相関値を算出する。これにより、口唇パターンｆ（ｔ）によって特定される口唇形状と、口唇包含パターンＦ（ｔ−ｉ）（ｉ＝１，２，・・・，Ｎ）によって特定される口唇形状と、が比較され、そして相関値が算出される。 In step 109, the CPU 11 compares the cut lip pattern f (t) and the lip inclusion pattern F (ti) (i = 1, 2,..., N) for the previous N frames from the present. A correlation value is calculated. Thereby, the lip shape specified by the lip pattern f (t) is compared with the lip shape specified by the lip inclusion pattern F (ti) (i = 1, 2,..., N). And a correlation value is calculated.

具体的には、ＣＰＵ１１は、口唇パターンｆ（ｔ）を参照（テンプレート）画像、口唇包含パターンＦ（ｔ−ｉ）を探索画像として、画像相関処理を行い、相関値画像ｓ（ｆ（ｔ），Ｆ（ｔ−１））を算出する。ここで、口唇パターンｆ（ｔ）の画像サイズＷ０・Ｈ０、口唇包含パターンＦ（ｔ）の画像サイズをＷ１・Ｈ１（ただし、Ｗ１＞Ｗ０、Ｈ１＞Ｈ０）とすれば、相関値画像ｓ（ｆ（ｔ），Ｆ（ｔ−ｉ））の画像サイズは（Ｗ１−Ｗ０）・（Ｈ１−Ｈ０）となる。なお、画像相関処理手法としては、正規化相関処理方法などの公知の技術を用いればよい。ただし、相関値ｓは０から１までの値をとり、相関が高いほど（類似しているほど）大きな値になるように正規化される。 Specifically, the CPU 11 performs image correlation processing using the lip pattern f (t) as a reference (template) image and the lip inclusion pattern F (ti) as a search image, and performs a correlation value image s (f (t)). , F (t−1)). Here, if the image size W0 · H0 of the lip pattern f (t) and the image size of the lip inclusion pattern F (t) are W1 · H1 (W1> W0, H1> H0), the correlation value image s ( The image size of f (t), F (t−i) is (W1−W0) · (H1−H0). Note that a known technique such as a normalized correlation processing method may be used as the image correlation processing method. However, the correlation value s takes a value from 0 to 1, and is normalized so that the higher the correlation (the more similar), the larger the value.

以下では、相関値画像ｓ（ｆ（ｔ），Ｆ（ｔ−ｉ））の座標（ｘ，ｙ）における相関値をｓｉ（ｘ，ｙ）と表す。ただし、ｘ＝０，１，・・・，（Ｗ１−Ｗ０−１）であり、ｙ＝０，１，・・・，（Ｈ１−Ｈ０−１）である。 Hereinafter, the correlation value at the coordinates (x, y) of the correlation value image s (f (t), F (t−i)) is represented as si (x, y). However, x = 0, 1,... (W1-W0-1) and y = 0, 1,..., (H1-H0-1).

つぎに、ＣＰＵ１１は、ｓｉ（ｘ，ｙ）の最大値ｓ＿ｍａｘ（ｔ，ｉ）と、そのときの座標（ｓｘ（ｔ，ｉ），ｓｙ（ｔ，ｉ））をそれぞれ算出する。このとき、
ｓ＿ｍａｘ（ｔ，ｔ−ｉ）＝ｓｉ（ｓｘ（ｔ，ｉ），ｓｙ（ｔ，ｉ））
である。 Next, the CPU 11 calculates the maximum value s_max (t, i) of si (x, y) and the coordinates (sx (t, i), sy (t, i)) at that time. At this time,
s_max (t, ti) = si (sx (t, i), sy (t, i))
It is.

図７（ａ）は参照画像と探索画像の一例を示す図、（ｂ）は参照画像が探索画像中を走査している状態を示す図、（ｃ）は参照画像と探索画像の最も相関の高い位置を表す図、（ｄ）は相関値画像の一例を示す図である。ここでは、話者の口唇の画像の代わりに四角錐の画像が用いわれている。 7A is a diagram illustrating an example of a reference image and a search image, FIG. 7B is a diagram illustrating a state in which the reference image is scanned in the search image, and FIG. 7C is a diagram illustrating the most correlation between the reference image and the search image. The figure showing a high position and (d) are figures showing an example of a correlation value image. Here, a quadrangular pyramid image is used instead of the speaker's lip image.

ＣＰＵ１１は参照画像と探索画像の相関値を算出して相関値が最も高くなる位置を探すことによって、図７（ｂ）に示すように参照画像が探索画像中で走査され、図７（ｃ）に示すように参照画像と探索画像とのマッチング位置が探し出される。このとき、図７（ｄ）に示すように、マッチング位置における相関値画像が、ｓ＿ｍａｘとして求められる。なお、画像相関処理で用いられる探索画像（口唇包含パターンＦ（ｔ−ｉ））は、次のようにＮフレーム存在するのが好ましい。 The CPU 11 calculates the correlation value between the reference image and the search image and searches for the position where the correlation value is the highest, whereby the reference image is scanned in the search image as shown in FIG. 7B, and FIG. As shown in FIG. 5, the matching position between the reference image and the search image is found. At this time, as shown in FIG. 7D, the correlation value image at the matching position is obtained as s_max. Note that the search image (lip inclusion pattern F (t−i)) used in the image correlation process preferably has N frames as follows.

図８は、口唇パターンｆ（ｔ）と口唇包含パターンＦ（ｔ−ｉ）（ｉ＝１、・・・、Ｎ）との画像相関処理を説明する図である。まず、ｆ（ｔ）とＦ（ｔ−１）で画像相関処理が行われ、最も高い相関値ｓ＿ｍａｘ（ｔ，１）が求められる。次に、ｆ（ｔ）とＦ（ｔ−２）で画像相関処理が行われ、最も高い相関値ｓ＿ｍａｘ（ｔ，２）が求められる。同様にして、ｆ（ｔ）とＦ（ｔ−３）で画像相関処理が行われ、最も高い相関値ｓ＿ｍａｘ（ｔ，３）が求められる。このようにして求められるＮ個の相関値ｓ＿ｍａｘは、次のステップＳ１１０で用いられる。 FIG. 8 is a diagram for explaining image correlation processing between the lip pattern f (t) and the lip inclusion pattern F (t−i) (i = 1,..., N). First, image correlation processing is performed with f (t) and F (t−1), and the highest correlation value s_max (t, 1) is obtained. Next, image correlation processing is performed using f (t) and F (t−2), and the highest correlation value s_max (t, 2) is obtained. Similarly, image correlation processing is performed using f (t) and F (t−3), and the highest correlation value s_max (t, 3) is obtained. The N correlation values s_max obtained in this way are used in the next step S110.

なお、口唇包含パターンＦ（ｔ）とＦ（ｔ−１）が全く同一の画像であれば、
ｓ＿ｍａｘ（ｔ，ｔ−ｉ）＝１
ｓｘ（ｔ，ｉ）＝（Ｗ１−Ｗ０）／２
ｓｙ（ｔ，ｉ）＝（Ｈ１−Ｈ０）／２
になる。 If the lip inclusion patterns F (t) and F (t−1) are the same image,
s_max (t, ti) = 1
sx (t, i) = (W1-W0) / 2
sy (t, i) = (H1-H0) / 2
become.

ステップ１１０では、ＣＰＵ１１は、現在から直前のＮフレームまで遡って求められた｛ｓ＿ｍａｘ（ｔ，ｔ−ｉ），ｓｘ（ｔ，ｉ），ｓｙ（ｔ，ｉ）；ｉ＝１，２，・・・，Ｎ｝から、口唇変動量Ｅ（ｔ）を式（１）より算出する。 In step 110, the CPU 11 obtains {s_max (t, ti), sx (t, i), sy (t, i); i = 1, 2,. .., N}, the lip variation E (t) is calculated from the equation (1).

図９は、（ａ）時刻ｔ、（ｂ）時刻ｔ＋１、（ｃ）時刻ｔ＋２、（ｄ）時刻ｔ＋３でそれぞれ得られた口唇特徴パターン、口唇パターンｆ、口唇包含パターンＦを示す図である。図９では、現在のフレームから２フレーム分過去に遡っているが、遡るフレーム数は特に限定されるものではない。図９を用いて、上述した各ステップについて説明する。 FIG. 9 is a diagram showing the lip feature pattern, the lip pattern f, and the lip inclusion pattern F obtained at (a) time t, (b) time t + 1, (c) time t + 2, and (d) time t + 3, respectively. In FIG. 9, two frames are traced back from the current frame, but the number of frames that are traced back is not particularly limited. Each step mentioned above is demonstrated using FIG.

まず、図９（ａ）に示す時刻ｔにおける口唇包含パターンＦ（ｔ）は次のように求められる。口唇パターンｆ（ｔ）を検出するための口唇特徴パターンは、例えば左右の口角の矩形の矩形パターンのように、口唇特徴パターンに特徴的なパターンとして予め登録されている（ステップ１０２〜Ｓ１０５）。そして、口唇特徴パターンの位置から、矩形点線で表した口唇パターンｆ（ｔ）が検出される（ステップ１０７、Ｓ１０８）。次に、口唇パターンｆ（ｔ）を含む口唇包含パターンＦ（ｔ）が設定される。ここで、Ｆ（ｔ−２）、Ｆ（ｔ−１）の画像は、それぞれ時刻ｔ−２、ｔ−１の時点で得られた口唇包含パターンを示している。そして、Ｆ（ｔ−２）、Ｆ（ｔ−１）のそれぞれに対して、相関処理により、時刻ｔでの口唇パターンｆ（ｔ）と類似性の高い画像領域が探索される。 First, the lip inclusion pattern F (t) at time t shown in FIG. 9A is obtained as follows. The lip feature pattern for detecting the lip pattern f (t) is registered in advance as a characteristic pattern of the lip feature pattern, for example, a rectangular pattern with a left and right mouth corner (steps 102 to S105). Then, a lip pattern f (t) represented by a rectangular dotted line is detected from the position of the lip feature pattern (steps 107 and S108). Next, a lip inclusion pattern F (t) including the lip pattern f (t) is set. Here, the images of F (t−2) and F (t−1) indicate the lip inclusion patterns obtained at the times t−2 and t−1, respectively. Then, for each of F (t−2) and F (t−1), an image region having a high similarity to the lip pattern f (t) at time t is searched by correlation processing.

図１０は、口唇パターンｆ（ｔ）と最も類似度の高いパターンを探索することを説明するための図である。 FIG. 10 is a diagram for explaining a search for a pattern having the highest similarity with the lip pattern f (t).

同図に示すように、口唇パターンｆ（ｔ）と口唇包含パターンＦ（ｔ−１）との間での相関処理が行われて、実線矩形で表されるパターンが最も相関が高いものとする。このときの相関値（＝０〜１：１に近いほど相関が高い）をｓ（ｔ，ｔ−１）と表す。その位置の口唇パターンｆ（ｔ−１）に対するずれ量を（Δｘ（ｔ，ｔ−１），Δｙ（ｔ，ｔ−１））と表す。ずれ量は、図中の矩形左上座標値を（ｘｐ，ｙｐ）及び（ｘｑ，ｙｑ）とすると、｜ｘｐ−ｘｑ｜及び｜ｙｐ−ｙｑ｜となる。 As shown in the figure, correlation processing is performed between the lip pattern f (t) and the lip inclusion pattern F (t-1), and the pattern represented by the solid line rectangle has the highest correlation. . The correlation value at this time (correlation is higher as it is closer to 0 to 1: 1) is expressed as s (t, t−1). The amount of deviation of the position with respect to the lip pattern f (t−1) is represented as (Δx (t, t−1), Δy (t, t−1)). The shift amounts are | xp−xq | and | yp−yq |, where the upper left coordinate values of the rectangle in the drawing are (xp, yp) and (xq, yq).

話者が全く口唇を動かしていない場合、理想的にはＦ（ｔ）とＦ（ｔ−１）が全く同一画像パターンになるため、
ｓ（ｔ，ｔ−１）＝１．０、
（Δｘ（ｔ，ｔ−１），Δｙ（ｔ，ｔ−１））＝（０，０）
となる。 If the speaker does not move the lips at all, ideally F (t) and F (t-1) are exactly the same image pattern,
s (t, t-1) = 1.0,
(Δx (t, t−1), Δy (t, t−1)) = (0, 0)
It becomes.

口唇変動量Ｅ（ｔ）は、上述した式（１）によって定義される。この定義から、口が動いていない（発話していない）と考えられる場合は、口唇変動量Ｅ（ｔ）は小さな値になる。時刻ｔ−Ｎからｔに亘って口が大きく動いている（発話している）と考えられる場合は、口唇変動量Ｅ（ｔ）は大きな値になる。そこで、口唇変動量Ｅ（ｔ）に基づいて、次のように、発話区間であるか否かが判定される。 Lip variation amount E (t) is defined by equation (1) described above. From this definition, when it is considered that the mouth is not moving (not speaking), the lip variation E (t) is a small value. When it is considered that the mouth is moving greatly (speaking) from time t-N to t, the lip variation E (t) becomes a large value. Therefore, based on the lip variation E (t), it is determined whether or not it is an utterance section as follows.

ステップ１１１では、ＣＰＵ１１は、式（１）で算出された口唇変動量Ｅ（ｔ）と予め設定された閾値Ｅｔｈとを比較して、Ｅ（ｔ）＞Ｅｔｈであればステップ１１２に移行し、Ｅ（ｔ）≦Ｅｔｈであればステップ１１３に移行する。 In step 111, the CPU 11 compares the lip fluctuation amount E (t) calculated by the equation (1) with a preset threshold Eth, and if E (t)> Eth, the CPU 11 proceeds to step 112. If E (t) ≦ Eth, the routine proceeds to step 113.

ステップ１１２では、ＣＰＵ１１は、現在フレームは発話区間であると判定して、ステップ１１４に移行する。 In step 112, the CPU 11 determines that the current frame is an utterance section, and proceeds to step 114.

ステップ１１３では、ＣＰＵ１１は、現在フレームは発話区間ではないと判定して、ステップ１１４に移行する。 In step 113, the CPU 11 determines that the current frame is not an utterance section, and proceeds to step 114.

ステップ１１４では、ＣＰＵ１１は、発話区間の判定結果を音声認識装置３０に送信して、処理を終了する。そして、再びステップ１０１移行の処理が実行される。これにより、音声認識装置３０は、画像処理装置１０の判定結果、つまり発話区間であるか否かを考慮しながら、Ａ／Ｄコンバータ２２から供給された音声データについて音声認識を行うことができるので、認識率を向上させることができる。 In step 114, CPU11 transmits the determination result of an utterance area to the speech recognition apparatus 30, and complete | finishes a process. Then, the process of step 101 is executed again. As a result, the voice recognition device 30 can perform voice recognition on the voice data supplied from the A / D converter 22 while considering the determination result of the image processing device 10, that is, whether or not it is a speech segment. , The recognition rate can be improved.

以上のように、本発明の実施の形態に係る音声認識システムは、登録された口唇特徴パターンから得られた口唇パターンｆ（ｔ）と、Ｎフレーム前まで遡った口唇包含パターンＦ（ｔ−ｉ）と、の相関値から算出される口唇変動量に基づいて、発話区間であるか否かを検出する。このように、発話区間の検出にＮフレーム前までの変化量を用いているので、上記音声認識システムは、話者の口唇が微小に変化した場合であっても、その影響を受けることなく、確実に発話区間を検出することができる。 As described above, the speech recognition system according to the embodiment of the present invention includes the lip pattern f (t) obtained from the registered lip feature pattern and the lip inclusion pattern F (t−i) traced back N frames. ) And the lip fluctuation amount calculated from the correlation value. As described above, since the amount of change up to N frames before is used for detecting the utterance period, the voice recognition system is not affected even when the speaker's lips change minutely. An utterance section can be detected reliably.

また、音声認識システムは、単純にパターン間の相関値を演算するのではなく、現在の口唇パターンｆ（ｔ）と過去の口唇包含パターンＦ（ｔ）との相関処理によって最も相関の高い位置を求めているので、口唇特徴パターンが変動して口唇パターンの切り出し誤差に影響が生じてしまうことを低減することができる。 Also, the speech recognition system does not simply calculate the correlation value between patterns, but determines the position with the highest correlation by the correlation process between the current lip pattern f (t) and the past lip inclusion pattern F (t). Therefore, it can be reduced that the lip feature pattern fluctuates and the lip pattern cut-out error is affected.

さらに、音声認識システムは、唇の色情報、輪郭情報、端点情報などを求めるための煩雑な画像処理手法を利用せず、濃度パターン及びその位置情報から矩形状の口唇パターンｆ（ｔ）及び口唇包含パターンＦ（ｔ）を算出するので、演算負荷を低減して高速に発話区間を検出できる。 Furthermore, the voice recognition system does not use a complicated image processing method for obtaining lip color information, contour information, end point information, and the like, and uses a rectangular pattern lip pattern f (t) and lip from the density pattern and its position information. Since the inclusion pattern F (t) is calculated, it is possible to detect an utterance section at high speed while reducing the calculation load.

そして、音声認識システムは、発話区間であるか否かの判定結果を考慮しながら、話者の音声データを用いて音声認識を行うので、識別率を向上させて、高精度に音声認識を行うことができる。 And since the speech recognition system performs speech recognition using the speech data of the speaker while taking into consideration the determination result of whether or not it is the speech section, the recognition rate is improved and speech recognition is performed with high accuracy. be able to.

なお、本発明は、上述した実施の形態に限定されるものではなく、特許請求の範囲に記載された範囲内で設計上の変更をされたものにも適用可能であるのは勿論である。 Note that the present invention is not limited to the above-described embodiment, and it is needless to say that the present invention can also be applied to a design modified within the scope of the claims.

例えば、画像処理装置１０のＣＰＵ１１は、式（１）の代わりに、次の式（２）、式（３）、式（４）のいずれかを用いてもよい。 For example, the CPU 11 of the image processing apparatus 10 may use any of the following expressions (2), (3), and (4) instead of the expression (1).

式（２）及び式（３）では、距離情報Δｄを用いることなく、口唇変動量Ｅ（ｔ）が演算される。また、式（３）、（４）ではフレームに応じた重み係数α（ｉ）を用いる。さらには、次の式(５)を用いてもよい。
In the equations (2) and (3), the lip variation E (t) is calculated without using the distance information Δd. In equations (3) and (4), a weighting coefficient α (i) corresponding to the frame is used. Further, the following formula (5) may be used.

図１１は、口唇パターンｆ（ｔ）と口唇包含パターンＦ（ｔ−ｉ）の他の画像相関処理を説明する図である。同図及び式(５)に示すように、ｆ（ｔ）とＦ（ｔ−１）、ｆ（ｔ−１）とＦ（ｔ−２）、ｆ（ｔ−２）とＦ（ｔ−３）の間で画像相関処理を行ってもよい。すなわち、口唇パターンｆ（ｔ）とその直前１フレームの口唇包含パターンＦ（ｔ−１））との間で最も高い相関値を演算し、最も高い相関値の和から口唇変動量Ｅ（ｔ）を算出してもよい。 FIG. 11 is a diagram for explaining another image correlation process of the lip pattern f (t) and the lip inclusion pattern F (ti). As shown in the figure and formula (5), f (t) and F (t-1), f (t-1) and F (t-2), f (t-2) and F (t-3) The image correlation process may be performed between That is, the highest correlation value is calculated between the lip pattern f (t) and the lip inclusion pattern F (t-1) of the immediately preceding frame, and the lip variation E (t) is calculated from the sum of the highest correlation values. May be calculated.

また、ＣＰＵ１１は、口唇変動量Ｅ（ｔ）の計算の際に用いるフレームをｉ＝１，３，５，・・・，Ｎのようにして、フレームを間引いてもよい。これにより、処理時間を短縮して発話区間を検出することができる。 Further, the CPU 11 may thin out the frames by using i = 1, 3, 5,..., N as the frames used for calculating the lip variation E (t). Thereby, processing time can be shortened and an utterance area can be detected.

本発明の実施の形態に係る音声認識システムの構成を示す図である。It is a figure which shows the structure of the speech recognition system which concerns on embodiment of this invention. 画像処理装置１０の発話区間検出ルーチンを示すフローチャートである。3 is a flowchart showing a speech section detection routine of the image processing apparatus 10; 口唇特徴パターンの一例を示す図である。It is a figure which shows an example of a lip feature pattern. 口唇特徴パターンの探索範囲を示す図である。It is a figure which shows the search range of a lip feature pattern. 口唇パターンの切り出しを説明する図である。It is a figure explaining extraction of a lip pattern. 口唇包含パターンＦ（ｔ）の切り出しを説明する図である。It is a figure explaining extraction of the lip inclusion pattern F (t). （ａ）は参照画像と探索画像の一例を示す図、（ｂ）は参照画像が探索画像中を走査している状態を示す図、（ｃ）は参照画像と探索画像の最も相関の高い位置を表す図、（ｄ）は相関値画像の一例を示す図である。(A) is a diagram showing an example of a reference image and a search image, (b) is a diagram showing a state in which the reference image is scanning the search image, and (c) is a position having the highest correlation between the reference image and the search image. (D) is a figure showing an example of a correlation value image. 口唇パターンｆ（ｔ）と口唇包含パターンＦ（ｔ−ｉ）の画像相関処理を説明する図である。It is a figure explaining the image correlation process of the lip pattern f (t) and the lip inclusion pattern F (t-i). （ａ）時刻ｔ、（ｂ）時刻ｔ＋１、（ｃ）時刻ｔ＋２、（ｄ）時刻ｔ＋３でそれぞれ得られた口唇特徴パターン、口唇パターンｆ、口唇包含パターンＦを示す図である。It is a figure which shows the lip feature pattern, the lip pattern f, and the lip inclusion pattern F obtained at (a) time t, (b) time t + 1, (c) time t + 2, and (d) time t + 3, respectively. 口唇パターンｆ（ｔ）と最も類似度の高いパターンを探索することを説明するための図である。It is a figure for demonstrating searching for the pattern with the highest similarity with the lip pattern f (t). 口唇パターンｆ（ｔ）と口唇包含パターンＦ（ｔ−ｉ）の他の相関処理を説明する図である。It is a figure explaining the other correlation processing of the lip pattern f (t) and the lip inclusion pattern F (t-i).

Explanation of symbols

１ＣＣＤイメージセンサ
２，２２Ａ／Ｄコンバータ
１０画像処理装置
１１ＣＰＵ
１２ＲＡＭ
１３ＲＯＭ
２１マイク
３０音声認識装置 1 CCD image sensor 2, 22 A / D converter 10 Image processing device 11 CPU
12 RAM
13 ROM
21 Microphone 30 Voice recognition device

Claims

Imaging means for imaging at least the speaker's lips;
Lip feature region specifying means for specifying a lip feature region capable of specifying the shape of the lips in the image frames continuously captured by the imaging unit;
Lip inclusion area specifying means for specifying a lip inclusion area including a lip whose shape is specified by the lip feature area in image frames continuously captured by the imaging means;
The lip shape specified by the lip feature region in a specific image frame in the image frames continuously captured by the imaging means, and one or a plurality of continuous image frames captured immediately before the specific image frame A correlation value calculating means for comparing the lip shape included in the lip inclusion region in the inside and calculating these correlation values;
A fluctuation amount calculating means for calculating a fluctuation amount of the lip based on a calculation result of the correlation value calculating means;
Utterance interval detection means for detecting whether or not the utterance interval is based on the fluctuation amount calculated by the fluctuation amount calculation means;
An utterance detection device comprising:

The image processing apparatus further includes a cutting-out unit that cuts out a density pattern for representing the characteristics of the lip of the speaker and tracking and specifying the shape of the lip from image frames continuously captured by the imaging unit
The lip feature region specifying unit specifies a rectangular region including the density pattern cut out by the cutout unit, which is an image frame continuously picked up by the image pickup unit, as the lip feature region. The utterance detection device according to claim 1.

The correlation value calculating means includes the lip shape specified by the lip feature region in the specific image frame, and the arbitrary image frame of a plurality of consecutive image frames captured immediately before the specific image frame. The utterance detection device according to claim 1, wherein the highest correlation value with each lip shape included in the lip inclusion region is calculated.

The utterance detection device according to claim 2, wherein the density pattern is a gray value of at least one of a mouth angle of the speaker, an upper lip, and a lower lip.

Image at least the speaker's lips,
In the continuously captured image frames, specify a lip feature region that can specify the shape of the lips,
In the continuously captured image frames, specify a lip inclusion region that includes a lip whose shape is specified by the lip feature region;
The lip shape specified by the lip feature region in a specific image frame in the continuously captured image frames, and the one or more consecutive image frames captured immediately before the specific image frame Compare the lip shape contained in the lip inclusion area, calculate these correlation values,
Calculate the amount of lip variation based on the correlation value calculation result,
An utterance detection method comprising: detecting whether or not an utterance section is based on the calculated fluctuation amount.

Further, from the continuously captured image frames, a density pattern for representing the characteristics of the speaker's lips and tracking and specifying the shape of the lips is cut out,
6. The lip feature region is specified by specifying, as the lip feature region, a rectangular region that is the continuously captured image frame and includes the extracted density pattern. Utterance detection method.

In the correlation value calculation, the lip shape specified by the lip feature region in the specific image frame and the lip of an arbitrary image frame among a plurality of continuous image frames captured immediately before the specific image frame. The utterance detection method according to claim 5, wherein the highest correlation value with each lip shape included in the inclusion region is calculated.

The utterance detection method according to claim 6, wherein the density pattern is a gray value of at least one of a speaker's mouth corner, upper lip, and lower lip.

On the computer,
At least image the lips of the speaker,
In the continuously captured image frames, the lip feature region that can specify the shape of the lips is specified,
In the continuously captured image frames, a lip inclusion region including a lip whose shape is specified by the lip feature region is specified,
The lip shape specified by the lip feature region in a specific image frame in the continuously captured image frames, and the one or more consecutive image frames captured immediately before the specific image frame Compare the lip shape included in the lip inclusion area, calculate these correlation values,
Based on the correlation value calculation result, to calculate the amount of lip variation,
An utterance detection program for detecting whether or not an utterance section is based on the calculated fluctuation amount.