JP6127811B2

JP6127811B2 - Image discrimination device, image discrimination method, and image discrimination program

Info

Publication number: JP6127811B2
Application number: JP2013157645A
Authority: JP
Inventors: 馬場　孝之; 孝之馬場; 正樹石原; 昌彦杉村; 遠藤　進; 進遠藤; 上原　祐介; 祐介上原; 内藤　宏久; 宏久内藤; あきら宮崎
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-07-30
Filing date: 2013-07-30
Publication date: 2017-05-17
Anticipated expiration: 2033-07-30
Also published as: JP2015028691A

Description

本発明は、画像判別装置、画像判別方法および画像判別プログラムに関する。 The present invention relates to an image discrimination device, an image discrimination method, and an image discrimination program.

近年、人の口が映った画像から発話内容を推定する読唇技術が注目されている。読唇技術では、例えば、単語などの所定単位の文字情報と、その文字情報を発音する際の口の形状を示す形状情報とが対応付けられた辞書情報が利用される。そして、読唇処理の際に、口の領域が映った入力画像と、辞書情報に含まれる形状情報とが比較され、入力画像における口領域との一致度が高い形状情報に対応付けられた文字情報が、発話内容と判定される。 In recent years, lip-reading technology that estimates the utterance content from an image of a person's mouth has attracted attention. In the lip reading technique, for example, dictionary information in which character information of a predetermined unit such as a word is associated with shape information indicating the shape of the mouth when the character information is pronounced is used. In the lip reading process, the input image showing the mouth area is compared with the shape information included in the dictionary information, and the character information associated with the shape information having a high degree of coincidence with the mouth area in the input image Is determined as the utterance content.

辞書情報に含まれる形状情報としては、口の領域を撮影した画像や、口の領域の形状特徴量の計測値などを用いることができる。
また、読唇技術を、携帯端末などの装置の操作に利用することも考えられている。例えば、操作者の口を撮影した画像から発声された単語を判別し、その判別結果に対応付けられた処理を実行する携帯電話機が提案されている。 As the shape information included in the dictionary information, an image obtained by photographing the mouth area, a measured value of the shape feature value of the mouth area, or the like can be used.
It is also considered that the lip reading technique is used for operating a device such as a portable terminal. For example, there has been proposed a mobile phone that determines a word uttered from an image obtained by photographing an operator's mouth and executes a process associated with the determination result.

また、口領域を特定する技術の例としては、色情報や形状情報を有するテンプレートを用いたテンプレートマッチング技術が知られている。 Further, as an example of a technique for specifying a mouth area, a template matching technique using a template having color information and shape information is known.

特開２０１２―５９０１７号公報JP 2012-59017 A 特開２０１２−１１８６７９号公報JP 2012-118679 A 特開２００６−１２０９３号公報JP 2006-12093 A

ところで、上記のような辞書情報を生成する方法としては、例えば、多数の被験者に単語などの文字列を発音させ、発音時の口領域の画像を撮影し、得られた画像を基に辞書情報を生成する方法が考えられる。しかし、この方法では撮影の手間がかかるという問題がある。 By the way, as a method of generating the dictionary information as described above, for example, a number of subjects can pronounce a character string such as a word, and an image of the mouth area at the time of pronunciation is taken, and the dictionary information is based on the obtained image. The method of generating is conceivable. However, this method has a problem that it takes time and effort to shoot.

これに対し、ネットワーク空間上で、あるいは記録媒体に記録された状態で公開されている様々な映像コンテンツを収集し、それらの映像コンテンツを基に辞書情報を生成する方法も考えられる。しかしながら、このように収集された映像コンテンツの内容は様々であることから、映像コンテンツから辞書情報の生成のために適切な箇所を抽出する作業が膨大になるという問題がある。 On the other hand, it is also conceivable to collect various video contents that are disclosed on the network space or recorded on a recording medium, and generate dictionary information based on the video contents. However, since the contents of the video content collected in this way are various, there is a problem that an operation for extracting an appropriate part for generating dictionary information from the video content becomes enormous.

１つの側面では、本発明は、読唇処理用の辞書情報の生成作業を効率化することが可能な画像判別装置、画像判別方法および画像判別プログラムを提供することを目的とする。 In one aspect, an object of the present invention is to provide an image discriminating apparatus, an image discriminating method, and an image discriminating program capable of increasing efficiency in generating dictionary information for lip reading processing.

１つの案では、検出部および判別部を有する画像判別装置が提供される。検出部は、文字列が発音された期間のシーンが映った入力動画像の各フレームから口の領域を検出する。判別部は、入力動画像のフレームのうち、口の領域が検出されなかったフレームの数が所定数以下である場合に、入力動画像を、文字列が発音される際の口の形状を示す辞書情報を生成するための動画像と判別する。 In one proposal, an image discrimination device having a detection unit and a discrimination unit is provided. The detection unit detects a mouth area from each frame of the input moving image in which a scene during a period in which the character string is pronounced is shown. The discriminating unit indicates the shape of the mouth when the character string is pronounced when the number of frames in which the mouth area is not detected among the frames of the input moving image is equal to or less than a predetermined number. It is determined as a moving image for generating dictionary information.

また、１つの案では、上記の画像判別装置と同様の処理が実行される画像判別方法が提供される。
さらに、１つの案では、上記の画像判別装置と同様の処理をコンピュータに実行させる画像判別プログラムが提供される。 Further, in one proposal, an image discrimination method is provided in which processing similar to that of the image discrimination device is executed.
Furthermore, in one proposal, an image discrimination program is provided that causes a computer to execute the same processing as that of the image discrimination device.

１態様によれば、読唇処理用の辞書情報の生成作業を効率化することができる。 According to one aspect, it is possible to improve the efficiency of generating dictionary information for lip reading processing.

第１の実施の形態に係る画像判別装置の構成例および処理例を示す図である。It is a figure which shows the structural example and processing example of the image discrimination | determination apparatus which concern on 1st Embodiment. 第２の実施の形態に係る作業支援装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of the work assistance apparatus which concerns on 2nd Embodiment. 読唇処理の手順の概要を示す図である。It is a figure which shows the outline | summary of the procedure of a lip reading process. 作業支援装置が備える機能の例を示すブロック図である。It is a block diagram which shows the example of the function with which a work assistance apparatus is provided. 発話画像ファイル生成部の処理例について示す図である。It is a figure shown about the example of a process of a speech image file generation part. 発話画像ファイル生成部の他の処理例について示す図である。It is a figure shown about the other example of a process of an utterance image file generation part. 単語分割部および単語区間抽出部の処理例について示す図である。It is a figure shown about the example of a process of a word division part and a word area extraction part. 口領域検出部の処理例について示す図である。It is a figure shown about the process example of a mouth area | region detection part. 判定部の処理例について示す図である。It is a figure shown about the example of a process of a determination part. 作業支援装置の処理例を示すフローチャートである。It is a flowchart which shows the process example of a work assistance apparatus. 文字別に用意されたテンプレートを用いた口領域の探索処理の概要を示す図である。It is a figure which shows the outline | summary of the search process of the mouth area | region using the template prepared according to the character. テンプレートの使用切り替えタイミングの一例を示す図である。It is a figure which shows an example of the use switching timing of a template. 補間によって得られたテンプレートを使用する例を示す図である。It is a figure which shows the example which uses the template obtained by interpolation. 特徴量ベクトルを用いた場合の補間処理例を示す図である。It is a figure which shows the example of an interpolation process at the time of using a feature-value vector. 口領域が複数検出された場合の処理例を示す図である。It is a figure which shows the process example when multiple mouth area | regions are detected. 口領域が複数検出された場合の他の処理例を示す図である。It is a figure which shows the other example of a process when multiple mouth area | regions are detected. 口領域の大きさに応じた検出処理例を示す図である。It is a figure which shows the example of a detection process according to the magnitude | size of a mouth area | region.

以下、本発明の実施の形態について図面を参照して説明する。
〔第１の実施の形態〕
図１は、第１の実施の形態に係る画像判別装置の構成例および処理例を示す図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
FIG. 1 is a diagram illustrating a configuration example and a processing example of the image discrimination device according to the first embodiment.

画像判別装置１は、読唇処理用の辞書情報を生成する作業を支援するものである。辞書情報とは、例えば、単語などの意味のある文字列に対して、文字列を発音する際の口の形状を示す形状情報が対応付けられた情報である。読唇処理では、例えば、口の領域が映った入力画像と、辞書情報に含まれる形状情報とが比較され、入力画像における口領域との一致度が高い形状情報に対応付けられた文字列が、発話内容と判定される。 The image discriminating apparatus 1 supports the operation of generating dictionary information for lip reading processing. The dictionary information is information in which shape information indicating the shape of a mouth when a character string is pronounced is associated with a meaningful character string such as a word. In the lip reading process, for example, the input image showing the mouth area and the shape information included in the dictionary information are compared, and the character string associated with the shape information having a high degree of coincidence with the mouth area in the input image, Determined as utterance content.

画像判別装置１は、検出部２および判別部３を有する。検出部２および判別部３の処理は、例えば、画像判別装置１が備えるプロセッサが、所定のプログラムを実行することで実現される。 The image discrimination device 1 includes a detection unit 2 and a discrimination unit 3. The processing of the detection unit 2 and the determination unit 3 is realized by, for example, a processor included in the image determination device 1 executing a predetermined program.

検出部２は、ある文字列が発音された期間のシーンが映った入力動画像の各フレームから、口領域を検出する。口領域を検出する方法としては、テンプレートマッチング法などの様々な方法を用いることができる。また、入力動画像としては、例えば、インターネットなどのネットワーク上で公開されている動画像から抽出したものを使用することができる。なお、検出部２による検出時において、入力動画像に映ったシーンで発音された文字列は、既知であるものとする。 The detection unit 2 detects a mouth area from each frame of an input moving image in which a scene during a period in which a certain character string is pronounced is shown. As a method for detecting the mouth region, various methods such as a template matching method can be used. Moreover, as an input moving image, what was extracted from the moving image published on networks, such as the internet, can be used, for example. It is assumed that the character string pronounced in the scene shown in the input moving image at the time of detection by the detection unit 2 is known.

図１の例では、入力動画像４は、「こんにちは」という文字列が発音された期間のシーンを映したものである。ここでは説明をわかりやすくするために、入力動画像４には５つのフレームが含まれるものとする。 In the example of FIG. 1, the input moving image 4 is obtained reflects the scene periods character string "Hello" is pronounced. Here, for easy understanding, it is assumed that the input moving image 4 includes five frames.

判別部３は、入力動画像のフレームのうち、検出部２によって口領域が検出されなかったフレームの数を計数する。そして、口領域が検出されなかったフレームの数が所定の判定しきい値以下である場合に、入力動画像を、辞書情報を生成するための動画像と判別する。また、判別部３は、例えば、口領域が検出されなかったフレームの数が所定の判定しきい値以下である場合、入力動画像を、辞書情報における口の形状情報として登録してもよい。なお、判定しきい値は、０以上の整数である。 The discriminating unit 3 counts the number of frames in which the mouth region is not detected by the detecting unit 2 among the frames of the input moving image. When the number of frames in which no mouth area is detected is equal to or smaller than a predetermined determination threshold, the input moving image is determined as a moving image for generating dictionary information. For example, when the number of frames in which no mouth area is detected is equal to or less than a predetermined determination threshold, the determination unit 3 may register the input moving image as mouth shape information in the dictionary information. The determination threshold value is an integer of 0 or more.

図１の例では、入力動画像４のフレームのうち、１番目，２番目，４番目，５番目の各フレームからは口領域が検出されたものの、３番目のフレームからは口領域が検出されなかったものとする（ステップＳ１）。ここで、判定しきい値を“０”とすると、判別部３は、口領域が検出されなかったフレームの数は“１”であり、判定しきい値“０”より大きいと判定する（ステップＳ２）。この場合、判別部３は、入力動画像４を、辞書情報を生成するための動画像でないと判別する。 In the example shown in FIG. 1, the mouth area is detected from the first, second, fourth, and fifth frames of the frame of the input moving image 4, but the mouth area is detected from the third frame. It is assumed that there was not (step S1). Here, if the determination threshold value is “0”, the determination unit 3 determines that the number of frames in which no mouth area is detected is “1” and is greater than the determination threshold value “0” (step S2). In this case, the determination unit 3 determines that the input moving image 4 is not a moving image for generating dictionary information.

ここで、辞書情報に含まれる口の形状情報の生成のために利用する動画像は、できるだけ多くのフレームに口領域が映っていることが好ましい。それにより、読唇処理の際に、生成された形状情報を口が映った画像と比較したときの一致度判定の精度を向上させることができる。 Here, it is preferable that a moving image used for generating mouth shape information included in the dictionary information includes mouth regions in as many frames as possible. Thereby, in the lip reading process, it is possible to improve the accuracy of the coincidence determination when the generated shape information is compared with an image showing a mouth.

画像判別装置１の上記処理により、入力動画像が、高精度な読唇処理を実行するための辞書情報の生成に利用する動画像として適切か否かを、精度よく判定することができる。従って、辞書情報の生成作業の効率を高めることができる。 With the above-described processing of the image discrimination device 1, it is possible to accurately determine whether or not the input moving image is appropriate as a moving image used for generating dictionary information for executing highly accurate lip reading processing. Therefore, the efficiency of the dictionary information generation work can be increased.

特に、入力動画像として、ネットワーク上で公開されている動画像など、すでに世の中に流通している大量の動画像から抽出したものを使用した場合、その入力動画像には、対応する文字列が発音されたシーンであるにもかかわらず、発音した人の口が常に映っているとは限らない。また、口どころか顔や人が映っていない場合もあり得る。画像判別装置１の上記処理により、このような入力動画像を用いた場合の辞書情報の生成作業を、顕著に効率化することができる。 In particular, when an input video image extracted from a large amount of video images already distributed in the world, such as a video image published on the network, the corresponding character string is included in the input video image. Despite the pronounced scene, the mouth of the person who pronounced it is not always reflected. In addition, there may be cases where a face or a person is not reflected in the mouth. The above processing of the image discriminating apparatus 1 can remarkably improve the dictionary information generation work when such an input moving image is used.

〔第２の実施の形態〕
図２は、第２の実施の形態に係る作業支援装置のハードウェア構成例を示す図である。作業支援装置１００は、読唇処理に用いる辞書情報を生成する作業を支援するための装置である。作業支援装置１００は、例えば、図２のようなコンピュータとして実現される。 [Second Embodiment]
FIG. 2 is a diagram illustrating a hardware configuration example of the work support apparatus according to the second embodiment. The work support apparatus 100 is an apparatus for supporting the work of generating dictionary information used for lip reading processing. The work support apparatus 100 is realized, for example, as a computer as shown in FIG.

作業支援装置１００は、プロセッサ１０１によって装置全体が制御されている。プロセッサ１０１は、マルチプロセッサであってもよい。プロセッサ１０１は、例えばＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、またはＰＬＤ（Programmable Logic Device）である。またプロセッサ１０１は、ＣＰＵ、ＭＰＵ、ＤＳＰ、ＡＳＩＣ、ＰＬＤのうちの２以上の要素の組み合わせであってもよい。 The work support apparatus 100 is entirely controlled by a processor 101. The processor 101 may be a multiprocessor. The processor 101 is, for example, a central processing unit (CPU), a micro processing unit (MPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), or a programmable logic device (PLD). The processor 101 may be a combination of two or more elements among CPU, MPU, DSP, ASIC, and PLD.

プロセッサ１０１には、バス１０８を介して、ＲＡＭ（Random Access Memory）１０２と複数の周辺機器が接続されている。
ＲＡＭ１０２は、作業支援装置１００の主記憶装置として使用される。ＲＡＭ１０２には、プロセッサ１０１に実行させるＯＳ（Operating System）のプログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ１０２には、プロセッサ１０１による処理に必要な各種データが格納される。 A RAM (Random Access Memory) 102 and a plurality of peripheral devices are connected to the processor 101 via a bus 108.
The RAM 102 is used as a main storage device of the work support apparatus 100. The RAM 102 temporarily stores at least part of an OS (Operating System) program and application programs to be executed by the processor 101. The RAM 102 stores various data necessary for processing by the processor 101.

バス１０８に接続されている周辺機器としては、ＨＤＤ（Hard Disk Drive）１０３、グラフィック処理装置１０４、入力インタフェース１０５、読み取り装置１０６および通信インタフェース１０７がある。 Peripheral devices connected to the bus 108 include an HDD (Hard Disk Drive) 103, a graphic processing device 104, an input interface 105, a reading device 106, and a communication interface 107.

ＨＤＤ１０３は、作業支援装置１００の補助記憶装置として使用される。ＨＤＤ１０３には、ＯＳプログラム、アプリケーションプログラム、および各種データが格納される。なお、補助記憶装置としては、ＳＳＤ（Solid State Drive）などの他の種類の不揮発性記憶装置を使用することもできる。 The HDD 103 is used as an auxiliary storage device of the work support apparatus 100. The HDD 103 stores an OS program, application programs, and various data. As the auxiliary storage device, other types of nonvolatile storage devices such as SSD (Solid State Drive) can be used.

グラフィック処理装置１０４には、表示装置１０４ａが接続されている。グラフィック処理装置１０４は、プロセッサ１０１からの命令に従って、画像を表示装置１０４ａの画面に表示させる。表示装置としては、ＣＲＴ（Cathode Ray Tube）を用いた表示装置や液晶表示装置などがある。 A display device 104 a is connected to the graphic processing device 104. The graphic processing device 104 displays an image on the screen of the display device 104a in accordance with an instruction from the processor 101. Examples of the display device include a display device using a CRT (Cathode Ray Tube) and a liquid crystal display device.

入力インタフェース１０５には、入力装置１０５ａが接続されている。入力インタフェース１０５は、入力装置１０５ａから出力される信号をプロセッサ１０１に送信する。入力装置１０５ａとしては、キーボードやポインティングデバイスなどがある。ポインティングデバイスとしては、マウス、タッチパネル、タブレット、タッチパッド、トラックボールなどがある。 An input device 105 a is connected to the input interface 105. The input interface 105 transmits a signal output from the input device 105a to the processor 101. Examples of the input device 105a include a keyboard and a pointing device. Examples of pointing devices include a mouse, a touch panel, a tablet, a touch pad, and a trackball.

読み取り装置１０６には、可搬型記録媒体１０６ａが脱着される。読み取り装置１０６は、可搬型記録媒体１０６ａに記録されたデータを読み取ってプロセッサ１０１に送信する。可搬型記録媒体１０６ａとしては、光ディスク、光磁気ディスク、半導体メモリなどがある。 A portable recording medium 106 a is detached from the reading device 106. The reading device 106 reads the data recorded on the portable recording medium 106 a and transmits it to the processor 101. Examples of the portable recording medium 106a include an optical disk, a magneto-optical disk, and a semiconductor memory.

通信インタフェース１０７は、ネットワーク１０７ａを介して、他の装置との間でデータの送受信を行う。なお、ネットワーク１０７ａは、例えば、インターネットに接続されていてもよい。 The communication interface 107 transmits / receives data to / from other devices via the network 107a. The network 107a may be connected to the Internet, for example.

以上のようなハードウェア構成によって、作業支援装置１００の処理機能を実現することができる。
次に、図３は、読唇処理の手順の概要を示す図である。この図３を用いて、本実施の形態で生成作業の対象となる辞書情報と、読唇処理での辞書情報の使われ方について説明する。 With the hardware configuration described above, the processing function of the work support apparatus 100 can be realized.
Next, FIG. 3 is a diagram showing an outline of the procedure of the lip reading process. With reference to FIG. 3, description will be given of dictionary information to be generated in this embodiment and how dictionary information is used in lip reading processing.

辞書情報２００には、例えば、それぞれ１つの単語に対応するレコード２０１が登録される。各レコード２０１には、単語を示すテキスト情報（または単語の識別情報）に対して、その単語を発音したときの口領域が撮影された動画像が対応付けて登録される。また、１つのレコード２０１には複数の動画像を対応付けて登録しておくことができる。 For example, a record 201 corresponding to one word is registered in the dictionary information 200. In each record 201, text information (or word identification information) indicating a word is registered in association with a moving image in which a mouth area is captured when the word is pronounced. Also, a plurality of moving images can be associated with each record 201 and registered.

読唇エンジン２１０は、例えば、辞書情報２００を用いて次のような読唇処理を行う。読唇エンジン２１０には、処理対象の動画像２２１が入力される。処理対象の動画像２２１には、未知の人の口領域が映っている。読唇エンジン２１０は、処理対象の動画像２２１における口領域の画像と、辞書情報２００に登録されている動画像とのマッチングを行い、類似度を計算する。そして、読唇エンジン２１０は、辞書情報２００の動画像のうち、類似度が最も高い動画像を判定し、判定した動画像に対応付けられている単語のテキスト情報を、処理対象の動画像２２１に映った人が発話した内容の推定結果２２２として出力する。 For example, the lip reading engine 210 performs the following lip reading process using the dictionary information 200. A moving image 221 to be processed is input to the lip reading engine 210. The moving image 221 to be processed includes an unknown person's mouth area. The lip reading engine 210 performs matching between the mouth area image in the moving image 221 to be processed and the moving image registered in the dictionary information 200, and calculates the similarity. Then, the lip reading engine 210 determines the moving image having the highest similarity among the moving images in the dictionary information 200, and sets the text information of the word associated with the determined moving image as the moving image 221 to be processed. It is output as an estimation result 222 of the content uttered by the reflected person.

ここで、前述のように、辞書情報２００においては、１つの単語に対して複数の動画像を対応付けて登録しておくことが可能である。例えば、同じ単語を異なる人が発音したときの口領域が撮影された動画像を、１つのレコード２０１に登録することが可能である。また、例えば、同じ単語を同じ人が発音したときの口領域を、それぞれ異なる角度から撮影することで得られた動画像を、１つのレコード２０１に登録することも可能である。 Here, as described above, in the dictionary information 200, it is possible to register a plurality of moving images in association with one word. For example, it is possible to register a moving image in which a mouth area is photographed when different people pronounce the same word in one record 201. Also, for example, it is possible to register a moving image obtained by photographing mouth areas when the same person pronounces the same word from different angles, in one record 201.

このように、辞書情報２００において、１つの単語に対して多くの動画像を対応付けておき、それらの動画像を読唇処理のマッチングに利用することで、読唇精度を向上させることができる。 As described above, in the dictionary information 200, by associating many moving images with one word and using these moving images for matching of the lip reading process, the lip reading accuracy can be improved.

ところで、辞書情報２００に登録する動画像を用意する方法としては、例えば、多数の被験者に単語を発音させ、発音時の口領域の画像を撮影するとい方法があるが、この方法は撮影の手間が大きいという問題がある。特に、上記のように読唇精度を向上させるために辞書情報２００に登録される動画像が多くなるほど、撮影の手間も大きくなる。 By the way, as a method of preparing a moving image to be registered in the dictionary information 200, for example, there is a method in which a large number of subjects pronounce a word and an image of a mouth region at the time of pronunciation is taken. There is a problem that is large. In particular, as the number of moving images registered in the dictionary information 200 in order to improve the lip reading accuracy as described above increases, the time and effort of shooting increases.

これに対し、本件の発明者は、すでに撮影された様々な動画像を収集し、収集された動画像を用いて辞書情報２００に登録する動画像を生成する方法を検討した。収集の対象とする動画像としては、例えば、インターネットなどのネットワーク上で公開されている動画像や、光ディスクなどの記録媒体に記録されて流通している動画像などが考えられる。このような方法により、辞書情報２００の生成のための撮影作業を行う手間を省くことができる。 On the other hand, the inventor of the present case has studied a method of collecting various moving images that have already been shot and generating a moving image to be registered in the dictionary information 200 using the collected moving images. As the moving image to be collected, for example, a moving image published on a network such as the Internet, a moving image recorded on a recording medium such as an optical disk, and the like can be considered. By such a method, it is possible to save the trouble of performing a photographing operation for generating the dictionary information 200.

しかしながら、このような方法により収集された動画像では、たとえ目的とする単語が発音されたシーンが映っていたとしても、その単語を発音した人の口領域が映っているとは限らない。また、口領域が映っていたとしても、その単語の発音の開始時点から終了時点までの全体を通して、発音した人の口領域が映っているとは限らない。 However, in a moving image collected by such a method, even if a scene in which the target word is pronounced is shown, the mouth area of the person who pronounced the word is not always shown. Even if the mouth area is shown, the mouth area of the person who pronounced the word is not necessarily shown throughout the entire period from the start point to the end point of the pronunciation of the word.

辞書情報２００の動画像として使用するためには、所望の単語の発音の開始時点から終了時点までのできるだけ長い期間（最も好ましくは期間全体で）、発音した人の口領域が映っていることが望ましい。これにより、読唇処理の精度を高くすることができる。しかしながら、収集された動画像から、辞書情報２００の動画像として使用するのに適する動画像を抽出するための作業の手間が非常に大きいという問題があった。 In order to be used as a moving image of the dictionary information 200, the mouth area of the person who pronounced is shown for as long a period as possible from the start to the end of pronunciation of the desired word (most preferably the entire period). desirable. Thereby, the accuracy of the lip reading process can be increased. However, there is a problem that the labor for extracting a moving image suitable for use as the moving image of the dictionary information 200 from the collected moving image is very large.

そこで、本実施の形態の作業支援装置１００は、このようにして収集された動画像から、辞書情報２００の動画像として使用するのに適する動画像を抽出する作業の少なくとも一部を自動化して、その作業効率を向上させる。これにより、読唇エンジン２１０の開発作業を効率化し、その読唇処理精度を高めるとともに、開発コストを削減する。 Therefore, the work support apparatus 100 according to the present embodiment automates at least a part of the work of extracting a moving image suitable for use as the moving image of the dictionary information 200 from the moving image collected in this way. , Improve its working efficiency. Thereby, the development work of the lip reading engine 210 is made efficient, the lip reading processing accuracy is improved, and the development cost is reduced.

なお、上記の図３の説明では、辞書情報２００の各レコード２０１には１つ以上の動画像が登録されるものとしたが、辞書情報２００の各レコード２０１には、対応する単語が発音されたときの口領域の形状を示す情報が登録されていればよい。動画像は、そのような形状を示す情報の一例である。口領域の形状を示す情報の他の例としては、口領域の特徴量などがある。 In the description of FIG. 3 above, one or more moving images are registered in each record 201 of the dictionary information 200, but a corresponding word is pronounced in each record 201 of the dictionary information 200. Information indicating the shape of the mouth area at that time may be registered. A moving image is an example of information indicating such a shape. Another example of information indicating the shape of the mouth area is a feature amount of the mouth area.

ただし、口領域の形状を示す情報は、一般的に、口領域が撮影された動画像を基に生成される。従って、辞書情報２００に、口領域の形状を示す情報として動画像以外の情報が登録されたとしても、そのような辞書情報２００を生成するために、口領域が撮影された動画像が必要となることに変わりはない。 However, information indicating the shape of the mouth area is generally generated based on a moving image in which the mouth area is photographed. Accordingly, even if information other than a moving image is registered as information indicating the shape of the mouth area in the dictionary information 200, a moving image in which the mouth area is photographed is necessary to generate such dictionary information 200. There will be no change.

なお、本実施の形態の読唇処理では、上記の図３の説明のように、単語単位で認識を行うものとする。この方法は、例えば、一文字単位（音節単位）で認識を行う方法と比較して、認識精度を高めることができる。 In the lip reading process of the present embodiment, recognition is performed in units of words as described in FIG. This method can improve the recognition accuracy, for example, as compared with a method of performing recognition in units of one character (syllable unit).

また、本実施の形態で言う「単語」とは、例えば、単語およびこれに後続する付属語などを含む「文節」も包含するものとする。
また、本実施の形態を、単語（あるいは文節）より大きな単位（例えば、複数文節）で認識するように変形することも可能である。この場合、辞書情報２００においては、例えば、複数文節の文字情報ごとに、口領域の形状を示す情報が１つ以上対応付けて登録されればよい。 In addition, the “word” referred to in the present embodiment includes, for example, a “sentence” including a word and an accompanying word that follows the word.
In addition, the present embodiment can be modified so as to be recognized in units larger than words (or phrases) (for example, a plurality of phrases). In this case, in the dictionary information 200, for example, one or more pieces of information indicating the shape of the mouth area may be registered in association with each character information of a plurality of phrases.

図４は、作業支援装置が備える機能の例を示すブロック図である。作業支援装置１００は、動画像収集部１１１、発話画像ファイル生成部１１２、単語分割部１１３、単語区間抽出部１１４、口領域検出部１１５および判定部１１６を有する。これらの各処理ブロックの処理は、例えば、作業支援装置１００が備えるプロセッサ１０１が、所定のプログラムを実行することで実現される。 FIG. 4 is a block diagram illustrating an example of functions provided in the work support apparatus. The work support apparatus 100 includes a moving image collection unit 111, an utterance image file generation unit 112, a word division unit 113, a word section extraction unit 114, a mouth area detection unit 115, and a determination unit 116. The processing of each processing block is realized by, for example, the processor 101 included in the work support apparatus 100 executing a predetermined program.

動画像収集部１１１は、ネットワーク１０７ａから動画像を収集する。例えば、動画像収集部１１１は、動画投稿サイトなどの無料で動画像コンテンツを配信しているＷｅｂサイトを提供するＷｅｂサーバにアクセスして、動画像のデータをダウンロードする。収集された動画像のデータは、例えば、作業支援装置１００のＨＤＤ１０３に格納される。 The moving image collection unit 111 collects moving images from the network 107a. For example, the moving image collection unit 111 accesses a Web server that provides a Web site that distributes moving image content free of charge, such as a moving image posting site, and downloads moving image data. The collected moving image data is stored in the HDD 103 of the work support apparatus 100, for example.

発話画像ファイル生成部１１２は、動画像収集部１１１によって収集された動画像それぞれから、人が言葉を発したシーンの区間（以下、「発話区間」と呼ぶ）を抽出する。発話画像ファイル生成部１１２は、抽出した発話区間の動画像を切り出して、発話画像ファイル１３０を生成する。発話画像ファイル１３０には、発話内容を示す発話内容テキスト１３１と、動画像データ１３２とが対応付けて格納されている。生成された発話画像ファイル１３０は、例えば、作業支援装置１００のＲＡＭ１０２に一時的に格納される。 The utterance image file generation unit 112 extracts a section of a scene in which a person utters a word (hereinafter referred to as “speech section”) from each of the moving images collected by the moving image collection unit 111. The utterance image file generation unit 112 cuts out the extracted moving image of the utterance section and generates the utterance image file 130. In the utterance image file 130, utterance content text 131 indicating the utterance content and moving image data 132 are stored in association with each other. The generated utterance image file 130 is temporarily stored in the RAM 102 of the work support apparatus 100, for example.

なお、後述するように、本実施の形態では、基本的に、動画像収集部１１１によって収集された動画像には、字幕テキストが付加されているものとする。例えば、収集された動画像におけるある範囲の隣接する複数フレームに対して、１つの文章（字幕）が表示されるように字幕テキストが付加されている。 As will be described later, in the present embodiment, it is assumed that subtitle text is basically added to the moving image collected by the moving image collection unit 111. For example, caption text is added so that one sentence (caption) is displayed for a plurality of adjacent frames in a certain range in the collected moving image.

単語分割部１１３は、生成された発話画像ファイル１３０それぞれの発話内容テキスト１３１を、形態素解析などを用いて、単語を単位として分割する。
単語区間抽出部１１４は、生成された発話画像ファイル１３０それぞれの動画像データ１３２から、単語分割部１１３によって分割された各単語が発音された区間（以下、「単語区間」と呼ぶ）を切り出して、単語画像ファイル１４０を生成する。単語画像ファイル１４０には、発音された単語を示す単語テキスト１４１と、動画像データ１４２とが対応付けて格納されている。生成された単語画像ファイル１４０は、例えば、作業支援装置１００のＲＡＭ１０２に一時的に格納される。 The word dividing unit 113 divides the utterance content text 131 of each of the generated utterance image files 130 in units of words using morphological analysis or the like.
The word segment extraction unit 114 cuts out a segment in which each word divided by the word segmentation unit 113 is pronounced (hereinafter referred to as “word segment”) from the moving image data 132 of each generated utterance image file 130. The word image file 140 is generated. In the word image file 140, word text 141 indicating a pronounced word and moving image data 142 are stored in association with each other. The generated word image file 140 is temporarily stored in the RAM 102 of the work support apparatus 100, for example.

口領域検出部１１５は、生成された単語画像ファイル１４０それぞれの動画像データ１４２に基づく画像の各フレームから、画像処理によって口領域を検出する。
判定部１１６は、単語画像ファイル１４０それぞれの動画像データ１４２について、口領域検出部１１５によって口領域が検出されなかったフレームの数を計数する。判定部１１６は、口領域が検出されなかったフレームの数が、あらかじめ決められた判定しきい値以下である場合に、検出対象の動画像データ１４２は前述の辞書情報２００の生成のために利用可能であると判定する。この場合、判定部１１６は、対応する単語画像ファイル１４０を基に、辞書情報２００の生成に利用するための辞書候補ファイル１５０を生成する。 The mouth area detection unit 115 detects the mouth area by image processing from each frame of the image based on the moving image data 142 of each of the generated word image files 140.
The determination unit 116 counts the number of frames in which the mouth region is not detected by the mouth region detection unit 115 for the moving image data 142 of each word image file 140. The determination unit 116 uses the moving image data 142 to be detected to generate the dictionary information 200 described above when the number of frames in which no mouth area is detected is equal to or less than a predetermined determination threshold. Determine that it is possible. In this case, the determination unit 116 generates a dictionary candidate file 150 to be used for generating the dictionary information 200 based on the corresponding word image file 140.

辞書候補ファイル１５０には、単語テキスト１５１と、動画像データ１５２と、動画像データ１５２の各フレームにおいて口領域が検出された領域を示す口領域座標１５３とが含まれる。これらのうち、単語テキスト１５１および動画像データ１５２は、対応する単語画像ファイル１４０の単語テキスト１４１および動画像データ１４２と同じである。生成された辞書候補ファイル１５０は、例えば、作業支援装置１００のＨＤＤ１０３に格納される。 The dictionary candidate file 150 includes word text 151, moving image data 152, and mouth area coordinates 153 that indicate areas in which mouth areas are detected in each frame of the moving image data 152. Among these, the word text 151 and the moving image data 152 are the same as the word text 141 and the moving image data 142 of the corresponding word image file 140. The generated dictionary candidate file 150 is stored in the HDD 103 of the work support apparatus 100, for example.

なお、作業支援装置１００は、動画像収集部１１１および発話画像ファイル生成部１１２を備えていなくてもよい。この場合、発話画像ファイル１３０は、ネットワークを通じて、または可搬型記録媒体を介して、作業支援装置１００に格納される。 The work support apparatus 100 may not include the moving image collection unit 111 and the utterance image file generation unit 112. In this case, the utterance image file 130 is stored in the work support apparatus 100 via a network or via a portable recording medium.

また、作業支援装置１００は、動画像収集部１１１および発話画像ファイル生成部１１２に加えて、さらに単語分割部１１３および単語区間抽出部１１４を備えていなくてもよい。この場合、単語画像ファイル１４０は、ネットワークを通じて、または可搬型記録媒体を介して、作業支援装置１００に格納される。 In addition to the moving image collection unit 111 and the utterance image file generation unit 112, the work support apparatus 100 may not further include the word division unit 113 and the word section extraction unit 114. In this case, the word image file 140 is stored in the work support apparatus 100 through a network or a portable recording medium.

次に、上記の各処理ブロックの処理について説明する。
まず、図５は、発話画像ファイル生成部の処理例について示す図である。
前述のように、動画像収集部１１１によって収集された動画像には、基本的に、字幕テキストが付加されているものとする。字幕テキストは、例えば、収集された動画像のデータのヘッダ領域などにテキストデータとして付加されていて、発話画像ファイル生成部１１２が、字幕テキストの内容とその表示期間とを認識可能であるものとする。 Next, processing of each processing block will be described.
First, FIG. 5 is a diagram illustrating a processing example of the utterance image file generation unit.
As described above, subtitle text is basically added to the moving image collected by the moving image collection unit 111. The subtitle text is added as text data, for example, in the header area of the collected moving image data, and the utterance image file generation unit 112 can recognize the content of the subtitle text and its display period. To do.

発話画像ファイル生成部１１２は、収集された動画像から、同一の字幕テキストが表示される期間（以下、「同一字幕表示期間」と呼ぶ）を特定する。発話画像ファイル生成部１１２は、特定した期間の動画像のデータを、発話画像ファイル１３０の動画像データ１３２として切り出し、表示される字幕テキストを記述した発話内容テキスト１３１に対応付ける。これにより、発話画像ファイル１３０が生成される。 The utterance image file generation unit 112 specifies a period during which the same caption text is displayed from the collected moving images (hereinafter referred to as “same caption display period”). The utterance image file generation unit 112 extracts the moving image data of the specified period as the moving image data 132 of the utterance image file 130 and associates it with the utterance content text 131 describing the displayed subtitle text. Thereby, the utterance image file 130 is generated.

図５の例では、「おはよう、きょうは」という字幕テキストが表示される期間Ｔ１が特定され、この期間Ｔ１のフレームが切り出されて動画像データ１３２が生成される。そして、切り出された動画像データ１３２と、「おはよう、きょうは」というテキスト情報が記述された発話内容テキスト１３１とを含む発話画像ファイル１３０が出力される。 In the example of FIG. 5, the period T1 in which the subtitle text “Good morning, today” is displayed is specified, and the moving image data 132 is generated by cutting out the frame of the period T1. Then, an utterance image file 130 including the extracted moving image data 132 and the utterance content text 131 describing the text information “Good morning, today” is output.

なお、発話画像ファイル生成部１１２は、動画像収集部１１１によって収集された、字幕テキストが付加されていない動画像を基に、発話画像ファイル１３０を生成することも可能である。この場合の例について、次の図６を用いて説明する。 Note that the utterance image file generation unit 112 can also generate the utterance image file 130 based on the moving image collected by the moving image collection unit 111 and without subtitle text added. An example of this case will be described with reference to FIG.

図６は、発話画像ファイル生成部の他の処理例について示す図である。
発話画像ファイル生成部１１２は、例えば、動画像収集部１１１によって収集された動画像における音声信号を基に、人が言葉を発した発話区間を検出する。発話区間の検出は、例えば、音声信号の周波数スペクトルを解析することによって行うことができる。発話画像ファイル生成部１１２は、検出した発話区間それぞれの音声信号を基に、さらに音声認識を行って、発話内容を示すテキストを取得する。このような方法により、収集された動画像に字幕テキストが付加されていない場合でも、発話画像ファイル１３０を生成することができる。 FIG. 6 is a diagram illustrating another processing example of the utterance image file generation unit.
For example, the utterance image file generation unit 112 detects an utterance section in which a person utters a word based on an audio signal in a moving image collected by the moving image collection unit 111. The detection of the utterance period can be performed, for example, by analyzing the frequency spectrum of the audio signal. The utterance image file generation unit 112 further performs voice recognition based on the detected voice signal of each utterance section, and acquires text indicating the utterance content. By such a method, the utterance image file 130 can be generated even when subtitle text is not added to the collected moving images.

図６の例では、収集された動画像における音声信号から、発話区間Ａ１，Ａ２が検出されたものとする。そして、発話区間Ａ１，Ａ２のそれぞれの音声信号から、音声認識によって「おはよう」「きょうは」という発話内容が認識されたものとする。この場合、発話画像ファイル生成部１１２は、発話区間Ａ１におけるフレームを切り出した動画像データ１３２と、「おはよう」というテキスト情報が記述された発話内容テキスト１３１とを含む発話画像ファイル１３０を出力する。また、発話画像ファイル生成部１１２は、発話区間Ａ２におけるフレームを切り出した動画像データ１３２と、「きょうは」というテキスト情報が記述された発話内容テキスト１３１とを含む発話画像ファイル１３０を出力する。 In the example of FIG. 6, it is assumed that speech sections A1 and A2 are detected from the audio signal in the collected moving image. It is assumed that the utterance contents “Good morning” and “Kyoha” are recognized by voice recognition from the respective voice signals in the utterance sections A1 and A2. In this case, the utterance image file generation unit 112 outputs the utterance image file 130 including the moving image data 132 obtained by cutting out the frame in the utterance section A1 and the utterance content text 131 in which the text information “Good morning” is described. Further, the utterance image file generation unit 112 outputs an utterance image file 130 including moving image data 132 obtained by cutting out a frame in the utterance section A2 and utterance content text 131 in which text information “Kyoha” is described.

なお、発話画像ファイル１３０からの、または動画像収集部１１１によって収集された動画像からの、発話区間および発話内容の抽出作業の少なくとも一部は、オペレータによる操作によって行われてもよい。また、例えば、単語分割部１１３および単語区間抽出部１１４の処理によって生成された単語画像ファイル１４０について、その動画像の開始時刻や終端時刻の修正や発話内容の修正が、オペレータによる操作によって行われてもよい。 Note that at least a part of the speech segment and speech content extraction work from the speech image file 130 or from the moving image collected by the moving image collection unit 111 may be performed by an operation by an operator. Further, for example, with respect to the word image file 140 generated by the processing of the word dividing unit 113 and the word section extracting unit 114, the start time and end time of the moving image and the utterance content are corrected by an operation by the operator. May be.

次に、図７は、単語分割部および単語区間抽出部の処理例について示す図である。
単語分割部１１３は、生成された発話画像ファイル１３０それぞれの発話内容テキスト１３１を、形態素解析などを用いて、単語を単位として分割する。 Next, FIG. 7 is a diagram illustrating a processing example of the word dividing unit and the word section extracting unit.
The word dividing unit 113 divides the utterance content text 131 of each of the generated utterance image files 130 in units of words using morphological analysis or the like.

単語区間抽出部１１４は、生成された発話画像ファイル１３０それぞれの動画像データ１３２から、単語分割部１１３によって分割された各単語が発音された区間（以下、「単語区間」と呼ぶ）を切り出して、単語画像ファイル１４０を生成する。単語画像ファイル１４０には、発音された単語を示す単語テキスト１４１と、切り出された動画像データ１４２とが対応付けて格納される。生成された単語画像ファイル１４０は、例えば、作業支援装置１００のＲＡＭ１０２に一時的に格納される。 The word segment extraction unit 114 cuts out a segment in which each word divided by the word segmentation unit 113 is pronounced (hereinafter referred to as “word segment”) from the moving image data 132 of each generated utterance image file 130. The word image file 140 is generated. In the word image file 140, word text 141 indicating a pronounced word and the extracted moving image data 142 are stored in association with each other. The generated word image file 140 is temporarily stored in the RAM 102 of the work support apparatus 100, for example.

単語区間抽出部１１４は、例えば、動画像データ１３２に含まれる音声信号を基に音声認識を行い、それぞれの単語が発音された区間の開始時刻と終了時刻とを検出する。そして、単語区間抽出部１１４は、動画像データ１３２から、開始時刻から終了時刻までに表示されるフレームを切り出して、単語画像ファイル１４０の動画像データ１４２を生成する。 For example, the word section extraction unit 114 performs voice recognition based on a voice signal included in the moving image data 132 and detects a start time and an end time of a section in which each word is pronounced. Then, the word segment extraction unit 114 cuts out frames displayed from the start time to the end time from the moving image data 132 to generate moving image data 142 of the word image file 140.

なお、例えば、発話画像ファイル生成部１１２が音声認識を用いて発話画像ファイル１３０を生成した場合には、この音声認識の時点で、発話画像ファイル１３０内の動画像データ１３２における単語や音節単位での発音開始時刻および発音終了時刻が検出されている場合がある。この場合、単語区間抽出部１１４は、発話画像ファイル生成部１１２によって検出された発音開始時刻および発音終了時刻に基づいて、動画像データ１３２における単語区間を検出することができる。 For example, when the utterance image file generation unit 112 generates the utterance image file 130 using voice recognition, at the time of the voice recognition, in units of words or syllables in the moving image data 132 in the utterance image file 130. In some cases, the pronunciation start time and the pronunciation end time are detected. In this case, the word segment extraction unit 114 can detect a word segment in the moving image data 132 based on the pronunciation start time and the pronunciation end time detected by the utterance image file generation unit 112.

図７には、「おはよう、きょうは」という発話内容が記述された発話画像ファイル１３０から、単語画像ファイル１４０が生成される際の例が示されている。単語分割部１１３は、「おはよう、きょうは」という発話内容に対して形態素解析を行って、発話内容を「おはよう」と「きょうは」という単語（文節）に分割する。 FIG. 7 shows an example in which the word image file 140 is generated from the utterance image file 130 in which the utterance content “Good morning, today” is described. The word dividing unit 113 performs morphological analysis on the utterance content “Good morning, today” and divides the utterance content into words (phrases) “Good morning” and “Kyoha”.

単語区間抽出部１１４は、発話画像ファイル１３０の動画像データ１３２から、「おはよう」と発音された単語区間と、「きょうは」と発音された単語区間とを切り出す。図７の例では、時刻Ｔ１１から時刻Ｔ１２までが「おはよう」と発音された単語区間と判定される。そして、発話画像ファイル１３０の動画像データ１３２における、時刻Ｔ１１から時刻Ｔ１２までに表示されるフレームが、単語画像ファイル１４０の動画像データ１４２として切り出される。また、図７の例では、時刻Ｔ１３から時刻Ｔ１４までが「きょうは」と発音された単語区間と判定される。そして、発話画像ファイル１３０の動画像データ１３２における、時刻Ｔ１３から時刻Ｔ１４までに表示されるフレームが、単語画像ファイル１４０の動画像データ１４２として切り出される。 The word segment extraction unit 114 cuts out a word segment pronounced as “good morning” and a word segment pronounced as “Kyoha” from the moving image data 132 of the utterance image file 130. In the example of FIG. 7, it is determined that the period from time T11 to time T12 is a word segment that is pronounced “good morning”. Then, the frame displayed from time T11 to time T12 in the moving image data 132 of the utterance image file 130 is cut out as moving image data 142 of the word image file 140. Further, in the example of FIG. 7, it is determined that the period from time T13 to time T14 is a word section pronounced “Kyoha”. Then, a frame displayed from time T13 to time T14 in the moving image data 132 of the utterance image file 130 is cut out as moving image data 142 of the word image file 140.

次に、図８は、口領域検出部の処理例について示す図である。
口領域検出部１１５は、単語画像ファイル１４０に含まれる動画像データ１４２の各フレームから、画像処理により口領域を検出する。口領域の検出には、例えば、テンプレートマッチング法を用いることができる。 Next, FIG. 8 is a diagram illustrating a processing example of the mouth area detection unit.
The mouth area detection unit 115 detects a mouth area from each frame of the moving image data 142 included in the word image file 140 by image processing. For example, a template matching method can be used for detecting the mouth region.

テンプレートマッチング法を用いた場合、口領域検出部１１５は、動画像データ１４２のそれぞれのフレームと、口領域のテンプレートとを比較する。テンプレートとは、発話時の口領域の形状パターンを含む画像情報であり、例えば、発話時の口領域が撮影された画像、または、そのような撮影画像から、口領域の特徴的な形状パターンのみが抽出された画像などである。 When the template matching method is used, the mouth area detection unit 115 compares each frame of the moving image data 142 with the template of the mouth area. A template is image information including a shape pattern of the mouth area at the time of utterance, for example, an image in which the mouth area at the time of utterance is photographed or only a characteristic shape pattern of the mouth area from such a photographed image. Are extracted images.

例えば、図８に示すように、口領域検出部１１５は、検出対象のフレーム２３１の左上と、テンプレート２３２の左上とを合わせ、このような位置関係を起点としてテンプレート２３２を右方向に１画素ずつ移動させながら、テンプレート２３２と、テンプレート２３２と重なったフレーム２３１の領域との類似度を計算する。口領域検出部１１５は、テンプレート２３２の右端がフレーム２３１の右端に達すると、テンプレート２３２を１画素分下方向に移動させて、同様の処理を行う。このようにして、フレーム２３１の全領域から口領域を探索する。 For example, as illustrated in FIG. 8, the mouth area detection unit 115 aligns the upper left of the detection target frame 231 and the upper left of the template 232, and starts the template 232 one pixel at a time from such a positional relationship as a starting point. While moving, the similarity between the template 232 and the area of the frame 231 that overlaps the template 232 is calculated. When the right end of the template 232 reaches the right end of the frame 231, the mouth area detection unit 115 moves the template 232 downward by one pixel and performs the same processing. In this way, the mouth area is searched from the entire area of the frame 231.

さらに、口領域検出部１１５は、テンプレート２３２のサイズを変えて（例えば１段階小さくして）、上記と同様にフレーム２３１の全領域から口領域を探索する。テンプレート２３２のサイズ変更は、３段階以上行われてもよい。 Further, the mouth area detection unit 115 searches the mouth area from the entire area of the frame 231 in the same manner as described above by changing the size of the template 232 (for example, by reducing it by one level). The size change of the template 232 may be performed in three or more stages.

口領域検出部１１５は、このようにして算出された類似度のうち、最も高い類似度が算出されたテンプレート２３２の位置に対応するフレーム２３１の領域２３３を、口領域と判定する。ただし、口領域検出部１１５は、算出された類似度の最大値が所定の判定しきい値より低い場合には、そのフレームからは口領域が検出されなかったと判定する。 The mouth area detection unit 115 determines the area 233 of the frame 231 corresponding to the position of the template 232 for which the highest similarity is calculated among the similarities calculated in this way as the mouth area. However, the mouth area detection unit 115 determines that the mouth area is not detected from the frame when the calculated maximum value of the similarity is lower than a predetermined determination threshold value.

なお、テンプレートは、例えば、輝度情報のみの画像データであってもよい。その場合には、フレームの輝度情報とテンプレートとが比較される。
また、テンプレートと、フレームにおけるテンプレートと重なる領域との類似度を計算する際には、例えば、両者の特徴量同士の類似度が計算されてもよい。特徴量の例としては、特徴量ベクトルがある。例えば、テンプレートの特徴量ベクトルをＸ、フレームにおけるテンプレートと重なる領域の特徴量ベクトルをＹとすると、類似度を示す距離Ｄは、Ｄ＝｜Ｘ−Ｙ｜として求めることができる。 Note that the template may be, for example, image data with only luminance information. In that case, the luminance information of the frame is compared with the template.
Moreover, when calculating the similarity between the template and the region overlapping the template in the frame, for example, the similarity between the feature amounts of both may be calculated. An example of the feature quantity is a feature quantity vector. For example, if the feature quantity vector of the template is X and the feature quantity vector of the region overlapping the template in the frame is Y, the distance D indicating the similarity can be obtained as D = | X−Y |.

なお、口領域の検出のために、顔検出を利用することもできる。例えば、フレームから顔を検出し、顔が検出された領域の一部を、上記のようなテンプレートを用いた口領域の探索領域とする。この方法によれば、テンプレートを用いた口領域の探索領域が限定され、処理負荷を低減することができる。ただし、顔検出が利用できるのは、フレームに顔のほぼ全体が映っている場合に限られる。 Note that face detection can also be used to detect the mouth area. For example, a face is detected from the frame, and a part of the area where the face is detected is set as a mouth area search area using the template as described above. According to this method, the search area of the mouth area using the template is limited, and the processing load can be reduced. However, face detection can be used only when almost the entire face is shown in the frame.

図９は、判定部の処理例について示す図である。
判定部１１６は、単語画像ファイル１４０の動画像データ１４２に含まれるフレームのうち、口領域検出部１１５によって口領域が検出されなかったフレームの数を計数する。ここで、動画像データ１４２に含まれるフレーム数をＮ、動画像データ１４２のフレームのうち口領域が検出されたフレームの数をＲ、判定しきい値をＫ（ただし、Ｋは０以上の整数）とする。このとき、判定部１１６は、（Ｎ−Ｒ）≦Ｋである場合に、動画像データ１４２は辞書情報２００の生成のために利用可能であると判定する。この場合、判定部１１６は、対応する単語画像ファイル１４０を基に、辞書情報２００の生成に利用するための辞書候補ファイル１５０を生成する。 FIG. 9 is a diagram illustrating a processing example of the determination unit.
The determination unit 116 counts the number of frames in which the mouth region is not detected by the mouth region detection unit 115 among the frames included in the moving image data 142 of the word image file 140. Here, the number of frames included in the moving image data 142 is N, the number of frames in which the mouth area is detected among the frames of the moving image data 142 is R, and the determination threshold is K (where K is an integer of 0 or more) ). At this time, the determination unit 116 determines that the moving image data 142 can be used for generating the dictionary information 200 when (N−R) ≦ K. In this case, the determination unit 116 generates a dictionary candidate file 150 to be used for generating the dictionary information 200 based on the corresponding word image file 140.

図９（Ａ）は、「おはよう」と発音された単語期間における単語画像ファイル１４０を基に判定される例を示し、図９（Ｂ）は、「きょうは」と発音された単語区間における単語画像ファイル１４０を基に判定される例を示す。説明を簡単にするために、図９（Ａ）の場合の動画像データ１４２には４フレームが含まれ、図９（Ｂ）の場合の動画像データ１４２には５フレームが含まれるものとする。また、各フレームに太線で示した矩形領域を、口が検出された領域とする。 FIG. 9A shows an example of determination based on the word image file 140 in the word period pronounced “Good morning”, and FIG. 9B shows the words in the word interval pronounced “Kyoha”. An example of determination based on the image file 140 is shown. In order to simplify the description, it is assumed that the moving image data 142 in the case of FIG. 9A includes 4 frames, and the moving image data 142 in the case of FIG. 9B includes 5 frames. . In addition, a rectangular area indicated by a thick line in each frame is an area where the mouth is detected.

ここで、例として判定しきい値Ｋを“０”とする。図９（Ａ）の例では、全フレームから口領域が検出されたものとする。この場合、口領域が検出されなかったフレームの数（上記の“Ｎ−Ｒ”）は判定しきい値“０”以下であることから、判定部１１６は、対象の動画像データ１４２を、辞書情報２００の生成のために利用可能であると判定する。 Here, the determination threshold value K is set to “0” as an example. In the example of FIG. 9A, it is assumed that the mouth area is detected from all frames. In this case, since the number of frames in which the mouth area is not detected (the above “N−R”) is equal to or less than the determination threshold value “0”, the determination unit 116 converts the target moving image data 142 into the dictionary. It is determined that the information 200 can be used for generation.

この場合、判定部１１６は、単語テキスト１５１として「おはよう」が記述され、口領域の検出対象とした動画像データ１４２を動画像データ１５２として格納した辞書候補ファイル１５０を生成して、ＨＤＤ１０３などに保存する。また、生成された辞書候補ファイル１５０の口領域座標１５３には、動画像データ１５２における各フレームについて、検出された口領域の座標を示す情報が登録される。 In this case, the determination unit 116 generates a dictionary candidate file 150 in which “good morning” is described as the word text 151 and the moving image data 142 that is the detection target of the mouth area is stored as the moving image data 152, and is stored in the HDD 103 or the like. save. In addition, information indicating the coordinates of the mouth area detected for each frame in the moving image data 152 is registered in the mouth area coordinates 153 of the generated dictionary candidate file 150.

辞書候補ファイル１５０に登録される動画像は、例えば、各フレームに口領域のみが映った動画像である。辞書候補ファイル１５０の口領域座標１５３は、例えば、辞書候補ファイル１５０から辞書情報２００に登録する動画像を生成する際に、各フレームから口領域を抽出するために利用される。また、判定部１１６により、辞書情報２００に登録する動画像が直接生成されてもよい。 The moving image registered in the dictionary candidate file 150 is, for example, a moving image in which only the mouth area is shown in each frame. The mouth area coordinates 153 of the dictionary candidate file 150 are used to extract a mouth area from each frame when generating a moving image to be registered in the dictionary information 200 from the dictionary candidate file 150, for example. The determination unit 116 may directly generate a moving image to be registered in the dictionary information 200.

一方、図９（Ｂ）の例では、１番目、２番目、４番目および５番目の各フレームからは口領域が検出されたものの、３番目のフレームからは口領域が検出されていない。この例では、口領域が検出されなかったフレームの数（上記の“Ｎ−Ｒ”）は判定しきい値“０”より大きいことから、判定部１１６は、対象の動画像データ１４２を、辞書情報２００の生成のために利用不可能であると判定する。この場合、辞書候補ファイル１５０は生成されない。 On the other hand, in the example of FIG. 9B, the mouth area is detected from the first, second, fourth, and fifth frames, but the mouth area is not detected from the third frame. In this example, since the number of frames in which no mouth area is detected (the above “N−R”) is larger than the determination threshold value “0”, the determination unit 116 stores the target moving image data 142 in the dictionary. It is determined that the information 200 cannot be used for generation. In this case, the dictionary candidate file 150 is not generated.

前述したように、辞書情報２００に登録される動画像は、対応する単語の発音の開始時点から終了時点までのできるだけ長い期間（最も好ましくは期間全体で）、発音した人の口領域が映っていることが望ましい。これにより、読唇処理の精度を高くすることができる。例えば、辞書情報２００の生成のために口領域を撮影した場合には、撮影された動画像には当然ながら口領域が確実に映っている。しかしながら、動画像収集部１１１によって収集された動画像から生成された単語画像ファイル１４０の動画像データ１４２においては、各フレームに口領域が映っているとは限らない。 As described above, the moving image registered in the dictionary information 200 shows the mouth area of the person who pronounced the sound as long as possible from the start to the end of the corresponding word (most preferably the entire period). It is desirable. Thereby, the accuracy of the lip reading process can be increased. For example, when the mouth area is photographed to generate the dictionary information 200, the mouth area is surely reflected in the photographed moving image. However, in the moving image data 142 of the word image file 140 generated from the moving images collected by the moving image collecting unit 111, the mouth area is not always shown in each frame.

上記の判定部１１６の処理により、単語が発音された区間の多くにおいて口領域が検出された動画像を選別して、選別された動画像を、辞書情報２００の生成のために利用するものとして保存することができる。これにより、生成される辞書情報２００を適正化し、その辞書情報２００を用いた読唇処理の精度を向上させることができる。そして、そのような辞書情報２００の動画像として使用するのに適する動画像を抽出する作業の効率を、高めることができる。 As a result of the processing of the determination unit 116 described above, a moving image in which a mouth area is detected in many sections where a word is pronounced is selected, and the selected moving image is used to generate the dictionary information 200. Can be saved. Thereby, the generated dictionary information 200 can be optimized and the accuracy of the lip reading process using the dictionary information 200 can be improved. And the efficiency of the operation | work which extracts the moving image suitable for using as a moving image of such dictionary information 200 can be improved.

なお、判定しきい値Ｋは“０”であることが望ましい。一方、判定しきい値Ｋを“１”以上とした場合には、例えば、口領域が検出されなかったフレームを含む動画像を基に辞書候補ファイル１５０を生成する際に、口領域が検出されなかったフレームにも口領域が含まれるように補正が行われてもよい。補正の方法としては、例えば、口領域が検出されなかったフレームを、口領域が検出された、その直前または直後のフレームで置き換える方法などが考えられる。 The determination threshold value K is preferably “0”. On the other hand, when the determination threshold value K is “1” or more, for example, when the dictionary candidate file 150 is generated based on a moving image including a frame in which no mouth area is detected, the mouth area is detected. The correction may be performed so that the frame that is not included includes the mouth region. As a correction method, for example, a method in which a frame in which the mouth area is not detected is replaced with a frame immediately before or immediately after the mouth area is detected.

次に、以上で説明した作業支援装置１００の処理を、フローチャートを用いて説明する。図１０は、作業支援装置の処理例を示すフローチャートである。
［ステップＳ１１］発話画像ファイル生成部１１２は、動画像収集部１１１によって収集された動画像から、人が言葉を発したシーンの区間（発話区間）を抽出する。発話画像ファイル生成部１１２は、抽出した発話区間の動画像を切り出して、発話画像ファイル１３０を生成する。発話画像ファイル１３０には、発話内容を示す発話内容テキスト１３１と、動画像データ１３２とが対応付けて格納されている。 Next, the process of the work support apparatus 100 described above will be described using a flowchart. FIG. 10 is a flowchart illustrating a processing example of the work support apparatus.
[Step S11] The utterance image file generation unit 112 extracts a scene section (utterance section) in which a person uttered a word from the moving images collected by the moving image collection unit 111. The utterance image file generation unit 112 cuts out the extracted moving image of the utterance section and generates the utterance image file 130. In the utterance image file 130, utterance content text 131 indicating the utterance content and moving image data 132 are stored in association with each other.

［ステップＳ１２］単語分割部１１３は、生成された発話画像ファイル１３０の発話内容テキスト１３１を、形態素解析などを用いて、単語を単位として分割する。
［ステップＳ１３］単語区間抽出部１１４は、発話画像ファイル１３０の動画像データ１３２から、ステップＳ１２で分割されて得られた各単語が発音された区間（単語区間）を切り出して、単語画像ファイル１４０を生成する。単語画像ファイル１４０には、発音された単語を示す単語テキスト１４１と、動画像データ１４２とが対応付けて格納されている。 [Step S12] The word dividing unit 113 divides the utterance content text 131 of the generated utterance image file 130 in units of words using morphological analysis or the like.
[Step S13] The word section extraction unit 114 cuts out a section (word section) in which each word obtained by dividing in step S12 is pronounced from the moving image data 132 of the utterance image file 130, and the word image file 140 is extracted. Is generated. In the word image file 140, word text 141 indicating a pronounced word and moving image data 142 are stored in association with each other.

［ステップＳ１４］判定部１１６は、ステップＳ１３で生成された単語画像ファイル１４０のすべてについて、ステップＳ１５〜Ｓ１８の処理が実行済みかを判定する。すべての単語画像ファイル１４０について処理済みである場合には、処理が終了される。一方、処理済みでない単語画像ファイル１４０がある場合には、処理済みでない単語画像ファイル１４０の１つを処理対象としてステップＳ１５の処理が実行される。 [Step S14] The determination unit 116 determines whether or not the processing in steps S15 to S18 has been executed for all the word image files 140 generated in step S13. If all the word image files 140 have been processed, the process is terminated. On the other hand, if there is a word image file 140 that has not been processed, the process of step S15 is executed for one of the word image files 140 that has not been processed.

［ステップＳ１５］口領域検出部１１５は、処理対象の単語画像ファイル１４０内の動画像データ１４２のすべてのフレームから、ステップＳ１６での口領域の検出処理が実行されたかを判定する。すべてのフレームからの検出処理が終了した場合には、ステップＳ１７の処理が実行される。一方、検出処理を行っていないフレームがある場合には、動画像データ１４２における次のフレームを検出処理の対象としてステップＳ１６の処理が実行される。 [Step S15] The mouth area detection unit 115 determines whether the mouth area detection processing in step S16 has been executed from all the frames of the moving image data 142 in the word image file 140 to be processed. When the detection process from all frames is completed, the process of step S17 is executed. On the other hand, if there is a frame that has not been subjected to the detection process, the process of step S16 is executed with the next frame in the moving image data 142 as the target of the detection process.

［ステップＳ１６］口領域検出部１１５は、探索対象のフレームから、画像処理によって口領域を検出し、検出結果をＲＡＭ１０２などに一時的に格納する。口領域が検出された場合、検出結果には、検出された口領域の座標情報が含まれる。一方、口領域が検出されなかった場合、検出結果には、例えば、口領域が検出されなかった旨が記述される。 [Step S16] The mouth area detection unit 115 detects the mouth area by image processing from the search target frame, and temporarily stores the detection result in the RAM 102 or the like. When the mouth area is detected, the detection result includes coordinate information of the detected mouth area. On the other hand, when the mouth area is not detected, the detection result describes, for example, that the mouth area is not detected.

［ステップＳ１７］判定部１１６は、処理対象の単語画像ファイル１４０の動画像データ１４２のフレームのうち、ステップＳ１６の処理によって口領域が検出されなかったフレームの数を計数する。判定部１１６は、口領域が検出されなかったフレームの数が所定の判定しきい値以下である場合には、ステップＳ１８の処理を実行する。一方、口領域が検出されなかったフレームの数が判定しきい値より大きい場合には、ステップＳ１８の処理がスキップされて、ステップＳ１４の処理が実行される。 [Step S17] The determination unit 116 counts the number of frames in which the mouth area is not detected by the process of step S16 among the frames of the moving image data 142 of the word image file 140 to be processed. When the number of frames in which the mouth area is not detected is equal to or smaller than a predetermined determination threshold value, the determination unit 116 performs the process of step S18. On the other hand, when the number of frames in which no mouth area is detected is larger than the determination threshold, the process of step S18 is skipped and the process of step S14 is executed.

［ステップＳ１８］判定部１１６は、処理対象の単語画像ファイル１４０を基に、辞書情報２００の生成に利用するための辞書候補ファイル１５０を生成し、ＨＤＤ１０３に保存する。 [Step S <b> 18] The determination unit 116 generates a dictionary candidate file 150 to be used for generating the dictionary information 200 based on the word image file 140 to be processed, and stores it in the HDD 103.

次に、上記の第２の実施の形態を基にしたいくつかの変形例について説明する。
＜変形例１：文字ごとのテンプレートを用いた口領域の検出処理例＞
図８では、１種類のテンプレートのみを用いた口領域の検出処理について説明した。しかしながら、口領域の形状は、発声する文字によって異なる。このため、文字ごとに適切なテンプレートを用いて口領域を検出することで、一致した場合の類似度の算出値が高くなり、その結果、口領域の判定精度を高めることができる。 Next, some modified examples based on the second embodiment will be described.
<Modification 1: Example of mouth region detection processing using a template for each character>
In FIG. 8, the mouth area detection process using only one type of template has been described. However, the shape of the mouth region varies depending on the character to be uttered. For this reason, by detecting the mouth area using an appropriate template for each character, the calculated value of the similarity when matching is increased, and as a result, the determination accuracy of the mouth area can be increased.

図１１は、文字別に用意されたテンプレートを用いた口領域の探索処理の概要を示す図である。図１１に示すテンプレートデータベース（ＤＢ）１６０は、例えば、作業支援装置１００のＨＤＤ１０３に格納されている。テンプレートデータベース１６０には、文字（かな）ごとに、その文字が発音される際の口領域のテンプレートが用意されている。 FIG. 11 is a diagram showing an outline of mouth area search processing using a template prepared for each character. A template database (DB) 160 illustrated in FIG. 11 is stored in the HDD 103 of the work support apparatus 100, for example. In the template database 160, a template for a mouth area when a character is pronounced is prepared for each character (kana).

口領域検出部１１５が検出の対象とする動画像データ１４２には、発話の内容を示す単語テキスト１４１が対応付けられている。このため、口領域検出部１１５は、文字別のテンプレートの中から、口領域の検出処理に用いるべきテンプレートを容易に特定することができる。図１１の例では、単語テキスト１４１には「おはよう」という単語が記述されている。この場合、口領域検出部１１５は、「お」「は」「よ」「う」にそれぞれ対応するテンプレート（図１１のテンプレート＃５，＃２６，＃３８，＃３）をテンプレートデータベース１６０から読み出し、これらのテンプレートを使用して口領域を検出する。 The moving image data 142 to be detected by the mouth area detection unit 115 is associated with a word text 141 indicating the content of the utterance. Therefore, the mouth area detection unit 115 can easily specify a template to be used for the mouth area detection process from the character-specific templates. In the example of FIG. 11, the word text 141 describes the word “Good morning”. In this case, the mouth area detection unit 115 reads out templates (templates # 5, # 26, # 38, and # 3 in FIG. 11) corresponding to “o”, “ha”, “yo”, and “u” from the template database 160, respectively. Detect the mouth area using these templates.

図１２は、テンプレートの使用切り替えタイミングの一例を示す図である。口領域検出部１１５は、テンプレートデータベース１６０から読み出した各テンプレートを、文字の出現順に使用して口領域の検出を行う。ただし、使用するテンプレートをどのタイミングで切り替えるかを決定する必要がある。 FIG. 12 is a diagram illustrating an example of template use switching timing. The mouth area detection unit 115 detects the mouth area using the templates read from the template database 160 in the order of appearance of characters. However, it is necessary to determine when to switch the template to be used.

切り替えタイミングを決定する方法の一例としては、フレーム数を文字数で均等に分割し、分割されたフレームごとに１つのテンプレートを割り当てる方法がある。また、この方法を変形した例として、基本的には１つのテンプレートに同数のフレームを割り当てるものの、最後の文字だけは割り当てるフレームの数を他の文字より少なくする方法がある。この方法は、単語（または文節）における最後の文字が発音される期間が、それより前の各文字が発音される期間より短い場合があることを利用したものである。 As an example of a method for determining the switching timing, there is a method in which the number of frames is equally divided by the number of characters and one template is assigned to each divided frame. Further, as a modified example of this method, there is a method in which the same number of frames are basically assigned to one template, but the number of frames to be assigned is smaller than other characters only for the last character. This method utilizes the fact that the period in which the last character in a word (or phrase) is pronounced may be shorter than the period in which each preceding character is pronounced.

図１２の例では、処理対象の動画像のフレームのうち、「お」「は」「よ」にそれぞれ対応するテンプレートが、それぞれ４つのフレームで使用されて口領域が検出される。しかし、最後の文字「う」に対応するテンプレートは、最後の１フレームでのみ使用される。このような方法により、発声される文字に応じたテンプレートを適切に使用して、口領域を検出することができる。 In the example of FIG. 12, templates corresponding to “o”, “ha”, and “yo” are used in four frames, respectively, among the frames of the moving image to be processed, and the mouth area is detected. However, the template corresponding to the last character “U” is used only in the last one frame. By such a method, it is possible to detect the mouth region by appropriately using a template corresponding to the character to be uttered.

なお、例えば、単語に含まれる文字ごとに割り当てるフレーム数の、単語全体に対応するフレーム数に対する割合は、単語ごとに決められてもよい。
図１３は、補間によって得られたテンプレートを使用する例を示す図である。 For example, the ratio of the number of frames assigned to each character included in a word to the number of frames corresponding to the entire word may be determined for each word.
FIG. 13 is a diagram illustrating an example in which a template obtained by interpolation is used.

ある文字を発声する状態から、それとは異なる次の文字を発声する状態に遷移する期間では、口領域の形状が前の文字に対応する形状から次の文字に対応する形状に変化する。このため、前の文字または次の文字のどちらに対応するテンプレートを用いたとしても、口領域を精度よく検出できない可能性がある。 In a period of transition from a state where a certain character is uttered to a state where a next character different from that is uttered, the shape of the mouth region changes from a shape corresponding to the previous character to a shape corresponding to the next character. For this reason, there is a possibility that the mouth region cannot be detected with high accuracy even if a template corresponding to either the previous character or the next character is used.

そこで、口領域検出部１１５は、ある文字を発声する状態から、それとは異なる次の文字を発声する状態に遷移する期間では、前の文字に対応するテンプレートと次の文字に対応するテンプレートとから、補間によってテンプレートを生成し、生成したテンプレートを用いて口領域を検出してもよい。 Therefore, the mouth area detection unit 115 determines from the template corresponding to the previous character and the template corresponding to the next character during the transition from the state of speaking a certain character to the state of speaking the next character different from that. Alternatively, a template may be generated by interpolation, and the mouth area may be detected using the generated template.

図１３の例では、「お」「は」と連続して発音される際に使用されるテンプレートについて示している。図１３では、「お」と発音される期間と「は」と発音される期間との間の時刻（または所定期間）に対応するフレームについては、「お」に対応するテンプレート１６１と「は」に対応するテンプレート１６２とから補間によって生成された補間テンプレート１６３を使用して、口領域が検出される。これにより、口領域の検出精度を高めることができる。 In the example of FIG. 13, a template used when continuously pronounced “o” and “ha” is shown. In FIG. 13, for a frame corresponding to a time (or a predetermined period) between a period in which “o” is pronounced and a period in which “ha” is pronounced, a template 161 corresponding to “o” and “ha”. The mouth region is detected using the interpolation template 163 generated by interpolation from the template 162 corresponding to the. Thereby, the detection precision of a mouth area | region can be improved.

また、テンプレート自体を補間する代わりに、類似度計算に用いる、テンプレートの特徴量を補間してもよい。以下、特徴量の例として特徴量ベクトルを用いた場合について説明する。 Further, instead of interpolating the template itself, the template feature amount used for similarity calculation may be interpolated. Hereinafter, a case where a feature quantity vector is used as an example of the feature quantity will be described.

図１４は、特徴量ベクトルを用いた場合の補間処理例を示す図である。図１４の例では、口領域検出部１１５は、発音される文字が切り替わる境界の直前および直後の２フレームから口領域を検出する際に、切り替え前後の各文字に対応するテンプレートの特徴量ベクトルを補間することで得られた補間特徴量ベクトルを利用する。 FIG. 14 is a diagram illustrating an example of interpolation processing in the case where a feature amount vector is used. In the example of FIG. 14, when the mouth area detection unit 115 detects the mouth area from two frames immediately before and after the boundary at which the sounded characters are switched, the mouth area detecting unit 115 calculates the feature vector of the template corresponding to each character before and after the switching. Interpolated feature vector obtained by interpolation is used.

例えば、図１４に示すように、フレーム番号ｆ＝０，１に対応するフレームは、「お」が発音される期間に含まれ、フレーム番号ｆ＝２，３に対応するフレームは、「は」が発音される期間に含まれるものとする。この場合、フレーム番号ｆ＝１，２に対応するフレームから口領域を検出する際に、「お」に対応するテンプレートの特徴量ベクトルＸ＿ａと、「は」に対応するテンプレートの特徴量ベクトルＸ＿ｂとを基に補間によって得られた補間特徴量ベクトルが用いられる。 For example, as shown in FIG. 14, the frame corresponding to the frame number f = 0, 1 is included in the period in which “o” is pronounced, and the frame corresponding to the frame number f = 2, 3 is “ha”. It is included in the period when is pronounced. In this case, when the mouth area is detected from the frame corresponding to the frame number f = 1, 2, the feature value vector X_a of the template corresponding to “O” and the feature value vector X_b of the template corresponding to “ha” An interpolation feature vector obtained by interpolation based on the above is used.

補間特徴量ベクトルの計算には、補正係数ｗ＿ａ，ｗ＿ｂが用いられる。補正係数ｗ＿ａは“−ｆ／ｕ＋１”で算出され、補正係数ｗ＿ｂは“ｆ／ｕ”で算出される。ただし、変数ｕは、補間特徴量ベクトルが用いられるフレーム数に、その直前および直後のフレーム数“２”を加算した値であり、図１４の例では“３”である。 Correction coefficients w_a and w_b are used for calculation of the interpolation feature vector. The correction coefficient w_a is calculated by “−f / u + 1”, and the correction coefficient w_b is calculated by “f / u”. However, the variable u is a value obtained by adding the number of frames “2” immediately before and immediately after the number of frames in which the interpolation feature vector is used, and is “3” in the example of FIG.

このような計算により、フレームの位置ごとに補正係数ｗ＿ａ，ｗ＿ｂが求められる。そして、利用される補間特徴量ベクトルは、“ｗ＿ａ×Ｘ＿ａ＋ｗ＿ｂ×Ｘ＿ｂ”によって算出される。フレーム番号ｆ＝１，２に対応する各フレームについて、このように算出された補間特徴量ベクトルと、フレームの特徴量ベクトルとを用いて類似度が計算されることで、口領域の検出精度を向上させることができる。 By such calculation, correction coefficients w_a and w_b are obtained for each frame position. The interpolation feature vector to be used is calculated by “w_a × X_a + w_b × X_b”. For each frame corresponding to the frame number f = 1, 2, the similarity is calculated using the interpolation feature vector calculated in this way and the frame feature vector, thereby improving the detection accuracy of the mouth region. Can be improved.

なお、以上で説明した変形例１では、文字（かな）ごとに個別のテンプレートが用意されるものとしたが、例えば、発音したときの口領域の形状が似ている文字については、同一のテンプレートが用意されてもよい。また、例えば、文字（かな）ごとではなく、母音ごとにテンプレートが用意されてもよい。 In the first modification described above, an individual template is prepared for each character (kana). For example, the same template is used for characters whose mouth region shape is similar when pronounced. May be prepared. Further, for example, a template may be prepared for each vowel instead of each character (kana).

＜変形例２：口領域が複数検出された場合の処理例（１）＞
図１５は、口領域が複数検出された場合の処理例を示す図である。口領域検出部１１５は、例えば、同一のフレームから口領域が複数検出された場合には、算出された類似度の値が最大である検出領域を、口領域と判定する。これにより、口領域の検出精度を向上させることができる。 <Modification 2: Processing example (1) when a plurality of mouth regions are detected>
FIG. 15 is a diagram illustrating a processing example when a plurality of mouth areas are detected. For example, when a plurality of mouth areas are detected from the same frame, the mouth area detection unit 115 determines the detection area having the maximum calculated similarity value as the mouth area. Thereby, the detection accuracy of the mouth area can be improved.

この方法は、特に、図１１に示したように、発音された文字に対応するテンプレートを用いて口領域を検出する場合に好適である。それは、フレームの中で、対応する文字を発音したときの口領域の形状に類似する領域ほど、その文字が発音された可能性が高いからである。 This method is particularly suitable when a mouth region is detected using a template corresponding to a pronounced character as shown in FIG. This is because, in a frame, an area similar to the shape of the mouth area when the corresponding character is pronounced is more likely to be pronounced.

図１５の例では、同一のフレームから、類似度が所定値以上となった２つの口領域が検出され、一方の領域について算出された類似度が“０．８”、他方の領域について算出された類似度が“０．９”であったとする。ただし、類似度が“０．９”の検出領域の大きさが、類似度が“０．８”の検出領域より小さかったものとする。 In the example of FIG. 15, two mouth areas having a similarity equal to or greater than a predetermined value are detected from the same frame, the similarity calculated for one area is “0.8”, and the other area is calculated for the other area. Assume that the similarity is “0.9”. However, it is assumed that the size of the detection region with the similarity of “0.9” is smaller than the detection region with the similarity of “0.8”.

この場合、口領域検出部１１５は、類似度が大きい検出領域を口領域と判定する。このように、口領域が複数検出されたとき、類似度が最大の検出領域の大きさが、他の少なくとも１つの検出領域より小さい場合であっても、類似度が最大の検出領域が口領域と判定される。 In this case, the mouth area detection unit 115 determines a detection area having a high degree of similarity as the mouth area. Thus, when a plurality of mouth areas are detected, the detection area having the maximum similarity is the mouth area even if the size of the detection area having the maximum similarity is smaller than at least one other detection area. It is determined.

＜変形例３：口領域が複数検出された場合の処理例（２）＞
図１６は、口領域が複数検出された場合の他の処理例を示す図である。この図１６の処理例は、図１５の処理例とは異なり、判定部１１６は、単語画像ファイル１４０の動画像データ１４２のフレームの中から、口領域が複数検出されたフレームが１つでもあった場合には、その動画像データ１４２は辞書情報２００の生成に適さないと判定して、その動画像データ１４２に基づく辞書候補ファイル１５０を保存しないようにする。 <Modification 3: Processing example (2) when a plurality of mouth areas are detected>
FIG. 16 is a diagram illustrating another example of processing when a plurality of mouth areas are detected. The processing example in FIG. 16 is different from the processing example in FIG. 15, and the determination unit 116 has at least one frame in which a plurality of mouth areas are detected from the frames of the moving image data 142 of the word image file 140. If it is determined that the moving image data 142 is not suitable for generating the dictionary information 200, the dictionary candidate file 150 based on the moving image data 142 is not stored.

同一のフレームに複数の人物の口が映っている場合、それらのうちのどの人物が目的とする文字を発音しているかを特定することが難しい場合がある。このため、複数の人物の口が映ったフレームを含む動画像データ１４２については、辞書情報２００の生成のために使用しないようにすることで、辞書情報２００の品質を向上させることができる。 When the mouths of a plurality of persons are shown in the same frame, it may be difficult to specify which of them pronounces the target character. For this reason, the quality of the dictionary information 200 can be improved by not using the moving image data 142 including the frame in which the mouths of a plurality of people are reflected for the generation of the dictionary information 200.

なお、判定部１１６は、例えば、動画像データ１４２のフレームの中から、口領域が複数検出されたフレームの数が、２以上の所定の判定しきい値以上である場合に、辞書情報２００の生成に適さないと判定してもよい。この場合、判定部１１６は、口領域が複数検出されたフレームの数が判定しきい値より小さい場合には、複数の口領域が検出されたフレームについては、図１５に示したように類似度が最大の検出領域を口領域と判定すればよい。 For example, when the number of frames in which a plurality of mouth areas are detected from the frames of the moving image data 142 is equal to or greater than a predetermined determination threshold value of 2 or more, the determination unit 116 determines whether the dictionary information 200 It may be determined that it is not suitable for generation. In this case, when the number of frames in which a plurality of mouth areas are detected is smaller than the determination threshold, the determination unit 116 determines the similarity for frames in which a plurality of mouth areas are detected as shown in FIG. May be determined as the mouth region.

＜変形例４：口領域の大きさに応じた検出処理例＞
図１７は、口領域の大きさに応じた検出処理例を示す図である。フレームから口領域が検出されたとしても、検出された口領域の大きさが小さ過ぎる場合には、辞書情報２００の生成のために利用する画像としては不向きである可能性がある。例えば、検出された口領域が小さ過ぎる場合には、フレーム間での口の形状の変化がわかりにくくなるからである。 <Modification 4: Example of detection processing according to the size of the mouth area>
FIG. 17 is a diagram illustrating a detection processing example according to the size of the mouth area. Even if the mouth area is detected from the frame, if the size of the detected mouth area is too small, it may be unsuitable as an image used for generating the dictionary information 200. For example, if the detected mouth area is too small, it is difficult to understand the change in mouth shape between frames.

そこで、口領域検出部１１５は、検出された口領域のフレームに対する面積比が所定の判定しきい値以下である場合には、口領域が検出されなかったと判定する。これにより、生成される辞書情報２００の品質を向上させることができる。なお、面積比は、“口領域の面積（画素数）／フレームの面積（画素数）”によって算出される。 Therefore, the mouth area detection unit 115 determines that the mouth area has not been detected when the area ratio of the detected mouth area to the frame is equal to or less than a predetermined determination threshold value. Thereby, the quality of the generated dictionary information 200 can be improved. The area ratio is calculated by “area of mouth region (number of pixels) / area of frame (number of pixels)”.

図１７の例では、１番目から４番目のフレームのいずれかも口領域が検出されたものとする。そして、各フレームで検出された口領域の面積比は、１番目から順に“０．０５”，“０．０４”，“０．０３”，“０．０１”であったとする。ここで、判定しきい値を“０．０２”とすると、口領域検出部１１５は、１番目から３番目の各フレームからは口領域が検出されたと判定するが、４番目のフレームからは口領域が検出されなかったと判定する。この場合、判定部１１６は、該当動画像のフレームのうち口領域が検出されなかったフレームの数を“１”と計数する。 In the example of FIG. 17, it is assumed that the mouth area is detected in any of the first to fourth frames. The area ratio of the mouth area detected in each frame is assumed to be “0.05”, “0.04”, “0.03”, and “0.01” in order from the first. Here, when the determination threshold is “0.02,” the mouth area detection unit 115 determines that the mouth area is detected from each of the first to third frames, but from the fourth frame, the mouth area is detected. It is determined that the area has not been detected. In this case, the determination unit 116 counts “1” as the number of frames in which the mouth area is not detected among the frames of the corresponding moving image.

なお、口領域検出部１１５は、面積比の代わりに、検出された口領域の面積（画素数）自体に基づいて、口領域が検出されたか否かを判定してもよい。
なお、上記の各実施の形態に示した装置（画像判別装置１、作業支援装置１００）の処理機能は、コンピュータによって実現することができる。その場合、各装置が有すべき機能の処理内容を記述したプログラムが提供され、そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、磁気記憶装置、光ディスク、光磁気記録媒体、半導体メモリなどがある。磁気記憶装置には、ハードディスク装置（ＨＤＤ）、フレキシブルディスク（ＦＤ）、磁気テープなどがある。光ディスクには、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ、ＣＤ−ＲＯＭ（Compact Disc-Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（Rewritable）などがある。光磁気記録媒体には、ＭＯ（Magneto-Optical disk）などがある。 The mouth area detection unit 115 may determine whether or not the mouth area is detected based on the area (number of pixels) of the detected mouth area instead of the area ratio.
Note that the processing functions of the devices (the image determination device 1 and the work support device 100) described in each of the above embodiments can be realized by a computer. In that case, a program describing the processing contents of the functions that each device should have is provided, and the processing functions are realized on the computer by executing the program on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. Examples of the computer-readable recording medium include a magnetic storage device, an optical disk, a magneto-optical recording medium, and a semiconductor memory. Examples of the magnetic storage device include a hard disk device (HDD), a flexible disk (FD), and a magnetic tape. Optical discs include DVD (Digital Versatile Disc), DVD-RAM, CD-ROM (Compact Disc-Read Only Memory), CD-R (Recordable) / RW (Rewritable), and the like. Magneto-optical recording media include MO (Magneto-Optical disk).

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ、ＣＤ−ＲＯＭなどの可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When distributing the program, for example, a portable recording medium such as a DVD or a CD-ROM in which the program is recorded is sold. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムまたはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、ネットワークを介して接続されたサーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. In addition, each time a program is transferred from a server computer connected via a network, the computer can sequentially execute processing according to the received program.

１画像判別装置
２検出部
３判別部
４入力動画像 DESCRIPTION OF SYMBOLS 1 Image discrimination device 2 Detection part 3 Discrimination part 4 Input moving image

Claims

A detection unit for detecting a mouth area from each frame of an input moving image in which a scene of a period during which a character string is pronounced is reflected;
A dictionary indicating the shape of the mouth when the character string is pronounced when the number of frames in which the mouth area is not detected among the frames of the input moving image is equal to or less than a predetermined number A discriminating unit for discriminating a moving image for generating information;
An image discrimination device characterized by comprising:

The discrimination unit
Referring to the moving image including the scene where the utterance is performed and the character information indicating the utterance content associated with the moving image, the character information is divided into character strings in units of words or phrases,
From the moving image, the pronunciation interval where the divided character string is pronounced is extracted for each divided character string,
The dictionary information is generated from each of the moving images corresponding to the sound generation interval, and the moving image corresponding to the sound generation interval in which the number of frames in which the mouth area is not detected by the detection unit is equal to or less than the predetermined number is generated. To select as a moving image for
The image discriminating apparatus according to claim 1.

3. The image discrimination device according to claim 2, wherein the process of extracting the pronunciation period for each of the divided character strings is performed by a voice recognition process based on a voice synchronized with a moving image corresponding to the pronunciation period. .

The detection unit determines, for each frame in the moving image, which character in the characters constituting the character string is pronounced, and the mouth shape pattern when each of the plurality of characters is pronounced. 4. A mouth region is detected by selecting a template corresponding to a pronounced character for each frame from a plurality of templates included, and detecting a mouth region by template matching using the selected template. The image discrimination device according to claim 1.

The detection unit detects a mouth area by template matching, and when a plurality of mouth areas are detected from one frame, an area having a maximum similarity to the template among the detected areas, The image discriminating apparatus according to claim 1, wherein the image discriminating apparatus is determined to be a mouth area.

The discriminating unit generates the dictionary information for the input moving image when the number of frames in which a plurality of mouth areas are detected is greater than or equal to a predetermined threshold among the frames of the input moving image. 6. The image discriminating apparatus according to claim 1, wherein the image discriminating apparatus discriminates that the moving image is not a moving image.

When the mouth area is detected from the frame, the detection unit determines that the ratio of the size of the detected mouth area to the size of the frame is equal to or less than a predetermined ratio. The image determination apparatus according to claim 1, wherein it is determined that a region has not been detected.

Computer
Detect the mouth area from each frame of the input video that shows the scene of the period when the character string was pronounced,
A dictionary indicating the shape of the mouth when the character string is pronounced when the number of frames in which the mouth area is not detected among the frames of the input moving image is equal to or less than a predetermined number Discriminate from moving images to generate information,
An image discrimination method characterized by the above.

On the computer,
Detect the mouth area from each frame of the input video that shows the scene of the period when the character string was pronounced,
A dictionary indicating the shape of the mouth when the character string is pronounced when the number of frames in which the mouth area is not detected among the frames of the input moving image is equal to or less than a predetermined number Discriminate from moving images to generate information,
An image discrimination program for executing a process.