JP7026812B2

JP7026812B2 - Information processing equipment and information processing method

Info

Publication number: JP7026812B2
Application number: JP2020547621A
Authority: JP
Inventors: 拓也村上; 雄大中村; 正博内藤
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2018-09-25
Filing date: 2018-09-25
Publication date: 2022-02-28
Anticipated expiration: 2038-09-25
Also published as: JPWO2020065706A1; WO2020065706A1; DE112018008012T5

Description

本発明は、情報処理装置及び情報処理方法に関する。 The present invention relates to an information processing apparatus and an information processing method.

近年のカーナビゲーションシステムは、音声で操作するための音声認識機能を備えたものが多い。音声認識機能は、目的地の設定等に活用され、発話内容を認識し、コマンドに変換することで操作を実現している。 Many car navigation systems in recent years are equipped with a voice recognition function for operating by voice. The voice recognition function is utilized for setting the destination, etc., and realizes the operation by recognizing the utterance content and converting it into a command.

しかしながら、自動車の車内では車外の騒音等、ノイズ源が多くあったり、複数の乗員が同時に発話したりすることで、音声認識が困難になるケースがある。このため、ノイズ成分を低減する必要がある。発話認識技術の従来技術としては、特許文献１に記載された技術が知られている。 However, there are cases where voice recognition becomes difficult due to many noise sources such as noise outside the vehicle inside the vehicle or when a plurality of occupants speak at the same time. Therefore, it is necessary to reduce the noise component. As a conventional technique of the utterance recognition technique, the technique described in Patent Document 1 is known.

特許文献１に記載された技術は、表情認識を目的に口の開き度合いを算出している。特許文献１では、右目の右端から口の右端までの右目－口間距離をＤｒ、左目の左端から口の左端までの左目－口間距離をＤｌ、右目の右端から左目の左端までの距離をＷｅとして、下記の（１）式により、口開き状態を示す特徴量Ｖ３として算出している。
Ｖ３＝（Ｄｒ＋Ｄｌ）÷Ｗｅ（１）The technique described in Patent Document 1 calculates the degree of mouth opening for the purpose of facial expression recognition. In Patent Document 1, the right-eye-mouth distance from the right end of the right eye to the right end of the mouth is Dr, the left-eye-mouth distance from the left end of the left eye to the left end of the mouth is Dl, and the distance from the right end of the right eye to the left end of the left eye. We are calculated as a feature amount V3 indicating an open mouth state by the following equation (1).
V3 = (Dr + Dl) ÷ We (1)

特開２００５－２９３５３９号公報Japanese Unexamined Patent Publication No. 2005-293539

従来の技術は、ドライバーのふり向き等により、顔の向きが変化することでその特徴量Ｖ３が変化するため、口の開き状態を示すには十分とは言えない。 The conventional technique is not sufficient to show the open state of the mouth because the feature amount V3 changes when the direction of the face changes depending on the direction of the driver's swing.

そこで、本発明の１又は複数の態様は、顔のふり向きに対してロバストな開口度を算出できるようにすることを目的とする。 Therefore, one or a plurality of aspects of the present invention is aimed at making it possible to calculate a robust opening degree with respect to the direction of turning of the face.

本発明の１態様に係る情報処理装置は、画像データで示される画像から、人の顔に対応する部分である顔画像を検出する顔検出部と、前記顔画像から、前記人の唇の上下方向における開口幅を取得する開口幅取得部と、前記顔画像から、前記人の左右の目の中点と鼻の下端点との距離である鼻の長さを取得する鼻長取得部と、前記鼻の長さに対する、前記開口幅の相対的な長さを示す開口度を算出する開口度算出部と、を備えることを特徴とする。 The information processing apparatus according to one aspect of the present invention has a face detection unit that detects a face image that is a portion corresponding to a human face from an image shown by image data, and an upper and lower lips of the person from the face image. An opening width acquisition unit that acquires the opening width in the direction, and a nose length acquisition unit that acquires the length of the nose, which is the distance between the middle point of the left and right eyes of the person and the lower end point of the nose, from the face image. It is characterized by comprising an opening degree calculation unit for calculating an opening degree indicating a relative length of the opening width with respect to the length of the nose.

本発明の１態様に係る情報処理方法は、画像データで示される画像から、人の顔に対応する部分である顔画像を検出し、前記顔画像から、前記人の唇の上下方向における開口幅を取得し、前記顔画像から、前記人の左右の目の中点と鼻の下端点との距離である鼻の長さを取得し、前記鼻の長さに対する、前記開口幅の相対的な長さを示す開口度を算出することを特徴とする。
The information processing method according to one aspect of the present invention detects a face image, which is a portion corresponding to a human face, from an image shown by image data, and from the face image, an opening width in the vertical direction of the person's lip. Is obtained, and the length of the nose, which is the distance between the middle point of the left and right eyes of the person and the lower end point of the nose, is obtained from the face image, and the opening width is relative to the length of the nose. It is characterized by calculating the opening degree indicating the length.

本発明の１又は複数の態様によれば、顔のふり向きに対してロバストな開口度を算出することができる。 According to one or more aspects of the present invention, it is possible to calculate a robust opening degree with respect to the direction of turning of the face.

実施の形態１に係る発話者判定装置の構成を概略的に示すブロック図である。It is a block diagram which shows schematic structure of the speaker determination apparatus which concerns on Embodiment 1. FIG. 実施の形態１における、発話者判定装置、撮像装置及びマイク装置の配置例を示す概略図である。It is a schematic diagram which shows the arrangement example of the speaker determination apparatus, the image pickup apparatus, and the microphone apparatus in Embodiment 1. FIG. 顔画像から、開口幅及び鼻長を算出する例を示す概略図である。It is a schematic diagram which shows the example of calculating the opening width and the nose length from a face image. 発話者判定装置のハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware configuration example of the speaker determination apparatus. 実施の形態１に係る発話者判定方法を示すフローチャートである。It is a flowchart which shows the speaker determination method which concerns on Embodiment 1. 実施の形態２に係る発話者判定装置の構成を概略的に示すブロック図である。FIG. 3 is a block diagram schematically showing a configuration of a speaker determination device according to a second embodiment. 実施の形態２における、発話者判定装置、撮像装置及びマイク装置の配置例を示す概略図である。It is a schematic diagram which shows the arrangement example of the speaker determination apparatus, the image pickup apparatus, and the microphone apparatus in Embodiment 2. FIG. 実施の形態２に係る発話者判定方法を示すフローチャートである。It is a flowchart which shows the speaker determination method which concerns on Embodiment 2.

実施の形態１．
図１は、実施の形態１に係る情報処理装置である発話者判定装置１００の構成を概略的に示すブロック図である。
発話者判定装置１００は、顔検出部１１０と、開口度取得部１２０と、発話者判定部１３０とを備える。
発話者判定装置１００は、撮像装置１６０から入力される画像データで示される画像から、発話者を判定する。そして、発話者判定装置１００は、その判定結果をマイク装置１７０に与える。Embodiment 1.
FIG. 1 is a block diagram schematically showing a configuration of a speaker determination device 100, which is an information processing device according to the first embodiment.
The speaker determination device 100 includes a face detection unit 110, an opening degree acquisition unit 120, and a speaker determination unit 130.
The speaker determination device 100 determines the speaker from the image indicated by the image data input from the image pickup device 160. Then, the speaker determination device 100 gives the determination result to the microphone device 170.

撮像装置１６０は、ＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）イメージセンサといった撮像素子を備え、撮像された画像の画像データを出力する。例えば、撮像装置１６０は、カメラにより実現することができる。
マイク装置１７０は、音を電気信号に変換するマイクを備える装置である。実施の形態１では、マイク装置１７０は、マイクの電源をオン又はオフにすることができ、発話者判定装置１００からの判定結果に応じて、マイクの電源をオン又はオフにする。言い換えると、マイク装置１７０は、音の電気信号への変換する機能をオン又はオフにすることができる。また、マイク装置１７０は、例えば、単一指向性を有するマイクを備え、発話者判定装置１００からの判定結果に応じて、集音する方向を変更できるようになっていてもよい。The image pickup device 160 includes an image pickup element such as a CCD (Charge Coupled Device) image sensor, and outputs image data of the captured image. For example, the image pickup apparatus 160 can be realized by a camera.
The microphone device 170 is a device including a microphone that converts sound into an electric signal. In the first embodiment, the microphone device 170 can turn on or off the power of the microphone, and turn on or off the power of the microphone according to the determination result from the speaker determination device 100. In other words, the microphone device 170 can turn on or off the function of converting sound into an electrical signal. Further, the microphone device 170 may be provided with, for example, a microphone having unidirectionality, and may be capable of changing the direction of collecting sound according to the determination result from the speaker determination device 100.

図２は、発話者判定装置１００、撮像装置１６０及びマイク装置１７０の配置例を示す概略図である。
図２に示されているように、車両１８０の内部に、発話者判定装置１００、撮像装置１６０及びマイク装置１７０が配置されている。FIG. 2 is a schematic view showing an arrangement example of the speaker determination device 100, the image pickup device 160, and the microphone device 170.
As shown in FIG. 2, a speaker determination device 100, an image pickup device 160, and a microphone device 170 are arranged inside the vehicle 180.

撮像装置１６０は、対象者である乗員の顔を撮像する。そして、撮像装置１６０は、発話者判定装置１００と接続されており、撮像された画像の画像データを発話者判定装置１００に送る。なお、撮像装置１６０は、ワイヤハーネス等の配線を介して発話者判定装置１００と接続されてもよく、また、無線により発話者判定装置１００と接続されてもよい。 The image pickup apparatus 160 captures the face of the occupant who is the target person. Then, the image pickup device 160 is connected to the speaker determination device 100, and sends the image data of the captured image to the speaker determination device 100. The image pickup device 160 may be connected to the speaker determination device 100 via wiring such as a wire harness, or may be wirelessly connected to the speaker determination device 100.

撮像装置１６０は、ＲＧＢ（ＲｅｄＧｒｅｅｎＢｌｕｅ）カメラ、ＩＲ（ＩｎｆＲａｒｅｄ）カメラ、ステレオカメラ又はＴＯＦ（ＴｉｍｅＯｆＦｌｉｇｈｔ）カメラ等、口の形状が検出できるものであればどのような形式の装置であってもよい。なお、「ＲＧＢカメラ」は、異なった３本のケーブル等を用いて、赤、緑及び青の３色の信号を通信するカメラであり、一般に３つの独立したＣＣＤセンサを用いる。「ＩＲカメラ」は、赤外線カメラであり、赤外線領域の波長に感度をもつカメラである。 The image pickup device 160 is a device of any type as long as it can detect the shape of the mouth, such as an RGB (Red Green Blue) camera, an IR (Infrared) camera, a stereo camera, or a TOF (Time Of Flight) camera. May be good. The "RGB camera" is a camera that communicates signals of three colors of red, green, and blue by using three different cables and the like, and generally uses three independent CCD sensors. An "IR camera" is an infrared camera, which is sensitive to wavelengths in the infrared region.

マイク装置１７０は、発話者の音声を収集する装置である。発話者判定装置１００で特定された発話状態にある対象者の音声を収集する。マイク装置１７０は、指向性を持たせるために、指向性マイク又はマイクアレイの構成とする。 The microphone device 170 is a device that collects the voice of the speaker. The voice of the subject in the utterance state specified by the utterance determination device 100 is collected. The microphone device 170 is configured as a directional microphone or a microphone array in order to have directivity.

図１に戻り、顔検出部１１０は、撮像装置１６０から与えられた画像データで示される画像から、人の顔に対応する部分である顔画像を検出する。顔検出部１１０は、検出された顔画像の位置を示す位置情報を、画像データとともに、開口度取得部１２０及び発話者判定部１３０に与える。なお、画像から複数の顔画像が検出された場合には、位置情報は、複数の顔画像の位置を示す。 Returning to FIG. 1, the face detection unit 110 detects a face image, which is a portion corresponding to the human face, from the image shown by the image data given by the image pickup apparatus 160. The face detection unit 110 provides position information indicating the position of the detected face image to the aperture degree acquisition unit 120 and the speaker determination unit 130 together with the image data. When a plurality of face images are detected from the image, the position information indicates the position of the plurality of face images.

開口度取得部１２０は、画像データで示される画像から検出された顔画像における口の開き具合を示す開口度を取得する。なお、開口度取得部１２０は、複数の顔画像が検出されている場合には、複数の顔画像の各々について、開口度を取得する。
開口度取得部１２０は、開口幅取得部１２１と、鼻長取得部１２４とを備える。The opening degree acquisition unit 120 acquires the opening degree indicating the degree of opening of the mouth in the face image detected from the image shown by the image data. When a plurality of face images are detected, the opening degree acquisition unit 120 acquires the opening degree for each of the plurality of face images.
The opening degree acquisition unit 120 includes an opening width acquisition unit 121 and a nose length acquisition unit 124.

開口幅取得部１２１は、顔検出部１１０により検出された顔画像から、唇の上下方向における開口幅を取得する。例えば、開口幅取得部１２１は、顔画像から、上唇の内側の下端点と、下唇の内側の上端点とを検出して、これらの間の距離を開口幅とすることができる。
開口幅取得部１２１は、唇検出部１２２と、開口幅算出部１２３とを備える。The opening width acquisition unit 121 acquires the opening width in the vertical direction of the lips from the face image detected by the face detection unit 110. For example, the opening width acquisition unit 121 can detect the inner lower end point on the inner side of the upper lip and the upper end point on the inner side of the lower lip from the face image, and set the distance between them as the opening width.
The opening width acquisition unit 121 includes a lip detection unit 122 and an opening width calculation unit 123.

唇検出部１２２は、画像データで示される画像において、顔検出部１１０において検出された顔に対応する部分の画像である顔画像より上唇内側の端点である上端点と、下唇内側の端点である下端点とを検出する。 In the image shown by the image data, the lip detection unit 122 has an upper end point which is an end point on the inner side of the upper lip and an end point on the inner side of the lower lip from the face image which is an image of a part corresponding to the face detected by the face detection unit 110. Detects a certain lower end point.

図３に示されているように、顔画像Ｆｉｍにおいて、唇の上端点Ｐ１は、顔の中心線上の上唇内側の端点であり、唇の下端点Ｐ２は、顔の中心線上の下唇内側の端点である。
例えば、唇検出部１２２は、パターンマッチングにより顔画像Ｆｉｍから唇の画像部分である唇画像を検出し、唇画像のエッジを検出することで、上唇の内側を示す線Ｌ１及び下唇の内側を示す線Ｌ２をそれぞれ検出することができる。そして、顔画像Ｆｉｍの縦方向における中心線と、検出されたそれぞれの線Ｌ１、Ｌ２との交点により、唇の上端点Ｐ１及び下端点Ｐ２を検出することができる。As shown in FIG. 3, in the face image Fim, the upper end point P1 of the lips is the inner end point of the upper lip on the center line of the face, and the lower end point P2 of the lips is the inner end point of the lower lip on the center line of the face. It is an end point.
For example, the lip detection unit 122 detects the lip image, which is an image portion of the lips, from the face image Fim by pattern matching, and detects the edge of the lip image to detect the line L1 indicating the inside of the upper lip and the inside of the lower lip. The indicated lines L2 can be detected respectively. Then, the upper end point P1 and the lower end point P2 of the lips can be detected by the intersection of the center line in the vertical direction of the face image Fim and the detected lines L1 and L2, respectively.

なお、唇検出部１２２は、検出された唇の上端点及び下端点の位置を示す唇情報を開口幅算出部１２３に与える。なお、唇情報は、複数の顔画像が検出されている場合には、複数の顔画像のそれぞれから検出された唇の上端点及び下端点の位置を示す。 The lip detection unit 122 provides the opening width calculation unit 123 with lip information indicating the positions of the upper end point and the lower end point of the detected lip. In addition, when a plurality of face images are detected, the lip information indicates the positions of the upper end point and the lower end point of the lips detected from each of the plurality of face images.

開口幅算出部１２３は、唇情報で示される唇の上端点及び下端点の位置により、唇の上端点及び下端点の間の距離を、開口幅として算出する。
例えば、開口幅算出部１２３は、図３に示されているように、唇の上端点Ｐ１及び下端点Ｐ２のユークリッド距離を、開口幅Ｄ１として算出する。
そして、開口幅算出部１２３は、算出された開口幅を示す開口幅情報を開口度算出部１２８に与える。なお、開口幅情報は、複数の顔画像が検出されている場合には、複数の顔画像のそれぞれから算出された開口幅を示す。The opening width calculation unit 123 calculates the distance between the upper end point and the lower end point of the lips as the opening width based on the positions of the upper end point and the lower end point of the lips indicated by the lip information.
For example, as shown in FIG. 3, the opening width calculation unit 123 calculates the Euclidean distance between the upper end point P1 and the lower end point P2 of the lips as the opening width D1.
Then, the opening width calculation unit 123 gives the opening width information indicating the calculated opening width to the opening degree calculation unit 128. The opening width information indicates the opening width calculated from each of the plurality of face images when a plurality of face images are detected.

鼻長取得部１２４は、顔検出部１１０により検出された顔画像から、鼻の長さを取得する。例えば、鼻長取得部は、顔画像から、人の右の目の目尻又は目頭の端点である右端点と、人の左の目の目尻又は目頭の端点である左端点とを検出し、右端点及び左端点との間の中点を特定する。また、鼻長取得部１２４は、顔画像から鼻の下端点を検出する。そして、鼻長取得部１２４は、特定された中点と、検出された下端点との間の距離を鼻の長さとする。
鼻長取得部１２４は、目検出部１２５と、鼻検出部１２６と、鼻長算出部１２７と、開口度算出部１２８とを備える。The nose length acquisition unit 124 acquires the nose length from the face image detected by the face detection unit 110. For example, the nose length acquisition unit detects the right end point, which is the end point of the outer corner or inner corner of the right eye of a person, and the left end point, which is the end point of the outer corner or inner corner of the left eye of a person, from the face image, and detects the right end point. Identify the midpoint between the point and the leftmost point. In addition, the nose length acquisition unit 124 detects the lower end point of the nose from the face image. Then, the nose length acquisition unit 124 sets the distance between the specified midpoint and the detected lower end point as the nose length.
The nose length acquisition unit 124 includes an eye detection unit 125, a nose detection unit 126, a nose length calculation unit 127, and an opening degree calculation unit 128.

目検出部１２５は、顔画像から、左右の目尻端点を検出して、それらの端点から左右の目の中点を検出する。
図３に示されているように、左右の目の中点Ｐ５は、左目の端点Ｐ３と、右目の端点Ｐ４との間の中央の点である。The eye detection unit 125 detects the left and right outer corner end points of the eyes from the face image, and detects the midpoints of the left and right eyes from those end points.
As shown in FIG. 3, the midpoint P5 of the left and right eyes is the central point between the endpoint P3 of the left eye and the endpoint P4 of the right eye.

例えば、目検出部１２５は、パターンマッチングにより顔画像Ｆｉｍから左の目の画像部分である左目画像、及び、右の目の画像部分である右目画像を検出し、左目画像及び右目画像のそれぞれにおいて、エッジを検出することで、左目及び右目の外側の線をそれぞれ検出することができる。そして、左目の外側の線の最も左にある点を左目の端点Ｐ３とし、右目の外側の線の最も右にある点を右目の端点Ｐ４とすることで、それらの中点Ｐ５を検出することができる。
なお、目検出部１２５は、検出された中点の位置を示す目情報を鼻長算出部１２７に与える。目情報は、複数の顔画像が検出されている場合には、複数の顔画像のそれぞれから検出された中点の位置を示す。For example, the eye detection unit 125 detects a left eye image, which is an image portion of the left eye, and a right eye image, which is an image portion of the right eye, from the face image Fim by pattern matching, and in each of the left eye image and the right eye image. By detecting the edge, the lines outside the left eye and the right eye can be detected, respectively. Then, by setting the leftmost point of the outer line of the left eye as the end point P3 of the left eye and the rightmost point of the outer line of the right eye as the end point P4 of the right eye, the midpoint P5 thereof is detected. Can be done.
The eye detection unit 125 gives eye information indicating the position of the detected midpoint to the nose length calculation unit 127. When a plurality of face images are detected, the eye information indicates the position of the midpoint detected from each of the plurality of face images.

鼻検出部１２６は、顔画像より左右の鼻孔の間にあたる、鼻下端点を検出する。
図３に示されているように、鼻下端点Ｐ６は、顔画像Ｆｉｍに含まれている鼻の最も下にある点となる。The nose detection unit 126 detects the lower end point of the nose, which is between the left and right nostrils from the face image.
As shown in FIG. 3, the nose lower end point P6 is the lowest point of the nose included in the face image Fim.

例えば、鼻検出部１２６は、パターンマッチングにより顔画像Ｆｉｍから鼻の画像部分である鼻画像を検出し、鼻画像における最も下にある点により、鼻下端点Ｐ６を検出することができる。
鼻検出部１２６は、検出された鼻下端点の位置を示す鼻情報を鼻長算出部１２７に与える。なお、鼻情報は、複数の顔画像が検出されている場合には、複数の顔画像のそれぞれから検出された鼻下端点の位置を示す。For example, the nose detection unit 126 can detect a nose image, which is an image portion of the nose, from the face image Fim by pattern matching, and can detect a nose lower end point P6 from the lowest point in the nose image.
The nose detection unit 126 provides the nose length calculation unit 127 with nose information indicating the position of the detected nose lower end point. When a plurality of face images are detected, the nose information indicates the position of the lower end point of the nose detected from each of the plurality of face images.

鼻長算出部１２７は、目情報で示される、左右の目の中点と、鼻情報で示される鼻下端点との間の距離を、鼻の長さである鼻長として算出する。
例えば、鼻長算出部１２７は、図３に示されているように、左右の目の中点Ｐ５と、鼻下端点Ｐ６との間のユークリッド距離を、鼻長Ｄ２として算出する。
そして、鼻長算出部１２７は、算出された鼻長を示す鼻長情報を開口度算出部１２８に与える。なお、鼻長情報は、複数の顔画像が検出されている場合には、複数の顔画像のそれぞれから算出された鼻長を示す。The nose length calculation unit 127 calculates the distance between the midpoints of the left and right eyes indicated by the eye information and the lower end points of the nose indicated by the nose information as the nose length, which is the length of the nose.
For example, the nose length calculation unit 127 calculates the Euclidean distance between the midpoint P5 of the left and right eyes and the lower end point P6 of the nose as the nose length D2, as shown in FIG.
Then, the nose length calculation unit 127 gives the nose length information indicating the calculated nose length to the opening degree calculation unit 128. The nose length information indicates the nose length calculated from each of the plurality of face images when a plurality of face images are detected.

開口度算出部１２８は、鼻長情報で示される鼻長に対する、開口幅情報で示される開口幅の相対的な長さである開口度を算出する。例えば、開口度算出部１２８は、開口幅を鼻長で割ることで、開口度を算出する。
開口度算出部１２８は、算出された開口度を示す開口度情報を発話者判定部１３０に与える。なお、開口度情報は、複数の顔画像が検出されている場合には、複数の顔画像のそれぞれから算出された開口度を示す。The opening degree calculation unit 128 calculates the opening degree, which is the relative length of the opening width indicated by the opening width information with respect to the nose length indicated by the nose length information. For example, the opening degree calculation unit 128 calculates the opening degree by dividing the opening width by the nose length.
The opening degree calculation unit 128 gives the speaker opening degree information indicating the calculated opening degree to the speaker determination unit 130. When a plurality of face images are detected, the opening degree information indicates the opening degree calculated from each of the plurality of face images.

発話者判定部１３０は、顔検出部１１０から与えられる位置情報で示される、顔画像が検出された位置と、開口度算出部１２８から与えられる開口度情報で示される、顔画像毎に算出された開口度から、対象者が発話を行っているか否かを判定する。なお、発話者判定部１３０は、複数の顔画像が検出されている場合には、複数の顔画像から算出された複数の開口度を用いて、複数の対象者の各々が発話を行っているか否かを判定する。 The speaker determination unit 130 is calculated for each face image, which is indicated by the position information given by the face detection unit 110 and the opening degree information given by the opening degree calculation unit 128 and the position where the face image is detected. From the degree of opening, it is determined whether or not the subject is speaking. When a plurality of face images are detected, the speaker determination unit 130 uses a plurality of aperture degrees calculated from the plurality of face images to determine whether each of the plurality of subjects is speaking. Judge whether or not.

例えば、発話者判定部１３０は、開口度情報で示されるいずれの開口度も予め定められた閾値以下である場合には、発話者がいないと判定する。
また、発話者判定部１３０は、開口度情報で示されるいずれかの開口度が予め定められた閾値を超えている場合には、その開口度に対応する顔画像が検出された位置の対象者が発話者であると判定する。
そして、発話者判定部１３０は、発話者がいないと判定した場合には、その旨を判定結果とし、発話者を判定した場合には、対応する顔画像が検出された位置を判定結果として、その判定結果を示す判定結果情報をマイク装置１７０に送る。For example, the speaker determination unit 130 determines that there is no speaker when any opening degree indicated by the opening degree information is equal to or less than a predetermined threshold value.
Further, when any of the opening degrees indicated by the opening degree information exceeds a predetermined threshold value, the speaker determination unit 130 is a target person at a position where a face image corresponding to the opening degree is detected. Is determined to be the speaker.
Then, when the speaker determination unit 130 determines that there is no speaker, that fact is used as the determination result, and when the speaker is determined, the position where the corresponding face image is detected is used as the determination result. The determination result information indicating the determination result is sent to the microphone device 170.

マイク装置１７０は、発話者判定部１３０からの判定結果情報に従って、マイクの制御を行う。例えば、マイク装置１７０は、発話者がいないことを判定結果が示している場合には、マイクの電源をオフにする。また、マイク装置１７０は、発話者の位置を判定結果が示している場合には、マイクの集音方向がその方向に向くように、マイクの位置、又は、マイクアレイのゲインを制御する。 The microphone device 170 controls the microphone according to the determination result information from the speaker determination unit 130. For example, the microphone device 170 turns off the power of the microphone when the determination result indicates that there is no speaker. Further, when the determination result indicates the position of the speaker, the microphone device 170 controls the position of the microphone or the gain of the microphone array so that the sound collecting direction of the microphone faces in that direction.

図４は、実施の形態１に係る発話者判定装置１００のハードウェア構成を示すブロック図である。
発話者判定装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１と、プログラムメモリ１２と、データメモリ１３と、インターフェース（Ｉ／Ｆ）１４と、これらを接続するバス１５とを備える。
Ｉ／Ｆ１４は、撮像装置１６０及びマイク装置１７０を接続し、これらとの間で通信を行うための接続インターフェースである。このため、Ｉ／Ｆ１４は、図１には図示されていないが、撮像装置１６０から画像データの入力を受ける入力部として機能する。また、Ｉ／Ｆ１４は、図１には図示されていないが、マイク装置１７０に、判定結果情報又はコマンドを出力する出力部として機能する。FIG. 4 is a block diagram showing a hardware configuration of the speaker determination device 100 according to the first embodiment.
The speaker determination device 100 includes a CPU (Central Processing Unit) 11, a program memory 12, a data memory 13, an interface (I / F) 14, and a bus 15 connecting them.
The I / F 14 is a connection interface for connecting the image pickup device 160 and the microphone device 170 and communicating with them. Therefore, although not shown in FIG. 1, the I / F 14 functions as an input unit that receives input of image data from the image pickup apparatus 160. Further, although not shown in FIG. 1, the I / F 14 functions as an output unit for outputting determination result information or a command to the microphone device 170.

ＣＰＵ１１は、プログラムメモリ１２に記憶されたプログラムに従って動作する。動作の過程で種々のデータをデータメモリ１３に記憶させる。
例えば、顔検出部１１０、開口度取得部１２０及び発話者判定部１３０は、ＣＰＵ１１がプログラムを実行することで実現することができる。The CPU 11 operates according to the program stored in the program memory 12. Various data are stored in the data memory 13 in the process of operation.
For example, the face detection unit 110, the opening degree acquisition unit 120, and the speaker determination unit 130 can be realized by the CPU 11 executing a program.

このようなプログラムは、ネットワークを通じて提供されてもよく、また、記録媒体に記録されて提供されてもよい。即ち、このようなプログラムは、例えば、プログラムプロダクトとして提供されてもよい。 Such a program may be provided through a network, or may be recorded and provided on a recording medium. That is, such a program may be provided, for example, as a program product.

なお、発話者判定装置１００のハードウェア構成は、図４に示されている構成に限定されるものではなく、例えば、顔検出部１１０、開口度取得部１２０及び発話者判定部１３０の一部又は全部が、単一回路、複合回路、プログラム化したプロセッサ、並列プログラム化したプロセッサ、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔｓ）又はＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等の処理回路で構成することもできる。 The hardware configuration of the speaker determination device 100 is not limited to the configuration shown in FIG. 4, and is, for example, a part of the face detection unit 110, the opening degree acquisition unit 120, and the speaker determination unit 130. Alternatively, the whole can be composed of a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, a processing circuit such as an ASIC (Application Specific Integrated Circuits) or an FPGA (Field Programmable Gate Array).

図５は、実施の形態１に係る情報処理方法である発話者判定方法を示すフローチャートである。
まず、顔検出部１１０は、撮像装置１６０から、車の中の乗員を対象者として撮像された画像を示す画像データを取得する（Ｓ１０）。ここでは、撮像範囲は、車の中の乗員を含む範囲であるが、撮像範囲は、対象となる人が映る可能性のある場所であればよく、エレベータの中、電車の車内、路上又は店内等であってもよい。FIG. 5 is a flowchart showing a speaker determination method, which is an information processing method according to the first embodiment.
First, the face detection unit 110 acquires image data showing an image captured by the occupant in the car as a target person from the image pickup device 160 (S10). Here, the image pickup range is a range including the occupants in the car, but the image pickup range may be any place where the target person may be reflected, and the image pickup range may be inside an elevator, inside a train, on the street, or inside a store. And so on.

顔検出部１１０は、画像データを取得すると、画像データで示される画像から対象者の顔に対応する部分である顔画像を検出する（Ｓ１１）。顔を検出する方法については、特に問わない。例えば、一般的なＨａａｒ－ｌｉｋｅ特徴量を使ったＡｄａＢｏｏｓｔベースの検出器が用いられてもよい。なお、Ｈａａｒ－ｌｉｋｅ特徴量は、画像の明暗差により特徴を捉える特徴量であり、画素値をそのまま特徴量として用いる場合と比べて、照明条件の変動及びノイズの影響を受けにくい特徴がある。また、ＡｄａＢｏｏｓｔは、ＡｄａｐｔｉｖｅＢｏｏｓｔｉｎｇの略で、弱い分類器を複数組み合わせて利用することで、パフォーマンスを改善した機械学習アルゴリズムである。 When the face detection unit 110 acquires the image data, the face detection unit 110 detects a face image, which is a portion corresponding to the face of the target person, from the image indicated by the image data (S11). The method of detecting the face is not particularly limited. For example, an AdaBoost-based detector using common Haar-like features may be used. The Haar-like feature amount is a feature amount that captures the feature by the difference in brightness of the image, and has a feature that is less susceptible to fluctuations in lighting conditions and noise than when the pixel value is used as the feature amount as it is. AdaBoost is an abbreviation for Adaptive Boosting, and is a machine learning algorithm with improved performance by using a plurality of weak classifiers in combination.

唇検出部１２２は、ステップＳ１１で検出された顔の位置に対応する顔画像より上唇内側の端点である上端点の位置（座標）を検出する（Ｓ１３）。
また、唇検出部１２２は、ステップＳ１１で検出された顔の位置に対応する顔画像より下唇内側の端点である下端点の位置（座標）を検出する（Ｓ１４）。
なお、ステップＳ１３及びＳ１４における、唇の端点の座標取得方法は、例えば、「黒田勉、渡辺富夫『ＨＳＶ表現法に基づく顔画像の唇抽出方法』、日本機械学会論文集６１巻５９２号、１９９５年１２月、Ｎｏ９５－０２１」に示されているように、ＲＧＢ色空間又はＨＳＶ（ＨｕｅＳａｔｕｒａｔｉｏｎＬｉｇｈｔｎｅｓｓＶａｌｕｅ）色空間を使ったものでもよく、ＡＡＭ（ＡｃｔｉｖｅＡｐｐｅａｒａｎｃｅＭｏｄｅｌ）等のように、モデルベースの顔画像の特徴点を検出するアルゴリズムが使用されてもよい。また、ＤｅｅｐＬｅａｒｎｉｎｇが用いられてもよい。ＤｅｅｐＬｅａｒｎｉｎｇは、４層以上の多層のニューラルネットワークによる機械学習手法であり、近年学習に必要な計算機の高速化及び学習手法の研究が進んだことで、高い性能を示している。The lip detection unit 122 detects the position (coordinates) of the upper end point, which is the end point inside the upper lip, from the face image corresponding to the position of the face detected in step S11 (S13).
Further, the lip detection unit 122 detects the position (coordinates) of the lower end point, which is the end point inside the lower lip, from the face image corresponding to the position of the face detected in step S11 (S14).
The method of acquiring the coordinates of the end points of the lips in steps S13 and S14 is, for example, "Tsutomu Kuroda, Tomio Watanabe" Lip extraction method of facial image based on HSV expression method ", Proceedings of the Japan Society of Mechanical Engineers, Vol. 61, No. 592, 1995. As shown in "December, No. 95-021", an RGB color space or an HSV (Hue Saturation Lightness Value) color space may be used, and a model-based one such as AAM (Active Appearance Model) may be used. An algorithm for detecting feature points in a facial image may be used. Moreover, Deep Learning may be used. Deep Learning is a machine learning method using a multi-layered neural network with four or more layers, and has shown high performance due to recent advances in research on speeding up computers and learning methods required for learning.

そして、開口幅算出部１２３は、ステップＳ１３で検出された上端点及びステップＳ１４で検出された下端点との間のユークリッド距離を、開口幅として算出する（Ｓ１４）。 Then, the opening width calculation unit 123 calculates the Euclidean distance between the upper end point detected in step S13 and the lower end point detected in step S14 as the opening width (S14).

また、目検出部１２５は、ステップＳ１１で検出された顔の位置に対応する顔画像より、左右の目の目尻の端点の位置（座標）を検出し、検出された位置の中点の位置（座標）を検出する（Ｓ１５）。
鼻検出部１２６は、ステップＳ１１で検出された顔の位置に対応する顔画像より、鼻下端点を検出する（Ｓ１６）。Further, the eye detection unit 125 detects the positions (coordinates) of the end points of the outer corners of the left and right eyes from the face image corresponding to the position of the face detected in step S11, and the position of the midpoint of the detected position (the position of the midpoint of the detected position). (Coordinates) is detected (S15).
The nose detection unit 126 detects the lower end point of the nose from the face image corresponding to the position of the face detected in step S11 (S16).

そして、鼻長算出部１２７は、ステップＳ１５で算出された、左右の目尻端点の中点と、ステップＳ１６で算出された鼻下端点との間のユークリッド距離を、鼻長として算出する（Ｓ１７）。 Then, the nose length calculation unit 127 calculates the Euclidean distance between the midpoints of the left and right outer corners of the eyes calculated in step S15 and the lower end point of the nose calculated in step S16 as the nose length (S17). ..

次に、開口度算出部１２８は、ステップＳ１４で算出された開口幅を、ステップＳ１７で算出された鼻長で除算することにより、開口度を算出する（Ｓ１８）。開口幅と、鼻長とは同軸上にあるため、これらにより算出された開口度は、顔向きの変化により影響を受けにくい。 Next, the opening degree calculation unit 128 calculates the opening degree by dividing the opening width calculated in step S14 by the nose length calculated in step S17 (S18). Since the opening width and the nose length are coaxial with each other, the opening degree calculated by these is not easily affected by the change in face orientation.

次に、発話者判定部１３０は、ステップＳ１９で算出された開口度が、閾値を超えているかを判定する（Ｓ１９）。なお、マイク装置１７０は、ステップＳ１９での判定結果に応じて、マイクを制御する。例えば、発話者が検出されていない場合には、マイクをオフにすることで、騒音、ノイズ、又は、特定の話者以外の発話等による影響を制限することができる。 Next, the speaker determination unit 130 determines whether the opening degree calculated in step S19 exceeds the threshold value (S19). The microphone device 170 controls the microphone according to the determination result in step S19. For example, when the speaker is not detected, turning off the microphone can limit the influence of noise, noise, or utterances other than a specific speaker.

以上のように、実施の形態１によれば、特定の対象者の発話状態を判定することができる。判定結果に応じたマイクの制御を行うことで、対象者が発話状態でないときの誤動作を防止することができる。 As described above, according to the first embodiment, it is possible to determine the utterance state of a specific target person. By controlling the microphone according to the determination result, it is possible to prevent a malfunction when the subject is not in the utterance state.

実施の形態２
図６は、実施の形態２に係る情報処理装置である発話者判定装置２００の構成を概略的に示すブロック図である。
発話者判定装置２００は、顔検出部１１０と、開口度取得部１２０と、発話者判定部２３０と、音声認識部２４０とを備える。
実施の形態２における顔検出部１１０及び開口度取得部１２０は、実施の形態１の顔検出部１１０及び開口度取得部１２０と同様である。Embodiment 2
FIG. 6 is a block diagram schematically showing the configuration of the speaker determination device 200, which is the information processing device according to the second embodiment.
The speaker determination device 200 includes a face detection unit 110, an opening degree acquisition unit 120, a speaker determination unit 230, and a voice recognition unit 240.
The face detection unit 110 and the opening degree acquisition unit 120 in the second embodiment are the same as the face detection unit 110 and the opening degree acquisition unit 120 of the first embodiment.

図７は、発話者判定装置２００、撮像装置１６０及びマイク装置２７０の配置例を示す概略図である。
図７に示されているように、室内に、発話者判定装置２００、撮像装置１６０及びマイク装置２７０が配置されている。
実施の形態２における撮像装置１６０は、実施の形態１における撮像装置１６０と同様である。FIG. 7 is a schematic view showing an arrangement example of the speaker determination device 200, the image pickup device 160, and the microphone device 270.
As shown in FIG. 7, a speaker determination device 200, an image pickup device 160, and a microphone device 270 are arranged in the room.
The image pickup device 160 in the second embodiment is the same as the image pickup device 160 in the first embodiment.

マイク装置２７０は、発話者の音声を収集する装置である。マイク装置２７０は、１つのマイクを備えていてもよく、複数のマイクを組み合わせたアレイマイクを備えていてもよい。マイク装置２７０は、発話者判定装置２００に接続され、発話者判定装置２００からの制御を受けるとともに、収集された音声を示す音声信号を発話者判定装置２００に与える。 The microphone device 270 is a device that collects the voice of the speaker. The microphone device 270 may include one microphone or may include an array microphone in which a plurality of microphones are combined. The microphone device 270 is connected to the speaker determination device 200, receives control from the speaker determination device 200, and gives a voice signal indicating the collected voice to the speaker determination device 200.

図６に戻り、音声認識部２４０は、マイク装置２７０より入力される音声信号で示される音声を認識する。例えば、音声認識部２４０は、マイク装置２７０から入力される音声信号で示される音声を自然言語化することで、認識する。音声認識部２４０は、認識された音声を示す音声情報を発話者判定部２３０に与える。 Returning to FIG. 6, the voice recognition unit 240 recognizes the voice indicated by the voice signal input from the microphone device 270. For example, the voice recognition unit 240 recognizes the voice indicated by the voice signal input from the microphone device 270 by converting it into natural language. The voice recognition unit 240 gives voice information indicating the recognized voice to the speaker determination unit 230.

発話者判定部２３０は、顔検出部１１０から与えられる位置情報で示される、顔画像が検出された位置と、開口度算出部１２８から与えられる開口度情報で示される、算出された顔毎の開口度と、から、音声認識部２４０から与えられる音声情報で示される音声を発した対象者を判定する。 The speaker determination unit 230 is calculated for each face, which is indicated by the position where the face image is detected, which is indicated by the position information given by the face detection unit 110, and the opening degree information, which is indicated by the opening degree information given by the opening degree calculation unit 128. From the degree of opening, the target person who emits the voice indicated by the voice information given by the voice recognition unit 240 is determined.

例えば、室内に３名の対象者がいるものとして説明する。３名の対象者を、対象者Ａ、対象者Ｂ及び対象者Ｃとする。
発話者判定部２３０には、開口度が０％の場合は、無言、開口度が３０％の場合は、母音が「オ」の音、開口度が７０％の場合は、母音が「ア」の音といったように、開口度に応じて、母音を示す判定ルールが予め定められている。For example, it is assumed that there are three subjects in the room. The three subjects are subject A, subject B, and subject C.
The speaker determination unit 230 is silent when the opening degree is 0%, the vowel is "O" when the opening degree is 30%, and the vowel is "A" when the opening degree is 70%. Judgment rules for indicating vowels are predetermined according to the degree of opening, such as the sound of.

このため、発話者判定部２３０は、位置情報及び開口度情報により、対象者Ａの開口度が０％、対象者Ｂの開口度が３０％、及び、対象者Ｃの開口度が７０％であると判定した場合、対象者Ａは、無言、対象者Ｂは、音声情報で示される、母音が「オ」の音声、対象者Ｃは、音声情報で示される、母音が「ア」の音声を発していると判定することができる。
以上により、音声情報で示される音声を、複数の発話者の何れかに割り当てることができる。Therefore, in the speaker determination unit 230, the opening degree of the subject A is 0%, the opening degree of the subject B is 30%, and the opening degree of the subject C is 70% based on the position information and the opening degree information. When it is determined that there is, the subject A is silent, the subject B is a voice whose vowel is "o", and the subject C is a voice whose vowel is "a". Can be determined to be emitting.
As described above, the voice indicated by the voice information can be assigned to any of the plurality of speakers.

実施の形態２における発話者判定装置２００も、実施の形態１と同様に、図４に示されているような、ハードウェアにより構成することができる。例えば、音声認識部２４０も、ＣＰＵ１１がプログラムを実行することで実現することができる。なお、音声認識部２４０は、図示されていない処理回路で実現することもできる。
なお、実施の形態２では、Ｉ／Ｆ１４は、マイク装置１７０から音声信号を入力する入力部としても機能する。The speaker determination device 200 according to the second embodiment can also be configured by hardware as shown in FIG. 4, similarly to the first embodiment. For example, the voice recognition unit 240 can also be realized by the CPU 11 executing a program. The voice recognition unit 240 can also be realized by a processing circuit (not shown).
In the second embodiment, the I / F 14 also functions as an input unit for inputting an audio signal from the microphone device 170.

図８は、実施の形態２に係る情報処理方法である発話者判定方法を示すフローチャートである。
なお、図８に示されているステップの内、図５に示されているステップと同様のステップについては、図５と同様の符号を付し、詳細な説明を省略する。FIG. 8 is a flowchart showing a speaker determination method, which is an information processing method according to the second embodiment.
Of the steps shown in FIG. 8, the same steps as those shown in FIG. 5 are designated by the same reference numerals as those in FIG. 5, and detailed description thereof will be omitted.

図８のステップＳ１０～Ｓ１８までの処理については、図５のステップＳ１０～Ｓ１８までの処理と同様である。但し、ステップＳ１８の処理の後は、処理はステップＳ２２に進む。 The processing of steps S10 to S18 in FIG. 8 is the same as the processing of steps S10 to S18 of FIG. However, after the process of step S18, the process proceeds to step S22.

ステップＳ１０～Ｓ１８での処理とは別に、音声認識部２４０は、マイク装置２７０から、例えば、室内の対象者の音声を示す音声信号を取得する（Ｓ２０）。音声を取得する場所は、車内又は屋外等、特に限定しない。 Apart from the processing in steps S10 to S18, the voice recognition unit 240 acquires, for example, a voice signal indicating the voice of the subject in the room from the microphone device 270 (S20). The place where the voice is acquired is not particularly limited, such as inside a car or outdoors.

次に、音声認識部２４０は、ステップＳ２０で取得された音声信号で示される音声の認識処理を実施する（Ｓ２１）。そして、処理はステップＳ２２に進む。 Next, the voice recognition unit 240 performs a voice recognition process indicated by the voice signal acquired in step S20 (S21). Then, the process proceeds to step S22.

なお、音声認識方法は、例えば、古くから研究されている確率モデルとして隠れマルコフモデルを導入したものでもよく、リカレントニューラルネットワークを使用した手法でもよい。
隠れマルコフモデルは、時系列的に変動するデータを確率的なモデルで表現したものである。また、リカレントニューラルネットワークは、既存のニューラルネットワークを前後の時系列情報まで扱えるように拡張したネットワークを指す。連続的な情報を利用できる点で自然言語処理の分野に向いているとされている。The speech recognition method may be, for example, a method in which a hidden Markov model is introduced as a stochastic model that has been studied for a long time, or a method using a recurrent neural network.
The hidden Markov model is a probabilistic model of data that fluctuates over time. In addition, the recurrent neural network refers to a network that is an extension of an existing neural network so that it can handle time-series information before and after. It is said to be suitable for the field of natural language processing in that continuous information can be used.

ステップＳ２２では、発話者判定部２３０は、ステップＳ１１で顔画像が検出された位置と、顔画像毎にステップＳ１８で算出された開口度とから、ステップＳ２１で認識された音声を発した対象者を判定する。 In step S22, the speaker determination unit 230 emits the voice recognized in step S21 from the position where the face image is detected in step S11 and the opening degree calculated in step S18 for each face image. To judge.

以上のように、実施の形態２によれば、音声認識処理により認識された音声が誰の発話かを判定することができる。具体的には、複数の対象者が同時に音声コマンドを入力した場合、それらを聞き分けてそれぞれ対象の人物に適応することができる。 As described above, according to the second embodiment, it is possible to determine who is speaking the voice recognized by the voice recognition process. Specifically, when a plurality of target persons input voice commands at the same time, they can be distinguished and adapted to each target person.

以上に記載された実施の形態１では、発話者判定部１３０から判定結果をマイク装置１７０に送り、マイク装置１７０が判定結果に応じて、マイク装置１７０がマイクの制御を行っているが、実施の形態１は、このような例に限定されない。例えば、発話者判定部１３０は、判定結果に従って、マイクを制御するためのコマンドをマイク装置１７０に送り、このようなコマンドに応じて、マイク装置１７０がマイクの制御を行ってもよい。 In the first embodiment described above, the speaker determination unit 130 sends the determination result to the microphone device 170, and the microphone device 170 controls the microphone according to the determination result. Form 1 is not limited to such an example. For example, the speaker determination unit 130 may send a command for controlling the microphone to the microphone device 170 according to the determination result, and the microphone device 170 may control the microphone in response to such a command.

１００，２００発話者判定装置、１１０顔検出部、１２０開口度取得部、１２２唇検出部、１２３開口幅算出部、１２５目検出部、１２６鼻検出部、１２７鼻長算出部、１２８開口度算出部、１３０，２３０発話者判定部、２４０音声認識部、１６０撮像装置、１７０，２７０マイク装置。 100,200 Speaker determination device, 110 face detection unit, 120 opening degree acquisition unit, 122 lip detection unit, 123 opening width calculation unit, 125th eye detection unit, 126 nose detection unit, 127 nose length calculation unit, 128 opening degree calculation unit Unit, 130, 230 speaker determination unit, 240 voice recognition unit, 160 image pickup device, 170, 270 microphone device.

Claims

A face detection unit that detects a face image, which is a part corresponding to a human face, from an image shown by image data, and a face detection unit.
An opening width acquisition unit that acquires an opening width in the vertical direction of the person's lips from the face image,
A nose length acquisition unit that acquires the length of the nose, which is the distance between the midpoint of the left and right eyes of the person and the lower end point of the nose, from the face image.
An information processing apparatus comprising: an opening degree calculation unit for calculating an opening degree indicating a relative length of the opening width with respect to the length of the nose.

The opening width acquisition portion detects the inner lower end point of the upper lip on the lips and the inner upper end point of the lower lip on the lips from the face image, and the distance between the lower end point and the upper end point. The information processing apparatus according to claim 1, wherein the opening width is set to.

From the face image, the nose length acquisition portion is a right end point which is an end point corresponding to the outer corner or the inner corner of the right eye of the person and a left end point which is an end point corresponding to the outer corner or the inner corner of the left eye of the person. Is detected, the midpoint between the right end point and the left end point is specified, the lower end point of the nose is detected from the face image, and the middle point between the middle point and the lower end point of the nose is detected. The information processing apparatus according to claim 1 or 2, wherein the distance is calculated to obtain the length of the nose.

The information processing apparatus according to any one of claims 1 to 3, wherein the opening degree calculation unit calculates the opening degree by dividing the opening width by the length of the nose.

The information processing apparatus according to any one of claims 1 to 4, further comprising a speaker determination unit that determines whether or not the person is speaking using the opening degree.

The face detection unit detects the face image corresponding to the face of each of the plurality of persons from the image, thereby displaying the plurality of face images corresponding to each face of the plurality of persons. Detect and
The opening width acquisition unit acquires a plurality of the opening widths from the plurality of face images, and obtains the plurality of opening widths.
The nose length acquisition unit acquires a plurality of the nose lengths from the plurality of facial images, and obtains the plurality of nose lengths.
The information processing apparatus according to any one of claims 1 to 4, wherein the opening degree calculation unit calculates a plurality of opening degrees corresponding to each of the plurality of people.

The information processing apparatus according to claim 6, further comprising a speaker determination unit that determines whether or not each of the plurality of people is speaking using the plurality of openings.

The information processing device according to claim 5 or 7, wherein the speaker determination unit controls a microphone device that collects voice from the person according to the result of the determination.

A voice recognition unit that recognizes the voice indicated by the voice signal,
The information processing apparatus according to claim 6, further comprising a speaker determination unit that determines a person who has emitted the recognized voice from the plurality of persons from the plurality of openings.

From the image shown in the image data, the face image, which is the part corresponding to the human face, is detected.
From the face image, the opening width in the vertical direction of the person's lips is obtained.
From the face image, the length of the nose, which is the distance between the midpoint of the left and right eyes of the person and the lower end of the nose, is obtained.
An information processing method comprising calculating an opening degree indicating a relative length of the opening width with respect to the length of the nose.