JP2021110996A

JP2021110996A - Speaker discrimination method, speaker discrimination program, and speaker discrimination device

Info

Publication number: JP2021110996A
Application number: JP2020000673A
Authority: JP
Inventors: 陽一景山; Yoichi Kageyama; 悦郎中村; Etsuro Nakamura; 礎成白須; Motonari Shirasu
Original assignee: Akita University NUC; Japan Business Systems Inc
Current assignee: Akita University NUC; Japan Business Systems Inc
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2021-08-02
Anticipated expiration: 2040-01-07
Also published as: JP7396590B2

Abstract

To provide an utterer discrimination method which is a simple facility and can accurately discriminate an utterer from a plurality of object persons.SOLUTION: A method of discriminating an utterer based on video and voice data includes steps that obtains a plurality of time-series first lip behavior feature amounts based on a distance between an upper lip and a lower lip of an object person from the obtained video, obtains a plurality of time-series voice feature amounts based on the obtained voice data, and discriminates the utterer. The step that discriminates the utterer includes processes that obtains a plurality of time-series second lip behavior feature amounts from the plurality of voice feature amounts, obtains a discrimination difference which is a difference between the first lip behavior feature amount and the second lip behavior feature amount, and sets the object person with the smallest discrimination difference among the object persons included in the video as an utterer.SELECTED DRAWING: Figure 1

Description

本発明は、映像及び音声情報に基づいた発話者の判別に関する。 The present invention relates to the determination of a speaker based on video and audio information.

近年、働き方改革の実現に向けて業務の効率化や労働環境の見直しが行われている。その中の１つとして職場おける労働の改善策として業務の効率化や会議の効率化が挙げられる。
会議における議事録は、議論された内容や取り決めを記録し、決定事項および経緯の共有を目的に行われ、作成される議事録は、その後の会議の質の向上や他の業務の効率化に寄与する。そして、音声認識の技術を応用して構築された議事録自動作成システムによれば、議事録作成におけるヒューマンエラーの低減や議事録作成に要する人員や時間を削減することが可能である。さらにこのような議事録自動作成システムにおいて発言ごとに発話者を自動判別する技術や音声認識精度を向上させる技術は、議事録作成の工数削減に貢献し、会議および業務の効率化に寄与する。 In recent years, work efficiency has been improved and the working environment has been reviewed in order to realize work style reforms. One of them is to improve the efficiency of work and meetings as measures to improve labor in the workplace.
The minutes of the meeting are used to record the content and arrangements discussed and share the decisions and circumstances, and the minutes created are used to improve the quality of subsequent meetings and improve the efficiency of other operations. Contribute. According to the minutes automatic creation system constructed by applying the voice recognition technology, it is possible to reduce the human error in the minutes creation and the personnel and time required for the minutes creation. Furthermore, in such an automatic minutes creation system, a technology for automatically identifying the speaker for each remark and a technology for improving the voice recognition accuracy contribute to the reduction of man-hours for creating minutes and contribute to the efficiency of meetings and operations.

特許文献１には、発言中の参加者の口唇部分の視認性を向上させることが可能なウェブ会議システムが開示され、音声および口唇の動きを用いて発話者を特定している。具体的には発言者のいるクライアント端末の特定を行うために音声情報を使用し、特定した端末内において最も口唇が動いている人物を発話者として判別する。しかしながらこの技術では、同じ端末内に存在する人物の口唇が同時に動いている場合、発話者の判別が困難になる。 Patent Document 1 discloses a web conferencing system capable of improving the visibility of the lip portion of a participant who is speaking, and identifies the speaker by using voice and lip movement. Specifically, voice information is used to identify the client terminal in which the speaker is located, and the person with the most moving lips in the specified terminal is determined as the speaker. However, with this technique, when the lips of a person existing in the same terminal are moving at the same time, it becomes difficult to identify the speaker.

特許文献２は、複数の参加者による多人数会話において、次の発話者および次の発話者が発話するタイミングの少なくとも一方を推定可能な発話者推定システムを開示している。しかしながら、この文献には次の発話者を判別するために口唇の動きを用いることが記載されているが、現在の話者を判別するための技術に関して記載はない。 Patent Document 2 discloses a speaker estimation system capable of estimating at least one of the next speaker and the timing at which the next speaker speaks in a multi-person conversation by a plurality of participants. However, although this document describes the use of lip movements to discriminate the next speaker, there is no description of the technique for discriminating the current speaker.

特許文献３は、多数のマイクなどを備えた特別な装置を必要とすることなく、会議の議事録を作成することができる端末装置を開示している。しかしながら、この技術では、端末ごとに使用者の音声情報を事前に登録する必要があるため、使用するには事前の準備が必要である。 Patent Document 3 discloses a terminal device capable of creating minutes of a meeting without requiring a special device including a large number of microphones and the like. However, in this technology, since it is necessary to register the voice information of the user in advance for each terminal, it is necessary to prepare in advance to use it.

特許文献４は会議参加者の発話状態を認識するシステムを開示している。この技術では、魚眼レンズを用いて取得した画像に対して会議の各参加者の唇近傍領域を設定するとともに、唇近傍領域内の輝度もしくは色を示す特徴量を用いて発話状態を推定している。しかしながらこの技術は口唇の動きのみに着目した手法であるため、会議参加者の口唇領域が同時に動いていた場合には発話者の判別が困難になる。 Patent Document 4 discloses a system for recognizing the utterance state of a conference participant. In this technique, the area near the lips of each participant in the conference is set for the image acquired using the fisheye lens, and the utterance state is estimated using the feature amount indicating the brightness or color in the area near the lips. .. However, since this technique focuses only on the movement of the lips, it becomes difficult to identify the speaker when the lip regions of the conference participants are moving at the same time.

特許文献５は、会議出席者の顔の上方（頭上）に画像表示するビデオ会議用カメラマイク装置を開示している。しかしながら、この技術は音声の到来方向を用いて発話者を識別しているため、人物間の距離が近い場合に判別が困難になる。 Patent Document 5 discloses a camera / microphone device for video conferencing that displays an image above (overhead) the face of a conference attendee. However, since this technique identifies the speaker using the direction of arrival of the voice, it becomes difficult to identify the speaker when the distance between the persons is short.

特許文献６は、発音者毎の固有の設定を加味しつつ感情のこもった音声を合成することを可能にする技術を開示している。ここには、発話者の発話音声からフレーム毎に抽出した音声特徴データを使用し、対応するフレームの顔特徴点を生成するためのネットワークを構築する処理が実装されることが記載されている。しかしながら、その特徴量には限界があり、精度を高める必要がある。 Patent Document 6 discloses a technique that makes it possible to synthesize emotional voice while taking into account the unique settings of each sounder. Here, it is described that a process of constructing a network for generating facial feature points of the corresponding frame by using the voice feature data extracted for each frame from the utterance voice of the speaker is implemented. However, there is a limit to the amount of features, and it is necessary to improve the accuracy.

特開２０１９−１１７９９７号公報JP-A-2019-117997 特開２０１８−０７７７９１号公報JP-A-2018-077791 特開２０１６−０２９４６８号公報Japanese Unexamined Patent Publication No. 2016-209468 特開２０１５−０１９１６２号公報Japanese Unexamined Patent Publication No. 2015-09162 特開２０１２−１４７４２０号公報Japanese Unexamined Patent Publication No. 2012-147420 特許６５８２１５７号公報Japanese Patent No. 6582157

本発明は、かかる点に鑑み、簡易な設備であるとともに、複数の対象者から発話者を精度よく判別することができる発話者判別方法を提供することを課題とする。またそのためのプログラム、及び装置を提供する。 In view of this point, it is an object of the present invention to provide a speaker discrimination method capable of accurately discriminating a speaker from a plurality of target persons as well as being a simple facility. It also provides a program and a device for that purpose.

本発明の１つの態様は、映像及び音声データから発話者を判別する方法であって、取得した映像から対象者の上唇と下唇との距離に基づいて第一の口唇挙動特徴量を時系列に複数得る過程と、取得した音声データに基づいて音声特徴量を時系列に複数得る過程と、発話者を判別する過程と、を有し、発話者を判別する過程では、複数の音声特徴量から第二の口唇挙動特徴量を時系列に複数得る過程と、第一の口唇挙動特徴量と、第二の口唇挙動特徴量と、の差である判別差分を得る過程と、映像に含まれる対象者のうち判別差分が最も小さい対象者を発話者とする過程と、を備える、発話者判別方法である。 One aspect of the present invention is a method of discriminating a speaker from video and audio data, in which the first lip behavior feature amount is time-series based on the distance between the upper lip and the lower lip of the subject from the acquired video. In the process of obtaining a plurality of voice features, the process of obtaining a plurality of voice features in time series based on the acquired voice data, and the process of discriminating the speaker, the process of discriminating the speaker has a plurality of voice features. The process of obtaining a plurality of second lip behavior features in chronological order, the process of obtaining a discriminant difference which is the difference between the first lip behavior features and the second lip behavior features, and the video It is a speaker discrimination method including a process in which a target person having the smallest discrimination difference among the target persons is used as a speaker.

第一の口唇挙動特徴量は、上唇と下唇との距離と、対象者の鼻梁上の２点間の距離と、の割合により得られるようにしてもよい。 The first lip behavior feature amount may be obtained by the ratio of the distance between the upper lip and the lower lip and the distance between the two points on the nose bridge of the subject.

発話者を判別する過程では、始点となる時間が異なり所定の時間範囲を有する複数の区間を作成する過程と、複数の区間のそれぞれについて、第一の口唇挙動特徴量と第二の口唇挙動特徴量との区間差分を求め、各区間の区間差分の平均を判別差分とするようにしてもよい。 In the process of determining the speaker, the process of creating a plurality of sections having different start points and having a predetermined time range, and the first lip behavior feature amount and the second lip behavior feature for each of the plurality of sections. The interval difference from the quantity may be obtained, and the average of the interval differences of each interval may be used as the discrimination difference.

複数の区間において、隣り合う区間では、その時間の一部が重複するように始点となる時間が決められてもよい。 In a plurality of sections, in adjacent sections, the starting time may be determined so that a part of the time overlaps.

複数の第一の口唇挙動特徴量及び複数の音声特徴量は０．０以上１．０以下の範囲で正規化されて表されてもよい。 The plurality of first lip behavior features and the plurality of voice features may be normalized and expressed in the range of 0.0 or more and 1.0 or less.

本発明の他の態様は、映像及び音声データから発話者を判別するプログラムであって、取得した映像から対象者の上唇と下唇との距離に基づいて第一の口唇挙動特徴量を時系列に複数得るステップと、取得した音声データに基づいて音声特徴量を時系列に複数得るステップと、発話者を判別するステップと、を有し、発話者を判別するステップでは、複数の音声特徴量から第二の口唇挙動特徴量を時系列に複数得るステップと、第一の口唇挙動特徴量と、第二の口唇挙動特徴量と、の差である判別差分を得るステップと、映像に含まれる対象者のうち判別差分が最も小さい対象者を発話者とするステップと、を含む、発話者判別プログラムである。 Another aspect of the present invention is a program for discriminating the speaker from video and audio data, in which the first lip behavior feature amount is time-series based on the distance between the upper lip and the lower lip of the subject from the acquired video. In the step of obtaining a plurality of voice features, a step of obtaining a plurality of voice features in time series based on the acquired voice data, and a step of determining the speaker, in the step of determining the speaker, a plurality of voice features are obtained. Included in the video is a step to obtain a plurality of second lip behavior features in chronological order, a step to obtain a discriminant difference which is a difference between the first lip behavior feature and the second lip behavior feature. It is a speaker discrimination program including a step of setting the target person having the smallest discrimination difference among the target persons as the speaker.

上記発話者判別プログラムにおいて、第一の口唇挙動特徴量は、上唇と下唇との距離と、対象者の鼻梁上の２点間の距離と、の割合により得てもよい。 In the speaker discrimination program, the first lip behavior feature amount may be obtained by the ratio of the distance between the upper lip and the lower lip and the distance between two points on the nose bridge of the subject.

上記発話者判別プログラムにおいて、発話者を判別するステップでは、始点となる時間が異なり予め決められた時間範囲を有する複数の区間を作成するステップと、複数の区間のそれぞれについて、第一の口唇挙動特徴量と第二の口唇挙動特徴量との区間差分を求め、各区間の区間差分の平均を求めてこれを判別差分とするように構成してもよい。 In the speaker discrimination program, in the step of discriminating the speaker, a step of creating a plurality of sections having different start points and having a predetermined time range, and a first lip behavior for each of the plurality of sections. The interval difference between the feature amount and the second lip behavior feature amount may be obtained, the average of the section differences in each section may be obtained, and this may be used as the discriminant difference.

上記発話者判別プログラムの複数の区間において、隣り合う区間では、その時間の一部が重複するように始点となる時間を決めてもよい。 In a plurality of sections of the speaker determination program, the time to be the starting point may be determined so that a part of the time overlaps in the adjacent sections.

上記発話者判別プログラムにおいて、複数の第一の口唇挙動特徴量及び複数の音声特徴量を０．０以上１．０以下の範囲で正規化してもよい。 In the speaker discrimination program, the plurality of first lip behavior features and the plurality of voice features may be normalized in the range of 0.0 or more and 1.0 or less.

また、映像及び音声データから発話者を判別する装置であって、映像を取得するカメラと、音声データを取得するマイクと、上記発話者判別プログラムが記憶された記憶手段、及び、発話者判別プログラムに基づいて演算を行う演算手段と、を有し、演算手段は、カメラで取得した映像、及び、マイクで取得した音声データを取得し、取得した映像及び音声データを用いて発話者判別プログラムによる演算が行われる、発話者判別装置を提供する。 Further, it is a device that discriminates a speaker from video and audio data, and is a camera that acquires video, a microphone that acquires audio data, a storage means in which the speaker discrimination program is stored, and a speaker discrimination program. The calculation means includes a calculation means that performs calculation based on the above, and the calculation means acquires video acquired by a camera and audio data acquired by a microphone, and uses the acquired video and audio data by a speaker discrimination program. Provided is a speaker discriminating device for performing calculations.

本発明によれば、簡易な設備であるとともに、複数の対象者から発話者を精度よく判別することができる。 According to the present invention, the equipment is simple, and the speaker can be accurately discriminated from a plurality of target persons.

図１は、発話者判別方法Ｓ１の流れを示す図である。FIG. 1 is a diagram showing a flow of the speaker determination method S1. 図２（ａ）は映像の一部を模式的に例示した図、図２（ｂ）は音声データの一部を例示した図である。FIG. 2A is a diagram schematically illustrating a part of a video, and FIG. 2B is a diagram illustrating a part of audio data. 図３は、第一の口唇挙動特徴量を算出する過程Ｓ２０の流れを示す図である。FIG. 3 is a diagram showing the flow of the process S20 for calculating the first lip behavior feature amount. 図４は、特徴点の配置を説明する図である。FIG. 4 is a diagram illustrating the arrangement of feature points. 図５は、図４のうち口唇部分を拡大した図である。FIG. 5 is an enlarged view of the lip portion of FIG. 図６は、口唇の縦方向特徴量の変化を説明する図である。FIG. 6 is a diagram for explaining the change in the vertical feature amount of the lips. 図７は、図４のうち鼻部分を拡大した図である。FIG. 7 is an enlarged view of the nose portion of FIG. 図８は、第一の口唇挙動特徴量を説明する図である。FIG. 8 is a diagram for explaining the first lip behavior feature amount. 図９（ａ）、図９（ｂ）は音声データを説明する図である。9 (a) and 9 (b) are diagrams for explaining voice data. 図１０は、音声データからＭＦＣＣを求めたことを説明する図である。FIG. 10 is a diagram for explaining that the MFCC was obtained from the voice data. 図１１は、発話者判別過程Ｓ４０の流れを説明する図である。FIG. 11 is a diagram illustrating the flow of the speaker determination process S40. 図１２は、発話者判別過程Ｓ４０を説明する図である。FIG. 12 is a diagram illustrating the speaker determination process S40. 図１３は、発話者判別装置の構成を説明する図である。FIG. 13 is a diagram illustrating a configuration of a speaker discrimination device. 図１４（ａ）、図１４（ｂ）は、試験の手順について説明する図である。14 (a) and 14 (b) are diagrams illustrating a test procedure.

｛発話者判別方法｝
図１は、１つの形態にかかる発話者判別方法Ｓ１の流れを示す図である。図１からわかるように本形態の発話者判別方法Ｓ１は、映像・音声データ取得過程Ｓ１０、第一の口唇挙動特徴量算出過程Ｓ２０、音声特徴量算出過程Ｓ３０、及び、発話者判別過程Ｓ４０を含んでいる。以下、各過程について説明する。 {Speaker identification method}
FIG. 1 is a diagram showing a flow of the speaker determination method S1 according to one form. As can be seen from FIG. 1, the speaker discrimination method S1 of the present embodiment includes a video / audio data acquisition process S10, a first lip behavior feature amount calculation process S20, a voice feature amount calculation process S30, and a speaker discrimination process S40. Includes. Each process will be described below.

［映像・音声データ取得過程Ｓ１０］
映像・音声データ取得過程Ｓ１０では、判別対象者の映像及び音声のデータを取得する。映像の取得はいわゆるカメラ、音声データの取得はマイクにより行うことができるが、本形態によれば、複数の判別対象者の映像を同時の撮影できるカメラ（例えば全方位カメラ、広角カメラ）及び、判別対象者の音声を取得できるマイクを用いて発話者を判別することができる。すなわち、判別対象者全員の情報を取得することができれば１つのビデオカメラ、１つのマイクであってもよい。複数台のビデオカメラやマイクを用いてもよいが、判別対象者の映像及び音声データを取得することができる限り、最小限に抑えることができる。
また、カメラとマイクとは別機器であっても一体であってもよい。従って、カメラに備わっているマイクを利用することもできる。 [Video / audio data acquisition process S10]
In the video / audio data acquisition process S10, the video / audio data of the discrimination target person is acquired. The image can be acquired by a so-called camera, and the audio data can be acquired by a microphone. The speaker can be discriminated by using a microphone that can acquire the voice of the discriminating target person. That is, one video camera and one microphone may be used as long as the information of all the discrimination target persons can be acquired. A plurality of video cameras and microphones may be used, but the video and audio data of the discrimination target can be minimized as long as it can be acquired.
Further, the camera and the microphone may be separate devices or integrated. Therefore, the microphone provided in the camera can also be used.

この映像・音声データ取得過程Ｓ１０により、例えば図２（ａ）に模式的に示したように判別対象者の顔部分の映像を取得することができる。また図２（ｂ）に模式的に示したように、横軸を時間とした波形として音声データ取得することができる。 By this video / audio data acquisition process S10, it is possible to acquire an image of the face portion of the discrimination target person, for example, as schematically shown in FIG. 2A. Further, as schematically shown in FIG. 2B, voice data can be acquired as a waveform with the horizontal axis as time.

［第一の口唇挙動特徴量算出過程Ｓ２０］
第一の口唇挙動特徴量算出過程Ｓ２０では、映像・音声データ取得過程Ｓ１０で取得した映像に基づいて、口唇の挙動を表す特徴量（第一の口唇挙動特徴量）を算出する。図３に、第一の口唇挙動特徴量算出過程Ｓ２０の流れを示した。
図３からわかるように、第一の口唇挙動特徴量算出過程Ｓ２０は、特徴点の配置過程Ｓ２１、口唇の縦方向特徴量の計算過程Ｓ２２、鼻の特徴量の計算過程Ｓ２３、及び、第一の口唇挙動特徴量の計算過程Ｓ２４を有している。以下、各過程について説明する。 [First lip behavior feature calculation process S20]
In the first lip behavior feature amount calculation process S20, the feature amount representing the lip behavior (first lip behavior feature amount) is calculated based on the image acquired in the video / audio data acquisition process S10. FIG. 3 shows the flow of the first lip behavior feature amount calculation process S20.
As can be seen from FIG. 3, the first lip behavior feature calculation process S20 includes the feature point placement process S21, the lip vertical feature calculation process S22, the nose feature calculation process S23, and the first. It has the calculation process S24 of the lip behavior feature amount of. Each process will be described below.

＜特徴点の配置過程Ｓ２１＞
特徴点の配置過程Ｓ２１では、映像・音声データ取得過程Ｓ１０で取得した映像に対して、判別対象者の顔部分に特徴点を配置する。図４に例を示した。図４の例では、図２（ａ）に示した映像の顔部分に「●」で示した特徴点Ａが配置されている（見易さのため、符号Ａは一部の特徴点のみに付し、他は省略した。）。本形態では顔の下半分の輪郭（頬から顎）、眉毛、目、鼻（鼻梁、下端部）、及び口唇（上下の唇）に対してそれぞれの輪郭に沿うように複数の特徴点Ａが配置されている。
特徴点の配置方法については特に限定されることはないが、隣接する画素の輝度差を利用し、所定の閾値以上の輝度差を有する位置を各部の輪郭と判断することができる。その他、市販や公開されているソフトウエアを用いてもよく、これには例えばＤｌｉｂが挙げられる。 <Placement process of feature points S21>
In the feature point arrangement process S21, the feature points are arranged on the face portion of the discrimination target person with respect to the video acquired in the video / audio data acquisition process S10. An example is shown in FIG. In the example of FIG. 4, the feature points A indicated by “●” are arranged on the face portion of the image shown in FIG. 2 (a) (for easy viewing, the reference numeral A is assigned to only some feature points. Attached, and the others are omitted.) In this embodiment, a plurality of feature points A are provided along the contours of the lower half of the face (cheeks to chin), eyebrows, eyes, nose (bridge of nose, lower end), and lips (upper and lower lips). Have been placed.
The method of arranging the feature points is not particularly limited, but it is possible to determine a position having a brightness difference of a predetermined threshold value or more as the contour of each part by using the brightness difference of adjacent pixels. In addition, commercially available or publicly available software may be used, and examples thereof include Dlib.

本形態では特徴点として後述するように唇の縦方向の位置、及び、鼻根と鼻尖との距離を時系列に把握するため、特徴点Ａは少なくともこれらの把握に必要な位置及び数で配置されていればよい。従って本形態では、少なくとも口唇部及び鼻部に特徴点Ａが配置されている。
ただし、その他の理由によりこれ以外に特徴点Ａが配置されてもよい。例えば、顔の輪郭に沿った特徴点Ａを用いて判別対象者の顔の位置や大きさを得たり、顔以外の情報を削除する処理を行ったりしてもよい。 In this embodiment, since the vertical position of the lips and the distance between the base of the nose and the tip of the nose are grasped in chronological order as feature points, the feature points A are arranged at least at the positions and numbers necessary for grasping them. It suffices if it is done. Therefore, in this embodiment, the feature points A are arranged at least on the lips and the nose.
However, the feature point A may be arranged in addition to this for other reasons. For example, the position and size of the face of the discrimination target person may be obtained by using the feature points A along the contour of the face, or a process of deleting information other than the face may be performed.

なお、このような特徴点Ａの配置は映像における画像ごとに行われる。すなわち、映像を構成するための時系列的に連続する複数の画像のそれぞれについて特徴点Ａが配置される。図４はある１つの画像について説明した例である。 It should be noted that such arrangement of the feature points A is performed for each image in the moving image. That is, feature points A are arranged for each of a plurality of images that are continuous in time series for forming an image. FIG. 4 is an example explaining one image.

＜口唇の縦方向特徴量の計算過程Ｓ２２＞
口唇の縦方向特徴量の計算過程Ｓ２２では、口唇部の縦方向の特徴量を計算する。図５には図４のうち口唇部分に注目して拡大した図を表した。
ここで「口唇の縦方向特徴量」とは、口唇部分のうち上唇と下唇とが並ぶ方向における特徴量を表し、具体例としては上唇に属する特徴点Ａ_１と下唇に属する特徴点Ａ_２との当該方向の距離が挙げられる。特徴点Ａ_１、特徴点Ａ_２の選択は特に限定されることはなく、口唇の縦方向特徴量が判別対象者の口述によって時系列的に変化することが把握できればよい。図５に示した例では、顔の正中線に最も近い特徴点Ａのうち、両者が最も離隔した位置にある特徴点を選択した。これにより、口唇の縦方向特徴量の時系列的な変化が明確になりやすくなる。 <Calculation process of vertical features of lips S22>
In the process of calculating the vertical feature amount of the lips S22, the vertical feature amount of the lip portion is calculated. FIG. 5 shows an enlarged view focusing on the lip portion of FIG.
Here, the "longitudinal feature amount of the lips" represents the feature amount in the direction in which the upper lip and the lower lip are lined up in the lip portion, and as specific examples, the feature point A ₁ belonging to the upper lip and the feature point A belonging to the lower lip. _{The distance from 2} in this direction can be mentioned. The selection of the feature point A ₁ and the feature point A ₂ is not particularly limited, and it is sufficient to understand that the vertical feature amount of the lips changes in time series according to the dictation of the discriminant. In the example shown in FIG. 5, among the feature points A closest to the median line of the face, the feature points at the positions farthest apart from each other were selected. This makes it easier to clarify the time-series changes in the vertical features of the lips.

従って、本過程では、上唇と下唇とが並ぶ方向の特徴点Ａ_１と特徴点Ａ_２との距離Ｂを求める。この距離Ｂは座標、長さ、画素数等、どのような単位で表現してもよい。本形態では画素数により距離を表現している。本形態ではこの距離Ｂが「口唇の縦方向特徴量」となる。 Accordingly, in this process, we obtain a distance B between the feature point A ₁ and the feature point A ₂ in a direction lined with the upper lip and the lower lip. This distance B may be expressed in any unit such as coordinates, length, and the number of pixels. In this embodiment, the distance is expressed by the number of pixels. In this embodiment, this distance B is the "longitudinal feature amount of the lips".

このような口唇の縦方向特徴量は、時系列で連続する複数の画像（フレーム）のそれぞれについて算出される。これにより例えば図６に示したように時間の経過（フレーム番号）に伴う時系列的な口唇の縦方向特徴量の変化を得ることができる。 Such a vertical feature amount of the lips is calculated for each of a plurality of consecutive images (frames) in chronological order. As a result, for example, as shown in FIG. 6, it is possible to obtain a time-series change in the vertical feature amount of the lips with the passage of time (frame number).

＜鼻の特徴量の計算過程Ｓ２３＞
鼻の特徴量の計算過程Ｓ２３では、鼻部の縦方向の特徴量を計算する。図７には図４のうち鼻部に注目して拡大した図を表した。
ここで「鼻の特徴量」とは、鼻部のうち、鼻梁に沿った方向における特徴量を表し、具体例としては、鼻梁に沿って配列された２つの特徴点（Ａ_３、Ａ_４）間の距離が挙げられる。２つの特徴点Ａ_３、特徴点Ａ_４の選択は特に限定されることはないが、図７に示した例では、一方を鼻根に最も近い特徴点Ａ_３とし、他方を鼻尖に最も近い特徴点Ａ_４とした。これにより両者が離隔しているため鼻の特徴量の時系列的な変化が明確になりやすくなる。 <Calculation process of nasal features S23>
In the process of calculating the feature amount of the nose S23, the feature amount in the vertical direction of the nose is calculated. FIG. 7 shows an enlarged view of FIG. 4 focusing on the nose.
Here, the "feature quantity of the nose", of the nose portion, represents a feature amount in the direction along the nasal bridge, Examples, two feature points arranged along the bridge of the nose (A _3, A ₄₎ The distance between them can be mentioned. Two characteristic points A _3, but are not limited particularly selected feature point A _4, in the example shown in FIG. 7, one of the characteristic point A ₃ closest to the root of the nose, closest to the nose tip and the other It was characterized by point _{a 4.} This makes it easier to clarify the time-series changes in the nasal features because they are separated from each other.

従って、本過程では、上唇と下唇とが並ぶ方向における鼻の特徴点Ａ_３と特徴点Ａ_４との距離Ｃを求める。この距離Ｃは座標、長さ、画素数等、どのような単位で表現してもよいが、上記した口唇に関する距離Ｂと同じ単位とする。従って本形態では画素数により距離を表現している。
本形態ではこの距離Ｃが「鼻の特徴量」となる。
このような鼻の特徴量は、時系列で連続する複数の画像（フレーム）のそれぞれについて算出される。これにより図示はしないが、上記した図６と同様にして時間の経過（フレーム番号）による鼻の特徴量の変化を得ることができる。 Accordingly, in this process, determine the distance C between the feature point A ₃ and the feature point A ₄ of the nose in the direction lined with the upper lip and the lower lip. This distance C may be expressed in any unit such as coordinates, length, number of pixels, etc., but is the same unit as the distance B related to the lips described above. Therefore, in this embodiment, the distance is expressed by the number of pixels.
In this embodiment, this distance C is the "feature amount of the nose".
Such a feature amount of the nose is calculated for each of a plurality of consecutive images (frames) in chronological order. As a result, although not shown, it is possible to obtain a change in the feature amount of the nose with the passage of time (frame number) in the same manner as in FIG. 6 described above.

＜第一の口唇挙動特徴量の計算過程Ｓ２４＞
第一の口唇挙動特徴量の計算過程Ｓ２４では、口唇の縦方向特徴量の計算過程Ｓ２２で求めた口唇の縦方向特徴量、及び、鼻の特徴量の計算過程Ｓ２３で求めた鼻の特徴量から、第一の口唇挙動特徴量を計算する。具体的には、次の式により求めることができる。
第一の口唇挙動特徴量＝口唇の縦方向特徴量／鼻の特徴量
例示したＢ、Ｃによる場合には第一の口唇挙動特徴量は次の式により求められる。
第一の口唇挙動特徴量＝Ｂ／Ｃ <Calculation process of first lip behavior feature amount S24>
In the first calculation process of the lip behavior feature amount S24, the vertical feature amount of the lips obtained in the calculation process S22 of the vertical feature amount of the lips and the feature amount of the nose obtained in the calculation process S23 of the nose feature amount. From, the first lip behavior feature is calculated. Specifically, it can be obtained by the following equation.
First lip behavior feature = Longitudinal feature of lip / Nose feature In the case of B and C illustrated, the first lip behavior feature is calculated by the following formula.
First lip behavior feature = B / C

この第一の口唇挙動特徴量は、鼻の特徴量に対する口唇の縦方向特徴量の割合をあらわす無次元量であり、これにより判別対象者と撮影手段との距離の影響を低減することができる。 This first lip behavior feature is a dimensionless quantity that represents the ratio of the vertical feature of the lips to the feature of the nose, and thus the influence of the distance between the discriminant and the imaging means can be reduced. ..

従って、必ずしも鼻の特徴量を考慮しなくてもよく、第一の口唇挙動特徴量を口唇の縦方向特徴量としてもよい。本形態で鼻の特徴量を用いたのは次の理由による。
映像の取得中にカメラと判別対象者との距離に変化が生じた場合、口唇の縦幅の距離が変化するため、同じ口唇の動きであっても口唇の縦方向特徴量が変わってしまう。これに対して、口唇の動きに対して変化が少ない鼻特徴量との割合をとり、これを指標とすることで、カメラと判別対象者との距離の変化の影響を軽減することができる。
従って、必ずしも鼻特徴量である必要はなく、次の２つの条件を満たすような特徴量を抽出し、これと口唇の縦方向特徴量との割合をとって第一の口唇挙動特徴量としてもよい。
第一の条件は、発話動作および表情の変化に対して変動しにくい特徴点間距離であることである。この点、鼻特徴量に用いた鼻根と鼻尖との距離は、発話動作や表情の変化に対して影響を受けにくい部位であり、動きの少ない特徴点間距離である。
第二の条件は、上下方向および左右方向に顔の角度変化が生じた場合、口唇縦幅の動きと類似した変動が見られる特徴点間距離であることである。顔がカメラに対して正面を向いている場合、「カメラと判別対象者との距離」に対する「特徴点間の距離」は、どの特徴点のペアを用いてもその割合は一定である。しかしながら、顔の角度が変化した場合にはこの割合に変化を生じる。例えば、カメラと判別対象者との距離が変動していない場合であっても、判別対象者が横を向くことで、顔の横幅（顔の左端と右端の特徴点を結ぶ直線の長さ）は変動する。これに対して本形態のような口唇の縦方向特徴量と鼻特徴量とは、概ね平行な関係にあり、かつ顔の中央に存在している。そのため、顔の角度変化が生じた場合における、特徴点間距離の変化の傾向が類似していることから、鼻特徴量を用いることで、顔の角度変化に影響を軽減することができる。 Therefore, it is not always necessary to consider the feature amount of the nose, and the first lip behavior feature amount may be used as the vertical feature amount of the lips. The feature amount of the nose was used in this embodiment for the following reasons.
If the distance between the camera and the person to be discriminated changes during the acquisition of the image, the vertical width of the lips changes, so that the vertical feature amount of the lips changes even if the movement of the lips is the same. On the other hand, by taking the ratio of the nasal feature amount, which has little change with respect to the movement of the lips, and using this as an index, the influence of the change in the distance between the camera and the discriminant can be reduced.
Therefore, it does not necessarily have to be a nasal feature amount, and a feature amount that satisfies the following two conditions is extracted, and the ratio of this to the vertical feature amount of the lips is taken as the first lip behavior feature amount. good.
The first condition is that the distance between feature points is less likely to fluctuate with respect to changes in speech movements and facial expressions. In this regard, the distance between the base of the nose and the tip of the nose used for the nasal feature amount is a portion that is not easily affected by changes in speech movements and facial expressions, and is a distance between feature points with little movement.
The second condition is the distance between feature points where a change similar to the movement of the vertical width of the lips is observed when the angle of the face changes in the vertical direction and the horizontal direction. When the face faces the front of the camera, the ratio of the "distance between the feature points" to the "distance between the camera and the discrimination target" is constant regardless of which pair of feature points is used. However, when the angle of the face changes, this ratio changes. For example, even when the distance between the camera and the discrimination target does not fluctuate, the width of the face (the length of the straight line connecting the feature points at the left and right edges of the face) when the discrimination target turns sideways. Fluctuates. On the other hand, the vertical feature amount of the lips and the nasal feature amount as in this embodiment have a substantially parallel relationship and are present in the center of the face. Therefore, since the tendency of the change in the distance between the feature points is similar when the angle of the face changes, the influence on the change in the angle of the face can be reduced by using the nasal feature amount.

本形態では第一の口唇挙動特徴量は、同じ時間の画像（フレーム）における口唇の縦方向特徴量及び鼻の特徴量で計算し、時系列で連続する複数の画像（フレーム）のそれぞれについて算出される。従って、図８に示すように時間の経過（フレーム番号）による第一の口唇挙動特徴量の変化を得ることができる。
なお、発話が無い部分を除外し、発話がある部分のみを対象とすることもできる。 In this embodiment, the first lip behavior feature is calculated by the vertical feature of the lips and the feature of the nose in the images (frames) at the same time, and is calculated for each of a plurality of consecutive images (frames) in chronological order. Will be done. Therefore, as shown in FIG. 8, it is possible to obtain a change in the first lip behavior feature amount with the passage of time (frame number).
It is also possible to exclude the part without utterance and target only the part with utterance.

また、第一の口唇挙動特徴量は、発話者の口唇の動きの個人差を低減するため、０．０以上１．０以下の範囲で正規化してもよい。 Further, the first lip behavior feature amount may be normalized in the range of 0.0 or more and 1.0 or less in order to reduce individual differences in the movement of the lips of the speaker.

［音声特徴量の計算過程Ｓ３０］
音声特徴量の計算過程Ｓ３０では、映像・音声データ取得過程Ｓ１０で得た音声データ（例えば図２（ｂ））から音声特徴量を計算して得る。これにより複雑な多くの情報を含む音声データから発話者判別に必要な音声データを抽出し、精度を保ちつつデータの取り扱いをし易くすることができる。 [Calculation process of voice features S30]
In the audio feature calculation process S30, the audio feature is calculated from the audio data (for example, FIG. 2B) obtained in the video / audio data acquisition process S10. As a result, it is possible to extract the voice data necessary for determining the speaker from the voice data including a large amount of complicated information, and to facilitate the handling of the data while maintaining the accuracy.

音声特徴量は、上記のように音声データから発話者判別に必要な音声データを抽出し、精度を保ちつつデータの取り扱いをし易くすることができれば特に限定されることはないが、その中でもメル周波数ケプストラム係数（Mel-Frequency Cepstrum Coefficient、MFCC）を用いることが好ましい。そのうち０次元目を用いることがさらに好ましい。これは、音声認識の特徴量に有用な低周波成分の特徴を有していること、及び、低次元（０次元目）成分は声道の音響特性や口腔の形状に起因して変化することによる。 The voice feature amount is not particularly limited as long as the voice data necessary for determining the speaker can be extracted from the voice data as described above and the data can be easily handled while maintaining the accuracy. It is preferable to use the frequency cepstrum coefficient (MFCC). It is more preferable to use the 0th dimension. This is because it has the characteristics of low-frequency components that are useful for the features of speech recognition, and the low-dimensional (0th-dimensional) components change due to the acoustic characteristics of the vocal tract and the shape of the oral cavity. by.

より具体的な例として次のように音声特徴量を得る。図９、図１０に説明のための図を示した。
初めに図９（ａ）に示した映像・音声データ取得過程Ｓ１０で得た音声データから所定の時間長さＤの部分（部分Ｅ_１）を図９（ｂ）のように抽出する。Ｄの大きさは特に限定されることはないが本例は２０ｍｓである。
次にこの部分Ｅ_１の音声データについてＭＦＣＣを求め図１０のようなＭＦＣＣデータを得る。ＭＦＣＣの求め方は公知の通りであるが、例えば、「河原達也編著、音声認識システム改定２版、オーム社、２０１６」に記載の内容を挙げることができる。
例えば次のように算出する。はじめに音声データ（音声波形）をフーリエ変換し、周波数成分を取得し、この周波数成分を用いてパワースペクトル（各周波数成分における音の大きさ）を算出する。次に、このパワースペクトルに対してメルフィルタバンクを掛ける。人間の聴覚は高周波になるにつれて分解能が低くなる特徴を有しているため、メルフィルタバンクを掛けることで、人間の聴覚特性に応じた特徴量を抽出することが可能となる。そして、ここからケプストラム特徴量を算出し、声紋波の高調波成分(人物の違いによって変化する特徴)と声道による包絡成分（発話内容の違いによって変化する特徴）を分離する。ケプストラム特徴量における低次元成分（０次元目〜１４次元目）が、主に音声認識に利用されるが、上記したように本形態では０次元目を用いることが好ましい。
このようにして抽出されたケプストラム特徴量をＭＦＣＣと呼び、音声特徴量とする。 As a more specific example, the voice features are obtained as follows. 9 and 10 are diagrams for explanation.
_First, a portion (part E 1) having a predetermined time length D is extracted from the audio data obtained in the video / audio data acquisition process S10 shown in FIG. 9 (a) as shown in FIG. 9 (b). The size of D is not particularly limited, but this example is 20 ms.
Then obtain the MFCC data such as shown in Figure 10 obtains the MFCC for audio data of the portion _{E 1.} The method of obtaining the MFCC is as known, and examples thereof include the contents described in "Tatsuya Kawahara, Revised 2nd Edition of Voice Recognition System, Ohmsha, 2016".
For example, it is calculated as follows. First, the voice data (voice waveform) is Fourier transformed to obtain a frequency component, and the power spectrum (loudness at each frequency component) is calculated using this frequency component. Next, this power spectrum is multiplied by a mel filter bank. Since human hearing has a characteristic that the resolution decreases as the frequency becomes higher, it is possible to extract a feature amount according to the human auditory characteristic by applying a mel filter bank. Then, the cepstrum feature amount is calculated from this, and the harmonic component of the voiceprint wave (characteristic that changes depending on the person) and the envelope component due to the vocal tract (characteristic that changes depending on the difference in the utterance content) are separated. The low-dimensional components (0th to 14th dimensions) in the cepstrum feature amount are mainly used for speech recognition, but as described above, it is preferable to use the 0th dimension in this embodiment.
The cepstrum features extracted in this way are called MFCCs and are used as voice features.

そのあと、図９（ａ）に示したように部分Ｅ_１に対して時間ｄだけ遅らせた部分Ｅ_２（時間長さＤ）についても同様に音声特徴量を得る。これを順次繰り返すことで時系列的に複数の音声特徴量を得る。なお、この遅らせる時間ｄの大きさは特に限定されることはないが、Ｄ＞ｄ、Ｄ＝ｄ、Ｄ＜ｄのいずれあってもよいが、精度を高める観点からＤ＞ｄであることが好ましい。本例では上記Ｄが２０ｍｓであるのに対してｄを１０ｍｓとしている。
また、発話が無い部分を除外し、発話がある部分のみを対象とすることもできる。 Then, as shown in FIG. 9A, the voice feature amount is similarly obtained for _{the portion E 2} (time length D) delayed by the time d with respect _{to the portion E 1.} By repeating this sequentially, a plurality of voice features are obtained in chronological order. The magnitude of the delay time d is not particularly limited, but any of D> d, D = d, and D <d may be used, but D> d from the viewpoint of improving accuracy. preferable. In this example, D is 20 ms, whereas d is 10 ms.
It is also possible to exclude the part without utterance and target only the part with utterance.

以上により、ＭＦＣＣの０次元目の数値である音声特徴量の時系列変化を取得することができる。
なお、この音声特徴量は、発話者の声の大きさの個人差を低減するため、発話区間内において、０．０以上１．０以下の範囲で正規化してもよい。 From the above, it is possible to acquire the time-series change of the voice feature amount, which is the 0th-dimensional numerical value of MFCC.
In addition, this voice feature amount may be normalized in the range of 0.0 or more and 1.0 or less within the utterance section in order to reduce individual differences in the loudness of the speaker's voice.

［発話者判別過程Ｓ４０］
発話者判別過程Ｓ４０では、第一の口唇挙動特徴量算出過程Ｓ２０で得られた第一の口唇挙動特徴量及び音声特徴量算出過程Ｓ３０で得られた音声特徴量から発話者を特定する。図１１に流れを示した。本形態の発話者判別過程Ｓ４０は、区間の設定過程Ｓ４１、第二の口唇挙動特徴量算出過程Ｓ４２、区間差分の算出過程Ｓ４３、発話者判別過程Ｓ４４を有している。以下に各過程について説明する。 [Speaker discrimination process S40]
In the speaker discrimination process S40, the speaker is specified from the first lip behavior feature amount obtained in the first lip behavior feature calculation process S20 and the voice feature amount obtained in the voice feature calculation process S30. The flow is shown in FIG. The speaker discrimination process S40 of the present embodiment includes a section setting process S41, a second lip behavior feature amount calculation process S42, a section difference calculation process S43, and a speaker discrimination process S44. Each process will be described below.

＜区間の設定過程Ｓ４１＞
区間の設定過程Ｓ４１では、映像・音声データの取得過程Ｓ１０で得られた映像及び音声データを複数の「区間」に分割する。図１２に説明のための図を表した。
この「区間」は、上記第一の口唇挙動特徴量と、この後に求める第二の口唇挙動特徴量との区間差分を算出する際の最小単位である。区間は例えば１０００ｍｓのように設定することができる。そして区間は、始点となる時間を少しずらすように複数設定され、図１２のように区間１〜区間Ｍを考える。
このように区間を用いることで、いわゆるフレーム単位ではなく、口唇の動きの時系列変化を考慮した発話者判別が可能となる。 <Section setting process S41>
In the section setting process S41, the video and audio data obtained in the video / audio data acquisition process S10 are divided into a plurality of "sections". FIG. 12 shows a diagram for explanation.
This "section" is the minimum unit for calculating the section difference between the first lip behavior feature amount and the second lip behavior feature amount obtained thereafter. The interval can be set, for example, 1000 ms. Then, a plurality of sections are set so as to slightly shift the time as the starting point, and section 1 to section M are considered as shown in FIG.
By using the section in this way, it is possible to discriminate the speaker in consideration of the time-series change of the movement of the lips, not in the so-called frame unit.

隣り合う区間における始点時間のずれは特に限定されることはなく、区間の長さと同じでもよく、区間の長さより短くてもよいし、区間の長さより長くてもよい。ただし、この始点時間のずれは、図１２のように区間の長さより短いことが好ましく例えば区間の長さの０．１倍（本例では１００ｍｓ）程度とすることができる。すなわち、複数の区間において、隣り合う区間では、その時間の一部が重複するように始点となる時間が決められるようにすることができる。
このように隣り合う区間において一部が時間的に重複するように区間を設定することで、重複させない場合と比較してより多くのパターンの特徴量が取得可能となる。 The deviation of the start point time in the adjacent sections is not particularly limited, and may be the same as the length of the section, shorter than the length of the section, or longer than the length of the section. However, this deviation of the start point time is preferably shorter than the length of the section as shown in FIG. 12, and can be, for example, about 0.1 times the length of the section (100 ms in this example). That is, in a plurality of sections, in adjacent sections, the time to be the starting point can be determined so that a part of the time overlaps.
By setting the sections so that some of them overlap in time in the adjacent sections in this way, it is possible to acquire more features of the pattern as compared with the case where they do not overlap.

それぞれの区間には複数の第一の口唇挙動特徴量及び複数の音声特徴量が含まれるように区間の長さが設定される。例えば、１つの区間の長さが１０００ｍｓで、音声特徴量は上記の例のように１０ｍｓごとに作成される（図９（ａ）のｄが１０ｍｓ）ときにはこの区間に含まれる音声特徴量データの数は１００である。一方、第一の口唇挙動特徴量について映像は通常のカメラが１秒（１０００ｍｓ）あたり３０フレームであることから、１フレームあたり１つの第一の口唇挙動特徴量が得られているのでデータ数は３０である。 The length of the section is set so that each section includes a plurality of first lip behavior features and a plurality of voice features. For example, when the length of one section is 1000 ms and the voice feature amount is created every 10 ms as in the above example (d in FIG. 9A is 10 ms), the voice feature amount data included in this section The number is 100. On the other hand, regarding the first lip behavior feature amount, since the image is 30 frames per second (1000 ms) with a normal camera, one first lip behavior feature amount is obtained per frame, so the number of data is It is thirty.

＜第二の口唇挙動特徴量算出過程Ｓ４２＞
第二の口唇挙動特徴量算出過程Ｓ４２では、区間の設定過程Ｓ４１で設定した区間ごとに、音声特徴量の計算過程Ｓ３０で得た音声特徴量を予め学習済のニューラルネットワークに入力して、第二の口唇挙動特徴量を算出する。従って、この過程により、音声データに基づく口唇挙動特徴量を得ることができる。 <Second lip behavior feature calculation process S42>
In the second lip behavior feature amount calculation process S42, the voice feature amount obtained in the voice feature amount calculation process S30 is input to the pre-learned neural network for each section set in the section setting process S41, and the second lip behavior feature amount calculation process S42. Calculate the second lip behavior feature. Therefore, by this process, the lip behavior feature amount based on the voice data can be obtained.

この第二の口唇挙動特徴量のデータ数は、第一の口唇挙動特徴量のデータ数と同じとすることが好ましい。すなわち、上記のように１つの区間に含まれる音声特徴量のデータ数が１００、第一の口唇挙動特徴量のデータ数が３０である場合には、音声特徴量の１００のデータに基づいて３０のデータの第二の口唇挙動特徴量が算出される。これにより後述する区間差分が求めやすくなる。 It is preferable that the number of data of the second lip behavior feature amount is the same as the number of data of the first lip behavior feature amount. That is, when the number of data of the voice feature amount included in one section is 100 and the number of data of the first lip behavior feature amount is 30 as described above, 30 is based on the data of 100 of the voice feature amount. The second lip behavior feature amount of the data of is calculated. This makes it easier to obtain the section difference described later.

そしてこのような第二の口唇挙動特徴量の算出は区間ごとに行われる。 Then, such a second lip behavior feature amount is calculated for each section.

ここで、ニューラルネットワークへの予めの学習の条件や方法は、音声特徴量を口唇挙動特徴量に対応づけることができれば特に限定されることはないが、本形態では次のように行った。 Here, the conditions and methods of prior learning to the neural network are not particularly limited as long as the voice features can be associated with the lip behavior features, but in this embodiment, the following is performed.

上記した区間の長さ、及び、ここに含まれる音声特徴量データ数、第一の口唇挙動特徴量のデータ数に合わせる条件で、入力層、中間層、出力層の３層構造のニューラルネットワークを用いて学習をおこなう。例えば上記の例を用いれば、区間の長さは１０００ｍｓ、入力層としては音声特徴量のデータ数に合わせて１００次元、中間層を５０次元とし、出力層は第一の口唇挙動特徴量のデータ数に合わせて３０データが出力されるように３０次元とした。なお勾配法にはＡｄａｍを使用することができる。
そしてこの出力層による出力が教師データと対比されることで学習が進められる。 A neural network with a three-layer structure consisting of an input layer, an intermediate layer, and an output layer is provided under the conditions that match the length of the above-mentioned section, the number of voice feature data included therein, and the number of data of the first lip behavior feature. Use to learn. For example, using the above example, the section length is 1000 ms, the input layer is 100 dimensions according to the number of voice feature data, the intermediate layer is 50 dimensions, and the output layer is the data of the first lip behavior feature. It was set to 30 dimensions so that 30 data could be output according to the number. Adam can be used for the gradient method.
Then, learning proceeds by comparing the output from this output layer with the teacher data.

＜区間差分の算出過程Ｓ４３＞
区間差分の算出過程Ｓ４３では、区間ごとに、その区間に属する第一の口唇挙動特徴量の算出過程Ｓ２０で得られた第一の口唇挙動特徴量と、その区間に属する第二の口唇挙動特徴量の算出過程Ｓ４２で得られた第二の口唇挙動特徴量との差分をとり、区間差分を得る。より具体的には次の通りである。 <Calculation process of section difference S43>
In the section difference calculation process S43, the first lip behavior feature amount obtained in the calculation process S20 of the first lip behavior feature amount belonging to the section and the second lip behavior feature amount belonging to the section are calculated for each section. The difference from the second lip behavior feature amount obtained in the amount calculation process S42 is taken to obtain the interval difference. More specifically, it is as follows.

上記したように本形態では、映像に基づく口唇挙動特徴量（第一の口唇挙動特徴量）と、音声に基づく口唇挙動特徴量（第二の口唇挙動特徴量）とのデータ数を一致させているので、その差分は時間の早い順から順次両者の差を取ればよい。従って、例えば１つの区間では３０の差分データが得られる。
そしてこの過程では得られた各々の差分データを絶対値で表し、これを平均し、当該区間における区間差分δとする。従って、図１２のように、この過程で各区間について区間差分δ_１、δ_２、δ_３、…、δ_Ｍが得られる。 As described above, in this embodiment, the number of data of the lip behavior feature amount based on the image (first lip behavior feature amount) and the lip behavior feature amount based on the voice (second lip behavior feature amount) are matched. Therefore, the difference may be taken in order from the earliest time. Therefore, for example, 30 difference data can be obtained in one section.
Then, in this process, each difference data obtained is represented by an absolute value, and this is averaged to obtain the interval difference δ in the relevant section. Therefore, as shown in FIG. 12, section differences δ ₁ , δ ₂ , δ ₃ , ..., δ _M are obtained for each section in this process.

＜発話者判別過程Ｓ４４＞
発話者判別過程Ｓ４４では、区間差分の算出過程Ｓ４３で得られた複数の区間差分を平均して判別差分δ_ａｖｅを算出し、判別対象者のうち、この判別差分δ_ａｖｅが最も小さかった者を発話者とする。
これにより発話者を判別することができる。 <Speaker discrimination process S44>
_{In the speaker discrimination process S44, the discrimination difference δ ave} is calculated by averaging the plurality of section differences obtained in the section difference calculation process S43, and among the discrimination target persons, the person having the smallest _{discrimination difference δ ave is selected.} Be the speaker.
This makes it possible to identify the speaker.

以上のような方法によれば、１台の全方位カメラおよびマイクであっても、取得された発話映像に対して処理を行うことで発話者を判別できるため、人数に応じて機器数を増やす必要がなく利便性がよい。
また、音声特徴量を用いて口唇挙動特徴量を算出し、これを映像に基づく口唇挙動特徴量と照らし合わせて差分をとることで発話者を判別するため、複数名で口唇が同時に動いている場合においても発話者の判別が可能であり、口唇の動きが同時に生じた場合においても適切に発話者の判別が可能である。 According to the above method, even with one omnidirectional camera and microphone, the speaker can be identified by processing the acquired utterance video, so the number of devices is increased according to the number of people. There is no need and it is convenient.
In addition, since the lip behavior feature is calculated using the voice feature and the difference is taken by comparing this with the lip behavior feature based on the image, the lips are moving simultaneously by multiple people. It is possible to identify the speaker even in the case, and it is possible to appropriately identify the speaker even when the movements of the lips occur at the same time.

{発話者判別プログラム、及び、発話者判別装置｝
図１３は、上記した発話者判別方法Ｓ１に沿って具体的に演算を行う１つの形態にかかる発話者判別装置５０の構成を概念的に表した図である。発話者判別装置５０は、入力機器５７、演算装置５１、及び表示手段５８を有している。そして演算装置５１は、演算手段５２、ＲＡＭ５３、記憶手段５４、受信手段５５、及び出力手段５６を備えている。 {Speaker discrimination program and speaker discrimination device}
FIG. 13 is a diagram conceptually showing the configuration of the speaker discrimination device 50 according to one form in which a specific calculation is performed according to the speaker discrimination method S1 described above. The speaker determination device 50 includes an input device 57, an arithmetic unit 51, and a display means 58. The arithmetic unit 51 includes an arithmetic unit 52, a RAM 53, a storage unit 54, a receiving unit 55, and an output unit 56.

演算手段５２は、いわゆるＣＰＵ（中央演算子）により構成されており、上記した各構成部材に接続され、これらを制御することができる手段である。また、記憶媒体として機能する記憶手段５４等に記憶された各種プログラムを実行し、これに基づいて上記した発話者判別方法Ｓ１の各処理のためのデータ作成の演算をおこなうのも演算手段５２である。 The calculation means 52 is composed of a so-called CPU (central operator), is a means that can be connected to each of the above-mentioned constituent members and can control them. Further, it is also the calculation means 52 that executes various programs stored in the storage means 54 or the like that functions as a storage medium, and based on this, performs a data creation calculation for each process of the speaker determination method S1 described above. be.

ＲＡＭ５３は、演算手段５２の作業領域や一時的なデータの記憶手段として機能する構成部材である。ＲＡＭ５３は、ＳＲＡＭ、ＤＲＡＭ、フラッシュメモリ等で構成することができ、公知のＲＡＭと同様である。 The RAM 53 is a component that functions as a work area of the calculation means 52 and a temporary data storage means. The RAM 53 can be composed of an SRAM, a DRAM, a flash memory, or the like, and is the same as a known RAM.

記憶手段５４は、各種演算の根拠となるプログラムやデータが保存される記憶媒体として機能する部材である。また記憶手段５４には、プログラムの実行により得られた中間、最終の各種結果を保存することができてもよい。より具体的には記憶手段５４には、プログラムが記憶（保存）されている。またその他情報も併せて保存されていてもよい。 The storage means 54 is a member that functions as a storage medium for storing programs and data that are the basis of various operations. Further, the storage means 54 may be able to store various intermediate and final results obtained by executing the program. More specifically, the program is stored (stored) in the storage means 54. In addition, other information may also be stored.

ここで、保存されているプログラムには、上記した発話者判別方法Ｓ１の各過程を演算する根拠となるプログラムが含まれる。すなわち、発話者判別プログラムは、図１に示した発話者判別方法Ｓ１の各過程（図３、図１１に示した各過程も含む。）に対応するように、各過程を各ステップに置き換えたステップを含んでいる。このプログラムの具体的な演算内容は上記した発話者判別方法Ｓ１で説明した通りである。
また、この記憶手段５４には、音声特徴量から第二の口唇挙動特徴量を算出する根拠となるニューラルネットワークの学習済の結果に基づいたデータベースが記憶されていてもよい。この場合には上記プログラムはこのデータベースを逐次参照して進められる。 Here, the stored program includes a program that is a basis for calculating each process of the speaker determination method S1 described above. That is, the speaker discrimination program replaces each process with each step so as to correspond to each process of the speaker discrimination method S1 shown in FIG. 1 (including each process shown in FIGS. 3 and 11). Includes steps. The specific calculation contents of this program are as described in the speaker determination method S1 described above.
Further, the storage means 54 may store a database based on the learned result of the neural network, which is the basis for calculating the second lip behavior feature from the voice feature. In this case, the above program proceeds by sequentially referring to this database.

受信手段５５は、外部からの情報を演算装置５１に適切に取り入れるための機能を有する構成部材であり、入力機器５７が接続される。いわゆる入力ポート、入力コネクタ等もこれに含まれる。 The receiving means 55 is a component having a function for appropriately incorporating information from the outside into the arithmetic unit 51, and the input device 57 is connected to the receiving means 55. This includes so-called input ports, input connectors, and the like.

出力手段５６は、得られた結果のうち外部に出力すべき情報を適切に外部に出力する機能を有する構成部材であり、モニター等の表示手段５８や各種装置がここに接続される。いわゆる出力ポート、出力コネクタ等もこれに含まれる。 The output means 56 is a component having a function of appropriately outputting information to be output to the outside among the obtained results, and a display means 58 such as a monitor and various devices are connected thereto. This includes so-called output ports, output connectors, and the like.

入力機器５７は、発話者の映像及び音声を取得する機器が挙げられる。典型的な機器としてはマイク、カメラ、又はマイク付きのビデオカメラである。ただし、これに限らす他の種類の発話者の映像及び音声を取得する機器であってもよい。ここから入力された情報が演算装置５１に取り込まれ、この情報を利用して上記プログラムが実行される。 The input device 57 includes a device that acquires the video and audio of the speaker. A typical device is a microphone, a camera, or a video camera with a microphone. However, the device may be a device that acquires the video and audio of another type of speaker limited to this. The information input from here is taken into the arithmetic unit 51, and the above program is executed using this information.

また、その他、ネットワークや通信により受信手段５５を介して演算装置５１に情報が提供されてもよい。同様にネットワークや通信により出力手段５６を介して外部の機器に情報を送信することができてもよい。 In addition, information may be provided to the arithmetic unit 51 via the receiving means 55 via a network or communication. Similarly, information may be transmitted to an external device via the output means 56 via a network or communication.

このような発話者判別装置５０によれば、上記した発話者判別方法Ｓ１を効率的に精度よく行なうことが可能となる。このような発話者判別装置５０としては例えばコンピュータを用いることができる。 According to such a speaker discrimination device 50, the speaker discrimination method S1 described above can be performed efficiently and accurately. As such a speaker discrimination device 50, a computer can be used, for example.

｛発話者判別試験｝
発明者は、実際に発話者を判別する試験を行った。以下に条件や試験の方法等を示す。
・カメラ：全方位カメラ、ＴＨＩＴＡＶ、ＲＩＣＯＨ社製
・マイク：ＴＡ−１、ＲＩＣＯＨ社製
・照明：蛍光灯、照度７００ｌｘ〜９００ｌｘ
・判別対象者：２名（Ａ、Ｂ）
・判別対象者の配置：カメラから５０ｃｍ離隔した位置、カメラに向かって正面を向いた姿勢 {Speaker discrimination test}
The inventor conducted a test to actually identify the speaker. The conditions and test method are shown below.
-Camera: Omni-directional camera, THITA V, manufactured by RICOH-Microphone: TA-1, manufactured by RICOH-Lighting: Fluorescent lamp, illuminance 700 x to 900 lpx
・ Discrimination target: 2 people (A, B)
-Arrangement of the person to be discriminated: A position 50 cm away from the camera, a posture facing the front toward the camera

以上のような条件に基づいて次のように試験を行った。
・判別対象者２名（Ａ、Ｂ）がそれぞれ別に同じ文章を音読し、これを上記カメラ及びマイクで記録した。
・判別対象者が音読した文章はニュース記事から抜粋した１１種類とした。従って、全部で２２の映像及び音声データを得た。
・この２２のデータから２０を教師データとしニューラルネットワークの学習に用い、残りの２つのデータをテストデータとする分割をした。教師データとテストデータの組み合わせを変更して異なる分割パターンで同様に行い、全部で２３１パターンとした。 Based on the above conditions, the test was conducted as follows.
-Two discrimination subjects (A and B) read the same sentence aloud separately and recorded it with the above camera and microphone.
・ The sentences read aloud by the discrimination target were 11 types extracted from news articles. Therefore, a total of 22 video and audio data were obtained.
-From these 22 data, 20 was used as the teacher data for learning the neural network, and the remaining 2 data were used as the test data for division. The combination of the teacher data and the test data was changed and the same was performed with different division patterns, resulting in a total of 231 patterns.

以上のような準備をして、次のように試験を行った。図１４に説明のための図を示した。
（１）２２データ(判別対象者Ａ：１１データ、判別対象者Ｂ：１１データ)を対象に、任意の２データをテストデータ、残りの２０データを教師データとして選定する。
（２）教師データ（２０データ）を使用してニューラルネットワークの学習を行い、学習済みのニューラルネットワークを構成する。
（３）この学習済みのニューラルネットワークを用いて、テストデータに対し、発話者判別を行う。このとき、図１４に示すように、２つ準備したテストデータ（図１４（ａ）のテストデータ１、図１４（ｂ）のテストデータ２）のそれぞれについて発話者判別を行う。
すなわち、テストデータ１については、図１４（ａ）に示したように、テストデータ１の音声データを用いて第二の口唇挙動特徴量を得る。これをテストデータ１の映像から得た第一の口唇挙動特徴量、及び、テストデータ２の映像から得た第一の口唇挙動特徴量と対比して、それぞれについてδ_ａｖｅを算出する。そしてこの場合にはテストデータ１同士のδ_ａｖｅの方が小さい場合に判別が成功である。
同様に、テストデータ２については、図１４（ｂ）に示したように、テストデータ２の音声データを用いて第二の口唇挙動特徴量を得る。これをテストデータ１の映像から得た第一の口唇挙動特徴量、及び、テストデータ２の映像から得た第一の口唇挙動特徴量と対比して、それぞれについてδ_ａｖｅを算出する。そしてこの場合にはテストデータ２同士のδ_ａｖｅの方が小さい場合に判別が成功である。
（４）テストデータと教師データの組み合わせを変更し、上記（１）乃至（３）の手順を繰り返し、全パターンである２３１回行った。
（５）上記（１）乃至（４）で得られた４６２回分（２３１×２）の判別結果を用いて、判別成功率とδ_ａｖｅの平均値を算出した。
なお、本例においては、第一の口唇挙動特徴量及び音声特徴量について０．０以上１．０以下の範囲における正規化をした場合と、当該正規化をしない場合とのそれぞれについて試験した。 With the above preparations, the test was conducted as follows. FIG. 14 shows a diagram for explanation.
(1) For 22 data (discrimination target person A: 11 data, discrimination target person B: 11 data), arbitrary 2 data are selected as test data, and the remaining 20 data are selected as teacher data.
(2) The neural network is learned using the teacher data (20 data), and the trained neural network is constructed.
(3) Using this trained neural network, the speaker is discriminated against the test data. At this time, as shown in FIG. 14, the speaker is determined for each of the two prepared test data (test data 1 in FIG. 14A and test data 2 in FIG. 14B).
That is, for the test data 1, as shown in FIG. 14A, the second lip behavior feature amount is obtained using the voice data of the test data 1. This is compared with the first lip behavior feature amount obtained from the video of test data 1 and the first lip behavior feature amount obtained from the video of test data 2, and δ _ave is calculated for each. In this case, the discrimination is successful when _{the δ-ave of the test data 1 is smaller.}
Similarly, for the test data 2, as shown in FIG. 14 (b), a second lip behavior feature amount is obtained using the voice data of the test data 2. This is compared with the first lip behavior feature amount obtained from the video of test data 1 and the first lip behavior feature amount obtained from the video of test data 2, and δ _ave is calculated for each. In this case, the discrimination is successful when _{the δ-ave of the test data 2 is smaller.}
(4) The combination of the test data and the teacher data was changed, and the above steps (1) to (3) were repeated, and the entire pattern was performed 231 times.
(5) Using the discrimination results of 462 times (231 × 2) obtained in the above (1) to (4), the discrimination success rate and the average value of _{δ-ave were calculated.}
In this example, the first lip behavior feature amount and the voice feature amount were tested for each of the case where the normalization was performed in the range of 0.0 or more and 1.0 or less and the case where the normalization was not performed.

以上の結果、後で示すがニューラルネットワークにおける学習回数を変更して上記の試験を行ったところ、正規化をした場合には７９．２％〜８３．８％の判別成功率を得ることができた。一方正規化をしなかった場合にも７８．１％〜８２．５％の判別成功率を得ることができた。従って、正規化の有無によらず高い判別成功率を得ることができる。ただし、正規化をすることにより判別成功率を高めることが可能である。判別成功率は、全判別回数に対する成功判別回数の比率を百分率で表したものである。 As a result of the above, as will be shown later, when the above test was performed by changing the number of learnings in the neural network, a discrimination success rate of 79.2% to 83.8% could be obtained when normalized. rice field. On the other hand, even when normalization was not performed, a discrimination success rate of 78.1% to 82.5% could be obtained. Therefore, a high discrimination success rate can be obtained regardless of the presence or absence of normalization. However, it is possible to increase the discrimination success rate by normalizing. The discrimination success rate is the ratio of the number of success discriminations to the total number of discriminations expressed as a percentage.

上記のように発明者はこの試験において、ニューラルネットワークにおける学習回数と判別成功率との関係を調べた。すなわち、学習の繰り返し回数（学習回数）を変更し、発話者判別の成功率との関係を調べた。その結果を表１に示す。これは学習回数を５００回から１００００回まで５００回ずつ変更した結果である。試験方法は上記と同じである。
また、表１の「δ_ａｖｅ」は判別差分の値である。 As described above, in this test, the inventor investigated the relationship between the number of learnings in the neural network and the discrimination success rate. That is, the number of times of learning was changed (the number of times of learning), and the relationship with the success rate of speaker discrimination was investigated. The results are shown in Table 1. This is the result of changing the number of learning times from 500 times to 10000 times by 500 times. The test method is the same as above.
Further, “δ _ave ” in Table 1 is a value of the discrimination difference.

表１からわかるように、いずれの場合も高い確率で発話者の判別が可能である。 As can be seen from Table 1, in each case, the speaker can be identified with a high probability.

５０発話者判別装置 50 Speaker discrimination device

Claims

It is a method of identifying the speaker from the video and audio data.
The process of obtaining a plurality of first lip behavior features in chronological order based on the distance between the upper lip and the lower lip of the subject from the acquired video, and
The process of obtaining a plurality of voice features in chronological order based on the acquired voice data,
Has a process of determining the speaker,
In the process of identifying the speaker,
The process of obtaining a plurality of second lip behavior features from a plurality of the voice features in chronological order,
The process of obtaining a discriminant difference, which is the difference between the first lip behavior feature amount and the second lip behavior feature amount,
A speaker discrimination method comprising a process of using a target person having the smallest discrimination difference among the target persons included in the video as a speaker.

The speaker discrimination method according to claim 1, wherein the first lip behavior feature amount is obtained by the ratio of the distance between the upper lip and the lower lip and the distance between two points on the nose bridge of the subject.

In the process of identifying the speaker,
The process of creating multiple sections with different starting points and a predetermined time range,
Claim that the section difference between the first lip behavior feature amount and the second lip behavior feature amount is obtained for each of the plurality of sections, and the average of the section differences in each section is used as the discrimination difference. The speaker determination method according to 1 or 2.

The speaker determination method according to claim 3, wherein a time as a starting point is determined so that a part of the time overlaps in the adjacent sections in the plurality of sections.

The speaker according to any one of claims 1 to 4, wherein the plurality of first lip behavior features and the plurality of voice features are normalized in the range of 0.0 or more and 1.0 or less. Discrimination method.

A program that identifies the speaker from video and audio data.
A step of obtaining a plurality of first lip behavior features in chronological order based on the distance between the upper lip and the lower lip of the subject from the acquired video.
A step of obtaining a plurality of voice features in chronological order based on the acquired voice data,
Has a step to determine the speaker, and
In the step of determining the speaker,
A step of obtaining a plurality of second lip behavior features from a plurality of the voice features in chronological order,
A step of obtaining a discriminant difference which is a difference between the first lip behavior feature amount and the second lip behavior feature amount.
A speaker discrimination program including a step of setting a target person having the smallest discrimination difference among the target persons included in the video as a speaker.

The speaker discrimination program according to claim 6, wherein the first lip behavior feature amount is obtained by the ratio of the distance between the upper lip and the lower lip and the distance between two points on the nose bridge of the subject.

In the step of determining the speaker,
A step to create multiple sections with different starting points and a predetermined time range,
For each of the plurality of sections, the section difference between the first lip behavior feature amount and the second lip behavior feature amount is obtained, the average of the section differences in each section is obtained, and this is referred to as the discrimination difference. The speaker determination program according to claim 6 or 7.

The speaker determination program according to claim 8, wherein in the plurality of sections, the time serving as the starting point is determined so that a part of the time overlaps in the adjacent sections.

The speaker discrimination program according to any one of claims 6 to 9, wherein the plurality of first lip behavior features and the plurality of voice features are normalized in the range of 0.0 or more and 1.0 or less.

A device that identifies the speaker from video and audio data.
The camera that acquires the video and
The microphone that acquires the voice data and
It has a storage means in which the speaker discrimination program according to any one of claims 6 to 10 is stored, and a calculation means for performing a calculation based on the speaker discrimination program.
The calculation means acquires the video acquired by the camera and the voice data acquired by the microphone, and uses the acquired video and the voice data to perform a calculation by the speaker discrimination program to determine the speaker. Device.