JP2022063080A

JP2022063080A - Computer and voice processing method

Info

Publication number: JP2022063080A
Application number: JP2020171420A
Authority: JP
Inventors: 光一郎伊藤; Koichiro Ito
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2022-04-21

Abstract

To extract voice of a speaker included in sound while suppressing the influence of noise by using the sound and an image.SOLUTION: A computer stores input sound acquired from a sound collection device and an input image acquired from an imaging device in a storage device, calculates a speaker characteristic amount showing characteristics of voice of a target person by using the input sound, extracts face area images including the face of the target person by using the input image, specifies an utterance period estimated to be a period in which the target person utters by using a plurality of face area images, and extracts estimated voice of the target person in the utterance period from the input sound by using the speaker characteristic amount and the utterance period.SELECTED DRAWING: Figure 3

Description

本発明は、入力された音及び画像を用いて、雑音の影響を抑えつつ、音に含まれる話者の音声を抽出する音声処理の技術に関する。 The present invention relates to a voice processing technique for extracting a speaker's voice included in a sound while suppressing the influence of noise by using the input sound and an image.

マイク等を用いて集音された音声データから文字起こしをする音声認識では、精度を高めるために、入力する音声から可能な限り雑音成分を取り除くことが望ましい。これに対して特許文献１に記載の技術が知られている。 In voice recognition in which transcription is performed from voice data collected using a microphone or the like, it is desirable to remove noise components from the input voice as much as possible in order to improve accuracy. On the other hand, the technique described in Patent Document 1 is known.

特許文献１では、カメラを用いて取得した画像から話者の口の開閉状態を識別し、話者の開口時（口の開閉動作期間）の音声を話者の音声が含まれる信号音声として扱い、話者の閉口時の音声を雑音音声として扱う制御プログラムが記載している。画像を用いて話者の音声と雑音とを識別することによって、精度の高い音声認識を実現している。 In Patent Document 1, the open / closed state of the speaker's mouth is identified from the image acquired by using the camera, and the voice at the time of opening the speaker (mouth opening / closing operation period) is treated as a signal voice including the speaker's voice. , A control program that treats the voice of the speaker when the mouth is closed as noise voice is described. By distinguishing the speaker's voice and noise using images, highly accurate voice recognition is realized.

特開２０１９－８１３４号公報Japanese Unexamined Patent Publication No. 2019-8134

しかし、話者の開口時の音声に話者以外の雑音が含まれている場合がある。この場合、話者の開口時の音から雑音が含まれていない話者の音声を抽出することが困難である。 However, the voice at the time of opening the speaker may contain noise other than the speaker. In this case, it is difficult to extract the speaker's voice without noise from the speaker's opening sound.

本発明は、上記の課題を解決するためになされたものであり、話者の口の開閉動作期間中の音から話者の音声を精度よく抽出する音声処理システムを提供するものである。 The present invention has been made to solve the above problems, and provides a voice processing system that accurately extracts a speaker's voice from the sound during the opening / closing operation period of the speaker's mouth.

本願において開示される発明の代表的な一例を示せば以下の通りである。すなわち、集音装置によって集音された入力音からターゲット人物の音声を抽出する計算機であって、演算装置、前記演算装置に接続される記憶装置、及び前記演算装置に接続される接続インタフェースを備え、前記集音装置及び前記ターゲット人物の画像を取得する撮像装置と、前記接続インタフェースを介して接続し、前記演算装置は、前記入力音及び前記画像を前記記憶装置に格納し、前記入力音を用いて、前記ターゲット人物の音声の特徴を示す話者特徴量を算出し、前記画像を用いて、前記ターゲット人物の顔を含む顔領域画像を抽出し、複数の前記顔領域画像を用いて、前記ターゲット人物が発話していたと推定される発話期間を特定し、前記話者特徴量及び前記発話期間を用いて、前記入力音から、前記発話期間における前記ターゲット人物の推定音声を抽出し、抽出された前記ターゲット人物の推定音声を前記記憶装置に格納する。 A typical example of the invention disclosed in the present application is as follows. That is, it is a computer that extracts the voice of the target person from the input sound collected by the sound collector, and includes a calculation device, a storage device connected to the calculation device, and a connection interface connected to the calculation device. , The sound collecting device and the image pickup device that acquires the image of the target person are connected via the connection interface, and the arithmetic device stores the input sound and the image in the storage device, and stores the input sound and the input sound. Using, the speaker characteristic amount indicating the voice characteristic of the target person is calculated, the face area image including the face of the target person is extracted using the image, and the plurality of face area images are used. The speech period estimated to have been spoken by the target person is specified, and the estimated voice of the target person in the speech period is extracted from the input sound by using the speaker feature amount and the speech period. The estimated voice of the target person is stored in the storage device.

本発明によれば、話者の口の開閉動作期間中の音から話者の音声を精度よく抽出できる。上記した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to the present invention, the speaker's voice can be accurately extracted from the sound during the opening / closing operation period of the speaker's mouth. Issues, configurations and effects other than those mentioned above will be clarified by the description of the following examples.

実施例１の音声処理システムの構成の一例を示す図である。It is a figure which shows an example of the structure of the voice processing system of Example 1. FIG. 実施例１のサーバの機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of the server of Example 1. FIG. 実施例１のサーバの機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of the server of Example 1. FIG. 実施例１の音声強調器の詳細な構成の一例を示す図である。It is a figure which shows an example of the detailed structure of the speech enhancer of Example 1. FIG. 実施例１の特徴結合部によって出力された複合特徴量の時系列データのイメージを示す図である。It is a figure which shows the image of the time series data of the complex feature quantity output by the feature coupling part of Example 1. FIG. 実施例１の音声処理システムが実行する学習処理の一例を説明するフローチャートである。It is a flowchart explaining an example of the learning processing executed by the voice processing system of Example 1. FIG. 実施例１の音声処理システムが実行する推定処理の一例を説明するフローチャートである。It is a flowchart explaining an example of the estimation processing performed by the voice processing system of Example 1. FIG.

以下、本発明の実施例を、図面を用いて説明する。ただし、本発明は以下に示す実施例の記載内容に限定して解釈されるものではない。本発明の思想ないし趣旨から逸脱しない範囲で、その具体的構成を変更し得ることは当業者であれば容易に理解される。 Hereinafter, examples of the present invention will be described with reference to the drawings. However, the present invention is not limited to the description of the examples shown below. It is easily understood by those skilled in the art that a specific configuration thereof can be changed without departing from the idea or purpose of the present invention.

以下に説明する発明の構成において、同一又は類似する構成又は機能には同一の符号を付し、重複する説明は省略する。 In the configuration of the invention described below, the same or similar configurations or functions are designated by the same reference numerals, and duplicate description will be omitted.

本明細書等における「第１」、「第２」、「第３」等の表記は、構成要素を識別するために付するものであり、必ずしも、数又は順序を限定するものではない。 The notations such as "first", "second", and "third" in the present specification and the like are attached to identify the components, and are not necessarily limited in number or order.

図面等において示す各構成の位置、大きさ、形状、及び範囲等は、発明の理解を容易にするため、実際の位置、大きさ、形状、及び範囲等を表していない場合がある。したがって、本発明では、図面等に開示された位置、大きさ、形状、及び範囲等に限定されない。 The position, size, shape, range, etc. of each configuration shown in the drawings and the like may not represent the actual position, size, shape, range, etc., in order to facilitate understanding of the invention. Therefore, the present invention is not limited to the position, size, shape, range, etc. disclosed in the drawings and the like.

図１は、実施例１の音声処理システム１００の構成の一例を示す図である。 FIG. 1 is a diagram showing an example of the configuration of the voice processing system 100 of the first embodiment.

音声処理システム１００は、サーバ１０１、撮像装置１０２、及び集音装置１０３から構成される。 The voice processing system 100 includes a server 101, an image pickup device 102, and a sound collector 103.

サーバ１０１、撮像装置１０２、及び集音装置１０３は、直接又はネットワークを介して互いに接続される。ネットワークは、例えば、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）及びＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等である。 The server 101, the image pickup device 102, and the sound collector 103 are connected to each other directly or via a network. The network is, for example, WAN (Wide Area Network) and LAN (Local Area Network).

なお、サーバ１０１が撮像装置１０２及び集音装置１０３を内蔵してもよい。なお、図１に示す音声処理システム１００の構成は一例であってこれに限定されない。 The server 101 may include the image pickup device 102 and the sound collecting device 103. The configuration of the voice processing system 100 shown in FIG. 1 is an example and is not limited thereto.

音声処理システム１００は、話者１１０が存在する空間から音及び画像を取得し、取得した音から話者１１０の音声を抽出する。当該空間には話者１１０の他に、話者１１０の音声とは異なる音（雑音）が発せられる雑音音源１１１が存在する。雑音音源１１１から発せられる音は、例えば、車のエンジン音、環境音、話者１１０と異なる人物の音声等である。 The voice processing system 100 acquires sounds and images from the space in which the speaker 110 exists, and extracts the voice of the speaker 110 from the acquired sounds. In addition to the speaker 110, there is a noise sound source 111 that emits a sound (noise) different from the voice of the speaker 110 in the space. The sound emitted from the noise sound source 111 is, for example, a car engine sound, an environmental sound, a voice of a person different from the speaker 110, or the like.

撮像装置１０２は、画像を取得する装置であり、例えば、カメラ及び深度計測器等である。実施例１の撮像装置１０２は話者１１０の口元領域を含む顔領域画像１１２を取得する。顔領域画像１１２は、例えば、ＲＧＢ画像又は深度マップ等である。なお、音声処理システム１００は、特性が異なる顔領域画像１１２を取得する複数の撮像装置１０２を有してもよい。 The image pickup device 102 is a device for acquiring an image, and is, for example, a camera, a depth measuring instrument, or the like. The image pickup apparatus 102 of the first embodiment acquires the face area image 112 including the mouth area of the speaker 110. The face area image 112 is, for example, an RGB image, a depth map, or the like. The voice processing system 100 may have a plurality of image pickup devices 102 that acquire face region images 112 having different characteristics.

集音装置１０３は、設置された空間の音を集音する装置であり、例えば、モノラルマイク及びマイクアレイ等である。実施例１の集音装置１０３は、話者１１０及び雑音音源１１１から発せられる音（混合音）を集音する。 The sound collecting device 103 is a device that collects sound in the installed space, and is, for example, a monaural microphone, a microphone array, or the like. The sound collecting device 103 of the first embodiment collects sounds (mixed sounds) emitted from the speaker 110 and the noise sound source 111.

サーバ１０１は、顔領域画像１１２及び混合音を用いて、集音音声から話者１１０の音声のみを抽出する。本明細書では、混合音から話者１１０の音声を抽出する機能を音声強調機能と記載する。 The server 101 uses the face area image 112 and the mixed sound to extract only the voice of the speaker 110 from the sound collection voice. In the present specification, the function of extracting the voice of the speaker 110 from the mixed sound is described as a speech enhancement function.

サーバ１０１は、プロセッサ１２０、記憶装置１２１、及び接続インタフェース１２２を有する。各ハードウェア構成は内部バス１２３を介して互いに接続される。 The server 101 has a processor 120, a storage device 121, and a connection interface 122. The hardware configurations are connected to each other via the internal bus 123.

プロセッサ１２０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）及びＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等の演算装置であり、記憶装置１２１に格納されるプログラムを実行する。プロセッサ１２０がプログラムにしたがって処理を実行することによって、音声強調機能を実現するモジュールとして動作する。以下の説明では、モジュールを主語に処理を説明する場合、プロセッサ１２０が当該モジュールを実現するプログラムを実行していることを示す。 The processor 120 is an arithmetic unit such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit), and executes a program stored in the storage device 121. The processor 120 operates as a module that realizes a speech enhancement function by executing processing according to a program. In the following description, when the process is described with the module as the subject, it is shown that the processor 120 is executing the program that realizes the module.

記憶装置１２１は、プロセッサ１２０が実行するプログラム及びプログラムが使用する情報を格納する。また、記憶装置１２１はプロセッサ１２０の作業領域としても用いられる。記憶装置１２１は、非一時的な記憶装置でもよいし、一時的な記憶装置でもよい。記憶装置１２１は、例えば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、及びフラッシュメモリ等である。 The storage device 121 stores a program executed by the processor 120 and information used by the program. The storage device 121 is also used as a working area for the processor 120. The storage device 121 may be a non-temporary storage device or a temporary storage device. The storage device 121 is, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), an HDD (Hard Disk Drive), a flash memory, or the like.

実施例１の記憶装置１２１は、推定部１３０及び学習部１３１を実現するプログラムを格納する。学習部１３１は、機械学習及び深層学習を利用して、推定部１３０に組み込まれる音声強調器２０２（図２を参照）を学習する。推定部１３０は音声強調機能を実現する機能部である。推定部１３０は、学習部１３１によって学習した音声強調器２０２を用いて入力された音から対象の人物（話者）の音声を抽出する。 The storage device 121 of the first embodiment stores a program that realizes the estimation unit 130 and the learning unit 131. The learning unit 131 learns the speech enhancer 202 (see FIG. 2) incorporated in the estimation unit 130 by using machine learning and deep learning. The estimation unit 130 is a functional unit that realizes a speech enhancement function. The estimation unit 130 extracts the voice of the target person (speaker) from the sound input by using the speech enhancer 202 learned by the learning unit 131.

接続インタフェース１２２は、撮像装置１０２及び集音装置１０３等の外部装置と接続するためのインタフェースである。接続インタフェース１２２は、例えば、ネットワークインタフェース及びＩＯインタフェースである。 The connection interface 122 is an interface for connecting to an external device such as an image pickup device 102 and a sound collecting device 103. The connection interface 122 is, for example, a network interface and an IO interface.

図２Ａ及び図２Ｂは、実施例１のサーバ１０１の機能構成の一例を示す図である。 2A and 2B are diagrams showing an example of the functional configuration of the server 101 of the first embodiment.

図２Ａは、推定部１３０の機能構成の詳細を示す。推定部１３０は、連続画像データ２１１及び音データ２１２を入力として受け付け、話者１１０の推定音声のみを含む音データ２１３及び話者１１０の識別情報である推定話者ＩＤ２１４を出力する。ここで、連続画像データ２１１は、撮像装置１０２によって一定の間隔で取得された複数の顔領域画像１１２から構成される時系列データである。音データ２１２は集音装置１０３によって集音された音に関するデータである。 FIG. 2A shows the details of the functional configuration of the estimation unit 130. The estimation unit 130 accepts the continuous image data 211 and the sound data 212 as inputs, and outputs the sound data 213 including only the estimated voice of the speaker 110 and the estimated speaker ID 214 which is the identification information of the speaker 110. Here, the continuous image data 211 is time-series data composed of a plurality of face region images 112 acquired by the image pickup apparatus 102 at regular intervals. The sound data 212 is data related to the sound collected by the sound collecting device 103.

連続画像データ２１１に含まれる顔領域画像１１２のサンプリングレートと、音データ２１２のサンプリングレートは異なっているが、時間的に同期しているものとする。 It is assumed that the sampling rate of the face area image 112 included in the continuous image data 211 and the sampling rate of the sound data 212 are different, but are synchronized in time.

なお、連続画像データ２１１及び音データ２１２は、別々のデータでなくてもよい。例えば、画像及び音を含む動画データでもよい。 The continuous image data 211 and the sound data 212 do not have to be separate data. For example, moving image data including images and sounds may be used.

推定部１３０は、画像前処理部２００、音前処理部２０１、及び音声強調器２０２を含む。 The estimation unit 130 includes an image preprocessing unit 200, a sound preprocessing unit 201, and a speech enhancer 202.

画像前処理部２００は、連続画像データ２１１に含まれる各顔領域画像１１２から話者１１０の口元領域又は顔領域を含む画像を抽出する。画像前処理部２００は、抽出された画像の時系列データを音声強調器２０２に出力する。なお、画像前処理部２００は、抽出された画像の画素値を正規化してもよい。 The image preprocessing unit 200 extracts an image including the mouth region or the face region of the speaker 110 from each face region image 112 included in the continuous image data 211. The image preprocessing unit 200 outputs the time series data of the extracted image to the speech enhancer 202. The image preprocessing unit 200 may normalize the pixel value of the extracted image.

音前処理部２０１は、音データ２１２に対して短時間フーリエ変換等の演算処理を実行することによって、音声スペクトル及びメルケプストラム等の音特徴量を算出する。ここでは、サンプリングレート毎の音特徴量が算出される。音前処理部２０１は、各サンプリングレートの音特徴量を含む音特徴量データを音声強調器２０２に出力する。なお、音前処理部２０１は音特徴量を正規化してもよい。 The sound preprocessing unit 201 calculates sound features such as a voice spectrum and a merkepstrum by executing arithmetic processing such as a short-time Fourier transform on the sound data 212. Here, the sound feature amount for each sampling rate is calculated. The sound preprocessing unit 201 outputs sound feature amount data including the sound feature amount of each sampling rate to the speech enhancer 202. The sound preprocessing unit 201 may normalize the sound feature amount.

音声強調器２０２は、画像前処理部２００から出力された画像の時系列データ及び音前処理部２０１から出力された音特徴量データを用いて、音データ２１３及び推定話者ＩＤ２１４を出力する。なお、音声強調器２０２は、最終的な出力として音データ２１３のみを出力してもよい。音声強調器２０２は、学習可能な複数のパラメタから定義されるモデルであり、記憶装置１２１には当該パラメタを格納するモデル情報（図示省略）が保存される。 The speech enhancer 202 outputs the sound data 213 and the estimated speaker ID 214 by using the time series data of the image output from the image preprocessing unit 200 and the sound feature amount data output from the sound preprocessing unit 201. The speech enhancer 202 may output only the sound data 213 as the final output. The speech enhancer 202 is a model defined from a plurality of learnable parameters, and model information (not shown) for storing the parameters is stored in the storage device 121.

図２Ｂは、学習部１３１の機能構成の詳細を示す。学習部１３１は、学習データに含まれるサンプル２２０を入力として受け付け、音声強調器２０２を定義するパラメタの学習を行う。 FIG. 2B shows the details of the functional configuration of the learning unit 131. The learning unit 131 receives the sample 220 included in the learning data as an input, and learns the parameters that define the speech enhancer 202.

サンプル２２０は、連続画像データ２２１、音データ２２３、及び話者ＩＤ２２４を含む。 The sample 220 includes continuous image data 221, sound data 223, and speaker ID 224.

連続画像データ２２１は、一定の間隔の画像２２２から構成される時系列データである。画像２２２は、話者の顔領域又は口元領域を含む画像である。なお、画像２２２は、あらかじめ話者から取得した画像でもよいし、音声処理システム１００の運用中に撮像装置１０２によって取得された画像でもよい。 The continuous image data 221 is time-series data composed of images 222 at regular intervals. Image 222 is an image including a speaker's face area or mouth area. The image 222 may be an image acquired from the speaker in advance, or may be an image acquired by the image pickup apparatus 102 during the operation of the voice processing system 100.

音データ２２３は、話者の音声を含むデータである。音データ２２３は、話者の音声のみが含まれてもよいし、話者の音声及び他の音が含まれてもよい。なお、音データ２２３は、雑音が少ない環境で集音された話者の音声のデータでもよいし、モノラルマイクを用いて集音された単一チャネル音声のデータでもよい。 The sound data 223 is data including the voice of the speaker. The sound data 223 may include only the voice of the speaker, or may include the voice of the speaker and other sounds. The sound data 223 may be speaker voice data collected in an environment with little noise, or single channel voice data collected using a monaural microphone.

連続画像データ２２１に含まれる画像２２２のサンプリングレートと、音データ２２３のサンプリングレートは異なっているが、時間的に同期しているものとする。 It is assumed that the sampling rate of the image 222 included in the continuous image data 221 and the sampling rate of the sound data 223 are different, but are synchronized in time.

なお、連続画像データ２２１及び音データ２２３は、別々のデータでなくてもよい。例えば、画像及び音を含む動画データでもよい。 The continuous image data 221 and the sound data 223 do not have to be separate data. For example, moving image data including images and sounds may be used.

話者ＩＤ２２４は話者の識別情報である。実施例１では、話者ＩＤ２２４として数字が割り当てられるものとする。この場合、話者毎に異なる数字が割り当てられる。 The speaker ID 224 is speaker identification information. In the first embodiment, it is assumed that a number is assigned as the speaker ID 224. In this case, different numbers are assigned to each speaker.

学習データには、前述したようなデータ構造のサンプル２２０が複数含まれる。例えば、１０００単位又は１００００単位の数のサンプル２２０が学習データに含まれる。 The training data includes a plurality of samples 220 of the data structure as described above. For example, a sample 220 having a number of 1000 units or 10000 units is included in the training data.

学習部１３１は、画像前処理部２００、音前処理部２０１、音声強調器２０２、混合音声生成部２０３、音声誤差算出部２０４、音声誤差反映部２０５、話者誤差算出部２０６、及び話者誤差反映部２０７を含む。 The learning unit 131 includes an image preprocessing unit 200, a sound preprocessing unit 201, a speech enhancer 202, a mixed speech generation unit 203, a voice error calculation unit 204, a voice error reflection unit 205, a speaker error calculation unit 206, and a speaker. The error reflection unit 207 is included.

画像前処理部２００、音前処理部２０１、及び音声強調器２０２は、推定部１３０に含まれるモジュールと同一である。 The image preprocessing unit 200, the sound preprocessing unit 201, and the speech enhancement unit 202 are the same as the modules included in the estimation unit 130.

混合音声生成部２０３は、話者の音声に関する音データ２２３及び干渉音声に関する音データ２２５を用いて、混合音声に関する音データ２２７を生成する。例えば、混合音声生成部２０３は、話者の音声及び干渉音声を加算又は重み付け加算することによって混合音声を生成する。 The mixed voice generation unit 203 generates sound data 227 related to the mixed voice by using the sound data 223 related to the speaker's voice and the sound data 225 related to the interfering voice. For example, the mixed voice generation unit 203 generates mixed voice by adding or weighting the speaker's voice and the interfering voice.

混合音声生成部２０３は、入力するサンプル２２０の話者ＩＤ２２４と異なる話者ＩＤ２２４を含むサンプル２２０の中から一つのサンプル２２０をランダムに選択し、選択されたサンプル２２０に含まれる音データ２２３を干渉音声の音データ２２５として用いる。なお、選択するサンプル２２０の数は二つ以上でもよい。この場合、混合音声には、複数人の音声が含まれる。なお、学習データとは異なるデータとして、干渉音に関する音データを入力してもよい。例えば、環境音を含む音データ等が考えられる。 The mixed voice generation unit 203 randomly selects one sample 220 from the samples 220 including the speaker ID 224 different from the speaker ID 224 of the input sample 220, and interferes with the sound data 223 included in the selected sample 220. It is used as voice sound data 225. The number of samples 220 to be selected may be two or more. In this case, the mixed voice includes the voices of a plurality of people. Note that sound data related to the interference sound may be input as data different from the learning data. For example, sound data including environmental sounds can be considered.

音声強調では、話者１１０の音声と、様々な雑音とが含まれる音が集音され、当該音から話者１１０の音声を抽出する必要がある。そこで、話者１１０の音声の抽出精度を向上させるために、混合音声を生成し、当該混合音声を用いて音声強調器２０２を学習する。なお、音声強調器２０２の学習時に混合音声を生成する手法は公知の手法である。 In speech enhancement, a sound including the voice of the speaker 110 and various noises is collected, and it is necessary to extract the voice of the speaker 110 from the sound. Therefore, in order to improve the voice extraction accuracy of the speaker 110, a mixed voice is generated, and the speech enhancer 202 is learned using the mixed voice. A method for generating mixed speech during learning of the speech enhancer 202 is a known method.

音声誤差算出部２０４は、音データ２２３及び音データ２１３の誤差を算出する。音データ２１３に音声波形が含まれる場合、音声誤差算出部２０４は、２乗誤差又はノルム誤差等、公知の誤差尺度に基づいて二つの音声波形の誤差を算出する。また、音データ２１３に音声スペクトルが含まれる場合、音声誤差算出部２０４は、音データ２２３に含まれる音声を音声スペクトルに変換し、２乗誤差等の公知の誤差尺度に基づいて二つの音声スペクトルの誤差を算出する。 The voice error calculation unit 204 calculates the error of the sound data 223 and the sound data 213. When the sound data 213 includes a voice waveform, the voice error calculation unit 204 calculates an error between the two voice waveforms based on a known error scale such as a square error or a norm error. When the sound data 213 includes a voice spectrum, the voice error calculation unit 204 converts the voice included in the sound data 223 into a voice spectrum and two voice spectra based on a known error scale such as a square error. Calculate the error of.

音声誤差反映部２０５は、誤差逆伝播法等の公知の手法を用いて、音声誤差算出部２０４によって算出された誤差が小さくなるように音声強調器２０２のパラメタを更新する。 The voice error reflection unit 205 updates the parameters of the speech enhancer 202 so that the error calculated by the voice error calculation unit 204 becomes smaller by using a known method such as an error back propagation method.

話者誤差算出部２０６は、交差エントロピー誤差等、公知の誤差尺度に基づいて、話者ＩＤ２２４及び推定話者ＩＤ２１４の間の誤差を算出する。 The speaker error calculation unit 206 calculates an error between the speaker ID 224 and the estimated speaker ID 214 based on a known error scale such as a cross entropy error.

話者誤差反映部２０７は、誤差逆伝播法等の公知の手法を用いて、話者誤差算出部２０６によって算出された誤差が小さくなるように音声強調器２０２のパラメタを更新する。 The speaker error reflection unit 207 updates the parameters of the speech enhancer 202 so that the error calculated by the speaker error calculation unit 206 becomes small by using a known method such as an error back propagation method.

図３は、実施例１の音声強調器２０２の詳細な構成の一例を示す図である。 FIG. 3 is a diagram showing an example of a detailed configuration of the speech enhancer 202 of the first embodiment.

音声強調器２０２は、画像特徴抽出部３００、音特徴抽出部３０１、特徴結合部３０２、同期推定部３０３、話者音声推定部３０４、音特徴変換部３０５、及び話者識別部３０６を含む。 The speech enhancer 202 includes an image feature extraction unit 300, a sound feature extraction unit 301, a feature coupling unit 302, a synchronization estimation unit 303, a speaker voice estimation unit 304, a sound feature conversion unit 305, and a speaker identification unit 306.

画像特徴抽出部３００は、画像前処理部２００によって処理された画像の時系列データに含まれる各画像から画像特徴量（特徴量及び特徴表現等）を抽出する。画像特徴抽出部３００は、時間的に連続する画像特徴量を特徴結合部３０２に出力する。画像特徴抽出部３００は、例えば、ＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）等を用いて構成される。画像特徴抽出部３００を構成するＣＮＮは学習対象のパラメタを含む。 The image feature extraction unit 300 extracts an image feature amount (feature amount, feature expression, etc.) from each image included in the time-series data of the image processed by the image preprocessing unit 200. The image feature extraction unit 300 outputs a temporally continuous image feature amount to the feature coupling unit 302. The image feature extraction unit 300 is configured by using, for example, a CNN (Convolutional Neural Network) or the like. The CNN constituting the image feature extraction unit 300 includes parameters to be learned.

音特徴抽出部３０１は、音前処理部２０１によって処理された音特徴量データから音特徴量（特徴量及び特徴表現等）を抽出する。音特徴抽出部３０１は、時間的に連続する音特徴量を含むデータ（音特徴量の時系列データ）を特徴結合部３０２に出力する。音特徴抽出部３０１は、ＣＮＮ又はＲＮＮ（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）等を用いて構成される。音特徴抽出部３０１を構成するＣＮＮ又はＲＮＮは学習対象のパラメタを含む。 The sound feature extraction unit 301 extracts a sound feature amount (feature amount, feature expression, etc.) from the sound feature amount data processed by the sound preprocessing unit 201. The sound feature extraction unit 301 outputs data including temporally continuous sound feature amounts (time-series data of sound feature amounts) to the feature coupling unit 302. The sound feature extraction unit 301 is configured by using CNN, RNN (Recurrent Neural Network) or the like. The CNN or RNN constituting the sound feature extraction unit 301 includes parameters to be learned.

特徴結合部３０２は、画像特徴量及び音特徴量を、時間的に同期した形式で結合することによって、複合特徴量を生成する。具体的には、特徴結合部３０２は、所定の時間間隔（タイムステップ）で画像特徴量及び音特徴量を結合することによって複合特徴量を生成する。特徴結合部３０２は、時間的に連続した複合特徴量を含む複合特徴量データを同期推定部３０３及び音特徴変換部３０５に出力する。 The feature combination unit 302 generates a composite feature amount by combining the image feature amount and the sound feature amount in a time-synchronized format. Specifically, the feature combining unit 302 generates a composite feature amount by combining an image feature amount and a sound feature amount at predetermined time intervals (time steps). The feature coupling unit 302 outputs the composite feature data including the temporally continuous composite feature data to the synchronization estimation unit 303 and the sound feature conversion unit 305.

一般的に、画像のサンプリングレートは、音声のサンプリングレートより疎であるため、時間的に同期した形式で画像特徴量及び音特徴量を結合するためには工夫が必要である。本実施例では、音特徴抽出部３０１を構成するＣＮＮの時間方向の畳み込み領域を広くすることによって、時間的な同期を実現している。 In general, the sampling rate of an image is sparser than the sampling rate of audio, so it is necessary to devise in order to combine the image feature amount and the sound feature amount in a time-synchronized format. In this embodiment, temporal synchronization is realized by widening the convolution area in the time direction of the CNN constituting the sound feature extraction unit 301.

音特徴変換部３０５は、音特徴抽出部３０１が抽出した音特徴量の時系列データに基づいて、音前処理部２０１に入力された音における話者の音声の含有度合いを示す指標を算出し、当該指標に基づいて発話状況を識別する。さらに、音特徴変換部３０５は、入力された音特徴量の時系列データを、発話状況を反映した音特徴量の時系列データに変換する。音特徴変換部３０５は、タイムステップ間類似度算出部３１０、発話状況識別部３１１、重み算出部３１２、及び重み反映部３１３を含む。 The sound feature conversion unit 305 calculates an index indicating the content of the speaker's voice in the sound input to the sound preprocessing unit 201 based on the time-series data of the sound feature amount extracted by the sound feature extraction unit 301. , Identify the speech status based on the index. Further, the sound feature conversion unit 305 converts the input time-series data of the sound feature amount into the time-series data of the sound feature amount reflecting the utterance situation. The sound feature conversion unit 305 includes a time step similarity calculation unit 310, an utterance status identification unit 311, a weight calculation unit 312, and a weight reflection unit 313.

タイムステップ間類似度算出部３１０は、タイムステップ間の複合特徴量の類似度を算出する。タイムステップ間の複合特徴量の類似度が発話状況を識別するための指標として用いられる。タイムステップ間類似度算出部３１０は、例えば、線形層等を用いて構成される。タイムステップ間類似度算出部３１０を構成する線形層は学習対象のパラメタを含む。複合特徴量がベクトルである場合、タイムステップ間類似度算出部３１０はベクトルの内積を類似度として算出する。 The time-step similarity calculation unit 310 calculates the similarity of the complex features between the time steps. The similarity of the complex features between the time steps is used as an index for identifying the utterance situation. The time step similarity calculation unit 310 is configured by using, for example, a linear layer or the like. The linear layer constituting the time-step similarity calculation unit 310 includes parameters to be learned. When the composite feature quantity is a vector, the time step similarity calculation unit 310 calculates the inner product of the vectors as the similarity.

発話状況識別部３１１は、タイムステップ間類似度算出部３１０によって算出されたタイムステップ間の類似度に基づいて、各タイムステップの話者の発話状況を識別する。実施例１では、発話状況識別部３１１は、各タイムステップについて、話者の音声のみが含まれるケース（第１ケース）、話者の音声及び雑音が含まれるケース（第２ケース）、並びに、雑音のみが含まれるケース（第３ケース）のいずれのケースに該当するかを識別する。なお、第２ケースに該当する場合、発話状況識別部３１１は、話者の音声と雑音との混合比率を合わせて識別してもよい。発話状況識別部３１１は、例えば、線形層等を用いて構成される。発話状況識別部３１１を構成する線形層は学習対象のパラメタを含む。 The utterance status identification unit 311 identifies the utterance status of the speaker at each time step based on the similarity between the time steps calculated by the time step similarity calculation unit 310. In the first embodiment, the utterance status identification unit 311 includes a case where only the speaker's voice is included (first case), a case where the speaker's voice and noise are included (second case), and a case where each time step includes the speaker's voice and noise. Identify which case corresponds to the case containing only noise (third case). When the second case is applicable, the utterance status identification unit 311 may identify the mixture ratio of the speaker's voice and noise together. The utterance situation identification unit 311 is configured by using, for example, a linear layer or the like. The linear layer constituting the utterance status identification unit 311 includes parameters to be learned.

重み算出部３１２は、発話状況識別部３１１の識別結果に基づいて、各タイムステップの音特徴量の重みを算出する。重み算出部３１２は、例えば、線形層等を用いて構成される。重み算出部３１２を構成する線形層は学習対象のパラメタを含む。実施例１では、重み算出部３１２は、第１ケースに該当するタイムステップについては大きい重みを算出し、第２ケースに該当するタイムステップについては中程度の大きさの重みを算出し、第３ケースに該当するタイムステップについては小さい重みを算出する。 The weight calculation unit 312 calculates the weight of the sound feature amount of each time step based on the identification result of the utterance status identification unit 311. The weight calculation unit 312 is configured by using, for example, a linear layer or the like. The linear layer constituting the weight calculation unit 312 includes parameters to be learned. In the first embodiment, the weight calculation unit 312 calculates a large weight for the time step corresponding to the first case, calculates a medium-sized weight for the time step corresponding to the second case, and obtains a third weight. Calculate a small weight for the time step that corresponds to the case.

重み反映部３１３は、重み算出部３１２によって算出された重みと、音特徴抽出部３０１から出力された音特徴量の時系列データとを用いて、重み付き音特徴量の時系列データを生成する。例えば、重み反映部３１３は、音特徴量の時間方向に対して各タイムステップの重みを乗算することによって重み付き音特徴量の時系列データを算出する。重み反映部３１３は、重み付き音特徴量の時系列データを話者識別部３０６に出力する。なお、重み付けの対象は複合特徴量でもよい。 The weight reflection unit 313 generates time-series data of the weighted sound feature amount by using the weight calculated by the weight calculation unit 312 and the time-series data of the sound feature amount output from the sound feature extraction unit 301. .. For example, the weight reflection unit 313 calculates the time-series data of the weighted sound feature amount by multiplying the time direction of the sound feature amount by the weight of each time step. The weight reflection unit 313 outputs the time-series data of the weighted sound feature amount to the speaker identification unit 306. The weighting target may be a composite feature amount.

話者識別部３０６は、重み付き音特徴量の時系列データに基づいて話者を識別し、識別結果として推定話者ＩＤ２１４を出力する。話者識別部３０６は、話者特徴抽出部３２０及び話者推定部３２１を含む。 The speaker identification unit 306 identifies the speaker based on the time-series data of the weighted sound feature amount, and outputs the estimated speaker ID 214 as the identification result. The speaker identification unit 306 includes a speaker feature extraction unit 320 and a speaker estimation unit 321.

話者特徴抽出部３２０は、重み付き音特徴量の時系列データから話者特徴量３３０を抽出し、話者特徴量３３０を話者推定部３２１及び話者音声推定部３０４に出力する。話者特徴抽出部３２０は、例えば、ＣＮＮ及びＲＮＮ等を用いて構成される。話者特徴抽出部３２０を構成するＣＮＮ及びＲＮＮは学習対象のパラメタを含む。 The speaker feature extraction unit 320 extracts the speaker feature amount 330 from the time-series data of the weighted sound feature amount, and outputs the speaker feature amount 330 to the speaker estimation unit 321 and the speaker voice estimation unit 304. The speaker feature extraction unit 320 is configured by using, for example, CNN, RNN, or the like. The CNN and RNN constituting the speaker feature extraction unit 320 include parameters to be learned.

重み付き音特徴量の時系列データは、音に含まれる話者の音声の比率が高いタイムステップの音特徴量が強調された特徴量の時系列データである。したがって、重み付き音特徴量の時系列データを用いて抽出された話者特徴量３３０は、音特徴量の時系列データを用いて抽出された話者特徴量より、話者の音声成分をよく反映した特徴量であることが期待される。 The time-series data of the weighted sound feature amount is the time-series data of the feature amount in which the sound feature amount of the time step in which the ratio of the speaker's voice included in the sound is high is emphasized. Therefore, the speaker feature amount 330 extracted using the time-series data of the weighted sound feature amount has a better voice component of the speaker than the speaker feature amount extracted using the time-series data of the sound feature amount. It is expected that the features will be reflected.

話者推定部３２１は、話者特徴量３３０を用いて話者を推定し、推定結果として推定話者ＩＤ２１４を出力する。話者推定部３２１は、例えば、線形層等から構成される。話者推定部３２１を構成する線形層は学習対象のパラメタを含む。 The speaker estimation unit 321 estimates the speaker using the speaker feature amount 330, and outputs the estimated speaker ID 214 as the estimation result. The speaker estimation unit 321 is composed of, for example, a linear layer or the like. The linear layer constituting the speaker estimation unit 321 includes parameters to be learned.

同期推定部３０３は、複合特徴量の時系列データに含まれる画像特徴量の時系列データに基づいて、話者の発話に伴う口の開閉動作が行われている期間（推定期間）を推定し、また、推定期間における口の開閉動作に関する特徴量を算出する。同期推定部３０３は、音特徴量の時系列データから、推定期間に一致又は同期する期間の音特徴量を抽出する。同期推定部３０３は、抽出された音特徴量を話者音声推定部３０４に出力する。同期推定部３０３は、例えば、ＲＮＮ等を用いて構成される。同期推定部３０３を構成するＲＮＮは学習対象のパラメタを含む。 The synchronous estimation unit 303 estimates the period (estimated period) in which the mouth opening / closing operation accompanying the speaker's speech is performed based on the time-series data of the image feature amount included in the time-series data of the composite feature amount. In addition, the feature amount related to the opening / closing operation of the mouth during the estimated period is calculated. The synchronization estimation unit 303 extracts the sound feature amount of the period that matches or synchronizes with the estimation period from the time series data of the sound feature amount. The synchronous estimation unit 303 outputs the extracted sound feature amount to the speaker voice estimation unit 304. The synchronization estimation unit 303 is configured by using, for example, an RNN or the like. The RNN constituting the synchronization estimation unit 303 includes parameters to be learned.

話者音声推定部３０４は、同期推定部３０３から入力された推定期間の音特徴量及び話者識別部３０６から入力された話者特徴量３３０を用いて、話者の推定音声を抽出し、抽出結果として音データ２１３を出力する。話者音声推定部３０４は、例えば、ＲＮＮ及び線形層等を用いて構成される。話者音声推定部３０４を構成するＲＮＮ及び線形層は学習対象のパラメタを含む。 The speaker voice estimation unit 304 extracts the estimated voice of the speaker by using the sound feature amount of the estimation period input from the synchronous estimation unit 303 and the speaker feature amount 330 input from the speaker identification unit 306. Sound data 213 is output as the extraction result. The speaker voice estimation unit 304 is configured by using, for example, an RNN and a linear layer. The RNN and the linear layer constituting the speaker voice estimation unit 304 include parameters to be learned.

従来の音声強調器は、画像特徴抽出部、音特徴抽出部、特徴結合部、同期推定部、及び話者音声推定部のみを含み、口の開閉動作に同期する期間の音声を話者の推定音声として出力する。 A conventional speech enhancer includes only an image feature extraction unit, a sound feature extraction unit, a feature combination unit, a synchronization estimation unit, and a speaker voice estimation unit, and estimates the voice of a speaker during a period synchronized with the opening / closing operation of the mouth. Output as audio.

一方、本実施例の音声強調器２０２は、画像特徴抽出部３００、音特徴抽出部３０１、特徴結合部３０２、同期推定部３０３、及び話者音声推定部３０４に加えて、音特徴変換部３０５及び話者識別部３０６を含む。音特徴変換部３０５は、話者の音声の純度が高い期間（タイムステップ）の音特徴量を強調するように重みを付与する。これは、音特徴量の抽出区間の選択手段として機能する。話者識別部３０６は、重み付き音特徴量の時系列データを用いることによって、精度の高い話者特徴量３３０を抽出することができる。話者音声推定部３０４は、話者特徴量３３０に基づいて話者の音声の音質（高さ、話す速度、及び音色等）も識別できるため、推定期間の音から干渉音を除いた話者の音声を抽出できる。このように、話者音声推定部３０４は、話者特徴量３３０をフィルタとして用いることによって、口の開閉動作に同期する期間の音から話者の音声をより正確に抽出することができる。 On the other hand, in the speech enhancer 202 of this embodiment, in addition to the image feature extraction unit 300, the sound feature extraction unit 301, the feature coupling unit 302, the synchronization estimation unit 303, and the speaker voice estimation unit 304, the sound feature conversion unit 305 And the speaker identification unit 306. The sound feature conversion unit 305 gives weights so as to emphasize the amount of sound features during the period (time step) in which the speaker's voice is highly pure. This functions as a means for selecting the extraction section of the sound feature amount. The speaker identification unit 306 can extract the speaker feature amount 330 with high accuracy by using the time-series data of the weighted sound feature amount. Since the speaker voice estimation unit 304 can also identify the sound quality (height, speaking speed, tone color, etc.) of the speaker's voice based on the speaker feature amount 330, the speaker excluding the interference sound from the sound during the estimation period. Sound can be extracted. As described above, the speaker voice estimation unit 304 can more accurately extract the speaker voice from the sound during the period synchronized with the opening / closing operation of the mouth by using the speaker feature amount 330 as a filter.

したがって、実施例１の音声処理システム１００の推定精度は、従来のシステムの推定精度より向上することが期待できる。 Therefore, the estimation accuracy of the voice processing system 100 of the first embodiment can be expected to be higher than the estimation accuracy of the conventional system.

なお、音声強調器２０２が有する各モジュールについては、複数のモジュールを一つのモジュールにまとめてもよいし、一つのモジュールを機能毎に複数のモジュールに分けてもよい。 For each module included in the speech enhancer 202, a plurality of modules may be combined into one module, or one module may be divided into a plurality of modules for each function.

次に、図４を用いて、音特徴変換部３０５の処理の詳細について説明する。図４は、実施例１の特徴結合部３０２によって出力された複合特徴量の時系列データのイメージを示す図である。 Next, the details of the processing of the sound feature conversion unit 305 will be described with reference to FIG. FIG. 4 is a diagram showing an image of time-series data of the complex feature amount output by the feature coupling portion 302 of the first embodiment.

図４に示す複合特徴量の時系列データ４００の一行目はタイムステップを表し、二行目はタイムステップの音特徴量を表し、三行目はタイムステップの画像特徴量を表す。 The first line of the time-series data 400 of the composite feature amount shown in FIG. 4 represents the time step, the second line represents the sound feature amount of the time step, and the third line represents the image feature amount of the time step.

図４では、タイムステップは、複合特徴量の時系列データ４００における順番を表す「１」から「８」までの数値として設定している。 In FIG. 4, the time step is set as a numerical value from “1” to “8” indicating the order of the complex feature amount in the time series data 400.

各タイムステップの音特徴量は、実際にはベクトル表現として与えられるが、説明のためにベクトル表現が表す、定性的な性質を示している。タイムステップ「１」から「３」の音特徴量は話者の音声のみが含まれる特徴量であることを示す。タイムステップ「４」から「６」の音特徴量は話者の音声及び干渉音が含まれる特徴量であることを示す。また、タイムステップ「７」、「８」の音特徴量は干渉音のみが含まれる特徴量であることを示す。なお、話者の音声及び干渉音の比率を表す特徴量であってもよい。 The sound features of each time step are actually given as a vector representation, but for the sake of explanation, they show the qualitative properties of the vector representation. The sound features of the time steps "1" to "3" indicate that the features include only the voice of the speaker. The sound features of the time steps "4" to "6" indicate that the features include the speaker's voice and the interference sound. Further, it is shown that the sound feature amounts of the time steps "7" and "8" are feature amounts including only the interference sound. It may be a feature amount representing the ratio of the speaker's voice and the interference sound.

各タイムステップの画像特徴量は、実際にはベクトル表現として与えられるが、説明の簡単のためにベクトル表現が表す、定性的な性質を示している。タイムステップ「１」から「６」の画像特徴量は、発話に伴って話者の口を開閉していることを表す特徴量であることを示す。タイムステップ「７」、「８」の画像特徴量は、発話していないため話者が閉口していることを表す特徴量であることを示す。 The image features of each time step are actually given as a vector representation, but for the sake of simplicity, they show the qualitative properties that the vector representation represents. The image feature amounts of the time steps "1" to "6" indicate that the feature amount indicates that the speaker's mouth is opened and closed with the utterance. The image feature quantities of the time steps "7" and "8" indicate that the speaker is closed because he / she is not speaking.

図４に示すような音特徴量及び画像特徴量を含む複合特徴量の時系列データ４００について以下のようなケースに分けることができる。 The time-series data 400 of the composite feature amount including the sound feature amount and the image feature amount as shown in FIG. 4 can be divided into the following cases.

タイムステップ「１」、「２」、「３」では、話者の音声のみが存在し、かつ、話者の口の開閉動作が行われている。したがって、タイムステップ「１」、「２」、「３」は、第１ケースに分類される。タイムステップ「４」、「５」、「６」では、話者の音声及び干渉音が存在し、かつ、話者の口の開閉動作が行われている。したがって、タイムステップ「４」、「５」、「６」は、第２ケースに分類される。タイムステップ「７」、「８」では、干渉音のみが存在し、かつ、話者の口は閉じられている。したがって、タイムステップ「７」、「８」は、第３ケースに分類される。 In the time steps "1", "2", and "3", only the voice of the speaker is present, and the speaker's mouth is opened and closed. Therefore, the time steps "1", "2", and "3" are classified into the first case. In the time steps "4", "5", and "6", the speaker's voice and the interference sound are present, and the speaker's mouth is opened and closed. Therefore, the time steps "4", "5", and "6" are classified into the second case. In the time steps "7" and "8", only the interference sound is present and the speaker's mouth is closed. Therefore, the time steps "7" and "8" are classified into the third case.

音特徴変換部３０５は、複合特徴量の時系列データを用いて、各タイムステップを第１ケース、第２ケース、及び第３ケースのいずれかに分類する。具体的には、音特徴変換部３０５は、タイムステップ間の複合特徴量（音特徴量及び画像特徴量）の類似度に基づいて、タイムステップ間におけるケースの遷移の分類を行う。 The sound feature conversion unit 305 classifies each time step into one of the first case, the second case, and the third case by using the time series data of the complex feature amount. Specifically, the sound feature conversion unit 305 classifies case transitions between time steps based on the degree of similarity of complex features (sound features and image features) between time steps.

タイムステップ間類似度算出部３１０が算出するタイムステップ間の音特徴量及び画像特徴量の類似度は以下のようなものとする。 The similarity between the sound features and the image features calculated by the time step similarity calculation unit 310 is as follows.

第１ケース及び第３ケース間の遷移の場合、各タイムステップの音は異なるため、音特徴量の類似度は低い。第１ケース及び第２ケース間の遷移の場合、いずれのケースも話者の音声を含むが、第２ケースでは干渉音も含まれため、音特徴量の類似度は中程度となる。第２ケース及び第３ケースの遷移の場合、いずれのケースも干渉音を含むが、第２ケースでは話者の音声も含まれるため、音特徴量の類似度は中程度となる。同一のケースの遷移の場合、音特徴量の類似度は高いものとしている。 In the case of the transition between the first case and the third case, the sound of each time step is different, so that the similarity of the sound features is low. In the case of the transition between the first case and the second case, the voice of the speaker is included in both cases, but the interference sound is also included in the second case, so that the similarity of the sound features is medium. In the case of the transitions of the second case and the third case, the interference sound is included in both cases, but since the speaker's voice is also included in the second case, the similarity of the sound features is medium. In the case of transitions in the same case, the similarity of sound features is assumed to be high.

第１ケース及び第２ケースの遷移の場合、いずれも口の開閉動作が行われているため、画像特徴量の類似度は高い。第１ケース及び第３ケースの遷移、並びに、第２ケース及び第３ケースの遷移の場合、第１ケース及び第２ケースでは口の開閉動作が行われ、第３ケースでは口は閉じているため、画像特徴量の類似度は低い。同一のケースの遷移の場合、画像特徴量の類似度は高いものとしている。 In the case of the transition of the first case and the second case, since the mouth is opened and closed, the similarity of the image feature quantities is high. In the case of the transition of the first case and the third case, and the transition of the second case and the third case, the mouth is opened and closed in the first case and the second case, and the mouth is closed in the third case. , The similarity of image features is low. In the case of transitions in the same case, the similarity of image features is high.

以上をまとめると以下のような特性になる。第１ケース及び第３ケース間の遷移の場合、音特徴量の類似度及び画像特徴量の類似度はともに低い。第１ケース及び第２ケース間の遷移の場合、音特徴量の類似度は中程度であり、画像特徴量の類似度は高い。第２ケース及び第３ケース間の遷移の場合、音特徴量の類似度は中程度であり、画像特徴量の類似度は低い。 Summarizing the above, the characteristics are as follows. In the case of the transition between the first case and the third case, the similarity of the sound features and the similarity of the image features are both low. In the case of the transition between the first case and the second case, the similarity of the sound features is medium, and the similarity of the image features is high. In the case of the transition between the second case and the third case, the similarity of the sound features is medium, and the similarity of the image features is low.

発話状況識別部３１１は、前述のようなケース間の遷移における類似度の性質に基づいて、各タイムステップのケースの分類を行う。 The utterance status identification unit 311 classifies the cases of each time step based on the nature of the similarity in the transition between the cases as described above.

なお、以下のような分類方法を採用してもよい。発話状況識別部３１１は、画像特徴量に基づいて、口が閉じているタイムステップを特定し、当該タイムステップを第３ケースに分類する。次に、発話状況識別部３１１は、第３ケースに分類されたタイムステップの音特徴量を基準音特徴量に設定する。なお、第３ケースに分類されるタイムステップが複数存在する場合、各タイムステップの音特徴量の平均値等の統計値を基準音特徴量に設定することが考えられる。次に、発話状況識別部３１１は、未分類のタイムステップの音特徴量と、基準音特徴量との間の類似度を算出する。次に、発話状況識別部３１１は、類似度及び閾値の比較結果に基づいて、未分類のタイムステップを分類する。例えば、発話状況識別部３１１は、類似度が閾値より小さいタイムステップを第１ケースに分類し、類似度が閾値以上のタイムステップを第２ケースに分類する。 The following classification method may be adopted. The utterance status identification unit 311 identifies a time step in which the mouth is closed based on the image feature amount, and classifies the time step into the third case. Next, the utterance status identification unit 311 sets the sound feature amount of the time step classified in the third case as the reference sound feature amount. When there are a plurality of time steps classified into the third case, it is conceivable to set a statistical value such as an average value of the sound features of each time step as the reference sound feature. Next, the utterance status identification unit 311 calculates the degree of similarity between the sound feature amount of the unclassified time step and the reference sound feature amount. Next, the utterance status identification unit 311 classifies the unclassified time step based on the comparison result of the similarity and the threshold value. For example, the utterance status identification unit 311 classifies a time step having a similarity smaller than the threshold value into the first case, and a time step having a similarity equal to or higher than the threshold value into the second case.

なお、発話状況識別部３１１は、学習可能な線形層を用いて構成してもよい。この場合、発話状況識別部３１１は、複合特徴量を、ケースの識別が容易な特徴量に変換し、当該特徴量に基づいて各タイムステップのケースの分類を行う。例えば、発話状況識別部３１１は、変換後の特徴量の特徴量空間における配置及びノルム距離を用いて、各タイムステップのケースの分類を行う。同一ケースのノルム距離は小さくなり、異なるケースのノルム距離は大きくなる。また、同一ケースの特徴量は、特徴量空間の特定の領域に密集する。 The utterance status identification unit 311 may be configured by using a learnable linear layer. In this case, the utterance status identification unit 311 converts the composite feature amount into a feature amount for which the case can be easily identified, and classifies the cases of each time step based on the feature amount. For example, the utterance status identification unit 311 classifies the cases of each time step by using the arrangement of the converted features in the feature space and the norm distance. The norm distance of the same case is small, and the norm distance of different cases is large. In addition, the features of the same case are concentrated in a specific area of the feature space.

重み算出部３１２は、第１ケースに分類されたタイムステップの音特徴量又は複合特徴量に対して大きい重みを算出し、第２ケースに分類されたタイムステップの音特徴量又は複合特徴量に対して中程度の重みを算出し、第３ケースに分類されたタイムステップの音特徴量又は複合特徴量に対して小さい重みを算出する。 The weight calculation unit 312 calculates a large weight for the sound feature amount or the composite feature amount of the time step classified in the first case, and calculates the sound feature amount or the composite feature amount of the time step classified in the second case. On the other hand, a medium weight is calculated, and a small weight is calculated for the sound feature amount or the compound feature amount of the time step classified in the third case.

なお、図４に示した発話状況のケースの分類は一例であってこれに限定されない。 The classification of the utterance situation cases shown in FIG. 4 is an example and is not limited to this.

音特徴変換部３０５は、話者の音声が含まれるタイムステップの特徴量が強調され、かつ、話者の音声が含まれないタイムステップの特徴量が抑制されるように音特徴量の時系列データを変換する。これによって、話者識別部３０６は、話者の音声をよく反映した話者特徴量３３０を抽出することができる。 The sound feature conversion unit 305 emphasizes the feature amount of the time step including the speaker's voice, and suppresses the feature amount of the time step not including the speaker's voice in a time series of the sound feature amount. Convert the data. As a result, the speaker identification unit 306 can extract the speaker feature amount 330 that well reflects the voice of the speaker.

次に、音声処理システム１００が実行する学習処理及び推定処理について説明する。 Next, the learning process and the estimation process executed by the speech processing system 100 will be described.

図５は、実施例１の音声処理システム１００が実行する学習処理の一例を説明するフローチャートである。 FIG. 5 is a flowchart illustrating an example of learning processing executed by the voice processing system 100 of the first embodiment.

学習部１３１は、実行指示を受信した場合、又は、学習データが入力された場合、以下で説明する学習処理を開始する。 When the learning unit 131 receives the execution instruction or the learning data is input, the learning unit 131 starts the learning process described below.

学習部１３１は、学習データの入力を受け付ける（ステップＳ５０１）。例えば、学習部１３１は、接続インタフェース１２２を介して接続されるユーザ端末から学習データの入力を受け付ける。学習部１３１は、受け付けた学習データを記憶装置１２１に保存する。 The learning unit 131 accepts the input of learning data (step S501). For example, the learning unit 131 accepts input of learning data from a user terminal connected via the connection interface 122. The learning unit 131 stores the received learning data in the storage device 121.

次に、学習部１３１は、学習データから一つのサンプル２２０を読み出す（ステップＳ５０２）。このとき、学習部１３１は、他のサンプル２２０に含まれる音データ２２３を干渉音声の音データ２２５として読み出す。なお、サンプル２２０はランダムに選択されてもよいし、あらかじめ設定されたポリシに基づいて選択されてもよい。 Next, the learning unit 131 reads one sample 220 from the learning data (step S502). At this time, the learning unit 131 reads out the sound data 223 included in the other sample 220 as the sound data 225 of the interference voice. The sample 220 may be randomly selected or may be selected based on a preset policy.

次に、学習部１３１は、連続画像データ２２１及び音データ２２３に対して前処理を実行する（ステップＳ５０３）。 Next, the learning unit 131 executes preprocessing on the continuous image data 221 and the sound data 223 (step S503).

具体的には、画像前処理部２００が連続画像データ２２１に含まれる画像２２２に対して前処理を実行する。また、混合音声生成部２０３は、音データ２２３、２２５を用いて混合音声の音データ２２７を生成し、音前処理部２０１が音データ２２７に対して前処理を実行する。 Specifically, the image preprocessing unit 200 executes preprocessing on the image 222 included in the continuous image data 221. Further, the mixed voice generation unit 203 generates the sound data 227 of the mixed voice using the sound data 223 and 225, and the sound preprocessing unit 201 executes preprocessing on the sound data 227.

次に、学習部１３１は、前処理が実行された連続画像データ２２１及び音データ２２３を用いて、音データ２１３及び推定話者ＩＤ２１４を出力する（ステップＳ５０４）。具体的には、音声強調器２０２によって以下のような処理が実行される。 Next, the learning unit 131 outputs the sound data 213 and the estimated speaker ID 214 using the continuous image data 221 and the sound data 223 on which the preprocessing has been executed (step S504). Specifically, the speech enhancer 202 executes the following processing.

特徴結合部３０２は、画像特徴抽出部３００によって抽出された画像特徴量の時系列データと、音特徴抽出部３０１によって抽出された音特徴量の時系列データとを、時間的に同期した形式で結合することによって複合特徴量の時系列データを生成する。 The feature combining unit 302 has a time-series synchronized format of the time-series data of the image feature amount extracted by the image feature extraction unit 300 and the time-series data of the sound feature amount extracted by the sound feature extraction unit 301. By combining, time-series data of complex features are generated.

音特徴変換部３０５は、複合特徴量の時系列データを用いて、重み付き音特徴量の時系列データを出力する。 The sound feature conversion unit 305 outputs the time-series data of the weighted sound features using the time-series data of the composite features.

話者識別部３０６は、重み付き音特徴量の時系列データから、中間出力として話者特徴量３３０を抽出する。また、話者識別部３０６は、話者特徴量３３０に基づいて、話者を推定し、推定結果として推定話者ＩＤ２１４を出力する。 The speaker identification unit 306 extracts the speaker feature amount 330 as an intermediate output from the time-series data of the weighted sound feature amount. Further, the speaker identification unit 306 estimates the speaker based on the speaker feature amount 330, and outputs the estimated speaker ID 214 as the estimation result.

同期推定部３０３は、口の開閉動作が行われている期間に対応するタイムステップの音特徴量を抽出する。話者音声推定部３０４は、話者特徴量３３０を用いて、同期推定部３０３によって抽出された、任意のタイムステップの音特徴量から音データ２１３を抽出し、出力する。 The synchronization estimation unit 303 extracts the sound feature amount of the time step corresponding to the period during which the mouth opening / closing operation is performed. The speaker voice estimation unit 304 uses the speaker feature amount 330 to extract and output sound data 213 from the sound feature amount of an arbitrary time step extracted by the synchronous estimation unit 303.

以上が、音声強調器２０２が実行する処理の説明である。 The above is a description of the process executed by the speech enhancer 202.

次に、学習部１３１は、音データ２１３及び音データ２２３の誤差と、推定話者ＩＤ２１４及び話者ＩＤ２２４の誤差とを算出する（ステップＳ５０５）。 Next, the learning unit 131 calculates the error of the sound data 213 and the sound data 223 and the error of the estimated speaker ID 214 and the speaker ID 224 (step S505).

具体的には、音声誤差算出部２０４が音データ２１３及び音データ２２３の誤差を算出し、話者誤差算出部２０６が推定話者ＩＤ２１４及び話者ＩＤ２２４の誤差を算出する。 Specifically, the voice error calculation unit 204 calculates the error of the sound data 213 and the sound data 223, and the speaker error calculation unit 206 calculates the error of the estimated speaker ID 214 and the speaker ID 224.

次に、学習部１３１は、音声強調器２０２に対して各誤差を反映する（ステップＳ５０６）。 Next, the learning unit 131 reflects each error on the speech enhancer 202 (step S506).

具体的には、音声誤差反映部２０５が、音データ２１３及び音データ２２３の誤差に基づいて、音声強調器２０２の各モジュールの学習対象のパラメタを更新し、話者誤差反映部２０７が、推定話者ＩＤ２１４及び話者ＩＤ２２４の誤差に基づいて、音声強調器２０２の各モジュールの学習対象のパラメタを更新する。 Specifically, the voice error reflecting unit 205 updates the learning target parameters of each module of the voice enhancer 202 based on the errors of the sound data 213 and the sound data 223, and the speaker error reflecting unit 207 estimates. Based on the error of the speaker ID 214 and the speaker ID 224, the parameters to be learned of each module of the voice enhancer 202 are updated.

実施例１では、画像特徴抽出部３００、音特徴抽出部３０１、同期推定部３０３、話者音声推定部３０４、音特徴変換部３０５（タイムステップ間類似度算出部３１０、発話状況識別部３１１、及び重み算出部３１２）、及び話者識別部３０６（話者特徴抽出部３２０及び話者推定部３２１）のパラメタが更新される。 In the first embodiment, the image feature extraction unit 300, the sound feature extraction unit 301, the synchronous estimation unit 303, the speaker voice estimation unit 304, and the sound feature conversion unit 305 (time step similarity calculation unit 310, utterance status identification unit 311, And the parameters of the weight calculation unit 312) and the speaker identification unit 306 (speaker feature extraction unit 320 and speaker estimation unit 321) are updated.

次に、学習部１３１は、学習を終了するか否かを判定する（ステップＳ５０７）。 Next, the learning unit 131 determines whether or not to end the learning (step S507).

例えば、誤差の減少幅が閾値より小さくなり、これ以上、誤差が小さくできない場合、学習部１３１は学習を終了する。なお、学習の終了判定は、ユーザが判断してもよい。 For example, when the reduction width of the error becomes smaller than the threshold value and the error cannot be reduced any more, the learning unit 131 ends the learning. The user may determine the end of learning.

学習を終了しないと判定された場合、学習部１３１は、ステップＳ５０２に戻り、同様の処理を実行する。 If it is determined that the learning is not completed, the learning unit 131 returns to step S502 and executes the same process.

学習を終了すると判定された場合、学習部１３１は、学習結果を推定部１３０に出力し（ステップＳ５０８）、その後、学習処理を終了する。 When it is determined that the learning is finished, the learning unit 131 outputs the learning result to the estimation unit 130 (step S508), and then ends the learning process.

具体的には、学習部１３１は、音声強調器２０２の各モジュールのパラメタを推定部１３０に出力する。 Specifically, the learning unit 131 outputs the parameters of each module of the speech enhancer 202 to the estimation unit 130.

学習処理によって、音声強調器２０２が出力する音データ２１３及び音データ２２３の誤差と、推定話者ＩＤ２１４及び話者ＩＤ２２４の誤差とが小さくなる。このように、学習処理では、同期推定部３０３、話者音声推定部３０４、音特徴変換部３０５、及び話者識別部３０６のパラメタが一緒に更新される。 By the learning process, the error of the sound data 213 and the sound data 223 output by the speech enhancer 202 and the error of the estimated speaker ID 214 and the speaker ID 224 are reduced. As described above, in the learning process, the parameters of the synchronous estimation unit 303, the speaker voice estimation unit 304, the sound feature conversion unit 305, and the speaker identification unit 306 are updated together.

図６は、実施例１の音声処理システム１００が実行する推定処理の一例を説明するフローチャートである。 FIG. 6 is a flowchart illustrating an example of estimation processing executed by the voice processing system 100 of the first embodiment.

推定部１３０は、実行指示を受信した場合、又は、データが入力された場合、以下で説明する推定処理を開始する。なお、音声処理システム１００は、図１に示すような環境において稼働しているものとする。 When the estimation unit 130 receives the execution instruction or the data is input, the estimation unit 130 starts the estimation process described below. It is assumed that the voice processing system 100 is operating in the environment as shown in FIG.

推定部１３０は、撮像装置１０２から連続画像データ２１１を取得し、また、集音装置１０３から音データ２１２を取得する（ステップＳ６０１）。推定部１３０は、連続画像データ２１１及び音データ２１２を記憶装置１２１に保存する。 The estimation unit 130 acquires continuous image data 211 from the image pickup device 102, and also acquires sound data 212 from the sound collector 103 (step S601). The estimation unit 130 stores the continuous image data 211 and the sound data 212 in the storage device 121.

次に、推定部１３０は、連続画像データ２１１及び音データ２１２に対して前処理を実行する（ステップＳ６０２）。ステップＳ６０２の処理はステップＳ５０３の処理と同様の処理である。 Next, the estimation unit 130 executes preprocessing on the continuous image data 211 and the sound data 212 (step S602). The process of step S602 is the same as the process of step S503.

次に、推定部１３０は、前処理が実行された連続画像データ２１１及び音データ２１２を用いて、音データ２１３及び推定話者ＩＤ２１４を出力し（ステップＳ６０３）、その後、推定処理を終了する。 Next, the estimation unit 130 outputs the sound data 213 and the estimated speaker ID 214 using the continuous image data 211 and the sound data 212 for which the preprocessing has been executed (step S603), and then ends the estimation process.

具体的には、学習結果が反映された音声強調器２０２が音データ２１３及び推定話者ＩＤ２１４を出力する。ステップＳ６０３の処理はステップＳ５０４の処理と同様の処理である。 Specifically, the speech enhancer 202 reflecting the learning result outputs the sound data 213 and the estimated speaker ID 214. The process of step S603 is the same process as the process of step S504.

音データ２１３に含まれる推定音声は、雑音音源１１１の音が抑制された、話者１１０の音声に非常に類似した音声となっている。なお、推定部１３０は、音データ２１３を、公知の音声認識器に入力することによって、文字起こしを行ってもよい。 The estimated voice included in the sound data 213 is a voice very similar to the voice of the speaker 110 in which the sound of the noise sound source 111 is suppressed. The estimation unit 130 may perform transcription by inputting the sound data 213 into a known voice recognizer.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。また、例えば、上記した実施例は本発明を分かりやすく説明するために構成を詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、各実施例の構成の一部について、他の構成に追加、削除、置換することが可能である。 The present invention is not limited to the above-described embodiment, and includes various modifications. Further, for example, the above-described embodiment describes the configuration in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to the one including all the described configurations. Further, it is possible to add, delete, or replace a part of the configuration of each embodiment with other configurations.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、本発明は、実施例の機能を実現するソフトウェアのプログラムコードによっても実現できる。この場合、プログラムコードを記録した記憶媒体をコンピュータに提供し、そのコンピュータが備えるプロセッサが記憶媒体に格納されたプログラムコードを読み出す。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施例の機能を実現することになり、そのプログラムコード自体、及びそれを記憶した記憶媒体は本発明を構成することになる。このようなプログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、光ディスク、光磁気ディスク、ＣＤ－Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどが用いられる。 Further, each of the above configurations, functions, processing units, processing means and the like may be realized by hardware by designing a part or all of them by, for example, an integrated circuit. The present invention can also be realized by a software program code that realizes the functions of the examples. In this case, a storage medium in which the program code is recorded is provided to the computer, and the processor included in the computer reads out the program code stored in the storage medium. In this case, the program code itself read from the storage medium realizes the function of the above-described embodiment, and the program code itself and the storage medium storing it constitute the present invention. Examples of the storage medium for supplying such a program code include a flexible disk, a CD-ROM, a DVD-ROM, a hard disk, an SSD (Solid State Drive), an optical disk, a magneto-optical disk, a CD-R, and a magnetic tape. Non-volatile memory cards, ROMs, etc. are used.

また、本実施例に記載の機能を実現するプログラムコードは、例えば、アセンブラ、Ｃ／Ｃ＋＋、ｐｅｒｌ、Ｓｈｅｌｌ、ＰＨＰ、Ｐｙｔｈｏｎ、Ｊａｖａ（登録商標）等の広範囲のプログラム又はスクリプト言語で実装できる。 In addition, the program code that realizes the functions described in this embodiment can be implemented in a wide range of programs or script languages such as assembler, C / C ++, perl, Shell, PHP, Python, and Java (registered trademark).

さらに、実施例の機能を実現するソフトウェアのプログラムコードを、ネットワークを介して配信することによって、それをコンピュータのハードディスクやメモリ等の記憶手段又はＣＤ－ＲＷ、ＣＤ－Ｒ等の記憶媒体に格納し、コンピュータが備えるプロセッサが当該記憶手段や当該記憶媒体に格納されたプログラムコードを読み出して実行するようにしてもよい。 Further, by distributing the program code of the software that realizes the functions of the embodiment via the network, the program code is stored in a storage means such as a hard disk or a memory of a computer or a storage medium such as a CD-RW or a CD-R. The processor included in the computer may read and execute the program code stored in the storage means or the storage medium.

上述の実施例において、制御線や情報線は、説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。全ての構成が相互に接続されていてもよい。 In the above-described embodiment, the control lines and information lines show what is considered necessary for explanation, and do not necessarily indicate all the control lines and information lines in the product. All configurations may be interconnected.

１００音声処理システム
１０１サーバ
１０２撮像装置
１０３集音装置
１１０話者
１１１雑音音源
１１２顔領域画像
１２０プロセッサ
１２１記憶装置
１２２接続インタフェース
１２３内部バス
１３０推定部
１３１学習部
２００画像前処理部
２０１音前処理部
２０２音声強調器
２０３混合音声生成部
２０４音声誤差算出部
２０５音声誤差反映部
２０６話者誤差算出部
２０７話者誤差反映部
２１１、２２１連続画像データ
２１２、２１３、２２３、２２５、２２７音データ
２１４推定話者ＩＤ
２２０サンプル
２２２画像
２２４話者ＩＤ
３００画像特徴抽出部
３０１音特徴抽出部
３０２特徴結合部
３０３同期推定部
３０４話者音声推定部
３０５音特徴変換部
３０６話者識別部
３１０タイムステップ間類似度算出部
３１１発話状況識別部
３１２重み算出部
３１３重み反映部
３２０話者特徴抽出部
３２１話者推定部
３３０話者特徴量 100 Voice processing system 101 Server 102 Image pickup device 103 Sound collector 110 Speaker 111 Noise sound source 112 Face area image 120 Processor 121 Storage device 122 Connection interface 123 Internal bus 130 Estimator unit 131 Learning unit 200 Image preprocessing unit 201 Sound preprocessing unit 202 Voice enhancer 203 Mixed voice generation unit 204 Voice error calculation unit 205 Voice error reflection unit 206 Speaker error calculation unit 207 Speaker error reflection unit 211, 221 Continuous image data 212, 213, 223, 225, 227 Sound data 214 Estimate Speaker ID
220 Sample 222 Image 224 Speaker ID
300 Image feature extraction unit 301 Sound feature extraction unit 302 Feature coupling unit 303 Synchronous estimation unit 304 Speaker voice estimation unit 305 Sound feature conversion unit 306 Speaker identification unit 310 Time step similarity calculation unit 311 Speech status identification unit 312 Weight calculation Part 313 Weight reflection part 320 Speaker feature extraction part 321 Speaker estimation part 330 Speaker feature amount

Claims

It is a computer that extracts the voice of the target person included in the sound collected by the sound collector.
It includes an arithmetic unit, a storage device connected to the arithmetic unit, and a connection interface connected to the arithmetic unit.
The sound collector and the image pickup device that acquires an image of the target person are connected via the connection interface.
The arithmetic unit is
The input sound acquired from the sound collecting device and the input image acquired from the imaging device are stored in the storage device.
Using the input sound, a speaker feature amount indicating the characteristics of the voice of the target person is calculated.
Using the input image, a face area image including the face of the target person is extracted.
Using the plurality of facial area images, the utterance period estimated to have been spoken by the target person was specified, and the utterance period was specified.
Using the speaker feature amount and the utterance period, the estimated voice of the target person in the utterance period is extracted from the input sound, and the extracted estimated voice of the target person is stored in the storage device. Characterized computer.

The computer according to claim 1.
The arithmetic unit is
Using the input sound and the plurality of input images, the utterance status of the target person at each time step in the input sound is identified.
Based on the utterance status of the target person in each time step, the input sound is converted so as to emphasize the time step including the voice of the target person.
A computer characterized in that the speaker feature amount is calculated using the converted input sound.

The computer according to claim 2.
The arithmetic unit is
From the input sound, a first time-series data consisting of sound features for each time step is generated.
From the plurality of face area images, a second time-series data consisting of an image feature amount for each time step is generated.
Using the first time-series data and the second time-series data, an index indicating the content of the voice of the target person in the input sound is calculated.
Based on the index, the utterance status of the target person at each time step is identified.
The weight according to the utterance situation of the target person in each time step is calculated.
Using the weights, the first time series data is transformed.
A computer characterized in that the speaker feature amount is calculated using the converted first time-series data.

The computer according to claim 3.
The arithmetic unit is
The similarity between the sound features and the image features between the time steps was calculated as the index.
A computer characterized by identifying the utterance status of the target person in each time step based on the similarity of the image features between the time steps and the similarity of the sound features between the time steps.

The computer according to claim 3.
The arithmetic unit is
Based on the image feature amount, the reference time step in which the target person is closed is specified.
Based on the sound feature amount corresponding to the reference time step, the reference sound feature amount is calculated.
The degree of similarity between the reference sound feature amount and the sound feature amount of each time step is calculated as the index.
A computer characterized by identifying the utterance status of the target person in each time step based on the degree of similarity between the reference sound feature amount and the sound feature amount in each time step.

The computer according to claim 3.
The storage device includes a first model that identifies a speaker who is speaking from an input sound, a second model that identifies an utterance status for each time step, a third model that specifies the utterance period, and the target. Stores information that defines a fourth model that extracts the voice of a person,
The arithmetic unit is
By inputting the first time-series data and the second time-series data into the second model, the converted input sound is calculated.
From the first model in which the converted input sound is input, the speaker feature amount of the target person is extracted.
By inputting the first time-series data and the second time-series data into the third model, the utterance period is calculated.
A computer characterized in that the voice of the target person is extracted from the input sound by inputting the speaker feature amount and the utterance period into the fourth model.

The computer according to claim 6.
The arithmetic unit is
Learning data including a plurality of samples composed of a plurality of learning images, learning voices, and speaker identification information is received and stored in the storage device.
Using the learning data, an error in the speaker identification result output from the first model and the speaker identification information included in the sample, and the voice of the target person output from the fourth model. A computer characterized by learning the first model, the second model, the third model, and the fourth model so that the error of the learning voice included in the sample is small.

It is a voice processing method executed by a computer to extract the voice of the target person included in the sound.
The calculator
It has an arithmetic unit, a storage device connected to the arithmetic unit, and a connection interface connected to the arithmetic unit.
A sound collecting device that collects sounds in the space where the target person exists and an imaging device that acquires an image of the target person are connected via the connection interface.
The voice processing method is
The first step in which the arithmetic unit stores the input sound acquired from the sound collecting device and the input image acquired from the image pickup device in the storage device.
A second step in which the arithmetic unit calculates a speaker feature amount indicating the characteristics of the voice of the target person using the input sound.
A third step in which the arithmetic unit uses the input image to extract a face region image including the face of the target person.
A fourth step in which the arithmetic unit uses a plurality of the face area images to identify an utterance period in which the target person is presumed to have spoken.
The arithmetic unit uses the speaker feature amount and the utterance period to extract the estimated voice of the target person in the utterance period from the input sound, and the extracted estimated voice of the target person is stored in the storage device. And the fifth step to store in
A voice processing method characterized by including.

The voice processing method according to claim 8.
The second step is
A sixth step in which the arithmetic unit identifies the utterance status of the target person for each time step in the input sound using the input sound and the plurality of input images.
A seventh step in which the arithmetic unit converts the input sound so as to emphasize the time step including the voice of the target person based on the utterance status of the target person in each time step.
A voice processing method comprising the eighth step of calculating the speaker feature amount using the converted input sound.

The voice processing method according to claim 9.
The first step is
A step in which the arithmetic unit generates first time-series data consisting of sound features for each time step from the input sound.
The arithmetic unit includes a step of generating a second time-series data including an image feature amount for each time step from the plurality of face area images.
The sixth step is
A ninth step in which the arithmetic unit uses the first time-series data and the second time-series data to calculate an index indicating the content of the voice of the target person in the input sound.
The arithmetic unit includes a tenth step of identifying the utterance status of the target person in each time step based on the index.
The seventh step is
A step in which the arithmetic unit calculates a weight according to the utterance status of the target person in each time step, and a step.
The arithmetic unit includes a step of converting the first time series data using the weights.
The eighth step is a voice processing method, wherein the arithmetic unit includes a step of calculating the speaker feature amount using the converted first time series data.

The voice processing method according to claim 10.
The ninth step includes a step in which the arithmetic unit calculates the similarity of the sound feature amount and the similarity of the image feature amount between the time steps as the index.
In the tenth step, the arithmetic unit of the target person in each time step is based on the similarity of the image feature amount between the time steps and the similarity of the sound feature amount between the time steps. A voice processing method comprising a step of identifying an utterance situation.

The voice processing method according to claim 10.
The ninth step is
A step in which the arithmetic unit identifies a reference time step in which the target person is closed based on the image feature amount, and a step.
A step in which the arithmetic unit calculates a reference sound feature amount based on the sound feature amount corresponding to the reference time step.
The arithmetic unit includes a step of calculating the similarity between the reference sound feature amount and the sound feature amount of each time step as the index.
In the tenth step, the arithmetic unit identifies the utterance status of the target person in each time step based on the degree of similarity between the reference sound feature amount and the sound feature amount in each time step. A voice processing method characterized by including steps.

The voice processing method according to claim 10.
The storage device includes a first model that identifies a speaker who is speaking from an input sound, a second model that identifies an utterance status for each time step, a third model that specifies the utterance period, and the target. Stores information that defines a fourth model that extracts the voice of a person,
The second step is
A step of calculating the converted input sound by the arithmetic unit inputting the first time series data and the second time series data into the second model.
The arithmetic unit includes a step of extracting the speaker feature amount of the target person from the first model in which the converted input sound is input.
The fourth step includes a step in which the arithmetic unit calculates the utterance period by inputting the first time series data and the second time series data into the third model.
The fifth step includes a step in which the arithmetic unit extracts the voice of the target person from the input sound by inputting the speaker feature amount and the utterance period into the fourth model. A voice processing method characterized by that.

The voice processing method according to claim 13.
A step in which the arithmetic unit receives learning data including a plurality of samples composed of a plurality of learning images, learning voices, and speaker identification information via the connection interface and stores the learning data in the storage device.
Using the learning data, the arithmetic unit outputs the speaker identification result output from the first model, the error of the speaker identification information included in the sample, and the fourth model. A step of learning the first model, the second model, the third model, and the fourth model so that the error between the voice of the target person and the voice for learning included in the sample is small. A voice processing method characterized by including.