JP2020140050A

JP2020140050A - Caption generation device, caption generation method and program

Info

Publication number: JP2020140050A
Application number: JP2019034979A
Authority: JP
Inventors: 一博中臺; Kazuhiro Nakadai; 道生岩月; Michio Iwatsuki; 克寿糸山; Katsutoshi Itoyama; 健次西田; Kenji Nishida
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2020-09-03
Anticipated expiration: 2039-02-27
Also published as: JP7267034B2

Abstract

To provide a caption generation device, a caption generation method and a program that enable generation of a caption using a neural network for an acoustic signal.SOLUTION: A caption generation device comprises a caption generation unit which divides, when generating a spectrogram for an acoustic signal, the spectrogram to a fixed length into one or more blocks, inputs the blocks to a convolutional neural network to extract a feature quantity vector, and inputs the extracted feature quantity vector to a recursion type neutral network to generate a caption for the acoustic signal.SELECTED DRAWING: Figure 1

Description

本発明は、キャプション生成装置、キャプション生成方法、およびプログラムに関する。 The present invention relates to a caption generator, a caption generator, and a program.

テロップや字幕が表示されるテレビ番組がある。このようなテロップや字幕は、例えば原稿を元に人がキーボードで入力して作成されている。生放送のニュース番組では、アナウンサーが話した言葉を聞き取りながら、キーボードで入力して作成されている。 Some TV programs display telops and subtitles. Such telops and subtitles are created, for example, by human input using a keyboard based on a manuscript. The live news program is created by typing on the keyboard while listening to the words spoken by the announcer.

また、画像に対してキャプションを生成して付与する手法が検討されている。例えば、画像を畳み込みニューラルネットワークに入力することにより画像特徴を得て、再帰型ニューラルネットワーク（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ；ＲＮＮ）を用いてキャプションを生成することが提案されている（例えば特許文献１、非特許文献１参照）。 In addition, a method of generating and adding a caption to an image is being studied. For example, it has been proposed to obtain image features by inputting an image into a convolutional neural network and generate a caption using a recurrent neural network (RNN) (for example, Patent Document 1, Non-Patent). Reference 1).

特許文献１や非特許文献１に記載の技術等の深層学習を用いた画像からのキャプション生成モデルは、画像を畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ；ＣＮＮ）を用いて固定長の中間ベクトル表現に変換し（例えば非特許文献２参照）、その中間ベクトル表現を、ＲＮＮを用いてキャプションに変換する（例えば非特許文献３参照）という構造をしている。 A caption generative model from an image using deep learning such as the techniques described in Patent Document 1 and Non-Patent Document 1 converts the image into a fixed-length intermediate vector representation using a convolutional neural network (CNN). (See, for example, Non-Patent Document 2), and its intermediate vector representation is converted into a caption using RNN (see, for example, Non-Patent Document 3).

特開２０１８−１２４９６９号公報JP-A-2018-1249969

” Show and Tell: A Neural Image Caption Generator”, Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan”, IEEE, In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015,p3156-3164"Show and Tell: A Neural Image Caption Generator", Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan ", IEEE, In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, p3156-3164 ”Overfeat: Integrated recognition, localization and detection using convolutional networks”, Sermanet, Pierre, et al., arXiv preprint arXiv:1312.6229, 2013"Overfeat: Integrated recognition, localization and detection using convolutional networks", Sermanet, Pierre, et al., ArXiv preprint arXiv: 1312.6229, 2013 ”Neural machine translation by jointly learning to align and translate”, Bahdanau, Dzrnitry, et al, arXiv preprint arXiv:1409.0473, 2014."Neural machine translation by jointly learning to align and translate", Bahdanau, Dzrnitry, et al, arXiv preprint arXiv: 1409.0473, 2014.

音響信号に対してもキャプションを生成して付与したいという要望がある。一般的な音響信号を収録したデータの内容は、実際に聞いてみるまでわからないという問題がある。しかしながら、各従来技術の画像のキャプション生成手法を音響信号に適用しようとしても、入力信号が１次元の時系列信号であり画像のように二次元ではなく、また音響信号が可変長なので画像のようにリサイズすることにより固定長表現にすることができない。 There is a desire to generate and add captions to acoustic signals as well. There is a problem that the contents of data containing general acoustic signals cannot be understood until they are actually heard. However, even if an attempt is made to apply the image caption generation method of each conventional technique to an acoustic signal, the input signal is a one-dimensional time-series signal, not two-dimensional like an image, and the acoustic signal has a variable length, so it looks like an image. It is not possible to make a fixed length expression by resizing to.

本発明は、上記の問題点に鑑みてなされたものであって、音響信号に対してニューラルネットワークを用いてキャプションを生成することを可能にするキャプション生成装置、キャプション生成方法、およびプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and provides a caption generation device, a caption generation method, and a program capable of generating captions for an acoustic signal using a neural network. The purpose is.

（１）上記目的を達成するため、本発明の一態様に係るキャプション生成装置＜１＞は、音響信号に対してスペクトログラムを生成させる際に、前記スペクトログラムを固定長に分割して１以上のブロックにし、前記ブロックを畳み込みニューラルネットワーク＜ＣＮＮ＞に入力して特徴量ベクトルを抽出し、抽出した前記特徴量ベクトルを再帰型ニューラルネットワーク＜ＲＮＮ＞に入力することにより前記音響信号に対するキャプションを生成するキャプション生成部＜１２＞、を備える。 (1) In order to achieve the above object, the caption generator <1> according to one aspect of the present invention divides the spectrogram into a fixed length and one or more blocks when generating a spectrogram for an acoustic signal. Then, the block is input to the convolutional neural network <CNN> to extract the feature amount vector, and the extracted feature amount vector is input to the recurrent neural network <RNN> to generate a caption for the acoustic signal. A generation unit <12> is provided.

（２）また、本発明の一態様に係るキャプション生成装置において、前記スペクトログラムは、対数メル周波数スペクトログラムであるようにしてもよい。 (2) Further, in the caption generator according to one aspect of the present invention, the spectrogram may be a logarithmic mel frequency spectrogram.

（３）また、本発明の一態様に係るキャプション生成装置において、前記再帰型ニューラルネットワークは、多層の長期短期記憶層＜ＬＳＴＭ＞によって構成され、前記再帰型ニューラルネットワークは、第１の再帰型ニューラルネットワーク＜ＲＮＮ部１２２２＞と第２の再帰型ニューラルネットワーク＜ＲＮＮ部１２３１＞を備え、前記第１の再帰型ニューラルネットワークは、抽出された特徴量ベクトルを多層の前記長期短期記憶層に入力して中間表現のベクトルを生成し、前記第２の再帰型ニューラルネットワークは、前記中間表現のベクトルを多層の前記長期短期記憶層に入力して前記キャプションを生成するようにしてもよい。 (3) Further, in the caption generator according to one aspect of the present invention, the recurrent neural network is composed of a multi-layered long-term short-term storage layer <LSTM>, and the recurrent neural network is a first recurrent neural network. A network <RNN unit 1222> and a second recurrent neural network <RNN unit 1231> are provided, and the first recurrent neural network inputs the extracted feature quantity vector into the multi-layered long-term and short-term storage layer. The intermediate representation vector may be generated, and the second recurrent neural network may input the intermediate representation vector into the multi-layered long-term and short-term storage layer to generate the caption.

（４）また、本発明の一態様に係るキャプション生成装置において、前記ブロックは、グレースケールの前記スペクトログラムの画像であるようにしてもよい。 (4) Further, in the caption generation device according to one aspect of the present invention, the block may be a grayscale image of the spectrogram.

（５）上記目的を達成するため、本発明の一態様に係るキャプション生成方法は、取得部が、音響信号を取得する手順と、キャプション生成部が、音響信号に対してスペクトログラムを生成させる際に、前記スペクトログラムを固定長に分割して１以上のブロックにし、前記ブロックを畳み込みニューラルネットワークに入力して特徴量ベクトルを抽出する手順と、前記キャプション生成部が、抽出した前記特徴量ベクトルを再帰型ニューラルネットワークに入力することにより前記音響信号に対するキャプションを生成する手順と、を含む。 (5) In order to achieve the above object, the caption generation method according to one aspect of the present invention is a procedure in which the acquisition unit acquires an acoustic signal and when the caption generation unit generates a spectrogram for the acoustic signal. , The procedure of dividing the spectrogram into one or more blocks into one or more blocks, inputting the blocks into a convolutional neural network to extract a feature amount vector, and a recursive type of the extracted feature amount vector by the caption generator. It includes a procedure for generating a caption for the acoustic signal by inputting it into a neural network.

（６）上記目的を達成するため、本発明の一態様に係るプログラムは、キャプション生成装置のコンピュータに、音響信号を取得する手順と、音響信号に対してスペクトログラムを生成させる際に、前記スペクトログラムを固定長に分割して１以上のブロックにし、前記ブロックを畳み込みニューラルネットワークに入力して特徴量ベクトルを抽出する手順と、抽出した前記特徴量ベクトルを再帰型ニューラルネットワークに入力することにより前記音響信号に対するキャプションを生成する手順と、を実行する。 (6) In order to achieve the above object, the program according to one aspect of the present invention uses the spectrogram in the procedure of acquiring an acoustic signal and in causing the computer of the caption generator to generate a spectrogram for the acoustic signal. The acoustic signal is divided into one or more blocks by dividing into a fixed length, and the block is input to a convolutional neural network to extract a feature amount vector, and the extracted feature amount vector is input to a recursive neural network. And perform the steps to generate a caption for.

上述した（１）、（３）、（５）、（６）によれば、一次元の音響信号を二次元に変換し、音響信号に対してニューラルネットワークを用いてキャプションを生成することができる。 According to (1), (3), (5), and (6) described above, a one-dimensional acoustic signal can be converted into two dimensions, and a caption can be generated for the acoustic signal using a neural network. ..

また、上述した（２）によれば、人間の聴覚に合わせたものを使用してキャプションを生成することができる。
また、上述した（４）によれば、演算量を削減することができる。 Further, according to (2) described above, a caption can be generated by using a caption that matches the human hearing.
Further, according to (4) described above, the amount of calculation can be reduced.

実施形態に係るキャプション生成装置の構成例を示す図である。It is a figure which shows the configuration example of the caption generation apparatus which concerns on embodiment. 実施形態に係るキャプション生成装置が行う処理の概要を示す図である。It is a figure which shows the outline of the process performed by the caption generation apparatus which concerns on embodiment. 実施形態に係るキャプション生成装置が行う処理の概要を示す図である。It is a figure which shows the outline of the process performed by the caption generation apparatus which concerns on embodiment. 実施形態に係る前処理部とエンコーダが行う処理を示す図である。It is a figure which shows the process performed by the pre-processing part and the encoder which concerns on embodiment. 実施形態に係るデコーダの処理例を示す図である。It is a figure which shows the processing example of the decoder which concerns on embodiment. 実施形態に係るエンコーダにおける学習処理例を示す図である。It is a figure which shows the learning process example in the encoder which concerns on embodiment. ＬＳＴＭの構成と処理例を示す図である。It is a figure which shows the structure of LSTM and the processing example. 実施形態に係るキャプション生成装置の処理手順例のフローチャートである。It is a flowchart of the processing procedure example of the caption generation apparatus which concerns on embodiment. 評価に用いた音響信号の例を示す図である。It is a figure which shows the example of the acoustic signal used for evaluation. 学習モデルのアーキテクチャを示す図である。It is a figure which shows the architecture of a learning model. 各モジュールの構成例と処理手順例を示す図である。It is a figure which shows the configuration example and the processing procedure example of each module. 正解と一致しなかった出力キャプションの出力例を示す図である。It is a figure which shows the output example of the output caption which did not match the correct answer. 評価結果例を示す図である。It is a figure which shows the evaluation result example.

以下、本発明の実施の形態について図面を参照しながら説明する。なお、以下の説明において、音響信号の持つ多くの情報を統合して認識することを、音響シーン理解と呼ぶ。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, integrating and recognizing a large amount of information contained in an acoustic signal is referred to as acoustic scene understanding.

＜キャプション生成装置＞
図１は、本実施形態に係るキャプション生成装置１の構成例を示す図である。図１に示すように、キャプション生成装置１は、音響信号取得部１１、キャプション生成部１２、および出力部１３を備える。
また、キャプション生成部１２は、前処理部１２１、エンコーダ１２２、およびデコーダ１２３を備える。
前処理部１２１は、切り出し部１２１１、正規化部１２１２、および短時間フーリエ変換部１２１３を備える。
エンコーダ１２２は、ＣＮＮ部１２２１、およびＲＮＮ部１２２２（第１の再帰型ニューラルネットワーク）を備える。
デコーダ１２３は、ＲＮＮ部１２３１（第２の再帰型ニューラルネットワーク）を備える。 <Caption generator>
FIG. 1 is a diagram showing a configuration example of the caption generation device 1 according to the present embodiment. As shown in FIG. 1, the caption generation device 1 includes an acoustic signal acquisition unit 11, a caption generation unit 12, and an output unit 13.
Further, the caption generation unit 12 includes a preprocessing unit 121, an encoder 122, and a decoder 123.
The preprocessing unit 121 includes a cutting unit 1211, a normalizing unit 1212, and a short-time Fourier transform unit 1213.
The encoder 122 includes a CNN unit 1221 and an RNN unit 1222 (first recurrent neural network).
The decoder 123 includes an RNN unit 1231 (second recurrent neural network).

キャプション生成装置１は、取得した音響信号に対するキャプションを生成する。すなわち、キャプション生成装置１は、音響信号に対して音響シーンを理解してキャプションを生成する。 The caption generation device 1 generates a caption for the acquired acoustic signal. That is, the caption generation device 1 understands the acoustic scene for the acoustic signal and generates a caption.

音響信号取得部１１は、音響信号を取得し、取得した音響信号をキャプション生成部１２に出力する。なお、音響信号は、マイクロホンで収音されたものであってもよく、録音されたものであってもよい。 The acoustic signal acquisition unit 11 acquires an acoustic signal and outputs the acquired acoustic signal to the caption generation unit 12. The acoustic signal may be a sound picked up by a microphone or a recorded sound signal.

キャプション生成部１２は、音響信号取得部１１が出力する音響信号に対してキャプションを生成する。 The caption generation unit 12 generates a caption for the acoustic signal output by the acoustic signal acquisition unit 11.

前処理部１２１は、取得した音響信号に対して前処理を行って一次元の情報である音響信号を二次元情報に変換する。 The preprocessing unit 121 performs preprocessing on the acquired acoustic signal to convert the acoustic signal, which is one-dimensional information, into two-dimensional information.

切り出し部１２１１は、音響信号取得部１１が出力する音響信号に対して、所定の時間幅の窓を用いて音響信号を切り出す。なお、音響信号の切り出し方法については後述する。切り出し部１２１１は、切り出した音響信号を正規化部１２１２に出力する。 The cutout unit 1211 cuts out an acoustic signal from the acoustic signal output by the acoustic signal acquisition unit 11 using a window having a predetermined time width. The method of cutting out the acoustic signal will be described later. The cutting unit 1211 outputs the cut out acoustic signal to the normalizing unit 1212.

正規化部１２１２は、切り出し部１２１１が切り出した音響信号に対して正規化を行い、正規化した切り出し後の音響信号を短時間フーリエ変換部１２１３に出力する。 The normalization unit 1212 normalizes the acoustic signal cut out by the cutout unit 1211, and outputs the normalized acoustic signal after the cutout to the short-time Fourier transform unit 1213.

短時間フーリエ変換部１２１３は、正規化部１２１２が出力する正規化され切り出し後の音響信号に対して短時間フーリエ変換（Ｓｈｏｒｔ−ＴｉｍｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ；ＳＴＦＴ）を行う。この処理によって、切り出された音響信号毎の二次元スペクトログラム画像が生成され、一次元の音響信号が二次元の画像データに変換される。短時間フーリエ変換部１２１３は、変換した二次元スペクトログラム画像を逐次、エンコーダ１２２に出力する。 The short-time Fourier transform unit 1213 performs a short-time Fourier transform (SFTT) on the normalized and cut-out acoustic signal output by the normalization unit 1212. By this processing, a two-dimensional spectrogram image for each cut-out acoustic signal is generated, and the one-dimensional acoustic signal is converted into two-dimensional image data. The short-time Fourier transform unit 1213 sequentially outputs the converted two-dimensional spectrogram image to the encoder 122.

エンコーダ１２２は、前処理部１２１が出力する二次元スペクトログラム画像を用いて中間表現であるベクトルを生成する。なお、エンコーダ１２２の処理に手法については後述する。 The encoder 122 uses the two-dimensional spectrogram image output by the preprocessing unit 121 to generate a vector which is an intermediate representation. The method for processing the encoder 122 will be described later.

ＣＮＮ部１２２１は、畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ；ＣＮＮ）である。ＣＮＮ部１２２１は、短時間フーリエ変換部１２１３が逐次出力する二次元スペクトログラム画像をＣＮＮに入力して特徴量ベクトルを生成する。ＣＮＮ部１２２１は、生成した特徴量ベクトルを、ＲＮＮ部１２２２に出力する。なお、ＣＮＮ部１２２１は、二次元スペクトログラム画像に対して、例えばカーネルを用いて畳み込み処理やプーリング処理によって二次元スペクトログラム画像の特徴量を抽出する。ＣＮＮ部１２２１は、抽出した特徴量を結合層によって結合して特徴量ベクトルを生成する。なお、ＣＮＮ部１２２１は、画像一枚に対して、特徴量ベクトルを１つ生成する。なお、ＣＮＮの構成、動作については後述する。 The CNN part 1221 is a convolutional neural network (CNN). The CNN unit 1221 inputs a two-dimensional spectrogram image sequentially output by the short-time Fourier transform unit 1213 into the CNN to generate a feature vector. The CNN unit 1221 outputs the generated feature quantity vector to the RNN unit 1222. The CNN unit 1221 extracts the feature amount of the two-dimensional spectrogram image from the two-dimensional spectrogram image by, for example, a convolution process or a pooling process using a kernel. The CNN unit 1221 combines the extracted features with a binding layer to generate a feature vector. The CNN unit 1221 generates one feature amount vector for one image. The configuration and operation of the CNN will be described later.

ＲＮＮ部１２２２は、再帰型ニューラルネットワーク（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ；ＲＮＮ）である。ＲＮＮ部１２２２は、ＣＮＮ部１２２１が出力する特徴量ベクトルをＲＮＮでまとめて中間表現であるベクトルを生成する。なお、生成された中間表現であるベクトルは、擬似的な可変長である。ＲＮＮ部１２２２は、生成した中間表現であるベクトルをデコーダ１２３に出力する。なお、ＲＮＮ部１２２２は、ＲＮＮを多層に重ねた構造を有している。なお、ＲＮＮは、隠れ層の値を再び隠れ層に入力するネットワーク構造のニューラルネットワークである。ＣＮＮ部１２２１が生成した特徴量ベクトルを１つずつＲＮＮの後述するＬＳＴＭ（Ｌｏｎｇｓｈｏｒｔ−ｔｅｒｍｍｅｍｏｒｙ；長期短期記憶）に入力することで、それをトリガにして順次単語が出力される。なお、ＲＮＮの構成、動作については後述する。 The RNN unit 1222 is a recurrent neural network (RNN). The RNN unit 1222 collects the feature quantity vectors output by the CNN unit 1221 by RNN to generate a vector which is an intermediate representation. The generated intermediate representation vector has a pseudo variable length. The RNN unit 1222 outputs the generated vector, which is an intermediate representation, to the decoder 123. The RNN unit 1222 has a structure in which RNNs are stacked in multiple layers. The RNN is a neural network having a network structure in which the value of the hidden layer is input to the hidden layer again. By inputting the feature vector generated by the CNN unit 1221 one by one into the LSTM (Long short-term memory) described later of the RNN, words are sequentially output by using it as a trigger. The configuration and operation of the RNN will be described later.

デコーダ１２３は、エンコーダ１２２が生成した中間表現であるベクトルを用いてキャプションを生成する。なお、デコーダ１２３の処理に手法については後述する。 The decoder 123 generates a caption using a vector which is an intermediate representation generated by the encoder 122. The method for processing the decoder 123 will be described later.

ＲＮＮ部１２３１は、再帰型ニューラルネットワーク（ＲＮＮ）である。ＲＮＮ部１２３１は、エンコーダ１２２が出力する中間表現であるベクトルをＲＮＮに入力してキャプションを生成し、生成したキャプションの情報（例えばテキスト情報）を出力部１３に出力する。なお、ＲＮＮ部１２３１は、ＲＮＮを多層に重ねた構造を有している。 The RNN unit 1231 is a recurrent neural network (RNN). The RNN unit 1231 inputs a vector, which is an intermediate representation output by the encoder 122, to the RNN to generate a caption, and outputs the generated caption information (for example, text information) to the output unit 13. The RNN unit 1231 has a structure in which RNNs are stacked in multiple layers.

出力部１３は、デコーダ１２３が出力するキャプションの情報を画像情報に変換し、変換した画像情報を画像表示装置（不図示）等の外部装置（不図示）に出力する。なお、外部装置は、タブレット端末、スマートフォン等であってもよい。また、出力部１３は、書録した音響信号も外部装置に出力するようにしてもよい。 The output unit 13 converts the caption information output by the decoder 123 into image information, and outputs the converted image information to an external device (not shown) such as an image display device (not shown). The external device may be a tablet terminal, a smartphone, or the like. Further, the output unit 13 may also output the recorded acoustic signal to the external device.

＜処理の流れの概要＞
まず、処理の流れの概要を、図２と図３を用いて説明する。
図２と図３は、本実施形態に係るキャプション生成装置１が行う処理の概要を示す図である。
図２と図３に示すように、音響信号取得部１１が、音響信号（図２の符号ｇ１、図３の符号ｇ１１）を取得する。図２の符号ｇ１、図３の符号ｇ１１において、横軸は時間であり、縦軸は振幅の大きさである。
図３の符号ｇ１２に示すように、前処理部１２１によって切り出し処理、正規化処理および短時間フーリエ変換処理が行われる。これにより、図３の符号ｇ１３に示すように複数の画像であるスペクトログラムが生成され、生成されたスペクトログラムがエンコーダに入力される（図２の符号ｇ２、図３の符号ｇ１３）。 <Outline of processing flow>
First, an outline of the processing flow will be described with reference to FIGS. 2 and 3.
2 and 3 are diagrams showing an outline of the processing performed by the caption generation device 1 according to the present embodiment.
As shown in FIGS. 2 and 3, the acoustic signal acquisition unit 11 acquires the acoustic signal (reference numeral g1 in FIG. 2 and reference numeral g11 in FIG. 3). In reference numerals g1 in FIG. 2 and reference numeral g11 in FIG. 3, the horizontal axis represents time and the vertical axis represents the magnitude of amplitude.
As shown by the reference numeral g12 in FIG. 3, the preprocessing unit 121 performs a cutting process, a normalization process, and a short-time Fourier transform process. As a result, spectrograms which are a plurality of images are generated as shown by the reference numeral g13 in FIG. 3, and the generated spectrograms are input to the encoder (reference numeral g2 in FIG. 2 and reference numeral g13 in FIG. 3).

次に、図３の符号ｇ１４に示すように、エンコーダ１２２では、スペクトログラムをＣＮＮに入力してベクトルを生成し、生成したベクトルをＲＮＮに入力する。なお、ＲＮＮ部１２２２は、後述するように多段のＬＳＴＭ層を備えている。エンコーダ１２２は、この処理によって中間表現のベクトル（図２の符号ｇ３、図３の符号ｇ１５）を生成する。 Next, as shown by the reference numeral g14 in FIG. 3, the encoder 122 inputs the spectrogram to the CNN to generate a vector, and inputs the generated vector to the RNN. The RNN unit 1222 is provided with a multi-stage LSTM layer as described later. The encoder 122 generates an intermediate representation vector (reference numeral g3 in FIG. 2 and reference numeral g15 in FIG. 3) by this processing.

次に、図２の符号ｇ４、図３の符号ｇ１６に示すように、デコーダ１２３は、中間表現のベクトルをＲＮＮに入力してキャプションを生成する。なお、ＲＮＮ部１２２２は、後述するように多段のＬＳＴＭ層を備えている。生成されたキャプションは、図２の符号ｇ５に示すように『ベルが鳴っている』である。 Next, as shown by the reference numerals g4 in FIG. 2 and the reference numeral g16 in FIG. 3, the decoder 123 inputs the vector of the intermediate representation into the RNN to generate a caption. The RNN unit 1222 is provided with a multi-stage LSTM layer as described later. The generated caption is "ringing the bell" as shown by reference numeral g5 in FIG.

＜前処理部１２１とエンコーダ１２２が行う処理＞
前処理部１２１とエンコーダ１２２が行う処理について、図５を用いて詳細に説明する。
図４は、本実施形態に係る前処理部１２１とエンコーダ１２２が行う処理を示す図である。
符号ｇ１０１は、取得した音響信号である。符号ｇ１０１において、横軸は時間（フレーム）であり、縦軸は振幅の大きさである。 <Processing performed by the pre-processing unit 121 and encoder 122>
The processing performed by the preprocessing unit 121 and the encoder 122 will be described in detail with reference to FIG.
FIG. 4 is a diagram showing processing performed by the preprocessing unit 121 and the encoder 122 according to the present embodiment.
Reference numeral g101 is an acquired acoustic signal. In reference numeral g101, the horizontal axis is time (frame) and the vertical axis is the magnitude of amplitude.

符号ｇ１０２と符号ｇ１０４に示すように、切り出し部１２１１は、所定の値（例えば２０４４８０（符号ｇ１０３））の窓を用いて、音響信号を切り出す。なお、切り出し部１２１１は、符号ｇ１０３に示すように、所定時間が重なる（オーバーラップ）ように音響信号を逐次切り出す。この処理によって、複数の波形データが生成される。 As shown by reference numerals g102 and reference numeral g104, the cutting unit 1211 cuts out an acoustic signal using a window having a predetermined value (for example, 204480 (reference numeral g103)). As shown by the reference numeral g103, the cutting unit 1211 sequentially cuts out the acoustic signals so that the predetermined times overlap (overlap). By this process, a plurality of waveform data are generated.

次に、符号ｇ１０５、符号ｇ１０６に示すように、正規化部１２１２は、切り出された音響信号に対して、振幅を−１〜１の範囲に正規化する。なお、正規化部１２１２は、符号ｇ１０７に示すように、例えばＦＦＴ（高速フーリエ変換）の窓を１０２４、重なり（オーバーラップ）を５１２で正規化を行う。 Next, as shown by reference numerals g105 and reference numeral g106, the normalization unit 1212 normalizes the amplitude of the cut out acoustic signal in the range of -1 to 1. As shown by the reference numeral g107, the normalization unit 1212 normalizes the FFT (Fast Fourier Transform) window with 1024 and the overlap (overlap) with 512.

次に、符号ｇ１０８、符号ｇ１０９に示すように、短時間フーリエ変換部１２１３は、正規化され切り出された音響信号に対して逐次、短時間フーリエ変換を行う。この処理によって、波形データから複数のスペクトログラムが生成される。符号ｇ１０９は、スペクトログラム画像であり、横軸が時間、縦軸が周波数（メルビン（Ｍｅｌｂｉｎ））である。
なお、本実施形態では、この前処理によって音響信号を、対数メル周波数スペクトログラム（ｌｏｇ−ｓｃａｌｅｄｍｅｌｆｒｅｑｕｅｎｃｙｓｐｅｃｔｒｏｇｒａｍ）に変換することによって１チャネルのグレースケール画像とした。これにより、後段のエンコーダ１２２のＣＮＮ部１２２１に入力可能にした。このように、本実施形態では、スペクトログラム画像をグレースケール化することで、演算量を削減することができる。 Next, as shown by reference numerals g108 and reference numerals g109, the short-time Fourier transform unit 1213 sequentially performs a short-time Fourier transform on the normalized and cut out acoustic signal. By this process, a plurality of spectrograms are generated from the waveform data. Reference numeral g109 is a spectrogram image, in which the horizontal axis represents time and the vertical axis represents frequency (Mel bin).
In the present embodiment, the acoustic signal is converted into a log-scaled mel frequency spectrogram by this preprocessing to obtain a 1-channel grayscale image. As a result, it is possible to input to the CNN section 1221 of the encoder 122 in the subsequent stage. As described above, in the present embodiment, the amount of calculation can be reduced by grayscale the spectrogram image.

次に、符号ｇ１１０に示すように、ＣＮＮ部１２２１は、複数のスペクトログラムを逐次ＣＮＮに入力することで、ベクトル（符号ｇ１１１）を生成する。なお、シーケンスの要素の画像それぞれがＣＮＮに通される。ＣＮＮによって音響信号はベクトルのシーケンスとなる。なお、１つの特徴量ベクトルは、切り出された１つの波形データに対応するスペクトログラムの画像に対応している。ｎは、スペクトログラムの個数であり、切り出した波形の個数である。 Next, as shown by the reference numeral g110, the CNN section 1221 generates a vector (reference numeral g111) by sequentially inputting a plurality of spectrograms into the CNN. Each image of the elements of the sequence is passed through CNN. CNN makes the acoustic signal a vector sequence. Note that one feature vector corresponds to a spectrogram image corresponding to one cut out waveform data. n is the number of spectrograms and is the number of cut out waveforms.

次に、符号ｇ１１２に示すように、ＲＮＮ部１２２２は、ＣＮＮ部１２２１が出力するベクトルをＲＮＮに入力して中間表現であるベクトル（符号ｇ１１３）を生成する。なお、ベクトルのシーケンスはエンコーダ側のＲＮＮにステップごとに入力され、その最終状態が中間表現として取り出される。 Next, as shown by the reference numeral g112, the RNN unit 1222 inputs the vector output by the CNN unit 1221 into the RNN to generate a vector (reference numeral g113) which is an intermediate representation. The vector sequence is input to the RNN on the encoder side step by step, and the final state is taken out as an intermediate representation.

このように、本実施形態では、音響信号のスペクトログラムを取るときに、固定長のスペクトログラムを波形データ上で窓をずらしながら複数取ることで、スペクトログラムを固定長のスペクトログラムに分割（ブロック化）して１つ以上のブロック（スペクトログラム画像）を生成する。すなわち、短時間フーリエ変換を行う際に、切り出し処理と正規化を行っている。
また、本実施形態では、エンコーダ１２２で特徴量ベクトルを抽出することで、音響信号に含まれている音源の種類の同定、すなわち音源同定処理を行っている。 As described above, in the present embodiment, when taking a spectrogram of an acoustic signal, the spectrogram is divided (blocked) into a fixed-length spectrogram by taking a plurality of fixed-length spectrograms on the waveform data while shifting the window. Generate one or more blocks (spectogram images). That is, when performing the short-time Fourier transform, the cutout process and the normalization are performed.
Further, in the present embodiment, the feature quantity vector is extracted by the encoder 122 to identify the type of sound source included in the acoustic signal, that is, the sound source identification process is performed.

＜対数メル周波数スペクトログラム＞
ここで、対数メル周波数スペクトログラムについて説明する。対数メル周波数スペクトログラムは、音響信号に対してＳＴＦＴを行い得られた振幅スペクトログラムを、人間の聴覚に合わせて変換したものである。また、人間の知覚に合わせるために、振幅スペクトログラムの周波数方向を伸長・圧縮、および振幅値を変換したものが対数メル周波数スペクトログラムである。 <Logarithmic mel frequency spectrogram>
Here, the logarithmic mel frequency spectrogram will be described. The logarithmic frequency spectrogram is an amplitude spectrogram obtained by performing an FTFT on an acoustic signal and converting it according to human hearing. In addition, the logarithmic mel frequency spectrogram is obtained by expanding / compressing the frequency direction of the amplitude spectrogram and converting the amplitude value in order to match the human perception.

まず、短時間フーリエ変換部１２１３は、メル尺度を用いて周波数特性を聴覚に合わせる。メル尺度は人間の音声知覚を反映した周波数軸である。短時間フーリエ変換部１２１３は、ｆ［Ｈｚ］を次式（１）でメル尺度ｍに変換する。 First, the short-time Fourier transform unit 1213 adjusts the frequency characteristics to the auditory sense using the Mel scale. The Mel scale is a frequency axis that reflects human speech perception. The short-time Fourier transform unit 1213 converts f [Hz] into a mel scale m by the following equation (1).

次に、短時間フーリエ変換部１２１３は、ＳＴＦＴによって得られた振幅スペクトルの周波数軸をメル尺度に変換するために、メル尺度上で等間隔に配置された複数のバンドパスフィルタを振幅スペクトルに適用する。この複数のバンドパスフィルタは、メルフィルタバンクと呼ばれ三角窓のフィルタの集合である。
次に、短時間フーリエ変換部１２１３は、音の大きさを聴覚に合わせる操作を行う。人間は音の大きさをｌｏｇスケールで知覚する。このため、短時間フーリエ変換部１２１３は、メルフィルタバンクを適用した後の振幅のｌｏｇをとることによって音の大きさを聴覚に合わせる操作を行う。 Next, the short-time Fourier transform unit 1213 applies a plurality of bandpass filters arranged at equal intervals on the mel scale to the amplitude spectrum in order to convert the frequency axis of the amplitude spectrum obtained by the STFT into the mel scale. To do. These plurality of bandpass filters are called a mel filter bank and are a set of filters having a triangular window.
Next, the short-time Fourier transform unit 1213 performs an operation of adjusting the loudness of the sound to the auditory sense. Humans perceive loudness on a log scale. Therefore, the short-time Fourier transform unit 1213 performs an operation of adjusting the loudness to the auditory sense by taking a log of the amplitude after applying the mel filter bank.

以上の処理によって、通常のスペクトログラムに対する、周波数方向へのメルフィルタバンクの適用と、振幅のｌｏｇスケールへの変換によって対数メル周波数スペクトログラムが得られる。 By the above processing, a logarithmic mel frequency spectrogram can be obtained by applying a mel filter bank in the frequency direction to a normal spectrogram and converting the amplitude into a log scale.

＜ニューラルネットワークの構造＞
次に、ニューラルネットワークの構造を説明する。
ｖを単語辞書の大きさ、Ｉ＝（Ｉ_０，…Ｉ_Ｍ−１）を切り出した波形データから計算した対数メル周波数スペクトログラムのシーケンス、Ｓ＝（Ｓ_０，…Ｓ_Ｎ−１）とする。また、キャプションの各単語はｖ次元のｏｎｅ−ｈｏｔベクトルＳ_ｔで表される。なお、ｏｎｅ−ｈｏｔベクトルとは、ベクトルの要素のうち１つの要素のみが１、それ以外の要素が０のベクトルである。ただし、Ｓ_０はキャプションの開始を表す特別な開始単語、Ｓ_Ｎはキャプションの終了を表す特別な終了単語であり，Ｓ_ｔ，ｔ｛１，…，Ｎ−１｝が実際のキャプションの単語に対応する。エンコーダ（ｅｎｃｄｅｒ）とデコーダ（ｄｅｃｏｄｅｒ）のニューラルネットワークは次式（２）〜（７）で表せる。 <Structure of neural network>
Next, the structure of the neural network will be described.
v The word dictionary _{size, I = (I 0, ...} I M-1) log mel-frequency spectrogram of the sequence calculated from the waveform data cut _{out, S = (S 0, ...} S N-1) and. Also, each word in the caption is represented by v dimensions of one-hot vector _{S t.} The one-hot vector is a vector in which only one element of the vector is 1 and the other elements are 0. However, S ₀ is a special start word indicating the start of the caption, _SN is a special end word indicating the end of the caption, and _{St, t} {1, ..., N-1} are the actual caption words. Correspond. The neural network of the encoder (encoder) and the decoder (decoder) can be expressed by the following equations (2) to (7).

式（２）〜（７）において、ＣＮＮはＣＮＮ処理を表し、ＬＳＴＭはＬＴＳＭ処理を表し、Ｓｏｆｔｍａｘは処理を表す。また、ｅは埋め込みベクトルであり、ｈ_ｔはエンコーダ側のＬＳＴＭの状態であり、ｓ_ｔはデコーダ側のＬＳＴＭの状態である。Ｗ_ｅ（∈Ｒ^ｅ×ｖ）はエンコーダ側の単語埋め込み行列であり、Ｗ_ｐはデコーダ側のＬＳＴＭの出力である。ＬＳＴＭ_ｈ、ＬＳＴＭ_ｙは状態から出力を決める関数である。ｐ_ｔはｖ次元ベクトルであり、その要素がそれぞれの単語の出力確率を表す。 In the formulas (2) to (7), CNN represents CNN processing, RSTM represents LTSM processing, and Softmax represents processing. Moreover, e is embedded vector, _{h t} is the state of LSTM the encoder side, _{s t} is the state of LSTM the decoder side. W _e (∈ R ^{e × v} ) is the word embedding matrix on the encoder side, and W _p is the output of the LSTM on the decoder side. LSTM _h and LSTM _y are functions that determine the output from the state. _pt is a v-dimensional vector whose elements represent the output probabilities of each word.

ＬＳＴＭ^{ｅｎｃｏｄｅｒ}とＬＳＴＭ^{ｄｅｃｏｄｅｒ}は、０ステップ目から始まるとする。また、入力が画像のシーケンスとなり、それがＬＳＴＭ^{ｅｎｃｏｄｅｒ}のステップ毎の入力になっている。ＬＳＴＭ^{ｅｎｃｏｄｅｒ}の最終状態ｈ_Ｍ−１が中間表現であり、これがＬＳＴＭ^{ｄｅｃｏｄｅｒ}の初期状態ｓ−１（式（４））として使用され、対応するキャプションが生成される。
このような可変長シーケンスを出力するＲＮＮによるモデルを、実施形態ではＳｅｑ−ｔｏ−Ｓｅｑ（Ｓｅｑｕｅｎｃｅ−ｔｏ−Ｓｅｑｕｅｎｃｅ）モデルという。 LSTM ^encoder and ^{LSTM decoder} is a zero-th step. In addition, the input is a sequence of images, which is the input for each step of the LSTM ^encoder . Final state _{h M-1} of LSTM ^encoder is an intermediate representation, which is used as the initial state s-1 of ^{LSTM decoder} (Equation (4)), the corresponding caption is generated.
A model based on an RNN that outputs such a variable length sequence is referred to as a Seq-to-Seq (Sequence-to-Sequence) model in the embodiment.

誤差関数を用いて学習を行う。誤差関数は、ビジョンでのモデルと同様に次式（８）を用いて、これを最小化することで音響信号の時分割スペクトログラムシーケンスＩに対応するキャプションＳを学習する。 Learning is performed using an error function. The error function learns the caption S corresponding to the time-division spectrogram sequence I of the acoustic signal by using the following equation (8) as in the model in the vision and minimizing it.

本実施形態では、Ｓｅｑ−ｔｏ−Ｓｅｑモデルにおいて、ＲＮＮを多層に重ねることで、性能を向上させている。ここで、２層のＬＳＴＭの場合は、次式（９）〜（１２）で表される。 In the present embodiment, in the Seq-to-Seq model, the performance is improved by stacking RNNs in multiple layers. Here, in the case of a two-layer LSTM, it is represented by the following equations (9) to (12).

ただし、式（９）〜（１２）において、ｘ_ｔは入力、ｙ_ｔ ^２は出力である。２層ＬＳＴＭの場合は、１層目のＬＳＴＭ^１の出力シーケンスｙ_ｔ ^１が、２層目のＬＳＴＭ^２の出力シーケンスとなり、２層目の出力シーケンスｙ_ｔ ^２が多層ＬＳＴＭ全体の出力となる。
Ｓｅｑ−ｔｏ−Ｓｅｑモデルで用いる場合は、エンコーダ側のＬＳＴＭとデコーダ側のＬＳＴＭの中間層の数を揃えることで、中間表現を受け渡せるようにしている。 However, in the equations (9) to (12), x _t is an input and y _t ² is an output. In the case of a two-layer LSTM, the output sequence y _t ¹ of the first layer LSTM ¹ becomes the output sequence of the second layer LSTM ² , and the output sequence y _t ² of the second layer becomes the output of the entire multi-layer LSTM.
When used in the Seq-to-Seq model, the intermediate representation can be passed by aligning the number of intermediate layers between the LSTM on the encoder side and the LSTM on the decoder side.

さらに本実施形態のエンコーダ側のＲＮＮでは、シーケンスを過去から未来の方向（順方向）への入力に加え、未来から過去の方向（逆方法）への入力も可能な双方向ＲＮＮ（ＢｉｄｉｒｅｃｔｉｏｎａｌＲＮＮ）で構成した。順方向と逆方向のエンコーダのＬＳＴＭを次式（１３）、（１４）で計算し、その最終状態ｈ_Ｍ−１ ^ｆとｈ_Ｍ−１ ^ｂを結合して中間表現ｈ_Ｍ−１＝［ｈ_Ｍ−１ ^ｆ；ｈ_Ｍ−１ ^ｂ］とした。 Further, in the RNN on the encoder side of the present embodiment, a bidirectional RNN (Bidirectional RNN) capable of inputting a sequence from the past to the future direction (forward direction) and also from the future to the past direction (reverse method). It consisted of. The LSTMs of the forward and reverse encoders are calculated by the following equations (13) and (14), and the final states h _M-1 ^f and h _M-1 ^b are combined to form the intermediate representation h _M-1 = [h. _M-1 ^f ; h _M-1 ^b ].

なお、式（１３）においてＬＳＴＭ^ｆが状態ｈ_ｔ ^ｆを持つ順方向ＬＳＴＭである。また、式（１４）においてＬＳＴＭ^ｂが状態ｈ_ｔ ^ｂを持つ逆方向ＬＳＴＭである。 In the equation (13), the LSTM ^f is a forward LSTM having a state h _t ^f . Further, in the equation (14), the LSTM ^b is a reverse LSTM having a state h _t ^b .

＜アテンションメカニズム＞
Ｓｅｑ−ｔｏ−Ｓｅｑモデルでは、入力をＲＮＮの状態としてステップ毎に１つのベクトルにまとめていくので、入力シーケンスが長くなった場合、デコーダにシーケンスの最初の入力情報を伝えにくくなる。このため、本実施形態では、デコード時に入力を直接参照できるように次式（１５）〜（１８）で表されるアテンションメカニズム（ＡｔｔｅｎｔｉｏｎＭｅｃｈａｎｉｓｍ）を備えるようにした。アテンションメカニズムでは、デコーダの各ステップｉにおいて状態ｓ_ｉを求めるとき、現在のデコーダの状態ｓ_ｉ−１と過去のエンコーダの各ステップｊでの状態ｈ_ｊとの間のスコアα_ｉｊをキャプション生成部１２が計算する。このスコアをステップｉでの状態ｓ_ｊの計算に用いることで入力を参照できるようにした。 <Attention mechanism>
In the Seq-to-Seq model, the inputs are grouped into one vector for each step in the RNN state, so that when the input sequence becomes long, it becomes difficult to convey the first input information of the sequence to the decoder. Therefore, in the present embodiment, the attention mechanism (Attention Mechanism) represented by the following equations (15) to (18) is provided so that the input can be directly referred to at the time of decoding. In the attention mechanism, when the state s _i is obtained in each step i of the decoder, the caption generation unit calculates the score α _ij between the state s _i-1 of the current decoder and the state h _{j in} each step j of the past encoder. 12 calculates. The input can be referred to by using this score in the calculation of the state s _j in step i.

ただし、式（１５）〜（１８）において、ｅ_ｉｊは、デコーダのｉ−１ステップ目の状態ｓ_ｉ−１と、エンコーダのｊステップ目の状態ｈ_ｊとの間のスコアである。また、α_ｉｊは、このｅ_ｉｊをエンコーダのステップｊ方向に正規化（Σ_ｊα_ｉｊ＝１．０≦α_ｉｊ≦１）したものである。また、α_ｉｊを重みとして加重平均ｃ_ｉをとり、このｃ_ｉを次のデコーダの状態ｓ_ｉの計算に反映させている。また、スコア関数ａ（ｓ_ｉ−１，ｈ_ｊ）には内積ｓ_ｉ−１・ｈ_ｊを用いる。 However, in the equations (15) to (18), e _ij is a score between the state s _i-1 of the i-1 step of the decoder and the state h _j of the j step of the encoder. Further, α _ij is obtained by normalizing this e _ij in the step j direction of the encoder (Σ _j α _ij = 1.0 ≦ α _ij ≦ 1). Also, taking the weighted average c _i the alpha _ij as a weight, thereby reflecting the c _i in the calculation of the state s _i for the next decoder. Further, the inner product s _i-1 · h _j is used for the score function a (s _i-1 , h _j ).

＜デコーダの処理＞
次に、デコーダ１２３の処理例を説明する。
図５は、本実施形態に係るデコーダ１２３の処理例を示す図である。なお、ｈ_１，…，ｈ_４（符号ｇ３０７、ｇ３１０、ｇ３１３、ｇ３１６）は、各層の状態である。
まず、ＲＮＮ部１２３１は、エンコーダ１２２が出力した中間表現ベクトル（符号ｇ２０１）をＲＮＮの初期状態としてセットする（符号ｇ３０１）。
そして、ＲＮＮ部１２３１は、ＲＮＮの１ステップ目（符号ｇ２１１）の入力としてキャプションの最初を表す特殊な単語ＳＴＡＲＴを入力する（符号ｇ３０２、ｇ３０３）。これにより、確率ｐ１が最大になるインデックスに対応する単語を取る（符号ｇ３０５、ｇ３０６）ことで、１ステップ目の出力としてキャプションの１番目の単語が出力される。 <Decoder processing>
Next, a processing example of the decoder 123 will be described.
FIG. 5 is a diagram showing a processing example of the decoder 123 according to the present embodiment. Note that h ₁ , ..., H ₄ (reference numerals g307, g310, g313, g316) are states of each layer.
First, the RNN unit 1231 sets the intermediate representation vector (reference numeral g201) output by the encoder 122 as the initial state of the RNN (reference numeral g301).
Then, the RNN unit 1231 inputs the special word START representing the beginning of the caption as the input of the first step (reference numeral g211) of the RNN (reference numerals g302, g303). As a result, by taking the word corresponding to the index having the maximum probability p1 (codes g305, g306), the first word of the caption is output as the output of the first step.

次に、ＲＮＮ部１２３１は、２ステップ目の入力として１ステップ目の出力単語に対応する単語を入力する（符号ｇ３０８）。これにより、ＲＮＮの２ステップ目（符号ｇ２１２）の出力では２番目の単語が出力される（符号ｇ３０９）。
以下、２ステップ目の出力を３ステップ目の入力にするということを繰り返してキャプションを生成する（符号ｇ３１１〜ｇ３１８、ｇ３２１）。
なお、文生成の終了は、ステップの出力が特殊な単語ＥＮＤを出力したら終了する（符号ｇ３１８）。 Next, the RNN unit 1231 inputs a word corresponding to the output word of the first step as the input of the second step (reference numeral g308). As a result, the second word is output in the output of the second step (reference numeral g212) of the RNN (reference numeral g309).
Hereinafter, the caption is generated by repeating the process of setting the output of the second step to the input of the third step (reference numerals g311 to g318, g321).
The sentence generation ends when the output of the step outputs a special word END (reference numeral g318).

＜学習処理＞
次に、エンコーダ１２２における学習処理例を説明する。
図６は、本実施形態に係るエンコーダ１２２における学習処理例を示す図である。符号ｇ４０１は、短時間フーリエ変換部１２１３によって生成されたスペクトログラムであり、横軸が時間、縦軸がメル周波数である。このスペクトログラムが、ＣＮＮ部１２２１が備えるＣＮＮに入力される（符号ｇ４０２）。 <Learning process>
Next, an example of learning processing in the encoder 122 will be described.
FIG. 6 is a diagram showing an example of learning processing in the encoder 122 according to the present embodiment. Reference numeral g401 is a spectrogram generated by the short-time Fourier transform unit 1213, and the horizontal axis is time and the vertical axis is mel frequency. This spectrogram is input to the CNN included in the CNN unit 1221 (reference numeral g402).

ここで、ＣＮＮにおける学習は、ＣＮＮ部分の学習段階と、モデル全体の学習段階との２段階に分けられる。 Here, the learning in the CNN is divided into two stages, a learning stage of the CNN part and a learning stage of the entire model.

ＣＮＮ部分の学習段階では、ＣＮＮの後の分類器を用いて音響信号のクラス分けのネットワークを作り、このネットワークを学習させ、学習させたネットワーク内のＣＮＮ部分のみを取り出す。ここで、分類器（識別器）とは、特徴量を入力して、それが何の特徴を表すかを分類するアルゴリズムである。また、クラス分けとは、音響信号に対して付与されたラベルへのクラス分けである。例えば、３３種類の環境音に対してキャプションを生成する場合は、この環境音に対する１つ１つがクラスであり、クラスの総数が３３個となる。なお、ＣＮＮ部分の学習段階における教師データは、例えば環境音（ｋｎｏｃｋ，ｆｌｕｔｅ，ｃｏｕｇｈ，ｂａｒｋ，ｃｈｉｍｅ…など）クラスラベルである。 In the learning stage of the CNN part, a network for classifying acoustic signals is created using a classifier after the CNN, this network is trained, and only the CNN part in the trained network is extracted. Here, the classifier (classifier) is an algorithm for inputting a feature amount and classifying what feature it represents. Further, the classification is a classification to the label given to the acoustic signal. For example, when captions are generated for 33 types of environmental sounds, each of these environmental sounds is a class, and the total number of classes is 33. The teacher data in the learning stage of the CNN part is, for example, an environmental sound (knock, flute, cow, bark, chime, etc.) class label.

ここで、モデル全体の学習段階では、ＣＮＮ部分の学習段階で学習したＣＮＮを持ってきてモデル全体を学習する。なお、モデル全体の学習段階における教師データは、例えば「ＳｎａｒｅＤｒｕｍの後にＧｏｎｇ，その後にＢａｒｋが鳴っている」という文章データである。 Here, in the learning stage of the entire model, the CNN learned in the learning stage of the CNN part is brought in to learn the entire model. The teacher data in the learning stage of the entire model is, for example, sentence data such as "Snare Drum is followed by Gong and then Bark".

＜ＬＳＴＭ＞
次に、ＬＳＴＭについて補足する。
図７は、ＬＳＴＭの構成と処理例を示す図である。図７において、ｘ_ｔはｔステップ目の入力であり、ｙ_ｔはｔステップ目の出力であり、ｃ_ｔはメモリセルであり、ｉは入力ゲートであり、ｆは忘却ゲートであり、ｏは出力ゲートであり、×は要素積である。
図７の構成と処理によって、現在と過去の情報をどれだけ用いるか判断できる。 <LSTM>
Next, the LSTM will be supplemented.
FIG. 7 is a diagram showing a configuration of an LSTM and a processing example. In FIG. 7, x _t is the input at the t step, y _t is the output at the t step, _ct is the memory cell, i is the input gate, f is the forgetting gate, and o is the oblivion gate. It is an output gate, and × is an element product.
With the configuration and processing of FIG. 7, it is possible to determine how much current and past information is used.

また、ＬＳＴＭで使用される値は、ＲＮＮ部１２２２またはＲＮＮ部１２３１が次式（１９）〜（２４）を用いて算出する。 Further, the value used in the LSTM is calculated by the RNN unit 1222 or the RNN unit 1231 using the following equations (19) to (24).

なお、式（１９）〜（２１）において、σはシグモイド関数である。 In equations (19) to (21), σ is a sigmoid function.

＜キャプション生成装置の処理手順＞
次に、キャプション生成装置１の処理手順例を説明する。
図８は、本実施形態に係るキャプション生成装置の処理手順例のフローチャートである。 <Processing procedure of caption generator>
Next, an example of the processing procedure of the caption generation device 1 will be described.
FIG. 8 is a flowchart of a processing procedure example of the caption generator according to the present embodiment.

（ステップＳ１）音響信号取得部１１は、音響信号を取得し、取得した音響信号をキャプション生成部１２に出力する。 (Step S1) The acoustic signal acquisition unit 11 acquires an acoustic signal and outputs the acquired acoustic signal to the caption generation unit 12.

（ステップＳ２）切り出し部１２１１は、音響信号取得部１１が出力する音響信号に対して、所定の時間幅の窓を用いて音響信号を切り出す。 (Step S2) The cutting unit 1211 cuts out an acoustic signal from the acoustic signal output by the acoustic signal acquisition unit 11 by using a window having a predetermined time width.

（ステップＳ３）正規化部１２１２は、切り出し部１２１１が切り出した音響信号に対して正規化を行う。 (Step S3) The normalization unit 1212 normalizes the acoustic signal cut out by the cutout unit 1211.

（ステップＳ４）短時間フーリエ変換部１２１３は、正規化部１２１２が出力する正規化され切り出し後の音響信号に対して短時間フーリエ変換を行って切り出された音響信号毎の二次元スペクトログラム画像が生成する。 (Step S4) The short-time Fourier transform unit 1213 performs a short-time Fourier transform on the normalized and cut-out acoustic signal output by the normalization unit 1212 to generate a two-dimensional spectrogram image for each cut-out acoustic signal. To do.

（ステップＳ５）ＣＮＮ部１２２１は、短時間フーリエ変換部１２１３が逐次出力する二次元スペクトログラム画像をＣＮＮに入力して特徴量ベクトルを生成する。なお、ＣＮＮ部１２２１は、画像一枚に対して、特徴量ベクトルを１つ生成する。 (Step S5) The CNN unit 1221 inputs the two-dimensional spectrogram image sequentially output by the short-time Fourier transform unit 1213 into the CNN to generate a feature vector. The CNN unit 1221 generates one feature amount vector for one image.

（ステップＳ６、Ｓ７）ＲＮＮ部１２２２は、ＣＮＮ部１２２１が出力する特徴量ベクトルをＲＮＮでまとめて中間表現であるベクトルを生成する。 (Steps S6 and S7) The RNN unit 1222 collects the feature quantity vectors output by the CNN unit 1221 by RNN to generate a vector which is an intermediate representation.

（ステップＳ８、Ｓ９）ＲＮＮ部１２３１は、エンコーダ１２２が出力する中間表現であるベクトルをＲＮＮに入力してキャプションを生成する。 (Steps S8 and S9) The RNN unit 1231 inputs a vector, which is an intermediate representation output by the encoder 122, to the RNN to generate a caption.

＜評価結果＞
次に、本実施形態のキャプション生成装置１を評価した評価結果例を、図９〜図１３を用いて説明する。
学習と評価にあたって音響信号とキャプションのデータセットが必要になる。このため、評価では、単一クラスのみが含まれた音源をランダムに合成した混合音と、それに対応するキャプションのデータセットを作成した。図９は、評価に用いた音響信号の例を示す図である。図９において、横軸は時間（秒）、縦軸は周波数［Ｈｚ］である。図９に示す例では、０〜２秒の間にＳｎａｒｅＤｒｕｍ（符号ｇ５０１）が鳴り、２〜４秒の間にＧｏｎｇ（符号ｇ５０２）が鳴り、４〜６秒の間にＢａｒｋ（符号ｇ５０３）がなっている。この例のキャプションの正解データは「ＳｎａｒｅＤｒｕｍの後にＧｏｎｇ，その後にＢａｒｋが鳴っている」である。このように、キャプション生成においては、何の音が鳴っているのかと、どの順番でなっているかを表現できているかを評価した。 <Evaluation result>
Next, an example of the evaluation result of evaluating the caption generation device 1 of the present embodiment will be described with reference to FIGS. 9 to 13.
A dataset of acoustic signals and captions is required for learning and evaluation. For this reason, in the evaluation, we created a mixed sound that randomly synthesized sound sources containing only a single class, and a data set of captions corresponding to it. FIG. 9 is a diagram showing an example of an acoustic signal used for evaluation. In FIG. 9, the horizontal axis is time (seconds) and the vertical axis is frequency [Hz]. In the example shown in FIG. 9, the SnareDrum (reference numeral g501) sounds between 0 and 2 seconds, the Kong (reference numeral g502) sounds between 2 and 4 seconds, and the Bark (reference numeral g503) sounds between 4 and 6 seconds. It has become. The correct data for the caption in this example is "Snare Drum followed by Gong, followed by Bark." In this way, in caption generation, it was evaluated whether it was possible to express what sounds were sounding and in what order.

音源には３３クラス（３３種類）の環境音（ｋｎｏｃｋ，ｆｌｕｔｅ，ｃｏｕｇｈ，ｂａｒｋ，ｃｈｉｍｅ…など）を使用し、混合音は３つ程度の音源をオパーラップがないように接続して作成した。ただし、音源信号間には０〜０．６秒程度のランダムな長さの無音信号を挿入した。評価では、モノラル入力音響信号に対して種類と順序を説明するキャプションを生成するものとした。混合音とキャプションのペア１８０００個分を学習データセット、２０００個分を評価データセットとした。 33 class (33 types) of environmental sounds (knock, flute, cow, bark, chime, etc.) were used as sound sources, and mixed sounds were created by connecting about 3 sound sources so that there was no opalap. However, a silent signal having a random length of about 0 to 0.6 seconds was inserted between the sound source signals. In the evaluation, it was assumed that a caption explaining the type and order of the monaural input acoustic signal was generated. 18,000 pairs of mixed sounds and captions were used as the learning data set, and 2000 pairs were used as the evaluation data set.

ＣＮＮの学習には、図１０、図１１に示す既存の学習済みモデルを使用した（例えば参考文献１参照）。図１０は、学習モデルのアーキテクチャを示す図である。図１１は、各モジュールの構成例と処理手順例を示す図である。なお、図１１において、ｋおよびｎは、畳み込みのフィルタサイズおよびソフトマックスレイヤの数を示す。また、ＢＮはバッチ正規化であり、Ｃｏｎｃａｔは特徴連結であり、Ｒｅｌｕは修正線形単位であり、Ｃｏｎｖは線形畳み込みであり、ＭａｘＰｏｏｌは最大プーリングであり、ＧＡＰはグローバル平均プーリングである。符号ｇ５１１はＬｏｗ−ｌｅｖｅｌｋのモジュール構成と処理手順である。符号ｇ５１２はＤｅｎｓｅＮｅｔ−ｋのモジュール構成と処理手順である。符号ｇ５１３はｎ−ｈｅａｄ分類器モジュール構成と処理手順である。なお、この学習済みのＣＮＮ部分のクラス分けの精度は約９０％である。
参考文献；”Audio tagging system for DCASE 2018: Focusing on label noise, data augmentation and its efficient learning”, Il-Young, Jeong, et al. DCASE2018 Challenge., 2018 For the training of CNN, the existing trained models shown in FIGS. 10 and 11 were used (see, for example, Reference 1). FIG. 10 is a diagram showing the architecture of the learning model. FIG. 11 is a diagram showing a configuration example and a processing procedure example of each module. In FIG. 11, k and n indicate the size of the convolution filter and the number of softmax layers. Also, BN is batch normalization, Concat is feature concatenation, Relu is a modified linear unit, Conv is a linear convolution, MaxPool is the maximum pooling, and GAP is the global average pooling. Reference numeral g511 is a module configuration and processing procedure of Low-level k. Reference numeral g512 is a module configuration and processing procedure of Dense Net-k. Reference numeral g513 is an n-head classifier module configuration and processing procedure. The accuracy of classifying the learned CNN part is about 90%.
References; “Audio tagging system for DCASE 2018: Focusing on label noise, data augmentation and its efficient learning”, Il-Young, Jeong, et al. DCASE2018 Challenge., 2018

モデルの学習は２段階に分けて行った。最初に音源識別用のＣＮＮを用意する。これには、上述した既存のトレーニング済みのモデルを用いた。次にこのＣＮＮに対して上述した学習データセットを用いてモデル全体の転移学習を行った。なお波形データは１．２８秒間ずつ、５０％オーバーラップするように切り取った。 The learning of the model was carried out in two stages. First, prepare a CNN for sound source identification. For this, the existing trained model described above was used. Next, transfer learning of the entire model was performed on this CNN using the training data set described above. The waveform data was cut out for 1.28 seconds so as to overlap by 50%.

図１２は、正解と一致しなかった出力キャプションの出力例を示す図である。図１２において、符号６０１とｇ６１１が示す文章が正解キャプションであり、符号ｇ６０２とｇ６１２が示す文章が出力キャプションである。また、符号ｇ６０２と符号ｇ６１２は、順序は一致しているが、存在（クラス）が完全一致していない例である。また、正解キャプション「Ｆｉｒｅｗｏｒｋｓの後にＢａｒｋ，その後にＧｕｎｓｈｏｔが鳴っている」に対して「Ｇｕｎｓｈｏｔの後にＢａｒｋ，その後にＦｉｒｅｗｏｒｋｓが鳴っている」が出力キャプションの場合は、存在が一致しているが順序が一致していない例である。 FIG. 12 is a diagram showing an output example of an output caption that did not match the correct answer. In FIG. 12, the sentences indicated by the reference numerals 601 and g611 are the correct answer captions, and the sentences indicated by the reference numerals g602 and g612 are the output captions. Further, reference numeral g602 and reference numeral g612 are examples in which the order is the same but the existence (class) is not completely the same. Also, if the correct caption "Fireworks is followed by Bark, then Gunshot is ringing" while "Gunshot is followed by Bark, then Fireworks is ringing" is the output caption, the existence is the same, but the order is Is an example that does not match.

図１３は、評価結果例を示す図である。
図１３に示すように、正解と完全に一致したキャプションが生成された割合は７３．２０％となった。一方、音源の順序を問わず、キャプションに含まれる音源の種類が一致したものは７５．９０％となった。また、音源の種類を問わず、キャプションに含まれる音源の順序が一致したものは７５．７５％となった。このように存在一致だが順序一致でないものは０．１５％であり順序付け自体はできている。 FIG. 13 is a diagram showing an example of the evaluation result.
As shown in FIG. 13, the percentage of captions that completely matched the correct answer was 73.20%. On the other hand, regardless of the order of the sound sources, 75.90% of the captions had the same type of sound source. In addition, regardless of the type of sound source, 75.75% had the same order of sound sources included in the caption. In this way, 0.15% of the cases are existence match but not order match, and the ordering itself is completed.

正解と一致しないエラーの原因は音源識別間違いと音源数の間違いに分けられる。そこでキャプションに含まれるラベル単語のみを抜き出してラベル列を作成し、そのラベル列に対して挿入誤りの数、削除誤りの数、置換誤りの数を計算した。全評価データ５９６９個のうち、挿入誤りの数が４０、削除誤りの数が２０と低い値であった。 The causes of errors that do not match the correct answer can be divided into sound source identification errors and sound source number errors. Therefore, only the label words included in the caption were extracted to create a label string, and the number of insertion errors, the number of deletion errors, and the number of replacement errors were calculated for the label sequence. Of all 5969 evaluation data, the number of insertion errors was 40 and the number of deletion errors was 20, which were low values.

以上のように、本実施形態では、一次元の音響信号を、短時間フーリエ変換することでチャンネル数１の二次元のグレースケール画像として扱えるようにした。
また、ビジョンでの入力画像は固定長のデータだったのに対し、音では可変長となりうる。しかし音響信号はその長さによってスペクトログラムの横幅が大きく変わってしまい、それをリサイズするとアスペクト比が大きく崩れる。このため、本実施形態では、音響信号のスペクトログラムを取るときに、固定長のスペクトログラムを波形データ上で窓をずらしながら複数取り、それらをエンコーダ側のＲＮＮでまとめるようにした。 As described above, in the present embodiment, the one-dimensional acoustic signal can be treated as a two-dimensional grayscale image having one channel by performing a short-time Fourier transform.
Also, while the input image in vision was fixed length data, it can be variable length in sound. However, the width of the spectrogram of the acoustic signal changes greatly depending on its length, and when it is resized, the aspect ratio collapses significantly. Therefore, in the present embodiment, when taking the spectrogram of the acoustic signal, a plurality of fixed-length spectrograms are taken on the waveform data while shifting the window, and they are put together by the RNN on the encoder side.

このように、本実施形態では、画像でのキャプション生成モデルを音響信号でのモデルに適用するために、音響信号に対するスペクトログラム表現と、複数のスペクトログラムを用いた可変長音響信号に対する固定長ベクトル表現を導入することによってモデルを拡張した。すなわち、本実施形態では、音声をスペクトログラム化し、二次元画像とした。そして本実施形態では、この画像を畳み込みニューラルネットワーク（ＣＮＮ）に入力して学習させるようにした。このとき、本実施形態では、スペクトログラムを固定長のスペクトログラムに分割（ブロック化）し、ブロック化したものをＣＮＮに入力すると同時に、リカレントニューラルネットワーク（ＲＮＮ）にも入力し擬似的に時系列信号を扱うことができるようにした。 As described above, in the present embodiment, in order to apply the caption generation model in the image to the model in the acoustic signal, the spectrogram representation for the acoustic signal and the fixed-length vector representation for the variable-length acoustic signal using a plurality of spectrograms are used. The model was extended by introducing it. That is, in the present embodiment, the sound is spectrogrammed into a two-dimensional image. Then, in the present embodiment, this image is input to a convolutional neural network (CNN) for learning. At this time, in the present embodiment, the spectrogram is divided (blocked) into fixed-length spectrograms, and the blocked spectrogram is input to the CNN and at the same time input to the recurrent neural network (RNN) to generate a pseudo time-series signal. Made it possible to handle.

これにより、本実施形態によれば、音響信号に対して音響シーンを理解してキャプションを生成することができる。 Thereby, according to the present embodiment, it is possible to understand the acoustic scene for the acoustic signal and generate a caption.

なお、本発明におけるキャプション生成装置１の機能全ての機能または一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによりキャプション生成装置１が行う処理の全てまたは一部を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 A program for realizing all or a part of the functions of the caption generator 1 in the present invention is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read by the computer system. By executing it, all or a part of the processing performed by the caption generation device 1 may be performed. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. In addition, the "computer system" shall also include a WWW system provided with a homepage providing environment (or display environment). Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, or a storage device such as a hard disk built in a computer system. Furthermore, a "computer-readable recording medium" is a volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, it shall include those that hold the program for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the program may be transmitted from a computer system in which this program is stored in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the "transmission medium" for transmitting a program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the above program may be for realizing a part of the above-mentioned functions. Further, a so-called difference file (difference program) may be used, which can realize the above-mentioned functions in combination with a program already recorded in the computer system.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…キャプション生成装置、１１…音響信号取得部、１２…キャプション生成部、１３…出力部、１２１…前処理部、１２２…エンコーダ、１２３…デコーダ、１２１１…切り出し部、１２１２…正規化部、１２１３…短時間フーリエ変換部、１２２１…ＣＮＮ部、１２２２…ＲＮＮ部、１２３１…ＲＮＮ部 1 ... Caption generator, 11 ... Acoustic signal acquisition unit, 12 ... Caption generation unit, 13 ... Output unit, 121 ... Preprocessing unit, 122 ... Encoder, 123 ... Decoder, 1211 ... Cutout unit, 1212 ... Normalization unit, 1213 ... Short-time Fourier transform unit, 1221 ... CNN part, 1222 ... RNN part, 1231 ... RNN part

Claims

When generating a spectrogram for an acoustic signal, the spectrogram is divided into fixed lengths into one or more blocks, and the blocks are input to a convolutional neural network to extract a feature amount vector, and the extracted feature amount vector. A caption generator that generates captions for the acoustic signal by inputting to a recursive neural network.
A caption generator equipped with.

The caption generator according to claim 1, wherein the spectrogram is a logarithmic mel frequency spectrogram.

The recursive neural network is composed of multiple layers of long-term and short-term memory.
The recursive neural network includes a first recurrent neural network and a second recurrent neural network.
The first recursive neural network inputs the extracted feature vector into the multi-layered long-term and short-term storage layer to generate an intermediate representation vector.
The caption generation device according to claim 1 or 2, wherein the second recursive neural network inputs the vector of the intermediate representation into the multi-layered long-term and short-term storage layer to generate the caption.

The caption generator according to any one of claims 1 to 3, wherein the block is a grayscale image of the spectrogram.

The procedure for the acquisition unit to acquire the acoustic signal,
When the caption generator generates a spectrogram for an acoustic signal, the spectrogram is divided into fixed lengths into one or more blocks, and the blocks are input to a convolutional neural network to extract a feature vector. ,
A procedure in which the caption generation unit generates a caption for the acoustic signal by inputting the extracted feature quantity vector into the recursive neural network.
Caption generation method including.

On the computer of the caption generator,
Procedures for acquiring acoustic signals and
When generating a spectrogram for an acoustic signal, the spectrogram is divided into fixed lengths into one or more blocks, and the blocks are input to a convolutional neural network to extract a feature vector.
A procedure for generating a caption for the acoustic signal by inputting the extracted feature vector into the recursive neural network, and
A program that runs.