JP2019097016A

JP2019097016A - Corpus generation device, corpus generation method, and program

Info

Publication number: JP2019097016A
Application number: JP2017224448A
Authority: JP
Inventors: 光穂山本; Mitsuo Yamamoto
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2017-11-22
Filing date: 2017-11-22
Publication date: 2019-06-20

Abstract

To provide a device for generating a corpus for lip reading.SOLUTION: A corpus generation device 1 comprises: a moving picture acquisition unit 10 that acquires a moving picture with subtitles from a website; a lip moving picture generation unit 12 that recognize a speaker included in the moving picture and generates a lip moving picture of the speaker; a voice recognition unit 13 that recognizes a voice included in the moving picture and obtains morphemes composing the voice; a text alignment unit 14 that, on the basis the morphemes obtained by the voice recognition unit and text data of the subtitles in the moving picture, obtains a time at which a voice corresponding to the text is obtained, thereby generating the text with the time; a dividing unit 15 that divides the text with the lip moving picture, the voice and the time into data per predetermined unit time; and a corpus storage unit 16 that stores the text with the lip moving picture, the voice, and the time.SELECTED DRAWING: Figure 1

Description

本発明は、リップリーディングのためのコーパスを生成する技術に関する。 The present invention relates to a technique for generating a corpus for lip reading.

リップリーディングは、唇の動きから発話の内容を読み取る技術である。２０１６年の前半頃までは、リップリーディングの認識精度は低かったが、ディープラーニングをはじめとする最近の機械学習技術の目覚ましい進歩により、リップリーディングの新しい技術が提案され、その精度も向上してきている。例えば、非特許文献１は、画像を入力として、発話した文節を推定する技術を開示している。 Lip reading is a technology for reading the contents of speech from lip movements. Until around the first half of 2016, the recognition accuracy of lip reading was low, but recent advances in machine learning technology such as deep learning have suggested new lip reading technology and its accuracy has also improved . For example, Non-Patent Document 1 discloses a technique of taking an image as an input and estimating a uttered phrase.

Yannis M.Assael他「LIPNET:END-TO-END SENTENCE-LEVEL LIPREADING」［online］Cornel University Library、［平成２９年１０月２７日検索］、インターネット（URL: https://arxiv.org/pdf/1611.01599.pdf）Yannis M. Assael et al. "LIPNET: END-TO-END SENTENCE-LEVEL LIPREADING" [online] Cornel University Library, [October 27, 2017 search], Internet (URL: https://arxiv.org/pdf/ 1611.01599. Pdf)

非特許文献１に記載された技術は、ニューラルネットワークモデルを用いた推定方法であり、あらかじめ教師データを用いてニューラルネットワークの学習を行っておくことが必要である。ニューラルネットワークの学習には、膨大な量の教師データが必要であるが、教師データを作成することは容易ではない。 The technique described in Non-Patent Document 1 is an estimation method using a neural network model, and it is necessary to perform learning of a neural network in advance using teacher data. Although learning of a neural network requires a huge amount of teacher data, it is not easy to create teacher data.

本発明は、上記背景に鑑み、リップリーディングのためのコーパスを生成する装置を提供することを目的とする。 In view of the above background, the present invention aims to provide an apparatus for generating a corpus for lip reading.

本発明のコーパス生成装置は、外部のリソースから字幕付きの動画を取得する動画取得部と、前記動画に含まれる話者を認識し、話者の唇動画を生成する唇動画生成部と、前記動画に含まれる音声の音声認識を行って前記音声を構成する形態素を求める音声認識部と、前記音声認識部にて求めた形態素と前記動画内の字幕のテキストデータとに基づいて、テキストに対応する音が発せられた時刻を求め、時刻付きテキストを生成するテキストアライメント部と、前記唇動画と前記時刻付きテキストを格納する記憶部とを備える。 The corpus generation device according to the present invention comprises: a moving image acquisition unit that acquires a subtitled moving image from an external resource; a lip moving image generation unit that recognizes a speaker included in the moving image and generates a speaker's lip moving image; Corresponds to text based on a voice recognition unit for performing voice recognition of voice contained in a moving image to obtain a morpheme forming the voice, a morpheme obtained by the voice recognition unit, and text data of subtitles in the moving image A text alignment unit for generating a time-added text by obtaining a time when a sound to be emitted is generated; and a storage unit for storing the lip moving image and the time-added text.

この構成により、外部のリソースから取得した動画に基づいてコーパスを生成するので、インターネット上のウェブサイト等の大量のデータをコーパス生成に利用することができる。例えば、ＧＲＩＤ等のデータベースは、単語や文法に制限があるが、外部のリソースから取得した動画は、様々な人によって自由な発話がなされているから、汎用性の高いコーパスを生成することができる。なお、時刻付きテキストは、センテンスの単位で時刻を付してもよいし、単語ごと、文字ごとに時刻を付してもよい。 With this configuration, a corpus is generated based on a moving image acquired from an external resource, so a large amount of data such as a website on the Internet can be used for corpus generation. For example, although a database such as GRID is limited in terms of words and grammar, a moving image acquired from an external resource is capable of generating a versatile corpus since free speech is made by various people. . The text with a time may have a time in units of sentences, or may have a time for each word or character.

本発明のコーパス生成装置は、前記唇動画と前記時刻付きテキストを所定の単位時間のデータに分割する分割部を備えてもよい。リップリーディングの推論を行うときと同じ時間単位に分割することで、コーパスを用いた適切な学習を行うことができる。 The corpus generation device according to the present invention may include a dividing unit that divides the lip moving image and the time-added text into data of a predetermined unit time. It is possible to perform appropriate learning using a corpus by dividing it into the same time unit as when performing lip reading inference.

本発明のコーパス生成装置において、前記テキストアライメント部は、前記字幕に含まれている漢字を平仮名に変換して、平仮名に対応する時刻付きテキストを生成してもよい。この構成により、字幕に漢字を含んでいる動画からも、コーパスを生成することができる。 In the corpus generation device of the present invention, the text alignment unit may convert kanji contained in the subtitles into hiragana to generate a time-added text corresponding to the hiragana. According to this configuration, a corpus can be generated also from a moving image including kanji in subtitles.

本発明のコーパス生成装置は、唇動画から発話された音を推定する発話音推定部を備え、前記唇動画生成部は、前記動画内に複数の人が映っているときには、推定される発話音が、前記字幕のテキストデータに一致する人を、当該動画における話者であると認識してもよい。この構成により、複数の人が話している動画からも、コーパスを生成することができる。 The corpus generation device according to the present invention includes an utterance sound estimation unit that estimates a sound uttered from a lip animation, and the lip animation generation unit is configured to estimate the utterance sound when a plurality of persons appear in the animation. However, a person who matches the text data of the subtitle may be recognized as a speaker in the moving image. With this configuration, it is possible to generate a corpus also from a video in which a plurality of people are talking.

本発明のコーパス生成装置において、前記記憶部に、前記唇動画及び前記時刻付きテキストと共に、前記音声を記憶してもよい。この構成により、唇画像と音声から発話音を推論するモデルの生成に有用なコーパスを生成できる。 In the corpus generation device of the present invention, the storage unit may store the voice together with the lip moving image and the time-added text. With this configuration, it is possible to generate a corpus useful for generating a model that infers speech sounds from lip images and speech.

本発明の別の態様のコーパス生成装置は、外部のリソースから動画を取得する動画取得部と、前記動画に含まれる話者を認識し、話者の唇動画を生成する唇動画生成部と、前記動画に含まれる音声の音声認識を行って前記音声を構成する形態素を求める音声認識部と、前記音声認識部にて求めた形態素を表すテキストデータを生成し、各形態素に対応する音が発せられた時刻を求め、時刻付きテキストを生成するテキストアライメント部と、前記唇動画と前記時刻付きテキストを格納する記憶部とを備える。 A corpus generation apparatus according to another aspect of the present invention includes: a moving image acquisition unit that acquires a moving image from an external resource; a lip moving image generation unit that recognizes a speaker included in the moving image and generates a lip moving image of the speaker; A voice recognition unit for performing voice recognition of voice included in the moving image to obtain morphemes forming the voice, and text data representing the morpheme obtained by the voice recognition unit are generated, and sounds corresponding to each morpheme are emitted A text alignment unit for obtaining a timed text and generating a timed text, and a storage unit for storing the lip moving image and the timed text.

本発明のコーパス生成方法は、コーパス生成装置が、外部のリソースから取得した字幕付きの動画を用いてコーパスを生成する方法であって、前記コーパス生成装置が、外部のリソースから字幕付きの動画を取得するステップと、前記コーパス生成装置が、前記動画に含まれる話者を認識し、話者の唇動画を生成するステップと、前記コーパス生成装置が、前記動画に含まれる音声の音声認識を行って前記音声を構成する形態素を求めるステップと、前記コーパス生成装置が、前記音声認識によって求めた形態素と前記動画内の字幕のテキストデータとに基づいて、テキストに対応する音が発せられた時刻を求め、時刻付きテキストを生成するステップと、前記コーパス生成装置が、前記唇動画と前記時刻付きテキストを記憶部に格納するステップとを備える。 The corpus generation method according to the present invention is a method in which the corpus generation device generates a corpus using subtitled moving images acquired from an external resource, wherein the corpus generation device generates a subtitled moving image from an external resource. Acquiring, the corpus generating device recognizes a speaker included in the moving image, and generating a lip moving image of the speaker, and the corpus generating device performs speech recognition of speech included in the moving image Determining a morpheme forming the voice, and the time when the sound corresponding to the text is emitted based on the morpheme determined by the voice recognition and the text data of the subtitle in the moving picture, the corpus generation device Determining a timed text, and the corpus generation device storing the lip moving image and the timed text in a storage unit. Tsu and a flop.

本発明のプログラムは、外部のリソースから取得した字幕付きの動画を用いてコーパスを生成するためのプログラムであって、コンピュータに、外部のリソースから字幕付きの動画を取得するステップと、前記動画に含まれる話者を認識し、話者の唇動画を生成するステップと、前記動画に含まれる音声の音声認識を行って前記音声を構成する形態素を求めるステップと、前記音声認識によって求めた形態素と前記動画内の字幕のテキストデータとに基づいて、テキストに対応する音が発せられた時刻を求め、時刻付きテキストを生成するステップと、前記唇動画と前記時刻付きテキストを記憶部に格納するステップとを実行する。 The program according to the present invention is a program for generating a corpus using subtitled moving images acquired from an external resource, and the computer acquires the subtitle moving image from the external resource, and Recognizing a speaker included and generating a lip moving image of the speaker, performing voice recognition of a voice included in the moving image to obtain a morpheme constituting the voice, and a morpheme obtained by the voice recognition Obtaining a time when a sound corresponding to the text is emitted based on the text data of subtitles in the moving image, generating a time-added text, and storing the lip moving image and the time-added text in a storage unit And do.

本発明は、インターネット上のウェブサイト等の外部のリソースから取得した動画に基づいてコーパスを生成することができる。 The present invention can generate a corpus based on a moving image acquired from an external resource such as a website on the Internet.

第１の実施の形態のコーパス生成装置の構成を示す図である。It is a figure showing composition of a corpus generation device of a 1st embodiment. 実施の形態のコーパス生成装置の動作を示す図である。It is a figure which shows operation | movement of the corpus production | generation apparatus of embodiment. リップリーディングのためのニューラルネットワークのモデルの一例を示す図である。It is a figure which shows an example of the model of the neural network for lip reading. 第２の実施の形態のコーパス生成装置の構成を示す図である。It is a figure which shows the structure of the corpus generation apparatus of 2nd Embodiment. 第２の実施の形態において、話者を認識する動作を示す図である。FIG. 11 is a diagram showing an operation of recognizing a speaker in the second embodiment.

以下、本発明の実施の形態のコーパス生成装置について、図面を参照して説明する。実施の形態のコーパス生成装置は、外部のリソースとしてインターネット上のウェブサイトから取得した動画を用いる。本実施の形態のコーパス生成装置は、ウェブサイトから取得した字幕付きの動画からコーパスを生成する。なお、ここで、字幕付きの動画とは、話者が話した内容を一字一句もらさず、字幕のテキストデータとして持っている動画である。 Hereinafter, a corpus generation device according to an embodiment of the present invention will be described with reference to the drawings. The corpus generation device according to the embodiment uses a moving image acquired from a website on the Internet as an external resource. The corpus generation device of the present embodiment generates a corpus from subtitled moving images acquired from a website. Here, the subtitled moving image is a moving image having the content spoken by the speaker as text data of subtitles without giving a single word.

ここで、リップリーディングのためのコーパスについて説明する。リップリーディングのコーパスに必要なのは、唇動画とその唇動画で何と言っているかを示すテキストデータである。テキストデータは、唇がどの形のときにどの音が出ているかを示す必要があるため、唇動画と同期している必要があるが、字幕付きの動画では、しゃべっている内容を字幕で示してはいるが、字幕は音の発生タイミングのデータを有していない。本実施の形態では、字幕のテキストデータと唇動画との同期をとってコーパスを生成する。 Here, a corpus for lip reading will be described. What is required for a lip reading corpus is text data indicating what lip animation and what it says in the lip animation. The text data needs to indicate which sound is output when the lip is in shape, so it needs to be synchronized with the lip video, but in the video with subtitles, the captioned content is indicated by subtitles However, subtitles do not have data on the timing of sound generation. In this embodiment, a corpus is generated in synchronization with subtitle text data and lip animation.

（第１の実施の形態）
図１は、実施の形態のコーパス生成装置１の構成を示す図である。コーパス生成装置１は、動画取得部１０を備えており、外部のリソースにアクセスして、字幕付き動画のデータを取得する。字幕付き動画データは、映像データと音声データと字幕のテキストデータとからなっている。 First Embodiment
FIG. 1 is a diagram showing the configuration of a corpus generation device 1 according to the embodiment. The corpus generation device 1 includes the moving image acquisition unit 10, accesses an external resource, and acquires data of a subtitled moving image. The subtitled moving image data is composed of video data, audio data, and subtitle text data.

コーパス生成装置１は、話者認識部１１と、唇動画生成部１２と、音声認識部１３と、テキストアライメント部１４とを備えている。話者認識部１１は、映像データの画像処理を行って、映像に映っている人の顔およびその器官点（目、鼻、口等）を認識する機能を有する。そして、話者認識部１１は、唇が開いたり閉じたりと動いている人を話者として認識する。なお、複数の人の唇が同時に動いている場合に話者を認識する方法は、第２の実施の形態で説明する。 The corpus generation device 1 includes a speaker recognition unit 11, a lip moving image generation unit 12, a speech recognition unit 13, and a text alignment unit 14. The speaker recognition unit 11 has a function of performing image processing of video data to recognize the face of a person appearing in the video and its organ points (eye, nose, mouth, etc.). Then, the speaker recognition unit 11 recognizes a person whose lips are open or closed as a speaker. A method of recognizing a speaker when the lips of a plurality of persons are simultaneously moving will be described in the second embodiment.

唇動画生成部１２は、話者認識部１１にて認識された話者の唇の領域を切り出した唇動画を生成する機能を有する。本実施の形態で切り出す唇の領域は、唇の中心から唇の左右端までのそれぞれの距離を「１」としたときに、唇の中心から左右に「３」ずつの距離を有し、かつ、上下方向についても同様に、唇の中心から唇の上下端までのそれぞれの距離を「１」としたときに、唇の中心から上下に「３」ずつの距離を有する領域である。ここで述べたのは、一例であって、唇画像の領域は、適宜設定することができる。唇動画生成部１２は、生成した唇動画を分割部１５に入力する。 The lip moving image generation unit 12 has a function of generating a lip moving image obtained by cutting out the area of the lip of the speaker recognized by the speaker recognition unit 11. The region of the lip cut out in this embodiment has a distance of “3” to the left and right from the center of the lip, assuming that each distance from the center of the lip to the left and right ends of the lip is “1” Similarly, in the vertical direction, when each distance from the center of the lip to the upper and lower ends of the lip is "1", this is a region having a distance of "3" from the center of the lip up and down. What has been described here is an example, and the area of the lip image can be set as appropriate. The lip moving image generation unit 12 inputs the generated lip moving image to the dividing unit 15.

音声認識部１３は、音声データの音声認識処理を行い、音声を構成する形態素およびセンテンスを求める機能を有している。テキストアライメント部１４は、字幕のテキストデータと音声の発生タイミングとを同期させる機能を有する。すなわち、テキストアライメント部１４は、音声認識によって求めた音声を構成する形態素及びセンテンスとテキストデータとを照合することにより、テキストデータに時刻を付ける。具体的には、テキストデータを構成する各センテンスの発生が開始された時刻と、終了した時刻を示すデータを付ける。 The speech recognition unit 13 has a function of performing speech recognition processing of speech data to obtain morphemes and sentences constituting speech. The text alignment unit 14 has a function of synchronizing the text data of subtitles and the timing of generation of sound. That is, the text alignment unit 14 adds time to the text data by collating the text data with the morphemes and sentences constituting the voice obtained by voice recognition. Specifically, data indicating the time when generation of each sentence constituting the text data is started and the time when the generation is finished is attached.

テキストアライメント部１４は、音声認識部１３による認識結果の一部に誤りが含まれていた場合であっても、テキストデータの各センテンスに対して時刻を割り当てることは可能である。例えば、字幕データが「本日の天気は晴れでした。」に対し、音声認識結果が「ほんじつのけんきははれでした。」というように「てんき」と「けんき」の間違いがあったとしても、２つのセンテンスの一致度が高いことから、「本日の天気は晴れでした。」という字幕データに対して、「ほんじつのけんきははれでした」というセンテンスが発声された時刻を付すことができる。 Even in the case where an error is included in part of the recognition result by the speech recognition unit 13, the text alignment unit 14 can allocate time to each sentence of the text data. For example, there is a mistake between "Tenki" and "Kenki" such that the subtitle data is "Today's weather was fine." While the speech recognition result is "Jonjitsu no Kenkenhaha." Even so, because the degree of agreement between the two sentences was high, a sentence was uttered to the subtitle data "Today's weather was fine." The time can be attached.

テキストアライメント部１４によってテキストデータに時刻を付すことにより、音声とテキストデータとのタイミングを合わせることができる。そして、音声と映像とは同期しているので、テキストデータと映像とを同期させることができる。つまり、唇動画とテキストデータとを同期付けることで、リップリーディングのためのコーパスを生成することができる。 By adding time to the text data by the text alignment unit 14, the timing of the voice and the text data can be matched. And since the audio and the video are synchronized, it is possible to synchronize the text data and the video. That is, by synchronizing lip animation and text data, a corpus for lip reading can be generated.

分割部１５は、生成した唇動画と音声と時刻付きテキストとを分割し、所定の単位時間のデータを生成する。ここで単位時間は、リップリーディングの推論を行うときと同じ単位時間である。例えば、単位時間は３秒である。１つのセンテンスが３秒以内の場合には、単位時間の３秒になるように空白のデータを埋める。逆に１つのセンテンスが３秒以上の場合には、文法上の区切りの良い箇所でセンテンスを区切って、その後で、単位時間の３秒になるように空白のデータを埋める。このようにして生成された単位時間の唇動画と音声と時刻付きテキストのセットを、コーパス記憶部１６に格納する。ここでは、センテンスの長さが単位時間を超える場合に分割する例を挙げたが、センテンスの長さが単位時間を超える場合には、単位時間を長くした（例えば５秒の）コーパスを生成してもよい。 The dividing unit 15 divides the generated lip moving image, voice, and time-added text, and generates data of a predetermined unit time. Here, the unit time is the same unit time as when inferring lip reading. For example, the unit time is 3 seconds. If one sentence is within 3 seconds, the blank data is filled so as to be 3 seconds of unit time. Conversely, if one sentence is 3 seconds or more, the sentence is separated at a good place of the grammatical division, and then the blank data is filled so as to be 3 seconds of unit time. The set of lip moving images, speech and time-added text of unit time generated in this manner is stored in the corpus storage unit 16. Here, an example is given where segmentation is performed when the sentence length exceeds unit time, but when the sentence length exceeds unit time, a corpus (for example, 5 seconds) in which the unit time is extended is generated May be

図２は、コーパス生成装置１の動作の流れを示す図である。図２において、太枠で示したのは、コーパス生成装置１の処理であり、細枠で示したのはデータである。コーパス生成装置１は、インターネット上のウェブサイトから字幕付き動画Ｄ１０を取得する。字幕付き動画Ｄ１０は、映像Ｄ１１と、音声Ｄ１２と、字幕Ｄ１３のデータからなっている。 FIG. 2 is a diagram showing the flow of the operation of the corpus generation device 1. In FIG. 2, a bold frame indicates the process of the corpus generation device 1, and a narrow frame indicates data. The corpus generation device 1 acquires a subtitled video D10 from a website on the Internet. The subtitled moving image D10 includes data of a video D11, an audio D12, and a subtitle D13.

コーパス生成装置１は、映像Ｄ１１に対する処理として、映像Ｄ１１に映る人の顔を認識し、その器官点を認識する（Ｓ１０）。続いて、コーパス生成装置１は、音声を発している人、すなわち話者を特定する（Ｓ１１）。話者認識の処理（Ｓ１１）において、音声Ｄ１２のデータから点線を記載しているのは、複数の人が同時にしゃべっている場合には、話者の認識処理に音声Ｄ１２のデータを利用することがあるからである。これについては、第２の実施の形態で説明する。本実施の形態では、映像Ｄ１１に映る人が一人だけであるか、あるいは、映像Ｄ１１に複数の人が映る場合であってもしゃべっているのは一人であるとする。 The corpus generation device 1 recognizes the face of a person appearing in the image D11 as processing on the image D11, and recognizes the organ point (S10). Subsequently, the corpus generation device 1 specifies a person who emits a speech, that is, a speaker (S11). In the process of speaker recognition (S11), the dotted line is described from the data of the voice D12 because the data of the voice D12 is used for the recognition process of the speaker when a plurality of persons are speaking at the same time Because there is This will be described in the second embodiment. In the present embodiment, it is assumed that only one person appears in the image D11, or even if a plurality of people appear in the image D11, only one person speaks.

また、コーパス生成装置１は、字幕Ｄ１３のテキストデータに対する処理として、テキストアライメントを行う（Ｓ１２）。テキストアライメントを行うために、音声Ｄ１２のデータを使う。上述したとおり、音声Ｄ１２の音声認識を行って、音声を構成する形態素を求め、音声認識結果を利用してテキストデータに時刻を付す。テキストアライメントの処理（Ｓ１２）によって時刻付きテキストＤ１５が生成される。 Further, the corpus generation device 1 performs text alignment as processing for text data of the subtitle D13 (S12). The voice D12 data is used to perform text alignment. As described above, voice recognition of the voice D12 is performed to obtain morphemes constituting the voice, and time is added to the text data using the voice recognition result. The text alignment process (S12) generates time-added text D15.

コーパス生成装置１は、生成された唇動画Ｓ１４、音声Ｄ１２、時刻付きテキストＤ１５をセンテンスごとに分割し、所定の単位時間になるように加工して（空白データを埋めて）、単位時間の唇動画Ｄ１６、単位時間の音声Ｄ１７、単位時間のテキストＤ１８を生成する。なお、図２において、唇動画Ｄ１６、音声Ｄ１７にも「時刻付き」と記載しているのは、動画や音声にはもともと再生タイミングを示すための時刻がついているので、これを確認的に記載したものである。 The corpus generation device 1 divides the generated lip moving image S14, the speech D12, and the time-added text D15 into sentences for each sentence, and processes it so as to be a predetermined unit time (fills the blank data). A moving image D16, a unit time audio D17, and a unit time text D18 are generated. In addition, in FIG. 2, the lip moving image D16 and the audio D17 are also described as "with time" because the moving image and the audio originally have a time for indicating the reproduction timing, so this is described in a confirmed manner. It is

以上、本実施の形態のコーパス生成装置１の構成について説明したが、上記したコーパス生成装置１のハードウェアの例は、ＣＰＵ、ＲＡＭ、ＲＯＭ、ハードディスク、ディスプレイ、キーボード、マウス、通信インターフェース等を備えたコンピュータである。上記した話者認識部１１、唇動画生成部１２、音声認識部１３、テキストアライメント部１４、分割部１５の各機能を実現するモジュールを有するプログラムをＲＡＭまたはＲＯＭに格納しておき、ＣＰＵによって当該プログラムを実行することによって、上記したコーパス生成装置１の機能が実現される。このようなプログラムも本発明の範囲に含まれる。また、動画取得部１０の一例は、通信インターフェースである。 The configuration of the corpus generation device 1 according to the present embodiment has been described above. The hardware of the corpus generation device 1 described above is provided with a CPU, a RAM, a ROM, a hard disk, a display, a keyboard, a mouse, a communication interface, etc. Computer. A program having modules for realizing the functions of the above-described speaker recognition unit 11, lip moving image generation unit 12, speech recognition unit 13, text alignment unit 14, and division unit 15 is stored in RAM or ROM, and the CPU By executing the program, the function of the above-described corpus generation device 1 is realized. Such programs are also included in the scope of the present invention. Moreover, an example of the moving image acquisition unit 10 is a communication interface.

以上、第１の実施の形態のコーパス生成装置１の構成および動作について説明した。本実施の形態のコーパス生成装置１は、インターネット上のウェブサイト等の外部のリソースから取得した動画に基づいてコーパスを生成することができる。これにより、コーパスを生成するための動画を作成しなくてもよく、ウェブサイト上にある大量のリソースを利用して、汎用性の高いコーパスを生成できる。生成したコーパスは、リップリーディングの推論のための学習に用いることができると共に、学習したモデルの検証にも用いることができる。 The configuration and operation of the corpus generation device 1 according to the first embodiment have been described above. The corpus generation device 1 according to the present embodiment can generate a corpus based on a moving image acquired from an external resource such as a website on the Internet. By this, it is not necessary to create a moving image for generating a corpus, and it is possible to generate a highly versatile corpus by using a large amount of resources on a website. The generated corpus can be used for learning for lipreading inference, and also for verification of the learned model.

ここで、リップリーディングの推論を行うニューラルネットワークの学習について説明する。図３は、唇動画と音声を入力として何の語を発したかのテキストを出力するニューラルネットワークのモデルを示す図である。ニューラルネットワークモデルは、（１）音声処理部と（２）画像処理部とを有している。（１）音声処理部と（２）画像処理部のそれぞれは、STCNN（Spatiotemporal Convolutional Neural Network）層とSpatialプーリング層の組合せからなる層を３層有し、その後段にGRU（Gated Recurrent Unit）層を２層有している。 Here, learning of a neural network for inference of lip reading will be described. FIG. 3 is a diagram showing a model of a neural network which outputs a text of what words are issued with lip video and voice as input. The neural network model has (1) a speech processing unit and (2) an image processing unit. (1) Each of the audio processing unit and (2) image processing unit has three layers consisting of a combination of an STCNN (Spatiotemporal Convolutional Neural Network) layer and a Spatial pooling layer, and a GRU (Gated Recurrent Unit) layer at the latter stage Has two layers.

続いて、（１）音声処理部と（２）画像処理部のネットワークがマージされ、（３）言語処理部に入力される。（３）言語処理部は、GRU層を２層有し、その後段にLiner層、CTC（Connectionist Temporal Classification）Loss層を有して構成されている。コーパスの唇動画を画像処理部に入力すると共に、音声を音声処理部に入力する。その結果、求められたテキストデータとコーパスの時刻付きテキストとの誤差を逆誤差伝搬法によってニューラルネットワークにフィードバックすることで学習を行う。 Subsequently, (1) the network of the audio processing unit and (2) the image processing unit are merged, and (3) the language processing unit is input. (3) The language processing unit has two GRU layers, and is configured to include a Liner layer and a CTC (Connectionist Temporal Classification) Loss layer at the subsequent stage. The lip video of the corpus is input to the image processing unit, and the voice is input to the voice processing unit. As a result, learning is performed by feeding back the error between the obtained text data and the timed text of the corpus to the neural network by the back error propagation method.

（第２の実施の形態）
図４は、第２の実施の形態のコーパス生成装置２の構成を示す図である。第２の実施の形態のコーパス生成装置２の基本的な構成は、第１の実施の形態のコーパス生成装置１と同じであるが、第２の実施の形態のコーパス生成装置２は、字幕付き動画に映る複数の人が同時にしゃべっている場合に、複数人の中から字幕テキストに対応する話者を認識するための構成を備えている。 Second Embodiment
FIG. 4 is a diagram showing the configuration of the corpus generation device 2 according to the second embodiment. The basic configuration of the corpus generation device 2 of the second embodiment is the same as that of the corpus generation device 1 of the first embodiment, but the corpus generation device 2 of the second embodiment has subtitles It has a configuration for recognizing a speaker corresponding to subtitle text from among a plurality of persons when a plurality of persons shown in a video are talking at the same time.

コーパス生成装置２は、発話音推定部１７を有している。発話音推定部１７は、動画内でしゃべっている人の唇画像から、リップリーディングによってその発話音を推定する機能を有する。発話音推定部１７は、モデル記憶部１８から、リップリーディングの推論を行うためのニューラルネットワークモデルを読み出し、読み出したニューラルネットワークモデルに唇動画を入力して発話音を推定する。なお、モデル記憶部１８に記憶されたニューラルネットワークモデルは、学習部１９が、コーパス記憶部１６に記憶されたコーパスを使って学習したものである。 The corpus generation device 2 has a speech sound estimation unit 17. The speech sound estimation unit 17 has a function of estimating the speech sound by lip reading from the lip image of the person speaking in the moving image. The speech sound estimation unit 17 reads a neural network model for inference of lip reading from the model storage unit 18, inputs lip video to the read neural network model, and estimates speech sound. The neural network model stored in the model storage unit 18 is one learned by the learning unit 19 using the corpus stored in the corpus storage unit 16.

図５は、第２の実施の形態のコーパス生成装置２によって、複数人の中から話者を特定する処理を示す図である。この動作は、図２に示す話者認識（Ｓ１１）に対応するフローであり、図２において点線で示すように音声Ｄ１２のデータも用いて話者の認識を行う。 FIG. 5 is a diagram showing processing of specifying a speaker from among a plurality of persons by the corpus generation device 2 according to the second embodiment. This operation is a flow corresponding to the speaker recognition (S11) shown in FIG. 2, and as shown by a dotted line in FIG.

話者認識部１１は、動画の中の複数人のそれぞれの唇の領域を特定する（Ｓ２０）。話者認識部１１は、特定された複数人の唇の領域のデータを発話音推定部１７に入力する。発話音推定部１７は、複数人のそれぞれの唇の動きに基づいて発話音を推定し（Ｓ２１〜Ｓ２２）、推定した発話音を話者認識部１１に入力する。話者認識部１１は、複数人の発話音と字幕のテキストデータとを比較し、字幕に対応する発話を行っている人を話者として認識する（Ｓ２３）。以上の説明からも分かるように、発話音推定部１７によるリップリーディングの精度は話者を特定できる程度の精度があればよい。 The speaker recognition unit 11 specifies the region of each lip of a plurality of persons in the moving image (S20). The speaker recognition unit 11 inputs data of the identified region of the plurality of lips to the speech sound estimation unit 17. The speech sound estimation unit 17 estimates speech sounds based on the movements of the lips of a plurality of persons (S21 to S22), and inputs the estimated speech sounds to the speaker recognition unit 11. The speaker recognition unit 11 compares a plurality of utterance sounds with text data of subtitles, and recognizes a person who is uttering corresponding to subtitles as a speaker (S23). As can be understood from the above description, the accuracy of the lip reading by the speech sound estimation unit 17 may be as long as it is possible to specify the speaker.

以上、第２の実施の形態のコーパス生成装置２について、第１の実施の形態のコーパス生成装置１から追加になっている構成を説明した。第２の実施の形態のコーパス生成装置２は、複数の人がしゃべっている動画の場合にも、話者を認識することができるので、コーパス生成に用いることができる動画の範囲が広がる。 The configuration has been described above in which the corpus generation device 2 of the second embodiment is added from the corpus generation device 1 of the first embodiment. The corpus generation device 2 according to the second embodiment can recognize the speaker even in the case of a moving image spoken by a plurality of persons, so the range of the moving image which can be used for corpus generation is expanded.

以上、本発明のコーパス生成装置およびコーパス生成方法について、実施の形態を挙げて詳細に説明したが、本発明は上記した実施の形態に限定されるものではない。
上記した第２の実施の形態では、映像に映る複数の人がしゃべっている場合に、字幕のテキストデータに対応する人を話者として認識するための構成を説明したが、複数人から話者を認識する方法は、第２の実施の形態で説明した方法に限定されるものではない。例えば、当該字幕付き画像について、「字幕に対応する発言をしている人は誰ですか？」という質問と、しゃべっている人を選択肢とする問題を自動的に生成して、インターネットの多数のユーザに問い合わせを行い（いわゆるクラウドソーシング）、その集計結果に基づいて話者を特定してもよい。 As mentioned above, although the corpus generation apparatus and the corpus generation method of the present invention have been described in detail by taking the embodiment, the present invention is not limited to the above embodiment.
In the second embodiment described above, the configuration for recognizing a person corresponding to the text data of a subtitle as a speaker when a plurality of persons appearing in a video are speaking has been described. The method of recognizing C. is not limited to the method described in the second embodiment. For example, for the image with subtitles, the question "Who is the person corresponding to the subtitles?" And the problem of selecting the person who is speaking are automatically generated, The user may be inquired (so-called crowdsourcing), and the speaker may be identified based on the counting result.

上記した実施の形態では、唇動画と音声と時刻付きテキストをセットにしたデータからなるコーパスを生成する例を挙げて説明したが、音声のデータを記憶しないで、唇動画と時刻付きテキストをセットにしたコーパスを生成してもよい。 Although the above embodiment has been described by way of an example in which a corpus consisting of lip video, voice, and timed text as a set is generated, the lip video and timed text are set without storing voice data. You may generate a corpus that

上記した実施の形態では、コーパス生成の材料として字幕付き動画を用いる例を挙げたが、音声認識部１３による音声認識の精度が高い場合には、字幕がなくてもコーパスを生成することができる。すなわち、音声認識部１３にて認識した音声を構成する形態素のデータを時刻付きテキストとして用いることで、字幕のテキストデータとの照合を省略できる。 In the above-described embodiment, an example is given in which a captioned moving image is used as a material for corpus generation. However, if the accuracy of speech recognition by the speech recognition unit 13 is high, a corpus can be generated without captions. . That is, by using data of morphemes that constitute speech recognized by the speech recognition unit 13 as time-added text, collation with subtitle text data can be omitted.

上記した実施の形態では、時刻付きテキストには、センテンスの開始時刻と終了時刻を示すデータを付したが、センテンスごとではなく、単語ごとの発声の開始時刻、終了時刻を付してもよいし、文字ごとに発生時刻を付してもよい。 In the above embodiment, although the data indicating the start time and the end time of the sentence is attached to the time-added text, the start time and the end time of the utterance of each word may be attached instead of each sentence. The occurrence time may be attached to each character.

本発明は、唇動画から発話音を推測するリップリーディングのためのコーパスを生成する装置等として有用である。 The present invention is useful as an apparatus or the like for generating a corpus for lipreading that estimates speech sound from lip moving images.

１，２コーパス生成装置
１０動画取得部
１１話者認識部
１２唇動画生成部
１３音声認識部
１４テキストアライメント部
１５分割部
１６コーパス記憶部
１７発話音推定部
１８モデル記憶部
１９学習部 1, 2 corpus generation apparatus 10 moving image acquisition unit 11 speaker recognition unit 12 lip moving image generation unit 13 speech recognition unit 14 text alignment unit 15 division unit 16 corpus storage unit 17 speech sound estimation unit 18 model storage unit 19 learning unit

Claims

A video acquisition unit that acquires subtitled videos from external resources,
A lip moving image generating unit that recognizes a speaker included in the moving image and generates a lip moving image of the speaker;
A voice recognition unit which performs voice recognition of voice included in the moving image to obtain morphemes constituting the voice;
A text alignment unit for generating a time-added text by obtaining a time at which a sound corresponding to a text is emitted, based on the morpheme obtained by the voice recognition unit and the text data of subtitles in the moving image;
A storage unit for storing the lip moving image and the time-added text;
A corpus generating device comprising:

The corpus generation device according to claim 1, further comprising: a division unit configured to divide the lip moving image and the time-added text into data of a predetermined unit time.

3. The corpus generation device according to claim 1, wherein the text alignment unit converts kanji contained in the subtitles into hiragana characters to generate time-added text corresponding to the hiragana characters.

A speech sound estimation unit that estimates the sound uttered from the lip video;
The lip moving image generation unit is configured to recognize a person whose estimated speech sound matches the text data of the subtitle as a speaker in the moving image, when a plurality of persons appear in the moving image. The corpus generation device according to any one of 1 to 3.

The corpus generation apparatus according to any one of claims 1 to 4, wherein the sound is stored in the storage unit together with the lip moving image and the time-added text.

A video acquisition unit that acquires videos from external resources,
A lip moving image generating unit that recognizes a speaker included in the moving image and generates a lip moving image of the speaker;
A voice recognition unit which performs voice recognition of voice included in the moving image to obtain morphemes constituting the voice;
A text alignment unit that generates text data representing morphemes determined by the voice recognition unit, determines times at which sounds corresponding to the morphemes are emitted, and generates a time-added text;
A storage unit for storing the lip moving image and the time-added text;
A corpus generating device comprising:

A method of generating a corpus using a movie with subtitles acquired from an external resource, wherein the corpus generation device generates a corpus,
The corpus generation device acquiring a subtitled moving image from an external resource;
The corpus generation device recognizes a speaker included in the moving image and generates a lip image of the speaker;
The corpus generation device performs speech recognition of speech included in the moving image to obtain morphemes constituting the speech;
The corpus generation device determines a time when a sound corresponding to a text is emitted based on a morpheme obtained by the speech recognition and text data of a subtitle in the moving image, and generates a time-added text;
Storing the lip moving image and the time-added text in a storage unit;
A corpus generation method comprising:

A program for generating a corpus using subtitled moving images acquired from an external resource, the computer comprising:
Obtaining a subtitled video from an external resource;
Recognizing a speaker included in the video, and generating a lip video of the speaker;
Performing voice recognition of voice included in the moving image to obtain morphemes constituting the voice;
Determining a time at which a sound corresponding to the text is emitted based on the morpheme determined by the voice recognition and the subtitle text data in the moving image, and generating a time-added text;
Storing the lip moving image and the time-added text in a storage unit;
Program to run.