JP2018206110A

JP2018206110A - Device for automatically generating interactive-sentence containing moving picture

Info

Publication number: JP2018206110A
Application number: JP2017111476A
Authority: JP
Inventors: 勉兼安; Tsutomu Kaneyasu; 整山田; Hitoshi Yamada; 生聖渡部; Seisho Watabe; 智哉高谷; Tomoya Takatani
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2017-06-06
Filing date: 2017-06-06
Publication date: 2018-12-27
Anticipated expiration: 2037-06-06
Also published as: JP6900792B2

Abstract

To provide a technique capable of generating a moving picture to which an interactive sentence is added in a device for automatically generating a moving picture from a plurality of still images.SOLUTION: A device for automatically generating an interactive-sentence containing moving picture according to the present invention computes a degree of image similarity between two still images in a plurality of still images input by an image input device. In addition, for each of the plurality of still images, the device extracts keywords for identifying objects included in each still image, and generates interactive texts related to the extracted keywords. Then, the device determines a reproduction order of a plurality of still images based on the degree of image similarity and a degree of linkage of the interactive sentences, and outputs the plurality of still images and subtitles or voices of the interactive sentences corresponding to the individual still images according to the reproduction order, thereby automatically generating an interactive-sentence containing moving picture.SELECTED DRAWING: Figure 5

Description

本発明は、対話文が付与された動画を自動生成する装置に関する。 The present invention relates to an apparatus for automatically generating a moving image to which a dialogue sentence is assigned.

近年、複数の静止画像から動画を自動生成する装置が開発されている。例えば、特許文献１では、複数の静止画像から動画を自動生成する装置において、各静止画像のメタ情報から各静止画像のナレーションを生成して、生成されたナレーションと静止画像とを関連づけることにより、ナレーション付きの動画を生成する技術が提案されている。 In recent years, an apparatus for automatically generating a moving image from a plurality of still images has been developed. For example, in Patent Document 1, in a device that automatically generates a moving image from a plurality of still images, the narration of each still image is generated from the meta information of each still image, and the generated narration is associated with the still image. A technique for generating a narrated video has been proposed.

特開２００６−２８７５２１号公報JP 2006-287521 A 特開２０１３−１０１４５０号公報JP2013-101450A 特開２００８−２９９４９３号公報JP 2008-299493 A

複数の静止画像から動画を自動生成する分野では、生成された動画のクオリティや多様性に関するニーズが高まってきており、例えば、複数の静止画像から、対話形式の音声や字幕が付加された動画を生成する技術が望まれている。 In the field of automatically generating moving images from multiple still images, there is an increasing need for the quality and diversity of the generated moving images. For example, moving images with interactive audio and subtitles added from multiple still images. A technology to generate is desired.

本発明は、上記したような実情に鑑みてなされたものであり、その目的は、複数の静止画像から動画を自動生成する装置において、対話文が付加された動画を生成可能な技術を提供することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technology capable of generating a moving image to which a dialogue sentence is added in an apparatus that automatically generates a moving image from a plurality of still images. There is.

本発明に係わる対話文動画の自動生成装置は、複数の静止画像から動画を自動的に生成する装置であって、各静止画像に関連した対話文を自動的に付加して動画を生成する装置である。 An apparatus for automatically generating a dialogue sentence moving image according to the present invention is an apparatus that automatically generates a movie from a plurality of still images, and that automatically adds a dialogue sentence related to each still image to generate a movie. It is.

詳細には、本発明に係わる対話文動画の自動生成装置は、複数の静止画像を入力する画像入力手段と、前記画像入力手段により入力された複数の静止画像における２枚の静止画像間の画像類似度合を演算する画像類似度演算手段と、前記画像入力手段により入力された複数の静止画像の各々について、各静止画像に含まれるオブジェクトを識別するためのキーワードを抽出するキーワード抽出手段と、前記画像入力手段により入力された複数の静止画像の各々について、前記キーワード抽出手段により抽出されたキーワードに関連する対話文を生成する対話文生成手段と、前記画像類似度演算手段により演算された画像類似度合と、前記対話文生成手段により生成された対話文のつながり度合と、に基づいて、前記複数の静止画像の再生順序を決定する再生順序決定手段と、前記再生順序決定手段により決定された再生順序に従って、前記複数の静止画像と各静止画像に対応する対話文の字幕又は音声とを出力する再生手段と、を備える。 More specifically, an apparatus for automatically generating a dialogue sentence moving image according to the present invention includes an image input unit for inputting a plurality of still images, and an image between two still images in the plurality of still images input by the image input unit. Image similarity calculating means for calculating the degree of similarity; keyword extracting means for extracting a keyword for identifying an object included in each still image for each of the plurality of still images input by the image input means; For each of a plurality of still images input by the image input means, a dialog sentence generating means for generating a dialog sentence related to the keyword extracted by the keyword extracting means, and an image similarity calculated by the image similarity calculating means The reproduction order of the plurality of still images based on the degree and the connection degree of the dialogue sentence generated by the dialogue sentence generation unit Includes a reproduction order determination means for determining, in accordance with the reproduction order determined by the reproduction order determination means, a reproduction means for outputting a subtitle or audio dialogue corresponding to the plurality of still images and the still image.

斯様な対話文動画の自動生成装置によれば、複数の静止画像から動画を生成する際に、各静止画像に含まれるオブジェクトに関連する対話文が自動的に生成される。そして、画像類似度合と対話文のつながり度合とに基づいて静止画像の再生順序が決定され、その再生順序に従って静止画像及びその静止画像に対応した対話文が出力される。その際、対話文は、字幕形式で出力されてもよく、又は音声形式で出力されてもよい。このようにして
生成される動画は、画像の連続性のみを考慮して生成される動画や画像の単なる説明文が付加された動画に比べ、クオリティが高く且つ多様性に富んだものとなる。 According to such an apparatus for automatically generating a dialogue sentence moving image, a dialogue sentence relating to an object included in each still image is automatically generated when a movie is generated from a plurality of still images. Then, the reproduction order of the still images is determined based on the image similarity degree and the connection degree of the dialogue sentence, and the still picture and the dialogue sentence corresponding to the still image are output according to the reproduction order. At that time, the dialogue sentence may be output in a subtitle format or may be output in an audio format. The moving image generated in this way has a high quality and rich variety as compared to a moving image generated considering only the continuity of images and a moving image to which a simple description of the image is added.

なお、ここでいう「対話文のつながり度合」は、２枚の静止画像間における対話文の類似度合である。静止画像間における対話文の類似度合は、例えば、各静止画像のキーワードをベクトル表現して、静止画像間のベクトル差分を演算することで求めてもよい。また、静止画像間における対話文の類似度合は、各静止画像の対話文に含まれるキーワード以外の単語をベクトル表現して、静止画像間のベクトル差分を演算することで求めてもよい。これらの方法においては、静止画像間のベクトル差分が小さいほど、それら静止画像間における対話文のつながり度合が高いとみなすようにしてもよい。 Note that the “degree of connection of dialogue sentences” here is the degree of similarity of dialogue sentences between two still images. The degree of similarity of dialogue sentences between still images may be obtained, for example, by expressing a keyword of each still image as a vector and calculating a vector difference between the still images. Further, the degree of similarity of dialogue sentences between still images may be obtained by expressing words other than keywords included in the dialogue sentences of each still image as vectors and calculating vector differences between the still images. In these methods, the smaller the vector difference between still images, the higher the degree of interaction between the still images.

また、本発明における再生順序決定手段は、前記画像類似度演算手段により演算された画像類似度合に第１の重み係数を乗算した値と、前記対話文生成手段により生成された対話文のつながり度合に第２の重み係数を乗算した値と、を加算してコスト値を演算し、演算されたコスト値に基づいて前記複数の静止画像の再生順序を決定してもよい。斯様な構成によれば、より多様性に富んだ動画を生成することが可能となる。 Further, the reproduction order determining means in the present invention comprises a value obtained by multiplying the image similarity calculated by the image similarity calculating means by a first weighting factor, and a connection degree of the dialog sentence generated by the dialog sentence generating means. The cost value may be calculated by adding the value multiplied by the second weighting factor to the image, and the reproduction order of the plurality of still images may be determined based on the calculated cost value. According to such a configuration, it is possible to generate a more diverse moving image.

ここで、再生順序決定手段は、生成される動画の時間長が長い場合は短い場合に比べ、画像類似度合に対する対話文のつながり度合の重みが大きくなるように、前記第１の重み係数及び前記第２の重み係数を決定してもよい。斯様な構成によれば、生成される動画の時間長が短い場合は、対話文のつながり度合に比して画像類似度合を重視して動画が生成されるため、画像の遷移に対して違和感の少ない動画を生成することができる。また、生成される時間長が長い場合は、画像類似度合に比して対話文のつながり度合を重視して動画が生成されるため、ストーリー性の高い動画を生成することができる。よって、動画のクオリティをより一層高めることが可能となる。 Here, the reproduction order determination means includes the first weighting factor and the first weighting factor so that the weight of the connection degree of the dialogue sentence with respect to the degree of image similarity is larger when the time length of the generated moving image is long than when the time length is short. A second weighting factor may be determined. According to such a configuration, when the time length of the generated moving image is short, the moving image is generated with an emphasis on the degree of image similarity as compared to the degree of connection of the dialogue sentence, so that the image transition is uncomfortable. It is possible to generate a moving image with little. In addition, when the time length to be generated is long, a moving image is generated with an emphasis on the degree of connection of dialogue sentences compared to the degree of image similarity, and thus a moving image with high storyliness can be generated. Therefore, the quality of the moving image can be further improved.

なお、再生順序決定手段は、生成される動画の途中で、画像類似度合に対する対話文のつながり度合の重みが変化するように、前記第１の重み係数及び前記第２の重み係数を決定してもよい。例えば、生成される動画の前半と後半とにおいて、画像類似度合に対する対話文のつながり度合の重みが異なるように、前記第１の重み係数及び前記第２の重み係数を決定してもよい。また、生成される動画の序盤、中盤、終盤において、画像類似度合に対する対話文のつながり度合の重みが異なるように、前記第１の重み係数及び前記第２の重み係数を決定してもよい。なお、画像類似度合に対する対話文のつながり度合の重みは、上記したように段階的に変更されてもよいが、動画の開始から終了へ向けて連続的に変更されてもよい。このように、１つの動画の途中で画像類似度合に対する対話文のつながり度合の重みが変更されると、より多様性に富んだ動画を生成することができる。 The reproduction order determining means determines the first weighting factor and the second weighting factor so that the weight of the connection degree of the dialogue sentence with respect to the image similarity degree changes in the middle of the generated moving image. Also good. For example, the first weighting coefficient and the second weighting coefficient may be determined so that the weight of the connection degree of the dialogue sentence with respect to the image similarity degree differs between the first half and the second half of the generated moving image. Further, the first weight coefficient and the second weight coefficient may be determined so that the weight of the connection degree of the dialogue sentence with respect to the image similarity degree is different in the early stage, middle stage, and end stage of the generated moving image. Note that the weight of the connection degree of the dialogue sentence with respect to the image similarity degree may be changed stepwise as described above, but may be changed continuously from the start to the end of the moving image. In this way, when the weight of the connection degree of the dialogue sentence with respect to the image similarity degree is changed in the middle of one moving picture, a more diverse moving picture can be generated.

また、本発明における対話文生成手段は、１枚の静止画像に対して複数の対話文候補を生成してもよい。その場合、再生順序決定手段は、各対話文候補に含まれる単語の出現位置に基づいて、複数の対話文候補の中で当該静止画像に適した対話文を１つ選択してもよい。斯様な構成によれば、よりクオリティの高い対話文を各静止画像に割り付けることが可能となる。 In addition, the dialog sentence generation means in the present invention may generate a plurality of dialog sentence candidates for one still image. In that case, the reproduction order determining means may select one dialogue sentence suitable for the still image from among the plurality of dialogue sentence candidates based on the appearance position of the word included in each dialogue sentence candidate. According to such a configuration, it is possible to assign a higher quality dialogue sentence to each still image.

本発明は、上記処理の少なくとも一部を含む対話文動画の自動生成方法として捉えることもできる。例えば、対話文動画の自動生成方法は、複数の静止画像を入力するステップと、入力された複数の静止画像における２枚の静止画像間の画像類似度合を演算するステップと、入力された複数の静止画像の各々について、各静止画像に含まれるオブジェクトを識別するためのキーワードを抽出するステップと、入力された複数の静止画像の各々について、抽出されたキーワードに関連する対話文を生成するステップと、生成された対話文のつながり度合と前記画像類似度合とに基づいて、前記複数の静止画像の再生順序を決
定するステップと、決定された再生順序に従って、前記複数の静止画像と各静止画像に対応する対話文の字幕又は音声とを出力するステップと、を含むようにしてもよい。 The present invention can also be understood as a method for automatically generating a dialogue sentence moving image including at least a part of the above processing. For example, the method for automatically generating a dialogue sentence moving image includes a step of inputting a plurality of still images, a step of calculating an image similarity degree between two still images in the plurality of inputted still images, and a plurality of inputted plurality of still images. Extracting a keyword for identifying an object included in each still image for each of the still images; and generating a dialogue sentence related to the extracted keyword for each of the plurality of input still images. Determining a playback order of the plurality of still images based on the generated connection level of the dialogue sentences and the image similarity level, and determining the plurality of still images and each still image according to the determined playback order. Outputting a subtitle or audio of a corresponding dialog sentence.

また、本発明は、上記した、対話文動画の自動生成方法を実現するためのプログラムやそのプログラムを記録した記録媒体として捉えることもできる。なお、上記手段及び処理の各々は可能な限り互いに組み合わせて本発明を構成することができる。 The present invention can also be understood as a program for realizing the above-described method for automatically generating a dialog sentence moving image and a recording medium on which the program is recorded. Each of the above means and processes can be combined with each other as much as possible to constitute the present invention.

本発明によれば、複数の静止画像から動画を自動生成する装置において、対話文が付加された動画を生成することができる。 ADVANTAGE OF THE INVENTION According to this invention, the moving image to which the dialog sentence was added can be produced | generated in the apparatus which produces | generates a moving image automatically from several still images.

本発明に係わる対話文動画の自動生成装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the automatic generation apparatus of the dialog sentence moving image concerning this invention. 画像入力装置により入力された静止画像の一例を示す図である。It is a figure which shows an example of the still image input by the image input device. 静止画像に含まれるオブジェクトを特定する方法の一例を示す図である。It is a figure which shows an example of the method of specifying the object contained in a still image. 対話文から抽出される特徴単語と、特徴単語の対話文における出現位置の基準順位との相関を示すテーブルの一例を示す図である。It is a figure which shows an example of the table which shows the correlation with the feature word extracted from a dialogue sentence, and the reference | standard order of the appearance position in the dialogue sentence of a feature word. コンピュータにより対話文動画を自動作成する際の動作フローを示す図である。It is a figure which shows the operation | movement flow at the time of producing a dialogue sentence animation automatically by a computer.

以下、本発明の具体的な実施形態について図面に基づいて説明する。本実施形態に記載される構成部品の寸法、材質、形状、相対配置等は、特に記載がない限り発明の技術的範囲をそれらのみに限定する趣旨のものではない。 Hereinafter, specific embodiments of the present invention will be described with reference to the drawings. The dimensions, materials, shapes, relative arrangements, and the like of the components described in the present embodiment are not intended to limit the technical scope of the invention to those unless otherwise specified.

図１は、本発明に係わる対話文動画の自動生成装置の概略構成を示すブロック図である。自動生成装置１は、図１に示すように、画像入力装置２、出力装置３、及びコンピュータ４を備えている。画像入力装置２は、静止画像を入力する装置であり、例えば、イメージスキャナやデジタルカメラ等である。なお、画像入力装置２は、ネットワークを介して静止画像を入力してもよく、又は記録メディア等から静止画像を入力してもよい。出力装置３は、対話文付きの動画を出力する装置であり、例えば、ディスプレイ、又はディスプレイとスピーカとを組み合わせて構成される。コンピュータ４は、ＣＰＵ、ＲＯＭ、ＲＡＭ、ハードディスク、ユーザインタフェース（例えば、キーボード、マウス、タッチパネル）等を備える、パーソナルコンピュータ又はワークステーション等である。 FIG. 1 is a block diagram showing a schematic configuration of an apparatus for automatically generating a dialogue sentence moving picture according to the present invention. As shown in FIG. 1, the automatic generation device 1 includes an image input device 2, an output device 3, and a computer 4. The image input device 2 is a device for inputting a still image, and is, for example, an image scanner or a digital camera. Note that the image input apparatus 2 may input a still image via a network, or may input a still image from a recording medium or the like. The output device 3 is a device that outputs a moving image with a dialogue sentence, and is configured by, for example, a display or a combination of a display and a speaker. The computer 4 is a personal computer or a workstation provided with a CPU, ROM, RAM, hard disk, user interface (for example, keyboard, mouse, touch panel) and the like.

コンピュータ４は、画像類似度演算部４０、キーワード抽出部４１、対話文生成部４２、再生順序決定部４３、及び再生部４４を備える。これらの機能部は、例えば、ＣＰＵがハードディスク等の記憶装置に記憶されているプログラムを実行することにより実現される。 The computer 4 includes an image similarity calculation unit 40, a keyword extraction unit 41, a dialogue sentence generation unit 42, a reproduction order determination unit 43, and a reproduction unit 44. These functional units are realized, for example, when the CPU executes a program stored in a storage device such as a hard disk.

（画像類似度演算部４０）
画像類似度演算部４０は、画像入力装置２により入力された、複数の静止画像の各静止画像について、当該静止画像と他の静止画像との画像類似度合を演算する。この演算処理は、動画の素材となる複数の静止画像における２枚の静止画像の全ての組合せについて行われる。例えば、動画の素材としてＡ、Ｂ、Ｃの３枚の静止画像が画像入力装置２によって入力された場合は、（Ａ，Ｂ）の組合せにおける画像類似度合、（Ｂ，Ｃ）の組合せにおける画像類似度合、及び（Ａ，Ｃ）の組合せにおける画像類似度合が演算される。 (Image similarity calculation unit 40)
The image similarity calculation unit 40 calculates, for each still image of a plurality of still images input by the image input device 2, the image similarity between the still image and another still image. This calculation process is performed for all combinations of two still images in a plurality of still images as moving image materials. For example, when three still images A, B, and C are input as the moving image material by the image input device 2, the image similarity in the combination of (A, B) and the image in the combination of (B, C) The similarity degree and the image similarity degree in the combination of (A, C) are calculated.

なお、画像類似度合は、例えば、画素値の差分を用いて評価してもよい。その場合、２
枚の静止画像間における画素値の差分が小さいほど、それら２枚の静止画像の画像類似度合が高いと評価するものとする。なお、画像類似度合は、画素値以外の画像特徴量を用いて評価してもよい。このようにして求められた画像類似度合は、２枚の静止画像の組合せを識別する情報とともにＲＡＭ等の記憶装置に記憶される。 Note that the degree of image similarity may be evaluated using, for example, a difference between pixel values. In that case, 2
Assume that the smaller the difference in pixel value between still images, the higher the degree of image similarity between the two still images. Note that the degree of image similarity may be evaluated using an image feature amount other than the pixel value. The image similarity obtained in this way is stored in a storage device such as a RAM together with information for identifying a combination of two still images.

（キーワード抽出部４１）
キーワード抽出部４１は、画像入力装置２により入力された複数の静止画像の各々について、各静止画像に含まれる被写体（オブジェクト）を識別するためのキーワードを抽出する。例えば、図２に示すように、ある静止画像Ａのオブジェクトが赤色の自動車である場合は、該静止画像Ａのキーワードとして、「赤色の自動車」が抽出される。なお、１枚の静止画像に複数のオブジェクトが含まれる場合も想定される。そのため、図３に示すように、各静止画像を複数の領域（図３に示す例では、４つの領域）に分割して、領域毎にキーワードを抽出してもよい。例えば、図３に示す静止画像Ｂのキーワードとしては、「赤色の自動車」、「男性」、及び「信号機」等の複数のキーワードが抽出される。 (Keyword extraction unit 41)
The keyword extraction unit 41 extracts a keyword for identifying a subject (object) included in each still image for each of a plurality of still images input by the image input device 2. For example, as shown in FIG. 2, when an object of a still image A is a red car, “red car” is extracted as a keyword of the still image A. It is assumed that a single still image includes a plurality of objects. Therefore, as shown in FIG. 3, each still image may be divided into a plurality of regions (four regions in the example shown in FIG. 3), and keywords may be extracted for each region. For example, as keywords of the still image B shown in FIG. 3, a plurality of keywords such as “red car”, “male”, and “traffic light” are extracted.

上記したようなキーワードの抽出には、機械学習により作成された識別器を用いることができる。このような識別器は、静止画像と該静止画像に含まれるオブジェクト（「自動車」、「男性」、「信号機」等）との組合せからなる学習データを多数用意し、既存の機械学習アルゴリズムを適用することで作成することができる。 For the keyword extraction as described above, a classifier created by machine learning can be used. Such a classifier prepares a lot of learning data consisting of combinations of still images and objects (“car”, “male”, “traffic light”, etc.) included in the still images, and applies existing machine learning algorithms. You can create it.

（対話文生成部４２）
対話文生成部４２は、キーワード抽出部４１により抽出されたキーワードに基づいて、各静止画像に対応する対話文を生成する。ここでいう「対話文」は、原則として複数の発話の組合せから構成されるが、１つの発話から１つの対話文が構成されてもよい。さらに、ここでいう「対話文」は、原則として、キーワード抽出部４１により抽出されたキーワード又は該キーワードに類似する単語を含む対話文である。例えば、前述の図２に例示した静止画像Ａについては、該静止画像のキーワードである「赤色の自動車」又は該キーワードに類似する「赤いクルマ」を含む対話文として、「赤色の自動車（赤いクルマ）は格好いいね」及び「そうだね」等の複数の発話から構成される対話文が生成される。また、前述の図３に例示した静止画像Ｂについては、「信号の傍にいる男性（男の人）は赤色の自動車（赤いクルマ）が好きみたいだね」及び「僕は白色の自動車（白いクルマ）が好きだよ」等の複数の発話から構成される対話文を作成する。 (Dialogue sentence generator 42)
The dialog sentence generation unit 42 generates a dialog sentence corresponding to each still image based on the keyword extracted by the keyword extraction unit 41. The “dialogue sentence” here is composed of a combination of a plurality of utterances in principle, but one dialogue sentence may be composed of one utterance. Furthermore, the “dialogue sentence” here is a dialogue sentence that includes a keyword extracted by the keyword extraction unit 41 or a word similar to the keyword in principle. For example, regarding the still image A illustrated in FIG. 2 described above, as a dialogue sentence including “red car” that is the keyword of the still image or “red car” similar to the keyword, “red car (red car) ) Is generated, and a dialogue sentence composed of a plurality of utterances such as “Like” is generated. In addition, regarding the still image B illustrated in FIG. 3 described above, “a man (man) near a signal seems to like a red car (red car)” and “I am a white car (white Create a dialogue sentence consisting of multiple utterances such as “I like the car”.

ここで、上記したような対話文の作成には、機械学習により作成された識別器を用いることができる。このような識別器は、キーワードと該キーワードに関連する対話文との組合せからなる学習データを多数用意し、既存の機械学習アルゴリズムを適用することで作成することができる。 Here, a classifier created by machine learning can be used to create the dialogue sentence as described above. Such a discriminator can be created by preparing a large number of learning data consisting of a combination of a keyword and a dialogue sentence related to the keyword, and applying an existing machine learning algorithm.

（再生順序決定部４３）
再生順序決定部４３は、画像類似度演算部４０により演算された画像類似度合と、対話文生成部４２により生成された対話文のつながり度合と、に基づいて、画像入力装置２により入力された複数の静止画像を動画として連続的に再生する際の再生順序を決定する。例えば、再生順序決定部４３は、先ず、２枚の静止画像の全ての組合せについて、以下の式（１）に従ってコスト値を演算する。
コスト値＝ｗ１×（画像類似度合）＋ｗ２×（対話文つながり度合）・・・（１）
上記の式（１）におけるｗ１は第１の重み係数であり、ｗ２は第２の重み係数である。これら２つの係数ｗ１、ｗ２は、静止画像の遷移と対話文の遷移とを主観評価に基づいて重みづける係数である。これら２つの係数ｗ１、ｗ２は、固定値であってもよいが、生成される動画の時間長（より具体的には、画像入力装置２によって入力される静止画像の枚数）に応じて変更される可変値であってもよい。その際、生成される動画の時間長が長い
場合は、画像類似度合より対話文のつながり度合に重みを置くようにｗ１、ｗ２が決定されてもよい（ｗ１＜ｗ２）。一方、生成される動画の時間長が短い場合は、対話文のつながり度合より画像類似度合に重みを置くようにｗ１、ｗ２を決定してもよい（ｗ１＞ｗ２）。また、上記した２つの係数ｗ１、ｗ２は、１つの動画の途中で変更されてもよい。例えば、係数ｗ１、ｗ２は、動画の前半と後半とで異なる値に設定されてもよい。また、係数ｗ１、ｗ２は、動画の序盤、中盤、終盤で異なる値に設定されてもよい。さらに、係数ｗ１、ｗ２は、動画の開始から終了へ向けて、連続的に変更されてもよい。 (Reproduction order determination unit 43)
The reproduction order determination unit 43 is input by the image input device 2 based on the image similarity degree calculated by the image similarity degree calculation unit 40 and the connection degree of dialogue sentences generated by the dialogue sentence generation unit 42. A reproduction order for continuously reproducing a plurality of still images as a moving image is determined. For example, the reproduction order determination unit 43 first calculates a cost value according to the following equation (1) for all combinations of two still images.
Cost value = w1 × (image similarity degree) + w2 × (dialog sentence connection degree) (1)
In the above equation (1), w1 is a first weighting factor, and w2 is a second weighting factor. These two coefficients w1 and w2 are coefficients that weight the transition of the still image and the transition of the dialogue sentence based on the subjective evaluation. These two coefficients w1 and w2 may be fixed values, but are changed according to the time length of the moving image to be generated (more specifically, the number of still images input by the image input device 2). It may be a variable value. At this time, when the time length of the generated moving image is long, w1 and w2 may be determined so as to place a weight on the connection degree of the dialogue sentence rather than the image similarity degree (w1 <w2). On the other hand, when the time length of the generated moving image is short, w1 and w2 may be determined so that the image similarity degree is weighted more than the connection degree of dialogue sentences (w1> w2). Further, the above two coefficients w1 and w2 may be changed in the middle of one moving image. For example, the coefficients w1 and w2 may be set to different values in the first half and the second half of the moving image. The coefficients w1 and w2 may be set to different values at the beginning, middle and end of the moving image. Furthermore, the coefficients w1 and w2 may be continuously changed from the start to the end of the moving image.

上記した「対話文のつながり度合」は、静止画像間における対話文の類似度合である。静止画像間における対話文の類似度合は、例えば、各静止画像のキーワードをベクトル表現して、静止画像間のベクトル差分を演算することで求めてもよい。また、静止画像間における対話文の類似度合は、各静止画像の対話文に含まれるキーワード以外の単語をベクトル表現して、静止画像間のベクトル差分を演算することで求めてもよい。これらの方法においては、静止画像間のベクトル差分が小さいほど、それら静止画像間における対話文のつながり度合が高いとみなすものとする。 The above-mentioned “degree of connection of dialogue sentences” is the degree of similarity of dialogue sentences between still images. The degree of similarity of dialogue sentences between still images may be obtained, for example, by expressing a keyword of each still image as a vector and calculating a vector difference between the still images. Further, the degree of similarity of dialogue sentences between still images may be obtained by expressing words other than keywords included in the dialogue sentences of each still image as vectors and calculating vector differences between the still images. In these methods, it is assumed that the smaller the vector difference between still images is, the higher the degree of connection of dialogue sentences between those still images is.

なお、対話文生成部４２によって各静止画像の対話文候補が複数生成される場合には、再生順序決定部４３は、上記の式（１）に基づく再生順序の決定処理を行う前に、それら複数の対話文候補の中から該静止画像に適した対話文を選択する処理を行う。詳細には、再生順序決定部４３は、先ず、各対話文候補に含まれる特徴単語を抽出する。ここでいう「特徴単語」は、原則として、上記のキーワード抽出部４１によって抽出されたキーワードであるが、キーワード以外の特徴的な単語が対話文に含まれている場合には該単語を特徴単語として抽出してもよい。次に、再生順序決定部４３は、各特徴単語の出現位置に基づく順位付けを行う。例えば、「信号の近くに止まっている自動車は、あの男性のものかな」という対話文候補の特徴単語として「信号」、「自動車」、及び「男性」が抽出された場合には、該対話文候補における各特徴単語の出現位置は、「信号」、「自動車」、「男性」の順であることから、「信号」の順位を“１”とし、「自動車」の順位を“２”とし、「男性」の順位を“３”とすればよい。一方、再生順序決定部４３は、各特徴単語の基準順位を特定する。ここでいう「基準順位」は、抽出された特徴単語を含む一般的な対話文における、各特徴単語の出現位置の平均的な順位である。このような基準順位は、特徴単語を含む対話文の多数の例から統計的に求められており、例えば、図４に示すようなテーブル形式でＲＯＭ等の記憶装置に格納されている。そして、再生順序決定部４３は、各特徴単語の順位と基準順位との差を演算し、且つその差の総和を演算することで、各対話文候補のスコアを求める。例えば、上記した「信号の近くに止まっている自動車は、あの男性のものかな」という対話文候補について、図４に示す基準順位に基づいてスコアを演算する場合は、当該対話文候補における「信号」の順位と基準順位との差が“０”となり、当該対話文候補における「自動車」の順位と基準順位との差が“１”となり、当該対話文候補における「男性」の順位と基準順位との差が“１”となるため、それらの差の総和が“２”となる。よって、「信号の近くに止まっている自動車は、あの男性のものかな」という対話文候補のスコアは、“２”となる。このような方法により、各静止画像に割り付けられた複数の対話文候補の各々についてスコアが求められると、再生順序決定部４３は、複数の対話文候補の中で最もスコアの小さい対話文候補を、当該静止画像に適した対話文として選択する。そして、再生順序決定部４３は、選択された対話文に基づいて、前述した「対話文のつながり度合」を求めるものとする。 When a plurality of dialogue sentence candidates for each still image are generated by the dialogue sentence generation unit 42, the reproduction order determination unit 43 performs the reproduction order determination process based on the above formula (1) before performing the reproduction order determination process. A process of selecting a dialogue sentence suitable for the still image from a plurality of dialogue sentence candidates is performed. Specifically, the reproduction order determination unit 43 first extracts feature words included in each dialogue sentence candidate. The “characteristic word” here is, in principle, a keyword extracted by the keyword extraction unit 41. When a characteristic word other than the keyword is included in the dialogue sentence, the word is a characteristic word. May be extracted as Next, the reproduction order determination unit 43 performs ranking based on the appearance position of each characteristic word. For example, when “signal”, “automobile”, and “male” are extracted as feature words of a dialogue sentence candidate “a car that is close to a signal is that man's?”, The dialogue sentence Since the appearance position of each feature word in the candidate is in the order of “signal”, “car”, “male”, the order of “signal” is “1”, the order of “car” is “2”, The ranking of “male” may be “3”. On the other hand, the reproduction order determination unit 43 specifies the reference order of each feature word. The “reference order” here is an average order of appearance positions of each feature word in a general dialogue sentence including the extracted feature words. Such a reference order is statistically obtained from many examples of dialogue sentences including characteristic words, and is stored in a storage device such as a ROM in a table format as shown in FIG. 4, for example. Then, the reproduction order determination unit 43 calculates the difference between the rank of each feature word and the reference rank, and calculates the sum of the differences, thereby obtaining the score of each dialogue sentence candidate. For example, when the score is calculated based on the reference ranking shown in FIG. 4 with respect to the above-mentioned dialogue sentence candidate “the car that is close to the signal belongs to that man”, the “signal” in the dialogue sentence candidate The difference between the ranking of “” and the reference ranking is “0”, the difference between the ranking of “automobile” in the dialogue candidate and the ranking is “1”, and the ranking of “male” in the dialogue candidate and the reference ranking Is “1”, and the sum of these differences is “2”. Therefore, the score of the dialogue sentence candidate “Is the car parked near the signal a thing of that man?” Is “2”. When the score is obtained for each of the plurality of dialogue sentence candidates assigned to each still image by such a method, the reproduction order determining unit 43 selects the dialogue sentence candidate having the lowest score among the plurality of dialogue sentence candidates. , And select as a dialogue sentence suitable for the still image. Then, it is assumed that the reproduction order determination unit 43 obtains the above-mentioned “degree of connection of dialogue sentences” based on the selected dialogue sentence.

再生順序決定部４３は、上記した方法によって静止画像の再生順序の決定処理、及び各静止画像に適した対話文の選択処理を実行し終えると、それらの情報を対話文に関連づけてハードディスク等の記憶装置に記憶させる。 When the reproduction order determining unit 43 finishes executing the process for determining the reproduction order of still images and the process for selecting a dialogue sentence suitable for each still image by the above-described method, the reproduction order determining unit 43 associates the information with the dialogue sentence, Store in a storage device.

（再生部４４）
再生部４４は、複数の静止画像を再生順序決定部４３により決定された再生順序に従って出力装置３から順次出力させるとともに、各静止画像に関連づけられた対話文を出力装置３から出力させる。その際、対話文が字幕形式のデータである場合は、各対話文に対応する静止画像が出力されている最中に、対話文の字幕データを出力させればよい。また、対話文が音声形式のデータである場合は、各対話文に対応する静止画像が出力されている最中に、対話文の音声データを出力させればよい。 (Playback unit 44)
The playback unit 44 sequentially outputs a plurality of still images from the output device 3 in accordance with the playback order determined by the playback order determination unit 43, and causes the output device 3 to output a dialog sentence associated with each still image. At this time, if the dialogue sentence is subtitle format data, the dialogue sentence subtitle data may be output while the still image corresponding to each dialogue sentence is being outputted. Further, when the dialogue sentence is data in a voice format, the voice data of the dialogue sentence may be output while the still image corresponding to each dialogue sentence is being outputted.

次に、本実施形態のコンピュータ４における対話文動画の自動生成手順について、図５に沿って説明する。図５は、対話文動画を自動生成する際のコンピュータ４の動作フローを示す図である。 Next, a procedure for automatically generating a dialogue sentence moving image in the computer 4 of the present embodiment will be described with reference to FIG. FIG. 5 is a diagram showing an operation flow of the computer 4 when automatically generating a dialogue sentence moving image.

コンピュータ４は、画像入力装置２から複数の静止画像を入力する（ステップＳ１０１）。例えば、画像入力装置２がデジタルカメラである場合には、該デジタルカメラによって撮影された複数の静止画像を入力する。なお、画像入力装置２は、デジタルカメラで予め撮影された複数の静止画像をハードディスク等の記憶装置に記憶させておき、その記憶装置から複数の静止画像をピックアップするようにしてもよい。その場合、ユーザが記憶装置から任意に複数の静止画像を選択してもよい。 The computer 4 inputs a plurality of still images from the image input device 2 (step S101). For example, when the image input device 2 is a digital camera, a plurality of still images captured by the digital camera are input. Note that the image input device 2 may store a plurality of still images previously captured by a digital camera in a storage device such as a hard disk and pick up the plurality of still images from the storage device. In that case, the user may arbitrarily select a plurality of still images from the storage device.

画像入力装置２によって複数の静止画像が入力されると、コンピュータ４は、入力された各静止画像について、該静止画像と他の静止画像との画像類似度合を演算する（ステップＳ１０２）。ここでは、コンピュータ４の画像類似度演算部４０が、前述したように、各静止画像と他の静止画像との画素値等の差分を演算して、その差分を画像類似度合として用いるものとする。 When a plurality of still images are input by the image input device 2, the computer 4 calculates the degree of image similarity between the still image and another still image for each input still image (step S102). Here, as described above, the image similarity calculation unit 40 of the computer 4 calculates a difference such as a pixel value between each still image and another still image, and uses the difference as the image similarity. .

コンピュータ４は、画像入力装置２によって入力された複数の静止画像の各々からキーワードを抽出する（ステップＳ１０３）。ここでは、コンピュータ４のキーワード抽出部４１が、各静止画像に含まれる特徴的なオブジェクトを特定して、そのオブジェクトを識別するためのキーワードを抽出する。具体的には、キーワード抽出部４１は、前述したように、機械学習により作成された識別器を用いることで、各静止画像に含まれるオブジェクトを識別するためのキーワードを抽出する。 The computer 4 extracts keywords from each of the plurality of still images input by the image input device 2 (step S103). Here, the keyword extraction unit 41 of the computer 4 specifies a characteristic object included in each still image and extracts a keyword for identifying the object. Specifically, as described above, the keyword extracting unit 41 extracts a keyword for identifying an object included in each still image by using a discriminator created by machine learning.

画像入力装置２によって入力された全ての静止画像についてキーワードの抽出処理が完了すると、コンピュータ４は、抽出されたキーワードに基づいて各静止画像に対応する対話文を生成する（ステップＳ１０４）。ここでは、コンピュータ４の対話文生成部４２が、キーワード抽出部４１によって抽出されたキーワードに基づいて、各静止画像に対応する対話文を生成する。具体的には、対話文生成部４２は、前述したように、機械学習により作成された識別器を用いることで、キーワードに対応した対話文を生成する。 When the keyword extraction process is completed for all the still images input by the image input device 2, the computer 4 generates a dialogue sentence corresponding to each still image based on the extracted keywords (step S104). Here, the dialog sentence generation unit 42 of the computer 4 generates a dialog sentence corresponding to each still image based on the keyword extracted by the keyword extraction unit 41. Specifically, the dialog sentence generation unit 42 generates a dialog sentence corresponding to the keyword by using a discriminator created by machine learning as described above.

画像入力装置２によって入力された全ての静止画像について対話文の生成処理が完了すると、コンピュータ４は、第１の重み係数ｗ１と第２の重み係数ｗ２とを決定する（ステップＳ１０５）。ここでは、コンピュータ４の再生順序決定部４３が、生成される動画の時間長に基づいて、第１の重み係数ｗ１と第２の重み係数ｗ２とを決定する。例えば、再生順序決定部４３は、生成される動画の時間長が長い場合は、生成される動画の時間長が短い場合に比べ、第１の重み係数ｗ１に対する第２の重み係数ｗ２の重みが大きくなるように、それらの係数ｗ１、ｗ２を決定する。 When the dialog sentence generation processing is completed for all the still images input by the image input device 2, the computer 4 determines the first weight coefficient w1 and the second weight coefficient w2 (step S105). Here, the reproduction order determination unit 43 of the computer 4 determines the first weighting coefficient w1 and the second weighting coefficient w2 based on the time length of the generated moving image. For example, when the time length of the generated moving image is long, the playback order determination unit 43 has a weight of the second weighting factor w2 with respect to the first weighting factor w1 as compared to the case where the time length of the generated moving image is short. The coefficients w1 and w2 are determined so as to increase.

コンピュータ４は、ステップＳ１０５の処理と並行して、ステップＳ１０６の処理を実行する。このステップＳ１０６の処理は、１つの静止画像に対して複数の対話文候補が生成された場合に実行される処理であって、複数の対話文候補の中から該静止画像に適した対話文を選択する処理である。この処理は、前述したように、コンピュータ４の再生順序
決定部４３によって行われる。具体的には、再生順序決定部４３は、先ず、各対話文候補に含まれる特徴単語を抽出して、それら特徴単語の出現位置に基づく順位付けを行う。次いで、再生順序決定部４３は、図４に示したようなテーブルから各特徴単語の基準順位を導出し、各対話文候補における特徴単語の順位と基準順位との差の総和を演算することで、各対話文候補のスコアを求める。そして、再生順序決定部４３は、複数の対話文候補の中で最もスコアの小さい対話文候補を、静止画像に適した対話文として選択する。 The computer 4 executes the process of step S106 in parallel with the process of step S105. The process of step S106 is a process executed when a plurality of dialogue sentence candidates are generated for one still image, and a dialogue sentence suitable for the still image is selected from the plurality of dialogue sentence candidates. The process to select. This process is performed by the reproduction order determination unit 43 of the computer 4 as described above. Specifically, the reproduction order determination unit 43 first extracts feature words included in each dialogue sentence candidate, and performs ranking based on the appearance positions of these feature words. Next, the reproduction order determination unit 43 derives the reference rank of each feature word from the table as shown in FIG. 4, and calculates the sum of the differences between the rank of the feature word and the reference rank in each dialogue sentence candidate. The score of each dialogue sentence candidate is obtained. Then, the reproduction order determination unit 43 selects the dialogue sentence candidate having the smallest score among the plurality of dialogue sentence candidates as the dialogue sentence suitable for the still image.

コンピュータ４は、ステップＳ１０５〜Ｓ１０６の処理を実行し終えると、ステップＳ１０７の処理へ進む。ステップＳ１０７の処理では、コンピュータ４は、ステップＳ１０４で決定された第１の重み係数ｗ１及び第２の重み係数ｗ２を用いて、静止画像間のコスト値を演算する。この演算処理は、前述したように、コンピュータ４の再生順序決定部４３により行われる。具体的には、再生順序決定部４３は、先ず、静止画像間における対話文のつながり度合を演算する。この処理は、前述したように、各静止画像の対話文に含まれるキーワード等をベクトル表現して、静止画像間のベクトル差分を演算する処理である。次いで、再生順序決定部４３は、各静止画像と他の静止画像との組合せについて、前述の式（１）に基づくコスト値の演算を行う。 When the computer 4 finishes executing the processes of steps S105 to S106, it proceeds to the process of step S107. In the process of step S107, the computer 4 calculates the cost value between the still images using the first weighting coefficient w1 and the second weighting coefficient w2 determined in step S104. This calculation process is performed by the reproduction order determination unit 43 of the computer 4 as described above. Specifically, the reproduction order determination unit 43 first calculates the degree of connection of dialogue sentences between still images. As described above, this process is a process of calculating a vector difference between still images by expressing a keyword or the like included in the dialogue sentence of each still image as a vector. Next, the playback order determination unit 43 calculates a cost value based on the above-described equation (1) for each still image and another still image combination.

画像入力装置２によって入力された複数の静止画像における２枚の静止画像の全ての組合せについてコスト値の演算処理が完了すると、コンピュータ４は、それらのコスト値に基づいて、複数の静止画像の再生順序を決定する（ステップＳ１０８）。この処理は、前述したように、コンピュータ４の再生順序決定部４３により行われる。例えば、ある静止画像Ａ’の次に再生すべき静止画像を決定する場合には、先ず、該静止画像Ａ’と他の静止画像との組合せのうち、コスト値が最も小さい組合せを特定する。そして、特定された組合せにおける相手側の静止画像を、該静止画像Ａ’の次に再生する静止画像に決定すればよい。なお、生成される動画の最初に再生される静止画像については、ユーザが選択してもよく、又はコンピュータ４が所定のアルゴリズムに従って選択してもよい。最初に再生される静止画像をコンピュータ４によって選択させる方法としては、複数の静止画像のうち、撮像日時が最も古い又は新しい静止画像を選択する方法等を用いてもよい。 When the cost value calculation processing is completed for all combinations of the two still images in the plurality of still images input by the image input device 2, the computer 4 reproduces the plurality of still images based on the cost values. The order is determined (step S108). This process is performed by the reproduction order determination unit 43 of the computer 4 as described above. For example, when determining a still image to be reproduced next to a certain still image A ′, first, a combination having the smallest cost value is identified from among combinations of the still image A ′ and other still images. Then, the still image on the other side in the specified combination may be determined as a still image to be reproduced next to the still image A ′. Note that the still image that is reproduced at the beginning of the generated moving image may be selected by the user, or may be selected by the computer 4 according to a predetermined algorithm. As a method of selecting the still image to be reproduced first by the computer 4, a method of selecting the still image with the oldest or newest shooting date from among a plurality of still images may be used.

コンピュータ４は、ステップＳ１０８の処理を実行し終えると、ステップＳ１０９の処理へ進む。ステップＳ１０９の処理では、コンピュータ４は、複数の静止画像と各静止画像に関連づけられた対話文とを、ステップＳ１０８の処理で決定された再生順序に従って出力装置３から出力させることで、対話文動画の再生を行う。 When the computer 4 finishes executing the process of step S108, the process proceeds to the process of step S109. In the process of step S109, the computer 4 outputs a plurality of still images and a dialog sentence associated with each still image from the output device 3 according to the reproduction order determined in the process of step S108, so that a dialog sentence moving image Play back.

図５に示す手順によれば、複数の静止画像から動画を生成する際に、各静止画像に含まれるオブジェクトに関連する対話文が自動的に生成される。そして、画像類似度合と対話文のつながり度合とに基づいて静止画像の再生順序が決定され、その再生順序に従って静止画像及びその静止画像に対応した対話文が出力される。このようにして生成される動画は、画像の連続性のみを考慮して生成される動画や画像の単なる説明文が付加された動画に比べ、クオリティが高く且つ多様性に富んだものとなる。 According to the procedure shown in FIG. 5, when a moving image is generated from a plurality of still images, a dialogue sentence related to an object included in each still image is automatically generated. Then, the reproduction order of the still images is determined based on the image similarity degree and the connection degree of the dialogue sentence, and the still picture and the dialogue sentence corresponding to the still image are output according to the reproduction order. The moving image generated in this way has a high quality and rich variety as compared to a moving image generated considering only the continuity of images and a moving image to which a simple description of the image is added.

また、図５に示す手順によれば、生成される動画の時間長が長い場合は短い場合に比べ、画像類似度合に対する対話文のつながり度合の重み付けが大きくされる。そのため、生成される動画の時間長が短い場合は、対話文のつながり度合に比して画像類似度合を重視して動画が生成されることになり、画像の遷移に対して違和感の少ない動画を生成することができる。一方、生成される時間長が長い場合は、画像類似度合に比して対話文のつながり度合を重視して動画が生成されるため、ストーリー性の高い動画を生成することができる。よって、動画のクオリティをより一層高めることが可能となる。 Further, according to the procedure shown in FIG. 5, when the time length of the generated moving image is long, the weight of the connection degree of the dialogue sentence with respect to the image similarity degree is increased compared to the case where the time length is short. Therefore, when the time length of the generated video is short, the video is generated with an emphasis on the degree of image similarity compared to the degree of dialogue sentence connection. Can be generated. On the other hand, when the time length to be generated is long, a moving image is generated with an emphasis on the degree of connection of dialogue sentences compared to the degree of image similarity, and thus a moving image with high storyliness can be generated. Therefore, the quality of the moving image can be further improved.

１自動生成装置
２画像入力装置
３出力装置
４コンピュータ
４０画像類似度演算部
４１キーワード抽出部
４２対話文生成部
４３再生順序決定部
４４再生部 DESCRIPTION OF SYMBOLS 1 Automatic generator 2 Image input device 3 Output device 4 Computer 40 Image similarity calculation part 41 Keyword extraction part 42 Dialogue sentence generation part 43 Playback order determination part 44 Playback part

Claims

Image input means for inputting a plurality of still images;
Image similarity calculation means for calculating the image similarity between two still images in a plurality of still images input by the image input means;
Keyword extraction means for extracting a keyword for identifying an object included in each still image for each of a plurality of still images input by the image input means;
Dialog sentence generation means for generating a dialog sentence related to the keyword extracted by the keyword extraction means for each of a plurality of still images input by the image input means;
Replay order determining means for determining the replay order of the plurality of still images based on the image similarity calculated by the image similarity calculating means and the connection degree of the dialog sentences generated by the dialog sentence generating means; ,
Reproducing means for outputting the plurality of still images and subtitles or voices of dialogue sentences corresponding to the still images in accordance with the reproduction order determined by the reproduction order determining means;
An apparatus for automatically generating a dialogue sentence moving image.

The reproduction order determining means represents the keyword extracted by the keyword extracting means as a vector, and calculates a keyword vector difference between the two still images, thereby connecting the conversation sentences between the two still images. Seeking the degree,
The apparatus for automatically generating a dialogue sentence moving image according to claim 1.

The reproduction order determination unit is configured to add a second value to a value obtained by multiplying the image similarity calculated by the image similarity calculation unit by a first weighting factor and a connection degree of the dialog sentence generated by the dialog sentence generation unit. The interactive sentence moving image according to claim 1 or 2, wherein a cost value is calculated by adding a value multiplied by a weighting factor, and a reproduction order of the plurality of still images is determined based on the calculated cost value. Automatic generator.

The reproduction order determination means includes the first weighting factor and the second weighting factor so that the weight of the connection degree of the dialogue sentence to the image similarity degree becomes larger when the time length of the generated moving image is longer than when the time length is shorter. The apparatus for automatically generating an interactive sentence moving image according to claim 3, wherein the weighting coefficient is determined.

The reproduction order determining means determines the first weighting factor and the second weighting factor so that the weight of the connection degree of the dialogue sentence to the degree of image similarity changes in the middle of the generated moving image. Item 4. An apparatus for automatically generating a dialogue sentence moving image according to Item 3.

The dialogue sentence generation means generates a plurality of dialogue sentence candidates related to the keyword extracted by the keyword extraction means for one still image,
The reproduction order determining unit selects one dialogue sentence suitable for the still image among the plurality of dialogue sentence candidates based on the appearance position of the word in each of the plurality of dialogue sentence candidates generated by the dialogue sentence generating unit. The apparatus for automatically generating an interactive sentence moving image according to any one of claims 1 to 5, wherein one is selected.

Inputting a plurality of still images;
Calculating an image similarity between two still images in a plurality of input still images;
Extracting a keyword for identifying an object included in each still image for each of the plurality of input still images;
Generating a dialogue sentence related to the extracted keyword for each of the plurality of input still images;
Determining a playback order of the plurality of still images based on the generated connection degree of the dialogue sentence and the image similarity degree;
Outputting the plurality of still images and subtitles or voices of dialogues corresponding to the still images according to the determined reproduction order;
A method for automatically generating a dialogue sentence video including