JP2022040665A

JP2022040665A - Video processing device, video processing method, and model generation device

Info

Publication number: JP2022040665A
Application number: JP2020145474A
Authority: JP
Inventors: 崇日昔; Takashi Hiseki
Original assignee: JVCKenwood Corp
Current assignee: JVCKenwood Corp
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2022-03-11

Abstract

To provide a video processing device, a video processing method, and a model generation device, capable of generating appropriate highlight scenes.SOLUTION: A video generation device (10) includes: an extraction part (highlight extraction part 17) to extract, from a sport video, a reference scene including characteristic movements and characteristic voices; and a generation part (highlight extraction part 17) to generate, as a highlight scene, a scene including from a first scene that returns only by a first predetermined time to a second scene that advances only by a second predetermined time, with the reference scene extracted by the extraction part as a reference.SELECTED DRAWING: Figure 1

Description

本発明は、映像処理装置、映像処理方法、及びモデル生成装置に関する。 The present invention relates to a video processing device, a video processing method, and a model generation device.

従来よりハイライトシーンを生成する発明が知られている（特許文献１）。特許文献１に記載された発明は、歓声の音量が所定値より大きい場合、そのシーンをハイライトシーンとして生成する。 Conventionally, an invention for generating a highlight scene has been known (Patent Document 1). The invention described in Patent Document 1 generates a scene as a highlight scene when the volume of cheers is larger than a predetermined value.

特開２０１２－１４７２９６号公報Japanese Unexamined Patent Publication No. 2012-147296

しかしながら、歓声が大きい場合であっても選手を映していないシーンがあり、そのようなシーンはハイライトシーンとして必ずしも適切ではない。 However, there are scenes that do not show the players even when the cheers are loud, and such scenes are not always appropriate as highlight scenes.

本発明は、上記問題に鑑みて成されたものであり、その目的は、適切なハイライトシーンを生成可能な映像処理装置、映像処理方法、及びモデル生成装置を提供することである。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a video processing device, a video processing method, and a model generating device capable of generating an appropriate highlight scene.

本発明の一態様に係る映像処理装置は、スポーツ映像から特徴のある動作または特徴のある音声が含まれる基準シーンを抽出する抽出部と、抽出部によって抽出された基準シーンを基準として、第１所定時間だけ戻った第１シーンから第２所定時間だけ進んだ第２シーンまでを含むシーンをハイライトシーンとして生成する生成部とを備える。 The image processing apparatus according to one aspect of the present invention has a first extraction unit that extracts a reference scene containing a characteristic motion or a characteristic sound from a sports image, and a reference scene extracted by the extraction unit as a reference. It is provided with a generation unit that generates a scene including a first scene that has returned by a predetermined time to a second scene that has advanced by a second predetermined time as a highlight scene.

本発明によれば、適切なハイライトシーンの生成が可能となる。 According to the present invention, it is possible to generate an appropriate highlight scene.

図１は、本発明の本実施形態に係る映像処理装置１０及びモデル生成装置５０の概略構成図である。FIG. 1 is a schematic configuration diagram of a video processing device 10 and a model generation device 50 according to an embodiment of the present invention. 図２は、ハイライトシーン抽出方法の一例を説明する図である。FIG. 2 is a diagram illustrating an example of a highlight scene extraction method. 図３は、ハイライトシーン抽出方法の他の例を説明する図である。FIG. 3 is a diagram illustrating another example of the highlight scene extraction method. 図４は、本発明の本実施形態に係る映像処理装置１０の一動作例を説明するフローチャートである。FIG. 4 is a flowchart illustrating an operation example of the video processing apparatus 10 according to the embodiment of the present invention.

以下、本発明の実施形態について、図面を参照して説明する。図面の記載において同一部分には同一符号を付して説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the description of the drawings, the same parts are designated by the same reference numerals and the description thereof will be omitted.

（映像処理装置の構成例）
映像処理装置１０は、ハイライトシーンを抽出し、抽出したハイライトシーンをハイライト映像として生成する装置である。ハイライトシーンが複数存在する場合、複数のハイライトシーンをつなげたものがハイライト映像となる。本実施形態においてハイライトシーンとは、いわゆるスポーツの見どころを意味し、典型的にはスポーツの得点シーンである。ただしスポーツによっては得点シーンがないスポーツもある（例えば相撲）。得点シーンがない相撲においてハイライトシーンとは力士の技が決まったシーンなどが該当する。本実施形態では映像処理装置１０はモデル生成装置５０とは異なる装置として説明するが、モデル生成装置５０が有する一部の機能またはすべての機能は映像処理装置１０に組み込まれてもよい。 (Configuration example of video processing device)
The image processing device 10 is a device that extracts a highlight scene and generates the extracted highlight scene as a highlight image. When there are multiple highlight scenes, the highlight image is a combination of multiple highlight scenes. In the present embodiment, the highlight scene means a so-called highlight of sports, and is typically a sports scoring scene. However, some sports do not have a scoring scene (for example, sumo). In sumo wrestling where there is no scoring scene, the highlight scene corresponds to the scene where the wrestler's skill is decided. In the present embodiment, the image processing device 10 will be described as a device different from the model generation device 50, but some or all the functions of the model generation device 50 may be incorporated in the image processing device 10.

図１を参照して、映像処理装置１０及びモデル生成装置５０の構成の一例について説明する。まず最初に映像処理装置１０について説明する。図１に示すように映像処理装置１０は、記憶装置１１と、制御部１２と、インターフェース１３とを備える。 An example of the configuration of the video processing device 10 and the model generation device 50 will be described with reference to FIG. First, the video processing apparatus 10 will be described. As shown in FIG. 1, the image processing device 10 includes a storage device 11, a control unit 12, and an interface 13.

記憶装置１１は、映像取得機器２０によって取得されたコンテンツを録画または記録する装置であり、典型的にはレコーダーである。このようなレコーダーは一例としてＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、またはＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）から構成される。なお記憶装置１１としてクラウドコンピューティングを利用したクラウドストレージが用いられてもよい。したがって映像処理装置１０は必ずしも記憶装置１１を備える必要はない。 The storage device 11 is a device for recording or recording the content acquired by the video acquisition device 20, and is typically a recorder. As an example, such a recorder is composed of an HDD (Hard Disk Drive) or an SSD (Solid State Drive). Note that cloud storage using cloud computing may be used as the storage device 11. Therefore, the image processing device 10 does not necessarily have to include the storage device 11.

映像取得機器２０は一例としてテレビ番組を取得する装置であり、典型的にはテレビである。また映像取得機器２０は、インターネット回線を通じて映像コンテンツを取得するスマートフォン、タブレット端末、パーソナルコンピュータ、ゲーム機であってもよい。テレビ番組もしくは映像コンテンツにはスポーツ中継、ドラマ、映画、ドキュメンタリー、アニメ、ニュースなどが含まれるところ、本実施形態においてハイライトシーン抽出の対象となるのはスポーツ中継である。スポーツの種類に関して、テレビあるいはインターネット回線を通じて提供されるスポーツであればハイライトシーン抽出の対象となる。テレビあるいはインターネット回線を通じて提供される代表的なスポーツとして、野球、バスケットボール、アメリカンフットボール、アイスホッケー、サッカー、テニス、バレーボール、卓球、ラグビー、クリケット、フェンシング、ゴルフ、相撲などが挙げられる。本実施形態ではスポーツ中継及びスポーツコンテンツを総称してスポーツ映像と表現する。したがってスポーツ映像にはテレビを通じて取得されるスポーツ中継、及びインターネット回線を通じて取得されるスポーツコンテンツが含まれる。ただし説明を簡略化するため、以下ではハイライトシーン抽出の対象をテレビ番組とする。なお映像取得機器２０の機能は映像処理装置１０に組み込まれてもよい。この場合、映像処理装置１０は持ち運びが容易なビデオカメラとして機能しうる。持ち運びが容易なビデオカメラはハンディカメラと呼ばれる場合もある。上述のスポーツ映像には一般的なユーザがビデオカメラで撮影したスポーツ映像も含まれる。 The video acquisition device 20 is, for example, a device for acquiring a television program, and is typically a television. Further, the video acquisition device 20 may be a smartphone, a tablet terminal, a personal computer, or a game machine that acquires video content through an Internet line. Where TV programs or video content include sports broadcasts, dramas, movies, documentaries, animations, news, etc., the target of highlight scene extraction in this embodiment is sports broadcasts. Regarding the types of sports, if the sport is provided through a television or an internet line, it will be the target of highlight scene extraction. Typical sports offered via television or internet access include baseball, basketball, American football, ice hockey, soccer, tennis, volleyball, table tennis, rugby, cricket, fencing, golf and sumo. In this embodiment, sports broadcasts and sports contents are collectively referred to as sports images. Therefore, the sports video includes sports broadcasts acquired through television and sports contents acquired via the Internet line. However, in order to simplify the explanation, the target of highlight scene extraction is the TV program below. The function of the video acquisition device 20 may be incorporated in the video processing device 10. In this case, the video processing device 10 can function as a video camera that is easy to carry. A video camera that is easy to carry is sometimes called a handy camera. The above-mentioned sports video also includes sports video taken by a general user with a video camera.

インターフェース１３は一例として、ネットワークアダプタなどのハードウェア、通信用ソフトウェア、あるいはこれらの組み合わせとして実装され、有線または無線の通信を実現できるように構成されている。またインターフェース１３はデータを送受信するための入力部及び出力部としての機能を有する。 As an example, the interface 13 is implemented as hardware such as a network adapter, communication software, or a combination thereof, and is configured to realize wired or wireless communication. Further, the interface 13 has a function as an input unit and an output unit for transmitting and receiving data.

インターフェース１３はユーザ端末３０との通信、及びネットワーク４０を介したモデル生成装置５０との通信に用いられる。ユーザ端末３０はユーザによって操作される装置であり、典型的にはレコーダーを操作するためのリモコンである。ただしユーザ端末３０はリモコンに限定されない。ユーザ端末３０はスマートフォン、タブレット端末、パーソナルコンピュータ、ゲーム機であってもよい。本実施形態ではユーザ端末３０をリモコンとして説明する。ネットワーク４０は、無線または有線の何れかの方式、あるいは両方の方式によって構成されてもよく、ネットワーク４０にはインターネットが含まれてもよい。本実施形態では、映像処理装置１０とモデル生成装置５０は無線通信方式によってネットワーク４０と接続する。 The interface 13 is used for communication with the user terminal 30 and communication with the model generation device 50 via the network 40. The user terminal 30 is a device operated by the user, and is typically a remote controller for operating the recorder. However, the user terminal 30 is not limited to the remote controller. The user terminal 30 may be a smartphone, a tablet terminal, a personal computer, or a game machine. In this embodiment, the user terminal 30 will be described as a remote controller. The network 40 may be configured by either a wireless or wired system, or both, and the network 40 may include the Internet. In the present embodiment, the video processing device 10 and the model generation device 50 are connected to the network 40 by a wireless communication method.

ユーザはリモコン（ユーザ端末３０）を操作してハイライトシーンを抽出したいスポーツ中継番組を選択する。すなわち本実施形態において、ハイライトシーンを抽出したいスポーツ中継番組は記憶装置１１に録画されていることが前提となる。ユーザがリモコンを操作してハイライトシーンを抽出したいスポーツ中継番組を選択したとき、リモコンから送信された信号はインターフェース１３を介して制御部１２に出力される。制御部１２は受信した信号に基づいてユーザによって選択されたスポーツ中継番組のハイライトシーンを抽出する。 The user operates the remote controller (user terminal 30) to select a sports broadcast program for which the highlight scene is to be extracted. That is, in the present embodiment, it is premised that the sports broadcast program for which the highlight scene is to be extracted is recorded in the storage device 11. When the user operates the remote controller to select a sports broadcast program for which the highlight scene is to be extracted, the signal transmitted from the remote controller is output to the control unit 12 via the interface 13. The control unit 12 extracts the highlight scene of the sports broadcast program selected by the user based on the received signal.

制御部１２は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、メモリ、及び入出力部などを備える汎用のマイクロコンピュータである。ＣＰＵ、メモリ、入出力部などは図示しないバスを介して電気的に接続されている。マイクロコンピュータには、映像処理装置１０として機能させるためのコンピュータプログラムがインストールされている。コンピュータプログラムを実行することにより、マイクロコンピュータは映像処理装置１０が備える複数の情報処理回路として機能する。なおここではソフトウェアによって映像処理装置１０が備える複数の情報処理回路を実現する例を示すが、もちろん、以下に示す各情報処理を実行するための専用のハードウェアを用意して情報処理回路を構成することも可能である。また、複数の情報処理回路を個別のハードウェアにより構成してもよい。制御部１２は、複数の情報処理回路として、映像分析部１４と、ハイライト抽出部１７を備える。また映像分析部１４は競技特定部１５と、特徴検出部１６とに分類される。 The control unit 12 is a general-purpose microcomputer including a CPU (Central Processing Unit), a memory, an input / output unit, and the like. The CPU, memory, input / output unit, and the like are electrically connected via a bus (not shown). A computer program for functioning as the video processing apparatus 10 is installed in the microcomputer. By executing the computer program, the microcomputer functions as a plurality of information processing circuits included in the video processing apparatus 10. Here, an example of realizing a plurality of information processing circuits included in the video processing apparatus 10 by software is shown, but of course, the information processing circuit is configured by preparing dedicated hardware for executing each of the following information processing. It is also possible to do. Further, a plurality of information processing circuits may be configured by individual hardware. The control unit 12 includes a video analysis unit 14 and a highlight extraction unit 17 as a plurality of information processing circuits. Further, the video analysis unit 14 is classified into a competition specific unit 15 and a feature detection unit 16.

競技特定部１５は、ユーザ端末３０から発信された信号を受信したとき、ユーザによって選択されたスポーツ中継番組（競技）の種類を特定する。特定方法の一例を説明する。記憶装置１１にスポーツ中継番組が録画されたときに、録画されたスポーツ中継番組の種類を示すタグ情報が同時に記憶される場合がある。このような場合競技特定部１５はタグ情報を参照することによりスポーツ中継番組の種類を特定することが可能となる。あるいは競技特定部１５はスポーツ中継番組の映像を分析しスポーツ中継番組の種類を特定してもよい。例えばコートまたはグラウンドにラインが引かれている場合、３ポイントラインが検出されればバスケットボール、センターサークル及びペナルティボックスが検出されればサッカーという特定が可能である。またユニフォームの形状、競技に使用されるボールの形状、選手の人数などからも特定が可能である。競技特定部１５は特定したスポーツ中継番組の種類を特徴検出部１６に出力する。 When the competition specifying unit 15 receives the signal transmitted from the user terminal 30, the competition specifying unit 15 identifies the type of the sports broadcast program (competition) selected by the user. An example of the specific method will be described. When a sports broadcast program is recorded in the storage device 11, tag information indicating the type of the recorded sports broadcast program may be stored at the same time. In such a case, the competition specifying unit 15 can specify the type of the sports broadcast program by referring to the tag information. Alternatively, the competition specifying unit 15 may analyze the video of the sports broadcast program and specify the type of the sports broadcast program. For example, when a line is drawn on the court or the ground, it is possible to specify basketball if a 3-point line is detected, soccer if a center circle and a penalty box are detected. It can also be specified from the shape of the uniform, the shape of the ball used in the competition, the number of players, and so on. The competition specifying unit 15 outputs the type of the specified sports broadcast program to the feature detection unit 16.

特徴検出部１６は、ユーザによって選択されたスポーツ中継番組を分析し、特徴のある表情、特徴のある動作、及び特徴のある音声を検出する。特徴のある表情、特徴のある動作の検出には周知の映像認識技術が用いられる。映像認識技術の一例としてモデル（テンプレート）を用いたテンプレートマッチングが挙げられる。テンプレートマッチングとはモデルの位置及び大きさを変えながらマッチングを行い、類似度の高いシーンを検出する技術である。類似度の指標にはＳＳＤ（ＳｕｍｏｆＳｑｕａｒｅｄＤｉｆｆｅｒｅｎｃｅ）、ＳＡＤ（ＳｕｍｏｆＡｂｓｏｌｕｔｅＤｉｆｆｅｒｅｎｃｅ）などが用いられる。これらは周知技術であるため詳細な説明は省略する。 The feature detection unit 16 analyzes a sports broadcast program selected by the user, and detects a characteristic facial expression, a characteristic motion, and a characteristic voice. Well-known image recognition technology is used to detect characteristic facial expressions and characteristic movements. As an example of the image recognition technology, there is template matching using a model (template). Template matching is a technique for detecting scenes with a high degree of similarity by performing matching while changing the position and size of the model. SSD (Sum of Squared Difference), SAD (Sum of Absolute Difference) and the like are used as an index of similarity. Since these are well-known techniques, detailed description thereof will be omitted.

特徴のある表情とは一例として選手の喜びの表情である。特徴のある動作とは、選手のガッツポーズ、選手が手を挙げて観客に応えている動作、選手同士のハイタッチ、選手の拍手もしくは観客の拍手などが挙げられる。すなわち特徴のある動作とは喜びを示す動作である。特徴のある音声の検出には周知の音声認識技術が用いられる。特徴のある音声とは一例として「やったー」、「ｃｏｍｅｏｎ」などの選手が得点した際の特定の文言が挙げられる。あるいは所定以上の音量を有する音声も特徴のある音声に含まれる。 The characteristic facial expression is, for example, the facial expression of the player's joy. Characteristic movements include a player's guts pose, a player raising his hand to respond to the spectator, a high five between the players, a player's applause or a spectator's applause. That is, a characteristic movement is a movement showing joy. Well-known speech recognition techniques are used to detect characteristic speech. An example of a characteristic voice is a specific wording when a player scores, such as "I did it" or "come on". Alternatively, a voice having a volume equal to or higher than a predetermined value is also included in the characteristic voice.

特徴のある表情、及び特徴のある動作のモデルはモデル生成装置５０によって生成される。詳細は後述するが、モデル生成装置５０は一例としてディープラーニングを用いて特徴のある表情、及び特徴のある動作のモデルを生成する。 A model of a characteristic facial expression and a characteristic motion is generated by the model generator 50. Although the details will be described later, the model generation device 50 uses deep learning as an example to generate a model of a characteristic facial expression and a characteristic motion.

特徴検出部１６は、スポーツ中継番組を所定時間ごとに区切って分析する。例えばスポーツ中継番組がサッカーである場合、前半戦であれば試合時間は４５分であるから、特徴検出部１６は映像を５分単位で分割し、９つのパートに分けて分析する。特徴検出部１６はテンプレートマッチング及び音声認識技術を用いて検出した特徴のある表情、特徴のある動作、及び特徴のある音声をハイライト抽出部１７に出力する。なお５分という時間は一例であってコンピュータの性能などに応じて適宜変更されうる。 The feature detection unit 16 analyzes the sports broadcast program by dividing it into predetermined time intervals. For example, when the sports broadcast program is soccer, the match time is 45 minutes in the first half of the game, so the feature detection unit 16 divides the video into 5 minute units and analyzes them by dividing them into 9 parts. The feature detection unit 16 outputs a characteristic facial expression, a characteristic motion, and a characteristic voice detected by using template matching and voice recognition technology to the highlight extraction unit 17. The time of 5 minutes is an example and can be changed as appropriate depending on the performance of the computer and the like.

ハイライト抽出部１７（抽出部、生成部）は、特徴検出部１６によって検出された特徴のある表情、特徴のある動作、及び特徴のある音声を含むシーン（以下基準シーンと呼ぶ場合がある）とその前後のシーンをハイライトシーンとして抽出する。前後のシーンとは基準シーンを基準として、第１所定時間だけ戻った第１シーンから第２所定時間だけ進んだ第２シーンまでを含むシーンと定義される。ハイライト抽出部１７は抽出したハイライトシーンをハイライト映像として記憶装置１１に記憶する。なおハイライトシーンが複数抽出された場合、ハイライト抽出部１７は複数のハイライトシーンをつなげて１つのハイライト映像を生成する。 The highlight extraction unit 17 (extraction unit, generation unit) is a scene including a characteristic facial expression, a characteristic motion, and a characteristic voice detected by the feature detection unit 16 (hereinafter, may be referred to as a reference scene). And the scenes before and after it are extracted as highlight scenes. The scenes before and after are defined as a scene including the first scene that has returned by the first predetermined time and the second scene that has advanced by the second predetermined time with respect to the reference scene. The highlight extraction unit 17 stores the extracted highlight scene as a highlight image in the storage device 11. When a plurality of highlight scenes are extracted, the highlight extraction unit 17 connects the plurality of highlight scenes to generate one highlight image.

ハイライト映像が記憶装置１１に記憶されたとき、映像処理装置１０はインターフェース１３を介してハイライト映像の生成が完了した旨をユーザに伝えてもよい。これによりユーザはハイライト映像を楽しむことができる。 When the highlight image is stored in the storage device 11, the image processing device 10 may inform the user that the generation of the highlight image is completed via the interface 13. This allows the user to enjoy the highlight video.

次にモデル生成装置５０について説明する。図１に示すようにモデル生成装置５０は、サーバ５１と、制御部５２と、記憶装置５５と、インターフェース５６とを備える。モデル生成装置５０の設置場所は特に限定されないが、例えばモデル生成装置５０は、モデル生成装置５０を管理する事業者が保有する管理センタに設置される。 Next, the model generator 50 will be described. As shown in FIG. 1, the model generation device 50 includes a server 51, a control unit 52, a storage device 55, and an interface 56. The installation location of the model generation device 50 is not particularly limited, but for example, the model generation device 50 is installed in a management center owned by a business operator who manages the model generation device 50.

サーバ５１には過去のスポーツ中継番組及びスポーツコンテンツが多数記憶されている。またサーバ５１には今後放送されるスポーツ中継番組及びスポーツコンテンツも記憶される。なお記憶媒体として機能する構造物であればサーバでなくてもよい。記憶媒体として機能する構造物として、カセットテープ、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ＢＬＵ－ＲＡＹディスクなどが挙げられる。 A large number of past sports broadcast programs and sports contents are stored in the server 51. In addition, sports broadcast programs and sports contents to be broadcast in the future are also stored in the server 51. It does not have to be a server as long as it is a structure that functions as a storage medium. Examples of the structure that functions as a storage medium include a cassette tape, a CD-ROM, a DVD-ROM, a BLU-RAY disc, and the like.

制御部５２は制御部１２と同様に汎用のマイクロコンピュータである。制御部５２は複数の情報処理回路として、映像取得部５３と、モデル生成部５４を備える。 The control unit 52 is a general-purpose microcomputer like the control unit 12. The control unit 52 includes a video acquisition unit 53 and a model generation unit 54 as a plurality of information processing circuits.

映像取得部５３はサーバ５１から過去に記憶されたスポーツ中継番組及びスポーツコンテンツを取得する。映像取得部５３は取得したスポーツ中継番組及びスポーツコンテンツをモデル生成部５４に出力する。 The video acquisition unit 53 acquires sports broadcast programs and sports contents stored in the past from the server 51. The video acquisition unit 53 outputs the acquired sports broadcast program and sports content to the model generation unit 54.

モデル生成部５４は映像取得部５３から取得したスポーツ中継番組及びスポーツコンテンツを分析してモデルを生成する。ここでいうモデルは上述したように特徴のある表情のモデル、及び特徴のある動作のモデルである。これらのモデルが映像認識技術で用いられる。モデルの生成には周知のＡＩ技術が用いられる。ＡＩ技術の一例としてディープラーニングが用いられる。ディープラーニングではニューラルネットワークを利用して大量の映像データ（スポーツ中継番組及びスポーツコンテンツに係るデータ）から特徴量（特徴のある表情、特徴のある動作）を抽出する。これにより多くのスポーツで共通する喜びの表情、ガッツポーズといった特徴に加えて、スポーツごとにそのスポーツ独特の動作の特徴を抽出することが可能となる。音声についても同様である。またディープラーニングでは入力された音声データの音の強弱、周波数、音量、特定の言葉などの特徴量を抽出し音声モデルが生成される。スポーツ映像がニューラルネットワークに入力されるとき、スポーツ映像は競技ごとに分類され、分類された映像がニューラルネットワークに入力される。ニューラルネットワークによって分類された映像が学習され競技ごとに特徴のある表情のモデル、特徴のある動作のモデル、及び特徴のある音声のモデルが生成される。このようにして生成されたモデルは記憶装置５５に記憶される。なお以下ではモデル生成部５４によって生成された特徴のある表情を示すモデルを表情モデル、特徴のある動作を示すモデルを動作モデル、特徴のある音声のモデルを音声モデルと呼ぶ場合がある。 The model generation unit 54 analyzes the sports broadcast program and sports content acquired from the video acquisition unit 53 to generate a model. As described above, the model referred to here is a model with a characteristic facial expression and a model with a characteristic motion. These models are used in video recognition technology. Well-known AI techniques are used to generate the model. Deep learning is used as an example of AI technology. In deep learning, feature quantities (characteristic facial expressions, characteristic movements) are extracted from a large amount of video data (data related to sports broadcast programs and sports contents) using a neural network. This makes it possible to extract the characteristics of movements unique to each sport, in addition to the features such as facial expressions of joy and guts poses that are common to many sports. The same applies to voice. In deep learning, a voice model is generated by extracting features such as sound intensity, frequency, volume, and specific words of input voice data. When the sports video is input to the neural network, the sports video is classified for each competition, and the classified video is input to the neural network. The images classified by the neural network are learned, and a model of a characteristic facial expression, a model of a characteristic motion, and a model of a characteristic voice are generated for each competition. The model thus generated is stored in the storage device 55. In the following, a model showing a characteristic facial expression generated by the model generation unit 54 may be referred to as a facial expression model, a model showing a characteristic motion may be referred to as an motion model, and a characteristic voice model may be referred to as a speech model.

モデル生成装置５０はネットワーク４０を介してモデルを映像処理装置１０に送信する。送信されたモデルは記憶装置１１に記憶される。記憶装置１１に記憶されたモデルはハイライトシーンを抽出する際に呼び出される。 The model generation device 50 transmits the model to the video processing device 10 via the network 40. The transmitted model is stored in the storage device 11. The model stored in the storage device 11 is called when extracting the highlight scene.

次に、図２～３を参照してハイライトシーン抽出方法の一例を説明する。図２～３において競技はサッカーである。 Next, an example of the highlight scene extraction method will be described with reference to FIGS. 2 to 3. In FIGS. 2 to 3, the competition is soccer.

図２に示す横軸はサッカーの試合時間である。図２において、試合が開始して３０分の時点で得点がうまれ、３０分５秒の時点で得点した選手がガッツポーズし、観客の歓声が大きくなっている。このようなシーンを有するサッカー中継番組において、ユーザがハイライトシーン抽出を希望したとする。上述したようにユーザの希望はリモコンによって操作される。 The horizontal axis shown in FIG. 2 is the soccer match time. In FIG. 2, a score is generated 30 minutes after the start of the match, and the player who scored at 30 minutes and 5 seconds makes a guts pose, and the cheers of the spectators are increasing. It is assumed that the user wants to extract a highlight scene in a soccer broadcast program having such a scene. As described above, the user's wishes are operated by the remote controller.

まず最初に競技特定部１５は、リモコンから発信された信号を受信したとき、ユーザによって選択されたスポーツ中継番組（競技）の種類を特定する。ここではサッカーが特定されたとする。次に特徴検出部１６は、記憶装置１１を参照してサッカーに関する表情モデル、動作モデル、音声モデルを取得する。特徴検出部１６はサッカー中継番組を所定時間ごとに区切って分析を開始する。ここでは２５分～３５分までが区切られた１つのパートとする。特徴検出部１６は２５分～３５分の映像において、表情モデル、動作モデル、音声モデルと一致するあるいは類似するシーンがないか分析する。分析の結果、特徴検出部１６は３０分５秒のシーンにおいて、選手の喜びの表情、選手のガッツポーズ、所定以上の音量を有する音声（歓声）を検出する。 First, the competition specifying unit 15 identifies the type of sports broadcast program (competition) selected by the user when receiving the signal transmitted from the remote controller. Here, it is assumed that soccer is specified. Next, the feature detection unit 16 refers to the storage device 11 to acquire a facial expression model, an motion model, and a voice model related to soccer. The feature detection unit 16 divides the soccer broadcast program into predetermined time intervals and starts analysis. Here, 25 minutes to 35 minutes are divided into one part. The feature detection unit 16 analyzes whether there is a scene that matches or is similar to the facial expression model, the motion model, and the voice model in the video for 25 to 35 minutes. As a result of the analysis, the feature detection unit 16 detects the facial expression of the player's joy, the player's guts pose, and the voice (cheers) having a volume equal to or higher than a predetermined value in the scene of 30 minutes and 5 seconds.

ハイライト抽出部１７は３０分５秒のシーンを基準として、第１所定時間だけ戻った第１シーンから第２所定時間だけ進んだ第２シーンまでを含むシーンを抽出する。図２に示す例では第１所定時間は１０秒であり、第１シーンは２９分５５秒時点のシーンである。第２所定時間は５秒であり、第２シーンは３０分１０秒時点のシーンである。図２に示す例ではハイライトシーンは１５秒となる。 The highlight extraction unit 17 extracts a scene including a scene including a first scene that has returned by a first predetermined time and a second scene that has advanced by a second predetermined time, based on a scene of 30 minutes and 5 seconds. In the example shown in FIG. 2, the first predetermined time is 10 seconds, and the first scene is the scene at 29 minutes 55 seconds. The second predetermined time is 5 seconds, and the second scene is a scene at 30 minutes and 10 seconds. In the example shown in FIG. 2, the highlight scene is 15 seconds.

サッカーで選手が得点した場合、その後のシーンとして、得点した選手が喜んでいるシーン、得点した選手がガッツポーズしているシーン、大きな歓声が上がったシーンが挙げられる。本実施形態によれば、得点シーンが発生した後のシーンの特徴を検出し、そのシーンの前後のシーンを含むシーンをハイライトシーンとして生成する。これにより、得点シーンそのものを抽出することなく、得点シーンを含むシーンをハイライトシーンとして生成することが可能となる。また音声の他に選手の表情及び選手の動作も考慮してハイライトシーンが生成されるため、従来技術と比較して適切なハイライトシーンの生成が可能となる。 When a player scores a goal in soccer, the subsequent scenes include a scene in which the scored player is pleased, a scene in which the scored player is in a guts pose, and a scene in which a big cheer is raised. According to the present embodiment, the feature of the scene after the scoring scene is generated is detected, and the scene including the scenes before and after the scene is generated as the highlight scene. This makes it possible to generate a scene including the scoring scene as a highlight scene without extracting the scoring scene itself. Further, since the highlight scene is generated in consideration of the facial expression of the player and the movement of the player in addition to the voice, it is possible to generate an appropriate highlight scene as compared with the conventional technique.

なお、特徴検出部１６は表情モデル、動作モデル、音声モデルが時間差なく一致するあるいは類似するシーンを抽出する。ただしある程度の時間差は許容されてもよい。許容される時間差は特に限定されないが、例えば１～２秒の範囲で設定されてもよい。 The feature detection unit 16 extracts scenes in which the facial expression model, the motion model, and the voice model match or are similar without any time difference. However, some time difference may be allowed. The allowable time difference is not particularly limited, but may be set in the range of, for example, 1 to 2 seconds.

図２に示す例では基準シーンの前後のシーンを含むようにしたがこれに限定されない。例えば図３に示すように基準シーンから第１所定時間（１０秒）だけ戻った第１シーンまでをハイライトシーンとして抽出してもよい。この場合であっても得点シーンそのものを抽出することなく、得点シーンを含むシーンをハイライトシーンとして抽出することが可能となる。 In the example shown in FIG. 2, the scenes before and after the reference scene are included, but the present invention is not limited to this. For example, as shown in FIG. 3, the highlight scene may be extracted from the reference scene to the first scene that has returned by the first predetermined time (10 seconds). Even in this case, it is possible to extract the scene including the scoring scene as a highlight scene without extracting the scoring scene itself.

次に、図４のフローチャートを参照して、映像処理装置１０の一動作例を説明する。 Next, an operation example of the video processing apparatus 10 will be described with reference to the flowchart of FIG.

ステップＳ１０１において、制御部１２がユーザ端末３０から発信された信号を受信したとき（ステップＳ１０１でＹＥＳ）、処理はステップＳ１０３に進む。ユーザ端末３０から信号が発信されていないとき（ステップＳ１０１でＮＯ）、処理は待機する。 In step S101, when the control unit 12 receives the signal transmitted from the user terminal 30 (YES in step S101), the process proceeds to step S103. When the signal is not transmitted from the user terminal 30 (NO in step S101), the process waits.

ステップＳ１０３において、競技特定部１５はユーザによって選択されたスポーツ中継番組を記憶装置１１から取得し、スポーツ中継番組の種類を特定する。ここではスポーツ中継番組の種類がサッカーであると特定されたとする。処理はステップＳ１０５に進み、特徴検出部１６は記憶装置１１を参照してサッカーに関する表情モデル、動作モデル、音声モデルを取得する。 In step S103, the competition specifying unit 15 acquires the sports broadcast program selected by the user from the storage device 11 and specifies the type of the sports broadcast program. Here, it is assumed that the type of sports broadcast program is specified as soccer. The process proceeds to step S105, and the feature detection unit 16 refers to the storage device 11 to acquire a facial expression model, an motion model, and a voice model related to soccer.

処理はステップＳ１０７に進み、特徴検出部１６はサッカー中継番組を所定時間ごとに区切って分析を開始する。特徴検出部１６は区切られたそれぞれのパートにおいて周知の映像認識技術及び音声認識技術を用いて表情モデル、動作モデル、音声モデルと一致するあるいは類似するシーンがないか分析する。 The process proceeds to step S107, and the feature detection unit 16 divides the soccer broadcast program into predetermined time intervals and starts analysis. The feature detection unit 16 analyzes whether there is a scene that matches or is similar to the facial expression model, the motion model, and the voice model by using a well-known video recognition technique and voice recognition technique in each of the separated parts.

処理はステップＳ１０９に進み、表情モデル、動作モデル、音声モデルと一致するあるいは類似するシーンが検出されたとき、そのシーンを基準シーンとして抽出する。ハイライト抽出部１７は基準シーンを基準として、第１所定時間だけ戻った第１シーンから第２所定時間だけ進んだ第２シーンまでを含むシーンを抽出する。これによりハイライトシーンが抽出される。処理はステップＳ１１１に進み、ハイライト抽出部１７は抽出したハイライトシーンをハイライト映像として記憶装置１１に記憶する。 The process proceeds to step S109, and when a scene that matches or is similar to the facial expression model, motion model, and voice model is detected, that scene is extracted as a reference scene. The highlight extraction unit 17 extracts a scene including the first scene that has returned by the first predetermined time to the second scene that has advanced by the second predetermined time, with the reference scene as a reference. This extracts the highlight scene. The process proceeds to step S111, and the highlight extraction unit 17 stores the extracted highlight scene as a highlight image in the storage device 11.

（作用効果）
以上説明したように、本実施形態に係る映像処理装置１０によれば、以下の作用効果が得られる。 (Action effect)
As described above, according to the video processing apparatus 10 according to the present embodiment, the following effects can be obtained.

映像処理装置１０はスポーツ映像から特徴のある動作及び特徴のある音声が含まれる基準シーンを抽出し、抽出された基準シーンを基準として第１所定時間だけ戻った第１シーンから第２所定時間だけ進んだ第２シーンまでを含むシーンをハイライトシーンとして生成する。本実施形態によれば、得点シーンが発生した後のシーンの特徴を検出し、そのシーンの前後のシーンを含むシーンをハイライトシーンとして生成する。これにより、得点シーンそのものを抽出することなく、得点シーンを含むシーンをハイライトシーンとして生成することが可能となる。また音声の他に選手の表情及び選手の動作も考慮してハイライトシーンが生成されるため、従来技術と比較して適切なハイライトシーンの生成が可能となる。なお上述の実施形態では特徴のある動作と特徴のある音声の両方を用いたが、必ずしもこれに限定されない。映像処理装置１０はスポーツ映像から特徴のある動作または特徴のある音声のどちらか一方が含まれる基準シーンを抽出し、ハイライトシーンを生成してもよい。 The image processing device 10 extracts a reference scene containing a characteristic motion and a characteristic sound from a sports image, and returns only a first predetermined time based on the extracted reference scene for a second predetermined time. A scene including the advanced second scene is generated as a highlight scene. According to the present embodiment, the feature of the scene after the scoring scene is generated is detected, and the scene including the scenes before and after the scene is generated as the highlight scene. This makes it possible to generate a scene including the scoring scene as a highlight scene without extracting the scoring scene itself. Further, since the highlight scene is generated in consideration of the facial expression of the player and the movement of the player in addition to the voice, it is possible to generate an appropriate highlight scene as compared with the conventional technique. In the above-described embodiment, both a characteristic motion and a characteristic voice are used, but the present invention is not necessarily limited to this. The image processing device 10 may extract a reference scene including either a characteristic motion or a characteristic sound from a sports image and generate a highlight scene.

上述の実施形態に記載される各機能は、１または複数の処理回路により実装され得る。処理回路は、電気回路を含む処理装置等のプログラムされた処理装置を含む。処理回路は、また、記載された機能を実行するようにアレンジされた特定用途向け集積回路（ＡＳＩＣ）や回路部品等の装置を含む。 Each of the functions described in the above embodiments may be implemented by one or more processing circuits. The processing circuit includes a programmed processing device such as a processing device including an electric circuit. Processing circuits also include devices such as application specific integrated circuits (ASICs) and circuit components arranged to perform the described functions.

上記のように、本発明の実施形態を記載したが、この開示の一部をなす論述及び図面はこの発明を限定するものであると理解すべきではない。この開示から当業者には様々な代替実施の形態、実施例及び運用技術が明らかとなろう。 As mentioned above, embodiments of the invention have been described, but the statements and drawings that form part of this disclosure should not be understood to limit the invention. This disclosure will reveal to those skilled in the art various alternative embodiments, examples and operational techniques.

得点シーンが発生した後のシーンの他の例として、バスケットボールであればダンクでゴールが壊れたシーン、テニスであればボールボーイがボールを拾ったシーン（ダブルフォルトであれば得点が入ったことになる）、バレーボールであれば選手が円陣を組んだシーンなどが該当する。 As another example of the scene after the scoring scene occurs, in the case of basketball, the scene where the goal was broken by a dunk, in the case of tennis, the scene where the ball boy picked up the ball (in the case of double fault, the score was scored). In the case of volleyball, the scene where the players form a circle is applicable.

ハイライトシーンは得点シーンに限定されない。サッカーであればオフサイドシーン、イエローカードあるいはレッドカードを主審が出したシーン、所定時間以上一人の選手がドリブルを続けるシーンなどもハイライトシーンになりうる。このような得点が絡まないシーンでは選手の喜びの表情は発生しにくいため、生成されるモデルは動作モデル及び音声モデルの２つになる。 The highlight scene is not limited to the scoring scene. In soccer, offside scenes, scenes where the referee issues a yellow card or red card, and scenes where one player continues to dribble for a predetermined period of time can be highlight scenes. In such a scene where the score is not involved, the facial expression of joy of the player is unlikely to occur, so there are two models to be generated, the motion model and the voice model.

ハイライトシーン抽出の対象となるスポーツは限定されてもよい。例えば自陣と敵陣に分かれて得点を競うスポーツのみをハイライトシーン抽出の対象としてもよい。自陣と敵陣に分かれて得点を競うスポーツには、サッカー、バスケットボール、アメリカンフットボール、アイスホッケー、テニス、バドミントン、バレーボール、卓球、ラグビー、フットサルなどが含まれる。これらのスポーツにおいても上述で説明したサッカーのように本実施形態によれば得点シーンそのものを抽出することなく、得点シーンを含むシーンをハイライトシーンとして生成することが可能となる。 The sports targeted for highlight scene extraction may be limited. For example, only sports in which one's own team and the other's team compete for points may be targeted for highlight scene extraction. Sports that compete for points on their own and on the enemy include soccer, basketball, American football, ice hockey, tennis, badminton, volleyball, table tennis, rugby, and futsal. In these sports as well, as in soccer described above, according to the present embodiment, it is possible to generate a scene including a scoring scene as a highlight scene without extracting the scoring scene itself.

もちろんこれらの自陣と敵陣に分かれて得点を競うスポーツにおいてもハイライトシーンは得点シーンに限定されない。アイスホッケーであれば観客席前のアクリル板が割れるシーンはハイライトシーンになりうる。テニス、バドミントン、バレーボールであればチャレンジシーンはハイライトシーンになりうる。これらのシーン及びその前後のシーンを抽出すればハイライトシーンの生成が可能となる。なおチャレンジシーンとは、判定に疑義がある場合にビデオ映像などによる判定の再確認を要求する、いわゆる「チャレンジ」を行ったシーンを意味する。 Of course, the highlight scene is not limited to the scoring scene even in the sports where the score is competed between the own team and the enemy team. In the case of ice hockey, the scene where the acrylic board in front of the audience seats breaks can be a highlight scene. For tennis, badminton, and volleyball, the challenge scene can be a highlight scene. Highlight scenes can be generated by extracting these scenes and the scenes before and after them. The challenge scene means a scene in which a so-called "challenge" is performed, in which a reconfirmation of the judgment is requested by a video image or the like when the judgment is doubtful.

また試合終了間際もハイライトシーンになりうる。 It can also be a highlight scene just before the end of the match.

モデル生成装置５０は、スポーツ中継番組において試合が中断しているときの映像のみを用いて特徴のある表情、特徴のある動作、及び特徴のある音声のモデルを生成してもよい。例えばサッカーであれば試合が中断しているときの映像とは、得点がうまれてから主審が試合開始のホイッスルを鳴らすまでの映像、ファールなどの原因で主審がホイッスルを鳴らしてから再度主審が試合開始のホイッスルを鳴らすまでの映像などである。すなわち競技の試合が中断しているときとは、試合中断の合図を起点としてその後再度試合開始の合図が行われるまでの間と定義されてもよい。また乱闘シーンはすべてのスポーツにおいて試合が中断しているときのシーンである。 The model generation device 50 may generate a model of a characteristic facial expression, a characteristic motion, and a characteristic sound by using only the image when the game is interrupted in the sports broadcast program. For example, in the case of soccer, the video when the match is interrupted is the video from the score being scored until the referee rings the whistle at the start of the match, and the referee rings the whistle due to a foul or the like, and then the referee plays the match again. It is a video until the whistle of the start is sounded. That is, the time when the game of the competition is interrupted may be defined as the period from the signal of the interruption of the game to the time when the signal of the start of the game is given again. The brawl scene is a scene when the game is interrupted in all sports.

また映像分析の際に、表情モデル、動作モデル、音声モデルとの一致度の検出にさらに追加して、画角の中心に映る選手が一定以上の大きさか否か、画角の中心に映る選手の周りの選手は同じユニフォームを着ているか、グラウンド内かなどを考慮してもよい。 In addition to detecting the degree of matching with the facial expression model, motion model, and audio model during video analysis, the player who appears in the center of the angle of view determines whether or not the player appears in the center of the angle of view. You may consider whether the players around you are wearing the same uniform or on the ground.

上述の実施形態ではユーザによって選択されたスポーツ中継番組のハイライトシーンを抽出したがこれに限定されない。ユーザの指示とは関係なく、記憶装置１１にスポーツ中継番組が記憶されていれば制御部１２は記憶されているスポーツ中継番組のハイライトシーンを自動的に抽出してもよい。 In the above embodiment, the highlight scene of the sports broadcast program selected by the user is extracted, but the present invention is not limited to this. Regardless of the user's instruction, if the sports broadcast program is stored in the storage device 11, the control unit 12 may automatically extract the highlight scene of the stored sports broadcast program.

１０映像処理装置
１１、５５記憶装置
１２、５２制御部
１３、５６インターフェース
１４映像分析部
１５競技特定部
１６特徴検出部
１７ハイライト抽出部
２０映像取得機器
３０ユーザ端末
４０ネットワーク
５０モデル生成装置
５１サーバ
５３映像取得部
５４モデル生成部 10 Video processing equipment 11, 55 Storage equipment 12, 52 Control unit 13, 56 Interface 14 Video analysis unit 15 Competition identification unit 16 Feature detection unit 17 Highlight extraction unit 20 Video acquisition device 30 User terminal 40 Network 50 Model generation device 51 Server 53 Video acquisition unit 54 Model generation unit

Claims

An extractor that extracts a reference scene containing characteristic movements or characteristic sounds from sports video,
With the reference scene extracted by the extraction unit as a reference, a generation unit that generates a scene including a scene including a first scene that has returned by a first predetermined time and a second scene that has advanced by a second predetermined time as a highlight scene. A video processing device characterized by being equipped with.

The characteristic motion or the characteristic sound is generated in advance as a model for each competition related to the sports image and stored in a storage device, and the generated model interrupts the game of the competition. It ’s a model of time,
The video processing device according to claim 1, wherein the extraction unit extracts the reference scene using the model stored in the storage device.

The characteristic movement is a joyful movement when the game of the competition is interrupted, or a competition-specific movement, and the characteristic voice is a voice containing a specific wording or a volume equal to or higher than a predetermined volume. The video processing apparatus according to claim 2, wherein the audio is provided.

The video processing device according to claim 3, wherein the time when the game of the competition is interrupted means a period from the signal of the interruption of the game to the time when the signal of the start of the game is given again.

Extract reference scenes containing characteristic movements or characteristic sounds from sports images,
A video processing method characterized by generating a scene including a scene including a first scene that has returned by a first predetermined time and a second scene that has advanced by a second predetermined time as a highlight scene with the extracted reference scene as a reference. ..

It is a model generator that generates a model for extracting highlight scenes from sports images.
The sports video is classified for each competition, and the classified video is input to a neural network to be learned, and a model generation unit for generating a characteristic motion model or a characteristic audio model for each competition is provided. Model generator.