JP6824332B2

JP6824332B2 - Video service provision method and service server using this

Info

Publication number: JP6824332B2
Application number: JP2019102475A
Authority: JP
Inventors: キム，ジンジュン; ウ，ソンソプ
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2018-06-01
Filing date: 2019-05-31
Publication date: 2021-02-03
Anticipated expiration: 2039-05-31
Also published as: KR20190137359A; KR102080315B1; JP2019212308A

Description

本出願は、動画サービス提供方法およびこれを用いるサービスサーバに関し、動画を意味に基づく単位区間に分離して、各単位区間に対するキーワードを自動で生成できる動画サービス提供方法およびこれを用いるサービスサーバに関する。 The present application relates to a video service providing method and a service server using the same, and relates to a video service providing method capable of separating moving images into unit intervals based on meaning and automatically generating keywords for each unit interval, and a service server using the same.

最近、インターネット技術の発達により、インターネットを介して動画を提供する動画サービスなどが広く活用されている。ユーザがインターネットを介して動画を視聴しようとする場合、インターネット上で提供される数多くの動画の中から所望の動画を検索する必要があり、効果的な動画検索のための様々な動画検索方法などが提示されてきた。 Recently, with the development of Internet technology, video services that provide videos via the Internet have been widely used. When a user wants to watch a video via the Internet, he / she needs to search for the desired video from a large number of videos provided on the Internet, and various video search methods for effective video search, etc. Has been presented.

しかし、最近では、ユーザが動画全体でなく動画内の一部分に対して関心を持ち、その部分だけを視聴しようとする場合が増えている。例えば、サッカー中継を視聴しようとするユーザは、サッカー中継プログラムの全体を視聴するよりは特定の選手がゴールを入れるシーンだけを視聴しようとすることがある。しかし、一般的な動画検索方法は、サッカー中継全体をその検索の対象にするため、ユーザが所望する動画の一部のシーンなどを検索するのが難しかった。 However, recently, there are increasing cases where users are interested in a part of a video rather than the whole video and try to watch only that part. For example, a user who wants to watch a soccer broadcast may want to watch only a scene in which a specific player scores a goal, rather than watching the entire soccer broadcast program. However, in the general video search method, since the entire soccer broadcast is targeted for the search, it is difficult to search for a part of the scene of the video desired by the user.

韓国登録特許第１０−０７２１４０９号公報Korean Registered Patent No. 10-0721409

本出願は、動画を意味に基づく単位区間に分離して、各単位区間に対するキーワードを自動で生成できる動画サービス提供方法およびこれを用いるサービスサーバを提供する。 The present application provides a video service providing method capable of separating moving images into unit intervals based on meaning and automatically generating keywords for each unit interval, and a service server using the same.

本出願は、動画内の音声の特性変化に基づいて動画を複数の単位区間に分離できる動画サービス提供方法およびこれを用いるサービスサーバを提供する。 The present application provides a video service providing method capable of separating a moving image into a plurality of unit intervals based on a change in audio characteristics in the moving image, and a service server using the same.

本出願は、動画を分離した各々の単位区間に音声認識および字幕認識を適用して、単位区間の内容に応じたキーワードを自動で生成できる動画サービス提供方法およびこれを用いるサービスサーバを提供する。 The present application provides a video service providing method capable of automatically generating a keyword according to the content of a unit interval by applying voice recognition and subtitle recognition to each unit section in which a moving image is separated, and a service server using the same.

本出願は、機械学習を用いた自然言語処理を適用して、動画の各単位区間の内容に応じたキーワードを自動で生成できる動画サービス提供方法およびこれを用いるサービスサーバを提供する。 The present application provides a video service providing method capable of automatically generating keywords according to the contents of each unit interval of a moving image by applying natural language processing using machine learning, and a service server using the same.

本発明の一実施形態による動画サービス提供方法は、サービスサーバが端末装置に動画を提供する動画サービス提供方法に関し、動画内に含まれる音声の特性変化を基準に、前記動画を複数の単位区間に分離する単位区間分離ステップ、前記単位区間に含まれる音声を認識して、前記音声に対応するスクリプト文字列を生成するスクリプト文字列生成ステップ、前記単位区間に含まれる字幕イメージを認識して、前記字幕イメージに対応する字幕文字列を生成する字幕文字列生成ステップ、および前記スクリプト文字列および字幕文字列に自然言語処理（ＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ）を適用して、前記単位区間に対応するキーワードを生成するキーワード生成ステップを含む。 The video service providing method according to one embodiment of the present invention relates to a video service providing method in which a service server provides a moving image to a terminal device, and the moving image is divided into a plurality of unit sections based on a change in the characteristics of audio included in the moving image. The unit section separation step to be separated, the script character string generation step of recognizing the sound included in the unit section and generating the script character string corresponding to the sound, the subtitle image included in the unit section, and the above A subtitle character string generation step for generating a subtitle character string corresponding to a subtitle image, and natural language processing (Natural Language Processing) are applied to the script character string and the subtitle character string to generate a keyword corresponding to the unit interval. Includes keyword generation steps.

本発明の一実施形態によるサービスサーバは、動画内に含まれる音声の特性変化を基準に、前記動画を複数の単位区間に分離する単位区間分離部、前記単位区間に含まれる音声を認識して、前記音声に対応するスクリプト文字列を生成するスクリプト文字列生成部、前記単位区間に含まれる字幕イメージを認識して、前記字幕イメージに対応する字幕文字列を生成する字幕文字列生成部、および前記スクリプト文字列および字幕文字列に自然言語処理を適用して、前記単位区間に対応するキーワードを生成するキーワード生成部を含む。 The service server according to the embodiment of the present invention recognizes a unit interval separation unit that separates the moving image into a plurality of unit sections and the sound included in the unit section based on a change in the characteristics of the sound included in the moving image. , A script character string generator that generates a script character string corresponding to the sound, a subtitle character string generator that recognizes a subtitle image included in the unit interval and generates a subtitle character string corresponding to the subtitle image, and It includes a keyword generation unit that applies natural language processing to the script character string and the subtitle character string to generate a keyword corresponding to the unit interval.

本発明の他の実施形態によるサービスサーバは、プロセッサ、および前記プロセッサに結合されたメモリを含むものであって、前記メモリは前記プロセッサにより実行されるように構成される１つ以上のモジュールを含み、前記１つ以上のモジュールは、動画内に含まれる音声の特性変化を基準に、前記動画を複数の単位区間に分離し、前記単位区間に含まれる音声を認識して、前記音声に対応するスクリプト文字列を生成し、前記単位区間に含まれる字幕イメージを認識して、前記字幕イメージに対応する字幕文字列を生成し、前記スクリプト文字列および字幕文字列に自然言語処理を適用して、前記単位区間に対応するキーワードを生成する、命令語を含む。 A service server according to another embodiment of the present invention includes a processor and a memory coupled to the processor, the memory including one or more modules configured to be executed by the processor. , The one or more modules separate the moving image into a plurality of unit sections based on a change in the characteristics of the sound included in the moving image, recognize the sound included in the unit section, and correspond to the sound. A script character string is generated, a subtitle image included in the unit section is recognized, a subtitle character string corresponding to the subtitle image is generated, and natural language processing is applied to the script character string and the subtitle character string. Includes a command word that generates a keyword corresponding to the unit interval.

また、課題を解決するための手段の欄に記載される事項は、本発明の特徴を全て列挙したものではない。本発明の様々な特徴とそれに応じた長所および効果は下記の具体的な実施形態を参照してより詳細に理解できるものである。 In addition, the matters described in the column of means for solving the problem do not list all the features of the present invention. The various features of the present invention and the corresponding advantages and effects can be understood in more detail with reference to the specific embodiments below.

本発明の一実施形態による動画サービス提供方法およびこれを用いるサービスサーバによれば、動画内の音声の特性変化に基づいて動画を分離するため、文脈や意味の損傷なしに動画を分離することができる。 According to the video service providing method according to the embodiment of the present invention and the service server using the same, the moving image is separated based on the change in the characteristics of the sound in the moving image, so that the moving image can be separated without damaging the context or meaning. it can.

本発明の一実施形態による動画サービス提供方法およびこれを用いるサービスサーバによれば、音声認識および字幕認識を適用して単位区間内に含まれる内容を抽出し、その後、それを用いて各々の単位区間に対するキーワードを設定するため、単位区間の内容に応じたキーワードを設定することができる。 According to the video service providing method according to the embodiment of the present invention and the service server using the same, voice recognition and subtitle recognition are applied to extract the contents included in the unit interval, and then each unit is used. Since the keyword for the section is set, the keyword can be set according to the content of the unit interval.

本発明の一実施形態による動画サービス提供方法およびこれを用いるサービスサーバによれば、ユーザは、内容に基づいて動画に含まれる特定のシーンを検索することができ、特定の主題や内容に基づいて要約動画を生成することができる。 According to the video service providing method according to the embodiment of the present invention and the service server using the same, the user can search for a specific scene included in the video based on the content, and based on a specific subject or content. A summary video can be generated.

また、本発明の実施形態による動画サービス提供方法およびこれを用いるサービスサーバが達成できる効果は上記で言及したものに制限されず、言及されていない他の効果は下記の記載によって本発明が属する技術分野で通常の知識を有する者に明らかに理解できるものである。 Further, the video service providing method according to the embodiment of the present invention and the effects that can be achieved by the service server using the same are not limited to those mentioned above, and other effects not mentioned are the techniques to which the present invention belongs according to the following description. It is clearly understandable to those who have ordinary knowledge in the field.

本発明の一実施形態による動画サービス提供システムを示す概略図である。It is the schematic which shows the moving image service providing system by one Embodiment of this invention. 本発明の一実施形態によるサービスサーバを示すブロック図である。It is a block diagram which shows the service server by one Embodiment of this invention. 本発明の一実施形態によるサービスサーバを示すブロック図である。It is a block diagram which shows the service server by one Embodiment of this invention. 本発明の一実施形態による動画の単位区間の分離を示す概略図である。It is the schematic which shows the separation of the unit section of a moving image by one Embodiment of this invention. 本発明の一実施形態によるスクリプト文字列および字幕文字列の生成を示す概略図である。It is the schematic which shows the generation of the script character string and the subtitle character string by one Embodiment of this invention. 本発明の一実施形態による字幕イメージの検出を示す概略図である。It is the schematic which shows the detection of the caption image by one Embodiment of this invention. 本発明の他の実施形態による動画サービス提供方法を示すフローチャートである。It is a flowchart which shows the moving image service providing method by another embodiment of this invention.

以下では添付図面を参照して本明細書に開示された実施形態について詳しく説明するが、図面に関係なく同一または類似した構成要素には同一の参照番号を付し、それに対する重複する説明は省略することにする。以下の説明で用いられる構成要素に対する接尾辞「モジュール」および「部」は、明細書の作成の容易さだけを考慮して付与または混用されるものであって、それ自体で互いに区別される意味または役割を有するものではない。すなわち、本発明で用いられる「部」という用語はソフトウェア、ＦＰＧＡまたはＡＳＩＣのようなハードウェア構成要素を意味し、「部」はある役割をする。ところが、「部」はソフトウェアまたはハードウェアに限定される意味ではない。「部」は、アドレッシングできる格納媒体にあるように構成されてもよく、一つまたはそれ以上のプロセッサを再生させるように構成されてもよい。よって、一例として「部」は、ソフトウェア構成要素、オブジェクト指向ソフトウェア構成要素、クラス構成要素およびタスク構成要素のような構成要素と、プロセス、関数、属性、プロシージャ、サブルーチン、プログラムコードのセグメント、ドライバ、ファームウェア、マイクロコード、回路、データ、データベース、データ構造、テーブル、アレイおよび変数を含む。構成要素と「部」の中から提供される機能は、さらに小さい数の構成要素および「部」で結合されるか、または追加の構成要素と「部」にさらに分離されてもよい。 Hereinafter, embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but the same or similar components will be given the same reference number regardless of the drawings, and duplicate description thereof will be omitted. I will do it. The suffixes "module" and "part" to the components used in the following description are given or mixed only for the ease of writing the specification, meaning that they are distinguished from each other by themselves. Or it does not have a role. That is, the term "part" as used in the present invention means a hardware component such as software, FPGA or ASIC, and the "part" plays a role. However, "department" is not limited to software or hardware. The "part" may be configured to be in an addressable storage medium or to regenerate one or more processors. Thus, as an example, a "part" is a component such as a software component, an object-oriented software component, a class component, a task component, and a process, function, attribute, procedure, subroutine, program code segment, driver, Includes firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables. The functionality provided within the components and "parts" may be combined by a smaller number of components and "parts" or further separated into additional components and "parts".

また、本明細書に開示された実施形態について説明するにおいて、関連の公知技術に関する具体的な説明が本明細書に開示された実施形態の要旨をあいまいにする恐れがあると判断される場合には、その詳細な説明は省略する。また、添付された図面は本明細書に開示された実施形態を容易に理解できるようにするためのものに過ぎず、添付された図面によって本明細書に開示された技術的思想が制限されるものではなく、本発明の思想および技術範囲に含まれる全ての変更、均等物乃至代替物を含むものとして理解しなければならない。 In addition, in the description of the embodiments disclosed in the present specification, when it is determined that the specific description of the related publicly known technology may obscure the gist of the embodiments disclosed in the present specification. Will omit the detailed description thereof. In addition, the accompanying drawings are merely intended to facilitate the understanding of the embodiments disclosed herein, and the attached drawings limit the technical ideas disclosed herein. It must be understood as including all modifications, equivalents or alternatives contained within the ideas and technical scope of the present invention.

図１は、本発明の一実施形態による動画サービス提供システムを示す概略図である。 FIG. 1 is a schematic view showing a moving image service providing system according to an embodiment of the present invention.

図１を参照すれば、本発明の一実施形態による動画サービス提供システムは、端末装置１およびサービスサーバ１００を含むことができる。 Referring to FIG. 1, the moving image service providing system according to the embodiment of the present invention can include the terminal device 1 and the service server 100.

以下では、図１を参照して、本発明の一実施形態による動画サービス提供システムについて説明する。 Hereinafter, the moving image service providing system according to the embodiment of the present invention will be described with reference to FIG.

端末装置１は、ネットワークを介してサービスサーバ１００と通信を行うことができ、サービスサーバ１００が提供する動画サービスの提供を受けることができる。端末装置１は、動画などのコンテンツをユーザに視覚的または聴覚的に提供するためのディスプレイ部、スピーカなどを含むことができ、ユーザの入力を受ける入力部、少なくとも一つのプログラムが格納されたメモリおよびプロセッサを含むことができる。 The terminal device 1 can communicate with the service server 100 via the network, and can receive the video service provided by the service server 100. The terminal device 1 can include a display unit, a speaker, and the like for visually or audibly providing contents such as moving images to the user, an input unit that receives input from the user, and a memory in which at least one program is stored. And can include processors.

端末装置１はスマートフォン、タブレットＰＣなどの移動端末またはデスクトップなどの固定型装置であってもよく、実施形態によっては、携帯電話、スマートフォン（Ｓｍａｒｔｐｈｏｎｅ）、ラップトップ型コンピュータ（ｌａｐｔｏｐｃｏｍｐｕｔｅｒ）、デジタル放送用端末、ＰＤＡ（ｐｅｒｓｏｎａｌｄｉｇｉｔａｌａｓｓｉｓｔａｎｔｓ）、ＰＭＰ（ｐｏｒｔａｂｌｅｍｕｌｔｉｍｅｄｉａｐｌａｙｅｒ）、スレートＰＣ（ｓｌａｔｅＰＣ）、タブレットＰＣ（ｔａｂｌｅｔＰＣ）、ウルトラブック（ｕｌｔｒａｂｏｏｋ）、ウェアラブルデバイス（ｗｅａｒａｂｌｅｄｅｖｉｃｅ、例えば、スマートウォッチ（ｓｍａｒｔｗａｔｃｈ）、スマートメガネ（ｓｍａｒｔｇｌａｓｓ）、ヘッドマウントディスプレイ（ＨＭＤ：ｈｅａｄｍｏｕｎｔｅｄｄｉｓｐｌａｙ））などが端末装置１に該当することができる。 The terminal device 1 may be a mobile terminal such as a smartphone or tablet PC or a fixed device such as a desktop, and depending on the embodiment, a mobile phone, a smartphone (Smartphone), a laptop computer (laptop computer), or a digital broadcast. Terminals, PDA (personal digital assistants), PMP (portable multimedia player), slate PC (slate PC), tablet PC (tablet PC), ultrabook (ultrabook), wearable device (warewatch), wearable device (warewatch) ), Smart glasses, head mount display (HMD: head mounted computer), and the like can correspond to the terminal device 1.

端末装置１とサービスサーバ１００を連結するネットワークは、有線ネットワークおよび無線ネットワークを含むことができ、具体的には、ローカル・エリア・ネットワーク（ＬＡＮ：ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、メトロポリタン・エリア・ネットワーク（ＭＡＮ：ＭｅｔｒｏｐｏｌｉｔａｎＡｒｅａＮｅｔｗｏｒｋ）、広域ネットワーク（ＷＡＮ：ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）などのような様々なネットワークを含むことができる。また、ネットワークは、公知のワールド・ワイド・ウェブ（ＷＷＷ：ＷｏｒｌｄＷｉｄｅＷｅｂ）を含むこともできる。但し、本発明に係るネットワークは、上記で列挙されたネットワークに限定されず、公知の無線データネットワーク、公知の電話ネットワーク、公知の有線または無線テレビネットワークなどを含むことができる。 The network connecting the terminal device 1 and the service server 100 can include a wired network and a wireless network, and specifically, a local area network (LAN: Local Area Network), a metropolitan area network (MAN:). It can include various networks such as Metropolitan Area Network), Wide Area Network (WAN: Wide Area Network) and the like. The network can also include a known World Wide Web (WWW: World Wide Web). However, the network according to the present invention is not limited to the networks listed above, and may include a known wireless data network, a known telephone network, a known wired or wireless television network, and the like.

サービスサーバ１００は、ネットワークを介して端末装置１に動画サービスを提供することができる。サービスサーバ１００には端末装置１に提供可能な複数の動画コンテンツが格納されており、端末装置１の要請に応じて端末装置１に動画を提供することができる。例えば、サービスサーバ１００は、動画などのコンテンツをリアルタイムでストリーミング（ｓｔｒｅａｍｉｎｇ）するか、またはそのようなコンテンツをダウンロード（ｄｏｗｎｌｏａｄ）するように提供することができる。 The service server 100 can provide a moving image service to the terminal device 1 via a network. A plurality of moving image contents that can be provided to the terminal device 1 are stored in the service server 100, and the moving image can be provided to the terminal device 1 in response to a request from the terminal device 1. For example, the service server 100 can provide content such as moving images to be streamed in real time or such content to be downloaded.

サービスサーバ１００は、動画サービスを提供するとき、動画に対するメタ情報をさらに含めて提供することができる。すなわち、動画そのものに対するメタ情報を設定して、動画の登場人物、ストーリー、ジャンルなどのような追加的な情報をユーザに提供することができ、それを活用してユーザに動画検索や推薦サービスなどを提供することもできる。 When the service server 100 provides the moving image service, the service server 100 can further provide the meta information for the moving image. That is, it is possible to set meta information for the video itself and provide the user with additional information such as characters, stories, genres, etc. of the video, which can be used for video search and recommendation services. Can also be provided.

ここで、本発明の一実施形態によるサービスサーバ１００は、動画そのものに対するメタ情報を設定することの他に、動画内に含まれる内容に対するメタ情報を設定することもできる。すなわち、サービスサーバ１００は、動画を意味に基づく単位区間に分離した後、各々の単位区間に対するキーワードを設定することによって、全体動画のうちユーザが所望する区間だけを探索するように提供することができる。また、同一のキーワードを有する単位区間を取り集めて全体動画を要約した要約動画をユーザに提供することもできる。 Here, the service server 100 according to the embodiment of the present invention can set meta information for the content included in the moving image in addition to setting meta information for the moving image itself. That is, the service server 100 can provide the service server 100 so as to search only the section desired by the user in the entire moving image by setting a keyword for each unit interval after separating the moving image into unit sections based on the meaning. it can. It is also possible to provide the user with a summary video summarizing the entire video by collecting unit intervals having the same keyword.

図２は、本発明の一実施形態によるサービスサーバを示すブロック図である。 FIG. 2 is a block diagram showing a service server according to an embodiment of the present invention.

図２を参照すれば、本発明の一実施形態によるサービスサーバ１００は、単位区間分離部１１０、スクリプト文字列生成部１２０、字幕文字列生成部１３０、キーワード生成部１４０、検索部１５０および要約動画生成部１６０を含むことができる。 Referring to FIG. 2, the service server 100 according to the embodiment of the present invention includes a unit interval separation unit 110, a script character string generation unit 120, a subtitle character string generation unit 130, a keyword generation unit 140, a search unit 150, and a summary moving image. The generation unit 160 can be included.

以下では、図２を参照して、本発明の一実施形態によるサービスサーバ１００について説明する。 Hereinafter, the service server 100 according to the embodiment of the present invention will be described with reference to FIG.

単位区間分離部１１０は、動画を複数の単位区間に分離することができる。すなわち、単位区間分離部１１０は、対象となる動画をロードすることができ、ロードした動画内に含まれる音声の特性変化を基準に、動画を複数の単位区間に分離することができる。ここで、音声の特性変化は音量または音質の変化であってもよく、実施形態によっては、音の高低、音色などの変化も含むことができる。 The unit interval separation unit 110 can separate the moving image into a plurality of unit intervals. That is, the unit section separation unit 110 can load the target moving image, and can separate the moving image into a plurality of unit sections based on the change in the characteristics of the sound included in the loaded moving image. Here, the change in the characteristics of the voice may be a change in the volume or the sound quality, and depending on the embodiment, the change in the pitch, the tone color, and the like may be included.

具体的には、単位区間分離部１１０は、音声の特性変化を確認するために、動画内の音量を追跡することができる。例えば、音量は、動画内の一定区間の間には特定の範囲内に維持され、急に特定の範囲を脱して急激に高くなったり低くなったりする。このとき、単位区間分離部１１０は、動画内の音量を追跡して、音量の変化が発生した動画内の地点を検知することができる。すなわち、単位区間分離部１１０は、音量の変化量を用いて音量の急激な上昇地点や下降地点を検知することができる。 Specifically, the unit interval separation unit 110 can track the volume in the moving image in order to confirm the change in the characteristics of the sound. For example, the volume is maintained within a specific range during a certain section of the moving image, and suddenly goes out of the specific range and suddenly increases or decreases. At this time, the unit interval separation unit 110 can track the volume in the moving image and detect the point in the moving image in which the change in volume has occurred. That is, the unit interval separation unit 110 can detect a sharp rise point or a fall point of the volume by using the amount of change in the volume.

ここで、音量の変化量は、動画内の一定区間の間の音量の平均値や、該区間内に現れる音量の最大値または最小値を基準に計算することができる。すなわち、単位区間分離部１１０は、測定した音量を平均値などの基準と比較してどのくらい変化したかを計算することができ、音量の変化量が一定の閾値（ｔｈｒｅｓｈｏｌｄ）以上に増加した地点を上昇地点、減少した地点を下降地点に設定することができる。このとき、上昇地点、下降地点を設定するための閾値は各々互いに異なるように設定されてもよく、閾値は各々の動画ごとに互いに異なるように設定されてもよい。 Here, the amount of change in volume can be calculated based on the average value of the volume during a certain section in the moving image and the maximum value or the minimum value of the volume appearing in the section. That is, the unit interval separation unit 110 can calculate how much the measured volume has changed by comparing it with a reference such as an average value, and a point where the amount of change in volume has increased above a certain threshold (threshold). Ascending points and decreasing points can be set as descending points. At this time, the threshold values for setting the ascending point and the descending point may be set differently from each other, and the threshold values may be set differently from each other for each moving image.

単位区間分離部１１０は、音量の上昇地点または下降地点を基準に動画を複数の単位区間に分離することができ、それにより、野球で打者がホームランを打ったときの喚声を用いてホームランシーンを検知したり、ニュースでアンカーが話しをする中で次のニュースに移るために間をおく部分などを検知したりすることができる。 The unit interval separation unit 110 can separate a moving image into a plurality of unit sections based on a volume rising point or a volume falling point, thereby making a home run scene using a cry when a batter hits a home run in baseball. It can detect, or detect the part of the news that the anchor is talking about and pauses to move on to the next news.

また、実施形態によっては、単位区間分離部１１０が動画内に含まれる音声の特性変化を確認するために、動画内の音質を把握することができる。例えば、単位区間分離部１１０は、動画内の音質が良好な状態から突然ノイズが多くなる部分を検知することができ、検知された部分を基準に単位区間に分離することができる。すなわち、単位区間分離部１１０は、ニュースでアンカーが話をする中で現場のアナウンサにマイクを渡すときに発生する音質の変化などを検知した後、それを基準に動画を分離することもできる。さらに、動画内に複数の話者が存在する場合、単位区間分離部１１０は、音色を用いて各々の話者を区別した後、話者別に単位区間に分離することもできる。この他にも、単位区間分離部１１０は、様々な方法で音声の特性変化を検知し、それに応じて単位区間に分離することができる。 Further, depending on the embodiment, the unit interval separation unit 110 can grasp the sound quality in the moving image in order to confirm the change in the characteristics of the sound included in the moving image. For example, the unit interval separation unit 110 can detect a portion of a moving image in which noise suddenly increases from a state in which the sound quality is good, and can separate the detected portion into unit intervals based on the detected portion. That is, the unit interval separation unit 110 can also separate the moving image based on the change in sound quality that occurs when the microphone is passed to the announcer at the site while the anchor is talking in the news. Further, when a plurality of speakers are present in the moving image, the unit interval separating unit 110 can distinguish each speaker by using a timbre and then separate the speakers into unit intervals for each speaker. In addition to this, the unit section separation unit 110 can detect a change in the characteristics of the voice by various methods and can separate the unit sections accordingly.

一方、ニュース動画の場合、アンカーは原稿（ｓｃｒｉｐｔ）を一定の速度で読んで行き、１つの段落が終われば、しばらく切って、次の段落を継続して読んで行く。すなわち、動画内の話者が読む各々の段落は、話者の音量の変化量を基準に区別することができる。同一の段落内には同一主題の内容が含まれるのが一般的であるため、それを基準に動画を区分すれば、動画を意味に基づいて分離することができる。また、定められた原稿がない動画などの場合にも、動画内の話者が話す文脈を維持するためには、話者の音量の変化量を基準に動画を区分することが有利である。したがって、単位区間分離部１１０においては、動画内に含まれる音声の音量の変化量を基準に、動画を複数の単位区間に分離することができる。 On the other hand, in the case of a news movie, the anchor reads the manuscript (script) at a constant speed, and when one paragraph is finished, cuts it for a while and continues reading the next paragraph. That is, each paragraph read by the speaker in the moving image can be distinguished based on the amount of change in the volume of the speaker. Since the same subject content is generally included in the same paragraph, if the moving images are classified based on the contents, the moving images can be separated based on the meaning. Further, even in the case of a moving image without a defined manuscript, it is advantageous to classify the moving image based on the amount of change in the volume of the speaker in order to maintain the context in which the speaker speaks in the moving image. Therefore, in the unit interval separation unit 110, the moving image can be separated into a plurality of unit sections based on the amount of change in the volume of the sound included in the moving image.

例えば、図４に示すように、ニュース動画Ｖ内のアンカーの音量変化量を用いれば、全体動画をアンカーが発話する区間Ａと発話を中断した停止区間Ｂに区分することができる。ここで、アンカーの音量の変化量を基準に動画を分離するため、１つの単位区間内に複数の画面転換が起こり得ることを確認することができる。 For example, as shown in FIG. 4, by using the volume change amount of the anchor in the news video V, the entire video can be divided into a section A in which the anchor speaks and a stop section B in which the utterance is interrupted. Here, since the moving image is separated based on the amount of change in the volume of the anchor, it can be confirmed that a plurality of screen changes can occur in one unit interval.

一方、アンカーが発話する区間Ａが各々の単位区間に該当するため、発話を中断した停止区間Ｂを編集点（Ｃｕｔｔｉｎｇｐｏｉｎｔ）に設定して各々の単位区間を分離することができる。ここで、停止区間Ｂは、音量が設定値未満に減少し、設定値未満に減少した音量が基準時間以上維持される区間に設定することができる。停止区間Ｂの長さは、各々の動画ごとに互いに異なるように設定されてもよい。 On the other hand, since the section A spoken by the anchor corresponds to each unit section, the stop section B in which the utterance is interrupted can be set as an editing point (Cutting point) to separate each unit section. Here, the stop section B can be set to a section in which the volume is reduced to less than the set value and the volume reduced to less than the set value is maintained for the reference time or longer. The length of the stop section B may be set to be different for each moving image.

したがって、単位区間分離部１１０は、音声の特性変化を用いて、動画内に含まれる停止区間Ｂを判別することができ、それを用いて複数の単位区間に分離することができる。 Therefore, the unit section separation unit 110 can determine the stop section B included in the moving image by using the change in the characteristics of the sound, and can use it to separate into a plurality of unit sections.

スクリプト文字列生成部１２０は、単位区間に含まれる音声を認識して、音声に対応するスクリプト文字列を生成することができる。動画を複数の単位区間に分離した後には、各々の単位区間内に含まれる内容を認識する必要がある。このために、スクリプト文字列生成部１２０は、話者が発話した音声を認識し、それを文字に変換し、変換された文字を結合してスクリプト文字列を生成することができる。 The script character string generation unit 120 can recognize the voice included in the unit interval and generate a script character string corresponding to the voice. After separating the video into multiple unit intervals, it is necessary to recognize the content contained in each unit interval. For this purpose, the script character string generation unit 120 can recognize the voice spoken by the speaker, convert it into characters, and combine the converted characters to generate a script character string.

実施形態によっては、サービスサーバ１００内に別の音声認識装置が備えられていてもよく、スクリプト文字列生成部１２０は音声認識装置を用いて音声を文字に変換することができる。例えば、単位区間に含まれる音声を、電気的信号である音声パターンで表すことができ、音声モデルデータベースなどに各々の文字に対応する標準音声パターンが格納されていてもよい。この場合、音声認識装置は、入力される音声パターンを、音声モデルデータベースに格納された標準音声パターンと比較することができ、各々の音声パターンに対応する標準音声パターンを抽出することができる。その後、抽出した標準音声パターンを対応する文字に変換することができ、変換された文字を結合してスクリプト文字列を生成することができる。すなわち、図５に示すように、スクリプト文字列生成部１２０は、動画内で話者が発話した音声を認識してスクリプト文字列Ｓ１を生成することができる。 Depending on the embodiment, another voice recognition device may be provided in the service server 100, and the script character string generation unit 120 can convert the voice into characters by using the voice recognition device. For example, the voice included in the unit interval can be represented by a voice pattern that is an electrical signal, and a standard voice pattern corresponding to each character may be stored in a voice model database or the like. In this case, the voice recognition device can compare the input voice pattern with the standard voice pattern stored in the voice model database, and can extract the standard voice pattern corresponding to each voice pattern. After that, the extracted standard voice pattern can be converted into the corresponding character, and the converted characters can be combined to generate a script character string. That is, as shown in FIG. 5, the script character string generation unit 120 can recognize the voice spoken by the speaker in the moving image and generate the script character string S1.

但し、スクリプト文字列生成部１２０が音声を文字に変換する方式はこれに限定されず、スクリプト文字列生成部１２０はこの他にも様々な方式で動画に含まれる音声を文字に変換することができる。 However, the method in which the script character string generation unit 120 converts the voice into characters is not limited to this, and the script character string generation unit 120 may convert the voice included in the moving image into characters by various other methods. it can.

字幕文字列生成部１３０は、単位区間に含まれる字幕イメージを認識して、字幕イメージに対応する字幕文字列を生成することができる。動画内には話者が話す内容や、動画が伝達しようとする内容を強調するために、字幕イメージが含まれることがある。例えば、図５に示すように、ニュース動画の場合にも、ニュースの主な内容を要約して伝達するために字幕イメージＣが含まれる。 The subtitle character string generation unit 130 can recognize the subtitle image included in the unit interval and generate the subtitle character string corresponding to the subtitle image. Subtitle images may be included in the video to emphasize what the speaker is saying or what the video is trying to convey. For example, as shown in FIG. 5, in the case of a news movie, a subtitle image C is included in order to summarize and convey the main contents of the news.

このように字幕イメージには動画の内容が要約されて表示されるため、各々の単位区間の内容を確認するために、字幕イメージに含まれる文字を認識する必要がある。但し、字幕イメージは文字でなく形状として認識されるため、字幕イメージに含まれる文字を認識するためには、文字認識アルゴリズムなどを適用する必要がある。 Since the contents of the moving image are summarized and displayed in the subtitle image in this way, it is necessary to recognize the characters included in the subtitle image in order to confirm the contents of each unit interval. However, since the subtitle image is recognized as a shape rather than a character, it is necessary to apply a character recognition algorithm or the like in order to recognize the character included in the subtitle image.

実施形態によっては、サービスサーバ１００内に別の文字認識装置が備えられていてもよく、字幕文字列生成部１３０は文字認識装置を用いて字幕イメージを文字に変換することができる。例えば、単位区間に含まれる字幕イメージをスキャンして字幕イメージに対するピクセル値の分布を電気的信号である形状パターンで表すことができ、文字モデルデータベースなどに各々の文字に対応する標準形状パターンが格納されていてもよい。この場合、文字認識装置は、入力される形状パターンを文字モデルデータベースに格納された標準形状パターンと比較することができ、各々の形状パターンに対応する標準形状パターンを抽出することができる。その後、抽出した標準形状パターンに対応する文字に各々変換して字幕文字列を生成することができる。すなわち、図５に示すように、動画フレームｆ内の字幕イメージＣに含まれる形状を文字に変換して字幕文字列Ｓ２として抽出することができる。 Depending on the embodiment, another character recognition device may be provided in the service server 100, and the subtitle character string generation unit 130 can convert the subtitle image into characters by using the character recognition device. For example, a subtitle image included in a unit interval can be scanned and the distribution of pixel values with respect to the subtitle image can be represented by a shape pattern that is an electrical signal, and a standard shape pattern corresponding to each character is stored in a character model database or the like. It may have been done. In this case, the character recognition device can compare the input shape pattern with the standard shape pattern stored in the character model database, and can extract the standard shape pattern corresponding to each shape pattern. After that, it is possible to generate a subtitle character string by converting each character into a character corresponding to the extracted standard shape pattern. That is, as shown in FIG. 5, the shape included in the subtitle image C in the moving image frame f can be converted into characters and extracted as the subtitle character string S2.

一方、字幕文字列生成部１３０が字幕イメージから字幕文字列を抽出するためには、単位区間内での字幕イメージの存在有無と、字幕イメージの動画フレーム内の位置を判別する必要がある。すなわち、字幕イメージが含まれる動画フレームに限って文字認識を実行し、動画フレーム内に字幕イメージが位置する領域に限って文字認識を実行するようにして、より効率的な文字認識が実行されるようにすることができる。また、それにより、動画フレーム内に含まれる字幕イメージでない他の文字を変換するなどの問題を防止することができる。したがって、字幕文字列生成部１３０においては、字幕文字列を生成する前に、まず、単位区間内の字幕イメージを含む動画フレームを検出し、動画フレーム内に含まれる字幕イメージの位置を特定することができる。 On the other hand, in order for the subtitle character string generation unit 130 to extract the subtitle character string from the subtitle image, it is necessary to determine the presence or absence of the subtitle image in the unit section and the position of the subtitle image in the moving image frame. That is, more efficient character recognition is executed by executing character recognition only in the video frame including the subtitle image and executing character recognition only in the area where the subtitle image is located in the video frame. Can be done. In addition, it is possible to prevent problems such as conversion of other characters that are not subtitle images included in the moving image frame. Therefore, in the subtitle character string generation unit 130, before generating the subtitle character string, first, the video frame including the subtitle image in the unit interval is detected, and the position of the subtitle image included in the video frame is specified. Can be done.

具体的には、字幕文字列生成部１３０は、単位区間に含まれる各々の動画フレームに複数のランドマークを設定することができる。すなわち、図６に示すように、動画フレーム内にランドマークＬが均一に位置するように設定することができ、各々のランドマークＬにおいて色相または輝度などを測定することができる。具体的には、ランドマークＬの位置に対応するピクセルから各々のピクセルの色相、輝度などの入力を受けることができる。 Specifically, the subtitle character string generation unit 130 can set a plurality of landmarks for each moving image frame included in the unit interval. That is, as shown in FIG. 6, the landmarks L can be set to be uniformly located in the moving image frame, and the hue or the brightness can be measured at each landmark L. Specifically, it is possible to receive input such as hue and brightness of each pixel from the pixel corresponding to the position of the landmark L.

その後、ランドマークにおいて測定された色相または輝度などが字幕イメージに対応する基準色相または基準輝度に該当すれば、その動画フレーム内に字幕イメージが位置すると判別することができる。図６に示すように、字幕イメージＣは原本画像Ｄを覆う形態で表示されることがあり、字幕イメージＣは基準色相と基準輝度を有するように設定されることができる。ここで、字幕イメージＣの基準色相、基準輝度は、原本画像Ｄとは区別される特徴的な色相や輝度を有するように設定されるため、字幕文字列生成部１３０は、色相や輝度を用いて字幕イメージを区別することができる。 After that, if the hue or brightness measured at the landmark corresponds to the reference hue or reference brightness corresponding to the subtitle image, it can be determined that the subtitle image is located in the moving image frame. As shown in FIG. 6, the subtitle image C may be displayed in a form that covers the original image D, and the subtitle image C can be set to have a reference hue and a reference brightness. Here, since the reference hue and the reference brightness of the subtitle image C are set to have a characteristic hue and brightness that are distinguished from the original image D, the subtitle character string generation unit 130 uses the hue and brightness. Subtitle images can be distinguished.

また、字幕文字列生成部１３０は、動画フレーム上に均一に分布する複数のランドマークのうち、字幕イメージに対応する基準色相または基準輝度が測定されたランドマークを抽出することができ、抽出されたランドマークを用いて字幕イメージの位置または大きさを特定することができる。すなわち、各々のランドマークの動画フレーム内での設定座標などが予め設定されていてもよく、字幕文字列生成部１３０は、字幕イメージを検出したランドマークの設定座標を用いて、該字幕イメージの位置と大きさを特定することができる。この場合、字幕文字列生成部１３０は、特定された字幕イメージ領域内でのみ文字認識を実行するように制御することができる。すなわち、全体動画フレームのうち文字認識を実行する領域を特定することができるため、より効率的な文字認識が可能である。 Further, the subtitle character string generation unit 130 can extract, and extract, the landmarks whose reference hue or reference brightness corresponding to the subtitle image is measured from among the plurality of landmarks uniformly distributed on the moving image frame. The position or size of the subtitle image can be specified by using the landmark. That is, the set coordinates in the moving image frame of each landmark may be set in advance, and the subtitle character string generation unit 130 uses the set coordinates of the landmark that detected the subtitle image to display the subtitle image. The position and size can be specified. In this case, the subtitle character string generation unit 130 can be controlled to execute character recognition only within the specified subtitle image area. That is, since the area in which the character recognition is executed can be specified in the entire moving image frame, more efficient character recognition is possible.

一方、字幕文字列生成部１３０は、動画製作者から各々の動画に用いた字幕イメージの基準色相や基準輝度、動画フレーム内での位置や大きさなどの特徴情報の提供を受け、字幕イメージの抽出時にそれを活用することができる。例えば、字幕イメージの位置や大きさなどに対する特徴情報を受け取る場合には、ランドマークを動画フレーム全体に均一に設定せず、字幕イメージが位置するものとして設定された領域内に限定して、ランドマークを設定することができる。 On the other hand, the subtitle character string generation unit 130 receives characteristic information such as the reference hue and reference brightness of the subtitle image used for each video, the position and size in the video frame from the video creator, and receives the subtitle image. It can be utilized at the time of extraction. For example, when receiving feature information regarding the position and size of a subtitle image, the landmarks are not set uniformly over the entire video frame, but are limited to the area set where the subtitle image is located. Marks can be set.

キーワード生成部１４０は、スクリプト文字列および字幕文字列に自然言語処理（ＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ）を適用して、単位区間に対応するキーワードを生成することができる。すなわち、ユーザが単位区間の内容を確認した後、それに対応してキーワードや注釈などを設定するのではなく、各々の単位区間に対する意味を基にしたキーワードを自動で設定することができる。ここで、スクリプト文字列および字幕文字列に適用する自然言語処理には様々な方法などが適用されることができ、実施形態によっては、ｗｏｒｄ２ｖｅｃ、ＬＤＡ（ＬａｔｅｎｔＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ）などの機械学習（ｍａｃｈｉｎｅｌｅａｒｎｉｎｇ）が適用されることができる。 The keyword generation unit 140 can apply natural language processing (Natural Language Processing) to the script character string and the subtitle character string to generate a keyword corresponding to the unit interval. That is, after the user confirms the content of the unit interval, the keyword and the annotation can be automatically set based on the meaning for each unit interval, instead of setting the keyword and the annotation corresponding to the content. Here, various methods can be applied to the natural language processing applied to the script character string and the subtitle character string, and depending on the embodiment, machine learning such as word2vec and LDA (Latent Dirichlet Allocation) can be applied. ) Can be applied.

一実施形態によれば、キーワード生成部１４０は、ｗｏｒｄ２ｖｅｃを用いて単語埋め込み（ｗｏｒｄｅｍｂｅｄｄｉｎｇ）したｗｏｒｄ２ｖｅｃモデルを実現することができ、字幕文字列またはスクリプト文字列から抽出した単語をｗｏｒｄ２ｖｅｃモデルに対する入力単語に設定して、入力単語に対応する関連単語を抽出することができる。その後、抽出された関連単語を、その単位区間に対するキーワードとして設定することができる。 According to one embodiment, the keyword generation unit 140 can realize a word2vec model in which words are embedded using word2vec, and the words extracted from the subtitle character string or the script character string are input words to the word2vec model. Can be set to to extract related words that correspond to the input words. After that, the extracted related words can be set as keywords for the unit interval.

例えば、サービスサーバ１００が提供する動画がニュース動画である場合には、最近５年間のニュース記事などを、ｗｏｒｄ２ｖｅｃを用いて単語埋め込みする方式で、ｗｏｒｄ２ｖｅｃモデルを実現することができる。Ｗｏｒｄ２ｖｅｃの場合、各々の単語をベクトル空間に埋め込んで単語をベクトルで表すものであり、互いに関連する単語は空間上で隣接して配置される特徴がある。すなわち、ｗｏｒｄ２ｖｅｃモデルが学習する複数のサンプルにおいて各々の単語が互いに隣接して現れる頻度が高いほど、ベクトル空間上で隣接して表示されることができる。例えば、サンプルに用いられた既存のニュース記事において、「ブレグジット」と関連して「英国」、「ユーロ圏」、「脱退」などがよく言及されると、「ブレグジット」と「英国」、「ユーロ圏」、「脱退」などに対応するベクトルは互いに隣接して埋め込まれることができ、これらは互いに関連があると判別することができる。 For example, when the moving image provided by the service server 100 is a news moving image, a word2vec model can be realized by embedding a word such as a news article for the last 5 years using word2vec. In the case of Word2vec, each word is embedded in a vector space to represent the word as a vector, and words related to each other are arranged adjacent to each other in the space. That is, the higher the frequency with which each word appears adjacent to each other in a plurality of samples trained by the word2vec model, the more the words can be displayed adjacent to each other in the vector space. For example, in the existing news articles used in the sample, when "Brexit" is often referred to as "UK", "Eurozone", "Leaving", etc., "Brexit" and "UK", "Euro" Vectors corresponding to "zone", "withdrawal", etc. can be embedded adjacent to each other, and it can be determined that they are related to each other.

但し、スクリプト文字列には複数の単語が含まれるため、スクリプト文字列に含まれる各々の単語に対応して抽出される関連単語を全てキーワードに設定するにはキーワードが過度に多くなりうる。それを防止するために、キーワード生成部１４０は、関連単語と入力単語を比較して類似度が高い関連単語だけをキーワードに設定することができる。 However, since the script character string contains a plurality of words, the number of keywords may be excessively large in order to set all the related words extracted corresponding to each word included in the script character string as keywords. In order to prevent this, the keyword generation unit 140 can compare the related word and the input word and set only the related word having a high degree of similarity as the keyword.

具体的には、キーワード生成部１４０は、ｗｏｒｄ２ｖｅｃモデルに入力した入力単語に対応する入力単語ベクトルと、関連単語に対応する関連単語ベクトルとの間の類似度を計算して、類似度が高い関連単語だけを抽出してキーワードに設定することができる。 Specifically, the keyword generation unit 140 calculates the degree of similarity between the input word vector corresponding to the input word input to the word2vec model and the related word vector corresponding to the related word, and the association having a high degree of similarity is calculated. Only words can be extracted and set as keywords.

単語埋め込みを通じて各々の単語は空間上でベクトル化して分布されることができ、学習したサンプルにおいて互いに類似するかまたは関連していると設定された単語は、ベクトル空間上で隣接した位置に位置するようになる。したがって、入力単語ベクトルと関連単語ベクトルとの間の類似度を計算して、入力単語と関連単語の間の関係を把握することができる。ここで、ベクトル間の類似度はコサイン類似度（ｃｏｓｉｎｅｓｉｍｉｌａｒｉｔｙ）を用いて計算することができるが、これに限定されず、ベクトル間の類似度を計算できるものであれば、いかなるものを適用してもよい。 Through word embedding, each word can be vectorized and distributed in space, and words that are set to be similar or related to each other in the learned sample are located at adjacent positions in vector space. Will be. Therefore, the similarity between the input word vector and the related word vector can be calculated to grasp the relationship between the input word and the related word. Here, the similarity between vectors can be calculated using the cosine similarity, but the similarity is not limited to this, and any similarity between vectors can be applied. You may.

キーワード生成部１４０は、入力ベクトルとの類似度が所定値以上の関連単語ベクトルを抽出することができ、抽出された関連単語ベクトルに対応する関連単語をキーワードに設定することができる。すなわち、類似度が所定値以上の関連単語ベクトルに該当する関連単語だけをキーワードに設定することができる。また、実施形態によっては、入力ベクトルとの類似度が高い順に応じて既に設定された個数の関連単語ベクトルを抽出することができ、抽出された既に設定された個数の関連単語ベクトルに対応する関連単語をキーワードに設定することもできる。例えば、最も類似度が大きい関連単語ベクトルを１０個抽出し、抽出された１０個の関連単語をキーワードに設定することができる。 The keyword generation unit 140 can extract a related word vector having a similarity with the input vector of a predetermined value or more, and can set a related word corresponding to the extracted related word vector as a keyword. That is, only related words corresponding to the related word vector having a similarity of a predetermined value or more can be set as a keyword. Further, depending on the embodiment, it is possible to extract an already set number of related word vectors according to the order of similarity with the input vector, and the association corresponding to the extracted already set number of related word vectors. You can also set a word as a keyword. For example, 10 related word vectors having the highest degree of similarity can be extracted, and the extracted 10 related words can be set as keywords.

さらに、キーワード生成部１４０がリアルタイム検索語情報を用いてキーワードを設定する実施形態も可能である。リアルタイム検索語情報は、ポータルサイトなどが提供する検索サービスで用いられる検索語のうち、リアルタイムで検索量が急増した検索語に対する情報であってもよい。リアルタイム検索語情報に含まれる各々の検索語は現在イシューになっている主題に関するものであるため、キーワード生成部１４０はリアルタイム検索語と関連する単語を優先的にキーワードに設定することができる。リアルタイム検索語情報は、サービスサーバ１００が外部から受信してキーワード生成部１４０に提供されることができる。 Further, an embodiment in which the keyword generation unit 140 sets a keyword using real-time search term information is also possible. The real-time search term information may be information for a search term whose search amount has rapidly increased in real time among the search terms used in a search service provided by a portal site or the like. Since each search word included in the real-time search word information is related to the subject currently being an issue, the keyword generation unit 140 can preferentially set a word related to the real-time search word as a keyword. The real-time search term information can be received by the service server 100 from the outside and provided to the keyword generation unit 140.

具体的には、キーワード生成部１４０は、ｗｏｒｄ２ｖｅｃモデルから抽出した関連単語のうち、リアルタイム検索語情報に含まれる検索語に対応する関連単語を抽出し、抽出された関連単語に対しては類似度の計算時に加重値を付加することができる。すなわち、相対的に類似度が低い場合にも、リアルタイム検索語情報に対応する関連単語に対しては加重値によりキーワードに設定されることができる。このとき、検索語のリアルタイム検索順位に応じて、検索語に対応する関連単語に提供する加重値を互いに異なるように付与することもできる。例えば、リアルタイム検索語の１位に該当する検索語と５位に該当する検索語に対して加重値を互いに異なるように設定することができる。 Specifically, the keyword generation unit 140 extracts the related words corresponding to the search words included in the real-time search word information from the related words extracted from the word2vec model, and the similarity with respect to the extracted related words. A weighted value can be added when calculating. That is, even when the similarity is relatively low, the related words corresponding to the real-time search word information can be set as keywords by weighted values. At this time, weighted values provided to related words corresponding to the search words can be given differently from each other according to the real-time search order of the search words. For example, the weighted values can be set differently for the search word corresponding to the first place and the search word corresponding to the fifth place of the real-time search word.

キーワード設定時にリアルタイム検索語情報を活用する場合には、キーワード生成部１４０が、各々の単位区間に対して設定するキーワードを毎回互いに異なるように設定することができる。すなわち、ユーザの興味や需要を反映してキーワードを設定することができ、それにより、イシューとなった内容と関連した単位区間をユーザが容易に検索できるように提供することができる。 When the real-time search term information is used when setting keywords, the keyword generation unit 140 can set the keywords to be set for each unit interval so as to be different each time. That is, keywords can be set to reflect the interests and demands of the user, whereby the unit interval related to the content of the issue can be provided so that the user can easily search.

一方、実施形態によっては、キーワード生成部１４０は、ＬＤＡ（ＬａｔｅｎｔＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ）を用いてキーワードを設定することもできる。すなわち、ＬＤＡで学習した機械学習モデルにスクリプト文字列および字幕文字列を適用して単位区間に対応する主題語を抽出することができ、その後、抽出された主題語を該単位区間のキーワードに設定することができる。 On the other hand, depending on the embodiment, the keyword generation unit 140 can also set keywords by using LDA (Latent Dirichlet Allocation). That is, a script character string and a subtitle character string can be applied to a machine learning model learned by LDA to extract a theme word corresponding to a unit interval, and then the extracted theme word is set as a keyword of the unit interval. can do.

ＬＤＡは、トピックモデル（ｔｏｐｉｃｍｏｄｅｌ）の１つであり、複数の文書集合を用いて各文書にどのような主題が存在するかを分類できる教師なし学習アルゴリズムに該当する。ＬＤＡを用いてモデリングをすれば、特定の主題に該当する単語と、特定の文書に含まれる主題を結果物として得ることができる。 LDA is one of the topic models (topic model), and corresponds to an unsupervised learning algorithm that can classify what kind of subject exists in each document using a plurality of document sets. By modeling using LDA, words corresponding to a specific subject and subjects contained in a specific document can be obtained as a result.

例えば、サービスサーバ１００が提供する動画がニュース動画である場合には、ＬＤＡを用いて最近５年間のニュース記事などを学習させて機械学習モデルを実現することができる。この場合、各々の記事に含まれる主題を示す主題語と、各々の主題語に対応する単語の集合を抽出することができる。例えば、ブレグジットに関する記事に対して、「英国」、「ユーロ圏」、「ハードブレックシート」、「ノディルブレックシート」の主題を含むものに分類することができ、「ノディルブレックシート」主題と関連して「ノディル」、「合意案」、「否決」、「脱退」などの単語が該主題に含まれるものに設定することができる。したがって、ニュース動画のいずれか１つの単位区間から抽出したスクリプト文字列と字幕文字列を機械学習モデルに入力すれば、入力したスクリプト文字列と字幕文字列に含まれる単語がどのような主題語に該当する単語であるかを確認することができ、それにより、該ニュース動画内にどのような主題語に対応する内容が含まれているかを把握することができる。その後、キーワード生成部１４０は、機械学習モデルを介して抽出された主題語を、該単位区間に対するキーワードに設定することができる。 For example, when the moving image provided by the service server 100 is a news moving image, a machine learning model can be realized by learning news articles and the like for the last 5 years using LDA. In this case, it is possible to extract a subject word indicating the subject contained in each article and a set of words corresponding to each subject word. For example, articles about Brexit can be categorized into those that include the subjects "UK", "Eurozone", "Hard Breakfast", and "Dill Breakfast", with the "Dill Breakfast" subject. Related words such as "dill", "proposed agreement", "veto", and "withdrawal" can be set to those included in the subject. Therefore, if the script character string and subtitle character string extracted from any one unit section of the news video are input to the machine learning model, what kind of subject word will the words included in the input script character string and subtitle character string become? It is possible to confirm whether the word is a corresponding word, and thereby it is possible to grasp what kind of subject word the content corresponds to in the news video. After that, the keyword generation unit 140 can set the theme word extracted via the machine learning model as a keyword for the unit interval.

また、実施形態によっては、キーワード生成部１４０が全体動画に対するキーワードを生成することもできる。具体的には、動画内に含まれる各々の単位区間に設定されたキーワードに自然言語処理を適用して、該動画に対応するキーワードを生成するようにすることができる。ここで、自然言語処理技法には、ｗｏｒｄ２ｖｅｃ、ＬＤＡなどの機械学習などが適用されることができる。すなわち、該動画全体の内容に対するキーワードを設定することがユーザの便宜上有利であるため、キーワード生成部１４０は該動画に対するキーワードも生成することができる。このとき、動画の内容を反映するために、各々の単位区間に対するキーワードを用いて、該動画のキーワードを生成することができる。 Further, depending on the embodiment, the keyword generation unit 140 may generate a keyword for the entire moving image. Specifically, natural language processing can be applied to the keywords set for each unit interval included in the moving image to generate the keyword corresponding to the moving image. Here, machine learning such as word2vec and LDA can be applied to the natural language processing technique. That is, since it is advantageous for the user to set a keyword for the content of the entire moving image, the keyword generation unit 140 can also generate a keyword for the moving image. At this time, in order to reflect the content of the moving image, the keyword of the moving image can be generated by using the keyword for each unit interval.

検索部１５０は、ユーザから入力されたキーワードに対応する単位区間を検索し、検索された単位区間をユーザに提供することができる。各々の単位区間にはキーワードが設定されているため、検索部１５０は特定の内容を含む単位区間を検索してユーザに提供することができる。また、検索部１５０は動画から分離された単位区間別に検索が可能であるため、ユーザが所望する単位区間だけを提供することができる。すなわち、検索部１５０によれば、動画サービスの提供時のユーザ利便性を大幅に向上させることができる。 The search unit 150 can search for a unit interval corresponding to a keyword input by the user and provide the searched unit interval to the user. Since a keyword is set for each unit interval, the search unit 150 can search for a unit interval including a specific content and provide it to the user. Further, since the search unit 150 can search for each unit interval separated from the moving image, it is possible to provide only the unit interval desired by the user. That is, according to the search unit 150, the user convenience at the time of providing the moving image service can be significantly improved.

要約動画生成部１６０は、同一の動画に対し、基準キーワードに対応する単位区間を抽出し、抽出された単位区間を結合して該動画に対する要約動画を生成することができる。ここで、基準キーワードは管理者により予め設定されるか、またはユーザから入力を受けてもよい。 The summary moving image generation unit 160 can extract unit intervals corresponding to the reference keywords for the same moving image and combine the extracted unit sections to generate a summary moving image for the moving image. Here, the reference keyword may be preset by the administrator or may be input by the user.

例えば、サッカー中継動画の場合、基準キーワードを「ゴール」、「得点」などに設定すれば、単位区間の中から得点シーンだけを抽出してゴールシーンをまとめた要約動画を生成することができ、基準キーワードを特定の選手の名前に設定すれば、その特定の選手がボールに触れる単位区間だけを抽出して、その特定の選手に対するハイライト要約動画を生成することができる。また、ニュース動画の場合には、基準キーワードを「経済」に設定して経済分野に対する要約動画を生成したり、「仮想通貨」などのような特定のイシューに対するニュースを集約して１つの要約動画に生成したりすることもできる。すなわち、動画に対する別の編集作業などを実行する必要がなく、容易に要約動画を生成してユーザに提供することができる。 For example, in the case of a soccer broadcast video, if the reference keywords are set to "goal", "score", etc., it is possible to extract only the scoring scene from the unit interval and generate a summary video summarizing the goal scene. By setting the reference keyword to the name of a specific player, it is possible to extract only the unit interval in which the specific player touches the ball and generate a highlight summary video for that specific player. In the case of news videos, the reference keyword is set to "economy" to generate a summary video for the economic field, or news for a specific issue such as "virtual currency" is aggregated into one summary video. It can also be generated in. That is, it is not necessary to perform another editing work on the moving image, and the summary moving image can be easily generated and provided to the user.

一方、本発明の一実施形態によるサービスサーバ１００は、図３に示すように、プロセッサ１０、メモリ４０などの物理的な構成を含むものであり、メモリ４０内には、プロセッサ１０により実行されるように構成される１つ以上のモジュールが含まれることができる。具体的には、１つ以上のモジュールには、単位区間分離モジュール、スクリプト文字列生成モジュール、字幕文字列生成モジュール、キーワード生成モジュール、検索モジュールおよび要約動画生成モジュールなどが含まれることができる。 On the other hand, as shown in FIG. 3, the service server 100 according to the embodiment of the present invention includes a physical configuration such as a processor 10 and a memory 40, and is executed by the processor 10 in the memory 40. One or more modules configured as such can be included. Specifically, one or more modules can include a unit interval separation module, a script character string generation module, a subtitle character string generation module, a keyword generation module, a search module, a summary moving image generation module, and the like.

プロセッサ１０は、様々なソフトウェアプログラムと、メモリ４０に格納されている命令語集合を実行して色々な機能を実行しデータを処理する機能を実行することができる。周辺インターフェース部３０は、コンピュータ装置の入出力周辺装置をプロセッサ１０、メモリ４０に連結することができ、メモリ制御部２０は、プロセッサ１０やコンピュータ装置の構成要素がメモリ４０にアクセスする場合に、メモリアクセスを制御する機能を実行することができる。実施形態によっては、プロセッサ１０、メモリ制御部２０および周辺インターフェース部３０を単一チップ上に実現するか、または別個のチップに実現してもよい。 The processor 10 can execute various software programs and a set of instruction words stored in the memory 40 to execute various functions and execute a function of processing data. The peripheral interface unit 30 can connect the input / output peripheral devices of the computer device to the processor 10 and the memory 40, and the memory control unit 20 can connect the input / output peripheral devices of the computer device to the memory 40 when the components of the processor 10 and the computer device access the memory 40. You can perform functions that control access. Depending on the embodiment, the processor 10, the memory control unit 20, and the peripheral interface unit 30 may be realized on a single chip or on separate chips.

メモリ４０は、高速ランダムアクセスメモリ、１つ以上の磁気ディスクストレージ、フラッシュメモリ装置のような不揮発性メモリなどを含むことができる。また、メモリ４０は、プロセッサ１０から離れて位置するストレージや、インターネットなどの通信ネットワークを介してアクセスされるネットワークアタッチトストレージなどをさらに含むことができる。 The memory 40 can include high-speed random access memory, one or more magnetic disk storages, non-volatile memory such as a flash memory device, and the like. Further, the memory 40 can further include storage located away from the processor 10, network attached storage accessed via a communication network such as the Internet, and the like.

一方、図３に示すように、本発明の一実施形態によるサービスサーバ１００は、メモリ４０にオペレーティングシステムをはじめとして、アプリケーションプログラムに該当する単位区間分離モジュール、スクリプト文字列生成モジュール、字幕文字列生成モジュール、キーワード生成モジュール、検索モジュールおよび要約動画生成モジュールなどを含むことができる。ここで、各々のモジュールは、上述した機能を実行するための命令語の集合として、メモリ４０に格納されることができる。 On the other hand, as shown in FIG. 3, the service server 100 according to the embodiment of the present invention has a memory 40, an operating system, a unit interval separation module corresponding to an application program, a script character string generation module, and a subtitle character string generation. It can include modules, keyword generation modules, search modules, summary video generation modules, and so on. Here, each module can be stored in the memory 40 as a set of instruction words for executing the above-mentioned functions.

したがって、本発明の一実施形態によるサービスサーバ１００は、プロセッサ１０がメモリ４０にアクセスして各々のモジュールに対応する命令語を実行することができる。但し、単位区間分離モジュール、スクリプト文字列生成モジュール、字幕文字列生成モジュール、キーワード生成モジュール、検索モジュールおよび要約動画生成モジュールは、上述した単位区間分離部、スクリプト文字列生成部、字幕文字列生成部、キーワード生成部、検索部および要約動画生成部に各々対応するため、ここでは詳しい説明は省略する。 Therefore, in the service server 100 according to the embodiment of the present invention, the processor 10 can access the memory 40 and execute the instruction word corresponding to each module. However, the unit interval separation module, script character string generation module, subtitle character string generation module, keyword generation module, search module, and summary video generation module are the unit interval separation unit, script character string generation unit, and subtitle character string generation unit described above. , Keyword generation unit, search unit, and summary video generation unit, respectively, so detailed description is omitted here.

図７は、本発明の一実施形態による動画サービス提供方法を示すフローチャートである。 FIG. 7 is a flowchart showing a method of providing a moving image service according to an embodiment of the present invention.

図７を参照すれば、本発明の一実施形態による動画サービス提供方法は、単位区間分離ステップ（Ｓ１０）、スクリプト文字列生成ステップ（Ｓ２０）、字幕文字列生成ステップ（Ｓ３０）、キーワード生成ステップ（Ｓ４０）、検索ステップ（Ｓ５０）および要約動画生成ステップ（Ｓ６０）を含むことができる。ここで、本発明の一実施形態による動画サービス提供方法は、サービスサーバにより実行されることができる。 Referring to FIG. 7, the video service providing method according to the embodiment of the present invention includes a unit interval separation step (S10), a script character string generation step (S20), a subtitle character string generation step (S30), and a keyword generation step (S30). S40), a search step (S50) and a summary video generation step (S60) can be included. Here, the video service providing method according to the embodiment of the present invention can be executed by the service server.

以下では、図７を参照して、本発明の一実施形態による動画サービス提供方法について説明する。 Hereinafter, a method of providing a moving image service according to an embodiment of the present invention will be described with reference to FIG. 7.

単位区間分離ステップ（Ｓ１０）では、動画内に含まれる音声の特性変化を基準に動画を複数の単位区間に分離することができる。ここで、音声の特性変化は音量または音質の変化を含むことができる。具体的には、音声の特性変化を用いて動画内の話者の発話が中断される停止区間を抽出することができ、停止区間を編集点に設定して動画を分離することができる。例えば、停止区間を、音量が設定値未満に減少し、設定値未満に減少した音量が基準時間以上維持される区間に設定することができる。すなわち、文脈などを考慮するとき、動画内の話者が話しを止めるまでを１つの区間に設定することができ、このために、単位区間の分離時に音量の変化量を用いることができる。 In the unit interval separation step (S10), the moving image can be separated into a plurality of unit sections based on the change in the characteristics of the sound included in the moving image. Here, changes in audio characteristics can include changes in volume or sound quality. Specifically, it is possible to extract a stop section in which the speaker's utterance in the moving image is interrupted by using the change in the characteristics of the voice, and to set the stop section as an editing point to separate the moving image. For example, the stop section can be set to a section in which the volume is reduced below the set value and the volume reduced below the set value is maintained for a reference time or longer. That is, when considering the context and the like, it is possible to set one section until the speaker in the moving image stops talking, and for this purpose, the amount of change in volume can be used when the unit intervals are separated.

スクリプト文字列生成ステップ（Ｓ２０）では、単位区間に含まれる音声を認識して、音声に対応するスクリプト文字列を生成することができる。動画を複数の単位区間に分離した後には、各々の単位区間内に含まれる内容を認識する必要がある。このために、話者が発話した音声を認識し、それを文字に変換し、変換された文字を結合してスクリプト文字列に生成することができる。 In the script character string generation step (S20), the voice included in the unit interval can be recognized and the script character string corresponding to the voice can be generated. After separating the video into multiple unit intervals, it is necessary to recognize the content contained in each unit interval. For this purpose, it is possible to recognize the voice spoken by the speaker, convert it into characters, and combine the converted characters to generate a script character string.

実施形態によっては、音声認識装置が備えられていてもよく、音声認識装置を用いて音声を文字に変換することができる。例えば、単位区間に含まれる音声を電気的信号である音声パターンで表すことができ、音声モデルデータベースなどに各々の文字に対応する標準音声パターンが格納されていてもよい。この場合、音声認識装置は、入力される音声パターンを音声モデルデータベースに格納された標準音声パターンと比較することができ、各々の音声パターンに対応する標準音声パターンを抽出することができる。その後、抽出した標準音声パターンを対応する文字に変換することができ、変換された文字を結合してスクリプト文字列を生成することができる。 Depending on the embodiment, a voice recognition device may be provided, and the voice recognition device can be used to convert the voice into characters. For example, the voice included in the unit interval can be represented by a voice pattern that is an electrical signal, and a standard voice pattern corresponding to each character may be stored in a voice model database or the like. In this case, the voice recognition device can compare the input voice pattern with the standard voice pattern stored in the voice model database, and can extract the standard voice pattern corresponding to each voice pattern. After that, the extracted standard voice pattern can be converted into the corresponding character, and the converted characters can be combined to generate a script character string.

字幕文字列生成ステップ（Ｓ３０）では、単位区間に含まれる字幕イメージを認識して、字幕イメージに対応する字幕文字列を生成することができる。字幕イメージには動画の内容が要約されて表示されるため、字幕イメージに含まれる文字を認識する必要がある。但し、字幕イメージは文字でなく形状に認識されるため、字幕イメージに含まれる文字を認識するためには、文字認識アルゴリズムなどを適用する必要がある。ここで、字幕文字列生成ステップ（Ｓ３０）はスクリプト文字列生成ステップ（Ｓ２０）と同時に実行されることができるが、これに限定されるものではない。 In the subtitle character string generation step (S30), the subtitle image included in the unit interval can be recognized and the subtitle character string corresponding to the subtitle image can be generated. Since the content of the video is summarized and displayed in the subtitle image, it is necessary to recognize the characters contained in the subtitle image. However, since the subtitle image is recognized not by characters but by shape, it is necessary to apply a character recognition algorithm or the like in order to recognize the characters included in the subtitle image. Here, the subtitle character string generation step (S30) can be executed at the same time as the script character string generation step (S20), but the present invention is not limited to this.

実施形態によっては、別の文字認識装置が備えられていてもよく、文字認識装置を用いて字幕イメージを文字に変換することができる。例えば、単位区間に含まれる字幕イメージをスキャンして字幕イメージに対するピクセル値の分布を電気的信号である形状パターンで表すことができ、文字モデルデータベースなどに各々の文字に対応する標準形状パターンが格納されていてもよい。この場合、文字認識装置は、入力される形状パターンを文字モデルデータベースに格納された標準形状パターンと比較することができ、各々の形状パターンに対応する標準形状パターンを抽出することができる。その後、抽出した標準形状パターンに対応する文字に各々変換して字幕文字列を生成することができる。 Depending on the embodiment, another character recognition device may be provided, and the subtitle image can be converted into characters by using the character recognition device. For example, a subtitle image included in a unit interval can be scanned and the distribution of pixel values with respect to the subtitle image can be represented by a shape pattern that is an electrical signal, and a standard shape pattern corresponding to each character is stored in a character model database or the like. It may have been done. In this case, the character recognition device can compare the input shape pattern with the standard shape pattern stored in the character model database, and can extract the standard shape pattern corresponding to each shape pattern. After that, it is possible to generate a subtitle character string by converting each character into a character corresponding to the extracted standard shape pattern.

一方、字幕イメージから字幕文字列を抽出するためには、単位区間内での字幕イメージの存在有無と、字幕イメージの動画フレーム内の位置を判別する必要がある。すなわち、字幕文字列を生成する前に、まず、単位区間内の字幕イメージを含む動画フレームを検出し、動画フレーム内に含まれる字幕イメージの位置を特定することができる。具体的には、字幕文字列生成ステップ（Ｓ３０）では、単位区間に含まれる動画フレーム内に複数のランドマークを設定し、ランドマークにおいて色相または輝度を測定する方式で字幕イメージを検出することができる。また、字幕イメージの位置は、ランドマークを動画フレーム上に均一に分布させた後、字幕イメージに対応する基準色相または基準輝度が測定されたランドマークを抽出して特定することができる。 On the other hand, in order to extract the subtitle character string from the subtitle image, it is necessary to determine the presence or absence of the subtitle image in the unit interval and the position of the subtitle image in the moving image frame. That is, before generating the subtitle character string, it is possible to first detect the moving image frame including the subtitle image within the unit interval and specify the position of the subtitle image included in the moving image frame. Specifically, in the subtitle character string generation step (S30), a plurality of landmarks are set in the moving image frame included in the unit interval, and the subtitle image is detected by a method of measuring the hue or the brightness in the landmarks. it can. Further, the position of the subtitle image can be specified by extracting the landmarks whose reference hue or reference brightness corresponding to the subtitle image is measured after the landmarks are uniformly distributed on the moving image frame.

キーワード生成ステップ（Ｓ４０）では、スクリプト文字列および字幕文字列に自然言語処理を適用して、単位区間に対応するキーワードを生成することができる。すなわち、ユーザが単位区間の内容を確認した後、それに対応してキーワードや注釈などを設定するのではなく、各々の単位区間に対する意味を基にしたキーワードを自動で設定することができる。ここで、スクリプト文字列および字幕文字列に適用する自然言語処理には様々な方法などが適用されることができ、実施形態によっては、ｗｏｒｄ２ｖｅｃ、ＬＤＡなどの機械学習が適用されることができる。 In the keyword generation step (S40), natural language processing can be applied to the script character string and the subtitle character string to generate the keyword corresponding to the unit interval. That is, after the user confirms the content of the unit interval, the keyword and the annotation can be automatically set based on the meaning for each unit interval, instead of setting the keyword and the annotation corresponding to the content. Here, various methods can be applied to the natural language processing applied to the script character string and the subtitle character string, and machine learning such as word2vec and LDA can be applied depending on the embodiment.

一実施形態によれば、キーワード生成ステップ（Ｓ４０）では、ｗｏｒｄ２ｖｅｃを用いて単語埋め込みしたｗｏｒｄ２ｖｅｃモデルを実現することができ、字幕文字列またはスクリプト文字列から抽出した単語をｗｏｒｄ２ｖｅｃモデルに対する入力単語に設定して、入力単語に対応する関連単語を抽出することができる。その後、抽出した関連単語を該単位区間に対するキーワードに設定することができる。 According to one embodiment, in the keyword generation step (S40), a word2vec model in which words are embedded using word2vec can be realized, and a word extracted from a subtitle character string or a script character string is set as an input word for the word2vec model. Then, the related words corresponding to the input words can be extracted. After that, the extracted related words can be set as keywords for the unit interval.

ここで、キーワード生成ステップ（Ｓ４０）は、関連単語と入力単語を比較して類似度が高い関連単語だけをキーワードに設定するように制限することができる。具体的には、ｗｏｒｄ２ｖｅｃモデルに入力した入力単語に対応する入力単語ベクトルと、関連単語に対応する関連単語ベクトルとの間の類似度を計算して、類似度が高い関連単語だけを抽出してキーワードに設定することができる。 Here, the keyword generation step (S40) can be restricted so that the related words and the input words are compared and only the related words having a high degree of similarity are set as the keywords. Specifically, the similarity between the input word vector corresponding to the input word input to the word2vec model and the related word vector corresponding to the related word is calculated, and only the related words having a high similarity are extracted. Can be set as a keyword.

各々の単語は単語埋め込みを通じて空間上でベクトル化して分布されることができ、学習したサンプルにおいて互いに類似するかまたは関連していると設定された単語はベクトル空間上で隣接した位置に位置するようになる。したがって、入力単語ベクトルと関連単語ベクトルとの間の類似度を計算して、入力単語と関連単語の間の関係を把握することができる。ここで、ベクトル間の類似度はコサイン類似度を用いて計算することができる。 Each word can be vectorized and distributed in space through word embedding, so that words that are set to be similar or related to each other in the trained sample are located adjacent to each other in vector space. become. Therefore, the similarity between the input word vector and the related word vector can be calculated to grasp the relationship between the input word and the related word. Here, the similarity between vectors can be calculated using the cosine similarity.

具体的には、入力ベクトルとの類似度が所定値以上の関連単語ベクトルを抽出することができ、抽出された関連単語ベクトルに対応する関連単語をキーワードに設定することができる。すなわち、類似度が所定値以上の関連単語ベクトルに該当する関連単語だけをキーワードに設定することができる。また、実施形態によっては、入力ベクトルとの類似度が高い順に応じて既に設定された個数の関連単語ベクトルを抽出することができ、抽出された既に設定された個数の関連単語ベクトルに対応する関連単語をキーワードに設定することもできる。例えば、最も類似度が大きい関連単語ベクトルを１０個抽出し、抽出された１０個の関連単語をキーワードに設定することができる。 Specifically, a related word vector having a similarity with the input vector of a predetermined value or more can be extracted, and a related word corresponding to the extracted related word vector can be set as a keyword. That is, only related words corresponding to the related word vector having a similarity of a predetermined value or more can be set as a keyword. Further, depending on the embodiment, it is possible to extract an already set number of related word vectors according to the order of similarity with the input vector, and the association corresponding to the extracted already set number of related word vectors. You can also set a word as a keyword. For example, 10 related word vectors having the highest degree of similarity can be extracted, and the extracted 10 related words can be set as keywords.

さらに、キーワード生成ステップ（Ｓ４０）では、リアルタイム検索語情報を用いて、キーワードを設定する実施形態も可能である。例えば、ｗｏｒｄ２ｖｅｃモデルから抽出した関連単語のうち、リアルタイム検索語情報に含まれる検索語に対応する関連単語を抽出することができ、抽出された関連単語に対しては類似度の計算時に加重値を付加することができる。すなわち、相対的に類似度が低い場合にも、リアルタイム検索語情報に対応する関連単語に対しては加重値によりキーワードに設定されることができる。この時、検索語のリアルタイム検索順位に応じて、検索語に対応する関連単語に提供する加重値を互いに異なるように付与することもできる。 Further, in the keyword generation step (S40), an embodiment in which a keyword is set using real-time search term information is also possible. For example, among the related words extracted from the word2vec model, the related words corresponding to the search words included in the real-time search word information can be extracted, and the extracted related words are weighted at the time of calculating the similarity. Can be added. That is, even when the similarity is relatively low, the related words corresponding to the real-time search word information can be set as keywords by weighted values. At this time, weighted values provided to related words corresponding to the search words can be given differently from each other according to the real-time search order of the search words.

一方、実施形態によっては、キーワード生成ステップ（Ｓ４０）においてＬＤＡを用いてキーワードを設定することもできる。すなわち、ＬＤＡを用いて学習した機械学習モデルにスクリプト文字列および字幕文字列を適用して単位区間に対応する主題語を抽出することができ、その後、抽出された主題語を該単位区間のキーワードに設定することができる。但し、ＬＤＡを用いて学習した機械学習モデルを用いてキーワードを設定する内容は前述したため、ここでは具体的な内容は省略する。 On the other hand, depending on the embodiment, a keyword can be set using LDA in the keyword generation step (S40). That is, a script character string and a subtitle character string can be applied to a machine learning model trained using LDA to extract a theme word corresponding to a unit interval, and then the extracted theme word can be used as a keyword for the unit interval. Can be set to. However, since the content of setting keywords using the machine learning model learned using LDA is described above, the specific content is omitted here.

また、実施形態によっては、キーワード生成ステップ（Ｓ４０）において全体動画に対するキーワードを生成することもできる。すなわち、動画内に含まれる各々の単位区間に設定されたキーワードに自然言語処理を適用して、該動画に対応するキーワードを生成するようにすることができる。ここで、自然言語処理技法にはｗｏｒｄ２ｖｅｃ、ＬＤＡなどの機械学習などが適用されることができる。 Further, depending on the embodiment, a keyword for the entire moving image can be generated in the keyword generation step (S40). That is, it is possible to apply natural language processing to the keywords set for each unit interval included in the moving image to generate the keyword corresponding to the moving image. Here, machine learning such as word2vec and LDA can be applied to the natural language processing technique.

検索ステップ（Ｓ５０）では、ユーザから入力されたキーワードに対応する単位区間を検索し、検索された単位区間をユーザに提供することができる。各々の単位区間にはキーワードが設定されているため、特定の内容を含む単位区間を検索してユーザに提供することができる。また、動画から分離された単位区間別に検索が可能であるため、ユーザが所望する単位区間だけを提供することができる。すなわち、動画サービスの提供時のユーザ利便性を大幅に向上させることができる。 In the search step (S50), the unit interval corresponding to the keyword input by the user can be searched, and the searched unit interval can be provided to the user. Since a keyword is set for each unit interval, it is possible to search for a unit interval containing a specific content and provide it to the user. Further, since the search can be performed for each unit interval separated from the moving image, only the unit interval desired by the user can be provided. That is, it is possible to significantly improve user convenience when providing a video service.

要約動画生成ステップ（Ｓ６０）では、同一の動画に対し、基準キーワードに対応する単位区間を抽出し、抽出された単位区間を結合して該動画に対する要約動画を生成することができる。ここで、基準キーワードは管理者により予め設定されるか、またはユーザから入力を受けてもよい。すなわち、動画に対する別の編集作業などを実行する必要がなく、容易に要約動画を生成してユーザに提供することができる。 In the summary video generation step (S60), unit intervals corresponding to the reference keywords can be extracted for the same video, and the extracted unit intervals can be combined to generate a summary video for the video. Here, the reference keyword may be preset by the administrator or may be input by the user. That is, it is not necessary to perform another editing work on the moving image, and the summary moving image can be easily generated and provided to the user.

前述した本発明は、プログラムが記録された媒体にコンピュータ読取可能なコードとして実現することができる。コンピュータ読取可能な媒体は、コンピュータで実行可能なプログラムを続けて格納するか、実行またはダウンロードのために臨時格納するものであってもよい。また、媒体は単一または数個のハードウェアが結合された形態の様々な記録手段または格納手段であってもよく、あるコンピュータ・システムに直接接続される媒体に限定されず、ネットワーク上に分散存在するものであってもよい。媒体の例示としては、ハードディスク、フロッピーディスクおよび磁気テープのような磁気媒体、ＣＤ−ＲＯＭおよびＤＶＤのような光気録媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような磁気−光媒体（ｍａｇｎｅｔｏ−ｏｐｔｉｃａｌｍｅｄｉｕｍ）、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含めてプログラム命令語が格納されるように構成されたものがある。また、他の媒体の例示として、アプリケーションを流通するアプリストアやその他の様々なソフトウェアを供給乃至流通するサイト、サーバなどが管理する記録媒体乃至格納媒体も挙げられる。したがって、上記の詳細な説明は、全ての面で制限的に解釈されてはならず、例示的なものに考慮されなければならない。本発明の範囲は添付された請求項の合理的な解釈によって決定されなければならず、本発明の等価的な範囲内での全ての変更は本発明の範囲に含まれる。 The present invention described above can be realized as a computer-readable code on a medium in which a program is recorded. The computer-readable medium may be one that continuously stores a computer-executable program or a temporary storage for execution or download. Also, the medium may be various recording or storage means in the form of a single piece or a combination of several pieces of hardware, not limited to a medium directly connected to a computer system, but distributed over a network. It may exist. Examples of media include hard disks, magnetic media such as floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magnetic-optical media such as floptic discs. There are optical media), and those configured to store program command words including ROM, RAM, flash memory, and the like. Further, as an example of other media, a recording medium or a storage medium managed by an app store that distributes applications, a site that supplies or distributes various other software, a server, or the like can be mentioned. Therefore, the above detailed description should not be construed in a restrictive manner in all respects and should be taken into account as exemplary. The scope of the invention must be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the invention are within the scope of the invention.

本発明は、前述した実施形態および添付された図面によって限定されるものではない。本発明が属する技術分野で通常の知識を有した者であれば、本発明の技術的思想を逸脱しない範囲内で本発明に係る構成要素を置換、変形および変更できることは明らかである。 The present invention is not limited to the embodiments described above and the accompanying drawings. It is clear that a person having ordinary knowledge in the technical field to which the present invention belongs can replace, modify and change the components according to the present invention without departing from the technical idea of the present invention.

１・・・端末装置
１０・・・プロセッサ
２０・・・メモリ制御部
３０・・・周辺インターフェース部
４０・・・メモリ
１００・・・サービスサーバ
１１０・・・単位区間分離部
１２０・・・スクリプト文字列生成部
１３０・・・字幕文字列生成部
１４０・・・キーワード生成部
１５０・・・検索部
１６０・・・要約動画生成部 1 ・・・ Terminal equipment 10 ・・・ Processor 20 ・・・ Memory control unit 30 ・・・ Peripheral interface unit 40 ・・・ Memory 100 ・・・ Service server 110 ・・・ Unit interval separation unit 120 ・・・ Script character Column generation unit 130 ・・・ Subtitle character string generation unit 140 ・・・ Keyword generation unit 150 ・・・ Search unit 160 ・・・ Summary video generation unit

Claims

A video service provision method in which a service server provides video to a terminal device.
A unit interval separation step that separates the video into a plurality of unit intervals based on changes in the characteristics of the audio contained in the video.
A script character string generation step that recognizes the voice included in the unit interval and generates a script character string corresponding to the voice.
A subtitle character string generation step that recognizes a subtitle image included in the unit section and generates a subtitle character string corresponding to the subtitle image, and sets a plurality of landmarks in a moving image frame included in the unit section. Then, the subtitle character string generation step of detecting the subtitle image using the hue or brightness measured at the landmark, and applying natural language processing to the script character string and the subtitle character string correspond to the unit interval. A method of providing a video service that includes a keyword generation step to generate a keyword to be generated.

The unit interval separation step
Using said characteristic change of voice extracts stop section the speech of the speaker in the video is interrupted, you separate the video by setting the stop section in the editing point, moving according to claim 1 Service provision method.

The unit interval separation step
Decreasing the volume of the sound is less than the set value, if it is maintained volume of less than the set value reference time or longer, you determine as the stop section, moving service providing method according to claim 2.

The script character string generation step is
Using the speech recognition device, converts voice pattern extracted from the voice into corresponding text, that generates the script text by combining the transformed character, moving service providing method according to claim 1 ..

The subtitle character string generation step is
Using the character recognition device, converts the shape pattern extracted from the caption image corresponding character, that generates the caption character string by combining the transformed character, video services provided according to claim 1 Method.

The subtitle character string generation step is
From the plurality of landmarks uniformly distributed on the moving image frame, the landmarks whose reference hue or reference brightness corresponding to the subtitle image is measured are extracted, and the extracted landmarks are used to obtain the subtitle image. position that identifies a video service providing method according to claim 1.

The keyword generation step is
The word embedding the word2vec model using word2vec caption text or enter an input word extracted from the script the string, and extracts the corresponding related words, to set the associated word to the keyword, claim The method for providing a video service according to 1.

The keyword generation step is
The step of calculating the similarity between the input word vector corresponding to the input word and the related word vector corresponding to the related word input to the word2vec model.
A step of extracting a related word vector having a similarity equal to or higher than a predetermined value or an already set number of related word vectors selected according to the order of increasing similarity, and a related word corresponding to the extracted related word vector. the to set the keywords, moving service providing method of claim 7.

The step of calculating the similarity is
Extract the relevant words corresponding to the search word included in the real time search word information, you added weight when calculating the degree of similarity for the relevant word the extracted, moving service providing method according to claim 8 ..

The keyword generation step is
In response to said search term real-time search ranking, to grant weight to provide the associated word corresponding to the keyword to be different from each other, moving service providing method according to claim 9.

The keyword generation step is
LDA (Latent Dirichlet Allocation) by applying the script text and subtitle character string in machine learning models trained using extracts the subject word corresponding to the unit section, to set the subject word to the keyword, The method for providing a video service according to claim 1.

The keyword generation step is
By applying natural language processing keyword corresponding to the unit section, that generates a keyword corresponding to the moving picture, moving picture service providing method according to claim 1.

Find the unit section corresponding to the keyword input by the user, further including a searching step of providing the retrieved unit segment to the user, video service providing method according to claim 1.

For the same video, extracting unit section corresponding to the reference keyword summaries moving image generation step further including generating a summary video by combining the extracted unit sections, video services provided according to claim 1 Method.

A computer program that is combined with hardware to cause the hardware to execute the video service providing method according to any one of claims 1 to 14 .

A unit interval separator that separates the video into a plurality of unit sections based on changes in the characteristics of the audio contained in the video.
A script character string generator that recognizes the voice included in the unit interval and generates a script character string corresponding to the voice.
A subtitle character string generator that recognizes a subtitle image included in the unit section and generates a subtitle character string corresponding to the subtitle image, and sets a plurality of landmarks in a moving image frame included in the unit section. Then, the subtitle character string generator that detects the subtitle image using the hue or brightness measured at the landmark , and the script character string and the subtitle character string are subjected to natural language processing to correspond to the unit section. A service server that contains a keyword generator that generates the keywords you want.

A service server that includes a processor and memory attached to the processor.
The memory comprises one or more modules configured to be executed by the processor.
The one or more modules
The video is divided into a plurality of unit intervals based on the change in the characteristics of the audio contained in the video.
The voice included in the unit interval is recognized, a script character string corresponding to the voice is generated, and the script character string is generated.
Wherein setting a plurality of landmarks in the video frames included in unit section, recognizes the subtitle image included in the unit section to detect a caption image with a hue or luminance was measured in the landmark, Generate a subtitle character string corresponding to the subtitle image,
Natural language processing is applied to the script character string and the subtitle character string to generate a keyword corresponding to the unit interval.
A service server that contains a command word.