JP2018180914A

JP2018180914A - Content search device, content search method, program, and data structure

Info

Publication number: JP2018180914A
Application number: JP2017079222A
Authority: JP
Inventors: 博子武藤; Hiroko Muto; 亮北原; Akira Kitahara; 川西　隆仁; Takahito Kawanishi; 隆仁川西; 吉岡　理; Osamu Yoshioka; 理吉岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-04-12
Filing date: 2017-04-12
Publication date: 2018-11-15
Anticipated expiration: 2037-04-12
Also published as: JP6530002B2

Abstract

PROBLEM TO BE SOLVED: To enable accurate content search.SOLUTION: A content search device 1 comprises: a collation parameter setting unit 23 configured to set a collation parameter that is used when collating candidate content with collation source content based on a feature of a candidate content being a candidate of content or editing technique; and a content DL/collation unit 24 configured to download the candidate content from an acquisition source of the candidate content, and collate the downloaded content with the content at the collation source using the collation parameter set by the collation parameter setting unit.SELECTED DRAWING: Figure 1

Description

本発明は、コンテンツ探索装置、コンテンツ探索方法、プログラム、及びデータ構造に関する。 The present invention relates to a content search device, a content search method, a program, and a data structure.

近年、通信ネットワーク技術の発展、及びスマートフォン、ＰＣ（Personal Computer）等に代表される通信端末装置の進歩に伴い、動画等のコンテンツをアップロード及びダウンロード可能なコンテンツ配信サイトが多数、開設されている。この種のコンテンツ配信サイトにおいては、ユーザはコンテンツを手軽にアップロードできるため、コンテンツ配信サイトにアップロードされるコンテンツ数が年々増加している。このようなコンテンツ配信サイトでは、コンテンツの権利を有さない非権利者によりコンテンツがアップロードされることが問題となっている。そのため、非権利者によりアップロードされた違法コンテンツを探索する手法が求められている。 In recent years, with the development of communication network technology and the progress of communication terminal devices represented by smartphones, personal computers (PCs) and the like, a large number of content distribution sites capable of uploading and downloading contents such as moving pictures have been established. In this type of content distribution site, users can easily upload content, and the number of content uploaded to the content distribution site is increasing year by year. In such content distribution sites, there is a problem that the content is uploaded by a non-rightholder who does not have the content right. Therefore, there is a need for a method for searching for illegal content uploaded by non-rightholders.

多数のコンテンツがアップロードされているコンテンツ配信サイトからコンテンツを探索する手法としては種々の手法が知られている。例えば、非特許文献１には、単語間の因果関係、上位下位関係、属性関係などを規定した単語間関係辞書を用いて、ユーザが入力したクエリと関連のある単語の集合を取得し、ユーザが入力したクエリだけでなく、ユーザが入力したクエリと関連があるとして取得した単語をクエリとして検索を実行する手法が記載されている。 Various methods are known as a method of searching for content from a content distribution site to which a large amount of content is uploaded. For example, in Non-Patent Document 1, a set of words related to a query input by a user is acquired using an inter-word relationship dictionary that defines causal relationships between words, upper-lower relationships, attribute relationships, etc. There is described a method of executing a search using not only the query input by the user but also a word acquired as being related to the query input by the user.

また、非特許文献２には、単語間の因果関係、上位下位関係、属性関係などを規定した単語間関係辞書を用いて、ユーザに選択されたコンテンツの概要文と、他のコンテンツの概要文との類似性を評価し、類似性の高いコンテンツを、ユーザによって選択されたコンテンツに関連するコンテンツとして提示する手法が記載されている。 Further, Non-Patent Document 2 uses a word-to-word relationship dictionary that defines causal relationships between words, upper-lower relationships, attribute relationships, etc., and a summary sentence of content selected by the user and a summary sentence of other content. A technique is described that evaluates the similarity with the user and presents highly similar content as content related to the content selected by the user.

また、非特許文献３には、コンテンツに対する意見を収集するために、Twitter（登録商標）に登録されているコンテンツ名を含むツイートが投稿されてから、所定時間内に投稿された同じコンテンツ名を含むツイート（隣接ツイート群）を収集し、隣接ツイート群内で共起頻度の高い単語を関連する単語として抽出し、該関連する単語をクエリとしてツイートを検索する手法が記載されている。 In addition, in Non-Patent Document 3, in order to collect an opinion on content, a tweet including a content name registered in Twitter (registered trademark) is posted, and then the same content name posted within a predetermined time is A method is disclosed that collects tweets (adjacent tweet group), extracts words having a high co-occurrence frequency in the adjacent tweet group as related words, and searches for tweets using the related words as a query.

また、非特許文献４には、通信ネットワークを介してアクセスされ得る多数のサイトから有害サイトを探す手法が知られている。この手法では、有害サイトのＨＴＭＬ（Hyper Text Markup Language）に含まれる文字列をＳＶＭ（Support Vector Machine）により統計的に学習し、該学習に基づいて抽出された文字列をＨＴＭＬに含むサイトを有害サイトとして判定する。 Further, Non-Patent Document 4 discloses a method of searching for harmful sites from a large number of sites that can be accessed via a communication network. In this method, a character string included in HTML (Hyper Text Markup Language) of a harmful site is statistically learned by SVM (Support Vector Machine), and a site including the character string extracted based on the learning in HTML is harmful. Determined as a site.

宮崎太郎、外６名、「単語間関係辞書を用いたテレビ番組検索」、言語処理学会第２２年次大会発表論文集、平成２８年３月、ｐ.９１７−９２０Miyazaki Taro, 6 others, "TV program search using inter-word relation dictionary," Proceedings of the 22nd Annual Conference of the Association for Language Processing Proceedings, March 2016, p. 917-920 山田一郎、外４名、「ランダムウォークを利用した番組類似性評価」、情報処理学会研究報告、Ｖｏｌ.２０１２−ＭＬ−２０７、Ｎｏ．１２，２０１２年７月２７日Ichiro Yamada, 4 others, "Program similarity evaluation using random walk", Information Processing Society research report, Vol. 2012-ML-207, No. 12, July 27, 2012 中澤昌美、外３名、「番組視聴者と番組中の話題変化を考慮した番組関連ツイート収集手法の提案」、情報・システム講演論文集１、２０１３年電子情報通信学会総合大会、２０１３年３月１９日Nakazawa Masami, 3 others, "Proposal of program related tweet collection method considering program viewer and topic change in program," Proceedings of Information and Systems Lecture 1, 2013 IEICE General Conference, March 2013 19th 池田和史、外５名、「ＨＴＭＬ要素に基づく有害サイト検出方法」、情報処理学会論文誌、Ｖｏｌ．５２、Ｎｏ．８、Ｐ．２４７４−２４８３Ikeda, Kazufumi, 5 others, "Hazard site detection method based on HTML element", Journal of Information Processing Society, Vol. 52, no. 8, p. 2474-2483

上述の従来技術にように関連のあるクエリを用いて検索することによって、多くのコンテンツが抽出され、それに伴いユーザに所望のコンテンツが抽出される可能性が高くなった。しかしながら、投稿者が違法コンテンツをアップロードするにあたって、自動照合の精度を低下させ、コンテンツについての正規な権利を有する権利者に発見されるのを回避するために、照合元から違法に入手したコンテンツを編集することがある。また、コンテンツ配信サイトに投稿されている膨大な数のコンテンツのジャンルは様々である。照合処理において適切な設定は、編集方式、ジャンルによって異なり、適切な設定がされていない場合、正確にコンテンツを探索することができない場合がある。 By searching using related queries as in the above-mentioned prior art, a large amount of content is extracted, and accordingly, the user is more likely to extract desired content. However, when a poster uploads illegal content, the content of the content illegally obtained from the collation source is lowered in order to reduce the accuracy of the automatic collation and to avoid being discovered by the right holder having the legitimate right to the content. I have to edit. In addition, the genre of the huge number of contents posted to the content distribution site is various. The appropriate setting in the matching process differs depending on the editing method and the genre, and if the appropriate setting is not made, it may not be possible to accurately search for content.

したがって、かかる点に鑑みてなされた本発明の目的は、正確に違法コンテンツを探索することができる違法コンテンツ探索装置、違法コンテンツ探索方法、プログラム、及びデータ構造を提供することにある。 Therefore, an object of the present invention made in view of such a point is to provide an illegal content search device, an illegal content search method, a program, and a data structure capable of searching for illegal content accurately.

上記の課題を解決するため、本発明に係るコンテンツ探索装置は、コンテンツを探索するコンテンツ探索装置であって、前記コンテンツの候補となる候補コンテンツの特徴又は編集手法に基づいて、前記候補コンテンツを照合元のコンテンツと照合する際の照合用パラメータを設定する照合パラメータ設定部と、前記候補コンテンツを該候補コンテンツの取得元からダウンロードし、前記照合パラメータ設定部により設定された照合用パラメータを用いて、該ダウンロードしたコンテンツと前記照合元のコンテンツとを照合するコンテンツＤＬ・照合部と、を備えることを特徴とする。 In order to solve the above problems, a content search apparatus according to the present invention is a content search apparatus for searching for content, and collates the candidate content based on the feature or editing method of the candidate content which is a candidate for the content. A collation parameter setting unit for setting a collation parameter when collating with the original content, downloading the candidate content from the acquisition source of the candidate content, and using the collation parameter set by the collation parameter setting unit A content DL / collation unit that collates the downloaded content with the content of the collation source.

また、本発明に係るコンテンツ探索方法は、コンテンツを探索するコンテンツ探索装置が実行するコンテンツ探索方法であって、前記コンテンツの候補となる候補コンテンツの特徴又は編集手法に基づいて、前記候補コンテンツを照合元のコンテンツと照合する際の照合用パラメータを設定するステップと、前記候補コンテンツを該候補コンテンツの取得元からダウンロードし、前記設定された照合用パラメータを用いて、該ダウンロードしたコンテンツと前記照合元のコンテンツとを照合するステップと、を含むことを特徴とする。 The content search method according to the present invention is a content search method executed by a content search apparatus for searching for content, and collates the candidate content based on the feature or editing method of the candidate content which is a candidate for the content. A step of setting a collation parameter at the time of collation with the original content; downloading the candidate content from an acquisition source of the candidate content; and using the set collation parameter, the downloaded content and the collation source And collating with the contents of

また、上記課題を解決するため、本発明に係るプログラムは、コンピュータを、上記コンテンツ探索装置として機能させることを特徴とする。 Further, in order to solve the above problems, a program according to the present invention is characterized in that a computer is caused to function as the above content search device.

また、上記課題を解決するため、本発明に係る設定パラメータのデータ構造は、一又は複数のジャンルに属するコンテンツを探索するコンテンツ探索装置で用いられる設定パラメータのデータ構造であって、前記コンテンツの候補となる候補コンテンツのジャンル及びフレーム長、あるいは前記候補コンテンツの編集手法及び照合手法を含み、前記コンテンツ探索装置が、前記候補コンテンツの前記ジャンルに対応するフレーム長、又は前記編集手法に対応する前記照合手法を特定し、前記候補コンテンツを該候補コンテンツの取得元からダウンロードし、該ダウンロードした前記候補コンテンツと照合元のコンテンツとを、前記フレーム長を照合の単位として又は前記照合手法に基づいて照合する処理に用いられる。 Further, in order to solve the above problems, a data structure of a setting parameter according to the present invention is a data structure of a setting parameter used in a content search device searching for content belonging to one or more genres. The genre of the candidate content and the frame length of the candidate content, or the editing method and the matching method of the candidate content, and the content searching device determines the frame length corresponding to the genre of the candidate content or the matching corresponding to the editing method A method is specified, the candidate content is downloaded from the acquisition source of the candidate content, and the downloaded candidate content and the content of the comparison source are collated using the frame length as a unit of collation or based on the collation method. Used for processing.

本発明によれば、コンテンツのプロフィールが示す該コンテンツの特性に応じて適切に照合処理を行うことができる。したがって、利用者は、照合元のコンテンツに合致するコンテンツを正確に探索することができる。 According to the present invention, the matching process can be appropriately performed according to the characteristic of the content indicated by the content profile. Therefore, the user can accurately search for content matching the content of the collation source.

本実施形態に係る違法コンテンツ探索装置の構成例を示す機能ブロック図である。It is a functional block diagram showing an example of composition of an illegal content search device concerning this embodiment. 検索クエリ生成規則と該検索クエリ生成規則に従って生成された検索クエリの例を示す図である。It is a figure which shows the example of the search query production | generation rule and the search query produced | generated according to the said search query production rule. 検索クエリ生成規則と該検索クエリ生成規則に従って生成された検索クエリの他の例を示す図である。It is a figure which shows the search query production rule and the other example of the search query produced | generated according to the said search query production rule. コンテンツのタイトル及びプロフィールの例を示す図である。It is a figure which shows the example of the title of a content, and a profile. 本実施形態に係るコンテンツ探索方法の一例を示すフローチャートである。It is a flowchart which shows an example of the content search method which concerns on this embodiment.

まず、図１を参照して、本発明の本実施形態の機能構成について説明する。図１は、本実施形態に係る違法コンテンツ探索装置１の機能ブロック図である。本実施形態の違法コンテンツ探索装置１は、非権利者によって投稿された違法コンテンツを探索する装置であるが、違法コンテンツに限らず、照合元のコンテンツに合致するコンテンツを探索するコンテンツ探索装置であってよい。 First, referring to FIG. 1, the functional configuration of the present embodiment of the present invention will be described. FIG. 1 is a functional block diagram of the illegal content search device 1 according to the present embodiment. The illegal content search device 1 of the present embodiment is a device searching for illegal content posted by a non-right holder, but is not limited to illegal content, and is a content searching device searching for content matching the content of the collation source. You may

図１に示すように、違法コンテンツ探索装置１は、違法語句モデル生成部１１と、違法語句モデル記憶部１２と、検索クエリ生成規則記憶部１３と、検索クエリ生成部１４と、照合候補取得部１５と、プロフィール推定モデル生成部１６と、プロフィール推定モデル記憶部１７と、コンテンツプロフィール取得・推定部１８と、例外コンテンツ除去部１９と、照合優先度計算部２０と、照合パラメータ設定部２３と、コンテンツＤＬ（Download:ダウンロード）・照合部２４と、違法語句モデル更新部２５と、プロフィール推定モデル更新部２６とを備える。 As shown in FIG. 1, the illegal content search device 1 includes an illegal phrase model generation unit 11, an illegal phrase model storage unit 12, a search query generation rule storage unit 13, a search query generation unit 14, and a matching candidate acquisition unit 15, profile estimation model generation unit 16, profile estimation model storage unit 17, content profile acquisition / estimation unit 18, exception content removal unit 19, verification priority calculation unit 20, verification parameter setting unit 23, Content DL (Download: download) and collating unit 24, an illegal word / phrase model updating unit 25, and a profile estimation model updating unit 26.

違法コンテンツ探索装置１は、照合元のコンテンツの正規タイトル及びメタ情報に基づいて、インターネット上のコンテンツ取得元に記憶されている違法コンテンツを探索する。違法コンテンツは、非権利者によって投稿されたコンテンツである。メタ情報は、照合元の（正規）コンテンツに付随する属性情報であって、例えば、サブタイトル、コンテンツに登場する出演者、キャラクターの名称、放送回番号、放送日時、出演者、略称、コンテンツのジャンルを含む。 The illegal content search device 1 searches for illegal content stored in the content acquisition source on the Internet based on the regular title and the meta information of the content of the collation source. Illegal content is content posted by a non-right holder. Meta information is attribute information that accompanies the (regular) content of the verification source, and for example, subtitles, performers appearing in the content, character names, broadcast number, broadcast date, performer, abbreviation, content genre including.

コンテンツ取得元は、インターネット上に存在する、コンテンツが投稿されているサイト（例えば、コンテンツ投稿サイト、違法コンテンツのＵＲＬ（Uniform Resource Locator）をまとめたサイト等）のことである。コンテンツ取得元は、投稿ユーザの要求に基づいて通信端末からのコンテンツの投稿を受け付け、投稿されたコンテンツを記憶する。また、コンテンツ取得元は、ユーザの要求に基づいて、記憶しているコンテンツを通信端末にダウンロードさせる。なお、コンテンツ取得元は、例えば、コンテンツ投稿サイトなどを管理するサーバ装置、複数台のサーバによって構成される分散システム、クラウドサービスなどである。また、「投稿する」とは、コンテンツをアップロードし、記憶させることである。また、「投稿ユーザ」とは、コンテンツ取得元を利用するユーザのうちコンテンツを投稿するユーザである。 The content acquisition source is a site (for example, a content posting site, a site in which URLs (Uniform Resource Locators) of illegal content are summarized) on the Internet where content is posted. The content acquisition source accepts the posting of the content from the communication terminal based on the request of the posting user, and stores the posted content. Also, the content acquisition source causes the communication terminal to download the stored content based on the user's request. The content acquisition source is, for example, a server device that manages a content posting site or the like, a distributed system configured with a plurality of servers, a cloud service, or the like. Also, "posting" means uploading and storing content. Moreover, a "posting user" is a user who posts content among users who use a content acquisition source.

違法語句モデル生成部１１は、違法又は非違法を示すラベルが付与されたコンテンツのタイトルを学習データとする機械学習によって違法語句モデルを生成する。違法語句モデルは、任意の語句に対して、違法コンテンツに用いられることが想定される違法語句を出力するモデルである。 The illegal phrase model generation unit 11 generates an illegal phrase model by machine learning using, as learning data, a title of content to which a label indicating illegal or non-illegal is added. The illegal phrase model is a model that outputs an illegal phrase assumed to be used for illegal content for an arbitrary phrase.

違法語句モデル記憶部１２は、違法語句モデル生成部１１によって生成された違法語句モデルを記憶する。 The illegal word model storage unit 12 stores the illegal word model generated by the illegal word model generation unit 11.

検索クエリ生成規則記憶部１３は、違法コンテンツをコンテンツ取得元から探索するための検索クエリを、照合元のコンテンツ（正規コンテンツ）の正規タイトルから生成するための規則である検索クエリ生成規則を記憶する。上述したように、違法コンテンツが権利者から発見されるのを回避しつつも、正規コンテンツとの関連性をユーザに認識させるために、違法コンテンツのタイトルは、例えば、正規コンテンツの正規タイトルの全部あるいは一部が、仮名変換、漢字変換、ローマ字又は略称などの類似の語句に変換され（言い換えられ）たものであることがある。検索クエリ生成規則は、コンテンツのタイトルに含まれる語句から、上述したような言い換えに得られる語句を生成する規則である。なお、上述したような言い換えは、例えば、Ｗｏｒｄ２Ｖｅｃ等の言語処理手法を用いて出力することができる。また、検索クエリ生成規則は、コンテンツのタイトルに含まれる語句の表記揺れを含む語句を生成する規則である。検索クエリ生成規則記憶部１３は、コンテンツのジャンル（動画の場合、ドラマ、アニメ、映画等）ごとに異なる傾向をもつ検索クエリ生成規則に基づいて語句を生成してもよい。 The search query generation rule storage unit 13 stores a search query generation rule which is a rule for generating a search query for searching for illegal content from the content acquisition source from the regular title of the content (regular content) of the collation source. . As described above, the title of the illegal content is, for example, all of the regular titles of the regular content, in order to make the user recognize the relevance to the regular content while avoiding the illegal content being found from the rights holder. Alternatively, some may be converted (rephrased) into similar terms such as Kana conversion, Kanji conversion, Roman characters or abbreviations. The search query generation rule is a rule for generating a word or phrase obtained in the above-described paraphrase from the word or phrase contained in the title of the content. Note that the paraphrasing as described above can be output using, for example, a language processing method such as Word2Vec. In addition, the search query generation rule is a rule that generates a phrase including a fluctuation in the expression of the phrase included in the title of the content. The search query generation rule storage unit 13 may generate a word or phrase based on a search query generation rule having a tendency different for each genre of content (in the case of a moving image, a drama, an animation, a movie, etc.).

検索クエリ生成規則は、任意の語句を、例えば、仮名変換、ローマ字変換、漢字変換することという規則である。また、検索クエリ生成規則は、任意の語句を外国語へ翻訳するという規則である。また、検索クエリ生成規則は、任意の語句が表記ゆれした語句に変換するという規則である。 The search query generation rule is, for example, a rule that Kana conversion, Romaji conversion, and Kanji conversion of an arbitrary word / phrase. In addition, the search query generation rule is a rule to translate an arbitrary phrase into a foreign language. In addition, the search query generation rule is a rule that any given word is converted into a written term.

検索クエリ生成部１４は、違法コンテンツ探索装置１のオペレータの操作に基づいて照合元のコンテンツ、並びに該コンテンツの正規タイトル及びメタ情報を入力する。 The search query generation unit 14 inputs, based on the operation of the operator of the illegal content search device 1, the content of the comparison source, and the regular title and meta information of the content.

検索クエリ生成部１４は、違法語句モデル記憶部１２に記憶されている違法語句モデル、検索クエリ生成規則記憶部１３に記憶されている検索クエリ生成規則を用いて、正規タイトルに関連する違法語句を含む検索クエリを生成する。 The search query generation unit 14 uses the illegal phrase model stored in the illegal phrase model storage unit 12 and the search query generation rule stored in the search query generation rule storage unit 13 to detect illegal phrases related to the regular title. Generate a search query that contains

具体的には、検索クエリ生成部１４は、正規タイトルに基づいて、上述の違法語句モデルから算出される確率値が閾値以上となる違法語句を含む検索クエリを生成する。また、検索クエリ生成部１４は、正規タイトルに含まれる語句と同一又は類似の意味内容を有する、異なる表記の語句を含む検索クエリを生成する。例えば、検索クエリ生成部１４は、上述の検索クエリ生成規則に従って正規タイトルに含まれる語句を仮名変換、漢字変換、又はローマ字変換することによって検索クエリを生成する。 Specifically, the search query generation unit 14 generates a search query including an illegal phrase whose probability value calculated from the illegal phrase model described above is equal to or higher than a threshold value, based on the regular title. In addition, the search query generation unit 14 generates a search query including terms in different expressions having the same or similar semantic content as the terms included in the regular title. For example, the search query generation unit 14 generates a search query by performing kana conversion, kanji conversion, or romanization conversion on the terms included in the regular title according to the above-described search query generation rule.

また、検索クエリ生成部１４は、入力した正規タイトルを検索クエリとして生成してもよい。 Further, the search query generation unit 14 may generate the input regular title as a search query.

また、検索クエリ生成部１４は、メタ情報を含む検索クエリを生成することができる。検索クエリ生成部１４は、検索クエリ生成規則を用いて、正規タイトル及びメタ情報の１つ以上を含む、例えば「タイトルサブタイトル」、「タイトル日付」、「タイトル放送回番号」、「出演者」、「略称日付」等を検索クエリとして生成する。図３に示す例では、検索クエリ生成部１４は、例えば、正規タイトルが「火曜ドラマ『トリオ』」であり、検索クエリ生成規則が、話数表記（１）である場合、「トリオ１話」という検索クエリを生成する。 Further, the search query generation unit 14 can generate a search query including meta information. The search query generation unit 14 includes one or more of a regular title and meta information using a search query generation rule, for example, “title subtitle”, “title date”, “title broadcast number”, “performer”, Generate "abbreviated date" etc. as a search query. In the example illustrated in FIG. 3, for example, when the regular title is “Tuesday drama“ Trio ”” and the search query generation rule is the number-of-speaks notation (1), the search query generation unit 14 “Triio 1 episode” Generate a search query

また、検索クエリ生成部１４は、違法語句モデル又は検索クエリ生成規則を用いて、コンテンツのジャンル（動画の場合、ドラマ、アニメ、映画等）よる傾向に応じて正規タイトルを言い換えた語句を検索クエリとして生成することができる。 Further, the search query generation unit 14 uses the illegal phrase model or the search query generation rule to search for a phrase that paraphrases the regular title according to the tendency according to the genre of the content (drama, animation, movie, etc. Can be generated as

検索クエリ生成部１４は、上述のように生成した検索クエリを照合候補取得部１５に出力する。 The search query generation unit 14 outputs the search query generated as described above to the matching candidate acquisition unit 15.

照合候補取得部１５は、検索クエリ生成部１４によって出力された検索クエリに基づいて、通信ネットワーク上のコンテンツ取得元を検索して、検索クエリに合致するコンテンツを、違法コンテンツの可能性がある候補コンテンツとして、該候補コンテンツの識別情報をコンテンツ取得元から取得する。識別情報は、候補コンテンツを一意に識別するための情報であり、例えば、インターネット上でのコンテンツのアドレス、すなわちＵＲＬなどである。また、照合候補取得部１５は、検索クエリに合致するコンテンツのタイトル及びコンテンツに付随するプロフィール（付随プロフィール）をコンテンツ取得元から取得する。ここで、照合候補取得部１５が取得するプロフィールは、図４に示すコンテンツ長、投稿時刻、投稿ユーザ名等を含む。 The collation candidate acquisition unit 15 searches the content acquisition source on the communication network based on the search query output by the search query generation unit 14, and the content matching the search query is a candidate having the possibility of illegal content. As content, identification information of the candidate content is acquired from a content acquisition source. The identification information is information for uniquely identifying the candidate content, and is, for example, an address of the content on the Internet, that is, a URL or the like. Further, the matching candidate acquisition unit 15 acquires, from the content acquisition source, the title of the content matching the search query and the profile (accompanying profile) attached to the content. Here, the profile acquired by the matching candidate acquiring unit 15 includes the content length, the posting time, the posting user name and the like shown in FIG. 4.

具体的には、照合候補取得部１５は、検索クエリ生成部１４から出力された検索クエリにより、コンテンツ取得元にコンテンツ群の中から検索クエリを含むタイトルを有する１つ以上のコンテンツを抽出させる。そして、照合候補取得部１５は、コンテンツ取得元によって抽出されたコンテンツのうち、一定数のコンテンツを候補コンテンツとして、それぞれの識別情報、タイトル、及び付随プロフィールを取得する。例えば、照合候補取得部１５は、検索クエリとの合致度が高いとして抽出されたコンテンツのうち、上位一定数のコンテンツそれぞれの識別情報、タイトル、及び付随プロフィールを取得する。 Specifically, the collation candidate acquisition unit 15 extracts one or more contents having a title including the search query from the content group at the content acquisition source by the search query output from the search query generation unit 14. And collation candidate acquisition part 15 acquires each discernment information, a title, and an accompanying profile by making a fixed number of contents into candidate contents among contents extracted by contents acquisition origin. For example, the matching candidate acquisition unit 15 acquires identification information, a title, and an accompanying profile of each of the upper fixed number of contents among the contents extracted as having a high degree of matching with the search query.

また、コンテンツ取得元の中には、検索クエリを用いた検索により抽出されたコンテンツだけでなく、検索クエリを用いた検索により抽出されたコンテンツとの関連性が高い関連コンテンツ（例えば、視聴するユーザ層が同じコンテンツなど）を１つ以上、抽出し、抽出された関連コンテンツのタイトル、サムネイル、付随プロフィール等を提示する機能を有するものがある。このような場合、照合候補取得部１５は、コンテンツ取得元により抽出された関連コンテンツを候補コンテンツとし、該候補コンテンツの識別情報、タイトル及び付随プロフィールを取得してもよい。このとき、照合候補取得部１５は、関連コンテンツの、検索クエリによって抽出されたコンテンツとの関連度合いを示す関連度を取得してもよい。 Also, among the content acquisition sources, not only the content extracted by the search using the search query, but also the related content having high relevance to the content extracted by the search using the search query (for example, the user who views and listens) Some layers have a function of extracting one or more same content (e.g., the same content), and presenting a title, a thumbnail, an accompanying profile, etc. of the extracted related content. In such a case, the matching candidate acquisition unit 15 may set the related content extracted by the content acquisition source as the candidate content, and may acquire identification information, a title, and an accompanying profile of the candidate content. At this time, the matching candidate acquisition unit 15 may acquire the degree of association indicating the degree of association of the related content with the content extracted by the search query.

また、照合候補取得部１５は、コンテンツ取得元に新たに記憶されたコンテンツ（新着コンテンツ）を上位一定数、抽出させてもよい。この場合、更に、照合候補取得部１５は、新着コンテンツを候補コンテンツとして、該候補コンテンツの識別情報、タイトル、及び付随プロフィールを取得する。新着コンテンツは、例えば、照合候補取得部１５がコンテンツを取得するタイミングを基準として、該基準の所定の時間前から該基準までの間に、コンテンツ取得元に投稿されたコンテンツである。これにより、照合候補取得部１５は、検索クエリに基づいて取得したコンテンツ、関連コンテンツだけでは取得しきれないコンテンツに違法コンテンツが含まれている場合に、違法コンテンツを漏れなく探索することができる。 In addition, the collation candidate acquisition unit 15 may extract a predetermined number of top contents (newly arrived contents) newly stored in the contents acquisition source. In this case, the matching candidate acquisition unit 15 further acquires identification information, a title, and an accompanying profile of the candidate content, using the newly arrived content as the candidate content. The newly arrived content is, for example, content posted to the content acquisition source between a predetermined time before the reference and the reference based on the timing at which the matching candidate acquisition unit 15 acquires the content. As a result, the collated candidate acquiring unit 15 can search for illegal content without omission when illegal content is included in content acquired based on a search query or content that can not be acquired only with related content.

照合候補取得部１５が取得する候補コンテンツの数は、コンテンツ取得元及びコンテンツに応じた設定パラメータとして、リスト形式等で予め設定された適切な数である。例えば、特定のジャンル（ドラマ、アニメ等）について違法コンテンツが多く投稿される傾向が強いコンテンツ取得元が存在する場合、照合候補取得部１５は、当該コンテンツ取得元からは、特定のジャンルの候補コンテンツを他のジャンルの候補コンテンツより多く取得する。また、照合候補取得部１５は、違法コンテンツの投稿率が高い投稿ユーザに係るコンテンツを、他の投稿ユーザに係るコンテンツより多く取得してもよい。これにより、候補コンテンツで識別されるコンテンツに違法コンテンツが含まれる可能性を高めることができる。 The number of candidate contents acquired by the collation candidate acquisition unit 15 is an appropriate number preset in a list format or the like as a setting parameter according to the contents acquisition source and the contents. For example, when there is a content acquisition source that tends to post a large amount of illegal content for a specific genre (drama, animation, etc.), the matching candidate acquisition unit 15 selects candidate content of a specific genre from the content acquisition source. Get more than the candidate content of other genres. In addition, the matching candidate acquisition unit 15 may acquire more contents related to a posting user having a high posting rate of illegal content than contents related to other posting users. This can increase the possibility that the content identified by the candidate content includes illegal content.

照合候補取得部１５は、コンテンツ取得元から取得した、候補コンテンツの識別情報、タイトル、及び付随プロフィールをコンテンツプロフィール取得・推定部１８に出力する。 The matching candidate acquisition unit 15 outputs the identification information of the candidate content, the title, and the accompanying profile acquired from the content acquisition source to the content profile acquisition / estimation unit 18.

プロフィール推定モデル生成部１６は、コンテンツのタイトル、及び該コンテンツに付随する付随プロフィールに基づき、該コンテンツに関する統計的な情報である統計プロフィールを出力するプロフィール推定モデルを生成する。プロフィール推定モデルは、タイトルと違法性との対応、付随プロフィールと違法性との対応をそれぞれ示すモデルである。 The profile estimation model generation unit 16 generates a profile estimation model that outputs a statistical profile, which is statistical information on the content, based on the title of the content and the accompanying profile attached to the content. The profile estimation model is a model that indicates the correspondence between the title and the illegality, and the correspondence between the incidental profile and the illegality.

具体的には、プロフィール推定モデル生成部１６は、投稿されたコンテンツが違法コンテンツである確度をタイトルごとに示すタイトル違法確度を学習し、タイトルとタイトル違法確度との対応を示す統計モデルをプロフィール推定モデルとして生成する。プロフィール推定モデル生成部１６は、各クラスに分類される確度を算出できる統計モデル（ＳＶＭ、ナイーブベイズ等）を用いた学習によってプロフィール推定モデルを生成することが望ましい。クラス分類は、違法／非違法の２値分類、コンテンツタイトル（複数）と非違法等との多値分類のどちらでもよい。なお、プロフィール推定モデルを生成する具体的な手法は、既知の任意の手法とすることができる。例えば、「言語処理のための機械学習入門（奥村学監修、高村大也著、コロナ社、p１０１−１１７）」にその手法の一例が記載されている。この方法では、学習データとなるテキストを形態素解析し、含有される単語を抽出して学習を行っているが、形態素解析を行わずにテキストを文字列として学習に用いることも可能である。 Specifically, the profile estimation model generation unit 16 learns a title illegal probability indicating the probability that the posted content is illegal content for each title, and profile estimates a statistical model indicating the correspondence between the title and the title illegal probability. Generate as a model. The profile estimation model generation unit 16 preferably generates a profile estimation model by learning using a statistical model (SVM, naive Bayes, etc.) capable of calculating the accuracy to be classified into each class. Classification may be either illegal / non-illegal binary classification, or multi-valued classification such as content title (s) or non-illegal. The specific method of generating the profile estimation model can be any known method. For example, an example of the method is described in "Introduction to machine learning for language processing (Manabu Okumura, Takaya Takamura, Corona, p101-117)." In this method, the text that is the learning data is subjected to morphological analysis, and the contained words are extracted to perform learning, but it is also possible to use the text as a character string for learning without performing the morphological analysis.

また、プロフィール推定モデル生成部１６は、投稿されたコンテンツが違法コンテンツである確度を投稿ユーザごとに示す投稿ユーザ違法確度を学習し、該投稿ユーザと投稿ユーザ違法確度との対応を示す統計モデルをプロフィール推定モデルとして生成する。投稿ユーザ違法確度は、過去に各投稿ユーザによって投稿されたコンテンツの違法性に基づいて推定される。具体的には、プロフィール推定モデル生成部１６は、投稿ユーザが過去に投稿したコンテンツにおける違法確度の高いタイトルを有するコンテンツの含有率、コンテンツ自体の削除率等の特徴量と、投稿ユーザの違法確度との対応を示す違法確度学習データに基づいてプロフィール推定モデルを作成する。 In addition, the profile estimation model generation unit 16 learns the posting user illegal accuracy indicating the accuracy that the posted content is illegal content for each posting user, and a statistical model indicating correspondence between the posting user and the posting user illegal accuracy. Generate as profile estimation model. The posting user illegality probability is estimated based on the illegality of the content posted by each posting user in the past. Specifically, the profile estimation model generation unit 16 determines the content rate of the content having a title with a high probability of illegality in the content posted by the posting user in the past, the feature amount such as the deletion rate of the content itself, and the illegality probability of the posting user Create a profile estimation model based on illegal probability learning data showing correspondence with.

プロフィール推定モデル記憶部１７は、プロフィール推定モデル生成部１６によって生成されたプロフィール推定モデルを記憶する。 The profile estimation model storage unit 17 stores the profile estimation model generated by the profile estimation model generation unit 16.

コンテンツプロフィール取得・推定部１８は、照合候補取得部１５によって出力された付随プロフィールに基づいて、候補コンテンツの統計プロフィールを取得する。コンテンツプロフィール取得・推定部１８によって取得される候補コンテンツの統計プロフィールは、コンテンツの内容に関する情報及び投稿ユーザに関する情報である。コンテンツプロフィール取得・推定部１８によって取得される候補コンテンツのプロフィールは、図４に示すように、上述のコンテンツ長、投稿時刻、投稿ユーザ名、に加えてタイトルの違法確度、投稿ユーザの違法確度、投稿ユーザが投稿したコンテンツの削除率、ユーザ種別、投稿ユーザの編集傾向種別（画像編集あり）、投稿ユーザの編集傾向種別（音声編集あり）を含む。これらのプロフィールのうち、追って詳細に説明する、プロフィール推定モデルを用いて推定されるタイトルの違法確度、及び投稿ユーザの違法確度を統計プロフィールという。 The content profile acquisition / estimation unit 18 acquires a statistical profile of candidate content based on the accompanying profile output by the matching candidate acquisition unit 15. The statistical profile of the candidate content acquired by the content profile acquisition / estimation unit 18 is information on the content of the content and information on the posting user. The profile of the candidate content acquired by the content profile acquisition / estimation unit 18 is, as shown in FIG. 4, the content length, the posting time, the posting user name, the illegal probability of the title, the illegal probability of the posting user, The deletion rate of the content posted by the posting user, the user type, the editing tendency type (with image editing) of the posting user, and the editing tendency type (with voice editing) of the posting user are included. Among these profiles, the probability of illegality of a title estimated using a profile estimation model and the probability of illegality of a posting user, which will be described in detail later, are referred to as statistical profiles.

ユーザ種別は、コンテンツの投稿ユーザが、上述した照合元のコンテンツを生成した（もしくは権利をもつ）正規ユーザであるか否かを示す種別である。ユーザ種別は、予め作成された正規ユーザリスト等に基づいて決定される。編集傾向種別は、コンテンツに対して施された編集、例えば、カットによる編集、時間伸縮、ＰｉｎＰ等の特殊処理の有無を示す種別である。編集傾向種別は、少なくとも一部の投稿ユーザについて予め作成された、該投稿ユーザの編集傾向種別のリストに基づいて決定される。 The user type is a type indicating whether or not the content posting user is a legitimate user who has generated (or has the right of) the content of the above-mentioned matching source. The user type is determined based on a previously created regular user list or the like. The editing tendency type is a type that indicates the presence or absence of editing performed on the content, for example, editing with a cut, time expansion and contraction, and special processing such as PinP. The editing tendency type is determined based on a list of editing tendency types of the posting user, which is created in advance for at least a part of the posting users.

具体的には、コンテンツプロフィール取得・推定部１８は、照合候補取得部１５によって出力された付随プロフィールに含まれるコンテンツ長及び投稿時刻を候補コンテンツのコンテンツ長及び投稿時刻として取得する。 Specifically, the content profile acquisition / estimation unit 18 acquires the content length and posting time included in the accompanying profile output by the matching candidate acquisition unit 15 as the content length and posting time of the candidate content.

また、コンテンツプロフィール取得・推定部１８は、予めメモリに記憶された正規ユーザリストに基づいて、照合候補取得部１５から出力された付随プロフィールに含まれる投稿ユーザに基づいて該投稿ユーザのユーザ種別を推定する。具体的には、コンテンツプロフィール取得・推定部１８は、投稿ユーザが、正規ユーザリストに含まれている場合、該投稿ユーザのユーザ種別が正規であると推定する。また、コンテンツプロフィール取得・推定部１８は、投稿ユーザが、正規ユーザリストに含まれていない場合、該投稿ユーザのユーザ種別が非正規であると推定する。 In addition, the content profile acquisition / estimation unit 18 determines the user type of the posting user based on the posting user included in the accompanying profile output from the matching candidate acquisition unit 15 based on the authorized user list stored in advance in the memory. presume. Specifically, when the posting user is included in the authorized user list, the content profile acquisition / estimation unit 18 estimates that the user type of the posting user is legitimate. In addition, when the posting user is not included in the authorized user list, the content profile acquisition / estimation unit 18 estimates that the user type of the posting user is non-normal.

また、コンテンツプロフィール取得・推定部１８は、編集傾向種別リストに基づいて、照合候補取得部１５によって出力された付随プロフィールに含まれる投稿ユーザに基づいて該投稿ユーザの編集傾向種別を推定する。編集傾向種別リストは、予めメモリに記憶されているリストであって、投稿ユーザと、編集傾向種別との対応を示すリストである。編集傾向種別は、該投稿ユーザに係るコンテンツについて多く行われた編集方式の種別である。種別には、例えば、カットによる編集、時間伸縮、ＰｉｎＰ等の特殊処理の有無等が含まれる。コンテンツプロフィール取得・推定部１８は、推定した編集方式を編集傾向種別として取得する。 Further, the content profile acquisition / estimation unit 18 estimates the editing tendency type of the posting user based on the posting user included in the accompanying profile output by the matching candidate acquiring unit 15 based on the editing tendency type list. The editing tendency type list is a list stored in advance in the memory, and is a list showing the correspondence between the posting user and the editing tendency type. The editing tendency type is a type of editing method frequently performed on the content related to the posting user. The type includes, for example, edit by cutting, time expansion / contraction, presence / absence of special processing such as PinP, and the like. The content profile acquisition / estimation unit 18 acquires the estimated editing method as the editing tendency type.

また、コンテンツプロフィール取得・推定部１８は、照合候補取得部１５によって出力された候補コンテンツのタイトル及び投稿ユーザ名に基づいてそれぞれタイトル違法確度又は投稿ユーザ違法確度をプロフィール推定モデル記憶部１７に記憶されているプロフィール推定モデルに基づいて推定する。なお、以降の説明では、コンテンツプロフィール推定部１８によって取得又は推定された付随プロフィール及び統計プロフィール、並びにコンテンツプロフィール推定部１８によって各種リストを用いて推定されたプロフィールを単に「プロフィール」ということがある。 Also, the content profile acquisition / estimation unit 18 stores the title illegality probability or the posting user illegality probability in the profile estimation model storage unit 17 based on the title of the candidate content and the posting user name output from the matching candidate acquisition unit 15 respectively. Estimate based on the profile estimation model In the following description, the accompanying profile and statistical profile acquired or estimated by the content profile estimation unit 18 and the profile estimated using the various lists by the content profile estimation unit 18 may be simply referred to as “profile”.

さらに、コンテンツプロフィール取得・推定部１８は、上述のように取得又は推定したプロフィールを識別情報及びタイトルとともに例外コンテンツ除去部１９に出力する。 Furthermore, the content profile acquisition / estimation unit 18 outputs the profile acquired or estimated as described above to the exceptional content removal unit 19 together with the identification information and the title.

例外コンテンツ除去部１９は、コンテンツプロフィール取得・推定部１８から出力されたプロフィールに基づいて、違法コンテンツの候補から除去する例外コンテンツを決定する。具体的には、例外コンテンツ除去部１９は、コンテンツプロフィール取得・推定部１８によって取得されたプロフィールが所定の条件を満たす場合、該プロフィールに対応する候補コンテンツを例外コンテンツとして除去する。所定の条件は、例えば、プロフィールに含まれるユーザ種別が正規であることとしてもよい。また、所定の条件は、例えば、付随プロフィールに含まれる投稿時刻が照合元のコンテンツの公開時刻より前であることとしてもよいし、付随プロフィールに含まれるコンテンツ長が所定の長さ（例えば、数秒程度）より短いこととしてもよい。所定の条件はこれらに限られず、候補コンテンツが違法コンテンツでない可能性が高いことを示す任意の条件とすることができる。 The exception content removal unit 19 determines exception content to be removed from the illegal content candidate based on the profile output from the content profile acquisition / estimation unit 18. Specifically, when the profile acquired by the content profile acquisition / estimation unit 18 satisfies a predetermined condition, the exceptional content removing unit 19 removes the candidate content corresponding to the profile as the exceptional content. The predetermined condition may be, for example, that the user type included in the profile is normal. Also, the predetermined condition may be, for example, that the posting time included in the accompanying profile is earlier than the publication time of the content of the matching source, or the content length included in the accompanying profile has a predetermined length (for example, several seconds) Degree) may be shorter. The predetermined conditions are not limited to these, and may be any conditions indicating that the candidate content is not likely to be illegal content.

例外コンテンツ除去部１９は、除去されなかった候補コンテンツの識別情報、タイトル、及びプロフィールを照合優先度計算部２０に出力する。 The exceptional content removing unit 19 outputs the identification information, the title, and the profile of the candidate content that has not been removed to the matching priority calculation unit 20.

照合優先度計算部２０は、例外コンテンツ除去部１９によって除去されなかった候補コンテンツの識別情報、タイトル、及びプロフィールに基づいて、後述する照合処理における優先度を計算する。 The matching priority calculation unit 20 calculates the priority in the matching process to be described later, based on the identification information, the title, and the profile of the candidate content not removed by the exceptional content removing unit 19.

具体的には、照合優先度計算部２０は、候補コンテンツの違法確度に基づいて、優先度計算モデルを用いて優先度を計算する。 Specifically, the matching priority calculation unit 20 calculates the priority using the priority calculation model based on the illegality probability of the candidate content.

まず、照合優先度計算部２０は、候補コンテンツのタイトルを示す文字列と、照合元のコンテンツのタイトルを示す文字列との編集距離を計算する。編集距離は、２つの文字列がどの程度異なっているかを示す距離の一種であり、１文字の挿入・削除・置換によって、一方の文字列をもう一方の文字列に変形するのに必要な手順の最小回数である。すなわち、編集距離が小さいほど、候補コンテンツのタイトルを示す文字列と照合元のコンテンツのタイトルを示す文字列とは関連性が高いことを示している。また、照合優先度計算部２０は、照合元のコンテンツのタイトルを示す文字列の代わりに、例えば、照合元のコンテンツのメタ情報に含まれる、該照合元のコンテンツに登場するキャラクターの名称、出演者名、サブタイトル等の文字列と、候補コンテンツのタイトルを示す文字列との編集距離を計算してもよい。 First, the matching priority calculation unit 20 calculates the editing distance between the character string indicating the title of the candidate content and the character string indicating the title of the content of the matching source. The edit distance is a kind of distance indicating how different two strings are, and the procedure necessary to transform one string into the other by inserting, deleting, or replacing one character. Minimum number of times. That is, the smaller the editing distance, the higher the relevance between the character string indicating the title of the candidate content and the character string indicating the title of the content to be collated. In addition, the matching priority calculation unit 20, instead of the character string indicating the title of the content of the matching source, for example, names of characters appearing in the content of the matching source included in the meta information of the content of the matching source, appearance The edit distance between a character string such as a person's name and a subtitle and a character string indicating the title of the candidate content may be calculated.

また、照合優先度計算部２０は、編集距離が所定の値より小さいタイトルに係る候補コンテンツの識別情報、タイトル、及びプロフィールを抽出する。 Further, the matching priority calculation unit 20 extracts identification information, a title, and a profile of candidate content related to a title whose editing distance is smaller than a predetermined value.

また、照合優先度計算部２０は、候補コンテンツのタイトルについての編集距離と、人物の名前についての編集距離との両方に基づいて関連性を判定してもよい。例えば、照合優先度計算部２０は、候補コンテンツのタイトルについての編集距離と、人物の名前についての編集距離とにそれぞれ重み付けしたうえで足し合わせたスコアを計算してもよい。この場合、照合優先度計算部２０は、所定の値より小さいスコアに係る候補コンテンツを抽出する。 Further, the matching priority calculation unit 20 may determine the relevancy based on both the editing distance for the title of the candidate content and the editing distance for the person's name. For example, the matching priority calculation unit 20 may calculate a score obtained by weighting each of the editing distance for the title of the candidate content and the editing distance for the name of the person and then adding them. In this case, the matching priority calculation unit 20 extracts candidate content related to a score smaller than a predetermined value.

上述のように、照合候補取得部１５は、検索クエリに基づいて抽出されたコンテンツ、該コンテンツの関連コンテンツの他に、新着コンテンツを含めた幅広い範囲の候補コンテンツの識別情報を取得する。これにより、識別情報で識別される候補コンテンツには、照合元のコンテンツと関連性の低いコンテンツが多く含有されている可能性がある。そこで、照合優先度計算部２０が、編集距離の小さい、すなわち関連性の高いと見込まれる候補コンテンツのみを照合の対象として抽出することにより、後述する照合に係る処理負荷を軽減することが可能となる。 As described above, the collation candidate acquisition unit 15 acquires identification information of a wide range of candidate contents including newly arrived contents, in addition to the contents extracted based on the search query and the contents related to the contents. As a result, the candidate content identified by the identification information may contain a large amount of content having low relevance to the content of the matching source. Therefore, it is possible to reduce the processing load associated with the matching described later by the matching priority calculation unit 20 extracting only candidate content that is expected to have a small editing distance, that is, a high degree of relevancy. Become.

照合優先度計算部２０は、編集距離に基づいて候補コンテンツを抽出すると、抽出された候補コンテンツのタイトル違法確度に基づいて優先度を決定する。このとき、候補コンテンツのタイトル違法確度として、コンテンツプロフィール取得・推定部１８によって推定されたプロフィールに含まれるタイトル違法確度が用いられる。また、照合優先度計算部２０は、抽出された候補コンテンツのタイトル違法確度に代えて、投稿ユーザ違法確度を用いて優先度を決定してもよい。候補コンテンツの投稿ユーザ違法確度として、コンテンツプロフィール取得・推定部１８によって推定されたプロフィールに含まれる投稿ユーザ違法確度が用いられる。また、照合優先度計算部２０は、タイトル違法確度及び投稿ユーザ違法確度の両方に基づいて優先度を決定してもよい。例えば、照合優先度計算部２０は、タイトル違法確度と投稿ユーザ違法確度とのそれぞれに重み付けをした値の和を優先度とすることができる。また、照合優先度計算部２０は、先に計算された編集距離と、各違法確度との組合せにより優先度を決定してもよい。 When the candidate content is extracted based on the editing distance, the matching priority calculation unit 20 determines the priority based on the title illegalty probability of the extracted candidate content. At this time, the title illegal probability included in the profile estimated by the content profile acquisition / estimation unit 18 is used as the title illegal probability of the candidate content. Further, the matching priority calculation unit 20 may determine the priority using the posting user illegal probability instead of the extracted title illegalty of the candidate content. As the posting user illegality probability of the candidate content, the posting user illegality accuracy included in the profile estimated by the content profile acquisition / estimation unit 18 is used. Further, the matching priority calculation unit 20 may determine the priority based on both the title illegality probability and the posting user illegality probability. For example, the matching priority calculation unit 20 can set the sum of values obtained by weighting each of the title illegal probability and the posting user illegal probability as the priority. Further, the matching priority calculation unit 20 may determine the priority based on the combination of the previously calculated editing distance and each illegal probability.

さらに、照合優先度計算部２０は、照合優先度計算部２０が計算した優先度を、該優先度に係る候補コンテンツの識別情報、タイトル、及びプロフィールとともにコンテンツＤＬ・照合部２４に出力する。 Further, the matching priority calculation unit 20 outputs the priority calculated by the matching priority calculation unit 20 to the content DL / matching unit 24 together with the identification information of the candidate content related to the priority, the title, and the profile.

照合パラメータ設定部２３は、予め記憶された設定パラメータのリストを用いて、候補コンテンツの特徴に基づいて、照合の処理で用いられる照合用パラメータを設定する。照合とは、候補コンテンツと照合元のコンテンツとが合致するか否かを判定することである。設定パラメータは、例えば、フレーム長、照合手法である。フレーム長は、照合処理における照合の基本単位となるフレームの長さである。照合手法には、音声によって照合を行う手法、画像によって照合を行う手法等が含まれる。 The collation parameter setting unit 23 sets a collation parameter to be used in the collation processing based on the feature of the candidate content, using the list of setting parameters stored in advance. The matching is to determine whether the candidate content matches the content of the matching source. The setting parameters are, for example, a frame length and a matching method. The frame length is a length of a frame which is a basic unit of matching in the matching process. The collation method includes a method of collating by voice, a method of collating by image, and the like.

照合手法として、既知の任意の手法を用いることができる。例えば、「音楽や映像を特定するメディア指紋技術とその応用（川西隆仁、他、The Japan Society for Industrial and Applied Mathematics、応用数理２１（４）、Ｐ．２８９−２９２、２０１１年１２月２２日」にその手法の一例が記載されている。 Any known method can be used as a matching method. For example, “Media fingerprint technology for identifying music and video and its application (Kawanishi Takahito, et al., The Japan Society for Industrial and Applied Mathematics, Applied Mathematics 21 (4), P.289-292, Dec. 22, 2011) Shows an example of the method.

設定パラメータリストは、候補コンテンツのプロフィール又はプロフィールの組合せに対応して、適切な設定が記載されているリストである。設定パラメータリストで用いられる候補コンテンツのプロフィールは、照合の精度が確保される程度に必要とされるフレーム長を推定するためのものであって、例えば、ジャンルである。候補コンテンツのジャンルがスポーツのマッシュアップコンテンツである場合、該候補コンテンツは、数秒程度の短い動画を編集して構成される。このため、設定パラメータリストにおいて、例えば、スポーツのマッシュアップコンテンツというジャンルに対応して、短いフレーム長（例えば２秒から３秒程度）という設定が記載されている。これにより、コンテンツＤＬ・照合部２４が、設定された短いフレーム長で照合処理を行い、照合元のコンテンツに合致している候補コンテンツを検出することができる。 The setting parameter list is a list in which appropriate settings are described corresponding to profiles of candidate content or combinations of profiles. The profile of candidate content used in the setting parameter list is for estimating a frame length required to such an extent that the accuracy of matching is ensured, and is, for example, a genre. When the genre of the candidate content is sports mashup content, the candidate content is configured by editing a short moving image of about several seconds. Therefore, in the setting parameter list, for example, the setting of a short frame length (for example, about 2 seconds to 3 seconds) is described corresponding to the genre of mashup content of sports. As a result, the content DL / collation unit 24 can perform collation processing with the set short frame length, and can detect candidate contents matching the content of the collation source.

一方、候補コンテンツのジャンルがドラマや映画である場合、コンテンツ長は数十分から数時間程度の長さである。このため、設定パラメータリストにおいて、例えば、ドラマ又は映画というジャンルに対応して、長いフレーム長（例えば５分程度）という設定が記載されている。これにより、コンテンツＤＬ・照合部２４は、設定された長いフレーム長で照合処理を行い、照合元のコンテンツに合致している候補コンテンツを正確に検出することができる。 On the other hand, when the genre of the candidate content is a drama or a movie, the content length is from several tens minutes to several hours. For this reason, in the setting parameter list, for example, setting of a long frame length (for example, about 5 minutes) is described corresponding to a genre such as drama or movie. As a result, the content DL / collation unit 24 can perform collation processing with the set long frame length, and can accurately detect candidate contents matching the content of the collation source.

また、設定パラメータリストで用いられる候補コンテンツのプロフィールは、例えば、編集手法であってもよい。編集手法は、コンテンツに対して行われた編集の手法であり、例えば、画像のなかに画像を埋め込むＰｉｎＰ、時間伸縮等が含まれる。照合パラメータ設定部２３は、候補コンテンツの編集手法を、コンテンツプロフィール取得・推定部１８が取得した投稿ユーザの編集傾向種別としてもよい。 Also, the profile of candidate content used in the setting parameter list may be, for example, an editing method. The editing method is a method of editing performed on the content, and includes, for example, PinP for embedding an image in an image, time extension and the like, and the like. The collation parameter setting unit 23 may set the editing method of the candidate content as the editing tendency type of the posting user acquired by the content profile acquisition / estimation unit 18.

例えば、候補コンテンツの編集手法がＰｉｎＰであり、異なる画像の中に照合元のコンテンツの画像と同様の画像が埋め込まれている場合、候補コンテンツは画像全体として照合元のコンテンツと異なると認識される。そのため、画像による照合によって、候補コンテンツが照合元のコンテンツとの一致度が高いとは判定されにくい。このため、設定パラメータリストにおいて、例えば、ＰｉｎＰという編集手法に対応して、音声による照合という設定が記載されている。これにより、コンテンツＤＬ・照合部２４は、音声による照合を行い、照合元のコンテンツに合致している候補コンテンツを正確に検出することができる。 For example, if the editing method of the candidate content is PinP and an image similar to the image of the content of the matching source is embedded in different images, the candidate content is recognized as the entire image as different from the content of the matching source . Therefore, it is difficult to determine that the candidate content has a high degree of matching with the content of the matching source by image matching. For this reason, in the setting parameter list, for example, the setting of collation by voice is described corresponding to the editing method of PinP. As a result, the content DL / collation unit 24 can perform voice collation to accurately detect candidate content that matches the content of the collation source.

また、例えば、候補コンテンツの編集手法が時間伸縮である場合、候補コンテンツの音声は、編集前の音声とは抽出される特徴量が大きく異なる。そのため、候補コンテンツが正規コンテンツを時間伸縮したものである場合、音声による照合処理によって、候補コンテンツが照合元のコンテンツとの一致度が高いとは判定されにくい。このため、設定パラメータリストにおいて、例えば、時間伸縮という編集手法に対応して、画像による照合という設定が記載されている。これにより、コンテンツＤＬ・照合部２４は、画像による照合を行い、照合元のコンテンツに合致している候補コンテンツを正確に検出することができる。 Also, for example, when the editing method of the candidate content is time expansion and contraction, the voice of the candidate content is largely different in the feature amount to be extracted from the voice before editing. Therefore, in the case where the candidate content is obtained by expanding and contracting the regular content in time, it is difficult to determine that the candidate content has a high degree of matching with the content of the comparison source by the audio matching process. For this reason, in the setting parameter list, for example, the setting of collation by image is described corresponding to the editing method of time expansion and contraction. As a result, the content DL / collation unit 24 can perform collation using an image, and can accurately detect candidate content that matches the content of the collation source.

コンテンツＤＬ・照合部２４は、照合優先度計算部２０によって計算された優先度が高い順に、候補コンテンツをコンテンツ取得元からダウンロードする。そして、コンテンツＤＬ・照合部２４は、ダウンロードした候補コンテンツを、照合パラメータ設定部２３による設定に従い、照合元のコンテンツと照合することによって、候補コンテンツが照合元のコンテンツに合致するか否かを判定する。コンテンツＤＬ・照合部２４は、照合元のコンテンツに合致した候補コンテンツを違法コンテンツとして、該違法コンテンツの識別情報を出力する。 The content DL / collation unit 24 downloads candidate content from the content acquisition source in descending order of the priority calculated by the collation priority calculation unit 20. Then, the content DL / collation unit 24 determines whether the candidate content matches the content of the collation source by collating the downloaded candidate content with the content of the collation source according to the setting by the collation parameter setting unit 23 Do. The content DL / collation unit 24 outputs the identification information of the illegal content as the illegal content that is the candidate content that matches the content of the collation source.

また、コンテンツＤＬ・照合部２４は、コンテンツのコンテンツ長が長い場合、優先度が高い候補コンテンツから順にダウンロードし、ダウンロードしたコンテンツから順に照合することによって、効率的に違法コンテンツを探索することが可能である。 In addition, when the content length of the content is long, the content DL / collation unit 24 can efficiently search for illegal content by downloading in order from the candidate content with the highest priority and collating in order from the downloaded content. It is.

また、コンテンツＤＬ・照合部２４は、コンテンツ長が長い候補コンテンツ（例えば数十分〜数時間の動画コンテンツ等）をダウンロードするとともに、並行してダウンロードされた部分から照合を開始してもよい。この場合、コンテンツＤＬ・照合部２４は、候補コンテンツと照合元のコンテンツとが合致したとき、候補コンテンツの残り時間のダウンロードを中止する。コンテンツＤＬ・照合部２４は、合致した候補コンテンツを違法コンテンツとして、該違法コンテンツの識別情報を出力する。そして、コンテンツＤＬ・照合部２４は、次に優先度の高い候補コンテンツのダウンロードおよび照合を行う。これにより、コンテンツＤＬ・照合部２４によって、１つの候補コンテンツの照合に要する時間を短縮させる、すなわち、単位時間あたりに照合される候補コンテンツの数を増加させることができる。 In addition, the content DL / collation unit 24 may download candidate content having a long content length (for example, moving image content of several dozen minutes to several hours) and start collation from a portion downloaded in parallel. In this case, the content DL / collation unit 24 cancels the download of the remaining time of the candidate content when the candidate content matches the content of the collation source. The content DL / collation unit 24 outputs the identification information of the illegal content, with the matched candidate content as the illegal content. Then, the content DL / collation unit 24 downloads and collates the candidate content with the next highest priority. As a result, the content DL / collation unit 24 can shorten the time required to collate one candidate content, that is, increase the number of candidate contents collated per unit time.

また、コンテンツＤＬ・照合部２４は、違法コンテンツのタイトルを、違法を示すラベルとともに違法語句モデル更新部２５に出力する。また、コンテンツＤＬ・照合部２４は、違法コンテンツのプロフィールを、違法を示すラベルとともにプロフィール推定モデル更新部２６に出力する。 Further, the content DL / collation unit 24 outputs the title of the illegal content to the illegal phrase model updating unit 25 together with the label indicating the illegality. Further, the content DL / collation unit 24 outputs the profile of the illegal content to the profile estimation model updating unit 26 together with the label indicating the illegality.

違法語句モデル更新部２５は、コンテンツＤＬ・照合部２４から出力された違法コンテンツのタイトルに基づいて違法語句モデルを更新する。具体的には、違法語句モデル更新部２５は、違法コンテンツのタイトルを新たな学習データとした機械学習により、違法語句モデル生成部１１に違法語句モデルを更新させる。これにより、違法語句モデルの精度が高まることが期待される。 The illegal phrase model updating unit 25 updates the illegal phrase model based on the title of the illegal content output from the content DL / matching unit 24. Specifically, the illegal phrase model updating unit 25 causes the illegal phrase model generation unit 11 to update the illegal phrase model by machine learning with the title of the illegal content as new learning data. This is expected to increase the accuracy of the illegal word model.

プロフィール推定モデル更新部２６は、コンテンツＤＬ・照合部２４から出力された違法コンテンツのプロフィールに基づいてプロフィール推定モデルを更新する。具体的には、プロフィール推定モデル更新部２６は、違法コンテンツのプロフィールを新たな学習データとした機械学習により、プロフィール推定モデル生成部１６にプロフィール推定モデルを更新させる。これにより、プロフィール推定モデルの精度が高まることが期待される。 The profile estimation model updating unit 26 updates the profile estimation model based on the profile of the illegal content output from the content DL / collation unit 24. Specifically, the profile estimation model updating unit 26 causes the profile estimation model generation unit 16 to update the profile estimation model by machine learning with the profile of the illegal content as new learning data. This is expected to increase the accuracy of the profile estimation model.

続いて、本実施形態における違法コンテンツ探索装置１が実行するコンテンツ探索方法について図５に示すフローチャートを参照して説明する。図５は、コンテンツ探索方法の一例を示すフローチャートである。 Subsequently, a content search method executed by the illegal content search device 1 according to the present embodiment will be described with reference to the flowchart shown in FIG. FIG. 5 is a flowchart showing an example of the content search method.

まず、検索クエリ生成部１４は、違法コンテンツ探索装置１のオペレータの操作に基づいて照合元のコンテンツ、タイトル、及びメタ情報を入力する（ステップＳ１）。 First, the search query generation unit 14 inputs the content of the collation source, the title, and the meta information based on the operation of the operator of the illegal content search device 1 (step S1).

ステップＳ１で照合元のコンテンツ、タイトル及びメタ情報が入力されると、検索クエリ生成部１４は、違法語句モデル、検索クエリ生成規則を用いて検索クエリを生成する（ステップＳ２）。 When the content to be collated, the title, and the meta information are input in step S1, the search query generation unit 14 generates a search query using an illegal phrase model and a search query generation rule (step S2).

ステップＳ２で検索クエリが生成されると、照合候補取得部１５は、検索クエリに基づいて、コンテンツ取得元に該検索クエリに対応する候補コンテンツを抽出させ、抽出された候補コンテンツの識別情報、タイトル、及び付随プロフィールを取得する（ステップＳ３）。 When a search query is generated in step S2, the matching candidate acquisition unit 15 causes the content acquisition source to extract candidate content corresponding to the search query based on the search query, and the identification information and title of the extracted candidate content. And the accompanying profile are obtained (step S3).

ステップＳ３で識別情報、タイトル、及び付随プロフィールが取得されると、コンテンツプロフィール取得・推定部１８は、取得された付随プロフィールに基づいて、候補コンテンツのプロフィールをさらに取得又は推定する（ステップＳ４）。 When the identification information, the title, and the incidental profile are acquired in step S3, the content profile acquisition / estimation unit 18 further acquires or estimates the profile of the candidate content based on the acquired incidental profile (step S4).

ステップＳ４でプロフィールが取得又は推定されると、例外コンテンツ除去部１９は、取得されたプロフィールに基づいて、該プロフィールが所定の条件を満たす候補コンテンツを違法コンテンツの候補から除去する（ステップＳ５）。 When the profile is acquired or estimated in step S4, the exceptional content removing unit 19 removes candidate content that satisfies the predetermined condition from the candidate for illegal content based on the acquired profile (step S5).

ステップＳ５で例外コンテンツの識別情報が除去されると、照合優先度計算部２０は、例外コンテンツ除去部１９によって除去されなかった各識別情報で識別される候補コンテンツのタイトル及びプロフィールに基づいて、各候補コンテンツの優先度を計算する（ステップＳ６）。 When the identification information of the exception content is removed in step S5, the matching priority calculation unit 20 determines each of the titles and the profiles of the candidate content identified by the pieces of identification information not removed by the exception content removing unit 19. The priority of the candidate content is calculated (step S6).

ステップＳ６で各候補コンテンツの優先度が計算されると、コンテンツＤＬ・照合部２４は、優先度が高い順に候補コンテンツをコンテンツ取得元からダウンロードし、ダウンロードされた候補コンテンツを、ステップＳ１で入力された照合元のコンテンツと照合する（ステップＳ７）。 When the priority of each candidate content is calculated in step S6, the content DL / collation unit 24 downloads the candidate content from the content acquisition source in descending order of priority, and the downloaded candidate content is input in step S1. It collates with the content of collation origin (step S7).

なお、上述した違法コンテンツ探索装置１として機能させるためにコンピュータを好適に用いることができ、そのようなコンピュータは、違法コンテンツ探索装置１の各機能を実現する処理内容を記述したプログラムを該コンピュータのデータベースに格納しておき、該コンピュータのＣＰＵによってこのプログラムを読み出して実行させることで実現することができる。 Note that a computer can be suitably used to function as the illegal content search device 1 described above, and such a computer is a computer that describes a program that describes processing content for realizing each function of the illegal content search device 1. It can be realized by storing it in a database and reading out and executing this program by the CPU of the computer.

また、プログラムは、コンピュータ読取り可能媒体に記録されていてもよい。コンピュータ読取り可能媒体を用いれば、コンピュータにインストールすることが可能である。ここで、プログラムが記録されたコンピュータ読取り可能媒体は、非一過性の記録媒体であってもよい。非一過性の記録媒体は、特に限定されるものではないが、例えば、ＣＤ−ＲＯＭやＤＶＤ−ＲＯＭなどの記録媒体であってもよい。 The program may also be recorded on a computer readable medium. Computer readable media are available to be installed on a computer. Here, the computer readable medium having the program recorded thereon may be a non-transitory recording medium. The non-transitory recording medium is not particularly limited, but may be, for example, a recording medium such as a CD-ROM or a DVD-ROM.

以上説明したように、本実施形態によれば、違法コンテンツ探索装置１は、予め記憶された設定パラメータのリストを用いて、候補コンテンツの特徴又は編集手法に基づいて、候補コンテンツを照合元のコンテンツと照合する際の照合用パラメータを設定する。そして、違法コンテンツ探索装置１は、照合パラメータ設定部により設定された照合用パラメータを用いて、該ダウンロードしたコンテンツと照合元のコンテンツとを照合する。このため、例えば、候補コンテンツが正規コンテンツを編集したものであっても編集によって変化しない要素（例えば、画像、音声等）に基づいて照合することができる。また、例えば、候補コンテンツの長さに応じて照合する単位を適切に設定することによって正確に照合することができる。したがって、違法コンテンツ探索装置１は、違法コンテンツを正確に探索することができる。 As described above, according to the present embodiment, the illegal content search device 1 uses the list of setting parameters stored in advance to compare the candidate content with the candidate content based on the feature or editing method of the candidate content. Set the matching parameter when matching with. Then, the illegal content search device 1 collates the downloaded content with the content of the collation source using the collation parameter set by the collation parameter setting unit. Therefore, for example, even if the candidate content is an edited regular content, matching can be performed based on an element (for example, an image, a sound, etc.) that does not change due to the editing. Also, for example, accurate matching can be performed by appropriately setting the unit to be matched according to the length of the candidate content. Therefore, the illegal content search device 1 can accurately search for illegal content.

なお、上述において、本実施形態に違法コンテンツ探索装置１は、違法コンテンツに限らず、照合元のコンテンツに合致するコンテンツを探索するコンテンツ探索装置としてもよいとしたが、この場合、上述の違法確度は、照合元のコンテンツに合致する確度とする。 In the above description, the illegal content search device 1 according to the present embodiment is not limited to illegal content, and may be a content search device that searches for content that matches the content of the collation source. Is the accuracy with which the content of the match source is matched.

上述の実施形態は代表的な例として説明したが、本発明の趣旨及び範囲内で、多くの変更及び置換ができることは当業者に明らかである。したがって、本発明は、上述の実施形態によって制限するものと解するべきではなく、特許請求の範囲から逸脱することなく、種々の変形や変更が可能である。 Although the embodiments described above have been described as representative examples, it will be obvious to those skilled in the art that many modifications and substitutions can be made within the spirit and scope of the present invention. Therefore, the present invention should not be construed as being limited by the above-described embodiments, and various modifications and changes are possible without departing from the scope of the claims.

１違法コンテンツ探索装置
１１違法語句モデル生成部
１２違法語句モデル記憶部
１３検索クエリ生成規則記憶部
１４検索クエリ生成部
１５照合候補取得部
１６プロフィール推定モデル生成部
１７プロフィール推定モデル記憶部
１８コンテンツプロフィール取得・推定部
１９例外コンテンツ除去部
２０照合優先度計算部
２３照合パラメータ設定部
２４コンテンツＤＬ・照合部
２５違法語句モデル更新部
２６プロフィール推定モデル更新部 1 Illegal Content Search Device 11 Illegal Word Model Generation Unit 12 Illegal Word Model Storage Unit 13 Search Query Generation Rule Storage Unit 14 Search Query Generation Unit 15 Matching Candidate Acquisition Unit 16 Profile Estimation Model Generation Unit 17 Profile Estimation Model Storage Unit 18 Content Profile Acquisition -Estimator 19 Exception content removing unit 20 Collation priority calculation unit 23 Collation parameter setting unit 24 Content DL / Collation unit 25 Illegal phrase model update unit 26 Profile estimation model update unit

Claims

A content search apparatus for searching for content, comprising:
A collation parameter setting unit configured to set a collation parameter when the candidate content is collated with the content of the collation source based on the feature or the editing method of the candidate content serving as the candidate of the content;
A content DL / collation unit that downloads the candidate content from an acquisition source of the candidate content and collates the downloaded content with the content of the collation source using the collation parameter set by the collation parameter setting unit; ,
A content search apparatus comprising:

In the content search device according to claim 1,
The content DL / collation unit may order the priority according to the degree of certainty that the candidate content matches the content of the collation source, or the editing distance between the title of the candidate content and the title of the content of the collation source. A content search device that matches the content with the content of the matching source.

In the content search device according to claim 1 or 2,
The content DL / collation unit collates the content of the collation source sequentially from the downloaded portion without waiting for the download of the candidate content to be completed, and the content in which the candidate content matches the content of the collation source A content search device that stops downloading of the candidate content when it is determined that

A content search method executed by a content search apparatus for searching content, comprising:
Setting a matching parameter when matching the candidate content with the content of the matching source based on the feature or editing method of the candidate content as the content candidate;
Downloading the candidate content from an acquisition source of the candidate content, and collating the downloaded content with the content of the collation source using the set collation parameter;
A content search method comprising:

A program for causing a computer to function as the content search device according to any one of claims 1 to 3.

A data structure of setting parameters used in a content search apparatus for searching for content belonging to one or more genres, which is a data structure of
The genre and the frame length of the candidate content which is the candidate of the content, or the editing method and the matching method of the candidate content,
The content search apparatus specifies a frame length corresponding to the genre of the candidate content or the matching method corresponding to the editing method, downloads the candidate content from an acquisition source of the candidate content, and downloads the candidate content A data structure of a setting parameter used for processing of matching candidate content and content of matching source as a unit of matching using the frame length or based on the matching method.