JP5462546B2

JP5462546B2 - Content detection support apparatus, content detection support method, and content detection support program

Info

Publication number: JP5462546B2
Application number: JP2009183305A
Authority: JP
Inventors: 昭典藤野; 昌明永田; 早苗藤田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-08-06
Filing date: 2009-08-06
Publication date: 2014-04-02
Anticipated expiration: 2029-08-06
Also published as: JP2011039575A

Description

本発明は、掲示板、ソーシャルネットワーキングサービス（ＳＮＳ）、ｂｌｏｇといったインターネット上のコミュニティサイトに投稿されるテキスト情報から成るコンテンツ群や、データベースに含まれる論文、特許等の文書、オンラインニュースデータ、電子メール、Ｗｅｂページ等のテキスト情報から成るコンテンツ群、画像やリンクなどのテキスト以外のコンテンツ群から特定の基準を満たす情報を含むコンテンツを人手で洩れなく検出する際に、コンテンツ群から一部のデータを人手で確認すべきデータとして機械的に抽出することで検出コストを低減させるコンテンツ検出支援装置、方法、プログラムに関する。 The present invention relates to a content group consisting of text information posted to community sites on the Internet such as bulletin boards, social networking services (SNS), blogs, articles such as articles, patents, online news data, electronic mail, When manually detecting content including information satisfying a specific criterion from a content group consisting of text information such as a web page or a content group other than text such as an image or a link, a part of the data from the content group is manually It is related with the content detection assistance apparatus, method, and program which reduce detection cost by extracting mechanically as data which should be confirmed in (1).

大量に蓄積されたコンテンツ群から特定の基準を満たす情報を含むコンテンツを検出する課題では、検出に要する人的コストを削減するために、パターンマッチングにより確認すべきコンテンツの量を削減する方法が一般的によく用いられる。例えば、掲示板、ＳＮＳ、ｂｌｏｇといったインターネット上のコミュニティサイトに投稿されたテキスト文書やＷｅｂページから、犯罪や麻薬といった違法性の高い文書や、卑猥な表現や悪質な勧誘などの有害性の高い文書を検出する課題では、違法・有害文書によく含まれる単語をＮＧワードとして記録したリストを用意し、それらのＮＧワードを含む文書をコンテンツ群から機械的に抽出することで人手で確認すべき文書数を削減する。 In the problem of detecting content that contains information that meets specific criteria from a large amount of accumulated content, a method that reduces the amount of content to be confirmed by pattern matching is generally used to reduce the human cost required for detection. Often used. For example, from a text document or web page posted on a community site on the Internet such as a bulletin board, SNS, blog, a highly illegal document such as a crime or drug, or a highly harmful document such as an obscene expression or a malicious solicitation For the problem to be detected, the number of documents to be confirmed manually by preparing a list in which words that are often included in illegal and harmful documents are recorded as NG words, and mechanically extracting the documents containing those NG words from the content group. To reduce.

従来、ユーザにとって不適切な情報へのアクセスを制限し、適切な情報のみを抽出するネットワーク上の情報フィルタリング装置として、例えば特許文献１に記載のものが提案されている。 Conventionally, an information filtering apparatus on a network that restricts access to information inappropriate for a user and extracts only appropriate information has been proposed, for example, in Patent Document 1.

尚、本発明で利用する技術は、特許文献２、非特許文献１〜９に開示されている。 The technology used in the present invention is disclosed in Patent Document 2 and Non-Patent Documents 1-9.

特開２００２−１４９９１号公報JP 2002-14991 A 特開２００６−３３８２６３号公報JP 2006-338263 A

Ｒ．Ｃｏｌｌｏｂｅｒｔ，Ｆ．Ｓｉｎｚ，Ｊ．Ｗｅｓｔｏｎ，ａｎｄＬ．Ｂｏｔｔｏｕ． “ＬａｒｇｅｓｃａｌｅｔｒａｎｓｄｕｃｔｉｖｅＳＶＭｓ”．ＪｏｕｒｎａｌｏｆＭａｃｈｉｎｅＬｅａｒｎｉｎｇＲｅｓｅａｒｃｈ，２００６．Ｖｏｌ．７，ｐｐ．１６８７−１７１２R. Collbert, F.M. Sinz, J.M. Weston, and L.L. Bottou. “Large scale transductive SVMs”. Journal of Machine Learning Research, 2006. Vol. 7, pp. 1687-1712 Ｙ．ＧｒａｎｄｖａｌｅｔａｎｄＹ．Ｂｅｎｇｉｏ． “Ｓｅｍｉ−ｓｕｐｅｒｖｉｓｅｄｌｅａｒｎｉｎｇｂｙｅｎｔｒｏｐｙｍｉｎｉｍｉｚａｔｉｏｎ”．ＩｎＡｄｖａｎｃｅｓｉｎＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ１７，ＭＩＴＰｒｅｓｓ，Ｃａｍｂｒｉｄｇｅ，ＭＡ，２００５．ｐｐ．５２９−５３６Y. Grandvalet and Y.M. Bengio. “Semi-supervised learning by entropy minimization”. In Advances in Neural Information Processing Systems 17, MIT Press, Cambridge, MA, 2005. In Advances in Neural Information Processing Systems 17, MIT Press, Cambridge, MA, 2005. pp. 529-536 Ｊ．Ｌａｆｆｅｒｔｙ，Ａ．ＭｃＣａｌｌｕｍ，ａｎｄＦ．Ｐｅｒｅｉｒａ．“Ｃｏｎｄｉｔｉｏｎａｌｒａｎｄｏｍｆｉｅｌｄｓ：Ｐｒｏｂａｂｉｌｉｓｔｉｃｍｏｄｅｌｓｆｏｒｓｅｇｍｅｎｔｉｎｇａｎｄｌａｂｅｌｉｎｇｓｅｑｕｅｎｃｅｄａｔａ”．ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１８ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭａｃｈｉｎｅＬｅａｒｎｉｎｇ（ＩＣＭＬ２００１），ｐｐ．２８２−２８９J. et al. Lufferty, A.M. McCallum, and F.M. Pereira. “Conditional random fields: Probabilistic models for segmenting and labeling sequence data”. In Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pp. 282-289 Ｋ．Ｎｉｇａｍ，Ａ．ＭｃＣａｌｌｕｍ，Ｓ．Ｔｈｒｕｎ，ａｎｄＴ．Ｍｉｔｃｈｅｌｌ．“ＴｅｘｔｃｌａｓｓｉｆｉｃａｔｉｏｎｆｒｏｍｌａｂｅｌｅｄａｎｄｕｎｌａｂｅｌｅｄｄｏｃｕｍｅｎｔｓｕｓｉｎｇＥＭ”．ＭａｃｈｉｎｅＬｅａｒｎｉｎｇ，２０００．Ｖｏｌ．３９，ｐｐ．１０３−１３４K. Nigam, A.M. McCallum, S.M. Thrun, and T.A. Mitchell. “Text classification from labeled and unlabeled documents using EM”. Machine Learning, 2000. Vol. 39, pp. 103-134 Ｊ．ＳｕｚｕｋｉａｎｄＨ．Ｉｓｏｚａｋｉ．“Ｓｅｍｉ−ｓｕｐｅｒｖｉｓｅｄｓｅｑｕｅｎｔｉａｌｌａｂｅｌｉｎｇａｎｄｓｅｇｍｅｎｔａｔｉｏｎｕｓｉｎｇｇｉｇａ−ｗｏｒｄｓｃａｌｅｕｎｌａｂｅｌｅｄｄａｔａ”．ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ４６ｔｈＡｎｎｕａｌＭｅｅｔｉｎｇｏｆｔｈｅＡｓｓｏｃｉａｔｉｏｎｆｏｒＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ（ＡＣＬ−２００８），ｐｐ．６６５−６７３J. et al. Suzuki and H.K. Isozaki. “Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data”. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL-2008), pp. 228 665-673 Ｈ．Ｔａｉｒａ，Ｓ．Ｆｕｊｉｔａ，ａｎｄＭ．Ｎａｇａｔａ．“ＡＪａｐａｎｅｓｅｐｒｅｄｉｃａｔｅａｒｇｕｍｅｎｔｓｔｒｕｃｔｕｒｅａｎａｌｙｓｉｓｕｓｉｎｇｄｅｃｉｓｉｏｎｌｉｓｔｓ”．ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２００８ＣｏｎｆｅｒｅｎｃｅｏｎＥｍｐｉｒｉｃａｌＭｅｔｈｏｄｓｉｎＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ（ＥＭＮＬＰ２００８），ｐｐ．５２２−５３１H. Taira, S.M. Fujita, and M.M. Nagata. “A Japan predicate argument structure analysis analysis list”. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), pp. 522-531 日本語構文解析システムＫＮＰインターネット＜ＵＲＬ：ｈｔｔｐ：／／ｎｌｐ．ｋｕｅｅ．ｋｙｏｔｏ−ｕ．ａｃ．ｊｐ／ｎｌ−ｒｅｓｏｕｒｃｅ／ｋｎｐ．ｈｔｍｌ＞．［平成２１年７月２８日検索］Japanese syntax analysis system KNP Internet <URL: http: // nlp. kuee. kyoto-u. ac. jp / nl-resource / knp. html>. [Search on July 28, 2009] 日本語係り受け解析器インターネット＜ＵＲＬ：ｈｔｔｐ：／／ｃｈａｓｅｎ．ｏｒｇ／〜ｔａｋｕ／ｓｏｆｔｗａｒｅ／ｃａｂｏｃｈａ＞．［平成２１年７月２８日検索］Japanese dependency analyzer Internet <URL: http: // chasen. org / ~ take / software / cabocha>. [Search on July 28, 2009] オープンソース形態素解析エンジンインターネット＜ＵＲＬ：ｈｔｔｐ：／／ｍｅｃａｂ．ｓｏｕｒｃｅｆｏｒｇｅ．ｎｅｔ／＞．［平成２１年７月２８日検索］Open source morphological analysis engine Internet <URL: http: // mecab. sourceforge. net />. [Search on July 28, 2009]

従来のパターンマッチングによる方法では、ＮＧワードのように予め設定したパターンを含むコンテンツを機械的に検出する。この方法では、設定したパターンのみで特定の基準を満たす情報を含むコンテンツか否かを判定できる課題に対して、効果的に基準に該当するコンテンツを検出できる。 In the conventional pattern matching method, content including a preset pattern such as an NG word is mechanically detected. In this method, it is possible to effectively detect content corresponding to a reference with respect to a problem in which it is possible to determine whether or not the content includes information satisfying a specific criterion with only a set pattern.

しかし、例えば、インターネット上のサイトに投稿された文書群から違法・有害情報を含む文書を検出する課題に対しては、文脈によって単語の意味が異なる語義曖昧性があるため、ＮＧワードリストに記載された単語を含む文書が必ずしも違法・有害情報であるとは限らない。 However, for example, for the problem of detecting documents containing illegal or harmful information from a group of documents posted on a site on the Internet, the meaning of the word differs depending on the context, so it is listed in the NG word list. A document that contains the word is not necessarily illegal or harmful information.

このため、パターンマッチングによる方法では、無害な文書を大量に誤検出してしまう危険性があり、違法・有害性の高い文書のみを機械的に検出するにはＮＧワードを絞りこむ必要がある。 For this reason, the pattern matching method has a risk of erroneously detecting a large amount of harmless documents. In order to mechanically detect only illegal and highly harmful documents, it is necessary to narrow down NG words.

一方、違法・有害情報の発信者は、摘発を逃れるために、「覚○剤（＝覚醒剤）」のような伏字、「レンコン（＝拳銃）」のような隠語などを日々作成している。このような伏字や隠語などを含む違法・有害文書をパターンマッチングによる方法で検出するには、ＮＧワードリストにこれらの単語を追加する必要がある。 On the other hand, senders of illegal / harmful information are making daily literary characters such as “stimulants” and slang words such as “renkon”. In order to detect illegal / harmful documents including such absurd letters and hidden words by a pattern matching method, it is necessary to add these words to the NG word list.

このため、違法・有害情報を含む文書を検出する課題では、ＮＧワードを増やすと語彙曖昧性により無害文書を大量に誤検出してしまい、逆にＮＧワードを減らすと日々作り出されている伏せ字や隠語のすべてに対応できないというジレンマが生じる。 For this reason, in the problem of detecting documents that contain illegal / harmful information, increasing the number of NG words causes a large amount of harmless documents to be erroneously detected due to vocabulary ambiguity. The dilemma of not being able to deal with all of the hidden words arises.

上記の違法・有害文書を検出する課題では、サイトの健全性を保証するために、洩れなく違法・有害文書を検出することが重視される。このため、人手で確認すべき文書を機械的に絞りこむ場合、伏せ字や隠語などを含むように拡張したＮＧワードリストなどを用いて違法・有害情報を含む可能性がある文書をすべて抽出する必要がある。 In the above-described problem of detecting illegal / hazardous documents, it is important to detect illegal / hazardous documents without omission in order to guarantee the soundness of the site. For this reason, when manually narrowing down documents to be checked manually, it is necessary to extract all documents that may contain illegal or harmful information using an NG word list that has been expanded to include concealed characters and hidden words. There is.

しかし、この方法では、人手で確認すべきテキスト量が膨大になる。また、ＮＧワードリストに含まれない単語を検索しないため、新しい伏字や隠語に対処できない。これらの問題は、違法・有害文書を検出する課題に限らず、コンテンツの構成要素であるパターンと意味との間に常に一対一の関係が成り立たないメディアで表現されるコンテンツから特定の基準を満たす情報を含むコンテンツを洩れなく検出する課題のすべてに存在する。 However, with this method, the amount of text to be confirmed manually is enormous. In addition, since words that are not included in the NG word list are not searched, it is not possible to deal with new characters and hidden words. These problems are not limited to the problem of detecting illegal and harmful documents, but meet specific standards from content expressed in media that does not always have a one-to-one relationship between the patterns and meanings that constitute the content. It exists in all the problems of detecting content including information without omission.

本発明は上記課題を解決するものであり、その目的は、人手によって確認すべきデータ量を削減することによって確認に要する人的コストを低減させるとともに、新しい伏せ字や隠語などの新規のパターンを含むコンテンツの検出洩れを抑制することができるコンテンツ検出支援装置、方法、プログラムを提供することにある。 The present invention solves the above-mentioned problems, and its object is to reduce the human cost required for confirmation by reducing the amount of data to be confirmed manually and to include new patterns such as new hidden characters and slang words. It is an object of the present invention to provide a content detection support apparatus, method, and program that can suppress content detection failure.

前記した目的を達成するために成された本発明に係わるコンテンツ検出支援装置は、コンテンツが特定の基準を満たす情報を含むかどうかを人手で判断する際に基準に該当する度合が相対的に高いパターン領域をコンテンツ中から機械的に抽出し、検出を行うオペレータにそのパターン領域を提示することによって、オペレータが確認すべきデータ量を削減する。パターン領域の抽出には、過去に人手で基準を満たす情報を含むか否かを判断されたコンテンツ集合の内容をもとに機械的に作成される判定ルールを用いる。この判定ルールを用いて対象のコンテンツ内に含まれる各パターンの基準に該当する度合を推定することで、基準に該当する度合の高いパターンを多く含むパターン領域をコンテンツから機械的に抽出する。 The content detection support apparatus according to the present invention configured to achieve the above-described object has a relatively high degree of corresponding to the standard when manually judging whether the content includes information satisfying a specific standard. The pattern area is mechanically extracted from the content, and the pattern area is presented to the operator who performs the detection, thereby reducing the amount of data to be confirmed by the operator. For the extraction of the pattern region, a determination rule that is mechanically created based on the contents of the content set for which it has been determined whether or not information that meets the criteria manually has been used in the past is used. By estimating the degree corresponding to the criterion of each pattern included in the target content using this determination rule, a pattern region including many patterns having a high degree corresponding to the criterion is mechanically extracted from the content.

本発明の請求項１に記載のコンテンツ検出支援装置は、テキスト情報を含むコンテンツ群から、予め定めた所定の基準を満たす情報を含むコンテンツ中の部分領域を抽出するコンテンツ検出支援装置であって、前記各コンテンツ中のテキストを所定の単位に分割し、当該分割した箇所であるパターン毎の特徴量を抽出する特徴量抽出手段と、前記予め定めた所定の基準を満たす情報を含むか否かが既知であるコンテンツから、コンテンツに含まれる各パターンが前記予め定めた所定の基準を満たすか否かを判断するための判定ルールを生成する判定ルール生成手段と、前記特徴量抽出手段により抽出された前記パターン毎の特徴量を用いて、該パターン毎に前記判定ルールを適用して、前記各パターンが前記予め定めた所定の基準を満たすか否かを判定するパターン判定手段と、前記コンテンツの中から、前記パターン判定手段において前記予め定めた所定の基準を満たすと判定されたパターンを多く含む部分領域を抽出するパターン領域抽出手段と、を有し、前記特徴量抽出手段は、判定対象のパターンと、当該パターンの前後数パターンの特徴量を加えて定義した特徴ベクトルを抽出することを特徴としている。 The content detection support apparatus according to claim 1 of the present invention is a content detection support apparatus that extracts a partial region in content including information satisfying a predetermined criterion from a content group including text information, Whether the text in each content is divided into predetermined units, the feature quantity extracting means for extracting the feature quantity for each pattern that is the divided part, and whether or not the information includes information satisfying the predetermined criteria Extracted from known content by a determination rule generating unit that generates a determination rule for determining whether each pattern included in the content satisfies the predetermined criterion, or the feature amount extracting unit or by using the feature amount of each of the pattern, by applying the decision rule for each said pattern, it satisfies a predetermined criterion that each pattern has the predetermined Yes and determining the pattern determining means for determining, from among the content, and a pattern area extraction means for extracting a number including partial region the determined pattern satisfies a predetermined criterion, wherein the predetermined in the pattern determination means The feature quantity extraction means extracts a feature vector defined by adding the pattern to be determined and the feature quantities of several patterns before and after the pattern .

（１）請求項１〜１１に記載の発明によれば、特定の基準を満たす情報を含むコンテンツを人手で洩れなく検出する課題において、人手で確認すべきコンテンツの数をコンテンツ自体の機械的な抽出によって削減するのではなく、各コンテンツに含まれる一部のパターン領域を確認すべきデータとして機械的に抽出することで、本来検出すべきコンテンツを検出対象から除外するリスクなしに人手により確認すべきデータ量を削減させる。 (1) According to the invention described in claims 1 to 11, in the problem of detecting contents including information satisfying a specific standard without omission, the number of contents to be confirmed manually is determined by the mechanicalness of the contents themselves. Rather than reducing by extraction, a part of the pattern area included in each content is mechanically extracted as data to be checked, so that the content that should be detected can be confirmed manually without the risk of being excluded from detection. Reduce the amount of data that needs to be reduced.

コンテンツの抽出による方法では、検出すべきコンテンツを機械的に誤判定して検出対象から除外することが直接的にコンテンツの検出洩れにつながる。このため、検出洩れを防ぐには、基準を満たす情報を含まないことを明確に判定できるコンテンツ以外をすべて人手で確認する必要がある。一方、本発明で行うパターン領域の抽出では、各コンテンツ内に含まれる複数のパターン領域から、基準を満たす情報を含むことを判断できるパターン領域のうち少なくとも１つを人手で確認すべきパターン領域として抽出すれば基準を満たす情報を含むコンテンツの検出洩れを防げる。コンテンツ内の他のパターン領域を抽出しなくても検出洩れには直接つながらない。このため、本発明には、コンテンツの抽出による方法と比べて、低い検出洩れのリスクで、オペレータが確認すべきデータ量を低減させる効果がある。
（２）また、判定対象のパターン、例えば単語の前後数単語の特徴量も含めた特徴ベクトルを用いて、判定対象の単語が所定の情報を含むか否かを判定する構成により、Ｗｅｂコンテンツ上で日々生み出される新しい語や伏せ字、隠語などにも対応した情報検出が可能となる。
（３）請求項４、５、９、１０に記載の発明によれば、同じコンテンツに複数の検出対象領域（有害情報を含む領域等）がある場合には、その中から１つの領域だけを選択し出力することができるので、人手で確認をするときの情報量を削減し、負担を軽減する効果が高くなる。 In the method based on content extraction, mechanically misjudging content to be detected and excluding it from the detection target directly leads to content detection failure. For this reason, in order to prevent omission of detection, it is necessary to manually check all contents other than the contents that can clearly determine that information that satisfies the standard is not included. On the other hand, in the pattern area extraction performed in the present invention, at least one of the pattern areas that can be determined to include information satisfying the criterion from the plurality of pattern areas included in each content is a pattern area that should be manually checked. If extracted, it is possible to prevent omission of detection of contents including information satisfying the standard. Even if other pattern areas in the content are not extracted, it does not directly lead to detection failure. For this reason, the present invention has an effect of reducing the amount of data to be confirmed by the operator with a low risk of detection failure compared to the method based on content extraction.
(2) Further , it is possible to determine whether or not the determination target word includes predetermined information by using a feature vector including a determination target pattern, for example, feature quantities of several words before and after the word. It is possible to detect information that corresponds to new words, hidden letters, and secret words that are created every day.
(3) According to the inventions described in claims 4, 5, 9, and 10, when there are a plurality of detection target areas (such as areas including harmful information) in the same content, only one area is selected from them. Since the information can be selected and output, the amount of information when checking manually is reduced, and the effect of reducing the burden is enhanced.

本発明の実施形態例のコンテンツ検出支援装置の構成を示す機能ブロック図。The functional block diagram which shows the structure of the content detection assistance apparatus of the example of embodiment of this invention.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。図１は本実施の形態のコンテンツ検出支援装置１の構成を示す機能ブロック図の例である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments. FIG. 1 is an example of a functional block diagram showing a configuration of a content detection support apparatus 1 according to the present embodiment.

図１に示すように、本実施の形態のコンテンツ検出支援装置１は、ある特定の基準を満たす情報を含むか否かを判断すべきコンテンツを入力する際のインターフェースとなる入力部２と、入力されたコンテンツに含まれる各パターンの特徴量を抽出するための特徴量抽出手段としての特徴量抽出部３と、コンテンツに含まれる各パターンが基準に該当するか否かを判定するパターン判定手段としてのパターン判定部４と、基準に該当すると判定されたパターンを含むパターン領域をコンテンツから抽出するパターン領域抽出手段としてのパターン領域抽出部５と、パターン領域の画面表示やオペレータの判断結果を保存する際のインターフェースとなる出力部６とを含んで構成される。 As shown in FIG. 1, the content detection support apparatus 1 according to the present embodiment includes an input unit 2 serving as an interface for inputting content that should be determined whether or not it includes information that satisfies a specific criterion, and an input As a feature amount extraction unit 3 as a feature amount extraction unit for extracting feature amounts of each pattern included in the obtained content, and as a pattern determination unit that determines whether or not each pattern included in the content corresponds to a reference The pattern determination unit 4, the pattern region extraction unit 5 as a pattern region extraction unit for extracting a pattern region including a pattern determined to meet the reference from the content, and the screen display of the pattern region and the operator's determination result are stored And an output unit 6 serving as an interface.

また、コンテンツ検出支援装置１は、検出対象となるコンテンツと同様の形式をもつコンテンツの例を集めて生成された訓練データ集合が記憶されている訓練データＤＢ（データベース）７と、パターン判定部４で各パターンの判定に用いる判定ルールを、訓練データＤＢ７に含まれる訓練データを用いて生成する判定ルール生成手段としての判定ルール生成部８とを含んで構成される。 The content detection support apparatus 1 includes a training data DB (database) 7 in which training data sets generated by collecting examples of content having the same format as the content to be detected are stored, and a pattern determination unit 4. And a determination rule generation unit 8 as a determination rule generation unit that generates the determination rule used for determining each pattern using the training data included in the training data DB 7.

前記コンテンツ検出装置１の各部の機能は、例えばコンピュータによって達成される。 The function of each part of the content detection apparatus 1 is achieved by a computer, for example.

ここで、パターンとはテキストを分割する単位を指す。つまり、コンテンツが文書などのテキストデータである場合は、文字・記号または単語、熟語、フレーズ等がパターンとなる。また、パターン領域はパターンを複数まとめた領域であり、例えば、文節、文、パラグラフ等がパターン領域となる。以下、コンテンツを文書、パターンを単語、パターン領域をパラグラフとし、文書群から有害情報を含む文書を検出する課題を例に、コンテンツ検出支援装置１の各要素の実現例を述べる。 Here, the pattern refers to a unit for dividing the text. That is, when the content is text data such as a document, a pattern is a character / symbol or word, idiom, phrase or the like. The pattern area is an area in which a plurality of patterns are collected. For example, phrases, sentences, paragraphs, and the like are pattern areas. Hereinafter, an implementation example of each element of the content detection support apparatus 1 will be described by taking as an example the problem of detecting a document including harmful information from a document group, where the content is a document, the pattern is a word, and the pattern region is a paragraph.

コンテンツ検出支援装置１は、入力部２で入力された文書に含まれる各単語に対し、特徴量抽出部３で特徴量を抽出する。特徴量抽出部３では、例えば、非特許文献５に記載の固有表現抽出器や品詞解析器で各単語に付与された固有表現タグや品詞情報、係り受け解析器（ＫＮＰ（非特許文献７）、Ｃａｂｏｃｈａ（非特許文献８など）で推定された単語の依存関係、非特許文献６に記載の項構造解析器で推定される構造情報などの既存の言語解析器を用いて各単語の特徴量を抽出する。 In the content detection support apparatus 1, a feature amount extraction unit 3 extracts a feature amount for each word included in a document input by the input unit 2. In the feature quantity extraction unit 3, for example, a unique expression tag or a part of speech information given to each word by a specific expression extractor or a part of speech analyzer described in Non-Patent Document 5, a dependency analyzer (KNP (Non-Patent Document 7)). , Feature values of each word using an existing language analyzer such as word dependency estimated by Cabocha (Non-patent Document 8 etc.), structure information estimated by the term structure analyzer described in Non-Patent Document 6 To extract.

また、伏せ字の○やスペース、アスキーアートに用いられる記号などの特殊な文字や絵文字などを含む表現には本来と異なる意味で用いられている場合も多いことから、単語を構成する文字の種類を特徴量として加えても良い。 In addition, it is often used in a different meaning from expressions that include special characters and pictograms such as circles and spaces, symbols used in ASCII art, etc. It may be added as an amount.

さらに、文脈に応じた判定を行うために、前後の数単語の特徴量を加えて各単語の特徴ベクトルを定義し、それらの特徴ベクトルを用いて各単語を判定しても良い。例えば、「れんこんを１５万円で売ります。」という文の各単語の特徴量が、
「れんこん」：ａ
「を」：ｂ
「１５万円」：ｃ
「で」：ｄ
「売り」：ｅ
「ます」：ｆ
「。」：ｇ
であるとする。文脈に応じた判定を行うために前後の２単語を考慮に入れて各単語を判定する場合には、各単語の判定に用いる特徴ベクトルを、
「れんこん」：ｘ₁＝（０，０，ａ，ｂ，ｃ）
「を」：ｘ₂＝（０，ａ，ｂ，ｃ，ｄ）
「１５万円」：ｘ₃＝（ａ，ｂ，ｃ，ｄ，ｅ）
「で」：ｘ₄＝（ｂ，ｃ，ｄ，ｅ，ｆ）
「売り」：ｘ₅＝（ｃ，ｄ，ｅ，ｆ，ｇ）
「ます」：ｘ₆＝（ｄ，ｅ，ｆ，ｇ，０）
「。」：ｘ₇＝（ｅ，ｆ，ｇ，０，０）
のように定義する。尚、日本語の文書の場合には、各文書に含まれる単語を解析するのに既存の形態素解析器（ＭｅＣａｂ（非特許文献９等）を用いることができる。 Furthermore, in order to make a determination according to the context, feature quantities of each word may be defined by adding feature quantities of several words before and after, and each word may be determined using those feature vectors. For example, the feature amount of each word in the sentence “Sell lotus root for 150,000 yen”
“Lenkon”: a
"O": b
“150,000 yen”: c
“De”: d
“Sell”: e
“Mas”: f
“.”: G
Suppose that When determining each word taking into consideration the two words before and after in order to make a determination according to the context, the feature vector used for the determination of each word is:
“Lotus root”: x ₁ = (0, 0, a, b, c)
“O”: x ₂ = (0, a, b, c, d)
“150,000 yen”: x ₃ = (a, b, c, d, e)
“De”: x ₄ = (b, c, d, e, f)
“Sell”: x ₅ = (c, d, e, f, g)
“Masu”: x ₆ = (d, e, f, g, 0)
“.”: X ₇ = (e, f, g, 0, 0)
Define as follows. In the case of a Japanese document, an existing morphological analyzer (MeCab (Non-Patent Document 9)) can be used to analyze words included in each document.

パターン判定部４では、特徴量抽出部３で抽出した単語とその特徴ベクトルを用い、後述の判定ルール生成部８で生成された判定ルールを適用して文書中で出現する各単語が有害情報を表すか否かを判定する。 The pattern determination unit 4 uses the word extracted by the feature amount extraction unit 3 and its feature vector, and applies the determination rule generated by the determination rule generation unit 8 to be described later. Whether to represent or not is determined.

各単語の判定は、例えば、パターンマッチングによる方法で行えるが、機械学習に基づく方法を用いて行ってもよい。パターンマッチングによる方法では、有害情報を表す可能性がある単語をすべて列挙したリストを用意し、そのリストに含まれているか否かを機械的に確認する。 Each word can be determined by a method based on pattern matching, for example, but may be determined using a method based on machine learning. In the pattern matching method, a list in which all the words that may represent harmful information are listed is prepared, and it is mechanically confirmed whether or not it is included in the list.

機械学習による方法では、含まれる単語が有害情報を表すか否かを人手で過去に判断された文書から判定ルールを機械的に学習し、その判定ルールを新規の文書に適用することで文書に含まれる各単語の判定を行える。 In the machine learning method, a decision rule is mechanically learned from a document that has been manually judged whether or not the contained word represents harmful information, and the decision rule is applied to a new document. Each contained word can be determined.

文書に含まれる単語列の特徴ベクトル集合をＸ＝｛ｘ₁，…，ｘ_i，…，ｘ_n｝、各単語の判定結果を示すベクトルをｙ＝（ｙ₁，…，ｙ_i，…，ｙ_n），ｙ_i∈｛０，１｝とし、ｙ_i＝１（ｙ_i＝０）がｉ番目の単語が有害情報を表す（表さない）ことを意味するとすると、ｗをパラメータとするスコア関数ｆ（Ｘ，ｙ；ｗ）を用いて判定ルールはａｒｇｍａｘ_yｆ（Ｘ，ｙ；ｗ）で与えられる。パラメータｗの値は、含まれる単語が有害情報を表すか否かを人手で過去に判断された文書を用いて推定する。 A feature vector set of word strings included in the document is X = {x ₁ ,..., X _i ,..., X _n }, and a vector indicating the determination result of each word is y = (y ₁ ,..., Y _i ,. y _n ), y _i ∈ {0, 1}, and y _i = 1 (y _i = 0) means that the i-th word represents (not represents) harmful information, and w is a parameter. Using the score function f (X, y; w), the determination rule is given by argmax _y f (X, y; w). The value of the parameter w is estimated using a document that has been manually determined in the past whether or not the contained word represents harmful information.

判定ルールに用いる関数の型やパラメータ値の推定には、例えば、特許文献２や非特許文献１〜５に記載の方法を応用できる。 For example, the methods described in Patent Document 2 and Non-Patent Documents 1 to 5 can be applied to the estimation of the function type and parameter value used in the determination rule.

パターン領域抽出部５では、パターン判定部４で有害情報を表すと判定された単語を多く含むパラグラフを抽出する。パラグラフの抽出は、例えば，有害情報を表すと判定された単語数のパラグラフ中の単語全体に占める割合を計算し、その割合の高いパラグラフを選択することで行うことができる。 The pattern area extraction unit 5 extracts paragraphs that include many words that are determined to represent harmful information by the pattern determination unit 4. The paragraph can be extracted, for example, by calculating the ratio of the number of words determined to represent harmful information to the whole word in the paragraph and selecting a paragraph with a high ratio.

パターン判定部４で機械学習に基づく判定ルールを採用する場合には、判定ルールのスコア関数をもとにパラグラフのスコア値を定義し、そのスコア値が高いパラグラフを選択することでパラグラフの抽出を行っても良い。 When the determination rule based on machine learning is adopted in the pattern determination unit 4, the score value of the paragraph is defined based on the score function of the determination rule, and the paragraph is extracted by selecting the paragraph with the higher score value. You can go.

パラグラフのスコア値は、パラグラフ中に含まれる単語列の特徴ベクトル集合をＸ´、単語数をｎとするとき、スコア関数ｆ（Ｘ，ｙ；ｗ）を用いて、例えば The score value of a paragraph is obtained by using, for example, a score function f (X, y; w), where X ′ is a feature vector set of a word string included in the paragraph and n is the number of words.

出力部６では、パターン領域抽出部５で抽出されたパラグラフを画面に表示してオペレータに提示する。画面への表示方法は、例えば，パラグラフのみを表示しても、文書に含まれる情報をすべて表示した上で抽出されたパラグラフをハイライトすることで提示しても良い。また、出力部６では、オペレータが文書を検出すべきか否かを判断した結果やオペレータが有害情報を表すと判断した単語や文などを必要に応じて適切な箇所（例えば図示省略のメモリ）に保存する。 The output unit 6 displays the paragraph extracted by the pattern area extraction unit 5 on the screen and presents it to the operator. As a display method on the screen, for example, only the paragraph may be displayed, or the information may be presented by highlighting the extracted paragraph after displaying all the information included in the document. Further, in the output unit 6, the result of determining whether or not the operator should detect the document or the word or sentence determined by the operator to represent harmful information is displayed in an appropriate location (for example, a memory not shown) as necessary. save.

訓練データＤＢ７には、有害情報を含むか否かを過去に人手で判断された文書が蓄積されており、それらの中で有害文書には有害情報を表すと判断された単語にタグが付与されている。また、出力部６で保存された文書とオペレータの判断結果を逐次訓練データＤＢ７に加えても良い。 In the training data DB 7, documents that have been manually determined whether or not they contain harmful information are stored in the past, and tags that are determined to represent harmful information are assigned to the harmful documents. ing. Further, the document stored in the output unit 6 and the operator's determination result may be sequentially added to the training data DB 7.

判定ルール生成部８では、訓練データＤＢ７に蓄積された文書とタグを訓練データとして、パターン判定部４で単語の判定に用いる判定ルールを機械的に生成する。例えば、パターン判定部４で単語リストを用いて単語を判定する場合では、訓練データ中の有害文書に含まれるタグ付けされた単語をすべて列挙したり、タグ付けされた数が多い単語を抽出することで単語リストを作成できる。 The determination rule generation unit 8 mechanically generates a determination rule used for word determination by the pattern determination unit 4 using the documents and tags accumulated in the training data DB 7 as training data. For example, in the case where the pattern determination unit 4 determines a word using a word list, all the tagged words included in the harmful document in the training data are listed, or a word with a large number of tags is extracted. You can create a word list.

また、機械学習法で得られる判定ルールをパターン判定部４で用いる場合には、例えば、訓練データＤＢ７に含まれるコンテンツの各単語にタグが付与されているか否かの情報を利用して、非特許文献１、２、４に記載のサポートベクトルマシン（ＳＶＭ）やロジスティック回帰モデル、ナイーブベイズモデルなどの分類器のパラメータ値、あるいは非特許文献３に記載の条件付確率場（ＣＲＦ）などの構造データ用ラベル付与器のパラメータ値を見積もることで、各単語が基準に該当するか否かを推定するための判定ルールを生成できる。 Further, when the determination rule obtained by the machine learning method is used in the pattern determination unit 4, for example, information on whether or not a tag is assigned to each word of the content included in the training data DB 7 is used. Parameter values of classifiers such as support vector machines (SVM), logistic regression models, and naive Bayes models described in Patent Documents 1, 2, and 4 or structures such as conditional random fields (CRF) described in Non-Patent Document 3 By estimating the parameter value of the data label applicator, it is possible to generate a determination rule for estimating whether each word corresponds to the reference.

あるいは、訓練データ中の有害文書に含まれる単語のうち、タグが付与されている単語を有害情報を表す単語であるとし、タグが付与されていない単語を有害情報を表すか否かが不明な単語とし、無害文書に含まれる単語をすべて有害情報を表さない単語とみなして、特許文献２や非特許文献５に記載の半教師あり学習技術を用いて分類器や構造データ用ラベル付与器のパラメータ値を見積もることで判定ルールを生成しても良い。 Or, among the words included in the harmful document in the training data, the word with the tag is regarded as a word representing harmful information, and it is unclear whether the word without the tag represents harmful information Classifiers and labeling units for structural data using the semi-supervised learning technique described in Patent Literature 2 and Non-Patent Literature 5 assuming that all the words included in the harmless document are words that do not represent harmful information. The determination rule may be generated by estimating the parameter value.

さらに、有害情報を含むか否かが不明の文書に含まれる単語と、有害情報を含むか否かが判断された文書に含まれる単語とから半教師あり学習技術を用いて分類器や構造データ用ラベル付与器のパラメータ値を見積もっても良い。この場合、有害情報を含むか否かが不明の文書に含まれるすべての単語を有害情報を表すか否かが不明な単語として扱ってパラメータ値を見積もる。 In addition, classifiers and structural data using semi-supervised learning technology from words contained in documents that are not known to contain harmful information and words contained in documents that are judged to contain harmful information. The parameter value of the label applicator may be estimated. In this case, the parameter value is estimated by treating all the words included in the document with unknown harmful information as words with unknown harmful information.

ここで、例えば、有害情報を１箇所でも含んでいれば、そのコンテンツは有害情報を含むと判断して良いので、同じコンテンツ中の複数箇所を人手で確認する必要はない。 Here, for example, if harmful information is included even at one location, it may be determined that the content includes harmful information, so it is not necessary to manually check a plurality of locations in the same content.

したがって、上記実施形態例のように、パターン領域抽出部５において、有害情報を表すと判定された単語数のパラグラフ中の単語全体に占める割合を計算し、その割合の高いパラグラフを選択することにより、出力部６において人手で確認をするときの情報量が著しく削減され、負担が飛躍的に軽減される。 Therefore, as in the above embodiment, the pattern area extraction unit 5 calculates the ratio of the number of words determined to represent harmful information to the entire word in the paragraph, and selects a paragraph with a high ratio. The amount of information when manually checking in the output unit 6 is significantly reduced, and the burden is drastically reduced.

また上記の実施形態例では、判定対象の単語の前後数単語の特徴量も含めた特徴ベクトルを用いて、判定対象の単語が所定の情報（有害な情報）を含むか否かを判定しているので、Ｗｅｂコンテンツ上で日々生み出される新しい語や伏せ字、隠語等にも対応した情報検出が可能となる。 Further, in the above embodiment example, it is determined whether or not the determination target word includes predetermined information (harmful information) using the feature vector including the feature amount of several words before and after the determination target word. Therefore, it is possible to detect information corresponding to new words, hidden characters, hidden words and the like that are generated daily on the Web content.

すなわち、例えば、「れんこんを１５万円で売ります」という文章が含まれていたときに、「れんこん」という単語そのものは有害な意味（拳銃）を表す隠語であることが分からなくても、「１５万円」や「売る」などの「れんこん」の周囲にある情報から「れんこんを１５万円で売ります」という領域を有害な情報を含む可能性の高い領域として検出することができる。 That is, for example, if the sentence “Sell lotus for 150,000 yen” is included, the word “renkon” itself is a hidden word that represents a harmful meaning (handgun). From the information in the vicinity of “Lenkon” such as “150,000 yen” and “Sell”, it is possible to detect an area of “Selling lotus root for 150,000 yen” as an area that is likely to contain harmful information.

また、本発明のコンテンツ検出支援方法は、例えば前記図１のコンテンツ検出支援装置１の各部が行なう処理を実行するものである。 Further, the content detection support method of the present invention executes, for example, processing performed by each unit of the content detection support apparatus 1 of FIG.

すなわち、まず、特徴量抽出部３が前記入力部２で入力されたコンテンツ（文書）に含まれる各単語の特徴量を抽出する（特徴量抽出ステップ）。 That is, first, the feature quantity extraction unit 3 extracts the feature quantity of each word included in the content (document) input by the input unit 2 (feature quantity extraction step).

次に、判定ルール生成部８が、前記訓練データＤＢ７に蓄積された文書とタグを訓練データとして、文書に含まれる各単語（各パターン）が所定の基準を満たすか否か（有害情報を表すか否か）を判断するための判定ルールを生成する（判定ルール生成ステップ）。 Next, the determination rule generation unit 8 uses the documents and tags accumulated in the training data DB 7 as training data, and whether or not each word (each pattern) included in the document satisfies a predetermined criterion (represents harmful information). A determination rule for determining whether or not) (determination rule generation step).

尚、前記特徴量抽出ステップと判定ルール生成ステップの実行順序は前記に限るものではない。 The execution order of the feature quantity extraction step and the determination rule generation step is not limited to the above.

次に、パターン判定部４が、特徴量抽出部３で抽出された単語とその特徴ベクトルを用い、判定ルール生成部８で生成された判定ルールを適用して、文書中で出現する各単語が有害情報を表すか否かを判定する（パターン判定ステップ）。 Next, the pattern determination unit 4 applies the determination rule generated by the determination rule generation unit 8 using the word extracted by the feature amount extraction unit 3 and its feature vector, so that each word appearing in the document is It is determined whether or not harmful information is represented (pattern determination step).

次に、パターン領域抽出部５が、パターン判定部４で有害情報を表すと判定された単語を多く含む部分領域（パラグラフ）を抽出する（パターン領域抽出ステップ）。 Next, the pattern region extraction unit 5 extracts a partial region (paragraph) that includes many words determined to represent harmful information by the pattern determination unit 4 (pattern region extraction step).

また、前記特徴量抽出部３が実行する特徴量抽出ステップでは、判定対象のパターン（単語）と、当該パターンの前後数パターンの特徴量を加えて定義した特徴ベクトルを抽出する。 In the feature quantity extraction step executed by the feature quantity extraction unit 3, a feature vector defined by adding a pattern (word) to be determined and feature quantities of several patterns before and after the pattern is extracted.

また、前記判定ルール生成部８が実行する判定ルール生成ステップでは、予め定めた所定の基準を満たす情報（有害情報）を含むことが既知であるコンテンツ内の各パターンについて、前記予め定めた所定の基準を満たす情報であることが既知であるパターンにタグを付与し、当該コンテンツに含まれるパターンのうち前記タグが付与されたパターンを前記所定の基準を満たすパターンとし、当該コンテンツに含まれるパターンのうち前記タグが付与されていないパターンを前記所定の基準を満たすか否かが不明なパターンとし、前記予め定めた所定の基準を満たす情報を含まないことが既知であるコンテンツ内の全てのパターンを前記所定の基準を満たす情報を含まないパターンとし、半教師あり学習により、前記予め定めた所定の基準を満たす情報を含むことが既知であるコンテンツおよび前記予め定めた所定の基準を満たす情報を含まないことが既知であるコンテンツに含まれる前記各パターンが、前記予め定めた所定の基準を満たすパターンである可能性を示すスコア関数のパラメータを学習し、学習されたスコア関数を判定ルールとする。 Further, in the determination rule generation step executed by the determination rule generation unit 8, for each pattern in the content that is known to include information (harmful information) that satisfies a predetermined criterion, the predetermined predetermined A tag is given to a pattern that is known to be information that satisfies the criteria, and a pattern to which the tag is given among the patterns included in the content is defined as a pattern that satisfies the predetermined criteria, and the pattern included in the content Of these patterns, the pattern to which the tag is not assigned is an unknown pattern whether or not the predetermined criterion is satisfied, and all patterns in the content that are known not to include information satisfying the predetermined predetermined criterion are included. A pattern that does not include information that satisfies the predetermined criterion, and the predetermined criterion is determined by semi-supervised learning. Each of the patterns included in the content that is known to contain additional information and the content that is known not to contain information that satisfies the predetermined criterion is a pattern that satisfies the predetermined criterion. A score function parameter indicating a certain possibility is learned, and the learned score function is set as a determination rule.

また、前記パターン領域抽出部５が実行するパターン領域抽出ステップでは、前記コンテンツ内の各部分領域毎に、当該部分領域内の全パターン数に対する、当該部分領域において前記所定の基準を満たすと判定したパターン数の割合を計算し、その割合の最も高い部分領域を、前記パターン判定部４において前記予め定めた所定の基準を満たすと判定されたパターンを多く含む部分領域として抽出する。 Further, in the pattern region extraction step executed by the pattern region extraction unit 5, it is determined for each partial region in the content that the predetermined standard is satisfied in the partial region with respect to the total number of patterns in the partial region. The ratio of the number of patterns is calculated, and the partial area having the highest ratio is extracted as a partial area containing a large number of patterns determined by the pattern determination unit 4 to satisfy the predetermined criterion.

また、前記パターン領域抽出部５が実行するパターン領域抽出ステップでは、前記コンテンツ内の各パターンに前記スコア関数である判定ルールを適用することによりスコア値を算出し、前記コンテンツ内の各部分領域毎に、当該部分領域内の全パターンのスコア値の総和を算出し、前記スコア値の総和が最も高い部分領域を、前記パターン判定部４において前記予め定めた所定の基準を満たすと判定されたパターンを多く含む部分領域として抽出する。 In the pattern region extraction step executed by the pattern region extraction unit 5, a score value is calculated by applying a determination rule that is the score function to each pattern in the content, and each partial region in the content is calculated. A pattern in which the sum of the score values of all the patterns in the partial area is calculated, and the partial area having the highest total score value is determined to satisfy the predetermined criterion in the pattern determination unit 4 Is extracted as a partial region containing a large amount of.

また、本実施形態のコンテンツ検出支援装置における各手段の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、本実施形態のコンテンツ検出支援方法における手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＨＤＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 In addition, the present invention can be realized by configuring some or all of the functions of each means in the content detection support apparatus of the present embodiment by a computer program and executing the program using the computer. It goes without saying that the procedure in the content detection support method of the above can be configured by a computer program, and the program can be executed by the computer, and the program for realizing the function by the computer can be read by the computer, For example, FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), memory card, CD (Compact Disk) -ROM, DVD (Digital V rsatile Disk) -ROM, CD-R, CD-RW, HDD, and recorded in a removable disk, or stored, it is possible or distribute. It is also possible to provide the above program through a network such as the Internet or electronic mail.

１…コンテンツ検出支援装置
２…入力部
３…特徴量抽出部
４…パターン判定部
５…パターン領域抽出部
６…出力部
７…訓練データＤＢ
８…判定ルール生成部 DESCRIPTION OF SYMBOLS 1 ... Content detection assistance apparatus 2 ... Input part 3 ... Feature-value extraction part 4 ... Pattern determination part 5 ... Pattern area extraction part 6 ... Output part 7 ... Training data DB
8: Determination rule generator

Claims

A content detection support apparatus that extracts a partial region in content including information satisfying a predetermined criterion from a content group including text information,
A feature amount extracting unit that divides the text in each content into predetermined units and extracts a feature amount for each pattern that is the divided portion;
Generating a determination rule for determining whether each pattern included in the content satisfies the predetermined predetermined criterion from content whose information satisfies whether the predetermined predetermined criterion is included Determination rule generation means to perform,
Using the feature amount for each pattern extracted by the feature amount extraction unit, the determination rule is applied to each pattern to determine whether or not each pattern satisfies the predetermined criterion. Pattern determination means;
Pattern area extracting means for extracting a partial area including a large number of patterns determined to satisfy the predetermined criterion in the pattern determining means from the content ;
The content amount extraction support device, wherein the feature amount extraction unit extracts a feature vector defined by adding a pattern to be determined and feature amounts of several patterns before and after the pattern.

A content detection support apparatus that extracts a partial region in content including information satisfying a predetermined criterion from a content group including text information,
A feature amount extracting unit that divides the text in each content into predetermined units and extracts a feature amount for each pattern that is the divided portion;
Generating a determination rule for determining whether each pattern included in the content satisfies the predetermined predetermined criterion from content whose information satisfies whether the predetermined predetermined criterion is included Determination rule generation means to perform,
Using the feature amount for each pattern extracted by the feature amount extraction unit, the determination rule is applied to each pattern to determine whether or not each pattern satisfies the predetermined criterion. Pattern determination means;
Pattern area extracting means for extracting a partial area including a large number of patterns determined to satisfy the predetermined criterion in the pattern determining means from the content ;
The determination rule generation means includes
For each pattern in the content that is known to contain information that satisfies the predetermined criteria, a tag is attached to the pattern that is known to be information that satisfies the predetermined criteria,
Among the patterns included in the content, the pattern to which the tag is assigned is a pattern that satisfies the predetermined standard,
Of the patterns included in the content, a pattern to which the tag is not assigned is an unknown pattern whether or not the predetermined criterion is satisfied,
All the patterns in the content that are known not to contain information satisfying the predetermined criterion are defined as patterns not including information satisfying the predetermined criterion,
Each pattern included in content that is known to include information satisfying the predetermined criterion and content that does not include information satisfying the predetermined criterion by semi-supervised learning Learning a parameter of a score function indicating the possibility of being a pattern satisfying the predetermined criterion, and using the learned score function as a determination rule.
A content detection support apparatus characterized by the above .

A content detection support apparatus that extracts a partial region in content including information satisfying a predetermined criterion from a content group including text information,
A feature amount extracting unit that divides the text in each content into predetermined units and extracts a feature amount for each pattern that is the divided portion;
Generating a determination rule for determining whether each pattern included in the content satisfies the predetermined predetermined criterion from content whose information satisfies whether the predetermined predetermined criterion is included Determination rule generation means to perform,
Using the feature amount for each pattern extracted by the feature amount extraction unit, the determination rule is applied to each pattern to determine whether or not each pattern satisfies the predetermined criterion. Pattern determination means;
Pattern area extracting means for extracting a partial area including a large number of patterns determined to satisfy the predetermined criterion in the pattern determining means from the content ;
The feature amount extraction means extracts a feature vector defined by adding a pattern to be determined and feature amounts of several patterns before and after the pattern ,
The determination rule generation means includes
For each pattern in the content that is known to contain information that satisfies the predetermined criteria, a tag is attached to the pattern that is known to be information that satisfies the predetermined criteria,
Among the patterns included in the content, the pattern to which the tag is assigned is a pattern that satisfies the predetermined standard,
Of the patterns included in the content, a pattern to which the tag is not assigned is an unknown pattern whether or not the predetermined criterion is satisfied,
All the patterns in the content that are known not to contain information satisfying the predetermined criterion are defined as patterns not including information satisfying the predetermined criterion,
Each pattern included in content that is known to include information satisfying the predetermined criterion and content that does not include information satisfying the predetermined criterion by semi-supervised learning Learning a parameter of a score function indicating the possibility of being a pattern satisfying the predetermined criterion, and using the learned score function as a determination rule.
A content detection support apparatus characterized by the above .

The pattern area extracting means includes
For each partial area in the content, the ratio of the number of patterns determined to satisfy the predetermined standard in the partial area to the total number of patterns in the partial area is calculated, and the partial area having the highest ratio is 4. The content detection support apparatus according to claim 1, wherein the pattern determination unit extracts a partial area including a lot of patterns determined to satisfy the predetermined criterion. 5.

The pattern area extracting means includes
A score value is calculated by applying a determination rule that is the score function to each pattern in the content, and for each partial area in the content, a sum of score values of all patterns in the partial area is calculated. , claim 2 or 3, characterized in that extracting the highest partial area sum of the score values, as the number includes partial regions the determined pattern satisfies a predetermined criterion a predetermined in the pattern determination means The content detection support apparatus described in 1.

A content detection support method for extracting a partial area in content including information satisfying a predetermined criterion from a content group including text information,
A feature amount extracting unit that divides the text in each content into predetermined units and extracts a feature amount for each pattern that is the divided portion; and
The determination rule generation means determines whether or not each pattern included in the content satisfies the predetermined predetermined criterion from content whose information whether or not the information includes information satisfying the predetermined predetermined criterion is included. A determination rule generation step for generating a determination rule for
Whether pattern determination means applies the determination rule for each pattern using the feature quantity for each pattern extracted by the feature quantity extraction means , and whether each pattern satisfies the predetermined criterion A pattern determination step for determining whether or not,
A pattern region extraction step for extracting a partial region including a large number of patterns determined to satisfy the predetermined criterion in the pattern determination unit from the content ;
The feature amount extraction step is characterized in that a feature vector defined by adding a pattern to be determined and feature amounts of several patterns before and after the pattern is extracted.

A content detection support method for extracting a partial area in content including information satisfying a predetermined criterion from a content group including text information,
A feature amount extracting unit that divides the text in each content into predetermined units and extracts a feature amount for each pattern that is the divided portion; and
The determination rule generation means determines whether or not each pattern included in the content satisfies the predetermined predetermined criterion from content whose information whether or not the information includes information satisfying the predetermined predetermined criterion is included. A determination rule generation step for generating a determination rule for
Whether pattern determination means applies the determination rule for each pattern using the feature quantity for each pattern extracted by the feature quantity extraction means , and whether each pattern satisfies the predetermined criterion A pattern determination step for determining whether or not,
A pattern region extraction step for extracting a partial region including a large number of patterns determined to satisfy the predetermined criterion in the pattern determination unit from the content ;
The determination rule generation step includes:
For each pattern in the content that is known to contain information that satisfies the predetermined criteria, a tag is attached to the pattern that is known to be information that satisfies the predetermined criteria,
Among the patterns included in the content, the pattern to which the tag is assigned is a pattern that satisfies the predetermined standard,
Of the patterns included in the content, a pattern to which the tag is not assigned is an unknown pattern whether or not the predetermined criterion is satisfied,
All the patterns in the content that are known not to contain information satisfying the predetermined criterion are defined as patterns not including information satisfying the predetermined criterion,
Each pattern included in content that is known to include information satisfying the predetermined criterion and content that does not include information satisfying the predetermined criterion by semi-supervised learning Learning a parameter of a score function indicating the possibility of being a pattern satisfying the predetermined criterion, and using the learned score function as a determination rule.
A content detection support method characterized by the above .

A content detection support method for extracting a partial area in content including information satisfying a predetermined criterion from a content group including text information,
A feature amount extracting unit that divides the text in each content into predetermined units and extracts a feature amount for each pattern that is the divided portion; and
The determination rule generation means determines whether or not each pattern included in the content satisfies the predetermined predetermined criterion from content whose information whether or not the information includes information satisfying the predetermined predetermined criterion is included. A determination rule generation step for generating a determination rule for
Whether pattern determination means applies the determination rule for each pattern using the feature quantity for each pattern extracted by the feature quantity extraction means , and whether each pattern satisfies the predetermined criterion A pattern determination step for determining whether or not,
A pattern region extraction step for extracting a partial region including a large number of patterns determined to satisfy the predetermined criterion in the pattern determination unit from the content ;
The feature amount extraction step extracts a feature vector defined by adding a pattern to be determined and feature amounts of several patterns before and after the pattern ;
The determination rule generation step includes:
For each pattern in the content that is known to contain information that satisfies the predetermined criteria, a tag is attached to the pattern that is known to be information that satisfies the predetermined criteria,
Among the patterns included in the content, the pattern to which the tag is assigned is a pattern that satisfies the predetermined standard,
Of the patterns included in the content, a pattern to which the tag is not assigned is an unknown pattern whether or not the predetermined criterion is satisfied,
All the patterns in the content that are known not to contain information satisfying the predetermined criterion are defined as patterns not including information satisfying the predetermined criterion,
Each pattern included in content that is known to include information satisfying the predetermined criterion and content that does not include information satisfying the predetermined criterion by semi-supervised learning Learning a parameter of a score function indicating the possibility of being a pattern satisfying the predetermined criterion, and using the learned score function as a determination rule.
A content detection support method characterized by the above .

The pattern region extraction step includes:
For each partial area in the content, the ratio of the number of patterns determined to satisfy the predetermined standard in the partial area to the total number of patterns in the partial area is calculated, and the partial area having the highest ratio is The content detection support method according to any one of claims 6 to 8, wherein the pattern determination means extracts as a partial region including many patterns determined to satisfy the predetermined criterion.

The pattern region extraction step includes:
A score value is calculated by applying a determination rule that is the score function to each pattern in the content, and for each partial area in the content, a sum of score values of all patterns in the partial area is calculated. , according to claim 7 or 8, characterized in that extracting the highest partial area sum of the score values, as the number includes partial regions the determined pattern satisfies a predetermined criterion a predetermined in the pattern determination means The content detection support method described in 1.

A content detection support program for causing a computer to function as each unit according to any one of claims 1 to 5.