JP5775508B2

JP5775508B2 - Spam account extraction apparatus and spam account extraction method

Info

Publication number: JP5775508B2
Application number: JP2012262369A
Authority: JP
Inventors: 勇二森; 大祐鳥居
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2012-11-30
Filing date: 2012-11-30
Publication date: 2015-09-09
Anticipated expiration: 2032-11-30
Also published as: JP2014109794A

Description

本発明は、スパムアカウントを抽出する抽出装置及び抽出方法に関する。 The present invention relates to an extraction apparatus and an extraction method for extracting a spam account.

従来、インターネットなどのネットワーク上のサーバに文書を投稿するサービスにおいて、ブログの一種であるミニブログ（マイクロブログ）等のコミュニケーションツールが利用されている。利用者は、日常の出来事や所感など様々なことを著した文書を投稿している。投稿された文書は、ストリームデータとなり、世論を反映したものとして、ビジネスやエンターテイメントなど多くの分野において様々な用途のために、分析及び活用されている。 Conventionally, a communication tool such as a miniblog (microblog), which is a kind of blog, is used in a service for posting a document to a server on a network such as the Internet. Users post documents that write various things such as daily events and feelings. The posted document becomes stream data and is analyzed and utilized for various uses in many fields such as business and entertainment as reflecting public opinion.

ストリームデータの活用例として、ストリームデータに出現する単語の出現頻度を分析し、出現頻度が上昇している単語を話題語とし、話題語を抽出するサービスがある。 As an example of the use of stream data, there is a service that analyzes the appearance frequency of words appearing in stream data, uses a word whose appearance frequency is rising as a topic word, and extracts the topic word.

話題語は、その特性上、ミニブログサイトもしくはその連携サイトの目立つ位置に設置されることが多いため、閲覧数が多くなる傾向にある。そのため、その傾向に着目し、話題語についての議論を装い自社サイトへ誘導するリンク等、不特定多数への一方的な広告を含むスパム投稿が横行している。 Due to their characteristics, topical words are often placed in conspicuous positions on a mini-blog site or a linked site, so the number of browsing tends to increase. Therefore, paying attention to this trend, spam postings including one-sided advertisements to unspecified many people such as links that lead to discussions about topic words and lead to their own site are rampant.

これらスパム投稿に対処する先行技術として、スパムメールや迷惑メールを排除する方法（例えば、特許文献１参照）と、機械学習的手法を用いてスパマー（スパムを送信するアカウント）を特定する方法（例えば、非特許文献１参照）が知られている。 As a prior art for dealing with these spam posts, a method for eliminating spam mails and spam mails (for example, see Patent Document 1) and a method for identifying spammers (accounts for sending spam) using machine learning techniques (for example, And non-patent document 1).

スパムメールや迷惑メールを排除する方法は、送信側から受信側への送信するメールに対して、該送信を拒否する判断基準となるキーワードをサーバにて管理する。そして、送信側から受信側へ送信するメール本文に、登録された該キーワードが含まれている場合、そのメールを破棄することにより、スパムメールや迷惑メールを受信側へ配信しない方法である。 As a method of eliminating spam mails and spam mails, a keyword that serves as a criterion for rejecting transmission of mail transmitted from the transmission side to the reception side is managed by the server. Then, when the registered keyword is included in the mail body to be transmitted from the transmission side to the reception side, spam mail and spam mail are not distributed to the reception side by discarding the mail.

また、機械学習的手法を用いてスパマーを特定する方法は、スパムの送信側であるスパマーが不当なリンクを形成するという特徴に着目する方法である。すなわち、まず、架空のユーザアカウントを作成し、次に、その架空のユーザアカウントへリンク形成の申請を送信してきたアカウントの中からスパマーを判定する。さらに、機械学習的手法を用いることにより、スパマーと判定したアカウントに見られる行動からスパマーを分類する。 A method of identifying a spammer using a machine learning method is a method that pays attention to a feature that a spammer on a spam transmission side forms an unfair link. That is, first, a fictitious user account is created, and then a spammer is determined from the accounts that have transmitted the link formation application to the fictitious user account. Further, by using a machine learning method, the spammers are classified from the behaviors that are seen in the account determined to be a spammer.

特開２００４−１３３５０８号公報JP 2004-133508 A

Ｋ．Ｌｅｅ，Ｊ．Ｃａｖｅｒｌｅｅ，ａｎｄＳ．Ｗｅｂｂ． “Ｕｎｃｏｖｅｒｉｎｇｓｏｃｉal ｓｐａｍｍｅｒｓ：Ｓｏｃｉａｌｈｏｎｅｙｐｏｔｓ＋ｍａｃｈｉｎｅｌｅａｒｎｉｎｇ”，ＳＩＧＩＲ，２０１０.K. Lee, J.M. Cavellee, and S.C. Webb. “Uncovering social spammers: Social honeypots + machine learning”, SIGIR, 2010.

しかしながら、上記特許文献１に記載の方法は、メール本文に含まれるキーワードをもとにして、そのメールが迷惑メールであるか否かを判断しているため、メール本文に登録されたキーワードが含まれない場合、話題語をターゲットにしたメールであっても判定することが困難である。このため、投稿された文書内に、登録されたキーワードが含まれない場合は、この文書をスパム文書と判定できない虞がある。 However, since the method described in Patent Document 1 determines whether or not the mail is a spam mail based on the keyword included in the mail text, the keyword registered in the mail text is included. If it is not, it is difficult to determine even an email targeted at a topic word. For this reason, if the registered keyword is not included in the posted document, this document may not be determined as a spam document.

また、上記非特許文献１に記載されている方法は、スパムアカウントの判定に各アカウントの行動に関する詳細な情報を用いるため、全てのアカウントを対象にすることは困難である。さらに、この方法は、スパム判定の精度を向上させようとすると、実際にはスパマーであるにも関わらず、スパマーで無いと誤った判定（偽陰性）が生じる虞がある。 Moreover, since the method described in the said nonpatent literature 1 uses the detailed information regarding the action of each account for determination of a spam account, it is difficult to make all accounts into object. Furthermore, in this method, when trying to improve the accuracy of spam determination, there is a possibility that an erroneous determination (false negative) may occur if the user is not a spammer even though he is actually a spammer.

よって、本発明は上記課題に鑑みてなされたものであり、スパムか否かを判定する対象となる文書を絞り込み、簡単な特徴から、スパムアカウントを効率的に抽出することを目的とする。 Therefore, the present invention has been made in view of the above problems, and an object of the present invention is to narrow down documents to be determined as spam or not and efficiently extract spam accounts from simple features.

本発明に係るスパムアカウント抽出装置は、ネットワーク上のサーバに文書を投稿するサービスにおいて、文書からなるストリームデータ中に出現する単語のうち出現頻度が上昇している単語を特定の話題となっている話題語として収集する話題語収集部と、投稿された文書に対して、話題語を少なくとも１つ含む文書を検索する検索部と、検索部によって検索された文書が、話題語を一定数以上含む場合、文書をスパム文書と判定するスパム文書判定部と、スパム文書が投稿される頻度に基づいて、スパム文書を投稿するアカウントをスパムアカウントと判定するスパムアカウント判定部と、を備える。 The spam account extracting apparatus according to the present invention is a topic that has a particular topic of words that appear in the stream data composed of documents , and that appears frequently in a service for posting documents to a server on a network. A topic word collection unit that collects as a topic word, a search unit that searches for a document that includes at least one topic word with respect to a posted document, and a document that is searched by the search unit includes a certain number of topic words or more If, obtain Preparations and spam documents and determining spam document judging section documents, based on the frequency of spam document is posted, an account posting spam documents spam account and determining spam account determination unit.

本発明に係るスパムアカウント抽出装置によると、話題語収集部によって特定の話題となっている話題語が収集され、検索部によって投稿された文書のうち少なくとも話題語収集部が収集した単語を１つ以上含む文書が検索される。その後、スパム文書判定部により、投稿された文書がスパム文書であるか判定され、スパムアカウント判定部により、投稿された文書がスパム文書であった場合、その文書を投稿したアカウントがスパムアカウントと判定される。このため、話題語に基づいて検索対象となる文書が絞り込まれ、一方、絞り込まれた文書を対象にスパム文書か否か判定される。そして、スパム文書と判定された文書を一定回数以上投稿するアカウントをスパムアカウントと判定する。よって、スパムアカウントの抽出を効率的かつ網羅的に行うことが可能となる。さらに、同一文書内に複数種類の話題語を一定数以上含む文書をスパム文書と判定することができる。このため、スパムアカウントの抽出をより効率的に行うことが可能となる。 According to the spam account extraction device of the present invention, topic words that are specific topics are collected by the topic word collection unit, and at least one word collected by the topic word collection unit from the documents posted by the search unit. Documents including the above are retrieved. After that, the spam document determination unit determines whether the posted document is a spam document. If the posted document is a spam document, the spam account determination unit determines that the account that posted the document is a spam account. Is done. For this reason, the document to be searched is narrowed down based on the topic word, and on the other hand, it is determined whether the narrowed document is a spam document. An account that posts a document determined to be a spam document a certain number of times or more is determined as a spam account. Therefore, it becomes possible to extract spam accounts efficiently and exhaustively. Furthermore, it is possible to determine a document including a certain number of topic words of a plurality of types in the same document as a spam document. For this reason, it becomes possible to extract a spam account more efficiently.

また、上記のスパムアカウント抽出装置は、文書に含まれるＵＲＬが不適切なサイトへのリンクを含む場合、ＵＲＬを不適切ＵＲＬと判定する不適切ＵＲＬ判定部をさらに備え、スパム文書判定部は、ＵＲＬが不適切ＵＲＬと判定された場合、文書を前記スパム文書と判定してもよい。 The spam account extraction device further includes an inappropriate URL determination unit that determines a URL as an inappropriate URL when the URL included in the document includes a link to an inappropriate site, and the spam document determination unit includes: If the URL is determined to be inappropriate, the document may be determined as the spam document.

このように構成すると、文書内に不適切なサイトへのリンクを含む文書をスパム文書と判定することができる。このため、スパムアカウントの抽出をより効率的に行うことが可能となる。 With this configuration, a document that includes a link to an inappropriate site in the document can be determined as a spam document. For this reason, it becomes possible to extract a spam account more efficiently.

また、上記のスパムアカウント抽出装置においては、不適切ＵＲＬ判定部は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが事前に定義した文字列を含む場合、ＵＲＬを不適切ＵＲＬと判定してもよい。 In the spam account extracting device, the inappropriate URL determination unit may determine that the URL is an inappropriate URL when the URL or the URL to which the URL is redirected includes a predefined character string.

このように構成すると、不適切ＵＲＬ判定部は、例えば、アフィリエイト又はアダルトサイト等の不適切なサイトであることがＵＲＬから事前に判別する。このため、不適切なＵＲＬを含む文書をスパム文書と判定することができるため、スパムアカウントの抽出をより効率的に行うことが可能となる。 If comprised in this way, an inappropriate URL determination part will determine beforehand from URL that it is an inappropriate site, such as an affiliate or an adult site, for example. For this reason, since a document including an inappropriate URL can be determined as a spam document, a spam account can be extracted more efficiently.

また、上記のスパムアカウント抽出装置においては、不適切ＵＲＬ判定部は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書が事前に定義した単語を含む場合、ＵＲＬを不適切ＵＲＬと判定してもよい。 In the spam account extracting device, the inappropriate URL determination unit may determine that the URL is an inappropriate URL when the URL or the document to which the URL to which the URL is redirected includes a predefined word. Good.

このように構成すると、不適切ＵＲＬ判定部は、文書に含まれるＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書に不適切な単語が含まれる場合、遷移元のＵＲＬを不適切なＵＲＬと判定する。このため、不適切なＵＲＬを含む文書をスパム文書と判定することができるため、スパムアカウントの抽出をより効率的に行うことが可能となる。 With this configuration, the inappropriate URL determination unit determines that the transition source URL is an inappropriate URL when the URL included in the document or the document to which the URL redirect destination URL transitions includes an inappropriate word. To do. For this reason, since a document including an inappropriate URL can be determined as a spam document, a spam account can be extracted more efficiently.

また、上記のスパムアカウント抽出装置においては、不適切ＵＲＬ判定部は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書が、複数の関連性のない話題語を含む文書からのリンクを有する場合、ＵＲＬを不適切ＵＲＬと判定してもよい。 In the spam account extraction device, the inappropriate URL determination unit may include a link from a document including a plurality of unrelated topic words when the document in which the URL or the URL to which the URL is redirected changes. The URL may be determined as an inappropriate URL.

このように構成すると、不適切ＵＲＬ判定部は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書が、複数の関連性のない話題語を含む文書からのリンクを有する場合、遷移元のＵＲＬを不適切ＵＲＬと判定する。このため、不適切なＵＲＬを含む文書をスパム文書と判定することができるため、スパムアカウントの抽出をより効率的に行うことが可能となる。 With this configuration, the inappropriate URL determination unit determines the transition source URL when the document in which the URL or the URL to which the URL is redirected has a link from a document including a plurality of unrelated topic words. It is determined as an inappropriate URL. For this reason, since a document including an inappropriate URL can be determined as a spam document, a spam account can be extracted more efficiently.

また、上記のスパムアカウント抽出装置においては、不適切ＵＲＬ判定部は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書が、文書で使用されている話題語を含まない場合、ＵＲＬを不適切ＵＲＬと判定してもよい。 In the spam account extracting apparatus, the inappropriate URL determination unit may determine that the URL or the URL to which the URL is redirected does not include the topic word used in the document. May be determined.

このように構成すると、不適切ＵＲＬ判定部は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書が、遷移元の文書で使用されている話題語を含まない場合、遷移元のＵＲＬを不適切ＵＲＬと判定する。このため、不適切なＵＲＬを含む文書をスパム文書と判定することができるため、スパムアカウントの抽出をより効率的に行うことが可能となる。 With this configuration, the inappropriate URL determination unit determines that the transition source URL is inappropriate when the URL or the document to which the URL to which the URL is redirected does not include a topic word used in the transition source document. The URL is determined. For this reason, since a document including an inappropriate URL can be determined as a spam document, a spam account can be extracted more efficiently.

また、上記のスパムアカウント抽出装置においては、不適切ＵＲＬ判定部は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書が、複数の関連性のない話題語を含む場合、ＵＲＬを不適切ＵＲＬと判定してもよい。 In the spam account extraction device, the inappropriate URL determination unit may determine that an URL is an inappropriate URL if the document to which the URL or the URL to which the URL is redirected includes a plurality of unrelated topic words. You may judge.

このように構成すると、不適切ＵＲＬ判定部は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書が、複数の関連性のない話題語を含む場合、ＵＲＬを不適切ＵＲＬと判定する。このため、不適切なＵＲＬを含む文書をスパム文書と判定することができるため、スパムアカウントの抽出をより効率的に行うことが可能となる。 If comprised in this way, an inappropriate URL determination part will determine URL as inappropriate URL, when the document to which URL or URL of the redirect destination of URL contains a plurality of unrelated topic words. For this reason, since a document including an inappropriate URL can be determined as a spam document, a spam account can be extracted more efficiently.

また、上記のスパムアカウント抽出装置は、第１のアカウントが投稿した文書を閲覧可能であるアカウントの一覧を取得する友人一覧取得部と、友人一覧取得部によって取得される第１のアカウントの友人一覧と、スパムアカウント判定部によりスパムアカウントと判定されたアカウントが投稿した文書を閲覧可能であるアカウントの一覧と、の類似度を算出するアカウント間類似度算出部と、をさらに備え、スパムアカウント判定部は、類似度に基づいてスパムアカウントを判定してもよい。 In addition, the spam account extracting device described above includes a friend list acquisition unit that acquires a list of accounts that can view documents posted by the first account, and a friend list of the first account acquired by the friend list acquisition unit. When, further comprising a and account inter similarity calculation unit for calculating a list of accounts that are viewable documents account identified as spam accounts posted, the similarity by spam account determination unit, spam account determination unit May determine the spam account based on the similarity.

このように構成すると、友人一覧取得部は、第１のアカウントが投稿した文書を閲覧可能であるアカウントの一覧を取得する。そして、アカウント間類似度算出部は、スパムアカウント判定部によりスパムアカウントと判定されたアカウントが投稿した文書を閲覧可能であるアカウントの一覧と、上記第１のアカウントが投稿した文書を閲覧可能であるアカウントの一覧と、の類似性を算出する。このため、スパムアカウントが抽出された場合、第１のアカウントがスパムアカウントか否かを効率的に判定することができる。よって、スパムアカウントの抽出を効率的かつ網羅的に行うことが可能となる。 If comprised in this way, a friend list acquisition part will acquire the list of the accounts which can browse the document which the 1st account contributed. The inter-account similarity calculation unit can browse a list of accounts that can be viewed by the account determined as a spam account by the spam account determination unit and a document posted by the first account. Calculate the similarity between the list of accounts . For this reason, if the spam account is extracted, it is possible to first account to determine whether spam account efficiently. Therefore, it becomes possible to extract spam accounts efficiently and exhaustively.

また、友人一覧取得部は、アカウント間類似度算出部が算出するアカウント間の類似度に基づいて、類似度が一定の閾値を超えるアカウントを第１のアカウントの友人一覧から抽出するスパムアカウント抽出部を含んでいてもよい。 Further, the friend list acquisition unit extracts a spam account extraction unit that extracts, from the friend list of the first account, accounts whose similarity exceeds a certain threshold based on the similarity between accounts calculated by the inter-account similarity calculation unit. May be included.

このように構成すると、スパムアカウント抽出部は、友人一覧取得部が取得したアカウントの一覧から、スパムアカウントと類似度の高いアカウントを、文書の投稿頻度に基づいて抽出することができる。よって、スパムアカウントの抽出をより精度よく、効率的かつ網羅的に行うことが可能となる。 If comprised in this way, the spam account extraction part can extract an account with high similarity with a spam account from the list of accounts which the friend list acquisition part acquired based on the posting frequency of a document. Therefore, it becomes possible to extract spam accounts more accurately, efficiently and comprehensively.

また、上記のスパムアカウント抽出装置は、スパムアカウントの判定結果に基づいて、文書のインデックスを生成する文書インデックス生成部と、スパムアカウントではないアカウントからの投稿のみを閲覧可能とする閲覧手段と、をさらに備えてもよい。 In addition, the spam account extraction device includes a document index generation unit that generates a document index based on a determination result of the spam account, and a browsing unit that allows browsing only from accounts that are not spam accounts. Further, it may be provided.

このように構成すると、文書インデックス生成部は、スパムアカウント判定部が判定する判定結果に基づいて、文書のインデックスを生成し、閲覧手段は、スパムアカウントではないアカウントからの投稿のみを閲覧可能とする。このため、閲覧手段は、文書のインデックスに基づいて、文書の検索結果にスパムアカウントの投稿が表示されないようにすることができる。 If comprised in this way, a document index production | generation part will produce | generate the index of a document based on the determination result which a spam account determination part determines, and a browsing means will be able to browse only the posting from the account which is not a spam account. . For this reason, the browsing means can prevent the posting of the spam account from being displayed in the document search result based on the document index.

ところで、上述したスパムアカウント抽出装置に係る発明は、スパムアカウント抽出方法の発明としてもとらえることができ、同様の作用・効果を奏する。スパムアカウント抽出方法は以下のように記述することができる。 By the way, the invention related to the spam account extraction apparatus described above can be obtained as an invention of a spam account extraction method, and has the same operations and effects. The spam account extraction method can be described as follows.

本発明に係るスパムアカウント抽出方法は、文書からなるストリームデータ中に出現する単語のうち出現頻度が上昇している単語を特定の話題となっている話題語として収集する話題語収集ステップと、投稿された文書に対して、話題語を少なくとも１つ含む文書を検索する検索ステップと、検索ステップによって検索された文書が、話題語を一定数以上含む場合、文書をスパム文書と判定するスパム文書判定ステップと、スパム文書が投稿される頻度に基づいて、スパム文書を投稿するアカウントをスパムアカウントと判定するスパムアカウント判定ステップと、を備えて構成される。 Spam account extraction method according to the present invention, a topic word collecting step of collecting a word occurrence frequency of the word appearing in the stream data composed of a document is increased as the topic word has become a particular topic, post A search step for searching for a document including at least one topic word from the retrieved document, and a spam document determination for determining a document as a spam document when the document searched by the search step includes a certain number of topic words or more a method, based on the frequency of spam document is posted, it constructed example Preparations and spam account determination step of determining an account and spam account to post spam documents, the.

本発明によれば、スパムか否かを判定する対象となる文書を絞り込み、簡単な特徴から、スパムアカウントを効率的に抽出することができる。 According to the present invention, it is possible to narrow down documents to be judged as spam or not and efficiently extract a spam account from simple features.

第１実施形態に係るスパムアカウント抽出装置の機能ブロック図である。It is a functional block diagram of the spam account extraction device concerning a 1st embodiment. 第１実施形態に係るスパムアカウント抽出装置におけるスパムアカウントを抽出するフローチャートである。It is a flowchart which extracts the spam account in the spam account extraction device concerning a 1st embodiment. 第１実施形態に係るスパムアカウント抽出装置におけるスパム文書を判定するフローチャートである。It is a flowchart which determines the spam document in the spam account extraction device concerning a 1st embodiment. 第１実施形態に係るスパムアカウント抽出装置におけるスパムアカウントを判定するフローチャートである。It is a flowchart which determines the spam account in the spam account extraction device concerning a 1st embodiment. 第２実施形態に係るスパムアカウント抽出装置の機能ブロック図である。It is a functional block diagram of the spam account extraction device concerning a 2nd embodiment. 第２実施形態に係るスパムアカウント抽出装置が備える不適切ＵＲＬ判定部の機能ブロック図である。It is a functional block diagram of the inappropriate URL determination part with which the spam account extraction device concerning a 2nd embodiment is provided. 第２実施形態に係るスパムアカウント抽出装置における不適切ＵＲＬを判定するフローチャートである。It is a flowchart which determines inappropriate URL in the spam account extraction apparatus which concerns on 2nd Embodiment. 第３実施形態に係るスパムアカウント抽出装置の機能ブロック図である。It is a functional block diagram of the spam account extraction device concerning a 3rd embodiment. 第３実施形態に係るスパムアカウント抽出装置におけるスパムアカウントを抽出するフローチャートである。It is a flowchart which extracts the spam account in the spam account extraction device concerning a 3rd embodiment. 第４実施形態に係るスパムアカウント抽出装置の機能ブロック図である。It is a functional block diagram of the spam account extraction device concerning a 4th embodiment. 第４実施形態に係るスパムアカウント抽出装置における文書閲覧を制御するフローチャートである。It is a flowchart which controls document browsing in the spam account extraction device concerning a 4th embodiment. 文書インデックス格納部の構成を示す概略図である。It is the schematic which shows the structure of a document index storage part. 文書インデックス格納部の構成を示す概略図である。It is the schematic which shows the structure of a document index storage part. 不適切ＵＲＬ格納部の構成を示す概略図である。It is the schematic which shows the structure of an inappropriate URL storage part. スパム文書格納部の構成を示す概略図である。It is the schematic which shows the structure of a spam document storage part. スパムユーザ格納部の構成を示す概略図である。It is the schematic which shows the structure of a spam user storage part. 不適切語格納部の構成を示す概略図である。It is the schematic which shows the structure of an inappropriate word storage part. 参照元話題語格納部の構成を示す概略図である。It is the schematic which shows the structure of a reference origin topic word storage part. 判定結果格納部の構成を示す概略図である。It is the schematic which shows the structure of a determination result storage part. スパムアカウント抽出装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of a spam account extraction apparatus.

以下、添付図面を参照して本発明の実施形態について具体的に説明するが、本発明は以下の実施形態に限定されるものではない。なお、同一又は相当する要素には、同一の符号を付し、重複する説明は省略する。 Hereinafter, embodiments of the present invention will be specifically described with reference to the accompanying drawings. However, the present invention is not limited to the following embodiments. In addition, the same code | symbol is attached | subjected to the element which is the same or it corresponds, and the overlapping description is abbreviate | omitted.

（第１実施形態）
図１は、第１実施形態に係るスパムアカウント抽出装置の機能ブロック図である。スパムアカウント抽出装置１は、インターネット等のネットワーク上のサーバに文書を投稿する装置に好適に適用される。 (First embodiment)
FIG. 1 is a functional block diagram of the spam account extraction apparatus according to the first embodiment. The spam account extraction device 1 is preferably applied to a device that posts a document to a server on a network such as the Internet.

図１に示すように、本実施形態のスパムアカウント抽出装置１は、主に、判定部１０Ａと記憶部２０とを備える。判定部１０Ａは、話題語収集部１１、検索部１２、ＵＲＬ展開部１３、スパム文書判定部１４、及びスパムアカウント判定部１５を備える。記憶部２０は、文書インデックス格納部２１、不適切ＵＲＬ格納部２２、スパム文書格納部２３、及びスパムユーザ格納部２４を備える。さらに判定部１０Ａは文書インデックス生成部１６を備えてもよい。以下、各部の機能を概説する。 As shown in FIG. 1, the spam account extraction apparatus 1 of the present embodiment mainly includes a determination unit 10 A and a storage unit 20. The determination unit 10A includes a topic word collection unit 11, a search unit 12, a URL expansion unit 13, a spam document determination unit 14, and a spam account determination unit 15. The storage unit 20 includes a document index storage unit 21, an inappropriate URL storage unit 22, a spam document storage unit 23, and a spam user storage unit 24. Further, the determination unit 10 A may include a document index generation unit 16. The function of each part is outlined below.

話題語収集部１１は、現在話題になっている話題語の一覧を収集する。検索部１２は、話題語Ｗを少なくとも１つ含む文書を検索し、話題語Ｗを含む文書の集合を収集する。ＵＲＬ展開部１３は、本文内のＵＲＬに一度アクセスを行い、リダイレクトされている間はリダイレクト先に再度アクセスする。スパム文書判定部１４は、収集した各文書がスパム投稿であるかどうかを判別する。スパムアカウント判定部１５は、スパム文書が投稿される頻度に基づいて、スパム文書を投稿するアカウントをスパムアカウントと判定する。文書インデックス生成部１６は、インデックスにスパムアカウントの抽出結果を登録し、スパムの投稿が検索結果に表示されないようにする。 The topic word collection unit 11 collects a list of topic words currently being discussed. The search unit 12 searches for a document including at least one topic word W and collects a set of documents including the topic word W. The URL developing unit 13 once accesses the URL in the text, and accesses the redirect destination again while being redirected. Spam document determination unit 14, each document was collected to determine whether it is spam submissions. The spam account determination unit 15 determines an account that posts a spam document as a spam account based on the frequency with which the spam document is posted. The document index generation unit 16 registers the spam account extraction result in the index so that spam posts are not displayed in the search result.

本実施形態では、スパム型投稿として主に２種類を想定している。１つは、話題語や検索急上昇ワード等の一覧を定期的に自動投稿するアカウントによる投稿で、以後トレンド型投稿と呼ぶ。トレンド型投稿は話題語をランキング表示することから同一文書内に話題語を複数含むという特徴を持つ。トレンド型投稿は厳密にはスパム文書ではないが、本発明の目的とする話題語に対する議論を抽出するという観点では必要の無い文書であるため、本実施形態においてはスパムと分類する。 In the present embodiment, two types of spam-type posts are assumed. One is posting by an account that automatically posts a list of topic words, search suddenly rising words, etc. periodically, and is hereinafter referred to as trend-type posting. The trend-type posting has a feature that a plurality of topic words are included in the same document because the topic words are ranked and displayed. Although the trend-type posting is not strictly a spam document, it is a document that is unnecessary from the viewpoint of extracting the discussion on the topic word targeted by the present invention, and is therefore classified as spam in this embodiment.

スパム型投稿のもう１つの種類は、話題語に加え、事前定義した特定のＵＲＬへのリンクを含む投稿であり、以後ＵＲＬ型投稿と呼ぶ。ＵＲＬ型投稿は基本的には自身のサイトやアフィリエイトプログラムへの誘導を目的としたものである。この目的は、大きく２種類のものが確認されている。１つは物販サイト等において話題語をキーワードに含む商品のＰＲを投稿したものであり、もう１つはあたかも話題語に関するデータソースを含む投稿を装いながら、実際にリンク先に遷移すると全く関係無い遷移先に誘導されるものである。ＵＲＬ型投稿はブログサイトのコメント欄等にも見られ、従来から存在するスパムに近いが、文書のテキスト情報そのものからの判別は困難であること、ＵＲＬが短縮されていることや文書の量が膨大であることから全ての文書を調査することは事実上困難であるという特徴がある。 Another type of spam-type post is a post containing a link to a specific URL defined in advance in addition to a topic word, and is hereinafter referred to as a URL-type post. The URL type posting is basically for the purpose of guiding to its own site or affiliate program. For this purpose, two types have been confirmed. One is a product sales site that has posted a PR of a product that contains a topic word as a keyword, and the other is that it is completely irrelevant if it actually transitions to a link destination while pretending to contain a data source related to the topic word. It is guided to the transition destination. URL-type postings are also seen in the comment field of blog sites and are similar to the existing spam, but it is difficult to discriminate from the text information of the document itself, the URL is shortened, and the amount of documents is Due to its enormous volume, it is practically difficult to investigate all documents.

文書インデックス格納部２１は、図１２及び図１３に示すように、文書一覧２１ａ及び転置インデックス２１ｂとから構成される。文書インデックス格納部２１の文書一覧２１ａは、図１２に示すように、文書ＩＤ、投稿日時、ユーザＩＤ、及び本文が、対応づけられて記憶されている。また、文書インデックス格納部２１の転置インデックス２１ｂは、図１３に示すように、単語及び文書ＩＤが対応づけて記憶されている。このため、各単語がどの文書ＩＤで管理される文書に記載されているか、簡易に検索することができる。 As shown in FIGS. 12 and 13, the document index storage unit 21 includes a document list 21a and a transposed index 21b. As shown in FIG. 12, the document list 21a of the document index storage unit 21 stores a document ID, a posting date, a user ID, and a text in association with each other. Further, as shown in FIG. 13, the transposed index 21b of the document index storage unit 21 stores a word and a document ID in association with each other. For this reason, it is possible to easily search for which document ID is used to manage each word.

不適切ＵＲＬ格納部２２は、図１４に示すように、不適切と判定されたＵＲＬが一覧で記録されている。 As shown in FIG. 14, the inappropriate URL storage unit 22 stores a list of URLs determined to be inappropriate.

スパム文書格納部２３は、スパムと判定された文書は文書ＩＤ及び投稿を行ったアカウントＩＤが格納されている。スパム文書格納部２３は、図１５に示すように、文書ＩＤ、投稿日時、ユーザＩＤ、話題語、ＵＲＬ、ＵＲＬ型フラグ、及びトレンド型フラグが一対一に対応づけられて記録されている。 The spam document storage unit 23 stores the document ID and the account ID that made the posting for the document determined to be spam. In the spam document storage unit 23, as shown in FIG. 15, document ID, posting date / time, user ID, topic word, URL, URL type flag, and trend type flag are recorded in a one-to-one correspondence.

スパムユーザ格納部２４は、トレンド型もしくはＵＲＬ型の投稿が一定数以上のアカウントをスパムアカウントと判定し、判定結果が格納されている。スパムユーザ格納部２４は、図１６に示すように、スパムアカウントと判定されたユーザＩＤが記録されている。 The spam user storage unit 24 determines an account having a certain number or more of trend-type or URL-type posts as a spam account, and stores a determination result. As shown in FIG. 16, the spam user storage unit 24 stores a user ID determined to be a spam account.

図２０には、スパムアカウント抽出装置１のハードウェア構成の一例を示す。スパムアカウント抽出装置１は、ハードウェア構成として、ＣＰＵ１Ａと、ＲＡＭ１Ｂと、ＲＯＭ１Ｃと、キーボード、音声入力のための音声認識装置等から成る入力部１Ｄと、所定位置に挿入された記憶媒体Ｍに記憶されたデータやプログラム等を読み取る読取部１Ｅと、外部との通信を行う通信部１Ｆと、補助記憶部１Ｇと、スパム判定結果又は文書検索結果等を含む画像を表示する表示部１Ｈと、を備える。前述したスパムアカウント抽出装置１の各機能ブロックの機能は、ＲＡＭ１Ｂ等にプログラムやデータ等を読み込ませ、ＣＰＵ１Ａの制御の下でプログラムを実行することで実現される。なお、後述する第２〜第４実施形態におけるスパムアカウント抽出装置のハードウェア構成も、上記図２０のハードウェア構成と同様である。 FIG. 20 shows an example of the hardware configuration of the spam account extraction apparatus 1. The spam account extracting device 1 stores, as a hardware configuration, a CPU 1A, a RAM 1B, a ROM 1C, an input unit 1D including a keyboard, a voice recognition device for voice input, and the like, and a storage medium M inserted at a predetermined position. A reading unit 1E that reads the read data and programs, a communication unit 1F that communicates with the outside, an auxiliary storage unit 1G, and a display unit 1H that displays an image including a spam determination result or a document search result. Prepare. The function of each functional block of the spam account extracting apparatus 1 described above is realized by causing the RAM 1B or the like to read a program or data and executing the program under the control of the CPU 1A. Note that the hardware configuration of the spam account extraction apparatus in the second to fourth embodiments described later is the same as the hardware configuration of FIG.

次に、図２を用いて、第１実施形態におけるスパムアカウントを抽出するフローを説明する。図２は、スパムアカウント抽出装置１におけるスパムアカウントを抽出するフローチャートである。 Next, a flow for extracting a spam account in the first embodiment will be described with reference to FIG. FIG. 2 is a flowchart for extracting a spam account in the spam account extraction apparatus 1.

図２に示すように、まず、ステップＳ１１において、話題語収集部１１は、現在話題になっている話題語Ｗを収集する。ここで、話題語Ｗは文書インデックス格納部２１に格納されている情報の時系列の推移から生成しても良いし、外部のミニブログ解析サービス、もしくは検索キーワード急上昇ランキング等を用いても良い。 As shown in FIG. 2, first, in step S 11, the topic word collection unit 11 collects a topic word W that is currently a topic. Here, the topic word W may be generated from a time-series transition of information stored in the document index storage unit 21, or an external miniblog analysis service, a search keyword rapid increase ranking, or the like may be used.

次に、ステップＳ１２において、検索部１２は、話題語収集部１１が収集した話題語Ｗを用いて、話題語Ｗを少なくとも１つ含む文書を検索し、話題語Ｗを含む文書の集合を収集する。検索部１２が検索する文書は、収集した複数の話題語Ｗのうち任意の一つを含んでいれば個数は限定されない。また、話題語Ｗを含む文書の集合の適合度は考慮する必要が無いため、検索部１２は新着順に検索を行ってもよい。さらに、目的は各話題語Ｗが公開されて以降に話題語Ｗを含む投稿を網羅的に収集することであるが、各語が公開された日時を知ることは困難であるため、図２のフローを定期的に実行してもよく、所定回実施時以降に投稿された文書のみを収集するとしてもよい。 Next, in step S12, the search unit 12 uses the topic word W collected by the topic word collection unit 11 to search for a document including at least one topic word W, and collects a set of documents including the topic word W. To do. The number of documents searched by the search unit 12 is not limited as long as it includes an arbitrary one of the collected topic words W. Further, since there is no need to consider the suitability of a set of documents including the topic word W, the search unit 12 may search in the order of arrival. Further, the purpose is to comprehensively collect posts including the topic word W after each topic word W is released, but it is difficult to know the date and time when each word was released. The flow may be executed periodically, or only documents posted after a predetermined time may be collected.

さらに、ステップＳ１３において、スパム文書判定部１４は、収集した各文書がスパム投稿であるかどうかを判別する。スパム文書判定部１４は、検索部１２によって検索された文書に含まれる話題語Ｗが、特定の話題と関連性が無い場合に、文書をスパム文書と判定する。文書をスパム判定するフローについては、図３を用いて後述する。 In step S13, the spam document determination unit 14 determines whether each collected document is a spam post. The spam document determination unit 14 determines a document as a spam document when the topic word W included in the document searched by the search unit 12 is not related to a specific topic. The flow for determining a document as spam will be described later with reference to FIG.

そして、ステップＳ１４において、スパムアカウント判定部１５は、スパム文書が投稿される頻度に基づいて、スパム文書を投稿するアカウントをスパムアカウントと判定する。スパムアカウント判定部１５は、スパム文書判定部１４によってスパム文書と判定された文書を投稿するアカウントを、スパムアカウントと判定する。スパム文書を投稿するスパムアカウントを判定するフローについては、図４を用いて後述する。 In step S14, the spam account determination unit 15 determines that the account that posts the spam document is a spam account based on the frequency with which the spam document is posted. The spam account determination unit 15 determines an account that posts a document determined as a spam document by the spam document determination unit 14 as a spam account. A flow for determining a spam account for posting a spam document will be described later with reference to FIG.

さらに、本実施形態では、図１に示すように、文書インデックス生成部１６を備えて構成してもよい。このように構成すると、文書インデックス生成部１６は、インデックスにスパムアカウントの抽出結果を登録し、スパムの投稿が検索結果に表示されないようにする。 Furthermore, in the present embodiment, as shown in FIG. 1, a document index generation unit 16 may be provided. If comprised in this way, the document index production | generation part 16 will register the extraction result of a spam account to an index, and will prevent the posting of spam from being displayed on a search result.

そして、図２に示すように、ステップＳ１５において、文書インデックス生成部１６は、インデックス（インデクサ）にスパムアカウントの抽出結果を登録し、スパムの投稿が検索結果に表示されないようにしてもよい。 As shown in FIG. 2, in step S15, the document index generation unit 16 may register the spam account extraction result in the index (indexer) so that spam posts are not displayed in the search result.

ここで、図３を用いて、本実施形態における文書をスパム判定するフローについて説明する。図３はスパム文書を判定する流れを示すフローチャートである。 Here, with reference to FIG. 3, a flow for determining a document as spam in this embodiment will be described. FIG. 3 is a flowchart showing a flow of determining a spam document.

本実施形態における文書のスパム判定（図２のＳ１３）は、スパム文書判定部１４が文書インデックス格納部２１及び不適切ＵＲＬ格納部２２に格納された情報をもとに、以下に示すフローに沿って行われる。 The document spam determination in this embodiment (S13 in FIG. 2) follows the flow shown below based on information stored in the document index storage unit 21 and the inappropriate URL storage unit 22 by the spam document determination unit 14. Done.

まず、図３に示すように、ステップＳ２１において、スパム文書判定部１４は文書内にＵＲＬを含んでいるか否かを判定し、文書内にＵＲＬを含む投稿である場合は、ステップＳ２２において、ＵＲＬ展開部１３がＵＲＬの展開を行う。 First, as shown in FIG. 3, in step S21, the spam document determination unit 14 determines whether or not a URL is included in the document. If the post includes a URL in the document, the URL is determined in step S22. The expansion unit 13 expands the URL.

一般に、ミニブログサービスでは、ＵＲＬを含む投稿を行う場合、投稿文字数の制限の関係や表示を簡易にするために、ＵＲＬ短縮サービスを併用することが多い。ＵＲＬ短縮サービスとは、元々のＵＲＬをより短い別のＵＲＬに置き換えるサービスで、複数の事業者により提供されている。短縮ＵＲＬはＨＴＴＰのリダイレクトにより実現されている。このため、本実施形態では、本文内のＵＲＬに一度アクセスを行い、リダイレクトされている間はリダイレクト先に再度アクセスを行い、その都度ＵＲＬを更新するとしてもよい。 In general, in a mini blog service, when a posting including a URL is performed, a URL shortening service is often used together in order to simplify the relationship and display of the number of posted characters. The URL shortening service is a service that replaces an original URL with another shorter URL, and is provided by a plurality of businesses. The shortened URL is realized by HTTP redirection. For this reason, in the present embodiment, the URL in the text may be accessed once, the redirect destination may be accessed again while being redirected, and the URL may be updated each time.

次に、ステップＳ２３において、スパム文書判定部１４は、同一文書内に話題語を一定数以上（例えば、５つ以上）含むか否かを判定し、同一文書内に話題語を一定数以上（例えば、５つ以上）含む場合、ステップＳ２４において、スパム文書判定部１４は判定対象の文書をトレンド型投稿のスパム文書と判定する。 Next, in step S23, the spam document determination unit 14 determines whether or not a certain number of topic words are included in the same document (for example, five or more), and a certain number or more of topic words are included in the same document ( For example, in a case where five or more are included, the spam document determination unit 14 determines that the document to be determined is a spam document of the trend type posting in step S24.

そして、ステップＳ２５において、スパム文書判定部１４は、文書内に含まれていたＵＲＬ、又はそのリダイレクト先のＵＲＬが事前定義したブラックリスト内のＵＲＬに、少なくとも一部が一致するか否かを判定し、文書内に含まれていたＵＲＬ、又はそのリダイレクト先のＵＲＬが事前定義したブラックリスト内のＵＲＬに、少なくとも一部が一致する場合、ステップＳ２６において、スパム文書判定部１４は、判定対象の文書をＵＲＬ型投稿のスパム文書と判定する。 In step S25, the spam document determination unit 14 determines whether or not at least a part of the URL included in the document or the URL in the blacklist in which the redirect destination URL is predefined matches. If at least a part of the URL included in the document or the URL of the redirect destination matches the URL in the predefined black list, in step S26, the spam document determination unit 14 The document is determined to be a URL-type posted spam document.

また、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬに、事前に定義した文字列が含まれる場合、スパム文書判定部１４は、判定対象の文書をスパム文書と判定してもよい。そして、スパム文書判定部１４は、不適切ＵＲＬ格納部２２に、不適切と判定されたＵＲＬを記録する。スパム文書判定部１４が記録するＵＲＬは、一覧にして、不適切ＵＲＬ格納部２２に記録されていてもよい。あるいは、アダルトサイトや物販サイトのアフィリエイトといった遷移先のＵＲＬを、不適切ＵＲＬ格納部２２へ事前に登録しておいてもよい。 Further, when the URL or the URL to which the URL is redirected includes a predefined character string, the spam document determination unit 14 may determine that the determination target document is a spam document. Then, the spam document determination unit 14 records the URL determined to be inappropriate in the inappropriate URL storage unit 22. The URLs recorded by the spam document determination unit 14 may be recorded in the inappropriate URL storage unit 22 as a list. Alternatively, a transition destination URL such as an affiliate of an adult site or a product sales site may be registered in advance in the inappropriate URL storage unit 22.

そして、スパムと判定された文書は、文書ＩＤ及び投稿を行ったアカウントＩＤがスパム文書格納部２３に格納される。単一の文書がトレンド型、ＵＲＬ型の両者に分類される場合もある。 Then, the document determined to be spam is stored in the spam document storage unit 23 with the document ID and the account ID that made the posting. A single document may be classified into both a trend type and a URL type.

ここで、図４を用いて、本実施形態におけるスパムアカウントを判定するフローについて説明する。図４はスパムアカウントを判定する流れを示すフローチャートである。 Here, a flow for determining a spam account in the present embodiment will be described with reference to FIG. FIG. 4 is a flowchart showing a flow for determining a spam account.

本実施形態におけるスパムアカウントの判定（図２のＳ１４）は、スパムアカウント判定部１５がスパム文書格納部２３に格納された情報を基に、以下に示すフローに沿って行われる。 The spam account determination (S14 in FIG. 2) in the present embodiment is performed along the flow shown below based on the information stored in the spam document storage unit 23 by the spam account determination unit 15.

まず、図４に示すように、ステップＳ３１において、スパムアカウント判定部１５は、スパム文書格納部２３に格納された情報のうち、直近のｄ日間分を抽出し、ステップＳ３２において、ユニークな文書数をアカウントＩＤ毎に集計する。また、ｄ日間を例えば７日間としてもよい。 First, as shown in FIG. 4, in step S31, the spam account determination unit 15 extracts the latest d days from the information stored in the spam document storage unit 23, and in step S32, the number of unique documents. For each account ID. Also, d days may be set to 7 days, for example.

なお、第１実施形態において、スパムアカウント判定部１５は、ユニークな文書数をアカウントＩＤごとに集計するための文書集計手段（不図示）を備えてもよい。そして、ステップＳ３２において、文書集計手段がユニークな文書数を集計してもよい。 In the first embodiment, the spam account determination unit 15 may include document counting means (not shown) for counting the number of unique documents for each account ID. In step S32, the document counting means may count the number of unique documents.

次に、ステップＳ３３において、スパムアカウント判定部１５は、集計した投稿数を基にアカウント毎に判定を行い、ステップＳ３４においてトレンド型もしくはＵＲＬ型の投稿が一定数以上のアカウントをスパムアカウントと判定し、判定結果をスパムユーザ格納部２４に格納する。 Next, in step S33, the spam account determination unit 15 makes a determination for each account based on the total number of posts. In step S34, the account having a certain number of trend-type or URL-type posts is determined as a spam account. The determination result is stored in the spam user storage unit 24.

そして、ステップＳ３５において、スパムアカウント判定部１５がスパムアカウントの判定を行う。 In step S35, the spam account determination unit 15 determines a spam account.

上述した一連の処理は、定期的に実行され、スパムアカウント（スパムユーザ）の情報は常に更新されるように設定してもよい。また、一度スパムアカウントと判定された場合でもd日間スパム投稿を行わない場合、そのアカウントをスパムアカウントではないと判定してもよい。あるいは、一度スパムアカウントと判定された場合でもd日間スパム投稿を行わない場合、そのアカウントをスパムユーザ格納部２４に格納されたリストから解除されるように設定してもよい。 The series of processes described above may be performed periodically, and the spam account (spam user) information may be constantly updated. Also, even if it is determined that the account is spam, it may be determined that the account is not a spam account if spam posting is not performed for d days. Alternatively, if spam posting is not performed for d days even if it is determined to be a spam account, the account may be set to be released from the list stored in the spam user storage unit 24.

このように構成すると、話題語収集部１１によって特定の話題となっている話題語Ｗが収集され、検索部１２によって投稿された文書のうち少なくとも話題語収集部１１が収集した単語を１つ以上含む文書が検索される。そして、スパム文書判定部１４により、投稿された文書がスパム文書であるか判定することができる。さらに、スパムアカウント判定部１５により、投稿された文書がスパム文書であった場合、その文書を投稿したアカウントをスパムアカウントと判定することができる。 If comprised in this way, the topic word W used as the specific topic will be collected by the topic word collection part 11, and one or more words which at least the topic word collection part 11 collected among the documents posted by the search part 12 will be collected. The containing document is searched. The spam document determination unit 14 can determine whether the posted document is a spam document. Furthermore, if the posted document is a spam document, the spam account determination unit 15 can determine that the account that posted the document is a spam account.

よって、話題語Ｗに基づいて検索対象となる文書を絞り込み、絞り込まれた文書を対象にしてスパム文書か否かを判定することができる。また、同一文書内に複数種類の話題語を一定数以上含む文書をスパム文書と判定することができる。さらに、スパム文書と判定された文書を一定回数以上投稿するアカウントをスパムアカウントと判定することができる。 Therefore, it is possible to narrow down the documents to be searched based on the topic word W and determine whether the narrowed document is a spam document. In addition, a document including a certain number of plural types of topic words in the same document can be determined as a spam document. Furthermore, an account that posts a document determined to be a spam document a certain number of times or more can be determined as a spam account.

このため、スパムアカウントの抽出を効率的かつ網羅的に行うことが可能となる。 For this reason, it becomes possible to extract spam accounts efficiently and exhaustively.

上述した第１実施形態において、文書インデックス生成部１６は、スパムアカウントの抽出を効率的かつ網羅的に行うための必須の構成ではない。第１実施形態において、文書インデックス生成部１６を備えることにより、スパムカウントと判定されたアカウントからの投稿を検索結果に表示されないように処理することができる。また、上述した第１実施形態において、スパムアカウントの判定は、上述した方法に限定されるものではない。 In the first embodiment described above, the document index generation unit 16 is not an indispensable configuration for efficiently and comprehensively extracting spam accounts. In the first embodiment, by providing the document index generation unit 16, it is possible to process a post from an account determined to be a spam count so as not to be displayed in a search result. In the first embodiment described above, the determination of the spam account is not limited to the method described above.

（第２実施形態）
図５は、第２実施形態に係るスパムアカウント抽出装置の機能ブロック図である。第２実施形態に係るスパムアカウント抽出装置２は、第１実施形態に係るスパムアカウント抽出装置１に、更に不適切ＵＲＬ判定部１７を備える。 (Second Embodiment)
FIG. 5 is a functional block diagram of the spam account extraction apparatus according to the second embodiment. The spam account extraction device 2 according to the second embodiment further includes an inappropriate URL determination unit 17 in the spam account extraction device 1 according to the first embodiment.

図６に不適切ＵＲＬ判定部１７の機能ブロック図を示す。不適切ＵＲＬ判定部１７は、判定部１００及び記憶部２００を備えて構成される。判定部１００は、ＵＲＬ展開部１０１、Ｗｅｂページ取得部１０２、Ｗｅｂページ解析部１０３を備えて構成される。記憶部２００は、不適切語格納部２０１、参照元話題語格納部２０２、判定結果格納部２０３から成る記憶部２００を備えて構成される。 FIG. 6 shows a functional block diagram of the inappropriate URL determination unit 17. The inappropriate URL determination unit 17 includes a determination unit 100 and a storage unit 200. The determination unit 100 includes a URL development unit 101, a web page acquisition unit 102, and a web page analysis unit 103. The storage unit 200 includes a storage unit 200 including an inappropriate word storage unit 201, a reference source topic word storage unit 202, and a determination result storage unit 203.

ＵＲＬ展開部１０１は、短縮ＵＲＬの展開を行う。Ｗｅｂページ取得部１０２は、遷移先のＷｅｂページの本文を取得する。Ｗｅｂページ解析部１０３は、取得したＷｅｂページ本文中に参照元話題語格納部２０２、並びに不適切語格納部２０１に格納された不適切語が出現するかどうかをチェックする。 The URL development unit 101 develops a shortened URL. The web page acquisition unit 102 acquires the text of the transition destination web page. The Web page analysis unit 103 checks whether or not inappropriate words stored in the reference source topic word storage unit 202 and the inappropriate word storage unit 201 appear in the acquired Web page body.

不適切語格納部２０１は、図１７に示すように、事前に登録された不適切語が格納されている。また、この不適切語の一覧２０１ａは、逐次更新されてもよく、あるいは、学習によって随時語句が格納されてもよい。参照元話題語格納部２０２は、図１８に示すように、遷移元の文書に含まれている話題語Ｗが投稿日時及びＵＲＬと対応づけて格納されている。判定結果格納部２０３は、図１９に示すように、不適切ＵＲＬと判定されたＵＲＬが格納されている。 As shown in FIG. 17, the inappropriate word storage unit 201 stores inappropriate words registered in advance. Further, the inappropriate word list 201a may be updated sequentially, or words may be stored as needed by learning. As shown in FIG. 18, the reference source topic word storage unit 202 stores the topic word W included in the transition source document in association with the posting date and URL. As shown in FIG. 19, the determination result storage unit 203 stores URLs determined to be inappropriate URLs.

不適切ＵＲＬ判定部１７は、次の条件のいずれか１つにでも該当するＵＲＬを不適切ＵＲＬと判定し、判定結果格納部２０３に格納する。条件１は遷移先に不適切語を含む場合、条件２は遷移先に参照元話題語を含まない場合、条件３は一定以上の話題語Ｗから参照されている場合、条件４は遷移先に複数の関連性のない話題語を含む場合である。このとき、一定以上の話題語Ｗとは、例えば、５つ以上としてもよい。 The inappropriate URL determination unit 17 determines that a URL corresponding to any one of the following conditions is an inappropriate URL and stores the URL in the determination result storage unit 203. Condition 1 includes an inappropriate word at the transition destination, Condition 2 includes no reference source topic word at the transition destination, Condition 3 refers to more than a certain topic word W, Condition 4 includes the transition destination This is a case where a plurality of unrelated topic words are included. At this time, the number of topic words W above a certain value may be five or more, for example.

図７は不適切ＵＲＬを判定するフローである。まず、ステップＳ４１において、ＵＲＬ展開部１０１は短縮ＵＲＬの展開を行う。次に、ステップＳ４２において、判定部１００は、短縮ＵＲＬ展開した展開後のＵＲＬが判定結果格納部２０３に登録済みのものであるかどうかをチェックする。 FIG. 7 is a flow for determining an inappropriate URL. First, in step S41, the URL development unit 101 develops a shortened URL. Next, in step S 42, the determination unit 100 checks whether the expanded URL obtained by expanding the shortened URL is already registered in the determination result storage unit 203.

ＵＲＬが登録済みのものでない場合、ステップＳ４３において、判定部１００は、参照元話題語格納部２０２にＵＲＬと参照元話題語のペアを格納した後、Ｗｅｂページ取得部１０２は遷移先のＷｅｂページの本文を取得する。 If the URL is not already registered, in step S43, the determination unit 100 stores the URL and reference source topic word pair in the reference source topic word storage unit 202, and then the Web page acquisition unit 102 determines the transition destination Web page. Get the body of.

ステップＳ４４において、判定部１００は、Ｗｅｂページ解析部１０３が取得したＷｅｂページ本文中に参照元話題語格納部２０２及びに不適切語格納部２０１に格納された不適切語が出現しているか否か判定する。 In step S 44, the determination unit 100 determines whether or not an inappropriate word stored in the reference source topic word storage unit 202 and the inappropriate word storage unit 201 appears in the Web page text acquired by the Web page analysis unit 103. To determine.

次に、ステップＳ４７において、不適切ＵＲＬ判定部１７は、以下の３つの条件のいずれか１つにでも該当するＵＲＬを不適切ＵＲＬと判定し、判定結果格納部２０３に格納する。条件１は、遷移先に不適切語を含む場合である（ステップＳ４４）。条件２は、遷移先に参照元話題語を含まない場合である（ステップＳ４５）。条件３は、一定以上（例えば、５つ以上）の話題語から参照されている場合、条件４は、遷移先に複数の関連性のない話題語を含む場合（ステップＳ４８）である（ステップＳ４６）。 Next, in step S 47, the inappropriate URL determination unit 17 determines that the URL corresponding to any one of the following three conditions is an inappropriate URL and stores it in the determination result storage unit 203. Condition 1 is a case where an inappropriate word is included in the transition destination (step S44). Condition 2 is a case where the reference topic word is not included in the transition destination (step S45). Condition 3 is a case where reference is made from more than a certain number of topic words (for example, five or more), and condition 4 is a case where a plurality of unrelated topic words are included in the transition destination (step S48) (step S46). ).

そして、本実施形態において、ステップＳ４５における判定は、遷移先に不適切な単語を含んでいない場合であっても、参照元の話題語とは関係性の薄いものであるとし、不適切なＵＲＬであると判定する。このため、事前定義されていないＵＲＬを含む文書を投稿していたとしてもスパムユーザとしての検出することが可能となる。 In the present embodiment, the determination in step S45 assumes that the transition destination does not include an inappropriate word, and that the topic word of the reference source is not closely related. It is determined that For this reason, even if a document including a URL that is not predefined is posted, it is possible to detect as a spam user.

また、本実施形態において、ステップＳ４６における判定は、遷移先に不適切な単語を含んでおらず、かつ、遷移先に参照元の話題語が含まれる場合であっても、一定数以上の参照元から参照されているＵＲＬは、不適切なＵＲＬであると判定する。このため、事前定義されていないＵＲＬを含む文書を投稿していたとしてもスパムユーザとしての検出することが可能となる。 Further, in the present embodiment, the determination in step S46 is that the transition destination does not include an inappropriate word, and even if the transition destination includes a reference source topic word, a certain number or more of references It is determined that the URL referenced from the beginning is an inappropriate URL. For this reason, even if a document including a URL that is not predefined is posted, it is possible to detect as a spam user.

さらに、本実施形態において、ステップＳ４８における判定は、遷移先に不適切な単語を含んでおらず、遷移先に参照元の話題語が含まれており、かつ、一定数以上の参照元から参照されていないＵＲＬであっても、遷移先に複数の関連性のない話題語が含まれている場合、不適切なＵＲＬであると判定する。このため、事前定義されていないＵＲＬを含む文書を投稿していたとしてもスパムユーザとして検出することが可能となる。 Furthermore, in this embodiment, the determination in step S48 does not include an inappropriate word at the transition destination, the topic word of the reference source is included at the transition destination, and is referenced from a certain number of reference sources or more. Even if the URL is not set, if the transition destination includes a plurality of unrelated topic words, the URL is determined to be inappropriate. For this reason, even if a document including a URL that is not predefined is posted, it can be detected as a spam user.

さらに、文書インデックス生成部１６は、不適切ＵＲＬ格納部２２に加え、判定結果格納部２０３を参照し、文書インデックス生成時の表示/非表示の判定を行う。 Further, the document index generation unit 16 refers to the determination result storage unit 203 in addition to the inappropriate URL storage unit 22, and determines whether to display or not when generating the document index.

このように構成すると、文書内に不適切なサイトへのリンクを含む文書をスパム文書と判定することができる。また、不適切ＵＲＬ判定部１７は、例えば、アフィリエイト又はアダルトサイト等の不適切なサイトであることをＵＲＬから事前に判別することができる。そして、不適切ＵＲＬ判定部１７は、文書に含まれるＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書に不適切な単語が含まれる場合、遷移元のＵＲＬを不適切なＵＲＬと判定することができる。そして、不適切ＵＲＬ判定部１７は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書が、複数の関連性のない話題語を含む文書からのリンクを有する場合、遷移元のＵＲＬを不適切ＵＲＬと判定することができる。さらに、不適切ＵＲＬ判定部１７は、ＵＲＬ又はＵＲＬのリダイレクト先のＵＲＬが遷移する文書が、複数の関連性のない話題語を含む場合であっても、ＵＲＬを不適切ＵＲＬと判定することができる。 With this configuration, a document that includes a link to an inappropriate site in the document can be determined as a spam document. Further, the inappropriate URL determination unit 17 can determine in advance from the URL that the site is an inappropriate site such as an affiliate or an adult site, for example. The inappropriate URL determination unit 17 may determine that the transition source URL is an inappropriate URL if the URL included in the document or the URL redirect destination URL includes an inappropriate word. it can. Then, the inappropriate URL determination unit 17 determines the transition source URL as an inappropriate URL when the document in which the URL or the URL to which the URL is redirected has a link from a document including a plurality of unrelated topic words. Can be determined. Furthermore, the inappropriate URL determination unit 17 determines that the URL is an inappropriate URL even when the URL or the URL redirected document includes a plurality of unrelated topic words. it can.

このため、不適切なＵＲＬを含む文書をスパム文書と判定することができるため、スパムアカウントの抽出をより効率的に行うことが可能となる。 For this reason, since a document including an inappropriate URL can be determined as a spam document, a spam account can be extracted more efficiently.

（第３実施形態）
図８は、第３実施形態に係るスパムアカウント抽出装置の機能ブロック図である。第３実施形態に係るスパムアカウント抽出装置３は、第２実施形態に係るスパムアカウント抽出装置２に、更に友人一覧取得部１８、スパムアカウント抽出部１８１、及びアカウント間類似度算出部１９を備える。 (Third embodiment)
FIG. 8 is a functional block diagram of the spam account extraction apparatus according to the third embodiment. The spam account extraction device 3 according to the third embodiment further includes a friend list acquisition unit 18, a spam account extraction unit 181, and an inter-account similarity calculation unit 19 in addition to the spam account extraction device 2 according to the second embodiment.

一般に、ミニブログサービスにおける友人関係には、一方からの申請を他方が承認することで成り立つ双方向のものと、一方のユーザが他方のユーザの投稿を一方的に閲覧可能な状態にする（以降、フォローと呼ぶ）一方向のものがある。説明理解の容易性を考慮し、本実施形態では、フォローの関係にあるユーザを友人関係にあるユーザとして説明する。 In general, friendships in the miniblog service include a bidirectional relationship that is established by one side approving an application from one side, and a state in which one user can unilaterally view the other user's post (hereinafter referred to as the other user's post). , Called follow). In consideration of the ease of understanding the explanation, in this embodiment, a user who is in a follow relationship is described as a user who is in a friend relationship.

友人一覧取得部１８は、スパムユーザ格納部２４に格納されたユーザアカウントの友人の一覧を取得する。スパムアカウント抽出部１８１は、第１のアカウントと関連する他のアカウントの一覧からから、類似度の高いアカウントをスパムアカウントとして抽出する。アカウント間類似度算出部１９は、各スパムアカウントの類似度を、友人関係を用いて算出する。記憶部２０は、類似度が一定の閾値を超えるアカウントをスパムユーザとしてスパムユーザ格納部２４に格納する。 The friend list acquisition unit 18 acquires a list of friends of the user account stored in the spam user storage unit 24. The spam account extraction unit 181 extracts an account with a high similarity as a spam account from a list of other accounts related to the first account. The similarity calculation part 19 between accounts calculates the similarity of each spam account using a friend relationship. The storage unit 20 stores an account whose similarity exceeds a certain threshold in the spam user storage unit 24 as a spam user.

図９は、スパムアカウント抽出装置３におけるスパムアカウントを抽出するフローである。すなわち、友人関係からのスパムアカウント判定の処理の流れを示すフローチャートである。 FIG. 9 is a flow for extracting a spam account in the spam account extraction apparatus 3. That is, it is a flowchart showing a flow of processing for determining a spam account from a friend relationship.

図９に示すように、ステップＳ５１において、友人一覧取得部１８は、スパムアカウントと判定されたアカウントの友人一覧を、スパムユーザ格納部２４から取得する。そして、ステップＳ５２において、スパムアカウント抽出部１８１は、友人一覧取得部１８が取得したユーザアカウントの一覧から、出現頻度の高い、即ちスパムアカウントを多数友人に持つアカウントを抽出する。 As shown in FIG. 9, in step S 51, the friend list acquisition unit 18 acquires a friend list of an account determined to be a spam account from the spam user storage unit 24. In step S52, the spam account extraction unit 181 extracts an account having a high appearance frequency, that is, an account having many spam accounts as friends from the list of user accounts acquired by the friend list acquisition unit 18.

次に、ステップＳ５３において、友人一覧取得部１８は、ステップＳ５２においてスパムアカウント抽出部１８１が抽出したユーザアカウントの友人一覧を取得する。 Next, in step S53, the friend list acquisition unit 18 acquires the friend list of the user account extracted by the spam account extraction unit 181 in step S52.

さらに、ステップＳ５４において、アカウント間類似度算出部１９は、ステップＳ５２においてスパムアカウント抽出部１８１が抽出したアカウントと、そのアカウントの友人のアカウントとの類似度を、算出する。また、記憶部２０は、類似度が一定の閾値を超えるアカウントをスパムアカウントとしてスパムユーザ格納部２４に格納する。例えば、類似度の指標として例えばＪａｃｃａｒｄ係数を用いてもよい。Ｊａｃｃａｒｄ係数を用いると、２つのアカウントＸ、Ｙの類似度は以下の式（１）にて評価される。

Further, in step S54, the inter-account similarity calculation unit 19 calculates the similarity between the account extracted by the spam account extraction unit 181 in step S52 and the friend account of the account. In addition, the storage unit 20 stores an account whose similarity exceeds a certain threshold in the spam user storage unit 24 as a spam account. For example, a Jaccard coefficient may be used as the similarity index. Using the Jaccard coefficient, the similarity between the two accounts X and Y is evaluated by the following equation (1).

ここで、｜Ｘ∩Ｙ｜はＸとＹが共通に友人として持つアカウントの数、｜Ｘ∪Ｙ｜はＸとＹの少なくとも一方を友人として持つアカウントの総数となる。 Here, | X∩Y | is the number of accounts that X and Y have as friends in common, and | X∪Y | is the total number of accounts that have at least one of X and Y as friends.

あるいは、類似度の指標として、コサイン距離又はＳｉｍｐｓｏｎ係数等他の類似度を表す指標を用いてもよい。 Alternatively, as a similarity index, an index representing another similarity such as a cosine distance or a Simpson coefficient may be used.

このように構成すると、第１のアカウントがスパムアカウントと判定された場合、効率的に他のスパムアカウントを抽出することができる。すなわち、ミニブログサービス上における友人関係から、スパム判定されたユーザアカウントの判定が可能となる。よって、スパムアカウントの抽出を効率的かつ網羅的に行うことが可能となる。 If comprised in this way, when a 1st account is determined to be a spam account, another spam account can be extracted efficiently. That is, it is possible to determine a user account determined to be spam from a friend relationship on the miniblog service. Therefore, it becomes possible to extract spam accounts efficiently and exhaustively.

尚、上述した実施形態では、友人関係における方向をフォローしているユーザを友人関係として説明したが、本実施形態における友人関係は、フォローしている場合に限らず、双方向の場合であってもよい。また、上述した実施形態では、１以上のアカウントを１のユーザが所有し管理していてもよい。 In the above-described embodiment, the user who follows the direction in the friendship has been described as the friendship. However, the friendship in the present embodiment is not limited to the case of following, but is a two-way case. Also good. In the above-described embodiment, one user may own and manage one or more accounts.

（第４実施形態）
図１０は、第４実施形態に係るスパムアカウント抽出装置４の機能ブロック図である。第４実施形態に係るスパムアカウント抽出装置４は、第３実施形態に係るスパムアカウント抽出装置３に、更に文書閲覧制御部３０を備える。 (Fourth embodiment)
FIG. 10 is a functional block diagram of the spam account extraction device 4 according to the fourth embodiment. The spam account extraction device 4 according to the fourth embodiment further includes a document browsing control unit 30 in addition to the spam account extraction device 3 according to the third embodiment.

文書閲覧制御部３０は、文書インデックス生成部１６によって生成され文書インデックスにもとづいて、スパムアカウントからの投稿と判定された文書の閲覧を除外する。 The document browsing control unit 30 excludes browsing of documents that are determined to be posted from the spam account based on the document index generated by the document index generating unit 16.

図１１は、文書閲覧制御部３０における文書の閲覧を判定するフローである。 FIG. 11 is a flow for determining document browsing in the document browsing control unit 30.

まず、図１１に示すように、ステップＳ６１において、文書閲覧制御部３０は、文書のインデックスを取得する。 First, as shown in FIG. 11, in step S61, the document browsing control unit 30 acquires an index of a document.

次に、ステップＳ６２において、文書閲覧制御部３０は、スパムユーザ格納部２４からスパムアカウントについての検索結果を取得する。さらに、ステップＳ６３において、文書閲覧制御部３０は、文書の検索結果がスパムアカウントから投稿された文書か否かを判定する。そして、ステップＳ６５において、文書閲覧制御部３０は、検索結果がスパムカウントから投稿された文書と判定された場合は閲覧不可と判定する。一方、ステップＳ６６において、文書閲覧制御部３０は、検索結果がスパムカウントから投稿された文書ではないと判定された場合、文書を閲覧可能と判定する。 Next, in step S 62, the document browsing control unit 30 acquires the search result for the spam account from the spam user storage unit 24. In step S63, the document browsing control unit 30 determines whether the document search result is a document posted from a spam account. In step S65, the document browsing control unit 30 determines that browsing is not possible when the search result is determined to be a document posted from the spam count. On the other hand, in step S66, when it is determined that the search result is not a document posted from the spam count, the document browsing control unit 30 determines that the document can be browsed.

このように構成すると、文書インデックス生成部１６がスパムアカウントの判定結果に基づいて作成した文書のインデックスを基に、文書閲覧制御部３０は、スパムアカウントではないアカウントからの投稿のみを閲覧可能とすることができる。このため、閲覧手段（不図示）は、文書のインデックスに基づいて、表示部１Ｈが表示する文書の検索結果中に、スパムアカウントの投稿が表示されないようにすることができる。 If comprised in this way, based on the index of the document which the document index production | generation part 16 produced based on the determination result of the spam account, the document browsing control part 30 will be able to browse only the posting from the account which is not a spam account. be able to. For this reason, the browsing means (not shown) can prevent the posting of the spam account from being displayed in the search result of the document displayed by the display unit 1H based on the document index.

上述した各実施形態は、本発明に係るスパムアカウント抽出装置及びスパムアカウントの抽出方法の一例を示すものであり、実施形態に係る装置及び方法に限られるものではなく、変形し、又は他のものに適用したものであってもよい。 Each embodiment mentioned above shows an example of the spam account extraction device and spam account extraction method concerning the present invention, and is not restricted to the device and method concerning an embodiment, but changes, or others It may be applied to.

１，２，３，４…スパムアカウント抽出装置、１０Ａ，１０Ｂ，１０Ｃ，１０Ｄ…判定部、１１…話題語収集部、１２…検索部、１３…ＵＲＬ展開部、１４…スパム文書判定部、１５…スパムアカウント判定部、１６…文書インデックス生成部、１７…不適切ＵＲＬ判定部、１８…友人一覧取得部、１９…アカウント間類似度算出部、２０…記憶部、２１…文書インデックス格納部、２１ａ…文書一覧、２１ｂ…転置インデックス、２２…不適切ＵＲＬ格納部、２３…スパム文書格納部、２４…スパムユーザ格納部、３０…文書閲覧制御部、１００…判定部、１０１…ＵＲＬ展開部、１０２…Ｗｅｂページ取得部、１０３…Ｗｅｂページ解析部、１８１…スパムアカウント抽出部、２００…記憶部、２０１…不適切語格納部、２０２…参照元話題語格納部、２０３…判定結果格納部、Ｗ…話題語。 1, 2, 3, 4 ... Spam account extraction device, 10A, 10B, 10C, 10D ... determination unit, 11 ... topic word collection unit, 12 ... search unit, 13 ... URL expansion unit, 14 ... spam document determination unit, 15 ... Spam account determination unit, 16 ... Document index generation unit, 17 ... Inappropriate URL determination unit, 18 ... Friend list acquisition unit, 19 ... Inter-account similarity calculation unit, 20 ... Storage unit, 21 ... Document index storage unit, 21a ... Document list, 21b ... Transposed index, 22 ... Inappropriate URL storage unit, 23 ... Spam document storage unit, 24 ... Spam user storage unit, 30 ... Document browsing control unit, 100 ... Determination unit, 101 ... URL development unit, 102 ... Web page acquisition unit, 103 ... Web page analysis unit, 181 ... Spam account extraction unit, 200 ... Storage unit, 201 ... Inappropriate word storage unit, 202 ... Reference source Title word storage unit, 203 ... judgment result storage unit, W ... topic words.

Claims

In a service for posting a document to a server on a network, a topic word collection that collects, as a topic word that is a specific topic, the word having an increased appearance frequency among words appearing in stream data composed of the document And
A search unit relative posted the document, searches for at least one comprising said document the topic word,
A spam document determination unit that determines that the document is a spam document if the document searched by the search unit includes a certain number of the topic words ;
A spam account determination unit that determines, based on the frequency with which the spam document is posted, an account that posts the spam document as a spam account;
Bei El, spam account extraction device.

The spam account extraction device further includes an inappropriate URL determination unit that determines that the URL is an inappropriate URL when the URL included in the document includes a link to an inappropriate site,
The spam document determination unit determines the document as the spam document when the URL is determined as the inappropriate URL.
The spam account extraction device according to claim 1.

The inappropriate URL determination unit determines the URL as the inappropriate URL when the URL or a URL to which the URL is redirected includes a predefined character string.
The spam account extraction device according to claim 2.

The inappropriate URL determination unit determines that the URL is an inappropriate URL when a document to which the URL or a URL to which the URL is redirected includes a predefined word,
The spam account extraction device according to claim 2 or 3.

The inappropriate URL determination unit determines that the URL or the URL to which the URL is redirected includes a link from a document including a plurality of unrelated topic words. To determine,
The spam account extraction device according to any one of claims 2 to 4.

The inappropriate URL determination unit determines that the URL is the inappropriate URL when the URL or the document to which the URL redirect destination transitions does not include the topic word used in the document.
The spam account extraction device according to any one of claims 2 to 5.

The inappropriate URL determination unit determines the URL as the inappropriate URL when the URL or the document to which the URL redirect destination includes transition includes a plurality of unrelated topic words.
The spam account extraction device according to any one of claims 2 to 6.

A friend list acquisition unit that acquires a list of accounts that can view documents posted by the first account;
The degree of similarity between the friend list of the first account acquired by the friend list acquisition unit and the list of accounts that can view documents posted by the account determined to be a spam account by the spam account determination unit. A calculation unit for calculating similarity between accounts;
Further comprising
The spam account determination unit determines the spam account based on the similarity;
The spam account extraction device according to any one of claims 1 to 7.

The friend list acquisition unit
A spam account extraction unit that extracts, from the friend list of the first account, accounts whose similarity exceeds a certain threshold based on the similarity between accounts calculated by the inter-account similarity calculation unit. 9. The spam account extraction device according to 8.

A document index generation unit that generates an index of the document based on a determination result determined by the spam account determination unit;
A document browsing control unit that controls browsing of documents posted from the spam account;
The spam account extraction device according to any one of claims 1 to 8, further comprising:

A spam account extraction method executed by a spam account extraction device,
In a service for posting a document to a server on a network, a topic word collection that collects, as a topic word that is a specific topic, the word having an increased appearance frequency among words appearing in stream data composed of the document Steps,
A search step of relative posted the document, searches for at least one comprising said document the topic word,
A spam document determination step of determining , when the document searched by the search step includes a certain number of the topic words or more, the document as a spam document;
A spam account determination step of determining an account that posts the spam document as a spam account based on the frequency with which the spam document is posted;
Bei El, spam account extraction methods.