JP2010066891A

JP2010066891A - Document classification method and system

Info

Publication number: JP2010066891A
Application number: JP2008231131A
Authority: JP
Inventors: Toshio Ikeda; 利夫池田
Original assignee: Kansai Electric Power Co Inc
Current assignee: Kansai Electric Power Co Inc
Priority date: 2008-09-09
Filing date: 2008-09-09
Publication date: 2010-03-25

Abstract

<P>PROBLEM TO BE SOLVED: To appropriately classify a plurality of documents extracted from a blog site or the like based on which document results in the contribution of those documents. <P>SOLUTION: A processor 10 is provided with: a document group extraction part 11 for extracting a classification target document from a blog server 22; a describer specification part 12 for specifying a describer by referring to a member server 23 about each classification target document; a document analyzing part 13 for specifying whether or not a nickname exists in the classification target document by document-analyzing the classification target document, and referring to a nickname database 24; and a determination part 14 for performing classification determination processing to the classification target document according to a predetermined type based on whether the describer is a diary contributor or the others, and whether or not the nickname exists in the classification target document. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、通信ネットワーク上のサイトに多数の記述者から投稿される文書を、適宜に分類する方法、及びそのシステムに関する。 The present invention relates to a method and system for appropriately classifying documents posted from a large number of writers on a site on a communication network.

昨今、インターネット上のブログサイト、ＳＮＳ（Social Networking Service）等の利用者は爆発的に増加している。ブログ等のユーザの多くは、日記文書の公開を通して他人の共感を得ること、乃至は日記文書を閲覧して自分が共感できる他人と出会うことを望んでいると言うことができる。ブログ等は、日常問題や時事問題に関する体験、感想、意見等を含む日記文書等の投稿、その日記文書に対するコメント文書の投稿の機会を通して、人々に共感や安心感、問題解決の糸口を提供するといったような、社会的貢献の役割をいまや担っていると言える。 In recent years, the number of users of Internet blog sites, SNS (Social Networking Service), etc. has increased explosively. It can be said that many users such as blogs desire to gain the sympathy of others through the publication of diary documents, or to meet others who can sympathize by browsing the diary documents. Blogs provide people with empathy, security, and problem-solving opportunities through the posting of diary documents including experiences, impressions, and opinions related to daily problems and current affairs, and the opportunity to post comment documents on the diary documents. It can be said that it now plays the role of social contribution.

ところで、ブログサイト等において自分が共感できる他人を見つけるためには、何らかの検索を行う必要がある。このような検索の従来手法としては、無作為検索、カテゴリ検索、属性・キーワード検索を挙げることができる。無作為検索は、投稿日や投稿者を指定する程度の検索を行い手動でＷｅｂページを閲覧する手法である。カテゴリ検索は、「子育て」、「ボランティア」といったテーマ分類を活用して検索する手法である。属性・キーワード検索は、投稿者の年齢や性別、キーワードを利用して検索する手法である。ブログサイトのキーワード検索の手法として、例えば特許文献１を例示することができる。
特開２００７−１１６５１号公報 By the way, it is necessary to perform some kind of search in order to find others who can sympathize with each other on a blog site or the like. Examples of such conventional search methods include random search, category search, and attribute / keyword search. Random search is a technique of manually browsing a Web page by performing a search that specifies a posting date and a poster. Category search is a technique for searching by utilizing theme classifications such as “child care” and “volunteer”. The attribute / keyword search is a technique for searching using the poster's age, gender, and keyword. As a keyword search technique for a blog site, for example, Patent Literature 1 can be exemplified.
JP 2007-11651 A

ところで、ブロガーの大量発生により、ブログサイトやＳＮＳは巨大化する傾向があり、現に数百万人規模の会員を擁するサイトも存在する。このような巨大サイトにおいて、従来の検索手法に依拠して検索を行っても、ユーザは、自分が共感する人物を効率良く、また精度良く見つけ出すのは困難である。すなわち、無作為検索やカテゴリ検索では、検索がラフすぎてなかなか共感先を見つけられない。一方キーワード検索では、適切なキーワードを選択し複雑な検索設定を行わないと、的確な検索結果は得られない。或いは、複雑な検索設定を行ったとしても、ヒット件数が膨大になることがある。 By the way, blog sites and SNSs tend to become huge due to the large number of bloggers, and there are actually sites with millions of members. Even if such a huge site is searched by relying on a conventional search method, it is difficult for the user to efficiently and accurately find a person whom he / she empathizes with. That is, in random search and category search, it is difficult to find a sympathy destination because the search is too rough. On the other hand, in keyword search, an accurate search result cannot be obtained unless an appropriate keyword is selected and complicated search settings are made. Or, even if complicated search settings are made, the number of hits may become enormous.

このように、なかなか自分が共感できる人物が発見できない結果、ブログサイトの日記文書に対してのコメント投稿が停滞し、サイトの活性度が低下する懸念がある。サイト運営者においてこのような事態は望ましいことではない。 As described above, as a result of not being able to find a person who can sympathize with each other, there is a concern that the posting of comments to the diary document of the blog site is stagnant and the activity of the site is reduced. This is not desirable for publishers.

そこで、共感人物同士を特定するために、既にブログサイト上で文書交信の実績をもつ者が作成した文書、つまり実際に共感している者によって現に作成された文書を抽出して解析することで、共感度を数値判定するための判定式の類を導出し、この判定式にまだ出会っていない者同士の作成に係る文書を当てはめて、その者同士の共感度を自動判定する手法が考えられる。 Therefore, in order to identify sympathetic people, by extracting and analyzing documents already created by those who have a history of document communication on the blog site, that is, documents actually created by those who actually sympathize A method of deriving a class of judgment formulas for determining the co-sensitivity numerically, applying a document related to the creation of persons who have not yet met this judgment formula, and automatically determining the co-sensitivity between the persons is considered. .

このような手法を取る場合、先ずは抽出された文書について的確な分類を行うことが肝要となる。例えば、交信実績文書として、特定の者がブログサイトに投稿した日記文書と、この日記文書に関連して作成され時系列的にブログサイトへ投稿された複数のコメント文書とを含む文書群を抽出した場合に、該文書群に含まれる文書が、他の記述者が前記日記文書に直接呼応して投稿したコメント文書か、前記特定の者が前記コメント文書に呼応して投稿したコメント文書であるか、或いは前記コメント文書に呼応して更に他の記述者が投稿したコメント文書であるか等を正確に分類することが、終局的に前記判定式の精度を向上させる上で重要となる。 When taking such a technique, it is important to first classify the extracted document accurately. For example, as a communication result document, a document group including a diary document posted by a specific person on a blog site and a plurality of comment documents created related to this diary document and posted to the blog site in time series is extracted. In this case, the document included in the document group is a comment document posted by another writer in direct response to the diary document, or a comment document posted by the specific person in response to the comment document. In order to improve the accuracy of the determination formula, it is important to accurately classify whether it is a comment document posted by another writer in response to the comment document.

本発明は、上記の問題点に鑑みてなされたもので、ブログサイト等から抽出された複数の文書を、どの文書に呼応して投稿された文書であるかを的確に分類することができる方法及びシステムを提供することを目的とする。 The present invention has been made in view of the above problems, and can accurately classify a plurality of documents extracted from a blog site or the like to which documents are posted in response to which document. And to provide a system.

本発明の一の局面に係る文書分類方法は、通信ネットワーク上のサイトに多数の記述者から投稿される文書を分類する方法であって、第１記述者により作成され前記サイトへ投稿された第１文書と、該第１文書に関連して作成され時系列的に前記サイトへ投稿された複数の文書とを含む文書群が存在する場合において、前記第１文書に呼応して作成された文書であって、前記第１記述者とは異なる第２記述者により作成され前記サイトに投稿された第２文書と、前記第２文書に呼応して作成された文書であって、前記第１記述者により作成され前記サイトに投稿された第３文書と、前記第２文書に呼応して作成された文書であって、前記第１記述者及び第２記述者とは異なる第３記述者により作成され前記サイトに投稿された第４文書と、を少なくとも分類するために、前記文書群の各分類対象文書に対し、前記分類対象文書を投稿した記述者の別を特定する第１ステップと、前記記述者が前記第１記述者以外である場合に、前記分類対象文書中に、予め各記述者に関連付けて定められた個人特定名称が存在するか否かを判定し、存在しない場合には前記第２文書と判定し、存在する場合であって前記文書群内に当該分類対象文書よりも時系列的に先に前記個人特定名称に係る記述者の投稿文書が存在する場合には前記第４文書と判定する第２ステップと、前記記述者が前記第１記述者である場合に、前記分類対象文書中に、予め各記述者に関連付けて定められた個人特定名称が存在するか否かを判定し、存在しない場合、並びに、存在する場合であって前記文書群内に当該分類対象文書よりも時系列的に先に前記個人特定名称に係る記述者の投稿文書が存在する場合には前記第３文書と判定する第３ステップと、を含むことを特徴とする（請求項１）。 A document classification method according to one aspect of the present invention is a method for classifying documents posted from a large number of writers on a site on a communication network, and is a method created by a first writer and posted to the site. A document created in response to the first document when there is a document group including one document and a plurality of documents created in relation to the first document and posted to the site in time series A second document created by a second writer different from the first writer and posted on the site, and a document created in response to the second document, the first description A third document created by a user and posted to the site, and a document created in response to the second document, created by a third writer different from the first writer and the second writer And a fourth document posted on the site. A first step of identifying, for each classification target document in the document group, the description of the person who posted the classification target document, and the description person other than the first description person In the classification target document, it is determined whether or not a personal identification name predetermined in association with each writer exists, and if it does not exist, it is determined as the second document. A second step of determining that the document is a fourth document in the document group when a document written by the descriptive person related to the individual specific name exists in time series before the document to be classified; Is the first descriptive person, it is determined whether or not a personal identification name previously defined in association with each descriptor exists in the classification target document. And the classification pair is included in the document group. And a third step of determining the third document as the third document when the posted document of the descriptive person related to the personal identification name exists in time series before the document (Claim 1). .

この構成によれば、第１文書に関連して作成され時系列的にサイトへ順次投稿された文書群内の各分類対象文書を、当該分類対象文書に個人特定名称が含まれているか否か、並びに、文書同士の時系列的な位置関係に基づいて、前記第２〜第４文書のいずれかに正確に分類することができる。すなわち、どの記述者が、前記サイトのどの文書に呼応して投稿した文書であるかを、的確に把握することができる。 According to this configuration, whether or not each classification target document in the document group created in relation to the first document and sequentially posted to the site in time series includes an individual specific name in the classification target document. In addition, based on the time-series positional relationship between documents, it can be accurately classified into any of the second to fourth documents. That is, it is possible to accurately grasp which writer is a document posted in response to which document on the site.

上記第３ステップにおいて、前記個人特定名称が存在しない場合には、当該分類対象文書が時系列的に直前の前記第２文書に呼応した前記第３文書であると判定することができる（請求項２）。 In the third step, when the personal identification name does not exist, it can be determined that the classification target document is the third document corresponding to the immediately preceding second document in time series (claim). 2).

この構成によれば、前記第３文書を、前記個人特定名称が存在するか否かによって、時系列的に直前の第２文書に呼応したものであるのか、或いはそれよりも先に投稿された第２文書に呼応したものであるのかの別に、さらに細分類することができる。 According to this configuration, whether the third document corresponds to the immediately preceding second document in chronological order or not depending on whether the personal identification name exists or not is posted. Further subdivision can be made depending on whether it is in response to the second document.

前記個人特定名称が、各記述者に割り当てられたニックネームであることが望ましい（請求項３）。この構成によれば、サイト上の文書で相手方を特定するために汎用されるニックネームを利用するので、文書の識別を行い易いという利点がある。 The personal identification name is preferably a nickname assigned to each writer (claim 3). According to this configuration, since a nickname that is widely used for specifying the other party in the document on the site is used, there is an advantage that it is easy to identify the document.

上記構成において、通信ネットワーク上のサイトが、インターネット上の特定のウエブサイトであることが望ましい（請求項４）。特に、前記ウエブサイトがブログサイトであって、前記第１文書が、前記第１記述者により作成された日記文書であり、前記第２文書が、前記第１記述者以外の記述者により作成された、日記文書に対する第１コメント文書であり、前記第３文書が、前記第１記述者により作成された、前記第１コメント文書に対する第２コメント文書であり、前記第４文書が、前記第１記述者以外の記述者により作成された、前記第１コメント文書に対する第３コメント文書であることが望ましい（請求項５）。 In the above configuration, it is preferable that the site on the communication network is a specific website on the Internet. In particular, the website is a blog site, the first document is a diary document created by the first writer, and the second document is created by a writer other than the first writer. The first comment document for the diary document, the third document is a second comment document for the first comment document created by the first writer, and the fourth document is the first comment document. It is desirable that it is a third comment document for the first comment document created by a descriptor other than the descriptor.

この構成によれば、ブログサイトに掲載された日記文書に対する第１コメント文書、この第１コメント文書に対する第２、第３コメント文書が分類対象文書となる。この場合、日記投稿者自身が作成した第２コメント文書を第１、第３コメント文書と扱ったり、第３コメント文書を第１コメント文書と扱ったりする不具合を解消することができる。 According to this configuration, the first comment document for the diary document posted on the blog site and the second and third comment documents for the first comment document are the classification target documents. In this case, it is possible to solve the problem that the second comment document created by the diary contributor himself is handled as the first and third comment documents, and the third comment document is handled as the first comment document.

本発明の他の局面に係る文書分類システムは、通信ネットワーク上のサイトに多数の記述者から投稿される文書の分類システムであって、前記サイトに投稿された文書を記憶する文書データベースと、前記サイトへの投稿が予定されている記述者名を記憶する記述者データベースと、予め前記記述者の各々に関連付けて定められた個人特定名称を記憶する名称データベースと、前記サイトに投稿された文書の分類処理を行う分類処理手段とを備え、前記分類処理手段は、前記文書データベース中から、第１記述者により作成され前記サイトへ投稿された第１文書と、該第１文書に関連して作成され時系列的に前記サイトへ投稿された複数の文書とを含む文書群を抽出する文書群抽出部と、前記文書群の各分類対象文書について、前記記述者データベースを参照して、前記分類対象文書を投稿した記述者を特定する記述者特定部と、前記分類対象文書を文書解析すると共に前記名称データベースを参照して、前記分類対象文書に個人特定名称が存在するか否かを特定する文書解析部と、前記記述者が前記第１記述者又は第１記述者以外であるかの別、及び、前記分類対象文書に個人特定名称が存在するか否かに基づいて、予め定められた類型に応じて、前記分類対象文書に分類判定処理を行う判定部と、を含むことを特徴とする（請求項６）。 A document classification system according to another aspect of the present invention is a system for classifying documents posted from a large number of writers on a site on a communication network, the document database storing documents posted on the site, A descriptor database that stores names of writers who are scheduled to post to the site, a name database that stores personally specified names previously associated with each of the writers, and documents posted to the site Classification processing means for performing classification processing, wherein the classification processing means is created from the document database in association with the first document created by the first writer and posted to the site. A document group extracting unit for extracting a document group including a plurality of documents posted to the site in time series, and for each categorization target document of the document group, the descriptor data A database is referred to, a writer specifying unit for specifying a writer who has posted the classification target document, a document analysis of the classification target document and a reference to the name database, and a personal identification name is assigned to the classification target document. Document analysis unit for specifying whether or not the document exists, whether the writer is the first writer or a person other than the first writer, and whether or not a personal identification name exists in the classification target document And a determination unit that performs a classification determination process on the classification target document according to a predetermined type (Claim 6).

この構成によれば、第１文書に関連して作成され時系列的にサイトへ順次投稿された文書群内の各分類対象文書を、記述者データベース及び名称データベースを参照して、当該分類対象文書の記述者と個人特定名称の有無とを特定し、その結果に基づき判定部により類型に応じて文書の分類判定を行わせることができる。 According to this configuration, each classification target document in the document group created in relation to the first document and sequentially posted to the site in time series is referred to the writer database and the name database, and the classification target document And the presence / absence of the personal identification name can be specified, and based on the result, the determination unit can determine the classification of the document according to the type.

この場合、前記判定部は、前記第１文書に呼応して作成された文書であって、前記第１記述者とは異なる第２記述者により作成され前記サイトに投稿された第２文書と、前記第２文書に呼応して作成された文書であって、前記第１記述者により作成され前記サイトに投稿された第３文書と、前記第２文書に呼応して作成された文書であって、前記第１記述者及び第２記述者とは異なる第３記述者により作成され前記サイトに投稿された第４文書と、を少なくとも分類するものであり、前記分類対象文書の記述者が前記第１記述者以外であって、該分類対象文書中に前記個人特定名称が存在しない場合には前記第２文書と判定し、前記個人特定名称が存在する場合であって前記文書群内に当該分類対象文書よりも時系列的に先に前記個人特定名称に係る記述者の投稿文書が存在する場合には前記第４文書と判定し、前記分類対象文書の記述者が前記第１記述者であって、該分類対象文書中に前記個人特定名称が存在しない場合、並びに、存在する場合であって前記文書群内に当該分類対象文書よりも時系列的に先に前記個人特定名称に係る記述者の投稿文書が存在する場合には前記第３文書と判定することが望ましい（請求項７）。 In this case, the determination unit is a document created in response to the first document, the second document created by a second writer different from the first writer, and posted to the site; A document created in response to the second document, a third document created by the first writer and posted to the site, and a document created in response to the second document, Categorizing at least a fourth document created by a third writer different from the first writer and the second writer and posted on the site, and the writer of the classification target document If the personal identification name is not present in the document to be classified and the personal identification name does not exist, it is determined as the second document. If the personal identification name exists, the classification is included in the document group. The personal identification name in chronological order before the target document If there is a posted document of the writer who relates to the document, it is determined as the fourth document, the writer of the classification target document is the first writer, and the personal identification name exists in the classification target document If there is a document posted by the descriptor related to the personal identification name in time series before the classification target document in the document group, the third document It is desirable to determine (claim 7).

この構成によれば、判定部は、分類対象文書中における個人特定名称の有無、並びに、文書同士の時系列的な位置関係に基づいて、分類対象文書を前記第２〜第４文書のいずれかに正確に分類することができる。すなわち、どの記述者が、前記サイトのどの文書に呼応して投稿した文書であるかを、的確に判定させることができる。 According to this configuration, the determination unit determines whether the classification target document is any one of the second to fourth documents based on the presence / absence of the personal identification name in the classification target document and the time-series positional relationship between the documents. Can be accurately classified. That is, it is possible to accurately determine which writer is a document posted in response to which document of the site.

本発明によれば、第１文書に関連して作成され時系列的にサイトへ順次投稿された文書群内の各分類対象文書について、どの記述者が、前記サイトのどの文書に呼応して投稿した文書であるかを、的確に把握することができる。従って、例えばサイト上で共感している者同士の文書を抽出する場合に、その抽出を正確に行うことができる。ひいては、その抽出文書を文書解析して共感度を数値判定する判定式を導出する場合に、その判定式の精度を向上させることができる。 According to the present invention, for each classification target document in the document group created in relation to the first document and sequentially posted to the site, which writer describes in response to which document of the site It is possible to accurately grasp whether the document is a completed document. Therefore, for example, when extracting documents of those who sympathize on the site, the extraction can be performed accurately. Eventually, when the extracted document is subjected to document analysis to derive a determination formula for determining the co-sensitivity numerically, the accuracy of the determination formula can be improved.

以下、図面に基づいて、本発明の実施形態につき説明する。図１は、本発明に係る文書分類方法が適用されるネットワークシステムＳ（文書分類システム）のハードウェア構成を概略的に示す構成図である。このネットワークシステムＳは、処理装置１０と、インターネット上でブログサイト２１を展開させるためのブログシステム２０と、このブログサイト２１の会員Ａ、Ｂ、Ｃ、Ｄ、Ｅ・・・に保有されている会員端末装置３１、３２、３３、３４、３５・・・を含む端末装置３０とが、インターネットＩＮを介してデータ通信可能に接続されてなる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram schematically showing a hardware configuration of a network system S (document classification system) to which a document classification method according to the present invention is applied. This network system S is owned by the processing device 10, the blog system 20 for developing the blog site 21 on the Internet, and the members A, B, C, D, E. The terminal devices 30 including the member terminal devices 31, 32, 33, 34, 35... Are connected via the Internet IN so that data communication is possible.

ブログサイト２１（通信ネットワーク上のサイト）は、インターネット上に展開された特定のウエブサイトであって、日記文書や、日記文書に対するコメント文書等が掲載されるサイトである。 The blog site 21 (site on the communication network) is a specific web site developed on the Internet, and is a site on which a diary document, a comment document for the diary document, and the like are posted.

会員Ａ〜Ｅは、ブログサイト２１のサイト運営者に対して自身の属性情報を開示し、ブログサイト２１の会員として登録されている者である。会員Ａ〜Ｅは、各自の会員端末装置３１〜２５を介して、ブログサイト２１に日記文書を投稿したり、その日記文書に対してコメント文書（第１コメント文書）を投稿したり、このコメント文書に対してさらにコメント文書（第２コメント文書）を投稿したり、或いはこれら文書を閲覧したりすることが可能とされている。端末装置３０は、代表的には、インターネット接続されたパーソナルコンピュータ、携帯電話機、携帯情報端末機等である。 The members A to E are persons who disclose their attribute information to the site operator of the blog site 21 and are registered as members of the blog site 21. The members A to E post a diary document to the blog site 21 via their member terminal devices 31 to 25, post a comment document (first comment document) to the diary document, It is possible to post a comment document (second comment document) to the document or to browse these documents. The terminal device 30 is typically a personal computer connected to the Internet, a mobile phone, a portable information terminal, or the like.

ブログシステム２０は、ブログサーバ２２（文書データベース）、会員サーバ２３（記述者データベース）及びニックネームデータベース２４（名称データベース）を備えている。 The blog system 20 includes a blog server 22 (document database), a member server 23 (describer database), and a nickname database 24 (name database).

ブログサーバ２２は、ブログサイト２１を運用するためのサーバであって、ブログサイト２１に投稿された文書の文書データと、その文書の投稿日時、投稿者等の各種データとを関連付けて蓄積するサーバである。会員サーバ２３は、ブログサイト２１に登録されている会員、つまりブログサイト２１への投稿が予定されている記述者の属性情報（氏名、会員番号、住所、端末装置のＩＰアドレス、年齢、性別、興味関心事など）が蓄積されたサーバである。 The blog server 22 is a server for operating the blog site 21, and stores the document data of a document posted on the blog site 21 in association with various data such as the posting date of the document and the poster. It is. The member server 23 includes attribute information (name, member number, address, IP address of the terminal device, age, sex, etc.) of a member registered in the blog site 21, that is, a writer who is scheduled to post to the blog site 21. This is a server in which interests, etc.) are accumulated.

ニックネームデータベース２４は、会員の各々に関連付けて定められているニックネーム（個人特定名称）を記憶するデータベースである。このニックネームは、ユーザがブログサイト２１に入会する際に、該ユーザが自ら申請することで登録される自身の愛称であって、ブログサイト２１に投稿される文書において自身を特定するときに用いられることが予定されている名称である。 The nickname database 24 is a database that stores nicknames (personally specified names) defined in association with each member. This nickname is a nickname of the user registered when he / she joins the blog site 21 and is used to identify himself / herself in a document posted to the blog site 21. Is the name that is planned.

処理装置１０（分類処理手段）は、ブログサイト２１に投稿された日記文書、この日記文書に呼応して投稿されたコメント文書、このコメント文書に呼応してさらに投稿されたコメント文書を、投稿した会員に応じて分類する処理を行う機能を有する。例えば、会員Ａ（第１記述者）により作成されブログサイト２１へ投稿された日記文書（第１文書）と、この日記文書に関連して会員Ｂ、Ｃ、Ｄ、Ｅ・・・若しくは会員Ａにより作成され時系列的に順次ブログサイト２１へ投稿された複数のコメント文書とを含む文書群が存在する場合を想定する。 The processing device 10 (classification processing means) posted a diary document posted to the blog site 21, a comment document posted in response to the diary document, and a comment document further posted in response to the comment document. It has a function to perform classification according to members. For example, a diary document (first document) created by a member A (first writer) and posted to the blog site 21 and members B, C, D, E... It is assumed that there is a document group including a plurality of comment documents that are created by the above and posted to the blog site 21 sequentially in time series.

この場合、処理装置１０は、前記複数のコメント文書を分類対象文書として、少なくとも次の類型１、類型２及び類型３に分類する処理を行う。
［類型１］日記文書に呼応して作成された文書であって、日記記述者である会員Ａ以外の会員、例えば会員Ｂ（第２記述者）により作成されブログサイト２１に投稿された第１コメント文書（第２文書）；
［類型２］第１コメント文書に呼応して作成された文書であって、日記記述者である会員Ａにより作成されブログサイト２１に投稿された第２コメント文書（第３文書）；
［類型３］第１コメント文書に呼応して作成された文書であって、会員Ａ及び会員Ｂ以外の会員、例えば会員Ｃ（第３記述者）により作成されブログサイト２１に投稿された第３コメント文書（第４文書）。 In this case, the processing apparatus 10 performs a process of classifying the plurality of comment documents into at least the following type 1, type 2, and type 3 as classification target documents.
[Type 1] A document created in response to a diary document, which is created by a member other than member A who is a diary writer, such as member B (second writer) and posted to the blog site 21 Comment document (second document);
[Type 2] A second comment document (third document) created in response to the first comment document and created by member A who is a diary writer and posted on the blog site 21;
[Type 3] A document created in response to the first comment document, created by a member other than member A and member B, for example, member C (third writer) and posted to the blog site 21 Comment document (fourth document).

図２は、処理装置１０の機能構成を示す機能ブロック図である。処理装置１０は、例えば上記の分類処理等を行うＣＰＵ（中央演算処理装置）を備えた大型のコンピュータ装置であって、ブログサーバ２２、会員サーバ２３及びニックネームデータベース２４に対してデータ通信可能に接続されている。前記ＣＰＵは、分類処理を行うべくプログラミングされたソフトウェアが実行されることで、図２に示す機能部を具備するように動作する。処理装置１０は、機能的に、文書群抽出部１１、記述者特定部１２、文書解析部１３、判定部１４及び解析処理部１５を備える。 FIG. 2 is a functional block diagram showing a functional configuration of the processing apparatus 10. The processing device 10 is a large computer device equipped with a CPU (Central Processing Unit) that performs the above-described classification processing, for example, and is connected to the blog server 22, the member server 23, and the nickname database 24 so that data communication is possible. Has been. The CPU operates to include the functional unit shown in FIG. 2 by executing software programmed to perform the classification process. The processing apparatus 10 functionally includes a document group extraction unit 11, a writer specifying unit 12, a document analysis unit 13, a determination unit 14, and an analysis processing unit 15.

文書群抽出部１１は、ブログサーバ２２に蓄積されている文書データの中から、日記文書と、この日記文書に関連するコメント文書の群（日記文書に対するスレッド）を抽出する。この際、これらの文書の投稿日時に関するデータを付帯させ、文書間の時系列的な関係（投稿時間の先後）が把握できるようにして文書の抽出を行う。例えば、ブログサイト２１上において会員Ａに対して共感度の高い会員Ｂ、Ｃ、Ｄ、Ｅ・・・を求めようとする場合は、文書群抽出部１１は、会員Ａが投稿した日記文書を起点とするスレッドの文書群を、ブログサーバ２２から抽出する。この抽出された文書群が、分類対象文書となる。 The document group extraction unit 11 extracts a diary document and a group of comment documents related to the diary document (a thread for the diary document) from the document data stored in the blog server 22. At this time, the data regarding the posting date and time of these documents is attached, and the documents are extracted so that the time-series relationship between the documents (before and after the posting time) can be grasped. For example, when trying to obtain members B, C, D, E,... With high sensitivity to the member A on the blog site 21, the document group extracting unit 11 reads the diary document posted by the member A. A document group of a thread as a starting point is extracted from the blog server 22. This extracted document group becomes a classification target document.

記述者特定部１２は、各分類対象文書について、会員サーバ２３を参照して、その分類対象文書を投稿した会員を特定する。例えば、記述者特定部１２は、ブログサイト２１への文書投稿の際に用いられたアドレスやパスワードに基づいて特定処理を行う。 The writer specifying unit 12 refers to the member server 23 for each classification target document, and specifies the member who posted the classification target document. For example, the writer specifying unit 12 performs a specifying process based on an address and a password used when posting a document to the blog site 21.

文書解析部１３は、分類対象文書を文書解析して、分類対象文書中にニックネームが存在するか否かを特定する処理を行う。具体的には文書解析部１３は、各分類対象文書の正規化処理、文書構造解析処理などを行う。正規化処理は、文書構造解析を正常に行い得るようにするために、分類対象文書から解析に不要な文字、記号等を削除すると共に、全角・半角文字の統一等を行う処理である。文書構造解析処理は、正規化処理後の分類対象文書に対しそれぞれ、例えば形態素解析を実施して文書を単語単位に分割する処理、構文解析処理を実施して単語間の係り受け（名詞と動詞との関係付け等）を特定する処理などである。このような文書構造解析処理のため、文書解析部１３は、同義語及び表記の揺れを吸収するシソーラス（同義語辞書）を活用する。 The document analysis unit 13 performs document analysis on the classification target document and specifies whether or not a nickname exists in the classification target document. Specifically, the document analysis unit 13 performs normalization processing, document structure analysis processing, and the like for each classification target document. The normalization process is a process for deleting characters, symbols, and the like unnecessary for analysis from the classification target document and unifying full-width and half-width characters so that the document structure analysis can be normally performed. In the document structure analysis processing, for each classification target document after normalization processing, for example, morphological analysis is performed to divide the document into words, and parsing processing is performed to determine dependency between words (nouns and verbs). And the like. For such document structure analysis processing, the document analysis unit 13 utilizes a thesaurus (synonym dictionary) that absorbs synonyms and notation fluctuations.

ニックネームが存在するか否かの特定のために、文書解析部１３はニックネームデータベース２４を参照する。図３は、ニックネームデータベース２４のデータ構造を模式的に示す表形式の図である。ここでは、会員名（会員ナンバー）Ａ、Ｂ、Ｃ、Ｄ、Ｅ・・・に各々対応付けて、各自のニックネーム「ａ」「ｂ」「ｃ」「ｄ」「ｅ」・・・が記憶されている例を示している。これらのニックネームは、例えば、ブログサイト２１への会員登録時、各会員Ａ、Ｂ、Ｃ、Ｄ、Ｅ・・・がサイトへの文書投稿の際に本名の代わりに用いるハンドルネームとして申告された名称である。 In order to specify whether or not a nickname exists, the document analysis unit 13 refers to the nickname database 24. FIG. 3 is a table format diagram schematically showing the data structure of the nickname database 24. Here, the nicknames “a”, “b”, “c”, “d”, “e”,... Are stored in association with the member names (member numbers) A, B, C, D, E,. An example is shown. These nicknames, for example, were registered as handle names that each member A, B, C, D, E ... would use instead of their real name when posting a document to the site when registering as a member on the blog site 21 It is a name.

文書解析部１３は、分類対象文書を文書解析して得た単語群と、ニックネームデータベース２４に記憶されているニックネームとを照らし合わせ、分類対象文書中にニックネームが存在するか否かを特定する。ここで、ニックネームデータベース２４に存在する名称でも、文書中ではニックネーム以外の意味で使用される場合がある。例えば、会員Ａのニックネーム「ａ」が「春」であるとする場合、四季を示す名詞として使用された「春」であるのか、会員Ａを特定するために用いられたニックネームとしての「春」であるのか、区別できない場合が生じ得る。 The document analysis unit 13 compares the word group obtained by analyzing the document to be classified with the nickname stored in the nickname database 24, and specifies whether or not the nickname exists in the classification target document. Here, even a name existing in the nickname database 24 may be used in the document with a meaning other than the nickname. For example, if the nickname “a” of member A is “spring”, it is “spring” used as a noun indicating the four seasons, or “spring” as a nickname used to identify member A In some cases, it may be impossible to distinguish between the two.

そこで、ニックネームとしての使用の場合、必ず「〜ちゃん」、「〜さん」、「〜さま」、「〜サン」、「〜様」等の敬称が付帯しているものと推定し、前者と後者の区別を図るようにしている。上記のケースでは、文書解析部１３は、「春」という単語がニックネームデータベース２４に存在していることが確認された後、その「春」に続いて「〜ちゃん」や「〜さん」等の敬称単語が存在しているか否かを判定し、存在する場合はニックネームとしての「春」と判定し、当該分類対象文書が会員Ａのニックネームを含む文書であると決定する。一方、敬称単語が後続していない場合は、「春」を単なる一般名詞等と扱い、当該分類対象文書がニックネームを含む文書ではないと決定する。 Therefore, in the case of use as a nickname, it is presumed that a title such as "~ chan", "~ san", "~ sama", "~ san", "~ sama" etc. is attached, the former and the latter The distinction is made. In the above case, after confirming that the word “spring” exists in the nickname database 24, the document analysis unit 13 follows “spring”, “˜chan”, “˜san”, etc. It is determined whether or not the honorific word exists. If it exists, it is determined as “spring” as the nickname, and it is determined that the document to be classified is a document including the nickname of the member A. On the other hand, if no honorific word follows, “Spring” is treated as a simple noun, and it is determined that the classification target document is not a document including a nickname.

判定部１４は、分類対象文書の記述者が日記投稿者であるかそれ以外の者であるかの別、及び、分類対象文書にニックネームが存在するか否かに基づいて、当該分類対象文書が、上掲の類型１、類型２及び類型３のいずれに当て嵌まるかを判定する分類判定処理を行う。この判定部１４における処理を、図４〜図７を参照して詳述する。 The determination unit 14 determines whether the categorization target document is based on whether the writer of the categorization target document is a diary contributor or another person and whether the categorization target document has a nickname. Then, a classification determination process is performed to determine which of the above-described type 1, type 2, and type 3 is applicable. The process in this determination part 14 is explained in full detail with reference to FIGS.

図４は、日記４Ａと、これに関連して投稿されたコメント群４Ｂを時系列的に示した模式図である。言わば、会員Ａの日記文書４１の投稿を起点として実際に生じた１つのスレッドを模式的に示す図である。図中の左側の矢印群は、日記−コメント間、コメント−コメント間の呼応関係を示している。コメント群４Ｂは８つのコメント文書４２〜４９が含まれ、図中の最も上側のコメント文書４２が日記文書４１の投稿時刻に最も近い時刻に投稿され、以降、投稿された時刻が早い順に下側に向けてコメント文書４３〜４９が配列されているものとする。 FIG. 4 is a schematic diagram showing the diary 4A and the comment group 4B posted in association with the diary 4A in time series. In other words, it is a diagram schematically showing one thread actually generated starting from posting of the diary document 41 of member A. The arrow group on the left side in the figure indicates the correspondence between the diary and the comment and between the comment and the comment. The comment group 4B includes eight comment documents 42 to 49, and the uppermost comment document 42 in the figure is posted at the time closest to the posting time of the diary document 41. It is assumed that the comment documents 43 to 49 are arranged toward.

コメント群４Ｂには、会員Ｂ、Ｃ、Ｄ、Ｅだけでなく日記投稿者である会員Ａのコメント文書も含まれている。また、日記文書４１に直接呼応するコメント文書だけではなく、コメント文書に呼応するコメント文書も存在する。詳述すると、日記文書４１に直接呼応するコメント文書は、会員Ｂのコメント文書４２、会員Ｃのコメント文書４４及び会員Ｄのコメント文書４５である。また、会員Ａは、コメント文書４３、４６、４７、４８を投稿している。 The comment group 4B includes not only the members B, C, D, and E but also the comment document of the member A who is a diary contributor. In addition to the comment document that directly responds to the diary document 41, there is also a comment document that responds to the comment document. More specifically, the comment documents that directly correspond to the diary document 41 are the comment document 42 of the member B, the comment document 44 of the member C, and the comment document 45 of the member D. Member A has posted comment documents 43, 46, 47, and 48.

すなわち会員Ａは、会員Ｂのコメント文書４２に呼応して直ちにコメント文書４３を投稿し、その後時間を置いてコメント文書４７を投稿している。このコメント文書４７には、会員Ｂを特定するニックネーム「ｂ」に敬称単語「さん」が付帯した「ｂさん」という文言が含まれている。また、会員Ａは、会員Ｃのコメント文書４４に呼応してコメント文書４８を投稿している。このコメント文書４８には、会員Ｃを特定するニックネーム「ｃ」に敬称単語「さま」が付帯した「ｃさま」という文言が含まれている。さらに、会員Ａは、会員Ｄのコメント文書４５に呼応して直ちにコメント文書４６を投稿している。 That is, the member A immediately posts the comment document 43 in response to the comment document 42 of the member B, and then posts the comment document 47 after a while. The comment document 47 includes the word “Mr. b” in which the nickname “b” that identifies the member B is accompanied by the honorific word “Mr.”. Member A posts a comment document 48 in response to member C's comment document 44. The comment document 48 includes a word “c-san” in which the nickname “c” identifying the member C is accompanied by the honorific word “sama”. Further, the member A immediately posts the comment document 46 in response to the comment document 45 of the member D.

この会員Ｄのコメント文書４５には、会員Ｅも呼応してコメント文書４９を投稿している。このコメント文書４９には、会員Ｄを特定するニックネーム「ｄ」に敬称単語「ちゃん」が付帯した「ｄちゃん」という文言が含まれている。ここで、会員Ｅは、当該スレッドにおいて、日記投稿者である会員Ａと文書交信をしていない。つまり、会員Ｅは、会員Ａに共感したのではなく、会員Ｄに共感してコメント文書４９を投稿したものである。 In response to the member D's comment document 45, the member E also posts a comment document 49 in response. The comment document 49 includes the word “d-chan” in which the nickname “d” for identifying the member D is accompanied by the honorific word “chan”. Here, the member E does not communicate with the member A who is a diary contributor in the thread. That is, member E does not sympathize with member A, but sympathizes with member D and posts comment document 49.

以上のような事実関係がある場合において、コメント文書４２〜４９を分類するとき、本発明に対する比較例として、図５（比較例１）及び図６（比較例２）に示す２つの分類方法が挙げられる。比較例１は、スレッドの全コメント文書を、日記文書に呼応するコメント文書と扱う分類方法である。比較例２は、日記投稿者（この例では会員Ａ）自身により投稿されたコメント文書を除いたコメント文書を、日記文書に呼応するコメント文書と扱う分類方法である。 When the comment documents 42 to 49 are classified in the above factual relationship, two classification methods shown in FIG. 5 (Comparative Example 1) and FIG. 6 (Comparative Example 2) are provided as comparative examples for the present invention. Can be mentioned. Comparative example 1 is a classification method in which all comment documents of a thread are treated as comment documents corresponding to a diary document. Comparative Example 2 is a classification method in which a comment document excluding a comment document posted by the diary contributor (member A in this example) is treated as a comment document corresponding to the diary document.

図５に示すように、比較例１の分類方法では、コメント群４Ｂの全コメント文書４２〜４９が、日記文書４１に呼応するコメント文書と扱われることになる。この方式では、機械的な処理速度は速いが、本来は抽出してはならない「日記−コメント」関係を抽出したり、逆に、本来抽出しなければならない「コメント−コメント」関係を抽出できなかったりする場合が生じる。 As shown in FIG. 5, in the classification method of Comparative Example 1, all comment documents 42 to 49 in the comment group 4 </ b> B are handled as comment documents corresponding to the diary document 41. With this method, the mechanical processing speed is fast, but it is not possible to extract the “diary-comment” relationship that should not be extracted or the “comment-comment” relationship that must be extracted. May occur.

具体的には、日記投稿者である会員Ａの他の会員に対するコメント文書４３、４６、４７、４８を、日記文書４１に対するコメント文書と扱うという、明らかに矛盾した「日記−コメント」関係を抽出してしまうことになる。また、日記文書４１に対するコメントではないコメント文書４９を、日記文書４１に対するコメント文書と扱う誤りも惹起する。反面、コメント文書４２に呼応したコメント文書４３、コメント文書４４に呼応したコメント文書４８、コメント文書４５に呼応したコメント文書４７、及び、コメント文書４５に呼応したコメント文書４９という、本来抽出しなければならない「コメント−コメント」関係を抽出できない不具合が生じる。 Specifically, the comment document 43, 46, 47, 48 for the other member A who is the diary contributor is treated as a comment document for the diary document 41, and a clearly contradictory “diary-comment” relationship is extracted. Will end up. Further, an error is caused in which a comment document 49 that is not a comment for the diary document 41 is treated as a comment document for the diary document 41. On the other hand, a comment document 43 corresponding to the comment document 42, a comment document 48 corresponding to the comment document 44, a comment document 47 corresponding to the comment document 45, and a comment document 49 corresponding to the comment document 45 must be originally extracted. There arises a problem that the “comment-comment” relationship that cannot be extracted.

次に、図６に示すように、比較例２の分類方法では、コメント群４Ｂのうち、日記投稿者の会員Ａ自身により投稿されたコメント文書４３、４６、４７、４８を除いたコメント文書４２、４４、４５、４９が、日記文書４１に呼応するコメント文書と扱われることとなる。 Next, as shown in FIG. 6, in the classification method of the comparative example 2, the comment document 42 excluding the comment documents 43, 46, 47, and 48 posted by the member A himself / herself of the diary poster in the comment group 4B. , 44, 45, 49 are treated as comment documents corresponding to the diary document 41.

比較例２の分類方法によれば、会員Ａの他の会員に対するコメント文書４３、４６、４７、４８を、日記文書４１に対するコメント文書と扱うという矛盾を解消することはできる。しかしながら、日記文書４１に対するコメントではないコメント文書４９を、日記文書４１に対するコメント文書と扱う誤りを解消することはできない。また、比較例１と同様に、本来抽出しなければならない「コメント−コメント」関係を抽出できない不具合は残る。 According to the classification method of the comparative example 2, the contradiction that the comment documents 43, 46, 47, 48 for the other members of the member A are handled as the comment documents for the diary document 41 can be solved. However, it is impossible to eliminate an error that treats a comment document 49 that is not a comment for the diary document 41 as a comment document for the diary document 41. Further, as in Comparative Example 1, there remains a problem that the “comment-comment” relationship that should be extracted cannot be extracted.

図７は、比較例１及び比較例２の分類方法の精度を評価した表形式の図である。図中の「×」印の項目は、当該比較例方法では不具合がある項目である。上述の通り、比較例１の分類方法では、日記投稿者自身のコメント文書を、「日記−コメント」関係のコメント文書として抽出する誤りと、「コメント−コメント」関係を「日記−コメント」関係と誤認する不具合とが生じる。但し、例えばコメント文書４８をコメント文書４２に呼応したコメントと扱うような、誤った「コメント−コメント」関係を抽出する不具合までは発生しない。また、誤りは含んでいるものの、正しい「日記−コメント」関係にあるコメント文書４２、４４、４５は漏れなく抽出できている。しかし、例えばコメント文書４５に呼応したコメント文書４９という正しい「コメント−コメント」関係が抽出できない不具合がある。 FIG. 7 is a table format in which the accuracy of the classification methods of Comparative Example 1 and Comparative Example 2 is evaluated. Items marked with “x” in the figure are items that are defective in the comparative method. As described above, in the classification method of Comparative Example 1, the comment document of the diary poster himself is extracted as a comment document in the “diary-comment” relationship, and the “comment-comment” relationship is the “diary-comment” relationship. Misleading faults occur. However, it does not occur up to a problem of extracting an erroneous “comment-comment” relationship, for example, treating the comment document 48 as a comment corresponding to the comment document 42. Moreover, although there are errors, the comment documents 42, 44, and 45 having the correct “diary-comment” relationship can be extracted without omission. However, for example, there is a problem that a correct “comment-comment” relationship of the comment document 49 corresponding to the comment document 45 cannot be extracted.

比較例２の分類方法では、日記投稿者自身のコメント文書を、「日記−コメント」関係のコメント文書として抽出する誤りが解消される他は、比較例１と同じ結果である。このような分類方法によれば、例えば会員Ａへの共感度が高い他の会員を検出するべく、会員Ａと他の会員との間で交信された文書を抽出し、該文書を解析して共感度を数値判定するような場合に、的確な共感度判定の前提となる会員間文書の抽出が正確に行われないこととなる。従って、共感度判定の精度は自ずと低下することになる。 The classification method of Comparative Example 2 is the same as Comparative Example 1 except that the error of extracting the comment document of the diary contributor himself as a comment document related to “diary-comment” is eliminated. According to such a classification method, for example, in order to detect other members having high co-sensitivity to member A, a document communicated between member A and another member is extracted, and the document is analyzed. In the case where the co-sensitivity is determined numerically, the inter-member document, which is a prerequisite for accurate co-sensitivity determination, is not accurately extracted. Accordingly, the accuracy of the co-sensitivity determination naturally decreases.

判定部１４は、図７に示した「×」印の項目についても正しい抽出が行えるよう、記述者特定部１２により特定された分類対象文書の記述者と、文書解析部１３により決定された分類対象文書中のニックネームの有無と、投稿時間とを参照して、次の（ａ）〜（ｃ）の論理に従って分類対象文書の分類を行う。 The determination unit 14 describes the classification target document specified by the writer specifying unit 12 and the classification determined by the document analysis unit 13 so that the items marked with “×” shown in FIG. With reference to the presence / absence of a nickname in the target document and the posting time, the classification target document is classified according to the following logics (a) to (c).

（ａ）分類対象文書の記述者が日記投稿者である会員Ａ以外であり、当該分類対象文書中にニックネームが存在しない場合は、日記文書に呼応して投稿された他の会員のコメント文書と扱う（上掲の［類型１］）。図４の例では、コメント文書４２、４４、４５が相当する。ここでは、日記文書に直接呼応するコメント文書では、相手方は明白であって通常は文中で相手方を特定する必要はないことから、コメント文書中にはニックネームが記述されないであろうという推定を採用している。 (A) If the writer of the classification target document is other than the member A who is the diary contributor, and there is no nickname in the classification target document, the comment document of the other member posted in response to the diary document Handled (see above [Type 1]). In the example of FIG. 4, comment documents 42, 44, and 45 correspond. Here, in a comment document that responds directly to a diary document, the other party is obvious, and it is not usually necessary to identify the other party in the text, so the assumption that the nickname will not be described in the comment document is adopted. ing.

（ｂ）分類対象文書の記述者が日記投稿者である会員Ａ以外であり、当該分類対象文書中にニックネームが存在する場合であって、コメント群４Ｂ内に当該分類対象文書よりも時系列的に先に当該ニックネームに係る会員のコメント文書が存在する場合には、他の会員間に交わされたコメント文書と扱う（上掲の［類型３］）。図４の例では、コメント文書４９が相当する。ここでは、日記文書ではなくスレッド内の他の会員のコメント文書に呼応してコメント文書を投稿する場合には、通常は相手方を特定するニックネームがコメント文書中に記述されるであろうという推定を採用している。比較例１、２の方法では、コメント文書４９を正確に分類することはできなかったが、本実施形態の方法によれば、コメント文書４９には「ｄちゃん」という記述が存在することから、これを正確に分類することができる。 (B) In the case where the writer of the classification target document is other than the member A who is the diary contributor and the nickname exists in the classification target document, the comment group 4B is more time-sequential than the classification target document. If there is a member's comment document relating to the nickname first, it is treated as a comment document exchanged with other members ([Type 3] above). In the example of FIG. 4, a comment document 49 corresponds. Here, when posting a comment document in response to another member's comment document in the thread instead of a diary document, it is usually assumed that a nickname identifying the other party will be described in the comment document. Adopted. In the methods of Comparative Examples 1 and 2, the comment document 49 could not be correctly classified. However, according to the method of the present embodiment, the comment document 49 includes a description “d-chan”. This can be classified accurately.

（ｃ）分類対象文書の記述者が日記投稿者である会員Ａであり、前記分類対象文書中にニックネームが存在しない場合、並びに、存在する場合であってコメント群４Ｂ内に当該分類対象文書よりも時系列的に先に当該ニックネームに係る会員のコメント文書が存在する場合には、日記投稿者が投稿した、他の会員の投稿に係るコメント文書に対するコメント文書と扱う（上掲の［類型２］）。なお、分類対象文書中にニックネームが存在しない場合には、当該分類対象文書が時系列的に直前のコメント文書に呼応して投稿されたコメント文書であると判定することで、ニックネームが含まれていないコメント文書が、どのコメント文書に呼応しているのかが確定できる。図４の例では、コメント文書４３、４６、４７、４８が相当する。 (C) When the writer of the classification target document is a member A who is a diary contributor and the nickname does not exist in the classification target document, and there is a nickname in the comment group 4B from the classification target document. If there is a member's comment document related to the nickname in chronological order, it is treated as a comment document for a comment document related to another member's post posted by the diary contributor (see [Type 2 ]). If there is no nickname in the classification target document, it is determined that the classification target document is a comment document posted in response to the previous comment document in time series, so that the nickname is included. It is possible to determine which comment document corresponds to a comment document that does not exist. In the example of FIG. 4, comment documents 43, 46, 47 and 48 correspond.

ここでは、直前に投稿されたコメント文書に呼応したコメント文書では、相手方は明白であって通常は文中で相手方を特定する必要はないことから、コメント文書中にはニックネームが記述されないであろうという推定を採用している。また、直前ではないが先に投稿されているコメント文書に呼応してコメント文書を投稿する場合には、文中に相手方を特定するニックネームが記述されるであろうという推定を採用している。この方法によれば、コメント文書４２に呼応するコメント文書４３、及びコメント文書４５に呼応するコメント文書４６の関係を正確に抽出できる。また、コメント文書４７内の「ｂさん」という記述、コメント文書４８内の「ｃさま」という記述に基づいて、コメント文書４２に呼応するコメント文書４７、及びコメント文書４４に呼応するコメント文書４８の関係を正確に抽出できる。 Here, in the comment document that responds to the comment document posted immediately before, the other party is obvious and usually there is no need to specify the other party in the sentence, so the nickname will not be described in the comment document The estimation is adopted. In addition, when a comment document is posted in response to a comment document that has been posted before, but not immediately before, a presumption is made that a nickname that identifies the other party will be described in the sentence. According to this method, the relationship between the comment document 43 corresponding to the comment document 42 and the comment document 46 corresponding to the comment document 45 can be accurately extracted. The comment document 47 corresponding to the comment document 42 and the comment document 48 corresponding to the comment document 44 based on the description “Mr. b” in the comment document 47 and the description “c” in the comment document 48. The relationship can be extracted accurately.

解析処理部１５は、ブログサイト２１において文書交信の実績をもつ会員（例えば会員Ａに対する会員Ｂ、Ｃ、Ｄ）の投稿文書（日記文書４１、コメント文書４２〜４８）を文書解析し、つまり、既に共感している者同士の間で交わされた文書を文書解析し、これを多変量解析して共感度を評価するための判定式を導出する。多変量解析の手法としては、例えば重回帰分析、判別分析、数量化Ｉ類、数量化II類を採用することができる。この際、前記文書解析で得られたパラメータが、適宜「目的変数」、「説明変数」として設定される。 The analysis processing unit 15 performs document analysis on posted documents (diary document 41, comment documents 42 to 48) of members (for example, members B, C, and D with respect to member A) in the blog site 21, that is, Documents exchanged between those who already sympathize with each other are analyzed, and a multivariate analysis is performed to derive a judgment formula for evaluating the co-sensitivity. As a method of multivariate analysis, for example, multiple regression analysis, discriminant analysis, quantification class I, and quantification class II can be employed. At this time, parameters obtained by the document analysis are appropriately set as “object variable” and “explanatory variable”.

さらに、解析処理部１５は、上記判定式を用いて、まだ交信実績のない会員同士の共感度を評価する処理を行う。この場合、共感度判定の対象とされる会員により作成された文書をブログサーバ２２から抽出して文書解析を行い、上記「目的変数」、「説明変数」となるパラメータを導出し、これを前記判定式に適用して会員同士の共感度を数値で評価する。そして、共感度が所定の閾値を超える会員の組み合わせが判明した場合、図略の表示手段にその旨を表示させる。これにより、ブログサイト２１において共感できる可能性のある会員同士を的確に抽出し、これら会員に情報を提供することで、ブログサイト２１の活性化を図ることができる。 Furthermore, the analysis processing unit 15 performs a process of evaluating the co-sensitivity of members who have not yet communicated using the determination formula. In this case, a document created by a member who is subject to co-sensitivity determination is extracted from the blog server 22 and document analysis is performed to derive parameters for the “object variable” and “explanatory variable”. Apply to the judgment formula to evaluate the co-sensitivity between members numerically. When a combination of members whose co-sensitivity exceeds a predetermined threshold is found, that effect is displayed on a display means (not shown). Accordingly, the blog site 21 can be activated by accurately extracting members who are likely to sympathize with the blog site 21 and providing information to these members.

以上説明した処理装置１０の文書分類処理の動作を、図８に示すフローチャートに基づいて説明する。先ず、文書群抽出部１１により、ブログサーバ２２に蓄積されている文書データの中から、日記文書と、この日記文書に関連するコメント文書Ｃ_１〜Ｃ_ｎとが抽出される（ステップＳ１）。そして、抽出されたコメント文書のうち、１番目の文書Ｃ_１＝Ｃ_ｍ、分類対象文書Ｍ＝Ｃ_ｍと設定される（ステップＳ２）。 The operation of the document classification process of the processing apparatus 10 described above will be described based on the flowchart shown in FIG. First, the document group extraction unit 11 extracts a diary document and comment documents C _{1 to} C _n related to the diary document from the document data stored in the blog server 22 (step S1). Then, among the extracted comment documents, the first document C ₁ = C _m and the classification target document M = C _m are set (step S2).

次に、記述者特定部１２により、分類対象文書Ｍを投稿した会員が特定される（ステップＳ３）。これにより、１番目の文書Ｃ_１の投稿者が、日記投稿者Ｘであるか、日記投稿者以外のコメント投稿者Ｙであるのかが判別される。 Next, the member who contributed the classification target document M is specified by the writer specifying unit 12 (step S3). Thereby, it is determined whether the contributor of the _first document C1 is the diary contributor X or the comment contributor Y other than the diary contributor.

ステップＳ３で投稿者が日記投稿者Ｘである場合、文書解析部１３により、分類対象文書Ｍが文書解析され、分類対象文書Ｍ中にニックネームが存在するか否かが特定される（ステップＳ４）。ニックネームが存在しない場合（ステップＳ４でＮＯ）、判定部１４は、その文書Ｍが、日記文書を投稿した会員（図４の例では会員Ａ）が時系列的に直前のコメント文書に呼応して投稿したコメント文書であると判定する（ステップＳ５）。図４の例では、コメント文書４３、４６が相当する。 If the contributor is the diary contributor X in step S3, the document analysis unit 13 analyzes the document to be classified and determines whether or not a nickname exists in the document to be classified M (step S4). . When the nickname does not exist (NO in step S4), the determination unit 14 responds to the comment document in which the document M has posted the diary document (member A in the example of FIG. 4) in time series. It is determined that the comment document has been posted (step S5). In the example of FIG. 4, the comment documents 43 and 46 correspond.

分類対象文書Ｍ中にニックネームが存在する場合（ステップＳ４でＹＥＳ）、判定部１４は、さらに、当該分類対象文書Ｍよりも時系列的に先に当該ニックネームに係る会員のコメント文書が存在するか否かを確認する（ステップＳ６）。かかる会員のコメント文書が存在する場合（ステップＳ６でＹＥＳ）、そのニックネームを含むコメント文書に対して日記投稿者が投稿したコメント文書と判定する（ステップＳ７）。図４の例では、コメント文書４７、４８が相当する。 When the nickname exists in the classification target document M (YES in step S4), the determination unit 14 further determines whether the member's comment document related to the nickname exists in time series before the classification target document M. It is confirmed whether or not (step S6). If such a member's comment document exists (YES in step S6), the comment document including the nickname is determined as a comment document posted by the diary contributor (step S7). In the example of FIG. 4, the comment documents 47 and 48 correspond.

一方、時系列的に先に、そのニックネームに対応する会員が投稿したコメント文書が存在しない場合（ステップＳ６でＮＯ）、すなわち、ニックネームデータベース２４には登録されているがそのニックネームに対応する会員のコメント文書がコメント群４Ｂには含まれていない場合、かかる文書はどの文書に共感するのかが特定できないため、分類対象から除外される（ステップＳ１２）。例えば、図４には登場しない、会員Ｆのニックネームが記述されているコメント文書が相当する。 On the other hand, if there is no comment document posted by the member corresponding to the nickname first (NO in step S6), that is, the member registered in the nickname database 24 but corresponding to the nickname. If the comment document is not included in the comment group 4B, it cannot be specified which document is sympathetic to the document, and is excluded from the classification target (step S12). For example, a comment document that does not appear in FIG.

これに対し、ステップＳ３で投稿者が日記投稿者以外のコメント投稿者Ｙである場合、同様に文書解析部１３により分類対象文書Ｍが文書解析され、分類対象文書Ｍ中にニックネームが存在するか否かが特定される（ステップＳ８）。ニックネームが存在しない場合（ステップＳ８でＮＯ）、判定部１４は、その文書Ｍが、日記文書に呼応して投稿された他の会員のコメント文書であると判定する（ステップＳ９）。図４の例では、コメント文書４２、４４、４５が相当する。 On the other hand, if the contributor is a comment contributor Y other than the diary contributor in step S3, the document analysis unit 13 similarly analyzes the document to be classified, and whether the nickname exists in the classification target document M. Whether or not is specified (step S8). When the nickname does not exist (NO in step S8), the determination unit 14 determines that the document M is a comment document of another member posted in response to the diary document (step S9). In the example of FIG. 4, comment documents 42, 44, and 45 correspond.

分類対象文書Ｍ中にニックネームが存在する場合（ステップＳ８でＹＥＳ）、判定部１４は、さらに、当該分類対象文書Ｍよりも時系列的に先に当該ニックネームに係る会員のコメント文書が存在するか否かを確認する（ステップＳ１０）。かかる会員のコメント文書が存在する場合（ステップＳ１０でＹＥＳ）、そのニックネームを含むコメント文書に対してコメント投稿者が投稿したコメント文書と判定する（ステップＳ１１）。図４の例では、コメント文書４９が相当する。 When the nickname exists in the classification target document M (YES in step S8), the determination unit 14 further determines whether the member's comment document related to the nickname exists in time series before the classification target document M. It is confirmed whether or not (step S10). If such a member's comment document exists (YES in step S10), the comment document including the nickname is determined as a comment document posted by the comment poster (step S11). In the example of FIG. 4, a comment document 49 corresponds.

一方、時系列的に先に、そのニックネームに対応する会員が投稿したコメント文書が存在しない場合（ステップＳ１０でＮＯ）、すなわち、ニックネームデータベース２４には登録されているがそのニックネームに対応する会員のコメント文書がコメント群４Ｂには含まれていない場合、かかる文書はどの文書に共感するのかが特定できないため、分類対象から除外される（ステップＳ１２）。 On the other hand, when there is no comment document posted by the member corresponding to the nickname first in time series (NO in step S10), that is, the member registered in the nickname database 24 but corresponding to the nickname. If the comment document is not included in the comment group 4B, it cannot be specified which document is sympathetic to the document, and is excluded from the classification target (step S12).

以上の判定処理を終えた後、分類対象文書Ｍがまだ残存しているか、つまりＭ＝Ｃ_ｎであるか否かが確認される（ステップＳ１３）。分類対象文書Ｍが残存している場合（ステップＳ１３でＮＯ）、Ｃ_ｍが１つインクリメントされ（ステップＳ１４）、分類対象文書Ｍ＝Ｃ_ｍと設定される（ステップＳ１５）。その後、ステップＳ３に戻って、次の分類対象文書についての判定処理が行われる。分類対象文書Ｍが残存していない場合（ステップＳ１３でＹＥＳ）、処理を終える。 After completion of the above determination process, whether classifying target document M is still remaining, whether that is M = C _n is checked (step S13). If the classification target document M remains (NO in step S13), _Cm is incremented by 1 (step S14), and the classification target document M = _Cm is set (step S15). Thereafter, the process returns to step S3, and the determination process for the next classification target document is performed. If the classification target document M does not remain (YES in step S13), the process ends.

以上説明したネットワークシステムＳによれば、日記文書に関連して作成され時系列的にサイトへ順次投稿されたコメント文書を、当該文書にニックネームが含まれているか否か、並びに、文書同士の時系列的な位置関係に基づいて、日記−コメント関係、コメント−コメント関係を正確に分類することができる。すなわち、どの記述者が、前記サイトのどの文書に呼応して投稿した文書であるかを、的確に把握することができる。従って、例えばサイト上で共感している者同士の文書を抽出する場合に、その抽出を正確に行うことができる。ひいては、その抽出文書を文書解析して共感度を数値判定する判定式を導出する場合に、その判定式の精度を向上させることができる。 According to the network system S described above, a comment document created in relation to a diary document and sequentially posted to the site in time series, whether or not the nickname is included in the document, and the time between documents Based on the sequential positional relationship, the diary-comment relationship and the comment-comment relationship can be accurately classified. That is, it is possible to accurately grasp which writer is a document posted in response to which document on the site. Therefore, for example, when extracting documents of those who sympathize on the site, the extraction can be performed accurately. Eventually, when the extracted document is subjected to document analysis to derive a determination formula for determining the co-sensitivity numerically, the accuracy of the determination formula can be improved.

本発明に係る文書分類方法が適用されるネットワークシステムのハードウェア構成を概略的に示す構成図である。1 is a configuration diagram schematically showing a hardware configuration of a network system to which a document classification method according to the present invention is applied. 処理装置の機能構成を示す機能ブロック図である。It is a functional block diagram which shows the function structure of a processing apparatus. ニックネームデータベースのデータ構造を模式的に示す表形式の図である。It is a figure of the table format which shows the data structure of a nickname database typically. 日記文書と、これに関連して投稿されたコメント群とを時系列的に示した模式図である。It is the schematic diagram which showed the diary document and the comment group contributed in connection with this in time series. 本発明に対する比較例１を示す模式図である。It is a schematic diagram which shows the comparative example 1 with respect to this invention. 本発明に対する比較例２を示す模式図である。It is a schematic diagram which shows the comparative example 2 with respect to this invention. 比較例１及び比較例２の分類方法の精度を評価した表形式の図である。It is the figure of the table format which evaluated the precision of the classification method of the comparative example 1 and the comparative example 2. 処理装置による文書分類処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the document classification process by a processing apparatus.

Explanation of symbols

Ｓネットワークシステム（文書分類システム）
１０処理装置（分類処理手段）
１１文書群抽出部
１２記述者特定部
１３文書解析部
１４判定部
１５解析処理部
２０ブログシステム
２１ブログサイト（通信ネットワーク上のサイト）
２２ブログサーバ（文書データベース）
２３会員サーバ（記述者データベース）
２４ニックネームデータベース（名称データベース）
３０端末装置 S network system (document classification system)
10. Processing device (classification processing means)
DESCRIPTION OF SYMBOLS 11 Document group extraction part 12 Writer specific part 13 Document analysis part 14 Judgment part 15 Analysis processing part 20 Blog system 21 Blog site (site on communication network)
22 Blog server (document database)
23 Member server (describer database)
24 Nickname database (name database)
30 Terminal device

Claims

A method for classifying documents posted from a large number of writers on a site on a communication network, the first document created by a first writer and posted to the site, and created in relation to the first document When there is a document group including a plurality of documents posted to the site in time series,
A document created in response to the first document, created by a second writer different from the first writer, and posted to the site;
A document created in response to the second document, the third document created by the first writer and posted to the site;
A document created in response to the second document, the fourth document created by a third writer different from the first writer and the second writer, and posted to the site;
In order to classify at least the classification target documents of the document group,
A first step of identifying another description who posted the classification target document;
If the writer is other than the first writer, it is determined whether or not there is a personally specified name previously defined in association with each writer in the classification target document. In the case where it is determined that the document is the second document, and there is a document posted by the descriptor associated with the personally specified name in time series before the classification target document in the document group, the second document A second step of determining four documents;
When the descriptive person is the first descriptive person, it is determined whether or not an individual specific name defined in advance in association with each descriptive person exists in the classification target document. A third step of determining that the document is a third document if there is a document posted by the writer who relates to the personal identification name in time series before the document to be classified in the document group. When,
A document classification method characterized by comprising:

In the third step,
2. The document according to claim 1, wherein if the personal identification name does not exist, it is determined that the classification target document is the third document corresponding to the immediately preceding second document in time series. Classification method.

The document classification method according to claim 1, wherein the personal identification name is a nickname assigned to each writer.

The document classification method according to claim 1, wherein the site on the communication network is a specific website on the Internet.

The website is a blog site,
The first document is a diary document created by the first writer;
The second document is a first comment document for a diary document created by a writer other than the first writer;
The third document is a second comment document for the first comment document created by the first writer;
The fourth document is a third comment document for the first comment document created by a descriptor other than the first descriptor.
The document classification method according to claim 4, wherein:

A system for classifying documents posted by a large number of writers on a site on a communication network,
A document database for storing documents posted on the site; a descriptor database for storing descriptor names scheduled to be posted to the site; and a personally specified name previously determined in association with each of the writers A name database for storing the information, and a classification processing means for performing classification processing of documents posted on the site,
The classification processing means includes
A first document created by a first writer and posted to the site from the document database, and a plurality of documents created in relation to the first document and posted to the site in time series A document group extraction unit for extracting a document group;
For each classification target document of the document group, referring to the description database, a descriptor specifying unit for specifying a descriptor who posted the classification target document;
A document analysis unit that analyzes the document to be classified and refers to the name database to identify whether or not a personal identification name exists in the document to be classified;
Based on whether the descriptive person is the first descriptive person or other than the first descriptive person, and whether or not a personal identification name exists in the classification target document, according to a predetermined type, A determination unit that performs a classification determination process on the classification target document;
A document classification system characterized by including:

The determination unit
A document created in response to the first document, created by a second writer different from the first writer, and posted to the site;
A document created in response to the second document, the third document created by the first writer and posted to the site;
A fourth document created in response to the second document, created by a third writer different from the first writer and the second writer, and posted to the site;
Are classified at least,
When the writer of the classification target document is other than the first writer and the personal identification name does not exist in the classification target document, it is determined as the second document, and the personal identification name exists In the case where there is a document posted by the writer who relates to the personal identification name in time series before the classification target document in the document group, it is determined as the fourth document,
The description target of the classification target document is the first description person, and the personal identification name does not exist in the classification target document. If there is a writer's post document related to the personal identification name in time series, the third document is determined.
The document classification system according to claim 6.