JP2010271870A

JP2010271870A - Analysis device for consecutive pictographs or the like

Info

Publication number: JP2010271870A
Application number: JP2009122389A
Authority: JP
Inventors: Kei Kimura; 圭木村
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2009-05-20
Filing date: 2009-05-20
Publication date: 2010-12-02
Anticipated expiration: 2029-05-20
Also published as: JP5049314B2

Abstract

PROBLEM TO BE SOLVED: To create a dictionary based on actual usage for consecutive pictographs or the like. SOLUTION: An analysis device for consecutive pictographs or the like includes: a means for extracting consecutive-pictographs or the like, sentence and article that refers to a single-pictographs-or-the-like dictionary to extract, from obtained content data, consecutive pictographs or the like, a sentence including the consecutive pictographs or the like, and an article including the sentence; an article feature word extraction means that refers to a word dictionary to extract an article feature word from the extracted article; a means for extracting a sentence excluding consecutive pictographs or the like that extracts another sentence matching a portion excluding the consecutive pictographs or the like of the extracted sentence from the obtained content data; a sentence feature word extraction means that refers to the word dictionary to extract a sentence feature word from the another extracted sentence; a feature word similarity determination means that refers to a similar word dictionary to determine whether or not the extracted article feature word is similar to the sentence feature word; and a means for registering consecutive pictographs or the like data that, if they are determined to be similar, associates the extracted consecutive ictographs or the like with the sentence feature word to register them in a consecutive-pictographs-or-the-like dictionary. COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、インターネット上のブログ（日記風のサイト）や掲示板等に用いられる絵文字を解析する技術に関する。 The present invention relates to a technique for analyzing pictograms used for blogs (diary-like sites) on the Internet, bulletin boards, and the like.

イラスト文字による絵文字や、通常の文字を並べて人の顔を表現した顔文字等が、電子メールやブログ、掲示板等で多く用いられている。 Pictographs based on illustration characters and emoticons expressing human faces by arranging normal characters are often used in e-mails, blogs, bulletin boards, and the like.

携帯電話等に搭載された文字変換機能は、単独の絵文字や顔文字に対応しているものも多く、文章の入力の過程で絵文字や顔文字を容易に入力することができる。 Many of the character conversion functions installed in mobile phones and the like support single pictograms and emoticons, and it is possible to easily input pictograms and emoticons during the text input process.

特許文献１には、絵文字（顔文字）の種類を解析し、テキストデータの表示と同期して、絵文字に対応するアニメ画像や音声を再生する技術が開示されている。 Patent Document 1 discloses a technique for analyzing the type of a pictograph (emoticon) and reproducing an animation image or sound corresponding to the pictograph in synchronization with the display of text data.

特開２００８−５４３４０号公報JP 2008-54340 A

上述したように、一つ一つの絵文字や顔文字については変換辞書の整備等が進んできている。しかし、最近では更に複数の絵文字や顔文字を連続させることで特定の意味（単語）を表現することが行われてきている。ここでは、このような複数の絵文字等を連続させたものを「連続絵文字等」と呼ぶこととする。なお、単独の絵文字や顔文字を「絵文字等」と呼ぶこととする。 As described above, for each pictogram or emoticon, a conversion dictionary has been developed. However, recently, a specific meaning (word) has been expressed by continuing a plurality of pictograms and emoticons. Here, such a series of a plurality of pictograms is called “continuous pictograms”. A single pictogram or emoticon is referred to as “pictogram etc.”.

このような連続絵文字等については、変換辞書の整備は行われておらず、ユーザは一文字ずつ入力しなければならないため、入力が煩雑であるという問題があった。また、サーバ側でコンテンツの解析を行う場合、連続絵文字等の意味するところが解釈できないため、有効なコンテンツの解析が行えないものであった。 For such continuous pictographs, the conversion dictionary has not been maintained, and the user has to input characters one by one. Further, when analyzing content on the server side, meanings such as continuous pictographs cannot be interpreted, and therefore effective content analysis cannot be performed.

本発明は上記の従来の問題点に鑑み提案されたものであり、その目的とするところは、連続絵文字等について、実際の使われ方に基づいて辞書を生成することのできる連続絵文字等解析装置を提供することにある。 The present invention has been proposed in view of the above-described conventional problems, and an object of the present invention is to provide a continuous pictograph analysis device capable of generating a dictionary for continuous pictographs based on how they are actually used. Is to provide.

上記の課題を解決するため、本発明にあっては、請求項１に記載されるように、解析対象となるコンテンツデータベースからコンテンツデータを取得するコンテンツ取得手段と、取得されたコンテンツデータから、単独絵文字等辞書を参照して、連続絵文字等と、当該連続絵文字等が含まれる文章と、当該文章を含む記事とを抽出する連続絵文字等・文章・記事抽出手段と、抽出された記事から単語辞書を参照して記事特徴語を抽出する記事特徴語抽出手段と、取得されたコンテンツデータから、抽出された文章の連続絵文字等を除外した部分に一致する他の文章を抽出する連続絵文字等除外文章抽出手段と、抽出された他の文章から単語辞書を参照して文章特徴語を抽出する文章特徴語抽出手段と、抽出された記事特徴語と文章特徴語から類義語辞書を参照して類似するか否か判定する特徴語類似判定手段と、類似すると判定された場合に、抽出された連続絵文字等と文章特徴語とを対応付けて連続絵文字等辞書に登録する連続絵文字等データ登録手段とを備える連続絵文字等解析装置を要旨としている。 In order to solve the above-described problems, in the present invention, as described in claim 1, content acquisition means for acquiring content data from a content database to be analyzed, and the acquired content data alone With reference to the pictogram dictionary, continuous pictograms, texts containing the continuous pictograms, etc., continuous pictograms, text / article extraction means for extracting articles containing the text, and word dictionary from the extracted articles Article feature word extraction means for extracting article feature words with reference to the above, and consecutive pictograms excluded text that extracts other text that matches the part of the extracted text excluding continuous pictograms etc. from the acquired content data Extraction means, sentence feature word extraction means for referring to a word dictionary from other extracted sentences to extract sentence feature words, and extracted article feature words and sentence feature words A feature word similarity determination unit that determines whether or not they are similar by referring to the synonym dictionary, and if it is determined to be similar, the extracted continuous pictograms and the text feature words are associated with each other and registered in the continuous pictogram dictionary. The gist is a continuous pictograph etc. analyzing device comprising continuous pictograph data registering means.

また、請求項２に記載されるように、請求項１に記載の連続絵文字等解析装置において、前記連続絵文字等辞書を参照し、連続絵文字等の出現回数の記録に基づき、その出現回数の更新が一定期間行なわれない場合に、その連続絵文字等を削除する連続絵文字等データ削除手段を備えるようにすることができる。 In addition, as described in claim 2, in the continuous pictograph etc. analyzing device according to claim 1, the number of appearances is updated based on the record of the number of appearances of the continuous pictographs with reference to the dictionary of continuous pictograms, etc. When the image is not performed for a certain period of time, it is possible to provide continuous pictogram data deletion means for deleting the continuous pictogram or the like.

また、請求項３に記載されるように、解析対象となるコンテンツデータベースからコンテンツデータを取得するコンテンツ取得工程と、取得されたコンテンツデータから、単独絵文字等辞書を参照して、連続絵文字等と、当該連続絵文字等が含まれる文章と、当該文章を含む記事とを抽出する連続絵文字等・文章・記事抽出工程と、抽出された記事から単語辞書を参照して記事特徴語を抽出する記事特徴語抽出工程と、取得されたコンテンツデータから、抽出された文章の連続絵文字等を除外した部分に一致する他の文章を抽出する連続絵文字等除外文章抽出工程と、抽出された他の文章から単語辞書を参照して文章特徴語を抽出する文章特徴語抽出工程と、抽出された記事特徴語と文章特徴語から類義語辞書を参照して類似するか否か判定する特徴語類似判定工程と、類似すると判定された場合に、抽出された連続絵文字等と文章特徴語とを対応付けて連続絵文字等辞書に登録する連続絵文字等データ登録工程とを備える連続絵文字等解析方法として構成することができる。 In addition, as described in claim 3, a content acquisition step of acquiring content data from a content database to be analyzed, a continuous pictogram, etc. with reference to a single pictogram dictionary from the acquired content data, Article feature words for extracting article feature words by referring to a word dictionary from the extracted articles and continuous pictograms, sentences, and article extraction process for extracting sentences including the continuous pictograms and articles including the sentences An extraction step, an extracted sentence extraction step for extracting other sentences that match a portion of the extracted sentence excluding continuous pictograms, etc., and a word dictionary from the extracted other sentences Text feature word extraction process that extracts text feature words by referring to the text, and whether or not they are similar by referring to the synonym dictionary from the extracted article feature words and text feature words A continuous pictogram or the like including a feature word similarity determination step and a continuous pictogram or the like data registration step for registering the extracted continuous pictogram or the like and the text feature word in association with the dictionary in the continuous pictogram dictionary when determined to be similar It can be configured as an analysis method.

本発明の連続絵文字等解析装置にあっては、連続絵文字等について、実際の使われ方に基づいて辞書を生成することができ、連続絵文字等の入力や解析に用いることができる。 In the continuous pictogram analyzing apparatus of the present invention, a dictionary can be generated based on the actual usage of continuous pictograms and the like, and can be used for input and analysis of continuous pictograms and the like.

本発明の一実施形態にかかるシステムの構成例を示す図である。It is a figure which shows the structural example of the system concerning one Embodiment of this invention. コンテンツデータベースのデータ構造例を示す図である。It is a figure which shows the data structure example of a content database. 単独絵文字等辞書のデータ構造例を示す図である。It is a figure which shows the data structure example of a dictionary, such as a single pictogram. 単語辞書のデータ構造例を示す図である。It is a figure which shows the data structure example of a word dictionary. 類義語辞書のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a synonym dictionary. 抽出作業用データベースのデータ構造例を示す図である。It is a figure which shows the example of a data structure of the database for extraction work. 連続絵文字等辞書のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a dictionary, such as a continuous pictograph. 絵文字等解析サーバのハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of analysis servers, such as a pictograph. 実施形態の処理例を示すシーケンス図（その１）である。It is a sequence diagram (the 1) which shows the process example of embodiment. 実施形態の処理例を示すシーケンス図（その２）である。It is a sequence diagram (the 2) which shows the process example of embodiment. 連続絵文字等含有文章・記事抽出および記事特徴語抽出の例を示す図である。It is a figure which shows the example of text / article extraction containing continuous pictograms, etc. and article feature word extraction. 絵文字等除外文章抽出および文章特徴語抽出の例を示す図である。It is a figure which shows the example of exclusion sentence extraction and pictorial feature word extraction, such as a pictograph.

以下、本発明の好適な実施形態につき説明する。 Hereinafter, preferred embodiments of the present invention will be described.

＜構成＞
図１は本発明の一実施形態にかかるシステムの構成例を示す図である。 <Configuration>
FIG. 1 is a diagram showing a configuration example of a system according to an embodiment of the present invention.

図１において、インターネット等のネットワーク１には、ユーザが操作するＰＣ（Personal Computer）、携帯電話、ＰＤＡ（Personal Digital Assistants）等のユーザ端末２が複数接続されている。ユーザ端末２は、一般的なブラウザ（Ｗｅｂブラウザ）２１を備えている。ブラウザ２１は、インターネットの標準プロトコルであるＨＴＴＰ（Hyper Text Transfer Protocol）等に従い、ＨＴＭＬ（Hyper Text Markup Language）等の言語で記述されたページデータの要求・取得・表示およびフォームデータの送信等を行う機能を有している。 In FIG. 1, a plurality of user terminals 2 such as a PC (Personal Computer), a mobile phone, and a PDA (Personal Digital Assistants) operated by a user are connected to a network 1 such as the Internet. The user terminal 2 includes a general browser (Web browser) 21. The browser 21 performs request / acquisition / display of page data described in a language such as HTML (Hyper Text Markup Language), transmission of form data, and the like according to HTTP (Hyper Text Transfer Protocol) which is a standard protocol of the Internet. It has a function.

一方、ネットワーク１には、ユーザの操作するユーザ端末２のブラウザ２１からのアクセスに対してブログ／掲示板の閲覧・記事掲載の管理を行うブログ／掲示板サーバ３が接続されている。ブログ／掲示板サーバ３には、コンテンツデータベース３０１が設けられている。このデータベースは、データベースを保持するコンピュータ内のＨＤＤ（Hard Disk Drive）等の記憶媒体上に所定のデータを体系的に保持するものである。 On the other hand, connected to the network 1 is a blog / bulletin board server 3 for managing browsing / posting of blogs / bulletin boards in response to access from the browser 21 of the user terminal 2 operated by the user. The blog / bulletin board server 3 is provided with a content database 301. This database systematically holds predetermined data on a storage medium such as an HDD (Hard Disk Drive) in a computer that holds the database.

また、ネットワーク１には、ブログ／掲示板サーバ３の記事内容を解析する絵文字等解析サーバ４が接続されている。 The network 1 is connected to a pictograph analysis server 4 that analyzes the contents of articles on the blog / bulletin board server 3.

絵文字等解析サーバ４は、機能部として、ブログ／掲示板コンテンツ取得部４１と連続絵文字等・文章・記事抽出部４２と記事特徴語抽出・地域判定部４３と連続絵文字等除外文章抽出部４４と文章特徴語抽出部４５と特徴語類似判定部４６と連続絵文字等データ登録部４７と連続絵文字等データ削除部４８とを備えている。これらの機能部は、絵文字等解析サーバ４を構成するコンピュータのＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）等のハードウェア資源上で実行されるコンピュータプログラムによって実現されるものである。これらの機能部は、単一のコンピュータ上に配置される必要はなく、必要に応じて分散される形態であってもよい。 The pictographic etc. analysis server 4 includes a blog / bulletin board content acquisition unit 41, a continuous pictograph etc./text/article extraction unit 42, an article feature word extraction / region determination unit 43, a continuous pictograph etc. excluded text extraction unit 44 and a text as functional units. A feature word extraction unit 45, a feature word similarity determination unit 46, a continuous pictograph data registration unit 47, and a continuous pictograph data deletion unit 48 are provided. These functional units are realized by a computer program executed on hardware resources such as a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access Memory) of the computer constituting the pictogram analysis server 4. It is what is done. These functional units do not need to be arranged on a single computer, and may be distributed as necessary.

また、絵文字等解析サーバ４が参照・更新する辞書もしくはデータベースとして、単独絵文字等辞書４０１と単語辞書４０２と類義語辞書４０３と抽出作業用データベース４０４と連続絵文字等辞書４０５とが設けられている。これらの辞書等は、辞書等を保持するコンピュータ内のＨＤＤ等の記憶媒体上に所定のデータを体系的に保持するものである。 Further, as a dictionary or database that the pictogram analysis server 4 refers to or updates, a single pictogram dictionary 401, a word dictionary 402, a synonym dictionary 403, an extraction work database 404, and a continuous pictogram dictionary 405 are provided. These dictionaries and the like systematically hold predetermined data on a storage medium such as an HDD in a computer that holds the dictionaries and the like.

図２はブログ／掲示板サーバ３のコンテンツデータベース３０１のデータ構造例を示す図である。コンテンツデータベース３０１は、「登録日時」「登録者ユーザＩＤ」「登録者ＩＰアドレス」「記事内容」等の項目を有している。「登録日時」は、記事が登録（掲載）された日時である。「登録者ユーザＩＤ」は、記事の登録を行ったユーザのユーザＩＤである。「登録者ＩＰアドレス」は、記事の登録を行ったユーザの登録時の送信元のＩＰアドレスである。「記事内容」は、記事の内容のデータであり、テキスト、画像、音声等を含む。 FIG. 2 shows an example of the data structure of the content database 301 of the blog / bulletin board server 3. The content database 301 includes items such as “registration date”, “registrant user ID”, “registrant IP address”, and “article content”. “Registration date” is the date and time when the article was registered (published). The “registrant user ID” is the user ID of the user who registered the article. The “registrant IP address” is the IP address of the transmission source at the time of registration of the user who registered the article. “Article content” is data of the content of an article, and includes text, images, sounds, and the like.

図３は絵文字等解析サーバ４の単独絵文字等辞書４０１のデータ構造例を示す図である。単独絵文字等辞書４０１は、「絵文字等」「コード」「単語」等の項目を有している。「絵文字等」は、１文字の絵文字あるいは一つの意味を表す顔文字である。「コード」は、絵文字等を識別する情報である。「単語」は、絵文字等に対応する単語である。 FIG. 3 is a diagram showing an example of the data structure of the single pictogram dictionary 401 of the pictogram analysis server 4. The single pictogram dictionary 401 includes items such as “pictogram etc.”, “code”, and “word”. “Emoji” or the like is an emoticon representing one character or one meaning. “Code” is information for identifying a pictograph or the like. “Word” is a word corresponding to a pictograph or the like.

図４は単語辞書４０２のデータ構造例を示す図であり、「単語」「品詞」等の項目を有している。「単語」は、見出しとなる単語である。「品詞」は、その単語の品詞である。 FIG. 4 is a diagram showing an example of the data structure of the word dictionary 402, and has items such as “word” and “part of speech”. A “word” is a word that becomes a headline. “Part of speech” is the part of speech of the word.

図５は類義語辞書４０３のデータ構造例を示す図であり、「単語」「類義語」等の項目を有している。「単語」は、見出しとなる単語である。「類義語」は、その単語に類似する他の単語（複数可）である。 FIG. 5 is a diagram showing an example of the data structure of the synonym dictionary 403, and has items such as “word” and “synonym”. A “word” is a word that becomes a headline. A “synonym” is other word (s) that are similar to the word.

図６は抽出作業用データベース４０４のデータ構造例を示す図であり、「連続絵文字等」「単語」「出現回数（更新日時）」「地域」等の項目を有している。「連続絵文字等」は、複数の連続する絵文字あるいは顔文字である。「単語」は、連続絵文字等に対応する単語である。「出現回数（更新日時）」は、解析により出現した回数および最後の更新日時である。「地域」は、連続絵文字等が出現した記事を書き込んだユーザの所在する地域である。 FIG. 6 is a diagram showing an example of the data structure of the extraction work database 404, and has items such as “continuous pictograms,” “words,” “appearance count (update date and time)”, “region”, and the like. “Continuous pictograms” are a plurality of continuous pictograms or emoticons. A “word” is a word corresponding to a continuous pictograph or the like. “Number of appearances (update date / time)” is the number of times of appearance by the analysis and the last update date / time. The “region” is a region where a user who has written an article in which a continuous pictograph or the like appears is located.

図７は連続絵文字等辞書４０５のデータ構造例を示す図であり、「連続絵文字等」「単語」「出現回数（更新日時）」「地域」等の項目を有している。項目は図６の抽出作業用データベース４０４と同様である。 FIG. 7 is a diagram showing an example of the data structure of the continuous pictogram dictionary 405, and includes items such as “continuous pictograms”, “words”, “appearance count (update date and time)”, “region”, and the like. The items are the same as those in the extraction work database 404 of FIG.

図１に戻り、絵文字等解析サーバ４のブログ／掲示板コンテンツ取得部４１は、絵文字等解析処理時に、ブログ／掲示板サーバ３のコンテンツデータベース３０１からコンテンツデータを取得する機能を有している。 Returning to FIG. 1, the blog / bulletin board content acquisition unit 41 of the pictogram etc. analysis server 4 has a function of acquiring content data from the content database 301 of the blog / bulletin board server 3 during the pictogram etc. analysis processing.

連続絵文字等・文章・記事抽出部４２は、コンテンツデータベース３０１により取得されたコンテンツデータから、単独絵文字等辞書４０１を参照して、連続絵文字等と、その連続絵文字等が含まれる文章と、その文章を含む記事とを抽出する機能を有している。 The continuous pictogram / text / article extraction unit 42 refers to the single pictogram dictionary 401 from the content data acquired by the content database 301, the text including the continuous pictogram, the continuous pictogram, etc., and the text It has a function to extract articles including.

記事特徴語抽出・地域判定部４３は、連続絵文字等・文章・記事抽出部４２により抽出された記事から、単語辞書４０２を参照して、特徴語を抽出する機能を有している。また、記事特徴語抽出・地域判定部４３は、記事に含まれるテキスト、登録者ＩＰアドレス、登録者ユーザＩＤ等に基づいて登録者の所在する地域を判定する機能も有している。 The article feature word extraction / region determination unit 43 has a function of referring to the word dictionary 402 from the articles extracted by the continuous pictograms, sentences / article extraction unit 42 and extracting the feature words. The article feature word extraction / area determination unit 43 also has a function of determining the area where the registrant is located based on text included in the article, registrant IP address, registrant user ID, and the like.

連続絵文字等除外文章抽出部４４は、連続絵文字等・文章・記事抽出部４２により抽出された文章から連続絵文字等を除外した部分をベースに、コンテンツデータベース３０１により取得されたコンテンツデータから、一致する文章を抽出する機能を有している。 The consecutive pictogram etc. excluded text extraction unit 44 matches the content data acquired by the content database 301 based on the part excluding the continuous pictogram etc. from the text extracted by the continuous pictogram etc./text/article extraction unit 42. It has a function to extract sentences.

文章特徴語抽出部４５は、連続絵文字等除外文章抽出部４４により抽出された文章から、単語辞書４０２を参照して、特徴語を抽出する機能を有している。 The sentence feature word extraction unit 45 has a function of referring to the word dictionary 402 from the sentence extracted by the consecutive pictogram etc. exclusion sentence extraction unit 44 and extracting a feature word.

特徴語類似判定部４６は、記事特徴語抽出・地域判定部４３により抽出された特徴語（記事特徴語）と文章特徴語抽出部４５により抽出された特徴語（文章特徴語）とを、類義語辞書４０３を参照して、類似するか否か判定する機能を有している。また、特徴語類似判定部４６は、類似すると判定した場合、連続絵文字等を抽出作業用データベース４０４に仮登録し、出現回数が所定値を超えた場合に本登録を連続絵文字等データ登録部４７に指示する機能も有している。 The feature word similarity determination unit 46 uses the feature words (article feature words) extracted by the article feature word extraction / region determination unit 43 and the feature words (sentence feature words) extracted by the sentence feature word extraction unit 45 as synonyms. It has a function of referring to the dictionary 403 to determine whether or not they are similar. Also, if the feature word similarity determination unit 46 determines that they are similar, the feature word similarity determination unit 46 temporarily registers continuous pictograms or the like in the extraction work database 404, and if the number of appearances exceeds a predetermined value, the main word registration data registration unit 47 It also has a function to instruct

連続絵文字等データ登録部４７は、特徴語類似判定部４６から本登録の指示を受けた場合に、連続絵文字等辞書４０５に連続絵文字等を登録する機能を有している。 The continuous pictogram data registration unit 47 has a function of registering a continuous pictogram or the like in the continuous pictogram dictionary 405 when receiving an instruction for main registration from the feature word similarity determination unit 46.

連続絵文字等データ削除部４８は、適当なタイミングで抽出作業用データベース４０４および連続絵文字等辞書４０５を参照し、廃れたと判断される連続絵文字等を削除する機能を有している。 The continuous pictogram data deletion unit 48 has a function of referring to the extraction work database 404 and the continuous pictogram dictionary 405 at an appropriate timing and deleting the continuous pictograms determined to be abandoned.

図８は絵文字等解析サーバ４のハードウェア構成例を示す図である。 FIG. 8 is a diagram illustrating a hardware configuration example of the pictogram analysis server 4.

図８において、絵文字等解析サーバ４は、システムバス４００１に接続されたＣＰＵ４００２、ＲＯＭ４００３、ＲＡＭ４００４、ＮＶＲＡＭ（Non-Volatile Random Access Memory）４００５、Ｉ／Ｆ（Interface）４００６と、Ｉ／Ｆ４００６に接続された、キーボード、マウス、モニタ、ＣＤ／ＤＶＤ（Compact Disk/Digital Versatile Disk）ドライブ等のＩ／Ｏ（Input/Output Device）４００７、ＨＤＤ４００８、ＮＩＣ（Network Interface Card）４００９等を備えている。Ｍはプログラムもしくはデータが格納されたＣＤ／ＤＶＤ等のメディア（記録媒体）である。 In FIG. 8, the pictogram analysis server 4 is connected to a CPU 4002, ROM 4003, RAM 4004, NVRAM (Non-Volatile Random Access Memory) 4005, I / F (Interface) 4006, and I / F 4006 connected to a system bus 4001. Also provided are an input / output device (I / O) 4007 such as a keyboard, mouse, monitor, CD / DVD (Compact Disk / Digital Versatile Disk) drive, HDD 4008, NIC (Network Interface Card) 4009, and the like. M is a medium (recording medium) such as a CD / DVD in which a program or data is stored.

＜動作＞
図９および図１０は上記の実施形態の処理例を示すシーケンス図である。 <Operation>
FIG. 9 and FIG. 10 are sequence diagrams showing a processing example of the above embodiment.

図９において、ユーザ端末２のユーザがブログもしくは掲示板に投稿を行なう場合、ユーザ端末２のブラウザ２１からブログ／掲示板サーバ３にアクセスして投稿ページ要求を行なう（ステップＳ１０１）。ページ要求は、インターネットの標準プロトコルであるＨＴＴＰに従ったＧＥＴメソッドとリクエストＵＲＩ（Uniform Resource Indicator）等を含むメッセージがユーザ端末２のブラウザ２１からブログ／掲示板サーバ３に送信されることで行なわれる。 In FIG. 9, when the user of the user terminal 2 makes a posting on a blog or bulletin board, the blog / bulletin board server 3 is accessed from the browser 21 of the user terminal 2 to make a posting page request (step S101). The page request is made by transmitting a message including a GET method according to HTTP, a standard protocol of the Internet, a request URI (Uniform Resource Indicator), and the like from the browser 21 of the user terminal 2 to the blog / bulletin board server 3.

これを受け、ブログ／掲示板サーバ３は、内部に保持あるいは動的に生成した投稿ページのページデータをユーザ端末２のブラウザ２１に返送する（ステップＳ１０２）。ページデータはＨＴＭＬ等により記述されており、ＨＴＴＰのレスポンス等に従ってブログ／掲示板サーバ３からユーザ端末２のブラウザ２１に送信される。 In response to this, the blog / bulletin board server 3 returns the page data of the posting page held inside or dynamically generated to the browser 21 of the user terminal 2 (step S102). The page data is described in HTML or the like, and is transmitted from the blog / bulletin board server 3 to the browser 21 of the user terminal 2 in accordance with an HTTP response or the like.

ユーザ端末２のブラウザ２１は投稿ページを表示し（ステップＳ１０３）、ユーザは投稿内容の入力を行なう（ステップＳ１０４）。 The browser 21 of the user terminal 2 displays the posting page (step S103), and the user inputs the posting content (step S104).

投稿ページへの入力が完了すると、ユーザ端末２のブラウザ２１からブログ／掲示板サーバ３に投稿内容が送信される（ステップＳ１０５）。入力内容はＨＴＴＰのＰＯＳＴメソッドあるいはＰＵＴメソッド等に付加されたＨＴＭＬ等のデータあるいはＧＥＴメソッド等に付加されたパラメータとしてユーザ端末２のブラウザ２１からブログ／掲示板サーバ３に送信される。 When the input to the posting page is completed, the posting content is transmitted from the browser 21 of the user terminal 2 to the blog / bulletin board server 3 (step S105). The input content is transmitted from the browser 21 of the user terminal 2 to the blog / bulletin board server 3 as data such as HTML added to the HTTP POST method or PUT method, or as a parameter added to the GET method.

投稿内容を受信したブログ／掲示板サーバ３は、コンテンツデータベース３０１（図２）に、登録日時、登録者ユーザＩＤ、登録者ＩＰアドレス等と関連付けて記事内容（投稿内容）を登録する（ステップＳ１０６）。 The blog / bulletin board server 3 that has received the posted content registers the content of the article (posted content) in the content database 301 (FIG. 2) in association with the registration date, registrant user ID, registrant IP address, and the like (step S106). .

その後、適当なタイミングにおいて、絵文字等解析サーバ４は動作を開始し、ブログ／掲示板コンテンツ取得部４１は、ブログ／掲示板サーバ３のコンテンツデータベース３０１からコンテンツデータを取得する（ステップＳ１１１）。取得するコンテンツデータは、古いものは参考とならないため、登録日時を考慮して、所定の鮮度が保てる期間のものに限定することができる。取得したコンテンツデータは、内部に一時的に保持し、連続絵文字等・文章・記事抽出部４２および連続絵文字等除外文章抽出部４４が利用できる状態とする（ステップＳ１１２、Ｓ１１３）。 Thereafter, at an appropriate timing, the pictograph analysis server 4 starts its operation, and the blog / bulletin board content acquisition unit 41 acquires content data from the content database 301 of the blog / bulletin board server 3 (step S111). The content data to be acquired can be limited to data having a predetermined freshness in consideration of the registration date and time since old data is not helpful. The acquired content data is temporarily held in the inside so that the continuous pictograph / text / article extracting unit 42 and the continuous pictograph excluded text extracting unit 44 can be used (steps S112 and S113).

次いで、連続絵文字等・文章・記事抽出部４２は、コンテンツデータベース３０１により取得されたコンテンツデータから、単独絵文字等辞書４０１を参照して、連続絵文字等と、その連続絵文字等が含まれる文章と、その文章を含む記事とを抽出する（ステップＳ１１４）。すなわち、単独絵文字等辞書４０１に登録された絵文字等をキーにしてコンテンツデータ全体に対して検索を行い、あるいは、コンテンツデータをスキャンして単独絵文字等辞書４０１に登録された絵文字等が出現するか否かを判断し、出現した連続絵文字等と、連続絵文字等が含まれる文章と、その文章を含む記事とを抽出する。なお、前回の解析時に抽出した連続絵文字等は、コンテンツデータの登録日時を考慮して、重複して抽出しないようにすることができる。図１１は連続絵文字等Ｌ１を含む文章Ｓ１と、この文章Ｓ１を含む記事Ａ１が抽出された状態を示している。 Next, the continuous pictogram / text / article extraction unit 42 refers to the single pictogram dictionary 401 from the content data acquired by the content database 301, and includes continuous pictograms and the text including the continuous pictograms, An article including the sentence is extracted (step S114). That is, whether the entire content data is searched using the pictograms registered in the single pictogram dictionary 401 as a key, or whether the pictograms registered in the single pictogram dictionary 401 appear by scanning the content data Judgment is made, and the continuous pictograms and the like that have appeared, the text including the continuous pictograms, and the article including the text are extracted. Note that the continuous pictograms extracted at the time of the previous analysis can be prevented from being duplicated in consideration of the registration date and time of the content data. FIG. 11 shows a state in which a sentence S1 including continuous pictograms L1 and an article A1 including the sentence S1 are extracted.

図９に戻り、連続絵文字等・文章・記事抽出部４２は、抽出した連続絵文字等（コード）と文章と記事とを、記事特徴語抽出・地域判定部４３に引き渡す（ステップＳ１１５）。 Returning to FIG. 9, the continuous pictograph / text / article extraction unit 42 delivers the extracted continuous pictograph, etc. (code), text, and article to the article feature word extraction / region determination unit 43 (step S 115).

次いで、記事特徴語抽出・地域判定部４３は、連続絵文字等・文章・記事抽出部４２により抽出された記事から、単語辞書４０２を参照して、特徴語（記事特徴語）を抽出する（ステップＳ１１６）。特徴語の抽出は、例えば、記事に含まれる文字列と単語辞書４０２に登録された単語とを、最長一致法等により一致の比較判断を行い、一致した頻度の高い単語を特徴語として決定する。図１１に示した記事Ａ１の場合、例えば、「１年生」を特徴語Ｃ１として抽出する。 Next, the article feature word extraction / region determination unit 43 refers to the word dictionary 402 from the articles extracted by the continuous pictograms / text / article extraction unit 42 and extracts feature words (article feature words) (step). S116). In the extraction of feature words, for example, a character string included in an article and a word registered in the word dictionary 402 are compared and determined by the longest match method or the like, and a word with a high matching frequency is determined as a feature word. . In the case of the article A1 shown in FIG. 11, for example, “first grader” is extracted as the feature word C1.

図９に戻り、記事特徴語抽出・地域判定部４３は、記事に含まれるテキスト、登録者ＩＰアドレス、登録者ユーザＩＤ等に基づいて登録者の所在する地域を判定する（ステップＳ１１７）。すなわち、記事に含まれるテキストに地名を示す単語が含まれている場合にはそれを地域とする。予め、ＩＰアドレス範囲と地域とを対応付けたテーブル（図示せず）のＩＰアドレス範囲に登録者ＩＰアドレスが含まれる場合は対応する地域を取得する。登録者ユーザＩＤからユーザＤＢ（図示せず）を参照し、プロフィールから住所が取得できる場合は、その住所を地域とする。 Returning to FIG. 9, the article feature word extraction / region determination unit 43 determines the region where the registrant is located based on the text included in the article, the registrant IP address, the registrant user ID, and the like (step S117). That is, when a word indicating a place name is included in the text included in the article, it is set as a region. If the registrant IP address is included in advance in the IP address range of a table (not shown) in which the IP address range is associated with the region, the corresponding region is acquired. If the user DB (not shown) is referred to from the registrant user ID and the address can be acquired from the profile, the address is set as the area.

次いで、記事特徴語抽出・地域判定部４３は、連続絵文字等・文章・記事抽出部４２から渡された連続絵文字等と文章と、自ら抽出した記事特徴語と判定した地域とを、連続絵文字等除外文章抽出部４４に引き渡す（ステップＳ１１８）。 Next, the article feature word extraction / region determination unit 43 converts the continuous pictograms, etc., the continuous pictograms, etc. passed from the sentence / article extraction unit 42, and the regions determined as the article feature words extracted by itself into continuous pictograms, etc. Delivered to the excluded text extraction unit 44 (step S118).

次いで、連続絵文字等除外文章抽出部４４は、連続絵文字等・文章・記事抽出部４２により抽出された文章から連続絵文字等を除外した部分をベースに、コンテンツデータベース３０１により取得されたコンテンツデータから、一致する文章を抽出する（ステップＳ１１９）。すなわち、連続絵文字等・文章・記事抽出部４２により抽出された文章の文字列から連続絵文字等を除外した文字列を生成し、当該文字列をキーにしてコンテンツデータ全体に対して検索を行い、あるいは、コンテンツデータをスキャンして当該文字列が出現するか否かを判断し、当該文字列が含まれる文章を抽出する。図１１の文章Ｓ１からは連続絵文字等Ｌ１を除くことで「おめでとう。」の文字列が生成される。図１２は、この文字列「おめでとう。」に基づいてコンテンツデータから文章Ｓ２、Ｓ３が抽出された状態を示している。 Next, the consecutive pictograms excluded text extraction unit 44, from the content data acquired by the content database 301, based on the part excluding the continuous pictograms etc. from the text extracted by the continuous pictograms, text / article extraction unit 42, A matching sentence is extracted (step S119). That is, a character string excluding continuous pictograms and the like is generated from the character string of the text extracted by the continuous pictogram / text / article extraction unit 42, and the entire content data is searched using the character string as a key, Alternatively, the content data is scanned to determine whether or not the character string appears, and a sentence including the character string is extracted. A character string “Congratulations” is generated from the sentence S1 of FIG. FIG. 12 shows a state in which sentences S2 and S3 are extracted from the content data based on the character string “Congratulations”.

図９に戻り、連続絵文字等除外文章抽出部４４は、記事特徴語抽出・地域判定部４３から渡された連続絵文字等と記事特徴語と地域と、自ら抽出した文章とを、文章特徴語抽出部４５に引き渡す（ステップＳ１２０）。 Returning to FIG. 9, the consecutive pictograms and other excluded text extraction unit 44 extracts the continuous pictograms, the article feature words, and the region passed from the article feature word extraction / region determination unit 43, and the sentence feature word extraction. Delivered to the unit 45 (step S120).

次いで、図１０において、文章特徴語抽出部４５は、連続絵文字等除外文章抽出部４４により抽出された文章から、単語辞書４０２を参照して、特徴語（文章特徴語）を抽出する（ステップＳ１２１）。特徴語の抽出は、例えば、文章に含まれる文字列と単語辞書４０２に登録された単語とを、最長一致法等により一致の比較判断を行い、一致した頻度の高い単語を特徴語として決定する。図１２に示した文章Ｓ２の場合、この文章Ｓ２に含まれる文字列「入学式」を特徴語Ｃ２として抽出する。文章Ｓ３の場合、この文章Ｓ３に含まれる「誕生日」を特徴語Ｃ３として抽出する。 Next, in FIG. 10, the sentence feature word extraction unit 45 refers to the word dictionary 402 from the sentence extracted by the consecutive pictogram etc. exclusion sentence extraction unit 44 and extracts a feature word (sentence feature word) (step S 121). ). In the extraction of feature words, for example, a character string included in a sentence and a word registered in the word dictionary 402 are compared and determined by a longest match method or the like, and a word with a high matching frequency is determined as a feature word. . In the case of the sentence S2 shown in FIG. 12, the character string “entrance ceremony” included in the sentence S2 is extracted as the feature word C2. In the case of the sentence S3, “birthday” included in the sentence S3 is extracted as the feature word C3.

図１０に戻り、文章特徴語抽出部４５は、連続絵文字等除外文章抽出部４４から渡された連続絵文字等と記事特徴語と地域と、自ら抽出した文章特徴語とを、特徴語類似判定部４６に引き渡す（ステップＳ１２２）。 Returning to FIG. 10, the sentence feature word extraction unit 45 converts the continuous pictograms, the article feature words, the region, and the sentence feature words extracted from the continuous pictogram etc. excluded sentence extraction unit 44 into the feature word similarity determination unit. 46 (step S122).

次いで、特徴語類似判定部４６は、記事特徴語抽出・地域判定部４３により抽出された記事特徴語と文章特徴語抽出部４５により抽出された文章特徴語とを、類義語辞書４０３を参照して、類似するか否か判定する（ステップＳ１２３）。すなわち、一方の特徴語をキーにして類義語辞書４０３を参照し、他方の特徴語が類義語として登録されているか否かにより類似するか否か判定する。図１１で抽出された特徴語Ｃ１「１年生」と、図１２で抽出された特徴語Ｃ２「入学式」と特徴語Ｃ３「誕生日」との間では、「１年生」と「入学式」が類似するものとしている。 Next, the feature word similarity determination unit 46 refers to the synonym dictionary 403 for the article feature words extracted by the article feature word extraction / region determination unit 43 and the sentence feature words extracted by the sentence feature word extraction unit 45. It is determined whether or not they are similar (step S123). That is, the synonym dictionary 403 is referenced using one feature word as a key, and it is determined whether or not the other feature word is similar depending on whether it is registered as a synonym. Between the feature word C1 “first grader” extracted in FIG. 11 and the feature word C2 “entrance ceremony” and feature word C3 “birthday” extracted in FIG. 12, “first grader” and “entrance ceremony” Are similar.

図１０に戻り、特徴語類似判定部４６は、類似と判定した場合、類似と判定した組み合わせにつき、連続絵文字等を抽出作業用データベース４０４に仮登録する（ステップＳ１２４）。すなわち、その文章特徴語を単語とし、抽出作業用データベース４０４に既に同じ連続絵文字等と単語と地域のレコードがある場合は、出現回数を１加算する。レコードがない場合は連続絵文字等と単語と地域のレコードを作成し、出願回数を１とする。 Returning to FIG. 10, when the feature word similarity determination unit 46 determines that they are similar, the feature word similarity determination unit 46 temporarily registers continuous pictograms and the like in the extraction work database 404 for the combinations determined to be similar (step S 124). That is, if the sentence feature word is a word and the extraction work database 404 already has the same continuous pictograph and the like and a record of the word and the region, the appearance count is incremented by one. If there is no record, a record of continuous pictograms, words, and regions is created, and the number of applications is set to 1.

次いで、特徴語類似判定部４６は、出現回数が所定値を超えるか否か判断し（ステップＳ１２５）、出現回数が所定値を超えた場合には、連続絵文字等と単語と出現回数と地域を連続絵文字等データ登録部４７に引き渡し、本登録を指示する（ステップＳ１２６）。 Next, the feature word similarity determination unit 46 determines whether or not the number of appearances exceeds a predetermined value (step S125). If the number of appearances exceeds the predetermined value, the continuous pictograms, the word, the number of appearances, and the region are determined. The data is handed over to the continuous pictograph data registration unit 47 to instruct main registration (step S126).

これを受け、連続絵文字等データ登録部４７は、連続絵文字等辞書４０５に連続絵文字等と単語と出現回数と地域を登録する（ステップＳ１２７）。既に、出現回数以外の連続絵文字等と単語と地域が同じレコードが存在する場合には、出現回数のみを更新する。 In response, the continuous pictogram data registration unit 47 registers the continuous pictogram etc., the word, the number of appearances, and the region in the continuous pictogram dictionary 405 (step S127). If there is already a record with the same word and area as the continuous pictograph etc. other than the appearance count, only the appearance count is updated.

一方、その後の適当なタイミングにおいて、連続絵文字等データ削除部４８は、抽出作業用データベース４０４および連続絵文字等辞書４０５を参照し（ステップＳ１３１）、廃れたと判断される削除対象の連続絵文字等を判定する（ステップＳ１３２）。すなわち、更新日時が所定期間以上経過していることで出現回数が所定期間にわたって増加がなく、他に同じ連続絵文字等のレコードが存在するものを削除対象とする。この場合、同じ連続絵文字等のレコードとは、地域も含めて同じという意味である。地域が異なれば、違う意味で使われる連続絵文字等が併存するのは普通であるからである。 On the other hand, at an appropriate timing thereafter, the continuous pictogram data deletion unit 48 refers to the extraction work database 404 and the continuous pictogram dictionary 405 (step S131), and determines the continuous pictograms to be deleted that are determined to be obsolete. (Step S132). That is, the number of appearances does not increase over a predetermined period because the update date and time has passed for a predetermined period or more, and other records having the same continuous pictograph or the like exist as deletion targets. In this case, the same continuous pictogram record means the same including the region. This is because it is normal that different pictographs used in different meanings coexist in different regions.

そして、連続絵文字等データ登録部４７は、該当するレコードを抽出作業用データベース４０４および連続絵文字等辞書４０５から削除する（ステップＳ１３３）。 Then, the continuous pictograph data registration unit 47 deletes the corresponding record from the extraction work database 404 and the continuous pictograph dictionary 405 (step S133).

＜総括＞
以上説明したように、本実施形態によれば、ブログ／掲示板サーバ３における連続絵文字等の実際の使われ方に基づいて連続絵文字等辞書４０５を最新の状態に保つことができ、連続絵文字等の入力や解析に有効に用いることができる。 <Summary>
As described above, according to the present embodiment, the continuous pictogram dictionary 405 can be kept up-to-date based on the actual usage of continuous pictograms and the like in the blog / bulletin board server 3. It can be used effectively for input and analysis.

以上、本発明の好適な実施の形態により本発明を説明した。ここでは特定の具体例を示して本発明を説明したが、特許請求の範囲に定義された本発明の広範な趣旨および範囲から逸脱することなく、これら具体例に様々な修正および変更を加えることができることは明らかである。すなわち、具体例の詳細および添付の図面により本発明が限定されるものと解釈してはならない。 The present invention has been described above by the preferred embodiments of the present invention. While the invention has been described with reference to specific embodiments, various modifications and changes may be made to the embodiments without departing from the broad spirit and scope of the invention as defined in the claims. Obviously you can. In other words, the present invention should not be construed as being limited by the details of the specific examples and the accompanying drawings.

１ネットワーク
２ユーザ端末
２１ブラウザ
３ブログ／掲示板サーバ
３０１コンテンツデータベース
４絵文字等解析サーバ
４１ブログ／掲示板コンテンツ取得部
４２連続絵文字等・文章・記事抽出部
４３記事特徴語抽出・地域判定部
４４連続絵文字等除外文章抽出部
４５文章特徴語抽出部
４６特徴語類似判定部
４７連続絵文字等データ登録部
４８連続絵文字等データ削除部
４０１単独絵文字等辞書
４０２単語辞書
４０３類義語辞書
４０４抽出作業用データベース
４０５連続絵文字等辞書 DESCRIPTION OF SYMBOLS 1 Network 2 User terminal 21 Browser 3 Blog / bulletin board server 301 Content database 4 Pictogram etc. analysis server 41 Blog / bulletin board content acquisition part 42 Continuous pictogram etc./Sentence/article extraction part 43 Article feature word extraction / region determination part 44 Continuous pictogram etc. Excluded sentence extraction section 45 Sentence feature word extraction section 46 Feature word similarity determination section 47 Continuous pictogram data registration section 48 Continuous pictogram data deletion section 401 Single pictogram dictionary 402 Word dictionary 403 Synonym dictionary 404 Extraction work database 405 Continuous pictogram, etc. dictionary

Claims

Content acquisition means for acquiring content data from a content database to be analyzed;
With reference to a dictionary of single pictograms, etc., from the acquired content data, continuous pictograms, etc., continuous pictograms etc., text / article extraction means for extracting sentences containing the continuous pictograms, etc., and articles containing the text, and ,
Article feature word extraction means for extracting an article feature word from the extracted article by referring to a word dictionary;
Excluded sentence extraction means such as continuous pictograms for extracting other sentences that match the part of the extracted text excluding the continuous pictograms of the extracted text,
A sentence feature word extracting means for extracting a sentence feature word from another extracted sentence by referring to a word dictionary;
Feature word similarity determination means for determining whether or not they are similar by referring to a synonym dictionary from the extracted article feature words and sentence feature words;
A continuous pictogram etc. analyzing device characterized by comprising continuous pictogram etc. data registration means for registering the extracted continuous pictogram etc. and the sentence feature word in association with each other when it is determined to be similar .

In the analysis apparatus for continuous pictograms according to claim 1,
With reference to the continuous pictogram dictionary, based on the record of the number of appearances of continuous pictograms, and the like, provided with a continuous pictogram data deletion means for deleting the continuous pictograms and the like when the number of appearances is not updated for a certain period of time. A device for analyzing continuous pictograms, etc.

A content acquisition process for acquiring content data from a content database to be analyzed;
From the acquired content data, referring to a dictionary of single pictograms, continuous pictograms, etc., continuous pictograms etc., text / article extraction process for extracting texts containing the continuous pictograms, etc., and articles containing the texts, ,
An article feature word extraction step of extracting an article feature word from the extracted article by referring to a word dictionary;
From the acquired content data, an extracted sentence extraction process such as continuous pictograms that extracts other sentences that match the part of the extracted text that excludes continuous pictograms, and the like,
A sentence feature word extraction step of extracting a sentence feature word from another extracted sentence with reference to a word dictionary;
A feature word similarity determination step for determining whether or not they are similar by referring to a synonym dictionary from the extracted article feature words and sentence feature words;
A continuous pictogram etc. data registration method comprising: a continuous pictogram etc. data registration step of registering the extracted continuous pictogram etc. and a text feature word in association with each other in a dictionary of continuous pictograms, etc., when determined to be similar .