JP5718213B2

JP5718213B2 - Web page topic determination device, Web page topic determination method, and Web page topic determination program

Info

Publication number: JP5718213B2
Application number: JP2011256179A
Authority: JP
Inventors: 滋藤村; 杉崎　正之; 正之杉崎; 健司江崎; 内山　匡; 匡内山; 典子高屋; 裕介市川; 翔一長野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-11-24
Filing date: 2011-11-24
Publication date: 2015-05-13
Anticipated expiration: 2031-11-24
Also published as: JP2013109709A

Description

本発明は、例えばＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）などのハイパーテキスト記述言語でよって記述されるＷｅｂページの話題を判定する技術に関する。 The present invention relates to a technique for determining the topic of a Web page described in a hypertext description language such as HTML (Hyper Text Markup Language).

Ｗｅｂページに限らず、電子化された文書の話題を抽出する技術としては、特許文献１のトピック抽出方法が知られている。ここでは電子文書に含まれるテキストから名詞を特徴語として抽出し、抽出された特徴語を検索語としてウェブ検索を実行し、得られた検索結果に共通に出現する語句を話題とする。 As a technique for extracting a topic of an electronic document as well as a Web page, the topic extraction method of Patent Document 1 is known. Here, a noun is extracted as a feature word from text included in an electronic document, a web search is performed using the extracted feature word as a search word, and a phrase commonly appearing in the obtained search result is used as a topic.

ところが、特定の話題だけを含むＷｅｂページ集合を収集する検索エンジンのクローラプログラム（スパイダー、ロボット）は、Ｗｅｂページ中のハイパーリンクの情報に基づき自動的に繰り返しリンクを辿りＷｅｂページを収集することから、話題の判定に特許文献１を適用しようとした場合にはリンク先のＷｅｂページにアクセスする前に、該ページのテキスト情報を得ることができず、特定の話題に関するＷｅｂページのみを収集したいといった場合には、収集効率に問題があった。 However, a search engine crawler program (spider, robot) that collects a set of Web pages including only a specific topic automatically collects Web pages by repeatedly following links based on information on hyperlinks in the Web pages. When trying to apply Patent Document 1 to topic determination, it is not possible to obtain text information of the page before accessing the linked Web page, and it is desired to collect only Web pages related to a specific topic. In some cases, there was a problem with collection efficiency.

そこで、リンク先のＷｅｂページにアクセスする前に利用可能なＵＲＬを情報源として利用し、話題を判定する技術が非特許文献１に提案されている。ここではＵＲＬを記号等で処理単位の文字列（以下、トークン「ｔｏｋｅｎ」と呼ぶ）に区切り、さらにトークンの部分文字列を特徴量として抽出する。この特徴量に対して、訓練事例によって機械的な学習を済ませた該当の話題か否かを判定器で判定する。 Therefore, Non-Patent Document 1 proposes a technique for determining a topic by using an available URL as an information source before accessing a linked Web page. Here, the URL is divided into character strings in units of processing (hereinafter referred to as tokens “token”) with symbols or the like, and further, partial character strings of tokens are extracted as feature amounts. With respect to this feature quantity, it is determined by a determiner whether or not it is a corresponding topic for which mechanical learning has been completed by a training example.

特開２００９−１５７９６JP2009-15796A

Eda Baykan,Monika Henzinger,Ludmila Marian,Ingmar Weber ”Purely URL-based Topic Classification” Proceedings of the 18th international conference on World wide web(WWW'09).pp1109-1110Eda Baykan, Monika Henzinger, Ludmila Marian, Ingmar Weber `` Purely URL-based Topic Classification '' Proceedings of the 18th international conference on World wide web (WWW'09) .pp1109-1110 ”Ｗｅｂ便利ツール／ＵＲＬエンコード・デコードフォーム−ＴＡＧｉｎｄｅｘＷｅｂサイト”、「ｏｎｌｉｎｅ」、「平成２３年１１月１０日検索」、インターネット＜URL:http://www.tagindex.com/tool/url.html"Web convenient tool / URL encoding / decoding form-TAG index website", "online", "November 10, 2011 search", Internet <URL: http://www.tagindex.com/tool/url. html

Ｗｅｂページの話題を判定するにあたって事前に収集済みの正解集合を学習データに用いる機械学習を採用する場合、話題の判定精度は正解集合の質およびＷｅｂページの特徴に何を利用するかに帰着する。特に、判定の情報源としてＷｅｂページのＵＲＬのみしか利用できない状況においてはＵＲＬからどの様な特徴を作り出すかが重要となる。 When using machine learning that uses a collection of correct answers collected in advance as learning data when determining the topic of a Web page, topic determination accuracy results in what is used for the quality of the correct set and the characteristics of the Web page. . In particular, in a situation where only the URL of a Web page can be used as an information source for determination, what kind of features are created from the URL is important.

非特許文献１では、前述のように判定可能なＷｅｂページを最大化するためにＵＲＬから得られたトークンの部分文字列を特徴として利用している。しかしながら、トークンがＷｅｂページの主要閲覧者の利用言語における単語をＵＲＬの規約によって表現した文字列であった場合には、該言語としては望ましくない区切りの部分文字列が特徴とされ、話題判定の精度に悪影響を与えるおそれがあった。 In Non-Patent Document 1, a token partial character string obtained from a URL is used as a feature in order to maximize the Web page that can be determined as described above. However, if the token is a character string that represents a word in the language used by the main viewer of the Web page according to the URL convention, it is characterized by a partial character string that is not desirable for the language. There was a risk of adversely affecting accuracy.

例えば、ＵＲＬ「http://example.co.jp/suitouchou/」のＷｅｂページについてみれば該ＵＲＬのトークンの一つ「suitouchou」は日本語では「出納帳」に該当する。ところが、非特許文献１では、利用言語を考慮せずにアルファベットのままトークンの部分文字列を取得するため、「suit」のような部分文字列も特徴として利用されるおそれがある。これでは英語で紳士服のスーツを表す単語と同一表記となるため、処理対象のＷｅｂページの本来的な話題と異なるファッション関連の話題と判定されてしまう。 For example, in the case of a Web page with a URL “http://example.co.jp/suitouchou/”, one of the tokens “suitouchou” of the URL corresponds to a “book” in Japanese. However, in Non-Patent Document 1, since a partial character string of a token is acquired as an alphabet without considering the language used, a partial character string such as “suit” may be used as a feature. This is the same notation as a word representing a suit for men's clothing in English, so it is determined that it is a fashion-related topic different from the original topic of the Web page to be processed.

本発明は、上述のような従来技術の問題点を解決するためになされたものであり、ＷｅｂページのＵＲＬから主要閲覧者が利用する言語を考慮した特徴を構築し、該言語に特化した適切な話題判定を行うことを解決課題としている。 The present invention has been made in order to solve the above-described problems of the prior art, and has constructed a feature that takes into account the language used by the main viewer from the URL of the Web page, and has specialized in that language. The problem is to perform appropriate topic determination.

そこで、本発明は、ＵＲＬの文字構成からＷｅｂページの主要閲覧者が利用する言語を特定するため、ＵＲＬ中のホスト名からホスト利用国を特定し、該利用国における主要言語を判定する。例えば事前に作成された公用語辞書などを用いて主要言語を特定することができる。 Therefore, according to the present invention, in order to specify the language used by the main viewer of the Web page from the character configuration of the URL, the host use country is specified from the host name in the URL, and the main language in the use country is determined. For example, the main language can be specified using an official language dictionary created in advance.

また、ＵＲＬを任意単位に分解した各文字列から主要言語に応じた特徴量を抽出する。例えば前記各文字列が前記主要言語の言語特徴に応じた文字列に変換可能であれば該変換された各文字列を特徴候補として抽出する。この各特徴候補から部分文字列を取得し、取得した部分文字列の出現頻度を特徴量として抽出することができる。 Also, feature amounts corresponding to the main language are extracted from each character string obtained by decomposing URL into arbitrary units. For example, if each character string can be converted into a character string corresponding to the language feature of the main language, the converted character string is extracted as a feature candidate. A partial character string can be acquired from each feature candidate, and the appearance frequency of the acquired partial character string can be extracted as a feature amount.

ここで抽出された特徴量を話題判定に用いることにより、Ｗｅｂページの主要閲覧者の利用言語を考慮した話題判定が可能となる。すなわち、Ｗｅｂページで利用されている言語において言葉として不適切な特徴量の抽出を抑制し、誤った話題判定を防止することができる。話題判定の手法としては、特定の話題に属するか否かを学習した判定器を用いて前記特徴量からＷｅｂページの話題を判定すればよい。 By using the feature amount extracted here for topic determination, it is possible to determine the topic in consideration of the language used by the main viewer of the Web page. That is, it is possible to suppress extraction of feature quantities inappropriate as words in a language used on a Web page, and to prevent erroneous topic determination. As a topic determination method, a topic of a Web page may be determined from the feature amount using a determiner that has learned whether or not it belongs to a specific topic.

本発明によれば、ＷｅｂページのＵＲＬから主要閲覧者が利用する言語を考慮した特徴が構築でき、該言語に特化した適切な話題判定が可能となる。 According to the present invention, it is possible to construct a feature that takes into account the language used by the main viewer from the URL of the Web page, and it is possible to perform appropriate topic determination specialized for the language.

本発明の実施形態に係るＷｅｂページの話題判定装置の構成図。The block diagram of the topic determination apparatus of the web page which concerns on embodiment of this invention. 同言語判定部の処理フロー図。The processing flow figure of the same language determination part. 同特徴量抽出部の処理フロー図。The processing flow figure of the same feature quantity extraction part. 図３の処理フローの処理例。4 is a processing example of the processing flow of FIG.

以下、本発明の実施形態に係るＷｅｂページの話題判定装置を説明する。この話題判定装置は、ＵＲＬの文字構成から主要閲覧者が利用する言語を特定し、判定された利用言語に応じた特徴量を抽出する。ここで抽出された特徴量を用いてＷｅｂページの話題を判定する。 A web page topic determination apparatus according to an embodiment of the present invention will be described below. This topic determination device specifies the language used by the main viewer from the character configuration of the URL, and extracts a feature amount corresponding to the determined usage language. The topic of the Web page is determined using the feature amount extracted here.

≪構成例≫
図１に基づき前記話題判定装置の構成例を説明する。ここでは前記話題判定装置１は、特定の話題を含むＷｅｂページ集合を収集する検索エンジンのクローラプログラム（スパイダー、ロボットなど）に利用される。 ≪Configuration example≫
A configuration example of the topic determination device will be described with reference to FIG. Here, the topic determination device 1 is used in a crawler program (spider, robot, etc.) of a search engine that collects a set of Web pages including a specific topic.

具体的には前記話題判定装置１は、検索エンジンのサーバ群に構成され、通常のコンピュータのハードウェアリソース、例えばＣＰＵ．メモリ（ＲＡＭ）やハードディスクドライブ装置などの記憶装置を備える。このハードウェアリソースとソフトウェアリソース（ＯＳ．アプリケーションなど）との協働の結果、前記話題判定装置１は、入力部１０．言語判定部１１．特徴量抽出部１２．話題判定部１３．出力部１４を実装する。 Specifically, the topic determination device 1 is configured as a server group of search engines, and is a normal computer hardware resource such as a CPU. A storage device such as a memory (RAM) or a hard disk drive device is provided. As a result of the cooperation between the hardware resource and the software resource (OS. Application, etc.), the topic determination device 1 has the input unit 10. Language determination unit 11. Feature amount extraction unit 12. Topic determination unit 13. The output unit 14 is mounted.

この入力部１０には話題判定対象のＷｅｂページ、即ちクローラプログラムで収集された各ＷｅｂページのＵＲＬが入力される。ここで入力されたＵＲＬは言語判定部１１に出力され、該ＵＲＬのみを材料とするＷｅｂページの話題判定が開始される。 The URL of each Web page collected by the crawler program is input to the input unit 10 as a topic determination target Web page. The URL input here is output to the language determination unit 11, and the topic determination of the Web page using only the URL as a material is started.

すなわち、言語判定部１１は、入力部１０からの出力情報を入力とし、該ＵＲＬ中のホスト名から該ホスト名の利用国を特定し、さらに該ホスト名の利用国における主要言語を判定する。この主要言語をＷｅｂページで利用される言語、即ちＷｅｂページの主要閲覧者の利用言語と推定する。この主要言語およびＵＲＬは特徴量抽出部１２に出力される。 That is, the language determination unit 11 receives the output information from the input unit 10, specifies the country of use of the host name from the host name in the URL, and further determines the main language in the country of use of the host name. This main language is estimated as the language used in the Web page, that is, the language used by the main viewer of the Web page. The main language and URL are output to the feature amount extraction unit 12.

特徴量抽出部１２は、言語判定部１１からの出力情報を入力とし、主要言語の言語特性を考慮してＵＲＬから特徴量を抽出する。ここではＵＲＬを処理単位の文字列に分解し、各文字列から主要言語に応じた特徴量を抽出する。このとき主要言語の言語特徴に応じた文字列に変換可能であれば、変換された各文字列の出現頻度を特徴量として抽出する。例えば主要言語として日本語が特定されれば、ローマ字かな変換・漢字かな変換などを行って特徴量を抽出することができる。抽出された特徴量は話題判定部１３に出力される。 The feature quantity extraction unit 12 receives the output information from the language determination unit 11 and extracts the feature quantity from the URL in consideration of the language characteristics of the main language. Here, the URL is broken down into character strings in units of processing, and feature quantities corresponding to the main language are extracted from the character strings. At this time, if it can be converted into a character string corresponding to the language feature of the main language, the appearance frequency of each converted character string is extracted as a feature amount. For example, if Japanese is specified as the main language, it is possible to extract features by performing Romaji-kana conversion or Kanji conversion. The extracted feature amount is output to the topic determination unit 13.

話題判定部１３は、特徴量抽出部１２からの出力情報を入力とし、前記特徴量に基づきＷｅｂページの話題を判定する。ここでは事前に特定の話題に属するか否かを学習した判定器を利用する。この判定器に入力された前記特徴量が事前学習した話題を有するか否かでＷｅｂページの話題を判定する。この判定結果は、出力部１４を通じて検索エンジンなどに出力される。以下、前記各部１１〜１３の処理内容を詳述する。 The topic determination unit 13 receives the output information from the feature amount extraction unit 12 and determines the topic of the Web page based on the feature amount. Here, a determiner that has learned in advance whether or not it belongs to a specific topic is used. The topic of the Web page is determined based on whether or not the feature value input to the determiner has a previously learned topic. The determination result is output to a search engine or the like through the output unit 14. Hereinafter, the processing content of each part 11-13 is explained in full detail.

≪言語判定部１１の処理内容≫
図２に基づき言語判定部１１の処理内容を詳述する。ここでは言語判定部１１は、入力されたＷｅｂページのＵＲＬ中におけるホスト名（サイト名）を取得する。この取得後に図２の処理を開始するものとする。この処理はＵＲＬ毎に行われるものとする。 << Processing content of language determination unit 11 >>
The processing content of the language determination unit 11 will be described in detail based on FIG. Here, the language determination unit 11 acquires the host name (site name) in the URL of the input Web page. Assume that the processing of FIG. 2 is started after this acquisition. This process is performed for each URL.

Ｓ０１：前記ホスト名に国別コードトップレベルドメインが含まれているか否かを判定する。判定の結果、該ドメインが含まれていなければＳ０２に進む一方、該ドメインが含まれていれば国別コードに基づき前記ホスト名の利用国を特定する。ここで特定された前記ホスト名の利用国をＷｅｂページの対象国と決定し、Ｓ０３に進む。例えば、前記ホスト名に「.jp」などが含まれていれば日本国をＷｅｂページの対象国と決定する。 S01: It is determined whether or not a country code top level domain is included in the host name. As a result of the determination, if the domain is not included, the process proceeds to S02. If the domain is included, the country of use of the host name is specified based on the country code. The country of use of the host name specified here is determined as the target country of the Web page, and the process proceeds to S03. For example, if “.jp” is included in the host name, Japan is determined as the target country of the Web page.

Ｓ０２：前記ホスト名（より正確にはホスト名中のドメイン名）に対してｗｈｏｉｓ（フーイズ）システム、即ちインターネット上でのドメイン名の所有者を検索するプロトコルを利用することで前記ホスト名の利用国が特定できるか否か確認する。確認の結果、前記ホスト名の利用国が特定できれば該利用国をＷｅｂページの対象国に決定してＳ０３に進む一方、特定できなければ処理を終了する。 S02: Use of the host name by using a whois system for the host name (more precisely, the domain name in the host name), that is, a protocol for searching for the owner of the domain name on the Internet. Check if the country can be identified. As a result of the confirmation, if the country of use of the host name can be identified, the country of use is determined as the target country of the Web page, and the process proceeds to S03.

Ｓ０３：Ｓ０１．Ｓ０２で決定されたＷｅｂページの対象国における主要言語を事前に作成された公用語辞書を用いて判定し、処理を終了する。この公用語辞書には国別に主要言語が掲載されていればよい。この主要言語に複数言語が掲載されていれば、該各言語を前記対象国の主要言語と判定できるものとする。 S03: S01. The main language in the target country of the Web page determined in S02 is determined using an official language dictionary created in advance, and the process ends. This official language dictionary only needs to contain major languages by country. If a plurality of languages are listed in the main language, each language can be determined as the main language of the target country.

≪特徴量抽出部１２の処理内容≫
図３に基づき特徴量抽出部１２の処理内容を詳述する。ここでは言語判定部１１において主要言語として日本語が特定された場合の処理内容を説明する。この特徴量抽出部１２の処理もＵＲＬ毎に行われるものとする。 ≪Processing content of feature quantity extraction unit 12≫
The processing content of the feature amount extraction unit 12 will be described in detail based on FIG. Here, the processing contents when Japanese is specified as the main language in the language determination unit 11 will be described. It is assumed that the process of the feature amount extraction unit 12 is also performed for each URL.

Ｓ１１：入力されたＵＲＬを記号「.」「-」「/」などの区切り文字によって複数個のトークン、即ち処理対象の各文字列に分解する。このトークン毎にＳ１２以降の処理が実行される。 S11: The inputted URL is decomposed into a plurality of tokens, that is, respective character strings to be processed, by delimiters such as symbols “.”, “-”, “/”. The processing after S12 is executed for each token.

Ｓ１２：Ｓ１１で分解された各トークンが、パーセントエンコード（Ｐｅｒｃｅｎｔ−Ｅｎｃｏｄｅ）、即ち文字コードを１６進数で表して「％ｘｘ」（ｘｘは１６進数）の形に変換するエンコード方式が施されているか否かを判定する。 S12: Is each token decomposed in S11 subjected to percent encoding (Percent-Encode), that is, an encoding method in which the character code is expressed in hexadecimal and converted into the form of “% xx” (xx is hexadecimal)? Determine whether or not.

ここでＵＲＬの規則を定めるＲＦＣ３９８６によれば、ＵＲＬ中のＡＳＣＩＩ以外の文字およびＡＳＣＩＩの予約文字は「％ｘｘ」に変換される。例えば「ＳＨＩＦＴ＿ＪＩＳ」で書かれた文字「あ」であれば「％82％ａ０」の形に変換され、文字「い」であれば「％82％ａ２」の形に変換される。このような変換表記に基づき前記トークンにパーセントエンコードが施されているか否か判定する。 Here, according to RFC3986 that defines the URL rule, characters other than ASCII and ASCII reserved characters in the URL are converted to “% xx”. For example, the character “A” written in “SHIFT_JIS” is converted into the form “% 82% a0”, and the character “I” is converted into the form “% 82% a2”. Based on such conversion notation, it is determined whether or not the token is percent-encoded.

Ｓ１３．Ｓ１４：Ｓ１１の判定の結果、パーセントエンコードが施されたトークンに対してはデコードを実行する。このデコード結果の文字列にカタカナや漢字が含まれていれば、ひらがな変換を行ったうえで変換後の文字列を取得する（Ｓ１３）。ここで取得した文字列を図示省略のリストに特徴候補として登録する（Ｓ１４）。 S13. S14: As a result of the determination in S11, decoding is executed for the token that has been percent-encoded. If the decoded character string contains katakana or kanji, hiragana conversion is performed and the converted character string is acquired (S13). The character string acquired here is registered as a feature candidate in a list not shown (S14).

なお、前記デコードには、例えば非特許文献２のような汎用ツールを用いることができ、また前記の漢字ひらがな変換は事前に用意された漢和辞書を用いればよい。 For the decoding, a general-purpose tool such as Non-Patent Document 2, for example, can be used, and the Kanji-Hiragana conversion may be performed using a previously prepared Han-Japanese dictionary.

Ｓ１５．Ｓ１６：Ｓ１１の判定の結果、パーセントエンコードが施されていないトークン（非パーセントエンコードのトークン）に対しては、トークンの文字列にローマ字かな変換を行う。ここではトークンの文字列がひらがな文字列に変換可能か否か、即ち完全に平仮名で表現可能か否かを確認する（Ｓ１５）。 S15. S16: As a result of the determination in S11, for tokens that have not been percent-encoded (non-percent-encoded tokens), Roman character kana conversion is performed on the token character string. Here, it is confirmed whether or not the character string of the token can be converted into a hiragana character string, that is, whether or not it can be completely expressed in hiragana (S15).

確認の結果、トークンの文字列が完全にひらがなで表現可能な場合は、変換後のひらがな文字列を前記リストに特徴候補として登録する（Ｓ１６）。例えばトークンの文字列が「suitouchou」であれば、ひらがな文字列「すいとうちょう」を特徴候補として登録する。一方、ひらがなに変換できない文字列は、ローマ字かな変換により完全に平仮名で表現できないため、アルファベット文字列のまま前記リストに特徴候補として登録する（Ｓ１６）。 As a result of the confirmation, if the token character string can be expressed completely in hiragana, the converted hiragana character string is registered as a feature candidate in the list (S16). For example, if the character string of the token is “suitouchou”, the hiragana character string “Sato Ito” is registered as a feature candidate. On the other hand, since a character string that cannot be converted into hiragana cannot be completely expressed in hiragana by romaji-kana conversion, it is registered as a feature candidate in the list as an alphabetic character string (S16).

Ｓ１７：Ｓ１４又はＳ１６で前記リストに登録されたすべての特徴候補から部分文字列を抽出した部分文字列集合を取得する。ここでは各部分文字列の前記集合内における出現頻度（出現回数）をカウントし、カウント結果の数量を特徴量として抽出する。この抽出後に特徴量を話題判定部１３に出力し、処理を終了する。この出力後に前記リストが初期化され、次のＵＲＬの処理が開始される。 S17: A partial character string set obtained by extracting partial character strings from all feature candidates registered in the list in S14 or S16 is acquired. Here, the appearance frequency (number of appearances) of each partial character string in the set is counted, and the quantity of the count result is extracted as a feature amount. After the extraction, the feature amount is output to the topic determination unit 13 and the process is terminated. After this output, the list is initialized and the processing of the next URL is started.

なお、Ｓ１５．Ｓ１６では、非パーセントエンコードのトークンの文字列に対して、ローマ字かな変換でひらがな文字列に変換可能か否かを判定し、ひらがな文字列とアルファベット文字列とを排他的に利用する方式を示しているが、ひらがな文字列に変換可能な場合には、本来のアルファベット文字とひらがな文字列の双方を特徴候補として利用する方式としてもよい。 S15. In S16, it is determined whether or not a character string of a non-percent-encoded token can be converted into a hiragana character string by romaji kana conversion, and a method of using the hiragana character string and the alphabet character string exclusively is shown. However, when it can be converted into a hiragana character string, both the original alphabetic character and the hiragana character string may be used as feature candidates.

また、Ｓ１３では、パーセントエンコードが施されたトークンに対して、デコード後の文字列に漢字やカタカナが含まれている場合にひらがなに変換する方式を示しているが、デコード語の文字列をそのまま特徴候補として登録する方式としてもよい。 Further, S13 shows a method of converting hiragana to a token subjected to percent encoding when the decoded character string includes kanji or katakana. However, the decoded character string is used as it is. A method of registering as feature candidates may be used.

さらに、特徴候補や特徴候補の部分文字列に対して文字列の長さの制約を設けてもよく、出現頻度があまりにも大きすぎる文字列をストップ文字列として事前に除外する方式を採用することもできる。 Furthermore, character string length restrictions may be set for feature candidates and partial character strings of feature candidates, and a method of excluding a character string with an appearance frequency that is too large in advance as a stop character string should be adopted. You can also.

≪特徴量抽出部１２の処理例≫
以下、図４に基づき特徴量抽出部１２の処理例を説明する。ここではＵＲＬ「http://www.example.co.jp/ichirei.html?category=%e3%82%b5%e3%83%b3%e3%83%97%e3%83%ab」が入力部１０に入力され，言語判定部１１で日本語が主要言語と判定されているものとする。また、特徴候補および特徴候補の部分文字列には文字列長「３〜８」の制約が設定され、ストップ文字列として「www」．「html」が事前に設定されているものとする。 << Processing Example of Feature Quantity Extraction Unit 12 >>
Hereinafter, a processing example of the feature amount extraction unit 12 will be described with reference to FIG. Here, the URL “http://www.example.co.jp/ichirei.html?category=%e3%82%b5%e3%83%b3%e3%83%97%e3%83%ab” is the input section 10 and the language determination unit 11 determines that Japanese is the main language. In addition, the restriction of the character string length “3 to 8” is set for the feature candidate and the partial character string of the feature candidate, and “www”. It is assumed that “html” is set in advance.

まず、特徴量抽出部１２に前記ＵＲＬおよび前記主要言語が入力されると、Ｓ１１において前記ＵＲＬに対するトークン化が実行される。このトークン化の結果、前記ＵＲＬは、「www」．「example」．「co」．「jp」．「ichirei」．「html」．「category」．「%e3%82%b5%e3%83%b3%e3%83%97%e3%83%ab」のトークンに分解される。 First, when the URL and the main language are input to the feature amount extraction unit 12, tokenization of the URL is executed in S11. As a result of the tokenization, the URL is “www”. “Example”. “Co”. “Jp”. “Ichirei”. “Html”. “Category”. It is broken down into tokens of “% e3% 82% b5% e3% 83% b3% e3% 83% 97% e3% 83% ab”.

つぎにトークン「example」．「category」．「ichirei」は、Ｓ１２で非パーセントエンコードと判定され、Ｓ１５に進む。ここでトークン「example」．「category」は、ローマ字かな変換でひらがな変換できないため、Ｓ１６ではアルファベット文字のまま特徴候補として登録される。一方、トークン「ichirei」は、ローマ字かな変換により「いちれい」と表現できるため、Ｓ１６では「いちれい」のひらがな文字列が特徴候補として登録される。 Next, the token “example”. “Category”. “Ichirei” is determined to be non-percent encoding in S12, and the process proceeds to S15. Here token "example". Since “category” cannot be converted to hiragana by romaji kana conversion, it is registered as a feature candidate as an alphabetic character in S16. On the other hand, since the token “ichirei” can be expressed as “Ichirei” by converting the Roman character to Kana, the hiragana character string “Ichirei” is registered as a feature candidate in S16.

また、トークン「%e3%82%b5%e3%83%b3%e3%83%97%e3%83%ab」は、Ｓ１２でパーセントエンコードが施されていると判定される。この判定後にＳ１３で「サンプル」にデコードされ、さらに「さんぷる」にひらがな変換される。この変換後にＳ１４で「さんぷる」のひらがな文字列が特徴候補として登録される。なお、トークン「www」．「html」は、ストップ文字列に該当するため、Ｓ１２〜Ｓ１６の処理から除外される。さらにトークン「co」．「jp」も、文字列長の制約から同様に除外される。 The token “% e3% 82% b5% e3% 83% b3% e3% 83% 97% e3% 83% ab” is determined to have been percent-encoded in S12. After this determination, it is decoded into “sample” in S13, and further hiragana converted to “sample”. After this conversion, the hiragana character string “Sample” is registered as a feature candidate in S14. The token “www”. Since “html” corresponds to a stop character string, it is excluded from the processes of S12 to S16. Furthermore, the token “co”. “Jp” is also excluded from the restriction on the character string length.

このＳ１１〜Ｓ１６の処理の結果、「example」．「いちれい」．「category」．「さんぷる」の文字列が特徴候補として登録される。そして、Ｓ１７において各特徴候補の文字列から長さ「３〜８」の範囲内で部分文字列を取得し、各部分文字列の出現頻度をカウントしてＵＲＬの特徴量、例えば「exa:1」．「xam:1」．「amp:1」などを抽出する。この特徴量の「１」は部分文字列集合内の出現回数を示している。 As a result of the processing of S11 to S16, “example”. “Ichirei”. “Category”. The character string “Sampuru” is registered as a feature candidate. Then, in S17, partial character strings are acquired from the character strings of the respective feature candidates within the range of length “3 to 8”, and the frequency of appearance of each partial character string is counted, for example, “exa: 1 ". “Xam: 1”. Extract “amp: 1” and so on. The feature quantity “1” indicates the number of appearances in the partial character string set.

≪話題判定部１３の処理内容≫
以下、話題判定部１３の処理内容を詳述する。具体的には話題判定部１３は、特徴量抽出部１２から出力された特徴量を入力とし、Ｗｅｂページの話題を判定した結果を出力する。この話題判定部１３では、事前に判定対象の話題に対して機械学習を利用した判定器の学習を行う必要がある。ここでは一例として「政治」を判定対象の話題とする場合を説明する。 ≪Processing content of topic determination unit 13≫
Hereinafter, the processing content of the topic determination part 13 is explained in full detail. Specifically, the topic determination unit 13 receives the feature amount output from the feature amount extraction unit 12 and outputs the result of determining the topic of the Web page. The topic determination unit 13 needs to learn a determiner using machine learning on a topic to be determined in advance. Here, as an example, a case where “politics” is the subject of determination will be described.

この学習にあたっては、あらかじめ「政治」に関連したＷｅｂページ集合と、「政治」に関連しないＷｅｂページ集合とを準備する必要がある。すなわち、「政治」に関連するＷｅｂページ集合のＵＲＬ群から得られた特徴量を、特徴量抽出部１２によって得られた特徴量の２値判定における正例として判定器の学習事例に利用する。同様に「政治」に関連しないＷｅｂページ集合のＵＲＬ群から得られた特徴量を、特徴量抽出部１２によって得られた特徴量の２値判定における負例として利用する。 In this learning, it is necessary to prepare a Web page set related to “politics” and a Web page set not related to “politics” in advance. That is, the feature amount obtained from the URL group of the Web page set related to “politics” is used as a learning example of the determiner as a positive example in the binary determination of the feature amount obtained by the feature amount extraction unit 12. Similarly, the feature quantity obtained from the URL group of the web page set not related to “politics” is used as a negative example in the binary judgment of the feature quantity obtained by the feature quantity extraction unit 12.

この正例・負例を学習済みの判定器に特徴量抽出部１２によって得られた特徴量を入力として与えることにより、処理対象のＷｅｂページが「政治」に関連した話題を有するか否かを判定する。 Whether or not the Web page to be processed has a topic related to “politics” by giving the feature quantity obtained by the feature quantity extraction unit 12 as an input to the discriminator that has already learned positive examples and negative examples. judge.

この判定結果は、出力部１４を通じて検索エンジンに出力され、全文索引を構築するための分類アルゴリズムなどに利用される。このとき前記話題判定装置１によれば、Ｓ１１〜Ｓ１７の処理を通じて主要言語の特徴を考慮した特徴量が抽出されることから、Ｗｅｂページの主要閲覧者の利用言語として望ましくない部分文字列における特徴量の抽出が防止され、該利用言語に特化した適切な話題判定を行うことができる。 This determination result is output to the search engine through the output unit 14 and used for a classification algorithm for constructing a full-text index. At this time, according to the topic determination device 1, the feature amount considering the feature of the main language is extracted through the processing of S 11 to S 17, so the feature in the partial character string that is not desirable as the language used by the main viewer of the Web page Extraction of the amount is prevented, and appropriate topic determination specialized in the language used can be performed.

例えばＵＲＬ「http://example.co.jp/suitouchou/」についてみれば、ＵＲＬに国別コードトップレベルドメイン「.jp」を含むため、日本語が主要言語と判定される。このＵＲＬを分解したトークン「suitouchou」は、Ｓ１５のローマ字かな変換により「すいとうちょう」と表現可能なため、Ｓ１６において「すいとうちょう」の文字列が特徴候補として登録される。 For example, regarding the URL “http://example.co.jp/suitouchou/”, the country code top-level domain “.jp” is included in the URL, so that Japanese is determined as the main language. Since the token “suitouchou” obtained by decomposing the URL can be expressed as “Suichou” by the Roman character Kana conversion in S15, the character string “Suichou” is registered as a feature candidate in S16.

したがって、非特許文献１のように「suit」の部分文字列が特徴量として抽出されることはなく、Ｗｅｂページの言語（ここでは日本語）の言葉としては不適切な特徴量の抽出が抑制され、誤った話題判定を防止することができる。 Therefore, unlike the non-patent document 1, the partial character string “suit” is not extracted as a feature amount, and the extraction of a feature amount inappropriate as a language of a Web page language (in this case, Japanese) is suppressed. This makes it possible to prevent erroneous topic determination.

なお、本発明は、上記実施形態に限定されるものではなく、各請求項に記載された範囲内で適宜変形して実施することができる。例えば言語判定部１１で日本語が特定された場合のみならず、他の外国語が特定された場合にも適用することができる。この場合にはＳ１３．Ｓ１５を特定された外国語に応じた変換にすればよい。 In addition, this invention is not limited to the said embodiment, It can deform | transform suitably and implement within the range described in each claim. For example, the present invention can be applied not only when the language determination unit 11 specifies Japanese but also when other foreign languages are specified. In this case, S13. S15 may be converted according to the specified foreign language.

また、話題判定部１３の処理内容では「政治」に関連するか否かという２値判定を示したが、あらかじめ複数の判定対象の話題（例えばスポーツやファッションなど）に関するＷｅｂページ集合を準備し、それぞれの話題に応じた２値判定器を用意することによって、判定器から得られる分類の確信度の最も高い話題を話題判定対象のＷｅｂページが有する話題として判定する方式を採用することもできる。さらに話題を一意に定めずに確信度が一定値以上の話題を処理対象のＷｅｂページに対するメタデータとして付与する方式を採用してもよい。 In addition, the processing content of the topic determination unit 13 indicates a binary determination as to whether or not it is related to “politics”, but prepares a set of Web pages related to a plurality of determination target topics (for example, sports and fashion) in advance. By preparing a binary determiner corresponding to each topic, it is possible to adopt a method for determining the topic having the highest classification certainty obtained from the determiner as the topic of the topic determination target Web page. Further, a method may be adopted in which a topic having a certainty level or more is given as metadata for a Web page to be processed without uniquely defining the topic.

≪プログラムなど≫
本発明は、前記話題判定装置１の各部１０〜１４の一部もしくは全部として、コンピュータを機能させるＷｅｂページの話題判定プログラムとして構成することもできる。このプログラムによれば、Ｓ０１〜Ｓ０３．Ｓ１１〜Ｓ１７の一部あるいは全部をコンピュータに実行させることが可能となる。 ≪Programs≫
The present invention may be configured as a topic determination program for a Web page that causes a computer to function as part or all of the units 10 to 14 of the topic determination device 1. According to this program, S01 to S03. It becomes possible to cause the computer to execute part or all of S11 to S17.

前記プログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，ＢＤ−ＲＯＭ，ＢＤ−Ｒ，ＢＤ−ＲＥなどの記録媒体に記録して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 The program can be provided through a network such as a website or e-mail. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, BD-ROM, BD-R, or BD-RE. It is also possible to record, save and distribute. This recording medium is read using a recording medium driving device, and the program code itself realizes the processing of the above embodiment, so that the recording medium also constitutes the present invention.

１…Ｗｅｂページの話題判定装置
１０…入力部
１１…言語判定部（言語判定手段）
１２…特徴量抽出部（特徴量抽出手段）
１３…話題判定部（話題判定手段）
１４…出力部 DESCRIPTION OF SYMBOLS 1 ... Web page topic determination apparatus 10 ... Input unit 11 ... Language determination unit (language determination unit)
12 ... feature quantity extraction unit (feature quantity extraction means)
13 ... Topic determination unit (topic determination means)
14 ... Output section

Claims

A web page topic determination apparatus for determining a topic referred to by a web page based on a URL,
Language determination means for identifying the host user country from the host name in the URL and determining the main language in the user country;
Feature quantity extraction means for extracting feature quantities according to the main language specified by the language determination means from each character string obtained by decomposing URL into arbitrary units;
Topic determination means for determining the topic of a web page from the feature quantity extracted by the feature quantity extraction means using a determiner that has learned whether or not it belongs to a specific topic ,
The language determining means determines the main language of the user country based on an official language dictionary created in advance,
The feature amount extraction unit extracts each converted character string as a feature candidate if each of the character strings can be converted into a character string corresponding to a language feature of a main language,
A Web page topic determination device characterized in that a partial character string is obtained from each feature candidate, and an appearance frequency of each partial character string is extracted as a feature amount.

If the determination of the language determination means is Japanese, the feature amount extraction means determines whether or not each character string is subjected to percent encoding,
Decode the character string that has been percent-encoded, and if the decoded character string contains katakana or kanji, the character string after hiragana conversion is used as a feature candidate.
2. The Web page according to claim 1, wherein a non-percent-encoded character string is subjected to Roman-kana conversion, and if the character string can be expressed completely in hiragana, the converted character string is used as a feature candidate . Topic determination device.

The topic determination device for a Web page according to claim 1, wherein a feature candidate or a character string to be excluded from the partial character string can be set.

A method for determining a topic of a Web page executed by an apparatus for determining a topic referred to by a Web page based on a URL,
A language determination step of identifying a host user country from a host name in a URL and determining a main language in the user country;
A feature amount extraction step of extracting a feature amount according to the main language specified by the language determination means from each character string obtained by decomposing the URL into arbitrary units;
A topic determination step of determining the topic of the web page from the feature amount extracted by the feature amount extraction means using a determiner that has learned whether or not it belongs to a specific topic ,
The language determination step includes a step of determining a main language of the user country based on an official language dictionary created in advance,
The feature amount extraction step includes extracting each converted character string as a feature candidate if each character string can be converted into a character string corresponding to a language feature of a main language;
Obtaining a partial character string from each feature candidate and extracting the appearance frequency of each partial character string as a feature amount; and
A method for determining the topic of a Web page, comprising:

If the determination of the language determination means is Japanese, the feature amount extraction step includes a step of determining whether or not percent encoding is applied to each character string;
Perform decoding on the character string that has been subjected to percent encoding, and if the decoded character string contains katakana or kanji, the character string after hiragana conversion is used as a feature candidate,
Performing a romaji kana conversion on a non-percent encoded character string, and if the character string can be expressed completely in hiragana, the converted character string as a feature candidate;
The method for determining the topic of a Web page according to claim 4, comprising:

6. The topic of a Web page according to claim 4, wherein the feature quantity extraction step excludes a character string that matches a preset condition from the feature candidates or the partial character string. Judgment method.

A Web page topic determination program for causing a computer to function as the Web page topic determination device according to any one of claims 1 to 3 .