JP2006106911A

JP2006106911A - Knowledge information collecting system, knowledge information collecting method, and program

Info

Publication number: JP2006106911A
Application number: JP2004289447A
Authority: JP
Inventors: Naomi Nakaya; 尚美中矢
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2004-09-30
Filing date: 2004-09-30
Publication date: 2006-04-20
Anticipated expiration: 2024-09-30
Also published as: JP4047850B2

Abstract

<P>PROBLEM TO BE SOLVED: To collect only information which is useful to users. <P>SOLUTION: A starting point URL, the number of stages of links whose information is to be collected and keywords representing unnecessary words are set in a setting file 13. A link label extraction module 113 extracts link character strings from page information collected from a network (the Internet/intranet) 20. A link determination module 114 determines whether the page information of a linking destination is useless from the extracted link character strings and the set keywords representing the unnecessary words. A collection control module 111 controls the collection of information from the network 20 by following the links from the starting point URL. The collection control module 111 does not collect the page information of the linking destination which is determined to be useless by the link determination module 114 even if the information is in a range of the number of the set stages. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、ナレッジマネジメントシステムで用いられる知識データベースに登録すべき情報をネットワーク上から収集するのに好適な知識情報収集システム、知識情報収集方法及びプログラムに関する。 The present invention relates to a knowledge information collection system, a knowledge information collection method, and a program suitable for collecting information to be registered in a knowledge database used in a knowledge management system from a network.

近年、知識情報の共有支援を図るためのナレッジマネジメントシステムが開発されている。このナレッジマネジメントシステムは、個人のノウハウなどの知識情報を知識データベースに蓄積して管理するためのものであり、自然言語検索などの検索機能と組み合わせることにより、蓄積された知識情報の効率的な活用を実現する。 In recent years, knowledge management systems for supporting sharing of knowledge information have been developed. This knowledge management system is for accumulating and managing knowledge information such as personal know-how in a knowledge database. By combining it with a search function such as natural language search, the knowledge management system can be used efficiently. Is realized.

このようなナレッジマネジメントシステムでは、知識情報をいかに効率よく収集するかが重要なポイントとなる。そこで最近は、ネットワーク、例えばインターネットから、ファイル形式の異なる様々な種類の文書ファイルを知識情報として効率よく収集するための知識情報収集システムが開発されている（例えば、特許文献１参照）。この特許文献１に記載された知識情報収集システムにおいては、設定ファイルに設定された知識情報収集のための条件（知識情報収集条件）に従って、インターネットから、文書情報が次のように収集される。 In such a knowledge management system, how to efficiently collect knowledge information is an important point. Therefore, recently, a knowledge information collection system for efficiently collecting various types of document files having different file formats as knowledge information from a network such as the Internet has been developed (see, for example, Patent Document 1). In the knowledge information collection system described in Patent Document 1, document information is collected from the Internet as follows according to the knowledge information collection conditions (knowledge information collection conditions) set in the setting file.

まず、知識情報収集システムは、設定ファイルとＷｅｂ収集モジュールとを有する。設定ファイルには、ユーザ（例えば管理者である管理ユーザ）の操作によって、情報収集の対象となるリンクの段数及びリンク毎の収集ファイル個数の少なくとも一方が、起点ＵＲＬ（Uniform Resource Locator）と共に設定される。Ｗｅｂ収集モジュールは、起点ＵＲＬ及びリンクの段数が設定ファイルに設定されている場合、当該起点ＵＲＬから設定されたリンクの段数の上限を超えない範囲で、全てのリンクを辿ってインターネットから文書情報（ページ情報）を収集する。また、起点ＵＲＬ及びリンクの段数に加えて、リンク毎の収集ファイル個数が設定されている場合、Ｗｅｂ収集モジュールは、起点ＵＲＬから設定されたリンクの段数の上限を超えず、かつ設定されたリンク毎の収集ファイル個数を超えない範囲で、全てのリンクを辿ってインターネットから文書情報を収集する。
特開２００３−３０３１９７号公報（段落０００８，００１０，００８６乃至００８９） First, the knowledge information collection system has a setting file and a web collection module. In the setting file, at least one of the number of links to be collected and the number of collected files for each link is set together with the origin URL (Uniform Resource Locator) by the operation of a user (for example, an administrative user who is an administrator). The When the starting URL and the number of links are set in the setting file, the Web collection module traces all the links within the range that does not exceed the upper limit of the number of links set from the starting URL, and retrieves document information ( Page information). When the number of collection files for each link is set in addition to the starting URL and the number of links, the Web collection module does not exceed the upper limit of the number of links set from the starting URL, and the set link Document information is collected from the Internet by following all links within a range that does not exceed the number of collected files.
JP 2003-303197 A (paragraphs 0008, 0010, 0086 to 0089)

上記したように、特許文献１に記載された知識情報収集技術（以下、先行技術と称する）によれば、設定ファイルを用いて、ネットワーク上からの情報収集の対象となるリンクの段数等を任意に指定することで、起点ＵＲＬから指定のリンクの段数の上限を超えない範囲で、全てのリンクを辿って文書情報が収集される。 As described above, according to the knowledge information collection technique described in Patent Document 1 (hereinafter referred to as the prior art), the number of link stages or the like to be collected from the network can be arbitrarily set using the setting file. Is specified, the document information is collected by tracing all the links within the range not exceeding the upper limit of the number of stages of the specified links from the starting URL.

しかし、上記先行技術においては、収集する文書情報（ページ情報）の内容に無関係に、指定のＵＲＬから、指定のリンクの段数の範囲で全ての情報を収集することから、ユーザにとって無用（不要）な情報が多数含まれる虞がある。また、上記先行技術においては、指定の段数よりより先のリンクにユーザにとって有用（重要）な情報があっても、その情報は収集されないという問題もある。 However, in the above prior art, all information is collected from the designated URL within the range of the number of stages of the designated link regardless of the contents of the document information (page information) to be collected, which is unnecessary (unnecessary) for the user. There is a possibility that a lot of information is included. The prior art also has a problem in that even if there is useful (important) information for the user in the link beyond the specified number of steps, the information is not collected.

本発明は上記事情を考慮してなされたものでその目的は、ユーザにとって有用な情報だけを効率よく収集することができる知識情報収集システム、知識情報収集方法及びプログラムを提供することにある。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a knowledge information collection system, a knowledge information collection method, and a program capable of efficiently collecting only information useful for a user.

本発明の１つの観点によれば、知識データベースに登録すべき情報をネットワーク上から収集する知識情報収集システムが提供される。このシステムは、ネットワーク上からの情報収集の起点となるページ情報の所在を示す起点ロケーション情報及びネットワーク上からの情報収集の対象となるリンクの段数を設定すると共に、ネットワーク上からの情報収集の対象外とすべきリンクに関係する語句を、不要語を表すキーワードとして設定する設定手段と、ネットワーク上から収集されたページ情報からリンク文字列を抽出するリンク文字列抽出手段と、抽出されたリンク文字列と設定された不要語を表すキーワードとから、リンク先のページ情報が無用であるかを判定するリンク判定手段と、設定された起点ロケーション情報からリンクを辿ることによりネットワーク上からの情報収集を制御する情報収集制御手段であって、上記リンク判定手段によって無用であると判定されたリンク先のページ情報は、設定されたリンクの段数の範囲内であっても収集の対象外とする情報収集制御手段とを備える。 According to one aspect of the present invention, a knowledge information collection system that collects information to be registered in a knowledge database from a network is provided. This system sets the starting location information indicating the location of page information that is the starting point of information collection from the network and the number of link stages to be collected from the network, and also collects information from the network. Setting means for setting a word related to a link to be removed as a keyword representing an unnecessary word, link character string extracting means for extracting a link character string from page information collected from the network, and extracted link character Link determination means for determining whether the linked page information is useless from the column and the keyword indicating the unnecessary word set, and collecting information from the network by following the link from the set starting point location information Information collection control means to be controlled, and determined to be useless by the link determination means. Landing page information of includes an information collection controlling means be in the range of number of links set to exclude the collection.

上記の構成においては、ページ内のリンク文字列と設定された不要語を表すキーワードとから、リンク先のページが無用であるかが判定されて、無用であると判定されたページは、予め設定されたリンクの段数の範囲内であっても収集の対象外とされる。これにより、予め設定されたリンクの段数の範囲内であるために、知識データベースに登録すべきでない無用な情報までも収集されるのを防止でき、知識データベースに登録すべき有用な情報だけをより効率よく収集できる。 In the above configuration, it is determined whether the linked page is useless from the link character string in the page and the keyword indicating the set unnecessary word, and the page determined to be useless is set in advance. Even if it is within the range of the number of linked links, it is not collected. This prevents the collection of even unnecessary information that should not be registered in the knowledge database because it is within the preset range of the number of links, and only useful information that should be registered in the knowledge database. It can be collected efficiently.

ここで、上記設定手段に、ネットワーク上からの情報収集の対象外とすべきリンクに関係する語句を、不要語を表すキーワードとして設定する代わりに、ネットワーク上からの情報収集の対象とすべきリンクに関係する語句を、重要語を表すキーワードとして設定する機能を持たせると共に、上記リンク判定手段に、リンク先のページ情報が無用であるかを判定する代わりに、抽出されたリンク文字列と設定された重要語を表すキーワードとから、リンク先のページ情報が有用であるかを判定する機能を持たせると良い。この場合、情報収集制御手段には、有用であると判定されたリンク先のページ情報を、設定された段数の範囲を超えていても収集の対象とする機能を持たせると良い。 Here, instead of setting a word related to a link to be excluded from information collection from the network as a keyword representing an unnecessary word in the setting means, a link to be collected from the network. In addition to having the function of setting a phrase related to the keyword as a keyword representing an important word, instead of determining whether the link destination page information is useless, the link determination means sets the extracted link character string. It is preferable to provide a function for determining whether the linked page information is useful from the keyword representing the important word. In this case, the information collection control means may have a function of collecting the page information of the link destination determined to be useful even if it exceeds the set number of stages.

このような構成においては、ページ内のリンク文字列と設定された重要語を表すキーワードとから、リンク先のページが有用であるかが判定され、有用であると判定されたページは、予め設定されたリンクの段数を超えていても収集の対象とされる。これにより、予め設定されたリンクの段数の範囲外であるために、知識データベースに登録すべき有用な情報までも収集されなくなるのを防止でき、知識データベースに登録すべき有用な情報だけをより効率よく収集できる。 In such a configuration, it is determined whether the linked page is useful from the link character string in the page and the keyword representing the set important word, and the page determined to be useful is set in advance. Even if the number of linked links is exceeded, it will be collected. This prevents the useful information that should be registered in the knowledge database from being collected because it is outside the preset number of link stages, and only the useful information that should be registered in the knowledge database is more efficient. Can be collected well.

また、上記システムに、上記知識データベースに収集された情報を与えられた検索式に従って検索し、その検索結果をユーザに提示する検索手段と、この検索手段による検索結果に応じて参照された情報の有用性または無用性をユーザにより評価させる手段と、上記検索手段による検索に用いられた検索式と、上記知識データベースに収集された情報毎の当該情報が参照される参照回数と、当該情報毎の当該情報に対するユーザの評価結果とを、検索ログとして蓄積する検索ログ蓄積手段と、この検索ログ蓄積手段に蓄積された検索式に出現する語句と、情報毎の参照回数と、情報毎の評価結果とを分析して、検索式に出現する語句毎に、当該語句がユーザにとって重要または不要である程度を表す評価値を生成するログ統計生成手段と、生成された語句毎の評価値をもとに、上記設定手段によって設定可能な重要語または不要語を表すキーワードを生成するキーワード生成手段とが追加された構成とすると良い。 Further, the system searches the information collected in the knowledge database according to a given search expression, presents the search result to the user, and information of the information referred to according to the search result by the search means. Means for evaluating usability or uselessness by the user, the search formula used for the search by the search means, the number of times the information is collected for each piece of information collected in the knowledge database, Search log storage means for storing the user's evaluation result for the information as a search log, words and phrases appearing in the search formula stored in the search log storage means, the number of references for each information, and the evaluation result for each information Log statistics generation means for generating an evaluation value representing the degree to which the word is important or unnecessary for the user for each word appearing in the search expression; Based on the evaluation value for each phrase that is, it may be configured to the keyword generation unit for generating keywords that represent important words or unnecessary word that can be set by the setting means is added.

このような構成においては、過去の検索で用いられた検索式に出現する語句と、情報毎の参照回数と、情報毎の評価結果とから、検索式に出現する語句毎に、当該語句がユーザにとって重要または不要である程度を表す評価値が求められ、その語句毎の評価値から、つまり、検索システムの利用状況（検索ログ）に関する評価結果から、重要語または不要語を表すキーワードとして用いられる語句が自動抽出される。これにより、重要語を表すキーワードが自動抽出される構成であれば、ユーザが重要なキーワードを設定する手間が省ける。また、不要語を表すキーワードが自動抽出される構成であれば、ユーザにとって情報収集の対象外とすべきリンクに関係するキーワードを設定する手間が省ける。しかも、情報収集の対象外とすべきリンクに関係するキーワード（不要なキーワード）は、情報収集の対象とすべきリンクに関係するキーワード（重要なキーワード）と異なって、一度収集された情報をユーザが実際に参照してみないと判明しにくい。したがって、不要語を表すキーワードが自動抽出されることは、ユーザにとって極めて有用である。 In such a configuration, for each word / phrase that appears in the search expression, the word / phrase appears in the user based on the word / phrase that appears in the search expression used in the past search, the reference count for each information, and the evaluation result for each information. Is used as a keyword that represents an important word or an unnecessary word from the evaluation value for each word, that is, from the evaluation result on the usage status (search log) of the search system. Are automatically extracted. As a result, if a keyword representing an important word is automatically extracted, it is possible to save the user from setting an important keyword. In addition, if a keyword representing an unnecessary word is automatically extracted, it is possible to save the user from setting a keyword related to a link that should be excluded from information collection. In addition, keywords related to links that should not be collected (unnecessary keywords) are different from keywords related to links that should be collected (important keywords). However, it is difficult to find out without actually referring to it. Therefore, it is extremely useful for the user that keywords representing unnecessary words are automatically extracted.

ここで、自動抽出された重要語または不要語を表すキーワードが、上記設定手段により無条件に設定される構成とする代わりに、自動抽出された重要語または不要語を表すキーワードの一覧をユーザに提示して、その一覧から、上記設定手段により設定されるキーワードをユーザに選択させる構成としても良い。 Here, instead of setting the keywords representing the automatically extracted important words or unnecessary words unconditionally by the setting means, a list of keywords representing the automatically extracted important words or unnecessary words is displayed to the user. It is good also as a structure which presents and makes a user select the keyword set by the said setting means from the list.

このような構成においては、自動的に抽出された重要語または不要語を表すキーワードが本当に重要または不要であるかを、ユーザ自身が判断できる。 In such a configuration, the user can determine whether a keyword representing an automatically extracted important word or unnecessary word is really important or unnecessary.

本発明によれば、ページ内のリンク文字列と設定された不要語を表すキーワードとから、リンク先のページが無用であるかを判定し、無用であると判定されたページは、予め設定されたリンクの段数の範囲内であっても収集の対象外とするようにしたので、予め設定されたリンクの段数の範囲内であるために、知識データベースに登録すべきでない無用な情報までも収集されるのを防止することができる。 According to the present invention, it is determined whether the linked page is useless from the link character string in the page and the keyword indicating the set unnecessary word, and the page determined to be useless is set in advance. Even if it is within the range of the number of steps of the link, it is excluded from the collection target, so since it is within the range of the number of steps of the link set in advance, even unnecessary information that should not be registered in the knowledge database is collected Can be prevented.

また本発明によれば、ページ内のリンク文字列と設定された重要語を表すキーワードとから、リンク先のページが有用であるかを判定し、有用であると判定されたページは、予め設定されたリンクの段数を超えていても収集の対象とするようにしたので、予め設定されたリンクの段数の範囲外であるために、知識データベースに登録すべき有用な情報までも収集されなくなるのを防止することができる。 Further, according to the present invention, it is determined whether the linked page is useful from the link character string in the page and the keyword representing the set important word, and the page determined to be useful is set in advance. Even if the number of linked links exceeds the specified number of links, the information is out of the preset number of links, so even useful information to be registered in the knowledge database is not collected. Can be prevented.

よって本発明によれば、ユーザにとってより有用な情報だけを効率よく収集することができる。 Therefore, according to the present invention, it is possible to efficiently collect only useful information for the user.

以下、本発明の一実施形態につき図面を参照して説明する。
図１は本発明の一実施形態に係る知識情報収集システムを実現するナレッジマネジメントシステムの構成を示すブロック図である。このナレッジマネジメントシステムは、知識情報の収集、分析及び検索等のサービスを提供する。ナレッジマネジメントシステムは、Ｗｅｂ情報収集システム１１、知識検索システム１２、設定ファイル１３、キーワード生成モジュール１４及び同義語辞書１５を含む。これらＷｅｂ情報収集システム１１、知識検索システム１２、設定ファイル１３、キーワード生成モジュール１４及び同義語辞書１５は、ナレッジマネジメントシステムが提供するサービスの１つである知識情報の収集を行うための知識情報収集システムを構築する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a knowledge management system that realizes a knowledge information collection system according to an embodiment of the present invention. This knowledge management system provides services such as collection, analysis and search of knowledge information. The knowledge management system includes a Web information collection system 11, a knowledge search system 12, a setting file 13, a keyword generation module 14, and a synonym dictionary 15. These Web information collection system 11, knowledge search system 12, setting file 13, keyword generation module 14 and synonym dictionary 15 collect knowledge information for collecting knowledge information which is one of the services provided by the knowledge management system. Build the system.

Ｗｅｂ情報収集システム１１は、ネットワーク、例えばインターネット／イントラネット２０上のＷｅｂサーバー２１からユーザに有用なＷｅｂ情報（ページ情報）を収集してテキスト部分を抽出し、そのテキスト部分を後述する知識データベース（知識ＤＢ）１２１に格納する。Ｗｅｂ情報収集システム１１は、収集制御モジュール１１１、テキスト抽出モジュール１１２、リンクラベル抽出モジュール１１３及びリンク判定モジュール１１４の各モジュールから構成される。 The Web information collection system 11 collects Web information (page information) useful for the user from a network, for example, a Web server 21 on the Internet / intranet 20 and extracts a text portion. DB) 121. The Web information collection system 11 includes modules of a collection control module 111, a text extraction module 112, a link label extraction module 113, and a link determination module 114.

収集制御モジュール１１１は、設定ファイル１３の設定内容に従って、インターネット／イントラネット２０からの有用なページ情報の収集を制御する。テキスト抽出モジュール１１２は、収集制御モジュール１１１の制御によって収集されたページ情報からテキストを抽出して知識ＤＢ１２１に格納する。リンクラベル抽出モジュール１１３は、収集されたページ情報からページ内のリンク文字列であるリンクのラベルを抽出する。リンク判定モジュール１１４は、抽出されたリンクのラベルと設定ファイル１３の設定内容とから、リンク先のページを収集するかを判定する。この判定結果は収集制御モジュール１１１に渡されて、当該収集制御モジュール１１１による、ページ情報収集の制御に用いられる。 The collection control module 111 controls collection of useful page information from the Internet / intranet 20 according to the setting contents of the setting file 13. The text extraction module 112 extracts text from the page information collected under the control of the collection control module 111 and stores it in the knowledge DB 121. The link label extraction module 113 extracts a link label, which is a link character string in the page, from the collected page information. The link determination module 114 determines whether to collect link destination pages from the extracted link labels and the setting contents of the setting file 13. The determination result is passed to the collection control module 111 and used for control of page information collection by the collection control module 111.

知識検索システム１２は、知識ＤＢ１２１、検索エンジン１２２、検索ログ１２３及びログ統計生成モジュール１２４から構成される。知識ＤＢ１２１は、Ｗｅｂ情報収集システム１１内の収集制御モジュール１１１によって収集されたページ情報から、テキスト抽出モジュール１１２によって抽出された文書情報（テキスト）を蓄積しておくのに用いられる。検索エンジン１２２は、ユーザ１０１の操作に応じてＷｅｂブラウザ１６から入力された検索要求の示す検索式（検索文、検索条件）に従って、当該検索式に合致する文書情報（テキスト）を知識ＤＢ１２１から検索し、その検索結果をＷｅｂブラウザ１６を介してユーザ１０１に提示する。検索エンジン１２２はまた、ユーザ１０１に提示された検索結果から選択された文書情報をユーザ１０１に提示する。これによりユーザ１０１は、目的の文書情報を参照することができる。検索ログ１２３は、検索エンジン１２２による情報検索・参照の履歴（ログ）、例えば検索に用いられた検索式、検索結果の参照回数及び参照された文書情報毎のユーザによる評価結果を、検索に関する統計情報として蓄積するのに用いられる。ログ統計生成モジュール１２４は、検索ログ１２３に従い、検索式に出現する語句、検索結果の参照回数、参照された各情報に対するユーザの評価結果を統計的に分析することにより、検索式に出現する各語句について、当該語句が、ユーザにとって重要であるか、不要であるか、そのどちらでもないかを判定するのに必要な情報（ログ統計情報）を生成する。 The knowledge search system 12 includes a knowledge DB 121, a search engine 122, a search log 123, and a log statistics generation module 124. The knowledge DB 121 is used to store document information (text) extracted by the text extraction module 112 from the page information collected by the collection control module 111 in the Web information collection system 11. The search engine 122 searches the knowledge DB 121 for document information (text) that matches the search expression according to the search expression (search sentence, search condition) indicated by the search request input from the web browser 16 in response to the operation of the user 101. The search result is presented to the user 101 via the Web browser 16. The search engine 122 also presents the user 101 with document information selected from the search results presented to the user 101. As a result, the user 101 can refer to the target document information. The search log 123 is a history (log) of information search / reference by the search engine 122, for example, a search formula used for the search, a reference count of the search result, and an evaluation result by the user for each referred document information. Used to store information. The log statistic generation module 124 statistically analyzes the words and phrases appearing in the search expression, the number of reference times of the search result, and the user evaluation result for each referenced information in accordance with the search log 123, thereby For a word, information (log statistical information) necessary to determine whether the word is important to the user, unnecessary, or neither is generated.

設定ファイル１３は、起点ＵＲＬ、インターネット／イントラネット２０上からの情報収集の対象となるリンクの段数（初期段数）ｎ、収集ページ数の上限、重要語及び不要語等、Ｗｅｂ情報収集のための条件（知識情報収集条件）を設定・保持する。本実施形態では、設定ファイル１３に設定される重要語及び不要語には、同義語辞書１５に登録されている代表語が用いられる。設定ファイル１３は、ユーザ、例えば管理ユーザ（管理者）１０２の操作に応じてＷｅｂブラウザ１７から入力された設定要求に従って生成される。図１では、１つの設定ファイル１３が示されている。しかし、複数の設定ファイル１３が生成され、その中から任意の設定ファイル１３がユーザの操作によって指定されることで、その指定された設定ファイル１３に設定されている条件に従うＷｅｂ情報収集が行われる構成であっても構わない。 The setting file 13 includes conditions for collecting Web information, such as a starting URL, the number of stages of links (initial number of stages) n to be collected from the Internet / intranet 20, the upper limit of the number of collected pages, important words and unnecessary words. (Knowledge information collection conditions) are set and maintained. In the present embodiment, representative words registered in the synonym dictionary 15 are used as important words and unnecessary words set in the setting file 13. The setting file 13 is generated according to a setting request input from the web browser 17 in response to an operation of a user, for example, a management user (administrator) 102. In FIG. 1, one setting file 13 is shown. However, a plurality of setting files 13 are generated, and an arbitrary setting file 13 is designated by the user's operation, and Web information collection according to the conditions set in the designated setting file 13 is performed. It may be a configuration.

起点ＵＲＬは、収集制御モジュール１１１によってインターネット／イントラネット２０上からの情報収集が制御される際の起点となるページ情報の所在を示すロケーション情報である。重要語及び不要語は、ページ情報内のリンクラベル（リンク文字列）から、リンク先のページ情報が有用であるか無用であるかを判定するのに用いられるキーワードである。キーワード生成モジュール１４は、ログ統計生成モジュール１２４によって生成された検索に関する統計情報（ログ統計情報）をもとに、重要語及び不要語をそれぞれキーワードとして生成する。キーワード生成モジュール１４によるキーワード生成には同義語辞書１５が利用される。同義語辞書１５には、意味が類似した語句の集合が同義語として予め登録されている。各同義語の集合は、それぞれ代表語に対応付けられている。例えば、「価格」「値段」「定価」「料金」は、「価格」を代表語とする同義語グループの要素として同義語辞書１５に登録されている。 The starting URL is location information indicating the location of page information that is the starting point when information collection from the Internet / intranet 20 is controlled by the collection control module 111. Important words and unnecessary words are keywords used to determine whether link destination page information is useful or useless from link labels (link character strings) in page information. The keyword generation module 14 generates important words and unnecessary words as keywords based on the statistical information (log statistical information) related to the search generated by the log statistical generation module 124. A synonym dictionary 15 is used for keyword generation by the keyword generation module 14. In the synonym dictionary 15, a set of phrases having similar meanings is registered in advance as synonyms. Each set of synonyms is associated with a representative word. For example, “price”, “price”, “list price”, and “charge” are registered in the synonym dictionary 15 as elements of a synonym group whose representative word is “price”.

Ｗｅｂ情報収集システム１１内の各モジュール、知識検索システム１２内の検索エンジン１２２及びログ統計生成モジュール１２４、そしてキーワード生成モジュール１４は、
コンピュータにインストールされた特別のソフトウェアプログラム（ナレッジマネジメントプログラム）を当該コンピュータ（内のＣＰＵ）が読み取って実行することにより実現される。このプログラムは、コンピュータで読み取り可能な記憶媒体（フロッピー（登録商標）ディスクに代表される磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤに代表される光ディスク、フラッシュメモリに代表される半導体メモリ等）に予め格納して頒布可能である。また、このプログラムが、ネットワークを介してダウンロード（頒布）されても構わない。 Each module in the Web information collection system 11, the search engine 122 and the log statistics generation module 124 in the knowledge search system 12, and the keyword generation module 14 are
This is realized by reading and executing a special software program (knowledge management program) installed in the computer by the computer (CPU in the computer). This program is stored in advance in a computer-readable storage medium (a magnetic disk typified by a floppy (registered trademark) disk, a CD-ROM, an optical disk typified by a DVD, a semiconductor memory typified by a flash memory, etc.). Can be distributed. Further, this program may be downloaded (distributed) via a network.

次に、図１のナレッジマネジメントシステムにおけるＷｅｂ情報収集処理について、図２及び図３のフローチャートを参照して説明する。
まず管理ユーザ１０２は、図１のナレッジマネジメントシステム（内のＷｅｂ情報収集システム１１）により、インターネット／イントラネット２０上のＷｅｂサーバーから知識ＤＢ１２１に登録すべきＷｅｂ情報（ページ情報）を収集したいものとする。この場合、管理ユーザ１０２は、キーボード等の図示せぬ入力手段を操作して、Ｗｅｂ情報収集システム１１を呼び出す。すると、Ｗｅｂ情報収集システム１１内の収集制御モジュール１１１は、管理ユーザ１０２に対し、設定ファイル１３を指定するための入力操作を要求する。収集制御モジュール１１１は、管理ユーザ１０２の入力操作によって任意の設定ファイル１３が指定されると、当該指定された設定ファイル１３の設定内容をＷｅｂ情報収集条件として用いて、インターネット／イントラネット２０から対応するＷｅｂ情報を次のように収集する、
まず、Ｗｅｂ情報収集システム１１内の収集制御モジュール１１１は、起点ページから始まる情報収集の対象となるリンクの段数Ｚを、初期値（初期段数）ｎに設定する（ステップＳ１）。本実施形態において、初期値ｎは２であるものとする。起点ページのＵＲＬ（起点ＵＲＬ）、及び段数Ｚの初期値ｎは、管理ユーザ１０２により指定された設定ファイル１３に設定されている。次に収集制御モジュール１１１は、起点ページ（のページ情報）を、例えばＨＴＴＰ（Hyper Text Transfer Protocol）を用いて収集する（ステップＳ２）。 Next, Web information collection processing in the knowledge management system of FIG. 1 will be described with reference to the flowcharts of FIGS.
First, it is assumed that the management user 102 wants to collect Web information (page information) to be registered in the knowledge DB 121 from a Web server on the Internet / intranet 20 using the knowledge management system (internal Web information collection system 11) of FIG. . In this case, the management user 102 calls the Web information collection system 11 by operating an input unit (not shown) such as a keyboard. Then, the collection control module 111 in the Web information collection system 11 requests the management user 102 to perform an input operation for designating the setting file 13. When an arbitrary setting file 13 is specified by an input operation of the management user 102, the collection control module 111 responds from the Internet / intranet 20 using the setting contents of the specified setting file 13 as a Web information collection condition. Collect web information as follows:
First, the collection control module 111 in the Web information collection system 11 sets the number of links Z, which is the target of information collection starting from the starting page, to an initial value (initial number of stages) n (step S1). In the present embodiment, it is assumed that the initial value n is 2. The starting page URL (starting URL) and the initial value n of the number of steps Z are set in the setting file 13 designated by the management user 102. Next, the collection control module 111 collects the origin page (page information thereof) using, for example, HTTP (Hyper Text Transfer Protocol) (step S2).

次に収集制御モジュール１１１は、未処理の収集済みページが存在するかを調べる（ステップＳ３）。もし、未処理の収集済みページが存在するならば、その未処理の収集済みページを１つ選択する（ステップＳ４）。テキスト抽出モジュール１１２は、収集制御モジュール１１１によって選択されたページ（のページ情報）からテキストを抽出する（ステップＳ５）。 Next, the collection control module 111 checks whether there is an unprocessed collected page (step S3). If there is an unprocessed collected page, one unprocessed collected page is selected (step S4). The text extraction module 112 extracts text from the page (page information) selected by the collection control module 111 (step S5).

リンクラベル抽出モジュール１１３は、テキスト抽出モジュール１１２によって抽出されたテキスト中に、未処理のリンクが存在するかを調べる（ステップＳ６）。もし、未処理のリンクが存在するならば、リンクラベル抽出モジュール１１３は未処理のリンクを１つ選択する（ステップＳ７）。 The link label extraction module 113 checks whether an unprocessed link exists in the text extracted by the text extraction module 112 (step S6). If there is an unprocessed link, the link label extraction module 113 selects one unprocessed link (step S7).

リンクラベル抽出モジュール１１３は、選択されたリンクの種類を判定する（ステップＳ８）。もし、リンクの種類が文字リンクであるならば、リンクラベル抽出モジュール１１３は、文字リンク（タグ＜Ａ＞及び＜／Ａ＞）で囲まれた文字列（リンク文字列）をリンクラベルとして抽出する（ステップＳ９）。一方、リンクの種類が画像リンクであるならば、リンクラベル抽出モジュール１１３は、タグ＜Ａ＞及び＜／Ａ＞内のａｌｔプロパティの文字列（リンク文字列）をリンクラベルとして抽出する（ステップＳ１０）。 The link label extraction module 113 determines the type of the selected link (step S8). If the link type is a character link, the link label extraction module 113 extracts a character string (link character string) surrounded by character links (tags <A> and </A>) as a link label. (Step S9). On the other hand, if the link type is an image link, the link label extraction module 113 extracts the character string (link character string) of the alt property in the tags <A> and </A> as a link label (step S10). ).

リンクラベル抽出モジュール１１３は、ステップＳ４で選択されたページに含まれている全てのリンクについて、上記リンク文字列（リンクラベル）を抽出する動作（ステップＳ７，Ｓ８，Ｓ９またはステップＳ７，Ｓ８，Ｓ１０）を繰り返す（ステップＳ６）。そして、リンク文字列（リンクラベル）を抽出する動作が、選択されたページ内の全リンクについて実行されると、リンクラベル抽出モジュール１１３からリンク判定モジュール１１４に制御が渡される。 The link label extraction module 113 extracts the link character string (link label) for all links included in the page selected in step S4 (steps S7, S8, S9 or steps S7, S8, S10). ) Is repeated (step S6). When the operation of extracting the link character string (link label) is executed for all the links in the selected page, control is passed from the link label extraction module 113 to the link determination module 114.

すると、リンク判定モジュール１１４は、ステップＳ４で選択されたページから抽出された全てのリンク文字列の中に、設定ファイル１３に設定されている重要語を含むリンク文字列が存在するかを判定する（ステップＳ１１）。このステップＳ１１の判定では、同義語辞書１５が参照され、設定ファイル１３に設定されている重要語の同義語も重要語として扱われる。例えば、設定ファイル１３に設定されている重要語が上述の「価格」の場合であれば、「価格」の同義語である「値段」「定価」「料金」も設定ファイル１３に重要語として設定されているものとして扱われる。なお、設定ファイル１３の生成時に、「値段」「定価」「料金」も「価格」と共に、重要語として設定される構成としても構わない。この場合には、ステップＳ１１の判定において同義語辞書１５を参照する必要はなくなる。 Then, the link determination module 114 determines whether a link character string including the important word set in the setting file 13 exists in all the link character strings extracted from the page selected in step S4. (Step S11). In the determination in step S11, the synonym dictionary 15 is referred to, and synonyms of the important words set in the setting file 13 are also handled as important words. For example, if the important word set in the setting file 13 is “price” described above, “price”, “list price”, and “fee”, which are synonyms for “price”, are also set as important words in the setting file 13. Are treated as being. It should be noted that “price”, “list price”, and “charge” may be set as important words together with “price” when the setting file 13 is generated. In this case, it is not necessary to refer to the synonym dictionary 15 in the determination in step S11.

もし、重要語を含むリンク文字列が存在するならば（ステップＳ１１）、リンク判定モジュール１１４は対応するリンク列に関して現在設定されているリンク収集の段数Ｚはｎ（初期段数）を超えているかを判定する（ステップＳ１２）。もし、Ｚがｎを超えていないならば、即ちＺがｎに一致するならば、リンク判定モジュール１１４は、ＺをＺ＋ｍ、つまり初期段数ｎよりｍだけ多い段数ｎ＋ｍに更新する（ステップＳ１３）。これに対し、Ｚがｎを超えているならば、リンク判定モジュール１１４は、起点ページからリンク先までのリンクの段数が現在設定されているリンク収集の段数Ｚ以内であるかを判定する（ステップＳ１４）。もし、リンク先までのリンクの段数がＺを超えているならば、Ｚを現在の段数より１だけ多い段数Ｚ＋１に更新する（ステップＳ１５）。 If there is a link character string including an important word (step S11), the link determination module 114 determines whether the currently set link collection stage number Z for the corresponding link series exceeds n (initial stage number). Determination is made (step S12). If Z does not exceed n, that is, if Z matches n, the link determination module 114 updates Z to Z + m, that is, the number of stages n + m that is larger than the initial stage number n (step S13). On the other hand, if Z exceeds n, the link determination module 114 determines whether the number of link stages from the starting page to the link destination is within the currently set number Z of link collection stages (step) S14). If the number of link stages up to the link destination exceeds Z, Z is updated to the number of stages Z + 1 which is 1 greater than the current number of stages (step S15).

リンク判定モジュール１１４は、ステップＳ１３またはＳ１５を実行すると、ステップＳ１６に進む。これに対し、リンク先までのリンク収集の段数がＺ以内であるならば（ステップＳ１４）、リンク判定モジュール１１４はＺを更新することなく、そのままステップＳ１６に進む。ステップＳ１６において、キーワード生成モジュール１４は、ステップＳ４で選択されたページから抽出されたリンク文字列の中に、未処理のリンク文字列が存在するかを調べる。この例のように、未処理のリンク文字列が存在するならば、リンク判定モジュール１１４は、未処理のリンク文字列を１つ選択する（ステップＳ１７）
リンク判定モジュール１１４は、ステップＳ１７で選択されたリンクの文字列（リンクラベル）が、設定ファイル１３に設定されている重要語を含むかを判定する（ステップＳ１８）。このステップＳ１８の判定では、同義語辞書１５が参照され、設定ファイル１３に設定されている重要語の同義語も重要語として扱われる。 The link determination module 114 proceeds to step S16 after executing step S13 or S15. On the other hand, if the number of stages of link collection up to the link destination is within Z (step S14), the link determination module 114 proceeds directly to step S16 without updating Z. In step S16, the keyword generation module 14 checks whether an unprocessed link character string exists in the link character string extracted from the page selected in step S4. If there is an unprocessed link character string as in this example, the link determination module 114 selects one unprocessed link character string (step S17).
The link determination module 114 determines whether the character string (link label) of the link selected in step S17 includes the important word set in the setting file 13 (step S18). In the determination in step S18, the synonym dictionary 15 is referred to, and the synonyms of the important words set in the setting file 13 are also treated as important words.

もし、リンク文字列が重要語を含むならば（ステップＳ１８）、リンク判定モジュール１１４は、対応するページ（ステップＳ４で選択されたページ）を含むリンク列に関するリンク収集の現在の段数Ｚに無関係にリンク先ページを収集することを判定（決定）する。すると収集制御モジュール１１１は、リンク先のページを収集する（ステップＳ２１）。収集制御モジュール１１１によって収集された「ページ」（のページ情報）は、「リンク元のラベル（リンク文字列）」と組にして、知識ＤＢ１２１に格納される。 If the link character string includes an important word (step S18), the link determination module 114 does not depend on the current stage number Z of link collection related to the link string including the corresponding page (the page selected in step S4). Determine (decide) to collect linked pages. Then, the collection control module 111 collects linked pages (step S21). The “page” (page information) collected by the collection control module 111 is stored in the knowledge DB 121 in combination with a “link source label (link character string)”.

これに対し、リンク文字列が重要語を含まないならば（ステップＳ１８）、リンク判定モジュール１１４は、当該文字列が、設定ファイル１３に設定されている不要語を含むかを判定する（ステップＳ１９）。このステップＳ１９の判定でも、上記ステップＳ１１またはＳ１８と同様に同義語辞書１５が参照され、設定ファイル１３に設定されている不要語の同義語も不要語として扱われる。ここで、リンク文字列が不要語を含むならば（ステップＳ１９）、リンク判定モジュール１１４は対応するページを含むリンク列に関するリンク収集の現在の段数Ｚに無関係にリンク先ページを収集しないことを判定（決定）する。この場合、収集制御モジュール１１１は、リンク先ページを収集することを控える（ステップＳ２３）。なお、本実施形態では、１つのリンクに重要語と不要語の両方が存在する場合には、図３のフローチャートから明らかなように重要語が優先され（ステップＳ１８）、「重要語を含むリンク」として扱われる。 On the other hand, if the link character string does not include an important word (step S18), the link determination module 114 determines whether the character string includes an unnecessary word set in the setting file 13 (step S19). ). Also in the determination in step S19, the synonym dictionary 15 is referred to in the same manner as in step S11 or S18, and synonyms of unnecessary words set in the setting file 13 are also handled as unnecessary words. Here, if the link character string includes an unnecessary word (step S19), the link determination module 114 determines not to collect the link destination page regardless of the current stage number Z of link collection regarding the link string including the corresponding page. (decide. In this case, the collection control module 111 refrains from collecting the link destination page (step S23). In the present embodiment, when both important words and unnecessary words exist in one link, the important words are prioritized as apparent from the flowchart of FIG. 3 (step S18). ".

一方、リンク文字列が不要語を含まないならば（ステップＳ１９）、つまりリンク文字列が重要語も不要語も含まないならば（ステップＳ１８，Ｓ１９）、リンク判定モジュール１１４は起点ページからリンク先までのリンクの段数は現在の設定段数Ｚ以内かを判定する（ス２０）。リンク判定モジュール１１４は、リンク文字列が重要語も不要語も含まない場合には（ステップＳ１８，Ｓ１９）、起点ページからリンク先までのリンクの段数が現在の設定段数Ｚ以内である場合に限って（ステップＳ２０）、収集制御モジュール１１１に対してリンク先ページの収集を要求する。収集制御モジュール１１１は、リンク判定モジュール１１４からの要求に応じてリンク先のページを収集する（ステップＳ２１）。これに対し、リンク文字列が重要語も不要語も含まず（ステップＳ１８，Ｓ１９）、しかも起点ページからリンク先までのリンクの段数が現在の設定段数Ｚを超えているならば（ステップＳ２０）、リンク判定モジュール１１４は収集制御モジュール１１１に対してリンク先ページを収集しないことを要求する。これにより収集制御モジュール１１１は、リンク先ページの収集を控える（ステップＳ２３）。この場合、収集制御モジュール１１１からリンク判定モジュール１１４に制御が渡されて、上記ステップＳ１６の判定が行われる。 On the other hand, if the link character string does not include an unnecessary word (step S19), that is, if the link character string includes neither an important word nor an unnecessary word (steps S18 and S19), the link determination module 114 links from the starting page to the link destination. It is determined whether the number of link stages up to is within the current set stage number Z (S20). When the link character string does not include an important word or an unnecessary word (steps S18 and S19), the link determination module 114 is limited to the case where the number of link stages from the start page to the link destination is within the current set stage number Z. (Step S20), the collection control module 111 is requested to collect linked pages. The collection control module 111 collects linked pages in response to a request from the link determination module 114 (step S21). On the other hand, if the link character string does not include an important word or an unnecessary word (steps S18 and S19), and the number of links from the starting page to the link destination exceeds the currently set number Z (step S20). The link determination module 114 requests the collection control module 111 not to collect the link destination page. Thereby, the collection control module 111 refrains from collecting linked pages (step S23). In this case, control is passed from the collection control module 111 to the link determination module 114, and the determination in step S16 is performed.

収集制御モジュール１１１は、上記ステップＳ２１によりリンク先ページ（のページ情報）を収集すると、対応するリンクの収集ページ数の総数が設定された収集ページ数の上限を超えたかを判定する（ステップＳ２２）。もし、収集ページ数の総数が設定された収集ページ数の上限を超えていないならば、収集制御モジュール１１１からリンク判定モジュール１１４に制御が渡されて、上記ステップＳ１６の判定が行われる。 When the collection control module 111 collects the link destination page (page information) in step S21, the collection control module 111 determines whether the total number of collection pages of the corresponding link exceeds the set upper limit of the number of collection pages (step S22). . If the total number of collection pages does not exceed the upper limit of the set number of collection pages, control is passed from the collection control module 111 to the link determination module 114, and the determination in step S16 is performed.

このようにして、ステップＳ４で選択されたページから抽出された全てのリンク文字列について、重要語、不要語またはそれ以外が含まれているかのリンク判定処理と、その判定結果等に基づいてリンク先ページを収集する、または収集しない収集制御とが実行されると、収集ページ数の総数が設定された収集ページ数の上限を超えていない限り、収集制御モジュール１１１は再びステップＳ３の判定を行う。そして、未処理の収集済みページが残っている場合、そのページについて、上記ステップＳ４から始まる処理が行われる。即ち、収集ページ数の総数が収集ページ数の上限を超えるか（ステップＳ２２）、収集されたページ上の全てのリンクについて処理し終えるまで（ステップＳ３）、ステップＳ４から始まる処理が繰り返される。 In this way, all link character strings extracted from the page selected in step S4 are linked based on link determination processing for whether important words, unnecessary words or other words are included, and the determination results. When collection control for collecting or not collecting the previous page is executed, the collection control module 111 performs the determination in step S3 again unless the total number of collection pages exceeds the set upper limit of the number of collection pages. . When an unprocessed collected page remains, the process starting from step S4 is performed on the page. That is, the processing starting from step S4 is repeated until the total number of collected pages exceeds the upper limit of the number of collected pages (step S22) or until all the links on the collected pages have been processed (step S3).

このように本実施形態においては、ページ中のリンク文字列（リンクラベル）が不要語を含む場合（ステップＳ１９）、リンク先のページはユーザにとって無用である可能性が高いことから、その時点における設定段数Ｚとリンク先までのリンクの段数とに拘わらずに（つまり、リンク先が設定段数Ｚの範囲内であったとしても）、リンク先のページの収集が抑止される（ステップＳ２３）。 As described above, in the present embodiment, when the link character string (link label) in the page includes an unnecessary word (step S19), the link destination page is likely to be useless for the user. Regardless of the set stage number Z and the number of link stages up to the link destination (that is, even if the link destination is within the range of the set stage number Z), collection of the link destination page is suppressed (step S23).

また、リンク収集の現在の段数Ｚが初期段数ｎである状態で、対応するリンク列中のページ（つまりｎ段までのページ）に重要語を含むリンクが存在する場合（ステップＳ１１，Ｓ１２）、そのページにつながるリンク収集の段数ＺがＺ＝ｎからＺ＝Ｚ＋ｍ＝ｎ＋ｍに増加される（ステップＳ１３）。同様に、リンク収集の現在の段数Ｚが初期段数ｎを超えている状態で、対応するリンク列中のＺ段目のページに重要語を含むリンクが存在する場合、つまりリンク先までの段数がＺを超える場合（ステップＳ１１，Ｓ１２，Ｓ１４）、そのＺ段目のページにつながるリンク収集の段数Ｚが１だけ増加される（ステップＳ１５）。これにより、ページ中のリンク文字列（リンクラベル）が不要語を含む場合（ステップＳ１８）、リンク先のページはユーザにとって有用である可能性が高いことから、リンク先がリンク収集の初期段数ｎを超えていても、リンク先のページの収集が抑止される（ステップＳ２１）。 Further, in the state where the current stage number Z of link collection is the initial stage number n, there are links including important words on the pages in the corresponding link string (that is, pages up to n stages) (steps S11 and S12). The number Z of link collection stages connected to the page is increased from Z = n to Z = Z + m = n + m (step S13). Similarly, when the current stage number Z of link collection exceeds the initial stage number n and there is a link including an important word on the Z-th page in the corresponding link string, that is, the number of stages to the link destination is When Z is exceeded (steps S11, S12, and S14), the number Z of link collections connected to the Z-th page is increased by 1 (step S15). As a result, when the link character string (link label) in the page includes an unnecessary word (step S18), the link destination page is likely to be useful to the user, so the link destination is the initial stage number n of link collection. However, the collection of linked pages is suppressed (step S21).

また、本実施形態においては、ページから抽出されたリンクラベルが重要語及び不要語のいずれのキーワードも含まない場合には（ステップＳ１８，Ｓ１９）、リンク先までのリンクの段数がその時点における対応するリンク収集の設定段数Ｚ以内である場合に限り（ステップＳ２０）、リンク先のページが収集される（ステップＳ２１）。換言すれば、リンクラベルが重要語及び不要語のいずれも含まず（ステップＳ１８，Ｓ１９）、且つリンク先までのリンクの段数がその時点における対応するリンク収集の設定段数Ｚ以内でない場合には（ステップＳ２０）、リンク先のページは収集されない（ステップＳ２３）。 Further, in this embodiment, when the link label extracted from the page does not include any keyword of the important word and the unnecessary word (steps S18 and S19), the number of link stages up to the link destination corresponds to that point in time. Only when it is within the set number Z of link collections to be performed (step S20), linked pages are collected (step S21). In other words, if the link label does not include either an important word or an unnecessary word (steps S18 and S19), and the number of link stages up to the link destination is not within the set number Z of the corresponding link collection at that time ( In step S20), the linked page is not collected (step S23).

次に、上述のＷｅｂ情報収集処理の具体例について、図４のリンク判定／ページ収集の一例を示す図を参照して説明する。図４には、設定ファイル１３に設定された起点ＵＲＬで指定される起点ページＰｓが示されている。このページＰｓには、当該ページＰｓを起点とする１段目のページＰ11，Ｐ12，Ｐ13へのリンクが存在する。このページＰ11，Ｐ12，Ｐ13へのリンク（のラベル）は、それぞれ記号○、△、×で示される語句を含む。設定ファイル１３には、記号○、×で示される語句が、それぞれ重要語、不要語として設定されている。以後の説明では、簡略化のために○、×をそれぞれ重要語、不要語であるとする。また、△を○、×どちらでもない語句であるとする。この場合、１段目のページＰ11，Ｐ12，Ｐ13のうち、×を含むリンクの先のページＰ13は、その時点におけるリンク収集の設定段数Ｚ＝２（ｎ＝２）の範囲内であるにも拘わらずに、収集の対象外となる。一方、△を含むリンクの先のページＰ12は、リンク収集の設定段数Ｚ＝２（ｎ＝２）の範囲内であることから、収集される。また、○を含むリンクの先のページＰ11は、リンク収集の設定段数Ｚに無関係に収集される。 Next, a specific example of the above-described Web information collection processing will be described with reference to a diagram showing an example of link determination / page collection in FIG. FIG. 4 shows a starting page Ps specified by the starting URL set in the setting file 13. In this page Ps, there are links to pages P11, P12, and P13 in the first stage starting from the page Ps. The links (labels) to the pages P11, P12, and P13 include words indicated by symbols ◯, Δ, and X, respectively. In the setting file 13, words indicated by symbols ◯ and X are set as important words and unnecessary words, respectively. In the following description, for simplification, it is assumed that ◯ and X are important words and unnecessary words, respectively. Further, it is assumed that Δ is a word that is neither ○ nor X. In this case, of the first-stage pages P11, P12, and P13, the page P13 ahead of the link including x is within the range of the set stage number Z = 2 (n = 2) of the link collection at that time. Regardless, they are not collected. On the other hand, the page P12 ahead of the link including Δ is collected because it is within the range of the link collection setting stage number Z = 2 (n = 2). Further, the page P11 ahead of the link including the circle is collected regardless of the set number Z of link collections.

ページＰ11は、○、△に対応する２段目のページＰ210、Ｐ211にリンクしている。この場合、ページＰｓを起点とし、且つページＰ11を含むリンク列に関する、収集するリンクの段数Ｚが２（ｎ）から２＋２（ｎ＋ｍ）に増やされる。これにより、２段目のページＰ210と、当該ページＰ210にリンクした３段目のページＰ310には、いずれも△を含むリンクしか存在しないにも拘わらず、そのリンクの先の３段目のページＰ310及び４段目のページＰ410は収集される。このページＰ410には、△を含むリンクしか存在しない。このリンクの先の５段目のページＰ510は、その時点における収集するリンクの段数Ｚ＝４を超える。このため、ページＰ510は収集の対象外となる。なお、本実施形態では、ページＰ11を含むリンク列に関する、収集するリンクの段数が増やされるのは１回だけである。 The page P11 is linked to the second-stage pages P210 and P211 corresponding to ◯ and Δ. In this case, the number of links Z to be collected from the page Ps and the link sequence including the page P11 is increased from 2 (n) to 2 + 2 (n + m). As a result, even though the second page P210 and the third page P310 linked to the page P210 have only links including Δ, the third page after the link P310 and the fourth page P410 are collected. This page P410 has only a link including Δ. The page P510 of the fifth stage ahead of this link exceeds the number of linked stages Z = 4 at that time. For this reason, the page P510 is not collected. In the present embodiment, the number of links to be collected regarding the link string including the page P11 is increased only once.

一方、ページＰ211にリンクした３段目のページＰ311には、△を含むリンクと○を含むリンクとが存在し、そのリンクの先の４段目のページＰ411，Ｐ412は共に収集される。ページＰ411には、△を含むリンクしか存在しない。このリンクの先の５段目のページＰ511は、その時点における収集するリンクの段数Ｚ＝４を超えている。このため、ページＰ511は収集の対象外となる。 On the other hand, the third page P311 linked to the page P211 includes a link including Δ and a link including ○, and the fourth page P411 and P412 at the end of the link are collected together. There is only a link including Δ on page P411. The page P511 of the fifth stage ahead of this link exceeds the number of link stages Z = 4 to be collected at that time. For this reason, the page P511 is not collected.

これに対し、ページＰ412には△を含むリンクと○’を含むリンクとが存在する。○’は○の同義語である。つまり、ページＰ412には、重要語を含むリンクが存在する。ここで、ページＰ412は起点ページＰｓから４段目であり、ページＰ412のリンク先までの段数は収集するリンクの段数Ｚ＝４を超える。しかし、ページＰ412には重要語を含むリンクが存在するため、Ｚが１増やされてＺ＝５となり、ページＰ412にリンクしている５段目のページＰ513，Ｐ514が収集される。ページＰ513には、△を含むリンクしか存在しない。このリンクの先の６段目のページＰ613は、その時点における収集するリンクの段数Ｚ＝５を超える。このため、ページＰ613は収集の対象外となる。一方、ページＰ514には、○を含むリンクが存在する。そこで、このリンクの先の６段目のページＰ614は収集される。 On the other hand, the page P412 includes a link including Δ and a link including ○ '. ○ ′ is a synonym for ○. That is, the page P412 has a link including an important word. Here, the page P412 is the fourth stage from the starting page Ps, and the number of stages to the link destination of the page P412 exceeds the number of links Z to collect. However, since there is a link including an important word on the page P412, Z is incremented by 1 to Z = 5, and the fifth page P513, P514 linked to the page P412 is collected. There is only a link including Δ on page P513. The page P613 at the sixth stage after this link exceeds the number of link stages Z to collect at that time. For this reason, the page P613 is not collected. On the other hand, the page P514 includes a link including “◯”. Therefore, the sixth page P614 at the end of this link is collected.

なお、上記実施形態では、起点ページからｎ（＝２）段までのページに重要語を含むリンクが存在し、そのリンクを含むリンク列に関するリンク収集の設定段数Ｚがｎを超えていない場合（Ｚが初期段数ｎの場合）、当該Ｚがｍ（＝２）だけ増やされる。しかし、任意の段のページに重要語を含むリンクが存在し、そのリンクを含むリンク列に関するリンク収集の設定段数Ｚがｎ以上の場合に、当該Ｚが１だけ増やされる構成であっても構わない。 In the above embodiment, there is a link including an important word on the pages from the starting page to n (= 2) stages, and the set number Z of link collections regarding the link string including the link does not exceed n ( When Z is the initial stage number n), Z is increased by m (= 2). However, when there is a link including an important word on a page in an arbitrary stage, and the number Z of link collection settings for the link string including the link is n or more, the Z may be increased by one. Absent.

次に、ログ統計生成モジュール１２４を中心とする知識検索システム１２の動作について説明する。
知識検索システム１２内の検索エンジン１２２は、ユーザ１０１の操作に従う検索要求をＷｅｂブラウザ１６を介して受け取ると、その検索要求の示す検索式に合致する文書情報（ページ情報）を知識ＤＢ１２１から検索する。このとき検索エンジン１２２は、検索式を検索ログ１２３に保存する。 Next, the operation of the knowledge search system 12 centering on the log statistics generation module 124 will be described.
When the search engine 122 in the knowledge search system 12 receives a search request according to the operation of the user 101 via the Web browser 16, the search engine 122 searches the knowledge DB 121 for document information (page information) that matches the search expression indicated by the search request. . At this time, the search engine 122 stores the search expression in the search log 123.

検索エンジン１２２は、検索式に合致する文書情報を検索すると、その検索結果の一覧の画像情報を生成して、Ｗｅｂブラウザ１６を介してユーザ１０１に提示する。この検索結果の一覧は、検索要求の示す検索式に合致する各文書情報の例えばＩＤ（情報ＩＤ）を含む。また、検索結果の一覧に、検索された各文書情報の要約を含めることも可能である。ユーザ１０１は、検索結果の一覧から、自身が参照したい文書情報の情報ＩＤを選択する。検索エンジン１２２は、検索結果の一覧から情報ＩＤが選択されたことを検出すると、選択されたＩＤにより示される文書情報を知識ＤＢ１２１から読み出し、その文書情報をＷｅｂブラウザ１６を介してユーザ１０１に提示する。 When searching for document information that matches the search expression, the search engine 122 generates image information of a list of search results and presents it to the user 101 via the Web browser 16. This list of search results includes, for example, ID (information ID) of each document information that matches the search formula indicated by the search request. It is also possible to include a summary of each searched document information in the search result list. The user 101 selects the information ID of the document information that he / she wants to refer to from the list of search results. When the search engine 122 detects that the information ID is selected from the list of search results, the search engine 122 reads the document information indicated by the selected ID from the knowledge DB 121 and presents the document information to the user 101 via the Web browser 16. To do.

検索ログ１２３は、図５に示すデータ構造のログテーブル１２３ａを含む。ログテーブル１２３ａの各エントリは、知識ＤＢ１２１に格納されている文書情報（ページ情報）毎に、その文書情報のＩＤ（情報ＩＤ）と、その文書情報と組にして知識ＤＢ１２１に格納されているリンク元のラベル（リンク文字列）と、その文書情報がユーザにより参照された回数（参照回数）と、その文書情報に関するユーザの評価結果とがそれぞれ設定される項目を有する。評価結果の項目は、文書情報が重要（有用）であった場合に１加点される重要評価回数と、不要（無用）であった場合に１加点される不要評価回数の各項目からなる。 The search log 123 includes a log table 123a having a data structure shown in FIG. Each entry in the log table 123a is linked to each document information (page information) stored in the knowledge DB 121 in combination with the document information ID (information ID) and the document information. The original label (link character string), the number of times the document information is referred to by the user (reference number), and the user's evaluation result regarding the document information are set. The evaluation result item includes items of an important evaluation count added by 1 when the document information is important (useful) and an unnecessary evaluation count added by 1 when the document information is unnecessary (useless).

本実施形態では、文書情報（ページ情報）が知識ＤＢ１２１に格納された際に、その文書情報のＩＤと、その文書情報（ページ情報）と組をなして当該知識ＤＢ１２１に格納されるリンク元のラベル（リンク文字列）とを含むエントリ情報が生成されて、ログテーブル１２３ａに格納される。このとき、エントリ情報中の参照回数、重要評価回数及び不要評価回数は、いずれも０に初期化されている。 In the present embodiment, when document information (page information) is stored in the knowledge DB 121, the link source stored in the knowledge DB 121 is paired with the document information ID and the document information (page information). Entry information including a label (link character string) is generated and stored in the log table 123a. At this time, the reference count, the important evaluation count, and the unnecessary evaluation count in the entry information are all initialized to zero.

ログテーブル１２３ａのエントリ情報中の参照回数は、検索結果の一覧から、ユーザ１０１によって対応する文書情報のＩＤが選択され、その選択されたＩＤの示す文書情報がユーザ１０１によって参照された場合に、検索エンジン１２２によって１だけインクリメントされる。また、文書情報が参照された場合、検索エンジン１２２はユーザ１０１に対して、その文書情報が重要（有用）であったか、或は不要（無用）であったかの評価結果の入力を要求する。もし、評価結果として「重要」が入力（選択）された場合、ログテーブル１２３ａの対応するエントリ情報中の重要評価回数が１だけインクリメントされる。これに対し、評価結果として「不要」が入力（選択）された場合、ログテーブル１２３ａの対応するエントリ情報中の不要評価回数が１だけインクリメントされる。 The number of times of reference in the entry information of the log table 123a is obtained when the ID of the corresponding document information is selected by the user 101 from the list of search results, and the document information indicated by the selected ID is referred to by the user 101. Incremented by 1 by the search engine 122. When document information is referred to, the search engine 122 requests the user 101 to input an evaluation result indicating whether the document information is important (useful) or unnecessary (useless). If “important” is input (selected) as the evaluation result, the number of important evaluations in the corresponding entry information in the log table 123a is incremented by one. On the other hand, when “unnecessary” is input (selected) as the evaluation result, the number of unnecessary evaluations in the corresponding entry information in the log table 123a is incremented by one.

検索ログ１２３はまた、図６に示すデータ構造のログ統計テーブル１２３ｂを含む。ログ統計テーブル１２３ｂの各エントリは、検索ログ１２３に保存されている、文書情報の検索に用いられた検索式に出現する単語（語句）毎に、検索式出現回数と、重要評価割合と、不要評価割合と、出現回数順位と、重要評価順位と、不要評価順位と、重要度と、判定結果との各項目からなる。検索式出現回数は、対応する単語が検索式に出現する回数を示す。重要評価割合及び不要評価割合は、それぞれ対応する単語の参照回数に対する重要評価回数及び不要評価回数の割合を示す。出現回数順位は、検索式出現回数の順位を示す。重要評価順位及び不要評価順位は、それぞれ「重要」評価割合及び「不要」評価割合の順位を示す。重要度は、対応する単語の重要度を示す。判定結果は、対応する単語の重要度から判定される、当該単語が「重要語」であるか、或は「不要語」であるか、或はそのいずれでもないかを示す。 The search log 123 also includes a log statistics table 123b having a data structure shown in FIG. Each entry in the log statistics table 123b is stored in the search log 123. For each word (phrase) that appears in the search formula used for searching the document information, the number of occurrences of the search formula, the importance evaluation ratio, and the unnecessary Each item includes an evaluation ratio, appearance frequency rank, important evaluation rank, unnecessary evaluation rank, importance, and determination result. The search expression appearance count indicates the number of times the corresponding word appears in the search expression. The important evaluation ratio and the unnecessary evaluation ratio indicate the ratio of the important evaluation count and the unnecessary evaluation count to the reference count of the corresponding word, respectively. The appearance frequency rank indicates the rank of the search expression appearance frequency. The important evaluation rank and the unnecessary evaluation rank indicate the ranks of the “important” evaluation ratio and the “unnecessary” evaluation ratio, respectively. The importance level indicates the importance level of the corresponding word. The determination result indicates whether the word is an “important word”, an “unnecessary word”, or neither, which is determined from the importance of the corresponding word.

本実施形態では、ログ統計テーブル１２３ｂが、ログ統計生成モジュール１２４によって定期的に生成される。以下、ログ統計生成モジュール１２４によるログ統計テーブル生成処理について、図７及び図８のフローチャートを参照して説明する。 In the present embodiment, the log statistics table 123b is periodically generated by the log statistics generation module 124. Hereinafter, the log statistics table generation processing by the log statistics generation module 124 will be described with reference to the flowcharts of FIGS.

まずログ統計生成モジュール１２４は、検索ログ１２３に保存されている検索式の中から未処理の検索式を１つ取り出す（ステップＳ３１）。次にログ統計生成モジュール１２４は、取り出された検索式から、その検索式に出現する未処理の単語（語句）を抽出する（ステップＳ３２）。もし、ログ統計テーブル１２３ｂに格納されていない単語が抽出された場合（ステップＳ３３）、ログ統計生成モジュール１２４は、その単語を含むログ統計テーブル１２３ｂのエントリ情報を生成する（ステップＳ３４）。このとき、エントリ情報中の検索式出現回数は１に初期化され、それ以外の項目は空欄となっている。また、検索式から抽出された単語を含むエントリ情報が既にログ統計テーブル１２３ｂに格納されている場合（ステップＳ３３）、ログ統計生成モジュール１２４は、当該エントリ情報中の検索式出現回数を１インクリメントする（ステップＳ３５）。このログ統計テーブル１２３ｂの各エントリ情報の検索式出現回数は、検索に用いられた検索式に出現する単語を分析して得られる頻度情報（統計情報）である。 First, the log statistics generation module 124 extracts one unprocessed search expression from the search expressions stored in the search log 123 (step S31). Next, the log statistics generation module 124 extracts unprocessed words (phrases) that appear in the retrieved search expression (Step S32). If a word not stored in the log statistics table 123b is extracted (step S33), the log statistics generation module 124 generates entry information of the log statistics table 123b including the word (step S34). At this time, the number of appearances of the search expression in the entry information is initialized to 1, and the other items are blank. If entry information including a word extracted from the search formula is already stored in the log statistics table 123b (step S33), the log statistics generation module 124 increments the search formula appearance count in the entry information by one. (Step S35). The search expression appearance count of each entry information in the log statistics table 123b is frequency information (statistical information) obtained by analyzing words appearing in the search expression used for the search.

次にログ統計生成モジュール１２４は、ステップＳ３１で取り出された検索式に未処理の単語が存在するかを判定する（ステップＳ３６）。もし、未処理の単語が存在するならば、ログ統計生成モジュール１２４は上記ステップＳ３２の処理に戻る。これに対し、未処理の単語が存在しないならば、ログ統計生成モジュール１２４は検索ログ１２３内に未処理の検索式が存在するかを判定する（ステップＳ３７）。もし、未処理の検索式が存在するならば、ログ統計生成モジュール１２４は上記ステップＳ３１の処理に戻る。 Next, the log statistic generation module 124 determines whether or not an unprocessed word exists in the search expression extracted in step S31 (step S36). If there is an unprocessed word, the log statistics generation module 124 returns to the process of step S32. On the other hand, if there is no unprocessed word, the log statistics generation module 124 determines whether there is an unprocessed search expression in the search log 123 (step S37). If there is an unprocessed search expression, the log statistics generation module 124 returns to the process of step S31.

ログ統計生成モジュール１２４は、検索ログ１２３に保存されている全ての検索式について処理し終えると（ステップＳ３７）、ログ統計テーブル１２３ｂ内の各エントリ情報中の、重要評価割合と、不要評価割合と、出現回数順位と、重要評価順位と、不要評価順位と、重要度と、判定結果とを、次のように決定する。 When the log statistics generation module 124 finishes processing all the search expressions stored in the search log 123 (step S37), the important evaluation ratio and the unnecessary evaluation ratio in each entry information in the log statistics table 123b The appearance frequency rank, the important evaluation rank, the unnecessary evaluation rank, the importance, and the determination result are determined as follows.

まずログ統計生成モジュール１２４は、ログ統計テーブル１２３ｂから未処理のエントリ情報を１つ選択する（ステップＳ３８）。次にログ統計生成モジュール１２４は、選択されたエントリ情報中の単語を読み出す（ステップＳ３９）。次にログ統計生成モジュール１２４は、ログテーブル１２３ａから、ステップＳ３９で読み出された単語を含むリンク文字列が設定されているエントリ情報を検索して、参照回数、重要評価回数及び不要評価回数を参照する（ステップＳ４０，Ｓ４１）。そしてログ統計生成モジュール１２４は、参照回数に対する重要評価回数及び不要評価回数各々の割合（％）を算出し、ログ統計テーブル１２３ｂ内の対応するエントリ情報に設定する（ステップＳ４２）。ここで、異なるリンク文字列に共通に含まれる単語（図４の例では「価格」）については、全ての参照回数と評価回数（重要評価回数及び不要評価回数）を、それぞれ合計して、参照回数に対する重要評価回数及び不要評価回数各々の割合（％）を算出する。ログ統計生成モジュール１２４は、上述したステップＳ３８乃至Ｓ４３の処理を、ログ統計テーブル１２３ｂ内の全てのエントリ情報について繰り返す（ステップＳ４４）。 First, the log statistics generation module 124 selects one unprocessed entry information from the log statistics table 123b (step S38). Next, the log statistics generation module 124 reads a word in the selected entry information (step S39). Next, the log statistics generation module 124 searches the log table 123a for entry information in which the link character string including the word read in step S39 is set, and sets the reference count, the important evaluation count, and the unnecessary evaluation count. Reference is made (steps S40 and S41). Then, the log statistics generation module 124 calculates the ratio (%) of each of the important evaluation count and the unnecessary evaluation count with respect to the reference count, and sets it in the corresponding entry information in the log statistics table 123b (step S42). Here, for the words that are commonly included in different link character strings ("price" in the example of FIG. 4), all the reference counts and the evaluation counts (important evaluation count and unnecessary evaluation count) are summed and referred to The ratio (%) of the number of important evaluations and the number of unnecessary evaluations to the number of times is calculated. The log statistics generation module 124 repeats the above-described processing of steps S38 to S43 for all entry information in the log statistics table 123b (step S44).

次に、ログ統計生成モジュール１２４は、ログ統計テーブル１２３ｂ内の全てのエントリ情報について、検索式出現回数の降順となるように検索式出現回数の順位付けを行い、当該全てのエントリ情報中の出現回数順位を設定する（ステップＳ４５）。同様に、ログ統計生成モジュール１２４は、ログ統計テーブル１２３ｂ内の全てのエントリ情報の重要評価割合の降順となるように重要評価割合の順位付けを行い、当該全てのエントリ情報中の重要評価順位を設定する（ステップＳ４６）。この場合、順位が高いほど、重要評価割合が高いことを示す。同様に、ログ統計生成モジュール１２４は、ログ統計テーブル１２３ｂ内の全てのエントリ情報の不要評価割合の昇順となるように不要評価割合の順位付けを行い、当該全てのエントリ情報中の不要評価順位を設定する（ステップＳ４７）。この場合、順位が低いほど、不要評価割合が高いことを示す。 Next, the log statistics generation module 124 ranks the search expression appearance counts so that all the entry information items in the log statistics table 123b are in descending order of the search expression appearance counts, and appears in all the entry information items. The frequency ranking is set (step S45). Similarly, the log statistics generation module 124 ranks the important evaluation ratios so that the importance evaluation ratios of all the entry information in the log statistics table 123b are in descending order, and sets the important evaluation ranks in all the entry information. Set (step S46). In this case, the higher the ranking, the higher the importance evaluation ratio. Similarly, the log statistics generation module 124 ranks the unnecessary evaluation ratios so that the unnecessary evaluation ratios of all the entry information in the log statistics table 123b are in ascending order, and sets the unnecessary evaluation ranks in all the entry information. Set (step S47). In this case, the lower the ranking, the higher the unnecessary evaluation ratio.

ログ統計生成モジュール１２４は、上記ステップＳ４５乃至Ｓ４６を実行すると、ログ統計テーブル１２３ｂ内の各エントリ情報毎に、出現回数順位と重要評価順位と不要評価順位との合計を算出し、その合計値を当該エントリ情報中の「重要度」として設定する（ステップＳ４８）。つまりログ統計生成モジュール１２４は、出現回数順位と重要評価順位と不要評価順位とを総合的に評価して、対応する単語の「重要度」を決定する。ここでは、対応する単語が検索式に出現する回数が多いほど、また重要評価順位が高いほど、そして不要評価順位が低いほど、「重要度」は高くなる（つまり「不要度」は低くなる）。同様に、対応する単語が検索式に出現する回数が少ないほど、また重要評価順位が低いほど、そして不要評価順位が高いほど、「重要度」は低くなる（つまり「不要度」は高くなる）。このように、「重要度」は、見方を変えれば、「不要度」を表すことと等価である。 When executing the above steps S45 to S46, the log statistics generation module 124 calculates the sum of the appearance frequency rank, the important evaluation rank, and the unnecessary evaluation rank for each entry information in the log statistics table 123b, and calculates the total value. It is set as “importance” in the entry information (step S48). In other words, the log statistics generation module 124 comprehensively evaluates the appearance frequency rank, the important evaluation rank, and the unnecessary evaluation rank, and determines the “importance” of the corresponding word. Here, the greater the number of times the corresponding word appears in the search expression, the higher the importance evaluation rank, and the lower the unnecessary evaluation rank, the higher the “importance” (that is, the “unnecessity” becomes lower). . Similarly, the smaller the number of times the corresponding word appears in the search expression, the lower the importance evaluation rank, and the higher the unnecessary evaluation rank, the lower the “importance” (that is, the “unnecessity” becomes higher). . In this way, “importance” is equivalent to representing “unnecessity” in a different way of looking.

次にログ統計生成モジュール１２４は、ログ統計テーブル１２３ｂ内の各単語毎のエントリ情報中の「重要度」を判定し、上位Ｘ％（例えば２０％）に属する単語を「重要語」の候補とし、下位Ｙ％（例えば２０％）に属する単語を「不要語」の候補とする（ステップＳ４９）。なお、この「重要語」及び「不要語」の候補を決定する処理（ステップＳ４９）、更には単語毎の「重要度」を算出する処理（ステップＳ４８）が、キーワード生成モジュール１４によって実行される構成であっても構わない。 Next, the log statistic generation module 124 determines “importance” in the entry information for each word in the log statistic table 123b, and sets words belonging to the higher X% (for example, 20%) as candidates for “important words”. The words belonging to the lower Y% (for example, 20%) are set as candidates for “unnecessary words” (step S49). The keyword generation module 14 executes a process for determining candidates for the “important words” and “unnecessary words” (step S49) and a process for calculating the “importance” for each word (step S48). It may be a configuration.

このように本実施形態においては、（１）検索式に出現する語句、（２）検索結果の参照回数、（３）参照された各情報に対するユーザの評価を統計的に分析し処理することにより、「重要語」の候補及び「不要語」の候補が自動的に決定される。ここでは、検索式に出現する頻度が高く、参照回数の多い、または評価の高い情報に含まれる語句が、ユーザが必要とする情報を収集するためのキーワードとしての「重要語」の候補とされる。また、検索式に出現する頻度が低く、参照回数の少ない、または評価の低い情報に含まれる語句が、ユーザが必要としない情報が収集されるのを抑止するためのキーワードとしての「不要語」の候補とされる。なお、ログ統計テーブル１２３ｂを生成する際に、同義語辞書１５を利用することで同義語を例えば代表語に置き換えて、１つのエントリ情報にマージすると良い。 As described above, in the present embodiment, (1) words / phrases appearing in a search expression, (2) the number of reference times of a search result, and (3) a user's evaluation for each referenced information is statistically analyzed and processed The “important word” candidates and the “unnecessary word” candidates are automatically determined. Here, words that are frequently used in search expressions, frequently referenced, or included in highly evaluated information are candidates for “important words” as keywords for collecting information needed by the user. The In addition, “unnecessary words” as keywords for preventing information that is not required by the user from being collected from words or phrases included in information with low frequency of appearance and low reference frequency or low evaluation. Candidate When generating the log statistics table 123b, the synonym dictionary 15 may be used to replace the synonym with a representative word, for example, and merge it into one entry information.

キーワード生成モジュール１４は、ログ統計生成モジュール１２４によって決定された、「重要度」の候補の一覧及び「不要語」の候補の一覧を、定期的に、或は管理ユーザ１０２から設定ファイル１３の生成が要求された場合に、当該管理ユーザ１０２にＷｅｂブラウザ１７を介して提示して、その一覧から、それぞれ「重要度」及び「不要語」を管理ユーザ１０２に選択させることで、「重要度」及び「不要語」を決定する。キーワード生成モジュール１４は、選択された「重要度」及び「不要語」が設定された設定ファイル１３を生成する。勿論、「重要度」の候補及び「不要語」の候補を、それぞれ「重要度」及び「不要語」として自動的に決定する構成であっても構わない。 The keyword generation module 14 generates the setting file 13 from the management user 102 periodically or from the management user 102 with the list of “importance” candidates and the list of “unnecessary words” candidates determined by the log statistics generation module 124. Is requested to the management user 102 via the web browser 17 and the management user 102 selects “importance” and “unnecessary words” from the list, thereby “importance”. And “unnecessary words” are determined. The keyword generation module 14 generates the setting file 13 in which the selected “importance” and “unnecessary words” are set. Of course, the “importance” candidate and the “unnecessary word” candidate may be automatically determined as “importance” and “unnecessary word”, respectively.

図１には、知識ＤＢ１２１が１つだけ示されている。しかし、知識検索システム１２が有する知識ＤＢ１２１は複数であることが多い。この場合、設定ファイル１３に、情報収集先の知識ＤＢ１２１を指定する情報を含めると良い。また設定ファイル１３に、上記特許文献１に記載されている「収集するＵＲＬの文字列パターン」と「収集しないＵＲＬの文字列パターン」とを設定し、ＵＲＬ単位で収集するページと収集しないページとを指定するようにしても構わない。 FIG. 1 shows only one knowledge DB 121. However, the knowledge search system 12 has a plurality of knowledge DBs 121 in many cases. In this case, the setting file 13 may include information specifying the knowledge collection destination knowledge DB 121. In the setting file 13, “character string pattern of URL to be collected” and “character string pattern of URL not to be collected” described in Patent Document 1 are set. May be specified.

なお、本発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment.

本発明の一実施形態に係る知識情報収集システムを実現するナレッジマネジメントシステムの構成を示すブロック図。The block diagram which shows the structure of the knowledge management system which implement | achieves the knowledge information collection system which concerns on one Embodiment of this invention. 同実施形態におけるＷｅｂ情報収集処理を説明するためのフローチャートの一部を示す図。The figure which shows a part of flowchart for demonstrating the web information collection process in the embodiment. 同実施形態におけるＷｅｂ情報収集処理を説明するためのフローチャートの残りを示す図。The figure which shows the remainder of the flowchart for demonstrating the web information collection process in the embodiment. Ｗｅｂ情報収集処理におけるリンク判定／ページ収集の一例を示す図。The figure which shows an example of link determination / page collection in Web information collection processing. ログテーブル１２３ａのデータ構造例を示す図。The figure which shows the data structure example of the log table 123a. ログ統計テーブル１２３ｂのデータ構造例を示す図。The figure which shows the example of a data structure of the log statistics table 123b. 同実施形態におけるログ統計テーブル生成処理を説明するためのフローチャートの一部を示す図。The figure which shows a part of flowchart for demonstrating the log statistics table production | generation process in the embodiment. 同実施形態におけるログ統計テーブル生成処理を説明するためのフローチャートの残りを示す図。The figure which shows the remainder of the flowchart for demonstrating the log statistics table production | generation process in the embodiment.

Explanation of symbols

１１…Ｗｅｂ情報収集システム、１２…知識検索システム、１３…設定ファイル（設定手段）、１４…キーワード生成モジュール、１５…同義語辞書、２０…インターネット／イントラネット（ネットワーク）、１１１…収集制御モジュール（情報収集制御手段）、１１２…テキスト抽出モジュール、１１３…リンクラベル抽出モジュール（リンク文字列抽出手段）、１１４…リンク判定モジュール、１２２…検索エンジン（検索手段、評価させる手段）、１２３…検索ログ（検索ログ蓄積手段）、１２３ａ…ログテーブル、１２３ｂ…ログ統計テーブル、１２４…ログ統計生成モジュール。 DESCRIPTION OF SYMBOLS 11 ... Web information collection system, 12 ... Knowledge search system, 13 ... Setting file (setting means), 14 ... Keyword generation module, 15 ... Synonym dictionary, 20 ... Internet / intranet (network), 111 ... Collection control module (information) (Collection control means), 112 ... text extraction module, 113 ... link label extraction module (link character string extraction means), 114 ... link determination module, 122 ... search engine (search means, means for evaluation), 123 ... search log (search) Log accumulation means), 123a ... log table, 123b ... log statistics table, 124 ... log statistics generation module.

Claims

In a knowledge information collection system that collects information to be registered in the knowledge database from the network,
Sets starting location information indicating the location of page information that is the starting point of information collection from the network, and the number of links to be collected from the network, and is not subject to information collection from the network A setting means for setting a word related to a link to be a keyword representing an unnecessary word,
Link character string extracting means for extracting a link character string from page information collected from the network;
Link determination means for determining whether or not link destination page information is useless from the extracted link character string and the keyword representing the set unnecessary word;
Information collection control means for controlling information collection from the network by following a link from the set origin location information, and the link destination page information determined to be useless by the link determination means, An information collection control means that excludes information from collection even within the set number of link stages.

In a knowledge information collection system that collects information to be registered in the knowledge database from the network,
Setting the starting location information indicating the location of page information that is the starting point of information collection from the network and the number of links to be collected from the network, and the information collection target from the network A setting means for setting a word related to a link to be used as a keyword representing an important word,
Link character string extracting means for extracting a link character string from page information collected from the network;
Link determination means for determining whether link destination page information is useful from the extracted link character string and the keyword representing the set important word;
Information collection control means for controlling information collection from the network by following a link from the set origin location information, the page information of the link destination determined to be useful by the link determination means, A knowledge information collection system comprising: information collection control means for collection even if the range of the set number of stages is exceeded.

In a knowledge information collection system that collects information to be registered in the knowledge database from the network,
Setting the starting location information indicating the location of page information that is the starting point of information collection from the network and the number of links to be collected from the network, and the information collection target from the network Setting means for setting each word related to a link to be used as a keyword representing an important word, and each word related to a link to be excluded from information collection from the network as a keyword representing an unnecessary word;
Link character string extracting means for extracting a link character string from page information collected from the network;
From the extracted link character string and the keyword representing the set important word, it is determined whether link destination page information is useful, and the extracted link character string and the set unnecessary word are determined. A link determination means for determining whether or not link destination page information is useless from
Information collection control means for controlling information collection from the network by following a link from the set origin location information, the page information of the link destination determined to be useful by the link determination means, Even if the set number of steps is exceeded, the page information of the link destination determined to be useless by the link determining means is collected even if it is within the set number of steps. A knowledge information collection system comprising: an information collection control unit that is not a target.

Search means for searching the information collected in the knowledge database according to a given search formula, and presenting the search results to the user;
Means for allowing the user to evaluate the usefulness or uselessness of the information collected in the knowledge database referred to according to the search result by the search means;
The search formula used for the search by the search means, the number of times the information is collected for each piece of information collected in the knowledge database, and the user's evaluation result for the information for each piece of information collected in the knowledge database And a search log storage means for storing as a search log,
Analyzing words and phrases appearing in the search formula stored in the search log storage means, the number of references for each information stored in the search log storage means, and the evaluation results for each information stored in the search log storage means Log statistic generation means for generating an evaluation value representing a degree to which the word is important or unnecessary for the user for each word appearing in the search expression;
A keyword generation unit that generates a keyword representing an unnecessary word that can be set by the setting unit based on the evaluation value for each of the phrases generated by the log statistics generation unit. The knowledge information collection system according to 1 or 3.

Further comprising means for presenting a list of keywords representing unnecessary words generated by the keyword generating means to the user so that the user can select keywords representing unnecessary words set by the setting means from the list. The knowledge information collecting system according to claim 4.

Search means for searching the information collected in the knowledge database according to a given search formula, and presenting the search results to the user;
Means for allowing the user to evaluate the usefulness or uselessness of the information collected in the knowledge database referred to according to the search result by the search means;
The search formula used for the search by the search means, the number of times the information is collected for each piece of information collected in the knowledge database, and the user's evaluation result for the information for each piece of information collected in the knowledge database And a search log storage means for storing as a search log,
Analyzing words and phrases appearing in the search formula stored in the search log storage means, the number of references for each information stored in the search log storage means, and the evaluation results for each information stored in the search log storage means Log statistic generation means for generating an evaluation value representing a degree to which the word is important or unnecessary for the user for each word appearing in the search expression;
The method further comprises: keyword generating means for generating a keyword representing an important word that can be set by the setting means based on the evaluation value for each of the phrases generated by the log statistics generating means. The knowledge information collection system according to 2 or 3.

Further comprising means for presenting a list of keywords representing the important words generated by the keyword generating means to the user so that the user can select keywords representing the important words set by the setting means from the list. The knowledge information collection system according to claim 6.

A knowledge information collecting method applied to a knowledge information collecting system for collecting information to be registered in a knowledge database from a network,
The starting location information indicating the location of page information that is the starting point of information collection from the network, the number of links to be collected from the network, and the information that is not collected from the network Generating a configuration file in which keywords representing unnecessary words related to power links are set;
Collecting information from the network by following links from the set origin location information;
Extracting a link character string from page information collected from the network;
Determining whether the link destination page information is useless from the extracted link character string and a keyword representing an unnecessary word set in the setting file,
The step of collecting the information includes the step of excluding the link destination page information determined to be useless even if the link destination page information is within the set number of stages. Knowledge information collection method characterized by

A knowledge information collecting method applied to a knowledge information collecting system for collecting information to be registered in a knowledge database from a network,
The starting location information indicating the location of page information that is the starting point of information collection from the network, the number of link stages that are the target of information collection from the network, and the information collecting target from the network Generating a configuration file with keywords representing key words related to the link;
Collecting information from the network by following links from the set origin location information;
Extracting a link character string from page information collected from the network;
Determining whether the link destination page information is useful from the extracted link character string and a keyword representing an important word set in the setting file,
The step of collecting the information includes a step of collecting the linked page information determined to be useful even if the linked page information exceeds the set number of steps. Characteristic knowledge information collection method.

A program executed by a knowledge information collection system that collects information to be registered in the knowledge database from the network,
On the computer,
The starting location information indicating the location of page information that is the starting point of information collection from the network, the number of links to be collected from the network, and the information that is not collected from the network Generating a configuration file in which keywords representing unnecessary words related to power links are set;
Extracting a link character string from page information collected from the network by following a link from the set origin location information;
Determining whether or not link destination page information is useless from the extracted link character string and a keyword representing an unnecessary word set in the setting file;
The step of collecting information from the network by following links from the set origin location information, wherein the link destination page information determined to be useless is the set page information. A program that executes steps that are not subject to collection even within the range of the number of steps.

A program executed by a knowledge information collection system that collects information to be registered in the knowledge database from the network,
On the computer,
The starting location information indicating the location of page information that is the starting point of information collection from the network, the number of link stages that are the target of information collection from the network, and the information collecting target from the network Generating a configuration file with keywords representing key words related to the link;
Extracting a link character string from page information collected from the network by following a link from the set origin location information;
Determining whether link destination page information is useful from the extracted link character string and a keyword representing an important word set in the setting file;
Collecting information from the network by following links from the set origin location information, wherein the link destination page information determined to be useful is the set page information. A program for executing the steps to be collected even if the number of steps exceeds the range.