JP2005275714A

JP2005275714A - Information retrieval apparatus

Info

Publication number: JP2005275714A
Application number: JP2004086934A
Authority: JP
Inventors: Taisuke Sugimoto; 泰輔杉本
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2004-03-24
Filing date: 2004-03-24
Publication date: 2005-10-06

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information retrieval apparatus by which a retrieval result is obtained corresponding to a change in the meaning of a word with the lapse of time. <P>SOLUTION: The control section 101 of the information retrieval apparatus 100 makes a keyword extract section 103 extract a keyword regarding an occupation in a sentence collected from a site on the specified Internet by a Web information collection section 102 and to calculate a weight being the degree of association of the keyword with the occupation to write the weight into a keyword history DB (Database) 104. The control section 101 makes an index generation section 106 totalize the weight for every name of the occupation based on the keyword history DB 104 and to store the weight on an Internet index DB 107. The control section 101 makes a retrieval processing section 108 read the names of the occupations from an occupation DB 110 upon receiving the keyword specified by a user, and to return the list of the names of the occupations taking the totalized weight in the Internet index DB 107 into account. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、インターネットを介したキーワード検索のための情報検索装置に関するものである。 The present invention relates to an information retrieval apparatus for keyword retrieval via the Internet.

インターネット上には一般のユーザがキーワードを指定し、例えばコンピュータ用語や経済用語等の特定の分野に関する単語を検索結果として得ることができる検索サイトがある。このような検索サイトの管理者は、検索精度の向上のために検索サイトのシステムのチューニングを行う。しかし、従来の技術ではこのチューニングは、当該検索サイトにおけるキーワードの利用頻度や、ユーザからの指摘や、あるいは、検索対象の特定の分野に詳しい専門家の助言などを元に、手作業で実施されていた。このため、単語の意味が時間の流れとともに変化する分野においては、このチューニングが追いつかないためにユーザが古いキーワードを使わなければ単語が検索結果に出てこないことがある問題点があった。 There are search sites on the Internet where general users can specify keywords and obtain words related to specific fields such as computer terms and economic terms as search results. The administrator of such a search site tunes the search site system in order to improve the search accuracy. However, in the conventional technology, this tuning is performed manually based on the frequency of use of keywords on the search site, indications from users, or advice from experts who are familiar with the specific field to be searched. It was. For this reason, in the field where the meaning of the word changes with the passage of time, there is a problem that the word cannot appear in the search result unless the user uses an old keyword because the tuning cannot catch up.

なお、従来の検索装置として、特許文献１に記載されるものが知られている。特許文献１に記載の技術は検索装置において検索対象の単語に類似のキーワードを付加するものであり、本発明が解決しようとする時間の流れに伴う単語の意味自体の変化に検索サイトが対応できない問題点を解決するものではない。
特開平１１−２７２７０６号公報 In addition, what is described in patent document 1 is known as a conventional search device. The technique described in Patent Document 1 adds a similar keyword to a search target word in a search device, and the search site cannot cope with a change in the meaning of the word itself with the passage of time to be solved by the present invention. It does not solve the problem.
Japanese Patent Laid-Open No. 11-272706

本発明は上記の事情を考慮してなされたもので、その目的は、時間の流れに伴う単語の意味自体の変化に対応した検索結果を得ることができる情報検索装置を提供することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide an information search apparatus capable of obtaining a search result corresponding to a change in the meaning of a word itself with the passage of time.

この発明は前述の課題を解決するためになされたもので、請求項１の発明は、クライアントのコンピュータからキーワードを受信した時、単語と、前記単語に関するキーワードとを格納する記憶領域中の単語データベースから前記受信したキーワードに対応する単語を読み出して前記クライアントのコンピュータに送信する情報検索装置であって、単語と、前記単語に関するキーワードと、前記単語と前記キーワードとの関連度の高さを表す重みとを格納するキーワード履歴データベースと、予め指定されたインターネット上の複数のサイトをアクセスして前記サイトに掲載された文章の情報を得る情報収集部と、前記情報収集部が得た文章の情報から前記キーワード履歴データベース中の単語に関するキーワードを抽出し、前記キーワードの重みを計算し、前記単語と、前記キーワードと、前記重みとを前記キーワード履歴データベースに書き込むキーワード抽出部と、前記キーワード履歴データベースからデータを読み出し、前記単語と、前記キーワードとを合わせてキーとして前記重みを集計した重みを得て、前記単語と、前記キーワードと、前記集計した重みとを記憶領域中のインターネットインデックスデータベースへ書き込むインデックス生成部と、前記クライアントのコンピュータからキーワードを受信した時、前記単語データベースから前記受信したキーワードに対応する単語を読み出し、前記受信したキーワードと、前記受信したキーワードに対応する単語との関連度を前記インターネットインデックスデータベース中の前記集計した重みに基づいて判定し、前記受信したキーワードに対応する単語と、前記関連度とを前記クライアントのコンピュータへ送信する検索処理部と、を備えることを特徴とする情報検索装置である。 The present invention has been made to solve the above-mentioned problems. The invention of claim 1 is a word database in a storage area for storing a word and a keyword related to the word when a keyword is received from a client computer. An information search device that reads out a word corresponding to the received keyword from the client and transmits the word to the client computer, the weight representing a word, a keyword related to the word, and a high degree of association between the word and the keyword A keyword history database that stores information, an information collection unit that accesses a plurality of sites on the Internet designated in advance to obtain information on sentences posted on the site, and information on sentences obtained by the information collection unit Extracting a keyword related to a word in the keyword history database; A keyword extraction unit that calculates a weight, writes the word, the keyword, and the weight into the keyword history database, reads data from the keyword history database, and combines the word and the keyword as a key When the weight is obtained, the weight is obtained, and when the keyword is received from the client computer, the word generator, the keyword, the index generation unit for writing the weighted weight to the Internet index database in the storage area, and the word Read a word corresponding to the received keyword from the database, determine a relevance between the received keyword and the word corresponding to the received keyword based on the aggregated weight in the Internet index database, A word corresponding to the received keyword, a retrieval processing unit for transmitting said relevance to the client computer is an information retrieval apparatus comprising: a.

請求項２に記載の発明は、請求項１に記載の情報検索装置であって、前記キーワード抽出部と、インデックス生成部とは、所定の期間毎に処理を行うことを特徴とするものである。 The invention according to claim 2 is the information search device according to claim 1, wherein the keyword extraction unit and the index generation unit perform processing every predetermined period. .

請求項３に記載の発明は、請求項１または請求項２のいずれかに記載の情報検索装置であって、前記キーワード履歴データベースから所定の時間を経過したデータを削除するキーワード履歴データベースメンテナンス部をさらに具備することを特徴とするものである。 The invention according to claim 3 is the information search apparatus according to claim 1 or 2, further comprising a keyword history database maintenance unit that deletes data that has passed a predetermined time from the keyword history database. Furthermore, it comprises.

請求項４に記載の発明は、請求項１〜請求項３のいずれかに記載の情報検索装置であって、前記キーワード履歴データベースメンテナンス部は、所定の期間毎に処理を行うことを特徴とするものである。 A fourth aspect of the present invention is the information search device according to any one of the first to third aspects, wherein the keyword history database maintenance unit performs processing every predetermined period. Is.

請求項５に記載の発明は、クライアントのコンピュータからキーワードを受信した時、単語と、前記単語に関するキーワードとを格納する記憶領域中の単語データベースから前記受信したキーワードに対応する単語を読み出して前記クライアントのコンピュータに送信する情報検索装置のコンピュータに、予め指定されたインターネット上のサイトをアクセスして前記サイトに掲載された文章の情報を得る処理と、単語と、前記単語に関するキーワードと、前記単語と前記キーワードとの関連度の高さを表す重みと、前記キーワードが掲載されていたサイトのアドレスと、前記キーワードの前記サイトからの抽出日時とを格納する記憶領域中のキーワード履歴データベースに格納されている単語に関するキーワードを前記文章の情報から抽出する処理と、前記抽出したキーワードの前記単語との関連度である重みを計算し、前記単語と、前記抽出したキーワードと、前記重みと、前記サイトのアドレスと、前記キーワードを抽出した日時とを前記キーワード履歴データベースに書き込む処理と、前記キーワード履歴データベースからデータを読み出し、前記単語と、前記キーワードとを合わせてキーとして前記重みを集計した重みを得て、前記単語と、前記キーワードと、前記集計した重みとを記憶領域中のインターネットインデックスデータベースへ書き込む処理と、前記クライアントのコンピュータからキーワードを受信した時、前記単語データベースから前記受信したキーワードに対応する単語を読み出し、前記受信したキーワードと、前記受信したキーワードに対応する単語との関連度を前記インターネットインデックスデータベース中の前記集計した重みに基づいて判定し、前記受信したキーワードに対応する単語と、前記関連度とを前記クライアントのコンピュータへ送信する処理とを実行させるためのプログラムである。 According to a fifth aspect of the present invention, when a keyword is received from a client computer, a word corresponding to the received keyword is read from a word database in a storage area for storing a word and a keyword related to the word, and the client The computer of the information retrieval device to be transmitted to the computer, a process of accessing a site on the Internet designated in advance to obtain information on the text posted on the site, a word, a keyword related to the word, and the word Stored in a keyword history database in a storage area for storing a weight representing a high degree of relevance with the keyword, an address of a site where the keyword was posted, and an extraction date and time of the keyword from the site. Extract keywords from the sentence information A weight that is a degree of association between the extracted keyword and the word, and the word, the extracted keyword, the weight, the site address, and the date and time when the keyword was extracted. A process of writing to the keyword history database, reading data from the keyword history database, obtaining a weight obtained by summing the weights using the word and the keyword together as a key, and obtaining the word, the keyword, and the sum A process of writing the weights to the Internet index database in the storage area, and when a keyword is received from the client computer, the word corresponding to the received keyword is read from the word database, the received keyword, and the reception Corresponding to the selected keyword For executing a process of transmitting a word corresponding to the received keyword and the degree of association to the client computer. It is a program.

請求項１、あるいは請求項５の発明によれば、情報検索装置はインターネットインデックスデータベースをインターネットのサイトを元に最新の内容に書き換えるため、情報検索装置のユーザは時間の流れに伴う単語の意味の変化に対応した検索結果を得ることができる効果がある。 According to the first or fifth aspect of the present invention, the information search device rewrites the Internet index database with the latest contents based on the Internet site. There is an effect that a search result corresponding to the change can be obtained.

請求項２の発明によれば、インターネットインデックスデータベースが長い間更新されないために新しい単語の登場や単語の意味の変化を情報検索装置の検索結果が反映しなくなることを抑止できる効果がある。 According to the invention of claim 2, since the Internet index database is not updated for a long time, it is possible to prevent the search result of the information search device from reflecting the appearance of a new word or a change in the meaning of the word.

請求項３の発明によれば、キーワード履歴データベースから古いデータが削除されるため、インターネットインデックスデータベース中からも古いデータが削除され、情報検索装置の検索結果の中に古い単語の意味が残ることを抑止できる効果がある。 According to the invention of claim 3, since old data is deleted from the keyword history database, old data is also deleted from the Internet index database, and the meaning of the old word remains in the search result of the information search device. There is an effect that can be suppressed.

請求項４の発明によれば、キーワード履歴データベースから古いデータが定期的に削除されるため、情報検索装置の検索結果の中に古い単語の意味が残ることを一層確実に抑止できる効果がある。 According to the invention of claim 4, since old data is periodically deleted from the keyword history database, there is an effect that the meaning of the old word remains in the search result of the information search device can be more reliably suppressed.

まず、図２を参照して本実施の形態における基本的な考え方を説明する。図２はインターネット上で職業紹介を行うサイトの職業情報に関する情報検索装置の概要を表している。この図において、情報検索装置はネットワーク経由で職業に関する情報を提供する複数のＷｅｂサイトを定期的にアクセスし、職業に関する情報を含む文章のデータを収集する。次に、情報検索装置は収集した文章のデータを分析して文章のデータに含まれる職業名や、その職業を検索する際にキーワードを抽出する。そして、情報検索装置は抽出したキーワードに関して、各Ｗｅｂサイトでの出現頻度を元に職業名との関連度を計算し、インターネットインデックスＤＢ（ＤａｔａＢａｓｅ）へ登録する。インターネットのユーザは図２の情報検索装置を使用して、例えば「介護支援」等のキーワードを元にユーザの興味のある職種を検索し、キーワードに対して関連度の高い「ケアマネージャ」などの具体的な職種を知ることができる。 First, the basic concept in the present embodiment will be described with reference to FIG. FIG. 2 shows an outline of an information search apparatus regarding occupation information of a site that introduces occupations on the Internet. In this figure, the information retrieval apparatus periodically accesses a plurality of Web sites that provide information on occupations via a network, and collects text data including information on occupations. Next, the information retrieval apparatus analyzes the collected sentence data and extracts a job name included in the sentence data and a keyword when searching for the occupation. Then, the information search apparatus calculates the degree of association with the occupation name based on the appearance frequency on each Web site for the extracted keyword, and registers it in the Internet index DB (Data Base). The user of the Internet uses the information search apparatus of FIG. 2 to search for a job type that the user is interested in based on a keyword such as “care support”, and the “care manager” or the like having a high degree of relevance to the keyword. You can know specific job types.

図１は本実施の形態における情報検索装置の構成を表している。図１の情報検索装置１００はインターネットに接続されたコンピュータであり、本実施の形態においては職業に関する情報検索サービスを一般のインターネットユーザに提供する。制御部１０１は情報検索装置１００の制御機能であり、詳細は以降に記述する。Ｗｅｂ情報収集部１０２はインターネット経由で他のＷｅｂサイトをアクセスして職業に関する情報を得て、この職業に関する情報をキーワード抽出部１０３へ出力する。
キーワード抽出部１０３はＷｅｂ情報収集部１０２から職業に関する情報を入力し、職業の職業名と、職業に関するキーワードとを抽出してキーワード履歴ＤＢ１０４へ書き込む。 FIG. 1 shows a configuration of an information search apparatus according to this embodiment. An information search apparatus 100 in FIG. 1 is a computer connected to the Internet. In this embodiment, an information search service related to occupation is provided to general Internet users. The control unit 101 is a control function of the information search apparatus 100, and details will be described later. The Web information collection unit 102 accesses another Web site via the Internet, obtains information about the occupation, and outputs the information about the occupation to the keyword extraction unit 103.
The keyword extraction unit 103 inputs information related to occupations from the Web information collection unit 102, extracts occupation names and occupation related keywords, and writes them to the keyword history DB 104.

キーワード履歴ＤＢ１０４は図１０に表される形式のデータベースである。キーワード履歴ＤＢ１０４は職業名と、職業に関するキーワードと、キーワードが登録されていたホームページのＵＲＬ（ＵｎｉｖｅｒｓａｌＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）と、キーワードの登録日時と、キーワードの職業に関する関連度を表す重みとを含むデータベースである。 The keyword history DB 104 is a database having the format shown in FIG. The keyword history DB 104 is a database that includes occupation names, keywords relating to occupations, URLs (Universal Resource Locators) of websites where the keywords are registered, keyword registration dates and times, and weights that indicate the degree of association of the keywords with respect to occupations. .

Ｗｅｂサイト情報格納部１０９は情報検索装置１００の職業に関する情報源であるインターネット上のサイトに関する情報を格納する。Ｗｅｂサイト情報格納部１０９は図４の形式のデータを格納し、ＵＲＬは職業に関する情報を掲載しているサイトのＵＲＬであり、定型／非定型は当該サイトが定型のフォーマットで表記されているか否かを表し、フォーマット情報は当該サイトが定型のフォーマットで書かれている場合のフォーマットに関する情報である。キーワード履歴ＤＢメンテナンス部１０５は制御部１０１の指示を受け、定期的にキーワード履歴ＤＢ１０４中の古いデータを削除する。インデックス生成部１０６は制御部１０１の指示を受けてキーワード履歴ＤＢ１０４を読み出し、読み出したデータを集計してインターネットインデックスＤＢ１０７へ書き込む。 The website information storage unit 109 stores information related to a site on the Internet, which is an information source related to the occupation of the information search apparatus 100. The Web site information storage unit 109 stores data in the format shown in FIG. 4, where the URL is the URL of the site on which information related to occupation is posted, and the standard / non-standard type indicates whether the site is described in a standard format. The format information is information relating to the format when the site is written in a standard format. The keyword history DB maintenance unit 105 receives an instruction from the control unit 101 and periodically deletes old data in the keyword history DB 104. The index generation unit 106 reads the keyword history DB 104 in response to an instruction from the control unit 101, totals the read data, and writes it in the Internet index DB 107.

インターネットインデックスＤＢ１０７は図１１に表される形式のデータベースである。インターネットインデックスＤＢ１０７は職業名と、職業に関するキーワードと、キーワード履歴ＤＢ１０４の重みの集計結果である集計した重みとを含むデータベースである。
検索処理部１０８は制御部１０１からユーザの指定したキーワードを入力し、職業ＤＢ１１０と、インターネットインデックスＤＢ１０７とをアクセスし、ユーザの指定したキーワードに関連する職業に関する情報を制御部１０１を経由してユーザに提供する。
職業ＤＢ１１０は図１２に表されるデータベースである。職業ＤＢ１１０は一般のインターネットユーザに対する職業情報の提供のために予め情報検索装置１００が備えるデータベースであり、職業名と、職業に関するキーワードと、職業の解説文とを含むものである。 The Internet index DB 107 is a database having the format shown in FIG. The Internet index DB 107 is a database that includes occupation names, keywords related to occupations, and aggregated weights that are the aggregated results of the weights of the keyword history DB 104.
The search processing unit 108 inputs a keyword designated by the user from the control unit 101, accesses the occupation DB 110 and the Internet index DB 107, and receives information about the occupation related to the keyword designated by the user via the control unit 101. To provide.
The occupation DB 110 is a database represented in FIG. The occupation DB 110 is a database provided in the information search apparatus 100 in advance for providing occupation information to general Internet users, and includes occupation names, keywords relating to occupations, and explanations of occupations.

次に、本実施の形態における処理の流れについて、図を参照しながら説明する。
本実施の形態においては、情報検索装置１００の管理者は準備として、職業に関する情報を提供するサイトを選別して情報検索装置に指定する。次に、情報検索装置１００は情報検索装置の管理者の指定したサイトを定期的にアクセスし、職業に関する情報を得て、この情報からキーワードを抽出し、キーワード履歴ＤＢ１０４へキーワードを格納する処理であるキーワード抽出処理を行う。また、情報検索装置１００は所定の期間毎にキーワード履歴ＤＢ１０４のメンテナンス処理を行う。次に、情報検索装置１００はキーワード抽出処理によって得られたキーワードから職業ごとに使用されるキーワードの使用頻度を計算して重み付けを行い、キーワードと、その重みとをインターネットインデックスＤＢ１０７へ登録する処理であるインデックス生成処理を行う。次に、情報検索装置１００はインターネットを経由してユーザからの職業に関する検索要求を受け付けた時、検索結果をユーザに提供する検索処理を行う。以降では、これらの処理について説明する。 Next, the flow of processing in the present embodiment will be described with reference to the drawings.
In the present embodiment, as a preparation, the administrator of the information search apparatus 100 selects a site that provides information related to occupation and designates it as the information search apparatus. Next, the information search device 100 is a process of periodically accessing a site designated by the administrator of the information search device, obtaining information on occupations, extracting keywords from this information, and storing the keywords in the keyword history DB 104. A keyword extraction process is performed. Further, the information search apparatus 100 performs maintenance processing of the keyword history DB 104 every predetermined period. Next, the information search apparatus 100 calculates the use frequency of the keyword used for each occupation from the keyword obtained by the keyword extraction process, performs weighting, and registers the keyword and its weight in the Internet index DB 107. A certain index generation process is performed. Next, when the information search apparatus 100 receives a search request related to occupation from the user via the Internet, the information search apparatus 100 performs a search process for providing the search result to the user. Hereinafter, these processes will be described.

＜Ｗｅｂサイト選別＞
まず、情報検索装置１００の管理者は職業に関する情報を提供する他のサイトを選別する。このサイトから得られる情報を元に情報検索装置１００は職業に関するキーワードを情報検索装置内部のキーワード履歴ＤＢ１０４に格納する。図３はこのサイト選別の手順を表している。情報検索装置１００の管理者は職業に関する情報を提供する他のサイトを選別するために、管理者のパソコンを使用してインターネット上の検索エンジンなどのサービスを利用して職業名が掲載されているサイトをチェックし、職業に関する情報を提供するサイトを見つけると、そのサイトのＵＲＬを情報検索装置１００へ入力する。 <Website selection>
First, the administrator of the information search apparatus 100 selects other sites that provide information on occupations. Based on information obtained from this site, the information search apparatus 100 stores keywords related to occupation in the keyword history DB 104 inside the information search apparatus. FIG. 3 shows the site selection procedure. The administrator of the information search apparatus 100 uses the administrator's personal computer to search for other sites that provide information about the occupation, and the occupation name is posted using a service such as a search engine on the Internet. When the site is checked and a site providing information on occupation is found, the URL of the site is input to the information search apparatus 100.

情報検索装置１００の制御部１０１は管理者からのデータの入力を受け付け、入力したＵＲＬを図４のようにＷｅｂサイト情報格納部１０９に書き込む。
また、情報検索装置１００の管理者は当該サイトが定型のフォーマットで情報を表示しているか、あるいは、特に決まったフォーマットを持っていないかをチェックし、決まったフォーマットを持っている場合には当該サイトの表示形式の情報を含むフォーマット情報を作成し、このフォーマット情報を情報検索装置１００に入力する。情報検索装置１００の制御部１０１は管理者からのデータの入力を受け付け、入力したフォーマット情報と、“定型”とをＷｅｂサイト情報格納部１０９に格納書き込む。 The control unit 101 of the information search apparatus 100 receives data input from the administrator, and writes the input URL in the Web site information storage unit 109 as shown in FIG.
In addition, the administrator of the information retrieval apparatus 100 checks whether the site displays information in a fixed format, or does not have a specific format. Format information including information on the display format of the site is created, and this format information is input to the information search apparatus 100. The control unit 101 of the information search apparatus 100 receives data input from the administrator, and stores and writes the input format information and “standard” in the website information storage unit 109.

また、情報検索装置の管理者は当該サイトが定型のフォーマットを持たない場合には、当該サイトが定型のフォーマットを持たないことを情報検索装置１００に入力する。情報検索装置１００の制御部１０１は管理者からのデータの入力を受け付け、“非定型”をＷｅｂサイト情報格納部１０９に格納書き込む。
その後、情報検索装置１００は図３のようにこのＵＲＬが示すＷｅｂサイトをアクセスして職業に関する情報を含むデータをＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）形式やＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）形式のデータとして得る。この処理に関してはキーワード抽出処理で説明する。 Further, when the site does not have a fixed format, the administrator of the information search device inputs to the information search device 100 that the site does not have a fixed format. The control unit 101 of the information search apparatus 100 accepts data input from the administrator, and stores and writes “non-standard” in the Web site information storage unit 109.
Thereafter, the information search apparatus 100 accesses the Web site indicated by the URL as shown in FIG. 3 and obtains data including information on occupations as data in HTML (Hyper Text Markup Language) format or XML (extensible Markup Language) format. This process will be described in the keyword extraction process.

＜キーワード抽出処理＞
情報検索装置１００は図５〜図８が表す処理の流れに従ってキーワード抽出処理を行う。まず、情報検索装置１００の制御部１０１は情報検索装置１００の管理者が指定したＷｅｂサイトを定期的にアクセスして職業情報を含むＨＴＭＬ形式、あるいはＸＭＬ形式のデータを収集するため、Ｗｅｂ情報収集部１０２にＷｅｂ情報収集要求を出力する。Ｗｅｂ情報収集部１０２は制御部１０１からの要求を受け、Ｗｅｂサイト情報格納部１０９からアクセスすべきＷｅｂサイトのＵＲＬを読み出し、インターネットを経由して当該Ｗｅｂサイトをアクセスして職業情報を含むＨＴＭＬ形式、あるいはＸＭＬ形式のデータであるページ情報を受信する（図５のステップＳ０１）。次に、Ｗｅｂ情報収集部１０２は受信したページ情報と、Ｗｅｂサイト情報格納部１０９から読み出したＵＲＬとをキーワード抽出部１０３へ出力する。 <Keyword extraction process>
The information search apparatus 100 performs keyword extraction processing according to the processing flow shown in FIGS. First, the control unit 101 of the information retrieval apparatus 100 periodically accesses a website designated by the administrator of the information retrieval apparatus 100 and collects data in HTML format or XML format including occupation information. A Web information collection request is output to the unit 102. The Web information collection unit 102 receives a request from the control unit 101, reads the URL of the Web site to be accessed from the Web site information storage unit 109, accesses the Web site via the Internet, and includes the HTML format including occupation information Alternatively, page information that is XML format data is received (step S01 in FIG. 5). Next, the Web information collection unit 102 outputs the received page information and the URL read from the Web site information storage unit 109 to the keyword extraction unit 103.

キーワード抽出部１０３はＷｅｂ情報収集部１０２からデータを入力し、Ｗｅｂサイト情報格納部１０９を参照してＷｅｂ情報収集部１０２から入力したＵＲＬが定型のフォーマットを持つか否かを確認する。
Ｗｅｂ情報収集部１０２から入力したページ情報の取得元であるＷｅｂサイトが定型のフォーマットである場合には（図５のＳ０２、図６のＵ０１、図７のステップＵ０１）、ページ情報のどこに職業を説明するためのキーワードであるのか、あるいはどこに職業を説明するための文章があるのかが予め明確である。このため、キーワード抽出部１０３はページ情報から職業に関するキーワードを抽出する。あるいは、ページ情報中の職業に関する説明等の文章については、キーワード抽出部１０３は当該項目に関して形態素解析処理を行い、職業に関するキーワードを抽出する。 The keyword extraction unit 103 inputs data from the Web information collection unit 102 and refers to the Web site information storage unit 109 to check whether the URL input from the Web information collection unit 102 has a fixed format.
When the Web site from which the page information input from the Web information collecting unit 102 is acquired has a fixed format (S02 in FIG. 5, U01 in FIG. 6, and Step U01 in FIG. 7), where the occupation is located in the page information It is clear beforehand whether it is a keyword for explaining or where a sentence for explaining the occupation exists. For this reason, the keyword extraction unit 103 extracts keywords related to occupations from the page information. Alternatively, for a sentence such as an explanation related to occupation in the page information, the keyword extraction unit 103 performs a morphological analysis process on the item to extract a keyword related to occupation.

次に、キーワード抽出部１０３は抽出したキーワードと、先ほどＷｅｂ情報収集部１０２から入力したＵＲＬと、現在日時とをキーワード履歴ＤＢ１０４へ書き込む。キーワード抽出部１０３は現在日時をキーワード履歴ＤＢ１０４中の登録日時に対応させる。ここで、キーワード抽出部１０３はキーワード履歴ＤＢ１０４に既に同じキーワードやＵＲＬが登録されていても、職業名と、キーワードと、ＵＲＬと、登録日時との全てが一致しない場合には別のデータとみなしてキーワード履歴ＤＢ１０４へ書き込む。 Next, the keyword extraction unit 103 writes the extracted keyword, the URL input from the Web information collection unit 102 and the current date and time in the keyword history DB 104. The keyword extraction unit 103 associates the current date and time with the registration date and time in the keyword history DB 104. Here, even if the same keyword or URL is already registered in the keyword history DB 104, the keyword extracting unit 103 regards the occupation name, the keyword, the URL, and the registration date / time as different data. To the keyword history DB 104.

次に、キーワード抽出部１０３は抽出したキーワードの重みを計算する。キーワード抽出部１０３は、抽出したキーワードが当該Ｗｅｂサイトのページ中で類似職業名として挙げられていた場合には１００ポイントをキーワード履歴ＤＢ１０４の当該キーワードの重みに設定する。また、キーワード抽出部１０３は、抽出したキーワードが当該Ｗｅｂサイトのページにおいて解説文の中で出てきた場合には１ポイントをキーワード履歴ＤＢ１０４の当該キーワードの重みに設定する。
ただし、キーワード履歴ＤＢ１０４に既に職業名と、キーワードと、ＵＲＬと、登録日時との全てが一致するデータが登録されている場合には、キーワード抽出部１０３はキーワード履歴ＤＢ１０４の重みを累積して加算し、設定する。
キーワード抽出部１０３による重み付けにより、図７においては、例えば、“介護福祉士”と、“ホームヘルパー”と、“サービス提供責任者”とは各々１００ポイントとなり、“シフト作成”と、“利用者”と、“ケアマネージャー”と、“サービス”と、“調整”と、“訪問”と、“介護”と、“計画書”と、“書類”と、“整備”と、“ヘルパー”と、“面接”と、“採用”と、“教育”と、“車”と、“巡回”とは各々５ポイントとなる。 Next, the keyword extraction unit 103 calculates the weight of the extracted keyword. The keyword extraction unit 103 sets 100 points as the weight of the keyword in the keyword history DB 104 when the extracted keyword is listed as a similar occupation name in the page of the Web site. Further, the keyword extraction unit 103 sets 1 point as the weight of the keyword in the keyword history DB 104 when the extracted keyword appears in the commentary on the page of the Web site.
However, if data that matches all of the occupation name, the keyword, the URL, and the registration date / time has already been registered in the keyword history DB 104, the keyword extraction unit 103 accumulates and adds the weight of the keyword history DB 104. And set.
According to the weighting by the keyword extraction unit 103, in FIG. 7, for example, “care worker”, “home helper”, and “service provider” are each 100 points, and “shift creation” and “user” ”,“ Care manager ”,“ service ”,“ coordination ”,“ visit ”,“ care ”,“ plan ”,“ documents ”,“ maintenance ”,“ helper ”, “Interview”, “recruitment”, “education”, “car”, and “tour” are 5 points each.

Ｗｅｂ情報収集部１０２から入力したページ情報の取得元であるＷｅｂサイトが非定型のフォーマットである場合には、キーワード抽出部１０３は入力したページ情報からＨＴＭＬのタグなどを取り除いて平文のデータを生成する（図６のＴ０１、図８のステップＴ０１）。次に、キーワード抽出部１０３は平文データに対して形態素解析処理を実行し、平文のデータを単語に分ける（図６のＴ０２、図８のステップＴ０２）。 If the Web site from which the page information input from the Web information collection unit 102 is acquired is in an atypical format, the keyword extraction unit 103 generates plain text data by removing HTML tags from the input page information. (T01 in FIG. 6, step T01 in FIG. 8). Next, the keyword extraction unit 103 executes morphological analysis processing on the plaintext data, and divides the plaintext data into words (T02 in FIG. 6 and step T02 in FIG. 8).

さらに、キーワード抽出部１０３は平文のデータ中の「。」や「、」等の記号、「が」、「の」、「を」等の助詞、「だ」、「です」等の助動詞、「また」等の接続詞、「ほとんど」等の副詞、「いわゆる」、「或る」等の連体詞、あるいは、「あの」「この」等の代名詞等を取り除き、キーワードのみを残す（図６のＴ０３、図８のステップＴ０３）。そして、キーワード抽出部１０３は平文データ中の各キーワードを、上述のＷｅｂサイトが定型のフォーマットである場合と同様にキーワード履歴ＤＢ１０４に書き込む（図６のＴ０４、図８のステップＴ０４）。 Further, the keyword extraction unit 103 includes symbols such as “.” And “,” in plaintext data, particles such as “ga”, “no”, “wo”, auxiliary verbs such as “da”, “da”, “ Also, connectives such as “,” adverbs such as “almost”, conjuncts such as “so-called” and “some”, or pronouns such as “that” and “this” are removed, leaving only the keywords (T03 in FIG. 6). Step T03 in FIG. 8). Then, the keyword extraction unit 103 writes each keyword in the plain text data in the keyword history DB 104 as in the case where the above-described Web site has a fixed format (T04 in FIG. 6 and step T04 in FIG. 8).

＜キーワード履歴ＤＢのメンテナンス処理＞
情報検索装置１００はキーワード抽出処理を定期的に繰り返すことにより、キーワード履歴ＤＢ１０４には常に新しい職業に関するデータが蓄積される。しかし、情報検索装置１００がデータを蓄積するばかりでは、キーワード履歴ＤＢ１０４中のデータには古いデータが残ったままとなる。そこで、情報検索装置１００の制御部１０１は古いデータをキーワード履歴ＤＢ１０４から取り除くために、所定の期間毎にキーワード履歴ＤＢメンテナンス部１０５にメンテナンス要求を出力する。 <Keyword history DB maintenance processing>
The information retrieval apparatus 100 periodically repeats the keyword extraction process, so that data related to new occupations is always accumulated in the keyword history DB 104. However, if the information search apparatus 100 only accumulates data, old data remains in the data in the keyword history DB 104. Therefore, the control unit 101 of the information search apparatus 100 outputs a maintenance request to the keyword history DB maintenance unit 105 every predetermined period in order to remove old data from the keyword history DB 104.

キーワード履歴ＤＢメンテナンス部１０５は情報検索装置１００からメンテナンス要求を入力し、キーワード履歴ＤＢ１０４中のデータのうち登録日時が予め決められた期限よりも古いものを削除する（図５のＳ０３）。この処理により、例えば図１０が表すようにキーワード履歴ＤＢ１０４中の２００２／１２／３１以前のデータは削除されるため、陳腐化して誰も使わなくなった職業名やキーワードはキーワード履歴ＤＢ１０４から削除される。また、キーワード抽出処理ではキーワード履歴ＤＢ１０４中に既に格納されているキーワードについても登録日時が異なれば別のデータとみなされてキーワード履歴ＤＢ１０４に再度登録されるため、長い年月の間一般に使用される職業名やキーワードはキーワード履歴ＤＢ１０４中に残ったままとなる。 The keyword history DB maintenance unit 105 inputs a maintenance request from the information search apparatus 100, and deletes data in the keyword history DB 104 whose registration date is older than a predetermined time limit (S03 in FIG. 5). By this processing, for example, as shown in FIG. 10, data before 2002/12/31 in the keyword history DB 104 is deleted. Therefore, occupation names and keywords that have become obsolete and no longer used by anyone are deleted from the keyword history DB 104. . Also, in the keyword extraction process, a keyword already stored in the keyword history DB 104 is regarded as different data if the registration date / time is different, and is registered again in the keyword history DB 104. Therefore, it is generally used for a long time. Occupation names and keywords remain in the keyword history DB 104.

＜インデックス生成処理＞
情報検索装置１００は予め決められた期間毎にキーワード履歴ＤＢ１０４の集計結果を生成してインターネットインデックスＤＢ１０７へ集計結果を格納する。このために、制御部１０１は所定の期間毎にインデックス生成部１０６にインデックス生成要求を出力する。
インデックス生成部１０６は制御部１０１からのインデックス生成要求を受け付け、一旦、インターネットインデックスＤＢ１０７中の全てのデータを削除する。そして、インデックス生成部１０６はキーワード履歴ＤＢ１０４を読み出し、図１１のようにキーワード履歴ＤＢ１０４中の職業名と、キーワードとを共にキーとして重みを加算して集計した重みを求め、インターネットインデックスＤＢ１０７へ職業名と、キーワードと、集計した重みとを書き込む（図５のステップＳ０４）。 <Index generation processing>
The information search apparatus 100 generates a total result of the keyword history DB 104 for each predetermined period and stores the total result in the Internet index DB 107. For this purpose, the control unit 101 outputs an index generation request to the index generation unit 106 every predetermined period.
The index generation unit 106 receives an index generation request from the control unit 101, and once deletes all data in the Internet index DB 107. Then, the index generation unit 106 reads the keyword history DB 104, obtains weights obtained by adding weights using the occupation names in the keyword history DB 104 and the keywords as keys as shown in FIG. And the keyword and the total weight are written (step S04 in FIG. 5).

＜検索処理＞
いま、情報検索装置１００はインターネットを経由してユーザが指定した“高齢者”というキーワードを受信し、当該キーワードに関連の強い職業名の検索要求を受け付けた（図９のステップＶ０１）。情報検索装置１００の制御部１０１はこの要求を受けて、検索処理部１０８に受信したキーワードを出力して当該キーワードと関連性の強い職業名の検索を要求する。 <Search process>
Now, the information search apparatus 100 receives the keyword “elderly” designated by the user via the Internet, and accepts a search request for an occupation name strongly related to the keyword (step V01 in FIG. 9). In response to this request, the control unit 101 of the information search apparatus 100 outputs the received keyword to the search processing unit 108 and requests a search for an occupation name having a strong relationship with the keyword.

検索処理部１０８は制御部１０１からデータを入力し、図１３に表される職業リストを記憶領域中に生成する。そして、検索処理部１０８は図１２に表される職業ＤＢ１１０をアクセスして職業名毎に“高齢者”に該当するキーワードが出現するか否か、および、解説文中に“高齢者”が出現するか否かをチェックする。検索処理部１０８はこのチェックを行いながら各職業名にポイントを付与し、ポイントが高い職業名はユーザが指定した“高齢者”と関連性が高いものと判断する。この処理では例えば、職業ＤＢ１１０中の“介護福祉士”にはキーワードとして“高齢者”は対応していないが、解説文中に“高齢者”が含まれている。この場合、検索処理部１０８は“介護福祉士”に対して５０ポイントを付与し、“介護福祉士”と、５０ポイントとを職業リストに格納する。また、職業リストの“介護福祉士”にすでにポイントが付与されている場合には、検索処理部１０８は５０ポイントを加算して職業リストに書き込む。 The search processing unit 108 inputs data from the control unit 101, and generates the occupation list shown in FIG. 13 in the storage area. Then, the search processing unit 108 accesses the occupation DB 110 shown in FIG. 12 to determine whether or not a keyword corresponding to “elderly” appears for each occupation name, and “elderly” appears in the commentary. Check whether or not. The search processing unit 108 gives points to each occupation name while performing this check, and determines that the occupation name having a high point is highly related to the “elderly person” designated by the user. In this processing, for example, “elderly” is not supported as a keyword for “care worker” in the occupation DB 110, but “elderly” is included in the commentary. In this case, the search processing unit 108 gives 50 points to “care worker” and stores “care worker” and 50 points in the occupation list. If points are already given to the “care worker” in the occupation list, the search processing unit 108 adds 50 points and writes them in the occupation list.

また、職業ＤＢ１１０中の“介護福祉士”のデータにはキーワードとして“高齢者”が対応していた場合には、検索処理部１０８は“介護福祉士”に対して５００ポイントを付与し、“介護福祉士”と、５００ポイントとを職業リストに格納する。また、職業リストの“介護福祉士”にすでにポイントが付与されている場合には、検索処理部１０８は５００ポイントを“介護福祉士”に加算して職業リストに書き込む。検索処理部１０８は、職業ＤＢ１１０中の他のデータについても同様の処理を行う。 In addition, when “elderly person” corresponds as a keyword to the data of “care worker” in the occupation DB 110, the search processing unit 108 gives 500 points to “care worker”. "Care worker" and 500 points are stored in the occupation list. If points are already given to the “care worker” in the occupation list, the search processing unit 108 adds 500 points to the “care worker” and writes it in the occupation list. The search processing unit 108 performs the same process on other data in the occupation DB 110.

さらに、検索処理部１０８はインターネットインデックスＤＢ１０７をアクセスし、“高齢者”がインターネットインデックスＤＢ１０７にキーワードとして存在するか否かをチェックする。インターネットインデックスＤＢ１０７中に“高齢者”がキーワードとして存在する場合には、検索処理部１０８は“高齢者”に対応する職業名と、集計した重みとをインターネットインデックスＤＢ１０７から読み出し、職業リストと突き合わせて同じ職業名が含まれているか否かをチェックする。
検索処理部１０８は、職業リスト中に、インターネットインデックスＤＢ１０７から読み出した職業名が含まれていないことを確認した場合には、集計した重みをポイントとみなして、職業名と、ポイントとを職業リストに追加する。
また、検索処理部１０８は、職業リスト中に、インターネットインデックスＤＢ１０７から読み出した職業名が含まれていることを検知した場合には、該当する職業リスト中のデータのポイントにインターネットインデックスＤＢ１０７から読み出した集計した重みを加算して書き込む（図９のステップＶ０２）。
次に、検索処理部１０８は職業リストを記憶領域から読み出し、ポイントの高い順にソートして制御部１０１へ出力する（図９のステップＶ０３）。制御部１０１は検索処理部１０８からデータを入力して職業リストをインターネットを経由して要求元へ送信する（図９のステップＶ０４）。 Further, the search processing unit 108 accesses the Internet index DB 107 and checks whether “elderly” is present as a keyword in the Internet index DB 107. When “elderly” is present as a keyword in the Internet index DB 107, the search processing unit 108 reads out the occupation name corresponding to “elderly” and the total weight from the Internet index DB 107 and matches it with the occupation list. Check if the same occupation name is included.
When it is confirmed that the occupation name read from the Internet index DB 107 is not included in the occupation list, the search processing unit 108 regards the total weight as a point and determines the occupation name and the point as the occupation list. Add to
In addition, when the search processing unit 108 detects that the occupation name read from the Internet index DB 107 is included in the occupation list, the search processing unit 108 reads the data point in the corresponding occupation list from the Internet index DB 107. The totaled weights are added and written (step V02 in FIG. 9).
Next, the search processing unit 108 reads out the occupation list from the storage area, sorts the job list in descending order, and outputs it to the control unit 101 (step V03 in FIG. 9). The control unit 101 inputs data from the search processing unit 108 and transmits the occupation list to the request source via the Internet (step V04 in FIG. 9).

以上、図面を参照して本発明の実施形態について詳述してきたが、具体的な構成はこれらの実施の形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計変更等も含まれる。例えば、本実施の形態における情報検索装置は職業に関する情報提供を目的とするものであったが、コンピュータ用語や経済用語、あるいは時事情報に関する用語等に関する情報提供を目的とするものに適用しても良い。 As described above, the embodiments of the present invention have been described in detail with reference to the drawings, but the specific configuration is not limited to these embodiments, and includes design changes and the like within a scope not departing from the gist of the present invention. It is. For example, the information retrieval apparatus in the present embodiment is intended to provide information on occupations, but may be applied to information intended to provide information on computer terms, economic terms, terms related to current affairs information, etc. good.

本発明の実施の形態における情報検索装置の構成を表すブロック図である。It is a block diagram showing the structure of the information search device in embodiment of this invention. 本発明の実施の形態における情報検索装置の基本的な考え方を表す図である。It is a figure showing the fundamental view of the information search device in embodiment of this invention. 本発明の実施の形態における情報検索装置の管理者が行うＷｅｂサイトの選別手順を表す図である。It is a figure showing the selection procedure of the Web site which the administrator of the information retrieval apparatus in embodiment of this invention performs. 本発明の実施の形態における情報検索装置のＷｅｂサイト情報格納部１０９の内容を表す図である。It is a figure showing the content of the Web site information storage part 109 of the information search device in embodiment of this invention. 本発明の実施の形態における情報検索装置のフローチャートである。It is a flowchart of the information search device in the embodiment of the present invention. 本発明の実施の形態における情報検索装置のフローチャートである。It is a flowchart of the information search device in the embodiment of the present invention. 本発明の実施の形態における情報検索装置のフローチャートである。It is a flowchart of the information search device in the embodiment of the present invention. 本発明の実施の形態における情報検索装置のフローチャートである。It is a flowchart of the information search device in the embodiment of the present invention. 本発明の実施の形態における情報検索装置のフローチャートである。It is a flowchart of the information search device in the embodiment of the present invention. 本発明の実施の形態における情報検索装置のキーワード履歴ＤＢを表す図である。It is a figure showing keyword history DB of the information search device in an embodiment of the invention. 本発明の実施の形態における情報検索装置のインターネットインデックスＤＢの作成方法を表す図である。It is a figure showing the creation method of internet index DB of the information search device in embodiment of this invention. 本発明の実施の形態における情報検索装置の職業ＤＢを表す図である。It is a figure showing occupation DB of the information search device in embodiment of this invention. 本発明の実施の形態における情報検索装置の職業リストを表す図である。It is a figure showing the occupation list of the information search device in embodiment of this invention.

Explanation of symbols

１００…情報検索装置
１０１…制御部
１０２…Ｗｅｂ情報収集部
１０３…キーワード抽出部
１０４…キーワード履歴ＤＢ
１０５…キーワード履歴ＤＢメンテナンス部
１０６…インデックス生成部
１０７…インターネットインデックスＤＢ
１０８…検索処理部
１０９…Ｗｅｂサイト情報格納部
１１０…職業ＤＢ DESCRIPTION OF SYMBOLS 100 ... Information retrieval apparatus 101 ... Control part 102 ... Web information collection part 103 ... Keyword extraction part 104 ... Keyword history DB
105 ... Keyword history DB maintenance unit 106 ... Index generation unit 107 ... Internet index DB
108 ... Search processing unit 109 ... Web site information storage unit 110 ... Occupation DB

Claims

An information retrieval device for reading a word corresponding to the received keyword from a word database in a storage area for storing a word and a keyword related to the word when the keyword is received from a client computer and transmitting the word to the client computer. There,
A keyword history database storing a word, a keyword related to the word, and a weight representing a high degree of association between the word and the keyword;
An information collecting unit for accessing a plurality of sites on the Internet designated in advance and obtaining information on sentences posted on the sites;
A keyword related to a word in the keyword history database is extracted from sentence information obtained by the information collecting unit, a weight of the keyword is calculated, and the word, the keyword, and the weight are written in the keyword history database. A keyword extractor;
Data is read from the keyword history database, the weight obtained by adding the word and the keyword together as a key to obtain the weight is obtained, and the word, the keyword, and the collected weight are stored in the Internet in the storage area. An index generator for writing to the index database;
When a keyword is received from the client computer, a word corresponding to the received keyword is read from the word database, and a degree of association between the received keyword and the word corresponding to the received keyword is stored in the Internet index database. A search processing unit for determining a word corresponding to the received keyword and transmitting the degree of association to the client computer;
An information retrieval apparatus comprising:

The information search apparatus according to claim 1, wherein the keyword extraction unit and the index generation unit perform processing every predetermined period.

The keyword history database further stores an address of a site where the keyword was posted and an extraction date and time of the keyword from the site,
When the keyword extraction unit writes the word, the keyword, and the weight in the keyword history database, an address of a site on which the keyword is posted and an extraction date and time of the keyword from the site Write further,
The information search apparatus according to claim 1, further comprising a keyword history database maintenance unit that deletes data that has passed a predetermined time from the keyword history database.

The information search apparatus according to claim 1, wherein the keyword history database maintenance unit performs processing every predetermined period.

An information retrieval apparatus for reading a word corresponding to the received keyword from a word database in a storage area for storing a word and a keyword related to the word when the keyword is received from a client computer and transmitting the word to the client computer. On the computer,
A process of accessing a plurality of sites on the Internet designated in advance to obtain information on sentences posted on the sites;
A keyword related to a word stored in a keyword history database in a storage area for storing a word, a keyword related to the word, and a weight representing a high degree of association between the word and the keyword is extracted from the sentence information. Processing to
Calculating a weight that is a degree of association of the extracted keyword with the word, and writing the word, the extracted keyword, and the weight to the keyword history database;
Data is read from the keyword history database, the weight obtained by adding the word and the keyword together as a key to obtain the weight is obtained, and the word, the keyword, and the collected weight are stored in the Internet in the storage area. Writing to the index database;
When a keyword is received from the client computer, a word corresponding to the received keyword is read from the word database, and a degree of association between the received keyword and the word corresponding to the received keyword is stored in the Internet index database. A program for executing a process of determining a word corresponding to the received keyword and transmitting the degree of association to the client computer.