JP2007164378A

JP2007164378A - Related term extraction device and related term extraction method

Info

Publication number: JP2007164378A
Application number: JP2005358328A
Authority: JP
Inventors: Masanori Osugi; 眞規大杉
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2005-12-12
Filing date: 2005-12-12
Publication date: 2007-06-28
Anticipated expiration: 2025-12-12
Also published as: JP4791169B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for extracting and storing related knowledge and a related term of information related to advertisement stored in a storage device connected via a communication line so as to analyze a vocabulary frequently used in a specific business category or industry by a provider of a Web document. <P>SOLUTION: This related term extraction device 500 associating vocabulary data related to each other from a plurality of pieces of vocabulary data receives the Web documents stored in Web servers 100a-c connected via a communication line network 30, stores the received Web documents, receives input of first advertisement vocabulary data related to the vocabulary data to be extracted, extracts the Web documents vocabulary including the inputted first advertisement data, extracts second vocabulary data included in the Web documents in common, generates a domain 410 wherein the extracted second advertisement vocabulary data are associated with the first advertisement vocabulary data, and stores the generated domain. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、関連語を抽出する関連語抽出装置及び、その方法に関する。特に、多くのドキュメントに頻繁に含まれる語彙を関連語として抽出する装置、方法に関する。 The present invention relates to a related word extracting apparatus and method for extracting related words. In particular, the present invention relates to an apparatus and method for extracting vocabulary frequently included in many documents as related words.

従来より、関連語を抽出する方法として、例えば、所定の文章に対して形態素解析等の言語解析を行って関連語を抽出する関連語検索が知られている。特許文献１では、所定のキーワードに関するテーマに対する詳しさ、興味の強さを反映したユーザの信頼度を考慮して、所定のテーマに関連する単語を抽出することができる抽出装置が記載されている。
特開平１１−０５３３７５号公報 Conventionally, as a method for extracting related words, for example, related word search is known in which language analysis such as morphological analysis is performed on a predetermined sentence to extract related words. Patent Document 1 describes an extraction device that can extract words related to a predetermined theme in consideration of the details about the theme related to the predetermined keyword and the reliability of the user reflecting the level of interest. .
Japanese Patent Laid-Open No. 11-053375

しかしながら、上記特許文献１では、以下で説明するように、実際に通信回線網を介して接続された複数のサーバが提供する情報間での関連語の抽出に関する装置、方法についてまで考慮されていない。 However, in Patent Document 1, as described below, no consideration is given to an apparatus and a method related to extraction of related terms between information provided by a plurality of servers actually connected via a communication network. .

一般に、インターネットを介してＷｅｂドキュメントによる情報を提供する際に、ユーザが検索等のために入力した語彙に基づいて、この語彙に関連したＷｅｂドキュメントあるいは、Ｗｅｂドキュメントへのリンクをユーザの端末に表示することが行われている。 In general, when providing information by a Web document via the Internet, a Web document related to this vocabulary or a link to the Web document is displayed on the user's terminal based on the vocabulary input by the user for searching or the like. To be done.

すなわち、ユーザが入力した検索キーワードの検索結果に、この検索キーワードに関連した広告情報を表示することで、効果的な宣伝が行われている。このような態様において、ユーザに対して広告効果を向上させるためには、ユーザから入力される検索キーワードの種類に応じて、どのような広告内容を検索結果とともに表示させるかが課題となる。 That is, effective advertisement is performed by displaying advertisement information related to the search keyword in the search result of the search keyword input by the user. In such an aspect, in order to improve the advertising effect on the user, it becomes a problem what kind of advertisement content is displayed together with the search result according to the type of the search keyword input from the user.

例えば、生命保険会社が広告を表示する際に、ユーザが入力した語彙が、「保険」であれば、この生命保険会社の広告を表示することが妥当であると考えられるが、「保証人」と入力した場合に、生命保険会社の広告を表示するのが妥当であるかを判断することが困難である。そこで、「保険」と「保証人」が、当該生命保険会社にとって関連した語であるか判断できる指標があれば望ましい。 For example, when a life insurance company displays an advertisement, if the vocabulary entered by the user is “insurance”, it is considered appropriate to display the advertisement of this life insurance company, but “guarantor” It is difficult to determine whether it is appropriate to display a life insurance company advertisement. Therefore, it is desirable to have an index that can determine whether “insurance” and “guarantor” are related words for the life insurance company.

一方、このような生命保険会社等の企業は、自社の広告、宣伝のためにＷｅｂサーバを立ち上げて、Ｗｅｂドキュメントを顧客に提供する。このように、Ｗｅｂサーバを立ち上げたＷｅｂドキュメント提供者が、特定の業種・業界においてよく使われる語彙、専門用語、製品情報等の業界知識を整理、解析するためには、自社以外の競争相手やこの業界に関連した組織、団体等が提供するＷｅｂドキュメントから情報収集することが望まれる。 On the other hand, companies such as life insurance companies set up a Web server for their advertisements and promotions, and provide Web documents to customers. In this way, web document providers who have launched a web server have competitors other than their own in order to organize and analyze industry knowledge such as vocabulary, technical terms, and product information often used in specific industries and industries. It is desirable to collect information from Web documents provided by organizations and organizations related to this industry.

本発明者らは、Ｗｅｂドキュメント提供者にとって、通信回線網を介して接続された記憶装置に記憶された情報どうしで関連した語彙が抽出できる装置、方法が提供されることが望ましいことに着目した。 The present inventors have noted that it is desirable for Web document providers to provide an apparatus and method that can extract related vocabulary between information stored in storage devices connected via a communication network. .

本発明の目的は、Ｗｅｂドキュメントの提供者が特定の業種・業界においてよく使われる語彙を解析するために、通信回線を介して接続された記憶装置に記憶された広告、宣伝に関する情報の関連語（業界知識）を抽出し、記憶することで、Ｗｅｂドキュメントの提供者が業界知識を収集、整理することが可能な装置、方法を提供することを目的とする。 An object of the present invention is to provide information related to advertisements and advertisements stored in a storage device connected via a communication line in order to analyze a vocabulary frequently used by a Web document provider in a specific industry / industry. It is an object of the present invention to provide an apparatus and a method by which a Web document provider can collect and organize industry knowledge by extracting and storing (industry knowledge).

（１）Ｗｅｂドキュメントの提供者が特定の業種・業界においてよく使われる語彙を解析するために、複数の広告語彙データから互いに関連した広告語彙データを関連づける関連語抽出装置（例えば、後述の関連語抽出装置５００）であって、
通信回線を介して接続された記憶装置に記憶されたＷｅｂドキュメントを受信する受信部（例えば、後述の通信部５１０）と、
前記受信部が受信したＷｅｂドキュメントを記憶するＷｅｂドキュメント記憶部（例えば、後述のＷｅｂドキュメント記憶部５３０）と、
抽出する広告語彙データに関連する第１広告語彙データの入力を受け付ける入力部（例えば、後述の入力部５５０）と、
前記入力部を介して入力された第１広告語彙データが含まれているＷｅｂドキュメントを、前記Ｗｅｂドキュメント記憶部から抽出するＷｅｂドキュメント抽出部（例えば、後述の抽出部５２５）と、
前記Ｗｅｂドキュメント中に共通して含まれる第２広告語彙データを抽出する抽出部（例えば、後述の抽出部５２５）と、
前記抽出部により抽出された前記第２広告語彙データを、前記第１広告語彙データと関連づけたドメインを生成するドメイン生成部（例えば、ドメイン生成部５２７）と、
前記ドメイン生成部により生成された前記ドメインを記憶するドメイン記憶部（例えば、ドメイン記憶部５４０）と、
を備える関連語抽出装置。 (1) A related word extracting device (for example, a related word described later) that associates advertising vocabulary data with each other from a plurality of advertising vocabulary data in order for a Web document provider to analyze a vocabulary frequently used in a specific industry / industry. An extraction device 500),
A receiving unit (for example, a communication unit 510 described later) for receiving a Web document stored in a storage device connected via a communication line;
A Web document storage unit (for example, a Web document storage unit 530 described later) that stores the Web document received by the reception unit;
An input unit (for example, an input unit 550 described later) that receives input of first advertising vocabulary data related to the extracted advertising vocabulary data;
A Web document extraction unit (for example, an extraction unit 525 described later) that extracts a Web document containing the first advertising vocabulary data input via the input unit from the Web document storage unit;
An extraction unit (for example, an extraction unit 525 described later) that extracts second advertising vocabulary data commonly included in the Web document;
A domain generating unit (for example, a domain generating unit 527) that generates a domain in which the second advertising vocabulary data extracted by the extracting unit is associated with the first advertising vocabulary data;
A domain storage unit (for example, a domain storage unit 540) that stores the domain generated by the domain generation unit;
A related word extraction device.

（１）に記載の発明によれば、関連語抽出装置は、通信回線を介して接続された記憶装置に記憶されたＷｅｂドキュメントを受信し、受信したＷｅｂドキュメントを記憶しておき、抽出する広告語彙データに関連する第１広告語彙データの入力を受け付けて、この第１広告語彙データが含まれているＷｅｂドキュメントを抽出し、抽出したＷｅｂドキュメント中に共通して含まれる第２広告語彙データを抽出し、抽出された第２広告語彙データを、第１広告語彙データと関連づけたドメインを生成し、生成されたドメインを記憶する。
よって、ユーザが入力した第１広告語彙データが含まれるＷｅｂドキュメントから、第２広告語彙データを抽出して、第１広告語彙データと第２広告語彙データとを関連づけたデータであるドメインを生成することが可能である。
したがって、関連語抽出装置によれば、入力された任意の第１広告語彙データに基づいて、通信回線を介して接続された記憶装置に記憶された広告、宣伝に関する情報（Ｗｅｂドキュメント）の関連語を第２広告語彙データとして抽出し、第１広告語彙データと第２広告語彙データとを関連づけたデータを生成することが可能である。 According to the invention described in (1), the related word extracting device receives a Web document stored in a storage device connected via a communication line, stores the received Web document, and extracts an advertisement. Accepting the input of the first advertising vocabulary data related to the vocabulary data, extracting the web document including the first advertising vocabulary data, and extracting the second advertising vocabulary data commonly included in the extracted web document. The extracted second advertisement vocabulary data is extracted, a domain associated with the first advertisement vocabulary data is generated, and the generated domain is stored.
Therefore, the second advertisement vocabulary data is extracted from the Web document including the first advertisement vocabulary data input by the user, and a domain that is data in which the first advertisement vocabulary data and the second advertisement vocabulary data are associated is generated. It is possible.
Therefore, according to the related word extracting device, related words of information (Web document) related to advertisements and advertisements stored in the storage device connected via the communication line based on the input first advertisement vocabulary data. Can be extracted as the second advertisement vocabulary data, and data in which the first advertisement vocabulary data and the second advertisement vocabulary data are associated with each other can be generated.

（２）前記ドメイン生成部は、前記第１広告語彙データとは異なる他の第１広告語彙データ、及び当該他の第１広告語彙データから抽出された第２広告語彙データから生成されたドメインと、前記ドメイン記憶部に、既に記憶されているドメインとを関連づける、（１）に記載の関連語抽出装置。 (2) The domain generation unit includes a first advertisement vocabulary data different from the first advertisement vocabulary data, and a domain generated from the second advertisement vocabulary data extracted from the other first advertisement vocabulary data; The related word extracting device according to (1), wherein the domain storage unit associates a domain that is already stored.

（２）に記載の発明によれば、関連語抽出装置は、第１広告語彙データとは異なる他の第１広告語彙データ、及び他の第１広告語彙データから抽出された第２広告語彙データから生成されたドメインと、既に記憶されているドメインとを関連づける。
よって、第１広告語彙データから第２広告語彙データを抽出し、これにより生成されたドメインと新しく生成されたドメインとを関連づけることで、ドメイン間での関連性を示すデータを生成することが可能である。 According to the invention described in (2), the related word extracting device includes other first advertisement vocabulary data different from the first advertisement vocabulary data, and second advertisement vocabulary data extracted from the other first advertisement vocabulary data. Associate the domain generated from the domain with the already stored domain.
Therefore, by extracting the second advertisement vocabulary data from the first advertisement vocabulary data and associating the generated domain with the newly generated domain, it is possible to generate data indicating the relationship between the domains. It is.

（３）前記抽出部は、前記Ｗｅｂドキュメント中に共通して含まれる第２広告語彙データを抽出する際に、頻出度が高い第２広告語彙データを優先的に抽出する（１）または（２）に記載の関連語抽出装置。 (3) When extracting the second advertisement vocabulary data commonly included in the Web document, the extraction unit preferentially extracts the second advertisement vocabulary data having a high frequency (1) or (2 ) Related word extraction device.

（３）に記載の発明によれば、関連語抽出装置はＷｅｂドキュメント中に共通して含まれる第２広告語彙データを抽出する際に、頻出度が高い第２広告語彙データを優先的に抽出する。
よって、Ｗｅｂドキュメントに含まれる語彙のうち、頻出度に基づいて第２広告語彙データを抽出するため、関連語の関連度として適切な第２広告語彙データを抽出することが可能である。 According to the invention described in (3), the related word extraction device preferentially extracts the second advertisement vocabulary data having a high frequency when extracting the second advertisement vocabulary data commonly included in the Web document. To do.
Therefore, since the second advertisement vocabulary data is extracted from the vocabulary included in the Web document based on the frequency of occurrence, it is possible to extract the second advertisement vocabulary data appropriate as the association degree of the related words.

（４）Ｗｅｂドキュメントの提供者が特定の業種・業界においてよく使われる語彙を解析するために、複数の広告語彙データから互いに関連した広告語彙データを関連づける関連語抽出方法であって、
通信回線を介して接続された記憶装置に記憶されたＷｅｂドキュメントを受信するステップと、
前記受信するステップにて受信したＷｅｂドキュメントを記憶するステップと、
抽出する広告語彙データに関連する第１広告語彙データの入力を受け付ける入力ステップと、
前記入力ステップにて入力された第１広告語彙データが含まれているＷｅｂドキュメントを、抽出する抽出ステップと、
前記Ｗｅｂドキュメント中に共通して含まれる第２広告語彙データを抽出する第２広告語彙データ抽出ステップと、
前記第２広告語彙データ抽出ステップにより抽出された前記第２広告語彙データを、前記第１広告語彙データと関連づけたドメインを生成するドメイン生成ステップと、
前記ドメイン生成ステップにより生成された前記ドメインを記憶するドメイン記憶ステップと、
を備える関連語抽出方法。 (4) A related word extraction method for associating advertisement vocabulary data related to each other from a plurality of advertisement vocabulary data in order for a web document provider to analyze vocabulary frequently used in a specific industry / industry.
Receiving a Web document stored in a storage device connected via a communication line;
Storing the web document received in the receiving step;
An input step for receiving input of first advertising vocabulary data related to the advertising vocabulary data to be extracted;
An extraction step of extracting a Web document including the first advertising vocabulary data input in the input step;
A second advertising vocabulary data extracting step for extracting second advertising vocabulary data commonly included in the Web document;
A domain generating step of generating a domain in which the second advertising vocabulary data extracted by the second advertising vocabulary data extracting step is associated with the first advertising vocabulary data;
A domain storage step of storing the domain generated by the domain generation step;
A related word extraction method comprising:

（５）前記ドメイン生成ステップは、前記第１広告語彙データとは異なる他の第１広告語彙データ、及び当該他の第１広告語彙データから抽出された第２広告語彙データから生成されたドメインと、前記ドメイン記憶ステップにて、既に記憶されているドメインとを関連づける、（４）に記載の関連語抽出方法。 (5) The domain generation step includes a first advertisement vocabulary data different from the first advertisement vocabulary data, and a domain generated from the second advertisement vocabulary data extracted from the other first advertisement vocabulary data; The related word extraction method according to (4), wherein in the domain storing step, a domain already stored is associated.

本発明によれば、通信回線を介して接続された記憶装置に記憶された広告、宣伝等の特定の業種・業界においてよく使われる語彙を抽出し記憶することで、この記憶された情報をＷｅｂドキュメントの提供者が業界知識の解析のために使用することが可能となる。 According to the present invention, by extracting and storing vocabulary frequently used in a specific type of business / industry such as advertisements and promotions stored in a storage device connected via a communication line, this stored information can be stored on the Web. Document providers can use it for analysis of industry knowledge.

以下、本発明の実施形態について、図面に基づいて説明する。
図１は、本発明の好適な実施形態である関連語抽出システム１の構成図である。関連語抽出システム１は、関連語抽出装置５００と、Ｗｅｂサーバ１００ａ〜ｃとから構成される。関連語抽出装置５００と、Ｗｅｂサーバ１００ａ〜ｃとは、通信回線ネットワーク３０を介して通信可能に接続される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a configuration diagram of a related word extraction system 1 which is a preferred embodiment of the present invention. The related word extraction system 1 includes a related word extraction device 500 and Web servers 100a to 100c. The related word extraction device 500 and the Web servers 100a to 100c are connected to be communicable via the communication line network 30.

関連語抽出装置５００は、関連語を抽出する情報処理装置であり、ＣＰＵ（Central Processing Unit）、及びメモリを備えるコンピュータであってよい。関連語抽出装置５００は、通信回線ネットワーク３０に接続して通信を実現する通信部５１０と、情報、データを制御する制御部５２０と、Ｗｅｂドキュメントが記憶されるＷｅｂドキュメント記憶部５３０と、生成したドメインが記憶されるドメイン記憶部５４０と、ユーザからの入力を受け付ける入力部５５０と、情報、データを出力する出力部５６０とから構成される。 The related word extraction device 500 is an information processing device that extracts related words, and may be a computer including a CPU (Central Processing Unit) and a memory. The related word extraction device 500 generates a communication unit 510 that connects to the communication line network 30 to realize communication, a control unit 520 that controls information and data, a Web document storage unit 530 that stores Web documents, The domain storage unit 540 that stores the domain, the input unit 550 that receives input from the user, and the output unit 560 that outputs information and data.

制御部５２０は、関連語抽出装置５００が処理する情報、データを制御し、例えば、ＣＰＵであってよい。制御部５２０は、Ｗｅｂドキュメント、第２広告語彙データの抽出を行う抽出部５２５と、後述するドメインを生成するドメイン生成部５２７とを備える。制御部５２０は、後述するメイン処理（図２のフローチャート）、ドメイン関連づけ処理（図８のフローチャート）を実行する。 The control unit 520 controls information and data processed by the related word extraction device 500, and may be a CPU, for example. The control unit 520 includes an extraction unit 525 that extracts a Web document and second advertisement vocabulary data, and a domain generation unit 527 that generates a domain to be described later. The control unit 520 executes main processing (flowchart in FIG. 2) and domain association processing (flowchart in FIG. 8) described later.

Ｗｅｂドキュメント記憶部５３０は、通信部５１０を介して、各Ｗｅｂサーバ１００ａ〜ｃから受信したＷｅｂドキュメントが制御部５２０により記憶される装置である。ここで、Ｗｅｂドキュメントとは、通信回線を介して閲覧可能な文章、画像が記載された電子データであり、Ｗｅｂページであってよい。ここで、Ｗｅｂドキュメントは、広告語彙データを含む電子データであってよく、企業の宣伝、広告に関する情報、すなわち、特定の業種・業界においてよく使われる語彙、専門用語、製品情報、業界知識が含まれてよい。広告語彙データとは、企業の宣伝、広告に関する語彙情報に関するデータである。 The web document storage unit 530 is a device in which the control unit 520 stores web documents received from the web servers 100a to 100c via the communication unit 510. Here, the web document is electronic data in which sentences and images that can be browsed via a communication line are described, and may be a web page. Here, the Web document may be electronic data including advertisement vocabulary data, and includes information related to corporate promotion and advertisement, that is, vocabulary, technical terms, product information, and industry knowledge that are often used in a specific industry / industry. It may be. The advertising vocabulary data is data relating to vocabulary information related to company promotion and advertisement.

なお、Ｗｅｂドキュメント記憶部５３０は、関連語抽出装置５００が備えていない態様であってもよい。すなわち、Ｗｅｂドキュメント記憶部５３０が、通信回線ネットワーク３０に接続されたサーバ、コンピュータ等の装置に備えられ、関連語抽出装置５００が、適宜、Ｗｅｂドキュメント及びＷｅｂドキュメントに関するデータを、通信部５１０を介してＷｅｂドキュメント記憶部５３０から読み出す態様であってよい。 The Web document storage unit 530 may be in a form that the related word extraction apparatus 500 does not have. That is, the Web document storage unit 530 is provided in a device such as a server or a computer connected to the communication line network 30, and the related word extraction device 500 appropriately transmits the Web document and data related to the Web document via the communication unit 510. It is possible to read from the Web document storage unit 530.

ドメイン記憶部５４０は、ドメイン生成部５２７が生成したドメインが制御部１００により記憶される装置である。ここで、ドメインとは、図６、図７にて後述するように、複数の広告語彙データが関連づけられたデータである。 The domain storage unit 540 is a device in which the control unit 100 stores the domain generated by the domain generation unit 527. Here, the domain is data in which a plurality of advertising vocabulary data is associated as will be described later with reference to FIGS.

入力部５５０は、ユーザからの入力を受け付ける装置であり、例えば、キーボード、マウスであってよい。出力部５６０は、情報、データを出力する装置であり、例えば、モニタ、液晶ディスプレイ、プリンタ等の出力機器であってよい。 The input unit 550 is a device that receives input from the user, and may be, for example, a keyboard or a mouse. The output unit 560 is a device that outputs information and data. For example, the output unit 560 may be an output device such as a monitor, a liquid crystal display, or a printer.

Ｗｅｂサーバ１００ａ〜ｃは、Ｗｅｂブラウザで閲覧するコンテンツを提供するコンピュータであり、Ｗｅｂドキュメントを記憶し、Ｗｅｂドキュメントの閲覧要求に応じて、Ｗｅｂドキュメントを要求されたコンピュータに送信するサーバである。すなわち、Ｗｅｂサーバ１００ａ〜ｃは、関連語抽出装置５００にＷｅｂドキュメントを送信する。 Each of the Web servers 100a to 100c is a computer that provides content to be browsed by a web browser, stores a web document, and transmits the web document to a requested computer in response to a web document browse request. That is, the Web servers 100 a to 100 c transmit Web documents to the related word extraction device 500.

ユーザ端末２００は、関連語抽出装置５００を遠隔から操作するためのコンピュータである。ユーザ端末２００は、情報、データが入力される入力部２５０と、通信回線ネットワーク３０に接続されて通信を実現する通信部２１０と、情報、データを制御する制御部２２０と、情報、データが出力される出力部２６０と、を備える。ユーザが関連語抽出装置５００の入力部５５０を介して、情報、データを入力させる代わりに、ユーザ端末２００の入力部２５０から、情報、データが入力される。 The user terminal 200 is a computer for operating the related word extraction device 500 from a remote location. The user terminal 200 includes an input unit 250 that receives information and data, a communication unit 210 that is connected to the communication line network 30 to realize communication, a control unit 220 that controls the information and data, and outputs information and data. Output unit 260. Instead of the user inputting information and data via the input unit 550 of the related word extraction device 500, information and data are input from the input unit 250 of the user terminal 200.

次に、関連語抽出装置５００の制御部５２０が実行するメイン処理について、図２に基づいて説明する。制御部５２０は、入力部５５０を介して、あるいは、ユーザ端末２００の入力部２５０を介して、第１広告語彙データ（抽出キーワード）の入力を受け付ける（ステップＳ０１）。第１広告語彙データ（抽出キーワード）は、関連語として抽出される語彙の基になる広告語彙データである。 Next, main processing executed by the control unit 520 of the related word extraction device 500 will be described with reference to FIG. The control unit 520 receives input of first advertising vocabulary data (extracted keyword) via the input unit 550 or the input unit 250 of the user terminal 200 (step S01). The first advertisement vocabulary data (extracted keyword) is advertisement vocabulary data that is the basis of the vocabulary extracted as related words.

次に、抽出部５２５が、Ｗｅｂドキュメント記憶部５３０から第１広告語彙データを含むＷｅｂドキュメントの抽出を行う（ステップＳ０２）。 Next, the extraction unit 525 extracts a Web document including the first advertising vocabulary data from the Web document storage unit 530 (step S02).

ここで、ステップＳ０２にて、抽出部５２５が、Ｗｅｂドキュメント記憶部５３０から第１広告語彙データを含むＷｅｂドキュメントの抽出を行う態様について説明したが、Ｗｅｂドキュメント記憶部５３０に、Ｗｅｂドキュメントの要約データ、リード文情報、インデックス情報等の、Ｗｅｂドキュメントを特徴付けるキーワードを含んだＷｅｂドキュメントの一部のデータが記憶されていてもよく、抽出部５２５が、記憶されたこの一部のデータに対して、第１広告語彙データが含まれているデータを抽出する態様であってもよい。 Here, the mode in which the extraction unit 525 extracts the Web document including the first advertisement vocabulary data from the Web document storage unit 530 has been described in Step S02. However, the Web document storage unit 530 stores the Web document summary data. , A part of data of a Web document including a keyword that characterizes the Web document, such as lead sentence information and index information, may be stored, and the extracting unit 525 A mode of extracting data including the first advertisement vocabulary data may be used.

上記いずれかの態様であっても、Ｗｅｂドキュメント記憶部５３０には、メイン処理フローが実行される前に、Ｗｅｂドキュメントあるいは、Ｗｅｂドキュメントの一部のデータが予め記憶されている。 In any of the above aspects, the Web document or a part of the Web document is stored in advance in the Web document storage unit 530 before the main processing flow is executed.

図３は、ユーザから入力された抽出キーワードと、これに関係づけられたＷｅｂドキュメントと、このリンクデータを対応付けた、Ｗｅｂドキュメント対応テーブル図である。ユーザが抽出キーワードとして、「キーワード０１」を入力し、これに対応したＷｅｂドキュメントを、Ｗｅｂドキュメントのテキスト部分から抽出する。結果として、抽出されたデータがＷｅｂドキュメントＡ〜Ｄとなる。これらのＷｅｂドキュメントＡ〜Ｄに対応した、リンクデータ（例えば、リンク先のＵＲＬデータ）が対応付けられる。なお、Ｗｅｂドキュメント対応テーブルは、Ｗｅｂドキュメント記憶部５３０に記憶され、適宜、制御部５２０に読み出される。 FIG. 3 is a Web document correspondence table diagram in which extracted keywords input by the user, Web documents related to the extracted keywords, and link data are associated with each other. The user inputs “keyword 01” as an extraction keyword, and a Web document corresponding to the keyword is extracted from the text portion of the Web document. As a result, the extracted data becomes Web documents A to D. Link data (for example, link destination URL data) corresponding to these Web documents A to D is associated. Note that the Web document correspondence table is stored in the Web document storage unit 530 and read out to the control unit 520 as appropriate.

図４にて、第１広告語彙データの入力を受け付けて、Ｗｅｂドキュメントを抽出した結果を出力部２６０、５６０に出力した画面イメージ図を示した。最初に、第１広告語彙データ（検索キーワード）の入力を、入力窓３０５から受け付ける。ここでは、ユーザが「融資」と入力して、Ｗｅｂドキュメントあるいは、Ｗｅｂドキュメントの一部のデータを抽出したとする。抽出部５２５は、この抽出キーワードに対して、○×商事３１０、融資の○○商事３２０のＷｅｂドキュメント（例えば、ＷｅｂドキュメントＡ、Ｂに対応する）を抽出する。そして、制御部６２０は、図４に示すように、これらのＷｅｂドキュメントに対応するリンクデータ３１１、３２１を出力部２６０、５６０に表示してもよい。 FIG. 4 shows a screen image diagram in which the input of the first advertising vocabulary data is received and the result of extracting the Web document is output to the output units 260 and 560. First, input of the first advertisement vocabulary data (search keyword) is received from the input window 305. Here, it is assumed that the user inputs “loan” and extracts a Web document or partial data of the Web document. The extraction unit 525 extracts Web documents (for example, corresponding to Web documents A and B) of XX trading 310 and loan XX trading 320 with respect to the extracted keyword. Then, as illustrated in FIG. 4, the control unit 620 may display link data 311 and 321 corresponding to these Web documents on the output units 260 and 560.

次に、抽出部５２５が抽出したＷｅｂドキュメントから第２広告語彙データを抽出する（ステップＳ０３）。第２広告語彙データは、第１広告語彙データにより抽出された１以上のＷｅｂドキュメントに共通して含まれる広告語彙データである。第２広告語彙データの抽出においては、共起頻度に基づいて、第２広告語彙データとして抽出される語彙が選別されてよい。すなわち、頻繁にＷｅｂドキュメント内に含まれる語彙を第２広告語彙データとして優先的に抽出部５２５が、抽出してよい。 Next, the second advertisement vocabulary data is extracted from the Web document extracted by the extraction unit 525 (step S03). The second advertisement vocabulary data is advertisement vocabulary data included in common with one or more Web documents extracted by the first advertisement vocabulary data. In the extraction of the second advertising vocabulary data, the vocabulary extracted as the second advertising vocabulary data may be selected based on the co-occurrence frequency. That is, the vocabulary frequently included in the Web document may be preferentially extracted by the extraction unit 525 as the second advertising vocabulary data.

図５にて、図４の抽出キーワードを用いてＷｅｂドキュメントを抽出した際に、抽出されたＷｅｂドキュメント６００、６０５の一例を示した。このようなＷｅｂドキュメント６００、６０５に共通して含まれる語彙を、抽出部５２５が第２広告語彙データとして抽出する。ここでは、第２広告語彙データとして、抽出部５２５は、Ｗｅｂドキュメント６００、６０５に共通している「即日」６１０ａ、６１０ｂ、「実質年率」６１５ａ、６１５ｂを抽出する。 FIG. 5 shows an example of Web documents 600 and 605 extracted when a Web document is extracted using the extracted keyword shown in FIG. The extraction unit 525 extracts the vocabulary commonly included in the Web documents 600 and 605 as the second advertisement vocabulary data. Here, as the second advertisement vocabulary data, the extraction unit 525 extracts “same day” 610 a and 610 b and “real annual rate” 615 a and 615 b common to the Web documents 600 and 605.

次に、ドメイン生成部５２７が、抽出キーワード（第１広告語彙データ）と抽出された第２広告語彙データとを関連づけて、ドメインを生成する（ステップＳ０４）。ここで、ドメインとは、複数の広告語彙データが関連づけられたデータである。図６にて、ドメイン４１０の一例を示した。この例では、上述の図４、図５の例のように、第１広告語彙データとしてユーザから「融資」が入力され、抽出部５２５が、この第１広告語彙データに関連する第２広告語彙データとして、「即日」、「実質年率」を抽出した場合を示す。したがって、ドメイン生成部５２７は、これらを関連づけたドメインを生成し、ドメイン記憶部５４０に記憶する（ステップＳ０５）。 Next, the domain generation unit 527 generates a domain by associating the extracted keyword (first advertisement vocabulary data) with the extracted second advertisement vocabulary data (step S04). Here, the domain is data in which a plurality of advertising vocabulary data is associated. An example of the domain 410 is shown in FIG. In this example, as in the examples of FIGS. 4 and 5 described above, “loan” is input from the user as the first advertising vocabulary data, and the extraction unit 525 extracts the second advertising vocabulary related to the first advertising vocabulary data. The case where “same day” and “real annual rate” are extracted as data is shown. Therefore, the domain generation unit 527 generates a domain in which these are associated, and stores the generated domain in the domain storage unit 540 (step S05).

なお、抽出部５２５が複数の第２広告語彙データを抽出してドメインを生成してもよいし、最も出現頻度（共起率）の高い一の語彙のみを第２広告語彙データとして抽出して、ドメインを生成してもよい。また、上述の説明では、抽出されたＷｅｂドキュメントが、Ｗｅｂドキュメント６００、６０５の２つのみであったが、Ｗｅｂドキュメントとして、ＷｅｂドキュメントＡ〜Ｄのように、３以上のＷｅｂドキュメントから、制御部５２０が、出現頻度等を算出し、第２広告語彙データを抽出して、ドメイン生成部５２７がドメインを生成してもよい。 The extraction unit 525 may extract a plurality of second advertisement vocabulary data to generate a domain, or extract only one vocabulary having the highest appearance frequency (co-occurrence rate) as the second advertisement vocabulary data. A domain may be generated. In the above description, the extracted Web documents are only the two Web documents 600 and 605. However, as the Web document, the control unit can be selected from three or more Web documents as Web documents A to D. 520 may calculate the appearance frequency and the like, extract the second advertising vocabulary data, and the domain generation unit 527 may generate the domain.

図７は、複数の第２広告語彙データを抽出した場合のドメイン４２０を示した。ここで、第２広告語彙データ間の関係については、任意であってよいが、例えば、ＷｅｂドキュメントＡ〜Ｄに共通して含まれる語彙は、「即日」、「金利」のように、「融資」と関連が強い語彙として位置づけられてよい。一方、ＷｅｂドキュメントＡとＢには含まれるが、ＷｅｂドキュメントＣとＤには含まれない語彙（「返済」、「担保」、「審査」）が、グループとして関連づけられて、逆に、ＷｅｂドキュメントＡとＢには含まれないが、ＷｅｂドキュメントＣとＤには含まれる語彙（「保証人」、「無利息」、「実質年率」）が他のグループとして関連づけられてもよい。 FIG. 7 shows a domain 420 when a plurality of second advertisement vocabulary data is extracted. Here, the relationship between the second advertisement vocabulary data may be arbitrary, but for example, the vocabulary included in the Web documents A to D is “finance” such as “same day” and “interest rate”. ”May be positioned as a vocabulary strongly related to. On the other hand, vocabularies ("repayment", "collateral", "examination") that are included in Web documents A and B but not in Web documents C and D are associated as a group. Vocabularies ("guarantor", "no interest", "real annual rate") that are not included in A and B but are included in Web documents C and D may be associated as another group.

次に、ドメイン生成部５２７が生成したドメインが複数生成された後に、ドメインどうしが関連づけられる処理について、図８に基づいて説明する。最初に、ドメイン生成部５２７は、新しいドメインがドメイン記憶部５４０に記憶されたことを判断する（ステップＳ１０）。ドメイン生成部５２７が、新しいドメインがドメイン記憶部５４０に記憶されたと判断した場合には、ドメイン生成部５２７は、生成された新しいドメインに含まれる第２広告語彙データと、同一の第１広告語彙データを備えるドメインがドメイン記憶部５４０に記憶されているかを判断する（ステップＳ１１）。ドメイン生成部５２７は、生成された新しいドメインに含まれる第２広告語彙データと、同一の第１広告語彙データを備えるドメインがドメイン記憶部５４０に記憶されていると判断した場合には、該当するドメインと新しいドメインとを関連づけ（ステップＳ１２）、新たなドメインを生成する（ステップＳ１３）。 Next, a process of associating domains after a plurality of domains generated by the domain generation unit 527 are generated will be described with reference to FIG. First, the domain generation unit 527 determines that a new domain has been stored in the domain storage unit 540 (step S10). When the domain generation unit 527 determines that the new domain has been stored in the domain storage unit 540, the domain generation unit 527 has the same first advertisement vocabulary as the second advertisement vocabulary data included in the generated new domain. It is determined whether a domain including data is stored in domain storage unit 540 (step S11). When the domain generation unit 527 determines that the domain having the same first advertisement vocabulary data as the second advertisement vocabulary data included in the generated new domain is stored in the domain storage unit 540, the corresponding The domain and the new domain are associated (step S12), and a new domain is generated (step S13).

図９にて、ドメイン関連づけ処理を説明するドメインの概念図を示した。ドメイン記憶部５４０に既にドメインＡが記憶されているとする。ドメインＡは、第１広告語彙データを「融資」としている。ドメイン生成部５２７が、第１広告語彙データを「保証人」としたドメインＢを生成する。ドメインＢの第１広告語彙データ「保証人」は、ドメインＡの第２広告語彙データの「保証人」と同一であるため、ドメインＢの「保証人」とドメインＡの「保証人」を関連づける。同様に、ドメインＣ、ドメインＤをドメインＡと関連づけて、複数のドメインが関連づけられた新たなドメインがドメイン記憶部５４０に記憶される。 FIG. 9 shows a conceptual diagram of the domain for explaining the domain association process. Assume that the domain A is already stored in the domain storage unit 540. Domain A uses the first advertising vocabulary data as “loan”. The domain generation unit 527 generates a domain B in which the first advertisement vocabulary data is “guarantor”. Since the first advertising vocabulary data “guarantor” of domain B is the same as “guarantor” of the second advertising vocabulary data of domain A, “guarantor” of domain B and “guarantor” of domain A are associated with each other. . Similarly, domain C and domain D are associated with domain A, and a new domain associated with a plurality of domains is stored in domain storage unit 540.

一方、ドメインＥは、ドメインＡ〜Ｄの全てに同一の広告語彙データが存在しない。すなわち、ドメインＥの第１広告語彙データ「デジカメ」を第２広告語彙データに含むドメインは存在せず、ドメインＥの第２広告語彙データを第１広告語彙データに含むドメインも存在しない。したがって、ドメインＥは、ドメインＡ〜Ｄとは、関連性が薄いドメインである。このように、ドメインどうしで新たに生成されたドメインにより、ドメイン間での相関関係を把握することが可能なデータを生成することが可能である。 On the other hand, in the domain E, the same advertisement vocabulary data does not exist in all of the domains A to D. That is, there is no domain that includes the first advertising vocabulary data “digital camera” of domain E in the second advertising vocabulary data, and there is no domain that includes the second advertising vocabulary data of domain E in the first advertising vocabulary data. Therefore, the domain E is a domain that is less relevant to the domains A to D. As described above, it is possible to generate data capable of grasping the correlation between domains by newly generating domains between domains.

上述の関連語抽出装置５００は、インターネット等の通信回線を介して提供する検索サービス（特定の広告語彙データの入力を受けて、入力された広告語彙データに関連したＷｅｂドキュメントのリンクデータを提供するサービス）に適用されることで、検索する語に関連した語彙を、検索するユーザに提示するために使用されてもよい。 The related term extraction device 500 described above provides a search service provided via a communication line such as the Internet (receives input of specific advertising vocabulary data and provides link data of a Web document related to the inputted advertising vocabulary data. Applied to the service), the vocabulary related to the word to be searched may be used to present to the searching user.

さらに、関連語抽出装置５００が抽出し、ドメインとしてドメイン記憶部５４０に記憶される情報が、企業等の広告、宣伝情報に限らず、任意の専門用語であってよい。すなわち、第１広告語彙データの代わりに、所定の専門分野に関する第１専門語彙データを入力し、第２語彙データの抽出を行うことで、この専門分野に関するドメインが生成され、ドメイン記憶部５４０に記憶される。例えば、第１専門語彙データとして、料理に関するデータを入力し、料理に関するドメインが生成されることで、料理に関する関連語を記憶したドメイン記憶部５４０を備えることが可能である。 Furthermore, the information extracted by the related word extraction apparatus 500 and stored as a domain in the domain storage unit 540 is not limited to advertisements and advertisement information of companies and the like, and may be arbitrary technical terms. That is, instead of the first advertising vocabulary data, the first specialized vocabulary data related to a predetermined specialized field is input and the second vocabulary data is extracted, whereby a domain related to this specialized field is generated and stored in the domain storage unit 540. Remembered. For example, as the first specialized vocabulary data, data related to cooking is input, and a domain related to cooking is generated, whereby a domain storage unit 540 that stores related words related to cooking can be provided.

上述の関連語抽出装置５００により、一の商品やサービスの宣伝、広告のためのＷｅｂドキュメントに関連した語彙を抽出して、抽出した関連語どうしの関連を示すデータを生成することが可能である。すなわち、一の商品、サービスを提供する複数の業者が各々提供するＷｅｂドキュメントに、共通して含まれる語彙を抽出することが可能である。例えば、ユーザが検索キーワードとして「保険」と入力した場合に、この検索結果に、保険会社Ａが自社のＷｅｂドキュメントのリンクを含めたいとする。この場合に、「保険」という語彙以外に、どのような語彙（例えば、「保証人」など）に対する検索結果に対して、保険会社ＡのＷｅｂドキュメントのリンクを含めればよいかの指標が必要となる。このような場合に、関連語抽出装置５００により生成されるドメインが検索キーワードに関連した語彙としての指標となる。したがって、関連語抽出装置５００が、このように検索キーワードに関連した語彙を提供するサービスとして適用されてよい。 By using the related term extraction device 500 described above, it is possible to extract vocabulary related to a Web document for the promotion or advertisement of one product or service, and generate data indicating the relationship between the extracted related terms. . That is, it is possible to extract a vocabulary commonly included in a Web document provided by each of a plurality of companies that provide one product and service. For example, when the user inputs “insurance” as a search keyword, it is assumed that insurance company A wants to include a link of its own Web document in the search result. In this case, in addition to the vocabulary “insurance”, an index as to what vocabulary (for example, “guarantor”, etc.) should be included in the search result for the Web document of insurance company A is necessary. Become. In such a case, the domain generated by the related word extraction device 500 becomes an index as a vocabulary related to the search keyword. Therefore, the related word extraction apparatus 500 may be applied as a service that provides the vocabulary related to the search keyword in this way.

本発明は、一つの実施形態として、関連語抽出システム１にて動作する各コンピュータのコンピュータ・プログラムによって実現可能である。上記プログラムを格納する記憶媒体は、電子的、磁気的、光学的、電磁的、赤外線または半導体システム（または、装置または機器）あるいは伝搬媒体であることができる。コンピュータ可読の媒体の例には、半導体、磁気テープ、取り外し可能なコンピュータ可読の媒体の例には、半導体、磁気テープ、取り外し可能なコンピュータ・ディスケット、ランダム・アクセス・メモリ（RAM）、リードオンリー・メモリ(ROM)、リジッド磁気ディスクおよび光ディスクが含まれる。現時点における光ディスクの例には、コンパクト・ディスク−リードオンリー・メモリ(CD-ROM)、コンパクト・ディスク−リード／ライト(CD-R/W)およびDVDが含まれる。 As an embodiment, the present invention can be realized by a computer program of each computer operating in the related word extraction system 1. The storage medium storing the program can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of computer readable media include semiconductors, magnetic tapes, examples of removable computer readable media include semiconductors, magnetic tapes, removable computer diskettes, random access memory (RAM), read only Memory (ROM), rigid magnetic disk and optical disk are included. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read / write (CD-R / W) and DVD.

以上、本発明の実施形態を説明したが、具体例を例示したに過ぎず、特に本発明を限定しない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載された効果に限定されない。 As mentioned above, although embodiment of this invention was described, it only showed the specific example and does not specifically limit this invention. Further, the effects described in the embodiments of the present invention only list the most preferable effects resulting from the present invention, and the effects of the present invention are not limited to the effects described in the embodiments of the present invention.

関連語抽出システム１の構成図である。1 is a configuration diagram of a related word extraction system 1. FIG. メイン処理のフローチャート図である。It is a flowchart figure of a main process. Ｗｅｂドキュメント対応テーブル図である。It is a Web document correspondence table figure. Ｗｅｂドキュメントを抽出した結果を出力した画面イメージ図である。It is a screen image figure which outputted the result of extracting a Web document. Ｗｅｂドキュメントを示す図である。It is a figure which shows a Web document. 生成されたドメインを示す図である。It is a figure which shows the produced | generated domain. 生成されたドメインを示す図である。It is a figure which shows the produced | generated domain. ドメイン関連づけ処理のフローチャート図である。It is a flowchart figure of a domain correlation process. 複数のドメインが関係づけられることを示す概念図である。It is a conceptual diagram which shows that a some domain is related.

Explanation of symbols

１関連語抽出システム
３０通信回線ネットワーク
１００制御部
１００ａ、１００ｂ、１００ｃＷｅｂサーバ
２００ユーザ端末
２１０通信部
２２０制御部
２５０入力部
２６０出力部
３０５入力窓
３１１リンクデータ
３２１リンクデータ
４１０ドメイン
４２０ドメイン
５００関連語抽出装置
５１０通信部
５２０制御部
５２５抽出部
５２７ドメイン生成部
５３０ドキュメント記憶部
５４０ドメイン記憶部
５５０入力部
５６０出力部
６００ドキュメント
６０５ドキュメント
６２０制御部 DESCRIPTION OF SYMBOLS 1 Related word extraction system 30 Communication line network 100 Control part 100a, 100b, 100c Web server 200 User terminal 210 Communication part 220 Control part 250 Input part 260 Output part 305 Input window 311 Link data 321 Link data 410 Domain 420 Domain 500 Related word Extraction device 510 Communication unit 520 Control unit 525 Extraction unit 527 Domain generation unit 530 Document storage unit 540 Domain storage unit 550 Input unit 560 Output unit 600 Document 605 Document 620 Control unit

Claims

In order to analyze a vocabulary frequently used in a specific industry / industry, a web document provider is a related word extraction device that associates advertisement vocabulary data related to each other from a plurality of advertisement vocabulary data,
A receiving unit that receives a Web document stored in a storage device connected via a communication line;
A Web document storage unit for storing the Web document received by the receiving unit;
An input unit for receiving input of first advertising vocabulary data related to the advertising vocabulary data to be extracted;
A Web document extraction unit that extracts a Web document including the first advertising vocabulary data input via the input unit from the Web document storage unit;
An extraction unit for extracting second advertising vocabulary data commonly included in the Web document;
A domain generating unit that generates a domain in which the second advertising vocabulary data extracted by the extracting unit is associated with the first advertising vocabulary data;
A domain storage unit for storing the domain generated by the domain generation unit;
A related word extraction device.

The domain generating unit includes a first advertisement vocabulary data different from the first advertisement vocabulary data, a domain generated from second advertisement vocabulary data extracted from the other first advertisement vocabulary data, and the domain The related word extraction device according to claim 1, wherein the storage unit associates a domain that is already stored with the storage unit.

The extraction unit preferentially extracts second advertising vocabulary data having a high frequency when extracting second advertising vocabulary data commonly included in the Web document. The related word extraction device described.

A related word extraction method for associating advertisement vocabulary data related to each other from a plurality of advertisement vocabulary data in order for a web document provider to analyze vocabulary frequently used in a specific industry / industry,
Receiving a Web document stored in a storage device connected via a communication line;
Storing the web document received in the receiving step;
An input step for receiving input of first advertising vocabulary data related to the advertising vocabulary data to be extracted;
An extraction step of extracting a Web document including the first advertising vocabulary data input in the input step;
A second advertising vocabulary data extracting step for extracting second advertising vocabulary data commonly included in the Web document;
A domain generating step for generating a domain in which the second advertising vocabulary data extracted in the second advertising vocabulary data extracting step is associated with the first advertising vocabulary data;
A domain storage step of storing the domain generated by the domain generation step;
A related word extraction method comprising:

The domain generation step includes a first advertisement vocabulary data different from the first advertisement vocabulary data, a domain generated from second advertisement vocabulary data extracted from the other first advertisement vocabulary data, and the domain The related word extraction method according to claim 4, wherein in the storing step, the already stored domain is associated.