JP6653169B2

JP6653169B2 - Keyword extraction device, content generation system, keyword extraction method, and program

Info

Publication number: JP6653169B2
Application number: JP2015249124A
Authority: JP
Inventors: 啓一副島
Original assignee: Faber Co
Current assignee: Faber Co
Priority date: 2015-12-21
Filing date: 2015-12-21
Publication date: 2020-02-26
Anticipated expiration: 2035-12-21
Also published as: JP2017117021A

Description

本発明は、キーワード抽出装置、コンテンツ生成システム、キーワード抽出方法、およびプログラムに関する。 The present invention relates to a keyword extraction device, a content generation system, a keyword extraction method, and a program.

近年のインターネットの急速な普及に伴い、Ｗｅｂ（ウェブ）サイトで提供されているサービスを利用して、情報を調べたり、ウェブサイトで販売されている商品を購入したりしる利用者が増えている。
このような場合に、利用者は、知りたい情報に関するキーワード（以下、検索キーワードという）を検索エンジンに入力して、利用したいサービスを提供しているウェブサイトを検索する。そして、利用者は、検索された結果、表示部上に表示されるウェブサイトのうち、例えば上位に表示されたウェブサイトから逐次アクセスして、そのサイトを閲覧する。 With the rapid spread of the Internet in recent years, the number of users who use services provided on a Web (Web) site to check information or purchase products sold on the Web site has increased. I have.
In such a case, the user inputs a keyword (hereinafter, referred to as a search keyword) related to information to be searched into a search engine, and searches for a website that provides a service to be used. Then, as a result of the search, the user sequentially accesses, for example, a website displayed at a higher position among the websites displayed on the display unit, and browses the site.

検索エンジンは、例えば検索キーワードとウェブサイトのソースコードの記述とを照らし合わせて、検索キーワードとウェブサイトのソースコードの記述との適合度によって、検索結果として表示されるウェブサイトを選択する。 The search engine, for example, compares the search keyword with the description of the source code of the website, and selects a website displayed as a search result according to the degree of matching between the search keyword and the description of the source code of the website.

このため、近年、ウェブサイトの運営者の間では、ウェブサイトにどのような記述を行えば、検索結果の上位に表示されるようになるかを知りたいという需要が存在する。
このような需要に応じて、ウェブサイトのコンテンツを構築するためのキーワードを抽出するキーワード抽出システムが提案されている。キーワード抽出システムでは、ウェブサイトを構成するウェブページ毎に、あらかじめ検索キーワードを決めておく。キーワード抽出システムでは、検索エンジンにおける検索履歴を検索キーワード毎に取得する。キーワード抽出システムでは、取得した検索履歴の多い順に検索キーワードを選択する。このように、キーワード抽出システムでは、検索エンジンの検索回数が多いキーワードを、コンテンツを構築するためのキーワードとして選択する（例えば、特許文献１参照）。 For this reason, in recent years, there has been a demand among website operators to know what description should be made on the website to be displayed at the top of the search results.
In response to such demands, a keyword extraction system for extracting keywords for constructing website content has been proposed. In the keyword extraction system, a search keyword is determined in advance for each web page constituting the website. In the keyword extraction system, a search history in a search engine is obtained for each search keyword. In the keyword extraction system, search keywords are selected in descending order of the acquired search histories. As described above, in the keyword extraction system, a keyword whose search engine has performed a large number of searches is selected as a keyword for constructing content (for example, see Patent Document 1).

特開２００６−１４６４４６号公報JP 2006-146446 A

しかしながら、検索結果の上位に表示されるには、検索キーワードをｍｅｔａ（メタ）タグやコンテンツ内に記述するだけでは不足であり、検索キーワードと適合しやすい記述をコンテンツに含んでいる必要がある。
このため、特許文献１に記載の技術では、検索回数の多いキーワードをコンテンツが含んでいるだけであり、利用者が得たい情報が含まれているとは限らない。従って、検索回数の多いキーワードを含んでいても、検索結果の上位に表示されない場合もあった。 However, in order to be displayed at the top of the search result, it is not sufficient to simply describe the search keyword in a meta (meta) tag or in the content, and the content needs to include a description that easily matches the search keyword.
For this reason, in the technology described in Patent Literature 1, the content only includes a keyword that is frequently searched, and does not necessarily include information that the user wants to obtain. Therefore, even if a keyword with a high number of searches is included, it may not be displayed at the top of the search results.

本発明は上記の点に鑑みてなされたものであり、利用者が知りたい情報に応じたキーワードを抽出することができるキーワード抽出装置、コンテンツ生成システム、キーワード抽出方法、およびプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and has as its object to provide a keyword extraction device, a content generation system, a keyword extraction method, and a program capable of extracting a keyword corresponding to information desired by a user. Aim.

（１）上記目的を達成するため、本発明の一態様に係るキーワード抽出装置は、検索キーワードに基づいてメインコンテンツを含む複数のコンテンツを検索する検索部と、前記検索部によって検索された前記複数のコンテンツの中から、キーワードの抽出において意味を成していない所定のドメインのコンテンツを除去する第１ノイズ除去部と、前記第１ノイズ除去部によって前記所定のドメインのコンテンツが除去された前記複数のコンテンツの中から１つの前記コンテンツを逐次選択し、選択した前記コンテンツからリンク先を示す情報を抽出し、抽出したリンク先の情報と、選択した前記コンテンツの情報とを比較して類似している情報を、選択した前記コンテンツの情報から除去してメインコンテンツを抽出するメインコンテンツ抽出部と、前記メインコンテンツ抽出部によって抽出された前記メインコンテンツのテキストから複数のキーワードを抽出するキーワード抽出部と、を備える。 (1) In order to achieve the above object, a keyword extraction device according to an aspect of the present invention includes a search unit that searches a plurality of contents including a main content based on a search keyword, and a search unit that searches the plurality of contents including the main content. A first noise removing unit that removes content of a predetermined domain that does not make sense in the keyword extraction from the content, and the plurality of the plurality of domains from which the content of the predetermined domain has been removed by the first noise removing unit. One of the contents is sequentially selected from the contents, information indicating a link destination is extracted from the selected contents, and the extracted information of the link destination is compared with the information of the selected content to be similar. Main content extraction for extracting main content by removing information from the selected content information. Comprising parts and, and a keyword extraction section that extracts a plurality of keywords from the text of the extracted main content by the main content extractor.

（２）また、本発明の一態様に係るキーワード抽出装置は、前記メインコンテンツ抽出部によって抽出された前記メインコンテンツの情報から、所定のタグによって記述されている情報を除去することで、前記キーワードの抽出において意味を成していない不要な記述を除去する第２ノイズ除去部、をさらに備え、前記キーワード抽出部は、前記第２ノイズ除去部によって前記所定のタグによって記述されている情報が除去された後の前記メインコンテンツのテキストからキーワードを抽出するようにしてもよい。 (2) Further, the keyword extraction device according to an aspect of the present invention removes the information described by a predetermined tag from the information of the main content extracted by the main content extraction unit, thereby extracting the keyword. Further comprising a second noise removing unit that removes unnecessary description that does not make sense, wherein the keyword extracting unit removes information described by the predetermined tag by the second noise removing unit. A keyword may be extracted from the text of the main content later.

（３）上記目的を達成するため、本発明の一態様に係るキーワード抽出装置は、検索キーワードに基づいてメインコンテンツを含む複数のコンテンツを検索する検索部と、前記検索部によって検索された前記複数のコンテンツの中から１つの前記コンテンツを逐次選択し、選択した前記コンテンツから所定のタグによって記述されている情報を除去することで、キーワードの抽出において意味を成していない不要な記述を除去する第２ノイズ除去部と、前記第２ノイズ除去部によって前記所定のタグによって記述されている情報が除去された前記コンテンツのテキストから複数のキーワードを抽出するキーワード抽出部と、を備える。 (3) In order to achieve the above object, a keyword extraction device according to an aspect of the present invention includes a search unit that searches a plurality of contents including a main content based on a search keyword, and a search unit that searches the plurality of contents including the main content. By sequentially selecting one of the contents from the contents and removing information described by a predetermined tag from the selected contents, an unnecessary description that does not make sense in the keyword extraction is removed. (2) a noise extracting unit; and a keyword extracting unit that extracts a plurality of keywords from the text of the content from which information described by the predetermined tag has been removed by the second noise removing unit.

（４）また、本発明の一態様に係るキーワード抽出装置は、前記検索部と前記キーワード抽出部との間に第１ノイズ除去部、をさらに備え、前記第１ノイズ除去部は、前記検索部によって検索された前記複数のコンテンツの中から、前記キーワードの抽出において意味を成していない所定のドメインのコンテンツを除去するようにしてもよい。 (4) The keyword extracting device according to an aspect of the present invention may further include a first noise removing unit between the searching unit and the keyword extracting unit, wherein the first noise removing unit includes the searching unit. Out of the plurality of contents searched for, a content of a predetermined domain that does not make sense in extracting the keyword may be removed.

（５）また、本発明の一態様に係るキーワード抽出装置は、前記第２ノイズ除去部によって前記コンテンツから所定のタグによって記述されている情報が除去されたコンテンツの中から１つの前記コンテンツを逐次選択し、選択した前記コンテンツからリンク先を示す情報を抽出し、抽出したリンク先の情報と、選択した前記コンテンツの情報とを比較して類似している情報を、選択した前記コンテンツの情報から除去してメインコンテンツを抽出するメインコンテンツ抽出部、をさらに備えるようにしてもよい。
（６）また、本発明の一態様に係るキーワード抽出装置は、前記第１ノイズ除去部によって前記所定のドメインのコンテンツが除去された前記複数のコンテンツの中から１つの前記コンテンツを逐次選択し、選択した前記コンテンツからリンク先を示す情報を抽出し、抽出したリンク先の情報と、選択した前記コンテンツの情報とを比較して類似している情報を、選択した前記コンテンツの情報から除去してメインコンテンツを抽出するメインコンテンツ抽出部、をさらに備えるようにしてもよい。 (5) Further, the keyword extracting device according to an aspect of the present invention may include sequentially extracting one content from the content in which information described by a predetermined tag is removed from the content by the second noise removing unit. Selecting, extracting information indicating a link destination from the selected content, comparing the extracted link destination information with the information of the selected content, and extracting similar information from the information of the selected content. A main content extracting unit that extracts the main content by removing the main content may be further provided.
(6) Further, the keyword extraction device according to one aspect of the present invention sequentially selects one of the plurality of contents from which the content of the predetermined domain has been removed by the first noise removal unit, The information indicating the link destination is extracted from the selected content, and the information of the extracted link destination is compared with the information of the selected content to remove similar information from the information of the selected content. A main content extraction unit for extracting main content may be further provided.

（７）また、本発明の一態様に係るキーワード抽出装置において、前記検索部は、前記検索キーワードに基づいて、コンテンツを検索するドメインを限定して前記コンテンツを検索し、前記キーワード抽出部は、前記限定したドメインのコンテンツのテキストから複数のキーワードを抽出し、抽出した結果に基づいてキーワードリストを生成するようにしてもよい。 (7) In the keyword extraction device according to one aspect of the present invention, the search unit searches the content based on the search keyword by limiting a domain in which the content is searched, and the keyword extraction unit A plurality of keywords may be extracted from the text of the content of the limited domain, and a keyword list may be generated based on the extracted result.

（８）また、本発明の一態様に係るキーワード抽出装置において、前記検索部は、前記検索キーワードに基づいて、予め定められている少なくとも２つのドメインの異なるコンテンツを検索し、前記キーワード抽出部は、前記異なるドメインのコンテンツのテキストそれぞれから複数のキーワードをそれぞれ抽出し、前記異なるドメインのコンテンツのテキストそれぞれから抽出したキーワードを比較し、比較した結果に基づいてキーワードリストを生成するようにしてもよい。 (8) Further, in the keyword extraction device according to one aspect of the present invention, the search unit searches for different contents of at least two predetermined domains based on the search keyword. A plurality of keywords may be respectively extracted from the texts of the contents of the different domains, the keywords extracted from the texts of the contents of the different domains may be compared, and a keyword list may be generated based on the comparison result. .

（９）また、本発明の一態様に係るキーワード抽出装置において、前記検索部は、前記検索キーワードに基づいて、コンテンツを検索し、前記キーワード抽出部が前記コンテンツのテキストから抽出した複数のキーワードに基づいて検索して評価対象のサイトの検索結果の順位を検索し、前記キーワード抽出部は、前記コンテンツのテキストから複数のキーワードを抽出し、抽出した前記複数のキーワードが前記評価対象のサイトのコンテンツで使用されているか否かを判別した結果と、前記検索部が検索した前記評価対象のサイトの検索順位に基づいてキーワードリストを生成するようにしてもよい。 (9) Further, in the keyword extraction device according to one aspect of the present invention, the search unit searches for a content based on the search keyword, and the keyword extraction unit searches for a plurality of keywords extracted from the text of the content. The keyword extracting unit extracts a plurality of keywords from the text of the content, and the extracted keywords are the contents of the site to be evaluated. The keyword list may be generated based on the result of determining whether the keyword is used in the search and the search order of the evaluation target site searched by the search unit.

（１０）また、本発明の一態様に係るキーワード抽出装置は、前記メインコンテンツ抽出部が抽出した前記メインコンテンツから少なくとも１つの文章を抽出する文章抽出部と、前記検索部によって前記文章に基づいて検索された順位を取得する検索順位取得部と、前記検索順位取得部が取得した順位に基づいて、前記文章が抽出された評価を行う対象のウェブページに対して評価を行う評価結果生成部と、をさらに備えるようにしてもよい。 (10) Further, the keyword extraction device according to one aspect of the present invention includes a text extraction unit configured to extract at least one text from the main content extracted by the main content extraction unit, and a search performed by the search unit based on the text. A search order obtaining unit that obtains the obtained order, and an evaluation result generating unit that performs an evaluation on a target web page on which the sentence is extracted based on the order obtained by the search order obtaining unit. You may make it provide further.

（１１）また、本発明の一態様に係るキーワード抽出装置は、前記検索部が検索した結果から、検索キーワードに基づく予測言葉を取得するサジェスト取得部と、前記サジェスト取得部によって取得された複数の前記予測言葉のうち１つを選択し、前記選択した予測言葉を前記検索部によって検索した結果から、前記メインコンテンツ抽出部によって抽出されたメインコンテンツを用いて、前記選択した予測言葉の検索順位を取得する検索順位取得部と、前記検索順位取得部が取得した順位に基づいて、評価を行う対象のウェブページに対して評価を行う評価結果生成部と、をさらに備えるようにしてもよい。 (11) In addition, the keyword extraction device according to one aspect of the present invention includes a suggestion acquisition unit that acquires a predicted word based on a search keyword from a result searched by the search unit, and a plurality of suggestion acquisition units that are acquired by the suggestion acquisition unit. One of the predicted words is selected, and a search order of the selected predicted words is obtained from a result of searching the selected predicted words by the search unit, using the main content extracted by the main content extraction unit. The information processing apparatus may further include a search order obtaining unit, and an evaluation result generating unit that evaluates a web page to be evaluated based on the order obtained by the search order obtaining unit.

（１２）上記目的を達成するため、本発明の一態様に係るコンテンツ生成システムは、（１）から（１１）のいずれか１つに記載のキーワード抽出装置と、前記キーワード抽出装置が抽出した前記複数のキーワードを用いて、所定のコンテンツを生成するコンテンツ生成装置と、を備える。 (12) In order to achieve the above object, a content generation system according to one aspect of the present invention includes a keyword extraction device according to any one of (1) to (11), and the keyword extraction device extracted by the keyword extraction device. A content generation device that generates predetermined content using a plurality of keywords.

（１３）上記目的を達成するため、本発明の一態様に係るキーワード抽出方法は、検索部が、検索キーワードに基づいてメインコンテンツを含む複数のコンテンツを検索する検索手順と、第１ノイズ除去部が、前記検索手順によって検索された前記複数のコンテンツの中から、キーワードの抽出において意味を成していない所定のドメインのコンテンツを除去する第１ノイズ除去手順と、メインコンテンツ抽出部が、前記第１ノイズ除去手順によって前記所定のドメインのコンテンツが除去された前記複数のコンテンツの中から１つの前記コンテンツを逐次選択し、選択した前記コンテンツからリンク先を示す情報を抽出し、抽出したリンク先の情報と、選択した前記コンテンツの情報とを比較して類似している情報を、選択した前記コンテンツの情報から除去してメインコンテンツを抽出するメインコンテンツ抽出手順と、キーワード抽出部が、前記メインコンテンツ抽出手順によって抽出された前記メインコンテンツのテキストから複数のキーワードを抽出するキーワード抽出手順と、を含む。 (13) In order to achieve the above object, a keyword extraction method according to an aspect of the present invention provides a keyword extracting method, wherein: a search unit searches for a plurality of contents including a main content based on a search keyword; A first noise removing step of removing, from the plurality of contents searched by the search step, contents of a predetermined domain that does not make sense in extracting a keyword, the main content extracting unit includes: One content is sequentially selected from the plurality of contents from which the content of the predetermined domain has been removed by the removal procedure, information indicating a link destination is extracted from the selected content, and information of the extracted link destination and Comparing the information of the selected content with similar information, Including a main content extraction procedure for extracting the main content is removed from the tree information, the keyword extraction section, and a keyword extraction procedure for extracting a plurality of keywords from the text of the main content extracted by the main content extraction procedure.

（１４）上記目的を達成するため、本発明の一態様に係るキーワード抽出方法は、検索部が、検索キーワードに基づいてメインコンテンツを含む複数のコンテンツを検索する検索手順と、第２ノイズ除去部が、前記検索手順によって検索された前記複数のコンテンツの中から１つの前記コンテンツを逐次選択し、選択した前記コンテンツから所定のタグによって記述されている情報を除去することで、キーワードの抽出において意味を成していない不要な記述を除去する第２ノイズ除去手順と、キーワード抽出部が、前記第２ノイズ除去手順によって前記所定のタグによって記述されている情報が除去された前記コンテンツのテキストから複数のキーワードを抽出するキーワード抽出手順と、を含む。 (14) In order to achieve the above object, in the keyword extraction method according to one aspect of the present invention, the search unit may search for a plurality of contents including the main content based on the search keyword; By sequentially selecting one of the contents from the plurality of contents searched by the search procedure, and removing information described by a predetermined tag from the selected contents, so that the meaning is extracted in the keyword extraction. A second noise removal procedure for removing unnecessary descriptions that have not been formed, and a keyword extracting unit configured to perform a plurality of processes from the text of the content from which information described by the predetermined tag has been removed by the second noise removal procedure. A keyword extraction procedure for extracting a keyword.

（１５）上記目的を達成するため、本発明の一態様に係るプログラムは、コンピュータに、検索キーワードに基づいてメインコンテンツを含む複数のコンテンツを検索する検索手順と、前記検索手順によって検索された前記複数のコンテンツの中から、キーワードの抽出において意味を成していない所定のドメインのコンテンツを除去する第１ノイズ除去手順と、前記第１ノイズ除去手順によって前記所定のドメインのコンテンツが除去された前記複数のコンテンツの中から１つの前記コンテンツを逐次選択し、選択した前記コンテンツからリンク先を示す情報を抽出し、抽出したリンク先の情報と、選択した前記コンテンツの情報とを比較して類似している情報を、選択した前記コンテンツの情報から除去してメインコンテンツを抽出するメインコンテンツ抽出手順と、前記メインコンテンツ抽出手順によって抽出された前記メインコンテンツのテキストから複数のキーワードを抽出するキーワード抽出手順と、を実行させる。 (15) In order to achieve the above object, a program according to one embodiment of the present invention provides a computer with a search procedure for searching for a plurality of contents including a main content based on a search keyword, and the plurality of contents searched by the search procedure. A first noise removing step of removing content of a predetermined domain that does not make sense in the keyword extraction from the contents of the plurality of contents; and a plurality of the plurality of the contents of which the content of the predetermined domain has been removed by the first noise removing procedure. One of the contents is sequentially selected from among the contents, information indicating a link destination is extracted from the selected contents, and the extracted information of the link destination is compared with the information of the selected content to determine similarity. The main content by removing the existing information from the selected content information And in the content extraction procedure, the keyword extraction procedure for extracting a plurality of keywords from the text of the main content extracted by the main content extraction procedure is run.

（１６）上記目的を達成するため、本発明の一態様に係るプログラムは、コンピュータに、検索キーワードに基づいてメインコンテンツを含む複数のコンテンツを検索する検索手順と、前記検索手順によって検索された前記複数のコンテンツの中から１つの前記コンテンツを逐次選択し、選択した前記コンテンツから所定のタグによって記述されている情報を除去することで、キーワードの抽出において意味を成していない不要な記述を除去する第２ノイズ除去手順と、前記第２ノイズ除去手順によって前記所定のタグによって記述されている情報が除去された前記コンテンツのテキストから複数のキーワードを抽出するキーワード抽出手順と、を実行させる。 (16) In order to achieve the above object, a program according to one embodiment of the present invention comprises: a computer having a search procedure for searching a plurality of contents including a main content based on a search keyword; One of the contents is sequentially selected from among the contents, and information described by a predetermined tag is removed from the selected contents, thereby removing unnecessary description that does not make sense in keyword extraction. A second noise removal procedure and a keyword extraction procedure of extracting a plurality of keywords from the text of the content from which the information described by the predetermined tag has been removed by the second noise removal procedure are executed.

本発明によれば、利用者が知りたい情報に応じたキーワードを抽出することができる。 ADVANTAGE OF THE INVENTION According to this invention, the keyword according to the information which a user wants to know can be extracted.

第１実施形態に係るキーワード抽出装置の操作画面を示す図である。It is a figure showing the operation screen of the keyword extraction device concerning a 1st embodiment. 第１実施形態に係るキーワード抽出装置の概略構成図である。It is a schematic structure figure of the keyword extraction device concerning a 1st embodiment. 第１実施形態に係るドメインＤＢに格納されている情報の一例を示す図である。FIG. 4 is a diagram illustrating an example of information stored in a domain DB according to the first embodiment. ウェブページの構成の一例を示す図である。It is a figure showing an example of composition of a web page. 第１実施形態に係るキーワード抽出装置の処理のフローチャートである。5 is a flowchart of a process of the keyword extracting device according to the first embodiment. ウェブページのソースコードの例を示す図である。FIG. 3 is a diagram illustrating an example of a source code of a web page. 第１実施形態に係る自ウェブページとリンク先のウェブページの構成例を示す図である。It is a figure showing the example of composition of the self-webpage and a link destination webpage concerning a 1st embodiment. 第１実施形態に係るメインコンテンツの抽出処理の手順のフローチャートである。It is a flowchart of the procedure of the extraction processing of the main content according to the first embodiment. 第１実施形態に係るキーワード抽出部の構成を示すブロック図である。FIG. 3 is a block diagram illustrating a configuration of a keyword extraction unit according to the first embodiment. 第１実施形態に係るキーワードリスト出力部が出力するキーワードリストの例を示す図である。FIG. 5 is a diagram illustrating an example of a keyword list output by a keyword list output unit according to the first embodiment. 第１実施形態に係るキーワードの抽出処理のフローチャートである。5 is a flowchart of a keyword extraction process according to the first embodiment. 第１実施形態の変形例に係るキーワード抽出装置の概略構成図である。It is a schematic structure figure of the keyword extraction device concerning the modification of a 1st embodiment. 第１実施形態の変形例に係るタグＤＢに格納されている情報の一例を示す図である。It is a figure showing an example of information stored in tag DB concerning a modification of a 1st embodiment. 第１実施形態の変形例に係るキーワード抽出装置の処理のフローチャートである。It is a flow chart of processing of a keyword extraction device concerning a modification of a 1st embodiment. 第１実施形態の変形例に係る無意味言葉の除去処理のフローチャートである。It is a flowchart of the removal process of the meaningless word which concerns on the modification of 1st Embodiment. 第２実施形態に係る本実施形態に係るキーワード抽出装置の概略構成図である。It is a schematic structure figure of the keyword extraction device concerning this embodiment concerning a 2nd embodiment. 第２実施形態に係るタグＤＢに格納されている情報の一例を示す図である。It is a figure showing an example of the information stored in tag DB concerning a 2nd embodiment. 第２実施形態に係るキーワード抽出装置の処理のフローチャートである。It is a flow chart of processing of a keyword extraction device concerning a 2nd embodiment. 第２実施形態の第１変形例に係るキーワード抽出装置の概略構成図である。It is a schematic structure figure of the keyword extraction device concerning the 1st modification of a 2nd embodiment. 第２実施形態の第１変形例に係るキーワード抽出装置の処理のフローチャートである。It is a flow chart of processing of a keyword extraction device concerning a 1st modification of a 2nd embodiment. 第２実施形態の第２変形例に係るキーワード抽出装置の概略構成図である。It is a schematic structure figure of the keyword extraction device concerning the 2nd modification of a 2nd embodiment. 第２実施形態の第２変形例に係るドメインＤＢに格納されているドメインの一例を示す図である。FIG. 14 is a diagram illustrating an example of a domain stored in a domain DB according to a second modification of the second embodiment. 第２実施形態の第２変形例に係るキーワード抽出装置による操作画面の例を示す図である。It is a figure showing the example of the operation screen by the keyword extraction device concerning the 2nd modification of a 2nd embodiment. 第２実施形態の第２変形例に係る第７の抽出方法による重要キーワード抽出装置によるキーワードの検索結果の比較例を示す図である。It is a figure showing the example of comparison of the search result of the keyword by the important keyword extraction device by the 7th extraction method concerning the 2nd modification of a 2nd embodiment. 第２実施形態の第２変形例に係る第８の抽出方法が選択された場合のキーワード抽出装置による操作画面の例を示す図である。It is a figure showing the example of the operation screen by the keyword extraction device when the 8th extraction method concerning the 2nd modification of a 2nd embodiment is selected. 第２実施形態の第２変形例に係る第８の抽出方法における処理のフローチャートである。It is a flow chart of processing in the 8th extraction method concerning a 2nd modification of a 2nd embodiment. 第２実施形態の第２変形例に係る第９の抽出方法が選択された場合のキーワード抽出装置による評価結果の例を示す図である。It is a figure showing the example of the evaluation result by the keyword extraction device when the ninth extraction method concerning the 2nd modification of a 2nd embodiment is selected. 第３本実施形態に係るキーワード抽出装置の概略構成図である。It is a schematic structure figure of the keyword extraction device concerning a 3rd present embodiment. 第３実施形態に係るキーワード抽出装置が行う評価処理のフローチャートである。13 is a flowchart of an evaluation process performed by the keyword extraction device according to the third embodiment. 第３実施形態に係る評価結果の例を示す図である。It is a figure showing the example of the evaluation result concerning a 3rd embodiment. 第４本実施形態に係るキーワード抽出装置の概略構成図である。It is a schematic structure figure of the keyword extraction device concerning a 4th embodiment. 第４実施形態に係るキーワード抽出装置が行う評価処理のフローチャートである。It is a flow chart of the evaluation processing which the keyword extraction device concerning a 4th embodiment performs. 第４実施形態に係る評価結果の例を示す図である。It is a figure showing the example of the evaluation result concerning a 4th embodiment. 第５実施形態に係るコンテンツ生成システムを示す構成図である。It is a lineblock diagram showing the contents generation system concerning a 5th embodiment.

［本発明の概要］
まず、本発明の概要を説明する。
本発明では、ウェブサイトに関する検索キーワードを、検索エンジンによって検索する。なお、検索キーワードとは、ウェブページの閲覧者が、検索エンジンに入力すると想定されるキーワードである。そして、本発明では、検索されたウェブページのうち上位から所定の個数のウェブページを選択する。そして、本発明では、選択した所定の個数ウェブページ（コンテンツともいう）それぞれからノイズを除去する。そして、本発明では、ノイズを除去したコンテンツに含まれるテキストを解析して、キーワードを抽出する。なお、キーワードとは、検索キーワードを用いて検索エンジンで検索した結果、検索結果の上位に表示されたウェブページに含まれているキーワードである。なお、各処理については、後述する。 [Summary of the present invention]
First, an outline of the present invention will be described.
In the present invention, a search keyword for a website is searched by a search engine. Note that the search keyword is a keyword assumed to be input to a search engine by a web page viewer. Then, in the present invention, a predetermined number of web pages are selected from the top among the searched web pages. Then, in the present invention, noise is removed from each of the selected predetermined number of web pages (also referred to as contents). Then, in the present invention, a keyword is extracted by analyzing text included in the content from which noise has been removed. Here, the keyword is a keyword included in a web page displayed at a higher rank of a search result as a result of a search performed by a search engine using a search keyword. Each processing will be described later.

以下、図面を用いて本発明の実施形態について詳細に説明する。なお、本発明は係る実施形態に限定されず、その技術思想の範囲内で種々の変更が可能である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited to the embodiment, and various changes can be made within the scope of the technical idea.

［第１実施形態］
図１は、本実施形態に係るキーワード抽出装置１の操作画面ｇ１０１を示す図である。
図１において、四角ｇ１１１は、検索キーワードの入力欄であり、四角ｇ１１２は、検索キーワードに基づいてキーワードの抽出を開始するボタンの画像であり、四角ｇ１１３は、検索キーワードに基づいて、キーワード抽出装置１によって抽出されたキーワードのリストが表示される欄である。 [First Embodiment]
FIG. 1 is a diagram showing an operation screen g101 of the keyword extraction device 1 according to the present embodiment.
In FIG. 1, a square g111 is an input field for a search keyword, a square g112 is an image of a button for starting extraction of a keyword based on the search keyword, and a square g113 is a keyword extraction device based on the search keyword. 1 is a column in which a list of keywords extracted by No. 1 is displayed.

＜キーワード抽出装置１の構成＞
図２は、本実施形態に係るキーワード抽出装置１の概略構成図である。
図２に示すように、キーワード抽出装置１は、キーワード入力部１１、検索部１２、ドメインＤＢ１３、第１ノイズ除去部１４、メインコンテンツ抽出部１５、キーワード抽出部１８、およびキーワードリスト出力部１９を備える。また、キーワード抽出装置１は、ネットワーク２に接続されている。ネットワーク２は、例えばインターネットである。 <Configuration of Keyword Extraction Device 1>
FIG. 2 is a schematic configuration diagram of the keyword extracting device 1 according to the present embodiment.
As shown in FIG. 2, the keyword extraction device 1 includes a keyword input unit 11, a search unit 12, a domain DB 13, a first noise removal unit 14, a main content extraction unit 15, a keyword extraction unit 18, and a keyword list output unit 19. . The keyword extracting device 1 is connected to a network 2. The network 2 is, for example, the Internet.

キーワード入力部１１は、例えばキーボード、マウス、タブレット等である。キーワード入力部１１は、利用者によって入力された検索キーワードを検索部１２に出力する。 The keyword input unit 11 is, for example, a keyboard, a mouse, a tablet, or the like. The keyword input unit 11 outputs the search keyword input by the user to the search unit 12.

検索部１２は、キーワード入力部１１が出力した検索キーワードを取得し、取得した検索キーワードに適したウェブページを、検索エンジンを用いて検索する。検索部１２は、検索によって得られたウェブページのうち、上位から所定の個数のウェブページを選択する。なお、所定の個数とは、例えば２０個である。検索部１２は、選択した所定の個数のウェブページを示す情報を第１ノイズ除去部１４に出力する。なお、検索結果には、各ウェブページのＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ；ユニフォームリソースロケータ）アドレスが含まれている。 The search unit 12 acquires the search keyword output by the keyword input unit 11, and searches for a web page suitable for the acquired search keyword using a search engine. The search unit 12 selects a predetermined number of web pages from the top among the web pages obtained by the search. The predetermined number is, for example, 20 pieces. The search unit 12 outputs information indicating the selected predetermined number of web pages to the first noise removal unit 14. Note that the search result includes the URL (Uniform Resource Locator) address of each web page.

ドメインＤＢ１３には、キーワードを抽出する上で不要なウェッブサイトのドメインが格納されている。ここで、不要なドメインとは、ウェブページの一部をまとめたサイト等のコンテンツとして意味をなしていないウェッブサイトのドメインである。 The domain DB 13 stores domains of websites unnecessary for extracting keywords. Here, the unnecessary domain is a domain of a web site that does not make sense as content such as a site in which a part of a web page is put together.

第１ノイズ除去部１４は、検索部１２が出力した所定の個数のウェブページを示す情報を用いて、所定の個数のウェブページからドメインＤＢ１３に格納されているドメインのウェブページを除去して、除去したウェブページを示す情報をメインコンテンツ抽出部１５に出力する。ウェブページを示す情報には、ウェブページのソースコードが含まれている。また、ウェブページの情報には、ヘッダー、サイドバー、メインコンテンツ、フッター等が含まれている。なお、所定の個数のウェブページに、ドメインＤＢ１３に格納されているドメインのウェブページが無い場合、第１ノイズ除去部１４は、所定の個数のウェブページを示す情報をメインコンテンツ抽出部１５に出力する。 The first noise removing unit 14 removes the web pages of the domain stored in the domain DB 13 from the predetermined number of web pages using the information indicating the predetermined number of web pages output by the search unit 12, The information indicating the removed web page is output to the main content extracting unit 15. The information indicating the web page includes the source code of the web page. The information of the web page includes a header, a sidebar, a main content, a footer, and the like. When the predetermined number of web pages do not include the web pages of the domain stored in the domain DB 13, the first noise removing unit 14 outputs information indicating the predetermined number of web pages to the main content extraction unit 15. .

メインコンテンツ抽出部１５は、第１ノイズ除去部１４が出力した所定の個数のウェブページを示す情報を用いて、所定の個数のウェブページの中から１つのウェブページの情報を逐次選択し、選択したウェブページの情報の中からメインコンテンツを抽出する。なお、メインコンテンツの抽出方法については、後述する。メインコンテンツ抽出部１５は、抽出したメインコンテンツを、ウェブページ毎にキーワード抽出部１８に出力する。 The main content extraction unit 15 sequentially selects information of one web page from the predetermined number of web pages using the information indicating the predetermined number of web pages output by the first noise removal unit 14 and selects the selected web page. Extract main content from web page information. The method of extracting the main content will be described later. The main content extraction unit 15 outputs the extracted main content to the keyword extraction unit 18 for each web page.

キーワード抽出部１８は、メインコンテンツ抽出部１５が出力したメインコンテンツから複数のキーワードを抽出する。キーワード抽出部１８は、抽出した複数のキーワードに対して、後述するようにソート（ｓｏｒｔ）処理を行い、ソート処理を行ったキーワードリストをキーワードリスト出力部１９に出力する。なお、キーワードの抽出方法、ソート処理については、後述する。 The keyword extracting unit 18 extracts a plurality of keywords from the main content output by the main content extracting unit 15. The keyword extracting unit 18 performs a sort process on the extracted plurality of keywords as described later, and outputs the sorted keyword list to the keyword list output unit 19. The keyword extraction method and the sorting process will be described later.

キーワードリスト出力部１９は、例えばＷｅｂ上での情報提供部、表示装置、プリンタ装置、通信装置のうち少なくとも１つである。キーワードリスト出力部１９は、キーワード抽出部１８が出力したキーワードリストを、例えばＷｅｂ上で提供する。 The keyword list output unit 19 is, for example, at least one of an information providing unit on the Web, a display device, a printer device, and a communication device. The keyword list output unit 19 provides the keyword list output by the keyword extraction unit 18 on, for example, the Web.

次に、ドメインＤＢ１３に格納されている情報の一例を説明する。
図３は、本実施形態に係るドメインＤＢ１３に格納されている情報の一例を示す図である。図３に示すように、ドメインＤＢ１３には、少なくとも１つのドメインが格納されている。なお、ドメインＤＢ１３に格納されるドメインは、ネットワーク２を介して更新されるようにしてもよい。なお、利用者がキーワードを入力するときに、キーワードを抽出する上で不要なウェッブサイトのドメインを入力するようにしてもよい。そして、キーワード抽出装置１は、入力されたドメインを、一時的にドメインＤＢに格納して、キーワードを抽出するようにしてもよい。 Next, an example of information stored in the domain DB 13 will be described.
FIG. 3 is a diagram illustrating an example of information stored in the domain DB 13 according to the present embodiment. As shown in FIG. 3, the domain DB 13 stores at least one domain. The domain stored in the domain DB 13 may be updated via the network 2. When a user inputs a keyword, a domain of a website that is unnecessary for extracting the keyword may be input. Then, the keyword extracting device 1 may temporarily store the input domain in the domain DB and extract the keyword.

次に、ウェブページｇ２０１の構成の一例を説明する。
図４は、ウェブページｇ２０１の構成の一例を示す図である。図４に示す例は、２カラムで、右にメニューがある例である。図４に示す例のウェブページｇ２０１は、ヘッダーｇ２２１、サイドバーｇ２２２、およびメインコンテンツｇ２２３を含んで構成されている。 Next, an example of the configuration of the web page g201 will be described.
FIG. 4 is a diagram illustrating an example of the configuration of the web page g201. The example shown in FIG. 4 is an example in which there are two columns and a menu is on the right. The web page g201 in the example shown in FIG. 4 includes a header g221, a sidebar g222, and a main content g223.

＜キーワード抽出装置１の処理手順＞
次に、キーワード抽出装置１の処理手順について説明する。図５は、本実施形態に係るキーワード抽出装置１の処理のフローチャートである。
（ステップＳ１）キーワード入力部１１は、利用者によって入力された検索キーワードを取得する。
（ステップＳ２）検索部１２は、キーワード入力部１１が出力した検索キーワードに適したウェブページを、検索エンジンを用いて検索して、検索によって得られたウェブページのうち、上位から所定の個数のウェブページを選択する。
（ステップＳ３）検索部１２は、選択した所定の個数のウェブページそれぞれのソースコードを取得する。 <Processing procedure of keyword extraction device 1>
Next, a processing procedure of the keyword extracting device 1 will be described. FIG. 5 is a flowchart of the process of the keyword extracting device 1 according to the present embodiment.
(Step S1) The keyword input unit 11 acquires a search keyword input by a user.
(Step S2) The search unit 12 searches for a web page suitable for the search keyword output by the keyword input unit 11 using a search engine, and a predetermined number of web pages among the web pages obtained by the search are searched. Select a web page.
(Step S3) The search unit 12 acquires the source code of each of the selected predetermined number of web pages.

（ステップＳ４）第１ノイズ除去部１４は、検索部１２が出力した所定の個数のウェブページの情報を用いて、所定の個数のウェブページからドメインＤＢ１３に格納されているドメインのウェブページを除去する。
（ステップＳ５）メインコンテンツ抽出部１５は、第１ノイズ除去部１４が出力したウェブページの情報の中から１つのウェブページの情報を逐次選択し、選択したウェブページの情報からメインコンテンツを抽出する。 (Step S4) The first noise removing unit 14 uses the information on the predetermined number of web pages output by the search unit 12 to remove the web pages of the domain stored in the domain DB 13 from the predetermined number of web pages. I do.
(Step S5) The main content extraction unit 15 sequentially selects one web page information from the web page information output by the first noise removal unit 14, and extracts the main content from the selected web page information.

（ステップＳ７）キーワード抽出部１８は、メインコンテンツ抽出部１５が出力したメインコンテンツから複数のキーワードを抽出する。
以上で、キーワード抽出装置１の処理を終了する。 (Step S7) The keyword extracting unit 18 extracts a plurality of keywords from the main content output by the main content extracting unit 15.
Thus, the processing of the keyword extracting device 1 ends.

＜メインコンテンツの抽出方法＞
次に、メインコンテンツの抽出方法について説明する。
図６は、ウェブページのソースコードの例を示す図である。なお、図６に示したソースコードは、ウェブページのソースコードのうちの一部である。また、図６に示したソースコードは、ウェブページを構成とソースコードとの関係を説明するための例であって、実際のウェブページのソースコードとは一致しない場合がある。
なお、本実施形態におけるウェブページのメインコンテンツとは、キーワードを抽出する上で必要な部分であり、例えば、タイトル、記事、質問内容、図や写真の説明、質問に対する返答等である、一方、本実施形態における不用部分とは、例えば、広告、メニュー等である。 <Method of extracting main content>
Next, a method of extracting the main content will be described.
FIG. 6 is a diagram illustrating an example of a source code of a web page. The source code shown in FIG. 6 is a part of the source code of the web page. The source code shown in FIG. 6 is an example for explaining the relationship between the configuration of the web page and the source code, and may not match the actual source code of the web page.
Note that the main content of the web page in the present embodiment is a part necessary for extracting a keyword, for example, a title, an article, a question content, a description of a diagram or a photo, a response to a question, and the like. The unnecessary part in the embodiment is, for example, an advertisement, a menu, or the like.

図６の符号ｇ２５１に示すように、ソースコードは、複数のタグを用いて記述されている。そして、ソースコードは、ウェブサイトのタイトル等が記述されているヘッダ情報ｇ２６１、ウェブサイトやウェブページのタイトルや説明が記述されているヘッダーｇ２６２、メインコンテンツｇ２６３、ウェブサイト内のリンク先や他のウェブサイトへのリンク先などが記述されているメニューｇ２６４を含んでいる。 As shown by reference numeral g251 in FIG. 6, the source code is described using a plurality of tags. Then, the source code includes header information g261 in which the title of the website is described, header g262 in which the title and description of the website and the web page are described, main content g263, a link destination in the website and other web pages. A menu g264 in which a link to a site and the like are described is included.

図７は、本実施形態に係る自ウェブページとリンク先のウェブページの構成例を示す図である。なお、自ウェブページとは、図５のステップＳ２の検索結果のうちの１つのウェブページである。
符号ｇ３０１が示すウェブページの構成は、自ウェブページの構成例であり、２カラムの構成であって、ウェブページの上にヘッダーｇ３１１が配置され、左にメインコンテンツｇ３１３が配置され、右にメニューｇ３１２が配置されている。 FIG. 7 is a diagram illustrating a configuration example of the own web page and the linked web page according to the present embodiment. The own web page is one of the search results in step S2 in FIG.
The configuration of the web page indicated by reference numeral g301 is an example of the configuration of the own web page, which is a two-column configuration, in which a header g311 is arranged on the web page, a main content g313 is arranged on the left, and a menu g312 is on the right. Is arranged.

符号ｇ３２１が示すウェブページの構成は、自ウェブページに記述されている第１のリンク先のウェブページの構成例であり、２カラムの構成であって、ウェブページの上にヘッダーｇ３３１が配置され、左にメインコンテンツｇ３３３が配置され、右にメニューｇ３３２が配置されている。
符号ｇ３４１が示すウェブページの構成は、自ウェブページに記述されている第２のリンク先のウェブページの構成例であり、３カラムの構成であって、ウェブページの上にヘッダーｇ３５１が配置され、左に第１のメニューｇ３５２が配置され、真ん中にメインコンテンツｇ３５３が配置され、右に第２のメニューｇ３５４が配置されている。 The configuration of the web page indicated by the reference numeral g321 is a configuration example of the web page of the first link destination described in the own web page, has a two-column configuration, and a header g331 is arranged on the web page. , A main content g333 is arranged on the left, and a menu g332 is arranged on the right.
The configuration of the web page indicated by the reference sign g341 is an example of the configuration of the web page of the second link destination described in the own web page, has a three-column configuration, and a header g351 is arranged on the web page. , A first menu g352 is arranged on the left, a main content g353 is arranged in the middle, and a second menu g354 is arranged on the right.

図７において、符号ｇ３２１が示すウェブページは、自ウェブページと同じウェブサイト内のウェブページの１つである。また、符号ｇ３２１が示すウェブページは、自ウェブページと異なるウェブサイト内のウェブページの１つである。
自ウェブページと同じウェブサイト内のウェブページのＵＲＬアドレスは、ドメイン、ホームページに割り振られたアドレス等が等しい場合が多い。一方、自ウェブページと異なるウェブサイト内のウェブページのＵＲＬアドレスは、ドメイン、ホームページに割り振られたアドレス等が異なる場合が多い。 In FIG. 7, the web page indicated by reference numeral g321 is one of web pages in the same website as the own web page. The web page indicated by reference numeral g321 is one of web pages in a website different from the own web page.
In many cases, the URL address of a web page in the same website as the own web page has the same domain, the address assigned to the home page, and the like. On the other hand, the URL address of a web page in a website different from the own web page often has a different domain, an address assigned to a home page, or the like.

ここで、自ウェブページのＵＲＬアドレスと、自ウェブページと同じウェブサイト内のウェブページのＵＲＬアドレスとの距離を、第１のレーベンシュタイン距離とする。また、自ウェブページのＵＲＬアドレスと、自ウェブページと異なるウェブサイト内のウェブページのＵＲＬアドレスとの距離を、第２のレーベンシュタイン距離とする。この場合、第１のレーベンシュタイン距離は、第２のレーベンシュタイン距離より小さな値が得られる、すなわちレーベンシュタイン距離が近い。一方、第２のレーベンシュタイン距離は、第１のレーベンシュタイン距離より大きな値であり、すなわちレーベンシュタイン距離が遠い。 Here, the distance between the URL address of the own web page and the URL address of a web page in the same website as the own web page is defined as a first Levenshtein distance. The distance between the URL address of the own web page and the URL address of a web page in a website different from the own web page is defined as a second Levenshtein distance. In this case, the first Levenshtein distance is smaller than the second Levenshtein distance, that is, the Levenshtein distance is short. On the other hand, the second Levenshtein distance is larger than the first Levenshtein distance, that is, the Levenshtein distance is far.

レーベンシュタイン距離が近い２つのウェブサイトそれぞれのソースコードを比較した場合、ヘッダーｇ３１１とヘッダーｇ３３１との記述が一致または類似し、メニューｇ３１２とメニューｇ３３２との記述が一致または類似していることが多い。すなわち、ソースコードが一致または類似している領域は、ヘッダーおよびメニュー（サイドバー）であると見なすことができる。そして、自ウェブページのソースコードから、ヘッダーｇ３１１とメニューｇ３１２それぞれの記述を除去したものは、メインコンテンツｇ３１３の記述である。このように、メインコンテンツ抽出部１５は、自ウェブページのソースコードから、ヘッダーｇ３１１とメニューｇ３１２それぞれの記述を除去することでメインコンテンツｇ３１３の記述を抽出する。なお、メインコンテンツ抽出部１５は、周知の文書間の類似度を推定する類似度推定法を用いて、ソースコードが一致しているか否か、または類似しているか否かを判定する。 When comparing the source codes of two websites having the same Levenshtein distance, the descriptions of the header g311 and the header g331 are often the same or similar, and the descriptions of the menu g312 and the menu g332 are often the same or similar. . That is, areas where the source codes match or are similar can be considered to be headers and menus (sidebars). The description of the main content g313 is obtained by removing the descriptions of the header g311 and the menu g312 from the source code of the own web page. As described above, the main content extracting unit 15 extracts the description of the main content g313 by removing the descriptions of the header g311 and the menu g312 from the source code of the own web page. Note that the main content extraction unit 15 determines whether or not the source codes match, or whether or not the source codes are similar, using a known similarity estimation method for estimating the similarity between documents.

また、レーベンシュタイン距離が遠い２つのウェブサイトそれぞれのソースコードを比較した場合、ヘッダーｇ３１１とヘッダーｇ３５１との記述がヘッダーｇ３１１とヘッダーｇ３５１との記述より類似していない場合が多い。また、メニューｇ３１２と第１のメニューｇ３５２との記述が、メニューｇ３１２とメニューｇ３３２との記述より類似していず、メニューｇ３１２と第２のメニューｇ３５４との記述が、メニューｇ３１２とメニューｇ３３２との記述より類似していない場合が多い。この結果、ソースコードが類似している領域がないため、レーベンシュタイン距離が遠い２つのウェブサイトそれぞれのソースコードを比較しても、自ウェブページのヘッダーｇ３１１やメニューｇ３１２（サイドバー）の記述を特定できない。このように、レーベンシュタイン距離が遠い２つのウェブサイトそれぞれのソースコードを比較しても、メインコンテンツｇ３１３の記述を抽出できない。 Further, when comparing the source codes of two websites having a long Levenshtein distance, the description of the header g311 and the header g351 is often less similar to the description of the header g311 and the header g351. Further, the description of the menu g312 and the first menu g352 is not more similar to the description of the menu g312 and the menu g332, and the description of the menu g312 and the second menu g354 is the description of the menu g312 and the menu g332. Often less similar. As a result, since there is no area where the source code is similar, even if the source codes of the two websites having a long Levenshtein distance are compared, the description of the header g311 and the menu g312 (sidebar) of the own web page can be obtained. It can not be identified. As described above, the description of the main content g313 cannot be extracted by comparing the source code of each of the two websites with a long Levenshtein distance.

また、特定のウェブページにのみ出現する部分は、メインコンテンツである傾向が高い。一方、不要部分は、複数のウェブページにわたって出現する傾向がある。例えば、ニュースサイトの記事の場合、当該ウェブページに他のニュースのリンク先が記載されている場合があり、他のニュースのウェブページと、当該ウェブページとの構成（図３参照）が似ている場合が多い。他のニュースのウェブページと当該ウェブページとには、例えば、図３のサイドバーｇ２２２にリンク先の情報、広告等が記載されている。このように、本実施形態では、複数のウェブページを比較し、比較した結果、共通している部分を不要部分と見なし、他のウェブページに出現しない部分をメインコンテンツであると見なす。そして、本実施形態では、検索されたウェブページの中から１つを選択し、選択したウェブページに記載されているリンク先を比較に用いるウェブページとする。また、本実施形態では、ウェブページ同士の比較に、例えばレーベンシュタイン距離を用いている。 Also, a portion that appears only on a specific web page is likely to be main content. On the other hand, unnecessary parts tend to appear over a plurality of web pages. For example, in the case of an article of a news site, a link destination of another news may be described on the web page, and the configuration of the web page of the other news (see FIG. 3) is similar to that of the web page of the other news. There are many cases. In the other news web page and the web page, for example, information on a link destination, an advertisement, and the like are described in the side bar g222 in FIG. As described above, in the present embodiment, a plurality of web pages are compared, and as a result of the comparison, a common portion is regarded as an unnecessary portion, and a portion which does not appear on another web page is regarded as main content. In this embodiment, one of the searched web pages is selected, and the link destination described in the selected web page is set as the web page used for comparison. In the present embodiment, for example, the Levenshtein distance is used for comparison between web pages.

このため、本実施形態では、メインコンテンツ抽出部１５が、自ウェブページのＵＲＬアドレスとレーベンシュタイン距離が近い自ウェブページ内に記述されているリンク先のウェブページのＵＲＬアドレスを少なくとも１つ抽出する。そして、メインコンテンツ抽出部１５が、自ウェブページのソースコードと、レーベンシュタイン距離が近いリンク先のウェブページのソースコードを取得し、取得したソースコードの類似性に基づいて、不要なエリアの記述を除去することでメインコンテンツを抽出する。 For this reason, in the present embodiment, the main content extracting unit 15 extracts at least one URL address of a link destination web page described in the own web page whose Levenshtein distance is close to the own web page. Then, the main content extracting unit 15 obtains the source code of the own web page and the source code of the linked web page having a short Levenshtein distance, and describes a description of an unnecessary area based on the similarity of the obtained source code. The main content is extracted by removing it.

次に、メインコンテンツ抽出部１５が、図５のステップＳ５で行うメインコンテンツの抽出処理の手順の一例を説明する。
図８は、本実施形態に係るメインコンテンツの抽出処理の手順のフローチャートである。 Next, an example of the procedure of the main content extraction processing performed by the main content extraction unit 15 in step S5 of FIG. 5 will be described.
FIG. 8 is a flowchart of a procedure of main content extraction processing according to the present embodiment.

（ステップＳ１０１）メインコンテンツ抽出部１５は、第１ノイズ除去部１４が出力した所定の個数のウェブページの情報の中から、１つの未処理のウェブページの情報を逐次選択して、ステップＳ１０２〜Ｓ１０６の処理を行う。
（ステップＳ１０２）メインコンテンツ抽出部１５は、選択したウェブページのソースコードを取得する、続けて、メインコンテンツ抽出部１５は、選択したウェブページの内に含まれているリンクを示す情報を抽出する。なお、リンクを示す情報とは、ウェブページのソースコードに含まれる＜ａｈｒｅｆ＝”…”＞、＜ｂａｓｅｈｒｅｆ＝”…”＞、＜ｌｉｎｋｒｅｌ＝”…” ｈｒｅｆ＝”…”＞、＜ｌｉｎｋｈｒｅｆ＝”…”＞等のタグで記述されている情報である。なお、本実施形態では、リンクを示す情報がタグで記述されている例を説明したが、記述はこれに限られずリンクを示すものであればよい。 (Step S101) The main content extraction unit 15 sequentially selects information of one unprocessed web page from information of a predetermined number of web pages output by the first noise removal unit 14, and performs steps S102 to S106. Is performed.
(Step S102) The main content extracting unit 15 acquires the source code of the selected web page. Subsequently, the main content extracting unit 15 extracts information indicating a link included in the selected web page. Note that the information indicating a link includes <a href=“... ”>, <base href = ”...”, <Link rel = ”...” Href = ”. link href = “...>” or the like. In the present embodiment, an example in which information indicating a link is described by a tag has been described. However, the description is not limited to this, and any description may be used as long as the information indicates a link.

（ステップＳ１０３）メインコンテンツ抽出部１５は、ステップＳ１０２で抽出された複数のリンク先のＵＲＬアドレスの中から１つを逐次選択する。メインコンテンツ抽出部１５は、ステップＳ１０１で選択したウェブページのＵＲＬアドレスと、リンクを示すタグに記述されているＵＲＬアドレスとのレーベンシュタイン距離を逐次計算する。
（ステップＳ１０４）メインコンテンツ抽出部１５は、計算した結果、レーベンシュタイン距離が近い少なくとも１つのリンク先のウェブサイトのソースコードを取得する。なお、メインコンテンツ抽出部１５は、レーベンシュタイン距離が近い順に複数のリンク先を選択するようにしてもよい。 (Step S103) The main content extraction unit 15 sequentially selects one from the plurality of link destination URL addresses extracted in step S102. The main content extraction unit 15 sequentially calculates the Levenshtein distance between the URL address of the web page selected in step S101 and the URL address described in the tag indicating the link.
(Step S104) As a result of the calculation, the main content extraction unit 15 acquires the source code of at least one linked website whose Levenshtein distance is short. Note that the main content extraction unit 15 may select a plurality of link destinations in ascending order of Levenshtein distance.

（ステップＳ１０５）メインコンテンツ抽出部１５は、ステップＳ１０１で選択したウェブページと、テップＳ１０４で取得したリンク先のウェブページそれぞれのソースコードを比較する。
（ステップＳ１０６）メインコンテンツ抽出部１５は、ステップＳ１０５で比較した結果、ソースコードが近い記述を除去することでメインコンテンツを抽出する（例えば、参考文献１参照）。 (Step S105) The main content extraction unit 15 compares the source code of the web page selected in step S101 with the source code of the link destination web page acquired in step S104.
(Step S106) As a result of the comparison in step S105, the main content extraction unit 15 extracts the main content by removing the description whose source code is close (for example, see Reference Document 1).

（ステップＳ１０７）メインコンテンツ抽出部１５は、第１ノイズ除去部１４が出力したウェブページの情報について、全てのウェブページについてステップＳ１０２〜Ｓ１０６の処理が終了した場合、抽出したメインコンテンツの記述を、ウェブページ毎にキーワード抽出部１８に出力する。
以上で、メインコンテンツの抽出処理を終了する。 (Step S107) When the processing of steps S102 to S106 is completed for all the web pages with respect to the information of the web pages output by the first noise removing unit 14, the main content extraction unit 15 writes the description of the extracted main content into the web page. Each time it is output to the keyword extraction unit 18.
This is the end of the main content extraction processing.

参考文献１；吉田光男、山本幹雄、教師情報を必要としないニュースページ群からのコンテンツ自動抽出、日本データベース学会論文誌 8(1) 29-34 2009. References 1; Mitsuo Yoshida, Mikio Yamamoto, Automatic content extraction from news pages that do not require teacher information, Transactions of the Database Society of Japan 8 (1) 29-34 2009.

＜キーワードの抽出＞
次に、キーワードの抽出について説明する。
図９は、本実施形態に係るキーワード抽出部１８の構成を示すブロック図である。図９に示すように、キーワード抽出部１８は、形態素解析部１８１、用語抽出部１８２、およびキーワードリスト生成部１８３を備える。 <Keyword extraction>
Next, keyword extraction will be described.
FIG. 9 is a block diagram illustrating a configuration of the keyword extracting unit 18 according to the present embodiment. As shown in FIG. 9, the keyword extraction unit 18 includes a morphological analysis unit 181, a term extraction unit 182, and a keyword list generation unit 183.

形態素解析部１８１は、メインコンテンツ抽出部１５が出力したメインコンテンツの中のテキスト情報をウェブページ毎に取得する。形態素解析部１８１は、テキスト情報に対して周知の手法を用いて形態素解析を行う。テキストが日本語の場合、形態素解析部１８１は、例えば「ＣｈａＳｅｎ（茶筌）」、「茶まめ」、「ＭｅＣａｂ（和布蕪）」等のソフトウェアを用いて形態素解析を行う。解析した解析結果には、文字列、文字列の品詞の種類、品詞の活用の種類、文字列の原形、読み等が含まれている。形態素解析部１８１は、解析した解析結果を用語抽出部１８２に出力する。 The morphological analysis unit 181 acquires text information in the main content output by the main content extraction unit 15 for each web page. The morphological analysis unit 181 performs a morphological analysis on the text information using a known method. When the text is in Japanese, the morphological analysis unit 181 performs morphological analysis using software such as “ChaSen (Chasen)”, “Chamame”, and “MeCab (Wafu)”. The analysis result includes the character string, the type of part of speech of the character string, the type of part of speech, the original form of the character string, the reading, and the like. The morphological analyzer 181 outputs the analyzed result to the term extractor 182.

用語抽出部１８２は、形態素解析部１８１が出力した解析結果を用いて、語の並びと品詞情報に基づいて複合語を組み立てる。用語抽出部１８２は、例えば名詞が連続して出現している場合、連続している名詞を統合して複合語にする。用語抽出部１８２は、名詞または、複数の名詞を含む複合語を抽出する。用語抽出部１８２は、複合語を構成する最小単位の名詞（以下、単名詞ともいう）または名詞それぞれが、検索部１２によって選択された所定の個数のウェブページに横断的に出現した回数に基づいて、例えばＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）法を用いて重要度ｉｄｆを算出する。
また、用語抽出部１８２は、検索部１２によって選択された所定の個数全てのウェブページそれぞれのテキスト情報中の名詞の出願回数または複合語の出現回数を算出する。
用語抽出部１８２は、抽出した名詞または複合語それぞれに、算出した重要度と出現回数とを対応付けてキーワードリスト生成部１８３に出力する。 The term extraction unit 182 uses the analysis result output from the morphological analysis unit 181 to assemble a compound word based on the word arrangement and part-of-speech information. For example, when nouns appear continuously, the term extraction unit 182 integrates the continuous nouns into a compound word. The term extraction unit 182 extracts a noun or a compound word including a plurality of nouns. The term extraction unit 182 determines the minimum unit noun (hereinafter, also referred to as a single noun) or a noun constituting a compound word based on the number of times that each of the nouns appears in a predetermined number of web pages selected by the search unit 12 in a transverse manner. Then, the importance idf is calculated using, for example, an IDF (Inverse Document Frequency) method.
In addition, the term extraction unit 182 calculates the number of applications of nouns or the number of appearances of compound words in the text information of all the predetermined number of web pages selected by the search unit 12.
The term extraction unit 182 outputs the extracted noun or compound word to the keyword list generation unit 183 in association with the calculated importance and the number of appearances.

キーワードリスト生成部１８３は、用語抽出部１８２が出力した、重要度と出現回数とが対応付けている名詞または複合語を取得する。キーワードリスト生成部１８３は、名詞または複合語毎に重要度と出現回数を乗算して、乗算した値が大きい順に名詞または複合語を並べて、キーワードリストを生成する。キーワードリスト生成部１８３は、生成したキーワードリストをキーワードリスト出力部１９に出力する。
なお、用語抽出部１８２、およびキーワードリスト生成部１８３は、プログラミング言語Ｐｅｒｌのモジュールである、例えば「ＴｅｒｍＥｘｔｒａｃｔ」を含んで構成されていてもよい。 The keyword list generation unit 183 acquires the noun or compound word output by the term extraction unit 182 and associated with the importance and the number of appearances. The keyword list generation unit 183 generates a keyword list by multiplying the importance and the number of appearances for each noun or compound word, and arranging the nouns or compound words in descending order of the multiplied value. The keyword list generation unit 183 outputs the generated keyword list to the keyword list output unit 19.
The term extraction unit 182 and the keyword list generation unit 183 may be configured to include a module of the programming language Perl, for example, “TermExtract”.

上述したように、本実施形態では、検索キーワードを、検索エンジンを用いて検索した上位から所定の個数のウェブページの中から１つのウェブページを１つ逐次選択する。そして、本実施形態では、選択された自ウェブページのソースコードに記述されているリンク先のソースコードと、自ウェブページのソースコードとを比較して、比較した結果に基づいて不要なエリアの記述を除去することでメインコンテンツを抽出する。このように、本実施形態では、ノイズ成分が除去されたメインコンテンツのテキストを用いて、例えば名詞と複合語とを抽出することで、キーワードを精度よく抽出することができる。なお、抽出する言葉は、名詞と複合語に限られず、例えば形容詞や副詞を含んでいてもよい。 As described above, in the present embodiment, one search web page is selected one by one from a predetermined number of web pages searched from the top using a search engine. In the present embodiment, the source code of the link destination described in the source code of the selected own web page is compared with the source code of the own web page, and an unnecessary area is determined based on the comparison result. The main content is extracted by removing the description. As described above, in the present embodiment, a keyword can be accurately extracted by extracting, for example, a noun and a compound word using the text of the main content from which the noise component has been removed. Note that the words to be extracted are not limited to nouns and compound words, and may include, for example, adjectives and adverbs.

＜キーワードリストの例＞
ここで、キーワードリスト出力部１９が出力するキーワードリストの例を説明する。
図１０は、本実施形態に係るキーワードリスト出力部１９が出力するキーワードリストの例を示す図である。図１０に示すように、キーワードリストには、入力したキーワード（符号ｇ３６１に示す領域）、出現回数（符号ｇ３６２に示す領域）、出現回数に重要度を乗算した値（符号ｇ３６３に示す領域）、キーワード（符号ｇ３６４に示す領域）が対応付けられている。
キーワードリスト出力部１９は、例えば図１０に示すように、キーワード抽出部１８が出力した出現回数に重要度をキーワード毎に乗算し、乗算した値が大きい順にキーワードのソートを行う。
この結果、図１０に示すように、キーワードの表示順番は、検索されたウェブページ内の出現回数に重要度を乗算した値が大きい順番である。このため、“足先しびれ冷たい”を入力したときに抽出されるキーワードは、出現回数が例えば１５位であっても重要度が大きいため、リストの３番目に表示される。 <Example of keyword list>
Here, an example of the keyword list output by the keyword list output unit 19 will be described.
FIG. 10 is a diagram illustrating an example of a keyword list output by the keyword list output unit 19 according to the present embodiment. As shown in FIG. 10, the keyword list includes the input keyword (the area indicated by reference sign g361), the number of appearances (the area indicated by reference sign g362), the value obtained by multiplying the number of appearances by the importance (the area indicated by reference sign g363), A keyword (a region indicated by reference sign g364) is associated with the keyword.
For example, as shown in FIG. 10, the keyword list output unit 19 multiplies the number of appearances output by the keyword extraction unit 18 by importance for each keyword, and sorts the keywords in descending order of the multiplied value.
As a result, as shown in FIG. 10, the display order of the keywords is the order in which the value obtained by multiplying the number of appearances in the searched web page by the importance is large. For this reason, a keyword extracted when "foot numbness is cold" is input is displayed in the third position in the list because its importance is high even if the number of appearances is, for example, 15th.

なお、キーワードリスト出力部１９が出力するキーワードリストは、少なくともキーワードが含まれていればよく、出現回数、重要度は含まれていなくてもよい。
また、表示順番は、図１０に示した例に限られず、出願回数が多い順番、重要度の値が大きい順番、他の統計的な手法に基づく順番等であってもよい。 The keyword list output by the keyword list output unit 19 only needs to include at least a keyword, and does not need to include the number of appearances and importance.
Further, the display order is not limited to the example shown in FIG. 10, and may be an order in which the number of applications is large, an order in which importance values are large, an order based on another statistical method, or the like.

次に、キーワード抽出部１８が、図５のステップＳ７で行うキーワードの抽出処理について説明する。
図１１は、本実施形態に係るキーワードの抽出処理のフローチャートである。
（ステップＳ２０１）形態素解析部１８１は、メインコンテンツ抽出部１５が出力したテキスト情報に対して周知の手法を用いて形態素解析を行う。
（ステップＳ２０２）用語抽出部１８２は、形態素解析部１８１が出力した解析結果を用いて、語の並びと品詞情報に基づいて複合語を組み立てる。 Next, the keyword extraction process performed by the keyword extraction unit 18 in step S7 of FIG. 5 will be described.
FIG. 11 is a flowchart of keyword extraction processing according to the present embodiment.
(Step S201) The morphological analysis unit 181 performs a morphological analysis on the text information output by the main content extraction unit 15 using a known method.
(Step S202) The term extraction unit 182 assembles a compound word based on the word arrangement and part-of-speech information using the analysis result output from the morphological analysis unit 181.

（ステップＳ２０３）用語抽出部１８２は、名詞と、複数の名詞を含む複合語とを抽出する。
（ステップＳ２０４）用語抽出部１８２は、例えばＩＤＦ法を用いて、名詞および複合語それぞれの重要度を算出する。 (Step S203) The term extraction unit 182 extracts a noun and a compound word including a plurality of nouns.
(Step S204) The term extraction unit 182 calculates the importance of each of the noun and the compound word using, for example, the IDF method.

（ステップＳ２０５）用語抽出部１８２は、入力された全てのウェブページそれぞれのテキスト情報中の、名詞および複合語の出現回数を算出する。続けて、用語抽出部１８２は、ステップＳ２０３で抽出された名詞または複合語それぞれに、算出された出現回数とステップＳ２０４で算出された重要度とを対応付けて、キーワードリスト生成部１８３に出力する。 (Step S205) The term extraction unit 182 calculates the number of appearances of nouns and compound words in the text information of each of the input web pages. Subsequently, the term extraction unit 182 associates each of the nouns or compound words extracted in step S203 with the calculated number of appearances and the importance calculated in step S204, and outputs them to the keyword list generation unit 183. .

（ステップＳ２０６）キーワードリスト生成部１８３は、用語抽出部１８２が出力した名詞または複合語毎に重要度と出現回数を乗算して、名詞または複合語に対して乗算した値が大きい順にソートを行い、キーワードリストを生成する。キーワードリスト生成部１８３は、生成したキーワードリストをキーワードリスト出力部１９に出力する。
以上で、キーワードの抽出処理を終了する。 (Step S206) The keyword list generation unit 183 multiplies the noun or compound word output by the term extraction unit 182 by the importance and the number of appearances, and sorts the noun or compound word in descending order of the multiplied value. , Generate a keyword list. The keyword list generation unit 183 outputs the generated keyword list to the keyword list output unit 19.
Thus, the keyword extraction processing ends.

＜キーワードリストの利用例＞
このように抽出されたキーワードリストの利用例を説明する。
例えば、Ａ社が、商品Ｂのウェブページを開設する場合、通常、Ｂ商品に対する説明をウェブページに記載する。しかしながら、このような記載では、実際にＢ商品について興味がある利用者が知りたい情報を網羅しているとは限らない。このようなウェブページ、すなわちコンテンツを作成した場合、検索エンジンによってウェブページの記載内容が評価された結果、検索結果の上位に表示されない場合も少なくない。
このため、ウェブページ制作者が、例えばマインドマップ等を用いて、Ｂ商品に関する利用者が検索に用いると想定される検索キーワードを抽出する。そして、抽出された検索キーワードをキーワード抽出装置１に入力して、キーワードリストを得る。
ウェブページ制作者は、キーワードリストに載っているキーワードを用いてＢ商品のウェブページを制作する。これにより、ウェブページ制作者は、Ｂ商品について、利用者が知りたい情報を多く含んだウェブページを制作することができる。このような利用者にとって知りたい情報を多く含んでいるウェブページは、検索エンジンによってウェブページの記載内容が評価された結果、検索結果の上位に表示され、かつ利用者の知りたい多くの情報が含まれているため、利用者の滞在時間が長くなり、商品の購買につながる効果が得られる。 <Example of using keyword list>
A usage example of the keyword list thus extracted will be described.
For example, when the company A sets up a web page of the product B, the description of the product B is usually described on the web page. However, such description does not always cover information that a user who is actually interested in the product B wants to know. When such a web page, that is, content is created, the content described on the web page is evaluated by a search engine, and as a result, the content is not often displayed at the top of the search result.
For this reason, the web page creator extracts, using, for example, a mind map or the like, a search keyword that is assumed to be used by the user regarding the product B for the search. Then, the extracted search keywords are input to the keyword extracting device 1 to obtain a keyword list.
The web page creator creates a web page of the product B using the keywords listed in the keyword list. As a result, the web page creator can create a web page containing much information that the user wants to know about the product B. A web page that contains a lot of information that the user wants to know is displayed at the top of the search results as a result of the evaluation of the content of the web page by the search engine, and the user wants to know a lot of information. Since it is included, the staying time of the user is prolonged, and an effect of purchasing a product is obtained.

なお、上述した例では、商品に関するウェブページを説明したが、これに限られない。パンフレット、カタログ、取扱説明書等を、キーワードリストを用いて制作することで、利用者が知りたい情報を多く含んだ内容することができる。 In the example described above, the web page regarding the product is described, but the present invention is not limited to this. By producing pamphlets, catalogs, instruction manuals, and the like using the keyword list, it is possible to include a lot of information that the user wants to know.

次に、例えばトレンドの調査者が、コンビニエンスストアで販売されているスイーツのトレンドを知りたい場合を例に説明する。
検索エンジンに“コンビニ”、“スイーツ”の検索キーワードを入力して検索した場合、２０１５年５月２８日現在、約１８０万件の検索結果が得られる。調査者がこれらを全て読むことは困難であり、いくつかの検索された結果のウェブページ全体を読んだだけでは、トレンドが掴みにくい。
一方、キーワード抽出装置１に“コンビニ”、“スイーツ”の検索キーワードを入力することで、キーワード抽出装置１が、インターネットの利用者によって話題にされているコンビニエンスストアのスイーツに関するキーワードリストを生成することができる。これにより、本実施形態では、生成されたキーワードリストを、トレンドの調査者が見ることでトレンドを知ることもできる。また、キーワードリストを定期的（例えば月に１回）にキーワード抽出装置１によって生成させることで、トレンドの調査者は、キーワードの変化、すなわちトレンドの変化を知ることもできる。 Next, a case will be described as an example where a trend researcher wants to know the trend of sweets sold at a convenience store.
When a search is performed by inputting a search keyword of “convenience store” or “sweets” into a search engine, as of May 28, 2015, about 1.8 million search results are obtained. It is difficult for researchers to read all of these, and it is difficult to spot trends by simply reading the entire web page of some searched results.
On the other hand, by inputting the search keywords “convenience store” and “sweets” into the keyword extraction device 1, the keyword extraction device 1 generates a keyword list relating to sweets in a convenience store that has been talked about by users of the Internet. Can be. Thereby, in the present embodiment, the trend can be known by the trend investigator viewing the generated keyword list. In addition, by generating a keyword list regularly (for example, once a month) by the keyword extracting device 1, a trend researcher can also know a change in a keyword, that is, a change in a trend.

なお、本実施形態では、検索部１２が検索結果から上位から所定の個数のウェブページを選択する例を説明したが、これに限られない。例えば、検索した結果から所定の個数のウェブページを選択し、第１ノイズ除去部１４によって不要なドメインを除去した後のウェブページの個数が、所望の個数（例えば１０以上）に満たない場合、検索部１２は、第１ノイズ除去部１４によって除去された後のウェブページの個数が、所望の個数以上になる個数を選択するようにしてもよい。 In the present embodiment, an example has been described in which the search unit 12 selects a predetermined number of web pages from the search result from the top, but the present invention is not limited to this. For example, when a predetermined number of web pages are selected from the search result and the number of web pages after removing unnecessary domains by the first noise removal unit 14 is less than a desired number (for example, 10 or more), The search unit 12 may select a number in which the number of web pages after being removed by the first noise removal unit 14 is equal to or greater than a desired number.

また、本実施形態では、メインコンテンツ抽出部１５が、自ウェブページとリンク先の距離の近さを、レーベンシュタイン距離を算出する例を説明したが、これに限られない。メインコンテンツ抽出部１５は、例えば、３−ｇｒａｍ（ｎ−ｇｒａｍ）法を用いて、ウェブページとリンク先の距離の近さを算出するようにしてもよい。 Further, in the present embodiment, an example has been described in which the main content extraction unit 15 calculates the Levenshtein distance based on the closeness of the distance between the own web page and the link destination. However, the present invention is not limited to this. The main content extraction unit 15 may calculate the closeness of the distance between the web page and the link destination using, for example, the 3-gram (n-gram) method.

以上のように、本実施形態のキーワード抽出装置１は、検索キーワードに基づいてメインコンテンツを含む複数のコンテンツ（例えば２０個）を検索する検索部１２と、検索部によって検索された複数のコンテンツの中から、前記キーワードの抽出において意味を成していない所定のドメインのコンテンツを除去する第１ノイズ除去部１４と、第１ノイズ除去部によって所定のドメインのコンテンツが除去された複数のコンテンツの中から１つのコンテンツを逐次選択し、選択したコンテンツからリンク先を示す情報を抽出し、抽出したリンク先の情報と、選択したコンテンツの情報とを比較して類似している情報を、選択したコンテンツの情報から除去してメインコンテンツを抽出するメインコンテンツ抽出部１５と、メインコンテンツ抽出部によって抽出されたメインコンテンツのテキストから複数のキーワードを抽出するキーワード抽出部１８と、を備える。
なお、所定のドメインは、キーワードの抽出においてコンテンツとして意味を成していないドメインである。 As described above, the keyword extraction device 1 according to the present embodiment includes the search unit 12 that searches for a plurality of contents (for example, 20) including the main content based on the search keyword, and the search unit 12 among the plurality of contents searched by the search unit. And a first noise removing unit 14 for removing content of a predetermined domain that does not make sense in the keyword extraction, and a plurality of contents from which the content of the predetermined domain has been removed by the first noise removing unit. One piece of content is sequentially selected, information indicating a link destination is extracted from the selected content, and information of the extracted link destination is compared with information of the selected content, and similar information is determined. A main content extraction unit 15 for extracting main content by removing from the information; Comprising a keyword extraction unit 18 for extracting a plurality of keywords from the text of the extracted main content by part, a.
Note that the predetermined domain is a domain that does not make sense as content in keyword extraction.

この構成によって、本実施形態では、まず、所定のドメインのウェブページを除去することで、キーワードを抽出する上で、コンテンツとして意味をなしていない不要なウェブページを検索結果から除去する。本実施形態では、不要なウェブページを除去した後に、検索キーワードを用いて検索された複数のウェブページそれぞれから、自ウェブページ内のリンク先のソースコードとの類似度に基づいて、不要なエリア（例えば、ヘッダー、フッター、メニュー等）の記述を除去する。この結果、本実施形態では、不要なウェブページを削除したウェブページからのみメインコンテンツを精度良く抽出できる。この結果、本実施形態によれば、検索キーワード、すなわち利用者が知りたい情報に応じたキーワードを抽出することができる。 With this configuration, in the present embodiment, first, by removing a web page of a predetermined domain, an unnecessary web page having no meaning as content in extracting a keyword is removed from a search result. In the present embodiment, after removing unnecessary web pages, an unnecessary area is determined from each of a plurality of web pages searched using the search keyword based on the similarity with the source code of the link destination in the own web page. (Eg, headers, footers, menus, etc.) are removed. As a result, in the present embodiment, the main content can be accurately extracted only from the web pages from which unnecessary web pages have been deleted. As a result, according to the present embodiment, it is possible to extract a search keyword, that is, a keyword corresponding to information that the user wants to know.

［第１の実施形態の変形例］
次に、第１の実施形態の変形例を説明する。
キーワード抽出装置１が、さらにタグＤＢ１６と第２ノイズ除去部１７を備える例を説明する。 [Modification of First Embodiment]
Next, a modified example of the first embodiment will be described.
An example in which the keyword extracting device 1 further includes a tag DB 16 and a second noise removing unit 17 will be described.

＜キーワード抽出装置１Ａの構成＞
図１２は、本実施形態の変形例に係るキーワード抽出装置１Ａの概略構成図である。なお、キーワード抽出装置１と同じ機能を有する機能部については、同じ符号を用いて、説明を省略する。
図１２に示すように、キーワード抽出装置１Ａは、キーワード入力部１１、検索部１２、ドメインＤＢ１３、第１ノイズ除去部１４、メインコンテンツ抽出部１５、タグＤＢ１６、第２ノイズ除去部１７、キーワード抽出部１８、およびキーワードリスト出力部１９を備える。 <Configuration of Keyword Extraction Device 1A>
FIG. 12 is a schematic configuration diagram of a keyword extraction device 1A according to a modification of the present embodiment. Note that functional units having the same functions as those of the keyword extracting device 1 are denoted by the same reference numerals, and description thereof is omitted.
As shown in FIG. 12, the keyword extracting device 1A includes a keyword input unit 11, a search unit 12, a domain DB 13, a first noise removing unit 14, a main content extracting unit 15, a tag DB 16, a second noise removing unit 17, a keyword extracting unit. 18 and a keyword list output unit 19.

メインコンテンツ抽出部１５は、第１ノイズ除去部１４が出力したウェブページの情報の中から１つのウェブページの情報を逐次選択し、選択したウェブページの情報からメインコンテンツを抽出する。メインコンテンツ抽出部１５は、抽出したメインコンテンツの記述を、ウェブページ毎に第２ノイズ除去部１７に出力する。 The main content extraction unit 15 sequentially selects information of one web page from the web page information output by the first noise removal unit 14, and extracts main content from the selected web page information. The main content extraction unit 15 outputs the description of the extracted main content to the second noise removal unit 17 for each web page.

タグＤＢ１６には、ウェブページのメインコンテンツの中から、不要な用語（以下、無意味言葉という）を含む項目を削除するために使用されるタグが格納されている。なお、タグＤＢ１６に格納されるタグは、ネットワーク２を介して更新されるようにしてもよい。 The tag DB 16 stores tags used to delete items including unnecessary terms (hereinafter, meaningless words) from the main content of the web page. The tags stored in the tag DB 16 may be updated via the network 2.

第２ノイズ除去部１７は、メインコンテンツ抽出部１５が出力したメインコンテンツの中から、タグＤＢ１６を参照して無意味言葉を除去する。これにより、第２ノイズ除去部１７は、ウェブページのメインコンテンツから無意味言葉のノイズを、ウェブページ毎に除去する。第２ノイズ除去部１７は、無意味言葉を除去したメインコンテンツを、キーワード抽出部１８に出力する。なお、無意味言葉の除去方法については、後述する。なお、本実施形態では、タグを用いて無意味言葉等を除去する例を説明するが、これに限られず、他の手法を用いて無意味言葉等を除去するようにしてもよい。この場合、タグＤＢ１６は、無意味言葉等を除去するために用いる情報が記憶されていてもよい。 The second noise removing unit 17 refers to the tag DB 16 and removes meaningless words from the main content output by the main content extracting unit 15. Thereby, the second noise removing unit 17 removes noise of meaningless words from the main content of the web page for each web page. The second noise removing unit 17 outputs the main content from which the meaningless words have been removed to the keyword extracting unit 18. A method of removing meaningless words will be described later. In the present embodiment, an example in which meaningless words and the like are removed using a tag will be described. However, the present invention is not limited to this, and other methods may be used to remove meaningless words and the like. In this case, the tag DB 16 may store information used for removing meaningless words and the like.

キーワード抽出部１８は、第２ノイズ除去部１７が出力したメインコンテンツから複数のキーワードを抽出する。キーワード抽出部１８は、抽出した複数のキーワードに対して、後述するようにソート（ｓｏｒｔ）処理を行い、ソート処理を行ったキーワードリストをキーワードリスト出力部１９に出力する。なお、キーワードの抽出方法、ソート処理については、後述する。 The keyword extracting unit 18 extracts a plurality of keywords from the main content output by the second noise removing unit 17. The keyword extracting unit 18 performs a sort process on the extracted plurality of keywords as described later, and outputs the sorted keyword list to the keyword list output unit 19. The keyword extraction method and the sorting process will be described later.

次に、タグＤＢ１６に格納されている情報の一例を説明する。
図１３は、本実施形態の変形例に係るタグＤＢ１６に格納されている情報の一例を示す図である。図１３に示すように、タグＤＢ１６には、少なくとも１つのタグが格納されている。例えば、＜ｃｌａｓｓ＞の中で用いられる“＜ｈ１＞〜＜／ｈ１＞”等は見出しを表すタグである。また、“＜ｄｉｖｃｌａｓｓ＝“ｕｓｒＩｎｆｏ”＞〜＜／ｄｉｖ＞”は、掲示板等のウェブページにおけるユーザーを識別するための識別子情報を表すタグである。このような不要な項目を表すタグは、キーワード抽出装置１Ａの設計者によって予め設定されている。 Next, an example of information stored in the tag DB 16 will be described.
FIG. 13 is a diagram illustrating an example of information stored in the tag DB 16 according to a modification of the present embodiment. As shown in FIG. 13, the tag DB 16 stores at least one tag. For example, “<h1> to </ h1>” used in <class> is a tag indicating a heading. “<Div class =“ usrInfo ”> to </ div>” is a tag indicating identifier information for identifying a user on a web page such as a bulletin board. Tags indicating such unnecessary items are set in advance by the designer of the keyword extracting device 1A.

＜キーワード抽出装置１Ａの処理手順＞
次に、キーワード抽出装置１Ａの処理手順について説明する。図１４は、本実施形態の変形例に係るキーワード抽出装置１Ａの処理のフローチャートである。なお、キーワード抽出装置１と同じ処理には同じ符号を用いて、説明を省略する。 <Processing Procedure of Keyword Extraction Apparatus 1A>
Next, a processing procedure of the keyword extracting device 1A will be described. FIG. 14 is a flowchart of the process of the keyword extracting device 1A according to the modification of the present embodiment. Note that the same processes as those of the keyword extraction device 1 are denoted by the same reference numerals, and description thereof will be omitted.

（ステップＳ１〜Ｓ５）キーワード抽出装置１Ａは、ステップＳ１〜Ｓ５の処理を行い、ステップＳ６の処理に進める。
（ステップＳ６）第２ノイズ除去部１７は、メインコンテンツ抽出部１５が出力したメインコンテンツの中から、タグＤＢ１６を参照して無意味言葉を除去する。
（ステップＳ７）キーワード抽出部１８は、第２ノイズ除去部１７が出力したメインコンテンツから複数のキーワードを抽出する。 (Steps S1 to S5) The keyword extraction device 1A performs the processing of steps S1 to S5, and proceeds to the processing of step S6.
(Step S6) The second noise removing unit 17 refers to the tag DB 16 and removes meaningless words from the main content output by the main content extracting unit 15.
(Step S7) The keyword extracting unit 18 extracts a plurality of keywords from the main content output by the second noise removing unit 17.

＜無意味言葉の除去処理＞
次に、第２ノイズ除去部１７が、図１４のステップＳ６で行う無意味言葉の除去処理について説明する。
図１５は、本実施形態の変形例に係る無意味言葉の除去処理のフローチャートである。 <Speechless processing>
Next, the process of removing meaningless words performed by the second noise removing unit 17 in step S6 of FIG. 14 will be described.
FIG. 15 is a flowchart of a meaningless word removal process according to a modification of the present embodiment.

（ステップＳ３０１）第２ノイズ除去部１７は、メインコンテンツ抽出部１５が出力したウェブページ毎のメインコンテンツの中から、１つの未処理のウェブページのメインコンテンツを逐次選択して、ステップＳ３０２〜Ｓ３０３の処理を行う。
（ステップＳ３０２）第２ノイズ除去部１７は、ステップＳ３０１で選択したウェブページのメインコンテンツのソースコードの中から、タグＤＢ１６を参照して無意味言葉に対応するタグの記述を除去することで、無意味言葉を除去する。ここで、タグの記述とは、開始のタグから、終了のタグで囲まれた記述である。 (Step S301) The second noise removal unit 17 sequentially selects the main content of one unprocessed web page from the main content of each web page output by the main content extraction unit 15, and performs the processes of steps S302 to S303. Do.
(Step S302) The second noise removing unit 17 removes the description of the tag corresponding to the meaningless word from the source code of the main content of the web page selected in step S301 by referring to the tag DB16. Remove semantic words. Here, the description of the tag is a description between the start tag and the end tag.

（ステップＳ３０３）第２ノイズ除去部１７は、ステップＳ３０２で除去されたメインコンテンツの記述から、残りのタグ情報を除去する。なお、残りのタグ情報とは、例えば文字サイズの指定、文字の色の指定、改行等を示すタグである。 (Step S303) The second noise removing unit 17 removes the remaining tag information from the description of the main content removed in Step S302. The remaining tag information is, for example, a tag indicating designation of a character size, designation of a character color, line feed, and the like.

（ステップＳ３０４）第２ノイズ除去部１７は、メインコンテンツ抽出部１５が出力したウェブページ毎のメインコンテンツについて、全てのウェブページについてステップＳ３０２〜Ｓ３０３の処理が終了した場合、無意味言葉を除去したメインコンテンツのテキスト情報を、ウェブページ毎にキーワード抽出部１８に出力する。
以上で、メインコンテンツの抽出処理を終了する。 (Step S304) For the main content for each web page output by the main content extraction unit 15, the second noise removing unit 17 completes the processing of steps S302 to S303 for all the web pages, The text information is output to the keyword extracting unit 18 for each web page.
This is the end of the main content extraction processing.

＜除去される無意味言葉の項目の例＞
ここで、無意味言葉として除去される項目の例を説明する。
まず、質問に対して回答を書き込める質疑のウェブページを例に説明する。このような質疑のウェブページには、質問、回答に加えて、例えば閲覧数、回答数、回答に対するお礼、解答者の識別情報、回答した日時、一番良い回答をした人を示す情報等が含まれている。
キーワードを抽出する上で有効なコンテンツは、例えば、質問のテキストと、回答のテキストである。このため、質疑のウェブページにおいて、閲覧数、回答数、回答に対するお礼、解答者の識別情報、回答した日時、一番良い回答をした人を示す情報等が無意味言葉である。これらの項目は、所定のｃｌａｓｓ名のタグや、所定のｃｌａｓｓのタグの中の項目タグ（例えば＜ｈ２＞〜＜／ｈ２＞）等によって記述されている。 <Examples of nonsense word items to be removed>
Here, examples of items that are removed as meaningless words will be described.
First, a question and answer web page where you can write answers to questions will be described as an example. In addition to questions and answers, such question and answer web pages include, for example, the number of views, the number of answers, thanks for the answers, the identity of the respondent, the date and time of the answer, the information indicating the person who gave the best answer, etc. include.
Effective contents for extracting a keyword are, for example, a text of a question and a text of an answer. Therefore, in the web page of the question, the number of views, the number of answers, thanks for the answer, the identification information of the answerer, the date and time of the answer, the information indicating the person who gave the best answer, and the like are nonsense words. These items are described by a tag having a predetermined class name, an item tag (for example, <h2> to </ h2>) in the tag of the predetermined class.

また、例えばウェブ上にある辞書ページでは、概略、本文、目次、用語の種類に関する説明、内容が不十分であることを示す記述、参考文献、文献リスト関連項目等が含まれている。
キーワードを抽出する上で有効なコンテンツは、例えば、概略のテキストと、本文のテキストである。このため、辞書ページにおいて、目次、用語の種類に関する説明、内容が不十分であることを示す記述、参考文献、文献リスト関連項目等が無意味言葉である。 Further, for example, a dictionary page on the web includes an outline, a text, a table of contents, a description of the type of term, a description indicating that the content is insufficient, a reference, a reference list-related item, and the like.
Effective contents for extracting a keyword are, for example, outline text and body text. For this reason, in the dictionary page, the table of contents, the description of the types of terms, the description indicating that the content is insufficient, the references, the reference list related items, and the like are nonsense words.

これらの無意味言葉に対応する項目が記述されているタグを、例えばキーワード抽出装置１Ａの製造者が予めソースコードを解析して抽出して、抽出したタグをタグＤＢ１６に格納しておく。 For example, the manufacturer of the keyword extracting device 1A analyzes and extracts the source code in advance and describes the tags in which the items corresponding to these meaningless words are described, and stores the extracted tags in the tag DB16.

以上のように、本実施形態のキーワード抽出装置１Ａにおいて、メインコンテンツ抽出部１５によって抽出されたメインコンテンツの情報から、所定のタグによって記述されている情報を除去することで、キーワードの抽出において意味を成していない不要な記述を除去する第２ノイズ除去部１７、をさらに備え、キーワード抽出部１８は、第２ノイズ除去部によって所定のタグによって記述されている情報が除去された後のメインコンテンツのテキストからキーワードを抽出する。
なお、所定のタグによって記述されている情報は、キーワードの抽出において意味を成していない不要な用語である。 As described above, in the keyword extraction device 1A of the present embodiment, by removing the information described by the predetermined tag from the information of the main content extracted by the main content extraction unit 15, meaning is obtained in keyword extraction. And a second noise removing unit 17 for removing unnecessary descriptions that have not been performed. The keyword extracting unit 18 includes a main content text after the information described by a predetermined tag is removed by the second noise removing unit. Extract keywords from.
The information described by the predetermined tag is an unnecessary term that does not make sense in extracting the keyword.

この構成によって、本実施形態では、メインコンテンツから所定のタグによって記述されている情報を除去することで、無意味言葉を除去することができる。この結果、本実施形態では、無意味言葉を除去した後のメインコンテンツからキーワードを精度良く抽出することができる。
なお、例えば不要と想定される単語をデータベースに登録しておき、不要と想定される単語をメインコンテンツから除去する場合、コンテンツの中から有効な言葉も除去してしまう場合があった。一方、本実施形態のように、タグによって記述されている情報を除去することで、精度良く無意味言葉を除去することができる。 With this configuration, in the present embodiment, meaningless words can be removed by removing information described by a predetermined tag from main content. As a result, in the present embodiment, keywords can be accurately extracted from the main content from which the meaningless words have been removed.
In addition, for example, when words that are assumed to be unnecessary are registered in the database and words that are considered to be unnecessary are removed from the main content, valid words may be removed from the content in some cases. On the other hand, by removing information described by tags as in the present embodiment, meaningless words can be accurately removed.

［第２実施形態］
第１実施形態では、キーワード抽出装置１（または、１Ａ）が、第１ノイズ除去部１４とメインコンテンツ抽出部１５を備える例を説明したが、第２実施形態では、第１ノイズ除去部１４とメインコンテンツ抽出部１５を備えず、第２ノイズ除去部を備える例を説明する。なお、本実施形態において、第２ノイズ除去部によって除去されるノイズは、タグに基づく無意味言葉と不要な記述である。 [Second embodiment]
In the first embodiment, the example in which the keyword extracting device 1 (or 1A) includes the first noise removing unit 14 and the main content extracting unit 15 has been described, but in the second embodiment, the first noise removing unit 14 and the main content An example in which the extraction unit 15 is not provided and the second noise removal unit is provided will be described. In the present embodiment, the noise removed by the second noise removing unit is a meaningless word and an unnecessary description based on the tag.

＜キーワード抽出装置１Ｂの構成＞
図１６は、本実施形態に係る本実施形態に係るキーワード抽出装置１Ｂの概略構成図である。
図１６に示すように、キーワード抽出装置１Ｂは、キーワード入力部１１、検索部１２Ｂ、タグＤＢ１６Ｂ、第２ノイズ除去部１７Ｂ、キーワード抽出部１８、およびキーワードリスト出力部１９を備える。また、キーワード抽出装置１Ｂは、ネットワーク２に接続されている。なお、キーワード抽出装置１または１Ａと同じ機能を有する機能部については、同じ符号を用いて、説明を省略する。 <Configuration of Keyword Extraction Device 1B>
FIG. 16 is a schematic configuration diagram of a keyword extraction device 1B according to the present embodiment according to the present embodiment.
As shown in FIG. 16, the keyword extraction device 1B includes a keyword input unit 11, a search unit 12B, a tag DB 16B, a second noise removal unit 17B, a keyword extraction unit 18, and a keyword list output unit 19. The keyword extracting device 1B is connected to the network 2. Note that functional units having the same functions as those of the keyword extracting device 1 or 1A are denoted by the same reference numerals and description thereof is omitted.

キーワード入力部１１は、例えばキーボード、マウス、タブレット等である。キーワード入力部１１は、利用者によって入力された検索キーワードを検索部１２Ｂに出力する。 The keyword input unit 11 is, for example, a keyboard, a mouse, a tablet, or the like. The keyword input unit 11 outputs the search keyword input by the user to the search unit 12B.

検索部１２Ｂは、キーワード入力部１１が出力した検索キーワードに適したウェブページを、検索エンジンを用いて検索して、検索によって得られたウェブページのうち、例えば上位から所定の個数のウェブページを選択する。検索部１２Ｂは、選択した所定の個数のウェブページを示す情報を第２ノイズ除去部１７Ｂに出力する。 The search unit 12B searches for a web page suitable for the search keyword output by the keyword input unit 11 using a search engine, and among the web pages obtained by the search, for example, a predetermined number of web pages from the top. select. The search unit 12B outputs information indicating the selected predetermined number of web pages to the second noise removal unit 17B.

タグＤＢ１６Ｂには、ウェブページのメインコンテンツの中から、不要な用語（無意味言葉）の項目を削除するために使用されるタグと、不要な記述を削除するために使用されるタグが格納されている。図１７は、本実施形態に係るタグＤＢ１６Ｂに格納されている情報の一例を示す図である。図１７に示すように、タグＤＢ１６Ｂには、少なくとも１つのタグが格納されている。例えば、＜ｃｌａｓｓ＞の中で用いられる“＜ｈ１＞〜＜／ｈ１＞”等は見出しを表すタグである。また、“＜ｄｉｖｃｌａｓｓ＝“ｕｓｒＩｎｆｏ”＞〜＜／ｄｉｖ＞”は、掲示板等のウェブページにおけるユーザーを識別するための識別子情報を表すタグである。 The tag DB 16B stores a tag used to delete an item of an unnecessary term (nonsense word) and a tag used to delete an unnecessary description from the main content of the web page. I have. FIG. 17 is a diagram illustrating an example of information stored in the tag DB 16B according to the present embodiment. As shown in FIG. 17, at least one tag is stored in the tag DB 16B. For example, “<h1> to </ h1>” used in <class> is a tag indicating a heading. “<Div class =“ usrInfo ”> to </ div>” is a tag indicating identifier information for identifying a user on a web page such as a bulletin board.

また、“＜ｍｅｔａ＞”タグは、文書（ウェブページ）に関するメタ情報を表すタグであり、“＜ｄｉｖｉｄ＝“ｈｅａｄｅｒ”＞〜＜／ｄｉｖ＞”タグは、ヘッダ情報を表すタグである。また、“＜ｓｃｒｉｐｔｔｙｐｅ＞〜＜／ｓｃｒｏｐｔ＞”タグは、スクリプトの記述のタグであり、“＜ｌｉｎｋｈｒｅｆ＝“”＞”タグと“＜ａｈｒｅｆ＝“”＞”タグは、リンク先を示すタグである。これらは、ウェブページにおいて、コンテンツとして意味をなしていない不要な記述である。不要な記述を削除するために使用されるタグは、他に広告を表すタグ、ボタンを示すタグ、フッターを表すタグ、注意書きを表すタグ等である。このような不要な項目や記述を表すタグは、キーワード抽出装置１Ｂの設計者によって予め設定されている。なお、タグＤＢ１６Ｂに格納されるタグは、ネットワーク２を介して更新されるようにしてもよい。 The “<meta>” tag is a tag indicating meta information regarding a document (web page), and the “<div id =“ header ”> to </ div> tags are tags indicating header information. The “<script type> to </ script>” tag is a tag for describing the script. The “<link href =“ ”>” tag and the “<a href=“”>” tag indicate the link destination. Tag. These are unnecessary descriptions that do not make sense as contents in the web page. Other tags used to delete unnecessary descriptions include a tag indicating an advertisement, a tag indicating a button, a tag indicating a footer, a tag indicating a notice, and the like. Tags indicating such unnecessary items and descriptions are set in advance by the designer of the keyword extracting device 1B. The tags stored in the tag DB 16B may be updated via the network 2.

第２ノイズ除去部１７Ｂは、検索部１２Ｂが出力した所定の個数のウェブページの情報（ソースコード）の中から、タグＤＢ１６Ｂを参照して無意味言葉と不要な記述を、ウェブページ毎に除去する。第２ノイズ除去部１７Ｂは、無意味言葉を除去したコンテンツを、キーワード抽出部１８に出力する。 The second noise removing unit 17B removes nonsense words and unnecessary descriptions from the information (source code) of a predetermined number of web pages output by the search unit 12B with reference to the tag DB 16B for each web page. I do. The second noise removing unit 17B outputs the content from which the meaningless words have been removed to the keyword extracting unit 18.

キーワード抽出部１８は、第２ノイズ除去部１７Ｂが出力したメインコンテンツから複数のキーワードを抽出する。キーワード抽出部１８は、抽出した複数のキーワードに対して、後述するようにソート処理を行い、ソート処理を行ったキーワードリストをキーワードリスト出力部１９に出力する。 The keyword extracting unit 18 extracts a plurality of keywords from the main content output by the second noise removing unit 17B. The keyword extracting unit 18 performs a sorting process on the extracted keywords as described later, and outputs the sorted keyword list to the keyword list output unit 19.

＜キーワード抽出装置１Ｂの処理手順＞
次に、キーワード抽出装置１Ｂの処理手順について説明する。図１８は、本実施形態に係るキーワード抽出装置１Ｂの処理のフローチャートである。なお、キーワード抽出装置１（または１Ａ）と同じ処理には、同じ符号を用いて説明を省略する。 <Processing procedure of keyword extraction device 1B>
Next, a processing procedure of the keyword extracting device 1B will be described. FIG. 18 is a flowchart of the process of the keyword extracting device 1B according to the present embodiment. Note that the same processes as those of the keyword extraction device 1 (or 1A) are denoted by the same reference numerals and description thereof is omitted.

（ステップＳ１〜Ｓ３）キーワード抽出装置１Ｂは、ステップＳ１〜Ｓ３の処理を行い、ステップＳ１５の処理に進める。 (Steps S1 to S3) The keyword extraction device 1B performs the processing of steps S1 to S3, and proceeds to the processing of step S15.

（ステップＳ１５）第２ノイズ除去部１７Ｂは、検索部１２Ｂが出力した所定の個数のウェブページの情報の中から、タグＤＢ１６Ｂを参照して無意味言葉と不要な記述を、ウェブページ毎に除去する。続けて、第２ノイズ除去部１７Ｂは、ステップＳ７に処理を進める。 (Step S15) The second noise removing unit 17B removes the meaningless words and unnecessary descriptions from the information of the predetermined number of web pages output by the search unit 12B by referring to the tag DB 16B for each web page. I do. Subsequently, the second noise removing unit 17B advances the processing to step S7.

なお、上述した例では、第２ノイズ除去部１７Ｂが、無意味言葉と不要な記述をウェブページのソースコードから除去する例を説明したが、これに限られない。第２ノイズ除去部１７Ｂは、無意味言葉に対応する項目のタグ、および不要な記述に対応するタグのうち、少なくとも１つを除去するようにしてもよい。 In the example described above, an example has been described in which the second noise removing unit 17B removes meaningless words and unnecessary descriptions from the source code of the web page, but the present invention is not limited to this. The second noise removing unit 17B may remove at least one of a tag of an item corresponding to a meaningless word and a tag corresponding to an unnecessary description.

上述したように、本実施形態では、検索キーワードを、検索エンジンを用いて検索した上位から所定の個数のウェブページの中から１つのウェブページを逐次選択する。そして、本実施形態では、選択したウェブページのソースコードから、無意味言葉に対応する項目の記述のタグと不要な記述のタグと除去することでキーワードの抽出に必要なコンテンツを抽出する。これにより、本実施形態では、検索キーワードを用いて検索された複数のウェブページの中から、ノイズである無意味言葉と不要な記述を除去したコンテンツを得ることができる。このように、本実施形態では、ノイズ成分が除去されたテキストを用いて名詞と複合語とを抽出することで、キーワードを精度よく抽出することができる。 As described above, in the present embodiment, one web page is sequentially selected from a predetermined number of web pages from the top searched for using a search engine. In the present embodiment, the content necessary for keyword extraction is extracted by removing the tag of the description of the item corresponding to the meaningless word and the tag of the unnecessary description from the source code of the selected web page. As a result, in the present embodiment, it is possible to obtain, from a plurality of web pages searched using the search keyword, the content in which the meaningless words as noise and the unnecessary description are removed. As described above, in the present embodiment, a keyword can be accurately extracted by extracting a noun and a compound word using the text from which the noise component has been removed.

以上のように、本実施形態のキーワード抽出装置１Ｂは、検索キーワードに基づいてメインコンテンツを含む複数（所定の個数、例えば２０個）のコンテンツ（例えば、ウェブページ）を検索する検索部１２Ｂと、検索部によって検索された複数のコンテンツの中から１つのコンテンツを逐次選択し、選択したコンテンツから所定のタグによって記述されている情報を除去することで、キーワードの抽出において意味を成していない不要な記述を除去する第２ノイズ除去部１７Ｂと、第２ノイズ除去部によって所定のタグによって記述されている情報が除去されたコンテンツのテキストからキーワードを抽出するキーワード抽出部１８と、を備える。 As described above, the keyword extracting device 1B of the present embodiment includes the search unit 12B that searches for a plurality of (predetermined numbers, for example, 20) contents (for example, web pages) including the main content based on the search keyword, By sequentially selecting one content from the plurality of contents searched by the section and removing information described by a predetermined tag from the selected content, unnecessary contents that do not make sense in keyword extraction are removed. The apparatus includes a second noise removing unit 17B for removing the description, and a keyword extracting unit 18 for extracting a keyword from the text of the content from which information described by a predetermined tag has been removed by the second noise removing unit.

この構成によって、本実施形態では、メインコンテンツから所定のタグによって記述されている情報を除去することで、無意味言葉および不要な記述の少なくとも１つを除去することができる。この結果、本実施形態では、無意味言葉または不要な記述を除去した後のコンテンツからキーワードを精度良く抽出することができる。
なお、例えば不要と想定される単語をデータベースに登録しておき、不要と想定される単語をコンテンツから除去する場合、コンテンツの中から有効な言葉も除去してしまう場合があった。一方、本実施形態のように、タグによって記述されている情報を除去することで、精度良く無意味言葉を除去することができる。 With this configuration, in the present embodiment, at least one of the meaningless words and the unnecessary description can be removed by removing the information described by the predetermined tag from the main content. As a result, in the present embodiment, keywords can be accurately extracted from the content from which the meaningless words or unnecessary descriptions have been removed.
For example, when words that are assumed to be unnecessary are registered in a database and words that are assumed to be unnecessary are removed from the contents, valid words may be removed from the contents in some cases. On the other hand, by removing information described by tags as in the present embodiment, meaningless words can be accurately removed.

［第２の実施形態の第１変形例］
次に、キーワード抽出装置１Ｂが、さらにドメインＤＢ１３および第１ノイズ除去部１４を備える例を説明する。 [First Modification of Second Embodiment]
Next, an example in which the keyword extracting device 1B further includes a domain DB 13 and a first noise removing unit 14 will be described.

＜キーワード抽出装置１Ｃの構成＞
図１９は、本実施形態の第１変形例に係るキーワード抽出装置１Ｃの概略構成図である。なお、キーワード抽出装置１、１Ａ、または１Ｂと同じ機能を有する機能部については、同じ符号を用いて、説明を省略する。
図１９に示すように、キーワード抽出装置１Ｃは、キーワード入力部１１、検索部１２、ドメインＤＢ１３、第１ノイズ除去部１４、タグＤＢ１６Ｂ、第２ノイズ除去部１７Ｂ、キーワード抽出部１８、およびキーワードリスト出力部１９を備える。なお、キーワード抽出装置１Ｃは、例えば第２ノイズ除去部１７Ｂとキーワード抽出部１８との間に、メインコンテンツ抽出部１５（図２参照）を備えていてもよい。 <Configuration of Keyword Extraction Device 1C>
FIG. 19 is a schematic configuration diagram of a keyword extracting device 1C according to a first modification of the present embodiment. Note that functional units having the same functions as those of the keyword extracting apparatuses 1, 1A, and 1B are denoted by the same reference numerals and description thereof is omitted.
As shown in FIG. 19, the keyword extracting device 1C includes a keyword input unit 11, a search unit 12, a domain DB 13, a first noise removing unit 14, a tag DB 16B, a second noise removing unit 17B, a keyword extracting unit 18, and a keyword list. An output unit 19 is provided. Note that the keyword extracting device 1C may include, for example, a main content extracting unit 15 (see FIG. 2) between the second noise removing unit 17B and the keyword extracting unit 18.

第１ノイズ除去部１４は、検索部１２が出力した所定の個数のウェブページを示す情報を取得する。第１ノイズ除去部１４は、取得した所定の個数のウェブページを示す情報を用いて、所定の個数のウェブページからドメインＤＢ１３に格納されているドメインのウェブページを除去して、除去した後のウェブページを示す情報を第２ノイズ除去部１７Ｂに出力する。なお、所定の個数のウェブページに、ドメインＤＢ１３に格納されているドメインのウェブページが無い場合、第１ノイズ除去部１４は、所定の個数のウェブページを示す情報を第２ノイズ除去部１７Ｂに出力する。 The first noise removal unit 14 acquires information indicating a predetermined number of web pages output by the search unit 12. The first noise removing unit 14 uses the information indicating the web page of a predetermined number obtained by removing the web pages of a domain that is stored from the web pages of a predetermined number of the domain DB 13, after removal of The information indicating the web page is output to the second noise removing unit 17B. When the predetermined number of web pages do not include the web pages of the domain stored in the domain DB 13, the first noise removing unit 14 sends information indicating the predetermined number of web pages to the second noise removing unit 17B. Output.

第２ノイズ除去部１７Ｂは、第１ノイズ除去部１４が出力した複数のウェブページの情報（ソースコード）から、タグＤＢ１６Ｂを参照して無意味言葉と不要な記述とを削除する。第２ノイズ除去部１７Ｂは、無意味言葉と不要な記述を除去したコンテンツを、キーワード抽出部１８に出力する。 The second noise removing unit 17B deletes the meaningless words and unnecessary descriptions from the information (source codes) of the plurality of web pages output by the first noise removing unit 14 by referring to the tag DB 16B. The second noise removing unit 17B outputs the content from which the meaningless words and unnecessary descriptions have been removed to the keyword extracting unit 18.

＜キーワード抽出装置１Ｃの処理手順＞
次に、キーワード抽出装置１Ｃの処理手順について説明する。図２０は、本実施形態の第１変形例に係るキーワード抽出装置１Ｃの処理のフローチャートである。なお、キーワード抽出装置１、１Ａ、または１Ｂと同じ処理には同じ符号を用いて、説明を省略する。 <Processing procedure of keyword extracting device 1C>
Next, a processing procedure of the keyword extracting device 1C will be described. FIG. 20 is a flowchart of the process of the keyword extracting device 1C according to the first modification of the present embodiment. The same processes as those of the keyword extracting device 1, 1A, or 1B are denoted by the same reference numerals, and description thereof will be omitted.

（ステップＳ１〜Ｓ３）キーワード抽出装置１Ｃは、ステップＳ１〜Ｓ３の処理を行い、ステップＳ４の処理に進める。
（ステップＳ４）第１ノイズ除去部１４は、検索部１２が出力した所定の個数のウェブページの情報を用いて、所定の個数のウェブページから、ドメインＤＢ１３に格納されているドメインのウェブページを除去する。
（ステップＳ１５）第２ノイズ除去部１７Ｂは、第１ノイズ除去部１４が出力した複数のウェブページの情報（ソースコード）から、タグＤＢ１６Ｂを参照して無意味言葉と不要な記述とを削除する。 (Steps S1 to S3) The keyword extracting device 1C performs the processing of steps S1 to S3, and proceeds to the processing of step S4.
(Step S4) The first noise elimination unit 14 uses the information of the predetermined number of web pages output by the search unit 12 to convert the web pages of the domain stored in the domain DB 13 from the predetermined number of web pages. Remove.
(Step S15) The second noise removing unit 17B deletes the meaningless words and unnecessary descriptions from the information (source codes) of the plurality of web pages output by the first noise removing unit 14 with reference to the tag DB 16B. .

以上のように、本実施形態のキーワード抽出装置１Ｃにおいて、検索部１２とキーワード抽出部１８との間に第１ノイズ除去部１４、をさらに備え、第１ノイズ除去部は、検索部によって検索された複数のコンテンツ（例えば、ウェブページ）の中から、キーワードの抽出において意味を成していない所定のドメインのコンテンツを除去する。 As described above, in the keyword extracting device 1C of the present embodiment, the first noise removing unit 14 is further provided between the searching unit 12 and the keyword extracting unit 18, and the first noise removing unit is searched by the searching unit. From the plurality of contents (for example, web pages), the content of a predetermined domain that does not make sense in extracting the keyword is removed.

この構成によって、本実施形態では、所定のドメインのウェブページを除去することで、キーワードを抽出する上で、コンテンツとして意味をなしていない不要なウェブページを検索結果から除去することができる。この結果、本実施形態では、不要なウェブページを削除したウェブページからのみメインコンテンツを精度良く抽出できる。 With this configuration, in the present embodiment, by removing a web page of a predetermined domain, unnecessary web pages that do not make sense as content when extracting a keyword can be removed from a search result. As a result, in the present embodiment, the main content can be accurately extracted only from the web pages from which unnecessary web pages have been deleted.

［第２の実施形態の第２変形例］
次に、キーワード抽出装置１Ｂの変形例を説明する。本変形例では、検索部が、キーワード入力部１１が出力した検索キーワードを予め定められているドメインのウェブページを、検索エンジンを用いて検索する。 [Second Modification of Second Embodiment]
Next, a modified example of the keyword extracting device 1B will be described. In the present modification, the search unit searches for a web page of a domain in which the search keyword output by the keyword input unit 11 is predetermined using a search engine.

＜キーワード抽出装置１Ｄの構成＞
図２１は、本実施形態の第２変形例に係るキーワード抽出装置１Ｄの概略構成図である。
図２１に示すように、キーワード抽出装置１Ｄは、キーワード入力部１１Ｄ、検索部１２Ｄ、ドメインＤＢ１３Ｄ、タグＤＢ１６Ｂ、第２ノイズ除去部１７Ｂ、キーワード抽出部１８Ｄ、およびキーワードリスト出力部１９を備える。なお、キーワード抽出装置１、１Ａ、１Ｂ、または１Ｃと同じ機能を有する機能部については、同じ符号を用いて、説明を省略する。なお、キーワード抽出装置１Ｄは、例えば検索部１２Ｄと第２ノイズ除去部１７Ｂとの間に第１ノイズ除去部１４を備えていてもよく、例えば第１ノイズ除去部１４とキーワード抽出部１８Ｄの間にメインコンテンツ抽出部１５を備えていてもよい。 <Configuration of Keyword Extraction Device 1D>
FIG. 21 is a schematic configuration diagram of a keyword extraction device 1D according to a second modification of the present embodiment.
As shown in FIG. 21, the keyword extraction device 1D includes a keyword input unit 11D, a search unit 12D, a domain DB 13D, a tag DB 16B, a second noise removal unit 17B, a keyword extraction unit 18D, and a keyword list output unit 19. Note that functional units having the same functions as those of the keyword extracting devices 1, 1A, 1B, and 1C are denoted by the same reference numerals and description thereof is omitted. Note that the keyword extracting device 1D may include the first noise removing unit 14 between the search unit 12D and the second noise removing unit 17B, for example, between the first noise removing unit 14 and the keyword extracting unit 18D. May be provided with a main content extraction unit 15.

キーワード入力部１１Ｄは、利用者が入力した検索キーワード、探索対象のドメインを取得し、取得した検索キーワード、探索対象のドメインを示す情報を検索部１２Ｄに出力する。なお、探索対象のドメインについては、第８の抽出方法で説明する。 The keyword input unit 11D acquires a search keyword and a search target domain input by the user, and outputs the acquired search keyword and information indicating the search target domain to the search unit 12D. The search target domain will be described in an eighth extraction method.

ドメインＤＢ１３Ｄには、キーワードを抽出するウェッブサイトのドメインが格納されている。なお、キーワードを抽出するウェッブサイトのドメインとは、例えば利用者が質問を公開し、回答を募って疑問を解消する仕組みを提供するウェブサイトのドメインである。以下の説明では、このようなサイトをＱ＆Ａサイトという。Ｑ＆Ａサイトは、一例として、Ｙａｈｏｏ！（登録商標）知恵袋（登録商標）、教えて！ｇｏｏ（登録商標）、発言小町（登録商標）、ＯＫＷａｖｅ（登録商標）等である。 The domain of the website from which the keyword is extracted is stored in the domain DB 13D. The domain of the website from which keywords are extracted is, for example, a domain of a website that provides a mechanism in which a user publishes a question, solicits an answer, and resolves the question. In the following description, such a site is referred to as a Q & A site. The Q & A site is, for example, Yahoo! (Registered trademark) Chiebukuro (registered trademark), tell me! goo (registered trademark), remark Komachi (registered trademark), OKWave (registered trademark), and the like.

検索部１２Ｄは、キーワード入力部１１Ｄが出力した検索キーワードを、ドメインＤＢ１３Ｄに格納されているドメインに対して検索エンジンを用いて検索する。検索部１２Ｄは、検索によって得られたウェブページのうち、例えば上位から所定の個数のウェブページを選択し、選択した所定の個数のウェブページを示す情報を第２ノイズ除去部１７Ｂに出力する。 The search unit 12D searches for a search keyword output by the keyword input unit 11D for a domain stored in the domain DB 13D using a search engine. The search unit 12D selects, for example, a predetermined number of web pages from the top among the web pages obtained by the search, and outputs information indicating the selected predetermined number of web pages to the second noise removal unit 17B.

キーワード抽出部１８Ｄは、複数の検索結果を比較する重要キーワードの抽出方法が選択された場合、抽出された重要キーワードを比較し、比較した結果に基づいてキーワードリストを生成する。キーワード抽出部１８Ｄは、生成したキーワードリストをキーワードリスト出力部１９に出力する。 When an important keyword extraction method for comparing a plurality of search results is selected, the keyword extraction unit 18D compares the extracted important keywords and generates a keyword list based on the comparison result. The keyword extracting unit 18D outputs the generated keyword list to the keyword list output unit 19.

次に、ドメインＤＢ１３Ｄに格納されているドメインの一例を説明する。
図２２は、本実施形態の第２変形例に係るドメインＤＢ１３Ｄに格納されているドメインの一例を示す図である。図２２に示すように、ドメインＤＢ１３Ｄには、Ｑ＆Ａサイト名と、Ｑ＆Ａサイトのドメインの情報とが対応付けられて格納されている。 Next, an example of a domain stored in the domain DB 13D will be described.
FIG. 22 is a diagram illustrating an example of domains stored in the domain DB 13D according to the second modification of the present embodiment. As shown in FIG. 22, the domain DB 13D stores a Q & A site name and information on the domain of the Q & A site in association with each other.

＜キーワード抽出装置１Ｄによる操作手順の例、操作画面の例＞
次に、キーワード抽出装置１Ｄによる操作手順の例、操作画面の例を説明する。
図２３は、本実施形態の第２変形例に係るキーワード抽出装置１Ｄによる操作画面の例を示す図である。
図２３に示す例において、符号ｇ４００が示す領域の画像は、重要キーワードの抽出方法を選択する領域の画像である。キーワードの抽出方法を選択する領域の画像ｇ４００には、抽出方法を選択する「抽出ツール」ボタンの画像ｇ４０１、第１の抽出方法を選択する「共起語の抽出」ボタンの画像ｇ４０２、第２の抽出方法を選択する「共起語の抽出Ｑ＆Ａサイト１」ボタンの画像ｇ４０３が含まれている。さらに、画像ｇ４００には、第３の抽出方法を選択する「共起語の抽出Ｑ＆Ａサイト２」ボタンの画像ｇ４０４、第４の抽出方法を選択する「共起語の抽出Ｑ＆Ａサイト３」ボタンの画像ｇ４０５、第５の抽出方法を選択する「共起語の抽出Ｑ＆Ａサイト４」ボタンの画像ｇ４０６が含まれている。さらに、画像ｇ４００には、第６の抽出方法を選択する「共起語の抽出（総合）」ボタンの画像ｇ４０７、第７の抽出方法を選択する「共起語の抽出（比較）」ボタンの画像ｇ４０８、第８の抽出方法を選択する「共起語の抽出（サイト内探索）」ボタンの画像ｇ４０９、第９の抽出方法を選択する「ページ内の過不足キーワード」ボタンの画像ｇ４１０が含まれている。なお、共起語とは、ある単語が文章中で使用される場合に、その文章中で高い頻度で使用されるある単語とは別の単語であり、本発明における抽出されるキーワードである。なお、第１の抽出方法〜第９の抽出方法の処理については、後述する。 <Example of operation procedure and example of operation screen by keyword extraction device 1D>
Next, an example of an operation procedure and an example of an operation screen by the keyword extraction device 1D will be described.
FIG. 23 is a diagram illustrating an example of an operation screen by the keyword extraction device 1D according to the second modification of the present embodiment.
In the example shown in FIG. 23, the image of the area indicated by reference sign g400 is an image of an area for selecting an important keyword extraction method. An image g400 of an “extraction tool” button for selecting an extraction method, an image g402 of a “co-occurrence word extraction” button for selecting a first extraction method, and a second image g400 of an area g400 for selecting a keyword extraction method. An image g403 of a “co-occurrence word extraction Q & A site 1” button for selecting an extraction method is included. Further, the image g400 includes an image g404 of a "co-occurrence word extraction Q & A site 2" button for selecting a third extraction method, and an image g404 of a "co-occurrence word extraction Q & A site 3" button for selecting a fourth extraction method. An image g405 and an image g406 of a “co-occurrence word extraction Q & A site 4” button for selecting the fifth extraction method are included. Further, the image g400 includes an image g407 of a “co-occurrence word extraction (combination)” button for selecting the sixth extraction method, and a “co-occurrence word extraction (comparison)” button for selecting the seventh extraction method. The image g408 includes an image g409 of an “extraction of co-occurring words (search in site)” button for selecting the eighth extraction method, and an image g410 of an “excess or insufficient keyword in page” button for selecting the ninth extraction method. Have been. Note that a co-occurrence word is a keyword that is different from a certain word that is frequently used in a sentence when the certain word is used in the sentence, and is a keyword extracted in the present invention. The processing of the first to ninth extraction methods will be described later.

図２３において、符号ｇ４２０が示す領域の画像は、検索キーワードの入力領域の画像である。なお、図２３は、第２の抽出方法が選択された場合の例を示している。検索キーワードの入力領域の画像ｇ４２０には、検索キーワードの入力スペースの画像ｇ４２１、検索ボタンの画像ｇ４２２、ウェブページの所定の個数の選択する画像ｇ４２３が含まれている。
また、図２３において、符号ｇ４３０が示す領域の画像は、抽出された結果を示す画像である。なお、抽出された結果を示す画像ｇ４３０は、抽出された結果の一部の画像であり、スクロールボタン（画像ｇ４３１）を用いて、利用者が検索結果をスクロールすることで残りの検索結果が表示される。 In FIG. 23, the image of the area indicated by reference sign g420 is the image of the input area of the search keyword. FIG. 23 shows an example in which the second extraction method is selected. The image g420 of the search keyword input area includes an image g421 of the search keyword input space, an image g422 of the search button, and a predetermined number of images g423 of the web page to be selected.
Further, in FIG. 23, the image of the area indicated by reference numeral g430 is an image indicating the extracted result. The image g430 indicating the extracted result is a partial image of the extracted result, and the remaining search results are displayed when the user scrolls the search results using the scroll button (image g431). Is done.

ここで、第１の抽出方法〜第９の抽出方法の概略について説明する。
第１の抽出方法では、検索キーワードを検索エンジンに入力して検索を行い、検索結果の上位から所定の個数の検索結果のサイトを選択する。そして、第１の抽出方法では、選択されたサイトから重要キーワード（共起語）を抽出する。
第２の抽出方法では、Ｑ＆Ａサイト１に対して検索キーワードの検索行って重要キーワードを抽出する。
第３の抽出方法では、Ｑ＆Ａサイト２に対して検索キーワードの検索行って重要キーワードを抽出する。
第４の抽出方法では、Ｑ＆Ａサイト３に対して検索キーワードの検索行って重要キーワードを抽出する。
第５の抽出方法では、Ｑ＆Ａサイト４に対して検索キーワードの検索行って重要キーワードを抽出する。 Here, an outline of the first to ninth extraction methods will be described.
In the first extraction method, a search keyword is input to a search engine to perform a search, and a predetermined number of search result sites are selected from the top of the search results. Then, in the first extraction method, important keywords (co-occurring words) are extracted from the selected site.
In the second extraction method, a search keyword is searched for the Q & A site 1 to extract an important keyword.
In the third extraction method, a search keyword is searched for the Q & A site 2 to extract an important keyword.
In the fourth extraction method, a search keyword is searched for the Q & A site 3 to extract an important keyword.
In the fifth extraction method, a search keyword is searched for the Q & A site 4 to extract an important keyword.

第６の抽出方法では、Ｑ＆Ａサイト１〜Ｑ＆Ａサイト４全てに対して検索キーワードの検索行って重要キーワードを抽出する。
第７の抽出方法では、Ｑ＆Ａサイト１〜Ｑ＆Ａサイト４全てに対して検索キーワードの検索行って重要キーワードを抽出し、さらに第１の抽出方法で重要キーワードを抽出する。そして、第７の抽出方法では、Ｑ＆Ａサイト１〜Ｑ＆Ａサイト４全てを検索して抽出した重要キーワードと、第１の抽出方法で抽出した重要キーワードとを比較する。
第８の抽出方法では、第１の抽出方法で重要キーワードを抽出し、抽出した重要キーワードが評価するサイト（以下、評価サイトという）に含まれているか否か、含まれている場合は重要キーワードの使用頻度に基づいて評価を行う。
第９の抽出方法では、評価サイトからキーワードを抽出し、さらに第１の抽出方法で重要キーワードを抽出する。そして、第９の抽出方法では、評価するサイトに不足している重要キーワード、過剰なキーワードを抽出して評価する。 In the sixth extraction method, search keywords are searched for all the Q & A sites 1 to 4 to extract important keywords.
In the seventh extraction method, search keywords are searched for all of the Q & A sites 1 to 4 to extract important keywords, and the first extraction method further extracts important keywords. Then, in the seventh extraction method, the important keywords extracted by searching all the Q & A sites 1 to 4 are compared with the important keywords extracted by the first extraction method.
In an eighth extraction method, important keywords are extracted by the first extraction method, and whether or not the extracted important keywords are included in a site to be evaluated (hereinafter, referred to as an evaluation site). The evaluation is based on the frequency of use.
In the ninth extraction method, keywords are extracted from the evaluation site, and important keywords are further extracted by the first extraction method. Then, in the ninth extraction method, important keywords and excessive keywords that are insufficient for the site to be evaluated are extracted and evaluated.

まず、第１の抽出方法の処理について説明する。
ドメインＤＢ１３Ｄには、少なくとも、Ｑ＆Ａサイト１〜Ｑ＆Ａサイト４に対応付けられたドメイン１１〜ドメイン１４、検索エンジンのアドレス（ドメイン）が格納されているとする。
第１の抽出方法が選択された場合、検索部１２Ｄは、ドメインＤＢ１３Ｄに格納されている検索エンジンのアドレスの検索エンジンを用いて、入力された検索キーワードを検索する。続けて、検索部１２Ｄは、検索した結果から不用なドメインを除去した後、例えば上位２０個のウェブページを選択する。続けて、第２ノイズ除去部１７Ｂは、無意味言葉等を示すタグを除去する。続けて、キーワード抽出部１８Ｄは、無意味言葉等が除去された上位２０個のウェブページの情報から、キーワードを抽出し、抽出したキーワードの出現回数、重要度等を算出する。続けて、キーワードリスト出力部１９は、抽出したキーワードを例えば図１０のようなリスト形式で出力する。 First, the processing of the first extraction method will be described.
It is assumed that the domain DB 13D stores at least the domains 11 to 14 associated with the Q & A sites 1 to 4 and the address (domain) of the search engine.
When the first extraction method is selected, the search unit 12D searches for the input search keyword using the search engine of the search engine address stored in the domain DB 13D. Subsequently, after removing unnecessary domains from the search result, the search unit 12D selects, for example, the top 20 web pages. Subsequently, the second noise removing unit 17B removes a tag indicating a meaningless word or the like. Subsequently, the keyword extracting unit 18D extracts keywords from the information of the top 20 web pages from which the meaningless words and the like have been removed, and calculates the number of appearances, importance, and the like of the extracted keywords. Subsequently, the keyword list output unit 19 outputs the extracted keywords in a list format, for example, as shown in FIG.

次に、第２の抽出方法〜第５の抽出方法の処理について説明する。
第２の抽出方法が選択された場合、検索部１２Ｄは、ドメインＤＢ１３Ｄに格納されているＱ＆Ａサイト１に対応付けられているドメイン１１を選択し、選択したドメイン１１を用いて入力された検索キーワードを検索する。
同様に、第ｎ（ｎは３〜５）の抽出方法が選択された場合、検索部１２Ｄは、ドメインＤＢ１３Ｄに格納されているＱ＆Ａサイトｎに対応付けられているドメイン１（ｎ）を選択し、選択したドメイン１（ｎ）を用いて入力された検索キーワードを検索する。
続けて、検索部１２Ｄは、検索した結果から、例えば上位２０個のウェブページを選択する。続けて、第２ノイズ除去部１７Ｂは、無意味言葉等を示すタグを除去する。なお、第２ノイズ除去部１７Ｂが除去した後のウェブページの情報には、少なくともＱ＆Ａサイトの質問部分のテキストが含まれ、回等部分のテキストが含まれていてもよい。続けて、キーワード抽出部１８Ｄは、無意味言葉等が除去された上位２０個のウェブページの情報から、キーワードを抽出し、抽出したキーワードの出現回数、重要度等を算出する。続けて、キーワードリスト出力部１９は、抽出したキーワードを例えば図１０のようなリスト形式で出力する。すなわち、第２の抽出方法〜第５の抽出方法と第１の抽出方法との差異は、第１の抽出方法の検索対象のウェブページが限られていないが、第２の抽出方法〜第５の抽出方法の検索対象のウェブページがＱ＆Ａサイトに限られている点である。 Next, the processing of the second to fifth extraction methods will be described.
When the second extraction method is selected, the search unit 12D selects the domain 11 associated with the Q & A site 1 stored in the domain DB 13D, and the search keyword input using the selected domain 11 Search for.
Similarly, when the n-th (n is 3 to 5) extraction method is selected, the search unit 12D selects the domain 1 (n) associated with the Q & A site n stored in the domain DB 13D. Then, the input search keyword is searched using the selected domain 1 (n).
Subsequently, the search unit 12D selects, for example, the top 20 web pages from the search result. Subsequently, the second noise removing unit 17B removes a tag indicating a meaningless word or the like. The information of the web page after the removal by the second noise removing unit 17B includes at least the text of the question part of the Q & A site, and may include the text of the part such as times. Subsequently, the keyword extracting unit 18D extracts keywords from the information of the top 20 web pages from which the meaningless words and the like have been removed, and calculates the number of appearances, importance, and the like of the extracted keywords. Subsequently, the keyword list output unit 19 outputs the extracted keywords in a list format, for example, as shown in FIG. That is, the difference between the second extraction method to the fifth extraction method and the first extraction method is that the search target web page of the first extraction method is not limited, but the second extraction method to the fifth extraction method are different. Is that the search target web page is limited to the Q & A site.

次に、第６の抽出方法の処理について説明する。
第６の抽出方法が選択された場合、検索部１２Ｄは、ドメインＤＢ１３Ｄに格納されているＱ＆Ａサイト１〜Ｑ＆Ａサイト４に対応付けられているドメイン１１〜ドメイン１４全てを選択し、選択したドメイン１１〜ドメイン１４全てを用いて入力された検索キーワードを検索する。続けて、検索部１２Ｄは、ドメイン１１〜ドメイン１４を検索した結果それぞれから、上位から所定の個数のウェブページを選択する。続けて、第２ノイズ除去部１７Ｂは、無意味言葉等を示すタグを除去する。続けて、キーワード抽出部１８Ｄは、無意味言葉等が除去された上位のウェブページの情報から、キーワードを抽出し、抽出したキーワードの出現回数、重要度等を算出する。続けて、キーワードリスト出力部１９は、抽出したキーワードを例えば図１０のようなリスト形式で出力する。 Next, the processing of the sixth extraction method will be described.
When the sixth extraction method is selected, the search unit 12D selects all the domains 11 to 14 corresponding to the Q & A sites 1 to 4 stored in the domain DB 13D, and selects the selected domain 11 Search for the entered search keyword using all of the domains 14. Subsequently, the search unit 12D selects a predetermined number of web pages from the top from each of the search results of the domains 11 to 14. Subsequently, the second noise removing unit 17B removes a tag indicating a meaningless word or the like. Subsequently, the keyword extracting unit 18D extracts a keyword from information of the top web page from which the meaningless words and the like have been removed, and calculates the number of appearances, importance, and the like of the extracted keyword. Subsequently, the keyword list output unit 19 outputs the extracted keywords in a list format, for example, as shown in FIG.

以上のように、本実施形態の第２変形例に係るキーワード抽出装置１Ｄにおいて、検索部１２Ｄは、検索キーワードに基づいて、コンテンツを検索するドメインを限定（例えば、Ｑ＆Ａサイトに限定）してコンテンツを検索し、キーワード抽出部は、限定したドメインのコンテンツのテキストから複数のキーワードを抽出し、抽出した結果に基づいてキーワードリストを生成する。 As described above, in the keyword extraction device 1D according to the second modification of the present embodiment, the search unit 12D limits the domain in which the content is searched based on the search keyword (for example, limits the search domain to the Q & A site). The keyword extracting unit extracts a plurality of keywords from the text of the content of the limited domain, and generates a keyword list based on the extracted result.

この構成によって、本実施形態によれば、Ｑ＆Ａサイトで用いられている検索キーワードに対応するキーワード（共起語）を抽出することができる。 With this configuration, according to the present embodiment, a keyword (co-occurrence word) corresponding to the search keyword used in the Q & A site can be extracted.

次に、第７の抽出方法について説明する。
第７の抽出方法が選択された場合、検索部１２Ｄは、ドメインＤＢ１３Ｄに格納されているＱ＆Ａサイト１〜Ｑ＆Ａサイト４に対応付けられているドメイン１１〜ドメイン１４全てを選択し、選択したドメイン１１〜ドメイン１４全てを用いて入力された検索キーワードを検索する。さらに、検索部１２Ｄは、検索エンジンを用いて入力された検索キーワードを検索し、ドメイン１１〜ドメイン１４全てを検索した結果と、検索エンジンを用いて検索した結果とを第２ノイズ除去部１７Ｂに出力する。続けて、第２ノイズ除去部１７Ｂは、無意味言葉等を示すタグを除去する。続けて、キーワード抽出部１８Ｄは、無意味言葉等が除去された上位のウェブページの情報それぞれから、キーワードを抽出し、抽出したキーワードの出現回数、重要度等を算出する。
この場合、キーワード抽出部１８Ｄは、図２４に示すように、ドメイン１１〜ドメイン１４全てを検索した結果（画像ｇ３８０）と、検索エンジンを用いて検索した結果（画像ｇ３７０）とを比較し、比較した結果に基づいてキーワードリストを生成する。 Next, a seventh extraction method will be described.
When the seventh extraction method is selected, the search unit 12D selects all the domains 11 to 14 corresponding to the Q & A sites 1 to 4 stored in the domain DB 13D, and selects the selected domain 11 Search for the entered search keyword using all of the domains 14. Further, the search unit 12D searches the search keyword input using the search engine, and outputs the result of searching all the domains 11 to 14 and the result of the search using the search engine to the second noise removal unit 17B. Output. Subsequently, the second noise removing unit 17B removes a tag indicating a meaningless word or the like. Subsequently, the keyword extracting unit 18D extracts a keyword from each of the information of the top web pages from which the meaningless words and the like have been removed, and calculates the number of appearances, importance, and the like of the extracted keyword.
In this case, as shown in FIG. 24, the keyword extracting unit 18D compares the result of searching all the domains 11 to 14 (image g380) with the result of searching using the search engine (image g370), and compares the results. A keyword list is generated based on the result.

図２４は、本実施形態の第２変形例に係る第７の抽出方法によるキーワード抽出装置１Ｄによるキーワードの検索結果の比較例を示す図である。
図２４に示すように、各検索結果には、出現回数の画像（ｇ３７２、ｇ３８２）、出現回数に重要度を乗算した値の画像（ｇ３７３、ｇ３８３）、抽出された重要キーワードの画像（ｇ３７４、ｇ３８４）が含まれている。
また、図２４に示すように、ドメイン１１〜ドメイン１４全てを検索した結果（画像ｇ３８０）と、検索エンジンを用いて検索した結果（画像ｇ３７０）の重要キーワードが異なる場合、異なっている重要キーワードの表示方法を変えるようにしてもよい。キーワード抽出部１８Ｄは、例えば、文字の色、文字の太さ、フォントの種類、文字に色つきマーカーを合成する等を行うようにしてもよい。図２４に示す例において、符号ｇ３８５に示すキーワードは、Ｑ＆Ａサイトの出現頻度が高いが検索エンジンで検索した上位サイトであまり用いられていないキーワードのうち、検索順位が例えば６１位以下であることを示している。また、符号ｇ３８６に示すキーワードは、Ｑ＆Ａサイトの出現頻度が高いが検索エンジンで検索した上位サイトであまり用いられていないキーワードのうち、検索順位が例えば３１位から６０位であることを示している。 FIG. 24 is a diagram illustrating a comparative example of a keyword search result obtained by the keyword extraction device 1D using the seventh extraction method according to the second modification of the present embodiment.
As shown in FIG. 24, each search result includes an image of the number of appearances (g372, g382), an image of the value obtained by multiplying the number of appearances by importance (g373, g383), and an image of the extracted important keyword (g374, g374). g384).
Further, as shown in FIG. 24, when the result of searching all the domains 11 to 14 (image g380) and the result of searching using the search engine (image g370) are different, the different important keywords are different. The display method may be changed. The keyword extracting unit 18D may perform, for example, the color of the character, the thickness of the character, the type of the font, and combining a colored marker with the character. In the example shown in FIG. 24, the keyword indicated by g385 indicates that the search rank is, for example, 61 or less among the keywords that have a high frequency of appearance of Q & A sites but are not frequently used in the top sites searched by the search engine. Is shown. The keyword indicated by reference sign g386 indicates that the search ranking is, for example, from the 31st to the 60th among keywords that are frequently used in the Q & A site but are not often used in the top sites searched by the search engine. .

なお、本実施形態では、第７の抽出方法において、ドメイン１１〜ドメイン１４全てを検索した結果と、検索エンジンを用いて検索した結果を比較する例を説明したが、これに限られない。キーワード抽出部１８Ｄは、例えばドメイン１１とドメイン１２で検索された結果を比較してキーワードリストを生成するようにしてもよい。 In the present embodiment, an example in which the result of searching all the domains 11 to 14 and the result of searching using a search engine are compared in the seventh extraction method has been described, but the present invention is not limited to this. For example, the keyword extracting unit 18D may generate a keyword list by comparing results searched in the domain 11 and the domain 12.

以上のように、第２の抽出方法〜第７の抽出方法によれば、Ｑ＆Ａサイトからキーワードを抽出するようにした。この意味合いは、検索エンジンを用いる人が、知りたい情報が何であるかが、Ｑ＆Ａサイトの、特に質問欄に含まれている可能性が高い。このため、Ｑ＆Ａサイトの質問欄に含まれているキーワードは、情報を知りたい人が、検索エンジンを使って検索するときに、検索キーワードとして入力する可能性が高い。従って、Ｑ＆Ａサイトに含まれる重要なキーワードである共起語を、例えば自社のサイトに用いていれば、情報を知りたい人が、検索エンジンを使って検索したときに上位の検索結果として表示される可能性を高めることができる。 As described above, according to the second to seventh extraction methods, keywords are extracted from the Q & A site. It is highly likely that the meaning of the information that the person using the search engine wants to know is included in the question column of the Q & A site. Therefore, there is a high possibility that a person who wants to know information inputs a keyword included in the question column of the Q & A site as a search keyword when searching using a search engine. Therefore, if a co-occurrence word, which is an important keyword included in the Q & A site, is used in, for example, its own site, a person who wants to know information is displayed as a top search result when searching using a search engine. Can be increased.

以上のように、本実施形態の第２変形例に係るキーワード抽出装置１Ｄにおいて、検索部１２Ｄは、検索キーワードに基づいて予め定められている少なくとも２つのドメイン（例えば、Ｑ＆Ａサイト１〜４、検索エンジンのうちの少なくとも２つ）の異なるコンテンツを検索し、キーワード抽出部１８Ｄは、異なるドメインのコンテンツのテキストそれぞれから複数のキーワードをそれぞれ抽出し、異なるドメインのコンテンツのテキストそれぞれから抽出したキーワードを比較し、比較した結果に基づいてキーワードリストを生成する。 As described above, in the keyword extraction device 1D according to the second modification of the present embodiment, the search unit 12D includes at least two domains (for example, the Q & A sites 1 to 4 and the search At least two of the engines search for different contents, and the keyword extracting unit 18D extracts a plurality of keywords from each of the texts of the contents of the different domains, and compares the keywords extracted from the respective texts of the contents of the different domains. Then, a keyword list is generated based on the comparison result.

この構成によって、本実施形態によれば、Ｑ＆Ａサイトで用いられている検索キーワードに対応する抽出されたキーワード（共起語）と、検索エンジンによって抽出されたキーワード（共起語）とを比較することができる。
例えば、Ｑ＆Ａサイトの質問から抽出されたキーワード（共起語）は、利用者が最も知りたいキーワードが含まれている可能性が高い。一方、検索エンジンで検索されたウェブページから抽出されたキーワード（共起語）には、コンテンツに使用される頻度が高くても、利用者が最も知りたい情報のキーワードではない場合もあり得る。このため、これらのキーワード（共起語）を比較し、利用者が最も知りたいと思われるキーワード（共起語）を含むコンテンツを作成することで、利用者が知りたい情報を提供することが可能になる。 With this configuration, according to the present embodiment, an extracted keyword (co-occurrence word) corresponding to a search keyword used in a Q & A site is compared with a keyword (co-occurrence word) extracted by a search engine. be able to.
For example, a keyword (co-occurrence word) extracted from a question on the Q & A site is likely to include a keyword that the user wants to know most. On the other hand, a keyword (co-occurrence word) extracted from a web page searched by a search engine may not be a keyword of information that the user wants to know most, even if the keyword is frequently used for content. Therefore, by comparing these keywords (co-occurrence words) and creating a content including the keyword (co-occurrence word) that the user wants to know most, it is possible to provide information that the user wants to know. Will be possible.

次に、第８の抽出方法について説明する。
第８の抽出方法では、入力した検索キーワードで検索した場合、検索結果の上位のサイトに含まれているキーワードが、評価対象のサイト（例えば自社のサイト）に過不足無く書かれているか判定する。 Next, an eighth extraction method will be described.
In the eighth extraction method, when a search is performed with the input search keyword, it is determined whether the keyword included in the top site of the search result is written in the evaluation target site (for example, the company's own site) without any excess or shortage. .

第８の抽出方法における操作画面について説明する。
図２５は、本実施形態の第２変形例に係る第８の抽出方法が選択された場合のキーワード抽出装置１Ｄによる操作画面の例を示す図である。なお、図２５は、検索キーワードの入力領域の画像ｇ４２０Ａと、探索結果の画像ｇ４４０を抜き出して示した図である。
図２５に示すように、検索キーワードの入力領域の画像ｇ４２０Ａには、検索キーワード入力欄の画像ｇ４２１、検索ボタンの画像ｇ４２２、所定の個数を選択する画像ｇ４２３、探索対象のドメイン入力欄の画像ｇ４２４が含まれている。 The operation screen in the eighth extraction method will be described.
FIG. 25 is a diagram illustrating an example of an operation screen of the keyword extraction device 1D when the eighth extraction method according to the second modification of the present embodiment is selected. FIG. 25 is a diagram showing an image g420A of the input area of the search keyword and an image g440 of the search result extracted.
As shown in FIG. 25, the image g420A of the search keyword input area includes an image g421 of a search keyword input field, an image g422 of a search button, an image g423 for selecting a predetermined number, and an image g424 of a search target domain input field. It is included.

また、探索結果の画像ｇ４４０には、検索キーワードを示す画像ｇ４４１、探索ドメインを示す画像ｇ４４２、出現回数の画像ｇ４４３、出現回数に重要度を乗算した値の画像ｇ４４４、抽出された重要キーワード（共起語）の画像ｇ４４５が含まれている。さらに、探索結果の画像ｇ４４０には、重要キーワードが使用されている評価サイト内のウェブページにおける重要キーワードの順位を示す画像ｇ４４６、重要キーワードが使用されている評価サイト内のウェブページのアドレスを示す画像ｇ４４７が含まれている。
なお、ドメイン内のウェブページ内に重要キーワードが含まれていない（使用されていない）場合は、順位を例えば５０位以上とし画像ｇ４３６に“５０＋”と表示し、画像ｇ４４７に空欄を表示させるようにしてもよい。 The search result image g440 includes an image g441 indicating a search keyword, an image g442 indicating a search domain, an image g443 of the number of appearances, an image g444 of a value obtained by multiplying the number of occurrences by importance, and an extracted important keyword (common Image g445 is included. Further, the image g440 of the search result shows an image g446 indicating the ranking of the important keyword in the web page in the evaluation site where the important keyword is used, and the address of the web page in the evaluation site where the important keyword is used. An image g447 is included.
If an important keyword is not included in the web page in the domain (it is not used), the order is set to, for example, 50th or higher, "50+" is displayed in the image g436, and a blank is displayed in the image g447. It may be.

次に、第８の抽出方法の処理手順について説明する。
図２６は、本実施形態の第２変形例に係る第８の抽出方法における処理のフローチャートである。 Next, the processing procedure of the eighth extraction method will be described.
FIG. 26 is a flowchart of a process in the eighth extraction method according to the second modification of the present embodiment.

（ステップＳ４０１）利用者は、第８の抽出方法を選択し、検索キーワードを検索キーワード入力欄（画像ｇ４２１）に入力し、さらに評価対象のドメインを、ドメイン入力欄（画像ｇ４２４）に入力する。続けて、キーワード入力部１１Ｄは、入力された検索キーワードと、評価対象のドメインとを取得する。なお、評価対象のドメインとは、抽出された重要キーワード（共起語）が含まれているウェブページを探索するためのドメインであり、例えば評価したい自社のサイトのドメインである。 (Step S401) The user selects the eighth extraction method, inputs a search keyword in the search keyword input field (image g421), and further inputs a domain to be evaluated in the domain input field (image g424). Subsequently, the keyword input unit 11D acquires the input search keyword and the domain to be evaluated. The domain to be evaluated is a domain for searching for a web page including the extracted important keyword (co-occurrence word), and is, for example, a domain of a site of a company to be evaluated.

（ステップＳ２）検索部１２Ｄは、検索エンジンを用いて検索キーワードを検索する。なお、検索方法は、第２の抽出方法〜第５の抽出方法で説明したＱ＆Ａサイトであってもよい。 (Step S2) The search unit 12D searches for a search keyword using a search engine. Note that the search method may be the Q & A site described in the second to fifth extraction methods.

続けて、検索部１２Ｄは、ステップＳ３を行う。続けて、第２ノイズ除去部１７Ｂは、ステップＳ１５の処理を行う。続けて、キーワード抽出部１８Ｄは、ステップＳ７の処理を行う。
（ステップＳ４０２）キーワード抽出部１８Ｄは、抽出された重要キーワード（共起語）を逐次選択する。
（ステップＳ４０３）キーワード抽出部１８Ｄは、選択した重要キーワードが入力されたドメインのウェブページに含まれているか否かを判別する。キーワード抽出部１８Ｄは、選択したキーワードが入力されたドメインのウェブページに含まれていると判別した場合（ステップＳ４０３；ＹＥＳ）、ステップＳ４０４の処理に進む。キーワード抽出部１８Ｄは、選択したキーワードが入力されたドメインのウェブページに含まれていないと判別した場合（ステップＳ４０３；ＮＯ）、ステップＳ４０２の処理に戻る。 Subsequently, the search unit 12D performs Step S3. Subsequently, the second noise removing unit 17B performs the process of step S15. Subsequently, the keyword extracting unit 18D performs the process of step S7.
(Step S402) The keyword extracting unit 18D sequentially selects the extracted important keywords (co-occurrence words).
(Step S403) The keyword extracting unit 18D determines whether the selected important keyword is included in the web page of the input domain. When the keyword extracting unit 18D determines that the selected keyword is included in the web page of the input domain (step S403; YES), the process proceeds to step S404. When the keyword extracting unit 18D determines that the selected keyword is not included in the web page of the input domain (step S403; NO), the process returns to step S402.

（ステップＳ４０５）キーワード抽出部１８Ｄは、ステップＳ４０２で選択したキーワードが含まれているウェブページのアドレスを取得する。
（ステップＳ４０５）キーワード抽出部１８Ｄは、ステップＳ６で抽出された全ての重要キーワードの選択が終了したか否かを判別する。キーワード抽出部１８Ｄは、全ての重要キーワードの選択が終了したと判別した場合（ステップＳ４０５；ＹＥＳ）、ステップＳ４０６の処理に進み、全ての重要キーワードの選択が終了していないと判別した場合（ステップＳ４０５；ＮＯ）、ステップＳ４０２の処理に戻る。 (Step S405) The keyword extracting unit 18D acquires an address of a web page including the keyword selected in step S402.
(Step S405) The keyword extracting unit 18D determines whether or not all important keywords extracted in step S6 have been selected. When determining that all important keywords have been selected (step S405; YES), the keyword extracting unit 18D proceeds to step S406 and determines that all important keywords have not been selected (step S406). S405; NO), and returns to the process of step S402.

（ステップＳ４０６）キーワード抽出部１８Ｄは、重要キーワードにアドレスを対応付けてキーワードリストを生成する。
以上で、第８の抽出方法の処理を終了する。 (Step S406) The keyword extracting unit 18D generates a keyword list by associating an important keyword with an address.
This is the end of the processing of the eighth extraction method.

以上のように、第８の抽出方法によれば、情報を知りたい人が検索時に入力すると想定される検索キーワードを用いて検索エンジンで検索した場合、検索結果の上位のサイトに含まれているキーワードが、評価対象のサイト（例えば自社のサイト）にも使用されているか否かを判定することができる。 As described above, according to the eighth extraction method, when a person who wants to know information searches with a search engine using a search keyword that is assumed to be input at the time of search, the search result is included in the top sites in the search results It can be determined whether or not the keyword is also used for a site to be evaluated (for example, its own site).

次に、第９の抽出方法について説明する。
第９の抽出方法では、まず、第１の抽出方法によって重要キーワードを抽出する。さらに、評価したいサイトのドメインまたはウェブページのアドレスの情報（以下、評価サイトの情報という）において出現頻度（使用頻度）が高いキーワードを抽出する。そして、第９の抽出方法では、抽出された重要キーワードと、抽出したい評価サイトにおけるキーワードとを比較し、評価サイトに不足している重要キーワードを抽出する。 Next, a ninth extraction method will be described.
In the ninth extraction method, first, important keywords are extracted by the first extraction method. Further, a keyword having a high appearance frequency (use frequency) is extracted from the information on the domain of the site to be evaluated or the address of the web page (hereinafter referred to as information on the evaluation site). Then, in the ninth extraction method, the extracted important keywords are compared with the keywords on the evaluation site to be extracted, and the important keywords that are lacking in the evaluation site are extracted.

図２７は、本実施形態の第２変形例に係る第９の抽出方法が選択された場合のキーワード抽出装置による評価結果の例を示す図である。
図２７に示すように、評価結果の画像ｇ４５０には、検索キーワードを示す画像ｇ４５１、評価サイトの情報を示す画像ｇ４５２、評価サイトから抽出されたキーワードを示す画像ｇ４５３、検索エンジンによって検索された上位サイトから抽出された重要キーワードを示す画像ｇ４５４、過不足キーワードを示す画像ｇ４５５が含まれている。
評価サイトから抽出されたキーワードを示す画像ｇ４５３において、キーワード（画像ｇ４５３３）は出現回数（画像ｇ４５３１）に重要度を乗算した値（画像ｇ４５３２）が大きい順に表示される。また、上位サイトから抽出された重要キーワードを示す画像ｇ４５４には、出現回数の画像ｇ４５４１、出現回数に重要度を乗算した値の画像ｇ４５４２、重要キーワード（共起語）の画像ｇ４５４３が含まれている。 FIG. 27 is a diagram illustrating an example of an evaluation result by the keyword extraction device when the ninth extraction method according to the second modification of the present embodiment is selected.
As shown in FIG. 27, the image g450 of the evaluation result includes an image g451 indicating the search keyword, an image g452 indicating the information of the evaluation site, an image g453 indicating the keyword extracted from the evaluation site, and the top rank searched by the search engine. An image g454 indicating an important keyword extracted from the site and an image g455 indicating an excess / shortage keyword are included.
In the image g453 indicating the keyword extracted from the evaluation site, the keyword (image g4533) is displayed in descending order of the value (image g4532) obtained by multiplying the number of appearances (image g4531) by the importance. In addition, the image g454 indicating the important keyword extracted from the top site includes an image g4541 of the number of appearances, an image g4542 of a value obtained by multiplying the number of appearances by the importance, and an image g4543 of the important keyword (co-occurrence word). I have.

過不足キーワードを示す画像ｇ４５５には、例えば、不足１（評価サイトに追加した方がよい重要キーワード；評価サイトでの頻度が低いキーワード）のリストの画像ｇ４５５１、不足２（評価サイトに追加した方がよい重要キーワード；評価サイトで使用されていないキーワード）のリストの画像ｇ４５５２が含まれている。さらに、過不足キーワードを示す画像ｇ４５５には、過剰１（評価サイトでは頻度が高いが、上位サイトでは頻度が低いキーワード）のリストの画像ｇ４５５３、過剰２（上位サイトでは使用されていないが評価サイトで頻度の高いキーワード）のリストの画像ｇ４５５４が含まれている。なお、不足しているキーワードは、検索エンジンによって上位のサイトから抽出された共起語であるため、商品やサービスの情報について購買者や商品の利用者が知りたい情報である。一方、過剰なキーワードは、購買者や商品の利用者にとっては、過剰な情報である可能性がある。 For example, the image g455 indicating the excess / shortage keyword includes an image g4551 of a list of shortage 1 (an important keyword that should be added to the evaluation site; a keyword with a low frequency on the evaluation site), and a shortage 2 (a keyword added to the evaluation site). Image g4552 of a list of important keywords; keywords not used on the evaluation site). Further, the image g455 indicating the excess / shortage keyword includes an image g4553 of a list of excess 1 (a keyword that is high in the evaluation site but low in frequency in the top site) and an excess 2 (a keyword that is not used in the top site but is used in the evaluation site. The image g4554 of the list of keywords with high frequency is included. Since the missing keyword is a co-occurrence word extracted from a higher-level site by a search engine, it is information that a purchaser or a user of a product wants to know about information on a product or service. On the other hand, excessive keywords may be excessive information for buyers and users of products.

ここで、不足キーワードの検出方法、過剰キーワードの検出方法の一例を説明する。
キーワード抽出部１８Ｄは、上位サイトから抽出された重要キーワードのうち１つを順次選択する。そして、キーワード抽出部１８Ｄは、選択した重要キーワードと、評価サイトから抽出されたキーワードとを順次比較することで、不足しているキーワードを検出する。
また、キーワード抽出部１８Ｄは、評価サイトから抽出されたキーワードのうち１つを順次選択する。そして、キーワード抽出部１８Ｄは、選択したキーワードと、上位サイトから抽出された重要キーワードとを順次比較することで、過剰なキーワードを検出する。 Here, an example of a method for detecting a missing keyword and a method for detecting an excessive keyword will be described.
The keyword extracting unit 18D sequentially selects one of the important keywords extracted from the top sites. Then, the keyword extracting unit 18D detects a missing keyword by sequentially comparing the selected important keyword with the keyword extracted from the evaluation site.
The keyword extracting unit 18D sequentially selects one of the keywords extracted from the evaluation site. Then, the keyword extracting unit 18D detects an excessive keyword by sequentially comparing the selected keyword with an important keyword extracted from the top site.

次に、第９の抽出方法の使用例について説明する。
サイトの運営者は、例えば評価サイトとして自社のサイトのアドレスを入力する。そして、評価結果を用いて、自社のサイトに不足しているキーワードを知ることで、自社のサイトを改善することができる。
なお、図２７に示した評価結果は一例であり、これに限られず、評価サイトと上位サイトを比較した結果に基づく情報であればよい。 Next, a usage example of the ninth extraction method will be described.
The site operator inputs the address of the company's own site as an evaluation site, for example. Then, by using the evaluation results to know the keywords that are lacking in the company site, the company site can be improved.
The evaluation result shown in FIG. 27 is an example, and the present invention is not limited to this. Any information may be used as long as the information is based on a result of comparison between an evaluation site and a top site.

上述したように、本実施形態の第２変形例においては、利用者はキーワード（共起語）を検索する場合に、検索したいウェブページのドメインを選択することができる。これにより、利用者は、例えば、各抽出方法によってキーワードを抽出させ、得られた抽出結果を比較することができる。また、上述したように、本実施形態の第２変形例では、複数のＱ＆Ａサイトを検索してキーワードを抽出するため、各Ｑ＆Ａサイトで話題となった文章から、バランスよくキーワード（共起語）を抽出することができる。 As described above, in the second modification of the present embodiment, when searching for a keyword (co-occurrence word), the user can select the domain of the web page to be searched. Thus, the user can, for example, extract a keyword by each extraction method and compare the obtained extraction results. Further, as described above, in the second modification of the present embodiment, a plurality of Q & A sites are searched to extract keywords, and therefore, keywords (co-occurring words) in a well-balanced manner are extracted from sentences that have been a topic at each Q & A site. Can be extracted.

なお、検索部１２Ｄは、Ｑ＆Ａサイトの質問から検索キーワードに対応するキーワード（共起語）を抽出するようにしてもよい。この場合、キーワード抽出装置１Ｄは、メインコンテンツ抽出部１５を備え、メインコンテンツ抽出部１５が抽出したメインコンテンツから質問のテキストを抽出する。 The search unit 12D may extract a keyword (co-occurrence word) corresponding to the search keyword from the question on the Q & A site. In this case, the keyword extracting device 1D includes a main content extracting unit 15, and extracts a question text from the main content extracted by the main content extracting unit 15.

以上のように、本実施形態の第２変形例に係るキーワード抽出装置１Ｄにおいて、検索部１２Ｄは、検索キーワードに基づいて、検索エンジンを用いてコンテンツを検索し、キーワード抽出部１８Ｄがコンテンツのテキストから抽出した複数のキーワードに基づいて検索して評価対象のサイトの検索結果の順位を検索し、キーワード抽出部は、コンテンツのテキストから複数のキーワードを抽出し、抽出した複数のキーワードが前記評価対象のサイトのコンテンツで使用されているか否かを判別した結果と、検索部が検索した評価対象のサイトの検索順位に基づいてキーワードリストを生成する。 As described above, in the keyword extraction device 1D according to the second modification of the present embodiment, the search unit 12D searches for content using the search engine based on the search keyword, and the keyword extraction unit 18D searches for the text of the content. The keyword extraction unit extracts a plurality of keywords from the text of the content by searching based on the plurality of keywords extracted from the search result and searches the ranking of the search result of the site to be evaluated. The keyword list is generated based on the result of determining whether or not the content is used in the content of the site and the search order of the evaluation target site searched by the search unit.

この構成によって、本実施形態によれば、検索エンジンによって抽出されたキーワード（共起語）が、評価サイトに含まれているか否か、検索エンジンで検索した場合の順位を提供することができる。 With this configuration, according to the present embodiment, it is possible to provide whether or not the keyword (co-occurrence word) extracted by the search engine is included in the evaluation site, and the ranking when the search engine performs the search.

［第３実施形態］
本実施形態では、ウェブページの品質を評価することができるキーワード抽出装置１Ｅについて説明する。
本実施形態のキーワード抽出装置１Ｅは、入力されたウェブページのメインコンテンツを抽出し、抽出したメインコンテンツから予め定められている個数の文章を抽出する。そして、キーワード抽出装置１Ｅは、抽出した文章を検索エンジンで検索し、検索した結果に基づいて、ウェブページを評価する。 [Third embodiment]
In the present embodiment, a keyword extraction device 1E capable of evaluating the quality of a web page will be described.
The keyword extracting device 1E of the present embodiment extracts the main content of the input web page, and extracts a predetermined number of sentences from the extracted main content. Then, the keyword extracting device 1E searches the extracted text by a search engine, and evaluates the web page based on the search result.

＜キーワード抽出装置１Ｅの構成＞
図２８は、本実施形態に係るキーワード抽出装置１Ｅの概略構成図である。なお、キーワード抽出装置１、１Ａ、１Ｂ、１Ｃ、または１Ｄと同じ機能を有する機能部については、同じ符号を用いて、説明を省略する。
図２８に示すように、キーワード抽出装置１Ｅは、キーワード入力部１１Ｅ、検索部１２Ｅ、メインコンテンツ抽出部１５、タグＤＢ１６、第２ノイズ除去部１７Ｅ、文章抽出部２０、検索順位取得部２１、評価結果生成部２２、および評価結果出力部２３を備える。 <Configuration of Keyword Extraction Device 1E>
FIG. 28 is a schematic configuration diagram of a keyword extraction device 1E according to the present embodiment. Note that functional units having the same functions as those of the keyword extracting devices 1, 1A, 1B, 1C, and 1D are denoted by the same reference numerals and description thereof is omitted.
As shown in FIG. 28, the keyword extraction device 1E includes a keyword input unit 11E, a search unit 12E, a main content extraction unit 15, a tag DB 16, a second noise removal unit 17E, a text extraction unit 20, a search rank acquisition unit 21, and an evaluation result. A generation unit 22 and an evaluation result output unit 23 are provided.

キーワード入力部１１Ｅは、利用者によって入力されたウェブページのアドレスを示す情報を取得し、取得したウェブページのアドレスを示す情報を検索部１２Ｅと検索順位取得部２１に出力する。 The keyword input unit 11E acquires information indicating the address of the web page input by the user, and outputs the information indicating the acquired address of the web page to the search unit 12E and the search order acquisition unit 21.

検索部１２Ｅは、キーワード入力部１１Ｅが出力したウェブページのアドレスを検索エンジンに入力し、ウェブページを検索する。検索部１２Ｅは、検索したウェブページのソースコードを取得し、取得したソースコードをメインコンテンツ抽出部１５に出力する。
また、検索部１２Ｅは、文章抽出部２０が出力した文章を取得し、取得した文書のうち１つを順次選択する。検索部１２Ｅは、選択した文章を、順次、検索エンジンに入力して検索する。なお、検索結果には、ソースコードが含まれている。そして、検索部１２Ｅは、選択した文章と検索結果をメインコンテンツ抽出部１５に順次出力する。 The search unit 12E inputs the address of the web page output by the keyword input unit 11E to a search engine, and searches for the web page. The search unit 12E acquires the source code of the searched web page, and outputs the acquired source code to the main content extraction unit 15.
Further, the search unit 12E acquires the text output by the text extraction unit 20, and sequentially selects one of the acquired documents. The search unit 12E sequentially inputs the selected sentences to a search engine to search. The search result includes the source code. Then, the search unit 12E sequentially outputs the selected text and the search result to the main content extraction unit 15.

第２ノイズ除去部１７Ｅは、メインコンテンツ抽出部１５が出力したメインコンテンツの中から、タグＤＢ１６を参照して無意味言葉等を除去する。第２ノイズ除去部１７Ｅは、無意味言葉等を除去したメインコンテンツを、文章抽出部２０に出力する。なお、無意味言葉には、検索に用いた文章に関連した広告が含まれる。第２ノイズ除去部１７Ｅは、無意味言葉等を除去したメインコンテンツと、選択された文章とを、検索順位取得部２１に順次出力する。 The second noise removing unit 17E removes meaningless words and the like from the main content output by the main content extracting unit 15 with reference to the tag DB16. The second noise removing unit 17 </ b> E outputs the main content from which the meaningless words and the like have been removed to the text extracting unit 20. The meaningless words include advertisements related to the text used for the search. The second noise removing unit 17E sequentially outputs the main content from which the meaningless words and the like have been removed and the selected text to the search order obtaining unit 21.

文章抽出部２０は、第２ノイズ除去部１７Ｅが出力したメインコンテンツから予め定められた個数の文章（テキスト）を抽出する。なお、予め定められた個数は、１つ以上であればよく、固定された値であってもよく、メインコンテンツの総文字数に応じて設定される個数であってもよい。文章抽出部２０は、抽出した文章を検索部１２Ｅに出力する。なお、文章抽出部２０は、抽出した文章が、所定の文字数以上の場合、文書の頭から所定の文字数を抜き出して、１つの文章として扱うようにしてもよい。 The text extraction unit 20 extracts a predetermined number of texts (texts) from the main content output by the second noise removal unit 17E. The predetermined number may be one or more, may be a fixed value, or may be a number set according to the total number of characters of the main content. The text extraction unit 20 outputs the extracted text to the search unit 12E. When the extracted text has a predetermined number of characters or more, the text extraction unit 20 may extract a predetermined number of characters from the head of the document and treat the text as one text.

検索順位取得部２１は、第２ノイズ除去部１７Ｅが出力した選択された文章と、無意味言葉等が除去された検索結果におけるメインコンテンツを順次取得する。また、検索順位取得部２１は、キーワード入力部１１Ｅが出力したウェブページのアドレスを示す情報を取得する。検索順位取得部２１は、取得したメインコンテンツとウェブページのアドレスを用いて、検索結果におけるウェブページの順位を取得し、取得した順位を選択された文章と対応付けて順次、評価結果生成部２２に出力する。 The search order obtaining unit 21 sequentially obtains the selected sentence output by the second noise removing unit 17E and the main content in the search result from which the meaningless words and the like have been removed. In addition, the search order obtaining unit 21 obtains information indicating the address of the web page output by the keyword input unit 11E. The search order obtaining unit 21 obtains the order of the web page in the search result using the obtained main content and the address of the web page, and sequentially associates the obtained order with the selected text to the evaluation result generation unit 22. Output.

評価結果生成部２２は、検索順位取得部２１が出力した順位に配点し、各文章に対する評価を行う。評価結果生成部２２は、各文章の配点を合計し、合計点に応じて評価結果を生成し、生成した評価結果を評価結果出力部２３に出力する。なお、順位に対する配点、評価結果については、後述する。 The evaluation result generation unit 22 assigns scores to the rankings output by the search ranking acquisition unit 21 and evaluates each sentence. The evaluation result generation unit 22 sums up the points of each sentence, generates an evaluation result according to the total score, and outputs the generated evaluation result to the evaluation result output unit 23. Note that the points for the rankings and the evaluation results will be described later.

評価結果出力部２３は、例えばＷｅｂ上での情報提供部、表示装置、プリンタ装置、通信装置のうち少なくとも１つである。評価結果出力部２３は、評価結果生成部２２が出力した評価結果を、例えばＷｅｂ上で提供する。 The evaluation result output unit 23 is, for example, at least one of an information providing unit on the Web, a display device, a printer device, and a communication device. The evaluation result output unit 23 provides the evaluation result output by the evaluation result generation unit 22 on, for example, the Web.

＜評価処理の手順＞
次に、キーワード抽出装置１Ｅが行う評価処理の手順について説明する。
図２９は、本実施形態に係るキーワード抽出装置１Ｅが行う評価処理のフローチャートである。
（ステップＳ５０１）キーワード入力部１１Ｅは、利用者によって入力されたウェブページのアドレスを示す情報を取得する。
（ステップＳ５０２）検索部１２Ｅは、ウェブページのソースコードを取得する。
続けて、メインコンテンツ抽出部１５は、ステップＳ５の処理を行い、処理終了後、ステップＳ５０３に処理を進める。 <Evaluation procedure>
Next, a procedure of an evaluation process performed by the keyword extracting device 1E will be described.
FIG. 29 is a flowchart of an evaluation process performed by the keyword extraction device 1E according to the present embodiment.
(Step S501) The keyword input unit 11E acquires information indicating an address of a web page input by a user.
(Step S502) The search unit 12E acquires the source code of the web page.
Subsequently, the main content extraction unit 15 performs the process of step S5, and after the process ends, proceeds to step S503.

（ステップＳ５０３）文章抽出部２０は、第２ノイズ除去部１７Ｅが出力したメインコンテンツから予め定められた個数の文章（テキスト）を抽出する。
（ステップＳ５０４）検索部１２Ｅは、文章抽出部２０が出力した文章を取得し、取得した文書のうち１つを順次選択する。続けて、検索部１２Ｅは、選択した文章を、順次、検索エンジンに入力して検索する。続けて、メインコンテンツ抽出部１５は、検索部１２Ｅが出力した検索結果からメインコンテンツを抽出する。続けて、第２ノイズ除去部１７Ｅは、メインコンテンツ抽出部１５が出力したメインコンテンツから広告を含む無意味言葉を除去する。 (Step S503) The text extracting unit 20 extracts a predetermined number of texts (texts) from the main content output by the second noise removing unit 17E.
(Step S504) The search unit 12E acquires the text output by the text extraction unit 20, and sequentially selects one of the acquired documents. Subsequently, the search unit 12E sequentially inputs the selected sentences to a search engine to search. Subsequently, the main content extraction unit 15 extracts main content from the search result output by the search unit 12E. Subsequently, the second noise removing unit 17E removes nonsense words including advertisements from the main content output by the main content extracting unit 15.

（ステップＳ５０５）検索順位取得部２１は、第２ノイズ除去部１７Ｅが出力した選択された文章と、無意味言葉が除去された検索結果におけるメインコンテンツを順次取得する。続けて、検索順位取得部２１は、検索結果におけるウェブページの順位を取得する。
（ステップＳ５０６）評価結果生成部２２は、検索順位取得部２１が出力した順位に対して配点し、各文章に対する評価を行う。続けて、評価結果生成部２２は、各文章の配点を合計し、合計点に応じて評価結果を生成する。続けて、評価結果出力部２３は、評価結果生成部２２が出力した評価結果を出力する。
以上で、評価処理を終了する。 (Step S505) The search order obtaining unit 21 sequentially obtains the selected text output by the second noise removing unit 17E and the main content in the search result from which the meaningless words have been removed. Subsequently, the search order obtaining unit 21 obtains the order of the web page in the search result.
(Step S506) The evaluation result generation unit 22 assigns a score to the ranking output by the search ranking acquisition unit 21, and evaluates each sentence. Subsequently, the evaluation result generation unit 22 sums up the points of each sentence and generates an evaluation result according to the total points. Subsequently, the evaluation result output unit 23 outputs the evaluation result output by the evaluation result generation unit 22.
Thus, the evaluation processing ends.

＜配点、評価結果の例＞
次に、配点、評価結果の例について説明する。
図３０は、本実施形態に係る評価結果の例を示す図である。図３０に示す例は、２つのウェブページに対する評価結果の例である。図３０に示すように、出力される判定結果には、ウェブページのアドレス、総合点、取得文章＋順位、アドバイス、評価日が含まれている。 <Example of score and evaluation result>
Next, an example of a score and an evaluation result will be described.
FIG. 30 is a diagram illustrating an example of an evaluation result according to the present embodiment. The example shown in FIG. 30 is an example of an evaluation result for two web pages. As shown in FIG. 30, the output determination result includes the address of the web page, the total score, the obtained sentence + rank, advice, and the evaluation date.

まず、配点について説明する。
一般的に、検索エンジンの利用者は、検索結果の１位から検索内容を閲覧していく。例えば、検索結果が１位の検索結果を閲覧し、そこで知りたい情報が得られた場合、他の検索結果を閲覧しない場合が少なくない。そして、検索エンジンの利用者は、検索結果が２０位以下の検索結果を閲覧しない場合が少なくない。したがって、検索結果が上位であるほど、検索に用いられた文章は、他のウェブページに対して優位であると言える。また、順位が低い場合、検索に用いられた文章は、他のウェブページにも使用されていることを意味しているため、他のウェブページに対する優位性が低いと言える。 First, the allocation will be described.
Generally, a user of a search engine browses search contents from the top of the search results. For example, when a search result browses a search result of the first place and obtains information that the user wants to know, there are many cases where another search result is not browsed. And, in many cases, the user of the search engine does not browse the search results whose search results are 20th or less. Therefore, it can be said that the higher the search result is, the more the sentence used in the search is superior to other web pages. In addition, when the ranking is low, it means that the sentence used for the search is also used for another web page, and thus it can be said that the superiority to other web pages is low.

評価結果生成部２２は、５個の文章を選択した場合、文章毎に２０点（＝１００／５）を割り当てる。そして、評価結果生成部２２は、上述した利用により、例えば、１位に２０点、２位に１６点、３位に１２点、・・・、２０位以下に０点を割り当てる。
評価結果生成部２２は、５つの文章の配点の総合点が１００点の場合、判定結果として「◎」または「ＶｅｒｙＧｏｏｄ」であると判別し、総合点が１００点未満である場合、判定結果として「×」または「ＮｏＧｏｏｄ」であると判定する。
なお、上述した配点、判定は一例であり、これに限られない。 When five sentences are selected, the evaluation result generation unit 22 assigns 20 points (= 100/5) to each sentence. Then, the evaluation result generation unit 22 assigns, for example, 20 points to the first place, 16 points to the second place, 12 points to the third place,...
When the total score of the five sentences is 100, the evaluation result generation unit 22 determines that the judgment result is “」 ”or“ Very Good ”, and when the total score is less than 100, the judgment result is Is determined to be “×” or “No Good”.
Note that the above-described scoring and determination are merely examples, and the present invention is not limited to this.

図３０の符号ｇ５０１で囲んだ評価結果は、ウェブページ「http://www.abcdef.html」に対する評価結果である。抽出された５つの文章が、文章１〜文章５である。それぞれの文章を検索エンジンに入力して検索した結果、それぞれの順位が３位、１位、１位、２０位以上、１位である。そして、総合点が７２点であり、判定「×」である。また、アドバイスは、「コピーされているか、書き直しを強くオススメます。」である。 The evaluation result surrounded by reference numeral g501 in FIG. 30 is the evaluation result for the web page “http: //www.abcdef.html”. The five extracted sentences are sentences 1 to 5. As a result of inputting each sentence to the search engine and searching, the respective rankings are third, first, first, twentieth or higher, and first. Then, the total score is 72 points, and the judgment is “x”. Also, the advice is "I strongly recommend that you copy or rewrite."

図３０の符号ｇ５０２で囲んだ評価結果は、ウェブページ「http://www.abcdfg.html」に対する評価結果である。符号ｇ５０２で囲んだ評価結果は、メインコンテンツから１０個の文章を抽出して評価した結果の例である。このウェブページから抜き出した例では、抽出した１０個の文章のうち１位が２個、２０位以内が８個であり、総合点が２０点である。そして、アドバイスは、「コピーされているか、書き直しを強くオススメます。」である。
なお、アドバイスの文面は、総合点に対応付けて、評価結果生成部２２に予め記憶させておくようにしてもよい。 The evaluation result surrounded by reference numeral g502 in FIG. 30 is the evaluation result for the web page “http: //www.abcdfg.html”. The evaluation result enclosed by reference sign g502 is an example of the result of extracting and evaluating ten sentences from the main content. In the example extracted from this web page, among the extracted ten sentences, the first place is two places, the top 20 places are eight places, and the total score is 20 points. And the advice is "I strongly recommend that you copy or rewrite."
The text of the advice may be stored in advance in the evaluation result generation unit 22 in association with the total score.

なお、図２８に示した例において、キーワード抽出装置１Ｅは、タブＤＢ１６、第２ノイズ除去部１７Ｅを備えていなくてもよい。この場合、メインコンテンツ抽出部１５は、抽出したメインコンテンツを文章抽出部２０に出力し、検索結果のメインコンテンツを検索順位取得部２１に出力するようにしてもよい。 In the example illustrated in FIG. 28, the keyword extracting device 1E may not include the tab DB 16 and the second noise removing unit 17E. In this case, the main content extracting unit 15 may output the extracted main content to the text extracting unit 20 and output the main content of the search result to the search order obtaining unit 21.

以上のように、本実施形態に係るキーワード抽出装置１Ｅにおいて、メインコンテンツ抽出部１５が抽出したメインコンテンツから少なくとも１つの文章を抽出する文章抽出部２０と、検索部１２Ｅによって文章に基づいて検索された順位を取得する検索順位取得部２１と、検索順位取得部が取得した順位に基づいて、文章が抽出された評価を行う対象のウェブページに対して評価を行う評価結果生成部２２と、をさらに備える。 As described above, in the keyword extraction device 1E according to the present embodiment, the sentence extraction unit 20 that extracts at least one sentence from the main content extracted by the main content extraction unit 15 and the order searched based on the sentence by the search unit 12E. And an evaluation result generation unit 22 that evaluates a web page to be evaluated from which sentences have been extracted based on the ranking acquired by the search ranking acquisition unit. .

この構成によって、本実施形態によれば、評価を行いたいウェブページからメインコンテンツを抽出し、抽出されたメインコンテンツから少なくとも１つの文章を抽出する。そして、本実施形態では、抽出された文章を、検索エンジンを用いて検索を行い、検索に用いられた文書が含まれているウェブサイトの順位に基づいて、ウェブページの評価を行う。これにより、本実施形態によれば、ウェブページの運用者は、ウェブページのアドレスをキーワード抽出装置１Ｅに入力するだけで、自社のウェブページの品質の評価結果を得ることができる。 With this configuration, according to the present embodiment, main content is extracted from a web page to be evaluated, and at least one sentence is extracted from the extracted main content. In the present embodiment, the extracted text is searched using a search engine, and the web page is evaluated based on the ranking of the website including the document used for the search. Thus, according to the present embodiment, the web page operator can obtain the evaluation result of the quality of the web page of the company only by inputting the web page address into the keyword extracting device 1E.

［第４実施形態］
本実施形態では、検索エンジンが有するサジェスト機能を用いてキーワードの抽出、評価を行う例を説明する。
まず、サジェスト機能について説明する。
サジェスト機能とは、検索エンジンを用いて単語を検索するときに、検索エンジンの利用者が検索する可能性が高い言葉を検索エンジンが提案する機能である。例えば、検索エンジンに「格安ＳＩＭ」と入力すると、「格安ｓｉｍ」、「格安ｓｉｍ比較」、「格安ｓｉｍテザリング」等の候補が提案される。このように、提案される言葉は、検索エンジンの利用者によって検索された回数が多い、すなわち利用者が知りたい情報である場合が多い。
本実施形態では、検索ワードに対して提案される単語を収集し、収集した単語が評価サイト（例えば自社のサイト）に含まれている頻度に応じて、評価サイトを評価する。 [Fourth embodiment]
In this embodiment, an example in which a keyword is extracted and evaluated using a suggestion function of a search engine will be described.
First, the suggestion function will be described.
The suggestion function is a function in which when a search engine is used to search for a word, the search engine proposes a word that is highly likely to be searched by a user of the search engine. For example, when "cheap SIM" is input to the search engine, candidates such as "cheap sim", "cheap sim comparison", and "cheap sim tethering" are proposed. As described above, the suggested words are frequently searched by users of the search engine, that is, information that the users want to know in many cases.
In the present embodiment, words suggested for a search word are collected, and the evaluation sites are evaluated according to the frequency at which the collected words are included in an evaluation site (for example, a company's own site).

＜キーワード抽出装置１Ｆの構成＞
図３１は、本実施形態に係るキーワード抽出装置１Ｆの概略構成図である。
図３１に示すように、キーワード抽出装置１Ｆは、キーワード入力部１１Ｆ、検索部１２Ｆ、メインコンテンツ抽出部１５、サジェスト取得部２４、検索順位取得部２１Ｆ、評価結果生成部２２Ｆ、および評価結果出力部２３を備える。また、キーワード抽出装置１Ｆは、ネットワーク２に接続されている。なお、キーワード抽出装置１、１Ａ、１Ｂ、１Ｃ、１Ｄ、１Ｅ、または１Ｆと同じ機能を有する機能部については、同じ符号を用いて、説明を省略する。また、キーワード抽出装置１Ｆは、メインコンテンツ抽出部１５と検索順位取得部２１Ｆとの間に、第２ノイズ除去部１７（または１７Ｂ、１７Ｅ）、タグＤＢ１６を備えていてもよい。 <Configuration of Keyword Extraction Device 1F>
FIG. 31 is a schematic configuration diagram of a keyword extraction device 1F according to the present embodiment.
As shown in FIG. 31, the keyword extraction device 1F includes a keyword input unit 11F, a search unit 12F, a main content extraction unit 15, a suggestion acquisition unit 24, a search order acquisition unit 21F, an evaluation result generation unit 22F, and an evaluation result output unit 23. Is provided. The keyword extracting device 1F is connected to the network 2. Note that functional units having the same functions as those of the keyword extracting apparatuses 1, 1A, 1B, 1C, 1D, 1E, and 1F are denoted by the same reference numerals and description thereof is omitted. The keyword extracting device 1F may include the second noise removing unit 17 (or 17B, 17E) and the tag DB 16 between the main content extracting unit 15 and the search order obtaining unit 21F.

キーワード入力部１１Ｆは、利用者によって入力された検索キーワードを取得し、取得した検索キーワードを検索部１２Ｆに出力する。また、キーワード入力部１１Ｆは、利用者によって入力された評価サイトの情報を取得し、取得した評価サイトの情報を検索順位取得部２１Ｆに出力する。 The keyword input unit 11F acquires a search keyword input by the user, and outputs the acquired search keyword to the search unit 12F. The keyword input unit 11F acquires information on the evaluation site input by the user, and outputs the acquired information on the evaluation site to the search order acquiring unit 21F.

検索部１２Ｆは、キーワード入力部１１Ｆが出力した検索キーワードを検索エンジンに入力し、検索キーワードを入力したときに提案される言葉（以下、予測言葉）をサジェスト取得部２４に出力する。なお、予測言葉には、少なくとも検索キーワードが含まれ、例えば検索キーワードと他の単語との組み合わせ、検索キーワードを含む複合語等である。
また、検索部１２Ｆは、サジェスト取得部２４が出力した予測言葉のうちから１つを選択し、選択した予測言葉を検索エンジンに入力して検索する。そして、検索部１２Ｆは、検索結果を順次、メインコンテンツ抽出部１５に出力する。 The search unit 12F inputs the search keyword output by the keyword input unit 11F to a search engine, and outputs a word proposed when the search keyword is input (hereinafter, a predicted word) to the suggestion acquisition unit 24. The predicted words include at least a search keyword, such as a combination of the search keyword and another word, a compound word including the search keyword, and the like.
The search unit 12F selects one of the predicted words output from the suggestion acquisition unit 24, and inputs the selected predicted word to a search engine to search. Then, the search unit 12F sequentially outputs the search results to the main content extraction unit 15.

サジェスト取得部２４は、検索部１２Ｆが出力した予測言葉を取得し、取得した予測言葉を検索部１２Ｆと評価結果生成部２２Ｆに出力する。
メインコンテンツ抽出部１５は、検索部１２Ｆが予測言葉を用いて検索した結果のソースコードからメインコンテンツを抽出し、抽出したメインコンテンツを検索順位取得部２１Ｆに出力する。 The suggestion acquisition unit 24 acquires the predicted words output by the search unit 12F, and outputs the obtained predicted words to the search unit 12F and the evaluation result generation unit 22F.
The main content extraction unit 15 extracts the main content from the source code as a result of the search performed by the search unit 12F using the predicted words, and outputs the extracted main content to the search order obtaining unit 21F.

検索順位取得部２１Ｆは、メインコンテンツ抽出部１５が出力したメインコンテンツと、キーワード入力部１１Ｆが出力した評価サイトの情報を取得する。検索順位取得部２１Ｆは、検索結果における評価サイトの順位を取得し、取得した順位を予測言葉と対応付けて順次、評価結果生成部２２Ｆに出力する。 The search order acquiring unit 21F acquires the main content output by the main content extracting unit 15 and the information of the evaluation site output by the keyword input unit 11F. The search order obtaining unit 21F obtains the order of the evaluation site in the search result, and sequentially outputs the obtained order to the evaluation result generating unit 22F in association with the predicted word.

評価結果生成部２２Ｆは、検索順位取得部２１Ｆが出力した順位と予測言葉を用いて、各予測言葉に対する評価を行う。評価結果生成部２２Ｆは、評価結果に基づいて評価結果を生成し、生成した評価結果を評価結果出力部２３に出力する。または、評価結果生成部２２Ｆは、サジェスト取得部２４が出力した予測言葉を用いて評価結果を生成し、生成した評価結果を評価結果出力部２３に出力する。なお、評価結果については、後述する。 The evaluation result generation unit 22F evaluates each predicted word using the order and the predicted word output by the search order obtaining unit 21F. The evaluation result generation unit 22F generates an evaluation result based on the evaluation result, and outputs the generated evaluation result to the evaluation result output unit 23. Alternatively, the evaluation result generation unit 22F generates an evaluation result using the predicted words output by the suggestion acquisition unit 24, and outputs the generated evaluation result to the evaluation result output unit 23. The evaluation result will be described later.

＜評価処理の手順＞
次に、キーワード抽出装置１Ｆが行う評価処理の手順について説明する。
図３２は、本実施形態に係るキーワード抽出装置１Ｆが行う評価処理のフローチャートである。
（ステップＳ６０１）キーワード入力部１１Ｆは、利用者によって入力された検索キーワードと、評価サイトの情報を取得する。
（ステップＳ６０２）検索部１２Ｆは、キーワード入力部１１Ｆが出力した検索キーワードを検索エンジンに入力する。続けて、サジェスト取得部２４は、提案された予測言葉を取得する。なお、取得する予測言葉の個数は、提示される全てであってもよく、または、予め定められた個数であってもよい。 <Evaluation procedure>
Next, a procedure of an evaluation process performed by the keyword extracting device 1F will be described.
FIG. 32 is a flowchart of an evaluation process performed by the keyword extraction device 1F according to the present embodiment.
(Step S601) The keyword input unit 11F acquires a search keyword input by a user and information on an evaluation site.
(Step S602) The search unit 12F inputs the search keyword output by the keyword input unit 11F to the search engine. Subsequently, the suggestion obtaining unit 24 obtains the proposed predicted words. It should be noted that the number of predicted words to be acquired may be all presented or may be a predetermined number.

（ステップＳ６０３）検索部１２Ｆ、検索順位取得部２１Ｆ、評価結果生成部２２Ｆは、ステップＳ６０４〜ステップＳ６０６の処理を、予測言葉毎に行う。
（ステップＳ６０４）検索部１２Ｆは、サジェスト取得部２４が出力した予測言葉のうちから１つを選択し、選択した予測言葉を検索エンジンに入力して検索する。 (Step S603) The search unit 12F, the search order acquisition unit 21F, and the evaluation result generation unit 22F perform the processing of steps S604 to S606 for each predicted word.
(Step S604) The search unit 12F selects one of the predicted words output from the suggestion acquisition unit 24, and inputs the selected predicted word to a search engine to search.

（ステップＳ６０５）検索順位取得部２１Ｆは、検索部１２Ｆが出力した検索結果において、キーワード入力部１１Ｆが出力した評価サイトの順位を取得する。
（ステップＳ６０６）評価結果生成部２２Ｆは、検索順位取得部２１Ｆが出力した順位と予測言葉を用いて、各予測言葉に対する判定を行う。評価結果生成部２２Ｆは、例えば、順位が１位〜１０位の場合に「独占」であると評価し、順位が１１位以下である場合に「未発掘」であると判定するようにしてもよい。 (Step S605) The search order obtaining unit 21F obtains the order of the evaluation site output by the keyword input unit 11F in the search result output by the search unit 12F.
(Step S606) The evaluation result generation unit 22F makes a determination on each predicted word using the order and the predicted word output by the search order obtaining unit 21F. For example, the evaluation result generation unit 22F may evaluate “exclusive” when the ranking is 1st to 10th, and determine “unexcavated” when the ranking is 11th or less. Good.

（ステップＳ６０７）検索部１２Ｆ、検索順位取得部２１Ｆ、評価結果生成部２２Ｆは、ステップＳ６０２で取得した予測言葉に対して、ステップＳ６０４〜ステップＳ６０６の処理が終了したとき、ステップＳ６０８の処理に進める。
（ステップＳ６０７）評価結果生成部２２Ｆは、評価結果に基づいて評価結果を生成する。続けて、評価結果出力部２３は、評価結果生成部２２Ｆが出力した評価結果を、例えばＷｅｂ上で提供する。
以上で、評価処理を終了する。 (Step S607) When the processing of steps S604 to S606 ends for the predicted word acquired in step S602, the search unit 12F, the search order obtaining unit 21F, and the evaluation result generation unit 22F proceed to the processing of step S608. .
(Step S607) The evaluation result generation unit 22F generates an evaluation result based on the evaluation result. Subsequently, the evaluation result output unit 23 provides the evaluation result output by the evaluation result generation unit 22F, for example, on the Web.
Thus, the evaluation processing ends.

＜評価結果の例＞
次に、評価結果の例について説明する。
図３３は、本実施形態に係る評価結果の例を示す図である。
図３３に示すように、評価結果を示す画像ｇ６００には、検索キーワードの画像ｇ６０１、評価サイトを示す画像ｇ６０２、検索結果と評価結果を示す画像ｇ６０３が含まれている。
検索結果と評価結果を示す画像ｇ６０３には、予測言葉を示す画像ｇ６０３１、判定結果を示す画像ｇ６０３２、順位を示す画像ｇ６０３３が含まれている。
判定結果を示す画像ｇ６０３２に示すように、予測言葉（画像ｇ６０３１）を検索エンジンに入力して検索した順位に応じて、「独占」、「共存」、「未発掘」、「改善」等のアドバイスが示される。なお、図３３に示した判定結果は一例であり、評価結果生成部２２Ｆは、例えば１位〜１０位、１１位〜２０位等、１０位毎に判定結果のラベルを付与するようにしてもよい。また、評価結果生成部２２Ｆは、「独占」、「共存」、「未発掘」、「改善」の各単語を色分けしたり、順位の文字を色分けしたり、各単語または文字にマーカーを付与したり、単語または文字の種類を異なるようにしてもよい。 <Example of evaluation result>
Next, an example of an evaluation result will be described.
FIG. 33 is a diagram illustrating an example of an evaluation result according to the present embodiment.
As shown in FIG. 33, the image g600 indicating the evaluation result includes an image g601 of the search keyword, an image g602 indicating the evaluation site, and an image g603 indicating the search result and the evaluation result.
The image g603 indicating the search result and the evaluation result includes an image g6031 indicating the predicted word, an image g6032 indicating the determination result, and an image g6033 indicating the order.
As shown in an image g6032 showing the determination result, advice such as “exclusive”, “coexistence”, “undiscovered”, “improved”, etc. is given according to the order in which the predicted word (image g6031) is input to the search engine and searched. Is shown. Note that the determination result illustrated in FIG. 33 is an example, and the evaluation result generation unit 22F may assign a label of the determination result for every tenth rank, for example, the first to tenth ranks, the eleventh to 20th ranks, and the like. Good. In addition, the evaluation result generation unit 22F color-codes each of the words “exclusive”, “coexistence”, “undiscovered”, and “improved”, color-codes in the order, and assigns a marker to each word or character. Alternatively, the types of words or characters may be different.

評価サイトの運用者は、このような評価結果を用いて、予測言葉を用いて検索した場合にも評価サイトが上位に検索されるように、例えば自社のサイトを構築する。これにより、自社のサイトへのアクセス数が向上する効果が得られる。 The operator of the evaluation site constructs, for example, its own site so that the evaluation site is searched at a higher rank even when the search is performed using the predicted words by using such an evaluation result. This has the effect of increasing the number of accesses to the company's site.

なお、図３２、図３３に示した例では、予測言葉を検索エンジンで検索して、評価サイトの順位も求める例を説明したが、これに限られない。キーワード抽出装置１Ｆは、検索キーワードに応じた予測言葉を取得し、取得した予測言葉を評価結果として出力するようにしてもよい。 Note that, in the examples shown in FIGS. 32 and 33, an example has been described in which a predicted word is searched for by a search engine and the ranking of the evaluation site is also obtained. The keyword extraction device 1F may acquire a predicted word corresponding to the search keyword and output the obtained predicted word as an evaluation result.

以上のように、本実施形態に係るキーワード抽出装置１Ｆにおいて、検索部１２Ｆが検索した結果から、検索キーワードに基づく予測言葉を取得するサジェスト取得部２４と、サジェスト取得部によって取得された複数の予測言葉のうち１つを選択し、選択した予測言葉を検索部によって検索した結果から、メインコンテンツ抽出部１５によって抽出されたメインコンテンツを用いて、選択した予測言葉の検索順位を取得する検索順位取得部２１Ｆと、検索順位取得部が取得した順位に基づいて、評価を行う対象のウェブページに対して評価を行う評価結果生成部２２Ｆと、をさらに備える。 As described above, in the keyword extraction device 1F according to the present embodiment, the suggestion obtaining unit 24 that obtains a predicted word based on the search keyword from the search result obtained by the search unit 12F, and the plurality of predictions that are obtained by the suggestion obtaining unit. A search order obtaining unit 21F that obtains a search order of the selected predicted word using the main content extracted by the main content extraction unit 15 from a result of selecting one of the words and searching the selected predicted word by the search unit. And an evaluation result generation unit 22F that evaluates a web page to be evaluated based on the ranking acquired by the search ranking acquisition unit.

この構成によって、本実施形態によれば、検索エンジンを用いて検索する利用者が入力する頻度が高いと思われる予測言葉を用いて検索した場合に、評価対象の評価サイトの順位に基づいて、評価サイトの品質を評価する。これにより、本実施形態によれば、例えば、自社サイトの運用者が評価のために入力した検索キーワードだけではなく、利用者によって使用されている検索キーワードも用いて、自社サイトの評価を行うことができる。 With this configuration, according to the present embodiment, when a search using a search engine is performed using predicted words that are considered to be frequently input, based on the ranking of the evaluation site to be evaluated, Evaluate the quality of the evaluation site. Thus, according to the present embodiment, for example, the evaluation of the company site is performed using not only the search keyword input for evaluation by the company site operator but also the search keyword used by the user. Can be.

［第５実施形態］
本実施形態では、キーワード抽出装置１、１Ａ〜１Ｆのうち、いずれか１つに、コンテンツ生成装置３が接続されている例を説明する。
図３４は、本実施形態に係るコンテンツ生成システム５を示す構成図である。
図３４に示すように、コンテンツ生成システム５は、キーワード抽出装置（１、１Ａ〜１Ｆのうちの、いずれか１つ）、およびコンテンツ生成装置３を備える。また、コンテンツ生成システム５は、ネットワーク２に接続されている。
なお、以下の例では、キーワード抽出装置１Ａを例に説明する。 [Fifth Embodiment]
In the present embodiment, an example will be described in which the content generation device 3 is connected to any one of the keyword extraction devices 1, 1A to 1F.
FIG. 34 is a configuration diagram illustrating the content generation system 5 according to the present embodiment.
As shown in FIG. 34, the content generation system 5 includes a keyword extraction device (any one of 1, 1A to 1F) and a content generation device 3. The content generation system 5 is connected to the network 2.
In the following example, the keyword extraction device 1A will be described as an example.

キーワード抽出装置１Ａのキーワードリスト出力部１９は、通信装置である。
キーワード抽出装置１Ａは、入力された検索キーワードに基づいて複数のキーワードを抽出し、抽出したキーワードをソート処理したキーワードリストの情報を、コンテンツ生成装置３に出力する。 The keyword list output unit 19 of the keyword extracting device 1A is a communication device.
The keyword extracting device 1A extracts a plurality of keywords based on the input search keywords, and outputs to the content generating device 3 information of a keyword list obtained by sorting the extracted keywords.

コンテンツ生成装置３は、コンテンツ雛形記憶部３１、コンテンツ生成部３２、およびコンテンツ出力部３３を備える。
コンテンツ雛形記憶部３１は、コンテンツの雛形を記憶する。なお、コンテンツの雛形とは、ウェブページの雛形、カタログの雛形、パンフレットの雛形、取扱説明書の雛形等であり、例えば商品毎に雛形が記憶されている。 The content generation device 3 includes a content template storage unit 31, a content generation unit 32, and a content output unit 33.
The content model storage unit 31 stores a content model. Note that the content template is a web page template, catalog template, pamphlet template, instruction manual template, and the like. For example, a template is stored for each product.

コンテンツ生成部３２は、抽出された複数のキーワードと、コンテンツ雛形記憶部３１に記憶されているコンテンツの雛形とを用いてコンテンツを生成し、生成したコンテンツをコンテンツ出力部３３に出力する。ここで、コンテンツとは、ウェブページ、カタログ、パンフレット、取扱説明書等である。 The content generation unit 32 generates content using the extracted plurality of keywords and the content templates stored in the content template storage unit 31, and outputs the generated content to the content output unit 33. Here, the content is a web page, a catalog, a pamphlet, an instruction manual, or the like.

コンテンツ出力部３３は、例えばＷｅｂ上での情報提供部、表示装置、プリンタ装置、通信装置のうち少なくとも１つである。コンテンツ出力部３３は、コンテンツ生成部３２が出力したコンテンツを例えばＷｅｂ上で提供する。 The content output unit 33 is, for example, at least one of an information providing unit on the Web, a display device, a printer device, and a communication device. The content output unit 33 provides the content output by the content generation unit 32, for example, on the Web.

なお、本実施形態では、コンテンツの雛形と抽出された複数のキーワードを用いてコンテンツを生成する例を説明したが、これに限られない。コンテンツ生成装置３は、キーワード抽出装置（１、１Ａ、１Ｂ、１Ｃ、１Ｄのうちの、いずれか１つ）によって抽出された複数のキーワードを用いて、周知の文章を自動生成するプログラム等によってコンテンツを生成するようにしてもよい。この場合、１つの文章に用いるキーワードの個数を予め設定しておくようにしてもよい。また、キーワードそれぞれについて、コンテンツ内で使用する回数を、例えば重要度や出願回数に基づいて設定しておくようにしてもよい。 In the present embodiment, an example has been described in which content is generated using a content template and a plurality of extracted keywords, but the present invention is not limited to this. The content generation device 3 uses a plurality of keywords extracted by the keyword extraction device (any one of 1, 1A, 1B, 1C, and 1D) to generate a content by a program that automatically generates a well-known sentence. May be generated. In this case, the number of keywords used for one sentence may be set in advance. Further, the number of times of use in the content for each keyword may be set based on, for example, the degree of importance or the number of applications.

以上のように、本実施形態のコンテンツ生成システム５は、キーワード抽出装置（１、１Ａ〜１Ｆのうちの、いずれか１つ）と、キーワード抽出装置が抽出した複数のキーワードを用いて、所定のコンテンツを生成するコンテンツ生成装置３と、を備える。 As described above, the content generation system 5 of the present embodiment uses the keyword extraction device (any one of 1, 1A to 1F) and a plurality of keywords extracted by the keyword extraction device to perform a predetermined process. A content generation device 3 for generating content.

この構成によって、本実施形態では、キーワード抽出装置（１、１Ａ〜１Ｆのうちの、いずれか１つ）によって抽出された複数のキーワードを用いて、コンテンツを生成することができる。この結果、本実施形態によれば、利用者が知りたい情報を用いたコンテンツを提供することができる。 With this configuration, in the present embodiment, content can be generated using a plurality of keywords extracted by the keyword extraction device (any one of 1, 1A to 1F). As a result, according to the present embodiment, it is possible to provide content using information that the user wants to know.

なお、上述した第１実施形態、第２実施形態において、キーワード抽出装置（１、１Ａ〜１Ｆのうちの、いずれか１つ）は、検索キーワードに応じたウェブページをネットワーク２から検索する例を説明したが、これに限られない。例えば、キーワード抽出装置（１、１Ａ〜１Ｆのうちの、いずれか１つ）に接続されているサーバ（不図示）から検索キーワードに応じたウェブページを検索するようにしてもよい。この場合、サーバには、検索キーワードに対応した複数のウェブページに関する情報が格納されている。
また、上述した第１実施形態〜第４実施形態において、ドメインＤＢ１３（または１３Ｄ）、タブＤＢ１６（または１６Ｂ）は、ネットワーク２上にあってもよい。 In the first embodiment and the second embodiment described above, the keyword extraction device (any one of 1, 1A to 1F) searches the network 2 for a web page corresponding to a search keyword. Although described, it is not limited to this. For example, a web page corresponding to the search keyword may be searched from a server (not shown) connected to the keyword extraction device (one of 1, 1A to 1F). In this case, the server stores information about a plurality of web pages corresponding to the search keyword.
In the above-described first to fourth embodiments, the domain DB 13 (or 13D) and the tab DB 16 (or 16B) may be on the network 2.

また、上述した実施形態におけるキーワード抽出装置（１、１Ａ、１Ｂ、１Ｃ、１Ｄ、１Ｅ、１Ｆのうちの、いずれか１つ）またはコンテンツ生成装置３の一部または全てをコンピュータで実現するようにしてもよい。その場合、これらの装置が備える機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、認識データ伝送装置に内蔵されたコンピュータシステムであって、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Further, a part or all of the keyword extracting device (one of 1, 1A, 1B, 1C, 1D, 1E, and 1F) or the content generating device 3 in the above-described embodiment is realized by a computer. You may. In that case, a program for realizing the functions of these devices may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read and executed by a computer system. Good. Here, the “computer system” is a computer system built in the recognition data transmission device, and includes hardware such as an OS and peripheral devices. The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in a computer system. Further, the "computer-readable recording medium" is a medium that holds the program dynamically for a short time, such as a communication line for transmitting the program through a network such as the Internet or a communication line such as a telephone line, In this case, a program holding a program for a certain period of time, such as a volatile memory in a computer system serving as a server or a client, may be included. Further, the above-mentioned program may be a program for realizing a part of the above-mentioned functions, or may be a program which can realize the above-mentioned functions in combination with a program already recorded in a computer system.

１、１Ａ、１Ｂ、１Ｃ、１Ｄ、１Ｅ、１Ｆ…キーワード抽出装置、２…ネットワーク、３…コンテンツ生成装置、５…コンテンツ生成システム、１１、１１Ｄ、１１Ｅ、１１Ｆ…キーワード入力部、１２、１２Ｂ、１２Ｄ、１２Ｅ、１２Ｆ…検索部、１３、１３Ｄ…ドメインＤＢ、１４…第１ノイズ除去部、１５…メインコンテンツ抽出部、１６、１６Ｂ…タグＤＢ、１７、１７Ｂ、１７Ｅ…第２ノイズ除去部、１８、１８Ｄ…キーワード抽出部、１９…キーワードリスト出力部、１８１…形態素解析部、１８２…用語抽出部、１８３…キーワードリスト生成部、２０…文章抽出部、２１、２１Ｆ…検索順位取得部、２２、２２Ｆ…評価結果生成部、２３…評価結果出力部、２４…サジェスト取得部、３１…コンテンツ雛形記憶部、３２…コンテンツ生成部、３３…コンテンツ出力部 1, 1A, 1B, 1C, 1D, 1E, 1F: keyword extraction device, 2: network, 3: content generation device, 5: content generation system, 11, 11D, 11E, 11F: keyword input unit, 12, 12B, 12D, 12E, 12F: search unit, 13, 13D: domain DB, 14: first noise removal unit, 15: main content extraction unit, 16, 16B: tag DB, 17, 17B, 17E: second noise removal unit, 18 , 18D: Keyword extraction unit, 19: Keyword list output unit, 181: Morphological analysis unit, 182: Term extraction unit, 183: Keyword list generation unit, 20: Sentence extraction unit, 21, 21F: Search order acquisition unit, 22, 22F ... Evaluation result generation unit, 23 ... Evaluation result output unit, 24 ... Suggest acquisition unit, 31 ... Content model storage unit, 32 ... Ceiling generation unit, 33 ... content output unit

Claims

A search unit for searching a plurality of contents including a main content based on a search keyword;
A first noise removing unit that removes, from the plurality of contents searched by the search unit, content of a predetermined domain that does not make sense in extracting a keyword,
One of the plurality of contents from which the contents of the predetermined domain have been removed by the first noise removing unit is sequentially selected, and information indicating a link destination is extracted from the selected contents, and the extracted link is extracted. A main content extraction unit that extracts the main information by removing the similar information by comparing the previous information and the information of the selected content from the information of the selected content,
A keyword extraction unit that extracts a plurality of keywords from the text of the main content extracted by the main content extraction unit,
A keyword extraction device comprising:

A second noise elimination for removing unnecessary description that does not make sense in the keyword extraction by removing information described by a predetermined tag from the information of the main content extracted by the main content extraction unit. Part, further comprising:
The keyword extracting unit includes:
Extracting a keyword from the text of the main content after the information described by the predetermined tag is removed by the second noise removing unit;
The keyword extracting device according to claim 1.

A search unit for searching a plurality of contents including a main content based on a search keyword;
A first noise removing unit that removes, from the plurality of contents searched by the search unit, content of a predetermined domain that does not make sense in extracting a keyword,
Sequentially selecting one of the plurality of contents from which the content of the predetermined domain has been removed by the first noise removing unit , and removing information described by a predetermined tag from the selected content; A second noise removing unit that removes unnecessary descriptions that do not make sense in keyword extraction;
Sequentially selecting one of the contents from the content from which the information described by the predetermined tag has been removed from the content by the second noise removing unit, extracting information indicating a link destination from the selected content, A main content extraction unit that extracts information of a link destination and information similar to the information of the selected content by comparing the information of the selected content from the information of the selected content to extract a main content.
A keyword extraction unit that extracts a plurality of keywords from the text of the main content extracted by the main content extraction unit ,
A keyword extraction device comprising:

The search unit searches for different content of at least two predetermined domains based on the search keyword,
The keyword extracting unit extracts a plurality of keywords from each of the texts of the content of the different domains, compares the keywords extracted from the texts of the content of the different domains, and generates a keyword list based on the comparison result. The keyword extracting device according to any one of claims 1 to 3 .

A sentence extracting unit for extracting at least one sentence from the main content extracted by the main content extracting unit;
A search order obtaining unit that obtains the order searched based on the sentence by the search unit;
An evaluation result generation unit that performs an evaluation on a target web page on which the sentence is extracted based on the ranking acquired by the search ranking acquisition unit;
The keyword extraction device according to any one of claims 1, 2, and 3 , further comprising:

A suggestion acquiring unit for acquiring a predicted word based on a search keyword from a result searched by the search unit;
Selecting one of the plurality of predicted words acquired by the suggestion acquiring unit, and using the main content extracted by the main content extracting unit from a result of searching the selected predicted word by the search unit; A search order obtaining unit for obtaining the search order of the selected predicted word;
An evaluation result generation unit that evaluates a web page to be evaluated based on the ranking acquired by the search ranking acquisition unit,
The keyword extraction device according to any one of claims 1, 2, and 3 , further comprising:

A keyword extraction device according to any one of claims 1 to 5 ,
Using the plurality of keywords extracted by the keyword extraction device, a content generation device that generates predetermined content,
Content generation system comprising:

A search procedure in which a search unit searches for a plurality of contents including a main content based on a search keyword;
A first noise removing step in which the first noise removing unit removes, from the plurality of contents searched by the searching procedure, content in a predetermined domain that does not make sense in extracting a keyword;
A main content extracting unit sequentially selecting one of the plurality of contents from which the content of the predetermined domain has been removed by the first noise removal procedure, and extracting information indicating a link destination from the selected content; And extracting the main information by extracting the information similar to the information of the extracted link destination and comparing the information of the selected content, from the information of the selected content, and extracting the main content.
A keyword extracting step of extracting a plurality of keywords from the text of the main content extracted by the main content extracting step,
Keyword extraction method including

A search procedure in which a search unit searches for a plurality of contents including a main content based on a search keyword;
A first noise removing step in which the first noise removing unit removes, from the plurality of contents searched by the searching procedure, content in a predetermined domain that does not make sense in extracting a keyword;
The second noise removing unit sequentially selects one of the plurality of contents from which the content of the predetermined domain has been removed by the first noise removing procedure , and is described by a predetermined tag from the selected content. A second noise removal procedure for removing unnecessary information that does not make sense in keyword extraction by removing the information that
The main content extracting unit sequentially selects one of the contents from which the information described by a predetermined tag has been removed from the content by the second noise removal procedure, and indicates a link destination from the selected content. A main content extraction procedure of extracting information, extracting extracted link destination information and information similar to the information of the selected content by comparing the information of the selected content, and extracting the main content by removing the information of the selected content;
Keyword extraction section, and a keyword extraction procedure of extracting a plurality of keywords from the text of the main content that is extracted by the main content extraction procedure,
Keyword extraction method including

On the computer,
A search procedure for searching for multiple content, including the main content, based on the search keyword;
A first noise removal procedure for removing, from the plurality of contents searched by the search procedure, content of a predetermined domain that does not make sense in extracting a keyword;
The one content is sequentially selected from the plurality of contents from which the content of the predetermined domain has been removed by the first noise removal procedure, information indicating a link destination is extracted from the selected content, and the extracted link is extracted. A main content extracting step of extracting the main information by removing the similar information by comparing the previous information and the information of the selected content from the information of the selected content,
A keyword extraction procedure for extracting a plurality of keywords from the text of the main content extracted by the main content extraction procedure,
A program that executes

On the computer,
A search procedure for searching for multiple content, including the main content, based on the search keyword;
A first noise removal procedure for removing, from the plurality of contents searched by the search procedure, content of a predetermined domain that does not make sense in extracting a keyword;
Sequentially selecting one of the plurality of contents from which the content of the predetermined domain has been removed by the first noise removal procedure , and removing information described by a predetermined tag from the selected content; A second noise removal procedure for removing unnecessary descriptions that do not make sense in keyword extraction;
Sequentially selecting one of the contents from the content from which the information described by the predetermined tag has been removed from the content by the second noise removal procedure, extracting information indicating a link destination from the selected content, A main content extraction step of extracting information that is similar to the extracted link destination information by comparing the information of the selected content with information of the selected content, and extracting the main content by removing the information from the information of the selected content;
A keyword extraction procedure for extracting a plurality of keywords from the text of the main content that is extracted by the main content extraction procedure,
A program that executes