JP4534019B2

JP4534019B2 - Name and keyword grouping method, program, recording medium and apparatus thereof

Info

Publication number: JP4534019B2
Application number: JP2005252731A
Authority: JP
Inventors: 豊松尾; 洋平浅田; 浩一橋田
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2004-08-31
Filing date: 2005-08-31
Publication date: 2010-09-01
Anticipated expiration: 2025-08-31
Also published as: JP2006099753A

Description

本願発明は、ある研究分野や技術領域、組織、興味等に関係する人を、研究分野等をクエリとして見つけ出すうえで有用な、人の名前と研究分野等を表わすキーワードとを互いに対応付けて自動的にグループ化することで、Ｗｅｂにある情報から自律的に研究者のグループがカテゴリ分けされて、その得られた各グループからクエリとして適切な研究分野等をユーザが把握し易くすることのできる方法等に関するものである。 The present invention automatically associates a person's name with a keyword representing a research field, which is useful in finding a person related to a certain research field, technical field, organization, interest, etc. as a query. By grouping them into groups, groups of researchers can be categorized autonomously from information on the Web, and it is possible to make it easier for users to grasp appropriate research fields as queries from the obtained groups. It relates to methods.

本願発明の発明者等は、Ｗｅｂ上の情報から自動的に人間関係を抽出する手法について既に提案している（非特許文献１参照）。この手法では、対象とする二人の名前を含むＷｅｂページを検索してそのヒット件数から両名の関係の強さを求め、それに基づいて二人の人間関係を抽出し、ネットワークとして図示することができる。このネットワークは、ノードが人を表し、ノードとノードを結ぶエッジがその二人の間の関係を表す無向グラフとして人間関係を表現しており、エッジに付されたラベルによって二人がいかなる関係にあるか、たとえば研究者同士の場合では、共著関係、同じ研究室のメンバーである関係、同じ研究会などで発表したことのある関係、同じプロジェクトで研究している関係など、を把握でき、且つそのエッジが短ければ短いほどその関係が強いことがわかる。 The inventors of the present invention have already proposed a method for automatically extracting human relationships from information on the Web (see Non-Patent Document 1). In this method, a web page including the names of two persons to be searched is searched, the strength of the relationship between the two names is obtained from the number of hits, and the human relationship between the two is extracted based on that, and illustrated as a network Can do. This network expresses human relationships as an undirected graph in which nodes represent people, and the edges connecting the nodes represent the relationship between the two, and the relationship between the two by the labels attached to the edges For example, in the case of researchers, you can understand co-authorship relationships, relationships that are members of the same laboratory, relationships that have been announced at the same research meeting, relationships that are researched in the same project, etc. It can also be seen that the shorter the edge, the stronger the relationship.

このような人間関係ネットワークは、ある分野や領域、組織内での人間関係を俯瞰するのに役立つ。たとえば、ある学会においてＡさんと知り合いになりたい場合、その学会員について抽出した人間関係ネットワーク中で自分とＡさんを結ぶ最短パスを調べたり、自分とＡさんとの共通の知人を調べたりすることで、両名を取り巻く他の会員等との人間関係を一目瞭然に把握でき、これによりＡさんを紹介してくれそうな適当な人を簡単に見つけることができる。
Yutaka Matsuo, Hironori Tomobe, Koichi Hashida and Mitsuru Ishizuka,"Mining Social Network of Conference Participants from the Web," 12thInternational WWW Conference, poster, 2003 Henry Kautz, Bart Selman, and Mehul Shah, "Referral Web :combining social networks and collabolative filtering," Communications ofthe ACM,vol.40, no.3, 1997 H. Kautz, B. Selman, and A. Milewski, "Agent amplifiedcommunication," Proceedings of the National Conference on ArtificialIntelligence, pp.3-9, 1999 H. Kautz, B. Selman, and M. Shah, "The Hidden Web," AIMagazine, 18(2), pp.27-36, 1997 G. Salton and M. J. McGill, "Introduction to ModernRetrieval," McGraw-Hill Book Company, 1983 I. S. Dhillon, "Co-clustering documents and words usingBipartite Spectral Graph Partitioning," Proceedings of the Seventh ACMSIGKDD International Conference on Knowledge Discovery and Data Mining(KDD),2001 Such a human relationship network is useful for overlooking human relationships within a certain field, area, or organization. For example, if you want to get acquainted with Mr. A at a certain academic society, check the shortest path between you and Mr. A in the network of human relations extracted for that school member, or find a common acquaintance between yourself and Mr. A. Thus, it is possible to clearly understand the human relationship with other members and the like surrounding both names, and thus it is possible to easily find an appropriate person who is likely to introduce Mr. A.
Yutaka Matsuo, Hironori Tomobe, Koichi Hashida and Mitsuru Ishizuka, "Mining Social Network of Conference Participants from the Web," 12th International WWW Conference, poster, 2003 Henry Kautz, Bart Selman, and Mehul Shah, "Referral Web: combining social networks and collabolative filtering," Communications of the ACM, vol. 40, no. 3, 1997 H. Kautz, B. Selman, and A. Milewski, "Agent amplified communication," Proceedings of the National Conference on ArtificialIntelligence, pp.3-9, 1999 H. Kautz, B. Selman, and M. Shah, "The Hidden Web," AIMagazine, 18 (2), pp.27-36, 1997 G. Salton and MJ McGill, "Introduction to ModernRetrieval," McGraw-Hill Book Company, 1983 IS Dhillon, "Co-clustering documents and words using Bipartite Spectral Graph Partitioning," Proceedings of the Seventh ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2001

しかしながら、たとえばある特定の分野等におけるエキスパートを見つけたい場合、上記人間関係ネットワークではユーザはそのエキスパートの名前を予め知っている必要がある。ノードが人の名前となっており、名前でなければ検索できないためである。 However, for example, when it is desired to find an expert in a specific field or the like, the user needs to know the name of the expert in advance in the human relationship network. This is because the node is the name of a person and can only be searched for by name.

この場合には、分野から、その分野に関連する人を検索できるようにすることが必要となり、これを実現するには、名前毎にその人に関連する分野をラベル付けしておくことで、ラベルから逆に人を検索できると考えられる。たとえば、研究者の場合、研究者ごとに研究分野を与えておくことで、研究分野からその分野の研究者を検索できる。 In this case, it is necessary to be able to search for a person related to the field from the field, and in order to achieve this, by labeling the field related to the person for each name, It is thought that people can be searched from the label. For example, in the case of a researcher, by giving a research field to each researcher, a researcher in that field can be searched from the research field.

ところが、研究分野等は、その人が関わっている文脈によって異なる。たとえば、ある研究者が、「人工知能」に携わる研究者の中では「自然言語処理」研究者として認識されていても、「自然言語処理」に携わる研究者の中ではより詳しく「構文解析」研究者として認識されていることがあり、文脈によって研究分野が決まってくる。 However, the field of research varies depending on the context in which the person is involved. For example, even if a researcher is recognized as a “natural language processing” researcher among researchers working on “artificial intelligence”, “syntactic analysis” is more detailed among researchers working on “natural language processing”. It is sometimes recognized as a researcher, and the research field is determined by the context.

そこで、対象とする研究者全体を見渡した上で、どの程度の粒度で研究分野を決めれば、効率的なデータベースとなるのか、抜けのない検索が可能となるのかが、重要なファクターであると考えられる。上記例で言うと、「自然言語処理」の粒度で研究分野を決める場合と、より詳しく「構文解析」の粒度で研究分野を決める場合と、どちらの方が効率的な検索が可能であるのかということである。 Therefore, after looking at the entire target researcher, it is important to determine the level of granularity of the research field to make an efficient database or search without omissions. Conceivable. In the above example, the more efficient search is possible, when the research field is determined by the granularity of “natural language processing” or when the research field is determined by the granularity of “syntactic analysis” in more detail That's what it means.

すなわち、どの粒度でグループ化するかを決めてから、グループ化することが必要である。 That is, it is necessary to determine the granularity to be used for grouping and then perform grouping.

そこで、以上のとおりの事情に鑑み、本願発明は、上記のように研究分野等のキーワード群とそれに関連する人等の名前群を互いに対応させて適切に且つ自動的にグループ化することのできる、名前及びキーワードのグループ化方法等を提案することを課題としている。 Therefore, in view of the circumstances as described above, the present invention can appropriately and automatically group a keyword group in a research field or the like and a name group such as a person related thereto as described above. The problem is to propose a method for grouping names and keywords.

本願発明は、上記の課題を解決するものとして、第１には、処理部が、名前およびキーワードの入力部からの入力を受け付けるステップ、前記処理部が、入力された名前およびキーワードの共起行列を作成するステップ、および前記処理部が、作成された共起行列における名前およびキーワードをクラスタリングするステップを有することを特徴とする名前及びキーワードのグループ化方法を提供する。 The present invention solves the above-mentioned problem. First, the processing unit accepts an input from the name and keyword input unit, and the processing unit receives the co-occurrence matrix of the input name and keyword. And a grouping method of names and keywords, wherein the processing unit has a step of clustering names and keywords in the created co-occurrence matrix.

第２には、前記処理部が、名前およびキーワードの両方を含む公開データ数を用いて前記共起行列を作成することを特徴とする前記名前及びキーワードのグループ化方法を提供する。 Second, the name and keyword grouping method is provided, wherein the processing unit creates the co-occurrence matrix using the number of public data including both the name and the keyword.

第３には、前記処理部が、前記公開データ数に対して重み付けすることを特徴とする前記名前及びキーワードのグループ化方法を提供する。 Third, the name and keyword grouping method is provided, wherein the processing unit weights the number of public data.

第４には、前記処理部が、前記クラスタリングにより得られた各グループを表わすラベルの入力部からの入力を受け付けるステップ、および前記処理部が、入力されたラベルを各グループ毎に対応付けて記憶手段に記憶させるステップをさらに有することを特徴とする前記名前及びキーワードのグループ化方法を提供する。 Fourth, the processing unit accepts an input from an input unit of a label representing each group obtained by the clustering, and the processing unit stores the input label in association with each group. The name and keyword grouping method is further provided, further comprising the step of storing in a means.

第５には、前記名前及びキーワードのグループ化方法をコンピュータに実行させるための名前及びキーワードのグループ化プログラムを提供する。 Fifth, a name and keyword grouping program for causing a computer to execute the name and keyword grouping method is provided.

第６には、前記名前及びキーワードのグループ化プログラムを記録したコンピュータ読取可能な記録媒体を提供する。 Sixth, a computer-readable recording medium on which the name and keyword grouping program is recorded is provided.

第７には、名前およびキーワードを入力する手段、入力された名前およびキーワードの共起行列を作成する手段、および作成された共起行列における名前およびキーワードの両方またはいずれか一方をクラスタリングする手段を備えたことを特徴とする名前及びキーワードのグループ化装置を提供する。 Seventh, means for inputting a name and a keyword, means for creating a co-occurrence matrix of the input name and keyword, and means for clustering names and / or keywords in the created co-occurrence matrix Provided is a name and keyword grouping device characterized by comprising the above.

第８には、共起行列を作成する前記手段は、名前およびキーワードの両方を含む公開データ数を用いて前記共起行列を作成することを特徴とする前記名前及びキーワードのグループ化装置を提供する。 Eighth, the name and keyword grouping apparatus is characterized in that the means for creating a co-occurrence matrix creates the co-occurrence matrix using the number of public data including both the name and the keyword. To do.

第９には、前記公開データ数は、重み付けされた公開データ数であることを特徴とする前記名前及びキーワードのグループ化装置を提供する。 Ninth, the name and keyword grouping device is provided, wherein the number of public data is a weighted number of public data.

第１０には、前記クラスタリングにより得られた各グループを表わすラベルを入力する手段、および入力されたラベルを各グループ毎に対応付けて記憶する手段をさらに備えたことを特徴とする前記名前及びキーワードのグループ化装置を提供する。 Tenth, the name and keyword further comprising means for inputting a label representing each group obtained by the clustering, and means for storing the input label in association with each group. A grouping device is provided.

上記第１〜４の本願発明の名前及びキーワードのグループ化方法によれば、名前及びキーワードの共起行列を作成してそれをクラスタリングすることで、それら名前及びキーワードのグループ化が実現される。このキーワードのグループを参考にすれば、それに対応する人のグループに対してラベルを付与することができる。 According to the name and keyword grouping methods of the first to fourth aspects of the present invention, the name and keyword co-occurrence matrix is created and clustered to realize the name and keyword grouping. By referring to this keyword group, a label can be assigned to a group of people corresponding to the keyword group.

すなわち、Ｗｅｂを代表とする各種データベースにて公開されている情報から、自律的に人のグループをカテゴリ分けできるのである。 That is, groups of people can be categorized autonomously from information published in various databases represented by the Web.

これにより、たとえば研究者および研究分野をグループ化した場合、各研究分野グループに付されたグループ名つまり研究カテゴリをクエリとして、そのグループに属する研究者らを一度に検索でき、この検索された研究者らはその分野のエキスパートであると推測されるので、前述したように分野等を表わす粒度の粗いキーワードからそれに関連するエキスパートを網羅的に検索できることとなる。 As a result, for example, when researchers and research fields are grouped, it is possible to search for researchers belonging to the group at a time using the group name, ie, research category, assigned to each research field group as a query. Since they are presumed to be experts in the field, as described above, the related experts can be comprehensively searched from the coarse-grained keywords representing the field.

クラスタリングについては、名前のクラスタ（＝グループ。以下同じ）およびキーワードのクラスタを同時に得ることのできるco-clusteringと呼ばれる手法（非特許文献６参照）や、名前のクラスタ、キーワードのクラスタを別々に得るK-means法、最小距離法、最大距離法などを用いることができ、いずれにあっても、人のクラスタとそれに対応するキーワードのクラスタを得ることができる。またそれが可能となるクラスタリング手法であれば特に限定されない。 For clustering, a method called co-clustering (see Non-Patent Document 6) that can simultaneously obtain a name cluster (= group; the same applies hereinafter) and a keyword cluster, and obtain a name cluster and a keyword cluster separately. A K-means method, a minimum distance method, a maximum distance method, or the like can be used. In any case, a cluster of people and a cluster of keywords corresponding to the cluster can be obtained. Further, there is no particular limitation as long as it is a clustering technique that enables this.

上記第５および上記第６の本願発明のグループ化プログラムおよびその記録媒体によれば、上記第１〜第４のグループ化方法と同様な効果が得られるコンピュータプログラムおよびそれを記録したフレキシブルディスクやＣＤ、ＤＶＤなどの記録媒体が実現され、上記第７〜第９の本願発明のグループ化装置によれば、上記第１〜第４のグループ化方法と同様な効果が得られる装置が実現される。 According to the grouping program and the recording medium of the fifth and sixth inventions of the present invention, a computer program capable of obtaining the same effects as those of the first to fourth grouping methods, and a flexible disk or CD on which the computer program is recorded , A recording medium such as a DVD is realized, and according to the grouping devices of the seventh to ninth inventions of the present invention, a device that can achieve the same effects as the first to fourth grouping methods is realized.

なお、Ｗｅｂ上の情報を用いてエキスパートを発見する研究はいくつか行われており、たとえばソーシャルネットワークを用いたReferral webと呼ばれる情報検索システムが知られている（非特許文献２〜４参照）。このシステムでは、ユーザの名前が与えられると、ユーザの周辺のソーシャルネットワークを表示する。それぞれのユーザはエージェントを持ち、エージェントがソーシャルネットワーク上でユーザの興味に合ったトピックワードを持つエキスパートを見つける。 Note that several studies have been conducted to discover experts using information on the Web. For example, an information search system called Referral web using a social network is known (see Non-Patent Documents 2 to 4). In this system, given a user's name, the social network around the user is displayed. Each user has an agent and the agent finds experts on the social network who have topic words that match the user's interests.

このReferral web が一人一人のユーザに着目してそれぞれのトピックワードを抽出しているのに対し、本願発明では、対象となる人全体をまず把握して、その中でキーワードとの共起に基づいて各人を特徴付けており、この点が重要な特徴である。 While this Referral web focuses on each user and extracts each topic word, the present invention first grasps the entire target person and based on the co-occurrence with keywords in it. This is an important feature.

すなわち、名前とキーワードとの共起行列を算出する、しかも名前とキーワードの両方を含むＷｅｂページ数等の公開データ数に基づいてそれを算出すること、つまり名前とキーワードの関係の強さをＷｅｂ等の各種データベースにおける共起に基づいて定義することや、共起行列に対して重み付けすること、つまり名前をキーワードの重み付けベクトルとして表現することは、従来には全くない本願発明の特徴点であり、これによって、名前及びキーワードを互いに対応付けて適切に且つ自動的にグループ化することが可能となる。 That is, the co-occurrence matrix of the name and the keyword is calculated, and it is calculated based on the number of public data such as the number of Web pages including both the name and the keyword, that is, the strength of the relationship between the name and the keyword is Definition based on co-occurrence in various databases, etc., and weighting the co-occurrence matrix, that is, expressing the name as a keyword weighting vector, is a feature of the present invention that has never existed before. This makes it possible to appropriately and automatically group names and keywords in association with each other.

なお、本願発明が対象とする名前としては、人名のみならず、会社名、団体名、職業名、製品名、動植物名、書名、曲名、国名、地名などの様々な種類の名前を考慮でき、人名と同様にして、各種名前および関連するキーワードのグループ化を実現できる。また、固有名詞だけでなく、様々な普通名詞に対しても、それを名前として捉えてキーワードとのグループ化を図ることも可能であり、ありとあらゆる単語や言葉を対象にできるのである。 In addition, as a name targeted by the present invention, not only a person's name but also various kinds of names such as a company name, an organization name, an occupation name, a product name, an animal and plant name, a book name, a song name, a country name, and a place name can be considered. In the same way as person names, grouping of various names and related keywords can be realized. Moreover, not only proper nouns but also various common nouns can be regarded as names and grouped with keywords, and any word or word can be targeted.

以下、グループ化の具体的処理について、図１および図２を適宜参酌しながらより詳細に説明する。図１はグループ化処理のフローチャートであり、図２はグループ化処理を実行するグループ化装置のシステム構成図である。 Hereinafter, specific processing of grouping will be described in more detail with reference to FIGS. 1 and 2 as appropriate. FIG. 1 is a flowchart of the grouping process, and FIG. 2 is a system configuration diagram of a grouping apparatus that executes the grouping process.

図２のシステム構成では、表示部（１）、入力部（２）、処理部（ＣＰＵ）（３）、記憶部（メモリ）（４）、通信制御部（５）、行列データベース（６）およびバス（７）を備えている。記憶部（４）には、グループ化プログラムや各種データが記憶されており、この記憶部（４）とバス（７）により接続されている処理部（３）は、グループ化プログラムの指令を受けてグループ化処理を実行する。また、処理部（３）は、入力画面や各種データ等を表示するディスプレイなどの表示部（１）と、名前やキーワード等を入力するキーボードやマウスなどの入力部（２）とも、バス（７）により接続されている。行列データベース（６）には、処理部（３）により算出された共起行列が蓄積される。 In the system configuration of FIG. 2, a display unit (1), an input unit (2), a processing unit (CPU) (3), a storage unit (memory) (4), a communication control unit (5), a matrix database (6) and A bus (7) is provided. The storage unit (4) stores a grouping program and various data, and the processing unit (3) connected to the storage unit (4) and the bus (7) receives an instruction of the grouping program. To execute the grouping process. In addition, the processing unit (3) includes both a display unit (1) such as a display for displaying an input screen and various data, and an input unit (2) such as a keyboard and a mouse for inputting names, keywords, and the like. ). The matrix database (6) stores the co-occurrence matrix calculated by the processing unit (3).

＜ステップＳ１＞
まず、処理部（３）により、グループ化の対象とする名前およびキーワードの入力を入力部（２）から受け付ける。 <Step S1>
First, the processing unit (3) receives input of names and keywords to be grouped from the input unit (2).

より具体的には、たとえば、ある学会に属する研究者とその研究内容のグループ化を図る場合、Ｗｅｂなどで公開されている学会員リストや全国大会プログラムなどから、対象とする研究者を選択し、それぞれの名前を入力する。また、公開されている論文などから、研究内容を連想させるキーワードを選択し、それらを入力する。処理部（３）は、その入力を入力部（２）から受け付ける。 More specifically, for example, in order to group researchers belonging to a certain academic society with their research contents, select the target researcher from the academic membership list published on the web or the national convention program. , Enter each name. In addition, keywords that are associated with research content are selected from published papers and entered. The processing unit (3) receives the input from the input unit (2).

キーワードについては、公知のキーワード抽出技術を用いて自動的に論文から抽出するなどの方法も考慮できる。この自動抽出の場合では、処理部（３）は、自身のキーワード抽出プログラムに従って抽出処理を実行するようにしても、ネットワーク（８）を介してアクセス可能な別途のキーワード抽出エンジン等にＷｅｂページに掲載されている論文等に対して抽出処理を実行させて、その結果を受け取るようにしてもよい。 For keywords, a method of automatically extracting from a paper using a known keyword extraction technique can be considered. In the case of this automatic extraction, even if the processing unit (3) executes the extraction process according to its own keyword extraction program, the processing unit (3) puts it on the Web page to a separate keyword extraction engine or the like accessible via the network (8). An extraction process may be executed on a published paper and the result may be received.

＜ステップＳ２＞
次に、処理部（３）により、上記入力された名前とキーワードの共起行列を作成する。 <Step S2>
Next, a co-occurrence matrix of the input name and keyword is created by the processing unit (3).

従来、情報検索では、全文書集合がしばしば文書−単語行列として表される。これはベクトル空間モデル（非特許文献５参照）としてよく知られているアプローチである。ベクトル空間モデルでは、全文書集合から特徴として内容語を抽出し、各文書をこの特徴との関係の深さを表す重みベクトルとして表現する。一般的に、文書ｉにおける単語ｊの重みｗ_ijは、次式のようにtfidf（term frequency inverse document frequency）法を用いて計算される。 Conventionally, in information retrieval, the entire document set is often represented as a document-word matrix. This is an approach well known as a vector space model (see Non-Patent Document 5). In the vector space model, content words are extracted from all document sets as features, and each document is expressed as a weight vector representing the depth of the relationship with the features. In general, the weight w _ij of the word j in the document i is calculated using a tfidf (term frequency inverse document frequency) method as shown in the following equation.

ただし、tf_ijは文書ｉにおける単語ｊの出現頻度(term frequency)、df_jは単語ｊの出現する文書数(document frequency)、Ｎは全文書数である。定性的には、tf の効果により、文書中によく出現する単語ほど重みは大きくなり、idf (inverse document frequency)の効果により、特定の文書中に局所的に出現する単語ほど、その文書における重みは大きくなる。 Here, tf _ij is the appearance frequency (term frequency) of word j in document i, df _j is the number of documents in which word j appears (document frequency), and N is the total number of documents. Qualitatively, because of the effect of tf, the more frequently the word appears in the document, the higher the weight, and the effect of idf (inverse document frequency) increases the weight in the document as the word appears locally in a specific document. Will grow.

これと同様にして、名前をキーワードの重みベクトルとして表わすことを考える。対象を研究者とした場合、ある研究者名が、ある研究キーワードとよく共起する、ということは、その研究者と研究キーワードの間に何らかの関係があることを示していると考えられる。そこで、研究者名と研究キーワードのＷｅｂ上での共起に基づいて研究者と研究キーワードの関係の強さを定義する。同一Ｗｅｂページ中に出現することを共起と呼ぶことにすると、研究者名と研究キーワードのＡＮＤ検索を行えば、ヒットしたＷｅｂページ数がこの研究者名と研究キーワードが共起するＷｅｂページ数である。これを各研究者名および研究キーワードについて実行することで、名前を横（行）、キーワードを縦（列）にとった（その逆でももちろんよい）共起行列を作成できる。 Similarly, let us consider representing a name as a keyword weight vector. If the subject is a researcher, the fact that a researcher's name often co-occurs with a research keyword is considered to indicate that there is some relationship between the researcher and the research keyword. Therefore, the strength of the relationship between the researcher and the research keyword is defined based on the co-occurrence of the researcher name and the research keyword on the Web. Appearing in the same web page is called co-occurrence. If AND search of the researcher name and research keyword is performed, the number of hit web pages is the number of web pages where this researcher name and research keyword co-occur. It is. By executing this for each researcher name and research keyword, a co-occurrence matrix can be created in which the name is horizontal (row) and the keyword is vertical (column) (or vice versa).

ここで、この共起Ｗｅｂページ数をそのまま重みにすると、比較的一般的な研究キーワードに大きな重みがついてしまうなどの現象が生じ得る。 Here, if the number of co-occurrence Web pages is used as a weight as it is, a phenomenon that a relatively general research keyword is heavily weighted may occur.

より具体的には、
１．共起件数の少ない研究者の分布が正確に測れない。
２．共起件数の多い研究者にとっては、あまり関係のない語との共起件数も相対的に大きくなる。
３．一般的な語が入っていた場合、その列の重みが相対的に大きくなる。 More specifically,
1. The distribution of researchers with few co-occurrence cases cannot be measured accurately.
2. For researchers with a large number of co-occurrence cases, the number of co-occurrence cases with words that are not much related is also relatively large.
3. When a general word is included, the weight of the column becomes relatively large.

そこで、単純な共起ヒット件数ではなく、共起の統計的な偏りを計算することが好ましい。 Therefore, it is preferable to calculate the statistical bias of co-occurrence rather than the simple number of co-occurrence hits.

たとえば、数１のアナロジーで次式のように研究者ｉに対する研究キーワードｊの重みｗ_ijを定義する。 For example, the weight w _ij of the research keyword j for the researcher i is defined by the analogy of Equation 1 as follows.

ただし、co_ijは研究者名ｉと研究キーワードｊが共起するＷｅｂページ数、pf_jは、研究キーワードｊが共起する研究者数(person frequency と呼ぶこととする)、Ｎは全研究者数である。 Where co _ij is the number of Web pages where researcher name i and research keyword j co-occur, pf _j is the number of researchers co-occurring with research keyword j (referred to as person frequency), and N is all researchers Is a number.

この数２に従って重みを計算すれば、研究者が研究キーワードの重みベクトルで表され、研究者名−研究キーワードの共起行列が得られる。 If the weight is calculated according to this equation 2, the researcher is represented by the weight vector of the research keyword, and a researcher name-research keyword co-occurrence matrix is obtained.

また、偏りの強さを表す統計量としては、一般的なものの一つにχ²値があり、このχ²値により共起の偏りの強さを表して、研究キーワードを重み付けすることもできる。 In addition, as a statistic indicating the strength of bias, there is a χ ² value as one of the general ones, and the strength of co-occurrence bias can be expressed by this χ ² value to weight research keywords. .

χ^２値は、次式のように定式化される。 The χ ² value is formulated as follows:

ただし、上式において、O_ij-E_ijが負になる場合にもχ²値は正の値をとる。しかし、ここでの目的は、研究者名を、その人の研究分野に関係したキーワードで重み付けることであるから、期待値よりも観測値が小さい場合のχ²値を０とする。すなわち、次式のよう
になる。 However, in the above equation, even when O _ij -E _ij becomes negative, the χ ² value takes a positive value. However, since the purpose here is to weight the researcher name with a keyword related to the person's research field, the χ ² value when the observed value is smaller than the expected value is set to zero. That is, the following equation is obtained.

なお、これらの重み付け方法は、統計的に有意な偏りを計る方法のひとつであり、他の様々な重み付け方法、たとえば行列の各要素を確率とし正規化しコサインや相互情報量をとる方法、行列間の類似度をKullback-Leibler統計量で計る方法など、も採用できる。 Note that these weighting methods are one of methods for measuring statistically significant bias. Various other weighting methods, for example, a method in which each element of a matrix is normalized as a probability to obtain cosine and mutual information, and between matrices It is also possible to adopt a method that measures the degree of similarity with the Kullback-Leibler statistic.

以上の処理において、検索については、処理部（３）は、自身の検索プログラムを実行するようにしても、別途のネットワーク（８）上のサーチエンジンに名前およびキーワードを送信して検索の実行をさせ、検索結果をサーチエンジンからネットワーク（８）を介して受け取るようにしてもよい。 In the above processing, regarding the search, the processing unit (3) executes the search by transmitting the name and keyword to the search engine on the separate network (8) even if it executes its own search program. The search result may be received from the search engine via the network (8).

Ｗｅｂページのヒット件数および共起行列は、名前およびキーワード毎に対応付けて記憶部（４）や別途の行列データベース（６）等の記憶手段に記憶される。 The number of web page hits and the co-occurrence matrix are stored in storage means such as the storage unit (4) or a separate matrix database (6) in association with each name and keyword.

ところで、Ｗｅｂ上では同姓同名の人が多く存在するため、目的とする人以外のＷｅｂページが検索されてしまう可能性がある。そこで、人物名とともに、その人物を判別する語を検索クエリに加えることで、検索精度を上げることもできる。たとえば、人物を判別する語として企業名、研究機関名、大学名等の所属組織名を考えた場合、氏名Ｎと所属名Ａのandを検索クエリとする。また、複数の所属がある場合、所属の変更がある場合、所属機関に複数の名称や略称がある場合などでは、
氏名Ｎ and（所属名Ａ or所属名Ｂ or所属略称Ｃ）を検索クエリとして用いる。もちろん人物判別語は、所属組織名等の所属情報を表す語に限定されず、同姓同名の中から目的とする人物を判別できる語であればよい。 By the way, since there are many people with the same name and the same name on the Web, there is a possibility that a Web page other than the target person will be searched. Therefore, the search accuracy can be improved by adding a word for identifying the person together with the person name to the search query. For example, when an organization name such as a company name, a research institution name, or a university name is considered as a word for identifying a person, the name N and the “and” of the organization name A are used as a search query. Also, if you have multiple affiliations, if your affiliation changes, or if your organization has multiple names or abbreviations,
The name N and (affiliation name A or affiliation name B or affiliation abbreviation C) is used as a search query. Of course, the person discriminating word is not limited to a word representing affiliation information such as an organization name, but may be any word that can discriminate a target person from the same name and the same name.

この同姓同名問題への対応処理を加えることで、たとえば、目的とする人物が複数の所属を持つ場合、所属が変わった場合などには、過去の所属においてどのような研究トピックであったか、複数の所属をどのような研究トピックで分けているかなどを知ることができる。 For example, if the target person has multiple affiliations, or if the affiliation changes, it is possible to determine what research topics were in the past affiliation, You can find out what research topics the affiliation is divided into.

また、名前については、正式名称とその略称が存在する場合も考えられ（特に企業等の組織名ではそのケースが多くみられる）、たとえば、正式名称等の一方の名称Ｘ１で検索したときにヒットする文書に含まれる語Ｙ１と、略称等の別の名称Ｘ２で検索したときにヒットする文書に含まれる語Ｙ２とが互いに近い関係にあれば、それらＸ１，Ｘ２は同じ組織の名称であると判断する処理を行うことで、さらなる検索精度の向上を図ることができる。
なお、以上の説明では、Ｗｅｂ上で公開されているＷｅｂページを検索対象としており、検索ヒット件数を「Ｗｅｂページ数」と呼んでいるが、本願発明では、Ｗｅｂページの他にも、公開されている様々な文書データ等のデータを検索対象として考慮できることは言うまでもなく、より広い概念として「公開データ数」と呼ぶことができる。 As for names, there are cases where both formal names and their abbreviations exist (especially in the case of organization names such as companies, there are many cases). For example, when searching with one name X1 such as a formal name, it is a hit. If the word Y1 included in the document to be searched and the word Y2 included in the document hit when searched by another name X2 such as an abbreviation are closely related to each other, the X1 and X2 are the names of the same organization By performing the determination process, it is possible to further improve the search accuracy.
In the above description, a Web page published on the Web is a search target, and the number of search hits is referred to as “the number of Web pages”. It goes without saying that data such as various document data can be considered as a search target, and can be referred to as “number of public data” as a broader concept.

公開データについては、広く一般に公開されているデータだけではなく、ある特定のデータベース（たとえば一般には公開されていないがある組織内でのみアクセスできるデータベースなど）内に蓄積されているデータ群も検索対象とできる。 For public data, not only data that is widely open to the public, but also a group of data stored in a specific database (for example, a database that is not open to the public and can be accessed only within an organization) And can.

＜ステップＳ３＞
続いて、処理部（３）により、この共起行列をクラスタリングする。 <Step S3>
Subsequently, the co-occurrence matrix is clustered by the processing unit (3).

ここでは、名前およびキーワードを同時にクラスタリングする場合について説明する。この場合のクラスタリング手法としてはco-clustering（非特許文献６参照）を用いることができる。Co-clustering は、文書と単語のクラスタを同時に得るクラスタリングアルゴリズムであり、見つけるクラスタの数ｋを決めておけば、あらかじめどんなクラスタが存在するか知らなくても、自動的に文書のクラスタと、それに対応する単語のクラスタ（逆も同様）を得ることができる。 Here, a case where names and keywords are clustered simultaneously will be described. As a clustering method in this case, co-clustering (see Non-Patent Document 6) can be used. Co-clustering is a clustering algorithm to obtain the document and the word of the cluster at the same time, if determines the number k of clusters to find, even if you do not know whether in advance what kind of cluster is present, and automatically the document cluster, it Corresponding word clusters (and vice versa) can be obtained.

文書−単語行列を、文書や単語を頂点とし、文書における単語の重みが０でなければその間にエッジが張られるグラフとして表わすことを考える。すると、文書同士の間のエッジや単語同士のエッジはないので、グラフは二部グラフとなる。Co-clustering の基本的な考え方は、クラスタをまたぐエッジの重みＥの和（cutと呼ぶ）を最小化する頂点の分割方法（Ｖ₁，Ｖ₂，・・・，Ｖ_k）を見つけることである。これを定式化すると、次式のようになる。 Consider that a document-word matrix is represented as a graph in which a document or a word is a vertex, and if the weight of the word in the document is not 0, an edge is drawn between them. Then, since there is no edge between documents and no edge between words, the graph becomes a bipartite graph. The basic idea of Co-clustering is to find a vertex division method (V ₁ , V ₂ ,..., V _k ) that minimizes the sum of the weights E of edges across clusters (called cut). is there. This is formulated as follows.

上の議論で、文書を名前に、単語をキーワードに読み換えれば、co-clustering を上記共起行列にも適用できる。その結果、たとえば、前もってどんな研究分野があるかを知らなくても、研究者のクラスタと、それに対応する研究キーワードのクラスタを同時に得ることができる。ただし、co-clusteringでは、一つの頂点が複数のクラスタに属することはないので、一人の研究者は一つの研究分野に割り当てられることになる。 In the above discussion, co-clustering can be applied to the above co-occurrence matrix by replacing the document with the name and the word with the keyword. As a result, for example, it is possible to simultaneously obtain a cluster of researchers and a cluster of research keywords corresponding to them without knowing in advance what research fields are available. However, in co-clustering, one vertex does not belong to multiple clusters, so one researcher is assigned to one research field.

これにより、図３に例示したように、名前のクラスタ（つまり名前グループ）、およびそれに対応するキーワードのクラスタ（つまりキーワードグループ）が得られる。 As a result, as illustrated in FIG. 3, a cluster of names (that is, name groups) and a corresponding cluster of keywords (that is, keyword groups) are obtained.

＜ステップＳ４＞
後は、キーワードグループ毎にそれぞれの内容を表わすのに適したグループ名ラベルを付して、それを名前グループおよびキーワードグループとともに互いに対応付けてデータベース化しておけば、ラベルをクエリとしてそのラベルが付されたキーワードグループに属する人を網羅的に検索できるようになる。 <Step S4>
After that, if a group name label suitable for expressing each content is attached to each keyword group, and it is associated with each other together with the name group and the keyword group, the label is attached to the label as a query. It becomes possible to comprehensively search for people belonging to the specified keyword group.

前述の例で言うと、co-clusteringの結果のキーワードグループを見ることにより、「自然言語処理」というラベルを付けるべきか、より詳しく「構文解析」というラベルを付けるべきなのか、判断できるということになる。 In the above example, by looking at the keyword group that results from co-clustering, you can determine whether you should label "natural language processing" or more specifically "syntactic analysis". become.

なお、ラベルは、入力部（３）から入力されたものを、上記算出されたクラスタ毎に対応付けて記憶部（４）や別途の行列データベース（６）等の記憶手段に記憶させておけばよい。また、処理部（３）により、たとえば、既存の研究分野ラベルと研究グループとの関連度（たとえば共起関係）から適切なラベルを自動抽出することも可能である。 Note that the label input from the input unit (3) is stored in a storage unit such as the storage unit (4) or a separate matrix database (6) in association with each calculated cluster. Good. The processing unit (3) can automatically extract an appropriate label from, for example, the degree of association (for example, co-occurrence relationship) between an existing research field label and a research group.

＜共起行列における名前及びキーワードの単独クラスタリング＞
ところで、以上のステップＳ３〜Ｓ４では、共起行列に対するクラスタリングとして、名前のクラスタ（ここでは人のクラスタと呼ぶこととする）とキーワードのクラスタを同時に得ることのできるco-clusteringについて説明しているが、人のクラスタとそれに対応するキーワードのクラスタが得られるのであれば、他のクラスタリング手法を適用しても同様の効果を得られる。 <Single clustering of names and keywords in co-occurrence matrix>
By the way, in the above steps S3 to S4, as clustering for the co-occurrence matrix, co-clustering capable of simultaneously obtaining a name cluster (herein referred to as a human cluster) and a keyword cluster is described. However, if a cluster of people and a corresponding keyword cluster can be obtained, the same effect can be obtained even if other clustering methods are applied.

たとえば、一般的なクラスタリング手法である、K-means法、最小距離法、最大距離法などを用いて人だけをクラスタリングすることも考えられる。この場合、人のクラスタに対応するキーワードのクラスタを見つけるためには、たとえば、人のクラスタの重心ベクトル（キーワードの重みベクトルとして表現される）を計算し、そのベクトルにおいて重みの大きなキーワードをクラスタ化する。これによれば、co-clusteringを行った場合と同様に、人のクラスタとそれに対応するキーワードのクラスタが得られる。 For example, it is possible to cluster only people using a general clustering method such as the K-means method, the minimum distance method, or the maximum distance method. In this case, in order to find a cluster of keywords corresponding to a human cluster, for example, a centroid vector (expressed as a keyword weight vector) of the human cluster is calculated, and keywords having a large weight in that vector are clustered. To do. According to this, as in the case of co-clustering, a cluster of people and a cluster of keywords corresponding to the cluster are obtained.

同様に、先にキーワードだけをクラスタリングし、各クラスタの重心ベクトル（人の重みベクトルとして表現される）において重みの大きな人をクラスタ化する、という方法でも、キーワードのクラスタとそれに対応する人のクラスタを得ることができる。 Similarly, the method of clustering only the keywords first and clustering the people with large weights in the centroid vectors (represented as the human weight vectors) of each cluster is also the keyword cluster and the corresponding human cluster. Can be obtained.

ここで、ある学術団体が開催する研究集会の参加者を対象として実際にグループ化を行ったので、その結果について説明する。 Here, the actual grouping was performed for the participants of a research meeting held by a certain academic organization, and the results will be explained.

まず、名前については、参加者のうち「あいうえお順」で最初から２００人の名前を対象とした。 First of all, the names of 200 participants from the beginning in the “Aiueo order” among the participants were targeted.

また、キーワードについては、公開された全論文タイトルから形態素解析によって名詞を抽出し、それぞれのｔｆｉｄｆ値を求め、その値の高い順に並べて、一般的過ぎる語および、固有のシステム名など逆に意味が特定すぎる語を除き、残った１７０語を対象とした。 For keywords, nouns are extracted from all published paper titles by morphological analysis, their tfidf values are calculated, arranged in descending order, and the meanings of words that are too general and unique system names are reversed. Except for too specific words, the remaining 170 words were targeted.

これらの名前およびキーワードのＷｅｂ上での共起を公知の検索エンジンを用いて調べ、前記数２によって重みを計算し、２００×１７０の共起行列を作成した。そして、この共起行列に対してco-clusteringを行い、１０個のクラスタに分割した。 The co-occurrence of these names and keywords on the Web was examined using a known search engine, the weights were calculated according to the equation 2, and a 200 × 170 co-occurrence matrix was created. The co-occurrence matrix was co-clustered and divided into 10 clusters.

表２はその結果を示したものであり、各グループ＃１〜１０における上位１０個のキーワードと人数（つまり名前の数）を示している。上位１０キーワードとは、各クラスタ内の重みの和が大きい順に１０個の研究キーワードのことである。研究キーワードが１０個に満たないクラスタでは、全てのキーワードを表示している。 Table 2 shows the results, and shows the top 10 keywords and the number of people (that is, the number of names) in each group # 1-10. The top 10 keywords are the 10 research keywords in descending order of the sum of the weights in each cluster. In a cluster with less than 10 research keywords, all keywords are displayed.

この表２から、たとえば、クラスタ３の研究分野は「ユビキタス」とラベル付けでき、クラスタ４の研究分野は「認知科学」とラベル付けでき、他の分野もその内容を表わす適当なラベルを付することができる。また、各研究者を研究分野に割り当てるのではなく、キーワードとの共起パターンが類似した研究者を一つのクラスタにまとめ、その結果各クラスタが研究分野を表している、というco-clusteringの特徴がよく表れていると言える。 From this Table 2, for example, the research field of cluster 3 can be labeled “ubiquitous”, the research field of cluster 4 can be labeled “cognitive science”, and other fields can also be labeled appropriately. be able to. In addition, instead of assigning each researcher to a research field, co-clustering features that researchers with similar keyword co-occurrence patterns are grouped into a single cluster, and as a result, each cluster represents a research field. Can be said to appear well.

そして、たとえばクラスタ３を「ユビキタス」とラベル付けした場合、このキーワードクラスタに対する研究者のクラスタ（２１人）のそれぞれの研究分野は「ユビキタス」であることになる。同様にして、全てのクラスタにその内容を適切に表わしたラベルを割り当てる。後は、このラベルつまり研究カテゴリをクエリとして検索すれば、該当するクラスタに属する研究者名を取得することができる。 For example, when cluster 3 is labeled as “ubiquitous”, each research field of the researcher cluster (21 persons) for this keyword cluster is “ubiquitous”. Similarly, all clusters are assigned labels that appropriately represent their contents. After that, if this label, that is, a research category is searched as a query, a researcher name belonging to the corresponding cluster can be acquired.

なお、以上の実施例は研究者および研究分野を対象としたものであるが、もちろん本願発明はこれに限定されるものではなく、あらゆる人および分野や領域、組織等に適用させることができることは言うまでもなく、たとえば、作家およびそれに関連するキーワードを対象として、「トリック、驚愕の結末、本格推理」といったキーワードグループからそれに対応する人のグループに対して「ミステリー作家」というラベルを付けたり、「江戸、幕末、新撰組」といったキーワードグループからそれに対応する人のグループに対して「時代小説作家」というラベルを付けたりすることができる。 The above examples are intended for researchers and research fields. Of course, the present invention is not limited to this, and can be applied to any person, field, region, organization, etc. Needless to say, for example, for a writer and related keywords, a keyword group such as “trick, astonishing ending, full reasoning” is labeled as a “mystery writer” for a group of people corresponding to it, or “Edo From the keyword group such as “Bakumatsu, Shinsengumi”, a group of people corresponding to the keyword group can be labeled “fictional novelist”.

また、本願発明では前述したように人名以外にも様々な種類の名前を対象とすることができ、たとえば、企業の場合には名前として企業名、キーワードとして業種や開発技術等を表現した単語、製品の場合には名前として製品名、キーワードとして製品種類や機能等を表現した単語をグループ化対象とできる。 Further, in the present invention, as described above, various types of names can be targeted in addition to a person's name. For example, in the case of a company, a company name as a name, a word expressing a business type or development technology as a keyword, In the case of a product, a word representing a product name as a name and a product type or function as a keyword can be grouped.

以上詳しく説明した通り、本願発明によって、Ｗｅｂを代表とする各種データベースにて公開されている情報から、共起行列の作成及びそのクラスタリングによって、自律的に人や企業、製品、動植物などといったあらゆるもののグループをカテゴリ分けできるコンピュータプログラムやその記録媒体、またはコンピュータ装置等が実現される。 As described above in detail, according to the present invention, from the information published in various databases represented by the Web, by creating a co-occurrence matrix and its clustering, all kinds of things such as people, companies, products, animals and plants, etc. autonomously. A computer program that can categorize a group, a recording medium thereof, a computer device, or the like is realized.

本願発明によるグループ化について説明するための処理フロー図。The processing flowchart for demonstrating grouping by this invention. 本願発明の一実施形態であるグループ化装置（クラスタ作成装置とも呼べる）のシステム構成図。1 is a system configuration diagram of a grouping device (also called a cluster creation device) according to an embodiment of the present invention. co-clusteringによる名前及びキーワードのクラスタ化について説明するための図。The figure for demonstrating the clustering of the name and keyword by co-clustering.

Explanation of symbols

１表示部
２入力部
３処理部
４記憶部
５通信制御部
６行列データベース
７バス
８ネットワーク

DESCRIPTION OF SYMBOLS 1 Display part 2 Input part 3 Processing part 4 Memory | storage part 5 Communication control part 6 Matrix database 7 Bus 8 Network

Claims

Processing unit, with only input or we receive an input of a plurality of name i and keywords j to be grouped and stored in the storage unit step,
Creating said processing unit, and each name i and keyword j stored, the co-occurrence matrix each name i and keyword j each such name i and keyword j obtained by the AND search is made of the number of public data co-occurring The storage unit stores the name i and the keyword j in the stored co-occurrence matrix and the vertex (name i and keyword j that minimizes the sum cut of the weights E of the edges across the clusters). ) Division method (V ₁ , V ₂ ,... V _k )

A method for grouping names and keywords, comprising the step of clustering by calculating .

The processing unit uses each stored name i and keyword j and the number of public data,

By calculating, the name and a keyword grouping method according to claim 1, characterized in that the weighting against the public number of data.

The name and keyword grouping method according to claim 1, wherein weighting is performed on the number of public data by calculating.

A step in which the processing unit receives an input from an input unit of a label representing each group obtained by the clustering; and a step in which the processing unit stores the input label in association with each group in a storage unit. The name and keyword grouping method according to claim 1, further comprising:

5. A name and keyword grouping program for causing a computer to execute the name and keyword grouping method according to claim 1.

Means for inputting a plurality of names i and keywords j to be grouped ;
Means for storing each input name i and keyword j;
Means for creating each name i and keyword j stored, the co-occurrence matrix each name i and keyword j each such name i and keyword j obtained by the AND search is made of the number of public data that co-occur,
Means for storing the number of public data and a co-occurrence matrix;
The name i and keyword j in the stored co-occurrence matrix and the vertex (name i and keyword j) division method (V ₁ , V ₂ ,... V ) that minimizes the sum cut of the edge weights E across the clusters. _k )

A device for grouping names and keywords, characterized by comprising means for clustering by calculating .

Using each stored name i and keyword j and the number of public data,

7. The name and keyword grouping apparatus according to claim 6 , further comprising means for weighting the number of public data by calculating .

Using each stored name i and keyword j and the number of public data,

7. The name and keyword grouping apparatus according to claim 6, further comprising means for weighting the number of public data by calculating.

Means for inputting a label representing each group obtained by the clustering, and an input label to any one of 6 to claim and further comprising a means for storing in association with each group 8 Device for grouping names and keywords.