JP2011103059A

JP2011103059A - Technical term extraction device and program

Info

Publication number: JP2011103059A
Application number: JP2009257660A
Authority: JP
Inventors: Hiroyuki Onuma; 宏行大沼; Shuhei Gokouchi; 脩平後河内
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2009-11-11
Filing date: 2009-11-11
Publication date: 2011-05-26

Abstract

<P>PROBLEM TO BE SOLVED: To extract a keyword that features an individual by excluding a general term, on the basis of the speech of a member in a community. <P>SOLUTION: A technical term extraction device 1 is provided with: a morphological analytic part 20 for morphologically analyzing a document input in response to the operation of a contributor; a deviation score calculation part 30 for calculating deviation scores between words included in the document, between the word and the contributor, and between the word and a contribution destination group to which the contributor belongs; a general term extraction part 40 for extracting a general term included in the document according to the value of the deviation score; and an index extraction part 50 for extracting a keyword showing individual features by excluding the general term extracted by the general term extraction part 40 from the document. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、専門用語抽出装置プログラムに関する。 The present invention relates to a technical term extraction device program.

昨今、ソーシャルネットワークサービスやニュースグループにおけるコミュニティでは、様々な話題が議論されており、発言者の間で活発なコミュニケーションがなされている。その中で、特定の分野に詳しい専門家が発言している場合もあり、個々人が何の分野に詳しいのかという情報は、コミュニケーションを円滑に行う上で重要な情報である。ソーシャルネットワークサービスでは、自分が所属しているコミュニティが公表されていることが多い。しかし、所属しているコミュニティの情報だけでは十分ではなく、個々人が実際にどのような話題に関心があるのかを知ることが望まれる。 In recent years, various topics have been discussed in social network services and communities in newsgroups, and active communication is carried out among speakers. Among them, experts who are familiar with a specific field may speak, and information on what field an individual is familiar with is important information for smooth communication. In social network services, the community to which you belong is often announced. However, information on the community to which they belong is not enough, and it is desirable to know what topics each person is actually interested in.

個々人がどの分野に詳しいのかを知るための機能としては、Ｋｎｏｗ−ｗｈｏ機能（専門家検索機能）が挙げられる。Ｋｎｏｗ−ｗｈｏ機能の実現には、２つのアプローチがある。一つは、特定のキーワードで検索された文書群のなかで、文書の著書として最も重みが高い人を専門家と判断する方法である。もう一つは、個人を特徴づけるキーワードを明示的に抽出する方法である。 As a function for knowing in which field an individual is familiar, there is a Know-how function (expert search function). There are two approaches to realizing the Know-how function. One is a method of determining a person who has the highest weight as a document book among a group of documents searched with a specific keyword as an expert. The other is a method of explicitly extracting keywords that characterize individuals.

例えば、特許文献１では、カテゴリ付き文書集合から、専門用語を抽出する専門用語抽出装置が開示されている。また、特許文献１では、文書内容に、部門名、人名、メールアドレスなどが付与された文書を、カテゴリ付き文書とし、カテゴリと関連が深い用語を抽出する方法が開示されている。特に、複数のカテゴリが付与された文書集合から、専門用語を抽出する方法が開示されている。 For example, Patent Document 1 discloses a technical term extraction device that extracts technical terms from a category-attached document set. Further, Patent Document 1 discloses a method of extracting a term having a close relationship with a category by making a document with a department name, a person's name, an e-mail address, etc. added to the document content as a category-added document. In particular, a method for extracting technical terms from a document set to which a plurality of categories are assigned is disclosed.

特許文献１では、１つの文書に複数のカテゴリが付与されることで、カテゴリごとに出現する単語の偏りの度合いが低下し、一定の閾値以上を専門用語と判断する場合などに専門用語から漏れてしまうことを防止している。 In Patent Document 1, since a plurality of categories are assigned to one document, the degree of bias of words appearing in each category is reduced, and leakage from technical terms occurs when judging a technical term above a certain threshold. Is prevented.

特開２００７−７９９４８号公報JP 2007-79948 A 特開２００７−７９９４８号公報JP 2007-79948 A

しかし、上記特許文献１では、専門用語らしくない単語が上位にランキングされる場合を考慮していない。ここで、専門用語らしくない単語とは、複数のカテゴリについて専門用語らしさを示すスコアが閾値以上の単語である。また、特許文献１では、カテゴリが人名か組織かなどカテゴリの性質に特化した違いを考慮していないという問題があった。
そこで、本発明は、上記問題に鑑みてなされたものであり、本発明の目的とするところは、コミュニティ内のメンバーの発言をもとに、一般用語を除外して個人を特徴づけるキーワードを抽出することが可能な、新規かつ改良された専門用語抽出装置およびプログラムを提供することにある。 However, Patent Document 1 does not consider the case where words that are not technical terms are ranked higher. Here, a word that does not look like a technical term is a word that has a score that indicates that it is like a technical term for a plurality of categories. Moreover, in patent document 1, there existed a problem that the difference specialized in the property of the category, such as a category being a person name or an organization, was not considered.
Therefore, the present invention has been made in view of the above problems, and an object of the present invention is to extract keywords that characterize individuals by excluding general terms based on the comments of members in the community. It is an object of the present invention to provide a new and improved terminology extraction device and program that can be used.

上記課題を解決するために、本発明のある観点によれば、投稿者の操作に応じて入力された文書を形態素解析する形態素解析部と、前記文書に含まれる単語間、単語と投稿者との間、単語と該投稿者が属する投稿先グループとの間の偏りスコアを計算する偏りスコア計算部と、前記偏りスコアの値に応じて、前記文書に含まれている一般用語を抽出する一般用語抽出部と、前記一般用語抽出部により抽出された前記一般用語を前記文書から除いて、個人の特徴を示すキーワードを抽出するインデックス抽出部と、を備えることを特徴とする、専門用語抽出装置が提供される。 In order to solve the above problems, according to an aspect of the present invention, a morphological analysis unit that performs morphological analysis on a document input in accordance with a contributor's operation, a word and a contributor between words included in the document A bias score calculation unit for calculating a bias score between a word and a posting destination group to which the poster belongs, and a general term included in the document in accordance with the value of the bias score A technical term extraction device comprising: a term extraction unit; and an index extraction unit that extracts a keyword indicating an individual characteristic by removing the general term extracted by the general term extraction unit from the document. Is provided.

また、専門用語抽出装置は、投稿者の操作に応じて入力された文書と、投稿者と、投稿者の属する投稿先グループとを関連付けて記憶している記憶部を備えてもよい。 In addition, the technical term extraction device may include a storage unit that stores a document input in accordance with a contributor's operation, a contributor, and a posting destination group to which the contributor belongs.

また、前記偏りスコア計算部は、前記偏りスコアをカイ二乗値によって計算してもよい。 Further, the bias score calculation unit may calculate the bias score using a chi-square value.

また、前記一般用語抽出部は、投稿者と単語または該投稿者が属する投稿先グループと単語との組み合わせのうち、前記偏りスコアの値が所定の値以下であり、複数の投稿者または複数の投稿先グループと関連がある単語を一般用語として抽出してもよい。 In addition, the general term extraction unit has a bias score value that is equal to or less than a predetermined value among a combination of a poster and a word or a posting destination group to which the poster belongs and a word. Words related to the posting destination group may be extracted as general terms.

また、前記偏りスコア計算部は、前記一般用語として抽出された単語を除いて、前記偏りスコアを再度計算し、前記一般用語抽出部は、投稿者と単語または該投稿者が属する投稿先グループと単語との組み合わせのうち、該偏りスコアの値が所定の値以下であり、複数の投稿者または複数の投稿先グループと関連がある単語を一般用語として再度抽出してもよい。 In addition, the bias score calculation unit calculates the bias score again except for the word extracted as the general term, and the general term extraction unit calculates the poster and the word or the posting destination group to which the poster belongs. Of the combinations with words, the bias score value may be equal to or less than a predetermined value, and a word related to a plurality of contributors or a plurality of posting destination groups may be extracted again as a general term.

また、前記インデックス抽出部は、投稿者と単語との間の偏りスコアの値が所定の値以上である単語を、個人の特徴を示すキーワードとして抽出してもよい。 The index extraction unit may extract a word having a bias score value between a poster and a word that is equal to or greater than a predetermined value as a keyword indicating personal characteristics.

また、前記インデックス抽出部は、前記投稿先グループの特徴を示す単語を抽出し、投稿者と該投稿者の属する投稿先グループとの間の偏りスコアの値が所定の値以上の投稿先グループの特徴を示す単語を、個人の特徴を示すキーワードとして抽出してもよい。 In addition, the index extraction unit extracts a word indicating the characteristics of the posting destination group, and the bias score between the posting person and the posting destination group to which the posting person belongs is a predetermined value or more. A word indicating a feature may be extracted as a keyword indicating an individual feature.

また、前記インデックス抽出部は、前記抽出した投稿先グループの特徴を示す単語のうち、該単語を含む文書が投稿された期間と、投稿者が前記投稿先グループで投稿した期間とが対応する場合に、該単語を個人の特徴を示すキーワードとして抽出してもよい。 The index extraction unit may include a period in which a document including the word is posted and a period in which the poster has posted in the posting group among words indicating the characteristics of the extracted posting group. In addition, the words may be extracted as keywords indicating individual characteristics.

また、上記課題を解決するために、本発明の別の観点によれば、コンピュータを、投稿者の操作に応じて入力された文書を形態素解析する形態素解析部と、前記文書に含まれる単語間、単語と投稿者との間、単語と該投稿者が属する投稿先グループとの間の偏りスコアを計算する偏りスコア計算部と、前記偏りスコアの値応じて、前記文書に含まれている一般用語を抽出する一般用語抽出部と、前記一般用語抽出部により抽出された前記一般用語を前記文書から除いて、個人の特徴を示すキーワードを抽出するインデックス抽出部と、を備えることを特徴とする、専門用語抽出装置として機能させるための、プログラムが提供される。 In order to solve the above-described problem, according to another aspect of the present invention, a computer uses a morphological analysis unit that performs a morphological analysis on a document input in accordance with a contributor's operation, and a word between words included in the document. A bias score calculation unit for calculating a bias score between a word and a poster, a word and a posting destination group to which the poster belongs, and a general score included in the document according to the value of the bias score A general term extraction unit that extracts terms; and an index extraction unit that extracts keywords indicating individual characteristics by removing the general terms extracted by the general term extraction unit from the document. A program for functioning as a technical term extraction device is provided.

以上説明したように本発明によれば、コミュニティ内のメンバーの発言をもとに、一般用語を除外して個人を特徴づけるキーワードを抽出することができる。 As described above, according to the present invention, keywords that characterize individuals can be extracted based on the remarks of members in a community, excluding general terms.

本発明の第１の実施形態にかかる専門用語抽出装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the technical vocabulary extraction apparatus concerning the 1st Embodiment of this invention. 同実施形態にかかるカテゴリ付き文書記憶部の記憶内容について説明する説明図である。It is explanatory drawing explaining the memory content of the document storage part with a category concerning the embodiment. 同実施形態にかかる形態素一時記憶部の記憶内容について説明する説明図である。It is explanatory drawing explaining the memory content of the morpheme temporary storage part concerning the embodiment. 同実施形態にかかる一般用語一時記憶部の記憶内容について説明する説明図である。It is explanatory drawing explaining the memory content of the general term temporary storage part concerning the embodiment. 同実施形態にかかるインデックス格納部の記憶内容について説明する説明図である。It is explanatory drawing explaining the memory content of the index storage part concerning the embodiment. 同実施形態にかかるインデックスの確認画面の一例を説明する説明図である。It is explanatory drawing explaining an example of the confirmation screen of the index concerning the embodiment. 同実施形態にかかる専門用語抽出処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the technical vocabulary extraction process concerning the embodiment. 同実施形態にかかる各組み合わせの出現数を計算する処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the process which calculates the appearance number of each combination concerning the embodiment. 同実施形態にかかる共起リスト一時記憶部の記憶内容について説明する説明図である。It is explanatory drawing explaining the memory content of the co-occurrence list temporary storage part concerning the embodiment. 同実施形態にかかる出現数一時記憶部の記憶内容について説明する説明図である。It is explanatory drawing explaining the memory content of the appearance number temporary storage part concerning the embodiment. 同実施形態にかかる偏りスコア一時記憶部の記憶内容について説明する説明図である。It is explanatory drawing explaining the memory content of the bias score temporary storage part concerning the embodiment. 本発明の第２の実施形態にかかるコミュニティの専門用語の抽出処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the extraction process of the technical term of the community concerning the 2nd Embodiment of this invention.

以下に添付図面を参照しながら、本発明の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Exemplary embodiments of the present invention will be described below in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

また、以下に示す順序に従って、当該「発明を実施するための形態」を説明する。
〔１〕本実施形態の目的
〔２〕第１実施形態
〔２−１〕専門用語抽出装置の機能構成
〔２−２〕専門用語抽出処理の詳細
〔３〕第２実施形態
〔３−１〕専門用語抽出装置の機能構成
〔３−２〕専門用語抽出処理の詳細 Further, the “detailed description of the embodiments” will be described in the order shown below.
[1] Purpose of this embodiment [2] First embodiment [2-1] Functional configuration of technical term extraction device [2-2] Details of technical term extraction processing [3] Second embodiment [3-1] Functional configuration of technical term extraction device [3-2] Details of technical term extraction processing

〔１〕本実施形態の目的
まず、本発明の実施形態の目的について説明する。昨今、ソーシャルネットワークサービスやニュースグループにおけるコミュニティでは、様々な話題が議論されており、発言者の間で活発なコミュニケーションがなされている。その中で、特定の分野に詳しい専門家が発言している場合もあり、個々人が何の分野に詳しいのかという情報は、コミュニケーションを円滑に行う上で重要な情報である。ソーシャルネットワークサービスでは、自分が所属しているコミュニティが公表されていることが多い。しかし、所属しているコミュニティの情報だけでは十分ではなく、個々人が実際にどのような話題に関心があるのかを知ることが望まれる。 [1] Object of this embodiment First, the object of the embodiment of the present invention will be described. In recent years, various topics have been discussed in social network services and communities in newsgroups, and active communication is carried out among speakers. Among them, experts who are familiar with a specific field may speak, and information on what field an individual is familiar with is important information for smooth communication. In social network services, the community to which you belong is often announced. However, information on the community to which they belong is not enough, and it is desirable to know what topics each person is actually interested in.

例えば、カテゴリ付き文書集合から、専門用語を抽出する専門用語抽出装置が開示されている。また、当該装置では、文書内容に、部門名、人名、メールアドレスなどが付与された文書を、カテゴリ付き文書とし、カテゴリと関連が深い用語を抽出する方法が開示されている。特に、複数のカテゴリが付与された文書集合から、専門用語を抽出する方法が開示されている。 For example, a technical term extraction device that extracts technical terms from a document set with categories is disclosed. In addition, this apparatus discloses a method of extracting a term having a close relationship with a category by setting a document with a department name, a person's name, an e-mail address, etc. to the document content as a document with a category. In particular, a method for extracting technical terms from a document set to which a plurality of categories are assigned is disclosed.

上記装置では、１つの文書に複数のカテゴリが付与されることで、カテゴリごとに出現する単語の偏りの度合いが低下し、一定の閾値以上を専門用語と判断する場合などに専門用語から漏れてしまうことを防止している。しかし、上記装置では、専門用語らしくない単語が上位にランキングされる場合を考慮していない。ここで、専門用語らしくない単語とは、複数のカテゴリについて専門用語らしさを示すスコアが閾値以上の単語である。また、上記装置では、カテゴリが人名か組織かなどカテゴリの性質に特化した違いを考慮していないという問題があった。 In the above device, by assigning a plurality of categories to one document, the degree of bias of words appearing in each category is reduced, and when a certain threshold or more is judged as a technical term, it is omitted from the technical term. Is prevented. However, the above apparatus does not consider the case where words that are not technical terms are ranked higher. Here, a word that does not look like a technical term is a word that has a score that indicates that it is like a technical term for a plurality of categories. In addition, the above apparatus has a problem that it does not take into account differences specific to the nature of the category, such as whether the category is a person name or an organization.

そこで、上記のような事情を一着眼点として本発明の実施形態にかかる専門用語抽出装置１が創作されるに至った。本実施形態にかかる専門用語抽出装置１によれば、コミュニティ内のメンバーの発言をもとに、一般用語を除外して個人を特徴づけるキーワードを抽出することが可能となる。本実施形態では、コミュニティとは、投稿者が属する投稿先のグループを意味する。例えば、ソーシャルネットワークや掲示板においては、メンバーはコミュニティに所属しているため、コミュニティの専門用語は、コミュニティに所属しているメンバーの専門を示すといえる。そこで、コミュニティとメンバーの発言などに含まれる単語、メンバーと単語、単語間の関係を考慮して、一般用語を除外して個人を特徴付けるキーワードの拡張を行っている。 Therefore, the technical term extraction device 1 according to the embodiment of the present invention has been created with the above circumstances as a focus. According to the technical term extraction device 1 according to the present embodiment, it is possible to extract keywords that characterize individuals by excluding general terms based on the comments of members in the community. In the present embodiment, the community means a posting destination group to which a contributor belongs. For example, in social networks and bulletin boards, members belong to a community, so it can be said that the terminology of the community indicates the specialty of the member belonging to the community. Therefore, in consideration of the words included in the remarks of the community and the members, the relationship between the members and the words, and the relationship between the words, keywords that characterize individuals are expanded by excluding general terms.

〔２〕第１実施形態
以上、本発明の実施形態の目的について説明した。次に、図１を参照して、本実施形態にかかる専門用語抽出装置１の機能構成について説明する。なお、専門用語抽出装置１の機能構成を説明するに際し、適宜、図２〜図９を参照する。専門用語抽出装置１としては、例えば、パーソナルコンピュータ等のコンピュータ装置（ノート型、デスクトップ型を問わない。）を例示できるが、かかる例に限定されず、携帯電話やＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）などで構成してもよい。 [2] First Embodiment The object of the embodiment of the present invention has been described above. Next, a functional configuration of the technical term extraction device 1 according to the present embodiment will be described with reference to FIG. In describing the functional configuration of the technical term extraction device 1, FIGS. 2 to 9 will be referred to as appropriate. As the technical term extraction device 1, for example, a computer device such as a personal computer (whether a notebook type or a desktop type) can be exemplified. However, the technical term extraction device 1 is not limited to such an example, and may be a mobile phone or a PDA (Personal Digital Assistant). It may be configured.

〔２−１〕専門用語抽出装置の機能構成
図１の機能構成を説明する前に、専門用語抽出装置１のハードウェア構成の一例について説明する。専門用語抽出装置１は、例えば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）と、入力装置と、出力装置と、ストレージ装置（ＨＤＤ）などを備える。 [2-1] Functional Configuration of Technical Term Extraction Device Before describing the functional configuration of FIG. 1, an example of a hardware configuration of the technical term extraction device 1 will be described. The technical term extraction device 1 includes, for example, a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), an input device, an output device, a storage device (HDD), and the like.

ＣＰＵは、演算処理装置および制御装置として機能し、各種プログラムに従って専門用語抽出装置１の動作全般を制御する。また、ＣＰＵは、マイクロプロセッサであってもよい。ＲＯＭは、ＣＰＵが使用するプログラムや演算パラメータ等を記憶する。ＲＡＭは、ＣＰＵの実行において使用するプログラムや、その実行において適宜変化するパラメータ等を一次記憶する。これらはＣＰＵバスなどから構成されるホストバスにより相互に接続されている。 The CPU functions as an arithmetic processing device and a control device, and controls the overall operation of the technical term extraction device 1 according to various programs. The CPU may be a microprocessor. The ROM stores programs used by the CPU, calculation parameters, and the like. The RAM primarily stores programs used in the execution of the CPU, parameters that change as appropriate during the execution, and the like. These are connected to each other by a host bus including a CPU bus.

入力装置は、例えば、マウス、キーボード、タッチパネル、ボタン、マイク、スイッチおよびレバーなどユーザが情報を入力するための入力手段と、ユーザによる入力に基づいて入力信号を生成し、ＣＰＵに出力する入力制御回路などから構成されている。 The input device includes, for example, an input means for a user to input information, such as a mouse, keyboard, touch panel, button, microphone, switch, and lever, and input control that generates an input signal based on the input by the user and outputs the input signal to the CPU. It consists of a circuit.

出力装置は、例えば、ＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）ディスプレイ装置、液晶ディスプレイ（ＬＣＤ）装置、ＯＬＥＤ（ＯｒｇａｎｉｃＬｉｇｈｔＥｍｉｔｔｉｎｇＤｉｓｐｌａｙ）装置およびランプなどの表示装置と、スピーカおよびヘッドホンなどの音声出力装置で構成される。 The output device includes, for example, a display device such as a CRT (Cathode Ray Tube) display device, a liquid crystal display (LCD) device, an OLED (Organic Light Emitting Display) device and a lamp, and an audio output device such as a speaker and headphones. .

ストレージ装置は、記憶媒体、記憶媒体にデータを記録する記録装置、記憶媒体からデータを読み出す読出し装置および記憶媒体に記録されたデータを削除する削除装置などを含むことができる。ストレージ装置は、例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）で構成される。このストレージ装置は、ハードディスクを駆動し、ＣＰＵが実行するプログラムや各種データを格納する。 The storage device can include a storage medium, a recording device that records data on the storage medium, a reading device that reads data from the storage medium, a deletion device that deletes data recorded on the storage medium, and the like. The storage device is composed of, for example, an HDD (Hard Disk Drive). This storage device drives a hard disk and stores programs executed by the CPU and various data.

以上、専門用語抽出装置１のハードウェア構成について説明した。次に、図１を参照して、専門用語抽出装置１の機能構成について説明する。図１に示したように、専門用語抽出装置１は、入力部１０、形態素解析部２０、偏りスコア計算部３０、一般用語抽出部４０、インデックス抽出部５０、確認表示部６０、カテゴリ付き文書記憶部７０、形態素一時記憶部８０、共起リスト一時記憶部９０、出現数一時記憶部１００、偏りスコア一時記憶部１１０、一般用語一時記憶部１２０、インデックス格納部１３０などを備える。 Heretofore, the hardware configuration of the technical term extraction device 1 has been described. Next, the functional configuration of the technical term extraction device 1 will be described with reference to FIG. As shown in FIG. 1, the technical term extraction device 1 includes an input unit 10, a morpheme analysis unit 20, a bias score calculation unit 30, a general term extraction unit 40, an index extraction unit 50, a confirmation display unit 60, and a document storage with categories. Unit 70, morpheme temporary storage unit 80, co-occurrence list temporary storage unit 90, appearance number temporary storage unit 100, bias score temporary storage unit 110, general term temporary storage unit 120, index storage unit 130, and the like.

入力部１０は、上記した入力装置により構成される。専門用語抽出装置１のユーザは、該入力部１０を操作することにより、専門用語抽出装置１に対して各種のデータを入力したり処理動作を指示したりすることができる。具体的には、ユーザ操作に応じて、個人を特徴付けるキーワードの抽出要求を受け付ける機能を有する。 The input unit 10 is configured by the input device described above. The user of the technical term extraction device 1 can input various data and instruct a processing operation to the technical term extraction device 1 by operating the input unit 10. Specifically, it has a function of accepting an extraction request for a keyword characterizing an individual in response to a user operation.

形態素解析部２０は、入力部１０を介してユーザから専門用語抽出の要求を受け付けると、カテゴリ付き文書記憶部７０に記憶されているテキスト情報を形態素解析する機能を有する。カテゴリとしては、文書を投稿した作成者、文書が投稿されたコミュニティやトピックなどが存在する。ここで、図２を参照して、カテゴリ付き文書記憶部７０の記憶内容について説明する。図２は、カテゴリ付き文書記憶部７０の記憶内容について説明する説明図である。 The morpheme analysis unit 20 has a function of performing morphological analysis on the text information stored in the category-added document storage unit 70 when a request for technical term extraction is received from the user via the input unit 10. The category includes the creator who posted the document, the community or topic where the document was posted, and the like. Here, the contents stored in the category-added document storage unit 70 will be described with reference to FIG. FIG. 2 is an explanatory diagram for explaining the stored contents of the category-added document storage unit 70.

図２に示したように、カテゴリ付き文書記憶部７０は、文書ＩＤ、コミュニティＩＤ、トピックＩＤ、投稿者ＩＤ、投稿時刻、投稿内容の各項目を保有する。ニュースグループやソーシャルネットワークシステムに対する投稿は、コミュニティ内のトピックに対して行われる。トピックとは、各コミュニティ（またはニュースグループ）に投稿された個々の話題に対応し、一つの話題に対して複数の発言を投稿することができるものである。個々のトピックは、トピックＩＤによって識別される。また、各投稿は、文書ＩＤによって識別される。 As shown in FIG. 2, the category-added document storage unit 70 holds items of document ID, community ID, topic ID, poster ID, posting time, and posting content. Posts to newsgroups and social network systems are made to topics within the community. A topic corresponds to an individual topic posted to each community (or news group), and a plurality of comments can be posted on one topic. Individual topics are identified by topic IDs. Each post is identified by a document ID.

各投稿は、文書ＩＤによって識別される。投稿者ＩＤは投稿者を識別する投稿者のＩＤ、投稿時刻は投稿が行われた時刻を格納する。投稿内容項目は、投稿されたテキスト情報を格納する。例えば、図３のＲ７１（文書ＩＤ＝ｄ１）は、投稿者ｍ１がトピックｔ１に「（特集）プログラミング言語．．」という内容の投稿を、「２００８年１２月１１日１２：００」に行ったことを示す。 Each post is identified by a document ID. The contributor ID stores the contributor ID for identifying the contributor, and the post time stores the time when the post was made. The posted content item stores posted text information. For example, in R71 (document ID = d1) in FIG. 3, the poster m1 posted “(Special Feature) Programming Language...” On the topic t1 on “December 11, 2008 12:00”. It shows that.

図１に戻り、形態素解析部２０は、投稿された文書のテキスト情報に対して形態素解析を行う機能を有する。形態素解析部２０は、テキスト情報を形態素解析した結果を形態素一時記憶部８０に記憶する。形態素解析部２０は、形態素解析結果のうち、一般名詞、サ変名詞を形態素一時記憶部８０に格納する。 Returning to FIG. 1, the morphological analysis unit 20 has a function of performing morphological analysis on text information of a posted document. The morpheme analyzer 20 stores the result of the morpheme analysis of the text information in the morpheme temporary storage unit 80. The morpheme analysis unit 20 stores the general noun and the saun noun among the morpheme analysis results in the morpheme temporary storage unit 80.

ここで、図３を参照して、形態素一時記憶部８０の記憶内容について説明する。図３は、形態素一時記憶部８０の記憶内容について説明する説明図である。図３に示したように、形態素解析した結果８０５は、文書ＩＤ８０１、コミュニティＩＤ８０２、トピックＩＤ８０３、投稿者ＩＤ８０４に関連付けて記憶される。図２に示したカテゴリ付き文書記憶部７０に記憶されているレコードＲ７１を形態素解析した結果が、図３に示した形態素一時記憶部８０に記憶されているレコードＲ８１である。形態素解析部２０は、形態素解析した結果を偏りスコア計算部３０に提供する。 Here, the stored contents of the morpheme temporary storage unit 80 will be described with reference to FIG. FIG. 3 is an explanatory diagram for explaining the stored contents of the morpheme temporary storage unit 80. As shown in FIG. 3, the result 805 of the morphological analysis is stored in association with the document ID 801, community ID 802, topic ID 803, and contributor ID 804. The result of the morphological analysis of the record R71 stored in the category-added document storage unit 70 shown in FIG. 2 is the record R81 stored in the morpheme temporary storage unit 80 shown in FIG. The morpheme analysis unit 20 provides the result of the morpheme analysis to the bias score calculation unit 30.

図１に戻り、偏りスコア計算部３０は、形態素解析部２０において処理された形態素解析結果を用いて、各文書においてどのような単語が出現したかを、共起リスト一時記憶部９０および出現数一時記憶部１００に格納する。さらに、単語間や、単語とカテゴリとの間のカイ二乗値やシンプソン値を偏りスコアとして計算する。共起リスト一時記憶部９０および出現数一時記憶部１００の記憶内容については、後で詳細に説明する。 Returning to FIG. 1, the bias score calculation unit 30 uses the morpheme analysis result processed by the morpheme analysis unit 20 to determine what words have appeared in each document, and the co-occurrence list temporary storage unit 90 and the number of appearances. Store in the temporary storage unit 100. Further, chi-square values and Simpson values between words or between words and categories are calculated as bias scores. The contents stored in the co-occurrence list temporary storage unit 90 and the appearance number temporary storage unit 100 will be described later in detail.

偏りスコア計算部３０は、単語の組み合わせや、投稿者ＩＤと単語の組み合わせや、コミュニティＩＤと単語の組み合わせについて、カイ二乗値を計算するために、各組み合わせについて出現数を計算する。出現数の計算については、後で詳細に説明する。偏りスコア計算部３０は、共起リスト一時記憶部９０および出現数一時記憶部１００の記憶内容を用いて単語の組み合わせ、投稿者と単語の組み合わせ、コミュニティＩＤと単語の組み合わせについてカイ二乗値を計算する。 The bias score calculation unit 30 calculates the number of appearances for each combination in order to calculate chi-square values for word combinations, poster ID / word combinations, and community ID / word combinations. The calculation of the number of appearances will be described in detail later. The bias score calculation unit 30 uses the stored contents of the co-occurrence list temporary storage unit 90 and the appearance number temporary storage unit 100 to calculate chi-square values for word combinations, poster / word combinations, and community ID / word combinations. calculate.

カイ二乗値の計算方法については、後で詳細に説明する。偏りスコア計算部３０は、計算された偏りスコアを偏りスコア一時記憶部１１０に格納する。偏りスコア一時記憶部１１０に格納される内容については、後で詳細に説明する。また、偏りスコア計算部３０は、計算した結果を一般用語抽出部４０に提供する。 A method for calculating the chi-square value will be described later in detail. The bias score calculation unit 30 stores the calculated bias score in the bias score temporary storage unit 110. The contents stored in the bias score temporary storage unit 110 will be described in detail later. In addition, the bias score calculation unit 30 provides the calculated result to the general term extraction unit 40.

一般用語抽出部４０は、偏りスコア計算部３０で計算された結果を用いて、偏りが少ない単語を一般用語として抽出する。一般用語抽出部４０は、偏りスコア一時記憶部１１０に記憶されたレコードについて、所定の条件を満たすレコードを抽出する。一般用語抽出部４０によるレコードの抽出処理については後で詳細に説明する。一般用語抽出部４０は、抽出した一般用語を、一般用語一時記憶部１２０に格納する。図４に示したように、一般用語一時記憶部１２０には、例えば、「特集」という単語が一般用語として格納される。また、一般用語抽出部４０は、抽出した一般用語をインデックス抽出部５０に提供する。 The general term extraction unit 40 extracts words with less bias as general terms using the result calculated by the bias score calculation unit 30. The general term extraction unit 40 extracts records satisfying a predetermined condition from the records stored in the bias score temporary storage unit 110. The record extraction process by the general term extraction unit 40 will be described in detail later. The general term extraction unit 40 stores the extracted general terms in the general term temporary storage unit 120. As shown in FIG. 4, for example, the word “special feature” is stored as a general term in the general term temporary storage unit 120. The general term extraction unit 40 provides the extracted general terms to the index extraction unit 50.

図１に戻り、インデックス抽出部５０は、偏りスコア計算部３０で計算された偏りスコア値をもとに、カテゴリと単語の間の偏りスコアと、単語間の偏りスコアから、各カテゴリのインデックスを抽出する。インデックス抽出部５０により抽出されるインデックスが、個人の特徴を特徴づけるキーワードとなる。さらに、インデックス格納部１３０は、インデックス格納部１３０に抽出したインデックスを格納する。 Returning to FIG. 1, the index extraction unit 50 calculates the index of each category from the bias score between categories and words and the bias score between words based on the bias score value calculated by the bias score calculation unit 30. Extract. The index extracted by the index extraction unit 50 becomes a keyword that characterizes an individual characteristic. Further, the index storage unit 130 stores the extracted index in the index storage unit 130.

図５に示したように、インデックス格納部１３０には、投稿者ＩＤ１３０１と、専門用語１３０２とスコア１３０４が関連付けて記憶される。これにより、個人（各投稿者１３０１）を特徴付ける単語（専門用語１３０２）が抽出される。さらに、スコア１３０４により、個人と単語とがどれくらい関連が深いのかがわかる。また、インデックス格納部１３０に格納されたインデックス（専門用語１３０２）が、インデックスを付与された当人に承認されているのか否かを示す承認有無１３０５も格納される。 As shown in FIG. 5, the index storage unit 130 stores a contributor ID 1301, a technical term 1302, and a score 1304 in association with each other. Thereby, a word (technical term 1302) characterizing an individual (each contributor 1301) is extracted. Furthermore, the score 1304 shows how deeply the relationship between an individual and a word is. In addition, an approval presence / absence 1305 indicating whether or not the index (technical term 1302) stored in the index storage unit 130 is approved by the person to whom the index is assigned is also stored.

図１に戻り、確認表示部６０は、インデックス格納部１３０に格納されたインデックスが正しいか否かを、インデックスが付与された当人に表示画面を介して確認する機能を有する。例えば、図６に示したように、表示画面に当人を特徴付けるインデックスとして、「○○言語」や「Ｃ＋＋」のインデックス（専門用語）を承認するか否かを確認画面に表示して、確認を求める。 Returning to FIG. 1, the confirmation display unit 60 has a function of confirming whether the index stored in the index storage unit 130 is correct through the display screen to the person to whom the index is assigned. For example, as shown in FIG. 6, whether or not to approve an index (technical term) of “XX language” or “C ++” as an index characterizing the person on the display screen is displayed on the confirmation screen. Ask for.

ユーザの入力に応じて、「承認」が選択された場合には、上記したインデックス格納部１３０の承認有無項目が「未承認」から「承認」に更新される。一方、ユーザの入力に応じて、「拒否」が選択された場合には、上記したインデックス格納部１３０の承認有無項目が「未承認」から「拒否」に更新される。 When “approval” is selected in accordance with user input, the approval / non-approval item in the index storage unit 130 is updated from “unapproved” to “approved”. On the other hand, when “reject” is selected according to the user's input, the approval / non-approval item in the index storage unit 130 is updated from “unapproved” to “reject”.

なお、上記した形態素解析部２０、偏りスコア計算部３０、一般用語抽出部４０、インデックス抽出部５０は、コンピュータにより構成され、その動作は、上記したＲＯＭに記憶されたプログラムをもとに、ＣＰＵで実行される。また、インデックス格納部１３０は、上記したストレージ装置（ＨＤＤ）により構成され、カテゴリ付き文書記憶部７０、形態素一時記憶部８０、共起リスト一時記憶部９０、出現数一時記憶部１００、偏りスコア一時記憶部１１０、一般用語一時記憶部１２０は、上記したストレージ装置（ＨＤＤ）またはＲＡＭにより構成される。 The morpheme analysis unit 20, the bias score calculation unit 30, the general term extraction unit 40, and the index extraction unit 50 are configured by a computer, and the operation thereof is based on a program stored in the ROM. Is executed. The index storage unit 130 includes the storage device (HDD) described above, and includes a category-added document storage unit 70, a morpheme temporary storage unit 80, a co-occurrence list temporary storage unit 90, an appearance number temporary storage unit 100, a bias score. The temporary storage unit 110 and the general term temporary storage unit 120 are configured by the storage device (HDD) or RAM described above.

〔２−２〕専門用語抽出処理の詳細
以上、本実施形態にかかる専門用語抽出装置１の機能構成について説明した。次に、図７および図８を参照して、専門用語抽出装置１における専門用語抽出処理の詳細について説明する。なお、専門用語抽出処理の詳細を説明するに際して、適宜、図９〜図１１を参照する。図７は、専門用語抽出処理の詳細を示すフローチャートである。 [2-2] Details of Technical Term Extraction Processing The functional configuration of the technical term extraction device 1 according to the present embodiment has been described above. Next, with reference to FIG. 7 and FIG. 8, the details of the technical term extraction process in the technical term extraction device 1 will be described. In describing the details of the technical term extraction processing, FIGS. 9 to 11 are referred to as appropriate. FIG. 7 is a flowchart showing details of the technical term extraction process.

図７に示したように、まず、入力部１０は、インデックスの作成要求を受け付ける（Ｓ１００）。そして、ステップＳ１００において、入力部１０によりインデックスの作成要求が受け付けられると、形態素解析部２０は、上記した図２のカテゴリ付き文書記憶部７０のテキストデータに対して形態素解析を実行する（Ｓ１１０）。 As shown in FIG. 7, first, the input unit 10 receives an index creation request (S100). In step S100, when an index creation request is received by the input unit 10, the morpheme analysis unit 20 performs morpheme analysis on the text data in the category-added document storage unit 70 of FIG. 2 (S110). .

形態素解析部２０は、カテゴリ付き文書記憶部７０の投稿内容項目に対して形態素解析を行う。形態素解析結果のうち、一般名詞、サ変名詞を図３に示した形態素一時記憶部８０に格納する。上記したように、図２に示したカテゴリ付き文書記憶部７０に記憶されているレコードＲ７１を形態素解析した結果が、図３に示した形態素一時記憶部８０に記憶されているレコードＲ８１である。 The morpheme analysis unit 20 performs morpheme analysis on the posted content item in the category-added document storage unit 70. Of the morpheme analysis results, general nouns and saun nouns are stored in the morpheme temporary storage unit 80 shown in FIG. As described above, the result of morphological analysis of the record R71 stored in the category-added document storage unit 70 shown in FIG. 2 is the record R81 stored in the morpheme temporary storage unit 80 shown in FIG.

偏りスコア計算部３０は、単語の組み合わせ、投稿者ＩＤと単語の組み合わせ、投稿者ＩＤと単語の組み合わせ、コミュニティＩＤと単語の組み合わせについて、カイ二乗値を計算する。そのためにまず、各組み合わせの出現数を計算する。ここで、図８を参照して、各組み合わせの出現数を計算する処理について説明する。図８は、各組み合わせの出現数を計算する処理の詳細を示すフローチャートである。 The bias score calculation unit 30 calculates chi-square values for word combinations, poster ID and word combinations, poster ID and word combinations, and community ID and word combinations. First, the number of occurrences of each combination is calculated. Here, the process of calculating the number of appearances of each combination will be described with reference to FIG. FIG. 8 is a flowchart showing details of processing for calculating the number of appearances of each combination.

図８に示したように、偏りスコア計算部３０は、形態素一時記憶部８０に記憶されている各レコードについて、共起数を計算する対象となる集合を形成する（Ｓ１２００）。集合の要素は、コミュニティＩＤ、投稿者ＩＤ、単語とする。例えば、図３のレコードＲ８１では、｛ｃ１、ｍ１、特集、プログラミング、言語｝を要素とする。 As shown in FIG. 8, the bias score calculation unit 30 forms a set for which the number of co-occurrence is calculated for each record stored in the morpheme temporary storage unit 80 (S1200). The elements of the set are a community ID, a contributor ID, and a word. For example, in the record R81 in FIG. 3, {c1, m1, special feature, programming, language} is an element.

そして、ステップＳ１２００で形成された集合の要素について、要素数２の冪集合を作成する（Ｓ１２１０）。例えば、図３のレコードＲ８１においては、{{ｃ１、ｍ１}、{ｃ１、特集}、{ｃ１、プログラミング}、{ｃ１、言語}、{ｍ１、特集}、{ｍ１、プログラミング}、{ｍ１、言語}、{特集、プログラミング}、{特集、言語}、{言語、プログラミング}}が要素となる。 Then, for the elements of the set formed in step S1200, a cocoon set with 2 elements is created (S1210). For example, in the record R81 of FIG. 3, {{c1, m1}, {c1, special feature}, {c1, programming}, {c1, language}, {m1, special feature}, {m1, programming}, {m1, Language}, {special feature, programming}, {special feature, language}, {language, programming}}.

そして、ステップＳ１２１０で計算された要素について、出現数を加算する（Ｓ１２２０）。具体的には、ステップＳ１２１０で計算された要素について、出現する毎に、図９に示した共起リスト一時記憶部９０の対応するレコードに出現数を加算する。同時に、図１０に示した出現数一時記憶部１００の対応するレコードにも出現数を加算する。 Then, the number of appearances is added to the element calculated in step S1210 (S1220). Specifically, each time the element calculated in step S1210 appears, the number of appearances is added to the corresponding record in the co-occurrence list temporary storage unit 90 shown in FIG. At the same time, the number of appearances is also added to the corresponding record in the appearance number temporary storage unit 100 shown in FIG.

図９に示したように、共起リスト一時記憶部９０には、コミュニティＩＤ９０１、投稿者ＩＤ９０２、単語１９０３、単語２９０４の各項目のうち、２つの項目に値が入っており、その２つの値から組み合わせが構成されている。出現数９０５は、各項目の出現数を示す。 As shown in FIG. 9, in the co-occurrence list temporary storage unit 90, two items of the community ID 901, the contributor ID 902, the word 1903, and the word 2904 have values, and the two values are included. The combination is made up of The number of appearances 905 indicates the number of appearances of each item.

また、図１０に示したように、出現数一時記憶部１００には、コミュニティＩＤ１００１、投稿者ＩＤ１００２、単語１１００３の各項目が、対応する要素名を示している。また、出現数１００４は、各要素の出現数を示す。 Further, as shown in FIG. 10, in the appearance number temporary storage unit 100, each item of the community ID 1001, the contributor ID 1002, and the word 11003 indicates the corresponding element name. The appearance number 1004 indicates the number of appearances of each element.

例えば、図９に示した共起リスト一時記憶部９０のレコードＲ７０１は、全文書中に対して、ステップＳ１２１０で冪集合を作成した際、（コミュニティｃ１、プログラミング）の組み合わせを持つ冪集合が２回出現したことを示す。以上、各組み合わせの出現数を計算する処理について説明した。 For example, in the record R701 in the co-occurrence list temporary storage unit 90 shown in FIG. 9, when a bag set is created in step S1210 for all documents, two bag sets having a combination of (community c1, programming) exist. Indicates that it has appeared once. The process for calculating the number of appearances of each combination has been described above.

図７に戻り、専門用語抽出処理の説明を続ける。ステップＳ１２０において組み合わせ数を計算した後、偏りスコア計算部３０は、単語の組み合わせ、コミュニティＩＤと単語の組み合わせについて、カイ二乗値を計算する。単語Ｘ（またはコミュニティＸ、または投稿者Ｘ）と単語Ｙのカイ二乗値は次の数式１により計算される。 Returning to FIG. 7, the description of the technical term extraction process will be continued. After calculating the number of combinations in step S120, the bias score calculation unit 30 calculates chi-square values for the word combinations and the community ID / word combinations. The chi-square value of the word X (or community X or contributor X) and the word Y is calculated by the following formula 1.

ここで、Ｎは全要素数であり、本実施形態では、図９の共起リスト一時記憶部９０の全レコード数である。また、Ｏ_１１、Ｏ_１２、Ｏ_２１、Ｏ_２２は、次の通りである。 Here, N is the total number of elements, and in this embodiment, is the total number of records in the co-occurrence list temporary storage unit 90 of FIG. O ₁₁ , O ₁₂ , O ₂₁ , and O ₂₂ are as follows.

Ｏ_１１：〔単語Ｘと単語Ｙとが共起した要素数〕（または、〔コミュニティＸと単語Ｙが共起した要素数〕、または、〔投稿者Ｘと単語Ｙが共起した要素数〕）
図９の共起リスト一時記憶部９０の各レコードの出現数が対応している。
Ｏ_２２：〔単語Ｘも単語Ｙもどちらも出現しない要素数〕（または、〔コミュニティＸ以外の文書で、単語Ｙが出現しない要素数〕、または、〔投稿者Ｘによって投稿されなかった文書のうち、単語Ｙが出現しない要素数〕）
（すなわち、Ｏ_２２＝Ｎ−Ｏ_１１−Ｏ_１２−Ｏ_２１）
Ｏ_１２：〔単語Ｘのみ出現し、単語Ｘが出現しない要素数〕（または、〔コミュニティＸの文書のうち、単語Ｙが出現しなかった要素数〕、または、〔投稿者Ｘによって投稿された文書数のうち、単語Ｙが出現しなかった要素数〕）
Ｏ_２１：〔単語Ｙのみ出現し、単語Ｘが出現しない要素数〕（または、〔コミュニティＸ以外の文書のうち、単語Ｙが出現した要素数〕、または、〔投稿者Ｘによって投稿されなかった文書数のうち、単語Ｙが出現した要素数〕） O ₁₁ : [number of elements in which word X and word Y co-occurred] (or [number of elements in which community X and word Y co-occurd] or [number of elements in which poster X and word Y co-occurd] )
The number of appearances of each record in the co-occurrence list temporary storage unit 90 in FIG. 9 corresponds.
O ₂₂ : [the number of elements in which neither word X nor word Y appears] (or [the number of elements in which word Y does not appear in documents other than community X], or [the number of documents not posted by contributor X] Number of elements in which word Y does not appear])
(That is, O ₂₂ = N—O ₁₁ —O ₁₂ —O ₂₁ )
O ₁₂ : [Number of elements in which only the word X appears but the word X does not appear] (or [Number of elements in the community X document in which the word Y did not appear) or [Posted by the contributor X Number of elements in which the word Y did not appear in the number of documents])
O ₂₁ : [Number of elements in which only word Y appears, and word X does not appear] (or [Number of elements in which word Y appears in documents other than community X], or [not posted by contributor X Number of elements in which word Y appears in the number of documents])

また、コミュニティＸと投稿者Ｙとのカイ二乗値も同様に計算される。この場合、
Ｏ_１１：〔コミュニティＸの文書のうち、投稿者Ｙの要素数〕
Ｏ_２２：〔コミュニティＸ以外の文書で、投稿者Ｙ以外の要素数〕
Ｏ_１２：〔コミュニティＸの文書のうち、投稿者Ｙ以外の要素数〕
Ｏ_２１：〔コミュニティＸ以外の文書で、投稿者Ｙの要素数〕
となる。 Further, the chi-square value of the community X and the contributor Y is calculated in the same manner. in this case,
O ₁₁ : [Number of elements of contributor Y in community X documents]
O ₂₂ : [number of elements other than contributor Y in documents other than community X]
O ₁₂ : [number of elements other than the contributor Y in the community X document]
O ₂₁ : [number of elements of contributor Y in documents other than community X]
It becomes.

Ｏ_１１、Ｏ_１２、Ｏ_２１、Ｏ_２２は、図９および図１０の共起リスト一時記憶部９０および出現数一時記憶部１００を用いて計算することができる。例えば、図１０では、単語「プログラミング」の要素数は「２２」である。また、コミュニティ「ｃ１」の要素数は「８０」である。コミュニティ「ｃ１」と単語「プログラミング」のカイ二乗値を計算するための、Ｏ_１１、Ｏ_１２、Ｏ_２１、Ｏ_２２は、次の通りになる。ただし、全要素数を１０００とする。 O ₁₁ , O ₁₂ , O ₂₁ , and O ₂₂ can be calculated using the co-occurrence list temporary storage unit 90 and the appearance number temporary storage unit 100 of FIGS. 9 and 10. For example, in FIG. 10, the number of elements of the word “programming” is “22”. The number of elements of the community “c1” is “80”. O ₁₁ , O ₁₂ , O ₂₁ , and O ₂₂ for calculating the chi-square value of the community “c1” and the word “programming” are as follows. However, the total number of elements is 1000.

Ｏ_１１：２（図９のレコードＲ９１の出現数）
Ｏ_１２：７８（コミュニティ「ｃ１」の文書のうち「プログラミング」を含まない要素数：図１０のレコードＲ１０１の出現数−図９のレコードＲ９１の出現数＝８０−２）
Ｏ_２１：２０（コミュニティ「ｃ１」以外の文書のうち「プログラミング」を含む要素数：図１０のレコードＲ１０２の出現数−図９のレコードＲ９１の出現数＝２２−２）
Ｏ_２２：９００（Ｏ_１１、Ｏ_１２、Ｏ_２１以外の文書＝１０００−２−７８−２０） O ₁₁ : 2 (number of appearances of record R91 in FIG. 9)
O ₁₂ : 78 (number of elements not including “programming” in documents of community “c1”: number of appearances of record R101 in FIG. 10−number of appearances of record R91 in FIG. 9 = 80-2)
O ₂₁ : 20 (number of elements including “programming” in documents other than community “c1”: number of appearances of record R102 in FIG. 10−number of appearances of record R91 in FIG. 9 = 22-2)
O ₂₂ : 900 (documents other than O ₁₁ , O ₁₂ , and O ₂₁ = 1000-2-78-20)

したがって、カイ二乗値は、
{１０００×（２×９００−７８×２０）^２}／{（２＋７８）×（２＋２０）×（７８＋９００）×（２０＋９００）}≒０．０３６
となる。 Therefore, the chi-square value is
{1000 × (2 × 900−78 × 20) ² } / {(2 + 78) × (2 + 20) × (78 + 900) × (20 + 900)} ≈0.036
It becomes.

上記カイ二乗値の計算を、単語間の組み合わせ、単語と投稿者との組み合わせ、単語とコミュニティの組み合わせ、コミュニティと投稿者との組み合わせで実行する。ただし、Ｏ_１１が１以下の場合は、ほとんど共起が起こらないため、計算の対象外とする。結果として、偏りスコアとしてのカイ二乗値を取得できるため、図１１に示した偏りスコア一時記憶部１１０にカイ二乗値を格納する。 The calculation of the chi-square value is executed for combinations of words, combinations of words and posters, combinations of words and communities, and combinations of communities and posters. However, when O ₁₁ is 1 or less, since co-occurrence hardly occurs, it is excluded from calculation. As a result, since the chi-square value as the bias score can be acquired, the chi-square value is stored in the bias score temporary storage unit 110 shown in FIG.

そして、ステップＳ１３０において偏りスコアを計算した後、一般用語抽出部４０は、ステップＳ１３０で計算されたカイ二乗値に基づいて、一般用語を抽出する（Ｓ１４０）。まず、偏りスコア一時記憶部１１０から、次の２つの条件を満たすレコードを、一般用語を示すレコードとして抽出する。
（条件１）カイ二乗値が１．０以下、かつ、単語１項目に値が存在している、かつ、コミュニティＩＤ項目または投稿者ＩＤに値があること
（条件２）条件１を満たすレコードについて、単語１項目をキーとして集約した際に、単語１項目が同一値のレコードが３レコード以上存在すること And after calculating a bias score in step S130, the general term extraction part 40 extracts a general term based on the chi-square value calculated in step S130 (S140). First, records satisfying the following two conditions are extracted from the bias score temporary storage unit 110 as records indicating general terms.
(Condition 1) The chi-square value is 1.0 or less, the word 1 item has a value, and the community ID item or the poster ID has a value. (Condition 2) Records satisfying the condition 1 When there is a word 1 item as a key, there must be 3 or more records with the same value in the word 1 item.

上記条件を満たすレコードは、例えば、次のＳＱＬにより得ることができる。
ｓｅｌｅｃｔ単語１，ｃｏｕｎｔ（＊）ｆｒｏｍ偏りスコア一時記憶部
ｗｈｅｒｅカイ二乗値 ≦ １．０ＡＮＤ（コミュニティＩＤｉｓｎｏｔ
ｎｕｌｌＯＲ投稿者ＩＤｉｓｎｏｔｎｕｌｌ）ＡＮＤ単語１ｉｓｎｏｔｎｕｌｌｇｒｏｕｐｂｙ単語１ｈａｖｉｎｇｃｏｕｎｔ（＊） ≧ ３； A record that satisfies the above conditions can be obtained by the following SQL, for example.
select word 1, count (*) from bias score temporary storage unit
where chi-square ≤ 1.0 AND (community ID is not
null OR contributor ID is not null) AND word 1 is not null group by word 1 having count (*) ≧ 3;

例えば、図１１には、条件１を満たすレコードとして、レコードＲ１１１、Ｒ１１２、１１３が該当する。条件２も満たすので、一般用語として単語「特集」が得られる。そして、得られた単語を、上記した図４に示す一般用語一時記憶部１２０に記憶する。 For example, in FIG. 11, records R111, R112, and 113 correspond to records that satisfy condition 1. Since the condition 2 is also satisfied, the word “special feature” is obtained as a general term. And the obtained word is memorize | stored in the general term temporary storage part 120 shown in above-mentioned FIG.

次に、ステップＳ１４０において抽出した一般用語を排除して、偏りスコアを再度計算するために、偏りスコア一時記憶部１１０をクリアする（Ｓ１５０）。そして、偏りスコア計算部３０は、ステップＳ１４０において抽出した一般用語を除いて、カイ二乗値の計算を再度実行する（Ｓ１６０）。ステップＳ１６０における偏りスコアの再計算処理は、ステップＳ１２０およびステップＳ１３０において実行した処理と同様であるため、詳細な説明は省略する。ただし、ステップＳ１２０において、共起リスト一時記憶部９０および出現数一時記憶部１００を作成する際に、単語１項目または単語２項目に、一般用語一時記憶部１２０と同じ単語が含まれている場合には、その要素については作成しない。 Next, in order to eliminate the general terms extracted in step S140 and calculate the bias score again, the bias score temporary storage unit 110 is cleared (S150). Then, the bias score calculation unit 30 executes the calculation of the chi-square value again, excluding the general terms extracted in step S140 (S160). Since the bias score recalculation process in step S160 is the same as the process executed in steps S120 and S130, detailed description thereof is omitted. However, when the co-occurrence list temporary storage unit 90 and the appearance number temporary storage unit 100 are created in step S120, the same word as the general term temporary storage unit 120 is included in the word 1 item or the word 2 item. In that case, the element is not created.

ステップＳ１６０においてカイ二乗値を再計算した結果を、偏りスコア一時記憶部１１０に格納する。例えば、単語「特集」が一般用語と判断されているため、図１１のＲ１１４の（ｃ１、特集）の組み合わせは格納されないこととなる。 The result of recalculating the chi-square value in step S160 is stored in the bias score temporary storage unit 110. For example, since the word “special feature” is determined as a general term, the combination of (c1, special feature) in R114 of FIG. 11 is not stored.

次に、一般用語抽出部４０は、ステップＳ１４０と同様の条件により、一般用語を抽出する（Ｓ１７０）。そして、ステップＳ１７０で得られた一般用語を一般用語一時記憶部１２０に追加する。そして、偏りスコアを再計算するか否かを判断する（Ｓ１８０）。ステップＳ１８０においては、例えば、「偏りスコアを１回だけ再計算すること」との条件としていたとする。ステップＳ１８０において、「偏りスコアを１回だけ再計算すること」の条件を満たす場合にはステップＳ１９０を実行する。一方、ステップＳ１８０において、「偏りスコアを１回だけ再計算すること」の条件を満たさない場合には、ステップＳ１５０からステップＳ１７０の処理を繰り返す。 Next, the general term extraction unit 40 extracts general terms under the same conditions as in step S140 (S170). Then, the general terms obtained in step S170 are added to the general term temporary storage unit 120. Then, it is determined whether or not to recalculate the bias score (S180). In step S180, for example, it is assumed that the condition is “recalculate the bias score only once”. In step S180, if the condition “recalculate the bias score only once” is satisfied, step S190 is executed. On the other hand, if the condition “recalculate the bias score only once” is not satisfied in step S180, the processing from step S150 to step S170 is repeated.

ステップＳ１９０において、インデックス抽出部５０は、コミュニティの専門用語を抽出する（Ｓ１９０）。インデックス抽出部５０は、コミュニティの専門用語を抽出するため、偏りスコア一時記憶部１１０のデータについて、コミュニティごとに偏りスコアが大きい順にソートする。これは、例えば、次のＳＱＬにより得ることができる。 In step S190, the index extracting unit 50 extracts community technical terms (S190). The index extraction unit 50 sorts the data in the bias score temporary storage unit 110 in descending order of bias score for each community in order to extract the technical terms of the community. This can be obtained, for example, by the following SQL.

ｓｅｌｅｃｔコミュニティＩＤ，単語１，カイ二乗値ｆｒｏｍ偏りスコア一時記憶部ｗｈｅｒｅカイ二乗値＞１．０ＡＮＤコミュニティＩＤｉｓｎｏｔｎｕｌｌＡＮＤ単語１ｉｓｎｏｔｎｕｌｌｏｒｄｅｒｂｙコミュニティＩＤ，カイ二乗値ｄｅｓｃ； select community ID, word 1, chi-square value from bias score temporary storage unit where chi-square value> 1.0 AND community ID is not null AND word 1 is not null order by community ID, chi-square value desc;

例えば、図１１に格納されたデータでは、コミュニティＩＤ、単語１、カイ二乗値の順で、
ｃ１，○○言語，１００
ｃ３，検索システム，８０
などのデータを得ることができる。このデータから、コミュニティｃ１の専門用語は、「○○言語」、コミュニティｃ３の専門用語は「検索システム」であることがわかる。 For example, in the data stored in FIG. 11, in the order of community ID, word 1, chi-square value,
c1, OO language, 100
c3, search system, 80
Data such as can be obtained. From this data, it can be seen that the technical term of the community c1 is “XX language” and the technical term of the community c3 is “search system”.

次に、インデックス抽出部５０は、メンバーの専門用語を抽出する（Ｓ２００）。ステップＳ２００においては、偏りスコア一時記憶部１１０のデータについて、投稿者ＩＤごとに、偏りスコアが大きい順にソートする。これは、例えば次のＳＱＬにより得ることができる。 Next, the index extraction unit 50 extracts member technical terms (S200). In step S200, the data in the bias score temporary storage unit 110 is sorted in descending order of bias score for each poster ID. This can be obtained, for example, by the following SQL.

ｓｅｌｅｃｔ投稿者ＩＤ，単語１，カイ二乗値ｆｒｏｍ偏りスコア一時記憶部ｗｈｅｒｅカイ二乗値＞１．０ＡＮＤ投稿者ＩＤｉｓｎｏｔｎｕｌｌＡＮＤ単語１ｉｓｎｏｔｎｕｌｌｏｒｄｅｒｂｙ投稿者ＩＤ，カイ二乗値ｄｅｓｃ； select contributor ID, word 1, chi-square value from bias score temporary storage section where chi-square value> 1.0 AND contributor ID is not null AND word 1 is not null order by contributor ID, chi-square value dessc;

例えば、図１１に格納されたデータでは、投稿者ＩＤ、単語１、カイ二乗値の順で、
ｍ１、○○言語、１００
などのデータが得られる。したがって、メンバーｍ１の専門用語は、「○○言語」であることがわかる。インデックス抽出部５０は、上記の得られたデータを、インデックス格納部１３０に格納する。インデックス格納部１３０は、上記したように、図５に示したデータを格納している。図５のスコア１３０４には、カイ二乗値が格納される。例えば、図１１に示したデータでは、図５のレコードＲ１３１が追加されることとなる。 For example, in the data stored in FIG. 11, in the order of contributor ID, word 1, chi-square value,
m1, OO language, 100
Data such as is obtained. Therefore, it can be seen that the technical term of the member m1 is “XX language”. The index extraction unit 50 stores the obtained data in the index storage unit 130. As described above, the index storage unit 130 stores the data shown in FIG. The score 1304 in FIG. 5 stores a chi-square value. For example, in the data shown in FIG. 11, the record R131 in FIG. 5 is added.

そして、インデックス抽出部５０は、メンバーとコミュニティの関連度を判断する（Ｓ２１０）。ステップＳ２１０において、偏りスコア一時記憶部１１０のデータについて、コミュニティＩＤと投稿者ＩＤごとに、偏りスコアが大きい順にソートする。これは、例えば、次のＳＱＬにより得ることができる。 And the index extraction part 50 judges the relevance degree of a member and a community (S210). In step S210, the data in the bias score temporary storage unit 110 is sorted in descending order of bias score for each community ID and poster ID. This can be obtained, for example, by the following SQL.

ｓｅｌｅｃｔコミュニティＩＤ，投稿者ＩＤ，カイ二乗値ｆｒｏｍ偏りスコア一時記憶部ｗｈｅｒｅカイ二乗値＞１．０ＡＮＤコミュニティＩＤｉｓｎｏｔｎｕｌｌＡＮＤ投稿者ＩＤｉｓｎｏｔｎｕｌｌｏｒｄｅｒｂｙ投稿者ＩＤ，コミュニティＩＤ，カイ二乗値ｄｅｓｃ； select community ID, contributor ID, chi-square value from bias score temporary storage part where chi-square value> 1.0 AND community ID is not null AND contributor ID is not null order by contributor ID, community ID, chi-square value dessc;

例えば、図１１に格納されたデータでは、コミュニティＩＤ、投稿者ＩＤ、カイ二乗値の順で、
ｃ１、ｍ９、７０
などのデータが得られる。したがって、コミュニティｃ１とメンバーｍ９とは関係が深いと判断することができる。 For example, in the data stored in FIG. 11, in the order of community ID, contributor ID, chi-square value,
c1, m9, 70
Data such as is obtained. Therefore, it can be determined that the community c1 and the member m9 are deeply related.

インデックス抽出部５０は、ステップＳ２１０において関係が深いと判断されたコミュニティについて、ステップＳ１９０で得られたコミュニティの専門用語を、インデックス格納部１３０に格納する（Ｓ２２０）。 The index extraction unit 50 stores the technical terms of the community obtained in step S190 in the index storage unit 130 for the community determined to have a deep relationship in step S210 (S220).

例えば、上記したように、コミュニティｃ１とメンバーｍ９とは関連が深いので、ステップＳ２００で抽出されたコミュニティｃ１の専門用語「○○言語」を、メンバーｍ９の専門用語として格納する。ここで追加されたデータは、図５のレコードＲ１３２となる。スコア１３０４には、コミュニティｃ１とメンバーｍ９の間のカイ二乗値が格納される。 For example, as described above, since the community c1 and the member m9 are deeply related, the technical term “XX language” of the community c1 extracted in step S200 is stored as the technical term of the member m9. The data added here becomes a record R132 in FIG. The score 1304 stores a chi-square value between the community c1 and the member m9.

インデックス抽出部５０は、ステップＳ２２０までに追加されたデータについて、単語間の関連が強い組み合わせを抽出してインデックス格納部１３０に追加する（Ｓ２３０）。単語間の関連が強い組み合わせについては、例えば、次のＳＱＬにより得られる。以下の例では、カイ二乗値の閾値は５０としているが、かかる例に限定されない。 The index extraction unit 50 extracts combinations with strong associations between words from the data added up to step S220 and adds them to the index storage unit 130 (S230). A combination having a strong association between words is obtained, for example, by the following SQL. In the following example, the chi-square value threshold is set to 50, but is not limited to this example.

ｓｅｌｅｃｔ単語１，単語２，カイ二乗値ｆｒｏｍ偏りスコア一時記憶部
ｗｈｅｒｅカイ二乗値＞５０ＡＮＤ単語ｉｓｎｏｔｎｕｌｌＡＮＤ単語１ｉｓｎｏｔｎｕｌｌａｎｄ（単語１ｉｎ（‘○○言語’）ｏｒ単語２ｉｎ（‘○○言語’）ｏｒｄｅｒｂｙカイ二乗値ｄｅｓｃ； select word 1, word 2, chi-square value from bias score temporary storage unit
where chi-square value> 50 AND word is not null AND word 1 is not null and (word 1 in ('XX language') or word 2 in ('XX language') order by chi-square value dessc;

上記ＳＱＬにより得られた結果として、図１１のレコードＲ１１５が抽出され、関連語として「Ｃ＋＋」を得ることができる。得られた関連語を、図５のレコードＲ１３３、Ｒ１３４に示すようにインデックス格納部１３０に格納する。 As a result obtained by the above SQL, the record R115 in FIG. 11 is extracted, and “C ++” can be obtained as a related word. The obtained related terms are stored in the index storage unit 130 as indicated by records R133 and R134 in FIG.

以上、インデックス格納部１３０へのインデックスの登録処理について説明した。上記したように、図５に示したインデックス格納部１３０には、登録されたデータ毎に承認有無１３０５の項目を設けている。承認有無１３０５の項目が「未承認」の場合には、メンバーに確認入力を要求する。例えば、該当メンバーがソーシャルネットワークシステムにログインした際などに、確認表示部６０に図６に示すインデックスの確認画面を表示して、メンバーに確認入力を要求する。 The index registration process in the index storage unit 130 has been described above. As described above, the index storage unit 130 shown in FIG. 5 is provided with an approval / non-approval item 1305 for each registered data. When the item of approval / non-approval 1305 is “unapproved”, the member is requested to input confirmation. For example, when the corresponding member logs in to the social network system, an index confirmation screen shown in FIG. 6 is displayed on the confirmation display unit 60 to request confirmation input from the member.

以上、専門用語抽出処理の詳細について説明した。本実施形態では、コミュニティと単語、メンバーと単語、単語間の関係を表す値として、カイ二乗値を計算する。そして、複数のコミュニティやメンバーにおいて専門用語であると判断された単語でも所定の条件をもとに専門用語らしくない（一般用語）と判断して、再度、コミュニティと単語、メンバーと単語、単語間の関係を計算することにより、高精度に専門用語を抽出することが可能となる。 The details of the technical term extraction processing have been described above. In this embodiment, a chi-square value is calculated as a value representing the relationship between the community and the word, the member and the word, and the word. And even if a word is judged to be a technical term in multiple communities or members, it is judged not to be a technical term (general term) based on a predetermined condition. By calculating the relationship, it is possible to extract technical terms with high accuracy.

また、メンバーと単語との関係の強さによって専門用語を抽出するのみでなく、コミュニティとメンバーとの関係、コミュニティと単語との関係を考慮することによりメンバーの専門用語を抽出することも可能となる。すなわち、自身で投稿した文書などに含まれていない専門用語も、コミュニティとの関係で自身の専門用語とすることができるため、個人を特徴付ける複数の専門用語を関連付けることが可能となる。 In addition to extracting technical terms based on the strength of the relationship between members and words, it is also possible to extract technical terms of members by considering the relationship between communities and members, and the relationship between communities and words. Become. In other words, technical terms that are not included in the documents posted by the user can also be used as technical terms of their own in relation to the community, so that it is possible to associate a plurality of technical terms that characterize individuals.

〔３〕第２実施形態
次に、本発明の第２実施形態について説明する。第１実施形態では、コミュニティと関連の深い単語を、コミュニティと関連が深いメンバーを特徴付ける単語として設定した。しかし、第１実施形態では、その単語がいつ出現したかという時間的な条件を考慮していない。例えば、第１実施形態では、メンバーが投稿した期間内において、関連付けられた単語が一度も出現しなかった場合でも、メンバーを特徴付ける単語として設定されてしまう。そこで、本実施形態では、メンバーがコミュニティにおいて投稿した時期も考慮して、メンバーを特徴付ける単語の抽出を行っている点で第１実施形態と異なっている。以下では、第１実施形態と異なる点について特に説明し、第１実施形態と同様な点については詳細な説明は省略する。 [3] Second Embodiment Next, a second embodiment of the present invention will be described. In the first embodiment, words that are closely related to the community are set as words that characterize members that are closely related to the community. However, the first embodiment does not consider the temporal condition of when the word appears. For example, in the first embodiment, even if the associated word never appears within the period posted by the member, it is set as a word characterizing the member. Therefore, the present embodiment is different from the first embodiment in that a word characterizing the member is extracted in consideration of the time when the member posted in the community. Hereinafter, differences from the first embodiment will be particularly described, and detailed description of points similar to those of the first embodiment will be omitted.

〔３−１〕専門用語抽出装置の機能構成
本実施形態にかかる専門用語抽出装置２の機能構成は、第１実施形態にかかる専門用語抽出装置１の機能構成とほぼ同様なため、詳細な説明は省略する。本実施形態では、インデックス抽出部５０が、コミュニティの専門用語を抽出する際に、メンバーの当該コミュニティにおける投稿期間を考慮する点で第１実施形態と異なっている。インデックス抽出部５０によるコミュニティの専門用語抽出処理については、後述する専門用語抽出処理の説明で詳細に説明する。 [3-1] Functional Configuration of Technical Term Extraction Device The functional configuration of the technical term extraction device 2 according to the present embodiment is substantially the same as the functional configuration of the technical term extraction device 1 according to the first embodiment, and thus detailed description thereof. Is omitted. This embodiment is different from the first embodiment in that the index extraction unit 50 considers the posting period of members in the community when extracting the technical terms of the community. The community terminology extraction process by the index extraction unit 50 will be described in detail in the explanation of the terminology extraction process described later.

〔３−２〕専門用語抽出処理の詳細
本実施形態における専門用語抽出処理は、第１実施形態とはコミュニティの専門用語の抽出方法が異なっているため、異なる点について特に詳細に説明し、第１実施形態と同様の処理については詳細な説明を省略する。 [3-2] Details of Terminology Extraction Processing The terminology extraction processing in the present embodiment is different from the first embodiment in terms of community terminology extraction methods. Detailed description of the same processing as that of the first embodiment will be omitted.

本実施形態における専門用語抽出処理では、第１実施形態における図７のステップ１９０のコミュニティの専門用語の抽出処理に代えて、図１２に示した処理が実行される。図１２には、本実施形態におけるコミュニティの専門用語の抽出処理の詳細を示すフローチャートである。 In the technical term extraction process in the present embodiment, the process shown in FIG. 12 is executed in place of the community technical term extraction process in step 190 of FIG. 7 in the first embodiment. FIG. 12 is a flowchart showing details of the process for extracting the technical terms of the community in the present embodiment.

図１２に示したように、インデックス抽出部５０は、ステップＳ２１０で関係が深いと判断されたコミュニティについて、次の条件によりステップＳ１９０で得られたコミュニティの専門用語をインデックス格納部１３０に格納する。
（条件）コミュニティにおいて、当該専門用語が出現した期間と、メンバーの当該コミュニティでの投稿期間が重なっていること As shown in FIG. 12, the index extraction unit 50 stores, in the index storage unit 130, the technical terms of the community obtained in step S190 based on the following conditions for the community determined to be deeply related in step S210.
(Condition) In the community, the period in which the technical term appears and the member's posting period in the community overlap.

上記条件を満たすために、インデックス抽出部５０は、まず、コミュニティ専門用語の出現期間を計算する（Ｓ２０００）。例えば、コミュニティｃ１の専門用語「○○言語」は、図３の形態素一時記憶部８０において、文書ｄ３、ｄ４のレコードに出現している。図３の形態素一時記憶部８０の文書ＩＤと、図２のカテゴリ付き文書記憶部７０の文書ＩＤとが対応しているため、図２の投稿時刻「２００８年１２月１１日」から「２００８年１２月１４日」までが出現期間となる。 In order to satisfy the above conditions, the index extraction unit 50 first calculates the appearance period of community technical terms (S2000). For example, the technical term “XX language” of the community c1 appears in the records of the documents d3 and d4 in the morpheme temporary storage unit 80 of FIG. Since the document ID of the morpheme temporary storage unit 80 in FIG. 3 corresponds to the document ID of the category-added document storage unit 70 in FIG. 2, the posting time “December 11, 2008” to “2008” in FIG. “December 14th” is the appearance period.

そして、メンバー（投稿者）のコミュニティにおける投稿期間を算出する（Ｓ２０１０）。図２のカテゴリ付き文書記憶部７０において、投稿者ｍ１は、コミュニティｃ１で、「２００８年１２月１１日」（文書ｄ１）から「２００８年１２月１４日」（文書ｄ４）まで投稿していることがわかる。 Then, the posting period in the community of the member (contributor) is calculated (S2010). In the category-added document storage unit 70 of FIG. 2, the contributor m1 posted in the community c1 from “December 11, 2008” (document d1) to “December 14, 2008” (document d4). I understand that.

ステップＳ２０００において取得した期間と、ステップ２０１０において取得した期間が重なっている場合には、当該コミュニティの専門用語を、インデックス格納部１３０に格納する。上記した例では、コミュニティの専門用語の出現期間とメンバーの投稿期間とが重複しているため、専門用語「○○言語」は、インデックス格納部１３０に格納される。 When the period acquired in step S2000 and the period acquired in step 2010 overlap, the technical term of the community is stored in the index storage unit 130. In the above-described example, since the appearance period of the technical term in the community overlaps with the posting period of the member, the technical term “XX language” is stored in the index storage unit 130.

以上、第２実施形態について説明した。本実施形態では、第１実施形態では考慮されなかった単語の出現期間という時間的な条件が考慮されている。これにより、コミュニティと関連の深い単語を、コミュニティと関連の強いメンバーを特徴付ける単語として設定する際の精度をより向上させることが可能となる。 The second embodiment has been described above. In the present embodiment, the temporal condition of the appearance period of words that is not considered in the first embodiment is considered. As a result, it is possible to further improve the accuracy when setting a word closely related to the community as a word characterizing a member strongly related to the community.

以上、添付図面を参照しながら本発明の好適な実施形態について詳細に説明したが、本発明はかかる例に限定されない。本発明の属する技術の分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本発明の技術的範囲に属するものと了解される。 The preferred embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field to which the present invention pertains can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that these also belong to the technical scope of the present invention.

例えば、上記実施形態では、コミュニティとメンバー、メンバーと単語、単語間の関係を示す値としてカイ二乗値を用いているが、本発明はかかる例に限定されない。例えば、単語関の関連の強さを算出するために、相互情報量などを用いてもよい。 For example, in the above embodiment, the chi-square value is used as a value indicating the relationship between the community and the member, the member and the word, and the word, but the present invention is not limited to such an example. For example, mutual information may be used to calculate the strength of relations between words.

また、一般用語を除いて偏りスコアを再計算する際に、閾値や出現頻度などについても再度計算するようにしてもよい。 Further, when recalculating the bias score excluding general terms, the threshold value and the appearance frequency may be calculated again.

また、コミュニティごとにコミュニティを特徴付ける単語を抽出しているが、かかる例に限定されず、関連のある複数のコミュニティについて、該コミュニティを特徴付ける単語を抽出するようにしてもよい。 Moreover, although the word which characterizes a community is extracted for every community, it is not limited to this example, You may make it extract the word which characterizes this community about several related communities.

また、例えば、本明細書の専門用語抽出装置１の処理における各ステップは、必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はない。すなわち、専門用語抽出装置１の処理における各ステップは、異なる処理であっても並列的に実行されてもよい。 Further, for example, each step in the processing of the technical term extraction device 1 of the present specification does not necessarily have to be processed in time series in the order described as a flowchart. That is, each step in the process of the technical term extraction device 1 may be executed in parallel even if it is a different process.

また、専門用語抽出装置１などに内蔵されるＣＰＵ、ＲＯＭおよびＲＡＭなどのハードウェアを、上述した専門用語抽出装置１の各構成と同等の機能を発揮させるためのコンピュータプログラムも作成可能である。また、該コンピュータプログラムを記憶させた記憶媒体も提供される。 Further, it is possible to create a computer program for causing hardware such as the CPU, ROM and RAM incorporated in the technical term extraction device 1 and the like to perform the same functions as the components of the technical term extraction device 1 described above. A storage medium storing the computer program is also provided.

１、２専門用語抽出装置
１０入力部
２０形態素解析部
３０偏りスコア計算部
４０一般用語抽出部
５０インデックス抽出部
６０確認表示部
７０カテゴリ付き文書記憶部
８０形態素一時記憶部
９０共起リスト一時記憶部
１００出現数一時記憶部
１１０偏りスコア一時記憶部
１２０一般用語一時記憶部
１３０インデックス格納部

DESCRIPTION OF SYMBOLS 1, 2 Technical term extraction apparatus 10 Input part 20 Morphological analysis part 30 Bias score calculation part 40 General term extraction part 50 Index extraction part 60 Confirmation display part 70 Document storage part with category 80 Morphological temporary storage part 90 Co-occurrence list temporary storage part 100 appearance number temporary storage unit 110 bias score temporary storage unit 120 general term temporary storage unit 130 index storage unit

Claims

A morphological analysis unit that performs a morphological analysis on a document input in accordance with a contributor's operation;
A bias score calculation unit that calculates a bias score between words included in the document, between a word and a contributor, between a word and a posting destination group to which the contributor belongs;
A general term extraction unit that extracts a general term included in the document according to a value of the bias score;
An index extraction unit for extracting a keyword indicating an individual characteristic by removing the general term extracted by the general term extraction unit from the document;
A technical term extraction device comprising:

The technical term according to claim 1, further comprising: a storage unit that stores a document input in response to a contributor's operation, a contributor, and a posting destination group to which the contributor belongs. Extraction device.

The bias score calculation unit
The technical term extraction device according to claim 1, wherein the bias score is calculated by a chi-square value.

The general term extraction unit includes:
Of the combinations of a poster and a word or a posting destination group to which the poster belongs and a word, the bias score is less than or equal to a predetermined value, and a word related to a plurality of posters or a plurality of posting groups The technical term extraction device according to claim 1, wherein the terminology is extracted as a general term.

The bias score calculation unit
Excluding the word extracted as the general term, recalculating the bias score,
The general term extraction unit includes:
Of the combinations of a poster and a word or a posting destination group to which the poster belongs and a word, the bias score value is less than or equal to a predetermined value, and a word related to a plurality of posters or a plurality of posting destination groups 5. The technical term extraction device according to claim 4, wherein the terminology is extracted again as a general term.

The index extraction unit
The technical term extraction device according to claim 1, wherein a word having a bias score value between a poster and a word that is equal to or greater than a predetermined value is extracted as a keyword indicating an individual characteristic.

The index extraction unit
A word indicating the characteristics of the posting destination group is extracted, and the word indicating the characteristics of the posting destination group having a bias score value between the poster and the posting destination group to which the poster belongs belongs to The technical term extraction device according to claim 1, wherein the technical term extraction device is extracted as a keyword indicating a feature.

The index extraction unit
Among the words indicating the characteristics of the extracted posting destination group, when a period in which a document including the word is posted corresponds to a period in which the poster has posted in the posting destination group, the word is an individual characteristic. The technical term extraction device according to claim 7, wherein the terminology is extracted as a keyword indicating.

Computer
A morphological analysis unit that performs a morphological analysis on a document input in accordance with a contributor's operation;
A bias score calculation unit that calculates a bias score between words included in the document, between a word and a contributor, between a word and a posting destination group to which the contributor belongs;
A general term extraction unit that extracts a general term included in the document according to a value of the bias score;
An index extraction unit for extracting a keyword indicating an individual characteristic by removing the general term extracted by the general term extraction unit from the document;
A program for functioning as a terminology extraction device, comprising: