JP2006293767A

JP2006293767A - Sentence categorizing device, sentence categorizing method, and categorization dictionary creating device

Info

Publication number: JP2006293767A
Application number: JP2005114841A
Authority: JP
Inventors: Koichi Hirano; 耕一平野; Noriya Furubayashi; 紀哉古林; Osamu Oshima; 修大島
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2005-04-12
Filing date: 2005-04-12
Publication date: 2006-10-26

Abstract

<P>PROBLEM TO BE SOLVED: To resolve a problem that it is difficult to categorize sentence data into a plurality of categories. <P>SOLUTION: A categorization dictionary storing part 24 stores dictionary data wherein a component of a sentence is associated with an appearance frequency of the component in the sentence categorized into each category in regard to the plurality of predetermined categories. A sentence receiving part 36 receives un-categorized sentence data being a new categorization target from an external device. A sentence analyzing part 18 analyzes the un-categorized sentence data in accordance with predetermined rules to extract sentence components. A category attribute probability calculating part 26 refers to the dictionary data to calculate an attribute probability in regard to each category indicating a probability of the un-categorized sentence data including the extracted components attributing to each category. A determining part 28 refers to the attribute probability to determine a category of the un-categorized sentence data. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、文章分類技術に関し、特に、未分類の文章データを予め定められた複数のカテゴリに分類する文章分類技術に関する。 The present invention relates to a sentence classification technique, and more particularly to a sentence classification technique for classifying uncategorized sentence data into a plurality of predetermined categories.

インターネットが社会に広く浸透するにしたがって、ネットワーク上を流通する文章の量が飛躍的に増加している。このため、ネットワークを通じて収集されるウェブページを適切なカテゴリに分類して表示したり、多量の電子メールを適切なフォルダに分類するなどの作業を人手で実行することは、困難になりつつある。そこで、文章データを所定のカテゴリに自動的に分類するための文章分類技術が考案されている。 As the Internet has become widespread in society, the amount of text distributed on the network has increased dramatically. For this reason, it is becoming difficult to manually perform operations such as classifying and displaying web pages collected through a network and classifying a large amount of e-mails into appropriate folders. Therefore, a text classification technique for automatically classifying text data into a predetermined category has been devised.

例えば、特徴ベクトル法を使用した文章分類技術がある（例えば、特許文献１参照）。この技術では、以下のようなステップで文章を分類する。まず、カテゴリｉに属する文例集から、カテゴリｉに対する各単語ｊの重要度ｗ_ｉｊをベクトルで表現したＷ_ｉ＝｛ｗ_ｉｊ｝を生成しておく。次に、未分類の文章に出現した単語を使用して、その文章の特徴ベクトルＷを生成する。そして、特徴ベクトルＷに最も距離が近いベクトルＷ_ｎを求め、その文章を対応するカテゴリｎに分類する。
特開２０００−２２２４３１号公報 For example, there is a sentence classification technique using a feature vector method (see, for example, Patent Document 1). In this technique, sentences are classified by the following steps. First, W _i = {w _ij } representing the importance w _ij of each word j with respect to category i as a vector is generated from a collection of sentence examples belonging to category i. Next, using a word that appears in an uncategorized sentence, a feature vector W of the sentence is generated. Then, a vector W _n closest to the feature vector W is obtained, and the sentence is classified into a corresponding category n.
Japanese Unexamined Patent Publication No. 2000-222431

しかしながら、特徴ベクトル法を使用した文章分類では、分類対象の文章がある単語を含んでいるとき、特定のカテゴリに分類される可能性、または特定のカテゴリに分類されない可能性が非常に高いというような条件の下では、そのような条件をうまく文章分類に反映させることができず、誤分類が多くなってしまう。 However, in sentence classification using the feature vector method, when a sentence to be classified contains a certain word, it is very likely that it is classified into a specific category or not classified into a specific category. Under such conditions, such conditions cannot be reflected well in the sentence classification, and misclassification increases.

本発明はこうした状況に鑑みてなされたものであり、その目的は、文章データを予め定められている分類体系に沿って自動的に分類する技術を提供することにある。 The present invention has been made in view of such circumstances, and an object thereof is to provide a technique for automatically classifying sentence data according to a predetermined classification system.

本発明のある態様は、文章分類装置である。この装置は、文章の構成要素と、予め定められた複数のカテゴリについて各カテゴリに分類されるべき文章中にその構成要素が出現する出現頻度とが関連付けられた辞書データを保持する分類辞書保持部と、新たに分類対象とする未分類文章データを外部装置から受け取る文章受付部と、所定の規則にしたがって前記未分類文章データを解析して文章の構成要素を抽出する文章分解部と、前記辞書データを参照して、抽出された構成要素を含む未分類文章データがそれぞれのカテゴリに帰属する確率を表す帰属確率を各カテゴリについて計算するカテゴリ帰属確率計算部と、前記帰属確率を参照して、前記未分類文章データが分類されるカテゴリを判定する判定部と、を備える。 One embodiment of the present invention is a sentence classification device. The apparatus includes a classification dictionary holding unit that holds dictionary data in which a sentence component is associated with an appearance frequency at which the component appears in a sentence to be classified into each category for a plurality of predetermined categories. A sentence receiving unit that receives unclassified sentence data to be newly classified from an external device, a sentence decomposing unit that analyzes the unclassified sentence data according to a predetermined rule and extracts constituent elements of the sentence, and the dictionary With reference to the data, with reference to the attribution probability, a category attribution probability calculating unit that calculates an attribution probability representing the probability that the uncategorized sentence data including the extracted constituent element belongs to each category, A determination unit that determines a category into which the uncategorized sentence data is classified.

この態様によれば、未分類文章データがいずれのカテゴリに属するかを、カテゴリ毎に算出される帰属確率にしたがって判定する。したがって、ひとつの文章データをひとつのカテゴリに分類することもできるし、２つ以上のカテゴリに分類することもできる。また、ひとつの視点に基づくカテゴリについて分類するのみならず、複数の視点に基づくカテゴリを混合させておき、それらについてまとめて帰属確率を算出することができる。 According to this aspect, it is determined according to the belonging probability calculated for each category to which category the uncategorized sentence data belongs. Therefore, one sentence data can be classified into one category, and can be classified into two or more categories. In addition to classifying categories based on one viewpoint, categories based on a plurality of viewpoints can be mixed, and the belonging probability can be calculated collectively.

「帰属確率」とは、いくつかの構成要素を含む未分類文章データがあるカテゴリに分類されるべき確率のことであり、カテゴリ毎に算出することができる。この帰属確率は、未分類文章データから抽出された各構成要素について帰属確率の計算対象となるカテゴリにおける出現頻度を前記分類辞書保持部から取り出して出現確率を計算し、算出された各構成要素についての出現確率を合成して当該カテゴリへの帰属確率を求めることによって計算することができる。 “Probability of belonging” is a probability that uncategorized text data including several components should be classified into a certain category, and can be calculated for each category. This attribution probability is calculated for each component extracted from uncategorized text data by calculating the appearance probability by taking out the appearance frequency in the category for which the attribution probability is calculated from the classification dictionary holding unit. Can be calculated by obtaining the probability of belonging to the category by combining the appearance probabilities.

本発明の別の態様は、分類辞書作成装置である。この装置は、予め定められた複数のカテゴリについて、それぞれのカテゴリに分類されるべき文章データを含むカテゴリ文例データ群と、カテゴリに分類されない文章データを含む非カテゴリ文例データ群とを格納する文例格納部と、所定の規則にしたがって文章データを解析して文章の構成要素を抽出する文章分解部と、前記カテゴリ文例データ群および非カテゴリ文例データ群に含まれる文章データ中に、前記文章分解部により抽出された構成要素が出現する出現頻度をカテゴリ毎にカウントする分類辞書作成部と、構成要素と各カテゴリにおける出現頻度とが関連付けられた辞書データを保持する分類辞書保持部と、外部装置に対して前記辞書データを提供する辞書提供部と、を備える。 Another aspect of the present invention is a classification dictionary creation device. This apparatus stores a sentence example storage that stores, for a plurality of predetermined categories, a category sentence example data group that includes sentence data to be classified into each category, and a non-category sentence example data group that includes sentence data that is not classified into a category. A sentence decomposition unit that analyzes sentence data according to a predetermined rule and extracts constituent elements of the sentence, and sentence data included in the category sentence example data group and the non-category sentence example data group, A classification dictionary creating unit that counts the appearance frequency of the extracted component element for each category, a classification dictionary holding unit that stores dictionary data in which the component element and the appearance frequency in each category are associated, and an external device A dictionary providing unit for providing the dictionary data.

この態様によれば、分類辞書作成装置は、未分類文章データの各カテゴリへの帰属確率を計算するために外部装置から辞書データの提供が要求されたとき、分類辞書保持部に保持されている辞書データを外部装置に送信することができる。 According to this aspect, the classification dictionary creation device is held in the classification dictionary holding unit when dictionary data is requested from an external device in order to calculate the attribution probability of uncategorized sentence data to each category. Dictionary data can be transmitted to an external device.

なお、以上の構成要素の任意の組合せ、本発明を方法、装置、システム、記録媒体、コンピュータプログラムにより表現したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described components and a representation of the present invention by a method, apparatus, system, recording medium, and computer program are also effective as an aspect of the present invention.

本発明によれば、未分類の文章データを予め定められたカテゴリに沿って自動的に分類することができる。 According to the present invention, uncategorized text data can be automatically classified along a predetermined category.

本発明の一実施形態は、特定のカテゴリに属するか否かが予め決められている多数の文例データを使用して分類用の辞書データを作成しておき、分類対象の文章データに対して、カテゴリ毎にそのカテゴリに属するか否かに関する確率を算出することによって、その文章データが分類されるべきカテゴリを決定する文章分類装置である。以下、図面を参照して本実施の形態に係る文章分類装置について説明する。 In one embodiment of the present invention, dictionary data for classification is created using a large number of example sentence data that is determined in advance as to whether or not it belongs to a specific category. It is a sentence classification device that determines a category in which sentence data should be classified by calculating a probability relating to whether or not each category belongs to the category. Hereinafter, the sentence classification device according to the present embodiment will be described with reference to the drawings.

図１は、本実施の形態に係る文章分類装置１０の使用形態の一例を示す。文章分類装置１０は、ネットワーク８０を介してクライアント端末８２やサーバ８４と接続される。文章分類装置１０は、クライアント端末８２またはサーバ８４から送信されてくる文章データを、予め設定してある複数のカテゴリのいずれかに分類する。また、文章分類装置１０は、図示しないウェブクローラ（サーチロボットともいう）がネットワークに接続されている多数のクライアント端末８２およびサーバ８４から収集してきた文章データを、複数のカテゴリのいずれかに分類する。本実施の形態による文章分類装置１０は、分類対象の文章データを２つ以上のカテゴリに属すると判定することもできる点に、特徴のひとつがある。 FIG. 1 shows an example of a usage pattern of a sentence classification device 10 according to the present embodiment. The sentence classification device 10 is connected to the client terminal 82 and the server 84 via the network 80. The sentence classification device 10 classifies the sentence data transmitted from the client terminal 82 or the server 84 into any of a plurality of preset categories. The sentence classification device 10 classifies sentence data collected from a large number of client terminals 82 and servers 84 connected to a network by a web crawler (also called a search robot) (not shown) into any of a plurality of categories. . The sentence classification apparatus 10 according to the present embodiment is characterized in that it can also determine that the sentence data to be classified belongs to two or more categories.

図２は、文章分類装置１０の機能ブロック図である。文章分類装置１０は、文例格納部１２、文章分解部１８、ソート部２０、分類辞書作成部２２、分類辞書保持部２４、カテゴリ帰属確率計算部２６、判定部２８、要素絞り込み基準提供部３０、判定結果格納部３２および文章受付部３６を備える。これらの構成は、ハードウエア的には、任意のコンピュータのＣＰＵ、メモリ、その他のＬＳＩで実現でき、ソフトウエア的にはメモリにロードされたプログラムなどによって実現されるが、ここではそれらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックがハードウエアのみ、ソフトウエアのみ、またはそれらの組合せによっていろいろな形で実現できることは、当業者には理解されるところである。 FIG. 2 is a functional block diagram of the sentence classification device 10. The sentence classification device 10 includes a sentence example storage unit 12, a sentence decomposition unit 18, a sorting unit 20, a classification dictionary creation unit 22, a classification dictionary holding unit 24, a category attribution probability calculation unit 26, a determination unit 28, an element narrowing reference provision unit 30, The determination result storage unit 32 and the text reception unit 36 are provided. These configurations can be realized in terms of hardware by a CPU, memory, or other LSI of any computer, and in terms of software, they are realized by programs loaded into the memory. It depicts the functional blocks that are realized. Accordingly, those skilled in the art will understand that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof.

文章分類装置１０に含まれる機能ブロックは、辞書データを作成する学習段階に使用される機能ブロックと、辞書データの作成後に当該辞書データを使用して未分類の文章データを分類する分類段階に使用される機能ブロックとに分けることができる。まず、学習段階に使用される機能ブロックについて説明する。 The functional blocks included in the sentence classification device 10 are used in a learning stage for creating dictionary data and in a classification stage for classifying unclassified sentence data using the dictionary data after the dictionary data is created. Can be divided into functional blocks. First, functional blocks used in the learning stage will be described.

文例格納部１２は、予め定められた複数のカテゴリそれぞれについて、そのカテゴリに分類されるべき文章データを含むカテゴリ文例１４と、そのカテゴリに分類されない文章データを含む非カテゴリ文例１６とを格納する。ここで「カテゴリ」とは、文章データを特定の基準にしたがって分類するためのグループ分けのことをいう。このカテゴリは、文章の分類結果の使用目的に応じて、多様に設定することができる。例えば、文章分類装置１０による分類結果をニュース記事配信サイトで使用する場合は、「政治」「経済」「社会」「スポーツ」のようなカテゴリの種類が考えられる。文章分類装置１０による分類結果をディレクトリ型の検索サイトで提供する場合は、「ショッピング」「旅行」「映画」「音楽」のようなカテゴリの種類が考えられる。文章分類装置１０により、クライアント端末から送信されてくるアンケートなどを分類する場合は、「女性」「若年層」のようなカテゴリの種類が考えられる。 The sentence example storage unit 12 stores, for each of a plurality of predetermined categories, a category sentence example 14 including sentence data to be classified into the category and a non-category sentence example 16 including sentence data not classified into the category. Here, “category” refers to grouping for classifying text data according to a specific standard. This category can be variously set according to the purpose of use of the sentence classification result. For example, when the classification result by the sentence classification device 10 is used on a news article distribution site, the category types such as “politics”, “economy”, “society”, and “sports” are conceivable. When the classification result by the sentence classification device 10 is provided on a directory-type search site, categories such as “shopping”, “travel”, “movie”, and “music” can be considered. In the case of classifying a questionnaire or the like transmitted from a client terminal by the sentence classification device 10, categories such as “female” and “young people” can be considered.

また、「文章データ」とは、テキストデータ、ＨＴＭＬ、ＸＭＬ、ＸＨＴＭＬファイルなどのテキスト情報を含んだデータを指すが、そのデータ形式は限定されない。 “Text data” refers to data including text information such as text data, HTML, XML, and XHTML files, but the data format is not limited.

Ｍ個のカテゴリ（カテゴリ１、・・・、カテゴリＭ）が規定されるとすると、各カテゴリ毎にカテゴリ文例１４と非カテゴリ文例１６が文例格納部１２内に準備される。カテゴリ文例は、あるカテゴリに分類されるべきひとつまたは複数の文章データを蓄積したファイルである。非カテゴリ文例データは、あるカテゴリに分類されないひとつまたは複数の文章データを蓄積したファイルである。文章データがいずれのカテゴリ文例または非カテゴリ文例に含まれるかの判断は、人手を介して行われる。また、カテゴリ文例と非カテゴリ文例は、各カテゴリについて一対ずつ準備される。 Assuming that M categories (category 1,..., Category M) are defined, a category sentence example 14 and a non-category sentence example 16 are prepared in the sentence example storage unit 12 for each category. The category sentence example is a file in which one or a plurality of sentence data to be classified into a certain category is accumulated. Non-category sentence example data is a file in which one or a plurality of sentence data that are not classified into a certain category are accumulated. The determination of which category sentence example or non-category sentence example the sentence data is included is made manually. A pair of category sentences and non-category sentence examples are prepared for each category.

あるカテゴリについての非カテゴリ文例に含まれる文章データは、他のカテゴリについての非カテゴリ文例に含まれる文章データと異なっていてもよいし、同一であってもよい。つまり、「非カテゴリ１文例」内の文章データと「非カテゴリ２文例」内の文章データとが重複していてもよい。カテゴリ文例についても同様であり、例えば、「カテゴリ１文例」と「カテゴリ２文例」の両方に分類される文章データが存在してもよい。ただし、同一カテゴリ内で、カテゴリ文例と非カテゴリ文例の両方に分類される文章データは存在しないことが望ましい。 Text data included in a non-category sentence example for a certain category may be different from or identical to sentence data included in a non-category sentence example for another category. That is, the sentence data in the “non-category 1 sentence example” and the sentence data in the “non-category 2 sentence example” may overlap. The same applies to category sentence examples. For example, sentence data classified into both “category 1 sentence example” and “category 2 sentence example” may exist. However, it is desirable that there is no sentence data classified into both category sentence examples and non-category sentence examples within the same category.

文章分解部１８は、上述したカテゴリ文例１４および非カテゴリ文例１６に含まれる文章データを所定の規則にしたがって解析し、文章の構成要素を抽出する。文章分解部１８は、既知の文章分解アルゴリズムのうち任意のものを使用することができる。文章分解部１８は、文章分解アルゴリズムに応じて、単語、単語と品詞の組合せ、文節、単文などを構成要素として文章データから抽出する。文章分解部１８で使用される文章分解アルゴリズムの例については、後述する。 The sentence decomposition unit 18 analyzes sentence data included in the category sentence example 14 and the non-category sentence example 16 described above according to a predetermined rule, and extracts constituent elements of the sentence. The sentence decomposition unit 18 can use any known sentence decomposition algorithm. The sentence decomposition unit 18 extracts words, combinations of words and parts of speech, phrases, simple sentences, and the like from the sentence data as constituent elements according to the sentence decomposition algorithm. An example of the sentence decomposition algorithm used in the sentence decomposition unit 18 will be described later.

ソート部２０は、文章分解部１８によりカテゴリ文例１４、非カテゴリ文例１６の文章データから抽出された構成要素を、所定の規則にしたがって並べ替える。一例として、構成要素の読みを５０音順に並べ替えてもよい。または、構成要素の最初の一文字のＡＳＣＩＩコード順に並べ替えてもよい。文章分解部１８が単語と品詞の組合せを構成要素として抽出した場合は、構成要素の品詞順に配列してもよい。ソート部２０によって構成要素を並べ替えることによって、後述する構成要素をキーとした辞書データの検索が容易になるため、分類辞書の作成に要する時間を短縮できる。 The sorting unit 20 rearranges the components extracted from the sentence data of the category sentence example 14 and the non-category sentence example 16 by the sentence decomposition unit 18 according to a predetermined rule. As an example, the readings of the constituent elements may be rearranged in the order of 50 sounds. Or you may rearrange in order of the ASCII code of the first character of a component. When the sentence decomposition unit 18 extracts combinations of words and parts of speech as constituent elements, they may be arranged in the part of speech order of the constituent elements. By rearranging the constituent elements by the sorting unit 20, it becomes easy to search dictionary data using the constituent elements described later as keys, so that the time required for creating the classification dictionary can be shortened.

分類辞書作成部２２は、文章分解部１８により抽出された各構成要素が、各カテゴリについてのカテゴリ文例１４および非カテゴリ文例１６内の文章データに出現する頻度を算出し、構成要素と各カテゴリについての出現頻度とを関連付けた辞書データを作成する。ここでいう「頻度」は、単なる出現の回数でもよいし、全単語数に対する出現の比率で表してもよい。あるいは、カテゴリ文例または非カテゴリ文例に含まれる文章データ数に対する出現比率でもよい。以下では、これらをあわせて「出現頻度」と呼ぶ。いずれにしても、ある構成要素がひとつのカテゴリについてのカテゴリ文例と非カテゴリ文例に出現する度合いを表現する数値であれば、任意のものを採用できる。
分類辞書作成部２２により作成された辞書データは、分類辞書保持部２４に格納される。分類辞書作成部２２のさらに詳細な構成および機能については、図４を参照して後述する。 The classification dictionary creation unit 22 calculates the frequency at which each component extracted by the sentence decomposition unit 18 appears in the sentence data in the category sentence example 14 and the non-category sentence example 16 for each category, and for each component and each category Create dictionary data in association with the appearance frequency of. “Frequency” here may be the number of appearances or the ratio of appearance to the total number of words. Alternatively, it may be an appearance ratio with respect to the number of sentence data included in the category sentence example or the non-category sentence example. Hereinafter, these are collectively referred to as “appearance frequency”. In any case, any numerical value can be used as long as it is a numerical value representing the degree of appearance of a certain component in a category sentence example and a non-category sentence example for one category.
The dictionary data created by the classification dictionary creation unit 22 is stored in the classification dictionary holding unit 24. A more detailed configuration and function of the classification dictionary creation unit 22 will be described later with reference to FIG.

要素絞り込み基準提供部３０には、文章分解部１８により抽出された構成要素から一部の構成要素を除外するための選択基準が格納されている。この選択基準は、例えば特定の品詞（例えば、名詞＋動詞、名詞のみ、助詞のみなど）を指定したり、文字数の上限を指定したり、平仮名のみ、または漢字のみからなる構成要素を指定する条件のことをいう。複数の条件を組み合わせて選択基準としてもよい。 The element narrowing criteria providing unit 30 stores selection criteria for excluding some components from the components extracted by the text decomposition unit 18. This selection criterion is, for example, a condition that specifies a specific part of speech (for example, a noun + verb, only a noun, only a particle, etc.), specifies the upper limit of the number of characters, specifies a component consisting of only hiragana or only kanji I mean. A combination of a plurality of conditions may be used as a selection criterion.

分類辞書作成部２２は、要素絞り込み基準提供部３０から提供される選択基準を利用して、辞書データの作成対象となる構成要素数を絞り込むことによって、分類辞書保持部２４に格納される辞書データのデータ量を抑制しつつ、分類に有効な辞書データを作成することができる。
本実施の形態の文章分類装置１０は、分類すべき文章データの言語を限定しないが、この要素絞り込み基準提供部３０は、日本語の文章データの処理時には特に有用となる。 The classification dictionary creating unit 22 uses the selection criteria provided from the element narrowing criteria providing unit 30 to narrow down the number of components for which dictionary data is to be created, thereby storing the dictionary data stored in the classification dictionary holding unit 24. It is possible to create dictionary data effective for classification while suppressing the amount of data.
The sentence classification apparatus 10 of the present embodiment does not limit the language of sentence data to be classified, but the element narrowing reference providing unit 30 is particularly useful when processing Japanese sentence data.

続いて、分類段階に使用される機能ブロックについて説明する。 Subsequently, functional blocks used in the classification stage will be described.

文章受付部３６は、分類対象となる文章データ３４（以下、「未分類文章データ３４」と呼ぶ）を図示しない外部装置から受信する。未分類文章データ３４は、上述と同様にデータ形式に制限はない。外部装置は、例えばネットワークに接続されたクライアント端末やサーバであるが、これらに限定されない。 The text receiving unit 36 receives text data 34 to be classified (hereinafter referred to as “unclassified text data 34”) from an external device (not shown). The data format of the uncategorized text data 34 is not limited as described above. The external device is, for example, a client terminal or server connected to the network, but is not limited to these.

文章分解部１８は、文章受付部３６から未分類文章データ３４を受け取り、上述したのと同様にして、文章の構成要素を抽出する。抽出された構成要素は、ソート部２０によって所定の規則にしたがって並べ替えられる。この規則は、カテゴリ文例および非カテゴリ文例から抽出された構成要素を並べ替えたのと同様の規則であることが好ましい。この並べ替えによって、構成要素をキーとした分類辞書保持部２４内の辞書データの検索が容易になるため、後述するカテゴリ帰属確率計算部２６における処理が高速化される。 The sentence decomposing unit 18 receives the uncategorized sentence data 34 from the sentence receiving unit 36 and extracts the constituent elements of the sentence in the same manner as described above. The extracted components are rearranged by the sorting unit 20 according to a predetermined rule. This rule is preferably the same rule as when the components extracted from the category sentence example and the non-category sentence example are rearranged. This rearrangement facilitates the search of the dictionary data in the classification dictionary holding unit 24 using the constituent elements as keys, thereby speeding up the processing in the category attribution probability calculating unit 26 described later.

カテゴリ帰属確率計算部２６は、分類辞書保持部２４に格納された辞書データを参照して、いくつかの構成要素を含む未分類文章データがそれぞれのカテゴリに分類されるべき確率を、各カテゴリについて計算する。以下では、この確率のことを「帰属確率」と呼ぶ。 The category attribution probability calculation unit 26 refers to the dictionary data stored in the classification dictionary holding unit 24, and calculates the probability that uncategorized sentence data including some components should be classified into each category. calculate. Hereinafter, this probability is referred to as “attribute probability”.

分類辞書作成部２２と同様に、カテゴリ帰属確率計算部２６は、要素絞り込み基準提供部３０から提供される選択基準にしたがって、帰属確率を算出する基礎となる構成要素数を限定してもよい。
カテゴリ帰属確率計算部２６のさらに詳細な構成および機能については、図７を参照して後述する。 Similar to the classification dictionary creation unit 22, the category attribution probability calculation unit 26 may limit the number of constituent elements serving as a basis for calculating the attribution probability according to the selection criterion provided from the element narrowing criterion provision unit 30.
A more detailed configuration and function of the category attribution probability calculation unit 26 will be described later with reference to FIG.

判定部２８は、カテゴリ帰属確率計算部２６により各カテゴリについて計算された帰属確率を取得し、帰属確率の値に基づいて未分類文章データをいずれかのカテゴリに分類するかを決定する。より具体的には、判定部２８は、帰属確率が最大となったカテゴリに未分類文章データを分類する。あるいは、予め設定されているしきい値以上の帰属確率が得られたすべてのカテゴリに未分類文章データを分類してもよい。こうすることによって、一連の演算でひとつの未分類文章データを２つ以上のカテゴリに分類することができる。帰属確率がしきい値以上となったカテゴリが存在しない場合、判定部２８は、未分類文章データをいずれのカテゴリにも分類されない文章と判定してもよいし、帰属確率が最大となったカテゴリに分類してもよい。判定部２８による未分類文章データの判定結果は、判定結果格納部３２に格納されるか、または図示しない外部装置に出力される。 The determination unit 28 acquires the belonging probability calculated for each category by the category belonging probability calculating unit 26 and determines whether to classify the unclassified sentence data into any category based on the value of the belonging probability. More specifically, the determination unit 28 classifies the unclassified sentence data into the category having the maximum attribution probability. Alternatively, uncategorized sentence data may be classified into all categories for which an attribution probability equal to or higher than a preset threshold is obtained. By doing so, one uncategorized sentence data can be classified into two or more categories by a series of operations. When there is no category having the attribution probability equal to or higher than the threshold, the determination unit 28 may determine that the uncategorized sentence data is a sentence not classified into any category, or the category having the maximum attribution probability. May be classified. The determination result of the uncategorized sentence data by the determination unit 28 is stored in the determination result storage unit 32 or output to an external device (not shown).

次に、文章分解部１８で使用される文章分解アルゴリズムの概要を説明する。 Next, an outline of the sentence decomposition algorithm used in the sentence decomposition unit 18 will be described.

（１）形態素解析
図３は、文章データを形態素解析によって構成要素に分解した例を示す。使用した文章データは、「気象庁は２３日、関東地方で春一番が吹いたと発表した。」という文章である。図３に示すように、この文章は「気象庁／は／２／３／日／、／関東／地方／で／春一番／が／吹い／た／と／発表／し／た／。」のように、１８の要素に分解される。形態素解析では、対象となる文章から、活用形５０と、原形５２と、品詞５４を決定することができる。これら活用形、原形、品詞のうち、（原形＋品詞）を構成要素としてもよいし、または、原形のみを要素としてもよい。原形の代わりに活用形を要素としてもよい。 (1) Morphological Analysis FIG. 3 shows an example in which sentence data is decomposed into constituent elements by morphological analysis. The text data used is the text "The Japan Meteorological Agency announced that the first spring in the Kanto region blew on the 23rd." As shown in FIG. 3, this sentence is “Meteorological Agency / Ha / 2/3 / Day /, / Kanto / Region / De / Spring Ichiban / Ga / Blow / Ta / To / Announcement / Shi / Ta /. Thus, it is broken down into 18 elements. In the morphological analysis, the utilization form 50, the original form 52, and the part of speech 54 can be determined from the target sentence. Of these utilization forms, original forms, and parts of speech, (original form + part of speech) may be a constituent element, or only the original form may be an element. The utilization form may be used as an element instead of the original form.

形態素解析を使用して抽出された構成要素は文章の分解能が高いため、この構成要素を使用した辞書データに基づくカテゴリへの分類が高精度になると期待される。形態素解析は周知の技術であるため、これ以上の説明を省略する。 Since components extracted using morphological analysis have high sentence resolution, classification into categories based on dictionary data using these components is expected to be highly accurate. Since morphological analysis is a well-known technique, further explanation is omitted.

（２）構文解析
次に、構文解析について説明する。構文解析は、文章を文節に分解する。図３の例と同一の文章データを構文解析によって分解すると、「気象庁は／２３日、／関東地方で／春一番が／吹いたと／発表した。」のように、６つの構成要素に分解される。 (2) Syntax analysis Next, syntax analysis will be described. Parsing breaks a sentence into phrases. When the same sentence data as in the example of FIG. 3 is decomposed by parsing, it is decomposed into six components as follows: “The Japan Meteorological Agency / 23 days / In the Kanto area / Spring first / Announced / I announced.” Is done.

構文解析を使用すると、形態素解析と比べて構成要素数が大幅に減少するので、高速分類に適しているが、分類の精度は低下する。構文解析は周知の技術であるため、これ以上の説明を省略する。 When parsing is used, the number of components is significantly reduced compared to morphological analysis, which is suitable for high-speed classification, but the accuracy of classification is reduced. Since parsing is a well-known technique, further explanation is omitted.

（３）最小構成文
次に、形態素解析と構文解析を使用して文章から最小構成文を抽出する例を説明する。ここで「最小構成文」とは、最小限の意味をなす文のことであり、詳細は「模倣レポート判定に用いる文書間類似度の考案、太田貴久、増山繁、言語処理学会第１０回年次大会発表論文集、pp.729-732、2004」に記載されている。 (3) Minimum Composition Sentence Next, an example of extracting a minimum construction sentence from a sentence using morphological analysis and syntax analysis will be described. Here, the “minimum component sentence” is a sentence that has a minimum meaning. For details, see “Invention of similarity between documents used for imitation report determination, Takahisa Ota, Shigeru Masuyama, 10th Annual Meeting of the Language Processing Society” Pp.729-732, 2004 ”, published in the next conference.

図３の例と同一の文章データから最小構成文を抽出すると、「気象庁は発表した。」「２３日、発表した。」「関東地方で春一番が吹いたと発表した。」の３つの最小構成文が得られる。これら最小構成文を構成要素として辞書データを作成すると、文脈の中での単語の意味を捉えることができるため、複数の意味に捉えられる単語を含んだ文章を適切なカテゴリに分類するといった高度の分類が可能となるが、計算コストは高くなる。 Extracting the minimum sentence from the same text data as in the example of FIG. 3, “Ministry of Meteorological Agency announced.” “Announced on 23rd.” “Announced that the first spring in the Kanto region was blown.” A composition sentence is obtained. By creating dictionary data using these minimum constituent sentences as constituent elements, it is possible to capture the meaning of words in the context, so it is possible to classify sentences containing words that can be captured in multiple meanings into appropriate categories. Classification is possible, but the calculation cost is high.

なお、形態素解析の結果得られる品詞情報を利用して、名詞、形容詞、動詞の原形のみからなる最小構成文を抽出してもよい。上記と同一の例を使用すると、「気象庁・発表する」「２３日・発表する」「関東地方・春一番・吹く・発表する」という３つの最小構成文が得られる。 Note that, by using the part-of-speech information obtained as a result of the morphological analysis, it is possible to extract the minimum constituent sentence consisting only of the noun, the adjective, and the verb form. Using the same example as above, three minimum composition sentences are obtained: “Meteorological Agency to announce”, “23 days to announce”, “Kanto region, spring first, blow, to announce”.

このように、文章分解部１８において異なる文章分解アルゴリズムを使用して構成要素を抽出することで、分類辞書作成部２２において異なる傾向を有する辞書データを作成することができる。したがって、カテゴリの種類などに合わせて適切な文章分解アルゴリズムを選択することで、分類の精度や処理速度を向上させることも可能である。 In this manner, by extracting the constituent elements using different sentence decomposition algorithms in the sentence decomposition unit 18, dictionary data having different tendencies can be created in the classification dictionary creation unit 22. Therefore, it is possible to improve classification accuracy and processing speed by selecting an appropriate sentence decomposition algorithm according to the category type.

図４は、分類辞書作成部２２の詳細な機能ブロック図である。分類辞書作成部２２は、構成要素受付部１０２、絞り込み情報受付部１０４、カテゴリ情報提供部１０６、構成要素選択部１０８、辞書データ検索部１１０および辞書データ更新部１１２を含む。 FIG. 4 is a detailed functional block diagram of the classification dictionary creation unit 22. The classification dictionary creation unit 22 includes a component reception unit 102, a narrowing information reception unit 104, a category information provision unit 106, a component selection unit 108, a dictionary data search unit 110, and a dictionary data update unit 112.

構成要素受付部１０２は、ソート部２０から所定の規則にしたがって並べ替えられた構成要素を受け取り、構成要素選択部１０８に渡す。カテゴリ情報提供部１０６は、構成要素受付部１０２で受け取られた構成要素が抽出されたカテゴリ文例および非カテゴリ文例の属するカテゴリについての情報を、要素絞り込み基準提供部３０に伝える。絞り込み情報受付部１０４は、要素絞り込み基準提供部３０から選択基準を受け取り、構成要素選択部１０８に渡す。構成要素選択部１０８は、選択基準と構成要素とを比較して、選択基準を満たす構成要素を選択して辞書データ検索部１１０に渡す。辞書データ検索部１１０は、分類辞書保持部２４に保持されている辞書データのなかから、選択基準を満たした構成要素と同一の構成要素についての辞書データがあるか検索し、対応する辞書データがある場合は、辞書データ更新部１１２に渡す。辞書データ更新部１１２は、選択基準を満たした各構成要素の数をカウントし、その数を辞書データに追加し、分類辞書保持部２４に格納する。構成要素が新規であるときは、新たな辞書データを作成して分類辞書保持部２４に格納する。 The component reception unit 102 receives the components rearranged according to a predetermined rule from the sorting unit 20 and passes them to the component selection unit 108. The category information providing unit 106 informs the element narrowing-down criterion providing unit 30 of information about the categories to which the category examples and the non-category example sentences from which the component elements received by the component element receiving unit 102 are extracted belong. The refinement information receiving unit 104 receives the selection criteria from the element refinement criteria providing unit 30 and passes them to the component selection unit 108. The component selection unit 108 compares the selection criterion with the component, selects a component that satisfies the selection criterion, and passes it to the dictionary data search unit 110. The dictionary data search unit 110 searches the dictionary data held in the classification dictionary holding unit 24 for dictionary data for the same component as the component that satisfies the selection criteria, and the corresponding dictionary data is found. If there is, it is passed to the dictionary data update unit 112. The dictionary data updating unit 112 counts the number of each constituent element that satisfies the selection criteria, adds the number to the dictionary data, and stores it in the classification dictionary holding unit 24. When the component is new, new dictionary data is created and stored in the classification dictionary holding unit 24.

図５は、分類辞書保持部２４に格納されている辞書データのデータ構造図である。辞書データ４０においては、構成要素４２と、その構成要素がカテゴリ１〜Ｍのカテゴリ文例および非カテゴリ文例に含まれる文章中に出現する出現頻度４４とが関連付けされている。構成要素をＷ_ｎ（ｎ＝１〜Ｎ）、Ｗ_ｎがカテゴリｍ（ｍ＝１〜Ｍ）のカテゴリ文例または非カテゴリ文例に含まれる文章中の出現頻度をそれぞれＸ_ｎｍ、Ｙ_ｎｍと表記すると、ある構成要素Ｗ_ｎについての辞書データ４６は、（Ｗ_ｎ，Ｘ_ｎ１，Ｙ_ｎ１，Ｘ_ｎ２，Ｙ_ｎ２，・・・，Ｘ_ｎｍ，Ｙ_ｎｍ）と表すことができる。 FIG. 5 is a data structure diagram of dictionary data stored in the classification dictionary holding unit 24. In the dictionary data 40, the constituent element 42 is associated with the appearance frequency 44 at which the constituent element appears in sentences included in the category sentence examples of category 1 to M and the non-category sentence example. If the component elements are represented as W _n (n = 1 to N) and the frequency of appearance in sentences included in the category sentence example or the non-category sentence example where W _n is the category m (m = 1 to M), respectively, X _nm and Y _nm. The dictionary data 46 for a certain component W _n can be expressed as (W _n , X _n1 , Y _n1 , X _n2 , Y _n2 ,..., X _nm , Y _nm ).

この実施の形態では、各構成要素Ｗ_ｎについて、（カテゴリｍのカテゴリ文例に含まれる文章中の出現頻度）と（カテゴリｍの非カテゴリ文例に含まれる文章中の出現頻度）の２つの値をペアで保持している。これは、カテゴリ文例または非カテゴリ文例に新たな文章データを追加して分類辞書保持部２４内の辞書データを拡充しようとした場合に、頻度情報の書き換えを容易にするためである。
別の実施の形態では、構成要素Ｗ_ｎの出現頻度を単一の値で保持してもよい。構成要素Ｗ_ｎのカテゴリｍについての出現頻度をＦ_ｎｍと表記すると、Ｆ_ｎｍ＝Ｘ_ｎｍ／Ｙ_ｎｍとしてもよいし、Ｆ_ｎｍ＝Ｘ_ｎｍ／（Ｘ_ｎｍ＋Ｙ_ｎｍ）としてもよい。この場合、ある構成要素Ｗ_ｎについての辞書データ４６は、（Ｗ_ｎ，Ｆ_ｎ１，Ｆ_ｎ２，・・・，Ｆ_ｎＭ）と表すことができる。 In this embodiment, for each component W _n , two values of (appearance frequency in sentences included in category sentence examples of category m) and (appearance frequency in sentences included in non-category sentence examples of category m) are set. Hold in pairs. This is for facilitating rewriting of the frequency information when new sentence data is added to the category sentence example or the non-category sentence example and the dictionary data in the classification dictionary holding unit 24 is to be expanded.
In another embodiment, the appearance frequency of the component W _n may be held as a single value. When the appearance frequency of the component W _n with respect to the category m is expressed as F _nm , F _nm = X _nm / Y _nm may be set, or F _nm = X _nm / (X _nm + Y _nm ) may be set. In this case, the dictionary data 46 for a certain component W _n can be expressed as (W _n , F _n1 , F _n2 ,..., F _nM ).

図６は、要素絞り込み基準提供部３０に格納されている選択基準のデータ構造図である。選択基準は、カテゴリ種類に対応して準備される。図６では、カテゴリ種類として、「テーマ分け」「文体」「年代」が含まれる。要素絞り込み基準提供部３０は、構成要素を抽出したカテゴリ文例の情報をカテゴリ情報提供部１０６から受け取り、図中の左欄５６に示す特定のカテゴリの場合には、右欄５８の選択基準を返す。カテゴリ情報提供部１０６から受け取ったカテゴリが左欄５６に存在しない場合は、標準的な「名詞」という選択基準を返す。 FIG. 6 is a data structure diagram of the selection criteria stored in the element narrowing criteria providing unit 30. Selection criteria are prepared corresponding to the category type. In FIG. 6, “theme classification”, “style”, and “age” are included as category types. The element narrowing criteria providing unit 30 receives the category sentence information from which the constituent elements are extracted from the category information providing unit 106, and returns the selection criteria in the right column 58 in the case of a specific category shown in the left column 56 in the figure. . If the category received from the category information providing unit 106 does not exist in the left column 56, the standard “noun” selection criterion is returned.

例えば、カテゴリ種類がテーマや話題の分類に関するもの、例えば「旅行」「音楽」「映画」などのカテゴリの場合は、「名詞」という基準を提供する。このようなテーマや話題の分類については、特定の名詞の存在がカテゴリ分類を決定付けることが多いからである。カテゴリの種類が文体に関するもの、例えば「フォーマル」「丁寧」「乱文」などのカテゴリの場合は、「形容詞または助詞」という基準を提供する。文体は、「てにをは」などの助詞や感情表現によって決定できる場合が多いからである。さらに、文章を作成した人の年代や性別に関するもの、例えば「女性」「若年層」などの場合は、「平仮名の名詞」という基準を提供する。このように、要素絞り込み基準提供部３０は、辞書データの作成対象となる構成要素が、いずれのカテゴリ文例または非カテゴリ文例に含まれる文章データから抽出されたかに応じて、カテゴリ毎に異なる選択基準を提供することができる。分類辞書作成部２２は、選択基準を参照して、辞書として準備される構成要素を絞り込んた辞書データを作成することができる。 For example, if the category type is related to a theme or topic classification, for example, a category such as “travel”, “music”, or “movie”, the criterion “noun” is provided. This is because, for such themes and topic classifications, the presence of a specific noun often determines the category classification. When the category type is related to a style, for example, a category such as “formal”, “careful”, or “random”, a criterion of “adjective or particle” is provided. This is because the style can often be determined by a particle such as “Tenanoha” or emotional expression. Furthermore, for the age and gender of the person who created the text, for example “female” and “young people”, a criterion “noun of hiragana” is provided. In this way, the element refinement criterion providing unit 30 selects different criteria for each category depending on which category sentence example or non-category sentence example includes the constituent elements for which dictionary data is to be created. Can be provided. The classification dictionary creation unit 22 can create dictionary data that narrows down the components prepared as a dictionary with reference to the selection criteria.

要素絞り込み基準提供部３０は、構成要素の品詞を選択基準として提供する代わりに、文字数を選択基準として提供してもよい。これによって、分類辞書作成部２２は、一定字数以下の構成要素について辞書を作成することができる。あるいは、要素絞り込み基準提供部３０は、選択基準として特定の構成要素（例えば、「自動車」という名詞）を提供してもよい。分類辞書作成部２２は、それと一致する構成要素は辞書データの作成対象から除外するようにしてもよい。例えば、極めて多数の文章中で使用されるありふれた名詞（例えば、「私」「物」）などはカテゴリ分類に与える影響が少ないので、除外することが好ましい。 The element narrowing criteria providing unit 30 may provide the number of characters as a selection criterion instead of providing the part of speech of the component as a selection criterion. As a result, the classification dictionary creation unit 22 can create a dictionary for components having a certain number of characters or less. Alternatively, the element refinement criterion providing unit 30 may provide a specific component (for example, a noun “automobile”) as a selection criterion. The classification dictionary creation unit 22 may exclude components that match the classification dictionary from the dictionary data creation target. For example, it is preferable to exclude common nouns (for example, “I” and “thing”) used in a very large number of sentences because they have little influence on the category classification.

図７は、カテゴリ帰属確率計算部２６の詳細な機能ブロック図である。カテゴリ帰属確率計算部２６は、構成要素受付部１２２、絞り込み情報受付部１２４、構成要素選択部１２６、辞書データ検索部１２８、出現確率算出部１３０および帰属確率算出部１３２を含む。 FIG. 7 is a detailed functional block diagram of the category attribution probability calculator 26. The category attribution probability calculation unit 26 includes a component reception unit 122, a narrowing information reception unit 124, a component selection unit 126, a dictionary data search unit 128, an appearance probability calculation unit 130, and an attribution probability calculation unit 132.

構成要素受付部１２２は、ソート部２０から所定の規則にしたがって並べ替えられた構成要素を受け取る。絞り込み情報受付部１２４は、要素絞り込み基準提供部３０から選択基準を受け取り、構成要素選択部１２６に渡す。構成要素選択部１２６は、選択基準と構成要素とを比較して、選択基準を満たす構成要素を選択して辞書データ検索部１２８に渡す。辞書データ検索部１２８は、分類辞書保持部２４に保持されている辞書データのなかから、選択基準を満たした構成要素と同一の構成要素についての辞書データがあるか検索し、対応する辞書データがある場合は、出現確率算出部１３０に渡す。 The component receiving unit 122 receives the components rearranged from the sorting unit 20 according to a predetermined rule. The refinement information receiving unit 124 receives the selection criterion from the element refinement criterion providing unit 30 and passes it to the component selection unit 126. The component selection unit 126 compares the selection criterion with the component, selects a component that satisfies the selection criterion, and passes it to the dictionary data search unit 128. The dictionary data search unit 128 searches the dictionary data held in the classification dictionary holding unit 24 for dictionary data for the same component as the component that satisfies the selection criteria, and the corresponding dictionary data is found. If there is, it is passed to the appearance probability calculation unit 130.

出現確率算出部１３０は、各カテゴリｍについて、未分類データから抽出された各構成要素Ｗ_ｎの出現確率ａ_ｎｍを計算する。ここで、出現確率ａ_ｎｍは、上述したカテゴリ文例または非カテゴリ文例に含まれる文章中への出現頻度Ｘ_ｎｍ、Ｙ_ｎｍを使用して、次式により算出される。 The appearance probability calculation unit 130 calculates the appearance probability a _nm of each component W _n extracted from the unclassified data for each category m. Here, the appearance probability a _nm is calculated by the following equation using the appearance frequencies X _nm and Y _nm in the sentence included in the above-described category sentence example or non-category sentence example.

図８は、数１により算出された、カテゴリ１に対する各構成要素Ｗ_１〜Ｗ_Ｎの出現確率ａ_１１〜ａ_Ｎ１を示す。
なお、出現確率の算出は、数１に限られない。例えば、上述したＦ_ｎｍをそのまま使用してもよい。 FIG. 8 shows the appearance probabilities a _{11 to} a _N1 of the components W _{1 to} W _N with respect to the category 1 calculated by Equation ₁ .
Note that the calculation of the appearance probability is not limited to Equation 1. For example, the above-described F _nm may be used as it is.

帰属確率算出部１３２は、算出された出現確率をすべての構成要素について総計して、未分類文章データについてカテゴリ毎の帰属確率を算出する。好ましくは、帰属確率算出部１３２は、ベイジアンフィルタ法を使用して、次式によりカテゴリｎへの帰属確率Ｅ_ｎを算出する。 The attribution probability calculation unit 132 adds up the calculated appearance probabilities for all the components, and calculates the attribution probability for each category for the unclassified sentence data. Preferably, belonging probability calculation unit 132 uses the Bayesian filter method to calculate a membership probability E _n to category n by the following equation.

なお、ベイジアンフィルタ法以外の手法を使用して帰属確率を算出してもよい。例えば、すべての構成要素の出現確率を単に掛け合わせて帰属確率を算出してもよいし、出現確率の平均値を帰属確率としてもよい。 Note that the attribution probability may be calculated using a method other than the Bayesian filter method. For example, the attribution probability may be calculated by simply multiplying the appearance probabilities of all the constituent elements, or the average value of the appearance probabilities may be used as the attribution probability.

図９は、分類辞書を作成する処理過程を示すフローチャートである。
まず、文章分解部１８は、文例格納部１２から一対のカテゴリ文例または非カテゴリ文例を取得する（Ｓ１０）。次に、文章分解部１８は、所定の文章分解アルゴリズムに基づいて、カテゴリ文例および非カテゴリ文例中の文章データを構成要素に分解し、ソート部２０は分解された構成要素を所定の規則にしたがって並べ替える（Ｓ１２）。なお、この並べ替えの実行は本実施の形態に必須ではなく、分類辞書保持部からのデータ検索時間が長くなるため演算速度は低下しうるが、カテゴリ分類の精度に影響を及ぼすことはない。 FIG. 9 is a flowchart showing a process of creating a classification dictionary.
First, the sentence decomposition unit 18 acquires a pair of category sentence examples or non-category sentence examples from the sentence example storage unit 12 (S10). Next, the sentence decomposition unit 18 decomposes the sentence data in the category sentence example and the non-category sentence example into constituent elements based on a predetermined sentence decomposition algorithm, and the sorting part 20 separates the decomposed constituent elements according to a predetermined rule. Rearrange (S12). The execution of the rearrangement is not essential for the present embodiment, and the data search time from the classification dictionary holding unit becomes long, so that the calculation speed can be reduced, but the accuracy of category classification is not affected.

次に、分類辞書作成部２２は、抽出されたひとつの構成要素について、要素絞り込み基準提供部３０から受け取った選択基準と比較して、辞書データの作成対象の構成要素であるか否かを判定する（Ｓ１４）。辞書データの作成対象でなければ（Ｓ１４のＮ）、Ｓ２４に進む。辞書データの作成対象であれば（Ｓ１４のＹ）、分類辞書作成部２２は分類辞書保持部２４からその構成要素についての辞書データを検索する（Ｓ１６）。対応する辞書データが存在した場合は（Ｓ１８のＹ）、今回の文例データ中に存在した構成要素の数を、辞書データ中のそのカテゴリの頻度に追加する（Ｓ２０）。対応する辞書データが存在しない場合は（Ｓ１８のＮ）、新たな辞書データを作成する（Ｓ２２）。そして、文章分解部１８で分解されたすべての構成要素について処理したか否かを判定し（Ｓ２４）、処理が終了していなければ（Ｓ２４のＮ）、別の構成要素についてＳ１４からの処理を繰り返す。すべての構成要素についての辞書データの作成が終了すると（Ｓ２４のＹ）、このフローを終了する。 Next, the classification dictionary creation unit 22 compares the extracted component with the selection criterion received from the element refinement criterion providing unit 30 to determine whether the component is a component for which dictionary data is to be created. (S14). If it is not a dictionary data creation target (N in S14), the process proceeds to S24. If it is a dictionary data creation target (Y in S14), the classification dictionary creation unit 22 searches the classification dictionary holding unit 24 for dictionary data for the component (S16). If the corresponding dictionary data exists (Y in S18), the number of components existing in the current sentence example data is added to the frequency of the category in the dictionary data (S20). If there is no corresponding dictionary data (N in S18), new dictionary data is created (S22). Then, it is determined whether or not all the components decomposed by the text decomposition unit 18 have been processed (S24). If the processing has not ended (N in S24), the processing from S14 is performed on another component. repeat. When the creation of dictionary data for all the constituent elements is finished (Y in S24), this flow is finished.

図１０は、未分類文章データをカテゴリに分類する処理過程を示すフローチャートである。
文章受付部３６は、未分類文章データを受け取る（Ｓ３０）。文章分解部１８は、好ましくは図９のＳ１２と同じ文章分解アルゴリズムに基づいて、未分類文章データ中の文章を構成要素に分解し、ソート部２０は分解された構成要素を所定の規則にしたがって並べ替える（Ｓ３２）。次に、カテゴリ帰属確率計算部６は、抽出されたひとつの構成要素について、要素絞り込み基準提供部３０から受け取った選択基準と比較して、未分類文章データの帰属確率を計算するために、その構成要素の出現確率を計算するか否かを判定する（Ｓ３４）。出現確率の計算対象であれば（Ｓ３４のＹ）、カテゴリ帰属確率計算部２６は、分類辞書保持部２４からその構成要素についての辞書データを検索し、対応する辞書データがある場合は（Ｓ３６のＹ）、辞書データに基づいて、その構成要素の出現確率を各カテゴリについて算出する（Ｓ３８）。Ｓ３４で構成要素が出現確率の計算対象でなかった場合（Ｓ３４のＮ）、またはＳ３６で対応する辞書データが存在しなかった場合（Ｓ３６のＮ）は、Ｓ３８をスキップする。
続いて、文章分解部１８によって抽出されたすべての構成要素について処理したか否かを判定し（Ｓ４０）、処理が終了していなければ（Ｓ４０のＮ）、別の構成要素についてＳ３４からの処理を繰り返す。 FIG. 10 is a flowchart showing a process of classifying uncategorized text data into categories.
The text receiving unit 36 receives unclassified text data (S30). The sentence decomposition unit 18 decomposes the sentences in the uncategorized sentence data into constituent elements, preferably based on the same sentence decomposition algorithm as S12 in FIG. 9, and the sorting unit 20 converts the decomposed constituent elements according to a predetermined rule. Rearrange (S32). Next, the category attribution probability calculating unit 6 compares the extracted one component with the selection criterion received from the element narrowing criterion providing unit 30 to calculate the attribution probability of uncategorized sentence data. It is determined whether or not the appearance probability of the component is to be calculated (S34). If it is an appearance probability calculation target (Y in S34), the category attribution probability calculation unit 26 searches the classification dictionary holding unit 24 for dictionary data for the component, and if there is corresponding dictionary data (in S36). Y) Based on the dictionary data, the appearance probability of the component is calculated for each category (S38). If the component is not the target of appearance probability calculation in S34 (N in S34), or if the corresponding dictionary data does not exist in S36 (N in S36), S38 is skipped.
Subsequently, it is determined whether or not all the components extracted by the text decomposition unit 18 have been processed (S40). If the processing has not ended (N in S40), the processing from S34 is performed on another component. repeat.

すべての構成要素についての処理が終了すると（Ｓ４０のＹ）、カテゴリ帰属確率計算部２６は、上述した手順にしたがって、カテゴリ毎に未分類文章データの帰属確率を算出し（Ｓ４２）、判定部２８は、帰属確率に基づいて未分類文章データが属するカテゴリを判定する（Ｓ４４）。 When the processing for all the constituent elements is completed (Y in S40), the category attribution probability calculation unit 26 calculates the attribution probability of uncategorized sentence data for each category according to the above-described procedure (S42), and the determination unit 28 Determines the category to which the uncategorized text data belongs based on the attribution probability (S44).

（実施例）
以下、具体的な実施例に基づいて、本実施の形態に係る文章分類装置１０の動作を説明する。この実施例では、説明を簡単にするために、カテゴリとして「ギャンブル」「教育」の２つのカテゴリが準備されているものとする。また、辞書データは作成済みのものを用いることとする。 (Example)
Hereinafter, based on a specific Example, operation | movement of the text classification apparatus 10 which concerns on this Embodiment is demonstrated. In this embodiment, it is assumed that two categories of “gambling” and “education” are prepared as categories in order to simplify the explanation. The dictionary data that has already been created is used.

図１１は、この実施例で使用される辞書データを示し、上述の全体説明における図５に対応する。この辞書データは、カテゴリ「ギャンブル」について、カテゴリ文例に含まれる３０の文章と非カテゴリ文例に含まれる１５の文章から抽出された構成要素、および、カテゴリ「教育」について、カテゴリ文例に含まれる２０の文章と非カテゴリ文例に含まれる１８の文章から抽出された構成要素についてのものである。使用された文章数は、欄１５４に「文章数」として示されている。 FIG. 11 shows dictionary data used in this embodiment, and corresponds to FIG. 5 in the entire description above. This dictionary data is included in the category sentence example for the category “gambling”, the constituent elements extracted from the 30 sentences included in the category sentence example and the 15 sentences included in the non-category sentence example, and the category “education”. And the components extracted from the 18 sentences included in the non-category sentence examples. The number of sentences used is shown in the column 154 as “number of sentences”.

図示するように、この辞書には、「パチンコ」「万馬券」「青少年」「健全」「育成」などの単語が構成要素として含まれている。そして、それぞれの構成要素に対して、カテゴリ毎の出現頻度情報を有している。構成要素「パチンコ」を例としてみると、カテゴリ「ギャンブル」に対して、カテゴリ文例の文章中の出現頻度は１０回、非カテゴリ文例の文章中の出現頻度は２回である。また、カテゴリ「教育」に対しては、カテゴリ文例の文章中の出現頻度は１回、非カテゴリ文例の文章中の出現頻度は２０回である。他の構成要素についても同様である。 As shown in the figure, this dictionary includes words such as “Pachinko”, “Manga ticket”, “Youth”, “Healthy”, and “Nurture” as constituent elements. Each component has appearance frequency information for each category. Taking the component “pachinko” as an example, the appearance frequency in the sentence of the category sentence example is 10 times and the appearance frequency in the sentence of the non-category sentence example is 2 times for the category “gambling”. For the category “education”, the frequency of appearance in the sentence of the category sentence example is once, and the frequency of appearance in the sentence of the non-category sentence example is 20 times. The same applies to other components.

「総計」欄１５２は、カテゴリ文例の文章数および非カテゴリ文例の文章数を、すべての構成要素について足し合わせた数である。 The “total” column 152 is a number obtained by adding the number of sentences in the category sentence example and the number of sentences in the non-category sentence example for all the constituent elements.

このような辞書データが分類辞書保持部２４に保持されていることを前提に、未分類文章データとして「パチンコ業界を健全に育成しましょう。」という文章が、２つのカテゴリのいずれに分類されるかを説明する。この文章から、文章分解部１８により構成要素が抽出される。この実施例では、文章分解部１８は形態素解析によって文章を分解し、その結果、「パチンコ／業界／を／健全／に／育成／し／ましょ／う／。」のように、１０の構成要素が抽出される。続いて、カテゴリ帰属確率計算部２６内の構成要素選択部１２６は、要素絞り込み基準提供部３０から「名詞」という選択基準を受け取り、抽出された構成要素から名詞のみを選択する。したがって、「パチンコ」「業界」「健全」「育成」の４つの構成要素が選択されることになる。 On the assumption that such dictionary data is held in the classification dictionary holding unit 24, the sentence “Let's nurture the pachinko industry soundly” as unclassified sentence data is classified into any of the two categories. Explain how. A component is extracted from the sentence by the sentence decomposition unit 18. In this embodiment, the sentence decomposing unit 18 decomposes the sentence by morphological analysis, and as a result, ten constituent elements such as “Pachinko / industry / to / health / to / nurture / do / make / to /.” Is extracted. Subsequently, the component element selection unit 126 in the category attribution probability calculation unit 26 receives the selection criterion “noun” from the element narrowing criterion provision unit 30 and selects only the noun from the extracted component elements. Therefore, four components of “pachinko”, “industry”, “sound” and “nurturing” are selected.

辞書データ検索部１２８は、対応する辞書データを分類辞書保持部２４から検索する。この場合、「パチンコ」「健全」「育成」の３つの辞書データが得られる。出現確率算出部１３０は、この辞書データにおける３つの構成要素の出現頻度から、上記数１を使用して各カテゴリについての出現確率を算出する。その結果を図１２に示す。構成要素「パチンコ」を例としてみると、カテゴリ「ギャンブル」についての出現確率は０．７１４、カテゴリ「教育」についての出現確率は０．０４３である。他の構成要素についても同様である。 The dictionary data retrieval unit 128 retrieves corresponding dictionary data from the classification dictionary holding unit 24. In this case, three dictionary data of “pachinko”, “sound” and “nurturing” are obtained. The appearance probability calculation unit 130 calculates the appearance probability for each category using the above equation 1 from the appearance frequencies of the three components in the dictionary data. The result is shown in FIG. Taking the component “Pachinko” as an example, the appearance probability for the category “gambling” is 0.714, and the appearance probability for the category “education” is 0.043. The same applies to other components.

帰属確率算出部１３２は、上記数２にしたがって、出現確率を使用して未分類文章データの帰属確率をカテゴリ毎に算出する。
具体的な数値を用いて説明する。図１３は、上記全体説明の図８に対応させて、カテゴリ「ギャンブル」「教育」についての出現確率ａ_ｎと（１−ａ_ｎ）をまとめた表である。カテゴリ「ギャンブル」に対しては、要素「パチンコ」の出現確率ａ_１１が０．７１４、要素「健全」の出現確率ａ_２１が０．２００、要素「育成」の出現確率ａ_３１が０．２７３であるから、数２にしたがって計算すると、未分類文章データのカテゴリ「ギャンブル」への帰属確率Ｅ_１は、以下のようにして算出される。
（数３）
Ｅ_１＝（０．７１４×０．２００×０．２７３）／
｛（０．７１４×０．２００×０．２７３）＋（１−０．７１４）×（１−０．２００）×（１−０．２７３）｝≒０．１９０ The attribution probability calculation unit 132 calculates the attribution probability of the uncategorized sentence data for each category using the appearance probability according to the above formula 2.
This will be described using specific numerical values. 13, corresponding to Figure 8 of the entire description, is a table summarizing the probability a _n for the category "gambling", "Education" (1-a _n). For the category “gambling”, the appearance probability a ₁₁ of the element “pachinko” is 0.714, the appearance probability a ₂₁ of the element “sound” is 0.200, and the appearance probability a ₃₁ of the element “nurturing” is 0.273. Therefore, when calculated according to Equation 2, the attribution probability E ₁ of the uncategorized sentence data to the category “gambling” is calculated as follows.
(Equation 3)
E ₁ = (0.714 × 0.200 × 0.273) /
{(0.714 × 0.200 × 0.273) + (1−0.714) × (1−0.200) × (1−0.273)} ≈0.190

カテゴリ「教育」に対しては、要素「パチンコ」の出現確率ａ_１２が０．０４３、要素「健全」の出現確率ａ_２２が０．７８３、要素「育成」の出現確率ａ_３２が０．８４４であるから、数２にしたがって計算すると、未分類文章データのカテゴリ「教育」への帰属確率Ｅ_２は、以下のようにして算出される。
（数４）
Ｅ_２＝（０．０４３×０．７８３×０．８４４）／
｛（０．０４３×０．７８３×０．８４４）＋（１−０．０４３）×（１−０．７８３）×（１−０．８４４）｝≒０．４６７ Category for the "education", the occurrence probability of _{a 12} element "pachinko" is 0.043, the occurrence probability of _{a 22} element "healthy" is 0.783, the occurrence probability of _{a 32} element "training" is 0.844 Therefore, when calculated according to Equation ₂ , the belonging probability E ₂ of the uncategorized sentence data to the category “education” is calculated as follows.
(Equation 4)
E ₂ = (0.043 × 0.783 × 0.844) /
{(0.043 × 0.783 × 0.844) + (1−0.043) × (1−0.783) × (1−0.844)} ≈0.467

この結果、判定部２８は、「パチンコ業界を健全に育成しましょう。」という文章は、帰属確率の大きい方のカテゴリ「教育」に分類されると判定する。以上で、未分類文章データを分類する一連の処理が終了する。 As a result, the determination unit 28 determines that the sentence “Let's nurture the pachinko industry soundly” is classified into the category “education” having the higher attribution probability. Thus, a series of processes for classifying unclassified sentence data is completed.

以上説明したように、本実施の形態の文章分類装置によれば、未分類文章データを予め定められたカテゴリに沿って自動的に分類することができる。 As described above, according to the sentence classification device of the present embodiment, uncategorized sentence data can be automatically classified according to a predetermined category.

ところで、従来から、ベイジアンフィルタ法を使用したスパムフィルタが知られている。このスパムフィルタは、スパムに属する文例集と、スパムに属さない文例集とから、各単語が含まれていた場合のスパム確率を算出しておき、検査対象の文章に出現する単語について、ベイズ理論にしたがってスパム確率を求めることによって、スパムメールを検出する。しかし、この方法では、ある文章が単一のカテゴリ、つまりこの場合ならば「スパムメール」というカテゴリに属するか否かの判定しかできない。 Incidentally, a spam filter using a Bayesian filter method has been conventionally known. This spam filter calculates the probability of spam when each word is included from a collection of sentence examples that belong to spam and a collection of sentence examples that does not belong to spam. Detect spam emails by determining spam probability according to: However, this method can only determine whether a certain sentence belongs to a single category, that is, in this case, a category of “spam mail”.

これに対し、本実施の形態の文章分類装置では、未分類文章データがいずれのカテゴリに属するかは、カテゴリ毎に算出される未分類文章データの帰属確率により判定される。したがって、ひとつの文章データをひとつのカテゴリに分類することもできるし、２つ以上のカテゴリに分類することもできる。また、ひとつの視点に基づくカテゴリについて分類するのみならず、複数の視点に基づくカテゴリを混合させておき、それらについてまとめて未分類文章データの帰属確率を算出することができる。具体的にいうと、一度の計算で、コンテンツの種類（例えば、政治／経済／社会）の分類と、文章のタイプ（ニュース記事／ブログ／エッセイ）のような分類とを同時に実行することができる。よって、「政治のニュース記事」「社会問題のエッセイ」というような、多軸的な視点に立った文章の分類も可能になる。 On the other hand, in the sentence classification device according to the present embodiment, to which category the unclassified sentence data belongs is determined by the belonging probability of the unclassified sentence data calculated for each category. Therefore, one sentence data can be classified into one category, and can be classified into two or more categories. In addition to classifying categories based on one viewpoint, categories based on a plurality of viewpoints can be mixed, and the belonging probability of unclassified sentence data can be calculated collectively. More specifically, classification of content types (for example, politics / economy / society) and classification such as sentence types (news articles / blogs / essays) can be executed simultaneously with a single calculation. . Therefore, it is possible to classify sentences from a multiaxial viewpoint such as “political news articles” and “essays on social issues”.

本実施の形態の文章分類装置は、以下に述べるような応用形態が想定される。 The sentence classification device of the present embodiment is assumed to be applied as described below.

応用形態１．
ディレクトリ型の検索サイトを作成する際に、ウェブクローラが収集してきたウェブページのＨＴＭＬファイルを文章分類装置に与えることによって、ウェブページを様々な話題に基づくカテゴリに分類することができる。この分類結果を使用することで、ディレクトリ型の検索サイトの構築を容易にすることができる。なお、この応用形態では、文章データがＨＴＭＬファイルやＸＭＬファイルのヘッダ、タグ、本文のどの部分にあるかに応じて、分類辞書作成部が出現頻度の重み付けをしてもよい。 Application form 1.
When a directory-type search site is created, the web page HTML file collected by the web crawler is given to the sentence classification device, so that the web page can be classified into categories based on various topics. By using this classification result, the construction of a directory-type search site can be facilitated. In this application mode, the classification dictionary creation unit may weight the appearance frequency according to whether the text data is in the header, tag, or body of the HTML file or XML file.

応用形態２．
電子掲示板システムにおいて、投稿者からネットワークを介して接続されたサーバに対して送信されてきた投稿データを文章分解装置に与えることによって、投稿データを内容に基づくカテゴリに分類することができる。これによって、投稿データを人手を介さずに自動的に分類して表示させることができる。また、投稿者も、投稿先を自ら選択することなく電子掲示板システムに対して投稿データを送信することができる。 Application form 2.
In the electronic bulletin board system, post data transmitted from a contributor to a server connected via a network is given to a text decomposing apparatus, whereby the post data can be classified into categories based on contents. As a result, the posted data can be automatically classified and displayed without human intervention. In addition, the contributor can also transmit the posting data to the electronic bulletin board system without selecting the posting destination.

応用形態３．
カテゴリ文例としてスパムメールのデータを、非カテゴリ文例としてそれ以外のメールのデータを準備しておくことによって、スパムメールの検出フィルタとしても、文章分類装置を使用することができる。 Application form 3.
By preparing spam mail data as category sentence examples and other mail data as non-category sentence examples, the sentence classification device can also be used as a spam mail detection filter.

以上、本発明をいくつかの実施の形態をもとに説明した。これらの実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on some embodiments. It is understood by those skilled in the art that these embodiments are exemplifications, and that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are also within the scope of the present invention. By the way.

請求項に記載の各構成要件が果たすべき機能は、本実施例において示された各機能ブロックの単体もしくはそれらの連係によって実現されることも当業者には理解されるところである。 It should also be understood by those skilled in the art that the functions to be fulfilled by the constituent elements described in the claims are realized by the individual functional blocks shown in the present embodiment or their linkage.

図１４は、別の実施の形態に係る文章分類システムの構成を示す。この実施の形態では、カテゴリ文例および非カテゴリ文例から辞書を作成する辞書ユニット６０と、未分類文章データをカテゴリに分類する分類ユニット７０から文章分類システム１００が構築される。 FIG. 14 shows a configuration of a sentence classification system according to another embodiment. In this embodiment, a sentence classification system 100 is constructed from a dictionary unit 60 that creates a dictionary from category sentence examples and non-category sentence examples, and a classification unit 70 that classifies unclassified sentence data into categories.

辞書ユニット６０内の、文例格納部１２、文章分解部１８、ソート部２０、分類辞書作成部２２、分類辞書保持部２４および要素絞り込み基準提供部３０は、図２に関して説明したものと同様の機能を有する。辞書提供部６２は、分類ユニット７０などの外部装置から辞書データの提供が要求されたとき、分類辞書保持部２４から構成要素に対応する辞書データを検索して外部装置に送信する。 In the dictionary unit 60, the sentence example storage unit 12, the sentence decomposition unit 18, the sorting unit 20, the classification dictionary creation unit 22, the classification dictionary holding unit 24, and the element refinement criterion providing unit 30 have the same functions as those described with reference to FIG. Have When the dictionary providing unit 62 is requested to provide dictionary data from an external device such as the classification unit 70, the dictionary providing unit 62 searches the dictionary dictionary corresponding to the component from the classification dictionary holding unit 24 and transmits the dictionary data to the external device.

また、分類ユニット７０内の文章受付部３６、カテゴリ帰属確率計算部２６、判定部２８および判定結果格納部３２もまた、図２に関して説明したものと同様の機能を有する。文章分解部７６、ソート部７８は、それぞれ文章分解部１８、ソート部２０と同様の機能を有する。カテゴリ帰属確率計算部２６は、未分類文章データ３４から抽出された構成要素について、辞書ユニット６０に対して辞書データの提供を求め、辞書提供部６２から送信される辞書データを受け取って、上述した一連の処理を実行する。カテゴリ帰属確率計算部２６は、要素絞り込み部７４から絞り込み情報の提供を受けて、要素を絞り込んでもよい。
このように、辞書ユニットと分類ユニットとを別々に構成することによって、各ユニットをネットワークを介してリモートに配置することができる。 In addition, the sentence reception unit 36, the category attribution probability calculation unit 26, the determination unit 28, and the determination result storage unit 32 in the classification unit 70 also have the same functions as those described with reference to FIG. The sentence decomposing unit 76 and the sorting unit 78 have the same functions as the sentence decomposing unit 18 and the sorting unit 20, respectively. The category attribution probability calculation unit 26 requests the dictionary unit 60 to provide dictionary data for the components extracted from the uncategorized sentence data 34, receives the dictionary data transmitted from the dictionary providing unit 62, and described above. A series of processing is executed. The category attribution probability calculation unit 26 may narrow down elements by receiving provision of narrowing information from the element narrowing unit 74.
In this way, by configuring the dictionary unit and the classification unit separately, each unit can be remotely arranged via the network.

別の実施の形態では、未分類文章データの分類の際に得られたデータを、分類辞書保持部２４内の辞書データに反映させてもよい。具体的には、未分類文章データから抽出された構成要素に対応する辞書データが分類辞書保持部２４に存在する場合は、辞書データの頻度情報を更新させるようにする。こうすれば、未分類文章データの分類を繰り返すたびに、辞書データを充実化することができる。 In another embodiment, data obtained when classifying unclassified sentence data may be reflected in the dictionary data in the classification dictionary holding unit 24. Specifically, when the dictionary data corresponding to the component extracted from the uncategorized sentence data exists in the classified dictionary holding unit 24, the frequency information of the dictionary data is updated. In this way, the dictionary data can be enriched each time classification of uncategorized text data is repeated.

分類辞書作成部２２は、カテゴリ毎の出現頻度を辞書データとして記録するとき、構成要素を品詞別に重み付けするようにしてもよい。例えば、構成要素が名詞であれば、頻度を２倍にして記録し、構成要素が助詞であれば、頻度を０．１倍して記録するようにしてもよい。また、カテゴリ帰属確率計算部２６は、構成要素の品詞に応じて出現確率に重み付けをして未分類文章データの帰属確率を算出してもよい。これによって、構成要素の品詞の影響の軽重を反映させた文章データの分類が可能になる。 When recording the appearance frequency for each category as dictionary data, the classification dictionary creation unit 22 may weight the constituent elements by part of speech. For example, if the constituent element is a noun, the frequency may be doubled, and if the constituent element is a particle, the frequency may be recorded by 0.1 times. Further, the category attribution probability calculation unit 26 may calculate the attribution probability of the unclassified sentence data by weighting the appearance probability according to the part of speech of the component. This makes it possible to categorize sentence data that reflects the importance of the influence of the part of speech of the constituent element.

本実施の形態に係る文章分類装置の使用形態の一例を示す図である。It is a figure which shows an example of the usage condition of the text classification device which concerns on this Embodiment. 文章分類装置の機能ブロック図である。It is a functional block diagram of a text classification device. 文章データを形態素解析によって構成要素に分解した例を示す図である。It is a figure which shows the example which decomposed | disassembled text data into the component by morphological analysis. 分類辞書作成部の詳細な機能ブロック図である。It is a detailed functional block diagram of a classification dictionary creation unit. 分類辞書保持部に格納されている辞書データのデータ構造図である。It is a data structure figure of the dictionary data stored in the classification dictionary holding part. 要素絞り込み基準提供部に格納されている絞り込み基準データの構造図である。It is a structural diagram of the narrowing-down criterion data stored in the element narrowing-down criterion providing unit. カテゴリ帰属確率計算部の詳細な機能ブロック図である。It is a detailed functional block diagram of a category attribution probability calculation unit. ひとつのカテゴリについての構成要素の出現確率を示す図である。It is a figure which shows the appearance probability of the component about one category. 分類辞書を作成する処理過程を示すフローチャートである。It is a flowchart which shows the process in which a classification dictionary is produced. 未分類文章データをカテゴリに分類する処理過程を示すフローチャートである。It is a flowchart which shows the process in which uncategorized text data is classified into a category. 一実施例で使用する辞書データを示す図である。It is a figure which shows the dictionary data used in one Example. 図１１の辞書データを使用して、構成要素毎に出現確率を計算した結果を示す図である。It is a figure which shows the result of having calculated the appearance probability for every component using the dictionary data of FIG. 図１２の出現確率を使用して各カテゴリについての帰属確率を計算した結果を示す図である。It is a figure which shows the result of having calculated the belonging probability about each category using the appearance probability of FIG. 別の実施の形態に係る文章分類システムの構成図である。It is a block diagram of the text classification system which concerns on another embodiment.

Explanation of symbols

１０文章分類装置、１２文例格納部、１４カテゴリ文例データ群、１６非カテゴリ文例データ群、１８文章分解部、２０ソート部、２２分類辞書作成部、２４分類辞書保持部、２６カテゴリ帰属確率計算部、２８判定部、３０要素絞り込み基準提供部、３２判定結果格納部、３４未分類文章データ、３６文章受付部、６２辞書提供部、１０２構成要素受付部、１０４絞り込み情報受付部、１０６カテゴリ情報提供部、１０８構成要素選択部、１１０辞書データ検索部、１１２辞書データ更新部、１２２構成要素受付部、１２４絞り込み情報受付部、１２６構成要素選択部、１２８辞書データ検索部、１３０出現確率算出部、１３２帰属確率算出部。 DESCRIPTION OF SYMBOLS 10 sentence classification device, 12 sentence example storage part, 14 category sentence example data group, 16 non-category sentence example data group, 18 sentence decomposition part, 20 sort part, 22 classification dictionary creation part, 24 classification dictionary holding part, 26 category attribution probability calculation part , 28 determination unit, 30 element narrowing reference providing unit, 32 determination result storage unit, 34 uncategorized text data, 36 text receiving unit, 62 dictionary providing unit, 102 component receiving unit, 104 narrowing information receiving unit, 106 category information providing , 108 component selection unit, 110 dictionary data search unit, 112 dictionary data update unit, 122 component element reception unit, 124 refinement information reception unit, 126 component element selection unit, 128 dictionary data search unit, 130 appearance probability calculation unit, 132 Attribution probability calculation unit.

Claims

A classification dictionary holding unit that holds dictionary data in which the constituent elements of the sentence and the appearance frequency of the constituent elements appearing in the sentences to be classified into the categories for a plurality of predetermined categories are associated;
A text receiving unit that receives unclassified text data to be newly classified from an external device;
A sentence decomposing unit that analyzes the unclassified sentence data according to a predetermined rule and extracts constituent elements of the sentence;
With reference to the dictionary data, a category attribution probability calculator for calculating an attribution probability representing the probability that the uncategorized sentence data including the extracted component belongs to each category,
A determination unit that determines a category in which the unclassified sentence data is classified with reference to the probability of belonging;
A sentence classification device comprising:

The category attribution probability calculation unit calculates the appearance probability by taking out the appearance frequency in the category for which the attribution probability is calculated for each component extracted from the unclassified sentence data from the classification dictionary holding unit, The sentence classification device according to claim 1, wherein the probability of belonging to the category is obtained by synthesizing the appearance probabilities of the constituent elements.

3. The sentence classification according to claim 1, wherein the classification dictionary holding unit holds dictionary data in which words and parts of speech obtained as a result of morphological analysis on the sentence data and appearance frequencies in each category are associated with each other. apparatus.

4. The sentence according to claim 1, wherein the classification dictionary holding unit holds the dictionary data in a state in which the elements are rearranged according to a predetermined rule that facilitates searching. Classification device.

An element refining criterion providing unit storing a criterion for selecting a component;
The category attribution probability calculating unit receives the criterion from the element narrowing criterion providing unit, and among the components extracted from the uncategorized sentence data, the component satisfying the criterion is the calculation target of the appearance probability. The classification dictionary creation device according to claim 3.

6. The sentence classification apparatus according to claim 5, wherein the element narrowing-down criterion providing unit has a plurality of the criteria and gives different criteria to the category attribution probability calculating unit according to the type of category.

The sentence classification apparatus according to claim 5 or 6, wherein the category attribution probability calculation unit calculates the appearance probability after weighting the appearance frequency according to a part of speech of a corresponding component.

About a plurality of predetermined categories, a sentence example storage unit that stores a category sentence example data group including sentence data to be classified into each category, and a non-category sentence example data group including sentence data not classified into a category,
A sentence decomposition unit that analyzes sentence data according to a predetermined rule and extracts constituent elements of the sentence;
A classification dictionary creation unit that counts, for each category, the frequency of appearance of components extracted by the sentence decomposition unit in the sentence data included in the category sentence example data group and the non-category example data group;
A classification dictionary holding unit that holds dictionary data in which a component and an appearance frequency in each category are associated;
A dictionary providing unit for providing the dictionary data to an external device;
A classification dictionary creation device comprising:

An element refining criterion providing unit storing a criterion for selecting a component;
The classification dictionary creating unit receives the criteria from the element narrowing criteria providing unit, and among the components extracted from the sentence data included in the category sentence example data group and the non-category sentence example data group, the constituent elements satisfying the criterion The classification dictionary creating apparatus according to claim 8, wherein the dictionary data is to be created.

10. The classification dictionary creating apparatus according to claim 9, wherein the element narrowing-down criterion providing unit has a plurality of the criteria and gives different criteria to the category attribution probability calculating unit according to the type of category.

For a plurality of predetermined categories, a category sentence data group including sentence data to be classified into each category and a non-category sentence example data group including sentence data not classified into a category are stored in the storage means.
Analyzing the sentence data according to the prescribed rules and extracting the constituent elements of the sentence,
Calculating the frequency of appearance of each component extracted in the sentence data included in the category sentence example data group and the non-category example data group for each category;
Storing dictionary data in which the frequency and the component are associated;
Newly received uncategorized text data to be classified from an external device,
For each component extracted from the uncategorized sentence data, refer to the dictionary data and calculate for each category an attribution probability representing the probability that the uncategorized sentence data including the extracted component belongs to each category. ,
A sentence classification method, wherein the category into which the unclassified sentence data is classified is determined with reference to the probability of belonging.