JP4075094B2

JP4075094B2 - Information classification device

Info

Publication number: JP4075094B2
Application number: JP09065697A
Authority: JP
Inventors: 研治水谷; 順小澤; 今中　　武
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1997-04-09
Filing date: 1997-04-09
Publication date: 2008-04-16
Anticipated expiration: 2017-04-09
Also published as: JPH10283366A

Description

【０００１】
【発明の属する技術分野】
本発明は、文字コードによって構成されるテキストを含むファイルをキーワードを付けて分類する装置に関するものである。
【０００２】
【従来の技術】
データベース・システムは一般に、検索目的を持った利用者が目的のファイルに容易に到達できるように、キーワード論理式などを入力するインタフェースを用意している。しかしながら、特に検索目的を持たず、データベースの中にどのようなファイルが収納されているかということに興味を持つ利用者にとっては、このようなインタフェースはあまり役に立たない。データベースの内容の一覧を提供するために、従来は、データベースの管理者があらかじめ固定的な概念体系を用意して、新しく追加するファイルの内容を理解してその体系における位置を決定したり、ファイルの提供者が位置を指定したり、あるいは、すでに手作業で分類したファイルとキーワードを比較して最も近い位置に自動分類して、利用者に分類結果を提供していた。
【０００３】
【発明が解決しようとする課題】
前述のあらかじめ固定的な概念体系を用意する方法では、新しい概念を持ったファイルが出現したときに利用者にその存在が伝わらないという問題が生じる。自動分類では、すでに分類されているファイルとキーワードが１つでも一致すれば概念的に近いと判断されて既存の概念に分類されるだけである。したがって、適当な時期に概念体系を修正して分類をやり直す必要があるが、その作業はデータベースの規模に比例して膨大な量になる。
【０００４】
本発明は、固定的な概念体系を利用するのではなく、ファイルに含まれるキーワードを概念として利用し、ファイルをそれに属する集合として自動分類して、ユーザにデータベースの内容の一覧を提供することを目的とする。
【０００５】
【課題を解決するための手段】
本発明は、文字コードによって構成されるテキストを含むファイルを格納するファイル格納手段と、前記ファイル格納手段が出力するファイルのテキスト部分に対して形態素解析を行って前記ファイルの識別子と共に出力する形態素解析手段と、前記ファイルを分類するためのキーワードを格納するキーワード格納手段と、前記キーワード格納手段の出力と前記形態素解析手段の出力とを入力として、前記形態素解析の結果の中から前記キーワード格納手段で格納されているキーワードと一致するキーワードを収集して前記ファイルの識別子と共に出力するキーワード収集手段と、前記キーワード収集手段で収集されたキーワードから、前記キーワードを含むファイルの数が所定の数より多くなるように、キーワードの組み合わせを選択する初期キーワード選択手段と、前記初期キーワード選択手段で選択されたキーワードと、選択された前記キーワードを含むテキストファイルの識別子から、ファイルをキーワードで分類して表示する表示手段と、利用者が前記初期キーワード選択手段で選択されたキーワードでの分類結果と異なる分類結果を要求する入力手段と、
前記キーワード収集手段で収集したキーワードでかつ前記初期キーワード選択手段で選択されたキーワード以外のキーワードを用いて、前記初期キーワード選択手段で選択されたキーワードとは異なるキーワードの組み合わせを選択する分類キーワード洗練手段と、
前記分類キーワード洗練手段で選択されたキーワードと前記キーワードが含まれる前記テキストファイルの識別子を要素とする集合を生成するファイル集合生成手段と、前記表示手段において、前記分類キーワード洗練手段で選択されたキーワードと、前記ファイル集合生成手段で生成されたファイルの識別子の集合とをキーワードで分類して表示しなおし、初期キーワード選択手段は、キーワード収集手段で収集したキーワードから、前記キーワードが出現するファイル数が多い順に並べ、最上位から一定数のキーワードを分類するキーワードとして選択し、所定のキーワードが含まれるテキストが、下位キーワードにも含まれた場合には、下位のキーワードにファイルの識別子を割り当てることを特徴とする情報分類装置である。
【０００６】
【発明の実施の形態】
本発明の一実施形態は、本発明は、文字コードによって構成されるテキストを含むファイルを格納するファイル格納手段と、前記ファイル格納手段が出力するファイルのテキスト部分に対して形態素解析を行って前記ファイルの識別子と共に出力する形態素解析手段と、前記ファイルを分類するためのキーワードを格納するキーワード格納手段と、前記キーワード格納手段の出力と前記形態素解析手段の出力とを入力として、前記形態素解析の結果の中から前記キーワード格納手段で格納されているキーワードと一致するキーワードを収集して前記ファイルの識別子と共に出力するキーワード収集手段と、前記キーワード収集手段で収集されたキーワードから、前記キーワードを含むファイルの数が多くなるように、キーワードの組み合わせを選択する初期キーワード選択手段と、前記初期キーワード選択手段で選択されたキーワードと、選択された前記キーワードを含むテキストファイルの識別子から、ファイルをキーワードで分類して表示する表示手段と、利用者が前記初期キーワード選択手段で選択されたキーワードでの分類結果と異なる分類結果を要求する入力手段と、前記キーワード収集手段で収集したキーワードでかつ前記初期キーワード選択手段で選択されたキーワード以外のキーワードを用いて、前記初期キーワード選択手段で選択されたキーワードとは異なるキーワードの組み合わせを選択する分類キーワード洗練手段と、前記分類キーワード洗練手段で選択されたキーワードと前記キーワードが含まれる前記テキストファイルの識別子を要素とする集合を生成するファイル集合生成手段と、前記表示手段において、前記分類キーワード洗練手段で選択されたキーワードと、前記ファイル集合生成手段で生成されたファイルの識別子の集合とをキーワードで分類して表示しなおすことを特徴とする情報分類装置である。
【０００７】
更に、初期キーワード選択手段は、キーワード収集手段で収集したキーワードから、前記キーワードが出現するファイル数が多い順に並べ、最上位から一定数のキーワードを分類するキーワードとして選択し、所定のキーワードが含まれるテキストが、下位キーワードにも含まれた場合には、下位のキーワードにファイルの識別子を割り当てるものである。
【０００８】
また、分類キーワード洗練手段において、前記キーワード収集手段で収集したキーワードでかつ前記初期キーワード選択手段で選択されたキーワード以外のキーワードを用いて、前記キーワードが出現するファイル数が多い順に並べ、前記初期キーワード選択手段で選択されたキーワードを、前記順に並べた後に並べ、最上位から一定数のキーワードを洗練されたキーワードとして選択するものである。
【００１１】
本発明の一実施の形態の情報分類装置全体の構成を表すブロック図を図１に示す。ファイル格納手段１０１は、文字コードによって構成されるテキストを含むファイルを格納する。形態素解析手段１０２は、ファイル格納手段１０１のファイルのテキスト部分に対して形態素解析を行ってファイルの識別子と共に出力する。キーワード格納手段１０３は、ファイルを分類するためのキーワードを格納する。キーワード収集手段１０４は、形態素解析の結果の中からキーワード格納手段１０３に格納されているキーワードだけを収集して、ファイルの識別子と共に出力する。情報分類手段１０５は、ファイルの識別子をそれに付随するキーワードによって分類する。表示手段１０６は、ファイル格納手段１０１のファイルの内容を分類結果に従って利用者に提供する。入力手段１０７は、利用者が提供された分類結果と異なる分類を希望するときに、情報分類手段１０５にその要求を伝える。
【００１２】
次に本実施の形態の動作を説明する。例として、図２に示すラーメンの飲食店について記述した５つのファイルがファイル格納装置１０１に格納されているとする。それぞれのファイルの識別子は、
file1, file2, file3, file4, file5
である。
【００１３】
形態素解析手段１０２は、ファイル格納手段１０１に格納されているファイルのテキスト部分について形態素解析を行い、ファイルの識別子と共に出力する。図２に示すファイルについて、形態素解析手段１０２が処理した結果の、名詞のみを取り出した結果を図３に示す。
【００１４】
キーワード格納手段１０３には、分類に使用するキーワードを列挙する。例を図４に示す。
【００１５】
キーワード収集手段１０４は、形態素解析手段１０２の出力の中から、キーワード格納手段１０３に格納されている単語だけを取り出して、ファイルの識別子と共に出力する。図３の形態素解析の結果を、キーワード収集手段１０４が処理した結果を図５に示す。
【００１６】
情報分類手段１０５は、キーワード収集手段が１０４が出力するキーワードを持つファイルの識別子の集合を、キーワードで分類して出力する。情報分類手段１０５の詳細な構成を示すブロック図を図６に示す。
【００１７】
初期キーワード選択手段６０１は、キーワードをそれが出現するファイル数が多い順に並べ、最上位から一定数のキーワードを分類キーワード集合として選択する。図５のキーワード収集手段１０４の出力を、キーワードを横軸として出現ファイル数が多い順に左から並べた結果を図７に示す。分類キーワード集合として選択するキーワードの数を２とすると、分類キーワードの集合は、
｛ラーメン、しょうゆ味｝
となる。
【００１８】
分類キーワード洗練手段６０２は、初期キーワード選択手段６０１が出力する分類キーワード集合に含まれるキーワードが、より多くのファイルに出現するように他のキーワードと置換する。まず、分類キーワード集合に含まれるキーワードが出現するファイルの数を評価関数とする。そして、分類キーワード集合に含まれる１つのキーワードを、まだ分類キーワード集合に含まれたことがないキーワードと置換する操作を、評価関数の値が増加する限り繰り返す。図７の例で、分類キーワード集合が、
｛ラーメン、しょうゆ味｝
に設定されているとき、評価関数の値は４である。分類集合に含まれるキーワードの「ラーメン」と「しょうゆ味」を、まだ分類集合に含まれたことがないキーワードの「焼き豚」と置換し、評価関数の値を計算するといずれの場合も４である。したがって、評価関数の値が増加しないので、分類キーワード洗練手段６０２は分類キーワード集合を、
｛ラーメン、しょうゆ味｝
として出力する。
【００１９】
ファイル集合生成手段６０３は、分類キーワード洗練手段６０２が出力する分類キーワード集合に従って、ファイルの識別子を分類する。まず、分類キーワード集合に含まれるキーワードを、それが出現するファイル数が多い順に並べて、キーワードに割り当てるファイルの識別子の集合を、そのキーワードよりも下位のキーワードが１つも出現しないファイルの識別子に限定する。図７の例で分類キーワード集合が、
｛ラーメン、しょうゆ味｝
であれば、キーワード「ラーメン」が出現するファイルは、
{file1, file3, file5}
であるが、file1にはそれよりも下位のキーワード「しょうゆ味」が出現するので、各キーワードに割り当てるファイルの識別子の集合は、
ラーメン：{file3, file5}
しょうゆ味：{file1, file4}
となる。また、分類キーワード集合に含まれるキーワードが１つも出現しないファイルについては、特殊キーワード「その他」を分類キーワード集合に追加し、それにファイルの識別子を割り当てる。図７の例では、
その他：{file2}
となり、ファイル集合生成手段６０３から情報分類手段１０５の結果として、
｛ラーメン：{file3, file5}、しょうゆ味：{file1, file4}、その他：{file2}｝
が出力される。
【００２０】
再帰的分類制御手段６０４は、ファイル集合生成手段６０３が分類した結果をさらに細分類するときに使用する。すなわち、すでに分類されたファイルの識別子とそのファイルに出現するキーワードの集合を初期キーワード選択手段６０１に与えることで、分類されたファイルの識別子をさらにキーワードで分類する。
【００２１】
表示手段１０６は、情報分類手段１０５の結果を木構造に変換して、利用者にデータベースの内容の一覧を提供する。情報分類手段１０５の出力が、
｛ラーメン：{file3, file5}、しょうゆ味：{file1, file4}、その他：{file2}｝
のときは、図８に示すような出力結果が得られる。利用者は、この出力結果を見て、他の分類結果を要求したいときに入力手段１０７を用いる。入力手段１０７は情報分類手段１０５に接続され、初期キーワード選択手段２０１にその要求が伝えられる。
【００２２】
初期キーワード選択手段６０１は、キーワード収集手段１０４が出力した結果から最近選択した分類キーワード集合を記憶している。入力手段１０７から利用者の要求が伝えられると、最近並べたキーワードの列について、最近選択した分類キーワードに含まれるキーワードの列を、最下位に順序を保存して移動した後、最上位から一定数のキーワードを分類キーワード集合として選択して出力する。図７の例では、分類キーワードとして
｛ラーメン、しょうゆ味｝
を最近選択したので、それを順序を保存して最下位のキーワード「焼き豚」の次に移動し、図９のようなキーワードの列を作る。そして最上位から２つのキーワードを選択して、分類キーワード集合、
｛焼き豚、ラーメン｝
を選択して出力する。分類キーワード洗練装置６０２以降の処理は同様であり、情報分類装置の出力として、
｛ラーメン：{file1, file5}、焼き豚：{file2, file3}、その他：{file4}｝
が出力される。表示手段１０６には、前回の図８の分類結果とは異なる、図１０に示すようなデータベースの内容の一覧が利用者に提供される。
【００２３】
なお、本発明は文字コードによって構成されるテキストを含むファイルであればどのような種類のファイルでも分類することができる。ファイルをインターネット上のホームページを構成するＨＴＭＬファイル、ファイルの識別子をそのＵＲＬアドレスとすれば、本発明の情報分類装置をホームページの分類システムとして利用することができる。
【００２４】
【発明の効果】
以上述べたところから明らかなように、本発明は、キーワードを概念として利用し、文字コードによって構成されるテキストを含むファイルを自動分類するので、新しい概念を持ったファイルが出現しても、キーワードを保守するだけで容易に概念体系の更新が可能であり、利用者にデータベースの内容の一覧を迅速に提供できるという長所を有する。
【図面の簡単な説明】
【図１】本発明の一実施の形態の情報分類装置の全体の構成を表すブロック図
【図２】同実施の形態の動作を説明するための図１のファイル格納手段１０１の一例を示す図
【図３】同実施の形態の動作を説明するための図１の形態素解析手段１０２の出力の一例を示す図
【図４】同実施の形態の動作を説明するための図１のキーワード格納手段１０３の一例を示す図
【図５】同実施の形態の動作を説明するための図１のキーワード収集手段１０４の出力の一例を示す図
【図６】同実施の形態の動作を説明するための図１の情報分類手段１０５の詳細なブロック図
【図７】同実施の形態の動作を説明するための図６の初期キーワード選択手段６０１の内部状態の一例を示す図
【図８】同実施の形態の動作を説明するための図１の表示手段１０６の一例を示す図
【図９】同実施の形態の動作を説明するための図６の初期キーワード選択手段６０１の内部状態の一例を示す図
【図１０】同実施の形態の動作を説明するための図１の表示手段１０６の一例を示す図
【符号の説明】
１０１ファイル格納手段
１０２形態素解析手段
１０３キーワード格納手段
１０４キーワード収集手段
１０５情報分類手段
１０６表示手段
１０７入力手段
６０１初期キーワード選択手段
６０２分類キーワード洗練手段
６０３ファイル集合生成手段
６０４再帰的分類制御手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an apparatus for classifying a file including text composed of character codes with a keyword attached thereto.
[0002]
[Prior art]
In general, a database system provides an interface for inputting a keyword logical expression or the like so that a user having a search purpose can easily reach a target file. However, such an interface is not very useful for users who have no particular search purpose and are interested in what files are stored in the database. Conventionally, in order to provide a list of database contents, the database administrator prepares a fixed conceptual system in advance, understands the contents of the newly added file, determines the position in the system, The provider specified the location, or compared the file already classified by hand with the keyword and automatically classified it to the closest location, and provided the classification result to the user.
[0003]
[Problems to be solved by the invention]
The above-described method for preparing a fixed concept system has a problem that when a file having a new concept appears, its existence is not transmitted to the user. In automatic classification, if even one of the already classified files and a keyword match, it is determined that they are conceptually close and only classified into existing concepts. Therefore, it is necessary to correct the concept system at an appropriate time and start the classification again, but the work is enormous in proportion to the scale of the database.
[0004]
The present invention uses a keyword contained in a file as a concept, rather than using a fixed concept system, automatically classifies the file as a set belonging to it, and provides the user with a list of database contents. Objective.
[0005]
[Means for Solving the Problems]
The present invention provides a file storage means for storing a file including text composed of character codes, and a morpheme analysis for performing a morphological analysis on a text portion of the file output by the file storage means and outputting the file together with an identifier of the file A keyword storage means for storing a keyword for classifying the file, an output of the keyword storage means and an output of the morpheme analysis means, and the keyword storage means from among the results of the morpheme analysis A keyword collecting unit that collects a keyword that matches a stored keyword and outputs the keyword together with the identifier of the file, and a keyword collected by the keyword collecting unit includes a number of files including the keyword greater than a predetermined number. To choose a keyword combination An initial keyword selection means, a keyword selected by the initial keyword selection means, a display means for classifying and displaying a file by a keyword from an identifier of a text file containing the selected keyword, and a user selects the initial keyword An input means for requesting a classification result different from the classification result of the keyword selected by the selection means;
Classification keyword refinement means for selecting a keyword combination different from the keyword selected by the initial keyword selection means by using a keyword collected by the keyword collection means and a keyword other than the keyword selected by the initial keyword selection means When,
A file set generation means for generating a set selected from the keyword selected by the classification keyword refinement means and an identifier of the text file including the keyword; and a keyword selected by the classification keyword refinement means in the display means And the set of file identifiers generated by the file set generation means are classified and displayed again by keyword , and the initial keyword selection means determines the number of files in which the keyword appears from the keywords collected by the keyword collection means. Arrange them in descending order and select a certain number of keywords from the top to classify them. If the text that includes a given keyword is also included in the lower keywords, assign a file identifier to the lower keywords. This is an information classification device.
[0006]
DETAILED DESCRIPTION OF THE INVENTION
According to an embodiment of the present invention , the present invention relates to a file storage unit that stores a file including text composed of character codes, and a morphological analysis performed on a text portion of the file output by the file storage unit. The result of the morpheme analysis with the input of the morpheme analysis means for outputting together with the identifier of the file, the keyword storage means for storing the keywords for classifying the file, the output of the keyword storage means and the output of the morpheme analysis means A keyword collecting unit that collects a keyword that matches the keyword stored in the keyword storage unit and outputs the keyword together with an identifier of the file; and from a keyword collected by the keyword collecting unit, a file including the keyword Keyword combinations to increase the number And the initial keyword selection means for selecting, and keywords that have been selected by the initial keyword selection means, from the identifier of a text file that contains the keyword that has been selected, and a display means for displaying to classify the files by keyword, user said Using an input means for requesting a classification result different from the classification result of the keyword selected by the initial keyword selection means, and a keyword other than the keyword selected by the initial keyword selection means and the keyword collected by the keyword collection means Classification keyword refinement means for selecting a combination of keywords different from the keyword selected by the initial keyword selection means; the keyword selected by the classification keyword refinement means; and the identifier of the text file containing the keyword as an element Generate a set That the file set generation unit, in the display unit, a keyword selected by the classifying keyword refinement means, said a set of identifiers of the generated files in the file set generation means again displays the classified keywords This is an information classification device.
[0007]
Further, the initial keyword selection means arranges the keywords collected by the keyword collection means in the order of the number of files in which the keywords appear, and selects a certain number of keywords from the top as a keyword to be classified and includes a predetermined keyword. If the text is also included in the lower keyword, a file identifier is assigned to the lower keyword.
[0008]
Further, in the classified keyword refinement means, the keywords collected by the keyword collection means and keywords other than the keyword selected by the initial keyword selection means are arranged in descending order of the number of files in which the keyword appears, and the initial keyword The keywords selected by the selection means are arranged after being arranged in the above order, and a certain number of keywords are selected as refined keywords from the top.
[0011]
FIG. 1 is a block diagram showing the configuration of the entire information classification apparatus according to an embodiment of the present invention . The file storage unit 101 stores a file including text composed of character codes. The morpheme analysis unit 102 performs morpheme analysis on the text portion of the file in the file storage unit 101 and outputs it together with the file identifier. The keyword storage means 103 stores keywords for classifying files. The keyword collection unit 104 collects only the keywords stored in the keyword storage unit 103 from the results of the morphological analysis, and outputs them together with the file identifier. The information classifying means 105 classifies the file identifier by a keyword associated therewith. The display means 106 provides the user with the contents of the file stored in the file storage means 101 according to the classification result. The input means 107 informs the information classification means 105 of the request when the user desires a classification different from the provided classification result.
[0012]
Next, the operation of the present embodiment will be described. As an example, it is assumed that five files describing the ramen restaurant shown in FIG. The identifier of each file is
file1, file2, file3, file4, file5
It is.
[0013]
The morpheme analysis unit 102 performs morpheme analysis on the text portion of the file stored in the file storage unit 101 and outputs it together with the file identifier. FIG. 3 shows a result of extracting only nouns as a result of processing by the morphological analysis unit 102 for the file shown in FIG.
[0014]
The keyword storage means 103 lists keywords used for classification. An example is shown in FIG.
[0015]
The keyword collection unit 104 extracts only the words stored in the keyword storage unit 103 from the output of the morpheme analysis unit 102 and outputs them together with the identifier of the file. FIG. 5 shows the result of the keyword collection unit 104 processing the result of the morphological analysis of FIG.
[0016]
The information classifying unit 105 classifies the set of identifiers of the files having the keywords output by the keyword collecting unit 104 according to the keywords, and outputs them. A block diagram showing a detailed configuration of the information classification means 105 is shown in FIG.
[0017]
The initial keyword selection unit 601 arranges keywords in the order of the number of files in which the keywords appear, and selects a certain number of keywords from the top as a classification keyword set. FIG. 7 shows the result of arranging the outputs of the keyword collecting unit 104 in FIG. 5 from the left in order of the number of appearing files with the keyword as a horizontal axis. If the number of keywords to be selected as a classification keyword set is 2, the set of classification keywords is
{Ramen, soy sauce}
It becomes.
[0018]
The classification keyword refinement unit 602 replaces keywords included in the classification keyword set output by the initial keyword selection unit 601 with other keywords so that they appear in more files. First, the number of files in which keywords included in the classified keyword set appear is used as an evaluation function. Then, the operation of replacing one keyword included in the classified keyword set with a keyword that has not been included in the classified keyword set is repeated as long as the value of the evaluation function increases. In the example of FIG. 7, the classification keyword set is
{Ramen, soy sauce}
Is set to 4, the value of the evaluation function is 4. Replace the keywords “ramen” and “soy sauce taste” included in the classification set with the keywords “baked pork” that have not yet been included in the classification set, and the value of the evaluation function is calculated to be 4. . Therefore, since the value of the evaluation function does not increase, the classified keyword refinement means 602 uses the classified keyword set as
{Ramen, soy sauce}
Output as.
[0019]
The file set generation unit 603 classifies the file identifiers according to the classification keyword set output by the classification keyword refinement unit 602. First, the keywords included in the classification keyword set are arranged in the descending order of the number of files in which the keywords appear, and the set of file identifiers assigned to the keywords is limited to the identifiers of files in which no keyword lower than the keyword appears. . In the example of FIG.
{Ramen, soy sauce}
Then, the file in which the keyword “ramen” appears is
{file1, file3, file5}
However, since the lower keyword “soy sauce taste” appears in file1, the set of file identifiers assigned to each keyword is
Ramen: {file3, file5}
Soy sauce taste: {file1, file4}
It becomes. For a file in which no keyword included in the classified keyword set appears, the special keyword “others” is added to the classified keyword set, and an identifier of the file is assigned to the special keyword “other”. In the example of FIG.
Other: {file2}
As a result of the information classifying unit 105 from the file set generating unit 603,
{Ramen: {file3, file5}, Soy Sauce: {file1, file4}, Others: {file2}}
Is output.
[0020]
The recursive classification control unit 604 is used to further classify the results classified by the file set generation unit 603. That is, by providing the identifier of the already classified file and the set of keywords appearing in the file to the initial keyword selecting means 601, the identifier of the classified file is further classified by the keyword.
[0021]
The display means 106 converts the result of the information classification means 105 into a tree structure, and provides a list of database contents to the user. The output of the information classification means 105 is
{Ramen: {file3, file5}, Soy Sauce: {file1, file4}, Others: {file2}}
In this case, an output result as shown in FIG. 8 is obtained. The user looks at this output result and uses the input means 107 when he wants to request another classification result. The input means 107 is connected to the information classification means 105, and the request is transmitted to the initial keyword selection means 201.
[0022]
The initial keyword selection unit 601 stores a classification keyword set recently selected from the result output by the keyword collection unit 104. When the user's request is transmitted from the input means 107, the keyword column included in the recently selected classification keyword is moved to the lowest-ordered keyword column, and the order is fixed from the highest level. A number of keywords are selected and output as a classification keyword set. In the example of FIG. 7, the classification keyword is {ramen, soy sauce}
Is recently selected, the order is preserved, and the keyword is moved next to the lowest keyword “baked pork” to create a keyword string as shown in FIG. Then select the two keywords from the top, and set the classification keyword set,
{Fried pork, ramen}
Select to output. The processing after the classification keyword refinement device 602 is the same, and as an output of the information classification device,
{Ramen: {file1, file5}, Grilled pork: {file2, file3}, Others: {file4}}
Is output. The display means 106 is provided with a list of database contents as shown in FIG. 10, which is different from the previous classification result of FIG.
[0023]
In the present invention, any type of file can be classified as long as the file includes text composed of character codes. If the file is an HTML file constituting a homepage on the Internet and the file identifier is its URL address, the information classification apparatus of the present invention can be used as a homepage classification system.
[0024]
【The invention's effect】
As is clear from the above description, the present invention uses keywords as concepts and automatically classifies files containing text composed of character codes, so even if a file with a new concept appears, the keywords It is possible to easily update the conceptual system simply by maintaining the database, and to provide a user with a quick list of database contents.
[Brief description of the drawings]
FIG. 1 is a block diagram showing the overall configuration of an information classification apparatus according to an embodiment of the present invention. FIG. 2 is a diagram showing an example of a file storage unit 101 in FIG. 1 for explaining the operation of the embodiment. 3 is a diagram showing an example of the output of the morpheme analyzing means 102 in FIG. 1 for explaining the operation of the embodiment; FIG. 4 is a keyword storing means in FIG. 1 for explaining the operation of the embodiment; FIG. 5 is a diagram showing an example of the output of the keyword collecting means 104 in FIG. 1 for explaining the operation of the embodiment. FIG. 6 is a diagram for explaining the operation of the embodiment. FIG. 7 is a detailed block diagram of the information classification unit 105 in FIG. 1. FIG. 7 is a diagram showing an example of the internal state of the initial keyword selection unit 601 in FIG. 6 for explaining the operation of the embodiment. 1 for explaining the operation of the embodiment FIG. 9 is a diagram showing an example of an internal state of initial keyword selection means 601 in FIG. 6 for explaining the operation of the embodiment. FIG. 10 is a diagram for explaining the operation of the embodiment. The figure which shows an example of the display means 106 of FIG.
101 File storage means 102 Morphological analysis means 103 Keyword storage means 104 Keyword collection means 105 Information classification means 106 Display means 107 Input means 601 Initial keyword selection means 602 Classification keyword refinement means 603 File set generation means 604 Recursive classification control means

Claims

File storage means for storing a file containing text composed of character codes;
Morphological analysis means for performing morphological analysis on the text portion of the file output by the file storage means and outputting together with the identifier of the file;
Keyword storage means for storing keywords for classifying the files;
Using the output of the keyword storage means and the output of the morpheme analysis means as input, collect keywords that match the keywords stored in the keyword storage means from the results of the morpheme analysis and output them together with the identifier of the file Keyword collection means to
Initial keyword selection means for selecting a combination of keywords from the keywords collected by the keyword collection means so that the number of files containing the keywords is greater than a predetermined number ;
Display means for classifying and displaying files by keyword from the keyword selected by the initial keyword selection means and the identifier of the text file containing the selected keyword;
An input means for requesting a classification result different from the classification result of the keyword selected by the initial keyword selection means by the user;
Classification keyword refinement means for selecting a keyword combination different from the keyword selected by the initial keyword selection means by using a keyword collected by the keyword collection means and a keyword other than the keyword selected by the initial keyword selection means When,
File set generation means for generating a set having the keyword selected by the classification keyword refinement means and the identifier of the text file including the keyword as elements;
In the display means, the keyword selected by the classification keyword refinement means and the set of file identifiers generated by the file set generation means are classified and displayed again by keywords,
The initial keyword selection means is
From the keywords collected by the keyword collecting means, arrange the keywords in the order of the number of files in which they appear, select a certain number of keywords from the top to classify them,
An information classification apparatus , wherein when a text including a predetermined keyword is also included in a lower keyword, a file identifier is assigned to the lower keyword .

In classification keyword refinement means,
The keywords collected by the keyword collecting means and using keywords other than the keyword selected by the initial keyword selecting means are arranged in descending order of the number of files in which the keyword appears, and the keywords selected by the initial keyword selecting means are arranged. The information classification apparatus according to claim 1, wherein the information is arranged after being arranged in the order, and a certain number of keywords are selected as refined keywords from the top.