JP3502537B2

JP3502537B2 - Index deriving device and method, and computer-readable medium recording index deriving program

Info

Publication number: JP3502537B2
Application number: JP00246498A
Authority: JP
Inventors: 雅博奥; 良輔野田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-01-08
Filing date: 1998-01-08
Publication date: 2004-03-02
Anticipated expiration: 2018-01-08
Also published as: JPH11203296A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、予め付与された検
索用インデックスより新たな検索用インデックスを派生
させる際の派生数を抑えることのできるインデックス派
生装置及びその方法並びにインデックス派生プログラム
を記録したコンピュータ読み取り可能な媒体に関するも
のである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an index derivation apparatus and method capable of suppressing the number of derivations when deriving a new search index from a previously given search index, and a computer recording an index derivation program. The present invention relates to a readable medium.

【０００２】[0002]

【従来の技術】発明者らは、１レコードに対して１つの
用語が検索用インデックスとして予め付与されているデ
ータベース（原データベース）において、この検索用イ
ンデックスより新たなインデックスを派生し、これを新
たな検索用インデックスとして付加することにより、デ
ータベースの検索効率を向上させ得る技術について提案
した（特願平８−３３１０３９号）。2. Description of the Related Art The inventors derive a new index from a search index in a database (original database) in which one term is preliminarily assigned to each record as a search index, and use this as a new index. A technique that can improve the search efficiency of a database by adding it as a special search index has been proposed (Japanese Patent Application No. 8-331039).

【０００３】前記技術では、原データベースの各レコー
ドに検索用インデックスとして付与されている用語（原
インデックス用語）を単語単位に区切る形態素解析を行
い、前記解析の結果、得られた各単語を始まりとする用
語（派生インデックス用語）を、前記原インデックス用
語から各始まりの単語以降を抜き出すことによって作成
する階段状レコード派生を行うとともに、前記形態素解
析の結果、得られた各単語のうち、原インデックス用語
において末尾に位置する単語の原データベースの原イン
デックス用語全体に亘る出現頻度を調べ、これら出現頻
度の高い単語と完全に一致する用語を新たな検索用イン
デックスから削除することによって、不要なインデック
スの派生を抑えるようになしていた。In the above technique, a morpheme analysis is performed in which terms (original index terms) assigned as a search index to each record of the original database are divided into word units, and each word obtained as a result of the analysis is started. The term (derivative index term) is derived from the original index term by extracting the words after the beginning of each step, and the stepwise record derivation is performed, and among the words obtained as a result of the morphological analysis, the original index term Determining unnecessary indexes by checking the frequency of occurrence of the word at the end of the original index in the original index term in the entire database and deleting the terms that exactly match these high-frequency words from the new search index. Was trying to suppress.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、このよ
うな技術では、（１）複数の単語から構成される用語で
あって、そのうちの１つでも出現頻度が低ければ削除さ
れず、検索の際に不要なインデックスも残ってしまう、
（２）出現頻度の高いインデックスを削除するため、同
字異義語（同じ単語が別の意味で使われている場合）を
区別なく削除してしまい、インデックスとして必要なも
のまで削除してしまう恐れがある、という問題点があっ
た。However, in such a technique, (1) a term composed of a plurality of words is not deleted if even one of them has a low appearance frequency, and it is not deleted at the time of search. Unnecessary indexes will also remain,
(2) Since the frequently appearing index is deleted, synonyms (when the same word is used with different meanings) are deleted without distinction, and even the necessary index may be deleted. There was a problem that there was.

【０００５】本発明の目的は、複数の単語から構成され
る用語も高頻度であれば削除することができ、さらに同
字異義語を区別して削除するか否かを決定でき、不要な
インデックスの派生を効果的に防止し得るインデックス
派生装置及びその方法並びにインデックス派生プログラ
ムを記録したコンピュータ読み取り可能な媒体を提供す
ることにある。It is an object of the present invention to delete a term composed of a plurality of words if the frequency is high, to determine whether or not to delete synonyms and synonyms, and to eliminate unnecessary indexes. An object is to provide an index derivation device and method capable of effectively preventing derivation, and a computer-readable medium recording an index derivation program.

【０００６】[0006]

【課題を解決するための手段】本発明では、前記課題を
解決するため、原データベースの各レコードに検索用イ
ンデックスとして付与されている用語（原インデックス
用語）を単語単位に区切る形態素解析を行う形態素解析
手段と、前記解析の結果、得られた各単語を始まりとす
る用語（派生インデックス用語）を、前記原インデック
ス用語から各始まりの単語以降を抜き出すことによって
作成する階段状レコード派生手段とを備え、前記作成さ
れた派生インデックス用語のうちの少なくとも１つの派
生インデックス用語を新たな検索用インデックスとして
持つレコードを蓄積して検索対象データベースを作成す
るデータベース作成装置におけるインデックス派生装置
において、階段状レコード派生手段にて派生インデック
ス用語を作成する際、該派生インデックス用語がどの原
インデックス用語から派生されたかを示す派生元情報を
付与する派生元情報付与手段と、前記派生インデックス
用語のデータベース全体に亘る出現頻度を調べる用語集
計手段と、出現頻度の高い用語の前方に位置する単語
（前方位置単語）を派生元情報より得られる原インデッ
クス用語から取得する前方位置単語取得手段と、該前方
位置単語の形態素情報の内容によって該当する派生イン
デックス用語を前記検索用インデックスから削除すべき
か否かを決定し、削除すべき派生インデックス用語を前
記検索用インデックスから削除する不要レコード決定手
段とからなる不要レコード削除手段とを備えたことを特
徴とする。In the present invention, in order to solve the above-mentioned problems, a morpheme for performing morpheme analysis in which a term (original index term) assigned as a search index to each record of an original database is divided into word units. And a stepwise record derivation means for creating a term starting from each word obtained as a result of the analysis (derivative index term) by extracting words after each starting word from the original index term. , the index derivation device in a database generating apparatus for generating at least one search target database storing the record with the derived index term as the index for the new search of the created derived index term, stepped records derived unit Derived index at
Source index term is
Derived from information indicating whether it was derived from the index term
Derivation source information addition means to be given, term aggregation means for checking the appearance frequency of the derived index terms in the entire database, and words positioned in front of terms with high appearance frequency
(Forward position word) is the original index obtained from the derivation source information.
Front position word acquiring means for acquiring from the word
Depending on the content of the morpheme information of the position word
Dex terms should be removed from the search index
Decide whether or not to delete the derived index terms before
Determinants for unnecessary records to be deleted from the search index
It is characterized in that it is provided with an unnecessary record deleting means composed of steps .

【０００７】前記構成によれば、単語よりも大きな単
位で出現頻度の高いレコードを削除でき、これによって
複数の単語から構成される用語も高頻度であれば削除す
ることができるとともに、複数の単語から構成される用
語全体の成り立ちに応じて、即ち同字異義語を区別して
削除するか否かを決定できる。According to the above configuration, it is possible to delete a record having a high appearance frequency in a unit larger than a word, and thus a term composed of a plurality of words can be deleted if the frequency is high , and a plurality of words can be deleted. Consists of
According to the origin of the word as a whole, that is, distinguishing homonyms
You can decide whether to delete it .

【０００８】[0008]

【０００９】[0009]

【００１０】なお、この際、形態素情報をキーとして有
し、派生インデックス用語の前方位置単語の形態素情報
の内容によって削除条件を表現したルールからなる不要
レコード決定ルール群を参照して、削除すべき派生イン
デックス用語を決定するようにしても良い。At this time, it should be deleted by referring to the unnecessary record determination rule group having the morpheme information as a key and expressing the deletion condition by the content of the morpheme information of the forward position word of the derived index term. The derived index term may be determined.

【００１１】［実施の形態１］図１は本発明のインデックス派生装置の第１の実施の形
態（但し、特許請求の範囲には含まれない。）を示すも
ので、図中、１は原データベース、２は検索用インデッ
クス、３は検索対象データベース、１０はインデックス
派生装置である。[First Embodiment ] FIG. 1 shows a first embodiment (however, not included in the scope of the claims) of an index derivation device of the present invention, in which 1 is an original. A database, 2 is a search index, 3 is a search target database, and 10 is an index derivation device.

【００１２】原データベース１は、１レコードに対して
１つの用語が検索用インデックス（検索キー）として付
与されている、インデックス派生対象となるデータベー
スである。検索用インデックス２は、原データベース１
をインデックス派生装置１０で処理して発生した、目的
とするインデックスである。検索対象データベース３
は、原データベース１に検索用インデックス２を付加し
たデータベースである。The original database 1 is an index derivation target database in which one term is added to each record as a search index (search key). Search index 2 is the original database 1
Is a target index generated by processing the index derivation device 10. Search target database 3
Is a database in which the search index 2 is added to the original database 1.

【００１３】インデックス派生装置１０は、データ読み
込み部１１、形態素解析部１２、階段状レコード派生部
１３、用語集計部１４、頻出用語表１５、不要レコード
削除部１６及び情報出力部１７から構成されている。The index derivation device 10 comprises a data reading unit 11, a morpheme analysis unit 12, a stepwise record derivation unit 13, a term aggregation unit 14, a frequent term table 15, an unnecessary record deletion unit 16 and an information output unit 17. There is.

【００１４】データ読み込み部１１は、原データベース
１から情報を１レコードずつを読み込む。形態素解析部
１２は、データ読み込み部１１で読み込んだ情報、即ち
原データベースの各レコードに検索用インデックスとし
て付与されている用語（原インデックス用語）を単語単
位に区切る。階段状レコード派生部１３は、形態素解析
部１２で得られた各単語を始まりとする派生インデック
ス用語（階段状のレコード）を、前記原インデックス用
語から各始まりの単語以降を抜き出すことによって作成
する。The data reading section 11 reads information from the original database 1 record by record. The morphological analysis unit 12 divides the information read by the data reading unit 11, that is, the term (original index term) given as a search index into each record of the original database into word units. The stepwise record derivation unit 13 creates the derived index terms (stepwise records) that start with each word obtained by the morpheme analysis unit 12 by extracting the words after each starting word from the original index term.

【００１５】用語集計部１４は、階段状レコード派生部
１３で得られた各派生インデックス用語（階段状のレコ
ード）のデータベース全体に亘る出現頻度をその字面あ
るいは字面と前記形態素情報とのペアを単位として集計
し、各用語の出現頻度を蓄積した頻出用語表１５を作成
する。不要レコード削除部１６は、頻出用語表１５を参
照し、出現頻度の高い用語と完全に一致する派生インデ
ックス用語あるいは出現頻度の高いペアと完全に一致す
る字面と形態素情報とを有する派生インデックス用語を
前記検索用インデックスから削除する。情報出力部１７
は、検索用インデックスを出力する。The term totaling unit 14 determines the occurrence frequency of each derived index term (staircase record) obtained by the staircase record deriving unit 13 in the entire database in terms of the character face or a pair of the character face and the morpheme information. As a result, a frequently-used term table 15 in which the appearance frequencies of the terms are accumulated is created. The unnecessary record deletion unit 16 refers to the frequently-used term table 15, and finds a derived index term that completely matches a frequently appearing term or a derived index term that has a face and a morpheme information that completely matches a frequently appearing pair. Delete from the search index. Information output unit 17
Outputs the search index.

【００１６】なお、インデックス派生装置１０は、ＣＰ
Ｕ、メモリ、外部記憶装置等のハードウェアとともに、
図２の動作フローチャートに示される手順を備えたソフ
トウェア（プログラム）で実現される。The index derivation device 10 uses the CP
With hardware such as U, memory, external storage device,
It is realized by software (program) having the procedure shown in the operation flowchart of FIG.

【００１７】以下、図２に従って本装置の動作を説明す
る。The operation of this apparatus will be described below with reference to FIG.

【００１８】（ステップｓ１）データ読み込み部１１で
は、原データベース１から１データレコード（原インデ
ックス用語）を読み込む。さらに読み込んだデータレコ
ードを形態素解析部１２に送る。(Step s1) The data reading section 11 reads one data record (original index term) from the original database 1. Further, the read data record is sent to the morphological analysis unit 12.

【００１９】（ステップｓ２）形態素解析部１２では抽
出されたデータレコードを形態素解析し、構成単語に分
けるとともに各単語に品詞等の形態素情報を付与する。
さらにこの結果を階段状レコード派生部１３に送る。(Step s2) The morpheme analysis unit 12 performs morpheme analysis on the extracted data record, divides it into constituent words, and adds morpheme information such as part of speech to each word.
Further, this result is sent to the stepwise record derivation unit 13.

【００２０】（ステップｓ３）階段状レコード派生部１
３では、送られてきた形態素解析結果から現在処理中の
データレコードの単語数を数え、その値をｎに設定す
る。さらに初期値として削除単語数ｋに０を設定する。(Step s3) Stepwise record derivation unit 1
In 3, the number of words of the data record currently being processed is counted from the sent morphological analysis result, and the value is set to n. Further, 0 is set to the number k of deleted words as an initial value.

【００２１】（ステップｓ４）次に、送られてきたデー
タレコードと形態素解析結果とをもとに、先頭からｋ個
を取り除いた残りの情報を持つレコード（派生インデッ
クス用語）を派生させ、メモリに蓄積する。(Step s4) Next, based on the sent data record and the morphological analysis result, a record (derivative index term) having the remaining information obtained by removing k records from the beginning is derived and stored in the memory. accumulate.

【００２２】（ステップｓ５）削除単語数ｋがｎ−１
（全単語数よりも１だけ少ない）に一致するか否かで処
理を分ける。等しくない場合にはステップｓ６へ、等し
い場合にはステップｓ７に移る。(Step s5) The number of deleted words k is n-1.
The processing is divided depending on whether or not it matches (one less than the total number of words). If they are not equal, the process proceeds to step s6, and if they are equal, the process proceeds to step s7.

【００２３】（ステップｓ６）削除単語数ｋに１を加え
て（ｋ＝ｋ＋１）、ステップｓ４に戻る。(Step s6) Add 1 to the number of deleted words k (k = k + 1), and return to step s4.

【００２４】（ステップｓ７）原データベース１の全て
のデータレコードを処理したか否かで処理を分ける。未
処理データレコードがある場合には制御をデータ読み込
み部１１に移してステップｓ８へ、全てのデータレコー
ドを処理し終わっている場合には制御を用語集計部１４
に移してステップｓ９へ進む。(Step s7) Processing is divided depending on whether all data records in the original database 1 have been processed. When there is an unprocessed data record, the control is transferred to the data reading unit 11 and the process proceeds to step s8. When all the data records have been processed, the control is changed to the term totaling unit 14
Move to step s9.

【００２５】（ステップｓ８）データ読み込み部１１で
は原データベース１から次のデータレコードを読み込
み、さらに読み込んだデータレコードを形態素解析部１
２に送り、ステップｓ２に戻る。(Step s8) The data reading unit 11 reads the next data record from the original database 1, and further reads the read data record from the morpheme analysis unit 1.
2 and returns to step s2.

【００２６】（ステップｓ９）用語集計部１４では、原データベース１全体の処理の
後、ステップｓ４でメモリに蓄積されている派生レコー
ドを、その字面あるいはその字面と形態素情報とのペア
で集計する。[0026] (Step s9) the term counting unit 14, after the processing of the entire original database 1, a derivative record stored in the memory in step s4, pairs of characters Men'a Rui its textually and morphological information
In aggregate.

【００２７】（ステップｓ１０）さらに用語集計部１４では、集計結果をもとに少なくと
も字面、出現頻度の２つの組あるいは字面、形態素情
報、出現頻度の３つの組を持つ頻出用語表１５を作成す
る。[0027] (Step s10) further term counting unit 14, at least typeface based on the counting result, two Kumia Rui textually frequency of occurrence, morphological information, a frequently-used word table 15 with three sets of frequency create.

【００２８】（ステップｓ１１）不要レコード削除部１
６では、頻出用語表１５を参照して頻出用語を不要レコ
ードとしてメモリに蓄積されている派生レコード中から
削除し、制御を情報出力部１７に移す。情報出力部１７
では、メモリに残った派生レコードを検索用インデック
ス２に出力する。(Step s11) Unnecessary record deleting unit 1
In step 6, the frequently-used term table 15 is referred to, the frequently-used term is deleted from the derivative records stored in the memory as an unnecessary record, and the control is transferred to the information output unit 17. Information output unit 17
Then, the derived record remaining in the memory is output to the search index 2.

【００２９】以上の処理によって、形態素情報まで一致
する頻出用語のみを検索用インデックスから削除するの
で、インデックスとして必要なものを正しく残すことが
できる。また、用語集計を単語単位ではなく用語単位に
行うことによって、２つ以上の単語からなる用語であっ
てもその用語が派生レコード中で頻出すれば不要レコー
ドとして削除することができる。By the above processing, only the frequently-used terms that match the morpheme information are deleted from the search index, so that the necessary index can be left correctly. Further, by performing term aggregation on a term basis rather than on a word basis, even a term consisting of two or more words can be deleted as an unnecessary record if the term frequently appears in the derived record.

【００３０】［実施の形態２］図３は本発明のインデックス派生装置の第２の実施の形
態を示すもので、図中、図１と同一構成部分は同一符号
をもって表す。即ち、１は原データベース、２は検索用
インデックス、３は検索対象データベース、２０はイン
デックス派生装置である。[Second Embodiment ] FIG. 3 shows a second embodiment of the index derivation device of the present invention. In the figure, the same components as those in FIG. 1 are represented by the same reference numerals. That is, 1 is an original database, 2 is a search index, 3 is a search target database, and 20 is an index derivation device.

【００３１】インデックス派生装置２０は、データ読み
込み部１１、形態素解析部１２、用語集計部１４、頻出
用語表１５、情報出力部１７、階段状レコード派生部２
１、派生元情報付与部２２、不要レコード削除部２３及
び不要レコード決定ルール群２４から構成されている。The index derivation device 20 includes a data reading unit 11, a morpheme analysis unit 12, a term totaling unit 14, a frequent term table 15, an information output unit 17, and a staircase record derivation unit 2.
1, a derivation source information addition unit 22, an unnecessary record deletion unit 23, and an unnecessary record determination rule group 24.

【００３２】階段状レコード派生部２１は、形態素解析
部１２で得られた各単語を始まりとする派生インデック
ス用語（階段状のレコード）を、前記原インデックス用
語から各始まりの単語以降を抜き出すことによって作成
するとともに、派生元情報付与部２２を呼び出し、各派
生インデックス用語に対して該派生インデックス用語が
どの原インデックス用語から派生されたかを示す派生元
情報を付与させ、さらに派生インデックス用語とその派
生元情報とをメモリに蓄積する。The stepwise record derivation unit 21 extracts the derived index terms (stepwise records) starting from each word obtained by the morphological analysis unit 12 from the original index term and after each starting word. At the same time as creating, the derivation source information adding unit 22 is called, derivation source information indicating from which original index term the derivation index term is derived is attached to each derivation index term, and further derivation index term and its derivation source. Store information and in memory.

【００３３】派生元情報付与部２２は、派生インデック
ス用語に派生元情報を付与する。The derivation source information adding section 22 adds derivation source information to the derivation index term.

【００３４】不要レコード削除部２３は、出現頻度の高
い用語の前方に位置する単語（前方位置単語）を派生元
情報より得られる原インデックス用語から取得する前方
位置単語取得部２３１と、頻出用語表１５及び不要レコ
ード決定ルール群２４を参照し、前方位置単語の形態素
情報の内容によって該当する派生インデックス用語を前
記検索用インデックスから削除すべきか否かを決定し、
削除すべき派生インデックス用語を前記検索用インデッ
クスから削除する不要レコード決定部２３２とからなっ
ている。The unnecessary record deletion unit 23 acquires a word located in front of a term having a high frequency of occurrence (forward position word) from the original index term obtained from the derivation source information, and a frequent term table. 15 and unnecessary record determination rule group 24, it is determined whether or not the corresponding derived index term should be deleted from the search index according to the content of the morpheme information of the forward position word,
The unnecessary record determining unit 232 deletes the derived index term to be deleted from the search index.

【００３５】不要レコード決定ルール群２４は、形態素
情報をキーとして有し、派生インデックス用語の前方位
置単語の形態素情報の内容によって削除条件を表現した
複数のルールからなっている。The unnecessary record determination rule group 24 has a plurality of rules having morpheme information as a key and expressing deletion conditions by the content of the morpheme information of the forward position word of the derived index term.

【００３６】なお、インデックス派生装置２０は、ＣＰ
Ｕ、メモリ、外部記憶装置等のハードウェアとともに、
図４、図５の動作フローチャートに示される手順を備え
たソフトウェア（プログラム）で実現される。The index derivation device 20 uses the CP
With hardware such as U, memory, external storage device,
It is realized by software (program) having the procedure shown in the operation flowcharts of FIGS.

【００３７】以下、図４、図５に従って本装置の動作を
説明する。The operation of this apparatus will be described below with reference to FIGS.

【００３８】（ステップｓ２１）データ読み込み部１１
では、原データベース１から１データレコードを読み込
む。さらに読み込んだデータレコードを形態素解析部１
２に送る。(Step s21) Data reading unit 11
Then, one data record is read from the original database 1. The morphological analysis unit 1 uses the read data record.
Send to 2.

【００３９】（ステップｓ２２）形態素解析部１２では
抽出されたデータレコードを形態素解析し、構成単語に
分けるとともに各単語に品詞等の形態素情報を付与す
る。さらにこの結果を階段状レコード派生部２１に送
る。(Step s22) The morpheme analysis unit 12 performs morpheme analysis on the extracted data record, divides it into constituent words, and adds morpheme information such as part of speech to each word. Further, this result is sent to the stepwise record derivation unit 21.

【００４０】（ステップｓ２３）階段状レコード派生部
２１では、送られてきた形態素解析結果から現在処理中
のデータレコードの単語数を数え、その値をｎに設定す
る。さらに初期値として削除単語数ｋに０を設定する。(Step s23) The stepwise record derivation unit 21 counts the number of words in the data record currently being processed from the sent morphological analysis result and sets the value to n. Further, 0 is set to the number k of deleted words as an initial value.

【００４１】（ステップｓ２４）また、階段状レコード
派生部２１では、送られてきたデータレコードと形態素
解析結果とをもとに、先頭からｋ個を取り除いた残りの
情報を持つレコードを派生させる。また、階段状レコー
ド派生部２１では、派生元情報付与部２２を呼び出し、
派生レコードに対して該派生レコードの派生元情報を付
与する。さらに、階段状レコード派生部２１では、派生
レコードとその派生元情報とをメモリに蓄積する。(Step s24) Further, the stepwise record derivation unit 21 derives a record having the remaining information after removing k records from the beginning, based on the sent data record and the morphological analysis result. In addition, the staircase record derivation unit 21 calls the derivation source information addition unit 22,
Derivation source information of the derived record is added to the derived record. Further, the stepwise record derivation unit 21 stores the derived record and its derivation source information in the memory.

【００４２】（ステップｓ２５）削除単語数ｋがｎ−１
（全単語数よりも１だけ少ない）に一致するか否かで処
理を分ける。等しくない場合にはステップｓ２６へ、等
しい場合にはステップｓ２７に移る。(Step s25) The number of deleted words k is n-1.
The processing is divided depending on whether or not it matches (one less than the total number of words). If they are not equal, the process proceeds to step s26, and if they are equal, the process proceeds to step s27.

【００４３】（ステップｓ２６）削除単語数ｋに１を加
えて（ｋ＝ｋ＋１）、ステップｓ２４に戻る。(Step s26) 1 is added to the number k of deleted words (k = k + 1), and the process returns to step s24.

【００４４】（ステップｓ２７）原データベース１の全
てのデータレコードを処理したか否かで処理を分ける。
未処理データレコードがある場合には制御をデータ読み
込み部１１に移してステップｓ２８へ、全てのデータレ
コードを処理し終わっている場合には制御を用語集計部
１４に移してステップｓ２９へ進む。(Step s27) The processing is divided depending on whether all the data records of the original database 1 have been processed.
If there is an unprocessed data record, control is transferred to the data reading section 11 and step s28, and if all data records have been processed, control is transferred to the term totaling section 14 and step s29 is proceeded to.

【００４５】（ステップｓ２８）データ読み込み部１１
では原データベース１から次のデータレコードを読み込
み、さらに読み込んだデータレコードを形態素解析部１
２に送り、ステップｓ２２に戻る。(Step s28) Data reading unit 11
Then, the next data record is read from the original database 1, and the read data record is read by the morphological analysis unit 1
2 and returns to step s22.

【００４６】（ステップｓ２９）用語集計部１４では、
原データベース１全体の処理の後、ステップｓ２４でメ
モリに蓄積されている派生レコードを、その字面あるい
はその字面と形態素情報とのペアで集計する。(Step s29) In the term totaling unit 14,
After the processing of the entire original database 1, the derived records accumulated in the memory are aggregated in step s24 as the character face or a pair of the character face and morpheme information.

【００４７】（ステップｓ３０）さらに用語集計部１４
では、集計結果をもとに少なくとも字面、出現頻度の２
つの組あるいは字面、形態素情報、出現頻度の３つの組
を持つ頻出用語表１５を作成する。(Step s30) Further, the term totaling unit 14
Then, based on the tabulation result, at least the character and the appearance frequency of 2
A frequently-used term table 15 having three groups of three groups of character or face, morpheme information, and appearance frequency is created.

【００４８】（ステップｓ３１）不要レコード削除部２
３では、頻出用語表１５、不要レコード決定ルール群２
４を参照して頻出用語を不要レコードとしてメモリに蓄
積されている派生レコード中から削除し、制御を情報出
力部１７に移す。情報出力部１７では、メモリに残った
派生レコードを検索用インデックス２に出力する。(Step s31) Unnecessary record deleting unit 2
3, the frequent term table 15 and unnecessary record determination rule group 2
4, the frequently-used term is deleted from the derivative record stored in the memory as an unnecessary record, and the control is transferred to the information output unit 17. The information output unit 17 outputs the derived record remaining in the memory to the search index 2.

【００４９】図５は図４中のステップｓ３１（不要レコ
ード削除処理）の詳細な動作フローチャートであり、以
下、これに従って動作を説明する。FIG. 5 is a detailed operation flowchart of step s31 (unnecessary record deletion processing) in FIG. 4, and the operation will be described below according to this.

【００５０】（ステップｓ４１）不要レコード削除部２
３を構成する前方位置単語取得部２３１では、まず、頻
出用語表１５から頻出用語を１つ読み込む。(Step s41) Unnecessary record deleting unit 2
In the forward position word acquisition unit 231 constituting No. 3, first, one frequently-used term is read from the frequently-used term table 15.

【００５１】（ステップｓ４２）次に、前方位置単語取
得部２３１では、該頻出用語の字面をキーにして階段状
レコード派生部２１（ステップｓ２４）でメモリに蓄積
した派生レコードを検索する。(Step s42) Next, the forward position word acquisition unit 231 searches the derived record stored in the memory by the stepwise record derivation unit 21 (step s24) using the face of the frequently-used word as a key.

【００５２】（ステップｓ４３）頻出用語の字面に一致
した派生レコードが存在したか否かで処理を分ける。存
在した場合にはステップｓ４４に進み、存在しない場合
には次の頻出用語を処理するためにステップｓ４１に戻
る。(Step s43) The processing is divided depending on whether or not there is a derived record that matches the face of the frequently-used term. When it exists, it progresses to step s44, and when it does not exist, it returns to step s41 in order to process the next frequent term.

【００５３】（ステップｓ４４）頻出用語の字面に一致
した派生レコード全てをそれらの派生元情報とともに読
み込む。(Step s44) All the derived records that match the characters of the frequently used terms are read together with their derivation source information.

【００５４】（ステップｓ４５）ステップｓ４４で読み
込んだ派生レコードのうちの１つを処理対象とする。(Step s45) One of the derived records read in step s44 is processed.

【００５５】（ステップｓ４６）前方位置単語取得部２
３１では、処理対象とした派生レコードの持つ派生元情
報から前方位置単語に関する形態素情報を取得し、制御
を不要レコード決定部２３２に移す。(Step s46) Forward position word acquisition unit 2
At 31, the morpheme information on the forward-positioned word is acquired from the derivation source information of the derivation record as the processing target, and the control is transferred to the unnecessary record determination unit 232.

【００５６】（ステップｓ４７）不要レコード決定部２
３２では、現在処理中の頻出用語の形態素情報で不要レ
コード決定ルール群２４を検索する。(Step s47) Unnecessary record determining unit 2
At 32, the unnecessary record determination rule group 24 is searched by the morpheme information of the frequently-used term currently being processed.

【００５７】（ステップｓ４８）一致した不要レコード
決定ルールが存在するか否かによって処理を分ける。存
在する場合にはステップｓ４９に進み、存在しない場合
には次の派生レコードを処理するためにステップｓ５１
に進む。(Step s48) Processing is divided depending on whether or not there is a matching unnecessary record determination rule. If it exists, the process proceeds to step s49, and if it does not exist, the process proceeds to step s51 to process the next derived record.
Proceed to.

【００５８】（ステップｓ４９）現在処理対象としてい
る派生レコードの派生元情報から得られた前方位置単語
に関する形態素情報が、ステップｓ４７で得られた不要
レコード決定ルールのいずれかに一致するか否か、即ち
削除すべきレコードであるか否かによって処理を分け
る。一致する、即ち削除すべきレコードである場合には
ステップｓ５０に進み、一致しない、即ち削除すべきで
ないレコードである場合にはステップｓ５１に進む。(Step s49) Whether or not the morpheme information on the forward-positioned word obtained from the derivation source information of the derivation record currently being processed matches any of the unnecessary record determination rules obtained at step s47. That is, the processing is divided depending on whether or not the record is a record to be deleted. If the records match, that is, the record should be deleted, the process proceeds to step s50. If the records do not match, that is, the record should not be deleted, the process proceeds to step s51.

【００５９】（ステップ５０）不要レコード決定部２３
２では、不要レコード決定ルール群２４に一致した派生
レコードをメモリから削除する。(Step 50) Unnecessary record determination unit 23
In 2, the derived record that matches the unnecessary record determination rule group 24 is deleted from the memory.

【００６０】（ステップｓ５１）全ての派生レコードを
処理したか否かで処理を分ける。処理している場合には
ステップｓ５２に進み、処理していない場合には次の派
生レコードを処理するために制御を前方位置単語取得部
２３１に移してステップｓ４５に戻る。(Step s51) The processing is divided depending on whether all the derived records have been processed. If so, the process proceeds to step s52, and if not, the control is transferred to the forward position word acquisition unit 231 to process the next derived record, and the process returns to step s45.

【００６１】（ステップｓ５２）全ての頻出用語を処理
したか否かで処理を分ける。処理している場合には制御
を情報出力部１７に移してステップｓ５３に進み、処理
していない場合には次の頻出用語を処理するために制御
を前方位置単語取得部２３１に移してステップｓ４１に
戻る。(Step s52) Processing is divided depending on whether or not all the frequently-used terms have been processed. If so, control is transferred to the information output unit 17, and the process proceeds to step s53. If not, control is transferred to the forward position word acquisition unit 231 to process the next frequently-used term, and step s41. Return to.

【００６２】（ステップｓ５３）情報出力部１７では、
上記の処理でメモリ上に残っている派生レコードを検索
用インデックス２に出力する。(Step s53) In the information output section 17,
The derived record remaining in the memory by the above processing is output to the search index 2.

【００６３】以上の処理によって、頻出用語が元のデー
タレコード中でどのような単語の後に出現したか（不要
レコード決定ルール群２４の記述）によって検索用イン
デックスから削除するか否かを決定するので、インデッ
クスとして必要なものを正しく残すことができる。ま
た、用語集計を単語単位ではなく用語単位に行うことに
よって、２以上の単語からなる用語であってもその用語
が派生レコード中で頻出すれば不要レコードとして削除
することができる。By the above processing, it is determined whether or not the frequently-used term appears in the original data record after what word (description of the unnecessary record determination rule group 24) to delete it from the search index. , You can leave exactly what you need as an index. Further, by performing term aggregation on a term basis rather than a word basis, even a term composed of two or more words can be deleted as an unnecessary record if the term frequently appears in the derived record.

【００６４】［具体例１］次に、第１の実施の形態の装置の動作を具体例を挙げて
説明する。図６は実際の処理のようすを、また、図７は
頻出用語表１５の一例をそれぞれ示すものである。[Specific Example 1 ] Next , the operation of the apparatus according to the first embodiment will be described with a specific example. FIG. 6 shows the actual processing, and FIG. 7 shows an example of the frequently-used term table 15.

【００６５】データ読み込み部１１では、原データベー
ス１から１データレコードを読み込む。ここでは「スナ
ックセンター」を読み込んだとする。データ読み込み部
１１では読み込んだデータレコード「スナックセンタ
ー」を形態素解析部１２に送る（ステップｓ１）。The data reading section 11 reads one data record from the original database 1. Here, it is assumed that the "snack center" is loaded. The data reading unit 11 sends the read data record "snack center" to the morphological analysis unit 12 (step s1).

【００６６】形態素解析部１２では抽出されたデータレ
コード「スナックセンター」を形態素解析し、構成単語
に分けるとともに各単語に品詞等の形態素情報を付与す
る（図６の形態素解析）。さらにこの結果を階段状レコ
ード派生部１３に送る（ステップｓ２）。The morpheme analysis unit 12 performs morpheme analysis on the extracted data record "snack center", divides it into constituent words, and adds morpheme information such as part of speech to each word (morpheme analysis in FIG. 6). Further, this result is sent to the stepwise record derivation unit 13 (step s2).

【００６７】階段状レコード派生部１３では、送られて
きた形態素解析結果から現在処理中のデータレコードの
単語数を数え、その値をｎに設定する。「スナックセン
ター」は２つの単語から構成されているのでｎ＝２とす
る。さらに初期値として削除単語数ｋに０を設定する
（ステップｓ３）。The stepwise record derivation unit 13 counts the number of words in the data record currently being processed from the sent morphological analysis result and sets the value to n. Since "snack center" is composed of two words, n = 2. Further, the number of deleted words k is set to 0 as an initial value (step s3).

【００６８】次に、送られてきたデータレコード「スナ
ックセンター」と形態素解析結果とをもとに、先頭から
ｋ＝０個を取り除いた残りの情報、即ち「スナックセン
ター（冠称名、固有名詞）」を持つレコードを派生さ
せ、メモリに蓄積する（ステップｓ４）。Next, based on the sent data record "snack center" and the morphological analysis result, the remaining information after removing k = 0 from the beginning, that is, "snack center (crown name, proper noun)" A record having "" is derived and stored in the memory (step s4).

【００６９】今、削除単語数ｋ＝０であって、ｎ＝２で
あるのでｎ−１＝１となり、ｋはｎ−１に一致しないの
で（ステップｓ５）、削除単語数ｋに１を加えてｋ＝１
とする（ステップｓ６）。Since the number of deleted words k = 0 and n = 2, n-1 = 1, and k does not match n-1 (step s5). Therefore, 1 is added to the deleted word number k. K = 1
(Step s6).

【００７０】次に、送られてきたデータレコード「スナ
ックセンター」と形態素解析結果とをもとに、先頭から
ｋ＝１個を取り除いた残りの情報、即ち「センター（固
有名詞）」を持つレコードを派生させ、メモリに蓄積す
る（ステップｓ４）。Next, based on the sent data record "snack center" and the morphological analysis result, the remaining information obtained by removing k = 1 from the beginning, that is, a record having "center (proper noun)" Is derived and stored in the memory (step s4).

【００７１】この際、削除単語数ｋ＝１であって、ｎ＝
２であるのでｎ−１＝１となり、ｋはｎ−１に一致する
（ステップｓ５）。よって、このデータレコード「スナ
ックセンター」の処理を終了する。At this time, the number of deleted words k = 1 and n =
Since n is 2, n-1 = 1, and k matches n-1 (step s5). Therefore, the processing of this data record "snack center" is completed.

【００７２】ここまでの処理で、階段状レコード派生結
果として、「スナックセンター（冠称名、固有名詞）」
と「センター（固有名詞）」の２つが形態素情報ととも
にメモリに蓄積される（図６の階段状レコード派生）。By the processing up to this point, "snack center (crown name, proper noun)" is obtained as the stepwise record derivation result.
And "center (proper noun)" are stored in the memory together with morpheme information (step-like record derivation in FIG. 6).

【００７３】原データベース１の全てのデータレコード
を処理したか否かで処理を分ける。ここでは未処理デー
タレコードがあるとする（ステップｓ７）。The processing is divided depending on whether all the data records in the original database 1 have been processed. Here, it is assumed that there is an unprocessed data record (step s7).

【００７４】データ読み込み部１１では、原データベー
ス１から次の１データレコードを読み込む（ステップｓ
８）。ここでは「関東不燃建築センター」を読み込んだ
とする。上記と同様にして（ステップｓ２〜ステップｓ
８）、階段状レコード派生結果として、「関東不燃建築
センター（地名、普通名詞、サ変名詞、接尾語）」、
「不燃建築センター（普通名詞、サ変名詞、接尾
語）」、「建築センター（サ変名詞、接尾語）」、「セ
ンター（接尾語）」の４つが形態素情報とともにメモリ
に蓄積される。The data reading unit 11 reads the next one data record from the original database 1 (step s
8). Here, it is assumed that the "Kanto Incombustible Building Center" is loaded. In the same manner as above (steps s2 to s
8), as a result of stepwise record derivation, "Kanto Incombustible Building Center (place name, common noun, sahen noun, suffix)",
Four types of "non-combustible building center (common noun, sahen noun, suffix)", "building center (sahennoun, suffix)", "center (suffix)" are stored in the memory together with morpheme information.

【００７５】ここまでで、原データベース１の全てのデ
ータレコードを処理したとすると、制御は用語集計部１
４に移る（ステップｓ７）。Up to this point, assuming that all data records in the original database 1 have been processed, the control is performed by the term totaling unit 1
The process moves to 4 (step s7).

【００７６】用語集計部１４では、原データベース１
全体の処理の後、ステップｓ４でメモリに蓄積されてい
る派生レコードを、その字面あるいはその字面と形態素
情報とのペアに対応）で集計する（ステップｓ９）。さ
らに用語集計部１４では、集計結果をもとに、少なくと
も字面、出現頻度の２つの組あるいは字面、形態素情
報、出現頻度の３つの組を持つ頻出用語表１５を作成す
る（ステップｓ１０）。ここでは図７に示す頻出用語表
１５の内容が得られたとする。In the term totaling unit 14, the original database 1
After the entire processing, the derived records stored in the memory in step s4, its shape Men'a Rui aggregate with corresponding) to the pair of the character faces and morpheme information (step s9). In addition the term totaling unit 14, counting result on the basis of at least typeface, two Kumia Rui frequency textually, morpheme information, to create a frequently-used word table 15 with three sets of frequency (step s10) . Here, it is assumed that the contents of the frequently-used term table 15 shown in FIG. 7 are obtained.

【００７７】この際、頻度３０００以上を頻出用語であ
るとすると、不要レコード削除部１６では、図７の頻出
用語表１５を参照して頻出用語「センター（接尾
語）」、「建築センター（サ変名詞、接尾語）」、「会
社（冠称名）」、「株式会社（冠称名）」を不要レコー
ドとしてメモリに蓄積されている派生レコード中から削
除する（ステップｓ１１）。At this time, assuming that the frequency of 3000 or more is a frequent term, the unnecessary record deleting unit 16 refers to the frequent term table 15 of FIG. 7 to refer to the frequent terms "center (suffix)", "building center (sa "Noun, suffix)", "company (crown name)", "corporation (crown name)" are deleted from the derivative records stored in the memory as unnecessary records (step s11).

【００７８】ここで、データレコード「スナックセンタ
ー」については、「センター（固有名詞）」と字面が一
致するが、その形態素情報である品詞は『固有名詞』で
あり、前記の不要レコードの「センター（接尾語）」と
一致しないので削除されない。このため、データレコー
ド「スナックセンター」については、そのまま「スナッ
クセンター」と「センター」の２つのレコードが派生レ
コードとなる（図６の不要レコード削除）。Here, the data record "snack center" has the same character face as "center (proper noun)", but the part of speech that is the morpheme information is "proper noun", and the unnecessary record "center" (Suffix) ”does not match and is not deleted. Therefore, for the data record "snack center", the two records "snack center" and "center" are directly derived records (unnecessary record deletion in FIG. 6).

【００７９】また、データレコード「関東不燃建築セン
ター」については、「建築センター（サ変名詞、接尾
語）」、「センター（接尾語）」の２つのレコードが、
前記の不要レコード「建築センター（サ変名詞、接尾
語）」、「センター（接尾語）」に形態素情報である品
詞まで一致するので削除され、結果として、「関東不燃
建築センター」については、「関東不燃建築センター」
と「不燃建築センター」の２つのレコードが派生レコー
ドとなる。As for the data record "Kanto Incombustible Building Center", two records "building center (sahenon noun, suffix)" and "center (suffix)" are
The unnecessary records "Architectural Center (Sahennoun, Suffix)" and "Center (Suffix)" are deleted because they match the part of speech that is the morpheme information, and as a result, "Kanto Incombustible Building Center" Fire-retardant building center "
And two records of "Flameproof Building Center" will be derived records.

【００８０】最後に、情報出力部１７では、メモリに残
った派生レコード「スナックセンター」、「センタ
ー」、「関東不燃建築センター」、「不燃建築センタ
ー」を検索用インデックス２に出力する（ステップｓ１
１）。Finally, the information output unit 17 outputs the derived records "snack center", "center", "Kanto incombustible building center", and "incombustible building center" remaining in the memory to the search index 2 (step s1).
1).

【００８１】以上の説明から明らかなように、本装置に
よれば、「スナックセンター」のように字面は「センタ
ー」と同じでも固有名詞である「センター」と「関東不
燃建築センター」における接尾語「センター」とを区別
することができ、インデックスとして必要なものを正し
く残すことができる。また、用語集計を単語単位ではな
く用語単位に行うことによって、「建築センター」のよ
うに２単語からなる用語であってもその用語が派生レコ
ード中で頻出すればその用語を不要レコードとして削除
することができる。As is clear from the above description, according to the present device, the suffixes in "Center" and "Kanto Incombustible Building Center" which are the proper nouns even if the character face is the same as "Center" such as "Snack Center" It is possible to distinguish it from the "center" so that what is necessary as an index can be left correctly. In addition, by performing term aggregation on a term-by-term basis rather than on a word-by-word basis, even if the term consists of two words, such as "Building Center", if that term occurs frequently in the derived records, that term is deleted as an unnecessary record. be able to.

【００８２】［具体例２］次に、第２の実施の形態の装置の動作を具体例を挙げて
説明する。図８は実際の処理のようすを、また、図９に
不要レコード決定ルール群２４の一例をそれぞれ示すも
のである。なお、頻出用語表１５については図７の例を
そのまま用いるものとする。また、以下の説明では具体
例１と同様な動作の部分の説明は割愛する。[Specific Example 2 ] Next , the operation of the apparatus according to the second embodiment will be described with a specific example. FIG. 8 shows the actual processing, and FIG. 9 shows an example of the unnecessary record determination rule group 24. For the frequently-used term table 15, the example of FIG. 7 is used as it is. Moreover, in the following description, the description of the same operation as that of the first specific example will be omitted.

【００８３】階段状レコード派生部２１では、具体例１
と同様な動作の後、データレコード「関東不燃建築セン
ター」に対して、「関東不燃建築センター（地名、普通
名詞、サ変名詞、接尾語）」、「不燃建築センター（普
通名詞、サ変名詞、接尾語）」、「建築センター（サ変
名詞、接尾語）」、「センター（固有名詞）」の４つの
レコードを、データレコード「株式会社建築センター」
に対して、「株式会社建築センター（冠称名、サ変名
詞、接尾語）」、「建築センター（サ変名詞、接尾
語）」、「センター（接尾語）」の３つのレコードを、
それぞれ派生レコードとして形態素情報とともにメモリ
に蓄積する（図８の形態素解析・階段状レコード派
生）。In the stepwise record derivation unit 21, the concrete example 1
After the same operation as the above, for the data record "Kanto Incombustible Building Center", "Kanto Incombustible Building Center (place name, common noun, sahen noun, suffix)", "Incombustible Building Center (common noun, sahenun, suffix) Data "," Architectural Center Co., Ltd. ", and four records" Architectural Center (Sa noun, suffix) "and" Center (proper noun) "
In contrast, three records of "Architecture Center Co., Ltd. (crow name, sahen noun, suffix)", "Architectural Center (sahenoh noun, suffix)", "center (suffix)",
Each is stored as a derivative record in the memory together with the morpheme information (morpheme analysis / step-like record derivative in FIG. 8).

【００８４】この時、それぞれの派生レコードの派生元
情報を示すポインタも同時に蓄積される（図８の階段状
レコード派生結果に矢印で表示）（ステップｓ２４）。
また、頻出用語表１５は具体例１と同様にして図７の如
く得られる（ステップｓ２９、ｓ３０）。At this time, pointers indicating the derivation source information of each derivation record are also accumulated at the same time (indicated by an arrow in the stepwise record derivation result in FIG. 8) (step s24).
Further, the frequently-used term table 15 is obtained as shown in FIG. 7 in the same manner as in the concrete example 1 (steps s29, s30).

【００８５】次に、不要レコード削除処理（ステップｓ
３１）について詳細に述べる。Next, unnecessary record deletion processing (step s
31) will be described in detail.

【００８６】不要レコード削除部２３を構成する前方位
置単語取得部２３１では、まず、頻出用語表１５から頻
出用語を１つ読み込む。ここで、頻度３０００以上を頻
出用語であるとすると、頻出用語表１５は図７の如くで
あるので「センター（接尾語）」が読み込まれる（ステ
ップｓ４１）。The forward position word acquisition section 231 constituting the unnecessary record deletion section 23 first reads one frequently-used term from the frequently-used term table 15. Here, assuming that the frequency of 3000 or more is a frequently-used term, the frequently-used term table 15 is as shown in FIG. 7, and thus "center (suffix)" is read (step s41).

【００８７】次に、前方位置単語取得部２３１では、該
頻出用語の字面「センター」をキーにして階段状レコー
ド派生部２１（ステップｓ２４）で蓄積した派生レコー
ドを検索する（ステップｓ４２）。図８より、「株式会
社建築センター」から派生した「センター（接尾語）」
と「関東不燃建築センター」から派生した「センター
（接尾語）」が得られる（ステップｓ４３、ｓ４４）。Next, the forward position word acquisition unit 231 searches for the derived record accumulated in the stepwise record derivation unit 21 (step s24) using the character face "center" of the frequently-used term as a key (step s42). From Figure 8, "Center (suffix)" derived from "Architectural Center Co., Ltd."
A "center (suffix)" derived from "Kanto Incombustible Building Center" is obtained (steps s43, s44).

【００８８】まず、「株式会社建築センター」から派生
した「センター（接尾語）」を処理対象とする（ステッ
プｓ４５）。First, the "center (suffix)" derived from the "Architecture Center Co., Ltd." is targeted for processing (step s45).

【００８９】前方位置単語取得部２３１では、処理対象
とした派生レコード「センター（接尾語）」の持つ派生
元情報から前方位置単語に関する形態素情報「建築（サ
変名詞）」を取得し、制御を不要レコード決定部２３２
に移す（ステップｓ４６）。The forward position word acquisition unit 231 acquires the morpheme information “architecture (sahen noun)” related to the forward position word from the derivation source information of the derived record “center (suffix)” which is the processing target, and does not require control. Record determination unit 232
(Step s46).

【００９０】不要レコード決定部２３２では、現在処理
中の頻出用語「センター（接尾語）」の形態素情報であ
る品詞（接尾語）で不要レコード決定ルール群２４を検
索する（ステップｓ４７）。不要レコード決定ルール群
２４は図９の如くであるので、不要レコード決定ルール
として（接尾語、品詞が「冠称名」でない）が得られる
（ステップｓ４７）。The unnecessary record determination unit 232 searches the unnecessary record determination rule group 24 with the part of speech (suffix) which is the morpheme information of the frequently-used term "center (suffix)" currently being processed (step s47). Since the unnecessary record determination rule group 24 is as shown in FIG. 9, the unnecessary record determination rule (suffix, part of speech is not "crown name") is obtained (step s47).

【００９１】現在処理対象としている派生レコード「セ
ンター（接尾語）」の派生元情報から得られた前方位置
単語「建築（サ変名詞）」に関する形態素情報である品
詞（サ変名詞）は、ステップｓ４７で得られた不要レコ
ード決定ルール（接尾語、品詞が「冠称名」でない）に
一致する（ステップｓ４８、ｓ４９）。よって、不要レ
コード決定部２３２では、不要レコード決定ルールに一
致した「株式会社建築センター」から派生した派生レコ
ード「センター（接尾語）」をメモリから削除する（ス
テップｓ５０）。At step s47, the part of speech (sahen noun) which is the morpheme information on the forward position word "architecture (sahen noun)" obtained from the derivation source information of the derived record "center (suffix)" currently processed It coincides with the obtained unnecessary record determination rule (suffix, part of speech is not "crown name") (steps s48, s49). Therefore, the unnecessary record determination unit 232 deletes the derived record “center (suffix)” derived from “Architecture Center Co., Ltd.” that matches the unnecessary record determination rule from the memory (step s50).

【００９２】同様にして、関東不燃建築センター」から
派生した「センター（接尾語）」に対しても処理を行う
ことによって、この派生レコードもメモリから削除され
る。Similarly, by processing the "center (suffix)" derived from "Kanto Incombustible Building Center", this derived record is also deleted from the memory.

【００９３】ここまでで、頻出用語「センター（接尾
語）」に関する全ての派生レコードを処理したとする
（ステップｓ５１）。さらに未処理の頻出用語が存在す
るので制御を前方位置単語取得部２３１に移してその処
理に移る（ステップｓ５２、ｓ４１）。Up to this point, it is assumed that all derived records relating to the frequently-used term "center (suffix)" have been processed (step s51). Further, since there is an unprocessed frequently-used term, the control is moved to the forward position word acquisition unit 231, and the processing is moved to (steps s52, s41).

【００９４】頻出用語表１５は図７の如くであるので、
「建築センター（サ変名詞、接尾語）」が読み込まれる
（ステップｓ４１）。Since the frequently-used term table 15 is as shown in FIG. 7,
"Architectural center (sahen noun, suffix)" is read (step s41).

【００９５】次に、前方位置単語取得部２３１では、該
頻出用語の字面「建築センター」をキーにして階段状レ
コード派生部２１（ステップｓ２４）で蓄積した派生レ
コードを検索する（ステップｓ４２）。図８より、「株
式会社建築センター」から派生した「建築センター（サ
変名詞、接尾語）」と「関東不燃建築センター」から派
生した「建築センター（サ変名詞、接尾語）」が得られ
る（ステップｓ４３、ｓ４４）。Next, the forward position word acquisition unit 231 searches for the derived record accumulated in the stepwise record derivation unit 21 (step s24) with the character face "building center" of the frequently used term as a key (step s42). From FIG. 8, the "Architectural Center (Sahen noun, suffix)" derived from the "Architectural Center Co., Ltd." and the "Architectural Center (Sahen noun, suffix)" derived from the "Kanto Incombustible Architecture Center" are obtained (step s43, s44).

【００９６】まず、「株式会社建築センター」から派生
した「建築センター（サ変名詞、接尾語）」を処理対象
とする（ステップｓ４５）。First, "building center (sahenon noun, suffix)" derived from "building center corporation" is processed (step s45).

【００９７】前方位置単語取得部２３１では、処理対象
とした派生レコード「建築センター（サ変名詞、接尾
語）」の持つ派生元情報から前方位置単語に関する形態
素情報「株式会社（冠称名）」を取得し、制御を不要レ
コード決定部２３２に移す（ステップｓ４６）。The forward position word acquisition unit 231 acquires the morpheme information "corporation (corporation name)" related to the forward position word from the derivation source information of the derived record "building center (Sahen noun, suffix)" which is the processing target. Then, the control is transferred to the unnecessary record determining unit 232 (step s46).

【００９８】不要レコード決定部２３２では、現在処理
中の頻出用語「建築センター（サ変名詞、接尾語）」の
先頭に位置する単語の形態素情報である品詞（サ変名
詞）で不要レコード決定ルール群２４を検索する（ステ
ップｓ４７）。不要レコード決定ルール群２４は図９の
如くであるので、不要レコード決定ルールとして（サ変
名詞、品詞が「冠称名」でない）が得られる（ステップ
ｓ４７）。In the unnecessary record determination unit 232, the unnecessary record determination rule group 24 is defined by the part of speech (sa variant noun) which is the morpheme information of the word located at the beginning of the frequently-used term “building center (sa variant noun, suffix)” currently being processed. Is searched (step s47). Since the unnecessary record determination rule group 24 is as shown in FIG. 9, the unnecessary record determination rule (Sahen noun, part-of-speech is not "crown name") is obtained (step s47).

【００９９】現在処理対象としている派生レコード「建
築センター」の派生元情報から得られた前方位置単語
「株式会社」に関する形態素情報である品詞（冠称名）
は、ステップｓ４７で得られた不要レコード決定ルール
（サ変名詞、品詞が「冠称名」でない）に一致しない。
よって、該派生レコードはメモリから削除されない（ス
テップｓ４９）。Part-of-speech (title) which is the morpheme information about the forward position word "corporation" obtained from the derivation source information of the derivation record "building center" which is the current processing target.
Does not match the unnecessary record determination rule (sahen noun, part of speech is not "crown name") obtained in step s47.
Therefore, the derived record is not deleted from the memory (step s49).

【０１００】次に、「株式会社建築センター」から派生
した派生レコード「建築センター（サ変名詞、接尾
語）」の処理を行う。前記と同様にして不要レコード決
定ルールとして（サ変名詞、品詞が「冠称名」でない）
が得られる（ステップｓ４７）。Next, the processing of the derived record "Architectural Center (Sahen noun, suffix)" derived from "Architectural Center Co., Ltd." is performed. In the same way as above, as an unnecessary record determination rule (sahen noun, part of speech is not "crown name")
Is obtained (step s47).

【０１０１】現在処理対象としている派生レコード「建
築センター（サ変名詞、接尾語）」の派生元情報から得
られた前方位置単語「不燃」に関する形態素情報である
品詞（普通名詞）は、ステップｓ４７で得られた不要レ
コード決定ルール（サ変名詞、品詞が「冠称名」でな
い）に一致する（ステップｓ４８、ｓ４９）。よって、
不要レコード決定部２３２では、不要レコード決定ルー
ルに一致した「関東不燃建築センター」から派生した派
生レコード「建築センター（サ変名詞、接尾語）」をメ
モリから削除する（ステップｓ５０）。The part-of-speech (ordinary noun), which is the morpheme information on the forward-position word "incombustible" obtained from the derivation source information of the derivation record "building center (Sahen noun, suffix)" currently being processed, is the step s47. It matches the obtained unnecessary record determination rule (sahen noun, part of speech is not "crown name") (steps s48, s49). Therefore,
The unnecessary record determination unit 232 deletes the derived record “building center (Sahen noun, suffix)” derived from “Kanto Incombustible Building Center” that matches the unnecessary record determination rule from the memory (step s50).

【０１０２】ここまでで、頻出用語「建築センター（サ
変名詞、接尾語）」に関する全ての派生レコードを処理
したとする（ステップｓ５１）。さらに未処理の頻出用
語が存在するので制御を前方位置単語取得部２３１に移
してその処理に移る（ステップｓ５２、ｓ４１）。Up to this point, it is assumed that all the derived records relating to the frequently-used term "building center (Sahen noun, suffix)" have been processed (step s51). Further, since there is an unprocessed frequently-used term, the control is moved to the forward position word acquisition unit 231, and the processing is moved to (steps s52, s41).

【０１０３】次に、頻出用語表１５から「会社（冠称
名）」が検索されるが、この頻出用語は派生レコードに
存在しない。同様にして、「株式会社（冠称名）」も派
生レコードに存在しない。Next, “Company (Colour name)” is searched from the frequently-used term table 15, but this frequently-used term does not exist in the derived record. Similarly, "stock corporation (crown name)" does not exist in the derived record.

【０１０４】以上の処理で全ての頻出用語を処理したと
すると、制御は情報出力部１７に移る（ステップｓ５
２）。Assuming that all the frequently-used terms have been processed by the above processing, control is transferred to the information output unit 17 (step s5).
2).

【０１０５】最後に、情報出力部１７では、上記の処理
の後、メモリに残った派生レコード「株式会社建築セン
ター」、「建築センター」、「関東不燃建築センタ
ー」、「不燃建築センター」を検索用インデックス２に
出力する（ステップｓ５３）。Finally, the information output unit 17 retrieves the derived records "Architectural Center Co., Ltd.", "Architectural Center", "Kanto Incombustible Architecture Center", and "Incombustible Architecture Center" that remain in the memory after the above processing. It outputs to the index 2 for (step s53).

【０１０６】以上の説明から明らかなように、本装置を
用いれば、「株式会社建築センター」のように字面は
「建築センター」と同じでも前方位置単語の形態素情報
が（冠称名）である「建築センター」と「関東不燃建築
センター」から派生された「建築センター」とを区別す
ることができ、インデックスとして必要なものを正しく
残すことができる。As is clear from the above description, if this device is used, the morpheme information of the forward position word is (crown name) even if the character face is the same as "Architectural Center" such as "Architectural Center Co., Ltd." It is possible to distinguish between the "Architectural Center" and the "Architectural Center" derived from the "Kanto Fire-retardant Building Center", and it is possible to correctly leave the necessary index.

【０１０７】[0107]

【発明の効果】以上説明したように、本発明によれば、
（１）単語単位ではなく用語単位で頻度を集計するの
で、複数の単語から構成される頻出用語であっても高頻
度であれば検索インデックスから削除することができ、
（２）高頻度単語のインデックスを削除する際に、形態
素情報まで一致するものだけあるいは不要レコード決定
ルールに従うものだけを削除するので、同字異義語（同
じ単語が別の意味で使われている場合）を区別して削除
するかしないかを決定でき、インデックスとして必要な
ものを削除する恐れがない、という効果が得られる。As described above, according to the present invention,
(1) Since the frequency is aggregated not by word but by term, even a frequent term composed of a plurality of words can be deleted from the search index if the frequency is high,
(2) When deleting the index of high-frequency words, only those that match even the morpheme information or those that comply with the unnecessary record determination rule are deleted, so homonyms (the same word is used with a different meaning) In this case, it is possible to decide whether or not to delete, and it is possible to obtain the effect that there is no fear of deleting a necessary index.

[Brief description of drawings]

【図１】本発明のインデックス派生装置の第１の実施の
形態を示す構成図FIG. 1 is a configuration diagram showing a first embodiment of an index derivation device of the present invention.

【図２】図１の装置の動作フローチャート2 is an operation flowchart of the apparatus of FIG.

【図３】本発明のインデックス派生装置の第２の実施の
形態を示す構成図FIG. 3 is a configuration diagram showing a second embodiment of an index derivation device of the present invention.

【図４】図２の装置の動作フローチャート4 is an operation flowchart of the apparatus of FIG.

【図５】図４中の不要レコード削除処理の詳細な動作フ
ローチャートFIG. 5 is a detailed operation flowchart of unnecessary record deletion processing in FIG.

【図６】第１の実施の形態における実際の処理のようす
を示す図FIG. 6 is a diagram showing a state of actual processing according to the first embodiment.

【図７】頻出用語表の一例を示す図FIG. 7 is a diagram showing an example of a frequently-used term table.

【図８】第２の実施の形態における実際の処理のようす
を示す図FIG. 8 is a diagram showing a state of actual processing in the second embodiment.

【図９】不要レコード決定ルール群の一例を示す図FIG. 9 is a diagram showing an example of an unnecessary record determination rule group.

[Explanation of symbols]

１…原データベース、２…検索用インデックス、３…検
索対象データベース、１０，２０…インデックス派生装
置、１１…データ読み込み部、１２…形態素解析部、１
３，２１…階段状レコード派生部、１４…用語集計部、
１５…頻出用語表、１６，２３…不要レコード削除部、
１７…情報出力部、２２…派生情報付与部、２３１…前
方位置単語取得部、２３２…不要レコード決定部、２４
…不要レコード決定ルール群。1 ... Original database, 2 ... Search index, 3 ... Search target database, 10, 20 ... Index derivation device, 11 ... Data reading unit, 12 ... Morphological analysis unit, 1
3, 21 ... Stepwise record derivation part, 14 ... Term aggregation part,
15 ... Frequent term table, 16, 23 ... Unnecessary record deletion section,
Reference numeral 17 ... Information output unit, 22 ... Derived information addition unit, 231, ... Forward position word acquisition unit, 232 ... Unnecessary record determination unit, 24
... Unnecessary record determination rule group.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平７−78182（ＪＰ，Ａ) 特開平６−309366（ＪＰ，Ａ) 神尾達夫，新聞記事データベースにおけるキーワード自動抽出，情報管理，日本，1989年７月１日，第32巻第４号，第283頁乃至第293頁 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)─────────────────────────────────────────────────── ─── Continuation of front page (56) References JP-A-7-78182 (JP, A) JP-A-6-309366 (JP, A) Tatsuo Kamio, automatic keyword extraction in newspaper article database, information management, Japan, July 1, 1989, Vol. 32, No. 4, 283 to 293 (58) Fields investigated (Int.Cl. ⁷ , DB name) G06F 17/30 JISST file (JOIS)

Claims

(57) [Claims]

1. A morphological analysis means for performing a morphological analysis for dividing a term (original index term) assigned as a search index to each record of an original database into word units, and each word obtained as a result of the analysis. A stepwise record derivation means for creating a starting term (derivative index term) by extracting each starting word and subsequent words from the original index term, and deriving at least one of the created derivation index terms. In the index derivation device in the database creation device that accumulates records having index terms as a new search index and creates the search target database, the derived index terms are created by the stepwise record derivation means.
When creating the index
A group that gives derivation origin information that indicates whether it was derived from the term
Origin information providing means, term aggregation means for checking the appearance frequency of the derived index terms in the entire database, and words located in front of words with high appearance frequency (forward position
Word) from the original index term obtained from the derivation origin information
Front position word acquisition means to be acquired and shape of the front position word
Derived index term applicable depending on the content of the feature information
Decide whether to delete from the search index
The derived index terms to be deleted
An index derivation device comprising: an unnecessary record deletion unit configured to delete an unnecessary record from a dex .

2. A derived index term to be deleted with reference to an unnecessary record determination rule group, which has morpheme information as a key, and which represents a deletion condition according to the content of the morpheme information of the forward position word of the derived index term. index derivation apparatus according to claim 1, comprising the unnecessary record determining means for determining.

3. A term starting from each word obtained as a result of the morphological analysis in which a term (original index term) assigned to each record in the original database as a search index is divided into word units. (Derived index term) is created by extracting each starting word and subsequent words from the original index term, and at least one derived index term of the created derived index terms is newly searched. In the index derivation method in the database creation method for accumulating the records having as the index for the search and creating the search target database, when creating the derived index term , the derived index
Indicates from which original index term the term was derived
Derivation source information is added, the frequency of occurrence of the derived index term in the entire database is checked, and words located in front of terms with high frequency of occurrence (forward position single
Word) from the original index term obtained from the derivation origin information
Acquired, and a group corresponding to the content of the morpheme information of the forward position word
Delete raw index terms from the search index
An index derivation method characterized by determining whether or not to do so, and deleting the derived index term to be deleted from the search index.

4. A derived index term to be deleted with reference to an unnecessary record determination rule group, which has morpheme information as a key and which represents a deletion condition according to the content of the morpheme information of the preceding position word of the derived index term. 4. The index derivation method according to claim 3, wherein

5. A morphological analysis in which a term (original index term) assigned as a search index to each record of an original database is divided into word units, and a term starting from each word obtained as a result of the analysis (Derived index term) is created by extracting each starting word and subsequent words from the original index term, and at least one derived index term of the created derived index terms is newly searched. a computer-readable medium recording the index derivation program in database creation to create the search database storing the record with the use index, when the index derivation program that is read by the computer, this computer, derived When you create an index term, the derivative production index
Indicates from which original index term the term was derived
Derivation source information is added, the frequency of occurrence of the derived index term in the entire database is checked, and words located in front of terms with high frequency of occurrence (forward position single
Word) from the original index term obtained from the derivation origin information
Acquired, and a group corresponding to the content of the morpheme information of the forward position word
Delete raw index terms from the search index
A computer-readable medium having an index derivation program recorded thereon , which determines whether or not to do so, and executes an operation of deleting a derived index term to be deleted from the search index.

6. A derived index term to be deleted with reference to an unnecessary record determination rule group, which has morpheme information as a key and which expresses a deletion condition according to the content of the morpheme information of the forward position word of the derived index term. claim, characterized in that to perform the operation of determining the 5
A computer-readable medium on which the described index-derived program is recorded.