JP2003228571A

JP2003228571A - Method of counting appearance frequency of character string, and device for using the method

Info

Publication number: JP2003228571A
Application number: JP2002026458A
Authority: JP
Inventors: Kyoji Umemura; 恭司梅村
Original assignee: Individual
Current assignee: Individual
Priority date: 2001-11-28
Filing date: 2002-02-04
Publication date: 2003-08-15

Abstract

<P>PROBLEM TO BE SOLVED: To solve a problem that no effective method for counting the number of documents containing character strings twice time or more is found in the prior art. <P>SOLUTION: A suffix array generating part 12 finds suffixes as to a plurality of document files 26 stored in a document database 24, and generates a suffix array in which an aggregate of all the suffixes is arrayed according to a dictionary order, to be stored as a suffix array file 28 in the document database 24. A character string frequency counting part 14 detects a class in classification of the suffixes, referring to the suffix array file 28, an overlap degree of the character strings is determined in the each class, and a character string frequency with the overlap degree is counted thereby. A document frequency calculating part 16 finds the frequency of the document in which the character strings are appeared k-times or more based on a difference in the character string frequencies with the overlap degree measured by the character string frequency counting part 14. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は記号列の出現頻度
を計数する技術、とくに文書中に含まれる文字列の出現
頻度を計数する方法、およびその方法を利用可能な装置
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for counting the frequency of appearance of symbol strings, and more particularly to a method for counting the frequency of appearance of character strings contained in a document, and an apparatus which can use the method.

【０００２】[0002]

【従来の技術】インターネットの普及に伴い、各種デー
タベースのネットワーク環境での利用が進んでおり、文
献データベースをもとにした電子図書館など実用的なア
プリケーションも広がっている。このような電子化され
た文書を多数蓄積したデータベースを効率よく利用する
ためには、蓄積された文書から適切なキーワードを抽出
して、検索しやすいようにインデックスをつけたり、キ
ーワードにもとづいて文書を分類することが必要であ
る。2. Description of the Related Art With the spread of the Internet, various databases are being used in a network environment, and practical applications such as electronic libraries based on document databases are also spreading. In order to efficiently use the database in which a large number of such electronic documents are stored, appropriate keywords are extracted from the stored documents and indexed for easy search, or documents are searched based on the keywords. It is necessary to classify.

【０００３】日本語のような膠着言語ではキーワードを
抽出する際、文書をまず単語列に分解することが行われ
る。単語の切れ目を判定するために、形態素解析が用い
られることがあるが、単語を正しく切り出すためには単
語辞書を充実させたり、複雑な文法規則を適用するな
ど、形態素解析の精度を向上させる必要がある。またイ
ンデックスとなる適切なキーワードを抽出するには、単
語の出現パターンを統計的に解析して、文書を特徴づけ
る有用な単語を選択する必要がある。In a sticky language such as Japanese, when extracting a keyword, the document is first decomposed into word strings. Morphological analysis is sometimes used to determine word breaks, but it is necessary to improve the accuracy of morphological analysis by expanding the word dictionary and applying complicated grammatical rules to correctly cut out words. There is. Further, in order to extract an appropriate keyword serving as an index, it is necessary to statistically analyze the appearance pattern of words and select useful words that characterize the document.

【０００４】このような統計的な解析にあたって、文書
中に含まれる文字列の出現頻度等の統計量が利用され
る。たとえば文書の集合が与えられたとき、ある文字列
の総出現回数ｃｆやある文字列を１回以上含むドキュメ
ントの数ｄｆが統計量として用いられる。文献 Mikio Y
amamoto and Kenneth W. Church, Using Suffix Arrays
to Compute Term Frequency and Document Frequency f
or All Substrings ina Corpus, Computational Lingui
stics, Vol.27:1, pp.1-30, MIT Press.には、ある文字
列を１回以上含むドキュメントの数ｄｆを求める方法が
開示されている。In such statistical analysis, statistical quantities such as the frequency of appearance of character strings included in a document are used. For example, when a set of documents is given, the total number of appearances cf of a certain character string and the number of documents df containing a certain character string one or more times are used as statistics. Literature Mikio Y
amamoto and Kenneth W. Church, Using Suffix Arrays
to Compute Term Frequency and Document Frequency f
or All Substrings ina Corpus, Computational Lingui
stics, Vol.27: 1, pp.1-30, MIT Press. discloses a method for obtaining the number df of documents containing a certain character string one or more times.

【０００５】[0005]

【発明が解決しようとする課題】このような文字列に関
する統計量を計測する場合、計算量が問題となる。扱う
文書の集合（以下コーパスという）の総文字数Ｎは非常
に大きなものとなり、その部分文字列はＮ（Ｎ−１）／
２個になるため、単純な計数方法を用いると通常の計算
機では処理できない計算量となってしまう。When measuring the statistic of such a character string, the amount of calculation becomes a problem. The total number of characters N in the set of documents to be handled (hereinafter referred to as the corpus) is extremely large, and the partial character string is N (N-1) /
Since the number is two, the amount of calculation cannot be processed by an ordinary computer if a simple counting method is used.

【０００６】上記のMikio Yamamoto and Kenneth W. Ch
urchによる文献では、文字列の出現パターンにもとづい
て、部分文字列を最大２Ｎ−１個のクラスに分類して、
ドキュメント頻度を計数する方法が提案されている。し
かしこの方法はある文字列が１回以上出現するドキュメ
ントの数を計測するものであり、ある文字列が２回以上
出現するドキュメントの数の計測には適用することがで
きない。文字列に関してより意味のある統計量は、その
文字列が２回以上出現するドキュメントの数から得られ
ることがあるが、これを効率的に計数する方法は知られ
ていない。The above Mikio Yamamoto and Kenneth W. Ch
In the literature by urch, substrings are classified into a maximum of 2N-1 classes based on the appearance pattern of character strings,
A method of counting document frequency has been proposed. However, this method measures the number of documents in which a certain character string appears one or more times, and cannot be applied to measure the number of documents in which a certain character string appears two or more times. A more meaningful statistic for a string may be obtained from the number of documents in which the string appears more than once, but there is no known efficient way to count this.

【０００７】本発明はこうした状況に鑑みてなされたも
のであり、その目的は、文字列の出現頻度を効率よく計
測するための計数方法、およびその方法を利用可能な計
数装置を提供することにある。The present invention has been made in view of such circumstances, and an object thereof is to provide a counting method for efficiently measuring the appearance frequency of a character string, and a counting device that can use the method. is there.

【０００８】[0008]

【課題を解決するための手段】本発明のある態様は文字
列の出現頻度の計数方法に関する。この方法は、文書の
集合に対して、文字列がｋ（ｋは２以上の自然数）回以
上出現する回数と前記文字列がｋ＋１回以上出現する回
数とを計数してそれらの回数の差を求めることにより、
前記文字列がｋ回以上含まれる文書の数を取得する。One aspect of the present invention relates to a method of counting the appearance frequency of character strings. This method counts the number of times a character string appears k times (k is a natural number of 2 or more) or more times and the number of times the character string appears k + 1 times or more in a set of documents, and calculates the difference between those times. By asking
The number of documents including the character string k times or more is acquired.

【０００９】ここでいう「文書」は言語を問わない。
「文書」は、古文書や暗号文書のようにかならずしも文
字や文法が解読されていない文書や、遺伝子配列など任
意の記号列のシーケンスも含み、任意の記号列が記録さ
れたファイル全般を意味する広い概念である。したがっ
て文書中に含まれる「文字列」という場合、言語のアル
ファベット等に限らず、一つの記号体系をなす任意の記
号列を含む趣旨である。The "document" referred to here may be of any language.
"Document" means a document in which characters or grammars are not always deciphered, such as an old document or a ciphertext, or a sequence of arbitrary symbol sequences such as gene sequences, and means all files in which arbitrary symbol sequences are recorded. It is a broad concept. Therefore, the term “character string” included in a document is not limited to the alphabet of the language, but includes any symbol string forming one symbol system.

【００１０】前記文書内に含まれる部分文字列の集合
を、同一クラスに属する部分文字列についてはその部分
文字列が同一文書に出現する回数が同じになるようなク
ラスに分類しながら、前記出現回数を前記クラス単位で
計数してもよい。前記クラスの階層構造を利用して、下
位クラスにおいて計数された出現回数を上位クラスにお
いて計数される出現回数に加算することにより前記出現
回数の計数を行ってもよい。The set of partial character strings included in the document is classified into classes such that the partial character strings belonging to the same class have the same number of times that the partial character strings appear in the same document. The number of times may be counted for each class. Using the hierarchical structure of the class, the number of appearances may be counted by adding the number of appearances counted in the lower class to the number of appearances counted in the upper class.

【００１１】本発明の別の態様も文字列の出現頻度の計
数方法に関する。この方法は、複数の文書を含むコーパ
スに対して、前記文書のある位置の文字から前記文書の
終了までの範囲の文字列、すなわち接尾辞（サフィック
ス）の集合であって、その集合が辞書順に並べられたサ
フィックスアレイ（Suffix Array）を生成する工程と、
与えられた文字列ｘの出現のうち、重複度がｋ（ｋは２
以上の自然数）以上の文字列の出現頻度ｃｆ_ｋ（ｘ）を
計数する工程と、前記文字列ｘがｋ回以上出現する文書
の数ｄｆ_ｋ（ｘ）をｃｆ_ｋ（ｘ）とｃｆ_ｋ＋１（ｘ）の
差により求める工程とを含む。Another aspect of the present invention also relates to a method of counting the appearance frequency of character strings. This method is a set of character strings in the range from a character at a certain position of the document to the end of the document, that is, a set of suffixes, for a corpus including a plurality of documents, and the set is a dictionary order. Generating an aligned suffix array (Suffix Array),
Among the appearances of the given character string x, the degree of duplication is k (k is 2
More occurrences of a natural number) or more strings frequency _cf k (a step of counting the x), wherein _cf number _df k of documents string x occurs more than k times (x) _cf k and (x) _{k +} 1 ( x)).

【００１２】サフィックスアレイはサフィックスのコー
パス中の出現場所を格納した配列である。重複度は、文
字列ｘの出現場所ごとに定義される値であり、ある文字
列ｘの出現場所の重複度がｋであるとは、サフィックス
アレイ内の配列の番号順でその出現場所以下の場所で、
かつ同一のドキュメントに属する文字列ｘの出現場所が
ｋ個あることをいう。The suffix array is an array storing the appearance locations in the corpus of suffixes. The degree of duplication is a value defined for each appearance location of the character string x, and the fact that the degree of multiplicity of an appearance location of a character string x is k means that the number of occurrences of the appearance location or less in the suffix array is At the place
In addition, it means that there are k occurrence positions of the character string x belonging to the same document.

【００１３】前記サフィックスアレイのクラス分けであ
って、同一クラスに属する文字列についてはその文字列
が同一文書に出現する回数が同じになるようなクラス分
けを生成する工程をさらに含み、前記クラス分けによる
各クラスについて前記出現頻度ｃｆ_ｋ（ｘ）を計数して
もよい。前記クラス分けを生成する工程により、階層構
造をなすクラス分けが生成され、前記出現頻度ｃｆ
_ｋ（ｘ）を計数する工程において、下位クラスにおいて
計数された出現頻度ｃｆ_ｋ（ｘ）が上位クラスにおいて
計数される出現頻度に加算されてもよい。The method further includes the step of generating the suffix array for classifying the character strings belonging to the same class so that the character strings appear the same number of times in the same document. The appearance frequency cf _k (x) may be counted for each class. By the step of generating the classification, the hierarchical classification is generated, and the appearance frequency cf
_In the step of counting _k (x), the appearance frequency cf _k (x) counted in the lower class may be added to the appearance frequency counted in the upper class.

【００１４】前記出現頻度ｃｆ_ｋ（ｘ）を計数する工程
における重複度の判定と、前記クラス分けを生成する工
程におけるクラスの検出とが同時になされてもよい。ク
ラスがサフィックスアレイの添え字の区間で定められる
とき、新しいクラスの区間の始まりと終わりを検出する
過程で、検出されたクラスごとに重複度を判定して出現
頻度ｃｆ_ｋ（ｘ）の計数を行ってもよい。The determination of the degree of overlap in the step of counting the appearance frequency cf _k (x) and the detection of the class in the step of generating the classification may be performed at the same time. When a class is defined by a subscript section of a suffix array, in the process of detecting the start and end of the section of a new class, the degree of overlap is determined for each detected class to count the occurrence frequency cf _k (x). You can go.

【００１５】本発明のさらに別の態様は計数装置に関す
る。この装置は、文書の集合に対して文字列の出現頻度
を計数する計数処理部と、前記文字列の出現頻度に関す
るデータを記録する記録部とを含む。前記計数処理部
は、前記文書のある位置の文字から前記文書の終了まで
の範囲の文字列の集合であって、その集合が辞書順に並
べられたサフィックスアレイを生成するサフィックスア
レイ生成部と、前記サフィックスアレイのクラス分けの
クラスの検出と同時に、前記文字列の出現の重複度の判
定を行うことにより、前記クラス単位で前記文字列が重
複して出現する頻度を計数する文字列頻度計数部と、前
記文字列の出現頻度にもとづいて前記文字列が出現する
文書の頻度を算出する文書頻度算出部とを含む。Yet another aspect of the present invention relates to a counting device. This apparatus includes a counting processing unit that counts the appearance frequency of a character string in a set of documents, and a recording unit that records data regarding the appearance frequency of the character string. The counting processing unit is a set of character strings in a range from a character at a certain position of the document to the end of the document, and the suffix array generation unit generates a suffix array in which the set is arranged in dictionary order, At the same time as detecting the class of suffix array classification, a character string frequency counting unit that counts the frequency of overlapping appearance of the character strings in the class unit by determining the degree of overlap of appearance of the character strings. , And a document frequency calculation unit that calculates the frequency of documents in which the character string appears based on the appearance frequency of the character string.

【００１６】前記サフィックスアレイの各サフィックス
は、当該サフィックスと同一文書に属し、かつ前記サフ
ィックスアレイの配列順で当該サフィックスの直前にあ
るサフィックスへのポインタをもち、前記文字列頻度計
数部は、このポインタを前記重複度の回数だけ順次たど
ることができるか否かにより前記重複度の判定を行って
もよい。Each suffix of the suffix array belongs to the same document as the suffix and has a pointer to a suffix immediately before the suffix in the order of arrangement of the suffix array, and the character string frequency counting section is provided with the pointer. The degree of overlap may be determined depending on whether or not the number of times can be sequentially traced.

【００１７】前記文字列頻度計数部は、前記サフィック
スアレイを階層的なクラス構造にクラス分けし、下位ク
ラスにおいて計数された前記文字列の前記出現頻度を上
位クラスにおいて計数される前記出現頻度に加算しても
よい。The character string frequency counting unit classifies the suffix array into a hierarchical class structure and adds the appearance frequency of the character string counted in the lower class to the appearance frequency counted in the upper class. You may.

【００１８】前記文書頻度算出部は、前記文字列がｋ
（ｋは自然数）回以上出現する文書の頻度を、前記文字
列についての前記重複度がｋ以上の出現頻度と前記重複
度がｋ＋１以上の出現頻度との差により求めてもよい。In the document frequency calculation unit, the character string is k
The frequency of a document that appears (k is a natural number) or more times may be obtained from the difference between the appearance frequency of the character string having the duplication degree of k or more and the appearance frequency of the duplication degree of k + 1 or more.

【００１９】前記文字列がｋ（ｋは自然数）回以上出現
する文書の頻度を用いて前記文書からキーワードを抽出
するキーワード抽出部をさらに含んでもよい。とくに文
字列が２回以上出現する文書の頻度が用いられてもよ
い。A keyword extraction unit may be further included for extracting a keyword from the document using the frequency of the document in which the character string appears k times or more (k is a natural number). In particular, the frequency of documents in which a character string appears twice or more may be used.

【００２０】本発明のさらに別の態様はコンピュータプ
ログラムに関する。このプログラムは、複数の文書を含
むコーパスに対して、前記文書のある位置の文字から前
記文書の終了までの範囲の文字列の集合であって、その
集合が辞書順に並べられたサフィックスアレイを生成す
る工程と、前記サフィックスアレイの階層的なクラス構
造に係るクラスを検出するとともに、そのクラス単位
で、与えられた文字列ｘの出現のうち、重複度がｋ（ｋ
は２以上の自然数）以上の文字列の出現頻度ｃｆ
_ｋ（ｘ）を計数する工程と、下位クラスにおいて計数さ
れた出現頻度ｃｆ_ｋ（ｘ）を上位クラスにおいて計数さ
れる出現頻度に加算する工程と、前記文字列ｘがｋ回以
上出現する文書の数ｄｆ_ｋ（ｘ）を、前記文字列につい
ての前記重複度がｋ以上の前記出現頻度ｃｆ_ｋ（ｘ）と
前記重複度がｋ＋１以上の前記出現頻度ｃｆ
_ｋ＋１（ｘ）との差により求める工程とをコンピュータ
に実行させる。Yet another aspect of the present invention is a computer program.
Regarding the program. This program contains multiple documents.
From the character at a certain position in the document to the corpus
A set of character strings in the range up to the end of the document
Generate a suffix array in which the sets are ordered lexicographically
And a hierarchical class structure of the suffix array.
The class related to the structure is detected and the class unit
In the appearance of the given character string x, the degree of overlap is k (k
Is a natural number of 2 or more) and the appearance frequency cf of a character string of 2 or more
_k(X) counting step and counting in lower class
Occurrence frequency cf_k(X) is counted in the upper class
The step of adding to the appearance frequency, and the character string x is k times or more.
Number of documents that appear above df_k(X) is added to the character string
The appearance frequency cf in which the degree of overlap is k or more_k(X) and
The appearance frequency cf in which the degree of overlap is k + 1 or more
_{k + 1}A process for obtaining the difference from (x) by a computer
To run.

【００２１】本発明のさらに別の態様は記号列の出現頻
度の計数方法に関する。この方法は、記号のシーケンス
が複数含まれる集合に対して、前記シーケンス内に含ま
れる部分記号列の集合を、同一クラスに属する部分記号
列についてはその部分記号列が同一シーケンスに出現す
る回数が同じになるような階層的なクラスに分類しなが
ら、任意の記号列がｋ（ｋは２以上の自然数）回以上出
現する回数を前記クラス単位で計数し、前記任意の記号
列がｋ回以上含まれる前記シーケンスの数を求める。Yet another aspect of the present invention relates to a method of counting the frequency of appearance of symbol strings. This method is such that, for a set including a plurality of symbol sequences, a set of partial symbol strings included in the sequence is determined by the number of times that a partial symbol string belonging to the same class appears in the same sequence. The number of times an arbitrary symbol string appears k or more (k is a natural number of 2 or more) times while classifying them into the same hierarchical class is counted for each class, and the arbitrary symbol string is k times or more. Determine the number of said sequences included.

【００２２】なお、以上の構成要素の任意の組合せ、本
発明の表現を方法、装置、サーバ、システム、コンピュ
ータプログラム、記録媒体などの間で変換したものもま
た、本発明の態様として有効である。It should be noted that any combination of the above constituent elements, and any expression of the present invention converted between a method, an apparatus, a server, a system, a computer program, a recording medium, etc. are also effective as an aspect of the present invention. .

【００２３】[0023]

【発明の実施の形態】実施の形態に係る計数方法と計数
装置を説明する前に、記号の定義と概念の説明を行う。BEST MODE FOR CARRYING OUT THE INVENTION Before describing a counting method and a counting apparatus according to an embodiment, the definition of symbols and the concept will be described.

【００２４】１．記号の定義ｔｆ（ｄ，ｘ）を、ドキュメントｄに含まれる文字列ｘ
の個数と定義する。以下の二つの統計量は、ｔｆ（ｄ，
ｘ）から計算できるものであり、情報検索に通常用いら
れる統計量である。ｃｆ（ｘ）：コーパスに文字列ｘが出現する数ｃｆ（ｘ）＝Σ_ｄｔｆ（ｄ，ｘ）ｄｆ（ｘ）：文字列ｘが１回以上出現するドキュメント
の数ｄｆ（ｘ）＝｜｛ｄ｜ｔｆ（ｄ，ｘ）≧１｝｜1. The symbol definition tf (d, x) is replaced with the character string x included in the document d.
Is defined as the number of The following two statistics are tf (d,
It is a statistic that can be calculated from x) and is usually used for information retrieval. cf (x): number string x appears in the corpus _{cf (x) = Σ d tf} (d, x) df (x): number df (x) of the document the string x occurs more than once = | {D | tf (d, x) ≧ 1} |

【００２５】本発明において利用する以下の統計量もｔ
ｆ（ｄ，ｘ）から求められる。ｄｆ_２（ｘ）：文字列ｘが２回以上出現するドキュメン
トの数ｄｆ_２（ｘ）＝｜｛ｄ｜ｔｆ（ｄ，ｘ）≧２｝｜ｄｆ_３（ｘ）：文字列ｘが３回以上出現するドキュメン
トの数ｄｆ_３（ｘ）＝｜｛ｄ｜ｔｆ（ｄ，ｘ）≧３｝｜一般に、ｄｆ_ｋ（ｘ）：文字列ｘがｋ回以上出現するド
キュメントの数ｄｆ_ｋ（ｘ）＝｜｛ｄ｜ｔｆ（ｄ，ｘ）≧ｋ｝｜The following statistics used in the present invention are also t
It is obtained from f (d, x). df ₂ (x): number of documents in which character string x appears twice or more df ₂ (x) = | {d | tf (d, x) ≧ 2} | df ₃ (x): character string x occurs 3 times Number of Documents Appearing Above df ₃ (x) = | {d | tf (d, x) ≧ 3} | Generally, df _k (x): number of documents where character string x appears k times or more df _k (x ) = | {D | tf (d, x) ≧ k} |

【００２６】ここで、ｄｆ_２（ｘ）の主要な利用法は、
「ドキュメントの分布において、ある文字列をドキュメ
ントが１回以上含むことを条件としたとき、その文字列
が２回以上含まれる条件付き確率の推定値」としてｄｆ
_２（ｘ）／ｄｆ（ｘ）の計算に用いることである。文献
［６］に示されているように、この条件付き確率は、単
語に適用した場合、語の性質を表すものとして使うこと
ができる。ｄｆ_３およびｄｆ_ｋは、ｄｆ_２の拡張であ
り、同様の条件付き確率の推定に使用でき、推定値のク
ロスチェックに使用することができる。Here, the main usage of df ₂ (x) is as follows:
Df as "estimated value of conditional probability that a character string is included more than once when the character string is included in the document distribution more than once"
₂ (x) / df (x). As shown in document [6], this conditional probability can be used as an expression of the nature of a word when applied to a word. df ₃ and df _k are extensions of df ₂ and can be used for estimation of similar conditional probabilities and can be used for cross-checking estimates.

【００２７】２． Suffix Array Suffix Arrayは文献［１］によって示されたデータ構造
である。図１の例に示すように、このデータ構造はある
テキストが与えられたときに、そのテキストのある位置
の文字からコーパスの終了までの範囲の文字列(これを
接尾辞（suffix）とよぶ)の集合を考え、その集合を辞
書順に並べたものである。どのsuffixも開始場所により
一意に定まるため、テキストの本体がメモリにあるとす
ると、一つのを格納するのに、文字列の開始場所を示す
一つの整数を格納すればよい。同図の例においてテキス
ト「ａｂｃａｂｅｄｃａｂｃｄｅ」に対して、suffix
[6]は「ｃａｂｃｄｅ」であり、開始場所はテキストの
配列の第７列であるから、suffix[6]の値は７である。
このように、Suffix Arrayを用いれば、任意の部分文字
列の場所を知ることができるにもかかわらず、必要な記
憶容量はテキストのサイズをＮとするとＯ（Ｎ）（Ｎの
オーダー）で済む。2. Suffix Array The Suffix Array is the data structure shown by reference [1]. As shown in the example of Fig. 1, when given text, this data structure is a character string in the range from the character at a certain position of the text to the end of the corpus (this is called a suffix). Is considered, and the set is arranged in lexicographical order. Since each suffix is uniquely determined by the start position, if the body of the text is in the memory, then one can be stored by storing one integer indicating the start position of the character string. In the example of the figure, the suffix "suffix" is added to the text "abcabedcabcde".
The value of suffix [6] is 7 because [6] is “cabcde” and the starting place is the seventh column of the text array.
As described above, if the suffix array is used, the location of an arbitrary substring can be known, but the required storage capacity is O (N) (N order) when the size of the text is N. .

【００２８】Suffix Arrayは以下のルーチンで生成でき
る。 /* size:コーパスの文字数, text:コーパスの先頭を指すポインタ*/ int suffix_compare(struct suffix_struct * x, struct suffix_struct * y) { return strcmp(text + x->position, text + y->position); /* x->position,y->positionはそれぞれx,yに対応する場所を指すポインタ*/ } for(i=0;i<size;i++) { suffix[i].position = i; } qsort(suffix, size, sizeof(struct suffix_struct),suffix_compare);The Suffix Array can be generated by the following routine. / * size: number of characters in the corpus, text: pointer to the beginning of the corpus * / int suffix_compare (struct suffix_struct * x, struct suffix_struct * y) (return strcmp (text + x-> position, text + y->position); / * x-> position, y-> position are pointers to locations corresponding to x and y, respectively * /} for (i = 0; i <size; i ++) {suffix [i] .position = i;} qsort (suffix, size, sizeof (struct suffix_struct), suffix_compare);

【００２９】ドキュメント頻度を計算する場合は、ドキ
ュメントの長さに上限があればコーパス中の文字列はド
キュメント毎に区切られていると見なすことができる。
この条件では上記のアルゴリズムを使ってデータ構造を
作成するためには、Ｏ（ＮｌｏｇＮ）の時間が必要であ
る。When calculating the document frequency, if the document length has an upper limit, the character string in the corpus can be regarded as being separated for each document.
Under this condition, O (NlogN) time is required to create the data structure using the above algorithm.

【００３０】一般にコーパスがドキュメントの構造をも
たないとき、単純な方法では、文字列の比較計算の上限
を見積もることが難しいが、Suffix Arrayを生成する効
率の良いアルゴリズムが知られており（文献［２］）、
この方法によると、Suffix Arrayの作成の計算量はＯ
（ＮｌｏｇＮ）である。Generally, when the corpus has no document structure, it is difficult to estimate the upper limit of comparison calculation of character strings by a simple method, but an efficient algorithm for generating a suffix array is known (see [2]),
According to this method, the computational complexity of creating a Suffix Array is O
(NlogN).

【００３１】このデータ構造により、任意の文字列を与
え、その文字列が出現する場所を特定するために２分探
索ができ、Ｏ（ｌｏｇＮ）の時間で出現場所のリストが
求められる。ここでいう出現場所とは、コーパス内のあ
る部分と一対一で対応する整数値であり、Suffix Array
の性質から、これはある一つのsuffixにも対応する。With this data structure, an arbitrary character string is given, a binary search can be performed in order to specify the place where the character string appears, and a list of appearance places can be obtained in O (logN) time. The appearance location here is an integer value that corresponds one-to-one with a certain part in the corpus.
Due to the nature of, this also corresponds to a certain suffix.

【００３２】３．文字列のクラス分け実施の形態に係る計数アルゴリズムでは、文献［３］の
文字列のクラス分けの方法を使用する。同一クラスに属
する文字列ｘについて、ｔｆ（ｄ，ｘ）の値が同じにな
るようにクラス分けする。このクラス分けはsuffixを用
いて定義する。Suffix Arrayのsuffixは辞書順に並んで
いるので、文字列の先頭部分が次のsuffixと共通である
ことが多い。そこで、common[i]をsuffix[i]とsuffix[i
+1]の文字列の先頭からの共通部分の長さとする。クラ
スの定義を下に示す。3. Character String Classification In the counting algorithm according to the embodiment, the character string classification method of the document [3] is used. Character strings x belonging to the same class are classified so that the values of tf (d, x) are the same. This classification is defined using suffix. The suffixes of the Suffix Array are arranged in dictionary order, so the beginning of the character string is often the same as the next suffix. So common [i] is suffix [i] and suffix [i
+1] is the length of the common part from the beginning of the character string. The class definition is shown below.

【００３３】ここで、定義の記述を簡単にするためｊ−
１＜ｉの場合ｍｉｎ_ｋ＝ｉ ^ｊ−１＝∞とする。そしてco
mmon[-1]=-1、common[N]=-1とする。区間［ｉ，ｊ］の境
界でのcommonの値の大きい方をoutgoing(i,j)=max(comm
on[i-1],common[j])と定義し、区間［ｉ，ｊ］の内部で
のcommonの値のうち最小のものをinner(i,j)=min_ｋ＝ _ｉ
^ｊ−１(common[k])と定義する。Here, in order to simplify the description of the definition, j-
In the case of 1 <i, min _{k = i} ^j−1 = ∞. And co
Set mmon [-1] =-1 and common [N] =-1. The larger value of common at the boundary of the interval [i, j] is outgoing (i, j) = max (comm
on [i-1], common [j]), and the smallest common value within the interval [i, j] is inner (i, j) = min _{k =} _i
^It is defined as ^j-1 (common [k]).

【００３４】［定義］区間［ｉ，ｊ］がクラスを形成す
るとは、inner(i,j)>outgoing(i,j)であることをいう。
inner(i,j)は区間全体で共通部分となる文字列の長さで
あり、inner(i,j)>outgoing(i,j)であるとは区間を広げ
ると全体で共通となる文字列が短くなることを意味す
る。[Definition] The section [i, j] forming a class means that inner (i, j)> outgoing (i, j).
inner (i, j) is the length of the character string that is the common part in the whole section, and inner (i, j)> outgoing (i, j) is the character string that is the common part in the whole section. Means shorter.

【００３５】図２のSuffix Arrayの例において、各suff
ixに対するcommonの値が示されている。この例で区間
［ｉ，ｊ］＝［２，２］、［ｉ，ｊ］＝［１，４］、お
よび［ｉ，ｊ］＝［１，３］がそれぞれクラスを形成す
るかどうかを考える。In the example of Suffix Array in FIG. 2, each suffix is
The value of common for ix is shown. In this example, consider whether the intervals [i, j] = [2,2], [i, j] = [1,4], and [i, j] = [1,3] each form a class. .

【００３６】 outgoing(2,2)=max(common[1],common[2])=max(6,3)=6 inner(2,2)=min_ｋ＝２ ^１(common[k])=∞ outgoing(1,4)=max(common[0],common[4])=max(2,0)=2 inner(1, 4)=min_ｋ＝１ ^３(common[k])=3 outgoing(1,3)=max(common[0],common[3])=max(2,6)=6 inner(1,3)=min_ｋ＝１ ^２(common[k])=3 となり、区間［２，２］はinner(2,2)>outgoing(2,2)、
区間［１，４］はinner(1,4)>outgoing(1,4)となるので
クラスを形成するが、区間［１，３］はinner(1,3)<out
going(1,3)となるのでクラスを形成しない。図３に区間
［１，４］の場合を図示した。このようにして区間ごと
にクラスが形成するかどうかを判定すると、図４に示す
区間においてクラスが形成される。同図に示した区間
［０，０］、［１，１］、［２，２］、［１，２］など
はすべてクラスを形成しており、後述のようにクラスを
形成する区間は階層構造をなす。Outgoing (2,2) = max (common [1], common [2]) = max (6,3) = 6 inner (2,2) = min _{k = 2} ¹ (common [k]) = ∞ outgoing (1,4) = max (common [0], common [4]) = max (2,0) = 2 inner (1,4) = min _{k = 1} ³ (common [k]) = 3 outgoing (1,3) = max (common [0], common [3]) = max (2,6) = 6 inner (1,3) = min _{k = 1} ² (common [k]) = 3 [2,2] is inner (2,2)> outgoing (2,2),
Since the interval [1,4] is inner (1,4)> outgoing (1,4), a class is formed, but the interval [1,3] is inner (1,3) <out
Do not form a class because it is going (1,3). FIG. 3 illustrates the case of the section [1, 4]. When it is determined in this way whether or not a class is formed for each section, a class is formed in the section shown in FIG. The sections [0,0], [1,1], [2,2], [1,2], etc. shown in the figure all form a class, and as will be described later, sections that form a class are hierarchical. Make a structure.

【００３７】区間［ｉ，ｊ］がクラスを形成するとき、
区間［ｉ，ｊ］に共通する長さoutgoing(i,j)+1からinn
er(i,j)までの部分文字列の集合を、その区間に対応す
る文字列のクラスと定義する。図４の例における区間
［１，４］、［１，２］、［１，１］のクラスに属する
文字列を図５に示す。区間［１，４］ではoutgoingの値
は２で、innerの値は３であるから、文字列「ａａｂ」
がクラスの要素である。区間［１，２］ではoutgoingの
値は３で、innerの値は６であるから、文字列「ａａｂ
ｂ」、「ａａｂｂｃ」、「ａａｂｂｃｃ」がクラスの要
素である。区間［１，１］ではoutgoingの値は６で、in
nerの値は無限大であるから、文字列「ａａｂｂｃｃ
ｄ」、「ａａｂｂｃｃｄｄ」、・・・（以下続く）がク
ラスの要素である。When the interval [i, j] forms a class,
Lengths outgoing (i, j) +1 to inn common to the interval [i, j]
The set of substrings up to er (i, j) is defined as the class of the string corresponding to that section. FIG. 5 shows character strings belonging to the classes of the sections [1, 4], [1, 2], and [1, 1] in the example of FIG. In the interval [1,4], the outgoing value is 2 and the inner value is 3, so the character string "aab"
Is an element of the class. In the interval [1, 2], the outgoing value is 3 and the inner value is 6, so the character string "aab
"b", "aabbcc", and "aabbcc" are class elements. In the interval [1,1], the outgoing value is 6 and in
Since the value of ner is infinite, the character string "aabbcc
“D”, “aabbccdd”, ... (Continued below) are class elements.

【００３８】以下、文字列の出現パターンというとき、
文字列ｘとドキュメントｄが与えられたとき、ｔｆ
（ｄ，ｘ）の関数形のことをいう。一般に、長さＮのテ
キストの部分文字列はＮ（Ｎ−１）／２個あり、そのす
べてを表にすることは実際的ではないが、同じ出現パタ
ーンを持つものに分類すると、その数は減る。文献
［３］によると、クラスの数は最大でも２Ｎ−１個であ
り、これは分類表を実際に作成し記憶することができる
実際的な大きさである。Hereinafter, when the appearance pattern of a character string is referred to,
Given a string x and a document d, tf
Refers to the functional form of (d, x). In general, there are N (N-1) / 2 substrings of text of length N, and it is not practical to tabulate all of them, but if they are classified into those having the same appearance pattern, the number is decrease. According to document [3], the number of classes is at most 2N-1, which is a practical size at which a classification table can be actually created and stored.

【００３９】［Occurence(C)の定義］クラスＣで定まる
区間［ｉ，ｊ］について、集合suffix[i], ・・・, suf
fix[j]をOccurence(C)とする。[Definition of Occurence (C)] For intervals [i, j] defined by class C, set suffix [i], ..., Suf
Let fix [j] be Occurrence (C).

【００４０】［性質１］クラスＣがあったとき、Ｃの任
意の２要素ｘ、ｙについて、任意のドキュメントをｄと
すると、ｔｆ（ｄ，ｘ）＝ｔｆ（ｄ，ｙ）である。証
明：ｔｆ（ｄ，ｘ）を計算するときに、ｘの属するＣに
ついて、ｘの出現する位置の集合Occurrence(C)を求め
て、それからｔｆ（ｄ，ｘ）を決定できる。ｙがｘと同
じクラスの属していれば、それはOccurence(C)が同じで
あるため、ｔｆ（ｄ，ｘ）＝ｔｆ（ｄ，ｙ）となる。[Property 1] If there is a class C, then for any two elements x and y of C, if an arbitrary document is d, then tf (d, x) = tf (d, y). Proof: When calculating tf (d, x), the set Occurrence (C) of the position where x appears can be obtained for C to which x belongs, and tf (d, x) can be determined from it. If y belongs to the same class as x, since Occurence (C) is the same, tf (d, x) = tf (d, y).

【００４１】この性質により、図５に示した各クラスに
ついて、各クラスの任意の要素が同一ドキュメントに含
まれる個数ｔｆが図示したように求められる。たとえば
区間［１，２］について、そのクラスの要素「ａａｂ
ｂ」、「ａａｂｂｃ」、「ａａｂｂｃｃ」が同一ドキュ
メントに含まれる個数ｔｆはこの区間の長さ２に等しく
なる。Due to this property, for each class shown in FIG. 5, the number tf of arbitrary elements of each class included in the same document can be obtained as illustrated. For example, for the interval [1, 2], the element "aab" of that class
The number tf of “b”, “aabbcc”, and “aabbcc” included in the same document is equal to the length 2 of this section.

【００４２】［性質２］クラスＣがあったとき、Ｃの任
意の２要素ｘ、ｙについて、ｄｆ（ｘ）＝ｄｆ（ｙ）ｄｆ_２（ｘ）＝ｄｆ_２（ｙ）ｄｆ_ｎ（ｘ）＝ｄｆ_ｎ（ｙ）（ｎ≧３）ｃｆ（ｘ）＝ｃｆ（ｙ）が成立する。[Property 2] When there is a class C, df (x) = df (y) df ₂ (x) = df ₂ (y) df _n (x) for any two elements x and y of C. = Df _n (y) (n ≧ 3) cf (x) = cf (y) holds.

【００４３】証明：性質１より、ｔｆ（ｄ，ｘ）＝ｔｆ
（ｄ，ｙ）なので、ｄｆ（ｘ）＝｛ｄ｜ｔｆ（ｄ，ｘ）≧１｝＝｛ｄ｜ｔｆ
（ｄ，ｙ）≧１｝＝ｄｆ（ｙ）ｄｆ_２（ｘ）＝｛ｄ｜ｔｆ（ｄ，ｘ）≧２｝＝｛ｄ｜ｔ
ｆ（ｄ，ｙ）≧２｝＝ｄｆ_２（ｙ）ｄｆ_ｎ（ｘ）＝｛ｄ｜ｔｆ（ｄ，ｘ）≧ｎ｝＝｛ｄ｜ｔ
ｆ（ｄ，ｙ）≧ｎ｝＝ｄｆ_ｎ（ｙ）ｃｆ（ｘ）＝Σ_ｄｔｆ（ｄ，ｘ）＝Σ_ｄｔｆ（ｄ，ｙ）
＝ｃｆ（ｙ）Proof: From property 1, tf (d, x) = tf
Since (d, y), df (x) = {d | tf (d, x) ≧ 1} = {d | tf
(D, y) ≧ 1} = df (y) df ₂ (x) = {d | tf (d, x) ≧ 2} = {d | t
f (d, y) ≧ 2} = df ₂ (y) df _n (x) = {d | tf (d, x) ≧ n} = {d | t
f (d, y) ≧ n} = df _n (y) cf (x) = Σ _d tf (d, x) = Σ _d tf (d, y)
= Cf (y)

【００４４】［性質３］すべての部分文字列は、高々２
Ｎ−１のクラスに分類できる。証明は文献[３]にある
が、その概略は、一度しか出現しない文字列のクラスが
Ｎ個あり、どんなに重複しても高々Ｎ−１しか重複した
クラスを構成できないことを示して、この性質を証明す
る。[Property 3] All substrings are at most 2
It can be classified into N-1 classes. There is a proof in Ref. [3], but its outline shows that there are N classes of character strings that appear only once, and no matter how many times they overlap, at most N-1 overlapping classes can be constructed. Prove.

【００４５】４．クラスの階層関係区間［ｉ１，ｊ１］がクラスＣ１を形成し、区間［ｉ
２，ｊ２］がクラスＣ２を形成していて、区間［ｉ１，
ｊ１］が区間［ｉ２，ｊ２］に含まれているとき、Ｃ１
はＣ２の下位のクラスと定義する。また、Ｃ２はＣ１の
上位のクラスと定義する。4. The class hierarchical relation section [i1, j1] forms the class C1, and the section [i, j1]
2, j2] form the class C2, and the interval [i1,
j1] is included in the interval [i2, j2], C1
Is defined as a subordinate class of C2. Further, C2 is defined as a class higher than C1.

【００４６】［性質４］二つのクラスＣ１、Ｃ２に交わ
りがあったときには、Ｃ１はＣ２の上位のクラスである
かＣ１はＣ２の下位のクラスであるかのどちらかであ
る。[Property 4] When the two classes C1 and C2 intersect with each other, either C1 is a higher class of C2 or C1 is a lower class of C2.

【００４７】証明Ｃ１とＣ２に交わりがあるということは、ｉ１≦ｉ２≦ｊ１≦ｊ２（１）ｉ２≦ｉ１≦ｊ２≦ｊ１（２）ｉ１≦ｉ２≦ｊ２≦ｊ１（３）ｉ２≦ｉ１≦ｊ１≦ｊ２（４）のいずれかである。Proof The intersection between C1 and C2 means i1 ≦ i2 ≦ j1 ≦ j2 (1) i2 ≦ i1 ≦ j2 ≦ j1 (2) i1 ≦ i2 ≦ j2 ≦ j1 (3) i2 ≦ i1 ≦ j1 ≦ j2 (4).

【００４８】（１）の場合、ｉ１＜ｉ２であると仮定す
る。区間［ｉ１，ｊ１］では max(common[i1-1],common[j1]) < min_{ｋ１＝ｉ１}
^ｊ１−１(common[k1]) となるので、common[j1]<common[k1]（ｉ１≦ｋ１≦ｊ
１−１）である。一方、区間［ｉ２，ｊ２］では、ｋ１
＝ｉ２−１、ｋ２＝ｊ１（ｉ２≦ｋ２≦ｊ２−１）とな
るｋ１、ｋ２が存在する。従って、 common[k1]=common[i2-1]>common[k2]=common[j1] となり、区間［ｉ２，ｊ２］は max(common[i2−1],common[j2]) < min_{ｋ２＝ｉ２}
^ｊ２−１(common[k2]) を満たさず、ｉ１＜ｉ２の場合クラスＣ２を形成しない
のでＣ１とＣ２に交わりはない。In the case of (1), it is assumed that i1 <i2. In the interval [i1, j1], max (common [i1-1], common [j1]) <min _{k1 = i1}
^j1-1 (common [k1]), so common [j1] <common [k1] (i1 ≦ k1 ≦ j
1-1). On the other hand, in the section [i2, j2], k1
= I2-1 and k2 = j1 (i2 ≦ k2 ≦ j2-1), there are k1 and k2. Therefore, common [k1] = common [i2-1]> common [k2] = common [j1], and the interval [i2, j2] is max (common [i2-1], common [j2]) <min _{k2 = i2}
^{Since j2-1} (common [k2]) is not satisfied and i1 <i2, the class C2 is not formed, so that there is no intersection between C1 and C2.

【００４９】ｉ１＝ｉ２≦ｊ１≦ｊ２の場合はクラスの
階層の定義より、Ｃ２がＣ１の上位クラスである、また
は、等しいクラスである。In the case of i1 = i2 ≦ j1 ≦ j2, C2 is a superclass of C1 or the same class according to the definition of the class hierarchy.

【００５０】（２）の場合も（１）と同様に証明でき
る。また、（３）の場合はクラスの階層の定義より、Ｃ
１がＣ２の上位クラスであるか等しいクラスであり、
（４）の場合は、Ｃ２がＣ１の上位クラスである、また
は、等しいクラスである。In the case of (2), it can be proved in the same manner as in (1). Also, in the case of (3), C is defined from the definition of the class hierarchy.
1 is a superclass of C2 or equal,
In the case of (4), C2 is a superclass of C1 or is an equal class.

【００５１】以上より、２つのクラスに交わりがある場
合は、一方がもう一方の上位クラス、または、下位クラ
スとなる。As described above, when two classes intersect each other, one becomes the upper class or the lower class of the other.

【００５２】［性質５］Suffix Arrayにおいて、すべて
のsuffixはクラスによって階層構造を形成する。証明 common[−1] = common[N]= −1より、最上位クラスは、
すべてのsuffixを含むクラスである。また、性質４より
あるクラスが他のクラスの部分クラスでない限り交わる
ことはない。このとき、部分クラスでは上位クラスより
その区間が短くなる。[Property 5] In the Suffix Array, all suffixes form a hierarchical structure by classes. From the proof common [−1] = common [N] = −1, the top class is
This class contains all suffixes. Also, due to the nature 4, unless a class is a partial class of another class, they do not intersect. At this time, the section of the partial class is shorter than that of the upper class.

【００５３】以上のことから、すべての文字列の出現場
所は文字列クラスによって階層構造を形成する。From the above, the appearance positions of all character strings form a hierarchical structure by the character string class.

【００５４】［性質６］任意の区間［ｉ，ｊ］につい
て、［ｉ，ｊ］を含む区間でクラスを形成する区間があ
る。区間［ｉ，ｉ］においてoutgoing(i, i)<∞、inner
(i, i)=∞なので、inner(i,i) > outgoing(i,i)とな
り、区間［ｉ，ｉ］は１つのsuffixからなる最下位クラ
スを形成する。証明性質５より、Suffix Arrayのすべてのsuffixはクラスに
よって階層構造を形成する。[Characteristic 6] For an arbitrary section [i, j], there is a section including a section including [i, j] to form a class. Outgoing (i, i) <∞, inner in interval [i, i]
Since (i, i) = ∞, inner (i, i)> outgoing (i, i), and the interval [i, i] forms the lowest class consisting of one suffix. From Proof Property 5, all suffixes of Suffix Array form a hierarchical structure by classes.

【００５５】［記号］任意の区間［ｉ，ｊ］について、
それを含むクラスを形成する区間のうち、もっとも下位
のものを［ｉ，ｊ］から定まるクラスとし、Class*([i,
j])と記述する。[Symbol] For an arbitrary section [i, j],
Of the intervals forming the class including it, the lowest one is the class determined from [i, j], and Class * ([i,
j]).

【００５６】任意の区間について、それを含むもっとも
下位のクラスが一意に定まることは、後述のように計算
量を抑えた計数アルゴリズムを構成するときに必要な性
質である。The fact that the lowest class including an arbitrary section is uniquely determined is a property required when a counting algorithm with a reduced amount of calculation is constructed, as will be described later.

【００５７】５．ドキュメント頻度の計測における問
題点すべてのクラスについて、それに属する文字列のドキュ
メント頻度を単純な方法で求めるとすると、通常の計算
機では実用上問題がある。クラスの大きさが高々２Ｎで
あったとしても、ｄｆ（ｘ）、ｄｆ_２（ｘ）、ｄｆ
_３（ｘ）を計測するときのように、それぞれの統計量の
定義に則した条件を満たす集合を作って、その大きさを
計測すると、各ｘの処理にＯ（Ｎ）の時間がかかり、ｘ
がＮ個あれば、全体ではＯ（Ｎ^２）の時間が必要とな
る。これは、コーパスの大きさから考えて、通常の計算
機では実行できない処理となる。5. Problems in measurement of document frequency If the document frequency of the character strings belonging to all classes is calculated by a simple method, there is a practical problem in a normal computer. Even if the class size is at most 2N, df (x), df ₂ (x), df
_{As in} the case of measuring ₃ (x), if a set that satisfies the definition of each statistic is created and its size is measured, it takes O (N) time to process each x. x
If there are N, then a total of O (N ² ) time is required. Considering the size of the corpus, this is a process that cannot be executed by an ordinary computer.

【００５８】Suffix Arrayを用いると、文字列の出現場
所が特定でき、その特定できた場所についてのドキュメ
ントの数を数えれば、任意の文字列に対するドキュメン
ト頻度を計算できる。これを使えば、単純な方法に比べ
て高速化ができる。しかし、これでも頻度が大きな文字
列を多く含むケースでは実用的ではない。文字列の出現
場所が特定できても、その文字列を含むドキュメントの
数を求めるには、ある出現場所が同じドキュメント内に
存在しないかどうか、重複を調べる必要がある。重複を
調べる処理は、出現場所の多い文字列に対しては時間が
かかる。すべての文字列のドキュメント頻度を計算する
には、一文字のように高頻度の文字列についての計算も
含まれ、そのような文字列が計算のボトルネックとな
る。By using the Suffix Array, the appearance location of the character string can be specified, and the document frequency for an arbitrary character string can be calculated by counting the number of documents at the specified location. This can be faster than the simple method. However, even this is not practical in the case where many frequently-used character strings are included. Even if the place where the character string appears can be specified, in order to find the number of documents that contain the character string, it is necessary to check for duplication whether or not a certain place exists in the same document. The process of checking for duplication takes time for a character string having many occurrence locations. Calculating the document frequency of all strings also includes calculations for strings that are as frequent as a single character, and such strings are the bottleneck in the calculation.

【００５９】単純な出現頻度であれば、クラス階層に従
って頻度の合計を求めることができるが、その文字列が
出現するドキュメントの数は、直接寄せ集めることがで
きない。たとえば、図６（ａ）のようなコーパスについ
て考える。文字列「ａｂｃ」はコーパス全体で６回出現
し、それが出現するドキュメントの数は４個である。ま
た、文字列「ａｂｘ」はコーパス全体で７回出現し、そ
れが出現するドキュメントの数は５個である。このとき
文字列「ａｂ」に続く文字のパターンが「ａｂｃ」と
「ａｂｘ」の二つだけであったとすると、図６（ｂ）に
示すように、「ａｂ」の出現回数ｃｆ（ａｂ）は、ｃｆ
（ａｂｃ）＋ｃｆ（ａｂｘ）により計算され、１３回で
ある。しかし、「ａｂ」が出現するドキュメントの数ｄ
ｆ（ａｂ）は、ｄｆ（ａｂｃ）＋ｄｆ（ａｂｘ）により
計算すると、９となり間違いである。これは「ａｂｃ」
と「ａｂｘ」の両方が出現するドキュメントを重複して
数えているからであり、正しくは６である。もし、ドキ
ュメントを計測する条件が、２回以上出現するドキュメ
ント数であった場合、クラスの上下によるドキュメント
頻度の変化はさらに複雑になり、重複を避けて数えるこ
とは困難になる。If the frequency of appearance is simple, the total frequency can be calculated according to the class hierarchy, but the number of documents in which the character string appears cannot be directly collected. For example, consider a corpus as shown in FIG. The character string “abc” appears 6 times in the entire corpus, and the number of documents in which it appears is 4. The character string "abx" appears 7 times in the entire corpus, and the number of documents in which it appears is 5. At this time, if there are only two character patterns “abc” and “abx” following the character string “ab”, the number of appearances cf (ab) of “ab” is as shown in FIG. 6B. , Cf
Calculated by (abc) + cf (abx), which is 13 times. However, the number of documents in which "ab" appears d
f (ab) is incorrect when calculated by df (abc) + df (abx). This is "abc"
This is because the documents in which both "abx" and "abx" appear are counted in duplicate. If the condition for measuring documents is the number of documents that appear twice or more, the change in document frequency due to the upper and lower classes becomes more complicated, and it is difficult to count without avoiding duplication.

【００６０】６．出現場所の重複度ドキュメント頻度の計数を行う準備として、文字列の出
現場所ごとの重複度を定義する。重複度を導入する目的
は、ドキュメント頻度の計測のために、クラスの階層を
たどる過程で加算することのできる新たな統計量を定義
し、その統計量の関数としてドキュメント頻度を求める
ことを可能にするためである。6. Duplication degree of appearance place As a preparation for counting the document frequency, the duplication degree of each appearance place of a character string is defined. The purpose of introducing multiplicity is to define a new statistic that can be added during the course of a class hierarchy to measure document frequency, and to obtain document frequency as a function of that statistic. This is because

【００６１】すべての文字列の出現場所は、Suffix Arr
ay内の配列の番号で順序づけることができる。この順序
をsuffix順と定義し、これを利用して文字列の出現場所
の重複度を定義する。The appearance location of all character strings is Suffix Arr.
It can be ordered by the number of the array in ay. This order is defined as the suffix order, and this is used to define the degree of overlap of the appearance location of the character string.

【００６２】［定義］ある文字列ｘの出現場所の重複度
がｋであるとは、suffix順でその出現場所以下の場所
で、かつ同一のドキュメントに属する文字列ｘの出現場
所がｋ個あることとする。[Definition] When the occurrence degree of a certain character string x is k, it means that there are k or less places of appearance of the character string x belonging to the same document in the suffix order or less. I will.

【００６３】図７（ａ）、（ｂ）に重複度の例を示す。
図７（ｂ）に示されたSuffix Arrayのsuffix順でａｂｘ
(suffix[3])以下の場所にあるのは、ａｂｃ(suffix
[0])、ａｂｄ(suffix[1])、ａｂｅ(suffix[2])、および
ａｂｘ(suffix[3])である。ここで、suffix[3]につい
て、図７（ａ）に示されたドキュメント＃１での文字列
「ａｂ」の重複度ｋを求めると、ドキュメント＃１中に
文字列「ａｂｃ」、「ａｂｄ」、「ａｂｘ」が出現する
ので、ｋ＝３である。同様にして他のsuffixについても
重複度ｋを求めた結果が図７（ｂ）に示されている。FIGS. 7A and 7B show examples of the degree of overlap.
Abx in suffix order of the Suffix Array shown in FIG.
Below (suffix [3]) are abc (suffix
[0]), abd (suffix [1]), abe (suffix [2]), and abx (suffix [3]). Here, for suffix [3], when the degree of overlap k of the character string “ab” in document # 1 shown in FIG. 7A is calculated, the character strings “abc” and “abd” in document # 1 are obtained. , "Abx" appear, k = 3. Similarly, FIG. 7B shows the result of obtaining the degree of overlap k for other suffixes.

【００６４】［性質７］文字列ｘがドキュメントｉにｔ
個出現するとき、ｔ個の出現場所について、すべて重複
度を求め、それをsuffix順に並べると１，・・・，ｔと
なる。[Property 7] The character string x is t in the document i.
When appearing individually, the degree of overlap is calculated for all t places of appearance, and they are arranged in suffix order to obtain 1, ..., T.

【００６５】７．重複条件つき文字列頻度［記号］ｘを文字列としたとき、重複度つき文字列頻度
をｃｆ_ｋ（ｘ）、重複度つきドキュメント文字列頻度を
ｔｆ_ｋ（ｄ，ｘ）と書く。［定義］ｃｆ_ｋ（ｘ）はコーパスの中で、文字列ｘの出
現のうち、重複度がｋ以上の文字列の出現頻度とする。［定義］ｔｆ_ｋ（ｄ，ｘ）はドキュメントｄの中での文
字列ｘの出現のうち、重複度がｋ以上の文字列の出現頻
度とする。7. When the character string frequency with duplication condition [symbol] x is a character string, the character string frequency with duplication degree is written as cf _k (x), and the document character string frequency with duplication degree is written as tf _k (d, x). [Definition] cf _k (x) is the appearance frequency of a character string whose duplication degree is k or more among the appearances of the character string x in the corpus. [Definition] tf _k (d, x) is the appearance frequency of a character string whose duplication degree is k or more among the appearance of the character string x in the document d.

【００６６】［性質８］ ∀ｘ，ｙ∈Ｃ，∀ｋ；∀ｄ；ｔｆ_ｋ（ｄ，ｘ）＝ｔｆ_ｋ
（ｄ，ｙ）性質８は、両辺ともＣの出現Occurence(C)のうちｄに含
まれかつ、重複度がｋであるものの個数であることから
明らかである。[Property 8] ∀x, yεC, ∀k; ∀d; tf _k (d, x) = tf _k
(D, y) Property 8 is obvious from the fact that both sides are the number of occurrence Occurrences (C) of C that are included in d and have a degree of overlap of k.

【００６７】８．重複条件つき文字列頻度とドキュメ
ント頻度の関係ドキュメント頻度と重複条件つき文字列頻度には下の単
純な関係がある。［定理「文字列頻度とドキュメント頻度の関係」］ｄｆ_ｋ（ｘ）＝ｃｆ_ｋ（ｘ）−ｃｆ_ｋ＋１（ｘ）8. Relationship between character string frequency with overlap condition and document frequency Document frequency and character string frequency with overlap condition have the following simple relationship. [Theorem “Relationship between character string frequency and document frequency”] df _k (x) = cf _k (x) −cf _{k + 1} (x)

【００６８】証明ｔｆ（ｄ，ｘ）＝ｔのとき、ｋ≦ｔについて、ｔｆ_ｋ（ｄ，ｘ）−ｔｆ_ｋ＋１（ｄ，ｘ）＝１ｔｆ（ｄ，ｘ）＝ｔのときｔｆ_ｔ（ｄ，ｘ）＝１，ｔｆ
_ｔ＋１（ｄ，ｘ）＝０，・・・であるから、ｋ＞ｔにつ
いて、ｔｆ_ｋ（ｄ，ｘ）−ｔｆ_ｋ＋１（ｄ，ｘ）＝０ｃｆ_ｋ（ｘ）＝Σ_ｄｔｆ_ｋ（ｄ，ｘ）であるので、ｃｆ_ｋ（ｘ）−ｃｆ_ｋ＋１（ｘ）＝Σ_ｄ（ｔｆ_ｋ（ｄ，ｘ）−ｔｆ_ｋ＋１（ｄ，ｘ））＝｜｛ｄ｜ｔｆ（ｄ，ｘ）≧ｋ｝｜＝ｄｆ_ｋ（ｘ）Proof When tf (d, x) = t, for k ≦ t, tf _k (d, x) -tf _{k + 1} (d, x) = 1 tf (d, x) = t tf _t ( d, x) = 1, tf
_{Since t + 1} (d, x) = 0, ..., For k> t, tf _k (d, x) −tf _{k + 1} (d, x) = 0 cf _k (x) = Σ _d tf _k (d , X), cf _k (x) −cf _{k + 1} (x) = Σ _d (tf _k (d, x) −tf _{k + 1} (d, x)) = | {d | tf (d, x) ≧ k} | = df _k (x)

【００６９】あるコーパスにおいて、ｃｆ_ｋとｄｆ_ｋを
実際に求めた例を図８（ａ）、（ｂ）に示す。コーパス
は図８（ａ）のように３つのドキュメント＃１、＃２、
＃３を含み、各ドキュメントについて文字列「ａｂ」、
「ａｂｃ」の出現頻度は、ｔｆ（１，ａｂ）＝７、ｔｆ
（１，ａｂｃ）＝３、ｔｆ（２，ａｂ）＝８、ｔｆ
（２，ａｂｃ）＝５、ｔｆ（３，ａｂ）＝６、ｔｆ
（３，ａｂｃ）＝３である。Examples of actually obtaining cf _k and df _k in a certain corpus are shown in FIGS. 8 (a) and 8 (b). The corpus consists of three documents # 1, # 2, as shown in FIG.
# 3, including the string “ab” for each document,
The appearance frequency of “abc” is tf (1, ab) = 7, tf
(1, abc) = 3, tf (2, ab) = 8, tf
(2, abc) = 5, tf (3, ab) = 6, tf
(3, abc) = 3.

【００７０】このコーパスについて文字列「ａｂ」、
「ａｂｃ」の重複度つき文字列頻度ｃｆ_ｋをもとめ、そ
の値からドキュメント頻度を求めた結果が図８（ｂ）に
示されている。ｃｆ_８（ａｂ）＝１、ｃｆ_７（ａｂ）＝
３、ｃｆ_６（ａｂ）＝６より、ｄｆ_７（ａｂ）＝ｃｆ_７
（ａｂ）−ｃｆ_８（ａｂ）＝３−１＝２、ｄｆ_６（ａ
ｂ）＝ｃｆ_６（ａｂ）−ｃｆ_７（ａｂ）＝６−３＝３と
なる。またｃｆ_５（ａｂｃ）＝１、ｃｆ_４（ａｂｃ）＝
２、ｃｆ_３（ａｂｃ）＝５より、ｄｆ_４（ａｂｃ）＝ｃ
ｆ_４（ａｂｃ）−ｃｆ_５（ａｂｃ）＝２−１＝１、ｄｆ
_３（ａｂｃ）＝ｃｆ _３（ａｂｃ）−ｃｆ_４（ａｂｃ）＝
５−２＝３となる。For this corpus, the character string "ab",
Character string frequency cf with duplication of "abc"_kSeeking
The result of calculating the document frequency from the value of
It is shown. cf₈(Ab) = 1, cf₇(Ab) =
3, cf₆From (ab) = 6, df₇(Ab) = cf₇
(Ab) -cf₈(Ab) = 3-1 = 2, df₆(A
b) = cf₆(Ab) -cf₇(Ab) = 6-3 = 3
Become. Also cf₅(Abc) = 1, cf_Four(Abc) =
2, cf_ThreeSince (abc) = 5, df_Four(Abc) = c
f_Four(Abc) -cf₅(Abc) = 2-1 = 1, df
_Three(Abc) = cf _Three(Abc) -cf_Four(Abc) =
5-2 = 3.

【００７１】［性質９］あるクラスＣがあったとき、そ
の要素ｘ，ｙについては任意のｋについて、ｃｆ_ｋ（ｘ）＝ｃｆ_ｋ（ｙ）が成り立つ。証明ｃｆ_ｋ（ｘ）＝Σ_ｄｔｆ_ｋ（ｄ，ｘ）、ｔｆ_ｋ（ｄ，
ｘ）＝ｔｆ_ｋ（ｄ，ｙ）より、ｃｆ_ｋ（ｘ）＝Σ_ｄｔｆ
_ｋ（ｄ，ｘ）＝Σ_ｄｔｆ_ｋ（ｄ，ｙ）＝ｃｆ_ｋ（ｙ）[Property 9] When there is a certain class C, cf _k (x) = cf _k (y) holds for any k of its elements x and y. Proof cf _k (x) = Σ _d tf _k (d, x), tf _k (d,
x) = tf _k (d, y), cf _k (x) = Σ _d tf
_{k (d, x) = Σ} d tf k (d, y) = cf k (y)

【００７２】９．重複度判定のためのデータ構造それぞれのsuffixについて、同じドキュメントに属し、
かつsuffix順で前にある最も近いsuffixの順位をprevio
usリンクとして記録しておく。もしそのような場所がな
ければ、−１をpreviousリンクとする。このpreviousリ
ンクのデータ構造はコーパスの大きさに比例した大きさ
のメモリ領域である。9. Data structure for duplication determination For each suffix, belong to the same document,
And previo the position of the closest suffix that precedes it in suffix order.
Record it as a us link. If there is no such place, make -1 the previous link. The data structure of this previous link is a memory area whose size is proportional to the size of the corpus.

【００７３】文字列ｘの出現場所の重複度がｋ以上であ
ることの判定は、その出現場所からpreviousリンクをｋ
回たどれるかどうかと、たどれる場合、ｋ回たどった先
の場所においてその文字列がまだ出現しているかを計測
することで判定することができる。To determine that the degree of duplication of the appearance location of the character string x is k or more, the previous link is set to k from the appearance location.
Whether or not it can be traced, and if it can be traced, it can be determined by measuring whether the character string still appears at the place after tracing k times.

【００７４】図９（ａ）、（ｂ）、（ｃ）はpreviousリ
ンクの説明図である。図９（ａ）、（ｂ）はコーパスに
含まれる２つのドキュメント＃２３４、＃２３５であ
る。図９（ｃ）はコーパスのSuffix Arrayを示す。図９
（ａ）、（ｃ）にはドキュメント＃２３４内でsuffix[1
2405]のpreviousリンクをたどった様子が矢印で示され
ている。また同様に図９（ｂ）、（ｃ）にはドキュメン
ト＃２３５内でsuffix[12494]のpreviousリンクをたど
った様子が矢印で示されている。FIGS. 9A, 9B and 9C are explanatory views of the previous link. 9A and 9B show two documents # 234 and # 235 included in the corpus. FIG. 9C shows the Suffix Array of the corpus. Figure 9
In documents # 234 (a) and (c), suffix [1
2405] 's previous link is shown by the arrow. Similarly, in FIGS. 9B and 9C, a state in which the previous link of suffix [12494] is followed in the document # 235 is indicated by an arrow.

【００７５】このようなpreviousリンクのデータ構造を
作るには、ドキュメント数と同数の整数配列を用意し
て、それぞれの文字列の出現ごとに、ドキュメントの番
号を求め、その配列からpreviousリンクの情報を求める
と同時に、その表を現在の場所に更新すればよい。In order to create the data structure of such a previous link, an integer array of the same number as the number of documents is prepared, the document number is obtained for each occurrence of each character string, and the information of the previous link is obtained from the array. And then update the table to the current location.

【００７６】previousリンクのデータ構造を作成するに
は、ドキュメント数と同じメモリ領域を用意し、コーパ
ス全体を一度スキャンすることになる。これは以下のプ
ログラムで生成できる /* id_max:ドキュメント数、size:コーパスの文字数*/ for(i=0;i<id_max;i++) { last_suffixes[i] = -1; } for(i=0;i<size;i++) { suffix[i].previous_suffix = last_suffixes[suffix[i].id]; last_suffixes[suffix[i].id] = i; }To create the data structure of the previous link, the same memory area as the number of documents is prepared and the entire corpus is scanned once. This can be generated with the following program / * id_max: number of documents, size: number of corpus characters * / for (i = 0; i <id_max; i ++) {last_suffixes [i] = -1;} for (i = 0; i <size; i ++) {suffix [i] .previous_suffix = last_suffixes [suffix [i] .id]; last_suffixes [suffix [i] .id] = i;}

【００７７】重複判定は、このpreviousリンクをｋ回た
どることができ、かつその文字列が同じドキュメントに
あるかどうかで判定することができる。実施の形態に係
る計数プログラムでは、計算量を押さえるため、単純に
重複度を判定するのではなく、後述のように重複度の判
定とクラスの検出の処理を同時に行う。The duplication determination can be performed by following this previous link k times and determining whether or not the character string is in the same document. In order to reduce the amount of calculation, the counting program according to the embodiment does not simply determine the degree of overlap, but simultaneously determines the degree of overlap and class detection as described below.

【００７８】１０．クラス検出のアルゴリズムクラスを検出するアルゴリズムの概略を以下に示す。（１）Suffix Arrayはsuffix順が小さいものから順に調
べる。（２）クラスの開始場所を見つける。（３）クラスの終了場所を探す。（３−ａ）クラスは入れ子になっているため、そのクラ
スの終了場所が見つかる前に、他のクラスの開始場所が
見つかることがある。この場合は、スタックにその開始
場所をプッシュする。（３−ｂ）クラスの終了場所が見つかれば、報告しスタ
ックを回復する。10. Algorithm for class detection An outline of an algorithm for class detection is shown below. (1) Suffix Array is checked in order from the smallest suffix order. (2) Find the starting point of the class. (3) Find the place where the class ends. (3-a) Since the classes are nested, the start positions of other classes may be found before the end positions of the class are found. In this case, push that starting location onto the stack. (3-b) If the end location of the class is found, report it and restore the stack.

【００７９】上記のアルゴリズムでクラスを求めていく
とき、求めるクラスの先頭が発見できていて、まだその
終わりが発見できないクラスを「計算中のクラス」と呼
ぶことにする。アルゴリズム上は、スタック中のクラス
が現在の「計算中のクラス」に相当する。When a class is obtained by the above algorithm, a class in which the beginning of the desired class can be found and the end of which cannot be found is called a "calculating class". Algorithmically, the class in the stack corresponds to the current “computing class”.

【００８０】common[i]はコーパスの文字列と同じ大き
さの配列で、Suffix Arrayで次のsuffixと文字列が一致
している長さである。この配列の計算時間は、一致して
いる長さが長いときには多いが、文献［２］の方法では
Suffix Arrayを計算すると同時に、common[i]が求めら
れる。したがって、計算量のオーダーを増やすことはな
い。Common [i] is an array having the same size as the character string of the corpus, and has a length at which the character string matches the next suffix in the Suffix Array. The calculation time of this sequence is long when the matching length is long, but in the method of Ref. [2],
Common [i] is calculated at the same time when Suffix Array is calculated. Therefore, the order of calculation amount is not increased.

【００８１】文字列のクラスは、common[i]の増減にし
たがう。common[i]が増加したときは、現在計算中のク
ラスを計算終了していないクラスとして登録し、新しい
クラスが開始したものとして処理する。The class of the character string follows the increase / decrease of common [i]. When common [i] increases, the class currently being calculated is registered as a class that has not been calculated, and it is processed as if a new class started.

【００８２】一方、common[i]が減少しているときは次
の２つのケースがある。（１）現在のクラスは終了するが、実は現在のクラスと
同じsuffixの場所から始まったクラスが、現在のクラス
以外にもある場合、（２）現在のクラスを終了し、スタ
ックトップのクラスの処理を継続しなければならない場
合。On the other hand, when common [i] is decreasing, there are the following two cases. (1) If the current class ends, but there is actually a class that started at the same suffix location as the current class other than the current class, (2) the current class ends and the stack top class When processing must be continued.

【００８３】２番目のケースで、スタックトップの計算
途中のクラスの処理を継続するときには、このクラスが
すぐに終了するかどうかの検査から処理を継続する。In the second case, when the processing of the class in the middle of the calculation of the stack top is continued, the processing is continued from the inspection as to whether or not this class will be terminated immediately.

【００８４】クラスの発見をするにはcommon[i]ごと
に、クラスの終了判定の操作を行うことになる。２番目
のケースでは、計算途中のクラスの数だけ繰り返しが起
きるが、その繰り返しの数を合計してもクラスの最大数
を越えることはない。したがって、クラスの最大数とco
mmon[i]の個数からこの操作はＯ（Ｎ）の計算量で完了
する。In order to discover a class, an operation for judging the end of the class is performed for each common [i]. In the second case, the number of iterations is the same as the number of classes being calculated, but the total number of iterations does not exceed the maximum number of classes. Therefore, the maximum number of classes and co
This operation is completed with a calculation amount of O (N) from the number of mmon [i].

【００８５】現在計算中のクラスについて、以下の性質
が成り立つ。［性質１０］現在計算中のsuffixから始まり、ドキュメ
ントの区切りまでの文字列を属するクラス毎に分類する
と、そのクラスは現在計算中のクラスとなる。The following properties hold for the class currently being calculated. [Property 10] When the character strings starting from the suffix currently being calculated up to the delimiter of the document are classified by the class to which they belong, the class becomes the class currently being calculated.

【００８６】１１．単純な重複条件つき文字列頻度の
計数重複度ｋが与えられていたとき、すべての文字列ｘに対
して重複度がｋ以上であるｃｆ_ｋ（ｘ）を求めることを
考える。重複度は文字列と場所の関数であるが、同一ク
ラスに属する文字列の重複度が異なることはない。ま
た、同一クラスに属する文字列について、ｃｆ_ｋ（ｘ）
は等しい。そこでクラスの数だけのカウンタを用意し、
各suffixについて処理を行うことでも計数できる。これ
を「単純な方法」と呼ぶ。この方法に要するメモリ領域
はＯ（Ｎ）であるが、計算時間が問題となる。11. Given that the counting degree k of the character string frequency with a simple duplication condition is given, it is considered to find cf _k (x) in which the degree of duplication is k or more for all the character strings x. The degree of duplication is a function of character string and location, but the degree of duplication of character strings belonging to the same class does not differ. Also, for character strings belonging to the same class, cf _k (x)
Are equal. So prepare counters for each class,
It is also possible to count by performing processing for each suffix. This is called the "simple method". Although the memory area required for this method is O (N), the calculation time becomes a problem.

【００８７】計数の方法は、ある場所について、そこか
ら始まるクラスの集合を求め、すべてのクラスに対して
カウンタを用意し、クラス毎に重複度がｋ以上であるか
を判定して、カウンタに１を加えるというものである。
この方法を単純に行うと、一つのsuffixに関連するクラ
スが多数になり得るため、Ｏ（Ｎｌｏｇ（Ｎ））以下の
計算量では収まらない場合がある。具体的には同じ文字
が長く連続するようなデータに対してはこのようなこと
が起こる。The counting method is as follows. For a certain place, a set of classes starting from that is obtained, counters are prepared for all classes, it is determined whether or not the degree of overlap is k or more for each class, and It is to add 1.
If this method is simply performed, the number of classes related to one suffix may be large, and thus the amount of calculation of O (Nlog (N)) or less may not be enough. Specifically, this happens for data in which the same character continues for a long time.

【００８８】１２．重複条件つき文字列頻度の計数重複度つき文字列頻度の計数を単純な方法で行うと、一
つのsuffixに対し、それが構成するクラスをすべて求
め、そのクラスのすべてに対してカウンタの更新を行わ
なければならない。しかし、以下の性質を利用すること
ですべてのクラスに対しカウンタを更新することを避け
ることができる。12. Counting character string frequency with duplication condition When the character string frequency with duplication degree is calculated by a simple method, all the classes that it consists of are calculated for one suffix, and the counter is updated for all the classes. It must be made. However, you can avoid updating the counter for all classes by using the following properties.

【００８９】［性質１１］ある場所が与えられたとき、
そのsuffixの先頭の文字列に対応するクラスの集合が求
められるが、そのクラスには一意の階層関係がある。［性質１２］ある場所が与えられたとき、その場所のsu
ffixの先頭の文字列に対応するクラスのうち、あるクラ
スの文字列について重複度がｋであったとすると、その
クラスより上位のクラスの重複度はｋ以上である。[Property 11] When a certain place is given,
A set of classes corresponding to the first character string of the suffix is required, but the classes have a unique hierarchical relationship. [Property 12] When a place is given, su of that place
If the degree of duplication of a character string of a certain class among the classes corresponding to the leading character string of ffix is k, the degree of duplication of a class higher than the class is k or more.

【００９０】この２つの性質のため、カウンタの加算を
一つのsuffixと一つの重複度ｋに対して一つにすること
ができる。つまり、あるsuffixで重複度ｋ以上となるク
ラスのうち、最下位のクラスのカウンタだけを加算して
おき、下位クラスの計数が終了したときに、上位のクラ
スのカウンタにその計数値を加算していくことにより、
すべてのクラスの計数値を得ることができる。この性質
を利用してクラスの検出と重複度の判定とを融合した計
数アルゴリズムを以下に説明する。Due to these two properties, it is possible to add the counters to one for one suffix and one overlap degree k. That is, among the classes with a degree of overlap k or more at a certain suffix, only the counter of the lowest class is added, and when the counting of the lower class is finished, the count value is added to the counter of the upper class. By going
Counts for all classes can be obtained. A counting algorithm that combines the detection of classes and the determination of the degree of duplication utilizing this property will be described below.

【００９１】１３．クラスの発見と頻度計算１３．１クラスの始まりを発見したときの処理あるクラスの始まりはcommon[i]が増加したことで発見
できる。このとき、現在計測している重複度つき文字列
頻度の情報はほかのクラスの情報と同様にスタックに待
避させ、重複度つき文字列頻度は０に初期化して新たに
計数する。13. Class discovery and frequency calculation 13.1 Processing when discovering the beginning of a class The beginning of a class can be discovered by increasing common [i]. At this time, the information of the character string frequency with the degree of duplication currently measured is saved in the stack like the information of other classes, and the character string frequency with the degree of duplication is initialized to 0 and newly counted.

【００９２】１３．２重複度判定とクラス選択の融合ある場所で、重複度がｋより大きいクラスのなかで最も
下位のクラスを特定する操作は、重複度判定と融合する
ことができる。重複度の判定はpreviousリンクをｋ回た
どった場所ｉと、現在の場所ｊの区間が一つのドキュメ
ントに含まれるかどうかにより行うので、逆にその区間
を含むクラスの集合を求めておき、その中で前述のClas
s*([i,j])を求めることができる。13.2 Fusion of Multiplicity Judgment and Class Selection At some place, the operation of specifying the lowest class among the classes with a multiplicity larger than k can be combined with the multiplicity judgment. Since the determination of the degree of duplication is performed based on whether the section of the location i which has followed the previous link k times and the section of the current location j is included in one document, conversely, a set of classes including the section is obtained, and In the above Clas
You can find s * ([i, j]).

【００９３】この操作は、さらにクラスの検出と同時に
行うことができる。これは、「ある場所で、重複度がｋ
より大きいClass*([i,j])」を定める区間[i,j]が、現在
の場所ｊを終わりに持つため、検出の途中では計算未終
了のクラスとなっていることを利用する。This operation can be performed simultaneously with the detection of the class. This means, "At one place, the degree of overlap is k
Since the section [i, j] defining “larger Class * ([i, j])” has the current location j at the end, it is used that the class is unfinished during the detection.

【００９４】具体的には、まず、previousリンクをｋ回
たどったところにある文字列の出現を求める。次に、そ
の出現場所と最初の出現場所を含む文字列から、共通か
つ計算中のClass*([i,j])を特定し、そのクラスの重複
度つき文字列頻度を加算する。Specifically, first, the appearance of a character string located after the previous link has been searched k times is obtained. Next, Class * ([i, j]), which is common and is being calculated, is specified from the character string including the appearance location and the first appearance location, and the character string frequency with the degree of duplication of the class is added.

【００９５】１３．３クラスの終了を発見したときの
処理あるクラスの終了はcommon[i]が減少することで発見で
きる。このとき、上位クラスへ計数の値を伝える処理を
する。下位クラスの計数が終了したときに上位クラスの
カウンタに、その計数値を加算することで、結果的にす
べてのクラスに加算するのと同じ値を得ることができ
る。13.3 Processing when the end of a class is found The end of a class can be found by reducing common [i]. At this time, a process of transmitting the count value to the upper class is performed. By adding the count value to the counter of the upper class when the counting of the lower class is finished, as a result, the same value as that of adding to all the classes can be obtained.

【００９６】１４．実施例次に上述の計数アルゴリズムを実行するための装置の実
施例を説明する。図１０は、実施の形態に係る計数装置
１００の構成図である。この構成は、ハードウエア的に
は、任意のコンピュータのＣＰＵ、メモリ、その他のＬ
ＳＩで実現でき、ソフトウエア的にはメモリにロードさ
れた計数処理機能のあるプログラムなどによって実現さ
れるが、ここではそれらの連携によって実現される機能
ブロックを描いている。したがって、これらの機能ブロ
ックがハードウエアのみ、ソフトウエアのみ、またはそ
れらの組合せによっていろいろな形で実現できること
は、当業者には理解されるところである。14. Example An example of an apparatus for carrying out the counting algorithm described above will now be described. FIG. 10 is a configuration diagram of the counting device 100 according to the embodiment. In terms of hardware, this configuration is the CPU, memory, and other L of any computer.
It can be realized by SI, and in terms of software, it is realized by a program having a counting processing function loaded in a memory, but here, the functional blocks realized by their cooperation are drawn. Therefore, it will be understood by those skilled in the art that these functional blocks can be realized in various forms by only hardware, only software, or a combination thereof.

【００９７】計数処理部１０は、文書データベース２４
に格納された複数の文書ファイル２６を読み込んで文書
ファイル２６中の文字列の出現頻度に関する統計量を計
数する。複数の文書ファイル２６はコーパスを形成す
る。計数処理部１０は、SuffixArray生成部１２と、文
字列頻度計数部１４と、文書頻度算出部１６とをもつ。
Suffix Array生成部１２は、各々の文書ファイル２６に
ついてsuffixを求め、すべてのsuffixの集合を辞書順に
並べたSuffix Arrayを生成する。またSuffix Array生成
部１２はそれぞれのsuffixについて前述のcommonとprev
iousリンクを作成し、各suffixに対応づけて格納する。
このようにして生成されたSuffix Arrayに関するデータ
はSuffix Arrayファイル２８として文書データベース２
４に格納される。The counting processing unit 10 uses the document database 24
The plurality of document files 26 stored in is read and the statistical amount regarding the appearance frequency of the character string in the document file 26 is counted. The plurality of document files 26 form a corpus. The counting processing unit 10 includes a SuffixArray generating unit 12, a character string frequency counting unit 14, and a document frequency calculating unit 16.
The Suffix Array generation unit 12 obtains a suffix for each document file 26 and generates a Suffix Array in which a set of all suffixes is arranged in a dictionary order. In addition, the Suffix Array generator 12 uses the common and prev described above for each suffix.
Create an ious link and store it in association with each suffix.
The data regarding the Suffix Array generated in this way is stored in the document database 2 as the Suffix Array file 28.
Stored in 4.

【００９８】文字列頻度計数部１４は、Suffix Arrayフ
ァイル２８を参照して、suffixのクラス分けのクラス検
出と、各クラスについての文字列の重複度判定とを行
い、重複度つき文字列頻度を計数する。文書頻度算出部
１６は、文字列頻度計数部１４により計測された重複度
つき文字列頻度の差分により文字列がｋ回以上出現する
ドキュメントの頻度を求める。文字列頻度計数部１４お
よび文書頻度算出部１６により計測された統計量は、文
字列出現頻度データ３０として文書データベース２４に
格納される。The character string frequency counting unit 14 refers to the Suffix Array file 28, performs class detection of suffix classification and determination of the character string duplication degree for each class, and determines the character string frequency with duplication degree. Count. The document frequency calculation unit 16 obtains the frequency of documents in which a character string appears k times or more based on the difference in the character string frequency with the degree of duplication measured by the character string frequency counting unit 14. The statistics measured by the character string frequency counting unit 14 and the document frequency calculating unit 16 are stored in the document database 24 as the character string appearance frequency data 30.

【００９９】問い合わせ部１８は、ユーザが指定する任
意の文字列を受けつける。文書頻度算出部１６は、問い
合わせ部１８からの要求に応じて文字列出現頻度データ
３０に含まれる各クラスの統計量の表を２分探索して、
問い合わせのあった文字列に対する出現頻度やドキュメ
ント頻度に関するデータを出力する。問い合わせ部１８
は文書頻度算出部１６が出力する統計データをユーザに
提供する。The inquiry section 18 receives an arbitrary character string designated by the user. In response to a request from the inquiry unit 18, the document frequency calculation unit 16 searches the table of the statistical amount of each class included in the character string appearance frequency data 30 for two minutes,
Outputs the data about the appearance frequency and document frequency for the inquired character string. Inquiry section 18
Provides the user with the statistical data output by the document frequency calculation unit 16.

【０１００】キーワード抽出部２２は、文字列出現頻度
データ３０を参照して、各文書ファイル２６を特徴づけ
るキーワードを抽出する。キーワードの抽出は後述のよ
うにドキュメント頻度ｄｆ_ｋを用いた単語の切れ目の検
出と単語の出現のパターンにもとづいて行い、単語辞書
を必要としない。抽出されたキーワードは各文書ファイ
ル２６の検索用のインデックスとしてキーワードインデ
ックスファイル３２に格納される。検索部２０は、ユー
ザが指定する文字列をキーワードとして受けつけ、キー
ワードインデックスファイル３２を参照して、指定され
たキーワードを含む文書ファイル２６を検索し、ユーザ
に提供する。The keyword extracting section 22 refers to the character string appearance frequency data 30 and extracts keywords that characterize each document file 26. The keyword extraction is performed based on the word break detection and the word appearance pattern using the document frequency df _k as described later, and does not require a word dictionary. The extracted keywords are stored in the keyword index file 32 as an index for searching each document file 26. The search unit 20 receives a character string specified by the user as a keyword, refers to the keyword index file 32, searches for the document file 26 including the specified keyword, and provides the user with the document file 26.

【０１０１】図１１は、以上の構成による計数装置１０
０が行う計数処理のフローチャートである。このフロー
チャートを用いて計数アルゴリズムの主要部の大まかな
流れを説明する。このフローチャートは、Suffix Array
生成部１２によりpreviousリンクを含むSuffix Arrayの
データ構造が生成された後、文字列頻度計数部１４が行
うクラス検出と重複度つき文字列頻度の計数処理の流れ
を示したものである。FIG. 11 shows the counting device 10 having the above-mentioned configuration.
It is a flowchart of the counting process performed by 0. The flow of the main part of the counting algorithm will be described with reference to this flowchart. This flowchart is based on Suffix Array
7 shows a flow of class detection and character string frequency counting with duplication performed by the character string frequency counting unit 14 after the generating unit 12 has generated the data structure of the Suffix Array including the previous link.

【０１０２】Suffix Arrayの添え字を示す変数ｉを０に
初期化する（Ｓ１０）。現在の変数iの値が新しいクラ
スの開始場所すなわち新しいクラスを定める区間の始ま
りであるかどうかを調べる（Ｓ１２）。新しいクラスの
開始場所であるなら（Ｓ１２のＹ）、その新しいクラス
の重複度つき文字列頻度ｃｆ_ｋのカウンタを０に初期化
する（Ｓ１４）。新しいクラスの開始場所でないなら
（Ｓ１２のＮ）、ステップＳ１４は行わない。A variable i indicating the suffix of Suffix Array is initialized to 0 (S10). It is checked whether or not the current value of the variable i is the start position of the new class, that is, the start of the section defining the new class (S12). If it is the start location of the new class (Y of S12), the counter of the character string frequency cf _k with duplication degree of the new class is initialized to 0 (S14). If it is not the start place of the new class (N of S12), step S14 is not performed.

【０１０３】次に計数処理として、各重複度について重
複度に応じた適切なクラスを選んで文字列頻度ｃｆ_ｋの
カウンタをインクリメントする（Ｓ１６）。ここでいう
適切なクラスとは、現在の場所ｉと、previousリンクを
ｋ回たどったところにある文字列の出現場所ｊで定まる
区間を含むもっとも下位のクラスのことである。Next, as a counting process, an appropriate class is selected for each degree of duplication according to the degree of duplication, and the counter of the character string frequency cf _k is incremented (S16). The appropriate class mentioned here is the lowest class including the section defined by the current location i and the appearance location j of the character string at the position where the previous link is followed k times.

【０１０４】次に後処理として、クラスの終了検出を行
う。現在の変数iがクラスの終了場所すなわちクラスを
定める区間の終わりであるかどうかを調べる（Ｓ１
８）。クラスの終了であれば（Ｓ１８のＹ）、終了した
クラスの重複度つき文字列頻度ｃｆ_ｋを記録する（Ｓ２
０）。さらに終了したクラスの文字列頻度ｃｆ_ｋを上位
のクラスの文字列頻度のカウンタに加算する（Ｓ２
２）。クラスの終了でなければ（Ｓ１８のＮ）、ステッ
プＳ２０およびＳ２２は行わない。Next, as post-processing, class end detection is performed. It is checked whether or not the current variable i is the end position of the class, that is, the end of the interval defining the class (S1
8). If it is the end of the class (Y of S18), the character string frequency cf _k with the degree of duplication of the finished class is recorded (S2).
0). Further, the character string frequency cf _k of the finished class is added to the character string frequency counter of the upper class (S2).
2). If the class has not ended (N of S18), steps S20 and S22 are not performed.

【０１０５】次のsuffixを調べるために、suffixの添え
字の変数iを１だけインクリメントする（Ｓ２４）。変
数のiがSuffix Arrayの配列のサイズに等しいなら（Ｓ
２６のＹ）、計数処理を終了するが、そうでないなら
（Ｓ２６のＮ）、ステップＳ１２にもどって計数処理を
繰り返す。In order to check the next suffix, the variable i of the suffix of suffix is incremented by 1 (S24). If the variable i is equal to the size of the array of Suffix Array (S
26 Y), the counting process ends, but if not (N in S26), the counting process is repeated by returning to step S12.

【０１０６】このようにして文字列頻度計数部１４は、
suffixを順次調べて、クラスの検出と重複度の判定を同
時に行い、重複度つきの文字列頻度を計測する。各ステ
ップの処理の詳細については既に述べた通りである。In this way, the character string frequency counting section 14
The suffix is sequentially checked, the class is detected and the degree of duplication is determined at the same time, and the frequency of character strings with the degree of duplication is measured. The details of the processing in each step are as described above.

【０１０７】１５．実行例以上説明した計数アルゴリズムは後述のプログラムによ
り実行される。ここではこのプログラムの実行結果を説
明する。サンプルとして処理するデータは以下のファイ
ルである。一行が一つのドキュメントに相当する。 abcabcabc abcd abcde bcde15. Example of Execution The counting algorithm described above is executed by the program described below. Here, the execution result of this program will be described. The data to be processed as a sample are the following files. One line corresponds to one document. abcabcabc abcd abcde bcde

【０１０８】１５．１第１段階第１段階では、Suffix Arrayを作成し、commonをもと
め、previousリンクを作成する。前述の例に対しては以
下のようなデータが作成される。データは先頭から、su
ffixの番号、suffixが属するドキュメントの番号、同じ
ドキュメントに属しているsuffixで、直前に現れたもの
の番号、直後のsuffixと「先頭から一致している文字
列」の長さ、そのsuffixの文字の順に示されている。 0 0 -1 0: 1 1 -1 0: 2 2 -1 0: 3 3 -1 0: 4 0 0 3:abc 5 0 4 6:abcabc 6 0 5 3:abcabcabc 7 1 1 4:abcd 8 2 2 0:abcde 9 0 6 2:bc 10 0 9 5:bcabc 11 0 10 2:bcabcabc 12 1 7 3:bcd 13 2 8 4:bcde 14 3 3 0:bcde 15 0 11 1:c 16 0 15 4:cabc 17 0 16 1:cabcabc 18 1 12 2:cd 19 2 13 3:cde 20 3 14 0:cde 21 1 18 1:d 22 2 19 2:de 23 3 20 0:de 24 2 22 1:e 25 3 23 0:e15.1 First Stage In the first stage, a Suffix Array is created, common is sought, and a previous link is created. The following data is created for the above example. The data is from the beginning, su
The number of the ffix, the number of the document to which the suffix belongs, the number of the suffix that belongs to the same document that appears immediately before, the length of the suffix that immediately follows and the "character string that matches from the beginning", and the character of that suffix. In order. 0 0 -1 0: 1 1 -1 0: 2 2 -1 0: 3 3 -1 0: 4 0 0 3: abc 5 0 4 6: abcabc 6 0 5 3: abcabcabc 7 1 1 4: abcd 8 2 2 0: abcde 9 0 6 2: bc 10 0 9 5: bcabc 11 0 10 2: bcabcabc 12 1 7 3: bcd 13 2 8 4: bcde 14 3 3 0: bcde 15 0 11 1: c 16 0 15 4 : cabc 17 0 16 1: cabcabc 18 1 12 2: cd 19 2 13 3: cde 20 3 14 0: cde 21 1 18 1: d 22 2 19 2: de 23 3 20 0: de 24 2 22 1: e 25 3 23 0: e

【０１０９】１５．２第２段階第２段階では、クラスの検出と頻度の計数を行う。実行
結果を表１に示す。実行結果に示された<start length
c0 c1 c2 c3 c4>は計算途中のクラスの状態を示す。sta
rtはクラスに対応する区間が始まったsuffixの番号、le
ngthはそのクラスに属する最長の文字列の長さである。
c0, c1,c2, c3,c4は、それぞれそのクラスに属する重複
度１，２，３，４，５の文字列の頻度のカウンタであ
る。15.2 Second Stage In the second stage, class detection and frequency counting are performed. The execution results are shown in Table 1. <Start length shown in the execution result
c0 c1 c2 c3 c4> indicates the class status during the calculation. sta
rt is the suffix number where the section corresponding to the class starts, le
ngth is the length of the longest string in that class.
c0, c1, c2, c3, and c4 are counters of the frequency of character strings having the degree of duplication 1, 2, 3, 4, and 5 belonging to the class, respectively.

【０１１０】文字列のクラスは、最短の文字列の長さと
最長の文字列の長さで定まるが、計算途中のクラスで
は、最短の長さの文字列はない。クラスに属する最短の
文字列の長さは、最終的にはクラスの階層関係から求め
る。また、最短の文字列の長さは、先頭から順に見てい
った場合、途中ではわからないので、含めることはでき
ない。The class of the character string is determined by the length of the shortest character string and the length of the longest character string, but in the class in the middle of calculation, there is no character string of the shortest length. The length of the shortest character string belonging to the class is finally obtained from the hierarchical relationship of the classes. Also, the length of the shortest character string cannot be included because it is not known in the middle when looking from the beginning.

【０１１１】suffixを調べる前にはその都度、計算途中
のクラスをすべて一行で出力する。これは表１の実行結
果のなかで、"<"で始まっている部分である。その次
に、suffixが出力されている。suffixのそれぞれの数字
は、第１段階の例と同様である。Before checking suffix, each time the class under calculation is output in one line. This is the part beginning with "<" in the execution result of Table 1. Next, suffix is output. Each number of suffix is the same as that of the example of the first stage.

【０１１２】suffixをひとつ見るごとに、以下の二つの
処理をする。（１）それぞれの多重度について、多重度に応じた「適
切なクラス」を選んで出現数を加算する。この処理は実
行結果の中では"Incrementing"で始まる部分である。（２）区間が２以上のクラスを発見したら、記録し、そ
の出現数を上位のクラスに足し込む。この処理は実行結
果の中で"---> Class"で始まる部分である。Each time one suffix is viewed, the following two processes are performed. (1) For each multiplicity, select an "appropriate class" according to the multiplicity and add the number of appearances. This process is the part that starts with "Incrementing" in the execution result. (2) When a class with two or more sections is found, record it and add the number of appearances to the higher class. This process is the part that begins with "--->Class" in the execution result.

【０１１３】ここで「適切なクラス」とは、現在の場所
（ａ）とprevious linkを多重度の回数だけたどった場
所（ｂ）で定まる区間を含み、一番短い区間のクラスで
ある。もし、previous linkが多重度の分だけたどれな
かった場合には、加算するべきクラスは存在しないこと
を意味する。Here, the "appropriate class" is the class of the shortest section including the section determined by the current location (a) and the location (b) where the previous link is traced the number of times of multiplicity. If the previous link is not traced by the multiplicity, it means that there is no class to add.

【０１１４】区間の始まりの場所と終わりの場所が同じ
であるクラスは特別に記録することをしない。記録にな
ければ、それは１回あらわれているか０回かのどちらか
であり、それはＯ（ｌｏｇ（Ｎ））で判定できるため、
記録は不要である。Classes that have the same beginning and ending locations of the interval are not recorded specially. If it is not recorded, it appears either once or 0 times, which can be judged by O (log (N)).
No record is required.

【０１１５】表１の実行結果において、計算途中のクラ
スの多重度の計数が変化しているが、その場所は、加算
の対象になったクラスと上位のクラスが終了した結果か
らの引き継ぎの二つのケースであり、それらのケースに
限られる。In the execution results of Table 1, the multiplicity count of the class in the middle of calculation is changed, but the place is two of the inheritance from the result of the end of the class to be added and the upper class. It is one case and limited to those cases.

【表１】 <S0 L0 0 0 0 0 0> 0 0 -1 0: -> Incrementing c[0] of Class*[0,0] : (S0,L0) S="" <S0 L0 1 0 0 0 0> 1 1 -1 0: -> Incrementing c[0] of Class*[1,1] : (S0,L0) S="" <S0 L0 2 0 0 0 0> 2 2 -1 0: -> Incrementing c[0] of Class*[2,2] : (S0,L0) S="" <S0 L0 3 0 0 0 0> 3 3 -1 0: -> Incrementing c[0] of Class*[3,3] : (S0,L0) S="" <S0 L0 4 0 0 0 0> 4 0 0 3:abc -> Incrementing c[0] of Class*[4,4] : (S4,L3) S="abc" -> Incrementing c[1] of Class*[0,4] : (S0,L0) S="" <S0 L0 4 1 0 0 0><S4 L3 1 0 0 0 0> 5 0 4 6:abcabc -> Incrementing c[0] of Class*[5,5] : (S5,L6) S="abcabc" -> Incrementing c[1] of Class*[4,5] : (S4,L3) S="abc" -> Incrementing c[2] of Class*[0,5] : (S0,L0) S="" <S0 L0 4 1 1 0 0><S4 L3 1 1 0 0 0><S5 L6 1 0 0 0 0> 6 0 5 3:abcabcabc -> Incrementing c[0] of Class*[6,6] : (S5,L6) S="abcabc" -> Incrementing c[1] of Class*[5,6] : (S5,L6) S="abcabc" -> Incrementing c[2] of Class*[4,6] : (S4,L3) S="abc" -> Incrementing c[3] of Class*[0,6] : (S0,L0) S="" ---> Class[5, 6] L=6 c[0]=2 c[1]=1 c[2]=0 c[3]=0 c[4]=0 S="abcabc" <S0 L0 4 1 1 1 0><S4 L3 3 2 1 0 0> 7 1 1 4:abcd -> Incrementing c[0] of Class*[7,7] : (S7,L4) S="abcd" -> Incrementing c[1] of Class*[1,7] : (S0,L0) S="" <S0 L0 4 2 1 1 0><S4 L3 3 2 1 0 0><S7 L4 1 0 0 0 0> 8 2 2 0:abcde -> Incrementing c[0] of Class*[8,8] : (S7,L4) S="abcd" -> Incrementing c[1] of Class*[2,8] : (S0,L0) S="" ---> Class[7, 8] L=4 c[0]=2 c[1]=0 c[2]=0 c[3]=0 c[4]=0 S="abcd" ---> Class[4, 8] L=3 c[0]=5 c[1]=2 c[2]=1 c[3]=0 c[4]=0 S="abc" <S0 L0 9 5 2 1 0> 9 0 6 2:bc -> Incrementing c[0] of Class*[9,9] : (S9,L2) S="bc" -> Incrementing c[1] of Class*[6,9] : (S0,L0) S="" -> Incrementing c[2] of Class*[5,9] : (S0,L0) S="" -> Incrementing c[3] of Class*[4,9] : (S0,L0) S="" -> Incrementing c[4] of Class*[0,9] : (S0,L0) S="" <S0 L0 9 6 3 2 1><S9 L2 1 0 0 0 0> 10 0 9 5:bcabc -> Incrementing c[0] of Class*[10,10] : (S10,L5) S="bcabc" -> Incrementing c[1] of Class*[9,10] : (S9,L2) S="bc" -> Incrementing c[2] of Class*[6,10] : (S0,L0) S="" -> Incrementing c[3] of Class*[5,10] : (S0,L0) S="" -> Incrementing c[4] of Class*[4,10] : (S0,L0) S="" <S0 L0 9 6 4 3 2><S9 L2 1 1 0 0 0><S10 L5 1 0 0 0 0> 11 0 10 2:bcabcabc -> Incrementing c[0] of Class*[11,11] : (S10,L5) S="bcabc" -> Incrementing c[1] of Class*[10,11] : (S10,L5) S="bcabc" -> Incrementing c[2] of Class*[9,11] : (S9,L2) S="bc" -> Incrementing c[3] of Class*[6,11] : (S0,L0) S="" -> Incrementing c[4] of Class*[5,11] : (S0,L0) S="" ---> Class[10, 11] L=5 c[0]=2 c[1]=1 c[2]=0 c[3]=0 c[4]=0 S="bcabc " <S0 L0 9 6 4 4 3><S9 L2 3 2 1 0 0> 12 1 7 3:bcd -> Incrementing c[0] of Class*[12,12] : (S12,L3) S="bcd" -> Incrementing c[1] of Class*[7,12] : (S0,L0) S="" -> Incrementing c[2] of Class*[1,12] : (S0,L0) S="" <S0 L0 9 7 5 4 3><S9 L2 3 2 1 0 0><S12 L3 1 0 0 0 0> 13 2 8 4:bcde -> Incrementing c[0] of Class*[13,13] : (S13,L4) S="bcde" -> Incrementing c[1] of Class*[8,13] : (S0,L0) S="" -> Incrementing c[2] of Class*[2,13] : (S0,L0) S="" <S0 L0 9 8 6 4 3><S9 L2 3 2 1 0 0><S12 L3 1 0 0 0 0><S13 L4 1 0 0 0 0> 14 3 3 0:bcde -> Incrementing c[0] of Class*[14,14] : (S13,L4) S="bcde" -> Incrementing c[1] of Class*[3,14] : (S0,L0) S="" ---> Class[13, 14] L=4 c[0]=2 c[1]=0 c[2]=0 c[3]=0 c[4]=0 S="bcde" ---> Class[12, 14] L=3 c[0]=3 c[1]=0 c[2]=0 c[3]=0 c[4]=0 S="bcd" ---> Class[9, 14] L=2 c[0]=6 c[1]=2 c[2]=1 c[3]=0 c[4]=0 S="bc" <S0 L0 15 11 7 4 3> 15 0 11 1:c -> Incrementing c[0] of Class*[15,15] : (S15,L1) S="c" -> Incrementing c[1] of Class*[11,15] : (S0,L0) S="" -> Incrementing c[2] of Class*[10,15] : (S0,L0) S="" -> Incrementing c[3] of Class*[9,15] : (S0,L0) S="" -> Incrementing c[4] of Class*[6,15] : (S0,L0) S="" <S0 L0 15 12 8 5 4><S15 L1 1 0 0 0 0> 16 0 15 4:cabc -> Incrementing c[0] of Class*[16,16] : (S16,L4) S="cabc" -> Incrementing c[1] of Class*[15,16] : (S15,L1) S="c" -> Incrementing c[2] of Class*[11,16] : (S0,L0) S="" -> Incrementing c[3] of Class*[10,16] : (S0,L0) S="" -> Incrementing c[4] of Class*[9,16] : (S0,L0) S="" <S0 L0 15 12 9 6 5><S15 L1 1 1 0 0 0><S16 L4 1 0 0 0 0> 17 0 16 1:cabcabc -> Incrementing c[0] of Class*[17,17] : (S16,L4) S="cabc" -> Incrementing c[1] of Class*[16,17] : (S16,L4) S="cabc" -> Incrementing c[2] of Class*[15,17] : (S15,L1) S="c" -> Incrementing c[3] of Class*[11,17] : (S0,L0) S="" -> Incrementing c[4] of Class*[10,17] : (S0,L0) S="" ---> Class[16, 17] L=4 c[0]=2 c[1]=1 c[2]=0 c[3]=0 c[4]=0 S="cabc" <S0 L0 15 12 9 7 6><S15 L1 3 2 1 0 0> 18 1 12 2:cd -> Incrementing c[0] of Class*[18,18] : (S18,L2) S="cd" -> Incrementing c[1] of Class*[12,18] : (S0,L0) S="" -> Incrementing c[2] of Class*[7,18] : (S0,L0) S="" -> Incrementing c[3] of Class*[1,18] : (S0,L0) S="" <S0 L0 15 13 10 8 6><S15 L1 3 2 1 0 0><S18 L2 1 0 0 0 0> 19 2 13 3:cde -> Incrementing c[0] of Class*[19,19] : (S19,L3) S="cde" -> Incrementing c[1] of Class*[13,19] : (S0,L0) S="" -> Incrementing c[2] of Class*[8,19] : (S0,L0) S="" -> Incrementing c[3] of Class*[2,19] : (S0,L0) S="" <S0 L0 15 14 11 9 6><S15 L1 3 2 1 0 0><S18 L2 1 0 0 0 0><S19 L3 1 0 0 0 0> 20 3 14 0:cde -> Incrementing c[0] of Class*[20,20] : (S19,L3) S="cde" -> Incrementing c[1] of Class*[14,20] : (S0,L0) S="" -> Incrementing c[2] of Class*[3,20] : (S0,L0) S="" ---> Class[19, 20] L=3 c[0]=2 c[1]=0 c[2]=0 c[3]=0 c[4]=0 S="cde" ---> Class[18, 20] L=2 c[0]=3 c[1]=0 c[2]=0 c[3]=0 c[4]=0 S="cd" ---> Class[15, 20] L=1 c[0]=6 c[1]=2 c[2]=1 c[3]=0 c[4]=0 S="c" <S0 L0 21 17 13 9 6> 21 1 18 1:d -> Incrementing c[0] of Class*[21,21] : (S21,L1) S="d" -> Incrementing c[1] of Class*[18,21] : (S0,L0) S="" -> Incrementing c[2] of Class*[12,21] : (S0,L0) S="" -> Incrementing c[3] of Class*[7,21] : (S0,L0) S="" -> Incrementing c[4] of Class*[1,21] : (S0,L0) S="" <S0 L0 21 18 14 10 7><S21 L1 1 0 0 0 0> 22 2 19 2:de -> Incrementing c[0] of Class*[22,22] : (S22,L2) S="de" -> Incrementing c[1] of Class*[19,22] : (S0,L0) S="" -> Incrementing c[2] of Class*[13,22] : (S0,L0) S="" -> Incrementing c[3] of Class*[8,22] : (S0,L0) S="" -> Incrementing c[4] of Class*[2,22] : (S0,L0) S="" <S0 L0 21 19 15 11 8><S21 L1 1 0 0 0 0><S22 L2 1 0 0 0 0> 23 3 20 0:de -> Incrementing c[0] of Class*[23,23] : (S22,L2) S="de" -> Incrementing c[1] of Class*[20,23] : (S0,L0) S="" -> Incrementing c[2] of Class*[14,23] : (S0,L0) S="" -> Incrementing c[3] of Class*[3,23] : (S0,L0) S="" ---> Class[22, 23] L=2 c[0]=2 c[1]=0 c[2]=0 c[3]=0 c[4]=0 S="de" ---> Class[21, 23] L=1 c[0]=3 c[1]=0 c[2]=0 c[3]=0 c[4]=0 S="d" <S0 L0 24 20 16 12 8> 24 2 22 1:e -> Incrementing c[0] of Class*[24,24] : (S24,L1) S="e" -> Incrementing c[1] of Class*[22,24] : (S0,L0) S="" -> Incrementing c[2] of Class*[19,24] : (S0,L0) S="" -> Incrementing c[3] of Class*[13,24] : (S0,L0) S="" -> Incrementing c[4] of Class*[8,24] : (S0,L0) S="" <S0 L0 24 21 17 13 9><S24 L1 1 0 0 0 0> 25 3 23 0:e -> Incrementing c[0] of Class*[25,25] : (S24,L1) S="e" -> Incrementing c[1] of Class*[23,25] : (S0,L0) S="" -> Incrementing c[2] of Class*[20,25] : (S0,L0) S="" -> Incrementing c[3] of Class*[14,25] : (S0,L0) S="" -> Incrementing c[4] of Class*[3,25] : (S0,L0) S="" ---> Class[24, 25] L=1 c[0]=2 c[1]=0 c[2]=0 c[3]=0 c[4]=0 S="e" <S0 L0 26 22 18 14 10>[Table 1] <S0 L0 0 0 0 0 0> 0 0 -1 0: -> Incrementing c [0] of Class * [0,0]: (S0, L0) S = "" <S0 L0 1 0 0 0 0> 1 1 -1 0: -> Incrementing c [0] of Class * [1,1]: (S0, L0) S = "" <S0 L0 2 0 0 0 0> 2 2 -1 0: -> Incrementing c [0] of Class * [2,2]: (S0, L0) S = "" <S0 L0 3 0 0 0 0> 3 3 -1 0: -> Incrementing c [0] of Class * [3,3]: (S0, L0) S = "" <S0 L0 4 0 0 0 0> 4 0 0 3: abc -> Incrementing c [0] of Class * [4,4]: (S4, L3) S = "abc" -> Incrementing c [1] of Class * [0,4]: (S0, L0) S = "" <S0 L0 4 1 0 0 0> <S4 L3 1 0 0 0 0> 5 0 4 6: abcabc -> Incrementing c [0] of Class * [5,5]: (S5, L6) S = "abcabc" -> Incrementing c [1] of Class * [4,5]: (S4, L3) S = "abc" -> Incrementing c [2] of Class * [0,5]: (S0, L0) S = "" <S0 L0 4 1 1 0 0> <S4 L3 1 1 0 0 0> <S5 L6 1 0 0 0 0> 6 0 5 3: abcabcabc -> Incrementing c [0] of Class * [6,6]: (S5, L6) S = "abcabc" -> Incrementing c [1] of Class * [5,6]: (S5, L6) S = "abcabc" -> Incrementing c [2] of Class * [4,6]: (S4, L3) S = "abc" -> Incrementing c [3] of Class * [0,6]: (S0, L0) S = "" ---> Class [5, 6] L = 6 c [0] = 2 c [1] = 1 c [2] = 0 c [3] = 0 c [4] = 0 S = "abcabc" <S0 L0 4 1 1 1 0> <S4 L3 3 2 1 0 0> 7 1 1 4: abcd -> Incrementing c [0] of Class * [7,7]: (S7, L4) S = "abcd" -> Incrementing c [1] of Class * [1,7]: (S0, L0) S = "" <S0 L0 4 2 1 1 0> <S4 L3 3 2 1 0 0> <S7 L4 1 0 0 0 0> 8 2 2 0: abcde -> Incrementing c [0] of Class * [8,8]: (S7, L4) S = "abcd" -> Incrementing c [1] of Class * [2,8]: (S0, L0) S = "" ---> Class [7, 8] L = 4 c [0] = 2 c [1] = 0 c [2] = 0 c [3] = 0 c [4] = 0 S = "abcd" ---> Class [4, 8] L = 3 c [0] = 5 c [1] = 2 c [2] = 1 c [3] = 0 c [4] = 0 S = "abc" <S0 L0 9 5 2 1 0> 9 0 6 2: bc -> Incrementing c [0] of Class * [9,9]: (S9, L2) S = "bc" -> Incrementing c [1] of Class * [6,9]: (S0, L0) S = "" -> Incrementing c [2] of Class * [5,9]: (S0, L0) S = "" -> Incrementing c [3] of Class * [4,9]: (S0, L0) S = "" -> Incrementing c [4] of Class * [0,9]: (S0, L0) S = "" <S0 L0 9 6 3 2 1> <S9 L2 1 0 0 0 0> 10 0 9 5: bcabc -> Incrementing c [0] of Class * [10,10]: (S10, L5) S = "bcabc" -> Incrementing c [1] of Class * [9,10]: (S9, L2) S = "bc" -> Incrementing c [2] of Class * [6,10]: (S0, L0) S = "" -> Incrementing c [3] of Class * [5,10]: (S0, L0) S = "" -> Incrementing c [4] of Class * [4,10]: (S0, L0) S = "" <S0 L0 9 6 4 3 2> <S9 L2 1 1 0 0 0> <S10 L5 1 0 0 0 0> 11 0 10 2: bcabcabc -> Incrementing c [0] of Class * [11,11]: (S10, L5) S = "bcabc" -> Incrementing c [1] of Class * [10,11]: (S10, L5) S = "bcabc" -> Incrementing c [2] of Class * [9,11]: (S9, L2) S = "bc" -> Incrementing c [3] of Class * [6,11]: (S0, L0) S = "" -> Incrementing c [4] of Class * [5,11]: (S0, L0) S = "" ---> Class [10, 11] L = 5 c [0] = 2 c [1] = 1 c [2] = 0 c [3] = 0 c [4] = 0 S = "bcabc " <S0 L0 9 6 4 4 3> <S9 L2 3 2 1 0 0> 12 1 7 3: bcd -> Incrementing c [0] of Class * [12,12]: (S12, L3) S = "bcd" -> Incrementing c [1] of Class * [7,12]: (S0, L0) S = "" -> Incrementing c [2] of Class * [1,12]: (S0, L0) S = "" <S0 L0 9 7 5 4 3> <S9 L2 3 2 1 0 0> <S12 L3 1 0 0 0 0> 13 2 8 4: bcde -> Incrementing c [0] of Class * [13,13]: (S13, L4) S = "bcde" -> Incrementing c [1] of Class * [8,13]: (S0, L0) S = "" -> Incrementing c [2] of Class * [2,13]: (S0, L0) S = "" <S0 L0 9 8 6 4 3> <S9 L2 3 2 1 0 0> <S12 L3 1 0 0 0 0> <S13 L4 1 0 0 0 0> 14 3 3 0: bcde -> Incrementing c [0] of Class * [14,14]: (S13, L4) S = "bcde" -> Incrementing c [1] of Class * [3,14]: (S0, L0) S = "" ---> Class [13, 14] L = 4 c [0] = 2 c [1] = 0 c [2] = 0 c [3] = 0 c [4] = 0 S = "bcde" ---> Class [12, 14] L = 3 c [0] = 3 c [1] = 0 c [2] = 0 c [3] = 0 c [4] = 0 S = "bcd" ---> Class [9, 14] L = 2 c [0] = 6 c [1] = 2 c [2] = 1 c [3] = 0 c [4] = 0 S = "bc" <S0 L0 15 11 7 4 3> 15 0 11 1: c -> Incrementing c [0] of Class * [15,15]: (S15, L1) S = "c" -> Incrementing c [1] of Class * [11,15]: (S0, L0) S = "" -> Incrementing c [2] of Class * [10,15]: (S0, L0) S = "" -> Incrementing c [3] of Class * [9,15]: (S0, L0) S = "" -> Incrementing c [4] of Class * [6,15]: (S0, L0) S = "" <S0 L0 15 12 8 5 4> <S15 L1 1 0 0 0 0> 16 0 15 4: cabc -> Incrementing c [0] of Class * [16,16]: (S16, L4) S = "cabc" -> Incrementing c [1] of Class * [15,16]: (S15, L1) S = "c" -> Incrementing c [2] of Class * [11,16]: (S0, L0) S = "" -> Incrementing c [3] of Class * [10,16]: (S0, L0) S = "" -> Incrementing c [4] of Class * [9,16]: (S0, L0) S = "" <S0 L0 15 12 9 6 5> <S15 L1 1 1 0 0 0> <S16 L4 1 0 0 0 0> 17 0 16 1: cabcabc -> Incrementing c [0] of Class * [17,17]: (S16, L4) S = "cabc" -> Incrementing c [1] of Class * [16,17]: (S16, L4) S = "cabc" -> Incrementing c [2] of Class * [15,17]: (S15, L1) S = "c" -> Incrementing c [3] of Class * [11,17]: (S0, L0) S = "" -> Incrementing c [4] of Class * [10,17]: (S0, L0) S = "" ---> Class [16, 17] L = 4 c [0] = 2 c [1] = 1 c [2] = 0 c [3] = 0 c [4] = 0 S = "cabc" <S0 L0 15 12 9 7 6> <S15 L1 3 2 1 0 0> 18 1 12 2: cd -> Incrementing c [0] of Class * [18,18]: (S18, L2) S = "cd" -> Incrementing c [1] of Class * [12,18]: (S0, L0) S = "" -> Incrementing c [2] of Class * [7,18]: (S0, L0) S = "" -> Incrementing c [3] of Class * [1,18]: (S0, L0) S = "" <S0 L0 15 13 10 8 6> <S15 L1 3 2 1 0 0> <S18 L2 1 0 0 0 0> 19 2 13 3: cde -> Incrementing c [0] of Class * [19,19]: (S19, L3) S = "cde" -> Incrementing c [1] of Class * [13,19]: (S0, L0) S = "" -> Incrementing c [2] of Class * [8,19]: (S0, L0) S = "" -> Incrementing c [3] of Class * [2,19]: (S0, L0) S = "" <S0 L0 15 14 11 9 6> <S15 L1 3 2 1 0 0> <S18 L2 1 0 0 0 0> <S19 L3 1 0 0 0 0> 20 3 14 0: cde -> Incrementing c [0] of Class * [20,20]: (S19, L3) S = "cde" -> Incrementing c [1] of Class * [14,20]: (S0, L0) S = "" -> Incrementing c [2] of Class * [3,20]: (S0, L0) S = "" ---> Class [19, 20] L = 3 c [0] = 2 c [1] = 0 c [2] = 0 c [3] = 0 c [4] = 0 S = "cde" ---> Class [18, 20] L = 2 c [0] = 3 c [1] = 0 c [2] = 0 c [3] = 0 c [4] = 0 S = "cd" ---> Class [15, 20] L = 1 c [0] = 6 c [1] = 2 c [2] = 1 c [3] = 0 c [4] = 0 S = "c" <S0 L0 21 17 13 9 6> 21 1 18 1: d -> Incrementing c [0] of Class * [21,21]: (S21, L1) S = "d" -> Incrementing c [1] of Class * [18,21]: (S0, L0) S = "" -> Incrementing c [2] of Class * [12,21]: (S0, L0) S = "" -> Incrementing c [3] of Class * [7,21]: (S0, L0) S = "" -> Incrementing c [4] of Class * [1,21]: (S0, L0) S = "" <S0 L0 21 18 14 10 7> <S21 L1 1 0 0 0 0> 22 2 19 2: de -> Incrementing c [0] of Class * [22,22]: (S22, L2) S = "de" -> Incrementing c [1] of Class * [19,22]: (S0, L0) S = "" -> Incrementing c [2] of Class * [13,22]: (S0, L0) S = "" -> Incrementing c [3] of Class * [8,22]: (S0, L0) S = "" -> Incrementing c [4] of Class * [2,22]: (S0, L0) S = "" <S0 L0 21 19 15 11 8> <S21 L1 1 0 0 0 0> <S22 L2 1 0 0 0 0> 23 3 20 0: de -> Incrementing c [0] of Class * [23,23]: (S22, L2) S = "de" -> Incrementing c [1] of Class * [20,23]: (S0, L0) S = "" -> Incrementing c [2] of Class * [14,23]: (S0, L0) S = "" -> Incrementing c [3] of Class * [3,23]: (S0, L0) S = "" ---> Class [22, 23] L = 2 c [0] = 2 c [1] = 0 c [2] = 0 c [3] = 0 c [4] = 0 S = "de" ---> Class [21, 23] L = 1 c [0] = 3 c [1] = 0 c [2] = 0 c [3] = 0 c [4] = 0 S = "d" <S0 L0 24 20 16 12 8> 24 2 22 1: e -> Incrementing c [0] of Class * [24,24]: (S24, L1) S = "e" -> Incrementing c [1] of Class * [22,24]: (S0, L0) S = "" -> Incrementing c [2] of Class * [19,24]: (S0, L0) S = "" -> Incrementing c [3] of Class * [13,24]: (S0, L0) S = "" -> Incrementing c [4] of Class * [8,24]: (S0, L0) S = "" <S0 L0 24 21 17 13 9> <S24 L1 1 0 0 0 0> 25 3 23 0: e -> Incrementing c [0] of Class * [25,25]: (S24, L1) S = "e" -> Incrementing c [1] of Class * [23,25]: (S0, L0) S = "" -> Incrementing c [2] of Class * [20,25]: (S0, L0) S = "" -> Incrementing c [3] of Class * [14,25]: (S0, L0) S = "" -> Incrementing c [4] of Class * [3,25]: (S0, L0) S = "" ---> Class [24, 25] L = 1 c [0] = 2 c [1] = 0 c [2] = 0 c [3] = 0 c [4] = 0 S = "e" <S0 L0 26 22 18 14 10>

【０１１６】１５．３第３段階この実行結果より、ｔｆの値が２より大きいクラスが求
まるので、クラスの先頭の場所を第１のキー、長さを第
２のキーとしてクラスをソートする。同時に、重複度つ
き文字列頻度ｃｆ_ｋの差分からドキュメント頻度ｄｆ_ｋ
を求める。その結果を表２に示す。この例では、ｔｆの
値が２より大きなクラスは全部で１４個ある。クラスご
とに、対応する区間、長さ、それぞれのクラスに対する
ｔｆ、ｄｆ_ｋの統計値、およびクラスの中の最長の文字
列が順に示されている。15.3 Third Stage From this execution result, a class having a value of tf larger than 2 can be obtained. Therefore, the class is sorted with the first key as the first key and the second key as the length. At the same time, the document frequency df _{k is} calculated from the difference in the character string frequency cf _k with the degree of duplication.
Ask for. The results are shown in Table 2. In this example, there are 14 classes in which the value of tf is greater than 2. For each class, the corresponding interval, length, tf, df _k statistic for each class, and the longest character string in the class are shown in order.

【０１１７】このクラスのリストの中には、区間が１の
クラスは含まれていない。また、クラスの中で最短長さ
は含まれていないが、これはあとに述べるように、クラ
スの代表する文字列と先頭から一致する文字列を検索し
た場合には、そのなかで短い最長の長さをもつクラスの
情報を取り出すことで対処している。This class list does not include a class whose interval is 1. In addition, although the shortest length is not included in the class, this means that when searching for a character string that matches the representative character string of the class from the beginning, the shortest longest length will be included, as described later. This is dealt with by extracting the class information that has a length.

【０１１８】クラスのソートで、区間の先頭を第一のキ
ーにすることでおよそアルファベットの順番にならぶ。
区間の先頭が同じ場合には、長さが短いほうが優先され
ることで、結果としてクラスの代表する文字列は辞書順
にならぶ。In class sorting, the beginning of the section is set to the first key so that the sections are arranged in approximately alphabetical order.
When the heads of the sections are the same, the shorter length is prioritized, and as a result, the character strings represented by the classes are arranged in the dictionary order.

【表２】 total=14 Class[ 4, 8] L=3 tf=5 df1=3 df2=1 df3=1 df4=0 S="abc" Class[ 5, 6] L=6 tf=2 df1=1 df2=1 df3=0 df4=0 S="abcabc" Class[ 7, 8] L=4 tf=2 df1=2 df2=0 df3=0 df4=0 S="abcd" Class[ 9, 14] L=2 tf=6 df1=4 df2=1 df3=1 df4=0 S="bc" Class[ 10, 11] L=5 tf=2 df1=1 df2=1 df3=0 df4=0 S="bcabc" Class[ 12, 14] L=3 tf=3 df1=3 df2=0 df3=0 df4=0 S="bcd" Class[ 13, 14] L=4 tf=2 df1=2 df2=0 df3=0 df4=0 S="bcde" Class[ 15, 20] L=1 tf=6 df1=4 df2=1 df3=1 df4=0 S="c" Class[ 16, 17] L=4 tf=2 df1=1 df2=1 df3=0 df4=0 S="cabc" Class[ 18, 20] L=2 tf=3 df1=3 df2=0 df3=0 df4=0 S="cd" Class[ 19, 20] L=3 tf=2 df1=2 df2=0 df3=0 df4=0 S="cde" Class[ 21, 23] L=1 tf=3 df1=3 df2=0 df3=0 df4=0 S="d" Class[ 22, 23] L=2 tf=2 df1=2 df2=0 df3=0 df4=0 S="de" Class[ 24, 25] L=1 tf=2 df1=2 df2=0 df3=0 df4=0 S="e"[Table 2] total = 14 Class [4, 8] L = 3 tf = 5 df1 = 3 df2 = 1 df3 = 1 df4 = 0 S = "abc" Class [5, 6] L = 6 tf = 2 df1 = 1 df2 = 1 df3 = 0 df4 = 0 S = "abcabc" Class [7, 8] L = 4 tf = 2 df1 = 2 df2 = 0 df3 = 0 df4 = 0 S = "abcd" Class [9, 14] L = 2 tf = 6 df1 = 4 df2 = 1 df3 = 1 df4 = 0 S = "bc" Class [10, 11] L = 5 tf = 2 df1 = 1 df2 = 1 df3 = 0 df4 = 0 S = "bcabc" Class [12, 14] L = 3 tf = 3 df1 = 3 df2 = 0 df3 = 0 df4 = 0 S = "bcd" Class [13, 14] L = 4 tf = 2 df1 = 2 df2 = 0 df3 = 0 df4 = 0 S = "bcde" Class [15, 20] L = 1 tf = 6 df1 = 4 df2 = 1 df3 = 1 df4 = 0 S = "c" Class [16, 17] L = 4 tf = 2 df1 = 1 df2 = 1 df3 = 0 df4 = 0 S = "cabc" Class [18, 20] L = 2 tf = 3 df1 = 3 df2 = 0 df3 = 0 df4 = 0 S = "cd" Class [19, 20] L = 3 tf = 2 df1 = 2 df2 = 0 df3 = 0 df4 = 0 S = "cde" Class [21, 23] L = 1 tf = 3 df1 = 3 df2 = 0 df3 = 0 df4 = 0 S = "d" Class [22, 23] L = 2 tf = 2 df1 = 2 df2 = 0 df3 = 0 df4 = 0 S = "de" Class [24, 25] L = 1 tf = 2 df1 = 2 df2 = 0 df3 = 0 df4 = 0 S = "e"

【０１１９】１５．４文字列に対する処理与えられた任意の文字列に対して、上記の表２を２分探
索することでｔｆ，ｄｆ_１，ｄｆ_２，ｄｆ_３，ｄｆ_４の
値を求めることができる。２分探索であり、表の大きさ
はＯ（Ｎ）であるので、この処理はＯ（ｌｏｇ（Ｎ））
の計算量で終了する。以下に文字列を与えたときの出力
結果を示す。 abc -- Class[4,8]に該当（代表文字列） 5 3 1 1 0 abc abcabc -- Class[5,6]に該当（代表文字列） 2 1 1 0 0 abcabc abcd -- Class[7,8]に該当（代表文字列） 2 2 0 0 0 abcd abca -- Class[5,6]に該当（代表文字列でない） 2 1 1 0 0 abca abcab -- Class[5,6]に該当（代表文字列でない） 2 1 1 0 0 abcab abcabc -- Class[5,6]に該当（代表文字列） 2 1 1 0 0 abcabc abcabca -- 表になく、コーパスに存在する 1 1 0 0 0 abcabca abcabcab -- 表になく、コーパスに存在する 1 1 0 0 0 abcabcab abcabcabc -- 表になく、コーパスに存在する 1 1 0 0 0 abcabcabc abcabcabca -- 表になく、コーパスに存在しない 0 0 0 0 0 abcabcabca15.4 Processing for Character String Obtaining the values of tf, df ₁ , df ₂ , df ₃ and df ₄ by searching the above-mentioned Table 2 for two for a given arbitrary character string. You can Since this is a binary search and the size of the table is O (N), this processing is O (log (N)).
It ends with the calculation amount of. The output result when a character string is given is shown below. abc --corresponds to Class [4,8] (representative character string) 5 3 1 1 0 abc abcabc --corresponds to Class [5,6] (representative character string) 2 1 1 0 0 abcabc abcd --class [7 , 8] (representative character string) 2 2 0 0 0 abcd abca --Class [5,6] (not representative character string) 2 1 1 0 0 abca abcab --Class [5,6] (Not a representative character string) 2 1 1 0 0 abcab abcabc --Corresponds to Class [5,6] (Representative character string) 2 1 1 0 0 abcabc abcabca --Exists in corpus, not in table 1 1 0 0 0 abcabca abcabcab-not in table, in corpus 1 1 0 0 0 abcabcab abcabcabc-in table, not in corpus 1 1 0 0 0 abcabcabc abcabcabca-not in table, not in corpus 0 0 0 0 0 abcabcabca

【０１２０】１６．計算量とメモリ使用量クラス分けを行わないですべての部分文字列に対してｄ
ｆ_ｋを求める方法や、クラス分けをしても、クラスの階
層構造を利用せずに、それぞれのクラスを独立して調べ
てｄｆ_ｋを求める方法に比べると、本アルゴリズムによ
る実行時間は大規模なコーパスに対しても実際的に処理
可能な計算時間となる。本アルゴリズムでは、全体の処
理は、テキストの文字数をＮ、求める最大のｋをｋＭＡ
Ｘとしたときに、Ｏ（ＮｌｏｇＮ＋ｋＭＡＸ×Ｎ）の前
処理のあとに、Ｏ（ｌｏｇＮ）で、「任意の文字列につ
いて、ｋ回以上出現するドキュメントの数」を求めるこ
とができる。またメモリ使用量については、前処理にお
いても、その後に任意の文字列に対して値を求める際に
も、Ｏ（Ｎ）の容量である。この計数アルゴリズムを実
装した後述のプログラムを用いて実験を行い、実行時間
を計測した結果、実用的に計数が行えることが確認され
ている。16. D for all substrings without classifying calculation amount and memory usage
a method of determining the f _k, even if the classification, without using the hierarchical structure of classes, compared to the method for obtaining the df _k examined independently of each class, the execution time according to the algorithm large Even for a large corpus, the calculation time can be practically processed. In the present algorithm, the entire process is N for the number of characters in the text and kMA is the maximum k to be obtained.
When X is set, after O (NlogN + kMAX × N) preprocessing, the number of documents that appear k times or more for any character string can be obtained by O (logN). Further, the memory usage amount is the capacity of O (N) both in the preprocessing and in the subsequent calculation of the value for an arbitrary character string. As a result of conducting an experiment using a program, which will be described later, which implements this counting algorithm and measuring the execution time, it has been confirmed that the counting can be practically performed.

【０１２１】１７．キーワード抽出への応用ｄｆ_２／ｄｆは、ドキュメントの確率空間において、ド
キュメントにある文字列が出現するということを条件と
したとき、そのドキュメントに２回その文字列が出現す
る確率の推定値として用いることができる。文献［６］
は英語において、その確率が統計的に単語の性質を識別
できることを示している。17. Application to keyword extraction df ₂ / df is used as an estimated value of the probability that a character string appears twice in the document, provided that the character string appears in the document in the probability space of the document. be able to. Reference [6]
Shows that in English, the probability can statistically identify the nature of the word.

【０１２２】任意文字列をＩＤＦ（Inverse Document F
requency）をポテンシャルとして、そのポテンシャルの
最良の分割点として求めるというアイデアは文献［４］
に示されている。さらに、ｄｆ_２も利用するとキーワー
ドの抽出精度を上げることができる。これらのキーワー
ド抽出作業を行うには、任意の部分文字列について、ｄ
ｆ、ｄｆ_２を計算する必要があり、本アルゴリズムを使
うことでシステムの動作速度を向上できる。An arbitrary character string is converted to IDF (Inverse Document F
Requency) is taken as the potential, and the idea of finding it as the best division point of that potential is given in Ref. [4].
Is shown in. Furthermore, if df _{2 is} also used, the keyword extraction accuracy can be improved. To perform these keyword extraction operations, d for any partial character string
It is necessary to calculate f and df _2, and the operation speed of the system can be improved by using this algorithm.

【０１２３】ｄｆ_２の計測はこのようにいくつか応用が
あるが、そのなかで辞書を使わないキーワードの抽出へ
の応用を示す。The measurement of df ₂ has some applications as described above, and among them, the application to the extraction of keywords without using a dictionary is shown.

【０１２４】以下のような文書ファイルをいくつか含む
コーパスが与えられたとする。gakkai-0000000001 電気
回路演習用ＣＡＩとその改良大学等での基礎的な電気
回路演習を支援するＣＡＩソフトウェアとその改良につ
いて述べている。本ＣＡＩはコンピュータが出題される
回路を学習者各人のレベルに応じて自動的に作成するこ
と、解答を数式で入力することが大きな特長である。ま
た、誤った解答に対しては、原因の検討を容易にするメ
ッセージが表示されるなど、効果的な個別学習が限られ
た設備・要員で実施可能であるよう配慮した。昨年度の
学生による本ＣＡＩの使用結果のアンケート、および発
表の場における質疑等を参考に、操作を容易とし、効果
を上げるための改良を行った。電気回路演習It is assumed that a corpus including the following document files is given. gakkai-0000000001 CAI for electric circuit exercises and its improvement Describes CAI software and its improvements that support basic electric circuit exercises at universities. The main features of this CAI are that it automatically creates a circuit for a computer according to the level of each learner and inputs the answer by a mathematical expression. In addition, for incorrect answers, a message is displayed to facilitate consideration of the cause, and effective individual learning can be implemented with limited equipment and personnel. Based on the questionnaire of the use results of this CAI by the students of last year and the question and answer at the time of the presentation, we made improvements to make the operation easier and to improve the effect. Electric circuit exercises

【０１２５】コーパスの部分文字列に関して、ｔｆ，
Ｎ，ｄｆ，ｄｆ_２，ｄｆ_３，ｄｆ_４，ｄｆ_５を求めた結
果の一部を表３に示す。Regarding the partial character string of the corpus, tf,
Table 3 shows a part of the results of obtaining N, df, df ₂ , df ₃ , df ₄ , and df ₅ .

【表３】 27511 40000 40000 40000 40000 40000 40000 1 20 21 回 14384 40000 5224 3132 2094 1431 938 1 20 22 回路 6 40000 3 2 1 0 0 1 20 23 回路演 6 40000 3 2 1 0 0 1 20 24 回路演習 23604 40000 40000 40000 40000 40000 40000 1 21 22 路 7 40000 4 2 1 0 0 1 21 23 路演 7 40000 4 2 1 0 0 1 21 24 路演習 3325 40000 1555 702 408 245 162 1 22 23 演 192 40000 75 41 25 18 12 1 22 24 演習 7 40000 6 1 0 0 0 1 22 25 演習用 4 40000 4 0 0 0 0 1 22 26 演習用Ｃ 4 40000 4 0 0 0 0 1 22 27 演習用ＣＡ 4 40000 4 0 0 0 0 1 22 28 演習用ＣＡＩ 4245 40000 1308 860 640 462 333 1 23 24 習 29 40000 23 5 1 0 0 1 23 25 習用 5 40000 5 0 0 0 0 1 23 26 習用Ｃ 5 40000 5 0 0 0 0 1 23 27 習用ＣＡ 5 40000 5 0 0 0 0 1 23 28 習用ＣＡＩ 72462 40000 40000 40000 40000 40000 40000 1 24 25 用 139 40000 121 16 2 0 0 1 24 26 用Ｃ 51 40000 40 10 1 0 0 1 24 27 用ＣＡ 27 40000 21 6 0 0 0 1 24 28 用ＣＡＩ 38995 40000 40000 40000 40000 40000 40000 1 25 26 Ｃ 3749 40000 1852 837 499 275 142 1 25 27 ＣＡ 790 40000 288 186 148 86 43 1 25 28 ＣＡＩ 9 40000 9 0 0 0 0 1 25 29 ＣＡＩと 44832 40000 40000 40000 40000 40000 40000 1 26 27 Ａ 1899 40000 877 445 276 157 76 1 26 28 ＡＩ 10 40000 10 0 0 0 0 1 26 29 ＡＩと 44675 40000 40000 40000 40000 40000 40000 1 27 28 Ｉ 180 40000 158 16 6 0 0 1 27 29 Ｉと 6 40000 6 0 0 0 0 1 27 30 Ｉとそ 5 40000 5 0 0 0 0 1 27 31 Ｉとその 158324 40000 40000 40000 40000 40000 40000 1 28 29 と 2093 40000 1939 141 13 0 0 1 28 30 とそ 1789 40000 1669 112 8 0 0 1 28 31 とその 14 40000 13 1 0 0 0 1 28 32 とその改 6 40000 5 1 0 0 0 1 28 33 とその改良[Table 3] 27511 40000 40000 40000 40000 40000 40000 1 20 21 times 14384 40000 5224 3132 2094 1431 938 1 20 22 Circuit 6 40000 3 2 1 0 0 1 20 23 Circuit performance 6 40000 3 2 1 0 0 1 20 24 Circuit exercise 23604 40000 40000 40000 40000 40000 40000 1 21 22 Road 7 40000 4 2 1 0 0 1 21 23 Road performance 7 40000 4 2 1 0 0 1 21 24 Road exercises 3325 40000 1555 702 408 245 162 1 22 23 Performance 192 40000 75 41 25 18 12 1 22 24 Exercise 7 40000 6 1 0 0 0 1 22 25 Exercise 4 40000 4 0 0 0 0 1 22 26 Exercise C 4 40000 4 0 0 0 0 1 22 27 Exercise C A 4 40000 4 0 0 0 0 1 22 28 Exercise C AI 4245 40000 1308 860 640 462 333 1 23 24 29 40000 23 5 1 0 0 1 23 25 Training 5 40000 5 0 0 0 0 1 23 26 Training C 5 40000 5 0 0 0 0 1 23 27 Training CA 5 40000 5 0 0 0 0 1 23 28 Training CA I For 72462 40000 40000 40000 40000 40000 40000 1 24 25 139 40000 121 16 2 0 0 1 24 26 for C CA for 51 40000 40 10 1 0 0 1 24 27 27 40000 21 6 0 0 0 1 24 28 CAI 38995 40000 40000 40000 40000 40000 40000 1 25 26 C 3749 40000 1852 837 499 275 142 1 25 27 CA 790 40000 288 186 148 86 43 1 25 28 CAI 9 40000 9 0 0 0 0 1 25 29 With CAI 44832 40000 40000 40000 40000 40000 40000 1 26 27 A 1899 40000 877 445 276 157 76 1 26 28 AI 10 40000 10 0 0 0 0 1 26 29 With AI 44675 40000 40000 40000 40000 40000 40000 1 27 28 I 180 40000 158 16 6 0 0 1 27 29 I and 6 40000 6 0 0 0 0 1 27 30 I and its 5 40000 5 0 0 0 0 1 27 31 I and its 158324 40000 40000 40000 40000 40000 40000 1 28 29 and 2093 40000 1939 141 13 0 0 1 28 30 1789 40000 1669 112 8 0 0 1 28 31 and its 14 40000 13 1 0 0 0 1 28 32 and its modifications 6 40000 5 1 0 0 0 1 28 33 and its modifications Good

【０１２６】この表３において、たとえば文字列「ＣＡ
Ｉ」については、ｄｆは２８８、ｄｆ_２は１８６、ｄｆ
_３は１４８であるが、文字列「ＣＡＩと」については、
ｄｆが９であるのに対して、ｄｆ_２、ｄｆ_３は０であ
る。このように単語の切れ目を含む文字列はｄｆに比べ
ｄｆ_２以降が急激に減ることがわかる。このｄｆ_２以降
が急激に減る性質を利用して単語の切れ目を検出して文
書ファイルの文字列を分割すると以下のようになる。In Table 3, for example, the character string "CA
For “I”, df is 288, df ₂ is 186, df
₃ is 148, but regarding the character string "CAI and",
While df is 9, df ₂ and df ₃ are 0. As described above, it can be seen that the character string including word breaks sharply decreases after df ₂ as compared with df. The character string of the document file is divided as follows by detecting the break of a word by utilizing the property that df _{2 and} thereafter are sharply reduced.

【０１２７】gakkai-000000/0001/電気回路/演習/用/Ｃ
ＡＩ/と/その/改良//大学/等/で/の基礎/的/な/電気回
路/演習/を/支援/する/ＣＡＩ/ソフトウエア/と/その/
改良/について述べ/ている。/本/ＣＡＩ/は/コンピュー
タ/が/出/題/され/る/回路/を/学習者/各/人の/レベル/
に応じて/自動/的/に/作成/する/こと、/解答/を/数式/
で/入力/する/ことが/大き/な/特/長/である。/ま/た
/、/誤/った/解答/に対しては、/原因/の/検討/を容易
に/する/メッセージ/が/表示/され/る/など/、/効果/的
/な/個別/学習/が/限/ら/れ/た/設備/・/要員/で/実施/
可能/である/よ/う/配/慮/した。/昨年度/の/学生/によ
る/本/ＣＡＩ/の/使用/結/果/の/アンケート/、/および
/発/表/の/場における/質/疑/等/を/参/考/に/、/操作/
を/容易/とし、/効果/を上げる/ため/の改良/を行っ
た。/電気回路//演習/Gakkai-000000 / 0001 / electric circuit / exercise / for / C
AI / and / that / improvement // university / etc / de / basic / target / electrical circuit / exercise / support / support / CAI / software / and / the /
Describes / describes improvements. / Book / CAI / Is / Computer / is / issue / title / do / circuit / is / learner / each / person / level /
According to / automatic / target / to / create / do / that / answer / to / mathematical /
It is / input / enter / that / large / large / special / long /. /Also
For /, / wrong / wrong / answer /
/ Na / individual / learning / ga / limit / la / re / ta / equipment / ・ / personnel / in / implement /
Possible / Yes / Yes / No / Distribution / Consideration. / Last year / of / student / by / book / CAI / of / use / result / result / of / questionnaire /, / and
/ Departure / Table / Of / On site / Quality / Suspicion / Etc. / Reference / Consider / In /, / Operation /
/ Easy / and / Effect / Increase / To improve / To improve. / Electrical Circuit // Practice /

【０１２８】このように、文字分割された文書ファイル
からキーワードとなりそうな文字列の分布をもつものを
選ぶことにより、以下のようなキーワードが抽出され
る。演習ＣＡＩ演習支援ＣＡＩソフトウエアＣ
ＡＩ学習者自動メッセージ表示学習学生
ＣＡＩ演習In this way, the following keywords are extracted by selecting from the document files that are character-divided, those having a distribution of character strings that are likely to be keywords. Exercise CAI Exercise Support CAI Software C
AI learner automatic message display learning student
CAI exercise

【０１２９】結果として、単語の辞書がまったくない状
態で、日本語の文章からキーワードが抽出できる。この
方法は言語独立であり、古文書のように辞書の整備がな
されていない文書や、暗号文書など未解読の文書にも適
用できる。また遺伝子情報に適用して、ＤＮＡの配列の
解析に応用することもできる。As a result, the keyword can be extracted from the Japanese sentence without any word dictionary. This method is language-independent, and can be applied to undocumented documents such as encrypted documents and documents whose dictionary is not maintained, such as old documents. It can also be applied to the analysis of DNA sequences by applying it to genetic information.

【０１３０】その他の応用をいくつか述べる。ｄｆ_２の
計測は情報検索への応用が考えられる。バイグラムの情
報検索システムにおいては、単語による情報検索におい
て行われているストップリストによる単語の選別方法が
明らかではない。そこで、バイグラムをｄｆ_２／ｄｆで
切り分けて、ｄｆ_２／ｄｆが高いバイグラムだけを選び
出すという応用が考えられる。文献［８］によれば、ｄ
ｆ_２／ｄｆにはキーワードを選別する傾向がみられるの
で、選別に使うことによって情報検索の性能の向上が見
て取れた。Some other applications will be described. The measurement of df ₂ can be applied to information retrieval. In the bigram information retrieval system, the method of selecting words by the stoplist, which is used in information retrieval by words, is not clear. Therefore, a bigram isolate with _{_{df 2 / df, df 2 /}} df is conceivable applications that select only high bigram. According to document [8], d
Since there is a tendency to select keywords in f ₂ / df, it can be seen that the information retrieval performance is improved by using the keywords.

【０１３１】ソフトウェアツールへの応用も考えられ
る。Suffix Arrayのクラス分けの応用として、文献
［７］にはプログラムにクラスシステムの分析を行い、
他の場所にはあらわれない特異な場所を計測して、プロ
グラム中の特異な場所を自動検出するシステムの報告が
ある。このシステムでは統計量としてｔｆだけを使って
いるがｄｆ、ｄｆ_２を利用することで、プログラムの計
測ツールとして、さらに性能の向上が期待できる。Application to software tools is also possible. As an application of Suffix Array classification, reference [7] analyzes a class system in a program,
There is a report of a system that measures a peculiar place that does not appear in other places and automatically detects the peculiar place in the program. In this system, only tf is used as the statistic, but by using df and df ₂ , further improvement in performance can be expected as a program measurement tool.

【０１３２】以上述べたように、本アルゴリズムに示さ
れたドキュメント頻度の計数技術では、既存のSuffix A
rrayのデータ構造を用いて、Suffix Arrayをクラス分け
し、そのクラスを利用してｄｆ_ｋを計数する。クラス分
けをして文字列が１回以上出現するドキュメントの数ｄ
ｆを計数する方法は文献［３］で知られているが、本ア
ルゴリズムでは、文字列がｋ回（ｋは２以上の自然数）
以上出現するドキュメントの数ｄｆ_ｋの効率的な計測を
可能にするため、重複度つきの文字列頻度を新たに定義
し、文字列がｋ回以上出現するドキュメントの数を求め
るとき、重複度つき文字列頻度の差分を用いて計算し
た。また、クラスの検出と重複度の判定の処理を同時に
行うことにより、計算量を抑えた。さらには入れ子にな
ったクラス階層の構造を利用して、下位のクラスで求め
た重複度つき文字列頻度を、上位のクラスで求める重複
度つき文字列頻度に加算することにより、文字列の出現
頻度の累積計測を可能にし、計算量を抑えた。これらの
技術により、コーパスの大きさに対して実用的な計算時
間でドキュメント頻度の計数が可能となった。As described above, in the document frequency counting technique shown in this algorithm, the existing Suffix A
Suffix Array is classified into classes using the data structure of rray, and df _k is counted using the classes. Number of documents that are classified into classes and in which a character string appears one or more times d
The method of counting f is known in the literature [3], but in this algorithm, the character string is k times (k is a natural number of 2 or more).
In order to enable efficient measurement of the number of documents df _{k that} appear above, a character string frequency with a degree of duplication is newly defined, and when determining the number of documents in which the character string appears k times or more, Calculation was performed using the difference in row frequency. Moreover, the amount of calculation is suppressed by simultaneously performing the processing of class detection and the determination of the degree of overlap. Furthermore, by using the structure of the nested class hierarchy, the character string frequency with multiplicity obtained by the lower class is added to the character string frequency with multiplicity obtained by the upper class, thereby generating the appearance of the character string. The cumulative frequency can be measured and the calculation amount is suppressed. These techniques have made it possible to count the document frequency in a practical calculation time with respect to the size of the corpus.

【０１３３】以上、本発明を実施の形態をもとに説明し
た。これらの実施の形態は例示であり、それらの各構成
要素や各処理プロセスの組合せにいろいろな変形例が可
能なこと、またそうした変形例も本発明の範囲にあるこ
とは当業者に理解されるところである。以下そのような
変形例を説明する。The present invention has been described above based on the embodiments. It is understood by those skilled in the art that these embodiments are mere examples, and that various modifications can be made to the combinations of the respective constituent elements and the respective processing processes, and such modifications are also within the scope of the present invention. By the way. Hereinafter, such a modified example will be described.

【０１３４】上記の実施の形態では、計数アルゴリズム
の実行例として、日本語を対象としたが、アルゴリズム
は言語独立である。また実施の形態では、文書を例にあ
げて文字列の出現頻度を計測したが、計数アルゴリズム
は、計測の対象を文書に限らず、たとえば遺伝子情報に
おける遺伝子の配列の出現頻度を計測するために用いる
ことも可能である。とくに本発明では辞書を用いずに出
現頻度の情報から意味のある単語を切り出すことができ
るため、遺伝子情報の解読に用いた場合、特定の機能を
もつ遺伝子配列の抽出が可能となる。In the above embodiment, Japanese was used as the execution example of the counting algorithm, but the algorithm is language independent. In the embodiment, the appearance frequency of the character string is measured by taking the document as an example, but the counting algorithm is not limited to the document as the measurement target, and for example, in order to measure the appearance frequency of the gene sequence in the gene information. It is also possible to use. Particularly, in the present invention, a meaningful word can be cut out from the information on the frequency of appearance without using a dictionary. Therefore, when it is used for decoding genetic information, a gene sequence having a specific function can be extracted.

【０１３５】本明細書で参照した参考文献のリストを以
下に示す。 [1] Manber, Udi and Gene Myer. 1990., Suffix arrays: A new method for on-line string sea
rches,In the first Annual ACM-SIAM Symposium on Di
screte Algorithms, pages 319-327. [2] 伊藤秀夫, Suffix Arrayの効率的な構築法, 情報処理学会論文誌、Vol.41, No.SIG1(TOD5), pp.31-3
9(2000). [3] Mikio Yamamoto and Kenneth W. Church,Using Suf
fix Arrays to Compute Term Frequency and Document
Frequency forAll Substrings in a Corpus,Computatio
nal Linguistics, Vol.27:1, pp.1-30, MIT Press. [4] Tomohiro Ozawa, Mikio Yamamoto, Kyoji Umemura
and Kennth W. Church,Japanese word segmentation us
ing similarity measure for IR,In proceedings of th
e first NTCIR workshop on Research in Japanese tex
tretrieval and term recognition. Tokyo Japan, Augu
st 1999, pp.89-96. [5] 小澤智裕、山本幹雄、山本英子、梅村恭司, 情報検索の類似尺度を用いた検索要求文の単語分割, 言語処理学会第回年次大会発表論文集, March, 1999, p
p.305-308. [6] Kenneth W. Church, Empirical Estimates of Adaptation: The chance of T
wo Noriega’s is closeto p/2 than p 2,Coling2000,
pp.173-179. [7] 吉川裕之、貴島寿郎、梅村恭司, n-gram解析手法を応用したプログラム中の欠損の検出, 情報処理学会論文誌、Vol.9, No.12, pp.3294-3303. [8] Kyoji Umemura and Kenneth W. Church,Empirical
Term Weighting and Expansion Frequency,Empirical M
ethods in Natural Language Processing and Very Lar
ge Corpora,pp.117-123. [9] 武田善行、梅村恭司, キーワード抽出を実現する文書頻度分析計量国語学, Vol.23, No.2, pp.65-90The following is a list of references referred to herein. [1] Manber, Udi and Gene Myer. 1990., Suffix arrays: A new method for on-line string sea
rches, In the first Annual ACM-SIAM Symposium on Di
screte Algorithms, pages 319-327. [2] Hideo Ito, Efficient construction of Suffix Array, Transactions of Information Processing Society of Japan, Vol.41, No.SIG1 (TOD5), pp.31-3
9 (2000). [3] Mikio Yamamoto and Kenneth W. Church, Using Suf
fix Arrays to Compute Term Frequency and Document
Frequency forAll Substrings in a Corpus, Computatio
nal Linguistics, Vol.27: 1, pp.1-30, MIT Press. [4] Tomohiro Ozawa, Mikio Yamamoto, Kyoji Umemura
and Kennth W. Church, Japanese word segmentation us
ing similarity measure for IR, In proceedings of th
e first NTCIR workshop on Research in Japanese tex
tretrieval and term recognition.Tokyo Japan, Augu
st 1999, pp.89-96. [5] Tomohiro Ozawa, Mikio Yamamoto, Eiko Yamamoto, Kyoji Umemura, Word segmentation of search request sentences using information retrieval similarity measure, Proceedings of the Annual Meeting of the Language Processing Society of Japan , March, 1999, p
p.305-308. [6] Kenneth W. Church, Empirical Estimates of Adaptation: The chance of T
wo Noriega's is closeto p / 2 than p 2, Coling2000,
pp.173-179. [7] Hiroyuki Yoshikawa, Toshiro Kijima, Kyoji Umemura, Detection of Defects in Programs Applying n-gram Analysis Method, IPSJ Journal, Vol.9, No.12, pp.3294 -3303. [8] Kyoji Umemura and Kenneth W. Church, Empirical
Term Weighting and Expansion Frequency, Empirical M
ethods in Natural Language Processing and Very Lar
ge Corpora, pp.117-123. [9] Takeyuki Yoshida, Kyouji Umemura, Document Frequency Analysis for Keyword Extraction Metric Language, Vol.23, No.2, pp.65-90

【０１３６】上述の実施の形態に係る計数アルゴリズム
を実装したプログラムのリストを表４に示す。Table 4 shows a list of programs implementing the counting algorithm according to the above-mentioned embodiment.

【表４】 1 /* もとめるdf_kのkの最大値+1, 2 */ 3 #define MAX_C 5 4 5 /* ドキュメントの区切り文字 */ 6 #define SEPARATOR '\n' 7 8 #include <stdio.h> 9 #include <string.h> 10 #include <stdlib.h> 11 /* qsortの引数に与えるべき関数の型*/ 12 typedef int (*sortfn)(const void *, const void*); 13 14 #define MESSAGE_FILE stdout 15 16 static char * text; /* テキストのもとデータ*/ 17 18 static int size; /* テキストのバイト数*/ 19 20 static int id; /* 読み込み中のドキュメントの番号 */ 21 22 static int id_max; /* ドキュメントの総数 */ 23 24 static int common_max; /* 隣り合うsuffixで共通である文字数の最大値 */ 25 26 27 struct suffix_struct 28 { int position; /* textのスタートindex */ 29 int common; /* 次のsuffixとの共通部分の長さ */ 30 int id; /* 対応するドキュメントの番号 */ 31 int previous_suffix; /* 同一のドキュメントであるsuffixのもっとも近いもの */ 32 }; 33 34 static struct suffix_struct * suffix; /* suffix array */ 35 36 static int * last_suffixes; /* previous suffixを作成するための配列 */ 37 38 static FILE * text_file; /* データのファイル */ 39 40 struct pending_struct 41 { int start_suffix; 42 int length; 43 int c[MAX_C]; 44 }; 45 46 47 static struct pending_struct * pendings; 48 static int level; 49 50 static int pending_level(int suffix) 51 { 52 int min, max, mid; 53 min = 0; max = level; 54 while(min + 1 < max) { 55 mid = (min + max) / 2; 56 if(pendings[mid].start_suffix <= suffix){ 57 min = mid; 58 } else { 59 max = mid; 60 } 61 } 62 if(pendings[max].start_suffix <= suffix) return max; 63 if(pendings[min].start_suffix <= suffix) return min; 64 fprintf(stderr, "internal error(pending_level)\n"); 65 exit(2); 66 return -1; 67 } 68 69 70 #ifdef DEBUG 71 /* 指定されたsuffixの状態を表示 */ 72 static void debug_suffix(int i) 73 { 74 int j; 75 fprintf(MESSAGE_FILE, 76 "%5d %3d %5d %2d:", 77 i, 78 suffix[i].id, 79 suffix[i].previous_suffix, 80 suffix[i].common); 81 for(j=0;text[suffix[i].position+j]!='\n';j++) { 82 fputc(text[suffix[i].position+j], MESSAGE_FILE); 83 } 84 fputc('\n', MESSAGE_FILE); 85 fflush(MESSAGE_FILE); 86 } 87 88 /* suffix arrayの状態を表示 */ 89 static void debug_output() 90 { int i; 91 fprintf(MESSAGE_FILE, "size = %d\n", size); 92 fprintf(MESSAGE_FILE, "id_max = %d\n", id_max); 93 fprintf(MESSAGE_FILE, "common_max = %d\n", common_max); 94 for(i=0;i<size;i++) { 95 debug_suffix(i); 96 } 97 } 98 99 /* pending classの状態を表示 */ 100 static void debug_pending() 101 { 102 int i; int j; 103 for(i = 0; i<=level; i++) { 104 fprintf(MESSAGE_FILE, "<S%d L%d", 105 pendings[i].start_suffix, 106 pendings[i].length); 107 for(j=0;j<MAX_C;j++) { 108 fprintf(MESSAGE_FILE, " %d", pendings[i].c[j]); 109 } 110 fprintf(MESSAGE_FILE, ">"); 111 } 112 fprintf(MESSAGE_FILE, "\n"); 113 } 114 #endif 115 116 static void error_alloc(void) 117 { 118 fprintf(MESSAGE_FILE, 119 "text %x, suffix %x, last_suffixes %x, classes %x \n", 120 (int)text, 121 (int)suffix, 122 (int)last_suffixes, 123 (int)pendings); 124 exit(1); 125 } 126 127 /* suffixの順序の決定関数, 128 ドキュメントの区切りで中断するが 129 文字列の辞書順の比較である。 130 131 */ 132 static int my_strcmp(char *x1, char *y1) 133 { 134 register unsigned char * x; 135 register unsigned char * y; 136 x = (unsigned char *) x1; y= (unsigned char *)y1; 137 while(*x && *y) { 138 if(*x < *y) return -1; 139 if(*x > *y) return 1; 140 if(*x == '\n') break; 141 if(*y == '\n') break; 142 x++; y++; 143 } 144 return 0; 145 } 146 147 static int string_sub(char *s1, char *s2) 148 { 149 while(*s2) { 150 if(*s1 != *s2) { return 0; } 151 s1++; 152 s2++; 153 } 154 return *s1; 155 } 156 157 /* suffixが示す文字列の順序を定める関数 158 先頭からある部分が同じなら, 次の文字で順序が決定する 159 という性質であれば,これに限らない。 160 */ 161 static int suffix_order(struct suffix_struct * x, struct suffix_st ruct * y) 162 { 163 return my_strcmp(text + x->position, text + y->position); 164 } 165 166 /* 先頭から共通の文字数を求める。関数 */ 167 168 static int common_length(char * x, char * y) 169 {int i; 170 i = 0; 171 while((*x == *y) && (*x) && (*x !='\n')) { i++; x++; y++;}; 172 return i; 173 } 174 175 struct class_struct 176 { struct class_struct * next; 177 int first; 178 int last; 179 int length; 180 int c[MAX_C]; 181 }; 182 183 static struct class_struct * class_list = 0; 184 static int class_count = 0; 185 186 static struct class_struct * class_table; 187 188 static void register_class(int first, int last, int length, int c[ ]) 189 { int i; 190 struct class_struct * p; 191 #ifdef DEBUG 192 printf(" ---> Class[%d, %d] L=%d", first, last, length); 193 for(i=0;i<MAX_C;i++) { 194 printf(" c[%d]=%d", i, c[i]); 195 } 196 printf(" S=\""); 197 for(i=0;i<length;i++) { putchar(text[suffix[first].position + i] ); } 198 printf("\"\n"); 199 #endif 200 p = (struct class_struct *) malloc(sizeof(struct class_struct)); 201 if(p == 0) { 202 fprintf(stderr, "register_class\n"); 203 exit(2); 204 } 205 p->first = first; 206 p->last = last; 207 p->length = length; 208 for(i=0;i<MAX_C;i++) { 209 p->c[i] = c[i]; 210 } 211 p->next = class_list; 212 class_list = p; 213 class_count++; 214 } 215 216 int class_order(struct class_struct * x, struct class_struct * y) 217 { 218 if(x->first < y->first) return -1; 219 if(x->first > y->first) return 1; 220 if(x->length < y->length) return -1; 221 if(x->length > y->length) return 1; 222 return 0; 223 } 224 225 /* クラスについて, 場所, 長さ, 文字, 計数値を表示する */ 226 #ifdef DEBUG 227 static void output_class(int first, int last, int length, int c[] ) 228 { 229 int i; int j; 230 printf("Class[%4d,%4d] L=%d tf=%d", 231 first, 232 last, 233 length, 234 c[0]); 235 for(j=1;j<MAX_C;j++) { 236 printf(" df%d=%d", j, c[j-1]-c[j]); 237 } 238 printf(" S=\""); 239 for(i=0;i<length;i++) { putchar(text[suffix[first].position + i] ); } 240 printf("\"\n"); 241 } 242 #endif 243 244 245 static void make_class_table(void) 246 { struct class_struct * p; 247 int i; 248 class_table = (struct class_struct *) 249 malloc (sizeof(struct class_struct) * class_count) ; 250 if(class_table == 0) { 251 fprintf(stderr, "make_class_table\n"); 252 exit(1); 253 } 254 p = class_list; 255 for(i=0;i<class_count;i++) { 256 class_table[i] = *p; 257 p = p->next; 258 }; 259 qsort(class_table, class_count, sizeof(struct class_struct), (so rtfn)class_order); 260 } 261 262 static void clear_class_list(void) 263 { int i; 264 struct class_struct * p; 265 struct class_struct * q; 266 p = class_list; 267 for(i=0;i<class_count;i++) { 268 q = p; 269 p = q->next; 270 q->next = 0; 271 free(q); 272 } 273 class_list = 0; 274 } 275 276 277 void df_setup(char * file) 278 { char * p; int i; int j; int ch; int previous; 279 #ifdef DEBUG 280 int ii; 281 #endif 282 text_file = fopen(file, "r"); 283 if(text_file == 0) { 284 fprintf(stderr, "File %s not found\n", file); 285 exit(1); 286 } 287 288 /* データの総文字数を求める*/ 289 size = 0; 290 while(EOF != (ch = fgetc(text_file))) { 291 size++; 292 }; 293 294 /* 総文字数から, 必要なデータ領域を生成する*/ 295 text = (char *) malloc (sizeof(char) * (size + 1)); 296 if(text == 0) error_alloc(); 297 suffix = (struct suffix_struct *) 298 malloc( sizeof(struct suffix_struct) * (size + 1)); 299 if(suffix == 0) error_alloc(); 300 fseek(text_file, 0, SEEK_SET); 301 p = text; 302 id = 0; 303 304 /* メモリに読み込むと同時に, ドキュメントの区切りを 305 つけていく */ 306 id_max = 0; 307 for(i=0;i<size;i++) { 308 ch = fgetc(text_file); 309 text[i] = ch; 310 suffix[i].position = i; 311 suffix[i].id = id; 312 id_max = id + 1; 313 if(ch == '\n') { id ++; } 314 }; 315 text[size]=0; 316 317 318 /* Suffixをつくるルーチン*/ 319 /* 以下は例である。*/ 320 common_max = 0; 321 for(i=0;i<size;i++) { suffix[i].position = i; } 322 /* sortの処理が, 計算時間の大半を占めると予想される */ 323 qsort(suffix, size, sizeof(struct suffix_struct), (sortfn) suffi x_order); 324 suffix[size].position = size; 325 for(i=0;i<size;i++) { 326 int c; 327 c = common_length(text+suffix[i].position, text+suffix[i+1].po sition); 328 suffix[i].common = c; 329 if(c > common_max) common_max = c; 330 } 331 suffix[size].common = 0; 332 333 334 /* 重複計算のためのデータ構造の生成 */ 335 last_suffixes = (int *) malloc( id_max * sizeof(int)); 336 if(last_suffixes == 0) error_alloc(); 337 for(i=0;i<id_max;i++) { last_suffixes[i] = -1; } 338 for(i=0;i<size;i++) { 339 suffix[i].previous_suffix = last_suffixes[suffix[i].id]; 340 last_suffixes[suffix[i].id] = i; 341 } 342 343 #ifdef DEBUG 344 debug_output(); 345 #endif 346 347 /* クラス構造の取出し */ 348 349 pendings = (struct pending_struct *) 350 malloc(sizeof(struct pending_struct) * (common_max+1)); 351 if(pendings == 0) error_alloc(); 352 level = 0; 353 pendings[level].length = 0; 354 pendings[level].start_suffix = 0; 355 for(j=0;j<MAX_C;j++) { 356 pendings[level].c[j] = 0; 357 } 358 for(i=0;i<size;i++) { 359 #ifdef DEBUG 360 debug_pending(); 361 debug_suffix(i); 362 #endif 363 /* 前処理, 新しいクラスの始まりかどうかのチェック */ 364 if(suffix[i].common > pendings[level].length) { 365 /* 現在の場所から新しいクラスを生成する。*/ 366 /* 新しいクラスは, カウント0から始める */ 367 level++; 368 pendings[level].start_suffix = i; 369 pendings[level].length = suffix[i].common; 370 for(j=0;j<MAX_C;j++) { pendings[level].c[j] = 0; } 371 } 372 373 /* 計数処理, 文字列の出現に関して適切なクラスを検索して計数する */ 374 375 previous = suffix[i].previous_suffix; 376 pendings[level].c[0]++; 377 #ifdef DEBUG 378 printf(" -> Incrementing c[%d] of Class*[%d,%d] : (S%d,L%d) S= \"", 379 0, 380 i, 381 i, 382 pendings[level].start_suffix, 383 pendings[level].length 384 ); 385 for(ii=0;ii<pendings[level].length;ii++) { 386 putchar(text[suffix[pendings[level].start_suffix].position + ii]) ; 387 } 388 printf("\"\n"); 389 #endif 390 for(j=1;j<MAX_C;j++) { 391 int plev; 392 if(previous < 0) break; 393 plev = pending_level(previous); 394 pendings[plev].c[j]++; 395 #ifdef DEBUG 396 printf(" -> Incrementing c[%d] of Class*[%d,%d] : (S%d,L%d) S=\"", 397 j, 398 previous, 399 i, 400 pendings[plev].start_suffix, 401 pendings[plev].length 402 ); 403 for(ii=0;ii<pendings[plev].length;ii++) { 404 putchar(text[suffix[pendings[plev].start_suffix].position + ii]); 405 } 406 printf("\"\n"); 407 #endif 408 previous = suffix[previous].previous_suffix; 409 } 410 411 /* 後処理: classの終了の検出 */ 412 while(suffix[i].common < pendings[level].length) { 413 int common = suffix[i].common; 414 /* classの終了が発見されたとき */ 415 register_class( 416 pendings[level].start_suffix,/*start suffix */ 417 i, /*final suffix */ 418 pendings[level].length, /*maximum class length */ 419 pendings[level].c); 420 if(level <= 0) { fprintf(stderr, "internal level\n"); exit(2 ); } 421 if( common > pendings[level-1].length) { 422 /* 計算中として登録されていなかったクラスが存在した。 423 計算が終了したクラスと同じ場所からスタートする上位の 424 クラスの処理を開始 */ 425 pendings[level].length = common; 426 /* 上位のクラスにカウントを引き継ぐ, ただし 427 上位のクラスは現在, 計算中のクラスと同じ場所にあるので 428 実際の操作は不要である。 */ 429 430 } 431 if( common <= pendings[level-1].length) { 432 /* 上位のクラスのスタート場所は, 今よりも前でpendingになっているもの */ 433 for(j=0;j<MAX_C;j++) { 434 pendings[level-1].c[j] += pendings[level].c[j]; 435 } 436 level --; 437 } 438 /* 終了処理をしたあと, 再度終了しているかどうか調べる. */ 439 } 440 } 441 #ifdef DEBUG 442 debug_pending(); 443 #endif 444 make_class_table(); 445 clear_class_list(); 446 } 447 448 449 static int df_class_string_length(char *s) 450 { int i; 451 i = 0; 452 while(*s++) i++; 453 return i; 454 } 455 456 457 static int df_class_compare_string(char *x, char *s) 458 { 459 if(string_sub(x, s) != 0) return 0; 460 return( my_strcmp(x, s) ); 461 } 462 463 static int df_class_compare(int m, char *s) 464 { 465 char *x; int r; 466 x = text + suffix[class_table[m].first].position; 467 r = df_class_compare_string(x, s); 468 if(r != 0) return r; 469 if (class_table[m].length > df_class_string_length(s)) return 1; 470 if (class_table[m].length < df_class_string_length(s)) return -1 ; 471 return 0; 472 } 473 474 475 476 static int df_class_binary(char * s) 477 { 478 int min; int max; int mid; int cmp; 479 /* cf>=2のclass にあるかどうか検索する */ 480 min = 0; 481 max = class_count-1; 482 while(min+1<max) { 483 mid = (max + min) / 2; 484 cmp = df_class_compare(mid, s); 485 if(cmp < 0) { min = mid; } else {max =mid; } 486 }; 487 if((string_sub(text+suffix[class_table[min].first].position, s) != 0) && 488 (string_sub(text+suffix[class_table[min].last].position, s) ! = 0) 489 ) return min; 490 if((string_sub(text+suffix[class_table[max].first].position, s) != 0) && 491 (string_sub(text+suffix[class_table[max].last].position, s) ! = 0) 492 ) return max; 493 /* cf=1であるかどうかどうか検索する */ 494 min = 0; 495 max = size-1; 496 while(min+1<max) { 497 mid = (max + min) / 2; 498 cmp = df_class_compare_string(text+suffix[mid].position, s); 499 if(cmp < 0) { min = mid; } else { max = mid; } 500 } 501 if(string_sub(text+suffix[min].position, s) != 0) return -1; 502 if(string_sub(text+suffix[max].position, s) != 0) return -1; 503 return -2; 504 } 505 506 static int df_class(char * s) 507 { 508 int c; 509 c = df_class_binary(s); 510 #ifdef DOCUMENTATION 511 if(c != df_class_simple(s)) { 512 fprintf(stderr, "%d %d %s\n", c, df_class_simple(s), s); 513 } 514 #endif 515 return c; 516 } 517 518 519 int cf(char *s) 520 { 521 int c; 522 c = df_class(s); 523 if(c < -1) return 0; 524 if(c < 0) return 1; 525 return class_table[c].c[0]; 526 } 527 528 529 static int df1(char *s) 530 { 531 int c; 532 c = df_class(s); 533 if(c < -1) return 0; 534 if(c < 0) return 1; 535 return class_table[c].c[0] - class_table[c].c[1]; 536 } 537 538 int dfn(int k, char *s) 539 { int c; 540 if(k>= MAX_C) { 541 fprintf(stderr, "%d: dfn K too large\n", k); 542 } 543 if(k==1) return df1(s); 544 c = df_class(s); 545 if(c< 0) return 0; 546 return class_table[c].c[k-1] - class_table[c].c[k]; 547 } 548 549 char line[1024]; 550 551 int main(int argc, char ** argv) 552 { int i; 553 if(argc != 2) { 554 fprintf(stderr, "Usage %s filename > output\n", argv[0]); 555 exit(1); 556 } 557 df_setup(argv[1]); 558 #ifdef DEBUG 559 fprintf(stdout, "total=%d\n", class_count); 560 for(i=0;i<class_count;i++) { 561 output_class( class_table[i].first, 562 class_table[i].last, 563 class_table[i].length, 564 class_table[i].c 565 ); 566 } 567 #endif 568 while(fgets(line, sizeof(line), stdin)) { 569 i = strlen(line); 570 line[i-1] = 0; 571 fprintf(stdout, "%d ", cf(line)); 572 for(i=1;i<MAX_C;i++) { 573 fprintf(stdout, "%d ", dfn(i, line)); 574 } 575 fprintf(stdout, "%s\n", line); 576 } 577 return 0; 578 }[Table 4] 1 / * maximum k of df_k to obtain +1, 2 * / 3 #define MAX_C 5 Four 5 / * Document delimiter * / 6 #define SEPARATOR '\ n' 7 8 #include <stdio.h> 9 #include <string.h> 10 #include <stdlib.h> 11 / * Function type to be given to qsort argument * / 12 typedef int (* sortfn) (const void *, const void *); 13 14 #define MESSAGE_FILE stdout 15 16 static char * text; / * Source data of text * / 17 18 static int size; / * number of bytes of text * / 19 20 static int id; / * number of the document being loaded * / twenty one 22 static int id_max; / * total number of documents * / twenty three 24 static int common_max; / * Maximum number of characters common to adjacent suffixes Large price * / twenty five 26 27 struct suffix_struct 28 {int position; / * start of text index * / 29 int common; / * length of common part with the following suffix * / 30 int id; / * Corresponding document number * / 31 int previous_suffix; / * Most of the same documents, suffix Close things * / 32}; 33 34 static struct suffix_struct * suffix; / * suffix array * / 35 36 static int * last_suffixes; / * array to create previous suffix * / 37 38 static FILE * text_file; / * data file * / 39 40 struct pending_struct 41 {int start_suffix; 42 int length; 43 int c [MAX_C]; 44}; 45 46 47 static struct pending_struct * pendings; 48 static int level; 49 50 static int pending_level (int suffix) 51 { 52 int min, max, mid; 53 min = 0; max = level; 54 while (min + 1 <max) { 55 mid = (min + max) / 2; 56 if (pendings [mid] .start_suffix <= suffix) { 57 min = mid; 58} else { 59 max = mid; 60} 61} 62 if (pendings [max] .start_suffix <= suffix) return max; 63 if (pendings [min] .start_suffix <= suffix) return min; 64 fprintf (stderr, "internal error (pending_level) \ n"); 65 exit (2); 66 return -1; 67} 68 69 70 #ifdef DEBUG 71 / * Show status of specified suffix * / 72 static void debug_suffix (int i) 73 { 74 int j; 75 fprintf (MESSAGE_FILE, 76 "% 5d% 3d% 5d% 2d:", 77 i, 78 suffix [i] .id, 79 suffix [i] .previous_suffix, 80 suffix [i] .common); 81 for (j = 0; text [suffix [i] .position + j]! = '\ N'; j ++) { 82 fputc (text [suffix [i] .position + j], MESSAGE_FILE); 83} 84 fputc ('\ n', MESSAGE_FILE); 85 fflush (MESSAGE_FILE); 86} 87 88 / * Display suffix array status * / 89 static void debug_output () 90 {int i; 91 fprintf (MESSAGE_FILE, "size =% d \ n", size); 92 fprintf (MESSAGE_FILE, "id_max =% d \ n", id_max); 93 fprintf (MESSAGE_FILE, "common_max =% d \ n", common_max); 94 for (i = 0; i <size; i ++) { 95 debug_suffix (i); 96} 97} 98 99 / * Display the status of pending class * / 100 static void debug_pending () 101 { 102 int i; int j; 103 for (i = 0; i <= level; i ++) { 104 fprintf (MESSAGE_FILE, "<S% d L% d", 105 pendings [i] .start_suffix, 106 pendings [i] .length); 107 for (j = 0; j <MAX_C; j ++) { 108 fprintf (MESSAGE_FILE, "% d", pendings [i] .c [j]); 109} 110 fprintf (MESSAGE_FILE, ">"); 111} 112 fprintf (MESSAGE_FILE, "\ n"); 113} 114 #endif 115 116 static void error_alloc (void) 117 { 118 fprintf (MESSAGE_FILE, 119 "text% x, suffix% x, last_suffixes% x, classes% x \ n", 120 (int) text, 121 (int) suffix, 122 (int) last_suffixes, 123 (int) pendings); 124 exit (1); 125} 126 127 / * suffix order decision function, 128 breaks at document break This is a lexical comparison of 129 strings. 130 131 * / 132 static int my_strcmp (char * x1, char * y1) 133 { 134 register unsigned char * x; 135 register unsigned char * y; 136 x = (unsigned char *) x1; y = (unsigned char *) y1; 137 while (* x && * y) { 138 if (* x <* y) return -1; 139 if (* x> * y) return 1; 140 if (* x == '\ n') break; 141 if (* y == '\ n') break; 142 x ++; y ++; 143} 144 return 0; 145} 146 147 static int string_sub (char * s1, char * s2) 148 { 149 while (* s2) { 150 if (* s1! = * S2) {return 0;} 151 s1 ++; 152 s2 ++; 153} 154 return * s1; 155} 156 157 / * A function that determines the order of the strings indicated by suffix 158 If some parts are the same from the beginning, the next character determines the order The property 159 is not limited to this. 160 * / 161 static int suffix_order (struct suffix_struct * x, struct suffix_st ruct * y) 162 { 163 return my_strcmp (text + x-> position, text + y-> position); 164} 165 166 / * Calculate the number of common characters from the beginning. Function * / 167 168 static int common_length (char * x, char * y) 169 {int i; 170 i = 0; 171 while ((* x == * y) && (* x) && (* x! = '\ N')) {i ++; x ++; y ++;}; 172 return i; 173} 174 175 struct class_struct 176 {struct class_struct * next; 177 int first; 178 int last; 179 int length; 180 int c [MAX_C]; 181}; 182 183 static struct class_struct * class_list = 0; 184 static int class_count = 0; 185 186 static struct class_struct * class_table; 187 188 static void register_class (int first, int last, int length, int c [ ]) 189 {int i; 190 struct class_struct * p; 191 #ifdef DEBUG 192 printf ("---> Class [% d,% d] L =% d", first, last, length); 193 for (i = 0; i <MAX_C; i ++) { 194 printf ("c [% d] =% d", i, c [i]); 195} 196 printf ("S = \" "); 197 for (i = 0; i <length; i ++) {putchar (text [suffix [first] .position + i] );} 198 printf ("\" \ n "); 199 #endif 200 p = (struct class_struct *) malloc (sizeof (struct class_struct)); 201 if (p == 0) { 202 fprintf (stderr, "register_class \ n"); 203 exit (2); 204} 205 p-> first = first; 206 p-> last = last; 207 p-> length = length; 208 for (i = 0; i <MAX_C; i ++) { 209 p-> c [i] = c [i]; 210} 211 p-> next = class_list; 212 class_list = p; 213 class_count ++; 214} 215 216 int class_order (struct class_struct * x, struct class_struct * y) 217 { 218 if (x-> first <y-> first) return -1; 219 if (x-> first> y-> first) return 1; 220 if (x-> length <y-> length) return -1; 221 if (x-> length> y-> length) return 1; 222 return 0; 223} 224 225 / * Show location, length, characters, counts for class * / 226 #ifdef DEBUG 227 static void output_class (int first, int last, int length, int c [] ) 228 { 229 int i; int j; 230 printf ("Class [% 4d,% 4d] L =% d tf =% d", 231 first, 232 last, 233 length, 234 c [0]); 235 for (j = 1; j <MAX_C; j ++) { 236 printf ("df% d =% d", j, c [j-1] -c [j]); 237} 238 printf ("S = \" "); 239 for (i = 0; i <length; i ++) {putchar (text [suffix [first] .position + i] );} 240 printf ("\" \ n "); 241} 242 #endif 243 244 245 static void make_class_table (void) 246 {struct class_struct * p; 247 int i; 248 class_table = (struct class_struct *) 249 malloc (sizeof (struct class_struct) * class_count) ; 250 if (class_table == 0) { 251 fprintf (stderr, "make_class_table \ n"); 252 exit (1); 253} 254 p = class_list; 255 for (i = 0; i <class_count; i ++) { 256 class_table [i] = * p; 257 p = p-> next; 258}; 259 qsort (class_table, class_count, sizeof (struct class_struct), (so rtfn) class_order); 260} 261 262 static void clear_class_list (void) 263 {int i; 264 struct class_struct * p; 265 struct class_struct * q; 266 p = class_list; 267 for (i = 0; i <class_count; i ++) { 268 q = p; 269 p = q-> next; 270 q-> next = 0; 271 free (q); 272} 273 class_list = 0; 274} 275 276 277 void df_setup (char * file) 278 {char * p; int i; int j; int ch; int previous; 279 #ifdef DEBUG 280 int ii; 281 #endif 282 text_file = fopen (file, "r"); 283 if (text_file == 0) { 284 fprintf (stderr, "File% s not found \ n", file); 285 exit (1); 286} 287 288 / * Calculate the total number of data characters * / 289 size = 0; 290 while (EOF! = (Ch = fgetc (text_file))) { 291 size ++; 292}; 293 294 / * Generate the required data area from the total number of characters * / 295 text = (char *) malloc (sizeof (char) * (size + 1)); 296 if (text == 0) error_alloc (); 297 suffix = (struct suffix_struct *) 298 malloc (sizeof (struct suffix_struct) * (size + 1)); 299 if (suffix == 0) error_alloc (); 300 fseek (text_file, 0, SEEK_SET); 301 p = text; 302 id = 0; 303 304 / * Delimit document at the same time as reading into memory 305 Put on * / 306 id_max = 0; 307 for (i = 0; i <size; i ++) { 308 ch = fgetc (text_file); 309 text [i] = ch; 310 suffix [i] .position = i; 311 suffix [i] .id = id; 312 id_max = id + 1; 313 if (ch == '\ n') {id ++;} 314}; 315 text [size] = 0; 316 317 318 / * Routine to create Suffix * / 319 / * The following is an example. * / 320 common_max = 0; 321 for (i = 0; i <size; i ++) {suffix [i] .position = i;} 322 / * sort processing is expected to take up most of the computation time * / 323 qsort (suffix, size, sizeof (struct suffix_struct), (sortfn) suffi x_order); 324 suffix [size] .position = size; 325 for (i = 0; i <size; i ++) { 326 int c; 327 c = common_length (text + suffix [i] .position, text + suffix [i + 1] .po sition); 328 suffix [i] .common = c; 329 if (c> common_max) common_max = c; 330} 331 suffix [size] .common = 0; 332 333 334 / * Generate data structure for duplicate calculations * / 335 last_suffixes = (int *) malloc (id_max * sizeof (int)); 336 if (last_suffixes == 0) error_alloc (); 337 for (i = 0; i <id_max; i ++) {last_suffixes [i] = -1;} 338 for (i = 0; i <size; i ++) { 339 suffix [i] .previous_suffix = last_suffixes [suffix [i] .id]; 340 last_suffixes [suffix [i] .id] = i; 341} 342 343 #ifdef DEBUG 344 debug_output (); 345 #endif 346 347 / * Extract class structure * / 348 349 pendings = (struct pending_struct *) 350 malloc (sizeof (struct pending_struct) * (common_max + 1)); 351 if (pendings == 0) error_alloc (); 352 level = 0; 353 pendings [level] .length = 0; 354 pendings [level] .start_suffix = 0; 355 for (j = 0; j <MAX_C; j ++) { 356 pendings [level] .c [j] = 0; 357} 358 for (i = 0; i <size; i ++) { 359 #ifdef DEBUG 360 debug_pending (); 361 debug_suffix (i); 362 #endif 363 / * pre-processing, check if new class starts * / 364 if (suffix [i] .common> pendings [level] .length) { 365 / * Generate new class from current location. * / 366 / * New class starts with count 0 * / 367 level ++; 368 pendings [level] .start_suffix = i; 369 pendings [level] .length = suffix [i] .common; 370 for (j = 0; j <MAX_C; j ++) {pendings [level] .c [j] = 0;} 371} 372 373 / * Counting, searching and counting the appropriate class for the occurrence of a string * / 374 375 previous = suffix [i] .previous_suffix; 376 pendings [level] .c [0] ++; 377 #ifdef DEBUG 378 printf ("-> Incrementing c [% d] of Class * [% d,% d]: (S% d, L% d) S = \ "", 379 0, 380 i, 381 i, 382 pendings [level] .start_suffix, 383 pendings [level] .length 384); 385 for (ii = 0; ii <pendings [level] .length; ii ++) { 386 putchar (text [suffix [pendings [level] .start_suffix] .position + ii]) ; 387} 388 printf ("\" \ n "); 389 #endif 390 for (j = 1; j <MAX_C; j ++) { 391 int plev; 392 if (previous <0) break; 393 plev = pending_level (previous); 394 pendings [plev] .c [j] ++; 395 #ifdef DEBUG 396 printf ("-> Incrementing c [% d] of Class * [% d,% d]: (S% d, L% d) S = \ "", 397 j, 398 previous, 399 i, 400 pendings [plev] .start_suffix, 401 pendings [plev] .length 402); 403 for (ii = 0; ii <pendings [plev] .length; ii ++) { 404 putchar (text [suffix [pendings [plev] .start_suffix] .position + ii]); 405} 406 printf ("\" \ n "); 407 #endif 408 previous = suffix [previous] .previous_suffix; 409} 410 411 / * Post-processing: end of class detection * / 412 while (suffix [i] .common <pendings [level] .length) { 413 int common = suffix [i] .common; 414 / * When the end of class is found * / 415 register_class ( 416 pendings [level] .start_suffix, / * start suffix * / 417 i, / * final suffix * / 418 pendings [level] .length, / * maximum class length * / 419 pendings [level] .c); 420 if (level <= 0) {fprintf (stderr, "internal level \ n"); exit (2 );} 421 if (common> pendings [level-1] .length) { 422 / * There was a class that was not registered as being calculated. 423 The upper class starting from the same place as the class for which the calculation is completed 424 class processing started * / 425 pendings [level] .length = common; 426 / * Inherit count to higher class, but 427 The upper class is currently in the same location as the class being calculated, so 428 No actual operation is required. * / 429 430} 431 if (common <= pendings [level-1] .length) { 432 / * The starting place of the higher class is pending before now thing */ 433 for (j = 0; j <MAX_C; j ++) { 434 pendings [level-1] .c [j] + = pendings [level] .c [j]; 435} 436 level-; 437} 438 / * After termination processing, check whether it has terminated again. * / 439} 440} 441 #ifdef DEBUG 442 debug_pending (); 443 #endif 444 make_class_table (); 445 clear_class_list (); 446} 447 448 449 static int df_class_string_length (char * s) 450 {int i; 451 i = 0; 452 while (* s ++) i ++; 453 return i; 454} 455 456 457 static int df_class_compare_string (char * x, char * s) 458 { 459 if (string_sub (x, s)! = 0) return 0; 460 return (my_strcmp (x, s)); 461} 462 463 static int df_class_compare (int m, char * s) 464 { 465 char * x; int r; 466 x = text + suffix [class_table [m] .first] .position; 467 r = df_class_compare_string (x, s); 468 if (r! = 0) return r; 469 if (class_table [m] .length> df_class_string_length (s)) return 1; 470 if (class_table [m] .length <df_class_string_length (s)) return -1 ; 471 return 0; 472} 473 474 475 476 static int df_class_binary (char * s) 477 { 478 int min; int max; int mid; int cmp; 479 / * Search for a class in cf> = 2 * / 480 min = 0; 481 max = class_count-1; 482 while (min + 1 <max) { 483 mid = (max + min) / 2; 484 cmp = df_class_compare (mid, s); 485 if (cmp <0) {min = mid;} else {max = mid;} 486}; 487 if ((string_sub (text + suffix [class_table [min] .first] .position, s) ! = 0) && 488 (string_sub (text + suffix [class_table [min] .last] .position, s)! = 0) 489) return min; 490 if ((string_sub (text + suffix [class_table [max] .first] .position, s) ! = 0) && 491 (string_sub (text + suffix [class_table [max] .last] .position, s)! = 0) 492) return max; 493 / * Search whether cf = 1 * / 494 min = 0; 495 max = size-1; 496 while (min + 1 <max) { 497 mid = (max + min) / 2; 498 cmp = df_class_compare_string (text + suffix [mid] .position, s); 499 if (cmp <0) {min = mid;} else {max = mid;} 500} 501 if (string_sub (text + suffix [min] .position, s)! = 0) return -1; 502 if (string_sub (text + suffix [max] .position, s)! = 0) return -1; 503 return -2; 504} 505 506 static int df_class (char * s) 507 { 508 int c; 509 c = df_class_binary (s); 510 #ifdef DOCUMENTATION 511 if (c! = Df_class_simple (s)) { 512 fprintf (stderr, "% d% d% s \ n", c, df_class_simple (s), s); 513} 514 #endif 515 return c; 516} 517 518 519 int cf (char * s) 520 { 521 int c; 522 c = df_class (s); 523 if (c <-1) return 0; 524 if (c <0) return 1; 525 return class_table [c] .c [0]; 526} 527 528 529 static int df1 (char * s) 530 { 531 int c; 532 c = df_class (s); 533 if (c <-1) return 0; 534 if (c <0) return 1; 535 return class_table [c] .c [0]-class_table [c] .c [1]; 536} 537 538 int dfn (int k, char * s) 539 {int c; 540 if (k> = MAX_C) { 541 fprintf (stderr, "% d: dfn K too large \ n", k); 542} 543 if (k == 1) return df1 (s); 544 c = df_class (s); 545 if (c <0) return 0; 546 return class_table [c] .c [k-1]-class_table [c] .c [k]; 547} 548 549 char line [1024]; 550 551 int main (int argc, char ** argv) 552 {int i; 553 if (argc! = 2) { 554 fprintf (stderr, "Usage% s filename> output \ n", argv [0]); 555 exit (1); 556} 557 df_setup (argv [1]); 558 #ifdef DEBUG 559 fprintf (stdout, "total =% d \ n", class_count); 560 for (i = 0; i <class_count; i ++) { 561 output_class (class_table [i] .first, 562 class_table [i] .last, 563 class_table [i] .length, 564 class_table [i] .c 565); 566} 567 #endif 568 while (fgets (line, sizeof (line), stdin)) { 569 i = strlen (line); 570 line [i-1] = 0; 571 fprintf (stdout, "% d", cf (line)); 572 for (i = 1; i <MAX_C; i ++) { 573 fprintf (stdout, "% d", dfn (i, line)); 574} 575 fprintf (stdout, "% s \ n", line); 576} 577 return 0; 578}

【０１３７】[0137]

【発明の効果】本発明によれば、文字列の出現頻度を効
率よく計数することができる。According to the present invention, the appearance frequency of a character string can be efficiently counted.

[Brief description of drawings]

【図１】実施の形態に係る計数方法に用いられるSuff
ix Arrayのデータ構造を説明する図である。FIG. 1 Suff used in a counting method according to an embodiment
It is a figure explaining the data structure of ix Array.

【図２】 Suffix Arrayの各suffixに対するcommonの値
を示す図である。FIG. 2 is a diagram showing common values for each suffix of the Suffix Array.

【図３】クラスを形成するSuffix Arrayの区間を例示
する図である。FIG. 3 is a diagram exemplifying sections of Suffix Array forming a class.

【図４】 Suffix Arrayのクラスの階層構造を示す図で
ある。FIG. 4 is a diagram showing a hierarchical structure of Suffix Array classes.

【図５】図４のSuffix Arrayのクラスに属する部分文
字列を説明する図である。5 is a diagram illustrating a partial character string belonging to the class of Suffix Array in FIG.

【図６】図６（ａ）は６つのドキュメントを含むコー
パスの例を示す図であり、図６（ｂ）は図６（ａ）のコ
ーパスに含まれる文字列に関する文字列頻度とドキュメ
ント頻度の計算例を示す図である。6A is a diagram showing an example of a corpus including six documents, and FIG. 6B is a diagram showing a character string frequency and a document frequency regarding a character string included in the corpus of FIG. 6A. It is a figure which shows the example of calculation.

【図７】図７（ａ）はコーパスに含まれるドキュメン
トの一つを示す図であり、図７（ｂ）はSuffix Arrayの
各suffixの重複度を示す図である。FIG. 7A is a diagram showing one of the documents included in the corpus, and FIG. 7B is a diagram showing the degree of overlap of each suffix of the Suffix Array.

【図８】図８（ａ）は３つのドキュメントを含むコー
パスを示す図であり、図８（ｂ）は図８（ａ）のコーパ
スに含まれる文字列に関する重複度つきの文字列頻度と
ドキュメント頻度の計算例を示す図である。8A is a diagram showing a corpus including three documents, and FIG. 8B is a character string frequency with duplication degree and a document frequency regarding a character string included in the corpus of FIG. 8A. It is a figure which shows the example of calculation of.

【図９】図９（ａ）、（ｂ）、（ｃ）はpreviousリン
クの説明図である。9 (a), 9 (b), and 9 (c) are explanatory diagrams of previous links.

【図１０】実施の形態に係る計数装置の構成図であ
る。FIG. 10 is a configuration diagram of a counting device according to an embodiment.

【図１１】実施の形態に係る計数処理手順のフローチ
ャートである。FIG. 11 is a flowchart of a counting process procedure according to the embodiment.

[Explanation of symbols]

１０計数処理部、１２ Suffix Array生成部、１
４文字列頻度計数部、１６文書頻度算出部、１
８問い合わせ部、２０検索部、２２キーワード
抽出部、２４文書データベース、２６文書ファ
イル、２８Suffix Arrayファイル、３０文字列出
現頻度データ、３２キーワードインデックスファイ
ル、１００計数装置。10 counting processing unit, 12 Suffix Array generation unit, 1
4 character string frequency counting unit, 16 document frequency calculating unit, 1
8 inquiry section, 20 search section, 22 keyword extraction section, 24 document database, 26 document file, 28 Suffix Array file, 30 character string appearance frequency data, 32 keyword index file, 100 counting device.

Claims

[Claims]

1. The number of times a character string appears k times (k is a natural number of 2 or more) in a set of documents and the character string is k.
A counting device, comprising: a count processing unit that acquires the number of documents in which the character string is included k times or more by counting the number of occurrences of +1 or more times and obtaining the difference between the numbers.

2. The counting processing unit sets a set of partial character strings included in the document so that the partial character strings belonging to the same class have the same number of times of occurrence of the partial character strings. The number of appearances is counted for each class while classifying into classes.
The counting device according to.

3. The counting processing unit counts the number of appearances by adding the number of appearances counted in the lower class to the number of appearances counted in the upper class by utilizing the hierarchical structure of the class. The counting device according to claim 2, wherein:

4. For a corpus containing a plurality of documents,
Range from the character at a certain position in the document to the end of the document
Is a set of character strings of
Generating a suffix array, Among the appearances of the given character string x, the degree of duplication is k (k is 2
Appearance frequency cf of a character string equal to or more than the above natural number cf_k(X)
The step of counting, The number of documents df in which the character string x appears k times or more_k(X)
Cf_k(X) and cf _{k + 1}Process determined by the difference of (x)
A method of counting the appearance frequency of a character string characterized by including
Law.

5. The method further includes the step of generating a classification of the suffix array, wherein the character strings belonging to the same class are classified so that the character strings appear the same number of times in the same document. The appearance frequency c for each class by classification
The counting method according to claim 4, wherein f _k (x) is counted.

6. The step of generating the classifications,
A classification having a hierarchical structure is generated, and the appearance frequency c
The counting method according to claim 5, wherein, in the step of counting f _k (x), the appearance frequency cf _k (x) counted in the lower class is added to the appearance frequency counted in the upper class. .

7. The method according to claim 5, wherein the determination of the degree of overlap in the step of counting the appearance frequency cf _k (x) and the detection of the class in the step of generating the classification are performed at the same time. The counting method described in.

8. A counting processing unit that counts the appearance frequency of a character string in a set of documents, and a recording unit that records data regarding the appearance frequency of the character string, wherein the counting processing unit is the document. A set of character strings in the range from a character at a certain position to the end of the document, the set being a suffix array generating a suffix array in which the set is arranged in a lexicographical order; Simultaneously with the detection, by determining the degree of overlap of the appearance of the character string, a character string frequency counting unit that counts the frequency of appearance of the character strings in duplicate for each class, and the appearance frequency of the character string. A counting device, which includes a document frequency calculating unit that calculates the frequency of the document in which the character string appears.

9. Each suffix of the suffix array belongs to the same document as the suffix and has a pointer to a suffix immediately before the suffix in the order of arrangement of the suffix array, and the character string frequency counting unit 9. The counting device according to claim 8, wherein the degree of overlap is determined by determining whether or not the pointer can be sequentially traced the number of times of the degree of overlap.

10. The character string frequency counting unit classifies the suffix array into a hierarchical class structure, and the appearance frequency of the character string counted in a lower class is counted in an upper class. The counting device according to claim 8 or 9, wherein

11. The document frequency calculation unit determines the frequency of a document in which the character string appears k (k is a natural number) or more times, the appearance frequency of the character string being k or more and the duplication degree. The counting device according to any one of claims 8 to 10, wherein the counting device calculates the difference from the appearance frequency of k + 1 or more.

12. The counting according to claim 8, further comprising a keyword extracting unit that extracts a keyword from the document using the frequency of the document in which the character string appears twice or more. apparatus.

13. A corpus including a plurality of documents,
A step of generating a suffix array in which a set of character strings ranging from a character at a certain position of the document to the end of the document, the set being arranged in lexicographic order; and a hierarchical class structure of the suffix array. The class is detected, and the appearance frequency cf _k (x) of the character string having the degree of duplication k (k is a natural number of 2 or more) or more among the appearances of the given character string x is counted for each class. A step of adding the appearance frequency cf _k (x) counted in the lower class to the appearance frequency counted in the upper class, and the number of documents df _k (x) in which the character string x appears k times or more
And a step of obtaining a difference between the appearance frequency cf _k (x) of which the degree of duplication of the character string is k or more and the appearance frequency cf _{k + 1} (x) of which the degree of duplication is k + 1 or more. A computer program characterized by the above.

14. A set of partial symbol sequences included in the sequence for a set including a plurality of symbol sequences, the number of times that the partial symbol sequences appear in the same sequence for partial symbol sequences belonging to the same class. The number of times that an arbitrary symbol string appears k (k is a natural number of 2 or more) times or more while being classified into hierarchical classes such that A method of counting the frequency of appearance of a symbol string, characterized in that the number of the sequences included is calculated.