JP6303148B2

JP6303148B2 - Document feature extraction device, document feature extraction method, document classification device, document classification method, document search device, document search method, computer program, and recording medium on which computer program is recorded

Info

Publication number: JP6303148B2
Application number: JP2015113018A
Authority: JP
Inventors: 明潮田
Original assignee: 明潮田
Priority date: 2015-06-03
Filing date: 2015-06-03
Publication date: 2018-04-04
Anticipated expiration: 2035-06-03
Also published as: JP2016224847A

Description

本発明は、機械学習を用いて文書を分類又は検索する際に、対象となる文書の素性を抽出する文書素性抽出装置、文書素性抽出方法、および、その素性を利用して文書を分類又は検索する、文書分類装置、文書分類方法、文書検索装置、文書検索方法、コンピュータプログラム、および、コンピュータプログラムを記録した記録媒体に関する。なお、本発明でいう「文書」とは、その長さやその目的から限定されるものではなく、単なる文や検索式のようなものを含み、更には、文字列からなる文書のみならず、それ以外のデータ、例えば、音声、映像や画像、その他のデータによって構成されるものを含む。 The present invention relates to a document feature extraction apparatus, a document feature extraction method, and a document feature extraction method for extracting features of a target document when classifying or searching documents using machine learning. The present invention relates to a document classification device, a document classification method, a document retrieval device, a document retrieval method, a computer program, and a recording medium on which the computer program is recorded. The “document” as used in the present invention is not limited by its length or its purpose, but includes a simple sentence or a search expression, and is not limited to a document consisting of a character string. Other data, for example, audio, video, image, and other data.

機械学習においては、対象となる文書の素性を要素とするベクトル（「素性ベクトル」あるいは「特徴ベクトル」）を用いて事例を表現することがしばしばある。その場合、事例間の類似度や事例の分布は、それぞれの事例を表現する素性ベクトル間の類似度や素性ベクトルの分布により与えられる。 In machine learning, cases are often expressed using vectors ("feature vectors" or "feature vectors") whose elements are the features of the target document. In that case, the similarity between cases and the distribution of cases are given by the similarity between the feature vectors representing each case and the distribution of the feature vectors.

たとえば、文書分類においては、分類の対象となる事例は個々の文書であり、与えられた文書カテゴリの集合の中の０個、１個あるいは複数個の文書カテゴリを各文書に対して割り当てることで文書分類が行われる。その際、それぞれの文書に対応する素性ベクトルは、文書中に含まれる単語の有無（０／１）や単語の文書中における頻度、あるいは頻度と単語の重要度から計算される単語の重み（評価値）などを要素とするベクトルで表される。実用上よく用いられる単語の評価値の１つとしてｔｆｉｄｆが挙げられる。ｔｆｉｄｆは、ｔｆ（単語の文書内出現頻度）とｉｄｆ（逆文書頻度）の二つの指標にもとづいて計算される。ｔｆｉｄｆの計算方法には、ｔｆ−ｉｄｆ−ｃｆ，ｔｆｃ，ｌｔｃ，ａｔｃ, ｎｔｃ, ａｎｎ, ｂｎｎ, ｌｎｎ, ｎｎｎなど、いくつかの種類があるが、たとえば以下のような式で求めることができる。 For example, in document classification, an example to be classified is an individual document, and 0, 1, or a plurality of document categories in a given set of document categories are assigned to each document. Document classification is performed. At this time, the feature vector corresponding to each document is based on the presence / absence (0/1) of the word included in the document, the frequency of the word in the document, or the word weight (evaluation) calculated from the frequency and the importance of the word. Value) etc. as a vector. One example of a word evaluation value often used in practice is tfidf. tfidf is calculated based on two indices, tf (frequency of appearance of words in the document) and idf (inverse document frequency). There are several methods for calculating tfidf, such as tf-idf-cf, tfc, ltc, atc, ntc, nn, bnn, lnn, nnn, and the like, for example, can be obtained by the following formula.

ここで、n_i,jは単語t_iの文書d_jにおける出現頻度、|D|は文書の総数、|{d:d∋t_i}|は単語t_iを含む文書の数である。
なお、ここでは「単語」という用語を用いているが、既に述べたように、本発明でいう「文書」とは、文字列からなる文書のみならず、それ以外のデータ、例えば、音声、映像や画像、その他のデータによって構成されるものを含み、それにあわせて、本発明でいう「単語」とは、自然言語における単語という意味に限定されるものではなく、その上位概念である、なんらかの意味を持つデータのまとまりという意味である。 Here, n _{i, j} is the appearance frequency of the word t _{i in} the document d _j , | D | is the total number of documents, and | {d: d∋t _i } | is the number of documents including the word t _i .
Note that the term “word” is used here, but as described above, the “document” in the present invention is not only a document consisting of a character string but also other data such as audio, video, and the like. "Word" in the present invention is not limited to the meaning of a word in natural language, but has some meaning that is a superordinate concept. It means a collection of data with

文書d_jにおける文書ベクトルの単語t_iに対するベクトル要素は、以下のとおりとなる。 The vector elements for the word t _i of the document vector in the document d _j are as follows.

単語の重みとしてｔｆｉｄｆを用いた文書分類手法の例として、非特許文献１に記載された教師付学習による手法がある。教師付学習による文書分類では、文書カテゴリが付与された文書の集合からなる学習データを用いて学習器に学習処理を施した後に、任意の文書に対して文書カテゴリの付与を行う。学習の第一ステップにおいて、各々の文書は素性と素性値のペアの集合に変換される。非特許文献１に記載された手法においては、学習器としてＳＶＭ(Support Vector Machine)が用いられ、素性として、学習データに３回以上出現した単語のうち、「and」「or」「the」などのストップワード(stop word)と呼ばれる検索の対象から除外された単語以外の単語が用いられている。素性値としてはｔｆｉｄｆの一種であるｔｆｃという指標が用いられている。非特許文献１では、Ｒｅｕｔｅｒｓコーパスを用いた文書分類実験において、ＳＶＭを含む５種類の機械学習器の分類精度比較を行った結果、ＳＶＭが最高精度を示し、ＫＮＮ法（k-nearest neighbor classifier）が次に高い精度を示したと報告されている。 As an example of a document classification method using tfidf as a word weight, there is a supervised learning method described in Non-Patent Document 1. In document classification by supervised learning, a learning process is performed on a learning device using learning data including a set of documents to which a document category is assigned, and then a document category is assigned to an arbitrary document. In the first step of learning, each document is converted to a set of feature-feature value pairs. In the method described in Non-Patent Document 1, an SVM (Support Vector Machine) is used as a learning device, and “and”, “or”, “the”, etc. among words appearing three or more times in the learning data as features. Words other than the words excluded from the search target, called stop words, are used. As the feature value, an index tfc, which is a kind of tfidf, is used. In Non-Patent Document 1, as a result of comparing classification accuracy of five types of machine learners including SVM in a document classification experiment using a Reuters corpus, SVM shows the highest accuracy, and the KNN method (k-nearest neighbor classifier) Was reported to have the next highest accuracy.

その他にも、機械学習を用いた文書分類において、ＳＶＭやＫＮＮが高い精度を達成できたという報告がなされている。また、これらの機械学習において、膨大な数の単語の中からどのように素性として用いるべき単語を選択するかという素性選択手法についても、多くの報告がなされている。しかしながら、そもそもどのような文字列を素性とすれば高い精度が得られるか、そしてどのように素性値を求めれば高い精度が得られるかについては、まだ十分に明らかにされているとはいえない。 In addition, it has been reported that SVM and KNN have achieved high accuracy in document classification using machine learning. In addition, in these machine learnings, many reports have been made on a feature selection method for selecting a word to be used as a feature from a huge number of words. However, it has not been fully clarified yet what kind of character string should be used to obtain high accuracy and how to obtain high accuracy by obtaining a feature value. .

たとえば、日本語の文書分類を行う場合、素性選択の前にまず文章を単語単位に分割する必要があるが、単語の切り出し方の違いにより学習データとテストデータの間で素性の不一致が生じる場合がある。一例として、表１に示すように、学習データ中に「医療科学部」、テストデータ中に「健康医療科学部」という文字列を含む文書が存在した場合を考える。ここで両文書は相互に関連が深いものと仮定する。もしこの時、学習データ中の「医療科学部」が「医療科」＋「学部」、テストデータ中の「健康医療科学部」が「健康医療」＋「科学」＋「部」のように単語分割されたならば、両文字列間に共通の単語は存在しないことになり、その分学習データとテストデータの間の関連性が実際よりも低く見積もられることになる。 For example, when performing Japanese document classification, it is necessary to first divide a sentence into words before selecting a feature, but there is a feature mismatch between the learning data and the test data due to differences in word extraction There is. As an example, as shown in Table 1, let us consider a case where there is a document including a character string “medical science department” in the learning data and “health medical science department” in the test data. Here, it is assumed that both documents are closely related to each other. If this is the case, the words “medical science department” in the learning data are “medical department” + “faculty”, and “health medical science department” in the test data are “health care” + “science” + “department”. If divided, there will be no common word between the two character strings, and the relevance between the learning data and the test data will be estimated lower than the actual amount.

このように、日本語の文書分類を行う場合、決定論的解析により単語の切り出しを行い、切り出された単語をそのまま素性（あるいは素性の候補）にするという従来手法では、文書間の関連性を十分に捉えきれないという問題があった。 In this way, when performing Japanese document classification, the conventional method in which words are extracted by deterministic analysis and the extracted words are used as features (or feature candidates) as they are, the relationship between documents is determined. There was a problem that it could not be grasped sufficiently.

文書を単語の集合と捉え、文書の特徴やアイデンティティーを文書中の単語の特徴やアイデンティティーから推定することを特徴とする、文書分類、情報検索やその他様々な言語処理システムにおいては、長い単語ほど良く文書の特徴を表す傾向にあると言える。たとえば、文書Ａと文書Ｂの内容の近さや、あるいは検索式Ａと検索対象文書群中の文書Ｂの内容の近さを計ろうとしたとき、文書Ａ／検索式Ａと文書Ｂに単語「カラー画像情報処理装置」が共通して現れた場合の方が、文書Ａ／検索式Ａと文書Ｂにそれぞれ「カラー」「画像」「情報」「処理」「装置」という単語がそれぞれ共通して現れた場合よりも、文書Ａ／検索式Ａと文書Ｂの内容が近いと推定することができる。従って、素性としての識別能力を重視した場合は、特に専門用語は長めに抽出した方が、これらの言語処理システムにおいては素性として有効に働くと言える。 Long words in document classification, information retrieval, and various other language processing systems, characterized by taking a document as a set of words and inferring the characteristics and identities of the document from the characteristics and identities of the words in the document It can be said that there is a tendency to express the characteristics of the document as well. For example, when trying to measure the closeness of the contents of documents A and B, or the closeness of the contents of search B and the contents of document B in the search target document group, the word “color” is added to document A / search formula A and document B. When the “image processing apparatus” appears in common, the words “color”, “image”, “information”, “processing”, and “apparatus” appear in document A / search formula A and document B, respectively. It can be estimated that the contents of the document A / search formula A and the document B are closer than the case where Therefore, when emphasizing the identification ability as a feature, it can be said that a longer term is extracted more effectively as a feature in these language processing systems.

しかしながら、文書Ａ／検索式Ａのみに単語「カラー画像情報処理装置」が現れ、文書Ｂのみに単語「画像情報処理装置」が現れた場合、「カラー画像情報処理装置」と「画像情報処理装置」がそれぞれ単語として切り出されてしまうと、文書Ａ／検索式Ａと文書Ｂの間の重要な共通要素が失われてしまう。従って、取りこぼしのないように、すなわち網羅性を高めるようにしようとした場合は、文書Ａ／検索式Ａと文書Ｂそれぞれにおいて「画像」「情報」「処理」「装置」を単語として切り出すなど、短めに抽出した方が素性として有効に働くと言える。従来技術では、これらのトレードオフの関係を効果的に解消できていないという課題があった。 However, if the word “color image information processing apparatus” appears only in document A / search formula A and the word “image information processing apparatus” appears only in document B, “color image information processing apparatus” and “image information processing apparatus” "Is cut out as a word, an important common element between document A / search formula A and document B is lost. Therefore, in order to avoid missing, that is, to improve the completeness, “image”, “information”, “processing”, and “apparatus” are extracted as words in each of the document A / retrieval formula A and document B, etc. It can be said that a shorter extraction works more effectively as a feature. In the prior art, there has been a problem that these trade-off relationships cannot be effectively resolved.

単語ベースの素性に代わる、あるいはそれを補強する従来手法として、単語ｎ−ｇｒａｍを素性として用いた文書分類手法が提案されている。特許文献１および２では、単語ｎ−ｇｒａｍを含む特徴情報を抽出してベクトルの素性とする文書分類手法を提案している。しかし文書からどのような単語ｎ−ｇｒａｍを抽出するかは示されていない。その他の従来技術においても、単語ｎ−ｇｒａｍを素性として用いる場合、「単語ｎ−ｇｒａｍ（ｎ＝１、２、３、…）」という形で指定されている。しかしながら、単に単語を並べただけの単語ｎ−ｇｒａｍでは、素性としての識別能力に大きな差が生じる。またｎの値の増加にともなって素性の数が爆発的に増加するという問題もある。このように、素性として有効な単語ｎ−ｇｒａｍを抽出するためには、単にｎの値を制限するという従来方法では不十分であった。 As a conventional method for replacing or reinforcing a word-based feature, a document classification method using a word n-gram as a feature has been proposed. Patent Documents 1 and 2 propose a document classification method that extracts feature information including a word n-gram and makes it a vector feature. However, it does not show what word n-gram is extracted from the document. Also in other conventional techniques, when a word n-gram is used as a feature, it is specified in the form of “word n-gram (n = 1, 2, 3,...)”. However, in the word n-gram in which the words are simply arranged, there is a great difference in the identification ability as a feature. There is also a problem that the number of features increases explosively as the value of n increases. Thus, in order to extract a word n-gram that is effective as a feature, the conventional method of simply limiting the value of n is insufficient.

また、特許文献３および４には、文節から構成される文から、隣接する文節を組み合わせて文の一部分を文節のｎ−ｇｒａｍとして取出し素性として用いることが記載されているが、特許文献３および４に記載されるものについても、特許文献１および２に記載されるものと、同様の問題を有している。 Patent Documents 3 and 4 describe that a part of a sentence is combined as an n-gram of a phrase and used as a feature of the sentence by combining adjacent phrases from sentences composed of phrases. 4 also has the same problems as those described in Patent Documents 1 and 2.

特開２０１４−０６７１５４号公報JP 2014-0667154 A 特開２０１４−１２３２８６号公報JP 2014-123286 A 特開２０１３−１７１５５０号公報JP 2013-171550 A 特開２０１５−０１１４２６号公報JP, 2015-011426, A 特許第４４７８０４２号公報Japanese Patent No. 4478042 T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: Proceedings of 10th European Conference on Machine Learning, (ECML), Chemnitz, Germany, pp. 137-142, 1998T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: Proceedings of 10th European Conference on Machine Learning, (ECML), Chemnitz, Germany, pp. 137-142, 1998 Yang, Yiming, and Jan O. Pedersen. "A comparative study on feature selection in text categorization." ICML. Vol. 97. 1997Yang, Yiming, and Jan O. Pedersen. "A comparative study on feature selection in text categorization." ICML. Vol. 97. 1997 Information Retrieval (Z39.50): Application Service Definition and Protocol Specification, ANSI/NISO Z39.50-2003, http://www.loc.gov/z3950/agency/Z39-50-2003.pdfInformation Retrieval (Z39.50): Application Service Definition and Protocol Specification, ANSI / NISO Z39.50-2003, http://www.loc.gov/z3950/agency/Z39-50-2003.pdf

機械学習を用いて文書を分類又は検索する際に、高い精度を得ることができる、対象となる文書の素性を抽出する文書素性抽出装置、文書素性抽出方法を提供する。また、そのような素性を用いた文書分類装置、文書分類方法、文書検索装置、文書検索方法をも提供する。なお、ここでいう「文書」とは、文字列からなる文書のみならず、それ以外のデータ、例えば、音声、映像や画像、その他のデータによって構成されるものを含む。 Provided are a document feature extraction device and a document feature extraction method for extracting features of a target document, which can obtain high accuracy when classifying or searching documents using machine learning. In addition, a document classification device, a document classification method, a document search device, and a document search method using such features are also provided. Here, the “document” includes not only a document consisting of a character string but also data composed of other data, for example, audio, video, image, and other data.

従来技術において述べた、素性としての識別能力を重視した場合は単語を長めに抽出した方が良く、取りこぼしのないように、すなわち網羅性を高めるようにしようとした場合は、単語を短めに抽出した方が素性として有効に働くというトレードオフの関係を調整する手段として、以下の２つの手法を提案する。 If importance is attached to the discrimination ability as a feature described in the prior art, it is better to extract words longer. If there is no miss, that is, if you want to improve completeness, extract words shorter. The following two methods are proposed as means for adjusting the trade-off relationship that the one that has worked effectively as a feature.

最初に、特定の領域内の単語列を素性として用いるという手法である。たとえば、文書Ａ／検索式Ａのみに単語「カラー画像情報処理装置」が現れ、文書Ｂのみに単語「画像情報処理装置」が現れた場合を考える。まず、文書Ａから単語「カラー」「画像」「情報」「処理」「装置」などを抽出し、それらを連結することによって単語列「カラー画像情報処理装置」「画像情報処理装置」「情報処理装置」「画像」「情報」「処理」「装置」などを得る。次に、文書Ｂから単語「画像」「情報」「処理」「装置」などを抽出し、それらを連結することによって単語列「画像情報処理装置」「情報処理装置」「画像」「情報」「処理」「装置」などを得る。ここで、得られた単語列同士を比較すると、互いに同じ単語列が存在することとなり、それを検出することによって、文書Ａ／検索式Ａと文書Ｂが類似しているか否かを判断する。 First, it is a technique of using a word string in a specific area as a feature. For example, consider the case where the word “color image information processing apparatus” appears only in document A / search formula A, and the word “image information processing apparatus” appears only in document B. First, the words “color”, “image”, “information”, “processing”, “device”, and the like are extracted from the document A, and the word strings “color image information processing device” “image information processing device” “information processing” are concatenated. "Device", "Image", "Information", "Processing", "Device", etc. are obtained. Next, the words “image”, “information”, “processing”, “device”, and the like are extracted from the document B, and the word strings “image information processing device” “information processing device” “image” “information” “ Get processing, equipment, etc. Here, when the obtained word strings are compared with each other, the same word strings exist, and by detecting them, it is determined whether or not the document A / search formula A and the document B are similar.

なお、ここでいう「単語列」とは、上の例において、文書Ａから抽出された単語「カラー」「画像」「情報」「処理」「装置」を連結した単語列に、「画像」「情報」「処理」「装置」が含まれているように、隣接する２つ以上の単語を連結したもののみではなく、連結する前の単語そのものも含まれる。 Note that the “word string” here refers to the word string obtained by concatenating the words “color”, “image”, “information”, “processing”, and “device” extracted from the document A in the above example. As included in “information”, “processing”, and “apparatus”, not only a concatenation of two or more adjacent words but also a word before the concatenation is included.

更に、これに加えて、特定の領域内において、複数の単語列が共に存在すること、すなわち、共起を利用する手法も提案する。たとえば、文書Ａが特許文書であり、その特許文書に記述されている技術の技術分野を知ろうとする場合に、文書Ａのある文に「本発明は」という単語列と「自然言語処理」という単語列が共に存在した場合には、文書Ａのある文に「本発明は」という単語列が存在し文書Ａの別の文に「自然言語処理」という単語列が存在した場合に比べて、文書Ａに記述されている技術の技術分野が自然言語処理である可能性はより高いものとなる。 Furthermore, in addition to this, a method is also proposed in which a plurality of word strings exist together in a specific region, that is, a co-occurrence is used. For example, when the document A is a patent document and the technical field of the technology described in the patent document is to be known, a word string “the present invention” and “natural language processing” are included in a sentence of the document A. In the case where both word strings exist, compared to the case where the word string “this invention” exists in one sentence of document A and the word string “natural language processing” exists in another sentence of document A, The possibility that the technical field of the technology described in the document A is natural language processing is higher.

なお、ここでは、複数の単語列が共に存在すること、すなわち、共起ということを前提として説明したが、例えば、単語列αが二義的な単語列であって、単語列αと単語列βが共起した場合には一方の意味であって、単語列αがあるものの単語列βが存在しない場合には通例他方の意味となるような場合も考えられる。このような場合を包含するために、以下「共起」という言葉の意味を、単純な共起ではなく、単語列αがあるものの単語列βが存在しない、単語列αと単語列βの一方のみが存在する等に拡張して考える。なお、この「共起」の例は実施例３にて詳細に説明する。 Here, the description has been made on the assumption that a plurality of word strings exist together, that is, co-occurrence. However, for example, the word string α is a secondary word string, and the word string α and the word string If β co-occurs, it means one meaning, and if the word string α is present but the word string β does not exist, the other meaning is usually considered. In order to include such a case, the meaning of the word “co-occurrence” is not a simple co-occurrence, but one of the word string α and the word string β, which has the word string α but does not have the word string β. Think of it as an extension to exist only. An example of this “co-occurrence” will be described in detail in Example 3.

このような手法を用いることによって、識別能力と網羅性を両立でき、その結果高い分類精度や検索精度を達成できる。 By using such a method, both discrimination ability and completeness can be achieved, and as a result, high classification accuracy and search accuracy can be achieved.

実施例１のフローチャートである。3 is a flowchart of the first embodiment. 実施例１において、処理の対象となる文字列をどのように処理するかを示す図である。In Example 1, it is a figure which shows how the character string used as the object of a process is processed. 実施例１のステップＳ１０４で行われる単語列の生成処理を示す図である。It is a figure which shows the production | generation process of the word string performed by step S104 of Example 1. FIG. 実施例２で作成される構文解析木を示す図である。It is a figure which shows the parsing tree produced in Example 2. FIG. 実施例２のフローチャートである。10 is a flowchart of Example 2. 実施例３のフローチャートである。10 is a flowchart of Example 3. 実施例３のステップＳ６０５で行われる単語列の共起の処理を示す図である。It is a figure which shows the co-occurrence process of the word string performed by step S605 of Example 3. FIG. 実施例５の文書分類装置のブロック図である。It is a block diagram of the document classification device of Example 5. 実施例５の文書分類装置において文書分類を行うときのフローチャートである。10 is a flowchart when document classification is performed in the document classification device according to the fifth exemplary embodiment. 実施例５の文書分類装置で用いる素性統計を作成するときのフローチャートである。10 is a flowchart for creating feature statistics used in the document classification apparatus according to the fifth embodiment. 実施例５の文書分類装置で用いる分類モデルを作成するときのフローチャートである。10 is a flowchart for creating a classification model used in the document classification apparatus according to the fifth embodiment. 実施例６の文書検索装置のブロック図である。It is a block diagram of the document search apparatus of Example 6. FIG. 実施例６の文書検索装置において文書検索を行うときのフローチャートである。16 is a flowchart for performing a document search in the document search apparatus according to the sixth embodiment. 実施例６の文書検索装置で用いる素性統計を作成するときのフローチャートである。14 is a flowchart for creating feature statistics used in the document search apparatus according to the sixth embodiment. 実施例６の文書検索装置で、検索式を処理するためのフローチャートである。15 is a flowchart for processing a search expression in the document search device according to the sixth embodiment.

本発明を実施するための形態について図面を参照しながら詳細に説明する。 Embodiments for carrying out the present invention will be described in detail with reference to the drawings.

実施例１として説明するものは、特定の領域内の単語列を素性として用いるという手法である。
図１は、実施例１のフローチャートである。また、図２は、実施例１において、処理の対象となる文字列をどのように処理するかを示す図である。 What is described as Example 1 is a technique of using a word string in a specific area as a feature.
FIG. 1 is a flowchart of the first embodiment. FIG. 2 is a diagram illustrating how a character string to be processed is processed in the first embodiment.

まず、実施例１では、図１の読込みステップＳ１０１において、素性の抽出の対象となる文書を読み込む。
つぎに、形態素解析ステップＳ１０２において、対象言語の文法の知識や辞書を用いて、読み込んだ文書に対して形態素解析を行い、単語ごとに分割する。なお、この形態素解析ステップＳ１０２での分割の単位は必ずしも単語に限定される必要はなく、他の単位でも必ずしも問題はないので、ここでいう単語とは「第一の単位」とも呼べる程度の意味であり、形態素解析ステップＳ１０２も第一境界分割ステップとも呼べる程度の意味である。 First, in the first embodiment, in a reading step S101 in FIG. 1, a document that is a feature extraction target is read.
Next, in the morpheme analysis step S102, the read document is subjected to morpheme analysis using knowledge of the target language grammar and a dictionary, and divided into words. Note that the unit of division in this morpheme analysis step S102 is not necessarily limited to words, and other units do not necessarily have a problem, so the word here means to the extent that it can also be called “first unit” This means that the morphological analysis step S102 can also be called the first boundary division step.

そして、境界検出ステップＳ１０３では、解析された文書から文節境界を抽出する。
ここで、この抽出された文節境界は、次のステップＳ１０４で単語列を作成するときに用いるためのものであって、単語列を作成する対象となる範囲の長さをなるべく短くすることによって、次の単語列抽出ステップＳ１０４で抽出される単語列の組み合わせが非常に多くなることを防止することであることから、この境界検出ステップＳ１０３で抽出されるのは、その境界をまたぐ形の単語が存在しない境界であればよく、文節境界の代わりに、文の境界や、句読点による境界など、他の境界で代えることもできる。更に、処理の対象の言語によっては、その境界をまたぐ形の単語が存在しない境界であれば、文節、文以外の境界となることもありうる。 In the boundary detection step S103, the phrase boundary is extracted from the analyzed document.
Here, this extracted phrase boundary is for use when creating a word string in the next step S104, and by shortening the length of the target range for creating the word string as much as possible, In order to prevent the number of combinations of word strings extracted in the next word string extraction step S104 from becoming very large, what is extracted in the boundary detection step S103 is a word that crosses the boundary. It may be a boundary that does not exist, and can be replaced by another boundary such as a sentence boundary or a punctuation boundary instead of a phrase boundary. Furthermore, depending on the language to be processed, a boundary other than a clause or sentence may be used as long as the word does not have a word that crosses the boundary.

加えて、実施例１の処理を、文書全体に適用して素性を抽出することを考えると、文書のある個所における境界検出ステップＳ１０３で抽出された境界が正しくないものであって、その境界によって合成語・複合語が分断されることによって、その境界の周辺で行われる単語列抽出ステップＳ１０４の処理の結果が必ずしも正確ではないものとなったとしても、その文書における重要な記述は文書中に複数回現れるのが通例であって、一回の間違いは文書全体からみて必ずしも大きな重みをもつものではないから、境界検出ステップＳ１０３が、その境界をまたぐ形の単語が存在しない境界という前提を満たしつつ、ある一定数の文字数ごとに境界を区切るものでも、分類や検索を行うときの最終的な結果に必ずしも大きな影響を与えるものでもない。 In addition, considering that the process of the first embodiment is applied to the entire document to extract features, the boundary extracted in the boundary detection step S103 at a certain part of the document is incorrect, and Even if the result of the word string extraction step S104 performed around the boundary is not necessarily accurate due to the division of the compound word / compound word, the important description in the document is not included in the document. Since it usually appears more than once, and a single mistake does not necessarily have a large weight as viewed from the whole document, the boundary detection step S103 satisfies the premise that there is no word that crosses the boundary. On the other hand, even if the boundary is delimited by a certain number of characters, it does not necessarily have a big impact on the final result when sorting and searching. No.

すなわち、ここでいう文節とは単語より長い単位、すなわち、「第一の単位より長い第二の単位」とも呼べる程度の意味であり、境界検出ステップＳ１０３も第二境界分割ステップとも呼べる程度の意味である。 That is, the phrase here means that it can be called a unit longer than a word, that is, a “second unit longer than the first unit”, and has a meaning that the boundary detection step S103 can also be called a second boundary division step. It is.

なお、まれに、文節境界前後の単語に意味的な関連がある場合があるため、この第二の単位として、文節より長いものを用いると、より高い分類精度や検索精度を達成できる場合がある。 In rare cases, there may be a semantic relationship between the words before and after the phrase boundary. Therefore, if the second unit is longer than the phrase, higher classification accuracy and search accuracy may be achieved. .

次の単語列抽出ステップＳ１０４が実施例１において、最も重要な処理を行うステップであって、文字列の中の特定の領域内から単語列を抽出する。
図２を参照し、単語列抽出ステップＳ１０４が、素性の抽出の対象となる文書、検索式に含まれる文字列をどのように処理するかを示す。なお、この図２を用いて、単語列抽出ステップＳ１０４の前提として、単語列抽出ステップＳ１０４の前に処理を行う形態素解析ステップＳ１０２、境界検出ステップＳ１０３が、どのような処理を行うかについても説明する。 The next word string extraction step S104 is a step for performing the most important processing in the first embodiment, and extracts a word string from a specific area in the character string.
Referring to FIG. 2, the word string extraction step S104 shows how to process a document whose feature is to be extracted and a character string included in the search expression. In addition, using FIG. 2, as a premise of the word string extraction step S104, the morpheme analysis step S102 and the boundary detection step S103 that perform the process before the word string extraction step S104 are also described. To do.

まず、読込みステップＳ１０１で、処理の対象の文字列として「対地作業機のローリング制御装置における対地追従性能を向上させる」という文字列を読み込む。
つぎに、形態素解析ステップＳ１０２において、形態素解析を行ない、入力した文字列を単語ごとに分割する。実施例１の形態素解析ステップＳ１０２における形態素解析では、後に単語列抽出ステップＳ１０４の処理を行うため、「カラー画像情報処理装置」や「対地作業機」のような合成語・複合語をひとまとまりの単語として単語分割をする必要はなく、図２に示すように、「対地作業機」を「対」「地」「作業」「機」、「ローリング制御装置」を「ローリング」「制御」「装置」というように、細かく分割しても問題がないので、この形態素解析に必要な対象言語の文法の知識や辞書を簡易なものとすること、更には形態素解析に必要な処理時間を短くすることができる。 First, in a reading step S101, a character string “improve ground performance in the rolling control device of the ground work machine” is read as a character string to be processed.
Next, in morpheme analysis step S102, morpheme analysis is performed, and the input character string is divided into words. In the morpheme analysis in the morpheme analysis step S102 of the first embodiment, since the processing of the word string extraction step S104 is performed later, a combined word / compound word such as “color image information processing apparatus” or “ground work machine” is collected. There is no need to divide words as words. As shown in FIG. 2, “ground work machine” is “to” “ground” “work” “machine”, and “rolling control device” is “rolling” “control” “device” As there is no problem even if it is divided finely, the knowledge of the target language grammar and the dictionary required for this morphological analysis should be simplified, and the processing time required for morphological analysis should be shortened. Can do.

そして、境界検出ステップＳ１０３において、文節境界を抽出する。図２では、抽出された文節境界を「／」で示す。
単語列抽出ステップＳ１０４では、境界検出ステップＳ１０３で抽出された文節境界で囲まれた領域を処理の対象とし、その領域の内で、形態素解析ステップＳ１０２での形態素解析によって得られた単語のうち隣接するものを連結し単語列として抽出するとともに、連結前の単語とあわせ、それら全体を単語列として出力する。たとえば、図２に示されるように、「対地作業機の」が文節として抽出され、この中から「対」「地」「作業」「機」「の」が単語として抽出されたので、隣接する単語を連結したものとして、「対地」「対地作業」「対地作業機」「対地作業機の」「地作業」「地作業機」「地作業機の」「作業機」「作業機の」「機の」という単語列を生成し、連結前の単語である「対」「地」「作業」「機」「の」とあわせて、これら全てを単語列として出力する。この連結処理を概略化して示したのが図３である。この図３では、αβγδという順番で４つの単語からなる文節があるときに、その文節に対して単語列抽出ステップＳ１０４に示される単語列の生成処理を行ったときに出力される単語列を示している。 Then, in the boundary detection step S103, the phrase boundary is extracted. In FIG. 2, the extracted phrase boundary is indicated by “/”.
In the word string extraction step S104, the region surrounded by the phrase boundary extracted in the boundary detection step S103 is processed, and within that region, the adjacent words among the words obtained by the morpheme analysis in the morpheme analysis step S102. What is to be connected is extracted as a word string, and the whole word is output as a word string together with the word before connection. For example, as shown in FIG. 2, “ground work machine” is extracted as a phrase, and “pair”, “ground”, “work”, “machine”, “no” are extracted as words from these, so that they are adjacent to each other. As a concatenation of words, “ground” “ground work” “ground work machine” “ground work machine” “ground work” “ground work machine” “ground work machine” “work machine” “work machine” “ A word string “machine” is generated, and the words “pair”, “ground”, “work”, “machine”, and “no” before connection are all output as a word string. FIG. 3 schematically shows this connection process. FIG. 3 shows a word string that is output when the word string generation processing shown in the word string extraction step S104 is performed on the phrase when there is a phrase composed of four words in the order of αβγδ. ing.

このような単語列の生成処理を行うため、対象となる文書に含まれる可能性のある単語について、重複した形とはなるものの、ほとんどすべてが検出されることになる。
なお、既に述べたように、ここでいう単語とは「第一の単位」とも呼べる程度の意味であって、形態素解析ステップＳ１０２が単語以外の単位で分割することもありうるから、この単語列抽出ステップＳ１０４も、列抽出ステップとも呼べる程度の意味である。 Since such a word string generation process is performed, almost all of the words that may be included in the target document are detected although they are in a duplicated form.
As already described, the word here means to the extent that it can also be called a “first unit”, and the morphological analysis step S102 may be divided into units other than words. The extraction step S104 also has a meaning that can be called a column extraction step.

また、単語列抽出ステップＳ１０４を設け、このような連結処理を行うため、形態素解析ステップＳ１０２における形態素解析において、「カラー画像情報処理装置」や「対地作業機」のような合成語・複合語をひとまとまりの単語として単語分割をする必要がなく、形態素解析ステップＳ１０２における形態素解析に必要な対象言語の文法の知識や辞書を簡易なものとすること、更には形態素解析に必要な処理時間を短くすることができ、実施例１全体としての処理を短くすることができる。 In addition, in order to perform such connection processing by providing a word string extraction step S104, in the morpheme analysis in the morpheme analysis step S102, compound words / compound words such as “color image information processing apparatus” and “ground work machine” are used. It is not necessary to divide the words as a single word, simplify the knowledge of the target language grammar and the dictionary required for the morphological analysis in the morphological analysis step S102, and further reduce the processing time required for the morphological analysis. Thus, the processing of the first embodiment as a whole can be shortened.

そして、素性出力ステップＳ１０５において、図２に示すような、単語列抽出ステップＳ１０４が抽出した単語列を文書の素性として出力する。
実施例１で生成される単語列を文書の素性として用い、検索や分類を行うことによって、識別能力と網羅性を両立でき、その結果高い分類精度や検索精度を達成できる。 In the feature output step S105, the word string extracted in the word string extraction step S104 as shown in FIG. 2 is output as the document feature.
By using the word string generated in the first embodiment as a document feature and performing search and classification, both discrimination ability and completeness can be achieved, and as a result, high classification accuracy and search accuracy can be achieved.

なお、特許文献５には、特定の領域内の単語を重複を許す形で抽出するものが記載されており、出力だけをみると、この実施例１と類似しているといえる。
しかしながら、特許文献５に記載される形態素ラティス生成部１０で生成される形態素ラティス８０は、特許文献５の段落００４５や図２，６に記載されるように、「東」「京」「都」「東京」「京都」などがいずれも形態素解析のための辞書の中の単語として存在するという前提で形態素解析を行なうものである。これに対し、実施例１の形態素解析ステップＳ１０２で行われる形態素解析は、「東」「京」「都」のみが形態素解析のための辞書に単語として存在し、「東京」「京都」が単語として形態素解析のための辞書中に存在しない場合でも、「東」「京」「都」「東京」「京都」「東京都」などが単語列として得られるものであって、最終的に得られるものが類似しているといえるとしても、それを求めるための途中の手法が異なっているといえる。 Note that Patent Document 5 describes a method of extracting words in a specific area in a form that allows duplication, and it can be said that it is similar to the first embodiment when only the output is observed.
However, the morpheme lattice 80 generated by the morpheme lattice generation unit 10 described in Patent Document 5 is “East”, “Kyo”, “Miyako”, as described in Paragraph 0045 of FIGS. Morphological analysis is performed on the assumption that “Tokyo”, “Kyoto”, etc. all exist as words in the dictionary for morphological analysis. On the other hand, in the morphological analysis performed in the morphological analysis step S102 of the first embodiment, only “East”, “Kyo” and “Miya” exist as words in the dictionary for morphological analysis, and “Tokyo” and “Kyoto” are words. Even if it does not exist in the dictionary for morphological analysis, “East”, “Kyo”, “Tokyo”, “Tokyo”, “Kyoto”, “Tokyo”, etc. are obtained as word strings, and finally obtained Even if things can be said to be similar, it can be said that the way to find them is different.

実施例２は、実施例１を英語に適用した例である。この実施例２では、実施例１を単に英語に適用することに加え、英語の文法に適応した処理を加えている。
図４，５は、実施例２による英文の処理を示す図である。 The second embodiment is an example in which the first embodiment is applied to English. In the second embodiment, in addition to simply applying the first embodiment to English, processing adapted to English grammar is added.
4 and 5 are diagrams illustrating English text processing according to the second embodiment.

図４は、入力される文を構文解析した構文解析木を示す。図４において、ＮＰ^[2]，Ｖ^[4]，ＮＰ^[5]，ＮＰ^[7] ，Ｖ^[9] ，ＮＰ^[10] ，Ｐ^[12] ，ＮＰ^[13]等は、その句の種類を表し、これらの横にある"(congress)", "(sent)" , "(president)"は、その句の主要部である主辞(head)を示す。 FIG. 4 shows a parse tree obtained by parsing an input sentence. In FIG. 4, NP ^[2] , V ^[4] , NP ^[5] , NP ^[7] , V ^[9] , NP ^[10] , P ^[12] , NP ^[13] "(Congress)", "(sent)", and "(president)" next to these indicate the head that is the main part of the phrase.

図５は、実施例１の図１に対応するフローチャートである。
図４に示すように、"Congress sent the president an $ 18.4 billion fiscal 1990 Treasury and Postal Service bill providing $ 5.5 billion for the Internal Revenue Service"という文を入力することを例として実施例２を説明する。 FIG. 5 is a flowchart corresponding to FIG. 1 of the first embodiment.
As shown in FIG. 4, the second embodiment will be described with reference to an example of inputting a sentence “Congress sent the president an $ 18.4 billion fiscal 1990 Treasury and Postal Service bill providing $ 5.5 billion for the Internal Revenue Service”.

まず、入力された文書を構文解析し、単語、句、文という単位に分割するとともに構文解析木を作成する。この処理は、読込みステップＳ５０１、構文解析ステップＳ５０２、境界検出ステップＳ５０３で行われる。実施例１と同様に、単語とは「第一の単位」とも呼べる程度の意味、句とは「第一の単位より長い第二の単位」とも呼べる程度の意味、文とは「第一の単位及び第二の単位いずれより長い第三の単位」とも呼べる程度の意味である。また、境界検出ステップＳ５０３も第二境界分割ステップとも呼べる程度の意味である。 First, the input document is parsed and divided into units of words, phrases, and sentences, and a parse tree is created. This process is performed in reading step S501, syntax analysis step S502, and boundary detection step S503. As in Example 1, the word means “first unit”, the phrase means “second unit longer than the first unit”, and the sentence means “first unit”. This means that it can be called a “third unit longer than either the unit or the second unit”. Further, the boundary detection step S503 has a meaning that can be called a second boundary division step.

なお、実施例１の境界検出ステップＳ１０３について説明したように、対象言語によっては、境界検出ステップＳ５０３で抽出する境界が句や文以外の境界となることもありうるし、境界検出ステップＳ５０３が、その境界をまたぐ形の単語が存在しない境界という前提を満たしつつ、一定数の文字数ごとに境界を区切るものでも、分類や検索を行うときの最終的な結果に、必ずしも大きな影響を与えるものでもない。 As described for the boundary detection step S103 of the first embodiment, depending on the target language, the boundary extracted in the boundary detection step S503 may be a boundary other than a phrase or a sentence. While satisfying the premise that there is no word that crosses the boundary, it does not necessarily delimit the boundary for every fixed number of characters, nor does it necessarily have a great influence on the final result when performing classification or search.

それと並行して、構文解析ステップＳ５０２によって、文内のすべての単語を抽出する。図４の例では、"Congress", "sent", "the", "president", "an", "$", "18.4", "billion", "fiscal", "1990", "Treasury", "and", "Postal", "Service", "bill", "providing", "$", "5.5", "billion", "for", "the", "Internal", "Revenue", "Service"を抽出することとなる。 At the same time, all words in the sentence are extracted by the parsing step S502. In the example of FIG. 4, "Congress", "sent", "the", "president", "an", "$", "18.4", "billion", "fiscal", "1990", "Treasury", "and", "Postal", "Service", "bill", "providing", "$", "5.5", "billion", "for", "the", "Internal", "Revenue", "Service "Will be extracted.

そして、句処理ステップＳ５０４にて、構文解析ステップＳ５０２、境界検出ステップＳ５０３の出力にもとづき、句について、名詞句(NP)、動詞句(VP)、形容詞句(AP)、副詞句(AdvP)、前置詞句(PP)、接続詞句(ConjP)、数量詞句(QP)などに分類し、それと共に、それぞれの句の主辞(head)を抽出する。この句処理ステップＳ５０４で行われる２種類の処理は、以下で述べるように、必ずしも必須のものではない。また、主辞情報付き文脈自由文法に基づく構文解析器を用いることによってステップS502からステップS504までの処理を１つのステップで行ってもよい。 Then, in the phrase processing step S504, based on the outputs of the syntax analysis step S502 and the boundary detection step S503, the noun phrase (NP), verb phrase (VP), adjective phrase (AP), adverb phrase (AdvP), Classify into preposition phrases (PP), conjunction phrases (ConjP), quantifier phrases (QP), etc., and extract the head of each phrase. The two types of processing performed in the phrase processing step S504 are not necessarily essential as will be described below. In addition, the processing from step S502 to step S504 may be performed in one step by using a syntax analyzer based on context-free grammar with head information.

次に、主辞列抽出ステップＳ５０５にて、抽出された主辞(head)のうち、構文解析木の葉ノード、すなわち、一つの単語のみを有するノードを子ノードとして持つ内部ノードに付与されたもの、図４で示された構文解析木ではＮＰ^[2]，Ｖ^[4]，ＮＰ^[5]，ＮＰ^[7] ，Ｖ^[9] ，ＮＰ^[10] ，Ｐ^[12] ，ＮＰ^[13]に付与された主辞(head)を連結し"Congress sent president bill providing billion for Service"というような主辞列として抽出する。なお、文法理論によっては、前置詞(P)と名詞句(NP)とからなる前置詞句(PP)において、前置詞句(PP)の主辞は前置詞(P)であるとされるが、本実施例では名詞句(NP)を主辞として扱った。 Next, in the main character string extraction step S505, among the extracted main characters (head), the leaf node of the parse tree, that is, the one given to the internal node having a node having only one word as a child node, FIG. Is given to NP ^[2] , V ^[4] , NP ^[5] , NP ^[7] , V ^[9] , NP ^[10] , P ^[12] , and NP ^[13]. The heads are concatenated and extracted as a main character string such as “Congress sent president bill providing billion for Service”. Depending on the grammatical theory, in the preposition phrase (PP) consisting of the preposition (P) and the noun phrase (NP), the main word of the preposition phrase (PP) is the preposition (P). The noun phrase (NP) was treated as the main word.

更に、単語列抽出ステップＳ５０６では、入力された文のそれぞれの句と、主辞列抽出ステップＳ５０５で生成された主辞列を対象として以下のように単語列を生成する。
まず、入力された文のなかのそれぞれの名詞句(NP)について、名詞句(NP)内の隣接する単語を順次連結して単語列を生成する。例えば、名詞句(NP)である"the Internal Revenue Service"から、"the", "Internal", "Revenue", "Service", "the Internal", "the Internal Revenue", "the Internal Revenue Service", "Internal Revenue", "Internal Revenue Service", "Revenue Service"という単語列を生成する。なお、ここでは、名詞句(NP)内の隣接する単語を順次連結して単語列を生成するとしたが、もちろん、その種類に関わらず、すべての句について隣接する単語を順次連結して単語列を生成するようにしてもよい。その場合には、先に述べた、句処理ステップＳ５０４における句についての分類が必要ではないこととなる。 Further, in the word string extraction step S506, a word string is generated as follows for each phrase of the input sentence and the main word string generated in the main word string extraction step S505.
First, for each noun phrase (NP) in the input sentence, adjacent words in the noun phrase (NP) are sequentially connected to generate a word string. For example, from the noun phrase (NP) "the Internal Revenue Service" to "the", "Internal", "Revenue", "Service", "the Internal", "the Internal Revenue", "the Internal Revenue Service" , "Internal Revenue", "Internal Revenue Service", "Revenue Service". In this example, adjacent words in a noun phrase (NP) are concatenated sequentially to generate a word string.Of course, regardless of the type, adjacent words are concatenated sequentially to form a word string. May be generated. In that case, the classification of phrases in the phrase processing step S504 described above is not necessary.

そして、主辞列内の隣接する単語を順次連結して単語列を生成する。すなわち、"Congress sent president bill providing billion for Service"というような主辞列から、"Congress", "Congress sent", "Congress sent president", "Congress sent president bill", "Congress sent president bill providing", "Congress sent president bill providing billion", "Congress sent president bill providing billion for", "Congress sent president bill providing billion for Service", "sent", "sent president", "sent president bill", "sent president bill providing", "sent president bill providing billion", "sent president bill providing billion for", "sent president bill providing billion for Service", "president”, “president bill", "president bill providing"・・・などのような単語列を生成する。 Then, adjacent words in the main word string are sequentially connected to generate a word string. That is, from the main remarks such as “Congress sent president bill providing billion for Service”, “Congress”, “Congress sent”, “Congress sent president”, “Congress sent president bill”, “Congress sent president bill providing”, “ Congress sent president bill providing billion "," Congress sent president bill providing billion for "," Congress sent president bill providing billion for Service "," sent "," sent president "," sent president bill "," sent president bill providing " , "sent president bill providing billion", "sent president bill providing billion for", "sent president bill providing billion for Service", "president", "president bill", "president bill providing" ... Generate a column.

この結果、一つの文、"Congress sent the president an $ 18.4 billion fiscal 1990 Treasury and Postal Service bill providing $ 5.5 billion for the Internal Revenue Service"から、名詞句(NP)から生成された単語列と主辞列から生成された単語列という２つの単語列が生成されるので、必要に応じて重複を排除し、まとめて文に対する単語列とする。 As a result, from one sentence, "Congress sent the president an $ 18.4 billion fiscal 1990 Treasury and Postal Service bill providing $ 5.5 billion for the Internal Revenue Service", from the word string and main word string generated from the noun phrase (NP) Since two word strings, the generated word strings, are generated, duplication is eliminated as necessary, and the word strings for the sentence are collectively collected.

もちろん、この主辞列から単語列を生成することを省略してもよく、その場合には、句処理ステップＳ５０４で行われる主辞(head)の抽出、主辞列抽出ステップＳ５０５、単語列抽出ステップＳ５０６で行われる主辞列からの単語列の生成、重複の排除が必要ないこととなる。 Of course, the generation of a word string from this main word string may be omitted, and in that case, the main word (head) extraction performed in the phrase processing step S504, the main word string extraction step S505, and the word string extraction step S506 are performed. It is not necessary to generate a word string from the main word string and eliminate duplicates.

最後に、素性出力ステップＳ５０７において、単語列抽出ステップＳ５０６が抽出した単語列を文書の素性として出力する。
なお、実施例２は英語を例として説明したが、もちろん、日本語を含む、英語以外の言語に適用することも可能である。 Finally, in the feature output step S507, the word string extracted in the word string extraction step S506 is output as the document feature.
In addition, although Example 2 demonstrated English as an example, of course, it is also possible to apply to languages other than English including Japanese.

実施例３として説明するものは、実施例１，２で説明したような、素性として抽出した単語列に対して、特定の領域内において、主に、複数の単語列が共に存在すること、すなわち、共起を利用する手法である。 What is described as the third embodiment is that, with respect to the word string extracted as a feature, as described in the first and second embodiments, a plurality of word strings exist mainly in a specific area, that is, This is a technique that uses co-occurrence.

なお、ここで「共起」という用語を用いているが、それは、この実施例３で想定される代表的な例は、複数の単語列が共に存在すること、すなわち、共起ということであって、必ずしもそれに限定されるものではない。この後の実施例３の説明も、この共起を中心に説明するが、既に述べたように、例えば、単語列αが二義的な単語列であって、単語列αと単語列βが共起した場合には一方の意味であって、単語列αがあるものの単語列βが存在しない場合には通例他方の意味となるような場合も考えられるため、以下「共起」という言葉の意味を、単純な共起ではなく、単語列αがあるものの単語列βが存在しない、単語αと単語βの一方のみが存在する等に拡張して考える。 Note that the term “co-occurrence” is used here, but a typical example assumed in the third embodiment is that a plurality of word strings exist together, that is, co-occurrence. However, the present invention is not necessarily limited thereto. In the following description of the third embodiment, this co-occurrence will be mainly described. As described above, for example, the word string α is a secondary word string, and the word string α and the word string β are In the case of co-occurrence, there is a case in which one meaning is present, but if the word string α is present but the word string β does not exist, the other meaning is usually considered. The meaning is not considered to be simple co-occurrence but is expanded to include the word string α but the word string β does not exist, or only one of the word α and the word β exists.

また、ここでいう「共起」は一種の近接演算であるともいえ、近接演算を単なる単語に適用することは、例えば、非特許文献３第９７〜９８頁"3.7.2 Proximity"に記載されるようによく知られている。しかしながら、実施例３では「共起」を単語ではなく単語列に適用することによって、相乗効果を得ることができる。 In addition, “co-occurrence” here is a kind of proximity calculation, and the application of proximity calculation to a simple word is described in, for example, “3.7.2 Proximity” on pages 97 to 98 of Non-Patent Document 3. As is well known. However, in Example 3, a synergistic effect can be obtained by applying “co-occurrence” to a word string instead of a word.

図６は、実施例３のフローチャートである。
まず、実施例３では、図６の読込みステップＳ６０１において、素性の抽出の対象となる文書を読み込み、つぎに、ステップＳ６０２において、対象言語の文法の知識や辞書を用いて、読み込んだ文書を、文字列として形態素解析を行い単語ごとに分割する。この読込みステップＳ６０１は、実施例１の読込みステップＳ１０１と同様のものである。実施例１と同様に、形態素解析ステップＳ６０２での分割の単位は必ずしも単語に限定される必要はなく、他の単位でも必ずしも問題はないので、ここでいう単語とは「第一の単位」とも呼べる程度の意味であり、形態素解析ステップＳ６０２も第一境界分割ステップとも呼べる程度の意味である。 FIG. 6 is a flowchart of the third embodiment.
First, in the third embodiment, in step S601 of FIG. 6, a document that is a target of feature extraction is read. Next, in step S602, the read document is read using grammar knowledge or a dictionary of the target language. Perform morphological analysis as a character string and divide into words. This reading step S601 is the same as the reading step S101 of the first embodiment. As in the first embodiment, the unit of division in the morphological analysis step S602 is not necessarily limited to words, and other units are not necessarily problematic, so the word here is also referred to as “first unit”. The morpheme analysis step S602 also means the first boundary division step.

つぎに、境界検出ステップＳ６０３で、解析された文書から文節境界と文境界を抽出する。抽出された文節境界は実施例１と同様に「第一の単位より長い第二の単位」と呼べる程度の意味のものであって、単語列抽出ステップＳ６０４で用いられる。これに対し、抽出された文境界は単語列共起検出ステップＳ６０５で共起の関係を調べるための範囲として用いられる。実施例２の境界検出ステップＳ５０３において文境界について説明したのと同様に、この文境界も単語や文節より長い単位、すなわち、「第一の単位及び第二の単位いずれより長い第三の単位」とも呼べる程度の意味である。すなわち、境界検出ステップＳ６０３も第二境界分割ステップとも呼べる程度の意味である。 Next, in a boundary detection step S603, a phrase boundary and a sentence boundary are extracted from the analyzed document. The extracted phrase boundary has a meaning that can be called a “second unit longer than the first unit” as in the first embodiment, and is used in the word string extraction step S604. In contrast, the extracted sentence boundary is used as a range for examining the co-occurrence relationship in the word string co-occurrence detection step S605. Similar to the description of the sentence boundary in the boundary detection step S503 of the second embodiment, this sentence boundary is also a unit longer than a word or phrase, that is, “a third unit longer than either the first unit or the second unit”. It can be called as well. That is, this means that the boundary detection step S603 can also be called a second boundary division step.

なお、実施例１の境界検出ステップＳ１０３、実施例２の境界検出ステップＳ５０３について説明したように、対象言語によっては、境界検出ステップＳ６０３で抽出する境界が文節や文以外の境界となることもありうるし、境界検出ステップＳ６０３が、その境界をまたぐ形の単語が存在しない境界という前提を満たしつつ、一定数の文字数ごとに境界を区切るものでも、分類や検索を行うときの最終的な結果に、必ずしも大きな影響を与えるものでもない。 As described in the boundary detection step S103 of the first embodiment and the boundary detection step S503 of the second embodiment, the boundary extracted in the boundary detection step S603 may be a boundary other than a clause or sentence depending on the target language. Moreover, even if the boundary detection step S603 satisfies the premise that there is no word having a shape that crosses the boundary, the final result when performing classification or search even if the boundary is delimited by a certain number of characters, It does not necessarily have a big influence.

単語列抽出ステップＳ６０４は、実施例１の単語列抽出ステップＳ１０４に対応するものであって、全く同じ動作をするので説明を省略する。
単語列共起検出ステップＳ６０５が実施例３において、最も重要な処理を行うステップであって、文字列の中の特定の領域内において、複数の単語列が共起することを検出する。なお、既に述べたように、ここでいう単語とは「第一の単位」とも呼べる程度の意味であって、形態素解析ステップＳ６０２が単語以外の単位で分割することもありうるから、この単語列共起検出ステップも、列共起検出ステップとも呼べる程度の意味である。 The word string extraction step S604 corresponds to the word string extraction step S104 of the first embodiment and performs the same operation, so that the description thereof is omitted.
The word string co-occurrence detection step S605 is the step of performing the most important processing in the third embodiment, and detects that a plurality of word strings co-occur in a specific area in the character string. As already described, the word here means to the extent that it can also be called a “first unit”, and the morpheme analysis step S602 may be divided into units other than words. The co-occurrence detection step also means that it can be called a column co-occurrence detection step.

図７を参照し、単語列共起検出ステップＳ６０５が検出すべき共起について、既に述べたような、単語列αがあるものの単語列βが存在しない、単語列αと単語列βの一方のみが存在する等に拡張することを含めて説明する。 Referring to FIG. 7, for the co-occurrence to be detected by the word string co-occurrence detection step S605, as described above, only one of the word string α and the word string β is present, although the word string α is present but the word string β is not present. Will be explained including the extension to the existence of

図７(a)は、「共起」の例として、処理対象の文が「αβγ」という３つの単語列から構成されているときに、『単語列「α」と単語列「β」が共に存在する。』という単語列「α」と単語列「β」の存在が論理積(AND)の関係を有する場合、『単語列「α」は存在するが、単語列「β」は存在しない。』という単語列「α」の存在と単語列「β」の不存在が論理積(AND)の関係を有する場合、『単語列「α」、単語列「β」のいずれか一方又は両方が存在する。』という単語列「α」と単語列「β」の存在が論理和(OR)の関係を有する場合、『単語列「α」、単語列「β」のいずれか一方のみが存在する。』という単語列「α」と単語列「β」の存在が排他的論理和(ExOR)の関係を有する場合についての例を示している。単語列共起検出ステップＳ６０５の出力は、検出の対象となる単語列の組み合わせが存在するか否かを出力するだけであり、最終的に得られる値は、「０」か「１」のいずれかとなっている。したがって、図７(b)のように、処理対象の文が「αβγαβ」のように同じ単語列が複数回含まれている場合の例でも、『単語列「α」と単語列「β」が共に存在する。』は「１」を出力することとなる。このように処理する理由は、「αβγαβ」を処理対象の文とし、『単語列「α」と単語列「β」が共に存在する。』という判断をするときに、一番目の単語列の「α」に対して、二番目の単語列の「β」と五番目の単語列の「β」のいずれが共起した単語列として適切なのか、また、双方とも共起した単語列として適切なのかの判断が難しいからであるが、その反面、一つの文のなかで同じ単語列の組み合わせで共起が複数発生したことを正しく反映できなくなるという問題もある。したがって、この場合に、より近い組合せの回数を検出した「２」、全ての組合せの回数を検出した「４」を出力するように構成してもよい。 As an example of “co-occurrence”, FIG. 7A shows that when a sentence to be processed is composed of three word strings “αβγ”, “word string“ α ”and word string“ β ”are both Exists. When the existence of the word string “α” and the word string “β” has a logical product (AND) relationship, the word string “α” exists but the word string “β” does not exist. When the presence of the word string “α” and the absence of the word string “β” have a logical product (AND) relationship, either or both of the word string “α” and the word string “β” exist. To do. ”And the word string“ β ”have a logical sum (OR) relationship, only one of the word string“ α ”and the word string“ β ”exists. ”And the word string“ β ”have an exclusive OR (ExOR) relationship. The output of the word string co-occurrence detection step S605 only outputs whether or not there is a combination of word strings to be detected, and the finally obtained value is either “0” or “1”. It has become. Therefore, as shown in FIG. 7B, even in the case where the sentence to be processed includes the same word string a plurality of times such as “αβγαβ”, “word string“ α ”and word string“ β ” Both exist. "Will output" 1 ". The reason for processing in this way is that “αβγαβ” is a sentence to be processed, and “a word string“ α ”and a word string“ β ”exist together. ”As the first word string“ α ”, the second word string“ β ”and the fifth word string“ β ”are both appropriate This is because it is difficult to determine whether both are appropriate as co-occurrence word strings, but on the other hand, it correctly reflects the fact that multiple co-occurrence occurred with the same word string combination in one sentence. There is also a problem that it becomes impossible. Therefore, in this case, “2” in which the number of closer combinations is detected and “4” in which the number of all combinations is detected may be output.

「０」か「１」のいずれかを出力する場合の単語列共起検出ステップＳ６０５の実際の処理は、検出の対象となる単語列、例えば、単語列「α」と単語列「β」が、処理の対象となる文字列の中の特定の領域内、例えば、「αβγ」において存在するか否かをまず検出し、その検出結果を、論理積(AND)、論理和(OR)、否定(NOT)、排他的論理和(ExOR)に代表される論理演算(Boolean演算)を行って求めることとなる。 In the actual processing of the word string co-occurrence detection step S605 in the case of outputting either “0” or “1”, the word string to be detected, for example, the word string “α” and the word string “β” First, it is detected whether or not it exists in a specific area in the character string to be processed, for example, `` αβγ '', and the detection result is logical product (AND), logical sum (OR), negated (NOT), a logical operation (Boolean operation) represented by exclusive OR (ExOR) is performed.

なお、単語列共起検出ステップＳ６０５で検出する「共起」は、２つの単語列の間のみに起こるものではなく、図７で示す『単語列「α」、単語列「β」、単語列「γ」が共に存在する。』に代表されるような、３単語以上の間での「共起」も含んでいる。 The “co-occurrence” detected in the word string co-occurrence detection step S605 does not occur only between the two word strings, but “word string“ α ”, word string“ β ”, word string shown in FIG. Both “γ” exist. It also includes “co-occurrence” between three or more words, as typified by.

最後に、素性出力ステップＳ６０６において、単語列抽出ステップＳ６０４で抽出された単語列と、単語列共起検出ステップＳ６０５で検出された「共起」を、合わせて文書の素性として出力する。このように、単語列そのものと単語列の共起を併せて素性として扱うことを、以下、「複合素性」と呼ぶ。 Finally, in the feature output step S606, the word string extracted in the word string extraction step S604 and the “co-occurrence” detected in the word string co-occurrence detection step S605 are output together as the document feature. In this way, handling the co-occurrence of the word string itself and the word string as a feature is hereinafter referred to as “composite feature”.

複数の単語列が近接して存在することは、単に一つの単語列が存在することに比べ、より限定された意味を表現していると考えることができるから、単語列共起検出ステップＳ６０５が出力する単語列の共起の有無を素性として用いることは、単なる単語列の有無を素性として用いることに比べより正確なものであるということができる。そして、実施例３で生成される図７に示したような単語列の共起の有無を素性として用い、検索や分類を行うことによって、識別能力と網羅性を両立でき、その結果高い分類精度や検索精度を達成できる。 The presence of a plurality of word strings close to each other can be considered to express a more limited meaning than the presence of a single word string. Therefore, the word string co-occurrence detection step S605 is performed. Using the presence or absence of co-occurrence of the word string to be output as a feature can be said to be more accurate than using the presence or absence of a simple word string as a feature. Then, by using the presence or absence of word string co-occurrence as shown in FIG. 7 generated in Example 3 as a feature and performing a search and classification, both discrimination ability and completeness can be achieved, resulting in high classification accuracy. And search accuracy can be achieved.

実施例１，２では図２で示したような単語列を素性として用いることによって、また、実施例３では図７に示したような単語列の共起を素性として用いることによって、検索や分類の識別能力と網羅性を両立でき、その結果高い分類精度や検索精度を達成できることについて説明した。それをふまえ、実施例４では、実施例３で説明したように、単語列を素性として用いることと、単語列の共起を素性として併用し「複合素性」とするときに、どのようにｔｆｉｄｆを用いるかの一例について説明する。 In the first and second embodiments, the word string as shown in FIG. 2 is used as a feature, and in the third embodiment, the word string co-occurrence as shown in FIG. 7 is used as a feature. As described above, it was possible to achieve both classification ability and completeness, and as a result, high classification accuracy and search accuracy could be achieved. Based on this, in the fourth embodiment, as described in the third embodiment, when the word sequence is used as a feature and the co-occurrence of the word sequence is used as a feature to make a “composite feature”, how tfidf An example of whether to use will be described.

背景技術の式（１）〜（４）で示したのが、単語に対するｔｆｉｄｆの求め方の一例とそれに基づく文書ベクトルの定義である。実施例１，２で説明したように単語列は単語を連結したものであるから、単語列に対してｔｆｉｄｆを用いるときは、抽出された単語列それぞれをそのまま式（１）〜（４）における単語t_iとして扱えばよい。 Expressions (1) to (4) of the background art show an example of how to obtain tfidf for a word and the definition of a document vector based on the example. As described in the first and second embodiments, the word string is a concatenation of words. Therefore, when tfidf is used for the word string, each extracted word string is directly used in the equations (1) to (4). It can be treated as the word t _i .

これに対し、実施例３で示した単語列の共起を素性として用いる部分は以下のとおりとなる。
まず、共起の対象となる単語列として単語列αと単語列βを考える。基本的には、この単語列αと単語列βを組み合わせた単語列ペアＰという概念を考え、この単語列ペアＰの共起を単語列と同様に式（１）〜（４）における単語t_iとして扱えばよい。しかしながら、この単語列αと単語列βに対しての共起は図７にも示されたように、以下のように複数種類存在するから、この単語列ペアＰの共起の実際の数は、共起の検出対象となる単語列αと単語列βの組み合わせの数に対し、検出の対象となる共起の種類の数を乗じた数となる。 On the other hand, the part using the word string co-occurrence shown in the third embodiment as a feature is as follows.
First, a word string α and a word string β are considered as word strings to be co-occurred. Basically, the concept of a word string pair P in which the word string α and the word string β are combined is considered, and the co-occurrence of the word string pair P is expressed by the word t in the expressions (1) to (4) as in the word string. _Treat as _i . However, since there are a plurality of types of co-occurrence for the word string α and the word string β as shown in FIG. 7, the actual number of co-occurrence of the word string pair P is as follows. The number of combinations of the word string α and the word string β to be detected for co-occurrence is multiplied by the number of co-occurrence types to be detected.

なお、実施例３でも説明したように、「共起」は２つの単語列の間のみに起こるものではなく、図７で示す『単語列「α」、単語列「β」、単語列「γ」が共に存在する。』に代表されるような、３単語以上の間での「共起」も含んでいるから、この単語列ペアＰも、単なる２つの単語列の組み合わせのみを指すのではなく、３つの単語列、４つの単語列等の組み合わせをも含む概念である。 As described in the third embodiment, “co-occurrence” does not occur only between two word strings, but “word string“ α ”, word string“ β ”, word string“ γ ”shown in FIG. "Exists together. The word sequence pair P also includes not only a combination of two word sequences but also three word sequences, because “co-occurrence” between three or more words as represented by It is a concept including combinations of four word strings and the like.

結局、実施例１，２で示したような単語列と、実施例３で説明した単語列の共起を素性として併用する場合には、単語列、単語列ペアＰに関する単語列の共起のそれぞれを、式（１）〜（４）における単語t_iとして扱い、ｔｆｉｄｆを求め最終的に文書ベクトルを求めればよいこととなる。 Eventually, when the word string as shown in the first and second embodiments and the word string co-occurrence described in the third embodiment are used together as features, the word string co-occurrence of the word string and the word string pair P Each is treated as the word t _i in the equations (1) to (4), tfidf is obtained, and finally a document vector is obtained.

実施例５として、実施例４で示した複合素性から求めた文書ベクトルを用いる文書分類装置の例を、図８〜１１を用いて説明する。
図８は、実施例５の文書分類装置のブロック図である。 As a fifth embodiment, an example of a document classification apparatus that uses a document vector obtained from the complex feature shown in the fourth embodiment will be described with reference to FIGS.
FIG. 8 is a block diagram of the document classification apparatus according to the fifth embodiment.

入力文書受付部８０１は、分類対象の文書の入力を受け付ける部分である。
文書解析部８０２、出現頻度生成部８０３において、分類対象の文書に対し、実施例１，２で説明した単語列の生成、単語列の共起の処理をおこなって複合素性を得る。 The input document receiving unit 801 is a part that receives an input of a document to be classified.
In the document analysis unit 802 and the appearance frequency generation unit 803, the generation of the word string and the co-occurrence processing of the word string described in the first and second embodiments are performed on the document to be classified to obtain a complex feature.

文書ベクトル生成部８０４は、分類対象の文書に対し、実施例４で説明したような文書ベクトルを求めるもので、素性統計記憶部８１１に記憶されているｉｄｆを用いている。
素性統計生成用文書記憶部８０７は、このｉｄｆを求めるための文書を記憶している。 The document vector generation unit 804 obtains a document vector as described in the fourth embodiment for the document to be classified, and uses idf stored in the feature statistics storage unit 811.
The feature statistic generation document storage unit 807 stores a document for obtaining the idf.

文書解析部８０８、出現頻度生成部８０９は、素性統計生成用文書記憶部８０７に記憶されたｉｄｆを求めるための文書に対し、実施例１，２で説明した単語列の生成、単語列の共起の処理をおこなって複合素性を得る。 The document analysis unit 808 and the appearance frequency generation unit 809 generate the word strings and share the word strings described in the first and second embodiments with respect to the document for obtaining the idf stored in the feature statistics generation document storage unit 807. Perform complex processing to obtain complex features.

素性統計算出部８１０は、素性統計生成用文書記憶部８０７に記憶された文書の総数|D|、素性統計生成用文書記憶部８０７に記憶された文書からｉｄｆを求め、素性統計記憶部８１１に記憶させる。 The feature statistic calculation unit 810 calculates the total number of documents | D | stored in the feature statistic generation document storage unit 807 and idf from the documents stored in the feature statistic generation document storage unit 807, and stores them in the feature statistic storage unit 811. Remember.

学習用文書記憶部８１２は、文書を分類するための規則を生成するための学習用文書を記憶している。
文書解析部８１３、出現頻度生成部８１４は、学習用文書記憶部８１２に記憶された文書を分類するための規則を生成するための学習用文書に対し、実施例１，２で説明した単語列の生成、単語列の共起の処理をおこなって複合素性を得て、文書ベクトル生成部８１５は、学習用文書に対し、実施例４で説明したような文書ベクトルを求めるもので、素性統計記憶部８１１に記憶されているｉｄｆを用いている。 The learning document storage unit 812 stores a learning document for generating a rule for classifying the document.
The document analysis unit 813 and the appearance frequency generation unit 814 are the word strings described in the first and second embodiments for the learning document for generating the rules for classifying the document stored in the learning document storage unit 812. The document vector generation unit 815 obtains the document vector as described in the fourth embodiment for the learning document by performing the generation of the word sequence and the co-occurrence processing of the word string, and the feature statistics storage The idf stored in the unit 811 is used.

学習データ生成部８１６、機械学習部８１７、学習結果記憶部８１８、分類モデル部８１９では、学習用文書を機械学習を用いて処理することによって、分類モデルを作成し記憶する。この機械学習に用いるアルゴリズムとしては、ＳＶＭやＣＲＦ(Conditional Random Fields)のような、素性の独立性を仮定しなくても使用できるアルゴリズムが望ましいが、それ以外の学習アルゴリズムを用いてもよい。 The learning data generation unit 816, the machine learning unit 817, the learning result storage unit 818, and the classification model unit 819 create and store a classification model by processing the learning document using machine learning. As an algorithm used for the machine learning, an algorithm such as SVM or CRF (Conditional Random Fields) that can be used without assuming the independence of the features is preferable, but other learning algorithms may be used.

カテゴリランキング部８０５、カテゴリ付与部８０６では、文書ベクトル生成部８０４で求めた分類対象の文書に対する文書ベクトルを、分類モデル部８１９に記憶されている分類モデルを用いてカテゴリに分類し、そのカテゴリを出力する。 The category ranking unit 805 and the category assigning unit 806 classify the document vector for the document to be classified, which is obtained by the document vector generation unit 804, into categories using the classification model stored in the classification model unit 819. Output.

図９は、実施例５の文書分類装置において文書分類を行うときのフローチャートであって、図８にしめした、入力文書受付部８０１、文書解析部８０２、出現頻度生成部８０３、文書ベクトル生成部８０４、カテゴリランキング部８０５、カテゴリ付与部８０６にて実行される。 FIG. 9 is a flowchart for performing document classification in the document classification apparatus according to the fifth embodiment. The input document reception unit 801, the document analysis unit 802, the appearance frequency generation unit 803, and the document vector generation unit illustrated in FIG. In step 804, the category ranking unit 805 and the category assigning unit 806 are executed.

形態素解析ステップＳ９０２、境界検出ステップＳ９０３、単語列抽出ステップＳ９０４、単語列共起検出ステップＳ９０５、素性出力ステップＳ９０６は、分類対象文書読込みステップＳ９０１で読み込んだ分類対象の文書に対して実施例１，２で説明した処理を行う部分であって、その詳細は実施例３で説明した読込みステップＳ６０１、形態素解析ステップＳ６０２、境界検出ステップＳ６０３、単語列抽出ステップＳ６０４、単語列共起検出ステップＳ６０５、素性出力ステップＳ６０６と同じものであるので説明は省略する。 A morpheme analysis step S902, a boundary detection step S903, a word string extraction step S904, a word string co-occurrence detection step S905, and a feature output step S906 are performed on the classification target document read in the classification target document reading step S901 according to the first embodiment. 2, the details of which are the reading step S 601, the morpheme analysis step S 602, the boundary detection step S 603, the word string extraction step S 604, the word string co-occurrence detection step S 605, and the features described in the third embodiment. Since this is the same as the output step S606, the description is omitted.

出現頻度生成ステップＳ９０７、素性統計読込ステップＳ９０８、文書ベクトル生成ステップＳ９０９では、素性統計記憶部８１１から素性統計読込ステップＳ９０８で読み込んだｉｄｆを用い、実施例４で説明したようにｔｆｉｄｆ値を求め、それを基にして分類対象の文書の文書ベクトルを生成する。 In the appearance frequency generation step S907, the feature statistics reading step S908, and the document vector generation step S909, the idf read from the feature statistics storage unit 811 in the feature statistics reading step S908 is used to obtain the tfidf value as described in the fourth embodiment, Based on this, a document vector of a document to be classified is generated.

文書モデル読込ステップＳ９１０で分類モデルを読みこみ、その分類モデルと文書ベクトルに基づいて、文書分類実行ステップＳ９１１にて分類対象の文書を分類する。
図１０は、実施例５の文書分類装置の素性統計記憶部８１１が記憶する素性統計を、素性統計生成用文書記憶部８０７、文書解析部８０８、出現頻度生成部８０９、素性統計算出部８１０がどのように求めているかを示すフローチャートである。 The classification model is read in the document model reading step S910, and the document to be classified is classified in the document classification execution step S911 based on the classification model and the document vector.
FIG. 10 illustrates feature statistics stored in the feature statistics storage unit 811 of the document classification apparatus according to the fifth embodiment. The feature statistics generation document storage unit 807, the document analysis unit 808, the appearance frequency generation unit 809, and the feature statistics calculation unit 810 It is a flowchart which shows how it calculates | requires.

形態素解析ステップＳ１００２、境界検出ステップＳ１００３、単語列抽出ステップＳ１００４、単語列共起検出ステップＳ１００５、素性出力ステップＳ１００６は、素性統計生成用文書記憶部８０７から素性統計文書読込みステップＳ１００１で読み込んだ素性統計を生成するための文書に対して実施例１，２で説明した処理を行う部分であって、その詳細は実施例３で説明した読込みステップＳ６０１、形態素解析ステップＳ６０２、境界検出ステップＳ６０３、単語列抽出ステップＳ６０４、単語列共起検出ステップＳ６０５、素性出力ステップＳ６０６と同じものであるので説明は省略する。 The morphological analysis step S1002, the boundary detection step S1003, the word string extraction step S1004, the word string co-occurrence detection step S1005, and the feature output step S1006 are feature statistics read from the feature statistics generation document storage unit 807 in the feature statistics document reading step S1001. Is a part that performs the processing described in the first and second embodiments, and details thereof are a reading step S601, a morpheme analyzing step S602, a boundary detecting step S603, a word string described in the third embodiment. Since this is the same as extraction step S604, word string co-occurrence detection step S605, and feature output step S606, description thereof will be omitted.

素性統計生成ステップＳ１００７は、実施例４で説明したようなｉｄｆを求めるステップであって、求めたｉｄｆは素性統計書込ステップＳ１００８で素性統計記憶部８１１に書き込まれる。 The feature statistics generation step S1007 is a step for obtaining idf as described in the fourth embodiment, and the obtained idf is written in the feature statistics storage unit 811 in the feature statistics write step S1008.

図１１は、実施例５の文書分類装置で用いる分類モデルを作成するときのフローチャートであって、図８にしめした、学習用文書記憶部８１２、文書解析部８１３、出現頻度生成部８１４、文書ベクトル生成部８１５、学習データ生成部８１６、機械学習部８１７、学習結果記憶部８１８、分類モデル部８１９にて実行される。 FIG. 11 is a flowchart for creating a classification model used in the document classification apparatus according to the fifth embodiment. The learning document storage unit 812, the document analysis unit 813, the appearance frequency generation unit 814, and the document shown in FIG. It is executed by the vector generation unit 815, the learning data generation unit 816, the machine learning unit 817, the learning result storage unit 818, and the classification model unit 819.

形態素解析ステップＳ１１０２、境界検出ステップＳ１１０３、単語列抽出ステップＳ１１０４、単語列共起検出ステップＳ１１０５、素性出力ステップＳ１１０６は、学習用文書記憶部８１２から学習用文書読込みステップＳ１１０１で読み込んだ学習用文書に対して実施例１，２で説明した処理を行う部分であって、その詳細は実施例３で説明した読込みステップＳ６０１、形態素解析ステップＳ６０２、境界検出ステップＳ６０３、単語列抽出ステップＳ６０４、単語列共起検出ステップＳ６０５、素性出力ステップＳ６０６と同じものであるので説明は省略する。 The morpheme analysis step S1102, the boundary detection step S1103, the word string extraction step S1104, the word string co-occurrence detection step S1105, and the feature output step S1106 are read from the learning document storage unit 812 into the learning document read step S1101. This is a part for performing the processing described in the first and second embodiments, and details thereof are the reading step S601, the morphological analysis step S602, the boundary detection step S603, the word string extraction step S604, and the word string described in the third embodiment. Since this is the same as the origin detection step S605 and the feature output step S606, the description thereof is omitted.

出現頻度生成ステップＳ１１０７、素性統計読込ステップＳ１１０８、文書ベクトル生成ステップＳ１１０９では、出現頻度生成ステップＳ９０７、素性統計読込ステップＳ９０８、文書ベクトル生成ステップＳ９０９と同様に、素性統計記憶部８１１から素性統計読込ステップＳ１１０８で読み込んだｉｄｆを用い、実施例４で説明したようにｔｆｉｄｆ値を求め、それを基にして学習用文書の文書ベクトルを生成する。 In appearance frequency generation step S1107, feature statistics reading step S1108, and document vector generation step S1109, similar to appearance frequency generation step S907, feature statistics reading step S908, and document vector generation step S909, feature statistics reading step from feature statistics storage unit 811 Using the idf read in S1108, the tfidf value is obtained as described in the fourth embodiment, and a document vector of the learning document is generated based on the tfidf value.

機械学習ステップＳ１１１０，分類モデル作成ステップＳ１１１１にて、その生成された学習用文書の文書ベクトルを用い機械学習をすることによって、分類モデルを作成する。 In machine learning step S1110 and classification model creation step S1111, a classification model is created by performing machine learning using the generated document vector of the learning document.

実施例５の文書分類装置では、実施例１，２で示した単語列、実施例３で示した単語列の共起にもとづいた複合素性を用いることによって、識別能力と網羅性を両立でき、その結果高い分類精度を達成できる。 In the document classification device according to the fifth embodiment, by using the complex feature based on the co-occurrence of the word strings shown in the first and second embodiments and the word strings shown in the third embodiment, both the identification ability and the completeness can be achieved. As a result, high classification accuracy can be achieved.

実施例６として、実施例４で示した複合素性から求めた文書ベクトルを用いる文書検索装置の例を、図１２〜１５を用いて説明する。
図１２は、実施例６の文書検索装置のブロック図である。 As a sixth embodiment, an example of a document search apparatus that uses a document vector obtained from the complex feature shown in the fourth embodiment will be described with reference to FIGS.
FIG. 12 is a block diagram of the document search apparatus according to the sixth embodiment.

検索対象文書記憶部１２０１は、検索対象の文書を記憶する部分である。
文書解析部１２０２、出現頻度生成部１２０３において、検索対象の文書に対し、実施例１，２で説明した単語列の生成、単語列の共起の処理をおこなって複合素性を得る。文書ベクトル生成部１２０４は、検索対象の文書に対し、実施例４で説明したような文書ベクトルを求めるもので、素性統計記憶部１２１１に記憶されているｉｄｆを用いている。 The search target document storage unit 1201 stores a search target document.
In the document analysis unit 1202 and the appearance frequency generation unit 1203, a complex feature is obtained by performing the word string generation and the word string co-occurrence processing described in the first and second embodiments for the search target document. The document vector generation unit 1204 obtains a document vector as described in the fourth embodiment for the document to be searched, and uses idf stored in the feature statistics storage unit 1211.

素性統計生成用文書記憶部１２０７、文書解析部１２０８、出現頻度生成部１２０９、素性統計算出部１２１０は、実施例５の素性統計生成用文書記憶部８０７、文書解析部８０８、出現頻度生成部８０９、素性統計算出部８１０と同様のものであって、素性統計生成用文書記憶部１２０７に記憶された文書の総数|D|、素性統計生成用文書記憶部１２０７に記憶された文書中からｉｄｆを求め、素性統計記憶部１２１１に記憶させるという処理を行う。 A feature statistics generation document storage unit 1207, a document analysis unit 1208, an appearance frequency generation unit 1209, and a feature statistics calculation unit 1210 are the feature statistics generation document storage unit 807, document analysis unit 808, and appearance frequency generation unit 809 of the fifth embodiment. , Which is the same as the feature statistic calculation unit 810, and the total number of documents | D | stored in the feature statistic generation document storage unit 1207, idf from among the documents stored in the feature statistic generation document storage unit 1207 The process of obtaining and storing in the feature statistics storage unit 1211 is performed.

検索式入力部１２１２において、検索をするための検索式を入力する。この場合、検索式として自然文を想定している。
検索式解析部１２１３、出現頻度生成部１２１４は、検索式入力部１２１２において入力された検索式に対し、実施例１，２で説明した単語列の生成、単語列の共起の処理を行い、検索ベクトル生成部１２１５は、それらの結果に基づき、実施例４で説明したような文書ベクトルを検索式に対する検索ベクトルとして求めるもので、素性統計記憶部１２１１に記憶されているｉｄｆを用いている。 In the search expression input unit 1212, a search expression for searching is input. In this case, a natural sentence is assumed as a search expression.
The search expression analysis unit 1213 and the appearance frequency generation unit 1214 perform the word string generation and word string co-occurrence processing described in the first and second embodiments for the search expression input in the search expression input unit 1212. The search vector generation unit 1215 obtains a document vector as described in the fourth embodiment as a search vector for the search formula based on those results, and uses idf stored in the feature statistics storage unit 1211.

検索部１２０５、検索結果出力部１２０６では、文書ベクトル生成部１２０４で求めた検索対象の文書に対する文書ベクトルと、検索ベクトル生成部１２１５で求めた検索ベクトルを比較することによって、類似する検索対象文書を順位づけして出力するものである。 In the search unit 1205 and the search result output unit 1206, a similar search target document is obtained by comparing the document vector for the search target document obtained by the document vector generation unit 1204 with the search vector obtained by the search vector generation unit 1215. It ranks and outputs.

図１３は、実施例６の文書検索装置において文書検索を行うときのフローチャートであって、図１２にしめした、検索対象文書記憶部１２０１、文書解析部１２０２、出現頻度生成部１２０３、文書ベクトル生成部１２０４、検索部１２０５、検索結果出力部１２０６にて実行される。 FIG. 13 is a flowchart for performing a document search in the document search apparatus according to the sixth embodiment. The search target document storage unit 1201, the document analysis unit 1202, the appearance frequency generation unit 1203, and the document vector generation shown in FIG. This is executed by the unit 1204, the search unit 1205, and the search result output unit 1206.

形態素解析ステップＳ１３０２、境界検出ステップＳ１３０３、単語列抽出ステップＳ１３０４、単語列共起検出ステップＳ１３０５、素性出力ステップＳ１３０６は、検索対象文書読込みステップＳ１３０１で読み込んだ検索対象の文書に対して実施例１，２で説明した処理を行う部分であって、その詳細は実施例３で説明した読込みステップＳ６０１、形態素解析ステップＳ６０２、境界検出ステップＳ６０３、単語列抽出ステップＳ６０４、単語列共起検出ステップＳ６０５、素性出力ステップＳ６０６と同じものであるので説明は省略する。 The morpheme analysis step S1302, the boundary detection step S1303, the word string extraction step S1304, the word string co-occurrence detection step S1305, and the feature output step S1306 are performed on the search target document read in the search target document reading step S1301 according to the first embodiment. 2, the details of which are the reading step S 601, the morpheme analysis step S 602, the boundary detection step S 603, the word string extraction step S 604, the word string co-occurrence detection step S 605, and the features described in the third embodiment. Since this is the same as the output step S606, the description is omitted.

出現頻度生成ステップＳ１３０７、素性統計読込ステップＳ１３０８、文書ベクトル生成ステップＳ１３０９では、素性統計記憶部１２１１から素性統計読込ステップＳ９０８で読み込んだｉｄｆを用い、実施例４で説明したようにｔｆｉｄｆ値を求め、それを基にして検索対象の文書の文書ベクトルを生成する。 In the appearance frequency generation step S1307, the feature statistics reading step S1308, and the document vector generation step S1309, using the idf read from the feature statistics storage unit 1211 in the feature statistics reading step S908, the tfidf value is obtained as described in the fourth embodiment, Based on this, a document vector of a search target document is generated.

検索ベクトル読込ステップＳ１３１０で検索ベクトルを読みこみ、その検索ベクトルと文書ベクトルに基づいて、検索実行ステップＳ１３１１にて検索対象の文書から、検索ベクトルに一致ないし類似する文書を選び出し、順位づけをして出力する。 In the search vector reading step S1310, the search vector is read, and based on the search vector and the document vector, in the search execution step S1311, a document that matches or is similar to the search vector is selected from the search target documents, and ranked. Output.

図１４は、実施例６の文書検索装置の素性統計記憶部１２１１が記憶する素性統計を、素性統計生成用文書記憶部１２０７、文書解析部１２０８、出現頻度生成部１２０９、素性統計算出部１２１０がどのように求めているかを示すフローチャートであるが、素性統計文書読込みステップＳ１４０１、形態素解析ステップＳ１４０２、境界検出ステップＳ１４０３、単語列抽出ステップＳ１４０４、単語列共起検出ステップＳ１４０５、素性出力ステップＳ１４０６、素性統計生成ステップＳ１４０７、素性統計書込ステップＳ１４０８は、実施例５の図１０に示した、素性統計文書読込みステップＳ１００１、形態素解析ステップＳ１００２、境界検出ステップＳ１００３、単語列抽出ステップＳ１００４、単語列共起検出ステップＳ１００５、素性出力ステップＳ１００６、素性統計生成ステップＳ１００７、素性統計書込ステップＳ１００８と全く同じ処理を行うものであるので説明は省略する。 FIG. 14 illustrates feature statistics stored in the feature statistics storage unit 1211 of the document search apparatus according to the sixth embodiment. The feature statistics generation document storage unit 1207, the document analysis unit 1208, the appearance frequency generation unit 1209, and the feature statistics calculation unit 1210 Although it is a flowchart which shows how it calculates | requires, feature statistical document reading step S1401, morpheme analysis step S1402, boundary detection step S1403, word sequence extraction step S1404, word sequence co-occurrence detection step S1405, feature output step S1406, feature The statistical generation step S1407 and the feature statistical writing step S1408 are the feature statistical document reading step S1001, the morpheme analyzing step S1002, the boundary detecting step S1003, the word string extracting step S1004, and the word string co-occurrence shown in FIG. Detection step S10 5, feature output step S1006, feature statistics generation step S1007, since and performs exactly the same processing as the feature statistics writing step S1008 description thereof will be omitted.

図１５は、実施例６の文書検索装置が検索式をどのように解析し検索ベクトルを求めているかを示すフローチャートであって、図１２にしめした、検索式入力部１２１２、検索式解析部１２１３、出現頻度生成部１２１４、検索ベクトル生成部１２１５にて実行される。 FIG. 15 is a flowchart showing how the document search apparatus according to the sixth embodiment analyzes a search expression and obtains a search vector. The search expression input unit 1212 and the search expression analysis unit 1213 shown in FIG. This is executed by the appearance frequency generation unit 1214 and the search vector generation unit 1215.

形態素解析ステップＳ１５０２、境界検出ステップＳ１５０３、単語列抽出ステップＳ１５０４、素性出力ステップＳ１５０６、単語列共起検出ステップＳ１５０５は、検索式読込ステップＳ１５０１で読み込んだ検索式に対して実施例１，２で説明した処理を行う部分であって、その詳細は実施例３で説明した読込みステップＳ６０１、形態素解析ステップＳ６０２、境界検出ステップＳ６０３、単語列抽出ステップＳ６０４、単語列共起検出ステップＳ６０５、素性出力ステップＳ６０６と同じものであるので説明は省略する。 The morphological analysis step S1502, the boundary detection step S1503, the word string extraction step S1504, the feature output step S1506, and the word string co-occurrence detection step S1505 are described in the first and second embodiments with respect to the search expression read in the search expression reading step S1501. The details are read out step S601, morpheme analysis step S602, boundary detection step S603, word string extraction step S604, word string co-occurrence detection step S605, and feature output step S606 described in the third embodiment. Since this is the same as that described above, the description thereof is omitted.

出現頻度生成ステップＳ１５０７、素性統計読込ステップＳ１５０８、検索ベクトル生成ステップＳ１５０９では、出現頻度生成ステップＳ１３０７、素性統計読込ステップＳ１３０８、文書ベクトル生成ステップＳ１３０９と同様に、素性統計記憶部１２１１から素性統計読込ステップＳ１５０８で読み込んだｉｄｆを用い、実施例４で説明したようにｔｆｉｄｆ値を求め、それを基にして検索ベクトルを生成する。 In appearance frequency generation step S1507, feature statistics reading step S1508, and search vector generation step S1509, similar to appearance frequency generation step S1307, feature statistics reading step S1308, and document vector generation step S1309, feature statistics reading step from the feature statistics storage unit 1211 Using the idf read in S1508, the tfidf value is obtained as described in the fourth embodiment, and a search vector is generated based on the tfidf value.

実施例６の文書検索装置では、実施例１，２で示した単語列、実施例３で示した単語列の共起にもとづいた複合素性を用いることによって、識別能力と網羅性を両立でき、その結果高い検索精度を達成できる。 In the document search apparatus according to the sixth embodiment, by using the complex feature based on the co-occurrence of the word strings shown in the first and second embodiments and the word strings shown in the third embodiment, it is possible to achieve both discrimination ability and completeness. As a result, high search accuracy can be achieved.

以上、本発明の実施形態による、データの素性抽出装置、データの分類装置、データの分類方法、データの検索装置を幾らかの実施例と共に説明した。上述した実施形態は、本発明の理解を容易にするための例示に過ぎず、本発明を限定して解釈するためのものではない。これら実施例は文字列からなる文書を対象にして説明したが、既に述べたように、本発明でいう「文書」とは、文字列からなる文書のみならず、それ以外のデータ、例えば、音声、映像や画像、その他のデータによって構成されるものを含み、本発明でいう「単語」とは、自然言語における単語という意味に限定されるものではなく、その上位概念である、なんらかの意味を持つデータのまとまりという意味であるので、本発明を、文字列以外のデータで構成される文書に対しても、そのデータの特性に合わせて必要な部分を修正して適用できることは明らかである。例えば、実施例１，２で抽出されるべき境界として、文節や文の境界の代わりに音素や音節等の境界を選択し、単語の代わりに音素、音節等を選ぶことによって、本発明を文字列からなる文書の代わりに音声からなる文書を対象とすることができることは自明である。また、実施例１，２における境界を抽出するアルゴリズムとして像域分離やシーンチェンジの検出を、単語の代わりに分離された像域やシーンを用い、それら像域やシーンから抽出される画像の特徴を用いれば、本発明を文字列からなる文書の代わりに静止画や動画から構成される文書に適用できることも自明である。 The data feature extraction device, data classification device, data classification method, and data search device according to the embodiments of the present invention have been described with some examples. The above-described embodiments are merely examples for facilitating understanding of the present invention, and are not intended to limit the present invention. Although these embodiments have been described with reference to a document made up of character strings, as already mentioned, the “document” in the present invention is not only a document made up of character strings but also other data such as audio. The word “word” in the present invention is not limited to the meaning of the word in the natural language but has a certain meaning that is a superordinate concept thereof. Since this means a collection of data, it is obvious that the present invention can be applied to a document composed of data other than character strings with a necessary portion modified in accordance with the characteristics of the data. For example, as a boundary to be extracted in the first and second embodiments, a boundary of a phoneme or a syllable is selected instead of a phrase or a sentence boundary, and a phoneme or a syllable is selected instead of a word. It is self-evident that a document consisting of speech can be targeted instead of a document consisting of columns. In addition, image area separation and scene change detection are used as algorithms for extracting boundaries in Embodiments 1 and 2, and image areas and scenes that are separated instead of words are used. It is obvious that the present invention can be applied to a document composed of a still image or a moving image instead of a document composed of a character string.

その他、本発明は、その趣旨を逸脱することなく、変更、改良することができると共に、本発明にはその均等物が含まれることは言うまでもない。 In addition, the present invention can be changed and improved without departing from the gist thereof, and it is needless to say that the present invention includes equivalents thereof.

Ｓ１０１読込みステップ
Ｓ１０２形態素解析ステップ
Ｓ１０３境界検出ステップ
Ｓ１０４単語列抽出ステップ
Ｓ１０５素性出力ステップ
Ｓ５０１読込みステップ
Ｓ５０２構文解析ステップ
Ｓ５０３境界検出ステップ
Ｓ５０４句処理ステップ
Ｓ５０５主辞列抽出ステップ
Ｓ５０６単語列抽出ステップ
Ｓ５０７素性出力ステップ
Ｓ６０１読込みステップ
Ｓ６０２形態素解析ステップ
Ｓ６０３境界検出ステップ
Ｓ６０４単語列抽出ステップ
Ｓ６０５単語列共起検出ステップ
Ｓ６０６素性出力ステップ
８０１入力文書受付部
８０２文書解析部
８０３出現頻度生成部
８０４文書ベクトル生成部
８０５カテゴリランキング部
８０６カテゴリ付与部
８０７素性統計生成用文書記憶部
８０８文書解析部
８０９出現頻度生成部
８１０素性統計算出部
８１１素性統計記憶部
８１２学習用文書記憶部
８１３文書解析部
８１４出現頻度生成部
８１５文書ベクトル生成部
８１６学習データ生成部
８１７機械学習部
８１８学習結果記憶部
８１９分類モデル部
Ｓ９０１分類対象文書読込みステップ
Ｓ９０２形態素解析ステップ
Ｓ９０３境界検出ステップ
Ｓ９０４単語列抽出ステップ
Ｓ９０５単語列共起検出ステップ
Ｓ９０６素性出力ステップ
Ｓ９０７出現頻度生成ステップ
Ｓ９０８素性統計読込ステップ
Ｓ９０９文書ベクトル生成ステップ
Ｓ９１０文書モデル読込ステップ
Ｓ９１１文書分類実行ステップ
Ｓ１００１素性統計文書読込みステップ
Ｓ１００２形態素解析ステップ
Ｓ１００３境界検出ステップ
Ｓ１００４単語列抽出ステップ
Ｓ１００５単語列共起検出ステップ
Ｓ１００６素性出力ステップ
Ｓ１００７素性統計生成ステップ
Ｓ１００８素性統計書込ステップ
Ｓ１１０１学習用文書読込みステップ
Ｓ１１０２形態素解析ステップ
Ｓ１１０３境界検出ステップ
Ｓ１１０４単語列抽出ステップ
Ｓ１１０５単語列共起検出ステップ
Ｓ１１０６素性出力ステップ
Ｓ１１０７出現頻度生成ステップ
Ｓ１１０８素性統計読込ステップ
Ｓ１１０９文書ベクトル生成ステップ
Ｓ１１１０機械学習ステップ
Ｓ１１１１分類モデル作成ステップ
１２０１検索対象文書記憶部
１２０２文書解析部
１２０３出現頻度生成部
１２０４文書ベクトル生成部
１２０５検索部
１２０６検索結果出力部
１２０７素性統計生成用文書記憶部
１２０８文書解析部
１２０９出現頻度生成部
１２１０素性統計算出部
１２１１素性統計記憶部
１２１２検索式入力部
１２１３検索式解析部
１２１４出現頻度生成部
１２１５検索ベクトル生成部
Ｓ１３０１検索対象文書読込みステップ
Ｓ１３０２形態素解析ステップ
Ｓ１３０３境界検出ステップ
Ｓ１３０４単語列抽出ステップ
Ｓ１３０５単語列共起検出ステップ
Ｓ１３０６素性出力ステップ
Ｓ１３０７出現頻度生成ステップ
Ｓ１３０８素性統計読込ステップ
Ｓ１３０９文書ベクトル生成ステップ
Ｓ１３１０検索ベクトル読込ステップ
Ｓ１３１１検索実行ステップ
Ｓ１４０１素性統計文書読込みステップ
Ｓ１４０２形態素解析ステップ
Ｓ１４０３境界検出ステップ
Ｓ１４０４単語列抽出ステップ
Ｓ１４０５単語列共起検出ステップ
Ｓ１４０６素性出力ステップ
Ｓ１４０７素性統計生成ステップ
Ｓ１４０８素性統計書込ステップ
Ｓ１５０１検索式読込みステップ
Ｓ１５０２形態素解析ステップ
Ｓ１５０３境界検出ステップ
Ｓ１５０４単語列抽出ステップ
Ｓ１５０５単語列共起検出ステップ
Ｓ１５０６素性出力ステップ
Ｓ１５０７出現頻度生成ステップ
Ｓ１５０８素性統計読込ステップ
Ｓ１５０９検索ベクトル生成ステップ S101 Reading step S102 Morphological analysis step S103 Boundary detection step S104 Word string extraction step S105 Feature output step S501 Reading step S502 Syntax analysis step S503 Boundary detection step S504 Phrase processing step S505 Main dictionary extraction step S506 Word string extraction step S507 Feature output step S60 Feature output step S60 Reading step S602 Morphological analysis step S603 Boundary detection step S604 Word string extraction step S605 Word string co-occurrence detection step S606 Feature output step 801 Input document reception unit 802 Document analysis unit 803 Appearance frequency generation unit 804 Document vector generation unit 805 Category ranking unit 806 Category assigning unit 807 Feature statistics generation document storage unit 808 Document analysis unit 809 Appearance frequency Unit 810 feature statistical calculation unit 811 feature statistical storage unit 812 learning document storage unit 813 document analysis unit 814 appearance frequency generation unit 815 document vector generation unit 816 learning data generation unit 817 machine learning unit 818 learning result storage unit 819 classification model unit S901 Classification target document reading step S902 Morphological analysis step S903 Boundary detection step S904 Word string extraction step S905 Word string co-occurrence detection step S906 Feature output step S907 Appearance frequency generation step S908 Feature statistics reading step S909 Document vector generation step S910 Document model reading step S911 Document classification execution step S1001 Feature statistical document reading step S1002 Morphological analysis step S1003 Boundary detection step S1004 Word string extraction step S10 5 Word Sequence Co-occurrence Detection Step S1006 Feature Output Step S1007 Feature Statistics Generation Step S1008 Feature Statistics Write Step S1101 Learning Document Reading Step S1102 Morphological Analysis Step S1103 Boundary Detection Step S1104 Word Sequence Extraction Step S1105 Word Sequence Co-occurrence Detection Step S1106 Feature Output step S1107 Appearance frequency generation step S1108 Feature statistics reading step S1109 Document vector generation step S1110 Machine learning step S1111 Classification model creation step 1201 Search target document storage unit 1202 Document analysis unit 1203 Appearance frequency generation unit 1204 Document vector generation unit 1205 Search unit 1206 Search result output unit 1207 Feature statistics generation document storage unit 1208 Document analysis unit 1209 Appearance frequency generation unit 121 Feature statistics calculation unit 1211 Feature statistics storage unit 1212 Search formula input unit 1213 Search formula analysis unit 1214 Appearance frequency generation unit 1215 Search vector generation unit S1301 Search target document reading step S1302 Morphological analysis step S1303 Boundary detection step S1304 Word string extraction step S1305 Word Sequence co-occurrence detection step S1306 Feature output step S1307 Appearance frequency generation step S1308 Feature statistics read step S1309 Document vector generation step S1310 Search vector read step S1311 Search execution step S1401 Feature statistics document read step S1402 Morphological analysis step S1403 Boundary detection step S1404 Word string Extraction step S1405 Word string co-occurrence detection step S1406 Feature output step S1407 Feature group Generation step S1408 Feature statistical writing step S1501 Retrieval expression reading step S1502 Morphological analysis step S1503 Boundary detection step S1504 Word sequence extraction step S1505 Word sequence co-occurrence detection step S1506 Feature output step S1507 Appearance frequency generation step S1508 Feature statistics reading step S1509 Search vector Generation step

Claims

An input means for inputting a document;
First boundary dividing means for analyzing and dividing the document into first units;
A second boundary dividing unit that divides a document divided into first units by the first boundary dividing unit into second units longer than the first unit;
Within each range of the second units divided by the second boundary dividing means, adjacent first units are connected to form a column, and column extracting means for extracting the first feature together with the first unit before connection. When,
Feature output means for outputting the first feature as a document feature;
A document feature extraction apparatus characterized by comprising:

The column extraction means determines whether or not to connect adjacent first units within each range for each second unit divided by the second boundary dividing means according to the type of each second unit. The document feature extraction apparatus according to claim 1, wherein the document feature extraction apparatus determines the document feature.

The second boundary dividing unit adds the document divided for each first unit by the first boundary dividing unit for each second unit longer than the first unit, and adds a third unit longer than the second unit. 3. The document feature extraction apparatus according to claim 1, wherein the document feature extraction unit is also divided into units.

Extracting the main first unit within each range for each of the second units, and connecting the extracted main first units within the range of the third unit and outputting them as a main word sequence Having means,
The column extraction means further generates a second column by concatenating adjacent first units included in the main word column, and combines the first column and the second column as a first feature. Generating,
The document feature extraction apparatus according to claim 3, wherein:

5. The document feature extraction apparatus according to claim 4, wherein the column extraction unit generates a first feature by eliminating an overlap between the first column and the second column. 6.

Detecting whether or not there is a first feature having a different characteristic with respect to the first feature output by the column extracting unit within the respective ranges of the third units divided by the second boundary dividing unit. Column co-occurrence detecting means for performing a logical operation on whether or not the first features having different features are within the same third unit range, and outputting a result of the logical operation as a second feature. Have
The feature output means outputs the first feature and the second feature together as a document feature;
The document feature extraction apparatus according to any one of claims 3 to 5.

The logical operation according to claim 6, wherein the logical operation includes at least one of logical product (AND), logical sum (OR), negation (NOT), and exclusive logical sum (ExOR). Document feature extraction device.

The document feature extraction apparatus according to claim 1, wherein the document is a document composed of a character string described in a natural language.

The document feature extraction apparatus according to claim 8, wherein the first unit is a word.

The document feature extraction apparatus according to claim 8 or 9, wherein the second unit is divided into phrases, phrases, or a certain number of characters.

The document feature extraction apparatus according to claim 1, wherein the document is a document including audio, video, or an image.

The document feature extraction apparatus according to claim 1, wherein a tfidf value is used as the feature value.

Document classification apparatus, which comprises using the identity of the document extracted by the document feature extraction apparatus according to any one of claims 1 to 12.

Document retrieval system, which comprises using the identity of the document extracted by the document feature extraction apparatus according to any one of claims 1 to 12.

An input step for entering the document;
A first boundary splitting step that parses the document and splits it into first units;
A second boundary dividing step of dividing the document divided into first units by the first boundary dividing step into second units longer than the first unit;
Column extraction step of extracting adjacent first units as a column within the respective ranges of the second units divided by the second boundary division step and extracting the first feature together with the first unit before connection. When,
A feature output step of outputting the first feature as a document feature;
A document feature extraction method characterized by comprising:

In the column extraction step, whether or not to connect adjacent first units within the respective ranges of the second units divided by the second boundary dividing step depends on the type of each of the second units. The document feature extraction method according to claim 15 , wherein the document feature extraction is performed.

The second boundary dividing step adds the document divided for each first unit by the first boundary dividing step for each second unit longer than the first unit, and adds a third longer than the second unit. The document feature extraction method according to claim 15 or 16, wherein the unit is also divided into units.

Extracting the main first unit within each range for each of the second units, and connecting the extracted main first units within the range of the third unit and outputting them as a main word sequence Has steps,
The column extraction step further generates a second column by concatenating adjacent first units included in the main word column, and combines the first column and the second column as a first feature. Generating,
The document feature extraction method according to claim 17, wherein:

19. The document feature extraction method according to claim 18, wherein the column extraction step generates the first feature by eliminating the overlap between the first column and the second column.

Detecting whether or not there is a first feature having a different characteristic with respect to the first feature output by the column extraction step within the respective ranges of the third units divided by the second boundary division step. A column co-occurrence detecting step of performing a logical operation on whether or not the first features having different features are within the same third unit range and outputting a result of the logical operation as a second feature. Have
The feature output step outputs the first feature and the second feature together as a document feature;
The document feature extraction method according to any one of claims 17 to 19.

21. The logical operation according to claim 20, wherein the logical operation includes at least one of logical product (AND), logical sum (OR), negation (NOT), and exclusive logical sum (ExOR). Document feature extraction method.

The document feature extraction method according to any one of claims 15 to 21, wherein the document is a document composed of a character string described in a natural language.

The document feature extraction method according to claim 22 , wherein the first unit is a word.

24. The document feature extraction method according to claim 22 or 23, wherein the second unit is a section, a phrase, or a predetermined number of characters.

The document feature extraction method according to any one of claims 15 to 21, wherein the document is a document including audio, video, or an image.

The document feature extraction method according to any one of claims 15 to 25, wherein a tfidf value is used as the feature value.

27. A document classification method using the document features extracted by the document feature extraction method according to any one of claims 15 to 26.

27. A document search method using a document feature extracted by the document feature extraction method according to any one of claims 15 to 26.

A program for executing the method according to any one of claims 15 to 28.

A recording medium on which a program for executing the method according to any one of claims 15 to 28 is recorded.