JP2007172616A

JP2007172616A - Document search method and device

Info

Publication number: JP2007172616A
Application number: JP2006340412A
Authority: JP
Inventors: Yaojie Ruu; ヤオジエルゥ; Ganmei You; ガンメイヨウ; Xiaoxia Wang; シャオシアワン; Gang Li; ガンリ; Yan Riu; ヤンリウ
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2005-12-19
Filing date: 2006-12-18
Publication date: 2007-07-05
Also published as: CN1987849A; CN100419753C

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document search method and device that can increase search precision by effectively using classified digital information. <P>SOLUTION: Searching through a digital data set having a hierarchical classification structure for a target document according to hierarchical class information includes: (a) extracting a keyword string including at least one keyword from a search request sentence; (b) acquiring class information about a current hierarchy from the digital data set; (c) calculating a difference of each keyword from each class belonging to the current hierarchy; (d) calculating the probability that the target document is present in each class belonging to the current hierarchy according to the difference; (e1) repeating the steps (c) to (e1) as designating the next hierarchy as the current hierarchy until the number of processed hierarchies reaches a predetermined number; (e2) combining the calculated difference from each class belonging to each hierarchy and the probability that the target document is present in each class belonging to each hierarchy to compute a combined difference of each keyword across the predetermined number of hierarchies; and (f) searching for the target document according to the combined difference. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、階層的に分類されたデジタル・データ集合において、各階層の分類情報に基づいて目標文書を検索する方法及び装置に関する。 The present invention relates to a method and apparatus for searching a target document based on classification information of each hierarchy in a hierarchically classified digital data set.

近年、益々多くの文書ファイルは、インターネット、デジタル図書館、ニュース、及び会社のＬＡＮ（Local Area Network）に現れている。このような電子化データを管理するために、文書デジタル情報の検索は益々重視されている。現在のデジタル情報の検索は、過去のような閉鎖的な、決まった形のものではなく、益々知能的になっている。また、現在のデジタル情報は、オープンで、更新が早く、また、これらのデジタル情報は通常分散されている。一方、デジタル情報システムの利用者は従来のような専門的な検索者から、ビジネスマンや、管理員や、学生など一般的なユーザまで拡大されている。これは、デジタル情報システムに対して様々な特殊な要求をもたらし、個性化及び知能性はデジタル情報検索システムに対する新しい要求となっている。 In recent years, more and more document files have appeared on the Internet, digital libraries, news, and company LANs (Local Area Networks). In order to manage such digitized data, search for document digital information is becoming more important. Today's search for digital information is becoming more intelligent than the closed, fixed form of the past. Also, current digital information is open and updated quickly, and these digital information is usually distributed. On the other hand, the number of users of digital information systems has been expanded from conventional specialized searchers to general users such as businessmen, managers, and students. This brings various special demands on digital information systems, and personalization and intelligence are new demands on digital information retrieval systems.

現在のデジタル情報は以下の重要な特徴がある。即ち、多くのデジタル情報は事前に既に分類されている。例えば、デジタル図書館の分類（例えば、ＡＣＭ、ＩＥＥＥなど）や、Webの分類（Yahoo、Google、Sinaなど）がある。しかし、従来のデジタル情報検索システムは、このような分類されたデジタル情報を利用して検索の精度を高めることができない。 The current digital information has the following important features. That is, a lot of digital information is already classified in advance. For example, there are digital library classifications (for example, ACM, IEEE, etc.) and Web classifications (Yahoo, Google, Sina, etc.). However, the conventional digital information search system cannot improve the search accuracy by using such classified digital information.

本発明は、以上の問題点に鑑みてなされ、その目的は、分類されたデジタル情報を有効に利用し、検索の精度を高めることができる文書検索方法及び装置を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a document search method and apparatus that can effectively use classified digital information and improve search accuracy.

本発明の第１の観点によれば、本発明の文書検索方法は、複数の文書のデジタル・データの集合を階層的に分類してなる階層状分類構造を有するデジタル・データ集合から、所定数の階層の分類情報に基づき所定の目標文書を検索する方法であって、（ａ）ユーザにより入力された検索リクエスト文から、少なくとも一つのキーワードを含むキーワード列を抽出し、（ｂ）前記デジタル・データ集合から現階層の分類情報を求め、（ｃ）前記現階層に属する各分類に対する、前記キーワード列における各キーワードの相違度を計算し、（ｄ）前記相違度に基づき、前記現階層に属する各分類に前記目標文書が存在する確率を計算し、（ｅ１）処理した階層の数が前記所定数より小さい場合は、次の階層を現階層とし、上記ステップ（ｃ）、（ｄ）、（ｅ１）を再度実行し、（ｅ２）処理した階層の数が前記所定数以上である場合は、計算された各階層に属する各分類に対する相違度と、前記各階層に属する各分類に前記目標文書が存在する確率とを合成し、前記各キーワードの前記所定値の階層における合成相違度を求め、及び（ｆ）前記合成相違度に従って前記目標文書を検索することを特徴とする。 According to a first aspect of the present invention, a document search method of the present invention is a predetermined number of digital data sets having a hierarchical classification structure formed by hierarchically classifying a set of digital data of a plurality of documents. (A) extracting a keyword string including at least one keyword from a search request sentence input by a user, and (b) the digital document Classification information of the current hierarchy is obtained from the data set, (c) the degree of difference of each keyword in the keyword string is calculated for each classification belonging to the current hierarchy, and (d) based on the degree of difference, belonging to the current hierarchy The probability that the target document exists in each category is calculated. (E1) When the number of processed layers is smaller than the predetermined number, the next layer is set as the current layer, and the above steps (c), ( ), (E1) are executed again, and (e2) when the number of processed hierarchies is equal to or greater than the predetermined number, the calculated difference between each class belonging to each hierarchy and each class belonging to each hierarchy The probability that the target document exists is synthesized, a synthesis difference degree in the hierarchy of the predetermined value of each keyword is obtained, and (f) the target document is searched according to the synthesis difference degree.

本発明の第２の観点によれば、本発明の文書検索装置は、複数の文書のデジタル・データの集合を階層的に分類してなる階層状分類構造を有するデジタル・データ集合から、所定数の階層の分類情報に基づき所定の目標文書を検索する装置であって、ユーザにより入力された検索リクエスト文から、少なくとも一つのキーワードを含むキーワード列を抽出するワード抽出手段と、前記デジタル・データ集合から現階層の分類情報を求める分類選択細分化手段と、前記現階層に属する各分類に対する、前記キーワード列における各キーワードの相違度を計算する相違度計算手段と、前記相違度に基づき、前記現階層に属する各分類に前記目標文書が存在する確率を計算する目標文書推定手段と、前記各分類に対する相違度と、前記各階層に属する各分類に前記目標文書が存在する確率とを合成し、前記各キーワードの合成相違度を求める相違度合成手段と、前記合成相違度に従って前記目標文書を検索する検索エンジンとを含み、処理した階層の数が前記所定数より小さい場合は、前記相違度計算手段と前記目標文書推定手段は、各階層に対して順次前記相違度及び前記確率を計算し、
前記現階層の階層番号が前記所定数以上である場合は、前記相違度合成手段は前記合成相違度を求め、前記検索エンジンは検索を行うことを特徴とする。 According to the second aspect of the present invention, the document search apparatus of the present invention is configured to select a predetermined number of digital data sets having a hierarchical classification structure obtained by hierarchically classifying a set of digital data of a plurality of documents. An apparatus for searching a predetermined target document based on classification information of a hierarchy of the word, a word extracting means for extracting a keyword string including at least one keyword from a search request sentence input by a user, and the digital data set Classification selection subdivision means for obtaining classification information of the current hierarchy from, classification calculation means for calculating the degree of difference of each keyword in the keyword string for each classification belonging to the current hierarchy, and based on the degree of difference, the current degree Target document estimation means for calculating the probability that the target document exists in each category belonging to a hierarchy, the degree of difference with respect to each category, and belonging to each hierarchy A hierarchy that includes a probability combining unit that obtains the probability that the target document exists in each classification and obtains a combined difference between the keywords, and a search engine that searches the target document according to the combined difference. If the number is less than the predetermined number, the difference calculation means and the target document estimation means calculate the difference and the probability sequentially for each layer,
When the hierarchy number of the current hierarchy is greater than or equal to the predetermined number, the dissimilarity synthesizing unit obtains the composite dissimilarity, and the search engine performs a search.

本発明の文書検索方法及び装置は、デジタル・データ集合における分類などの補助情報を利用し、精確なキーワード・ウエイトを推定し、有効に検索の精度を高めることができる。 The document search method and apparatus of the present invention can use auxiliary information such as classification in a digital data set to estimate accurate keyword weights and effectively improve search accuracy.

次に、添付した図面を参照しながら、本発明の実施形態を説明する。 Next, embodiments of the present invention will be described with reference to the accompanying drawings.

図１は、本発明の一実施例に係る文書検索装置１００の構成を示すブロック図である。 FIG. 1 is a block diagram showing a configuration of a document search apparatus 100 according to an embodiment of the present invention.

図１に示す文書検索装置１００は、例えば、複数の文書のデジタル・データの集合であるデジタル・データ集合１１０から目標文書を検索する。デジタル・データ集合１１０は複数の分類に分けられ、それぞれの分類はさらに複数のサブ分類に細分化され、そのサブ分類はさらに複数の分類に細分化される。このようにデジタル・データ集合１１０は階層状に分類され、例えば、Ｎ階級（第１階層〜第Ｎ階層）を有する階層分類構造になっている。それぞれの階層で、当該階層に属する複数の分類の分類情報が与えられている。 The document search apparatus 100 shown in FIG. 1 searches for a target document from a digital data set 110 that is a set of digital data of a plurality of documents, for example. The digital data set 110 is divided into a plurality of classifications, and each classification is further subdivided into a plurality of sub-classifications, and the sub-classification is further subdivided into a plurality of classifications. As described above, the digital data set 110 is classified in a hierarchical manner, and has, for example, a hierarchical classification structure having N classes (first to Nth hierarchies). In each hierarchy, classification information of a plurality of classifications belonging to the hierarchy is given.

文書検索装置１００は、例えば、デジタル・データ集合１１０における第１階層〜第Ｍ階層の分類情報に従って目標文書を検索する。ここで、Ｍは、ユーザが必要に応じて設定した検索しようとする階層の数である。即ち、デジタル・データ集合１１０はＮ階層を有しているが、ユーザはそのうちのＭ階層のみ検索してもよい。なお、Ｍ、Ｎは整数であり、かつ、Ｎ≧１，Ｎ≧Ｍ≧１である。 For example, the document search apparatus 100 searches for a target document according to the classification information of the first to Mth layers in the digital data set 110. Here, M is the number of hierarchies to be searched set as required by the user. That is, although the digital data set 110 has N layers, the user may search only the M layers. M and N are integers, and N ≧ 1, N ≧ M ≧ 1.

文書検索装置１００は、ワード抽出手段（ＴＥ）１０１，キーワード選択モジュール（ＴＳＭ）１０２，分類選択細分化モジュール（ＣＳＭ）１０３，相違度計算器（ＤＰＣ）１０４，目標文書推定手段（ＰＲＥ）１０５，相違度合成モジュール（ＤＩＭ）１０６，分類確定手段（ＣＬ）１０７，ウエイト併合モジュール（ＴＷＣ）１０８、及び検索エンジン１０９を有する。 The document retrieval apparatus 100 includes a word extraction unit (TE) 101, a keyword selection module (TSM) 102, a classification selection subdivision module (CSM) 103, a difference calculator (DPC) 104, a target document estimation unit (PRE) 105, It has a dissimilarity synthesis module (DIM) 106, a classification determination means (CL) 107, a weight merge module (TWC) 108, and a search engine 109.

ワード抽出手段１０１は、ユーザが入力した検索リクエスト文から、少なくとも一つのキーワードを含むキーワード列を抽出し、現在のキーワード列とする。 The word extraction unit 101 extracts a keyword string including at least one keyword from the search request sentence input by the user and sets it as the current keyword string.

キーワード選択モジュール１０２は、現階層のキーワード列、それに対応する相違度、及びキーワードの出現頻度に基づき、現在のキーワード列におけるノイズ・キーワードを除去し、次の階層（即ち、現階層より一つランク下の階層）のキーワード列とする。 The keyword selection module 102 removes the noise keyword in the current keyword string based on the keyword string of the current hierarchy, the corresponding difference degree, and the appearance frequency of the keyword, and ranks one rank higher than the current hierarchy (ie, one rank from the current hierarchy). (Lower level) Keyword column.

分類選択細分化モジュール１０３は、デジタル・データ集合１１０から現階層の分類情報を求める。 The classification selection subdivision module 103 obtains classification information of the current hierarchy from the digital data set 110.

相違度計算器１０４は、現階層に属する各分類に対する、現在のキーワード列における各キーワードの相違度を計算する。 The dissimilarity calculator 104 calculates the dissimilarity of each keyword in the current keyword string for each classification belonging to the current hierarchy.

目標文書推定手段１０５は、上記キーワードの相違度に基づいて、現階層に属する各分類に目標文書が存在する確率を計算する。 The target document estimation means 105 calculates the probability that the target document exists in each classification belonging to the current hierarchy, based on the above-described keyword dissimilarity.

相違度合成モジュール１０６は、相違度計算器１０４により得られた各キーワードの相違度と、目標文書推定手段１０５により得られた目標文書が各階層に属する各分類に存在する確率とを合成し、現在のキーワード列の各キーワードは第１階層〜第Ｍ階層における目標文書に対する合成相違度を求める。 The dissimilarity synthesis module 106 synthesizes the dissimilarity of each keyword obtained by the dissimilarity calculator 104 and the probability that the target document obtained by the target document estimation means 105 exists in each classification belonging to each hierarchy, For each keyword in the current keyword string, the composite dissimilarity with respect to the target document in the first to Mth layers is obtained.

分類確定手段１０７は、各階層に属する各分類に目標文書が存在する確率により、ノイズ分類を除去し、次の階層の分類情報とする。 The classification determination unit 107 removes the noise classification based on the probability that the target document exists in each classification belonging to each hierarchy, and uses the classification information of the next hierarchy.

ウエイト併合モジュール１０８は、グローバルな相違度を合成する。 The weight merge module 108 synthesizes the global dissimilarity.

検索エンジン１０９は、合成相違度に基づいて目標文書を検索する。 The search engine 109 searches for the target document based on the composite difference.

文書検索装置１００は第１階層〜第Ｍ階層の分類情報に従って目標文書を検索する際に、現階層の階層番号がＭより小さい場合は、相違度計算器１０４と目標文書推定手段１０５は各々の階層に対して順次上記相違度及び上記確率を計算する。現階層の階層番号がＭ以上になると、相違度合成モジュール１０６は合成相違度を求め、検索エンジン１０９は検索を行う。 When the document retrieval apparatus 100 retrieves the target document according to the classification information of the first hierarchy to the Mth hierarchy, if the hierarchy number of the current hierarchy is smaller than M, the difference calculator 104 and the target document estimation means 105 The dissimilarity and the probability are sequentially calculated for the hierarchy. When the hierarchy number of the current hierarchy is greater than or equal to M, the dissimilarity composition module 106 obtains a composite dissimilarity and the search engine 109 performs a search.

以上の構成を有する文書検索装置１００は、デジタル・データ集合１１０における分類などの補助情報を利用し、分類のキーワード・ウエイト計算法を採用することにより、精確なキーワード・ウエイトを推定して、有効に検索の精度を高めることができる。 The document search apparatus 100 having the above configuration uses the auxiliary information such as classification in the digital data set 110 and adopts the keyword weight calculation method of classification to estimate accurate keyword weight and effectively The search accuracy can be improved.

なお、図１に示す構成は本発明を説明するための具体例であり、本願発明を限定するものではない。例えば、キーワード選択モジュール１０２は、キーワードノイズを除去することにより、精度を高め、応答時間を短縮することができるが、キーワード選択モジュール１０２を省略してもよい。即ち、相違度計算器１０４がワード抽出手段１０１から直接キーワード列を受け取ってもよい。また、分類確定手段１０７とウエイト併合モジュール１０８は、分類ノイズを除去することにより、検索精度を高め、応答時間を短縮することができ、さらに、キーワード・ウエイトを計算する際に、グローバルなキーワード・ウエイト計算方法を併用し、検索精度及びシステムの汎用性を高めることができるが、分類確定手段１０７とウエイト併合モジュール１０８を省略してもよい。即ち、目標文書推定手段１０５は分類確定手段１０７を経由して分類選択細分化モジュール１０３にノイズ除去の情報をフィードバックしなくてもよく、また、検索エンジン１０９は、相違度合成モジュール１０６から直接キーワード列の合成相違度を受け取ってもよい。この場合、例えば、キーワードは一文字或いは一つのフレーズである。 The configuration shown in FIG. 1 is a specific example for explaining the present invention, and does not limit the present invention. For example, the keyword selection module 102 can improve accuracy and shorten the response time by removing keyword noise, but the keyword selection module 102 may be omitted. That is, the difference calculator 104 may receive the keyword string directly from the word extraction unit 101. Further, the classification determining means 107 and the weight merging module 108 can improve the search accuracy and shorten the response time by removing the classification noise. Furthermore, when calculating the keyword weight, Although the weight calculation method can be used in combination to improve the search accuracy and the versatility of the system, the classification determination unit 107 and the weight merge module 108 may be omitted. That is, the target document estimation unit 105 does not need to feed back the noise removal information to the classification selection subdivision module 103 via the classification determination unit 107, and the search engine 109 directly receives the keyword from the dissimilarity synthesis module 106. A composite dissimilarity of the columns may be received. In this case, for example, the keyword is one letter or one phrase.

また、以上の文書検索装置１００は、分類されていないデジタル・データ集合（即ち、階級数、分類数は１である）にも適用できる。また、システムの汎用性を高めるために、相違度合成モジュール１０６は相違度計算器１０４が計算した全ての相違度について合成を行う。ここで、好ましくは、グローバルなキーワードは統計的な方法により計算される。 The document search apparatus 100 described above can also be applied to unclassified digital data sets (that is, the number of classes and the number of classifications is 1). Further, in order to improve the versatility of the system, the dissimilarity composition module 106 performs composition for all the dissimilarities calculated by the dissimilarity calculator 104. Here, preferably, the global keyword is calculated by a statistical method.

また、分類に対するキーワードの区別力は以下の基準で求める。
（１）分類に対するキーワードの区別力に基づき推定する。
（２）異なる分類に対するキーワードの表現力の相違に基づき推定する。
（３）分類におけるキーワードの出現頻度、及び分類自身の属性を考慮して求める。 In addition, the distinctiveness of keywords with respect to classification is obtained according to the following criteria.
(1) Estimate based on keyword distinctiveness with respect to classification.
(2) Estimate based on differences in the expressiveness of keywords for different classifications.
(3) It is determined in consideration of the appearance frequency of keywords in the classification and the attributes of the classification itself.

図２は、本発明の一実施例に係る文書検索方法を示すフローチャートである。 FIG. 2 is a flowchart illustrating a document search method according to an embodiment of the present invention.

例えば、上記のデジタル・データ集合１１０における第１階層〜第Ｍ階層の分類情報に基づいて目標文書を検索する。 For example, the target document is searched based on the classification information of the first to Mth layers in the digital data set 110 described above.

図２に示すように、ステップＳ２０１において、ユーザが入力した検索リクエスト文から、少なくとも一つのキーワードを含むキーワード列を抽出し、現在のキーワード列とする。 As shown in FIG. 2, in step S <b> 201, a keyword string including at least one keyword is extracted from the search request sentence input by the user, and is set as the current keyword string.

ステップＳ２０２において、デジタル・データ集合１１０から現階層の分類情報を求める。 In step S202, classification information of the current hierarchy is obtained from the digital data set 110.

ステップＳ２０３において、現階層に属する各分類に対して現在のキーワード列における各キーワードの相違度を計算する。 In step S203, the degree of difference of each keyword in the current keyword string is calculated for each classification belonging to the current hierarchy.

ステップＳ２０４において、上記各キーワードの相違度に基づいて、現階層に属する各分類に目標文書が存在する確率を計算する。 In step S204, the probability that the target document exists in each category belonging to the current hierarchy is calculated based on the difference between the keywords.

ステップＳ２０５において、現階層の階層番号がＭより小さい場合は、ステップＳ２０６に進む。現階層の番号がＭ以上である場合は、ステップＳ２０９に進む。 In step S205, if the hierarchy number of the current hierarchy is smaller than M, the process proceeds to step S206. If the current hierarchy number is greater than or equal to M, the process proceeds to step S209.

ステップＳ２０６において、次の階層を現階層とする。 In step S206, the next hierarchy is set as the current hierarchy.

ステップＳ２０７において、現階層のキーワード列、それに対応する相違度、及びキーワードの出現頻度に基づいて、現在のキーワード列におけるノイズを除去し、次の階層のキーワード列とする。 In step S207, noise in the current keyword string is removed based on the keyword string of the current hierarchy, the degree of difference corresponding thereto, and the appearance frequency of the keyword, and the keyword string is set to the next hierarchy.

ステップＳ２０８において、現階層に属する各分類に目標文書が存在する確率によりノイズを除去し、次の階層の分類情報とする。 In step S208, noise is removed based on the probability that the target document exists in each classification belonging to the current hierarchy, and classification information for the next hierarchy is obtained.

ステップＳ２０９において、ステップ２０３において得られた各キーワードの相違度と、ステップ２０４において得られた目標文書が各階層に属する各分類に存在する確率とを合成し、現在のキーワード列の各キーワードの第１階層〜第Ｍ階層において目標文書に対する合成相違度を求める。 In step S209, the degree of difference between the keywords obtained in step 203 and the probability that the target document obtained in step 204 exists in each category belonging to each hierarchy are synthesized, and the number of each keyword in the current keyword string is determined. The degree of composition difference with respect to the target document is obtained from the first layer to the Mth layer.

ステップＳ２１０において、グローバルな相違度を合成する。 In step S210, global dissimilarities are synthesized.

ステップＳ２１１において、合成相違度に従って目標文書を検索する。 In step S211, the target document is searched according to the compositional difference.

以上の文書検索方法において、デジタル・データ集合１１０における分類などの補助情報を利用し、精確なキーワード・ウエイトを推定して、有効に検索の精度を高めることができる。 In the above document search method, it is possible to estimate accurate keyword weights by using auxiliary information such as classification in the digital data set 110 and effectively improve the search accuracy.

なお、図２に示す構成は本発明を説明するための具体例であり、本願発明を限定するものではない。例えば、ステップ２０７には、検索処理においてキーワードノイズを除去することにより、精度を高め応答時間を短縮することができるが、ステップ２０７を省略してもよい。即ち、ステップ２０６から直接ステップ２０８に進んでもよい。また、ステップ２０８とステップ２１０には、分類ノイズを除去することにより、検索精度を高め、応答時間を短縮することができ、さらに、キーワード・ウエイトを計算する際に、グローバルなキーワード・ウエイト計算方法を併用し、検索精度及びシステムの汎用性を高めることができるが、ステップ２０８とステップ２１０を省略してもよい。即ち、ステップ２０６から直接ステップ２０３に戻ってもよく、また、ステップ２０９から直接ステップ２１１に進んでもよい。この場合、例えば、キーワードとして一文字或いは一つのフレーズである。 The configuration shown in FIG. 2 is a specific example for explaining the present invention, and does not limit the present invention. For example, in step 207, the keyword noise is removed in the search process to improve accuracy and shorten the response time. However, step 207 may be omitted. That is, the process may proceed directly from step 206 to step 208. Further, in step 208 and step 210, it is possible to improve the search accuracy and shorten the response time by removing the classification noise. Further, when calculating the keyword weight, a global keyword weight calculation method is used. Can be used together to improve search accuracy and system versatility, but step 208 and step 210 may be omitted. That is, the process may return directly from step 206 to step 203, or may proceed directly from step 209 to step 211. In this case, for example, the keyword is one character or one phrase.

また、以上の文書検索方法は、分類されていないデジタル・データ集合にも適用できる。また、システムの汎用性を高めるために、ステップ２０３で計算した全ての相違度について合成を行う。ここで、好ましくは、グローバルなキーワードは統計的な方法により計算される。 The document retrieval method described above can also be applied to unclassified digital data sets. In addition, in order to improve the versatility of the system, all differences calculated in step 203 are combined. Here, preferably, the global keyword is calculated by a statistical method.

図３は、本発明の実施例に係る目標文書を検索する操作を示すフローチャートである。 FIG. 3 is a flowchart showing an operation of searching for a target document according to the embodiment of the present invention.

まず、ユーザが検索リクエスト文を入力する。当該検索リクエスト文はユーザが検索の意図を示している。たとえば、検索リクエスト文として、ユーザは一文、一段落、更に一つの文章を入力する。 First, the user inputs a search request sentence. The search request text indicates the intention of the search by the user. For example, the user inputs one sentence, one paragraph, and one sentence as a search request sentence.

ワード抽出手段１０１はまずユーザの検索リクエスト文に対してキーワード抽出を行い、キーワードｔ_１、ｔ_２、…、ｔ_ｍからなるキーワード列Ｔを得る。 Word extracting means 101 first performs a keyword extracting the search request statement of the user, the keyword _t _1, t 2, ..., obtaining a keyword string T consisting of _{t m.}

Ｔ＝（ｔ_１、ｔ_２、…、ｔ_ｍ）
ここで、検索されるデータ集合は上記のデジタル・データ集合１１０であるとする。即ち、デジタル・データ集合１１０は階層分類構造を有している。 T = (t ₁ , t ₂ ,..., T _m )
Here, it is assumed that the data set to be searched is the digital data set 110 described above. That is, the digital data set 110 has a hierarchical classification structure.

まず、分類選択細分化モジュール（ＣＳＭ）１０３により第１階層の分類列Ｃを選択する。分類列Ｃは、サブ分類ｃ_１、ｃ_２、…、ｃ_ｎからなる。 First, the classification column C in the first hierarchy is selected by the classification selection subdivision module (CSM) 103. The classification column C includes sub classifications c ₁ , c ₂ ,..., C _n .

Ｃ＝（ｃ_１、ｃ_２、…、ｃ_ｎ）
キーワード列Ｔにおける各キーワードは、異なる文書に対して異なる区別力を有する。ここで、単語の文書に対する区別力を単語の「ウエイト」と呼ぶ。キーワードのウエイトを推定することは検索システムにとって重要である。本発明は、階層状分類構造においてキーワードのウエイトを推定するシステムを実現した。当該システムにおいて、分類されたデータを繰り返して細分化することにより、徐々にキーワードの最終的なウエイトに近付く。 C = (c ₁ , c ₂ ,..., C _n )
Each keyword in the keyword string T has different distinctiveness for different documents. Here, the distinctiveness of a word with respect to a document is called a word “weight”. Estimating keyword weights is important for search systems. The present invention has realized a system for estimating keyword weights in a hierarchical classification structure. The system gradually approaches the final weight of the keyword by repeatedly subdividing the classified data.

次に、キーワード列Ｔ＝（ｔ_１、ｔ_２、…、ｔ_ｍ）及び分類列Ｃ＝（ｃ_１、ｃ_２、…、ｃ_ｎ）を相違度計算器（ＤＰＣ）１０４に入力し、相違度計算器（ＤＰＣ）１０４は、分類列Ｃに対してキーワード列Ｔにおける各キーワードの区別力を計算し、相違度ＤＰを求める。 Next, the keyword string T = (t ₁ , t ₂ ,..., T _m ) and the classification string C = (c ₁ , c ₂ ,..., C _n ) are input to the dissimilarity calculator (DPC) 104 and the difference The degree calculator (DPC) 104 calculates the discriminating power of each keyword in the keyword string T with respect to the classification string C, and obtains the difference DP.

ＤＰ＝（ｄｐ_１，ｄｐ_２，…，ｄｐ_ｍ）。 _{_{DP = (dp 1, dp 2}} , ..., dp m).

次に、相違度ＤＰをキーワード選択モジュール（ＴＳＭ）１０２に入力し、ノイズとなるキーワードをフィルタリングする。これによって、新しいキーワード列Ｔ、及び分類列Ｃに対する各キーワードの相違度ＤＰが得られる。 Next, the dissimilarity DP is input to the keyword selection module (TSM) 102, and keywords that cause noise are filtered. As a result, the difference DP of each keyword with respect to the new keyword string T and the classification string C is obtained.

次に、新しいキーワード列Ｔ及び相違度ＤＰを目標文書推定手段（ＰＲＥ）１０５に入力し、目標文書推定手段（ＰＲＥ）１０５は、検索しようとする目標文書が分類列Ｃの各分類に存在する確率ＰＣを計算する。 Next, the new keyword string T and the dissimilarity DP are input to the target document estimation means (PRE) 105, and the target document estimation means (PRE) 105 has a target document to be searched for in each classification of the classification string C. Calculate the probability PC.

ＰＣ＝（ｐｃ_１、ｐｃ_２、…、ｐｃ_ｎ）
実際に、検索しようとする目標文書はサブ分類ｃ_ｋに属するにも拘わらず、ユーザが入力した検索リクエスト文におけるキーワードにより、誤って他のサブ分類が検索されることがしばしばある。ここで、このような分類はノイズ分類と呼ぶ。 PC = (pc ₁ , pc ₂ ,..., Pc _n )
Actually, although the target document to be searched belongs to the sub-category _ck , another sub-category is often searched by mistake due to the keyword in the search request sentence input by the user. Here, such classification is called noise classification.

このようなノイズ分類の影響を抑えるために、分類確定手段（ＣＬ）１０７を用いてこれらのノイズを除去する。これにより、以下の新しい分類列Ｃが得られる。 In order to suppress the influence of such noise classification, these noises are removed using the classification determination means (CL) 107. As a result, the following new classification sequence C is obtained.

Ｃ＝（ｃ_１、ｃ_２、…、ｃ_ｑ）
分類選択細分化モジュール（ＣＳＭ）１０３は、分類列Ｃにおける各サブ分類ｃ_ｋを更に細分化する。 C = (c ₁ , c ₂ ,..., C _q )
The classification selection subdivision module (CSM) 103 further subdivides each sub-classification c _k in the classification column C.

Ｃ_ｋ＝（ｃ_ｋ１、ｃ_ｋ２、…、ｃ_ｋｎ）。 C _k = (c _k1 , c _k2 ,..., C _kn ).

上述した上位分類列Ｃに対する処理と同じように、キーワード列Ｔ＝（ｔ_１、ｔ_２、…、ｔ_ｍ）及び分類列Ｃ_ｋ＝（ｃ_ｋ１、ｃ_ｋ２、…、ｃ_ｋｎ）を相違度計算器（ＤＰＣ）１０４に入力し、相違度計算器（ＤＰＣ）１０４は、キーワード列Ｔにおける各キーワードの分類列Ｃ_ｋに対する区別力を計算し、相違度ＤＰを求める。次に、目標文書推定手段（ＰＲＥ）１０５は、ユーザが検索しようとする目標文書のＣ_ｋにおける各分類に存在する確率ＰＣを計算し、そして、分類確定手段（ＣＬ）１０７は目標の分類を選択する。 As with treatment for the upper classification column C as described above, the keyword string _{_{T = (t 1, t 2}} , ..., t m) and classification column _{_{_{C k = (c k1, c}}} k2, ..., c kn) the dissimilarity enter the calculator (DPC) 104, difference calculating unit (DPC) 104 calculates the distinguishing power for the classification column _{C k} for each keyword in the keyword column T, obtaining a dissimilarity DP. Next, the target document estimation means (PRE) 105 calculates the probability PC existing in each classification in C _k of the target document that the user wants to search, and the classification confirmation means (CL) 107 determines the target classification. select.

必要に応じて、分類Ｃ_ｋにおける各分類を更に細分化し、次の階層で以上の動作を繰り返す。例えば、階層数はデータ集合の構造及び精度の要求により定められる。 If necessary, each classification in the classification C _k is further subdivided, and the above operation is repeated in the next hierarchy. For example, the number of hierarchies is determined by the structure and accuracy requirements of the data set.

以上のように、相違度を計算し、ユーザが検索するキーワード列を求め、各階層に属する分類に対して各キーワードの相違度ＤＰを得て、目標文書が各分類に存在する確率ＰＣを計算する。 As described above, the degree of difference is calculated, the keyword string to be searched by the user is obtained, the degree of difference DP of each keyword is obtained for the classification belonging to each hierarchy, and the probability PC that the target document exists in each classification is calculated. To do.

以上に得た結果を相違度合成モジュール（ＤＩＭ）１０６に入力し、相違度合成モジュール１０６は、以上の結果に基づき、最終的なキーワードの相違度を計算する。 The result obtained above is input to the dissimilarity composition module (DIM) 106, and the dissimilarity composition module 106 calculates the final dissimilarity of keywords based on the above results.

当該相違度は、異なる分類に対して強い区別力を有し、相違度を用いれば、容易に目標文書の属する分類を特定することができる。しかし、もし使用されるキーワードは目標の分類における文書で頻繁に使われている慣用語（即ち、出現頻度が高い）である場合は、当該キーワードでその分類から所望の目標文書を選べることが困難である。この場合は、他の統計学的な方法に基づくキーワード計算方法を併用して最終的なキーワードのウエイトを計算する。 The degree of difference has a strong distinctive power with respect to different classifications, and if the degree of difference is used, the classification to which the target document belongs can be easily specified. However, if the keyword used is an idiom that is frequently used in documents in the target category (that is, the frequency of occurrence is high), it is difficult to select the desired target document from the category for that keyword. It is. In this case, the final keyword weight is calculated using a keyword calculation method based on another statistical method.

例えば、ＴＦ＊ＩＤＦウエイト計算器を用いて、以下の式でキーワードのウエイトを計算する。 For example, a keyword weight is calculated by the following formula using a TF * IDF weight calculator.

ここで、Ｎは文書の総数であり、ｎｔはキーワードｔを含む文書の数である。なお、上記の式は有名なRobertson/Spark-Jones式の変形である。

Here, N is the total number of documents, and nt is the number of documents including the keyword t. The above formula is a variation of the famous Robertson / Spark-Jones formula.

以上のようにＴＦ＊ＩＤＦ方法により得られたウエイト及び上記のように得た全体の相違度をウエイト併合モジュール（ＴＷＣ）１０８に入力し、最終的なキーワードのウエイトを計算する。 As described above, the weight obtained by the TF * IDF method and the overall difference obtained as described above are input to the weight merging module (TWC) 108, and the final keyword weight is calculated.

次に、本実施形態の検索装置の各構成を説明する。 Next, each configuration of the search device according to the present embodiment will be described.

まず、ワード抽出手段（ＴＥ）１０１を説明する。 First, the word extraction means (TE) 101 will be described.

ワード抽出手段（ＴＥ）１０１は、ユーザが入力した検索リクエスト文から、キーワード列を抽出する。 The word extraction means (TE) 101 extracts a keyword string from the search request sentence input by the user.

具体的に、ワード抽出手段（ＴＥ）１０１は、
（１）ユーザの検索リクエスト文を分説し、
（２）単語の性質により初期的なフィルタリングを行い、利用されない単語を除去する。例えば、量詞、数詞などである。
（３）停止語テーブルを参照して、一部のノイズ単語を除去する。例えば、「使用」、「効果」などがある。
（４）ノイズを削減する。ここで、誤った検索結果を生じ得る単語を除去する。具体的に、閾値ｔｓを設け、出現頻度が閾値ｔｓを下回る単語を除去する。 Specifically, the word extraction means (TE) 101
(1) Describe user search request text
(2) Perform initial filtering according to the nature of the word to remove unused words. For example, a measure and a number.
(3) Remove some noise words with reference to the stop word table. For example, there are “use” and “effect”.
(4) Reduce noise. Here, words that may cause erroneous search results are removed. Specifically, a threshold value ts is provided, and words whose appearance frequency is lower than the threshold value ts are removed.

以下、具体例を説明する。 Specific examples will be described below.

ユーザが次の検索リクエスト文を入力する。 The user enters the next search request text.

「本発明は、自動追跡ができ、かつ常に光収集器の最大面積で太陽光を収集し、さらに、光を常に一定の角度で一定の位置に伝搬し、室内に進入させ、発散させる太陽光入室装置を快事する。本発明は良好な応用性を供える。」
以上の文書に、二つの誤字がある、即ち、「快事」（開示）と「供える」（備える）である。前者は余り使わない単語であり、後者は比較的によく使う単語である。 “The present invention can automatically track and always collect sunlight at the maximum area of the light collector, and also always propagate the light to a certain position at a certain angle to enter and diverge the room. I'm happy with the entry device. The present invention provides good applicability. "
There are two typographical errors in the above document: “pleasant” (disclosure) and “provide” (provide). The former is a less frequently used word, and the latter is a relatively frequently used word.

まず、以下の単語列が得られる。即ち、「本、発明、は、自動、追跡、でき、常に、光収集器、の、最大、面積、太陽光、を、収集、さらに、光、一定、角度、で、位置、に、伝搬、室内、進入、発散、入、室、装置、快事、する、良好な、応用性、供える」である。 First, the following word string is obtained. That is, “the present invention can be automatically, tracked, always collect the light collector's maximum, area, sunlight, further propagate, to light, constant, angle, position, Indoor, approach, divergence, entrance, room, equipment, pleasant, good, applicability, offer ".

次に、名詞、動詞、形容詞のみを選んで、以下の単語列が得られる。即ち、「本、発明、自動、追跡、でき、光収集器、最大、面積、太陽光、収集、一定、角度、位置、伝搬、室内、進入、発散、入、室、装置、快事、する、良好な、応用性、供える」である。 Next, by selecting only nouns, verbs, and adjectives, the following word sequence is obtained. That is, "book, invention, automatic, tracking, can, light collector, maximum, area, sunlight, collection, constant, angle, position, propagation, indoor, approach, divergence, entrance, room, equipment, pleasure Good, applicability, offering ".

次に、停止語テーブルに存在する単語を除去する。例えば、「する」、「でき」などがある。 Next, the words existing in the stop word table are removed. For example, there are “Yes” and “Done”.

次に、「快事」の出現頻度が非常に低いので、この単語を除去する。このように、以下の単語列が得られる。即ち、「本、発明、自動、追跡、光収集器、最大、面積、太陽光、収集、一定、角度、位置、伝搬、室内、進入、発散、入、室、装置、良好な、応用性、供える」である。 Next, since the appearance frequency of “joy” is very low, this word is removed. In this way, the following word string is obtained. That is, "book, invention, automatic, tracking, light collector, maximum, area, sunlight, collection, constant, angle, position, propagation, indoor, approach, divergence, entrance, room, equipment, good, applicability, "I will offer."

次に、分類選択細分化モジュール（ＣＳＭ）１０３を説明する。 Next, the classification selection subdivision module (CSM) 103 will be described.

分類選択細分化モジュール（ＣＳＭ）１０３は、デジタル・データ集合１１０から分類情報を求める。分類選択細分化モジュール（ＣＳＭ）１０３への入力はヌルでもよく、分類でもよい。また、分類選択細分化モジュール（ＣＳＭ）１０３は入力される分類を更に細分化するか否かを決める。従って、デジタル・データ集合が異なれば、分類選択細分化モジュール（ＣＳＭ）１０３も異なる。 A classification selection subdivision module (CSM) 103 obtains classification information from the digital data set 110. The input to the classification selection subdivision module (CSM) 103 may be null or classification. A classification selection subdivision module (CSM) 103 determines whether or not to further subdivide the input classification. Therefore, if the digital data sets are different, the classification selection subdivision module (CSM) 103 is also different.

具体的に、分類選択細分化モジュール（ＣＳＭ）１０３への入力はヌルである場合は、分類選択細分化モジュール（ＣＳＭ）１０３は、デジタル・データ集合１１０に分類が存在するか否かを判断する。 Specifically, if the input to the classification selection subdivision module (CSM) 103 is null, the classification selection subdivision module (CSM) 103 determines whether a classification exists in the digital data set 110. .

分類が存在する場合は、分類選択細分化モジュール（ＣＳＭ）１０３は、配置情報を読み取り、相違度情報が必要であるか否かを判断する。 If there is a classification, the classification selection subdivision module (CSM) 103 reads the arrangement information and determines whether or not the difference information is necessary.

相違度情報が必要である場合は、分類選択細分化モジュール（ＣＳＭ）１０３は、第１階層（一番ランク上の階層）の分類情報を出力する。 When the dissimilarity information is necessary, the classification selection subdivision module (CSM) 103 outputs the classification information of the first hierarchy (the highest rank hierarchy).

デジタル・データ集合１１０に分類が存在しないと判断する場合は、分類選択細分化モジュール（ＣＳＭ）１０３は、検索システムに相違度の計算を完了したことを通知する。 If it is determined that no classification exists in the digital data set 110, the classification selection subdivision module (CSM) 103 notifies the search system that the calculation of the degree of difference has been completed.

一方、分類選択細分化モジュール（ＣＳＭ）１０３への入力はヌルではなく、分類情報である場合は、分類選択細分化モジュール（ＣＳＭ）１０３は、当該分類にサブ分類が存在するか否かを判断する。 On the other hand, if the input to the classification selection subdivision module (CSM) 103 is not null but classification information, the classification selection subdivision module (CSM) 103 determines whether or not a subclass exists in the classification. To do.

サブ分類が存在する場合は、分類選択細分化モジュール（ＣＳＭ）１０３は配置情報を読み取り、さらに細分化するか否かを判断する。 If there is a sub-classification, the classification selection subdivision module (CSM) 103 reads the arrangement information and determines whether or not to further subdivide.

さらに細分化する場合は、分類選択細分化モジュール（ＣＳＭ）１０３は、次の階層の分類情報を出力する。 When further subdividing, the classification selection subdivision module (CSM) 103 outputs the classification information of the next layer.

例えば、デジタル・データ集合１１０は、ＩＰＣに準拠する特許データベースであるとする。また、検索システムは相違度の計算は第３階層までと仮定する。 For example, the digital data set 110 is assumed to be a patent database compliant with IPC. In addition, the search system assumes that the difference is calculated up to the third layer.

例えば、分類選択細分化モジュール（ＣＳＭ）１０３への入力はＮ／Ａである場合、分類選択細分化モジュール（ＣＳＭ）１０３は第１階層のＩＰＣを出力する。即ち、
Ａセクション ―― 生活必需品
Ｂセクション ―― 処理操作；運輸
Ｃセクション ―― 化学；冶金
Ｄセクション ―― 繊維；紙
Ｅセクション ―― 固定構造物
Ｆセクション ―― 機械工学；照明；加熱；武器；爆破
Ｇセクション ―― 物理学
Ｈセクション ―― 電気
１）もし分類選択細分化モジュール（ＣＳＭ）１０３への入力は「Ｈセクション電気」である場合、出力は以下の通りである。
Ｈ０１基本的電気素子
Ｈ０２電力の発電、変換、配電
Ｈ０３基本電子回路
Ｈ０４電気通信技術
Ｈ０５他に分類されない電気技術
２）もし分類選択細分化モジュール（ＣＳＭ）１０３への入力は「Ｈ０１Ｃ抵抗器」である場合、この分類は、第３階層に属する分類であるので、対応する出力はＮ／Ａである。これは更なる分類は存在しないことを示す。このような出力を受け取った場合、相違度の計算を終了する。 For example, when the input to the classification selection subdivision module (CSM) 103 is N / A, the classification selection subdivision module (CSM) 103 outputs the IPC of the first layer. That is,
Section A-Daily Essentials Section B-Processing Operation; Transportation Section C-Chemical; Metallurgy Section D-Textile; Paper Section E-Fixed Structure Section F-Mechanical Engineering; Lighting; Heating; Weapons; Section-Physics H Section-Electricity 1) If the input to the Classification Selection Subdivision Module (CSM) 103 is "H Section Electric", the output is as follows.
H01 Basic electrical element H02 Electric power generation, conversion, and distribution H03 Basic electronic circuit H04 Electrical communication technology H05 Electrical technology not classified elsewhere 2) The input to the classification selection subdivision module (CSM) 103 is “H01C resistor” In some cases, this classification is a classification belonging to the third hierarchy, so the corresponding output is N / A. This indicates that there is no further classification. When such an output is received, the calculation of the difference is terminated.

次に、相違度計算器（ＤＰＣ）１０４を説明する。 Next, the dissimilarity calculator (DPC) 104 will be described.

相違度計算器（ＤＰＣ）１０４への入力はキーワード列及び分類列であり、その出力は相違度の値である。 The input to the dissimilarity calculator (DPC) 104 is a keyword string and a classification string, and its output is a dissimilarity value.

全ての文書は独自の特性があり、殆どの特性は最も基本的な語義表現単位である単語で表されている。言い換えれば、各単語は文書の特性を表している。ある単語は当該文書に現れているか、頻度はどうか、どのような位置に現れているか、等々、このような全ての情報は文書の属性を表すことに用いることができる。 Every document has its own characteristics, and most characteristics are expressed in words, which are the most basic semantic expression units. In other words, each word represents a document characteristic. All such information can be used to represent the attributes of a document, such as whether a word appears in the document, how often, where it appears, and so on.

キーワード列Ｔにおける各キーワードｔは、分類を表す能力が異なる。ここで、このような表現能力をＰｔと記す。すべてのキーワード列Ｔに属するｔ_ｉ及び分類列Ｃに属するｃ_ｊ（即ち、ｔ_ｉ∈Ｔ、ｃ_ｊ∈Ｃ）について、以下のようなＰｔのマトリックスが得られる。 Each keyword t in the keyword string T has a different ability to represent a classification. Here, such an expression ability is denoted as Pt. For all t _i belonging to the keyword column T and c _j belonging to the classification column C (ie, t _i εT, c _j εC), the following matrix of Pt is obtained.

ここで、

here,

ここで、ｎ_ｔｉは、キーワードｔが分類ｃ_ｉにおいて出現する回数である。
ｌ_ｊは分類ｃ_ｊの長さである（即ち、分類ｃ_ｊにおける単語の数である）。

_{Here, n ti} is the number of times a keyword t appear in the classification _{c i.}
l _j is the length of class c _j (ie, the number of words in class c _j ).

また、 Also,

ｌ_aveは分類の平均の長さである。

l _ave is the average length of the classification.

また、ｆ_ｉｊは分類における単語の出現頻度であり、ある分類においてある単語が出現する回数を表す。 F _ij is the appearance frequency of a word in a classification, and represents the number of times a word appears in a certain classification.

もしある単語が一つの分類に集中している場合は、ｆ_ｉｊの値は比較的に高い。しかし、出現頻度と分類の長さとの関係を考えなければならない。上記の計算に、分類の長さ及び分類の平均長さを同時に考慮した。 If a word is concentrated in one classification, the value of _fij is relatively high. However, the relationship between the appearance frequency and the classification length must be considered. In the above calculation, the classification length and the average classification length were considered simultaneously.

もしある単語が各分類における出現頻度が同じである場合は、分類は短ければ短いほど、この単語の当該分類に対する表現力が高い。パラメータｋ及びｂは出現頻度と分類の長さを調整する因子である。 If a word has the same frequency of appearance in each category, the shorter the category, the higher the expressiveness of this word for that category. Parameters k and b are factors that adjust the appearance frequency and the length of the classification.

このように、各キーワードについて、以下のように、当該キーワードの分類表現力を表すＰＴが得られる。 As described above, for each keyword, a PT representing the classification expression power of the keyword is obtained as follows.

ここで、

here,

ＰＴにより、単語の現在階層に属する分類に対する相違度を推定できる。以下の式を用いて相違度を計算する。

By PT, it is possible to estimate the degree of difference of a word belonging to the current hierarchy. The dissimilarity is calculated using the following formula.

ここで、ｎは分類の数である。

Here, n is the number of classifications.

以上の式には、キーワードが各分類において均一に分布し、なお、目標文書は各分類において均一に分布すると仮定している。 In the above formula, it is assumed that the keywords are uniformly distributed in each category, and the target document is uniformly distributed in each category.

ｄｐの値は、単語の各分類における分布の相違を反映する。ある単語は一つ或いは少数の分類に出現する場合は、ｄｐの値は比較的に高くなる（もし単語は一つの分類にしか出現しない場合は、ｄｐの値は１になる）。即ち、当該単語は検索を全体的にこれらの分類に導く傾向がある。一方、ある単語は全ての分類に出現する場合は、ｄｐの値は比較的に小さくなる（当該単語は各分類における分布が同じである場合は、ｄｐの値は最小の０になる）。 The value of dp reflects the distribution difference in each classification of words. If a word appears in one or a few categories, the dp value is relatively high (if the word appears in only one category, the dp value is 1). That is, the word tends to lead the search to these classifications as a whole. On the other hand, if a word appears in all categories, the value of dp is relatively small (if the word has the same distribution in each category, the value of dp is 0, which is the minimum).

図４は、本発明の実施例に係る相違度計算器（ＤＰＣ）１０４の動作を示すフローチャートである。 FIG. 4 is a flowchart showing the operation of the dissimilarity calculator (DPC) 104 according to the embodiment of the present invention.

例えば、入力されたキーワード列と分類列はＴ（ｔ_１、ｔ_２、ｔ_３）、及びＣ（ｃ_１、ｃ_２、ｃ_３）であるとする。 For example, it is assumed that the input keyword string and classification string are T (t ₁ , t ₂ , t ₃ ) and C (c ₁ , c ₂ , c ₃ ).

データ集合１１０により、分類ｃ_１、ｃ_２、ｃ_３）のそれぞれの単語の数は２０００，３０００，及び５０００であり、初期化パラメータは、ｋ＝２、ｂ＝０．７であると定められた。 According to the data set 110, the number of words of each of the classifications c ₁ , c ₂ , c ₃ ) is 2000, 3000, and 5000, and the initialization parameters are defined as k = 2, b = 0.7. It was.

データ集合１１０を検索し、以下の結果が得られた。
ｔ１はｃ１に１００回出現し、ｃ２に３００回出現し、ｃ３に２００回出現し、
ｔ２はｃ１に７０回出現し、ｃ２に５００回出現し、ｃ３に１０００回出現し、
ｔ３はｃ１に２００回出現し、ｃ２に３００回出現し、ｃ３に５００回出現する。 The data set 110 was searched and the following results were obtained.
t1 appears 100 times in c1, appears 300 times in c2, appears 200 times in c3,
t2 appears 70 times in c1, 500 times in c2, 1000 times in c3,
t3 appears 200 times in c1, appears 300 times in c2, and appears 500 times in c3.

よって、以下のマトリックスＰｔが得られる。 Therefore, the following matrix Pt is obtained.

従って、以下のような相違度の列が得られる。

Therefore, the following dissimilarity columns are obtained.

ＤＰ＝（０．４３０，０．５１８，０．０５２８）。 DP = (0.430, 0.518, 0.0528).

次に、キーワード選択モジュール（ＴＳＭ）１０２を説明する。 Next, the keyword selection module (TSM) 102 will be described.

キーワード選択モジュール（ＴＳＭ）１０２は、主にキーワード列におけるノイズを除去する。キーワード選択モジュール（ＴＳＭ）１０２への入力はキーワード列、対応する相違度、及びキーワードの頻度である。キーワード選択モジュール（ＴＳＭ）１０２からの出力は選択されたキーワード列、及び対応する相違度である。 The keyword selection module (TSM) 102 mainly removes noise in the keyword string. The input to the keyword selection module (TSM) 102 is the keyword string, the corresponding dissimilarity, and the keyword frequency. The output from the keyword selection module (TSM) 102 is the selected keyword string and the corresponding dissimilarity.

キーワード選択モジュール（ＴＳＭ）１０２は、相違度が閾値を下回る如何なる単語を除去する。ここで、その閾値は（Max(dp)*parameter）である。また、parameterは事前に設定したパラメータである。 The keyword selection module (TSM) 102 removes any words whose dissimilarity is below a threshold. Here, the threshold value is (Max (dp) * parameter). Parameter is a parameter set in advance.

このようなフィルタリングステップを通じて、キーワード選択モジュール（ＴＳＭ）１０２は、新しいキーワード列及び対応する相違度を出力する。 Through such a filtering step, the keyword selection module (TSM) 102 outputs a new keyword string and a corresponding dissimilarity.

次に、目標文書推定手段（ＰＲＥ）１０５を説明する。

Next, the target document estimation means (PRE) 105 will be described.

目標文書推定手段（ＰＲＥ）１０５は、目標文書の各分類に存在する確率を計算する。目標文書推定手段（ＰＲＥ）１０５への入力はキーワード列、対応する相違度、及び分類情報であり、出力は各分類に目標文書が存在する確率である。 The target document estimation means (PRE) 105 calculates the probability existing in each classification of the target document. The input to the target document estimation means (PRE) 105 is a keyword string, the corresponding dissimilarity, and classification information, and the output is the probability that the target document exists in each classification.

図５は、本発明の実施形態に係る目標文書推定手段（ＰＲＥ）１０５の動作を示すフローチャートである。 FIG. 5 is a flowchart showing the operation of the target document estimation means (PRE) 105 according to the embodiment of the present invention.

入力されたキーワード列及び分類情報に従って、相違度計算器（ＤＰＣ）１０４と同じように、マトリックスＰｔが得られる。 A matrix Pt is obtained in the same manner as the dissimilarity calculator (DPC) 104 according to the input keyword string and classification information.

各分類は、マトリックスＰｔの一行に対応する。さらに、以下のように定義する。

Each classification corresponds to one row of the matrix Pt. Furthermore, it defines as follows.

ここで、

here,

目標文書推定手段１０５は、相違度に基づいて目標文書の各分類に属する確率を計算する。以下に定義する。ｃ_ｊ∈Ｃの場合、

The target document estimation unit 105 calculates a probability belonging to each classification of the target document based on the degree of difference. It is defined below. If c _j ∈C,

ここで、ＰＣ_ｊ ^＋はＰＣ_ｊの反転である。

Here, PC _j ⁺ is the inversion of PC _j .

また、ＤＰ＝（ｄｐ_１，ｄｐ_２，…，ｄｐ_ｍ）は入力された相違度の列である。 _{_{Also, DP = (dp 1, dp}} 2, ..., dp m) is a sequence of dissimilarity entered.

例えば、入力に応じて、以下のマトリックスが得られる。 For example, the following matrix is obtained according to the input.

よって、以下のＰＣ_ｊが得られる。

Therefore, the following PC _j is obtained.

次に、分類確定手段（ＣＬ）１０７を説明する。

Next, the classification determination means (CL) 107 will be described.

ユーザが入力した検索リクエストが分類ｃ_ｋに属するにもかかわらず、抽出された一部のキーワードがより分類ｃ_ｑに傾いていることがある。ｃ_ｑのような分類はノイズ分類と呼ぶ。 Users despite the search request input belongs to a classification c _k, there is a part of the keywords extracted is tilted more classification c _q. Classification such as c _q is called noise classification.

検索の最終精度を高めるために、これらのノイズ分類を除去する。具体的に、ｐｃの値は以下に表したｔｃ_ｋより小さい分類を除去する。 These noise classifications are removed to increase the final accuracy of the search. Specifically, the classification with a value of pc smaller than tc _k shown below is removed.

ここで、Ｍａｘ（ＰＱ_ｋ’）は最大値であり、ｐｑ_ｉｋは目標文書の分類ｃ_ｋに出現する確率である。なお、ｐｑ_ｉｋは上の階級で計算される。ただ、現階級は第１階級であれば、ｐｑ＝１／ｎである。ここで、ｎは現在の分類の分類番号である。

Here, Max (PQ _k ′) is the maximum value, and pq _ik is the probability of appearing in the target document classification _ck . Note that pq _ik is calculated in the upper class. However, if the current class is the first class, pq = 1 / n. Here, n is the classification number of the current classification.

分類確定手段（ＣＬ）１０７への入力はＰＱ列および分類列であり、出力は選択された分類Ｃ＝（ｃ_１、ｃ_２、…、ｃ_ｑ）である。 The inputs to the classification determination means (CL) 107 are the PQ column and the classification column, and the output is the selected classification C = (c ₁ , c ₂ ,..., C _q ).

次に、相違度合成モジュール（ＤＩＭ）１０６を説明する。 Next, the dissimilarity synthesis module (DIM) 106 will be described.

相違度合成モジュール（ＤＩＭ）１０６は、異なる階級の相違度を合成する。 The dissimilarity synthesis module (DIM) 106 synthesizes dissimilarities of different classes.

相違度合成モジュール（ＤＩＭ）１０６への入力は、相違度と、目標文書の各分類での確率であり、出力は合成後の相違度ｔｗ_ｉである。 The input to the dissimilarity synthesis module (DIM) 106 is the dissimilarity and the probability in each classification of the target document, and the output is the dissimilarity tw _i after synthesis.

ここで、ｋ１、ｋ２、ｋ３はパラメータである。

Here, k1, k2, and k3 are parameters.

ｔｗ_ｉは、第ｉ番目のキーワードのウエイトである。 tw _i is the weight of the i-th keyword.

ｄｐ_ｉは、第ｉ番目のキーワードの第１階級の分類での相違度である。 dp _i is the degree of difference in the classification of the first class of the i-th keyword.

ｄｐ_ｔｉは、第ｉ番目のキーワードのＣｔサブ分類での相違度である。 dp _ti is the degree of difference in the Ct subclass of the i-th keyword.

ｄｑ_ｔは、目標文書が分類Ｃｔに属する確率である。 dq _t is a probability that the target document belongs to the classification Ct.

ｄｐ_ｔｕｔは、第ｉ番目のキーワードのＣｔのサブ分類での相違度である。 dp _tut is the degree of difference in the Ct subclass of the i-th keyword.

ｄｑ_ｔｕは、目標文書がＣｔのサブ分類に属する確率である。 dq _tu is the probability that the target document belongs to the sub-class of Ct.

次に、ウエイト併合モジュール１０８を説明する。 Next, the weight merging module 108 will be described.

上記のように、相違度により簡単に目標文書の分類を特定できる。しかし、キーワードは目標文書において頻繁に使われている場合は、このキーワードでさらに検索することができない。この場合は、本発明の方法と他のグローバルなウエイト計算方法とを併用する必要がある。 As described above, the classification of the target document can be easily specified based on the degree of difference. However, if the keyword is frequently used in the target document, it cannot be further searched with this keyword. In this case, it is necessary to use the method of the present invention in combination with another global weight calculation method.

ウエイト併合モジュール１０８への入力は、キーワード列、対応する相違度、及びグローバルなウエイト計算方法で計算したウエイトである。出力はキーワードのウエイトである。ここで、以下の式を用いて計算する。 Inputs to the weight merging module 108 are keyword strings, corresponding dissimilarities, and weights calculated by a global weight calculation method. The output is the keyword weight. Here, it calculates using the following formula | equation.

以上、本発明の好ましい実施形態を説明したが、本発明はこれらの実施形態に限定されず、本発明の趣旨を離脱しない限り、本発明に対するあらゆる変更は本発明の範囲に属する。

As mentioned above, although preferable embodiment of this invention was described, this invention is not limited to these embodiment, Unless it leaves | separates the meaning of this invention, all the changes with respect to this invention belong to the scope of the present invention.

本発明の一実施例に係る文書検索装置１００の構成を示すブロック図である。1 is a block diagram illustrating a configuration of a document search apparatus 100 according to an embodiment of the present invention. 本発明の一実施例に係る文書検索方法を示すフローチャートである。6 is a flowchart illustrating a document search method according to an embodiment of the present invention. 本発明の実施例に係る目標文書を検索する操作を示すフローチャートである。It is a flowchart which shows operation which searches the target document based on the Example of this invention. 本発明の実施例に係る相違度計算器（ＤＰＣ）１０４の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the difference calculator (DPC) 104 which concerns on the Example of this invention. 本発明の実施形態に係る目標文書推定手段（ＰＲＥ）１０５の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the target document estimation means (PRE) 105 which concerns on embodiment of this invention.

Explanation of symbols

１００文書検索装置
１０１ワード抽出手段（ＴＥ）
１０２キーワード選択モジュール（ＴＳＭ）
１０３分類選択細分化モジュール（ＣＳＭ）
１０４相違度計算器（ＤＰＣ）
１０５目標文書推定手段（ＰＲＥ）
１０６相違度合成モジュール（ＤＩＭ）
１０７分類確定手段（ＣＬ）
１０８ウエイト併合モジュール（ＴＷＣ）
１０９検索エンジン
１１０デジタル・データ集合 100 document retrieval apparatus 101 word extraction means (TE)
102 Keyword selection module (TSM)
103 Classification Selection Subdivision Module (CSM)
104 Dissimilarity calculator (DPC)
105 Target document estimation means (PRE)
106 Dissimilarity synthesis module (DIM)
107 Classification confirmation means (CL)
108 Weight Merge Module (TWC)
109 Search Engine 110 Digital Data Set

Claims

A method for retrieving a predetermined target document from a digital data set having a hierarchical classification structure obtained by hierarchically classifying a set of digital data of a plurality of documents based on classification information of a predetermined number of layers,
(A) extracting a keyword string including at least one keyword from a search request sentence input by a user;
(B) obtaining classification information of the current hierarchy from the digital data set;
(C) calculating the degree of difference of each keyword in the keyword string for each classification belonging to the current hierarchy;
(D) calculating a probability that the target document exists in each classification belonging to the current hierarchy based on the degree of difference;
(E1) If the number of processed layers is smaller than the predetermined number, the next layer is set as the current layer, and the above steps (c), (d), and (e1) are executed again.
(E2) When the number of processed hierarchies is equal to or greater than the predetermined number, the calculated difference between each class belonging to each hierarchy and the probability that the target document exists in each class belonging to each hierarchy are combined And (f) searching for the target document in accordance with the composite difference.

The minimum number of layers in the digital data set is 1, and the minimum number of classifications in each layer is 1.
The document search method according to claim 1.

The document search method according to claim 1 or 2, wherein in step (e2), all the degrees of difference calculated in step (c) are combined.

The document search method according to any one of claims 1 to 3, further comprising combining global dissimilarities in the step (e2).

The document search method according to any one of claims 1 to 4, wherein in the step (c), the degree of difference is calculated based on a distinctiveness for each classification of the keyword.

The document search method according to claim 5, wherein the distinctiveness for each classification of the keyword is calculated by the expressive power for each classification of the keyword.

The document search method according to claim 6, wherein the power of expression for each classification of the keyword is calculated based on an appearance frequency in each classification of the keyword and an attribute of each classification.

In step (e1), before making the next hierarchy the current hierarchy,
The document search method according to claim 1, wherein (g) a noise keyword is removed from the keyword string of the current hierarchy based on the keyword string of the current hierarchy, the corresponding dissimilarity, and the appearance frequency to obtain a keyword string of the next hierarchy. .

The document search method according to claim 4, wherein the global dissimilarity is calculated by a statistical method.

In step (e1), before making the next hierarchy the current hierarchy,
The document search method according to claim 1, wherein (h) noise classification is removed based on a probability existing in each classification belonging to the current hierarchy of the target document, and classification information of the next hierarchy is used.

The document search method according to any one of claims 1 to 10, wherein the keyword includes a word or a phrase.

An apparatus for retrieving a predetermined target document from a digital data set having a hierarchical classification structure obtained by hierarchically classifying a set of digital data of a plurality of documents, based on classification information of a predetermined number of layers,
Word extraction means for extracting a keyword string including at least one keyword from a search request sentence input by a user;
Classification selection subdivision means for obtaining classification information of the current hierarchy from the digital data set;
A dissimilarity calculating means for calculating a dissimilarity of each keyword in the keyword string for each classification belonging to the current hierarchy;
Target document estimation means for calculating a probability that the target document exists in each classification belonging to the current hierarchy based on the degree of difference;
A dissimilarity synthesizing unit that synthesizes the dissimilarity with respect to each classification and the probability that the target document exists in each classification belonging to each hierarchy, and obtains the synthetic dissimilarity of each keyword;
A search engine that searches the target document according to the composite dissimilarity, and
When the number of processed hierarchies is smaller than the predetermined number, the dissimilarity calculating means and the target document estimating means sequentially calculate the dissimilarity and the probability for each hierarchy,
When the number of processed hierarchies is equal to or greater than the predetermined number, the difference degree synthesizing unit obtains the degree of composition difference, and the search engine performs a search.

The minimum number of layers in the digital data set is 1, and the minimum number of classifications in each layer is 1.
The document search device according to claim 12.

The document search apparatus according to claim 12 or 13, wherein the dissimilarity synthesis unit synthesizes all the dissimilarities calculated by the dissimilarity calculation unit.

The document search device according to any one of claims 12 to 14, wherein the dissimilarity synthesizing unit further synthesizes a global dissimilarity.

The document search device according to any one of claims 12 to 15, wherein the dissimilarity calculation means calculates the dissimilarity based on a distinctiveness for each classification of the keyword.

The document search apparatus according to claim 16, wherein the distinctiveness for each classification of the keyword is calculated by the expressive power for each classification of the keyword.

The document search apparatus according to claim 17, wherein the expression power for each classification of the keyword is calculated based on an appearance frequency in each classification of the keyword and an attribute of each classification.

The keyword selection unit according to claim 12, further comprising: a keyword selection unit that removes a noise keyword from the keyword string of the current hierarchy according to the keyword string of the current hierarchy, the corresponding difference degree, and the appearance frequency, and sets the keyword keyword of the next hierarchy. Document retrieval device.

The document search apparatus according to claim 15, wherein the global dissimilarity is calculated by a statistical method.

13. The document search apparatus according to claim 12, further comprising: a class determination unit that removes a noise class based on a probability existing in each class belonging to the current hierarchy of the target document and sets it as class information of a next hierarchy.

The document search device according to any one of claims 12 to 21, wherein the keyword includes a word or a phrase.