JP5538830B2

JP5538830B2 - Document analysis system and dictionary creation system

Info

Publication number: JP5538830B2
Application number: JP2009259042A
Authority: JP
Inventors: 義行小林
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-11-12
Filing date: 2009-11-12
Publication date: 2014-07-02
Anticipated expiration: 2029-11-12
Also published as: JP2011107767A

Description

本発明は、電子文書中から数値表現などの量的な表現を抽出し、言語的に表現された量と数値表現とを対応づける辞書を自動的に構築する文書分析システム及び辞書作成システムに関する。 The present invention relates to a document analysis system and a dictionary creation system that extract a quantitative expression such as a numerical expression from an electronic document and automatically construct a dictionary that associates the linguistically expressed quantity with the numerical expression.

電子化された文書が大量に蓄積され自由に利用できるようになったことで、文書中の有用な情報を自動的に収集・分析したいというニーズが高まっている。そのようなニーズに応えることを目的として、自然言語処理の研究でも文書内容を分析する手法が研究されてきた。文書内容を分析するとき、事物の属性を表す目的でよく使われる数値表現に代表される量的な表現は、重要な表現である。例えば、数値限定特許では、発明を数値で記述した量で限定している。適切に数値限定特許を検索するためには、量的表現に対して適切な絞り込みができることが望まれる。 The need to automatically collect and analyze useful information in a document is increasing as a large amount of digitized documents can be stored and used freely. In order to meet such needs, methods for analyzing document contents have been studied in the study of natural language processing. When analyzing document contents, quantitative expressions represented by numerical expressions often used for the purpose of representing the attributes of things are important expressions. For example, in a numerical limited patent, the invention is limited by the numerical value described. In order to appropriately search for numerical limited patents, it is desirable to be able to appropriately narrow down quantitative expressions.

文書の量的表現には、数値による表現と言葉による表現とがある。数字文字列と単位・助数詞を使うのが数値による表現である。例えば、「５０歳」や「五千円」などである。言葉による表現は、量を表す言葉による表現である。例えば、「中年」や「廉価」などである。文書の量的表現を扱うとき、数値による表現と言葉による表現を対応つけることが望まれる。例えば、「中年」という言葉による表現を、「４０から５０歳」という数値による表現に対応つけることである。数値による表現と言葉による表現を対応つけることで、例えば、「年齢」が「４０から５０歳」の記述を含む文書を検索するとき、同時に同じ意味である「中年」を含む文書を検索することができる。 There are two types of quantitative expressions for documents: numerical expression and verbal expression. Numeric character strings, units and classifiers are used for numerical expressions. For example, “50 years old” or “5,000 yen”. The expression by words is the expression by words that express quantity. For example, “middle-aged” or “cheap”. When dealing with quantitative expressions in documents, it is desirable to associate numerical expressions with verbal expressions. For example, the expression “middle-aged” is associated with the numerical expression “40 to 50 years old”. By associating the numerical expression and the verbal expression, for example, when searching for a document including a description of “age” of “40 to 50 years”, a document including “middle age” having the same meaning is simultaneously searched. be able to.

このように数値と言葉を対応づける方法としては、言葉と数値のあいだの対応を人手で対応づける方法が考えられる。特許文献１では、ファジィ理論のメンバシップ関数を使って量的表現に使われる言葉と数値を対応づけている。 As a method of associating numerical values with words in this way, a method of manually associating correspondence between words and numerical values can be considered. In Patent Document 1, a fuzzy theory membership function is used to associate numerical values used in quantitative expression with numerical values.

特許第２７６４７６６号Patent No. 2764766

「テキストから属性関係を抽出する」情報処理学会自然言語処理研究会NL-164-4 pp.19-24 (2004)"Extracting attribute relationships from text" NL-164-4 pp.19-24 (2004)

しかし、人手で言葉と数値を対応づける方法では、言葉と数値の対応づけを知っておくことが必要である。あらゆる量的表現についてそのような対応づけをあらかじめ知っておくことは困難である。 However, it is necessary to know the correspondence between words and numbers in the method of manually associating words and numbers. It is difficult to know in advance such correspondence for all quantitative expressions.

そこで、本発明は、量的表現について言葉と数値の対応づけ辞書を自動的に構築するための文書分析システム及び辞書作成システムを提供することを目的とする。 Therefore, an object of the present invention is to provide a document analysis system and a dictionary creation system for automatically constructing a dictionary that associates words with numerical values for quantitative expressions.

本発明では、文書中から、同じ事物に対する属性値と推定される量的表現を収集し、数値表現が指す数値の範囲と頻度を評価することで、言葉による表現が対応する数値を自動的に判断し、このことによって、量的表現について言葉と数値を対応づけた辞書を自動的に構築する。 In the present invention, attribute values for the same thing are collected from the document, and quantitative expressions that are estimated are collected. By evaluating the range and frequency of the numerical values indicated by the numerical expressions, the numerical values corresponding to the verbal expressions are automatically obtained. Judgment is made to automatically build a dictionary that associates numerical values with numerical values for quantitative expressions.

すなわち、本発明の文書分析システムは、文書ファイルを入力する入力部と、文書ファイルから文書部分を抽出する文書部分抽出部と、文書部分から事物名と当該事物名が指す事物に対する属性値候補を抽出する事物属性値組候補抽出部と、属性値候補が数値表現の場合に数的な値に変換する数値表現処理部と、同一の事物名に対する属性値候補の集合を作成し、作成した属性値候補の集合を集計する属性値候補集計部と、集計した属性値候補の集合から言葉による属性値を選択し、選択した言葉による属性値と数値表現による属性値の関係を評価する対応度評価部とを有する。 That is, the document analysis system according to the present invention includes an input unit for inputting a document file, a document part extraction unit for extracting a document part from the document file, and an attribute value candidate for the thing pointed to by the thing name and the thing name from the document part. The attribute attribute value candidate extraction unit to extract, the numeric expression processing unit to convert to numerical values when the attribute value candidate is a numeric expression, and a set of attribute value candidates for the same thing name Attribute value candidate aggregation part that aggregates a set of value candidates, and attribute evaluation by words from the aggregated attribute value candidate set, and evaluation of the relationship between the attribute value by the selected word and the attribute value by numerical expression Part.

また、本発明の辞書作成システムは、文書分析システムによる評価結果に基づき、言葉による属性値とそれに対応する数値表現を辞書として出力する辞書出力部を更に有する。 The dictionary creation system according to the present invention further includes a dictionary output unit that outputs attribute values based on words and numerical expressions corresponding thereto based on the evaluation result by the document analysis system.

文書中から、同じ事物に対する属性値と推定される量的表現を収集し、数値表現が指す数値の適切さをスコア等で評価することで、言葉による表現が対応する数値を自動的に判断することができ、このことによって、量的表現について言葉と数値を対応づけた辞書を自動的に構築することができる。 Quantitative expressions estimated as attribute values for the same thing are collected from the document, and the appropriateness of the numerical value indicated by the numerical expression is evaluated by a score, etc., so that the numerical value corresponding to the verbal expression is automatically judged. This makes it possible to automatically build a dictionary that associates words with numerical values for quantitative expressions.

本発明による数値相当表現辞書作成システムの構成例を示す図。The figure which shows the structural example of the numerical equivalent expression dictionary creation system by this invention. 本発明のシステムを実現するコンピュータの構成を示す図。The figure which shows the structure of the computer which implement | achieves the system of this invention. 事物属性値組候補抽出部による処理の説明図。Explanatory drawing of the process by a thing attribute value set candidate extraction part. 属性値候補集計部による処理の説明図。Explanatory drawing of the process by an attribute value candidate total part. 対応度評価部による処理の説明図。Explanatory drawing of the process by a correspondence evaluation part. 数値表現処理部による処理の説明図。Explanatory drawing of the process by a numerical expression process part. 換算処理部による処理の説明図。Explanatory drawing of the process by the conversion process part. 数値範囲処理部による処理の説明図。Explanatory drawing of the process by a numerical value range process part. 数値範囲比較処理部による処理の説明図。Explanatory drawing of the process by a numerical value range comparison process part. データの流れを説明する図。The figure explaining the flow of data. 本発明による数値相当表現辞書作成システムの他の構成例を示す図。The figure which shows the other structural example of the numerical equivalent expression dictionary creation system by this invention. 単位指定を含む対応度評価部による処理の説明図。Explanatory drawing of the process by the correspondence evaluation part containing unit designation.

以下、図面を参照し、本発明のシステム構成例と、そのシステム上の処理動作によって本発明の実施形態を説明する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention will be described below with reference to the drawings by way of system configuration examples of the present invention and processing operations on the system.

本発明の辞書作成システムは、図１に示すように、文書ファイル入力部１０１、文書部分抽出部１０２、事物属性値組候補抽出部１０３、属性値候補集計部１０４、対応度評価部１０５、数値表現処理部１０６、換算処理部１０７、数値範囲処理部１０８、数値範囲比較部１０９、辞書出力部１１０を備える。 As shown in FIG. 1, the dictionary creation system of the present invention includes a document file input unit 101, a document part extraction unit 102, an object attribute value set candidate extraction unit 103, an attribute value candidate totaling unit 104, a correspondence degree evaluation unit 105, a numerical value An expression processing unit 106, a conversion processing unit 107, a numerical value range processing unit 108, a numerical value range comparison unit 109, and a dictionary output unit 110 are provided.

また、本システムは、図２に示すように、中央処理ユニットを持ち蓄積プログラム方式によって情報を処理する中央処理装置２０１と、ランダムアクセスメモリからなる主記憶装置２０２、処理対象文書２０３１や処理結果の辞書２０３２を保存する外部記憶装置２０３、文書などの入力に使用するＣＤ−ＲＯＭ読取装置２０４１やＤＶＤ読取装置２０４２，キーボード２０４３などの入力装置２０４、作成した辞書など情報処理結果を出力するＣＤ−ＲＯＭ書き込み装置２０５１やＤＶＤ書き込み装置２０５２，ディスプレイ２０５３などの出力装置２０５を有する装置上で動作する。また、ネットワーク２０６を通して外部の情報処理装置２０７とも情報の入出力ができる。 Further, as shown in FIG. 2, this system has a central processing unit 201 that has a central processing unit and processes information by a storage program method, a main storage device 202 composed of random access memory, a processing target document 2031 and processing results. An external storage device 203 for storing the dictionary 2032, a CD-ROM reading device 2041 or a DVD reading device 2042 used for inputting a document, an input device 204 such as a keyboard 2043, a CD-ROM for outputting information processing results such as a created dictionary It operates on a device having an output device 205 such as a writing device 2051, a DVD writing device 2052, and a display 2053. In addition, information can be input / output to / from an external information processing apparatus 207 via the network 206.

文書ファイル入力部１０１は、入力装置２０４を遣ってＤＶＤやＣＤ−ＲＯＭなどの記憶媒体の形式でシステムの外部から入力される文書ファイルを受理する。または、ネットワーク２０６を通して外部の情報処理装置２０７から送られる文書ファイルを受理する。受理した文書ファイルは外部記憶装置２０３に保存する。文書ファイルはワードプロセッサなどで作成された電子化文書のこともあるし、光学式文字読取装置（ＯＣＲ）などを使って電子化された文書ファイルのこともある。どちらの場合でも、文書は文字コード化されているものとする。文字コードの種類は特に制限しない。文書ファイルには識別のためのユニークな記号が付与される。以下では識別のためのユニークな記号を文書識別子と呼ぶ。 The document file input unit 101 receives a document file input from the outside of the system in the form of a storage medium such as a DVD or a CD-ROM through the input device 204. Alternatively, a document file sent from the external information processing apparatus 207 via the network 206 is accepted. The received document file is stored in the external storage device 203. The document file may be an electronic document created by a word processor or the like, or may be a document file digitized using an optical character reader (OCR) or the like. In either case, the document is assumed to be character-coded. The type of character code is not particularly limited. A unique symbol for identification is given to the document file. Hereinafter, a unique symbol for identification is referred to as a document identifier.

文書部分抽出部１０２は、入力された文書ファイルから文書部分を抽出する。文書部分とは、当該文書ファイルの内容と考えられる部分であり、自然言語で記述されたデータが本発明の処理対象である。文書部分を抽出するとは、フォーマット情報やメタデータ類を取り除く処理である。抽出された文書部分は主記憶装置に保持する。 The document part extraction unit 102 extracts a document part from the input document file. A document part is a part considered as the contents of the document file, and data described in a natural language is a processing target of the present invention. Extracting a document part is a process of removing format information and metadata. The extracted document part is held in the main storage device.

事物属性値組候補抽出部１０３は、抽出した文書部分から事物名と事物を説明する属性値の候補組を網羅的に抽出する機能を有する。本機能は、非特許文献１に開示された発明を修正することによって実現できる。なお、事物とは、物あるいは出来事を表す。物は物理的な物に限らず抽象的な物も含む。 The thing attribute value set candidate extraction unit 103 has a function of exhaustively extracting attribute value candidate sets that explain the name of the thing and the thing from the extracted document portion. This function can be realized by correcting the invention disclosed in Non-Patent Document 1. Note that an object represents an object or an event. Objects include not only physical objects but also abstract objects.

非特許文献１では、文型と共起スコアを使って、対象、属性値、属性名の３つ組を抽出する。例えば、「富士山の標高は３７７６メートル」という文が入力されると、文型「＜対象＞の＜属性名＞は＜属性値＞」を使い、対象が「富士山」、属性名が「標高」、属性値が「３７７６メートル」であることを抽出する。大規模な文書データベースから収集した共起データの頻度を使い、対象が「富士山」で属性名が「標高」であることの尤もらしさと、属性名が「標高」で属性値が「３７７６メートル」である尤もらしさを評価する。事物属性値組候補抽出部１０３の機能は、文型を使って事物名、属性名と属性値の３つ組の候補を抽出し共起スコアを使って適切なものを選択する非特許文献１に記載の手段のうち、文型を利用して３つ組を抽出する部分を事物名と属性値を抽出するように修正したものに相当する。 In Non-Patent Document 1, a set of a target, an attribute value, and an attribute name is extracted using a sentence pattern and a co-occurrence score. For example, if a sentence “Mt. Fuji is at altitude 3776 meters” is input, the sentence type “<object> <attribute name> is <attribute value>” is used, the object is “Mt. Fuji”, the attribute name is “elevation”, It is extracted that the attribute value is “3776 meters”. Using the frequency of co-occurrence data collected from a large document database, the likelihood that the target is “Mount Fuji” and the attribute name is “elevation”, the attribute name is “elevation”, and the attribute value is “3776 meters” Evaluate the likelihood that The function of the thing attribute value set candidate extraction unit 103 is based on Non-Patent Document 1 that extracts a triple set of a thing name, an attribute name, and an attribute value using a sentence pattern and selects an appropriate one using a co-occurrence score. Of the means described, this corresponds to a part in which triplets are extracted using sentence patterns so as to extract an object name and an attribute value.

事物属性値組候補抽出部１０３の処理は、図３に示ように、事物名抽出用パターン辞書３０１、形態素解析処理（Ｓ３０２）、名詞句同定処理（Ｓ３０３）、事物名属性値抽出処理（Ｓ３０４）、出力処理（Ｓ３０５）により実現される。なお、形態素解析処理（Ｓ３０２）は広く知られた既知の技術であるので以下では詳細な説明を省略する。また、名詞句同定処理（Ｓ３０３）もチャンキングなどで知られる公知の技術であるので詳細な説明を省略する。 As shown in FIG. 3, the processing of the thing attribute value candidate extraction unit 103 includes a thing name extraction pattern dictionary 301, a morphological analysis process (S302), a noun phrase identification process (S303), and a thing name attribute value extraction process (S304). ) And output processing (S305). Note that the morphological analysis process (S302) is a well-known technique, and therefore detailed description thereof will be omitted below. Further, the noun phrase identification process (S303) is also a well-known technique known as chunking and the like, and thus detailed description thereof is omitted.

事物名抽出用パターン辞書３０１には、文書中に出現しうる事物名と属性値の文型パターンを登録してある。文型パターンは、例えば「属性値Ｖの事物名Ｏ」といったものである。文型パターンの収集は、非特許文献１に従い、係り受け関係付きのコーパスを利用することができるが、係り受け関係のないコーパスからの抽出でもかまわない。 In the object name extraction pattern dictionary 301, sentence patterns of object names and attribute values that can appear in a document are registered. The sentence pattern is, for example, “object name O of attribute value V”. According to Non-Patent Document 1, collection of sentence patterns can use a corpus having a dependency relationship, but extraction from a corpus having no dependency relationship may be used.

事物名属性値検出処理（Ｓ３０４）は、事物名抽出用パターン辞書３０１を使って事物名を抽出する。文型パターンの「属性値表現Ｖ」「事物名Ｏ」「任意句」にマッチするのは、名詞句同定処理（Ｓ３０３）で同定された名詞句のみとする。事物名については文型パターンにマッチした文節の末尾にある名詞とする。また、同じ文に対してパターン辞書を再帰的に適用するものとする。 In the event name attribute value detection process (S304), the event name is extracted using the event name extraction pattern dictionary 301. Only the noun phrases identified in the noun phrase identification process (S303) match the sentence pattern "attribute value expression V", "thing name O", and "arbitrary phrase". The thing name is the noun at the end of the phrase that matches the sentence pattern. Also, the pattern dictionary is applied recursively to the same sentence.

出力処理（Ｓ３０５）では、抽出した事物名と属性値を、「文書識別子、事物名、属性値」の３つ組に整形して出力する。 In the output process (S305), the extracted thing name and attribute value are formed into a triple of “document identifier, thing name, attribute value” and output.

例えば、文書識別子「DOC_ID_1」の文書に含まれる文「４０から５０歳の女性向けの靴を販売」を事物属性値組候補抽出部１０３で処理すると、事物名抽出用パターン「属性値表現Ｖの事物名Ｏ」により事物名「女性」と属性値表現「４０から５０歳」が抽出され、「文書識別子、事物名、属性値」の３つ組として「DOC_ID_1、女性、４０から５０歳」が出力される。 For example, when the entity attribute value set candidate extraction unit 103 processes a sentence “sale for women aged 40 to 50” included in the document with the document identifier “DOC_ID_1”, the entity name extraction pattern “attribute value expression V The item name “female” and the attribute value expression “40 to 50 years old” are extracted from the “item name O”, and “DOC_ID_1, woman, 40 to 50 years old” is set as a triple of “document identifier, item name, attribute value”. Is output.

さらに、文書識別子「DOC_ID_1」の文書に文「中年女性を新たな販売ターゲットとした」が含まれる場合、事物名抽出用パターン「属性値表現Ｖ事物名Ｏ」により事物名「女性」と属性値表現「中年」が抽出され、「文書識別子、事物名、属性値」の３つ組として「DOC_ID_1、女性、中年」が出力される。 Furthermore, if the document with the document identifier “DOC_ID_1” includes the sentence “middle-aged female as a new sales target”, the attribute name “female” is attributed to the attribute name extraction pattern “attribute value expression V property name O”. The value expression “middle age” is extracted, and “DOC_ID_1, female, middle age” is output as a triple of “document identifier, thing name, attribute value”.

属性値候補集計部１０４は、事物属性値組候補抽出部１０３で作成された、文書集合に含まれるすべての「文書識別子、事物名、属性値」に対して集計処理を行なう。集計処理の流れを図４に示す。 The attribute value candidate totalization unit 104 performs totalization processing for all “document identifiers, thing names, and attribute values” included in the document set created by the thing attribute value set candidate extraction unit 103. The flow of the counting process is shown in FIG.

属性値収集処理（Ｓ４０１）では、文書識別子と事物名の組が同一のデータを集めて、属性値を抽出し、属性値集合を作成する。文書識別子Ｕから抽出した事物名O_Uの属性値集合は、SET_V（O_U）で表す。同時に、すべての事物名O_Uを要素とする事物名集合SET_Oと、すべての文書識別子Ｕを要素とする文書識別子集合SET_Uを作成する。 In the attribute value collection process (S401), data having the same combination of document identifier and thing name is collected, attribute values are extracted, and an attribute value set is created. The attribute value set of the thing name O_U extracted from the document identifier U is represented by SET_V (O_U). At the same time, an object name set SET_O having all the object names O_U as elements and a document identifier set SET_U having all document identifiers U as elements are created.

例えば、「文書識別子、事物名、属性値」の３つ組として「DOC_ID_1、女性、中年」、「DOC_ID_1、女性、４０から５０歳」、「DOC_ID_1、靴、中年」、「DOC_ID_1、靴、２０から２３センチ」、「DOC_ID_2、女性、若い」、「DOC_ID_2、女性、１５０から１６０センチ」が入力されたとする。文書識別子「DOC_ID_1」、事物名「女性」から作成したSET_V（O_U）は｛４０から５０歳，中年｝、文書識別子「DOC_ID_1」、事物名「靴」から作成したSET_V（O_U）は｛２０から２３センチ，中年｝になる。文書識別子「DOC_ID_2」、事物名「女性」から作成したSET_V（O_U）は｛１５０から１６０センチ，若い｝になる。SET_Oは｛女性，靴｝になる。SET_Uは、｛DOC_ID_1，DOC_ID_2｝になる。なお、本説明では要素Ａ，Ｂを含む集合を｛Ａ，Ｂ｝で表記しており、以下、この表記を使う。集合を要素とする集合を表記する場合は、入れ子表現となる。 For example, “DOC_ID_1, female, middle-aged”, “DOC_ID_1, female, 40 to 50 years old”, “DOC_ID_1, shoes, middle-aged”, “DOC_ID_1, shoes as triples of“ document identifier, thing name, attribute value ” , 20 to 23 cm ”,“ DOC_ID_2, female, young ”and“ DOC_ID_2, female, 150 to 160 cm ”are input. The SET_V (O_U) created from the document identifier “DOC_ID_1” and the thing name “female” is {40 to 50 years old, middle-aged}, the SET_V (O_U) created from the document identifier “DOC_ID_1” and the thing name “shoes” is {20 To 23 centimeters, middle-aged}. The SET_V (O_U) created from the document identifier “DOC_ID_2” and the thing name “female” is {150 to 160 cm, young}. SET_O becomes {women, shoes}. SET_U becomes {DOC_ID_1, DOC_ID_2}. In this description, a set including elements A and B is represented by {A, B}, and this notation is used hereinafter. When a set having a set as an element is described, a nested expression is used.

属性値集合分類処理（Ｓ４０２）では、作成した属性値集合SET_V（O_U）を、数値表現による属性値の集合SET_N（O_U）と、言葉による属性値の集合SET_W（O_U）の２つに分類する。属性値が数値表現か言葉による表現かの判断は、数値がアラビア数字や漢数字など表記される文字列が限られていることを利用する。数値外の表現は、言葉による表現と判定する。 In the attribute value set classification process (S402), the created attribute value set SET_V (O_U) is classified into two, an attribute value set SET_N (O_U) in numerical expression and a verbal attribute value set SET_W (O_U). . Judgment whether the attribute value is a numerical expression or a verbal expression is based on the fact that the character string in which the numerical value is expressed such as Arabic numerals or Chinese numerals is limited. Expressions outside the numerical value are judged as expressions by words.

この処理により、文書識別子「DOC_ID_1」から作成したSET_V（O_U）から、SET_N（O_U）＝｛２０から２３センチ，４０から５０歳｝と、SET_W（O_U）＝｛中年｝が作成される。文書識別子「DOC_ID_2」から作成したSET_V（O_U）から、SET_N（O_U）＝｛１５０から１６０センチ｝と、SET_W（O_U）＝｛若い｝が作成される。 Through this process, SET_N (O_U) = {20 to 23 cm, 40 to 50 years old} and SET_W (O_U) = {middle age} are created from SET_V (O_U) created from the document identifier “DOC_ID_1”. From SET_V (O_U) created from the document identifier “DOC_ID_2”, SET_N (O_U) = {150 to 160 cm} and SET_W (O_U) = {young} are created.

共起関係作成処理（Ｓ４０３）では、属性値集合分類処理で作成した集合SET_N（O_U）の要素とSET_W（O_U）の要素のすべての組み合わせから構成される集合SET_C（O_U）を作成する。 In the co-occurrence relationship creation process (S403), a set SET_C (O_U) including all combinations of elements of the set SET_N (O_U) and SET_W (O_U) created by the attribute value set classification process is created.

この処理により、文書識別子「DOC_ID_1」から作成したSET_N（O_U）とSET_W（O_U）から、SET_C（O_U）＝｛｛２０から２３センチ，中年｝，｛４０から５０歳，中年｝｝が作成される。文書識別子「DOC_ID_2」から作成したSET_N（O_U）とSET_W（O_U）から、SET_C（O_U）＝｛｛１５０から１６０センチ，若い｝｝が作成される。 By this processing, SET_C (O_U) = {{20 to 23 cm, middle age}, {40 to 50 years old, middle age}} is obtained from SET_N (O_U) and SET_W (O_U) created from the document identifier “DOC_ID_1”. Created. From SET_N (O_U) and SET_W (O_U) created from the document identifier “DOC_ID_2”, SET_C (O_U) = {{150 to 160 cm, young}} is created.

集合保存処理（Ｓ４０４）では、すべての事物名について、作成した集合SET_N（O_U）、SET_W（O_U）、SET_C（O_U）をそれぞれ集合SET_N、SET_W、SET_Cに保存する。 In the set storage process (S404), the created sets SET_N (O_U), SET_W (O_U), and SET_C (O_U) are stored in the sets SET_N, SET_W, and SET_C for all the thing names.

この処理により、文書識別子「DOC_ID_1」と文書識別子「DOC_ID_2」から作成したデータから、SET_N＝｛｛２０から２３センチ，４０から５０歳｝，｛１５０から１６０センチ｝｝、SET_W＝｛｛中年｝，｛若い｝｝、SET_C＝｛｛２０から２３センチ，中年｝，｛４０から５０歳，中年｝，｛１５０から１６０センチ，若い｝｝が作成される。 By this processing, from the data created from the document identifier “DOC_ID_1” and the document identifier “DOC_ID_2”, SET_N = {{20 to 23 cm, 40 to 50 years}}, {150 to 160 cm}}, SET_W = {{middle age }, {Young}}, SET_C = {{20 to 23 cm, middle age}, {40 to 50 years old, middle age}, {150 to 160 cm, young}}.

属性値候補集計部１０４で作成した集合SET_N、SET_W、SET_Cに対する対応度評価部１０５の処理を図５に示す。上述の例を使い、以下の処理を説明する。 FIG. 5 shows the processing of the correspondence evaluation unit 105 for the sets SET_N, SET_W, and SET_C created by the attribute value candidate totalization unit 104. The following processing will be described using the above example.

SET_Wの要素をひとつ取り出しSET_WTMPとする（Ｓ５０１）。この処理により、SET_W＝｛｛中年｝，｛若い｝｝から｛中年｝を取り出し、SET_WTMP＝｛中年｝とする。 One element of SET_W is extracted and set as SET_WTMP (S501). With this processing, {middle age} is extracted from SET_W = {{middle age}, {young}}, and set to SET_WTMP = {middle age}.

SET_WTMPの要素があれば一つ取り出し、その要素をキーとするハッシュテーブルHASH_Wの値に１を加える。ALL_Wに１を加える（Ｓ５０２，Ｓ５０３）。 If there is an element of SET_WTMP, one is taken out and 1 is added to the value of the hash table HASH_W using that element as a key. 1 is added to ALL_W (S502, S503).

この処理により、KEY＝中年となり、ハッシュテーブルHASH_Wに「中年」を登録し、その値に１を加える。つまり、KEY＝中年となるデータの全出現回数がHASH_Wに保存される。また、ALL_Wに１を加える。つまり、KEYに何らかの値が代入された回数がALL_Wに保存される。 By this processing, KEY = middle age, “middle age” is registered in the hash table HASH_W, and 1 is added to the value. That is, the total number of occurrences of data with KEY = middle age is stored in HASH_W. Also add 1 to ALL_W. In other words, the number of times a value is assigned to KEY is stored in ALL_W.

SET_Wの全ての要素に対してＳ５０１からＳ５０３の処理を行なうことにより、ALL_Wは２になり、ハッシュテーブルには２種類のキー「中年」と「若い」が登録され、それぞれの値は１になる。 By performing the processing from S501 to S503 for all elements of SET_W, ALL_W becomes 2, and two types of keys “middle age” and “young” are registered in the hash table, and each value is 1. Become.

HASH_Wの値を頻度の降順にソートし、ソート結果順に対応するHASH_Wのキーとその値をALL_Wで除した値を組にしてリストWORDに保存する（Ｓ５０４）。 The values of HASH_W are sorted in descending order of frequency, and the HASH_W key corresponding to the sort result order and the value obtained by dividing the value by ALL_W are paired and stored in the list WORD (S504).

この処理により、リストWORDには（＜中年、０．５＞，＜若い、０．５＞）が保存される。なお、本説明では要素Ａ，Ｂを含むリストを（Ａ，Ｂ）で表記しており、以下、この表記を使う。 By this processing, (<middle age, 0.5>, <young, 0.5>) is stored in the list WORD. In this description, a list including elements A and B is represented by (A, B), and this notation is used hereinafter.

SET_Nの要素をひとつ取り出しSET_NTMPとする（Ｓ５０５）。この処理により、SET_N＝｛｛２０から２３センチ，４０から５０歳｝，｛１５０から１６０センチ｝｝から｛２０から２３センチ，４０から５０歳｝を取り出し、SET_NTMP＝｛２０から２３センチ，４０から５０歳｝とする。 One element of SET_N is extracted and set as SET_NTMP (S505). By this processing, SET_NMP = {{20 to 23 cm, 40 to 50 years}, {150 to 160 cm}} is extracted from {20 to 23 cm, 40 to 50 years}, and SET_NTMP = {20 to 23 cm, 40 To 50 years old}.

SET_NTMPの要素があれば一つ取り出し、その要素を数値表現処理部１０６で処理する（Ｓ５０６，Ｓ５０７）。数値表現処理部１０６における処理の内容は図６により説明する。この処理により｛２０から２３センチ，４０から５０歳｝から「２０から２３センチ」を取り出し、数値処理１０６で「２０から２３センチ」を処理した結果、最大値として３０、最小値として２０、単位としてセンチを得る。 If there is an element of SET_NTMP, one element is taken out and the element is processed by the numerical expression processing unit 106 (S506, S507). The contents of the processing in the numerical expression processing unit 106 will be described with reference to FIG. As a result of taking “20 to 23 centimeters” from {20 to 23 centimeters, 40 to 50 years old] by this processing and processing “20 to 23 centimeters” in numerical processing 106, the maximum value is 30, the minimum value is 20, and the unit As centimeters get.

数値表現処理部１０６は、抽出した数値表現が含む数字列（ただし、範囲表現を含む）を、具体的な値（整数値、あるいは、実数値）に変換する。なお、数字列が範囲表現を含むので、出力は、当該数値表現が含む数値の最大値と最小値とする。例えば、「百から二百」という数字列は、「最大値：２００、最小値：１００」（ただし、数字は文字ではなく値）を出力する。処理の流れを図６に示す。以下では数字列「千二，三百」を例として処理を説明する。 The numerical expression processing unit 106 converts a numeric string (including a range expression) included in the extracted numerical expression into a specific value (integer value or real value). Since the numeric string includes a range expression, the output is the maximum value and the minimum value of the numerical value included in the numerical expression. For example, a number string “100 to two hundred” outputs “maximum value: 200, minimum value: 100” (however, the number is not a letter but a value). The flow of processing is shown in FIG. In the following, the processing will be described by taking the numeric string “1000, 300” as an example.

まず、数値文字列を取得する（Ｓ６０１）。つづいてコンマを含むか検査する（Ｓ６０２）。コンマを含む場合にはＳ６０３の処理へ、含まない場合にはＳ６０８の処理へ進む。「千二，三百」はコンマを含むので、Ｓ６０３に進む。 First, a numerical character string is acquired (S601). Next, it is inspected whether a comma is included (S602). If a comma is included, the process proceeds to S603, and if not, the process proceeds to S608. Since “12,300” includes a comma, the process proceeds to S603.

Ｓ６０３では、３桁の位取りの形式を満たしているか検査する。満たしている場合は、コンマを削除した文字列を取得し（Ｓ６０５）の処理へ進む。満たさない場合は、コンマの前後にある数字が連番か検査する（Ｓ６０４）。連番ではない場合、コンマで数字を分割する（Ｓ６０６）。連番の場合は、Ｓ６０７の処理に進む。「千二，三百」の場合、３桁の位取りの形式を満たしていないので、Ｓ６０４に進み、コンマの前後が「二，三」のように連番なのでＳ６０７に進む。 In S603, it is inspected whether the three-digit scale format is satisfied. If it satisfies, the character string from which the comma is deleted is acquired (S605), and the process proceeds. If not, it is checked whether the numbers before and after the comma are serial numbers (S604). If it is not a sequential number, the number is divided by a comma (S606). In the case of sequential numbers, the process proceeds to S607. In the case of “1,000, three hundred”, since the three-digit scale format is not satisfied, the process proceeds to S604. Since the commas are consecutive numbers such as “2, three”, the process proceeds to S607.

Ｓ６０７の処理では、変数Ａ、Ｂ、Ｃ、Ｄを使用する。コンマの前一文字をＡに、コンマの後一文字をＢに、数字列の先頭からコンマの二文字前までをＣに、コンマの二文字後から数字列の末尾までをＤに代入する。Ｃ、Ａ、Ｄの順にその値を連結した数字列と、Ｃ、Ｂ、Ｄの順にその値を連結した数字列を作成し、以降の処理では、この作成した数字列を処理対象とする。「千二，三百」の場合、Ｃは千、Ａは二、Ｂは三、Ｄは百となり、千二百と千三百が以降の処理対象となる。 In the process of S607, variables A, B, C, and D are used. The first character before the comma is assigned to A, the first character after the comma is assigned to B, the first character before the comma to the second character before the comma is assigned to C, and the second character after the comma to the last character of the number string is assigned to D. A number string in which the values are concatenated in the order of C, A, and D and a number string in which the values are concatenated in the order of C, B, and D are created. In the subsequent processing, the created number string is set as a processing target. In the case of “1,000, three hundred”, C is 1,000, A is two, B is three, D is one hundred, and one thousand two hundred and one hundred three hundred are the subsequent processing targets.

次のＳ６０８では、変数Ｖ、Ｃ１、Ｃ２、Ｆを使用する。初期値を、Ｖが０、Ｃ１が１、Ｃ２が１、Ｆが０とする。数値文字列の右から順に１文字ずつ抽出する（Ｓ６０９）。数字の種類によって以下のように処理する。 In next step S608, variables V, C1, C2, and F are used. The initial values are 0 for V, 1 for C1, 1 for C2, and 0 for F. Characters are extracted one by one in order from the right of the numerical character string (S609). Depending on the type of number, it will be processed as follows.

数字の値が０から９になる漢数字又はアラビア数字か検査する（Ｓ６１０）。真のときは、「数字の値」×Ｃ１×Ｃ２をＶに加え、Ｃ１を１０倍する（Ｓ６１４）。 It is inspected whether the numerical value is a Chinese numeral or an Arabic numeral from 0 to 9 (S610). When true, “numeric value” × C1 × C2 is added to V, and C1 is multiplied by 10 (S614).

数字が漢数字の十、百、千か検査する（Ｓ６１１）。真のとき、Ｓ６１５の判定でＦが１なら、Ｃ１×Ｃ２をＶに加える（Ｓ６１６）。さらに、Ｃ１に数字の値（１０、１００、１０００）を代入し、Ｆに１を代入する（Ｓ６１７）。 Whether the number is ten, hundred or thousand of Chinese numerals is inspected (S611). When true, if F is 1 in the determination of S615, C1 × C2 is added to V (S616). Further, a numerical value (10, 100, 1000) is substituted for C1, and 1 is substituted for F (S617).

数字が漢数字の万、億、兆、京、亥か検査する（Ｓ６１２）。真のとき、Ｓ６１８の判定でＦが１なら、Ｃ１×Ｃ２をＶに加える（Ｓ６１９）。Ｃ２に数字の値（１００００など）、Ｃ１に１、Ｆに０を代入する（Ｓ６２０）。 It is inspected whether the number is a Chinese number ten thousand, hundreds of thousands, trillions, kyo, or habit (S612). When true, if F is 1 in the determination of S618, C1 × C2 is added to V (S619). A numerical value (10000 or the like) is substituted for C2, 1 is substituted for C1, and 0 is substituted for F (S620).

数字列に残りがあるか検査し（Ｓ６２１）、無ければＶを出力して終了し（Ｓ６２２）、有ればＳ６０９から処理を繰り返す。 Whether or not there is a remainder in the number string is checked (S621), and if there is not, V is output and the process ends (S622), and if there is, the process is repeated from S609.

Ｓ６０８からＳ６２２で千二百を処理する場合、Ｓ６０９の処理で「百」「二」「千」の順に文字を切り出し以下の処理に進む。最初の文字が「百」なので、Ｓ６１１の判定で真となり、Ｓ６１５の処理に進む。変数Ｆは初期値のままで０なので、Ｓ６１７の処理に進み、変数Ｃ１に百の数としての値１００、変数Ｆに１を代入する。つづいて文字が「二」なので、Ｓ６１０の判定で真となり、Ｓ６１４の処理に進む。変数Ｖの値は、０＋１００×１×２により２００になる。Ｃ１は１００×１０により１０００になる。最後に文字「千」を処理する。Ｓ６１１の判定で真となり、Ｓ６１５に進む。変数Ｆは１なので真となり、Ｓ６１６の処理に進み、変数Ｖは２００＋１０００×１となり１２００となる。Ｓ６１７において、Ｃ１に千の数としての値１０００を、変数Ｆに１を代入する。次に、Ｓ６２１の判定で偽となり、Ｖの値１２００を出力し（Ｓ６２２）、処理は終了する。 When processing one thousand two hundred from S608 to S622, the characters are cut out in the order of “hundred”, “two”, and “thousand” in the process of S609, and the process proceeds to the following process. Since the first character is “100”, the determination in S611 is true, and the process proceeds to S615. Since the variable F is 0 with the initial value, the process proceeds to S617, and the value 100 as a hundred number is substituted into the variable C1, and 1 is substituted into the variable F. Subsequently, since the character is “two”, the determination in S610 is true, and the process proceeds to S614. The value of the variable V becomes 200 by 0 + 100 × 1 × 2. C1 becomes 1000 by 100 × 10. Finally, the character “1000” is processed. It becomes true by determination of S611, and progresses to S615. Since the variable F is 1, the result is true, and the process proceeds to S616, where the variable V is 200 + 1000 × 1 and becomes 1200. In S617, the value 1000 as a number of 1000 is substituted for C1, and 1 is substituted for the variable F. Next, it becomes false by determination of S621, the value 1200 of V is output (S622), and a process is complete | finished.

換算処理部１０７は、数値表現に含まれる単位を、基準となる単位（以下、基準単位）に変換する処理を行う。例えば、長さの基準単位を「メートル」とする場合、１０センチメートルは「０．１メートル」に換算される。 The conversion processing unit 107 performs processing for converting a unit included in the numerical expression into a reference unit (hereinafter referred to as a reference unit). For example, when the reference unit of length is “meter”, 10 centimeter is converted to “0.1 meter”.

換算処理部１０７は、図７に示すように、基準単位選択用テーブル７０１、単位構造解析部７０２、計算部７０３、接頭辞換算表７０４、単位換算表７０５を有する。 As shown in FIG. 7, the conversion processing unit 107 includes a reference unit selection table 701, a unit structure analysis unit 702, a calculation unit 703, a prefix conversion table 704, and a unit conversion table 705.

基準単位選択用テーブル７０１は、図７に示すように、項目名として、単位・助数詞、尺度、尺度がとる単位を有するテーブルである。尺度と尺度がとる単位は計量法、計量単位令を参考に定義することができるが、前記法令に含まれない尺度と単位も定義できる。属性値と単位・助数詞が与えられたとき、対応する基準単位が選ばれる。 As shown in FIG. 7, the reference unit selection table 701 is a table having unit / classifiers, scales, and units taken as scales as item names. The scale and the units taken by the scale can be defined with reference to the Metrology Law and the Measurement Unit Ordinance, but scales and units not included in the law can also be defined. When an attribute value and unit / classifier are given, the corresponding reference unit is selected.

物の個数を数えるときに数字の右につける助数詞（１個の「個」や、１台の「台」など）の場合、基準単位は「個」とする。また、含有量などの割合は基準単位を％とする。 In the case of a classifier (one “piece”, one “stand”, etc.) attached to the right of the number when counting the number of objects, the reference unit is “piece”. In addition, the percentage of content and the like is based on the reference unit.

単位構造解析部７０２は、単位の構造を文脈自由文法に基づく構文解析プログラムによって解析する。構文解析プログラムは広く知られた技術なので、ここでは説明しない。 The unit structure analysis unit 702 analyzes the unit structure by a syntax analysis program based on a context free grammar. The parsing program is a well-known technique and will not be described here.

計算部７０３は、単位構造解析結果、接頭辞換算表、単位換算表を使って、抽出した数値表現の数値を基準単位に合うように換算する。 The calculation unit 703 uses the unit structure analysis result, the prefix conversion table, and the unit conversion table to convert the extracted numerical value to a reference unit.

図７の接頭辞換算表７０４に示すように、接頭辞に対して１０のＮ乗倍のＮに相当する値を登録しておく。 As shown in the prefix conversion table 704 in FIG. 7, a value corresponding to N, which is N times the power of 10 is registered for the prefix.

単位変換表７０５は、図７に示すように、変換元単位、基準単位、変換式を登録しておく。当該式では、数値表現から抽出した数値はＸに代入して計算する。上記に例として示した「最大値として３０、最小値として２０，単位としてセンチ」というデータは、「最大値：０．３、最小値：０．２、基準単位：メートル」となる。 As shown in FIG. 7, the unit conversion table 705 registers conversion source units, reference units, and conversion formulas. In this formula, the numerical value extracted from the numerical expression is substituted for X and calculated. The data “30 as the maximum value, 20 as the minimum value, and centimeter as the unit” shown as an example above becomes “maximum value: 0.3, minimum value: 0.2, reference unit: meter”.

図５に戻り、数値処理の結果出力される「最小値、最大値、基準単位」の組について、それぞれの値をMIN、MAX、UNITにセットする。MINとMAXの組＜MIN，MAX＞をハッシュの第１のキーとしUNITを第２のキーとする２次元ハッシュテーブルHASH_Nの値に１を加える。ALL_Nに１を加える。UNITをSET_Rに追加する。 Returning to FIG. 5, regarding the set of “minimum value, maximum value, reference unit” output as a result of numerical processing, the respective values are set to MIN, MAX, and UNIT. 1 is added to the value of the two-dimensional hash table HASH_N in which the pair <MIN, MAX> of MIN and MAX is the first key of hash and UNIT is the second key. Add 1 to ALL_N. Add UNIT to SET_R.

この処理により、ハッシュテーブルHASH_Nに第１ハッシュキーとして＜０．３，０．２＞を登録し、第２ハッシュキーとして「メートル」を登録し、その値に１を加える。つまり、「最大値：０．３、最小値：０．２、基準単位：メートル」となるデータの全出現回数がHASH_Nに保存される。また、ALL_Nに１を加える。つまり、何らかの数値表現が抽出された回数がALL_Nに保存される。SET_Rに「メートル」を追加する。 By this processing, <0.3, 0.2> is registered as the first hash key in the hash table HASH_N, “Meter” is registered as the second hash key, and 1 is added to the value. That is, the total number of appearances of data with “maximum value: 0.3, minimum value: 0.2, reference unit: meter” is stored in HASH_N. Also add 1 to ALL_N. That is, the number of times that some numerical expression is extracted is stored in ALL_N. Add “meter” to SET_R.

SET_Nの全ての要素に対してＳ５０５からＳ５０７の処理を行なうことにより、ALL_Nは３になり、ハッシュテーブルには３種類の第１ハッシュキーと第２ハッシュキーの組「歳、＜５０，４０＞」、「メートル、＜１．６，１．５＞」，「メートル、＜０．３，０．２＞」が登録され、それぞれの値は１になる。ALL_Nは３になる。 By performing the processing from S505 to S507 for all elements of SET_N, ALL_N becomes 3, and the hash table includes a set of three types of first hash key and second hash key “year, <50, 40>. ”,“ Meter, <1.6, 1.5> ”,“ Meter, <0.3, 0.2> ”are registered, and each value is 1. ALL_N becomes 3.

SET_RとALL_N、HASH_Nを第１、第２、第３引数として数値範囲処理部１０８に入力し、処理の結果であるハッシュテーブルHashTblNを出力する（Ｓ５０８）。 SET_R, ALL_N, and HASH_N are input as the first, second, and third arguments to the numerical value range processing unit 108, and the hash table HashTblN that is the result of the processing is output (S508).

数値範囲処理部１０８は、入力された数値データの分布からスコアを計算する処理を行う。その処理の流れを図８に示す。 The numerical value range processing unit 108 performs processing for calculating a score from the distribution of input numerical data. The processing flow is shown in FIG.

上述のように、SETに第１引数、ALLに第２引数、HASHに第３引数を代入する（Ｓ８０１）。 As described above, the first argument is assigned to SET, the second argument is assigned to ALL, and the third argument is assigned to HASH (S801).

説明に使っている例を代入すると、SETは｛メートル，歳｝、ALLは３になる。HASHは第１ハッシュキーと第２ハッシュキーの組「歳、＜５０，４０＞」、「メートル、＜１．６，１．５＞」、「メートル、＜０．３，０．２＞」でそれぞれの値が１であるデータが登録される（Ｓ８０１）。 Substituting the example used in the description, SET becomes {meter, age} and ALL becomes 3. HASH is a combination of the first hash key and the second hash key “age, <50, 40>”, “meter, <1.6, 1.5>”, “meter, <0.3, 0.2>” In step S801, data each having a value of 1 is registered.

集合SETの要素から１つ値を取り出しUNITに代入する。全てのSETの要素を処理するまで本処理は行なう。具体的には集合SETの要素は単位である（Ｓ８０２）。 One value is extracted from the elements of the set SET and assigned to UNIT. This process is performed until all SET elements have been processed. Specifically, the element of the set SET is a unit (S802).

例では、まず、UNITに「メートル」が代入される。 In the example, first, “meter” is assigned to UNIT.

UNITの値を第１ハッシュキーとする全てのデータをHASHから取り出しhashにセットする。このときハッシュキーはHASHの第２ハッシュキーとする（Ｓ８０３）。 All data with the UNIT value as the first hash key is extracted from the HASH and set in the hash. At this time, the hash key is the second hash key of HASH (S803).

この処理によってhashに＜１．６，１．５＞と＜０．３，０．２＞をハッシュキーとしその値が１のデータが登録される。 By this processing, data having <1.6, 1.5> and <0.3, 0.2> as hash keys and having a value of 1 is registered in the hash.

hashのハッシュキーである数値データの最大値MAXと最小値MINの組について、最大値MAXの中で最も大きいものをMMax、最小値MINの中で最も小さいものをMMinとする（Ｓ８０４）。MMaxとMMinの比が１０より大きいか検索する。大きい場合にはＳ８０６へ、小さい場合にはＳ８１１へ進む。Ｓ８０６以下の処理は、１から１０、１０から１００のように１０のｎ乗ごとに範囲を区切って、その範囲での頻度をカウントするための処理である。Ｓ８１１以下の処理は、１０から２０、２０から３０のように同じ位で範囲を区切って、その範囲での頻度をカウントするための処理である。 For the set of the maximum value MAX and the minimum value MIN of the numerical data that is the hash key of hash, the largest value of the maximum values MAX is MMax, and the smallest value of the minimum values MIN is MMin (S804). Search whether the ratio of MMax and MMin is greater than 10. If larger, the process proceeds to S806, and if smaller, the process proceeds to S811. The processes after S806 are processes for dividing a range every 10 n powers, such as 1 to 10, 10 to 100, and counting the frequency in the range. The processes after S811 are processes for dividing the range at the same place, such as 10 to 20, 20 to 30, and counting the frequency in the range.

例の場合、MMaxは１．６になり、MMinは０．２になる。その比が１０より小さいので、Ｓ８１１の処理に進むことになる。 In the example, MMax is 1.6 and MMin is 0.2. Since the ratio is smaller than 10, the process proceeds to S811.

Ｓ８０５の判定が真の場合、Ｓ８０６で数値データが残っているか検査した後、残っていれば数値データを１組だけ取り出す（Ｓ８０７）。MAX、MINそれぞれについて、１０のint（log₁₀MAX）乗をMaxLに代入、１０のint（log₁₀MIN）乗をMinLに代入する。ここで、intは小数点以下を切り捨てて整数化する関数である。MaxLを１０で割ってゆき、その値がMinL以下のあいだ、MaxLをハッシュキーとするハッシュHashTblの値にhash｛＜MAX，MIN＞｝の値を加えてゆく（Ｓ８０９、Ｓ８１０）。 If the determination in S805 is true, it is checked in S806 whether numerical data remains, and if it remains, only one set of numerical data is extracted (S807). For each of MAX and MIN, 10 to the power of int (log ₁₀ MAX) is assigned to MaxL, and 10 to the power of int (log ₁₀ MIN) is assigned to MinL. Here, int is a function that rounds down to the nearest whole number. While MaxL is divided by 10, while the value is equal to or smaller than MinL, the value of hash {<MAX, MIN>} is added to the value of hash HashTbl using MaxL as a hash key (S809, S810).

例えば、数値データの組がMIN＝２００とMAX＝５０００のとき、MaxLはint（log₁₀５０００）が３なので１０の３乗＝１０００となり、MinLはint（log₁₀２００）が２なので１０の２乗＝１００となる。HashTblの「１０００」とUNITの値の組をハッシュキーとするデータに、hash｛＜５０００，２００＞｝の値を加える。MaxLを１０で割った値は１００で１００より小さくないので、「１００」とUNITの値の組をハッシュキーとするデータもHashTblに登録する。 For example, when the numerical data set is MIN = 200 and MAX = 5000, MaxL is int (log ₁₀ 5000) is 3, so 10 to the third power = 1000, and MinL is int (log ₁₀ 200) is 2, so 10 2 Multiplier = 100. A value of hash {<5000,200>} is added to data having a set of “1000” of HashTbl and the value of UNIT as a hash key. Since the value obtained by dividing MaxL by 10 is 100 and not smaller than 100, data having a set of “100” and the value of UNIT as a hash key is also registered in HashTbl.

一方、Ｓ８０５の判定が偽の場合、Ｓ８１１で数値データが残っているか検査した後、残っていれば数値データの値を１組だけ取り出す（Ｓ８１２）。１０のint（log₁₀MIN）乗をMaxLとMaxOに代入、MINをMaxLで割った値の整数部分をMaxEに代入する。MaxEにMaxOを加算してゆき、その値がMAX以下のあいだ、MaxEとUNITの組をハッシュキーとするハッシュHashTblの値（データ型は整数）にhash｛＜MAX，MIN＞｝の値を加えてゆく（Ｓ８１１、Ｓ８１２）。 On the other hand, if the determination in S805 is false, after checking whether numerical data remains in S811, if it remains, only one set of numerical data values is taken out (S812). Substitute _{10 to the} power of int (log ₁₀ MIN) into MaxL and MaxO, and substitute the integer part of the value obtained by dividing MIN by MaxL into MaxE. Add MaxO to MaxE, and add hash {<MAX, MIN>} to the value of hash HashTbl (data type is an integer) with MaxE and UNIT as the hash key while the value is less than or equal to MAX Go (S811, S812).

上記の例をひきつづいて使うと、数値データの組がMIN＝１．５とMAX＝１．６のとき、MaxLはint（log₁₀１．５）が０なので１０の０乗＝１となり、MaxOは１、MaxEは１．５割る１の整数部分なので１となる。HashTblの「メートル」と「１」の組をハッシュキーとするデータに、hash｛＜１．６，１．５＞｝の値を加える。MaxEにMaxOを加えると２になり、MAXより大きいので次のデータを処理する。次のデータは、MIN＝０．２とMAX＝０．３のとき、MaxLはint（log₁₀０．２）が０なので１０の０乗＝１となり、MaxOは１、MaxEは０．２割る１の整数部分なので０となる。HashTblの「メートル」と「０」の組をハッシュキーとするデータに、hash｛＜０．３，０．２＞｝の値を加える。MaxEにMaxOを加えると１になり、MAXより大きいので次のデータを処理しようとするが、無いので、次の単位「歳」をSETからUNITに読み込み、＜５０，４０＞の処理に進む。以上で数値範囲処理１０８の処理の説明を終わる。 Continuing with the above example, when the numeric data set is MIN = 1.5 and MAX = 1.6, MaxL is in the power of 10 = 1 because int (log ₁₀ 1.5) is 0, and MaxO Is 1 and MaxE is 1 because it is an integer part of 1.5 divided by 1. A value of hash {<1.6, 1.5>} is added to data having a set of “meter” and “1” in HashTbl as a hash key. When MaxO is added to MaxE, it becomes 2, and since it is larger than MAX, the next data is processed. The next data is that when MIN = 0.2 and MAX = 0.3, MaxL is int (log ₁₀ 0.2) 0, so 10 to the 0th power = 1, MaxO is 1, and MaxE is 0.2. It is 0 because it is an integer part of 1. A value of hash {<0.3, 0.2>} is added to data having a set of “meter” and “0” in HashTbl as a hash key. When MaxO is added to MaxE, it becomes 1, and since it is larger than MAX, the next data is to be processed. However, since there is no data, the next unit “year” is read from SET to UNIT, and the process proceeds to <50, 40>. This is the end of the description of the numerical value range processing 108.

対応度評価部１０５の説明にもどる。リストWORDから、値が大きい順に語Ｗを取り出す。取り出す語は任意の数Ｎまで可能である。頻度は属性値表現の適切さを表す指標とみなす。 Returning to the description of the correspondence degree evaluation unit 105. The word W is extracted from the list WORD in descending order. Any number of words can be extracted. The frequency is regarded as an index indicating the appropriateness of attribute value expression.

例の場合は、語は「中年」と「若い」になる。以下、Ｗに「中年」が代入されている場合を例に説明する。 In the example, the words are “middle-aged” and “young”. Hereinafter, a case where “middle age” is assigned to W will be described as an example.

SET_Cから語Ｗを含む要素をすべて取り出し、その数値表現部分だけから作成した集合をSET_CTMPとする（Ｓ５０９，Ｓ５１０）。 All elements including the word W are extracted from SET_C, and a set created only from the numerical expression portion is set as SET_CTMP (S509, S510).

この処理により、SET_CTMPに、SET_C＝｛２０から２３センチ，中年｝，｛４０から５０歳，中年｝，｛１５０から１６０センチ，若い｝｝から｛｛２０から２３センチ，中年｝，｛４０から５０歳，中年｝｝が設定される。 By this processing, SET_CTMP is changed to SET_C = {20 to 23 cm, middle age}, {40 to 50 years old, middle age}, {150 to 160 cm, young}} to {{20 to 23 cm, middle age}, {40 to 50 years old, middle age}} is set.

SET_CTMPの要素があれば一つ取り出し、その要素を数値表現処理部１０６で処理する。数値処理の結果出力される「最小値、最大値、基準単位」の組についてそれぞれの値をMIN、MAX、UNITにセットする。MINとMAXの組＜MIN，MAX＞をハッシュの第１のキーとしUNITを第２のキーとする２次元ハッシュテーブルHASH_Cの値に１を加える。ALL_Cに１を加える。UNITをSET_Rに追加する（Ｓ５１１，Ｓ５１２）。SET_RとALL_C、HASH_Cを数値範囲処理部１０８で処理する。処理結果のハッシュテーブルをHashTblCとする（Ｓ５１３）。これらの処理は、すでに説明した処理と同様の処理なので、例による詳細な説明は省略する。 If there is an element of SET_CTMP, one element is taken out and the numerical expression processing unit 106 processes the element. Set each value to MIN, MAX, and UNIT for the “minimum value, maximum value, reference unit” set that is output as a result of numerical processing. 1 is added to the value of the two-dimensional hash table HASH_C in which the pair <MIN, MAX> of MIN and MAX is the first key of hash and UNIT is the second key. Add 1 to ALL_C. UNIT is added to SET_R (S511, S512). The numerical value range processing unit 108 processes SET_R, ALL_C, and HASH_C. The hash table of processing results is set as HashTblC (S513). Since these processes are the same as the processes already described, detailed description by example is omitted.

処理の結果、HashTblNとHashTblCが得られる。例の場合、HashTblNには、ハッシュキーとして＜メートル，１＞、＜メートル，０＞、＜歳，４＞が登録され、それぞれの値は１である。HashTblCには、ハッシュキーとして＜メートル，０＞、＜歳，４＞が登録され。それぞれの値は１である。 As a result of the processing, HashTblN and HashTblC are obtained. In the example, <Meter, 1>, <Meter, 0>, <Year, 4> are registered as hash keys in HashTblN, and each value is 1. In HashTblC, <Meter, 0>, <Year, 4> are registered as hash keys. Each value is 1.

HashTblNとHashTblCを数値範囲比較部１０９で処理する（Ｓ５１４）。 HashTblN and HashTblC are processed by the numerical value range comparison unit 109 (S514).

数値範囲比較部１０９の処理の流れを図９に示す。 The flow of processing of the numerical value range comparison unit 109 is shown in FIG.

HashTblCの全てのキーの集合をSET_Fとする（Ｓ９０１）。SET_Fから要素をひとつ取り出しKEYとする（Ｓ９０２）。そのKEYに含まれる単位をUNITとする（Ｓ９０３）。 A set of all keys of HashTblC is set as SET_F (S901). One element is extracted from SET_F and set as a KEY (S902). The unit included in the KEY is set to UNIT (S903).

HashTblCのハッシュキーにUNITを含む値の合計をCOUNT_Cにセットする（Ｓ９０４）。HashTblNのハッシュキーにUNITを含む値の合計をCOUNT_Nにセットする（Ｓ９０５）。 The sum of values including UNIT in the hash key of HashTblC is set in COUNT_C (S904). The sum of values including UNIT in the hash key of HashTblN is set in COUNT_N (S905).

KEYについて、HashTblCの値とHashTblNの値をそれぞれのCOUNTで正規化したのち、除算する。その値をスコアＳとする（Ｓ９０６）。 For KEY, normalize HashTblC value and HashTblN value by each COUNT, then divide. The value is set as a score S (S906).

スコアＳとKEYの組のうち最大のスコアのKEYを出力する。スコアＭ位までのものをスコアＳとKEYの組で出力することもでき、この場合、スコアはKEYの正しさに対する信頼度として利用することが考えられる。 Outputs the KEY of the maximum score from the set of score S and KEY. It is possible to output up to score M as a set of score S and KEY. In this case, the score may be used as a reliability for the correctness of KEY.

例の場合、＜メートル，０＞のスコアは、HashTblCの値が１、HashTblNの値が１、COUNT_Cの値が１、COUNT_Nの値が２なので、１／２÷１／１＝２になる。＜歳，４＞のスコアは、HashTblCの値が１、HashTblNの値が１、COUNT_Cの値が１、COUNT_Nの値が１なので、１／１＝１になる。 In the example, the <meter, 0> score is 1/2 ÷ 1/1 = 2 because the value of HashTblC is 1, the value of HashTblN is 1, the value of COUNT_C is 1, and the value of COUNT_N is 2. The score of <year, 4> is 1/1 = 1 because the value of HashTblC is 1, the value of HashTblN is 1, the value of COUNT_C is 1, and the value of COUNT_N is 1.

辞書出力部１１０は対応度評価部１０５の結果を受けて、「語Ｗ、単位Ｕ、数値の組＜MIN，MAX＞」の３つ組を辞書として出力する。 The dictionary output unit 110 receives the result of the correspondence degree evaluation unit 105 and outputs a triple set of “word W, unit U, numerical value set <MIN, MAX>” as a dictionary.

例の場合、「中年、歳、＜４０，５０＞」の３つ組が辞書に登録される。なお、本説明では簡単のための文書ファイルが２つだけの例で説明したが、実際の運用では文書数は数百以上になることを付言しておく。 In the case of the example, a triplet of “middle age, age, <40, 50>” is registered in the dictionary. In this description, the description has been made with an example in which there are only two document files. However, it should be added that the number of documents is several hundred or more in actual operation.

以上説明した処理により、「４０から５０歳の女性向けの靴を販売する新ブランドを発表。このブランドでは、中年女性を新な販売ターゲットとした」という文章から得られる属性値「４０から５０歳」と事物名「女性」や、属性値「中年」と事物名「女性」を収集し、「中年」と「基準単位：歳、最小値：４０、最大値：５０」のような対応をスコアつきで取得することができる。その結果、数値相当表現辞書に「中年」と「４０から５０歳」を自動的に対応づけることができる。 Through the processing described above, an attribute value “40 to 50” obtained from the sentence “A new brand that sells shoes for women aged 40 to 50 years is announced. This brand targets middle-aged women as a new sales target”. Collect the “age” and the item name “female”, the attribute value “middle age” and the item name “female”, such as “middle age” and “base unit: age, minimum value: 40, maximum value: 50” Correspondence can be acquired with a score. As a result, “middle age” and “40 to 50 years old” can be automatically associated with the numerical equivalent expression dictionary.

上記の説明した処理のデータの流れをまとめると図１０にようになる。入力装置から文書ファイルを読込、当該文書ファイルを記憶装置に保存、以降、保存したデータの読込と処理結果の保存を繰り返し、最終的に、言葉による量的表現と数値を用いた量的表現の対応を収集した辞書を出力する。辞書には、言葉による量的表現に対応する数値を用いた量的表現を、スコアの一番高いものだけを一つだけ収集してもよいし、スコアが上位のものを複数個スコアつきで収集してもよい。 The data flow of the above-described processing is summarized as shown in FIG. Read the document file from the input device, save the document file in the storage device, then repeatedly read the saved data and save the processing results, finally the quantitative expression using words and numerical expression using numerical values Output a dictionary of collected correspondences. In the dictionary, you may collect only one quantitative expression using numerical values corresponding to the quantitative expression by words, and the one with the highest score will be collected. May be collected.

なお、以上の処理は言葉による表現と数値による表現の対応を自動的につける方法であるが、データが十分に得られない場合は、完全に自動的な方法で十分な精度を得ることが困難なことがある。そのような場合、図１における対応度評価部１０５を図１１のように、単位指定対応度評価部１１５に変更することが考えられる。 Note that the above processing is a method of automatically assigning correspondence between verbal expressions and numerical expressions. However, if sufficient data is not available, it is difficult to obtain sufficient accuracy with a completely automatic method. There is something wrong. In such a case, it can be considered that the correspondence evaluation unit 105 in FIG. 1 is changed to a unit designation correspondence evaluation unit 115 as shown in FIG.

このときの単位指定対応度評価部１１５の処理の流れを図１２に示す。指定した単位だけを処理対象とすること以外は、対応度評価部１０５と同じ処理である。図１２に示したＳ１２０１からＳ１２０５の処理は図５のＳ５０１からＳ５０５の処理と同じ、図１２のＳ１２０８からＳ１２１４の処理は図５のＳ５０８からＳ５１４の処理と同じである。異なるのはＳ１２０６、Ｓ１２０７の処理で、基準単位と指定した単位の同一性をチェックする部分である。 The flow of processing of the unit designation correspondence evaluation unit 115 at this time is shown in FIG. The processing is the same as that of the correspondence level evaluation unit 105 except that only the designated unit is processed. The processing from S1201 to S1205 shown in FIG. 12 is the same as the processing from S501 to S505 in FIG. 5, and the processing from S1208 to S1214 in FIG. 12 is the same as the processing from S508 to S514 in FIG. The difference is in the processing of S1206 and S1207, where the reference unit and the designated unit are checked for identity.

以上、図１あるいは図１１に示したシステムによって、言葉による表現と数値による表現を自動的に対応づけることができることを説明した。 As described above, it has been explained that the expression shown by words and the expression by numerical values can be automatically associated by the system shown in FIG.

１０１文書ファイル入力部
１０２文書部分抽出部
１０３事物属性値組候補抽出部
１０４属性値候補集計部
１０５対応度評価部
１０６数値表現処理部
１０７換算処理部
１０８数値範囲処理部
１０９数値範囲比較部
１１０辞書出力部
１１５単位指定対応度評価部
２０１処理装置
２０２主記憶装置
２０３外部記憶装置
２０４入力装置
２０５出力装置
３０１事物名抽出用パターン辞書
７０１基準単位選択用テーブル
７０２単位構造解析部
７０３計算部
７０４接頭辞換算表
７０５単位換算表
DESCRIPTION OF SYMBOLS 101 Document file input part 102 Document part extraction part 103 Object attribute value set candidate extraction part 104 Attribute value candidate totaling part 105 Correspondence degree evaluation part 106 Numerical expression processing part 107 Conversion processing part 108 Numerical value range processing part 109 Numerical value range comparison part 110 Dictionary Output unit 115 Unit designation correspondence evaluation unit 201 Processing device 202 Main storage device 203 External storage device 204 Input device 205 Output device 301 Object name extraction pattern dictionary 701 Reference unit selection table 702 Unit structure analysis unit 703 Calculation unit 704 Prefix Conversion table 705 Unit conversion table

Claims

An input unit for inputting a document file;
A document part extractor for extracting a document part from a document file;
After morphological analysis of the extracted document part, the noun phrase is identified, and then, by using the object name extraction pattern dictionary in which the sentence pattern of the attribute name and attribute value that can appear in the document part is registered, An object attribute value pair candidate extraction unit that extracts an attribute value candidate for the object pointed to by the object name and the object indicated by the object name with respect to the identified noun phrase;
If the attribute value candidate is a numerical expression, a number string is obtained from the attribute value candidate, and if the number string is separated by a comma every three digits, a number string is generated by removing the comma, If the numbers before and after are consecutive numbers, a number string represented by a number before the comma, a number up to two characters before the comma, and a number after the comma two characters, a number after the comma, and two characters after the comma Divide into a numeric string expressed by the previous number and the number after the comma, and if the number before and after the comma is not a sequential number, divide it into a numeric string before the comma and a numeric string after the comma The numerical value information is extracted in order from the end side of the numerical string obtained by executing the processing, and numerical values for the number of digits corresponding to the extracted numerical information are sequentially generated. numeric representation process of converting the acquired numeric string to a numeric And,
An attribute value candidate aggregation unit for creating a set of attribute value candidates for the same thing name from the attribute value candidates, and further creating a set of attribute values by words and a set of attribute values by numerical expression from the set;
Selecting one word in order from the set of attribute values by words , calculating a co-occurrence frequency of the attribute value by the selected words and the attribute value included in the set of attribute values by the numerical expression , A degree-of-correspondence evaluation unit that generates information indicating a correspondence relationship between the attribute value by the selected word and the attribute value included in the set of attribute values by the numerical expression based on the frequency ;
A document output unit that outputs a dictionary that collects correspondences between quantitative expressions using words and numerical expressions using numerical values based on information generated by the correspondence degree evaluation unit. Analysis system.

2. The document analysis system according to claim 1, wherein the numerical value is converted into a numerical value of a reference unit based on a unit included in the numerical expression processed by the numerical expression processing unit, a predetermined reference unit, and a unit conversion rule. A document analysis system comprising a conversion processing unit.

The document analysis system according to claim 1, wherein the correspondence evaluation unit sets only a numerical expression including a specified unit as a processing target.

An input unit for inputting a document file;
A document part extractor for extracting a document part from a document file;
After morphological analysis of the extracted document part, the noun phrase is identified, and then, by using the object name extraction pattern dictionary in which the sentence pattern of the attribute name and attribute value that can appear in the document part is registered, An object attribute value pair candidate extraction unit that extracts an attribute value candidate for the object pointed to by the object name and the object indicated by the object name with respect to the identified noun phrase;
If the attribute value candidate is a numerical expression, a number string is obtained from the attribute value candidate, and if the number string is separated by a comma every three digits, a number string is generated by removing the comma, If the numbers before and after are consecutive numbers, a number string represented by a number before the comma, a number up to two characters before the comma, and a number after the comma two characters, a number after the comma, and two characters after the comma Divide into a numeric string expressed by the previous number and the number after the comma, and if the number before and after the comma is not a sequential number, divide it into a numeric string before the comma and a numeric string after the comma The numerical value information is extracted in order from the end side of the numerical string obtained by executing the processing, and numerical values for the number of digits corresponding to the extracted numerical information are sequentially generated. numeric representation process of converting the acquired numeric string to a numeric And,
An attribute value candidate aggregation unit for creating a set of attribute value candidates for the same thing name from the attribute value candidates, and further creating a set of attribute values by words and a set of attribute values by numerical expression from the set;
Selecting one word in order from the set of attribute values by words , calculating a co-occurrence frequency of the attribute value by the selected words and the attribute value included in the set of attribute values by the numerical expression , A degree-of-correspondence evaluation unit that generates information indicating a correspondence relationship between the attribute value by the selected word and the attribute value included in the set of attribute values by the numerical expression based on the frequency ;
A dictionary creation system comprising: a dictionary output unit for outputting, as a dictionary, attribute values by words and numerical expressions corresponding thereto based on information created in the correspondence evaluation unit.

5. The dictionary creation system according to claim 4 , wherein the numerical value is converted into a numerical value of a reference unit based on a unit included in the numerical expression processed by the numerical expression processing unit, a predetermined reference unit, and a unit conversion rule. A dictionary creation system comprising a conversion processing unit.