JP5426868B2

JP5426868B2 - Numerical expression processing device

Info

Publication number: JP5426868B2
Application number: JP2008289164A
Authority: JP
Inventors: 義行小林; 康嗣森本; 順一谷本
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-11-11
Filing date: 2008-11-11
Publication date: 2014-02-26
Anticipated expiration: 2028-11-11
Also published as: JP2010117797A

Description

本発明は、電子文書中から有用な表現をコンピュータによって自動的に抽出する固有表現抽出技術に関し、特に数値表現と当該数値表現に関わる付随的情報を抽出し、また数値表現を含む文書を検索する数値表現抽出装置に関する。 The present invention relates to a specific expression extraction technique for automatically extracting useful expressions from an electronic document by using a computer, and more particularly, extracts numerical expressions and incidental information related to the numerical expressions, and searches for documents including the numerical expressions. The present invention relates to a numerical expression extraction device.

電子化された文書が大量に蓄積され自由に検索できるようになったことで、文書中から有用な情報を自動的に収集したいというニーズが高まっている。そのようなニーズに答えることを目的として、自然言語処理の研究から固有表現抽出や情報抽出とよばれる技術が生まれてきた。固有表現抽出とは、固有名詞(人名、地名など)や日付、時間表現などを文書中から抽出する技術である。情報抽出とは、文書から有用な情報だけを抽出する技術である。固有表現抽出は情報抽出の構成要素の一つとみなせる。 Since a large amount of electronic documents are accumulated and can be freely searched, there is an increasing need to automatically collect useful information from the documents. For the purpose of answering such needs, a technology called proper expression extraction and information extraction has emerged from research on natural language processing. Specific expression extraction is a technique for extracting proper nouns (person names, place names, etc.), date and time expressions from a document. Information extraction is a technique for extracting only useful information from a document. Specific expression extraction can be regarded as one of the components of information extraction.

固有表現にはいくつかの種類があるが、数に関わる表現を抽出する技術は数値表現抽出(あるいは数量表現抽出)と呼ばれる。数値表現抽出を利用すると、文書を検索するときに単語だけではなく数値を使って検索質問をできるようになる。その結果、例えば、ある物品の価格や数量で表わされる仕様を使ったWeb検索や組織内検索などが可能になる。 There are several types of specific expressions, but the technique of extracting expressions related to numbers is called numerical expression extraction (or quantity expression extraction). When using numeric expression extraction, you can search for documents using numeric values instead of words. As a result, for example, it is possible to perform a web search or a search within an organization using a specification expressed by the price or quantity of a certain article.

文書から自動的に数値表現を抽出する方法として、数値表現に関するさまざまなパターンを定義して抽出を実現する方法がある(特許文献１)。この方法では、アラビア数字や漢数字の前後に出現する文字列のパターンを定義して数値表現の部分を抽出している。同時に、文書内に出現する単語を使って、数値の内容(重さなのか長さなのか等)を推定している。パターンではなく簡単な規則を定義して抽出する方法もある(特許文献２)。この方法は、数値の前後数文字と句読点を手掛かりとして数値表現を抽出するもので、単位を使い数値の内容を推定している。また、数値表現を使って検索できる文書検索システムも知られている(特許文献３)。この検索システムは、キーワードとそのキーワードを特徴づける数値を使って文書検索を実現している。 As a method for automatically extracting a numerical expression from a document, there is a method for realizing extraction by defining various patterns related to the numerical expression (Patent Document 1). In this method, patterns of character strings appearing before and after Arabic numerals and Chinese numerals are defined to extract a numerical expression part. At the same time, it uses the words that appear in the document to estimate the numerical content (whether it is weight or length). There is also a method of defining and extracting simple rules instead of patterns (Patent Document 2). This method extracts numerical expressions by using several characters before and after numerical values and punctuation marks, and estimates the contents of numerical values using units. There is also known a document search system that can search using numerical expressions (Patent Document 3). This search system realizes a document search using a keyword and a numerical value characterizing the keyword.

特許第３３６０６１７号公報Japanese Patent No. 3360617 特開２００６−３５０９８９号公報JP 2006-350989 A 特開２００６−３２３５７５号公報JP 2006-323575 A

まず、従来の方法では適切に処理できない文書の例を示す。従来の方法では、「容疑者は年齢２０歳から３０歳くらい」という文から「２０歳から３０歳」という部分を数値表現として抽出し、「下限値：２０、上限値：３０、単位：歳」などとしてインデクスに登録する。上記の特許文献１及び特許文献２でも同様の方法で数値情報をインデクスに登録している。 First, an example of a document that cannot be properly processed by the conventional method is shown. In the conventional method, the part “20 to 30 years old” is extracted as a numerical expression from the sentence “the suspect is about 20 to 30 years old”, and “lower limit: 20, upper limit: 30, unit: years old”. To the index. In the above Patent Document 1 and Patent Document 2, numerical information is registered in the index in the same manner.

上記の文が西暦２０００年のある報告書から抽出された情報だと仮定し、このような報告書から数値を検索質問に使って文書検索する場合を考えてみる。この場合、２００８年の時点で、検索質問として「２０歳」を入力して上記の文が検索されるのは不適切な場合がある。同様に、２００８年の時点で、検索質問として「３５歳」を入力して上記の文が検索されないのも不適切である。なぜなら、２０００年の時点で「２０歳から３０歳」であるなら、容疑者の年齢は、２００８年には「２８歳から３８歳」のはずだからである。 Assuming that the above sentence is information extracted from a report in the year 2000, consider a case where a document is searched using a numerical value as a search question from such a report. In this case, as of 2008, it may be inappropriate to search for the above sentence by inputting “20 years old” as a search question. Similarly, it is inappropriate that the above sentence is not searched by inputting “35 years old” as a search question at the time of 2008. This is because if it is “20 to 30 years old” in 2000, the suspect's age should be “28 to 38 years old” in 2008.

このような不適切な処理は、文書から抽出した数値をそのままデータベースに登録することが原因である。抽出した数値を、実際の世界で使われる絶対的な尺度と対応づけてインデクスに登録する必要がある。 Such inappropriate processing is caused by registering a numerical value extracted from a document as it is in a database. It is necessary to register the extracted numerical value in the index in association with the absolute scale used in the actual world.

例の場合、年齢そのままを登録するのではなく、西暦を尺度として出生年に変換して、「尺度：西暦、属性名：出生年、下限値：１９７０、上限値：１９８０、単位：年」としてインデクスに登録しておけばよい。このようにしておけば、２００８年に検索質問として入力された「２０歳」は「尺度：西暦、属性名：出生年、値：１９８８、単位：年」となりインデクスに登録されている情報とは一致しないことになる。２００８年に検索質問として入力された「３５歳」は「尺度：西暦、属性名：出生年、値：１９７３、単位：年」となりインデクスに登録されている情報と一致する。 In the case of the example, instead of registering the age as it is, it is converted into the year of birth using the year as the scale, and “scale: year, attribute name: year of birth, lower limit: 1970, upper limit: 1980, unit: year” Just register it in the index. In this way, “20 years old” input as a search question in 2008 becomes “scale: AD, attribute name: year of birth, value: 1988, unit: year” and the information registered in the index. Will not match. “35 years old” input as a search question in 2008 becomes “scale: AD, attribute name: year of birth, value: 1973, unit: year”, and matches the information registered in the index.

上記は、年齢という属性を西暦という尺度に対応づける例であるが、その他に、「○○から１０キロメートル」のように相対的な使って記述された位置を「緯度・経度」という尺度に対応づける場合が考えられる。 The above is an example of associating the attribute of age with a scale called the Christian era, but in addition, the position described by relative use, such as “10 to 10 kilometers from ○○”, corresponds to the scale of “latitude / longitude” The case where it attaches can be considered.

物理量に対しても尺度への対応づけは有効である。例えば、重さの単位には、SI単位で標準とされているキログラムの他に、ポンドやオンスがある。これらの複数の重さの単位で表された数値表現を単一の尺度に対応づけることで、文章中で使われている単位が異なる場合でも、その違いを意識せずに検索することが可能になる。 Correspondence to scale is also effective for physical quantities. For example, weight units include pounds and ounces in addition to kilograms, which are standard in SI units. By associating the numerical expression expressed in these multiple units of weight with a single scale, even if the units used in the text are different, it is possible to search without being aware of the difference become.

このように、数値表現を絶対的な尺度と対応づけるという課題は、文書処理における数値表現の処理をより適切にするために解決すべき課題といえる。ところで、この課題を解決するためには、解決すべきもうひとつの課題がある。それは、数値表現を属性値とする属性名を推定することである。 Thus, the problem of associating numerical expressions with absolute scales can be said to be a problem to be solved in order to make the numerical expression processing in document processing more appropriate. By the way, in order to solve this problem, there is another problem to be solved. It is to estimate an attribute name whose attribute value is a numerical expression.

なぜなら、適切な尺度を選択するためには、数値表現を属性値とする属性名を知る必要があるからである。例えば「風速１０メートルを記録」という文から「数値：１０、単位：メートル」を抽出しただけでは適切な尺度「速さ」に対応づけられない。「属性名：風速」という情報が得られることで「尺度：速さ、単位：メートル／秒」と対応づけられるようになる。 This is because in order to select an appropriate scale, it is necessary to know an attribute name having a numeric expression as an attribute value. For example, simply extracting “numerical value: 10, unit: meter” from the sentence “record wind speed of 10 meters” cannot be associated with an appropriate scale “speed”. By obtaining the information “attribute name: wind speed”, it can be associated with “scale: speed, unit: meters / second”.

特許文献２では単位から属性名を推定しており、単位「歳」から属性値「年齢」を推定できるような場合にしか適用できない。特許文献１には、「ポンド」が、英国貨幣の単位と重さの単位のどちらかを文章中の単語を使い推定する方法が示されている。しかし、特許文献１の方法では、「新幹線は、最高時速３００キロで東京から大阪まで約５００キロの距離を２時間半で走る」という文から、「数値：最高３００キロ、属性：時速、尺度：速度」と「数値：約５００キロ、属性：距離、尺度：長さ」のように同じ表記の単位が一つの文章内で異なる属性名に対する属性値として使われている場合を処理できない。数値の属性を判断する発明としては、特開２００６−９２１９３号公報もあるが、特許文献１と同じように文書に出現する単語を利用して判断しており同じ課題を抱えている。 In Patent Document 2, the attribute name is estimated from the unit, and this is applicable only when the attribute value “age” can be estimated from the unit “year”. Patent Document 1 discloses a method in which “pound” uses a word in a sentence to estimate either a unit of British money or a unit of weight. However, according to the method of Patent Document 1, “Shinkansen runs about 500 km from Tokyo to Osaka at a maximum speed of 300 km / h in two and a half hours”. “Numerical value: maximum 300 km, attribute: speed, scale : Speed "and" numerical value: about 500 km, attribute: distance, scale: length "cannot be processed when the same notation unit is used as an attribute value for different attribute names in one sentence. Japanese Patent Laid-Open No. 2006-92193 is also available as an invention for judging numerical attributes. However, as in Japanese Patent Application Laid-Open No. 2006-92193, it is judged using words appearing in a document, and has the same problem.

まとめると、本発明の課題は、文書から数値表現を抽出し、当該数値に対して適切な尺度を対応づけることであり、そのために、抽出した数値表現と数値表現を属性値とする属性名を検出することである。 In summary, an object of the present invention is to extract a numerical expression from a document and associate an appropriate scale with the numerical value. For this purpose, attribute names having the extracted numerical expression and the numerical expression as attribute values are assigned. Is to detect.

属性名と属性値の組から適切な尺度を選択するためのテーブルを使い、数値表現に適切な尺度を対応づける。属性名は、数値の単位がとりうる属性名の辞書と、事物がとりうる属性名の辞書を使って推定する。 Use a table to select an appropriate scale from a set of attribute name and attribute value, and associate an appropriate scale with the numerical expression. The attribute name is estimated using a dictionary of attribute names that can be taken as numerical units and a dictionary of attribute names that can be taken by things.

本発明による数値表現処理装置は、電子文書から数値と当該数値の単位を抽出する数値表現抽出部と、文書から数値を属性値とする属性名を抽出する属性名抽出部と、抽出した数値を属性名ごとに予め定められた尺度に対応付け、当該尺度の数値に変換する尺度選択部とを有する。尺度選択部は、属性名と当該属性名の数値表現の単位と尺度とを関係づけて登録したテーブルを有し、テーブルに登録された情報を使って、抽出した数値を尺度の数値に変換する。 A numerical expression processing apparatus according to the present invention includes a numerical expression extraction unit that extracts a numerical value and a unit of the numerical value from an electronic document, an attribute name extraction unit that extracts an attribute name having a numerical value as an attribute value from the document, and an extracted numerical value. Each attribute name has a scale selection unit that associates with a predetermined scale and converts it to a numerical value of the scale. The scale selection unit has a table in which an attribute name, a unit of numerical expression of the attribute name and a scale are associated and registered, and converts the extracted numerical value into a scale numerical value using information registered in the table .

また、属性名、数値表現と事物名の文型パターンが登録された事物名抽出用パターン辞書を備える事物名抽出部を有し、事物名抽出部は、数値表現抽出部で抽出した数値表現と属性名抽出部で抽出した属性名の組を有する事物名を文書から抽出する。事物名抽出用パターン辞書には属性名と数値表現の単位の共起スコア及び事物名と属性名の共起スコアが登録されており、事物名抽出部は、文書から抽出された数値、当該数値の単位、及び当該数値の事物名を元に、事物名抽出用パターン辞書に登録された共起スコアを用いて、抽出された数値を値とする属性名を推定する機能を有するのが好ましい。 In addition, it has a feature name extraction unit with an attribute name extraction pattern dictionary in which attribute name, numerical expression and sentence pattern of the thing name are registered, and the thing name extraction unit has the numeric expression and attributes extracted by the numerical expression extraction unit. An object name having a set of attribute names extracted by the name extraction unit is extracted from the document. Co-occurrence score of attribute name and numerical expression unit and co-occurrence score of thing name and attribute name are registered in the pattern dictionary for extracting the object name, and the object name extracting unit stores the numerical value extracted from the document, the numerical value It is preferable to have a function of estimating an attribute name having the extracted numerical value as a value using the co-occurrence score registered in the object name extraction pattern dictionary based on the unit of the numerical value and the numerical item name.

属性名抽出部は、同義な属性名が同一の代表名に対応付けられるようにして属性名と代表名の対を登録した辞書を有し、当該辞書を用いて前記文書から抽出した属性名に対応する代表名を取得するのが好ましい。 The attribute name extraction unit has a dictionary in which attribute name and representative name pairs are registered so that synonymous attribute names are associated with the same representative name, and the attribute name extracted from the document using the dictionary It is preferable to obtain a corresponding representative name.

また、本発明の数値表現処理装置は、文書名、当該文書に含まれる数値をその事物名、属性名の代表名、前記属性名に対応付けられた尺度に変換された数値を記憶したテーブルを有する。 In addition, the numerical expression processing apparatus of the present invention includes a table storing a document name, a numerical value included in the document, a numerical value obtained by converting the object name, a representative name of the attribute name, and a scale associated with the attribute name. Have.

本発明の数値表現処理装置は、また、事物名と属性名と数値の入力を受け付ける検索クエリ入力部、及び、検索クエリ入力部に入力された属性名を代表名に変換し、入力された数値を入力された属性名に対応する尺度の数値に変換し、テーブルを参照して、入力された事物名と変換された代表名及び数値が含まれる文書ファイルを検索する情報検索部を有する。 The numerical expression processing apparatus of the present invention also includes a search query input unit that accepts input of an object name, an attribute name, and a numerical value, and converts the attribute name input to the search query input unit into a representative name, and the input numerical value Is converted into a numerical value of a scale corresponding to the input attribute name, and an information search unit for searching the document file including the input thing name, the converted representative name and numerical value with reference to the table is provided.

本発明によると、事物と、事物を特徴づける属性名、属性名の数値の３つを１つの組として文書中から抽出し、抽出した数値を適切な尺度に対応づけられるようになる。その結果、正確に数量表現を扱えるようになり、数値表現を含むテキストマイニングや文書要約、文書検索を従来よりも精度よく実現できるようになる。また、本発明を使い、文書情報からファクト情報を抽出して、ファクト情報データベースを構築するような使い方も考えられる。 According to the present invention, a thing, an attribute name characterizing the thing, and a numerical value of the attribute name are extracted from the document as one set, and the extracted numerical value can be associated with an appropriate scale. As a result, it becomes possible to handle quantity expressions accurately, and text mining, document summarization, and document retrieval including numerical expressions can be realized with higher accuracy than before. Further, it is possible to use the present invention by extracting fact information from document information and constructing a fact information database.

以下では、本発明の実施形態を、実施形態の構成と当該構成を使っての処理動作に分けて説明する。また、文書から抽出した情報を利用する具体的な応用としては文書検索を想定して説明する。 In the following, an embodiment of the present invention will be described by dividing it into a configuration of the embodiment and a processing operation using the configuration. A specific application using information extracted from a document will be described assuming a document search.

本発明の数値表現処理装置１０は、図１に示すように、文書ファイル入力部１００、文書部分抽出部２００、数値表現抽出部３００、属性名抽出部４００、事物名抽出部５００、尺度選択部６００、情報保存部７００、情報検索部８００を有する。これらの処理部は、コンピュータ上で実行されるプログラムによってソフト的に実現することも、ハードウエアによって実現することも、ソフトウエアとハードウエアの組み合わせによって実現することもできる。 As shown in FIG. 1, the numerical expression processing apparatus 10 of the present invention includes a document file input unit 100, a document part extraction unit 200, a numerical expression extraction unit 300, an attribute name extraction unit 400, an object name extraction unit 500, a scale selection unit. 600, an information storage unit 700, and an information search unit 800. These processing units can be realized by software by a program executed on a computer, by hardware, or by a combination of software and hardware.

文書ファイル入力部１００は、文書を入力するための処理部である。文書はワードプロセッサなどで作成された電子化文書のこともあるし、光学式文字読取装置（ＯＣＲ）などを使って電子化された文書ファイルのこともある。どちらの場合でも、文書は文字コード化された電子文書であるとする。文字コードの種類はとくに制限しない。文書部分抽出部２００は、入力された文書ファイルから文書部分を抽出する処理部である。文書部分とは、当該文書ファイルの内容と考えられる部分である。文書部分を抽出するとは、フォーマット情報やメタデータ類を取り除く処理である。 The document file input unit 100 is a processing unit for inputting a document. The document may be an electronic document created by a word processor or the like, or may be a document file digitized using an optical character reader (OCR) or the like. In either case, it is assumed that the document is a character-coded electronic document. The type of character code is not particularly limited. The document part extraction unit 200 is a processing unit that extracts a document part from an input document file. The document part is a part considered as the contents of the document file. Extracting a document part is a process of removing format information and metadata.

数値表現抽出部３００は、抽出した文書から数値表現を検出する処理部である。数値表現には、アラビア数字や漢数字などで表記される数値、数値の前に付く程度表現(「約」や「だいたい」など)、単位や助数詞、数値や単位、助数詞の後に続く程度を表す表現(「くらい」や「強」、「以下」など。以下では、程度表現と記述する)、範囲を表す表現(「〜」や「から」など。以下では、範囲表現と記述する)を含む。数値表現抽出部には、特許文献２の手段など、既知の手段を利用することができる。 The numerical expression extraction unit 300 is a processing unit that detects a numerical expression from the extracted document. Numeric expressions include numerical values expressed in Arabic numerals and Chinese numerals, degree expressions that precede the numerical values (such as `` about '' and `` about ''), units and classifiers, numbers and units, and degrees following a classifier Including expressions (such as “about”, “strong”, “below”, etc., hereinafter referred to as degree expressions), and expressions that represent ranges (such as “˜”, “from”, etc., hereinafter referred to as range expressions) . For the numerical expression extraction unit, known means such as the means of Patent Document 2 can be used.

属性名抽出部４００は、抽出した数値表現を値とする属性名を文書から抽出する処理部である。この処理部は既知の手段を用いて実現することができる。例えば、次の(1)〜(5)のような方法で実現できる。
(1) 予め作成しておいた辞書を使い、属性と属性値を抽出する。辞書は少数の属性と属性名を使い「属性のX」や「Xの属性値」といった文型にある事例を収集することで作成する。（「意見抽出を目的とした機械学習による属性-評価値対同定」情報処理学会自然言語処理研究会ＮＬ−１６５−４：文献１）
(2) 予め定義しておいた規則にしたがって、数値表現への係り受け関係からの数値と属性名を推定する。（「係り受けの制約と優先規則に基づく数量表現抽出」情報処理学会自然言語処理研究会ＮＬ−１４５−１８）
(3) 文型と共起スコアを使って、対象、属性、属性名の3つ組を抽出する。（「テキストから属性関係を抽出する」情報処理学会自然言語処理研究会ＮＬ−１６４−４：文献２）
(4) ＨＴＭＬタグや「ＡのＢ」のような文型を使って属性と属性値を抽出する。（「属性語のＷｅｂ文書からの自動発見と人手評価のための基準」自然言語処理 Vol.13，No.4）
(5) ＨＴＭＬファイルのＴＡＢＬＥタグを使って属性と属性名の組を抽出する。（”Extracting attributes and their values from web pages” ACL-02 Student Research Workshop） The attribute name extraction unit 400 is a processing unit that extracts an attribute name whose value is the extracted numerical expression from a document. This processing unit can be realized by using known means. For example, it can be realized by the following methods (1) to (5).
(1) Extract attributes and attribute values using a dictionary created in advance. The dictionary is created by using a small number of attributes and attribute names and collecting examples in the sentence pattern such as “Attribute X” and “Attribute Value of X”. ("Attribute-Evaluation Value Pair Identification by Machine Learning for Opinion Extraction" Information Processing Society of Japan NL-165-4: Reference 1)
(2) Estimate the numerical value and attribute name from the dependency relation to the numerical expression according to the rules defined in advance. ("Quantum Expression Extraction Based on Dependency Constraints and Priority Rules" Information Processing Society of Japan, Natural Language Processing Study Group NL-145-18)
(3) Using the sentence pattern and co-occurrence score, extract the triples of target, attribute, and attribute name. ("Extract attribute relations from text" Information Processing Society of Japan NL-164-4: Reference 2)
(4) Extract attributes and attribute values using a sentence pattern such as an HTML tag or “B of A”. ("Criteria for automatic discovery and manual evaluation of attribute words from Web documents" Natural language processing Vol.13, No.4)
(5) A set of attributes and attribute names is extracted using the TABLE tag of the HTML file. ("Extracting attributes and their values from web pages" ACL-02 Student Research Workshop)

本実施例の属性名抽出部４００は、辞書を使って属性名と属性値を抽出する上記(1)の方法を拡張したものであり、図２に示すように、属性名辞書４０１、属性名検出部４０２、属性名拡張部４０３、属性名・数値表現組評価部４０４から構成される。数値表現に対応した接尾辞処理や属性名拡張処理が拡張部分に該当する。 The attribute name extraction unit 400 of the present embodiment is an extension of the above method (1) for extracting attribute names and attribute values using a dictionary. As shown in FIG. It comprises a detection unit 402, an attribute name expansion unit 403, and an attribute name / numeric expression combination evaluation unit 404. Suffix processing and attribute name expansion processing corresponding to numerical expressions correspond to the extended portion.

属性名辞書４０１には、文書中に出現しうる属性名を登録してある。属性名には、属性名の同義関係と代表名を登録しておく。属性名の同義語は人手、あるいは、コンピュータにより自動的に収集することができる。代表名は辞書の見出し語に相当するものである。図２の例では、同義関係にある属性名「金額」と属性名「価格」に対して、統一的な代表名として「価格」が対応付けられて登録されている。文書から抽出された属性名を代表名に置き換えて後の処理を行うことにより、同義語の存在に起因する処理の煩雑さを低減することができる。 In the attribute name dictionary 401, attribute names that can appear in the document are registered. In the attribute name, a synonym relation of attribute names and a representative name are registered. Synonyms of attribute names can be collected manually or automatically by a computer. The representative name corresponds to a dictionary entry word. In the example of FIG. 2, “price” is registered in association with the attribute name “amount” and the attribute name “price” having the synonymous relationship as a unified representative name. By replacing the attribute name extracted from the document with the representative name and performing the subsequent processing, it is possible to reduce the complexity of the processing due to the presence of the synonym.

また、属性名の一部になる接尾辞についての情報を登録しておく。当該情報を利用することで、未知の属性名を検出もできるようになる。接尾辞の知識は人手で登録する。接尾辞としては、例えば、「率、料、量、比、費、度、長、重、圧、強さ、硬さ、長さ、重さ、流、速さ、数、力」などが考えられる。接尾辞は、必要に応じて追加する。 Also, information about a suffix that becomes a part of the attribute name is registered. By using the information, an unknown attribute name can be detected. Suffix knowledge is registered manually. Examples of suffixes include “rate, fee, quantity, ratio, cost, degree, length, weight, pressure, strength, hardness, length, weight, flow, speed, number, force”, etc. It is done. Add suffixes as needed.

属性名検出部４０２は、数値情報や句読点との相対的な位置関係と、属性名辞書４０１を使って属性名を抽出する。適切な属性名の選択は、後の属性名・数値表現組評価部４０４で行なう。属性名検出部４０２では、候補になりうる全ての属性名を抽出する。辞書に登録されている属性名だけでは属性を抽出できない場合は、接尾辞を使う。数値情報や句読点との相対的な位置関係から、接尾辞を検出する。検出した接尾辞を属性名の末尾に持つ属性名を属性辞書で検索し属性名の候補とする。 The attribute name detection unit 402 extracts attribute names using the relative positional relationship with numerical information and punctuation marks and the attribute name dictionary 401. Selection of an appropriate attribute name is performed by the attribute name / numeric expression combination evaluation unit 404 later. The attribute name detection unit 402 extracts all attribute names that can be candidates. If an attribute cannot be extracted using only the attribute name registered in the dictionary, a suffix is used. The suffix is detected from the relative positional relationship with numerical information and punctuation marks. The attribute name having the detected suffix at the end of the attribute name is searched for in the attribute dictionary and set as a candidate attribute name.

属性名拡張部４０３は、属性名検出部４０２で抽出した属性名の候補について、文書中で当該属性名の前後に出現する文字列を追加して、追加した結果の文字列を新に属性名とする処理部である。当該処理は、例えば「・・・炭素の含有率・・・」という文章から、属性名「含有率」を検出した後で、含有率の前に出現する「炭素の」を追加し、「炭素の含有率」を属性名とする。属性名辞書に登録されている属性名が「何について」の情報なのかを必要とする場合に、この処理は実施する。属性名辞書では、属性名が要求する「何について」の情報を対象として登録している。図２の属性名辞書では、対象としてＩＤを登録し、ＩＤの内容を対象用のテーブルに登録している。対象の情報は、人手で逐次登録することもできるし、文書から自動的に収集することもできる。例えば、「○○を含有」のように当該処理を必要とする属性名はヲ格をともなって文書に出現しているという性質を利用できる。「炭素の含有率」の場合、文書中から「炭素を含有」という記述をあらかじめ検出しておくことで、属性名辞書に「含有率」が「炭素」を対象とすることを登録できる。 The attribute name extension unit 403 adds a character string that appears before and after the attribute name in the document to the candidate attribute name extracted by the attribute name detection unit 402, and newly adds the resulting character string to the attribute name. Is a processing unit. For example, from the sentence “... carbon content rate ...”, the attribute name “content rate” is detected, and then “carbon” that appears before the content rate is added. Is the attribute name. This process is performed when the attribute name registered in the attribute name dictionary needs information about “what”. In the attribute name dictionary, “what” information required by the attribute name is registered as a target. In the attribute name dictionary of FIG. 2, an ID is registered as a target, and the contents of the ID are registered in a target table. The target information can be manually registered sequentially or can be automatically collected from a document. For example, the property that an attribute name that requires the processing, such as “contains XX”, appears in the document with a case. In the case of “carbon content”, it can be registered in the attribute name dictionary that “carbon” is targeted for “carbon” by detecting in advance the description “containing carbon” in the document.

属性名・数値表現組評価部４０４では、数値表現に含まれる単位・助数詞と属性名の組について適切さを評価する。そのために、上記文献１に記載されている機械学習によって単位と属性名の組の尤もらしさを評価する方法を利用することができる。 The attribute name / numerical expression pair evaluation unit 404 evaluates the appropriateness of the unit / classifier and attribute name pair included in the numerical expression. For this purpose, the method of evaluating the likelihood of a unit / attribute name pair by machine learning described in the above-mentioned document 1 can be used.

事物名抽出部５００は、抽出した数値表現と属性名の組を特徴とする事物の名称を文書から抽出する処理部である。事物とは、物あるいは出来事を表す。物は物理的な物に限らず抽象的な物も含む。この処理は、上記文献２のパターンを使って属性名と属性値の組を抽出する方法に、事物名と属性名の組を抽出するパターンを追加することで実現する。本処理部は、図３に示すように、事物名抽出用パターン辞書５０１、事物名抽出部５０２、属性名補完部５０３、オントロジ利用による属性名補完部５０４、オントロジ(部分全体関係)５０５から構成される。 The thing name extraction unit 500 is a processing unit that extracts, from a document, the name of an object characterized by a combination of the extracted numerical expression and attribute name. An object represents an object or an event. Objects include not only physical objects but also abstract objects. This processing is realized by adding a pattern for extracting a combination of an object name and an attribute name to a method for extracting a combination of an attribute name and an attribute value using the pattern of the above-mentioned document 2. As shown in FIG. 3, the processing unit includes an object name extracting pattern dictionary 501, an object name extracting unit 502, an attribute name complementing unit 503, an attribute name complementing unit 504 using ontology, and an ontology (partial whole relationship) 505. Is done.

事物名抽出用パターン辞書５０１には、文書中に出現しうる属性名、数値表現と事物名の文型パターンと共起スコアを登録してある。文型パターンは、例えば、「属性名Ａ数値表現Ｖの事物名Ｏ」である。また、属性名が省略されている文から事物名を抽出するために、文書中に出現しうる単位・助数詞と事物名の文型パターンを登録してある。例えば「数値表現Ｖの事物名Ｏ」である。共起スコアは、数値表現Ｖの単位・助数詞と属性名Ａ、属性名Ａと事物名Ｏ、それぞれの組にについて登録しておく。文型パターンの尤度評価に利用する。文型パターンの収集は、上記文献２に従い係り受け関係付きのコーパスを利用することができるが、係り受け関係のないコーパスからの抽出でもかまわない。また、共起スコアの評価は文献２に従えば、事物名と属性名、属性名と属性値それぞれについての相互情報量を使っているが、その他の評価方法でもかまわない。 The attribute name extraction pattern dictionary 501 registers attribute names that can appear in a document, numerical expressions, sentence pattern patterns of event names, and co-occurrence scores. The sentence pattern is, for example, “thing name O of attribute name A numerical expression V”. In addition, in order to extract a thing name from a sentence in which an attribute name is omitted, a sentence pattern of unit / classifier and thing name that can appear in a document is registered. For example, “thing name O of numerical expression V”. The co-occurrence score is registered for each unit of the numerical expression V, the classifier and the attribute name A, and the attribute name A and the thing name O. This is used to evaluate the likelihood of sentence patterns. The sentence pattern can be collected using a corpus having a dependency relationship according to the above-mentioned document 2. However, extraction from a corpus having no dependency relationship may be used. In addition, according to Document 2, the co-occurrence score is evaluated using the mutual information about the thing name and attribute name, and the attribute name and attribute value, but other evaluation methods may be used.

事物名検出部５０２は、事物名抽出用パターン辞書５０１を使って事物名を抽出する。例えば「属性名Ａ数値表現Ｖの事物名Ｏ」を使うと「排気量２０００ｃｃの自動車」から属性名「排気量」、数値表現「２０００ｃｃ」が分かっているとき、事物名「自動車」を抽出することができる。また、「数値表現Ｖの事物名Ｏ」を使うと「２０００ｃｃの自動車」から、数値表現「２０００ｃｃ」が分かっているとき、事物名「自動車」を抽出することができる。 The thing name detection unit 502 uses the thing name extraction pattern dictionary 501 to extract a thing name. For example, if “attribute name A numeric expression V thing name O” is used, the attribute name “automobile” is extracted when the attribute name “displacement” and numerical expression “2000 cc” are known from “2000 cc automobile”. be able to. In addition, if the numerical expression “2000cc” is known from the “2000 cc automobile”, the thing name “automobile” can be extracted from the “2000 cc automobile” using the “numerical expression V object name O”.

属性名補完部５０３は、属性名が文書中から検出できない場合に、数値表現の単位・助数詞と事物名を使って属性名を決める処理部である。事物名抽出用パターン辞書５０１に登録されている属性名と数値表現の単位、事物名と属性名の共起スコアを使い、共起スコアが最大になり、かつ、ある閾値を超える場合、属性名を省略されている属性名とする。例えば「２０００ｃｃのエンジン」からは、数値表現「２０００ｃｃ」と事物名「エンジン」が抽出されるが、属性名が分からない。このとき事物名抽出用パターン辞書５０１にある共起スコアを使い、「エンジン」及び「ｃｃ」との間に高い共起スコアを持つ属性名が「排気量」であることから、属性名「排気量」を推定することができる。 The attribute name complementing unit 503 is a processing unit that determines an attribute name by using a numerical unit, a classifier, and a thing name when the attribute name cannot be detected from the document. If the co-occurrence score of the attribute name and attribute name registered in the feature name extraction pattern dictionary 501 and the event name and attribute name is used and the co-occurrence score is maximized and exceeds a certain threshold, the attribute name Is an attribute name that is omitted. For example, from “2000cc engine”, the numerical expression “2000cc” and the thing name “engine” are extracted, but the attribute name is not known. At this time, since the attribute name having a high co-occurrence score between “engine” and “cc” is “exhaust amount” using the co-occurrence score in the object name extraction pattern dictionary 501, the attribute name “exhaust” Amount "can be estimated.

属性名補完部５０３で属性名が得られない場合、オントロジ利用による属性名補完部５０４によって属性名を推定する。当該処理では、部分全体関係を登録したオントロジ５０５を使い、文書から抽出した事物名を全体とした場合に部分の関係にある事物名を調べ、この事物名を使って、共起スコアによる属性名推定処理を行なう。以下に例を示す。数値表現「２０００ｃｃ」事物名「自動車」に対して、属性名が得られないとする。オントロジ５０５を使い、「自動車」の部分に「エンジン」があることが分かる。「エンジン」と「２０００ｃｃ」を使い、上記の方法によって、属性名「排気量」を得ることができる。 If the attribute name cannot be obtained by the attribute name complementing unit 503, the attribute name is estimated by the attribute name complementing unit 504 using ontology. In this process, the ontology 505 in which the partial whole relationship is registered is used, the thing names that are part relations are examined when the whole thing names extracted from the document are used, and the attribute name based on the co-occurrence score is used using this thing name. Perform estimation processing. An example is shown below. It is assumed that an attribute name cannot be obtained for the numerical expression “2000 cc” and the thing name “car”. Using the ontology 505, it can be seen that there is an “engine” in the “car” part. The attribute name “displacement” can be obtained by the above method using “engine” and “2000 cc”.

尺度選択部６００は、属性名と数値表現に含まれる単位・助数詞を使って、数値表現と尺度を対応づけ、数値表現に含まれる数値を尺度の持つ単位に換算する処理を行う。尺度選択部６００は、図４に示すように、尺度選択用テーブル６０１、単位構造解析部６０２、換算部６０３、接頭辞換算表６０４、単位換算表６０５から構成される。 The scale selection unit 600 uses the unit name and the classifier included in the attribute name and the numerical expression to associate the numerical expression with the scale, and converts the numerical value included in the numerical expression into a unit of the scale. As shown in FIG. 4, the scale selection unit 600 includes a scale selection table 601, a unit structure analysis unit 602, a conversion unit 603, a prefix conversion table 604, and a unit conversion table 605.

尺度選択用テーブル６０１は、図４に示すように、項目名が、属性名（代表名）、単位・助数詞、尺度、尺度がとる単位のテーブルである。尺度と尺度がとる単位は計量法、計量単位令を参考に定義することができるが、前記法令に含まれない尺度と単位も定義できる。属性名（代表名）と単位・助数詞が与えられたとき、対応する尺度が選ばれる。例えば、数値表現が年齢のとき、尺度として生まれた年の暦年と対応づける。また、ある地点からの距離を表す数値表現が得られたとき、対応する緯度・経度に対応づける。そのほかにも、「キロメートル」を「メートル」に対応づけるような単位換算の関係もテーブルに登録できる。さらに、「キログラム」を簡略化した「キロ」のような単位表記も登録しておき、尺度「重さ」と対応づけることもできる。 As shown in FIG. 4, the scale selection table 601 is a table in which the item name is an attribute name (representative name), a unit / a classifier, a scale, and a unit taken by the scale. The scale and the units taken by the scale can be defined with reference to the Metrology Law and the Measurement Unit Ordinance, but scales and units not included in the law can also be defined. When an attribute name (representative name) and unit / classifier are given, the corresponding scale is selected. For example, when the numerical expression is age, it is associated with the calendar year of the year born as a scale. Also, when a numerical expression representing the distance from a certain point is obtained, it is associated with the corresponding latitude / longitude. In addition, the unit conversion relationship that associates “kilometers” with “meters” can also be registered in the table. Furthermore, a unit notation such as “kilo”, which is a simplified form of “kilogram”, can also be registered and associated with the scale “weight”.

ある地点からの距離については、緯度・経度に対応づけられる。だいたいの場所に対応づければ良い場合には、郵便番号に対応づけることもできる。これらの対応づけを実現する場合には、地名と緯度・経度との対応を登録したデータベースや、地名と郵便番号の対応を登録したデータベースが必要である。地名と緯度・経度との対応データや地名と郵便番号の対応データは、公的機関や企業などにより公開・販売されているものがあるのでここでは詳細に説明しない。 The distance from a certain point is associated with the latitude and longitude. If it should be associated with a general location, it can also be associated with a postal code. In order to realize these associations, a database in which correspondence between place names and latitude / longitude is registered, and a database in which correspondence between place names and postal codes are registered are necessary. Correspondence data between place names and latitude / longitude and correspondence data between place names and postal codes are disclosed and sold by public institutions and companies and will not be described in detail here.

地名と方角、距離が分かれば、まず、ある地点の地名から緯度・経度を検索する。ある地点の緯度・経度が分かれば、下記の式を使い、ある地点からの距離が分かっている位置の緯度・経度を計算できる。緯度・経度が分かれば、緯度・経度から地名を検索し、地名から郵便番号を検索することができる。
方角：Ｒ（東西方向を基準にラジアンで表す）
距離：Ｌ（単位はメートル）
ある地点の緯度：Ｉ
ある地点の経度：Ｋ
求めたい緯度：ｉ
求めたい経度：ｋ
計算に使う定数（赤道の半径）：Ａ（＝6378137メートル）
（Ｌ／Ａ）²＝（ｉ−Ｉ）²＋｛（ｋ−Ｋ）cosＩ｝²
Ｒ＝tan^-1｛（ｉ−Ｉ）／（ｋ−Ｋ）cosＩ｝ If the place name, direction, and distance are known, first the latitude and longitude are searched from the place name at a certain point. If you know the latitude and longitude of a point, you can calculate the latitude and longitude of a position where you know the distance from a point using the following formula. If the latitude and longitude are known, the place name can be searched from the latitude and longitude, and the postal code can be searched from the place name.
Direction: R (expressed in radians with reference to the east-west direction)
Distance: L (unit is meter)
Latitude of a point: I
Longitude of a point: K
Desired latitude: i
Desired longitude: k
Constant used for calculation (radius of the equator): A (= 6378137 meters)
(L / A) ² = (i−I) ² + {(k−K) cos I} ²
R = tan ⁻¹ {(i−I) / (k−K) cosI}

物の個数を数えるときに数字の右につける助数詞(１個の「個」や１台の「台」など)の場合、尺度は「個数」とする。単位は「個」である。また、含有量などの割合は尺度を「割合」とする。単位は％である。属性名がなくても尺度が決められる単位の場合、属性名は空欄とする。 In the case of a classifier (one “piece”, one “table”, etc.) attached to the right of the number when counting the number of objects, the scale is “number”. The unit is “pieces”. In addition, the ratio of content and the like is defined as “ratio”. The unit is%. In the case where the scale is determined even if there is no attribute name, the attribute name is left blank.

単位構造解析部６０２は、単位の構造を解析する処理部である。解析規則の一部を、標準的な文法記述形式ＢＮＦ（バッカス・ナウア記法：Backus-Naur Form）を使い、図５に例として示す。図５のような規則を解析する構文解析プログラムは、プログラム言語のコンパイラと同じ要領で実現できる。なお、図５の規則は、正しい単位を解析できるが、定義する組立単位の範囲には不適切な単位が含まれている。例えば、単位長さあたりの重さを表す単位「ｋｇ／ｍ」を解析すると、規則４を使い「接頭辞」の「ｋ」を認識し、規則２を使い「重さ単位」の「ｇ」を認識し、規則１０を使い「重さ組立単位」の「ｋｇ」を認識し、規則１を使い「長さ単位」の「ｍ」を認識し、規則１３を使い「ｋｇ」と「ｍ」が「単位」であると認識し、規則１４を使い、「ｋｇ」と「ｍ」が「部分組立単位」と認識し、規則１６を使い２つの部分組立単位と「／」が組立単位「ｋｇ／ｍ」を構成していることを認識できる。 The unit structure analysis unit 602 is a processing unit that analyzes a unit structure. A part of the analysis rule is shown as an example in FIG. 5 using a standard grammar description format BNF (Backus-Naur Form). The syntax analysis program for analyzing the rules as shown in FIG. 5 can be realized in the same manner as a programming language compiler. 5 can analyze correct units, but the range of assembly units to be defined includes inappropriate units. For example, when the unit “kg / m” representing the weight per unit length is analyzed, “k” of “prefix” is recognized using rule 4 and “g” of “weight unit” is used using rule 2. , Recognize "kg" of "weight assembly unit" using rule 10, recognize "m" of "length unit" using rule 1, and use "kg" and "m" using rule 13 Is “unit”, using rule 14, “kg” and “m” are recognized as “partial assembly unit”, and using rule 16, two subassembly units and “/” are assembly units “kg”. / M "can be recognized.

換算部６０３は、単位構造解析結果、接頭辞換算表、単位換算表を使って、抽出した数値表現の数値を尺度が対応づいている単位に合うように換算する処理部である。図４の接頭辞換算表６０４に示すように、接頭辞に対して１０のＮ乗倍のＮに相当する値を登録しておく。単位変換表６０５は、図４に示すように、変換元単位、対応づける尺度、変換先単位、変換式を登録しておく。当該式では、数値表現から抽出した数値はＸに代入して計算する。 The conversion unit 603 is a processing unit that converts the numerical value of the extracted numerical expression so as to match the unit corresponding to the scale using the unit structure analysis result, the prefix conversion table, and the unit conversion table. As shown in the prefix conversion table 604 in FIG. 4, a value corresponding to N that is N times the power of 10 is registered for the prefix. As shown in FIG. 4, the unit conversion table 605 registers conversion source units, scales to be associated, conversion destination units, and conversion expressions. In this formula, the numerical value extracted from the numerical expression is substituted for X and calculated.

情報保存部７００は、文書から抽出した情報、及びこれらを処理して得た情報を、抽出元となった文書ファイル名と関係づけて保存する処理部である。図６に示す情報保存テーブル７０１に登録する。テーブルの項目は、事物名７０２、事物名の開始位置７０３、事物名の終了位置７０４、属性名７０５、属性名の開始位置７０６、属性名の終了位置７０７、属性名の属性名辞書における代表名７０８、数値表現に含まれる数値７０９、数値表現に含まれる単位・助数詞７１０、数値表現に含まれる程度表現７１１、数値表現が範囲表現の場合の範囲終了数値７１２、数値表現の開始位置７１３、数値表現の終了位置７１４、対応づけられた尺度７１５、数値７０９を換算して得た数値７１６、数値７１２を換算して得た数値７１７を登録する。出現位置と終了位置は文書先頭からのバイト数で表す。 The information storage unit 700 is a processing unit that stores information extracted from a document and information obtained by processing the information in association with a document file name from which the information is extracted. It registers in the information storage table 701 shown in FIG. The items in the table are: an object name 702, an object name start position 703, an object name end position 704, an attribute name 705, an attribute name start position 706, an attribute name end position 707, and a representative name in the attribute name attribute name dictionary. 708, a numerical value 709 included in the numerical expression, a unit / classifier 710 included in the numerical expression, a degree expression 711 included in the numerical expression, a range end numerical value 712 when the numerical expression is a range expression, a numerical expression start position 713, a numerical value The expression end position 714, the associated scale 715, the numerical value 716 obtained by converting the numerical value 709, and the numerical value 717 obtained by converting the numerical value 712 are registered. The appearance position and end position are represented by the number of bytes from the beginning of the document.

数値表現が範囲表現の場合の範囲終了数値７１２とその換算値７１７には、数値表現が範囲表現の場合の大きいほうの数値を登録する。このとき、数値表現の数値７０９及びその換算値７１６の値は範囲表現の小さいほうの数値である。数値表現に含まれる数値７０９及びその換算値７１６、数値表現が範囲表現の場合の範囲終了数値７１２及びその換算値７１７には、無限大や無限小が登録されることもある。 In the range end numerical value 712 and the converted value 717 when the numerical value expression is the range expression, the larger numerical value when the numerical value expression is the range expression is registered. At this time, the numerical value 709 of the numerical expression and the value of the converted value 716 are the smaller numerical values of the range expression. Infinite or infinitesimal may be registered in the numerical value 709 and its converted value 716 included in the numerical value expression, and the range end numerical value 712 and the converted value 717 when the numerical value expression is the range expression.

検索部８００は、利用者が入力した検索クエリについて、情報保存部で情報保存テーブル７０１に蓄積された情報を検索する処理部である。検索部は、図７に示すように、検索クエリ入力部８０１、事物名検索部８０２、属性名と数値表現検索部８０３、結果出力部８０４から構成される。 The search unit 800 is a processing unit that searches the information stored in the information storage table 701 by the information storage unit for the search query input by the user. As shown in FIG. 7, the search unit includes a search query input unit 801, an object name search unit 802, an attribute name and numerical expression search unit 803, and a result output unit 804.

検索クエリ入力部８０１は、利用者からの検索クエリの入力を受けつける処理部である。検索クエリは、１個以上の事物名と、０個以上の属性名と数値表現の組から構成される。本実施例では、図８に示すユーザ・インタフェース画面から入力するものとするが、このような方法に限られるわけではない。当該画面上で、利用者は、事物名、属性名、数値を文字列として入力する。程度には、「くらい」「以上」「未満」などの表現を入力する。 The search query input unit 801 is a processing unit that receives a search query input from a user. The search query is composed of a set of one or more thing names, zero or more attribute names and numerical expressions. In this embodiment, input is made from the user interface screen shown in FIG. 8, but the present invention is not limited to this method. On the screen, the user inputs a thing name, an attribute name, and a numerical value as a character string. For the degree, an expression such as “about”, “more than”, “less than” is input.

事物名検索部８０２では、事物名について、情報保存テーブル７０１を検索する。属性名と数値表現検索部８０３は、尺度選択部６００で属性名と数値表現から尺度を選択し、数値を換算し、その結果を使い事物名検索で絞り込まれた範囲で、情報保存テーブル７０１に対して検索する。結果出力部８０４は、検索結果をファイル名の一覧などで出力する処理部である。出力先としてディスプレイや紙、記録媒体などが利用できる。統計的尺度によってランキングすることもできる。 The thing name search unit 802 searches the information storage table 701 for the thing name. The attribute name and numerical expression search unit 803 selects a scale from the attribute name and numerical expression by the scale selection unit 600, converts the numerical value, and uses the result in the information name search table 701 within the range narrowed down by the object name search. Search against. The result output unit 804 is a processing unit that outputs search results as a list of file names. A display, paper, recording medium, or the like can be used as an output destination. You can also rank by statistical scale.

続いて、本発明の方法の動作を４つの例文を使って説明する。
〔例文１〕
「犯人の特徴
２５歳くらい、身長１７０ｃｍくらい、中肉」
〔例文２〕
「精油所や給油所に対し、２０年までにガソリンの炭素の含有量を１０％削減するよう求める。」
〔例文３〕
「軽自動車を排気量が２０００ｃｃの自動車と比較すると、保険の金額は約６０％とかなりお得な保険料設定でした。」
〔例文４〕
「配向膜における表面エネルギーが、３０ｄｙｎｅ／ｃｍから４０ｄｙｎｅ／ｃｍであるようにした。」 Subsequently, the operation of the method of the present invention will be described using four example sentences.
[Example 1]
“Characteristics about 25 years old, height of about 170cm, medium meat”
[Example sentence 2]
“Require refineries and gas stations to reduce the carbon content of gasoline by 10% by 20 years.”
[Example sentence 3]
“Comparing a light car with a car with a displacement of 2000 cc, the insurance amount was about 60%, which was a very good insurance premium.”
[Example sentence 4]
“The surface energy of the alignment film was set to be 30 dyne / cm to 40 dyne / cm.”

最初に、文書ファイル入力部１００により例文１〜４それぞれを含む文書ファイルを入力する。次に、文書部分抽出部２００により、文書ファイルから、例文１〜４それぞれを含む文書部分を抽出する。 First, a document file including each of the example sentences 1 to 4 is input by the document file input unit 100. Next, the document part extraction unit 200 extracts a document part including each of the example sentences 1 to 4 from the document file.

次に、数値表現抽出部３００により、数値表現を抽出する。アラビア数字の連続部分を検出することで、例文１について「２５」と「１７０」を、例文２について「２０」と「１０」を、例文３について「２０００」と「６０」、例文４について「３０」と「４０」を検出する。アラビア数字の右方向に単位・助数詞を探索し、例文１について「２５歳」と「１７０ｃｍ」を、例文２について「２０年」と「１０％」を、例文３について「２０００ｃｃ」と「６０％」、例文４について「３０ｄｙｎｅ／ｃｍ」と「４０ｄｙｎｅ／ｃｍ」を検出する。検出した範囲の左右に程度表現や範囲表現を探索し、例文１について「２５歳くらい」と「１７０ｃｍくらい」を、例文２について「２０年までに」と「１０％」を、例文３について「２０００ｃｃ」と「約６０％」を、例文４について「３０ｄｙｎｅ／ｃｍから４０ｄｙｎｅ／ｃｍ」検出する。数値表現の抽出はこれで終了である。 Next, the numerical expression extraction unit 300 extracts numerical expressions. By detecting consecutive parts of Arabic numerals, “25” and “170” for example sentence 1, “20” and “10” for example sentence 2, “2000” and “60” for example sentence 3, and “2000” for example sentence 4 “30” and “40” are detected. Search for units and classifiers to the right of the Arabic numerals, “25 years old” and “170 cm” for example sentence 1, “20 years” and “10%” for example sentence 2, “2000cc” and “60%” for example sentence 3. ”,“ 30 dyne / cm ”and“ 40 dyne / cm ”are detected for example sentence 4. A degree expression and a range expression are searched to the left and right of the detected range, “about 25 years old” and “about 170 cm” for example sentence 1, “by 20 years” and “10%” for example sentence 2, “2000 cc” and “about 60%” are detected for the example sentence 4 from “30 dyne / cm to 40 dyne / cm”. This completes the extraction of the numerical expression.

属性名抽出部４００を使い、属性名を抽出する。属性名辞書４０１を使い、例文１については、「２５歳くらい」に対しては属性名が得られず、「１７０ｃｍくらい」に対しては属性名「身長」が得られる。例文２については、「２０年」の属性名は得られず、「１０％」の属性は「含有量」が得られる。例文３については、「２０００ｃｃ」は属性名「排気量」、「約６０％」は属性名「金額」が得られる。例文４については「３０ｄｙｎｅ／ｃｍから４０ｄｙｎｅ／ｃｍ」は属性名「表面エネルギー」が得られる。属性名拡張部４０３によって、例文２については、「１０％」の属性は「炭素の含有量」が得られる。 The attribute name is extracted using the attribute name extraction unit 400. Using the attribute name dictionary 401, for example sentence 1, an attribute name is not obtained for “about 25 years old” and an attribute name “height” is obtained for “about 170 cm”. For example sentence 2, the attribute name “20 years” is not obtained, and the attribute “10%” is “content”. For example sentence 3, “2000cc” is attribute name “displacement”, and “about 60%” is attribute name “money”. For example sentence 4, “30 dyne / cm to 40 dyne / cm” gives the attribute name “surface energy”. The attribute name expansion unit 403 obtains “carbon content” for the attribute “10%” for the example sentence 2.

事物名抽出部５００を使い、事物名を抽出する。事物名抽出用パターン辞書５０１によって、例文１については、数値表現「２５歳くらい」に対して事物名が「犯人」、数値表現「１７０ｃｍくらい」属性名「身長」に対して、事物名「犯人」が得られる。例文２については、数値表現「２０年」に対しては事物名が得られず、数値表現「１０％」属性名「炭素の含有量」に対して、事物名「ガソリン」が得られる。例文３については、数値表現「２０００ｃｃ」属性名「排気量」に対し事物名「自動車」、数値表現「約６０％」属性名「金額」に対し事物名「保険」が得られる。例文４については数値表現「３０ｄｙｎｅ／ｃｍから４０ｄｙｎｅ／ｃｍ」属性名「表面エネルギー」に対し事物名「配向膜」が得られる。 The thing name extraction unit 500 is used to extract the thing name. According to the object name extraction pattern dictionary 501, for example sentence 1, the object name is “criminal” for the numerical expression “about 25 years old”, the object name “criminal” for the attribute name “height” for the numerical expression “about 170 cm”. Is obtained. In the example sentence 2, the thing name “gasoline” is obtained for the numerical expression “10%” and the attribute name “carbon content” for the numerical expression “20 years”. For example sentence 3, the item name “car” is obtained for the numerical expression “2000cc” attribute name “displacement”, and the item name “insurance” is obtained for the numerical expression “about 60%” attribute name “money”. For the example sentence 4, the object name “alignment film” is obtained for the numerical expression “30 dyne / cm to 40 dyne / cm” attribute name “surface energy”.

属性名補完部５０３により、例文１については、事物名「犯人」数値「２５歳」に対して、事物名「犯人」と属性名「年齢」、属性名「年齢」と単位「歳」の共起スコアから属性名「年齢」が得られる。 The attribute name complementing unit 503 determines that for the example sentence 1, the object name “criminal” and the attribute name “age”, the attribute name “age”, and the unit “year” for the object name “criminal” numerical value “25 years old”. The attribute name “age” is obtained from the starting score.

尺度選択部６００を使い、抽出した属性名と数値表現の組を尺度と対応づける。例文１については、属性名「年齢」数値表現「２５歳くらい」に対しては、図４のテーブルから尺度「暦年」単位「年」が得られ、換算部によって現在年「２００８年」を使い「２００８−２５」が計算され「１９８３年」が得られ、この値が尺度「暦年」の値とされる。現在年はコンピュータのオペレーティング・システムの機能を使い取得できるが、現在年ではなく文書のプロパティ情報や、文書内に記述されている年についての記述を利用することもできる。属性名「身長」数値表現「１７０ｃｍくらい」に対しては、図４のテーブルから尺度「長さ」単位「メートル」が得られ、数値表現「１７０ｃｍ」の単位「ｃｍ」の構造を解析して接頭辞「ｃ」と単位「ｍ」が得られ、接頭辞「ｃ」の換算値１０のマイナス２乗を使い「１７０」が「１.７」に換算され、「１.７メートル」が得られる。 Using the scale selection unit 600, the pair of the extracted attribute name and numerical expression is associated with the scale. For example sentence 1, for attribute name “age” numerical expression “about 25 years old”, the scale “calendar year” unit “year” is obtained from the table of FIG. 4, and the current year “2008” is used by the conversion unit. “2008-25” is calculated to obtain “1983”, and this value is taken as the value of the scale “calendar year”. The current year can be obtained using the functions of the computer's operating system, but instead of the current year, document property information and a description of the year described in the document can also be used. For the attribute name “height” numerical expression “about 170 cm”, the scale “length” unit “meter” is obtained from the table of FIG. 4, and the structure of the unit “cm” of the numerical expression “170 cm” is analyzed. The prefix “c” and the unit “m” are obtained, “170” is converted to “1.7” using the minus square of the conversion value 10 of the prefix “c”, and “1.7 meters” is obtained. It is done.

例文２については、「２０年」に対しては事物名がないのでこれより先の処理は行なわず、属性名「炭素の含有量」数値表現「１０％」に対しては、尺度「割合」単位「％」が得られ、数値表現は同じ単位なので、換算処理はしない。例文３については、属性名「排気量」数値表現「２０００ｃｃ」に対し、尺度「体積」単位「立方メートル」が得られ、「ｃｃ」から「立方メートル」への換算値として「１０の−６乗」が得られ、換算の結果「２.０×１０の−３乗立方メートル」が得られる。属性名「金額」「６０％」に対しては、尺度が得られず、そのため換算もなされない。 For example sentence 2, since there is no thing name for “20 years”, no further processing is performed, and for attribute name “carbon content” numerical expression “10%”, the scale “ratio” Since the unit “%” is obtained and the numerical expression is the same unit, conversion processing is not performed. For example sentence 3, the scale “volume” unit “cubic meter” is obtained for the attribute name “displacement” numerical expression “2000 cc”, and the conversion value from “cc” to “cubic meter” is “10 −6”. As a result of conversion, “2.0 × 10 −3 cubed” is obtained. No scale is obtained for the attribute names “amount” and “60%”, and therefore no conversion is made.

例文４については「３０ｄｙｎｅ／ｃｍから４０ｄｙｎｅ／ｃｍ」属性名「表面エネルギー」に対し尺度「単位面積あたりのエネルギー」単位「ジュール／平方メートル」が得られ、「ｄｙｎｅ／ｃｍ」の構造解析の結果から「ニュートン×１０の−５乗／メートル×１０の−２乗」＝「ニュートン／メートル」×「１０の−３乗」が得られる。単位変換表から「ニュートン／メートル」から「ジュール／平方メートル」への変換を得て、「３×１０−２ジュール／平方メートルから４×１０の−２乗ジュール／平方メートル」が得られる。 For example sentence 4, the scale “energy per unit area” unit “Joule / square meter” is obtained for the attribute name “surface energy” from “30 dyne / cm to 40 dyne / cm”, and the structural analysis result of “dyne / cm” “Newton × 10 −5 power / meter × 10 −2 power” = “Newton / meter” × “10 −3 power” is obtained. The conversion from “Newton / meter” to “Joule / square meter” is obtained from the unit conversion table, and “3 × 10 −2 Joule / square meter to 4 × 10 −2 Joule / square meter” is obtained.

情報保存部７００を使い、抽出した属性名、数値表現、事物名、対応づけた尺度、換算した値を情報保存テーブル７０１に登録する。例文１については、事物名「犯人」属性名「年齢」数値「２５」単位「歳」尺度「暦年」換算値「１９８３」程度表現「くらい」と、事物名「犯人」、属性「身長」数値「１７０」単位「ｃｍ」尺度「長さ」換算値「１.７」が登録される。例文２については、事物名「ガソリン」属性名「炭素の含有量」数値「１０」単位「％」尺度「割合」が登録される。例文３については、事物名「自動車」属性名「排気量」数量「２０００」単位「ｃｃ」尺度「体積」換算数値「２.０×１０の−３乗立方」が登録される。例文４については事物名「配向膜」属性名「表面エネルギー」数値「３０」数値「４０」単位「ｄｙｎｅ／ｃｍ」尺度「単位面積あたりのエネルギー」換算値「３×１０の−２乗」換算値「４×１０の−２乗」が登録される。例文１〜３から抽出した情報を登録した状態の情報保存テーブルの一部を図９Ａ及び図９Ｂに示す。
最後に、情報検索部８００の動作を説明する。例文５の検索クエリを入力するものとする。 Using the information storage unit 700, the extracted attribute name, numerical expression, thing name, associated scale, and converted value are registered in the information storage table 701. For example sentence 1, the object name “criminal” attribute name “age” numerical value “25” unit “year” scale “calendar year” converted value “1983” expression “about”, object name “criminal”, attribute “height” numerical value “170” unit “cm” scale “length” converted value “1.7” is registered. For example sentence 2, the item name “gasoline” attribute name “carbon content” numerical value “10” unit “%” scale “ratio” is registered. For example sentence 3, the item name “automobile” attribute name “displacement” quantity “2000” unit “cc” scale “volume” converted numerical value “2.0 × 10 −3 cube” is registered. For example sentence 4, thing name “alignment film” attribute name “surface energy” numerical value “30” numerical value “40” unit “dyne / cm” scale “energy per unit area” converted value “3 × 10 −2” The value “4 × 10 −2” is registered. 9A and 9B show a part of the information storage table in a state where the information extracted from the example sentences 1 to 3 is registered.
Finally, the operation of the information search unit 800 will be described. Assume that a search query for example sentence 5 is input.

〔例文５〕
事物名：自動車、属性名：総排気量、数値：２０００ｃｃ
検索クエリ入力部８０１は、利用者からの検索クエリの入力を受けつける処理部である。図８に示すユーザ・インタフェース画面に例文５の検索クエリを入力し、図１０の状態になる。検索実行ボタンをクリックすると検索が実行される。具体的には、事物名検索部８０２にクエリが送信される。 [Example sentence 5]
Item name: Car, Attribute name: Total displacement, Numerical value: 2000cc
The search query input unit 801 is a processing unit that receives a search query input from a user. A search query for example sentence 5 is input to the user interface screen shown in FIG. Clicking the search execution button executes the search. Specifically, the query is transmitted to the thing name search unit 802.

事物名検索部８０２により、検索クエリの事物名「自動車」について図９Ａ、図９Ｂのテーブルから検索し、図９Ｂのテーブルのファイル名が「例文３を含むファイルの名称」であるファイルの事物名フィールドの値が「自動車」という文字列を含むので、ファイル名が「例文３を含むファイルの名称」に含まれる各フィールドのデータを抽出する。 The object name search unit 802 searches the table of FIG. 9A and FIG. 9B for the object name “car” of the search query, and the file name of the file whose file name in the table of FIG. 9B is “name of file including example sentence 3”. Since the field value includes the character string “car”, the data of each field whose file name is included in “name of file including example sentence 3” is extracted.

属性名と数値表現検索部８０３により、属性名「総排気量」、数値表現「２０００ｃｃ」は、「総排気量」については属性名辞書により代表名「排気量」に変換される。数値表現「２０００ｃｃ」は尺度選択部の機能により「２.０×１０の−３乗立方メートル」に変換される。事物名検索部８０２の処理で検索された事物名の値に「自動車」という文字列を含むデータに対して、属性名が「排気量」であり、換算数値が「２.０×１０の−３乗立方」であるデータを図９Ａ、図９Ｂのテーブルから検索する。 The attribute name and numerical expression search unit 803 converts the attribute name “total displacement” and the numerical expression “2000 cc” into the representative name “displacement” by the attribute name dictionary for “total displacement”. The numerical expression “2000 cc” is converted into “2.0 × 10 −3 cubic meters” by the function of the scale selection unit. The attribute name is “displacement” and the conversion value is “−2.0 × 10 −− for data including the character string“ automobile ”in the value of the object name searched by the process of the object name search unit 802. Data that is “cubic cube” is retrieved from the tables of FIGS. 9A and 9B.

結果出力部８０４により、検索結果を検索クエリが入力されたコンピュータに接続されているディスプレイに表示する。ディスプレイの表示画面は、例えば図１１のようになる。この例では、検索結果には３つのファイルが含まれている。また、それぞれのファイル名の下には、それぞれのファイルのなかでクエリにマッチした部分を表示している。 The result output unit 804 displays the search result on a display connected to the computer in which the search query is input. The display screen of the display is as shown in FIG. 11, for example. In this example, the search result includes three files. In addition, below each file name, the portion of each file that matches the query is displayed.

本発明による数値表現処理装置の機能構成図。The function block diagram of the numerical expression processing apparatus by this invention. 属性名抽出部の機能構成図。The function block diagram of an attribute name extraction part. 事物名抽出部の機能構成図。The functional block diagram of a thing name extraction part. 尺度選択部の機能構成図。The function block diagram of a scale selection part. 単位解析規則のBNFによる記述の例を示す図。The figure which shows the example of the description by BNF of a unit analysis rule. 情報保存テーブルの例を示す図。The figure which shows the example of an information storage table. 情報検索部の機能構成図。The function block diagram of an information search part. ユーザ・インタフェースの画面User interface screen 情報を登録した状態の情報保存テーブルの例を示す図。The figure which shows the example of the information storage table of the state which registered information. 情報を登録した状態の情報保存テーブルの例を示す図。The figure which shows the example of the information storage table of the state which registered information. クエリを入力した状態のユーザ・インタフェースの画面例を示す図。The figure which shows the example of a screen of the user interface in the state which input the query. クエリを検索した結果のユーザ・インタフェースの画面例を示す図。The figure which shows the example of a screen of the user interface of the result of having searched the query.

Explanation of symbols

１０数値表現処理装置
１００文書ファイル入力部
２００文書部分抽出部
３００数値表現抽出部
４００属性名抽出部
４０１属性名辞書
４０２属性名検出部
４０３属性名拡張部
４０４属性名・数値表現組評価部
５００事物名抽出部
５０１事物名抽出用パターン辞書
５０２事物名抽出
５０３属性名補完部
５０４オントロジ利用による属性名補完部
５０５オントロジ(部分全体関係)
６００尺度選択部
６０１尺度選択用テーブル
６０２単位構造解析部
６０３換算部
６０４接頭辞換算表
６０５単位換算表
７００情報保存部
７０１情報保存テーブル
７０２事物名
７０３事物名の開始位置
７０４事物名の終了位置
７０５属性名
７０６属性名の開始位置
７０７属性名の終了位置
７０８属性名の属性名辞書における代表名
７０９数値表現に含まれる数値
７１０数値表現に含まれる単位・助数詞
７１１数値表現に含まれる程度表現
７１２数値表現が範囲表現の場合の範囲終了数値
７１３数値表現の開始位置
７１４数値表現の終了位置
７１５対応づけられた尺度
７１６数値７０９を換算して得た数値
７１７数値７１２を換算して得た数値
８００情報検索部
８０１検索クエリ入力部
８０２事物名検索部
８０３属性名と数値表現検索部
８０４結果出力部 DESCRIPTION OF SYMBOLS 10 Numerical expression processing apparatus 100 Document file input part 200 Document part extraction part 300 Numerical expression extraction part 400 Attribute name extraction part 401 Attribute name dictionary 402 Attribute name detection part 403 Attribute name expansion part 404 Attribute name / numeric expression pair evaluation part 500 Name extraction unit 501 Item name extraction pattern dictionary 502 Item name extraction 503 Attribute name complementing unit 504 Attribute name complementing unit 505 using ontology 505 Ontology (partial whole relationship)
600 Scale selection unit 601 Scale selection table 602 Unit structure analysis unit 603 Conversion unit 604 Prefix conversion table 605 Unit conversion table 700 Information storage unit 701 Information storage table 702 Item name 703 Item name start position 704 Item name end position 705 Attribute name 706 Attribute name start position 707 Attribute name end position 708 Representative name 709 in the attribute name attribute name dictionary 710 Numeric value included in the numerical expression 710 Unit / subscript 711 included in the numerical expression 712 Degree expression included in the numerical expression 712 Numerical value Range end numerical value 713 when expression is range expression Numerical expression start position 714 Numerical expression end position 715 Corresponding scale 716 Numerical value 717 obtained by converting numerical value 709 Numerical value 800 obtained by converting numerical value 712 Information Search part 801 Search query input part 802 Item name search part 803 Attribute name and numerical value Expression search unit 804 Result output unit

Claims

A numerical expression extraction unit that extracts a numerical value and a unit of the numerical value from an electronic document;
An attribute name extraction unit that extracts an attribute name having the numeric value as the attribute value from the document;
A table in which an attribute name, a unit of numerical expression of the attribute name and an absolute scale are registered in association with each other, and the numerical value extracted by the numerical expression extraction unit is used for each attribute name using information registered in the table A scale selection unit that associates with a predetermined absolute scale and converts the absolute scale into a numerical value of the absolute scale ;
An object name extraction pattern dictionary in which sentence pattern patterns of attribute names, numerical expressions, and object names are registered, and an object having a pair of numerical expressions extracted by the numerical expression extraction unit and attribute names extracted by the attribute name extraction unit A numerical expression processing apparatus comprising: an object name extraction unit for extracting names from a document .

2. The numerical expression processing apparatus according to claim 1, wherein a co-occurrence score of an attribute name and a unit of numerical expression and a co-occurrence score of an object name and an attribute name are registered in the pattern dictionary for extracting an object name. The object name extraction unit uses the co-occurrence score registered in the object name extraction pattern dictionary on the basis of the numerical value extracted from the document, the unit of the numerical value, and the numerical object name. A numerical expression processing apparatus characterized by estimating an attribute name having a numerical value as a value.

The numerical expression processing device according to claim 2, wherein the thing name extraction unit estimates an attribute name having a numerical expression as a value by using an ontology.

The numerical expression processing device according to claim 1 , wherein the attribute name extraction unit has a dictionary in which pairs of attribute names and representative names are registered such that synonymous attribute names are associated with the same representative name, A numerical expression processing apparatus, wherein a representative name corresponding to an attribute name extracted from the document is obtained using a dictionary.

5. The numerical expression processing apparatus according to claim 4 , wherein a document name, a numerical value included in the document is stored as a thing name, a representative name of an attribute name, and a numerical value converted into an absolute scale associated with the attribute name. A numerical expression processing apparatus characterized by having a table.

The numerical expression processing apparatus according to claim 1 , wherein the scale selection unit converts age into a year of birth.

The numerical expression processing apparatus according to claim 1 , wherein the scale selection unit converts a numerical expression representing a position into a latitude / longitude.

The numerical expression processing apparatus according to claim 1 , wherein the scale selection unit converts a numerical expression representing a position into a zip code.

The numerical expression processing device according to claim 1 , wherein the attribute name extraction unit has a dictionary in which suffix information that is a part of the attribute name is registered, and estimates the attribute name using the suffix information. A numerical expression processing apparatus characterized by that.

The numerical expression processing device according to claim 1 , wherein the attribute name extraction unit has a dictionary in which attribute names and target information that can take the attribute names are registered, and is detected from the document by collating with the dictionary. A numerical value that creates an extended attribute name by combining an attribute name and a character string that is registered in the dictionary as a target that can take the attribute name among character strings that appear before and after the attribute name Expression processing device.

The numerical expression processing device according to claim 1 ,
A numerical expression processing apparatus having a function of estimating an attribute name having a numerical expression as a value using information registered in a table.

The numerical expression processing apparatus according to claim 1, wherein a regular unit expression is estimated from a unit that is simply expressed using an attribute name.

The numerical expression processing apparatus according to claim 1, wherein the numerical expression processing apparatus has a function of analyzing a structure of an assembly unit.

6. The numerical expression processing apparatus according to claim 5 , wherein a search query input unit that receives input of an object name, an attribute name, and a numerical value, and an attribute name input to the search query input unit is converted into the representative name, and the input A document file that converts the input numerical value into an absolute scale numerical value corresponding to the input attribute name and includes the input thing name, the converted representative name, and the numerical value with reference to the table A numerical expression processing apparatus, comprising: an information search unit for searching.