WO2018025317A1 - Natural language processing device and natural language processing method - Google Patents

Natural language processing device and natural language processing method Download PDF

Info

Publication number
WO2018025317A1
WO2018025317A1 PCT/JP2016/072583 JP2016072583W WO2018025317A1 WO 2018025317 A1 WO2018025317 A1 WO 2018025317A1 JP 2016072583 W JP2016072583 W JP 2016072583W WO 2018025317 A1 WO2018025317 A1 WO 2018025317A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute
model
natural language
language processing
numerical
Prior art date
Application number
PCT/JP2016/072583
Other languages
French (fr)
Japanese (ja)
Inventor
康嗣 森本
利彦 柳瀬
芳樹 丹羽
利昇 三好
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to JP2018531005A priority Critical patent/JP6546703B2/en
Priority to PCT/JP2016/072583 priority patent/WO2018025317A1/en
Publication of WO2018025317A1 publication Critical patent/WO2018025317A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to a syntax analysis technique for natural language processing.
  • the present invention relates to a technique for extracting attribute value-attribute value pairs from text.
  • the amount of digitized documents accessible to users is increasing with the spread of personal computers and the Internet. Electronic documents are unstructured data and are difficult to handle with a computer. Therefore, in order to effectively use a large amount of digitized documents by structuring them, expectations for natural language processing are increasing.
  • One technique for structuring digitized documents is attribute / attribute value extraction.
  • the attribute / attribute value extraction is a technique for extracting a pair of an attribute such as “sex” and an attribute value such as “female” from text. It is used for the purpose of extracting various attributes describing the specifications and their attribute values from a variety of information sources and displaying them in an integrated manner. Among such attributes, an attribute value having a numerical value is called a numerical attribute.
  • weight and size are numerical attributes. Such numerical attributes are quantitative and objective information, and are particularly valuable among attribute / attribute value information. In view of such importance, there is a known technique for automatically extracting attribute / attribute value pairs including numerical attributes. A known technique for extracting attribute / attribute value pairs will be described below.
  • Non-Patent Document 1 The most standard technique for extracting attribute / attribute value pairs related to numeric attributes uses syntactic information as a clue.
  • Non-Patent Document 1 for example, for a sentence such as “Weight is 10 Kg”, first, a numeric character string “10 Kg” is extracted as a candidate attribute value, and “Weight” is obtained from the result of parsing. A technique for recognizing that is an attribute is disclosed.
  • a parsing technique for obtaining syntactic information cannot always obtain a correct solution at the current technical level. Therefore, a semantic method that uses the strength of the relationship between attributes and attribute values as a clue is required.
  • Non-patent document 2 As a known technique for extracting attribute / attribute value pairs using information other than syntactic information, there is a technique that uses “proximity” as a clue and a technique that uses “correlation” between attributes and attribute values as a clue.
  • Non-patent document 2 In Non-patent Document 2, first, as a first step, a phrase that is a candidate for an attribute or attribute value is extracted from the text. Next, as a second step, the correspondence between attribute candidate and attribute value candidate pairs is determined. At this time, in particular, in the second step, a technique using “syntax information”, “proximity”, and “correlation” in an appropriate combination is disclosed. “Proximity” is an approximation of syntax information, and “correlation” is semantic information.
  • Non-Patent Document 2 uses a correlation indicating the strength of a semantic relationship between attributes / attribute values even when syntactic information cannot be used as a clue. Attributes / attribute values can be associated.
  • Attribute values related to attributes other than numerical values are finite and the number of types is not so large. Therefore, the strength of the relationship between the attribute and the attribute value can be enumerated for all the attribute values. For example, when “manufacturing company” is considered as an attribute of a certain product, attribute values such as “Company A”, “B company”, and “C company” exist as attribute values, and ⁇ Manufacturing company, Company A>, ⁇ Manufacturing It is possible to consider the strength of the relationship for each attribute / attribute value pair such as “company, company B>, ⁇ manufacturing company, company C>. Specifically, a statistic such as self-mutual information may be calculated based on the appearance frequency of each pair. By learning the strength of such a relationship using correct answer data or the like, it becomes possible to extract attribute / attribute value pairs in the text from the text.
  • the numerical values that are attribute values are in principle continuous and there are infinite types of values.
  • the attribute value is represented by an integer, and although it is discrete, the number of types of attribute values becomes very large. Therefore, it cannot be expected that all attribute values appear in the data used for learning. Therefore, it is difficult to learn in advance the strength of the relationship between the attribute and the attribute value by a method as in the case of an attribute other than a numerical value.
  • An object of the present invention is to extract an attribute / attribute value pair that is a numerical attribute accurately and at high speed by providing means for calculating the strength of the relationship between the attribute and attribute value pair that is a numerical attribute.
  • a typical example of the present invention is a natural language processing apparatus that analyzes a document structure of input text data, an extraction unit that extracts candidates of pairs of numerical attribute and attribute value from text data, and a predetermined numerical value Based on a model that shows the distribution of attribute values corresponding to attributes, a calculation unit that calculates values indicating the validity of the extracted numeric attribute and attribute value pair candidates, and extraction based on the calculated values indicating validity
  • a determination unit that determines a pair of a numerical attribute and an attribute value from the candidates.
  • attribute / attribute value pairs that are numerical attributes appearing in text can be extracted with high accuracy and at high speed. Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.
  • the structure of a numerical attribute extraction apparatus is shown. An outline of numerical attribute extraction processing executed by the numerical attribute extraction program will be described.
  • the flowchart of the numerical value attribute extraction process by a numerical value attribute extraction program is shown. A configuration example of correct text is shown.
  • the structural example of an attribute / attribute value pair list is shown.
  • the flowchart of an attribute / attribute value relationship model learning process is shown.
  • the structural example of an attribute / attribute value relationship model is shown.
  • the flowchart of a relationship model learning process between attributes is shown.
  • a configuration example of the inter-attribute relationship model is shown. It is explanatory drawing of an attribute / attribute value pair determination process.
  • the flowchart of an attribute / attribute value pair determination process is shown.
  • the flowchart of an attribute / attribute value pair determination process is shown.
  • a method of extracting attribute / attribute value pairs of numerical attributes from text is disclosed. More specifically, a distribution of attribute values for each attribute is learned as a model, and a numerical attribute is determined based on the model. A method of determining is disclosed.
  • One of the objects of the present disclosure is to extract attribute / attribute value pairs with high accuracy by learning the strength of the relationship between the attributes of attribute values and attribute values from correct data.
  • An example of the technology of the present disclosure is to learn the distribution of attribute values for each attribute from correct data as a model, and when a certain numerical value is given, the strength of the relationship between the given numerical value and each attribute is used as the posterior probability. Calculate and extract attributes based on the calculated posterior probabilities.
  • the accuracy of attribute / attribute value pair extraction can be improved by learning the strength of the relationship between attributes from correct data as a model and considering the order of appearance of attributes.
  • FIG. 1 shows a configuration of a synonym extraction device 100 of the present embodiment.
  • the synonym extraction device 100 includes a CPU 101, a main memory 102, an input / output device 103, and a disk device 110.
  • the CPU 101 is a processor, and executes various processes by executing programs stored in the main memory 102. Specifically, the CPU 101 loads a program stored in the disk device 110 into the main memory 102 and executes the program loaded in the main memory 102. The program may be loaded from the external server to the main memory 102 via the network.
  • the main memory 102 stores programs executed by the CPU 101 and data required by the CPU 101.
  • the input / output device 103 receives input of information from the user and outputs information in response to an instruction from the CPU 101.
  • the input / output device 103 is an input device such as a keyboard and a mouse, and an output device such as a display.
  • the disk device 110 is an auxiliary memory including a computer-readable non-transitory storage medium.
  • the disk device 110 stores various programs and various data. Specifically, the disk device 110 stores an OS 111, a numerical attribute extraction program 112, a correct text 113, a new text 114, an attribute / attribute value pair list 115, teacher data 116, and a numerical attribute extraction model 117.
  • the numerical attribute extraction model 117 is an attribute / attribute value relationship model 1171 used for determining whether a pair of attribute candidates and attribute value candidates is truly an attribute / attribute value pair.
  • An inter-attribute relationship model 1172 indicating whether or not it is likely to appear is included.
  • the numerical attribute extraction program 112 determines whether or not the word / phrase pair included in the input text 114 is an attribute / attribute value pair, and determines the word / phrase pair determined to be the attribute / attribute value pair of the numerical attribute. Extract as attribute / attribute value pairs.
  • the numerical attribute extraction program 112 includes a correct answer pair extraction subprogram 1121, a teacher data creation subprogram 1122, a model learning subprogram 1123, an attribute / attribute value candidate extraction subprogram, and an attribute / attribute value pair determination subprogram 1125. The processing of these programs will be described in detail with reference to FIG.
  • the CPU 101 realizes a predetermined function by executing the above program.
  • the program performs a predetermined process by being executed by the processor. Therefore, in the present disclosure, the description with the program as the subject may be an explanation with the CPU 101 or the synonym extraction device 100 as the subject.
  • the CPU 101 operates as a functional unit (means) that realizes a predetermined function by operating according to a program.
  • the CPU 101 functions as a numerical attribute extraction unit (numerical attribute extraction means) by operating according to the numerical attribute extraction program 112.
  • the numerical attribute extraction apparatus 100 is an apparatus including these functional units (means).
  • the correct text 113 is data input to the numerical attribute extraction program 112 and is used to learn a model for extracting numerical attribute attribute / attribute value pairs.
  • the correct text 113 is a text loaded with information necessary to specify the attribute of the numerical attribute and the appearance position of the corresponding attribute value, and may have an arbitrary format.
  • the correct text 113 is constructed by inserting a tag indicating that it is an attribute or attribute value in the text, or preparing a table indicating the start position and end position of the attribute or attribute value separately from the text. .
  • the new text 114 is a text that is a target of extraction of attribute / attribute value pairs of numerical attributes. Usually, a new text different from the correct text 113 is a target. Moreover, after executing attribute / attribute value pair extraction for a new text, it can also be registered as correct text 113.
  • the attribute / attribute value pair list 115 is a list in the order of appearance position in the text of the character string pair indicating the attribute and the attribute value extracted from the correct text 115.
  • the positional relationship between pairs included in different texts is arbitrary.
  • a character string indicating the appearance position of the character string, a unit of the attribute value, and the like are stored.
  • the teacher data 116 has the same format as the attribute / attribute value pair list.
  • Attribute normalization means unifying different notations such as “ANION GAP” and “Anion Gap”.
  • the normalization of attribute values is to align the units to one of the units for numeric character strings with different appearances such as “300 mg” and “0.3 g”.
  • the numerical attribute extraction model 117 is data used by the attribute / attribute value pair determination subprogram 1125 and includes an attribute / attribute value relationship model 1171 and an inter-attribute relationship model 1172.
  • the attribute / attribute value relationship model 1171 indicates a criterion for determining whether or not two words are a pair of corresponding attribute and attribute value. Specifically, the attribute / attribute value relationship model 1171 expresses a distribution of attribute values for each attribute.
  • the inter-attribute relationship model 1172 indicates the strength of the relationship between two attributes, that is, which other attributes are likely to appear when a certain attribute appears. Specifically, the inter-attribute relationship model 1172 expresses a conditional probability of a subsequent appearing attribute with respect to an appearing attribute.
  • FIG. 2 shows an outline of numerical attribute extraction processing executed by the synonym extraction device 100.
  • the correct answer pair extraction subprogram 1121 acquires attribute / attribute value pairs from the correct text 113 and generates an attribute / attribute value pair list 115.
  • the teacher data creation subprogram 1122 refers to the attribute / attribute value pair list 115, identifies the attribute and attribute value to be normalized, and normalizes the attribute and attribute value.
  • the model learning subprogram 1123 reads the teacher data 116 and learns the numerical attribute extraction model 117.
  • the attribute / attribute value pair candidate extraction subprogram 1124 reads the new text 114 and extracts character strings that are candidates for attributes and attribute values.
  • the attribute / attribute value pair determination subprogram 1125 receives a character string that is an attribute and attribute value candidate from the attribute / attribute value pair candidate extraction subprogram 1124, and uses the numeric attribute extraction model 117 to select the attribute candidate and the attribute value candidate. It is determined whether the pair is an attribute / attribute value pair.
  • the word / phrase pair determined as the attribute / attribute value pair is stored in the correct text 114 through a manual check if necessary.
  • FIG. 3 shows a flowchart of numerical attribute extraction processing by the numerical attribute extraction program 112.
  • the correct answer pair extraction subprogram 1121 acquires the attribute / attribute value pair of the numerical attribute from the correct answer text 113 as a correct answer, and outputs it as an attribute / attribute value pair list (S11).
  • FIG. 4 shows an example of correct text. In the example of FIG. 4, a correct text is added to the text and a list of appearance positions of attribute / attribute value pairs in the text. Although the correct text is created manually here, instead of the correct text, attribute / attribute value pairs may be extracted from the text using conventional techniques that use syntax analysis results and pattern matching. You may use the text which made only the high pair the correct answer.
  • FIG. 5 shows an example of the attribute / attribute value pair list 115.
  • the attribute / attribute value pair list 115 is data in which attributes, attribute values, and additional information extracted from the correct text are stored in each row.
  • the first line includes a value attribute “7” for a numeric attribute “ALT” (alanine aminotransferase; Alanine transaminase) in a sentence with a document ID 1 and a sentence ID 1. This indicates that there is an attribute / attribute value pair having a numerical value of “U / L” as the unit of the attribute value.
  • the teacher data creation subprogram 1122 generates teacher data 116 from the acquired attribute / attribute value pair 115 (S12).
  • the attribute and the attribute value are normalized.
  • attribute normalization normalization is performed using synonym extraction and notation fluctuation extraction techniques known in the prior art. In the example of FIG. 5, “ALT” and “AlT (SGPT)” are normalized to “ALT”.
  • the attribute value first, the numeric character string extracted from the text is converted into a numeric value. After that, referring to the unit, if the unit is not uniform, it is integrated into a standard unit and the numerical value is converted. For example, if there is an attribute value such as "200mg” and an attribute value such as "0.3g", align "mg” and "g” with, for example, "mg” and convert "0.3” to "300” To do.
  • the model learning subprogram 1123 learns the numerical attribute extraction model 117 from the teacher data 116 (S13).
  • the numerical attribute extraction model 117 includes an attribute / attribute value relationship model 1171 and an attribute relationship model 1172. Details of step S13 will be described later with reference to FIG.
  • the attribute / attribute value candidate extraction subprogram 1124 extracts attributes or attribute value candidates from the new text 114 from which the attribute / attribute value pair is to be extracted (S14).
  • an information extraction technique known as a conventional technique can be used, and a description thereof will be omitted.
  • candidates for attributes or attribute values are extracted as entities.
  • the attribute / attribute value pair determination subprogram 1125 determines whether the attribute / attribute value candidate pair extracted in step 14 is a true attribute / attribute value pair, and extracts the attribute / attribute value pair. (S15). Details of step 15 will be described later with reference to FIGS.
  • the numerical attribute extraction model 1171 indicating the strength of the relationship between attributes and attribute values.
  • the attribute / attribute value relationship model 1171 is a model indicating validity of whether a certain value is a value of an attribute.
  • the other is an inter-attribute relationship model 1172 representing the strength of the relationship between attributes.
  • the inter-attribute relationship model 1172 is a model indicating other attributes that are likely to appear when a certain attribute appears.
  • FIG. 6 shows a processing flow of learning processing of the attribute / attribute value relationship model 1171.
  • the process proceeds to step 1305. If there is an unprocessed model, the process proceeds to S1304 (S1303).
  • the model refers to a type of probability distribution such as a normal distribution. Since the distribution of attribute values differs for each attribute, it is preferable to set different types of models in advance and select the most appropriate model.
  • the most appropriate parameter is calculated from all the attribute values acquired in S1302 (S1304).
  • the parameters of each model are determined using maximum likelihood estimation for the attribute value acquired in S132 for each model.
  • a normal distribution, a log normal distribution, and a rectangular distribution are used as models.
  • the normal distribution and lognormal distribution have two parameters, and the rectangular distribution has three parameters.
  • the normal distribution and lognormal distribution can be determined directly from statistics such as mean and standard deviation.
  • an appropriate parameter is searched for while changing the parameter.
  • the most appropriate model is selected from preset models based on the parameters calculated in S1304 and stored in the attribute / attribute value relationship model 1171 (S1305).
  • FIG. 7 shows an example of the attribute / attribute value relationship model 1171.
  • Each row of the attribute / attribute value relationship model 1171 includes a model type 11712 and a model parameter group 11713 having the highest validity for each attribute 11711.
  • Various methods can be considered as a method for determining the validity of the model.
  • AIC is used as an example.
  • the lognormal distribution is selected as the most appropriate model from the maximum likelihood estimation value and the number of parameters of each model.
  • a parametric estimation method is described in which a probability distribution that is a continuous distribution is assumed and selection is performed based on an information amount criterion such as AIC.
  • an information amount criterion such as AIC.
  • a non-parametric estimation method that does not assume a specific distribution is also considered. It is done.
  • the K-neighbor method is a technique that selects K types of cases in the order closest to a certain case, and performs classification and the like based on the classification of the selected case.
  • the teacher data 116 is referred to, and K rows are selected in order of decreasing value.
  • the attribute column 1154 is acquired and tabulated. At this time, if the number of certain attributes in the selected row is k, k / K is the reliability of this value for this attribute.
  • an attribute / attribute value relationship model is not explicitly created, and a necessary numerical value is obtained by executing the above calculation in the attribute / attribute value pair determination process in step 15.
  • a kernel function method other than the K-neighbor method may be used.
  • the attribute value is a numerical value, but it may be desirable to treat it as categorical data.
  • a drug having a different ratio of active ingredients to a certain drug such as “drug name 250 mg” and “drug name 100 mg”.
  • “medicine name” is an attribute of “250 mg”, but since there are only a few types of attribute values, it is desirable to treat them in the same way as non-numeric attribute values. In such a case, the number of each numerical value such as “250 mg” and “100 mg” increases.
  • a value obtained by dividing the total number of attribute values that have appeared by the number of types of attribute values that are present is greater than a predetermined threshold value, it can be handled by a method such as considering categorical data.
  • the appearance probability of each attribute value may be estimated as a maximum likelihood as in the past.
  • Fig. 8 shows the processing flow of the learning process of the inter-attribute relationship model.
  • Attribute / attribute value pair is acquired from the attribute / attribute value pair list 115 (S1311).
  • Attribute 2-gram is extracted according to the order of appearance of the attribute value in the text (S1312).
  • the attribute 2-gram is an ordered pair of ⁇ ALT, AST> when an attribute value with the attribute “AST” appears after the attribute value with the attribute “ALT”.
  • the high 2-gram frequency indicates that “AST” is likely to appear after “ALT”.
  • the frequency of 2-grams of the extracted attributes is totaled (S1313).
  • the conditional probability related to the appearance order of attributes that is, the conditional probability of the attribute of the attribute value that appears next when the attribute value of a certain attribute appears is calculated and stored in the inter-attribute relationship model 1172.
  • An example of the inter-attribute relationship model 1172 is shown in FIG. For example, all 2-grams with ALT as the first element are obtained, the sum of their frequencies is obtained, and the value obtained by calculating the sum of the frequencies of ⁇ ALT, AST> when “ALT” appears The conditional probability that “AST” will appear. That is, if this value is large, it is likely that “AST” appears after “ALT”.
  • the order of appearance is expressed using 2-gram. That is, a simple Markov process is assumed, but a second-order or higher Markov process may be used.
  • a model in which the probability is high if it appears at an appearance position within a certain distance can be used.
  • S15 attribute / attribute value pairs are extracted as follows.
  • there is one numerical value that is a candidate attribute value and there are a plurality of candidate attributes in the new text 114.
  • an appropriate attribute can be determined as follows by the attribute / attribute value relationship model.
  • the likelihood of each attribute is given by setting a conditional probability of each attribute for the numerical value of the candidate attribute value in the text.
  • the attribute having the highest likelihood may be selected as the attribute value candidate attribute.
  • supervised learning using grammatical dependency and distance between attributes / attribute values (number of characters, number of words, distance on syntax tree, etc.) Can also be done.
  • an appropriate combination is determined by the following two methods and combinations thereof.
  • the first method is based on the inter-attribute relationship model 1172.
  • the combination search can be made efficient based on information on whether the relationship between attributes is strong or weak.
  • a method to which the Markofaggar used for sentence part-of-speech determination is applied will be described.
  • FIG. 11 shows a processing flow of the attribute / attribute value pair determination process by Marco Fugger.
  • a series of attribute value candidates in order of appearance is acquired from the attribute value candidates extracted from the new text 114 (S1501). In the case of FIG. 10, a sequence such as [6, 20] is obtained.
  • the conditional probability of the attribute is calculated for each attribute value candidate in the series, and an attribute with a high probability is acquired (S1502).
  • a list such as “ALT”, “AST”, and “AGE” is obtained for the attribute value candidate “6”.
  • the optimum attribute transition sequence is determined (S1504).
  • the determination of the optimum sequence can be executed by an algorithm called the Viterbi algorithm, and thus detailed description thereof is omitted.
  • the result that the attribute series [ALT, AST] is most likely to be the attribute value candidate series [6, 20] is obtained.
  • ⁇ ALT, 6>, ⁇ Two attribute / attribute value pairs such as AST, 20> are obtained.
  • FIG. 10 is a conceptual explanatory diagram when performing numerical attribute extraction on the new text 114.
  • “ALT” and “AST” are extracted as attribute candidates and “6” and “20” are extracted as attribute values.
  • FIG. 10 shows a conceptual diagram of a Markoffagger that assigns attribute columns to these attribute value columns.
  • attribute candidates There is a column of attribute value candidates extracted from the text, and attribute candidates having a high possibility for each attribute value are acquired by the attribute / attribute value relationship model 1171.
  • the attributes “ALT”, “AST”, and “AGE” are acquired in the descending order of validity for the attribute value “6”.
  • the attribute with the highest validity is “ALT”, and an incorrect result is obtained.
  • processing is performed as follows using the inter-attribute relationship model 1172.
  • the attribute candidates in the text are not explicitly used.
  • priority is given to the order of appearance in the past data such as the correct answer data 113.
  • attribute / attribute value pair determination processing by DP matching will be described.
  • FIG. 12 shows a processing flow of attribute / attribute value pair determination processing by DP matching.
  • a series of attribute value candidates in the order of appearance is acquired from the attribute value candidates extracted from the new text 114 (S1511). In the case of FIG. 10, a sequence such as [6, 20] is obtained.
  • a series of attribute candidates in order of appearance is acquired from the attribute candidates extracted from the new text 114 (S1512).
  • a sequence such as [ALT, AST] is obtained.
  • the strength of the relationship between all attribute and attribute value pairs is acquired from the attribute / attribute value relationship model, and a matrix composed of the acquired numerical values is generated (S1513). Specifically, the score regarding the I-th attribute and the J-th attribute value pair is stored in the ⁇ I, J> element of the matrix.
  • the optimum attribute transition sequence is determined (S1514). Since the series can be determined by DP matching, the description is omitted.
  • the attribute / attribute value pair determination process described in FIGS. 11 and 12 can be used in combination.
  • the method of using the attribute candidates in the new text described above with reference to FIG. 12 as a constraint is executed, and the correspondence relationship obtained by DP matching is a part where the reliability is low, for example, a part where the value set in the matrix is small.
  • the attribute candidate information in the new text is not used as a strong constraint, but can be used to temporarily modify the inter-attribute relationship model 1172.
  • a series of attribute candidates is created, and the value of the inter-attribute relationship model 1172 is increased at a certain ratio with respect to an arbitrary set of attributes on this series.
  • An attribute pair that does not exist in the inter-attribute relationship model 1172 is newly added, and a predetermined value is set as the probability.
  • BP blood pressure
  • 141 is the highest blood pressure
  • 70 is the lowest blood pressure.
  • attribute / attribute value pairs that are numeric attributes can be extracted at high speed and with high accuracy.

Abstract

The purpose of the present invention is to accurately extract, from text, an attribute-value pair for a numerical attribute, which takes a numerical value. This natural language processing device extracts attribute-value pairs from correct text and calculates the validity of association between each attribute and an unknown numerical value on the basis of the distribution of values of the attribute. The calculation of the validity for each attribute is accomplished by determining a distribution most similar to the distribution of values of the attribute in the correct text and then using this determined distribution. When a plurality of attribute values are subjected to this attribute-value validity calculation, relationships between attributes, as well as relationships between attributes and values, are learned from the correct text to determine appropriate pairs.

Description

自然言語処理装置及び自然言語処理方法Natural language processing apparatus and natural language processing method
 本発明は、自然言語処理の構文解析技術に関する。特に、テキストから数値属性の属性と属性値のペアを抽出する技術に関する。 The present invention relates to a syntax analysis technique for natural language processing. In particular, the present invention relates to a technique for extracting attribute value-attribute value pairs from text.
 パソコン及びインターネットの普及によって、ユーザがアクセス可能な電子化文書の量が増大している。電子化文書は、非構造データであり計算機で扱うことが困難である。そのため、大量の電子化文書を構造化することで有効に活用するために、自然言語処理に対する期待が高まっている。電子化文書を構造化するための技術の一つに属性/属性値抽出(attribute-value extraction)がある。属性/属性値抽出は、例えば、「性別」のような属性と「女性」のような属性値のペアをテキストから抽出する技術である。様々な情報源から、ある製品に関して、その仕様を記述する様々な属性とその属性値を抽出し、統合して表示するといった用途で用いられる。このような属性の中で、属性値が数値であるものを数値属性と呼ぶことにする。製品の仕様の場合には、重量や大きさが数値属性にあたる。このような数値属性は、定量的・客観的な情報であり、属性/属性値情報の中でも特に価値が大きい。このような重要性を鑑み、数値属性を含む属性/属性値ペアを自動抽出する公知技術が存在する。以下に、属性/属性値ペアを抽出する公知技術について説明する。 The amount of digitized documents accessible to users is increasing with the spread of personal computers and the Internet. Electronic documents are unstructured data and are difficult to handle with a computer. Therefore, in order to effectively use a large amount of digitized documents by structuring them, expectations for natural language processing are increasing. One technique for structuring digitized documents is attribute / attribute value extraction. The attribute / attribute value extraction is a technique for extracting a pair of an attribute such as “sex” and an attribute value such as “female” from text. It is used for the purpose of extracting various attributes describing the specifications and their attribute values from a variety of information sources and displaying them in an integrated manner. Among such attributes, an attribute value having a numerical value is called a numerical attribute. In the case of product specifications, weight and size are numerical attributes. Such numerical attributes are quantitative and objective information, and are particularly valuable among attribute / attribute value information. In view of such importance, there is a known technique for automatically extracting attribute / attribute value pairs including numerical attributes. A known technique for extracting attribute / attribute value pairs will be described below.
 数値属性に関する属性/属性値ペアの抽出に関して最も標準的な技術は、構文的な情報を手掛かりとして用いる(非特許文献1)。非特許文献1では、例えば、「重さは10Kgである。」といった文に対して、まず、属性値候補として「10Kg」という数値文字列を抽出し、構文解析の結果から、「重さ」が属性であることを認識する技術が開示されている。しかしながら、文書の種類によっては、構文的な手掛かりがほとんどない文書が存在する。構文的な情報を得るための構文解析技術では、現在の技術レベルでは、必ずしも正しい解を得ることができない。そのため、属性と属性値の関係の強さを手掛かりとする意味的な方法が必要となる。 The most standard technique for extracting attribute / attribute value pairs related to numeric attributes uses syntactic information as a clue (Non-Patent Document 1). In Non-Patent Document 1, for example, for a sentence such as “Weight is 10 Kg”, first, a numeric character string “10 Kg” is extracted as a candidate attribute value, and “Weight” is obtained from the result of parsing. A technique for recognizing that is an attribute is disclosed. However, depending on the type of document, there are documents that have almost no syntactic clues. A parsing technique for obtaining syntactic information cannot always obtain a correct solution at the current technical level. Therefore, a semantic method that uses the strength of the relationship between attributes and attribute values as a clue is required.
 構文的な情報以外の情報を用いて、属性/属性値ペアを抽出する公知技術として、「近接性」を手掛かりとする技術、属性と属性値の「相関性」を手掛かりとする技術が存在する(非特許文献2)。非特許文献2では、まず、第1のステップとして、属性あるいは属性値の候補となる語句をテキストから抽出する。次に、第2のステップとして、属性候補および属性値候補のペアの対応関係を判定する。このとき、特に第2のステップにおいて、「構文情報」「近接性」「相関性」を適宜組み合わせて用いる技術が開示されている。「近接性」は構文情報の近似であり、「相関性」が意味的な情報にあたる。 As a known technique for extracting attribute / attribute value pairs using information other than syntactic information, there is a technique that uses “proximity” as a clue and a technique that uses “correlation” between attributes and attribute values as a clue. (Non-patent document 2). In Non-Patent Document 2, first, as a first step, a phrase that is a candidate for an attribute or attribute value is extracted from the text. Next, as a second step, the correspondence between attribute candidate and attribute value candidate pairs is determined. At this time, in particular, in the second step, a technique using “syntax information”, “proximity”, and “correlation” in an appropriate combination is disclosed. “Proximity” is an approximation of syntax information, and “correlation” is semantic information.
米国特許7925652号US Pat. No. 7,925,652
 上述のように、非特許文献2に開示された技術は、構文的な情報が手掛かりとして使用できない場合でも、属性/属性値間の意味的な関係の強さを示す相関性を用いることで、属性/属性値を対応付けることができる。しかしながら、従来技術には、以下の問題が存在する。 As described above, the technique disclosed in Non-Patent Document 2 uses a correlation indicating the strength of a semantic relationship between attributes / attribute values even when syntactic information cannot be used as a clue. Attributes / attribute values can be associated. However, the following problems exist in the prior art.
 数値以外の属性に関する属性値は、有限かつ種類数がそれほど大きくない。そのため、属性と属性値間の関係の強さを、全ての属性値に対して列挙することが可能である。例えば、ある製品の属性として、「製造企業」を考えると、属性値として「A社」「B社」「C社」のような属性値が存在し、<製造企業,A社>、<製造企業,B社>、<製造企業,C社>のような属性/属性値ペアのそれぞれに対して、関係の強さを考えることが可能である。具体的には、各ペアの出現頻度に基づいて、自己相互情報量のような統計量を計算すれば良い。このような関係の強さを、正解データ等を用いて学習することで、テキスト中の属性/属性値ペアをテキストから抽出することが可能となる。 Attribute values related to attributes other than numerical values are finite and the number of types is not so large. Therefore, the strength of the relationship between the attribute and the attribute value can be enumerated for all the attribute values. For example, when “manufacturing company” is considered as an attribute of a certain product, attribute values such as “Company A”, “B company”, and “C company” exist as attribute values, and <Manufacturing company, Company A>, <Manufacturing It is possible to consider the strength of the relationship for each attribute / attribute value pair such as “company, company B>, <manufacturing company, company C>. Specifically, a statistic such as self-mutual information may be calculated based on the appearance frequency of each pair. By learning the strength of such a relationship using correct answer data or the like, it becomes possible to extract attribute / attribute value pairs in the text from the text.
 一方、数値属性の場合、属性値である数値は、原理的には連続であり無限の種類の値が存在する。実際上は、例えば、血圧のような属性の場合、属性値は整数で表現されており、離散的ではあるが、属性値の種類数は非常に大きくなる。そのため、学習に用いるデータの中に全ての属性値が出現していることは期待できない。よって、数値以外の属性の場合のような方法で、属性と属性値の関係の強さを予め学習しておくことが困難である。 On the other hand, in the case of numerical attributes, the numerical values that are attribute values are in principle continuous and there are infinite types of values. In practice, for example, in the case of an attribute such as blood pressure, the attribute value is represented by an integer, and although it is discrete, the number of types of attribute values becomes very large. Therefore, it cannot be expected that all attribute values appear in the data used for learning. Therefore, it is difficult to learn in advance the strength of the relationship between the attribute and the attribute value by a method as in the case of an attribute other than a numerical value.
 本発明は、数値属性である属性と属性値のペアの関係の強さを計算する手段を与えることで、数値属性である属性/属性値ペアを正確かつ高速に抽出することを目的とする。 An object of the present invention is to extract an attribute / attribute value pair that is a numerical attribute accurately and at high speed by providing means for calculating the strength of the relationship between the attribute and attribute value pair that is a numerical attribute.
 本発明の代表的な一例は、入力されたテキストデータの文書構造を解析する自然言語処理装置であって、テキストデータから数値属性と属性値のペアの候補を抽出する抽出部と、所定の数値属性に対応する属性値の分布を示すモデルに基づき、抽出した数値属性と属性値のペアの候補の妥当性を示す値を算出する算出部と、算出された妥当性を示す値に基づき、抽出された候補のなかから数値属性と属性値のペアを決定する判定部とを備える。 A typical example of the present invention is a natural language processing apparatus that analyzes a document structure of input text data, an extraction unit that extracts candidates of pairs of numerical attribute and attribute value from text data, and a predetermined numerical value Based on a model that shows the distribution of attribute values corresponding to attributes, a calculation unit that calculates values indicating the validity of the extracted numeric attribute and attribute value pair candidates, and extraction based on the calculated values indicating validity A determination unit that determines a pair of a numerical attribute and an attribute value from the candidates.
 本発明の一態様によれば、テキスト中に出現する数値属性である属性/属性値ペアを高精度かつ高速に抽出できる。上記以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 According to one aspect of the present invention, attribute / attribute value pairs that are numerical attributes appearing in text can be extracted with high accuracy and at high speed. Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.
数値属性抽出装置の構成を示す。The structure of a numerical attribute extraction apparatus is shown. 数値属性抽出プログラムによって実行される数値属性抽出処理の概略を示す。An outline of numerical attribute extraction processing executed by the numerical attribute extraction program will be described. 数値属性抽出プログラムによる数値属性抽出処理のフローチャートを示す。The flowchart of the numerical value attribute extraction process by a numerical value attribute extraction program is shown. 正解テキストの構成例を示す。A configuration example of correct text is shown. 属性/属性値ペアリストの構成例を示す。The structural example of an attribute / attribute value pair list is shown. 属性/属性値関係性モデル学習処理のフローチャートを示す。The flowchart of an attribute / attribute value relationship model learning process is shown. 属性/属性値関係性モデルの構成例を示す。The structural example of an attribute / attribute value relationship model is shown. 属性間関係性モデル学習処理のフローチャートを示す。The flowchart of a relationship model learning process between attributes is shown. 属性間関係性モデルの構成例を示す。A configuration example of the inter-attribute relationship model is shown. 属性/属性値ペア判定処理の説明図である。It is explanatory drawing of an attribute / attribute value pair determination process. 属性/属性値ペア判定処理のフローチャートを示す。The flowchart of an attribute / attribute value pair determination process is shown. 属性/属性値ペア判定処理のフローチャートを示す。The flowchart of an attribute / attribute value pair determination process is shown.
 以下、添付図面を参照して本発明の実施形態を説明する。本実施形態は本発明を実現するための一例に過ぎず、本発明の技術的範囲を限定するものではないことに注意すべきである。各図において共通の構成については同一の参照符号が付されている。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that this embodiment is merely an example for realizing the present invention, and does not limit the technical scope of the present invention. In each figure, the same reference numerals are given to common configurations.
 以下において、数値属性の属性と属性値のペアをテキストから抽出する方法を開示し、より具体的には、属性毎の属性値の分布をモデルとして学習し、モデルに基づいて、ある数値の属性を判定する方法を開示する。本開示の目的の一つは、数値属性の属性と属性値の関係の強さを正解データから学習することで、属性/属性値ペアを高精度に抽出することである。 In the following, a method of extracting attribute / attribute value pairs of numerical attributes from text is disclosed. More specifically, a distribution of attribute values for each attribute is learned as a model, and a numerical attribute is determined based on the model. A method of determining is disclosed. One of the objects of the present disclosure is to extract attribute / attribute value pairs with high accuracy by learning the strength of the relationship between the attributes of attribute values and attribute values from correct data.
 本開示の技術の一例は、正解データから属性毎の属性値の分布をモデルとして学習し、ある数値が与えられたときに、与えられた数値と各属性の関係の強さを、事後確率として計算し、計算された事後確率によって属性を抽出する。 An example of the technology of the present disclosure is to learn the distribution of attribute values for each attribute from correct data as a model, and when a certain numerical value is given, the strength of the relationship between the given numerical value and each attribute is used as the posterior probability. Calculate and extract attributes based on the calculated posterior probabilities.
 また、正解データから、属性間の関係の強さをモデルとして学習し、属性の出現順序を考慮することによって、属性/属性値ペアの抽出精度を向上することができる。 Also, the accuracy of attribute / attribute value pair extraction can be improved by learning the strength of the relationship between attributes from correct data as a model and considering the order of appearance of attributes.
 以下において、本開示の構成を具体的に説明する。図1は、本実施形態の同義語抽出装置100の構成を示す。同義語抽出装置100は、CPU101、主メモリ102、入出力装置103、及びディスク装置110を備える。 The configuration of the present disclosure will be specifically described below. FIG. 1 shows a configuration of a synonym extraction device 100 of the present embodiment. The synonym extraction device 100 includes a CPU 101, a main memory 102, an input / output device 103, and a disk device 110.
 CPU101はプロセッサであって、主メモリ102に記憶されたプログラムを実行することによって各種処理を実行する。具体的には、CPU101は、ディスク装置110に記憶されたプログラムを主メモリ102にロードし、主メモリ102にロードされたプログラムを実行する。プログラムは、ネットワークを介して外部サーバから主メモリ102にロードされてもよい。 The CPU 101 is a processor, and executes various processes by executing programs stored in the main memory 102. Specifically, the CPU 101 loads a program stored in the disk device 110 into the main memory 102 and executes the program loaded in the main memory 102. The program may be loaded from the external server to the main memory 102 via the network.
 主メモリ102は、CPU101によって実行されるプログラム及びCPU101によって必要とされるデータを記憶する。入出力装置103は、ユーザからの情報の入力を受け付けるとともに、CPU101の指示に応じて情報を出力する。例えば、入出力装置103は、キーボード及びマウス等の入力装置、並びに、ディスプレイ等の出力装置である。 The main memory 102 stores programs executed by the CPU 101 and data required by the CPU 101. The input / output device 103 receives input of information from the user and outputs information in response to an instruction from the CPU 101. For example, the input / output device 103 is an input device such as a keyboard and a mouse, and an output device such as a display.
 ディスク装置110は、計算機読み取り可能な非一時的記憶媒体を含む補助メモリである。ディスク装置110は、各種プログラム及び各種データを記憶する。具体的には、ディスク装置110は、OS111、数値属性抽出プログラム112、正解テキスト113、新規テキスト114、属性/属性値ペアリスト115、教師データ116、数値属性抽出モデル117を記憶する。 The disk device 110 is an auxiliary memory including a computer-readable non-transitory storage medium. The disk device 110 stores various programs and various data. Specifically, the disk device 110 stores an OS 111, a numerical attribute extraction program 112, a correct text 113, a new text 114, an attribute / attribute value pair list 115, teacher data 116, and a numerical attribute extraction model 117.
 数値属性抽出モデル117は、属性の候補と属性値の候補のペアが、真に属性/属性値ペアであるかどうかを判定するために用いる属性/属性値関係性モデル1171、属性と属性が同時に出現し易いかどうかを示す属性間関係性モデル1172を含む。 The numerical attribute extraction model 117 is an attribute / attribute value relationship model 1171 used for determining whether a pair of attribute candidates and attribute value candidates is truly an attribute / attribute value pair. An inter-attribute relationship model 1172 indicating whether or not it is likely to appear is included.
 数値属性抽出プログラム112は、入力されたテキスト114に含まれる語句のペアが属性/属性値ペアであるかどうかを判定し、数値属性の属性/属性値ペアであると判定された語句のペアを属性/属性値ペアとして抽出する。数値属性抽出プログラム112は、正解ペア抽出サブプログラム1121、教師データ作成サブプログラム1122、モデル学習サブプログラム1123、属性/属性値候補抽出サブプログラム及び属性/属性値ペア判定サブプログラム1125を含む。これらのプログラムの処理は図2で詳細を説明する。 The numerical attribute extraction program 112 determines whether or not the word / phrase pair included in the input text 114 is an attribute / attribute value pair, and determines the word / phrase pair determined to be the attribute / attribute value pair of the numerical attribute. Extract as attribute / attribute value pairs. The numerical attribute extraction program 112 includes a correct answer pair extraction subprogram 1121, a teacher data creation subprogram 1122, a model learning subprogram 1123, an attribute / attribute value candidate extraction subprogram, and an attribute / attribute value pair determination subprogram 1125. The processing of these programs will be described in detail with reference to FIG.
 CPU101は、上記プログラムを実行することによって所定の機能を実現する。プログラムはプロセッサによって実行されることで、定められた処理を行う。従って、本開示において、プログラムを主語とする説明は、CPU101又は同義語抽出装置100を主語とした説明でもよい。 The CPU 101 realizes a predetermined function by executing the above program. The program performs a predetermined process by being executed by the processor. Therefore, in the present disclosure, the description with the program as the subject may be an explanation with the CPU 101 or the synonym extraction device 100 as the subject.
 CPU101は、プログラムに従って動作することによって、所定の機能を実現する機能部(手段)として動作する。例えば、CPU101は、数値属性抽出プログラム112に従って動作することで数値属性抽出部(数値属性抽出手段)として機能する。数値属性抽出装置100は、これらの機能部(手段)を含む装置である。 The CPU 101 operates as a functional unit (means) that realizes a predetermined function by operating according to a program. For example, the CPU 101 functions as a numerical attribute extraction unit (numerical attribute extraction means) by operating according to the numerical attribute extraction program 112. The numerical attribute extraction apparatus 100 is an apparatus including these functional units (means).
 正解テキスト113は、数値属性抽出プログラム112へ入力されるデータであり、数値属性の属性/属性値ペアを抽出するためのモデルを学習するために使用される。正解テキスト113は、数値属性の属性および対応する属性値の出現位置を特定するのに必要な情報が負荷されたテキストであって、任意のフォーマットを持ち得る。例えば、テキスト中に属性あるいは属性値であることを示すタグを挿入する、テキストとは別に属性あるいは属性値の開始位置、終了位置を示すテーブルを準備する等の方法で、正解テキスト113を構成する。 The correct text 113 is data input to the numerical attribute extraction program 112 and is used to learn a model for extracting numerical attribute attribute / attribute value pairs. The correct text 113 is a text loaded with information necessary to specify the attribute of the numerical attribute and the appearance position of the corresponding attribute value, and may have an arbitrary format. For example, the correct text 113 is constructed by inserting a tag indicating that it is an attribute or attribute value in the text, or preparing a table indicating the start position and end position of the attribute or attribute value separately from the text. .
 新規テキスト114は、数値属性の属性/属性値ペア抽出の対象となるテキストである。通常、正解テキスト113とは異なる新しいテキストが対象である。また、新規テキストに対して、属性/属性値ペア抽出を実行した後、正解テキスト113として登録することもできる。 The new text 114 is a text that is a target of extraction of attribute / attribute value pairs of numerical attributes. Usually, a new text different from the correct text 113 is a target. Moreover, after executing attribute / attribute value pair extraction for a new text, it can also be registered as correct text 113.
 属性/属性値ペアリスト115は、正解テキスト115から抽出された属性を示す文字列と属性値を示す文字列のペアのテキスト内での出現位置順のリストである。異なるテキストに含まれるペアの位置関係は任意である。付加的な情報として、文字列の出現位置や属性値の単位を示す文字列等が格納される。教師データ116は、属性/属性値ペアリストと同一な形式である。ただし、属性や属性値の数値の正規化を行った結果を格納する。属性の正規化とは、”ANION GAP”、”Anion Gap”のような異表記を統一することを示す。属性値の正規化とは、”300mg”と”0.3g”のような見た目が異なる数値文字列に関し、単位を手掛かりにどちらかの単位に揃えることである。 The attribute / attribute value pair list 115 is a list in the order of appearance position in the text of the character string pair indicating the attribute and the attribute value extracted from the correct text 115. The positional relationship between pairs included in different texts is arbitrary. As additional information, a character string indicating the appearance position of the character string, a unit of the attribute value, and the like are stored. The teacher data 116 has the same format as the attribute / attribute value pair list. However, the result of normalizing the numerical value of the attribute or attribute value is stored. Attribute normalization means unifying different notations such as “ANION GAP” and “Anion Gap”. The normalization of attribute values is to align the units to one of the units for numeric character strings with different appearances such as “300 mg” and “0.3 g”.
 数値属性抽出モデル117は、属性/属性値ペア判定サブプログラム1125が使用するデータであり、属性/属性値関係モデル1171と属性間関係性モデル1172からなる。属性/属性値関係性モデル1171は、二つの語句が対応する属性と属性値のペアであるか否かを判定するための基準を示す。具体的には、属性/属性値関係性モデル1171は、各属性毎の属性値の分布を表現している。属性間関係性モデル1172は、2つの属性間の関係の強さ、すなわちある属性が出現しているときに、他のどの属性が出現し易いかを示す。具体的には、属性間関係性モデル1172は、先行して出現する属性に関する、後続して出現する属性の条件付き確率を表現している。 The numerical attribute extraction model 117 is data used by the attribute / attribute value pair determination subprogram 1125 and includes an attribute / attribute value relationship model 1171 and an inter-attribute relationship model 1172. The attribute / attribute value relationship model 1171 indicates a criterion for determining whether or not two words are a pair of corresponding attribute and attribute value. Specifically, the attribute / attribute value relationship model 1171 expresses a distribution of attribute values for each attribute. The inter-attribute relationship model 1172 indicates the strength of the relationship between two attributes, that is, which other attributes are likely to appear when a certain attribute appears. Specifically, the inter-attribute relationship model 1172 expresses a conditional probability of a subsequent appearing attribute with respect to an appearing attribute.
 図2は、同義語抽出装置100によって実行される数値属性抽出処理の概略を示す。正解ペア抽出サブプログラム1121は、正解テキスト113から属性/属性値ペアを取得し、属性/属性値ペアリスト115を生成する。 FIG. 2 shows an outline of numerical attribute extraction processing executed by the synonym extraction device 100. The correct answer pair extraction subprogram 1121 acquires attribute / attribute value pairs from the correct text 113 and generates an attribute / attribute value pair list 115.
 教師データ作成サブプログラム1122は、属性/属性値ペアリスト115を参照し、正規化すべき属性および属性値を特定し、属性および属性値を正規化する。 The teacher data creation subprogram 1122 refers to the attribute / attribute value pair list 115, identifies the attribute and attribute value to be normalized, and normalizes the attribute and attribute value.
 モデル学習サブプログラム1123は、教師データ116を読み込み、数値属性抽出モデル117を学習する。属性/属性値ペア候補抽出サブプログラム1124は、新規テキスト114を読み込み、属性および属性値の候補となる文字列を抽出する。属性/属性値ペア判定サブプログラム1125は、属性および属性値の候補となる文字列を属性/属性値ペア候補抽出サブプログラム1124から受け取り、数値属性抽出モデル117を用いて、属性候補と属性値候補のペアが属性/属性値ペアであるか否かを判定する。属性/属性値ペアと判定された語句のペアは、必要であれば人手のチェックを介して、正解テキスト114に格納される。 The model learning subprogram 1123 reads the teacher data 116 and learns the numerical attribute extraction model 117. The attribute / attribute value pair candidate extraction subprogram 1124 reads the new text 114 and extracts character strings that are candidates for attributes and attribute values. The attribute / attribute value pair determination subprogram 1125 receives a character string that is an attribute and attribute value candidate from the attribute / attribute value pair candidate extraction subprogram 1124, and uses the numeric attribute extraction model 117 to select the attribute candidate and the attribute value candidate. It is determined whether the pair is an attribute / attribute value pair. The word / phrase pair determined as the attribute / attribute value pair is stored in the correct text 114 through a manual check if necessary.
 図3は、数値属性抽出プログラム112による数値属性抽出処理のフローチャートを示す。正解ペア抽出サブプログラム1121は、正解テキスト113から数値属性の属性と属性値のペアを正解として取得し、属性/属性値ペアリストとして出力する(S11)。図4に正解テキストの例を示す。図4の例では、テキストに加え、テキスト中の属性と属性値のペアの出現位置のリストを付与したものを正解テキストとしている。また、ここでは正解テキストを人手で作成したものとしているが、正解テキストの替わりに、構文解析結果やパターンマッチングを用いる従来技術によってテキストから属性/属性値ペアを抽出し、正解である可能性が高いペアのみを正解としたテキストを用いても良い。本実施例の方法は、数値属性の属性値の分布を用いて属性/属性値ペアを決定するため、従来技術で用いられる情報とは独立であるため、このような代替テキストでも問題が少ない。図5に属性/属性値ペアリスト115の例を示す。属性/属性値ペアリスト115は、正解テキストから抽出された、属性と属性値および付加的な情報が各行に格納されたデータである。図5の例では、1行目が、文書IDが1の文書の文IDが1の文中に、「ALT」(アラニンアミノ基転移酵素; Alanine transaminase)という数値属性に対し、属性値が「7」という数値を持つ属性/属性値ペアが存在し、属性値の単位として「U/L」が使用されていたことを示している。 FIG. 3 shows a flowchart of numerical attribute extraction processing by the numerical attribute extraction program 112. The correct answer pair extraction subprogram 1121 acquires the attribute / attribute value pair of the numerical attribute from the correct answer text 113 as a correct answer, and outputs it as an attribute / attribute value pair list (S11). FIG. 4 shows an example of correct text. In the example of FIG. 4, a correct text is added to the text and a list of appearance positions of attribute / attribute value pairs in the text. Although the correct text is created manually here, instead of the correct text, attribute / attribute value pairs may be extracted from the text using conventional techniques that use syntax analysis results and pattern matching. You may use the text which made only the high pair the correct answer. Since the method of the present embodiment determines attribute / attribute value pairs using the distribution of attribute values of numerical attributes, and is independent of information used in the prior art, there are few problems even with such alternative text. FIG. 5 shows an example of the attribute / attribute value pair list 115. The attribute / attribute value pair list 115 is data in which attributes, attribute values, and additional information extracted from the correct text are stored in each row. In the example of FIG. 5, the first line includes a value attribute “7” for a numeric attribute “ALT” (alanine aminotransferase; Alanine transaminase) in a sentence with a document ID 1 and a sentence ID 1. This indicates that there is an attribute / attribute value pair having a numerical value of “U / L” as the unit of the attribute value.
 教師データ作成サブプログラム1122は、取得した属性/属性値ペア115から教師データ116を生成する(S12)。ステップ12では、属性および属性値を正規化する。属性の正規化については、従来技術として知られている同義語抽出、表記揺れの抽出技術を用い、正規化を行う。図5の例の場合、「ALT」と「AlT(SGPT)」が「ALT」に正規化される。属性値については、まず、テキストから抽出した数値文字列を数値に変換する。その後、単位を参照し、単位が不揃いな場合は、標準的な単位に統合し、数値を変換する。例えば、「200mg」のような属性値と「0.3g」のような属性値が存在する場合は、「mg」「g」を、例えば「mg」に揃え、「0.3」を「300」に変換する。 The teacher data creation subprogram 1122 generates teacher data 116 from the acquired attribute / attribute value pair 115 (S12). In step 12, the attribute and the attribute value are normalized. As for attribute normalization, normalization is performed using synonym extraction and notation fluctuation extraction techniques known in the prior art. In the example of FIG. 5, “ALT” and “AlT (SGPT)” are normalized to “ALT”. For the attribute value, first, the numeric character string extracted from the text is converted into a numeric value. After that, referring to the unit, if the unit is not uniform, it is integrated into a standard unit and the numerical value is converted. For example, if there is an attribute value such as "200mg" and an attribute value such as "0.3g", align "mg" and "g" with, for example, "mg" and convert "0.3" to "300" To do.
 モデル学習サブプログラム1123は、教師データ116から数値属性抽出モデル117を学習する(S13)。数値属性抽出モデル117は、属性/属性値関係性モデル1171と属性間関係性モデル1172からなる。ステップS13の詳細は、図6を参照して後述する。 The model learning subprogram 1123 learns the numerical attribute extraction model 117 from the teacher data 116 (S13). The numerical attribute extraction model 117 includes an attribute / attribute value relationship model 1171 and an attribute relationship model 1172. Details of step S13 will be described later with reference to FIG.
 属性/属性値候補抽出サブプログラム1124は、属性/属性値ペアの抽出対象である新規テキスト114から、属性あるいは属性値の候補を抽出する(S14)。属性および属性値の候補の抽出に関しては、従来技術として知られている情報抽出技術を使用することができるため説明は省略する。本ステップでは、属性あるいは属性値の候補がエンティティとして抽出される。 The attribute / attribute value candidate extraction subprogram 1124 extracts attributes or attribute value candidates from the new text 114 from which the attribute / attribute value pair is to be extracted (S14). Regarding the extraction of attributes and attribute value candidates, an information extraction technique known as a conventional technique can be used, and a description thereof will be omitted. In this step, candidates for attributes or attribute values are extracted as entities.
 属性/属性値ペア判定サブプログラム1125は、ステップ14で抽出された属性あるいは属性値候補のペアに対し、真に対応する属性/属性値ペアであるかを判定し、属性/属性値ペアを抽出する(S15)。ステップ15の詳細は、図10~12を参照して後述する。 The attribute / attribute value pair determination subprogram 1125 determines whether the attribute / attribute value candidate pair extracted in step 14 is a true attribute / attribute value pair, and extracts the attribute / attribute value pair. (S15). Details of step 15 will be described later with reference to FIGS.
 以下では、S13の数値属性抽出モデルの学習処理について説明する。本実施例では、数値属性抽出モデルとして、以下の2種類のモデルを用いる。一つは、属性と属性値の関係性の強さを示す属性/属性値関係性モデル1171である。属性/属性値関係性モデル1171は、ある値がある属性の値であるかどうかの妥当性を示すモデルである。もう一つは、属性間の関係性の強さを表す属性間関係性モデル1172である。属性間関係性モデル1172は、ある属性が出現しているときに出現し易い他の属性を示すモデルである。 Hereinafter, the learning process of the numerical attribute extraction model in S13 will be described. In this embodiment, the following two types of models are used as the numerical attribute extraction model. One is an attribute / attribute value relationship model 1171 indicating the strength of the relationship between attributes and attribute values. The attribute / attribute value relationship model 1171 is a model indicating validity of whether a certain value is a value of an attribute. The other is an inter-attribute relationship model 1172 representing the strength of the relationship between attributes. The inter-attribute relationship model 1172 is a model indicating other attributes that are likely to appear when a certain attribute appears.
 図6に属性/属性値関係性モデル1171の学習処理の処理フローを示す。 FIG. 6 shows a processing flow of learning processing of the attribute / attribute value relationship model 1171.
 教師データ116を参照して全属性について処理したかどうかを調べ、処理済みであれば全体の処理を終了する。未処理の属性があれば、S1302に進む(S1301)。 It is checked whether or not all attributes have been processed with reference to the teacher data 116, and if the processing has been completed, the entire processing is terminated. If there is an unprocessed attribute, the process proceeds to S1302 (S1301).
 未処理の属性の中から1個を選択し、選択された属性に対する全ての属性値を取得する(S1302)。 1 Select one of the unprocessed attributes, and acquire all attribute values for the selected attribute (S1302).
 予め定めた少なくとも1種類以上のモデルに関し、処理を終了したかを調べ、処理済であればステップ1305に進む。未処理のモデルがあれば、S1304に進む(S1303)。ここでモデルとは、正規分布のような確率分布の種別を指す。属性毎に属性値の分布は異なる性質を持つため、異なる種類のモデルを予め設定しておき、最も適切なモデルを選択することが好ましい。 Whether at least one type of predetermined model has been processed is checked, and if it has been processed, the process proceeds to step 1305. If there is an unprocessed model, the process proceeds to S1304 (S1303). Here, the model refers to a type of probability distribution such as a normal distribution. Since the distribution of attribute values differs for each attribute, it is preferable to set different types of models in advance and select the most appropriate model.
 未処理のあるモデルに対し、S1302で取得した全属性値から最も妥当なパラメータを計算する(S1304)。本ステップでは、各モデル毎にS132で取得した属性値に対して最尤推定を用いて、各モデルのパラメータを決定する。本実施例では、モデルとして、正規分布、対数正規分布、矩形分布を用いる。正規分布、対数正規分布は2個、矩形分布は3個のパラメータを持つ。正規分布、対数正規分布は平均、標準偏差等の統計量から直接分布を定めることができる。矩形分布に対しては、パラメータを変化させながら、適切なパラメータを探索する。 For the unprocessed model, the most appropriate parameter is calculated from all the attribute values acquired in S1302 (S1304). In this step, the parameters of each model are determined using maximum likelihood estimation for the attribute value acquired in S132 for each model. In this embodiment, a normal distribution, a log normal distribution, and a rectangular distribution are used as models. The normal distribution and lognormal distribution have two parameters, and the rectangular distribution has three parameters. The normal distribution and lognormal distribution can be determined directly from statistics such as mean and standard deviation. For the rectangular distribution, an appropriate parameter is searched for while changing the parameter.
 各属性に対し、予め設定されたモデルの中からS1304で計算されたパラメータに基づいて、最も妥当なモデルを選択し属性/属性値関係性モデル1171に格納する(S1305)。図7に属性/属性値関係性モデル1171の例を示す。属性/属性値関係性モデル1171の各行は、各属性11711に対し、最も妥当性が高いモデルの種別11712、モデルのパラメータ群11713からなる。モデルの妥当性を判断する方法としては様々な方法が考えられるが、本実施例では一例としてAICを用いる。図の例のALTの場合、各モデルの最尤推定値とパラメータ数から、対数正規分布が最も妥当なモデルとして選択される。 For each attribute, the most appropriate model is selected from preset models based on the parameters calculated in S1304 and stored in the attribute / attribute value relationship model 1171 (S1305). FIG. 7 shows an example of the attribute / attribute value relationship model 1171. Each row of the attribute / attribute value relationship model 1171 includes a model type 11712 and a model parameter group 11713 having the highest validity for each attribute 11711. Various methods can be considered as a method for determining the validity of the model. In this embodiment, AIC is used as an example. In the case of the ALT shown in the figure, the lognormal distribution is selected as the most appropriate model from the maximum likelihood estimation value and the number of parameters of each model.
 なお、以上の説明では、連続分布である確率分布を仮定して、AIC等の情報量基準によって選択を行うパラメトリックな推定方法を説明したが、特定の分布を仮定しないノンパラメトリックな推定方法も考えられる。 In the above description, a parametric estimation method is described in which a probability distribution that is a continuous distribution is assumed and selection is performed based on an information amount criterion such as AIC. However, a non-parametric estimation method that does not assume a specific distribution is also considered. It is done.
 ノンパラメトリックな推定方法の一例として、K-近傍法を用いる方法が考えられる。K-近傍法は、ある事例に対して最も近い順にK種類の事例を選択し、選択された事例の分類に基づいて、分類等を行う技術である。本実施例では、ある値に対し、教師データ116を参照して、値が近い順にK個の行を選択する。選択された行に対して、属性欄1154を取得し、集計を行う。このとき、選択された行におけるある属性の個数がkであれば、k/Kをこの属性に対する、この値の信頼度とする。この方法では、属性/属性値関係性モデルは明示的に作成されず、上記のような計算をステップ15の属性/属性値ペアの判定処理において実行することで必要な数値を得る。なお、ノンパラメトリックな推定方法として、K-近傍法以外の核関数法等を使用しても良い。 As an example of a nonparametric estimation method, a method using the K-neighbor method can be considered. The K-neighbor method is a technique that selects K types of cases in the order closest to a certain case, and performs classification and the like based on the classification of the selected case. In the present embodiment, with respect to a certain value, the teacher data 116 is referred to, and K rows are selected in order of decreasing value. For the selected row, the attribute column 1154 is acquired and tabulated. At this time, if the number of certain attributes in the selected row is k, k / K is the reliability of this value for this attribute. In this method, an attribute / attribute value relationship model is not explicitly created, and a necessary numerical value is obtained by executing the above calculation in the attribute / attribute value pair determination process in step 15. As a non-parametric estimation method, a kernel function method other than the K-neighbor method may be used.
 また、別の例として、属性値が数値であるが、カテゴリカルなデータとして扱うことが望ましい場合がある。例えば、「薬名250mg」「薬名100mg」のように、ある薬に対し、有効成分の割合が異なる薬が存在する場合がある。このような場合、「薬名」は「250mg」の属性であるが、属性値は数種類しか存在しないため、非数値の属性値と同様の扱いをすることが望ましい。このような場合には、「250mg」「100mg」のような数値毎の件数が多くなる。そのため、出現した属性値の種類数で出現した属性値の総数を割った値が予め定めた閾値よりも大きければ、カテゴリカルなデータと考える、といった方法で、対応することができる。カテゴリカルなデータの場合は、従来のように各属性値の出現確率は最尤推定すれば良い。 As another example, the attribute value is a numerical value, but it may be desirable to treat it as categorical data. For example, there may be a drug having a different ratio of active ingredients to a certain drug, such as “drug name 250 mg” and “drug name 100 mg”. In such a case, “medicine name” is an attribute of “250 mg”, but since there are only a few types of attribute values, it is desirable to treat them in the same way as non-numeric attribute values. In such a case, the number of each numerical value such as “250 mg” and “100 mg” increases. Therefore, if a value obtained by dividing the total number of attribute values that have appeared by the number of types of attribute values that are present is greater than a predetermined threshold value, it can be handled by a method such as considering categorical data. In the case of categorical data, the appearance probability of each attribute value may be estimated as a maximum likelihood as in the past.
 図8に、属性間関係性モデルの学習処理の処理フローを示す。 Fig. 8 shows the processing flow of the learning process of the inter-attribute relationship model.
 属性/属性値ペアリスト115から、属性/属性値ペアを取得する(S1311)。 Attribute / attribute value pair is acquired from the attribute / attribute value pair list 115 (S1311).
 テキスト中での属性値の出現順序にしたがって属性の2-gramを抽出する(S1312)。属性の2-gramとは、属性が「ALT」である属性値の次に、属性が「AST」である属性値が出現している場合、<ALT,AST>という順序付きのペアである。この2-gramの頻度が高いことは、「ALT」の次に「AST」が出現し易いことを示す。 Attribute 2-gram is extracted according to the order of appearance of the attribute value in the text (S1312). The attribute 2-gram is an ordered pair of <ALT, AST> when an attribute value with the attribute “AST” appears after the attribute value with the attribute “ALT”. The high 2-gram frequency indicates that “AST” is likely to appear after “ALT”.
 抽出された属性の2-gramの頻度を集計する(S1313)。 The frequency of 2-grams of the extracted attributes is totaled (S1313).
 属性の出現順序に関する条件付き確率、すなわち、ある属性の属性値が出現した際に次に出現する属性値の属性の条件付き確率を計算し、属性間関係性モデル1172に格納する。属性間関係性モデル1172の例を図9に示す。例えば、ALTが1要素目である全ての2-gramを取得して、その頻度の和を求め、<ALT,AST>の頻度をその和で求めた値が、「ALT」が出現した際の「AST」が出現する条件付き確率となる。すなわち、この値が大きければ、「ALT」の後に「AST」が出現する可能性が高いということになる。 The conditional probability related to the appearance order of attributes, that is, the conditional probability of the attribute of the attribute value that appears next when the attribute value of a certain attribute appears is calculated and stored in the inter-attribute relationship model 1172. An example of the inter-attribute relationship model 1172 is shown in FIG. For example, all 2-grams with ALT as the first element are obtained, the sum of their frequencies is obtained, and the value obtained by calculating the sum of the frequencies of <ALT, AST> when “ALT” appears The conditional probability that “AST” will appear. That is, if this value is large, it is likely that “AST” appears after “ALT”.
 以上の例では、出現順序を2-gramを用いて表現した。すなわち、単純マルコフ過程を前提としたが、2次以上のマルコフ過程を用いても良い。また、順序関係として属性間の関係を表現するのではなく、例えば、一定の距離内に出現位置に出現していれば、確率が高くなるといったモデルを用いることもできる。 In the above example, the order of appearance is expressed using 2-gram. That is, a simple Markov process is assumed, but a second-order or higher Markov process may be used. In addition, instead of expressing the relationship between attributes as the order relationship, for example, a model in which the probability is high if it appears at an appearance position within a certain distance can be used.
 以下では、S15の処理の詳細について説明する。S15では、以下のように属性/属性値ペアを抽出する。まず、最も単純な例について説明する。最も単純な例の場合、属性値の候補となる数値が1個あり、属性の候補が新規テキスト114中に複数存在する。このような場合、属性/属性値関係性モデルによって適切な属性を以下のように決定することができる。テキスト中の属性値候補の数値に対し、各属性の条件付き確率を設定することで、各属性の尤度が与えられる。この尤度が最も高い属性を属性値候補の属性として選択すれば良い。また、属性毎の属性値の分布だけではなく、文法的な係り受けの有無や、属性/属性値間の距離(間の文字数、単語数、構文木上の距離等)を用いて教師あり学習を行うこともできる。 Hereinafter, details of the processing of S15 will be described. In S15, attribute / attribute value pairs are extracted as follows. First, the simplest example will be described. In the simplest example, there is one numerical value that is a candidate attribute value, and there are a plurality of candidate attributes in the new text 114. In such a case, an appropriate attribute can be determined as follows by the attribute / attribute value relationship model. The likelihood of each attribute is given by setting a conditional probability of each attribute for the numerical value of the candidate attribute value in the text. The attribute having the highest likelihood may be selected as the attribute value candidate attribute. In addition to the distribution of attribute values for each attribute, supervised learning using grammatical dependency and distance between attributes / attribute values (number of characters, number of words, distance on syntax tree, etc.) Can also be done.
 しかしながら、実際には属性候補と属性値候補が共に複数存在する場合が多い。このような場合には、上記のように属性値候補毎にスコアが最大の属性を選択する方法では誤りが発生する可能性が高く、適切な組み合わせを決定することが必要である。組み合わせ探索は、非常に大きな計算量を必要とするため、実際的な計算量で適切な組み合わせを決定する方法が必要である。本実施例では、以下のような2種類の方法、およびその組み合わせにより適切な組み合わせを決定する。 However, in practice, there are often multiple attribute candidates and attribute value candidates. In such a case, the method of selecting the attribute having the maximum score for each attribute value candidate as described above is highly likely to cause an error, and it is necessary to determine an appropriate combination. Since the combination search requires a very large amount of calculation, a method for determining an appropriate combination with a practical amount of calculation is required. In the present embodiment, an appropriate combination is determined by the following two methods and combinations thereof.
 1つ目の方法は、属性間関係性モデル1172に基づく方法である。正解テキスト、すなわち過去の事例において、属性間の関係性が強いか、弱いかという情報に基づいて、組み合わせ探索を効率化することができる。以下では、文の品詞決定に用いられるマルコフタガーを適用した方法について説明する。 The first method is based on the inter-attribute relationship model 1172. In the correct answer text, that is, in the past case, the combination search can be made efficient based on information on whether the relationship between attributes is strong or weak. In the following, a method to which the Markofaggar used for sentence part-of-speech determination is applied will be described.
 図11に、マルコフタガーによる属性/属性値ペアの判定処理の処理フローを示す。 FIG. 11 shows a processing flow of the attribute / attribute value pair determination process by Marco Fugger.
 新規テキスト114から抽出された属性値候補から、出現順の属性値候補の系列を取得する(S1501)。図10の場合、[6,20]のような系列が得られる。 A series of attribute value candidates in order of appearance is acquired from the attribute value candidates extracted from the new text 114 (S1501). In the case of FIG. 10, a sequence such as [6, 20] is obtained.
 属性/属性値候補関係性モデルを用いて、系列中の各属性値候補に対する、属性の条件付き確率を計算し、確率が高い属性を取得する(S1502)。図10の場合、属性値候補「6」に対し、「ALT」「AST」「AGE」のようなリストが得られる。 Using the attribute / attribute value candidate relationship model, the conditional probability of the attribute is calculated for each attribute value candidate in the series, and an attribute with a high probability is acquired (S1502). In the case of FIG. 10, a list such as “ALT”, “AST”, and “AGE” is obtained for the attribute value candidate “6”.
 属性間関係性モデルを用いて、可能な属性間の遷移に対して確率を取得する(S1503)。図10の場合、[ALT,ALT][ALT,AST][ALT,AGE][AST,ALT][AST,AST]のような遷移が可能であり、それぞれに対して、属性間関係性モデルを参照して、確率を取得する。 Using the inter-attribute relationship model, probabilities are acquired for possible transitions between attributes (S1503). In the case of FIG. 10, transitions such as [ALT, ALT] [ALT, AST] [ALT, AGE] [AST, ALT] [AST, AST] are possible. Refer to and get the probability.
 最適な属性遷移の系列を決定する(S1504)。最適な系列の決定は、ビタビアルゴリズムと呼ばれるアルゴリズムによって実行できるため詳細な説明は省略する。以上の処理によって、[ALT,AST]という属性の系列が、属性値候補の系列[6,20]に対して最も可能性が高いという結果が得られ、その結果、<ALT,6>,<AST,20>のような属性/属性値ペアが2個得られる。 The optimum attribute transition sequence is determined (S1504). The determination of the optimum sequence can be executed by an algorithm called the Viterbi algorithm, and thus detailed description thereof is omitted. Through the above processing, the result that the attribute series [ALT, AST] is most likely to be the attribute value candidate series [6, 20] is obtained. As a result, <ALT, 6>, < Two attribute / attribute value pairs such as AST, 20> are obtained.
 図10は、新規テキスト114に対して数値属性抽出を行う際の概念的な説明図である。この例の場合、属性候補として「ALT」「AST」が抽出され、属性値として「6」「20」が抽出されているとする。図10は、これらの属性値の列に対して、属性の列を割り当てるマルコフタガーの概念図を示している。テキストから抽出された属性値候補の列が存在し、各属性値に対して可能性が高い属性の候補が、属性/属性値関係性モデル1171によって取得される。図10の例の場合、例えば、属性値「6」に対して、「ALT」「AST」「AGE」という属性が、妥当性が高い順に取得されている。このとき、属性値「20」に対しては、最も妥当性が高い属性は「ALT」となっており、誤った結果が得られてしまう。 FIG. 10 is a conceptual explanatory diagram when performing numerical attribute extraction on the new text 114. In this example, it is assumed that “ALT” and “AST” are extracted as attribute candidates and “6” and “20” are extracted as attribute values. FIG. 10 shows a conceptual diagram of a Markoffagger that assigns attribute columns to these attribute value columns. There is a column of attribute value candidates extracted from the text, and attribute candidates having a high possibility for each attribute value are acquired by the attribute / attribute value relationship model 1171. In the case of the example in FIG. 10, for example, the attributes “ALT”, “AST”, and “AGE” are acquired in the descending order of validity for the attribute value “6”. At this time, for the attribute value “20”, the attribute with the highest validity is “ALT”, and an incorrect result is obtained.
 そのため、属性間関係性モデル1172を用いて次のように処理を行う。 Therefore, processing is performed as follows using the inter-attribute relationship model 1172.
 図11に示した処理フローでは、テキスト中の属性候補は明示的には使用されていない。属性間の順序は、正解データ113のような過去のデータに多く出現した順序が優先される。しかしながら、マルコフタガーが通常使用される文中の品詞の順序の決定と比較すると、属性の順序をどのように記載するかは自由度が高い。そのため、新規テキスト114から得られる属性候補の情報をより積極的に使用する方法が考えられる。以下、DPマッチングによる属性/属性値ペア判定処理について説明する。 In the processing flow shown in FIG. 11, the attribute candidates in the text are not explicitly used. As for the order between the attributes, priority is given to the order of appearance in the past data such as the correct answer data 113. However, compared to the determination of the part-of-speech order in sentences in which Markofaga is normally used, there is a high degree of freedom in how the attribute order is described. Therefore, a method of more actively using attribute candidate information obtained from the new text 114 can be considered. Hereinafter, attribute / attribute value pair determination processing by DP matching will be described.
 図12に、DPマッチングによる属性/属性値ペアの判定処理の処理フローを示す。 FIG. 12 shows a processing flow of attribute / attribute value pair determination processing by DP matching.
 新規テキスト114から抽出された属性値候補から、出現順の属性値候補の系列を取得する(S1511)。図10の場合、[6,20]のような系列が得られる。 A series of attribute value candidates in the order of appearance is acquired from the attribute value candidates extracted from the new text 114 (S1511). In the case of FIG. 10, a sequence such as [6, 20] is obtained.
 新規テキスト114から抽出された属性候補から、出現順の属性候補の系列を取得する(S1512)。図10の場合、[ALT,AST]のような系列が得られる。 A series of attribute candidates in order of appearance is acquired from the attribute candidates extracted from the new text 114 (S1512). In the case of FIG. 10, a sequence such as [ALT, AST] is obtained.
 全ての属性と属性値のペアの関係の強さを属性/属性値関係性モデルから取得し、取得した数値からなる行列を生成する(S1513)。具体的には、I番目の属性とJ番目の属性値ペアに関するスコアを行列の<I,J>要素 に格納する。 The strength of the relationship between all attribute and attribute value pairs is acquired from the attribute / attribute value relationship model, and a matrix composed of the acquired numerical values is generated (S1513). Specifically, the score regarding the I-th attribute and the J-th attribute value pair is stored in the <I, J> element of the matrix.
 最適な属性遷移の系列を決定する(S1514)。系列の決定はDPマッチングによって行うことができるため説明は省略する。 The optimum attribute transition sequence is determined (S1514). Since the series can be determined by DP matching, the description is omitted.
 図11、12で述べた属性/属性値ペアの判定処理は組み合わせて使用することもできる。例えば、図12で述べた新規テキスト中の属性候補を制約として使用する方法を実行し、信頼性が低い箇所、例えば、行列に設定した値が小さい箇所やDPマッチングによって得られた対応関係が1対1ではない箇所、のみに対して、図11による過去データにおける属性の順序関係を参考とする方法を用いるといった方法が考えられる。 The attribute / attribute value pair determination process described in FIGS. 11 and 12 can be used in combination. For example, the method of using the attribute candidates in the new text described above with reference to FIG. 12 as a constraint is executed, and the correspondence relationship obtained by DP matching is a part where the reliability is low, for example, a part where the value set in the matrix is small. A method of using a method that refers to the order relation of attributes in the past data shown in FIG.
 また、新規テキスト中の属性候補の情報を強い制約として用いるのではなく、属性間関係性モデル1172を一時的に修正するために用いることもできる。この方法の場合、属性候補の系列を作成し、この系列上の任意の属性の組に対して、属性間関係性モデル1172の値を一定の割合で大きくする。また、属性間関係性モデル1172に存在しなかった属性のペアについては、新たに追加し、予め定めた値を確率として設定する。このような処理によって、新規テキスト114中に存在する属性候補同士の遷移が優先され、また過去のデータである正解テキスト113中に含まれない属性の遷移に関しても適切に処理を行うことが可能となる。 Also, the attribute candidate information in the new text is not used as a strong constraint, but can be used to temporarily modify the inter-attribute relationship model 1172. In the case of this method, a series of attribute candidates is created, and the value of the inter-attribute relationship model 1172 is increased at a certain ratio with respect to an arbitrary set of attributes on this series. An attribute pair that does not exist in the inter-attribute relationship model 1172 is newly added, and a predetermined value is set as the probability. By such processing, priority is given to the transition between candidate attributes existing in the new text 114, and it is possible to appropriately perform processing regarding the transition of attributes that are not included in the correct text 113 that is past data. Become.
 以上の実施例では、単純化のため、属性/属性値ペアは一対一であることを前提とした。しかしながら、<BP,141/70>のようなケースも存在する。BPは、血圧(Blood Pressure)であり、141は最高血圧、70は最低血圧である。このような例の場合、BPに対してBP_1、BP_2のように仮想的なラベルを生成し、仮想的なラベルに対して、ステップ13以降の処理を行うことで対応できる。 In the above embodiment, for simplicity, it is assumed that the attribute / attribute value pair is one-to-one. However, there are cases like <BP, 141/70>. BP is blood pressure (Blood Pressure), 141 is the highest blood pressure, and 70 is the lowest blood pressure. In the case of such an example, it is possible to cope with the BP by generating virtual labels such as BP_1 and BP_2 and performing the processing from step 13 on the virtual labels.
 以上の構成及び処理により、数値属性である属性/属性値ペアを高速かつ高精度に抽出することが可能になる。 With the above configuration and processing, attribute / attribute value pairs that are numeric attributes can be extracted at high speed and with high accuracy.
100 数値属性抽出装置
101 CPU
102 主メモリ
103 入出力装置
110 ディスク装置
111 OS
112 数値属性抽出プログラム
1121 正解ペア抽出サブプログラム
1122 教師データ作成サブプログラム
1123 モデル学習サブプログラム
1124 属性/属性値候補抽出サブプログラム
1125 属性/属性値ペア判定サブプログラム
113 正解テキスト
114 新規テキスト
115 属性/属性値ペアリスト
116 教師データ
117 数値属性モデル
1171 属性/属性値関係性モデル
1172 属性間関係性モデル
100 Numerical attribute extraction apparatus 101 CPU
102 Main memory 103 Input / output device 110 Disk device 111 OS
112 Numeric attribute extraction program 1121 Correct answer pair extraction subprogram 1122 Teacher data creation subprogram 1123 Model learning subprogram 1124 Attribute / attribute value candidate extraction subprogram 1125 Attribute / attribute value pair determination subprogram 113 Correct text 114 New text 115 Attribute / attribute Value pair list 116 Teacher data 117 Numeric attribute model 1171 Attribute / attribute value relationship model 1172 Attribute relationship model

Claims (10)

  1.  入力されたテキストデータの文書構造を解析する自然言語処理装置であって、
     前記テキストデータから数値属性と属性値のペアの候補を抽出する抽出部と、
     所定の数値属性に対応する属性値の分布を示す第1のモデルに基づき、抽出した前記数値属性と前記属性値のペアの候補の妥当性を示す値を算出する第1の算出部と、
     前記妥当性を示す値に基づき、前記候補のなかから、前記数値属性と前記属性値のペアを決定する判定部とを備えることを特徴とする自然言語処理装置。
    A natural language processing apparatus for analyzing a document structure of input text data,
    An extractor for extracting candidates for pairs of numerical attributes and attribute values from the text data;
    A first calculator that calculates a value indicating the validity of the extracted candidate of the numerical attribute and attribute value pair based on a first model indicating a distribution of attribute values corresponding to a predetermined numerical attribute;
    A natural language processing apparatus comprising: a determination unit that determines a pair of the numeric attribute and the attribute value from the candidates based on the value indicating validity.
  2.  請求項1に記載の自然言語処理装置であって、前記第1のモデルは、数値属性と属性値のペアを特定する情報を含むテキストデータを教師データとして学習することにより推定されたモデルであることを特徴とする自然言語処理装置。 The natural language processing apparatus according to claim 1, wherein the first model is a model estimated by learning text data including information for identifying a pair of a numerical attribute and an attribute value as teacher data. A natural language processing apparatus characterized by that.
  3.  請求項2に記載の自然言語処理装置であって、前記第1のモデルは、パラメトリックな推定方法又はノンパラメトリックな推定方法によって推定されたモデルであることを特徴とする自然言語処理装置。 3. The natural language processing apparatus according to claim 2, wherein the first model is a model estimated by a parametric estimation method or a nonparametric estimation method.
  4.  請求項3に記載の自然言語処理装置であって、
     第2のモデルに基づき、前記数値属性間の関係性の強弱を示す値を算出する第2の算出部を更に備え、
     前記判定部は、前記第1及び前記第2の算出部により算出された値に基づき、前記数値属性と前記属性値のペアを決定することを特徴とする自然言語処理装置。
    The natural language processing apparatus according to claim 3,
    A second calculation unit for calculating a value indicating the strength of the relationship between the numerical attributes based on a second model;
    The determination unit determines a pair of the numeric attribute and the attribute value based on values calculated by the first and second calculation units.
  5.  請求項4に記載の自然言語処理装置であって、前記第2のモデルは、前記教師データに含まれる複数の数値属性が所定の距離内で出現する頻度に基づき算出されたモデルであることを特徴とする自然言語処理装置。 5. The natural language processing apparatus according to claim 4, wherein the second model is a model calculated based on a frequency at which a plurality of numerical attributes included in the teacher data appear within a predetermined distance. A natural language processing device.
  6.  自然言語処理装置に入力されたテキストデータの文書構造を解析する自然言語処理方法であって、
     前記自然言語処理装置が、
     入力された前記テキストデータから数値属性と属性値のペアの候補を抽出し、
     所定の数値属性に対応する属性値の分布を示す第1のモデルに基づき、抽出した前記数値属性と前記属性値のペアの候補の妥当性を示す値を算出し、
     前記妥当性を示す値に基づき、前記候補のなかから、前記数値属性と前記属性値のペアを決定することを特徴とする自然言語処理方法。
    A natural language processing method for analyzing a document structure of text data input to a natural language processing apparatus,
    The natural language processing device comprises:
    Extracting numerical attribute / attribute value pairs from the input text data,
    Based on a first model indicating a distribution of attribute values corresponding to a predetermined numerical attribute, a value indicating the validity of the extracted candidate of the numerical attribute and attribute value pair is calculated,
    A natural language processing method, comprising: determining a pair of the numerical attribute and the attribute value from the candidates based on the value indicating validity.
  7.  請求項6に記載の自然言語処理方法であって、前記第1のモデルは、数値属性と属性値のペアを特定する情報を含むテキストデータを教師データとして学習することにより推定されたモデルであることを特徴とする自然言語処理方法。 The natural language processing method according to claim 6, wherein the first model is a model estimated by learning text data including information for identifying a pair of a numerical attribute and an attribute value as teacher data. A natural language processing method characterized by that.
  8.  請求項7に記載の自然言語処理方法であって、前記第1のモデルは、パラメトリックな推定方法又はノンパラメトリックな推定方法によって推定されたモデルであることを特徴とする自然言語処理方法。 8. The natural language processing method according to claim 7, wherein the first model is a model estimated by a parametric estimation method or a nonparametric estimation method.
  9.  請求項8に記載の自然言語処理方法であって、
     更に、第2のモデルに基づき、前記数値属性間の関係性の強弱を示す値を算出し、
     前記第1及び前記第2の算出部により算出された値に基づき、前記数値属性と前記属性値のペアを決定することを特徴とする情報処理装置。
    The natural language processing method according to claim 8,
    Further, based on the second model, a value indicating the strength of the relationship between the numerical attributes is calculated,
    An information processing apparatus, wherein a pair of the numerical attribute and the attribute value is determined based on values calculated by the first and second calculation units.
  10.  請求項9に記載の自然言語処理方法であって、前記第2のモデルは、前記教師データに含まれる複数の数値属性が所定の距離内で出現する頻度に基づき算出されたモデルであることを特徴とする自然言語処理方法。 The natural language processing method according to claim 9, wherein the second model is a model calculated based on a frequency at which a plurality of numerical attributes included in the teacher data appear within a predetermined distance. A natural language processing method characterized.
PCT/JP2016/072583 2016-08-02 2016-08-02 Natural language processing device and natural language processing method WO2018025317A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2018531005A JP6546703B2 (en) 2016-08-02 2016-08-02 Natural language processing apparatus and natural language processing method
PCT/JP2016/072583 WO2018025317A1 (en) 2016-08-02 2016-08-02 Natural language processing device and natural language processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/072583 WO2018025317A1 (en) 2016-08-02 2016-08-02 Natural language processing device and natural language processing method

Publications (1)

Publication Number Publication Date
WO2018025317A1 true WO2018025317A1 (en) 2018-02-08

Family

ID=61073110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/072583 WO2018025317A1 (en) 2016-08-02 2016-08-02 Natural language processing device and natural language processing method

Country Status (2)

Country Link
JP (1) JP6546703B2 (en)
WO (1) WO2018025317A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046345A (en) * 2019-03-12 2019-07-23 同盾控股有限公司 A kind of data extraction method and device
JP2020064482A (en) * 2018-10-18 2020-04-23 株式会社日立製作所 Attribute extraction device and attribute extraction method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005250682A (en) * 2004-03-02 2005-09-15 Oki Electric Ind Co Ltd Information extraction system
JP2010117797A (en) * 2008-11-11 2010-05-27 Hitachi Ltd Numeric representation processing apparatus
JP2010182165A (en) * 2009-02-06 2010-08-19 Hitachi Ltd Analysis system and information analysis method
JP2013527958A (en) * 2010-04-21 2013-07-04 マイクロソフト コーポレーション Product synthesis from multiple sources

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005250682A (en) * 2004-03-02 2005-09-15 Oki Electric Ind Co Ltd Information extraction system
JP2010117797A (en) * 2008-11-11 2010-05-27 Hitachi Ltd Numeric representation processing apparatus
JP2010182165A (en) * 2009-02-06 2010-08-19 Hitachi Ltd Analysis system and information analysis method
JP2013527958A (en) * 2010-04-21 2013-07-04 マイクロソフト コーポレーション Product synthesis from multiple sources

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KATSUYUKI FUJIHATA ET AL.: "Extraction of Numerical Expressions by Constraints and Default Rules of Dependency Structure", IPSJ SIG NOTES, vol. 2001, no. 86, 11 September 2001 (2001-09-11), pages 119 - 125 *
RAYID GHANI ET AL.: "Text Mining for Product Attribute Extraction", SIGKDD EXPLORATIONS NEWSLETTER, vol. 8, no. 1, June 2006 (2006-06-01), pages 41 - 48, XP058283395 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020064482A (en) * 2018-10-18 2020-04-23 株式会社日立製作所 Attribute extraction device and attribute extraction method
JP7125322B2 (en) 2018-10-18 2022-08-24 株式会社日立製作所 Attribute extraction device and attribute extraction method
CN110046345A (en) * 2019-03-12 2019-07-23 同盾控股有限公司 A kind of data extraction method and device

Also Published As

Publication number Publication date
JPWO2018025317A1 (en) 2018-11-15
JP6546703B2 (en) 2019-07-17

Similar Documents

Publication Publication Date Title
WO2020082560A1 (en) Method, apparatus and device for extracting text keyword, as well as computer readable storage medium
US10210245B2 (en) Natural language question answering method and apparatus
US11544459B2 (en) Method and apparatus for determining feature words and server
US10496928B2 (en) Non-factoid question-answering system and method
US20140149102A1 (en) Personalized machine translation via online adaptation
JP5544602B2 (en) Word semantic relationship extraction apparatus and word semantic relationship extraction method
US11403465B2 (en) Systems and methods for report processing
WO2017198031A1 (en) Semantic parsing method and apparatus
US11244009B2 (en) Automatic keyphrase labeling using search queries
JP5234232B2 (en) Synonymous expression determination device, method and program
WO2022134779A1 (en) Method, apparatus and device for extracting character action related data, and storage medium
US20220245353A1 (en) System and method for entity labeling in a natural language understanding (nlu) framework
JP4534666B2 (en) Text sentence search device and text sentence search program
US9547645B2 (en) Machine translation apparatus, translation method, and translation system
US10509812B2 (en) Reducing translation volume and ensuring consistent text strings in software development
CN111444713B (en) Method and device for extracting entity relationship in news event
US20220222442A1 (en) Parameter learning apparatus, parameter learning method, and computer readable recording medium
WO2018025317A1 (en) Natural language processing device and natural language processing method
Khan et al. A clustering framework for lexical normalization of Roman Urdu
JP6867963B2 (en) Summary Evaluation device, method, program, and storage medium
JP5426292B2 (en) Opinion classification device and program
Vaishnavi et al. Paraphrase identification in short texts using grammar patterns
JP5225219B2 (en) Predicate term structure analysis method, apparatus and program thereof
Al-Olimat et al. A practical incremental learning framework for sparse entity extraction
Tianwen et al. Evaluate the chinese version of machine translation based on perplexity analysis

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2018531005

Country of ref document: JP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16911578

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16911578

Country of ref document: EP

Kind code of ref document: A1