WO2018025317A1

WO2018025317A1 - Natural language processing device and natural language processing method

Info

Publication number: WO2018025317A1
Application number: PCT/JP2016/072583
Authority: WO
Inventors: 康嗣森本; 利彦柳瀬; 芳樹丹羽; 利昇三好
Original assignee: 株式会社日立製作所
Priority date: 2016-08-02
Filing date: 2016-08-02
Publication date: 2018-02-08
Also published as: JPWO2018025317A1; JP6546703B2

Abstract

The purpose of the present invention is to accurately extract, from text, an attribute-value pair for a numerical attribute, which takes a numerical value. This natural language processing device extracts attribute-value pairs from correct text and calculates the validity of association between each attribute and an unknown numerical value on the basis of the distribution of values of the attribute. The calculation of the validity for each attribute is accomplished by determining a distribution most similar to the distribution of values of the attribute in the correct text and then using this determined distribution. When a plurality of attribute values are subjected to this attribute-value validity calculation, relationships between attributes, as well as relationships between attributes and values, are learned from the correct text to determine appropriate pairs.

Description

Natural language processing apparatus and natural language processing method

The present invention relates to a syntax analysis technique for natural language processing. In particular, the present invention relates to a technique for extracting attribute value-attribute value pairs from text.

The amount of digitized documents accessible to users is increasing with the spread of personal computers and the Internet. Electronic documents are unstructured data and are difficult to handle with a computer. Therefore, in order to effectively use a large amount of digitized documents by structuring them, expectations for natural language processing are increasing. One technique for structuring digitized documents is attribute / attribute value extraction. The attribute / attribute value extraction is a technique for extracting a pair of an attribute such as “sex” and an attribute value such as “female” from text. It is used for the purpose of extracting various attributes describing the specifications and their attribute values from a variety of information sources and displaying them in an integrated manner. Among such attributes, an attribute value having a numerical value is called a numerical attribute. In the case of product specifications, weight and size are numerical attributes. Such numerical attributes are quantitative and objective information, and are particularly valuable among attribute / attribute value information. In view of such importance, there is a known technique for automatically extracting attribute / attribute value pairs including numerical attributes. A known technique for extracting attribute / attribute value pairs will be described below.

The most standard technique for extracting attribute / attribute value pairs related to numeric attributes uses syntactic information as a clue (Non-Patent Document 1). In Non-Patent Document 1, for example, for a sentence such as “Weight is 10 Kg”, first, a numeric character string “10 Kg” is extracted as a candidate attribute value, and “Weight” is obtained from the result of parsing. A technique for recognizing that is an attribute is disclosed. However, depending on the type of document, there are documents that have almost no syntactic clues. A parsing technique for obtaining syntactic information cannot always obtain a correct solution at the current technical level. Therefore, a semantic method that uses the strength of the relationship between attributes and attribute values as a clue is required.

As a known technique for extracting attribute / attribute value pairs using information other than syntactic information, there is a technique that uses “proximity” as a clue and a technique that uses “correlation” between attributes and attribute values as a clue. (Non-patent document 2). In Non-Patent Document 2, first, as a first step, a phrase that is a candidate for an attribute or attribute value is extracted from the text. Next, as a second step, the correspondence between attribute candidate and attribute value candidate pairs is determined. At this time, in particular, in the second step, a technique using “syntax information”, “proximity”, and “correlation” in an appropriate combination is disclosed. “Proximity” is an approximation of syntax information, and “correlation” is semantic information.

US Pat. No. 7,925,652

As described above, the technique disclosed in Non-Patent Document 2 uses a correlation indicating the strength of a semantic relationship between attributes / attribute values even when syntactic information cannot be used as a clue. Attributes / attribute values can be associated. However, the following problems exist in the prior art.

Attribute values related to attributes other than numerical values are finite and the number of types is not so large. Therefore, the strength of the relationship between the attribute and the attribute value can be enumerated for all the attribute values. For example, when “manufacturing company” is considered as an attribute of a certain product, attribute values such as “Company A”, “B company”, and “C company” exist as attribute values, and <Manufacturing company, Company A>, <Manufacturing It is possible to consider the strength of the relationship for each attribute / attribute value pair such as “company, company B>, <manufacturing company, company C>. Specifically, a statistic such as self-mutual information may be calculated based on the appearance frequency of each pair. By learning the strength of such a relationship using correct answer data or the like, it becomes possible to extract attribute / attribute value pairs in the text from the text.

On the other hand, in the case of numerical attributes, the numerical values that are attribute values are in principle continuous and there are infinite types of values. In practice, for example, in the case of an attribute such as blood pressure, the attribute value is represented by an integer, and although it is discrete, the number of types of attribute values becomes very large. Therefore, it cannot be expected that all attribute values appear in the data used for learning. Therefore, it is difficult to learn in advance the strength of the relationship between the attribute and the attribute value by a method as in the case of an attribute other than a numerical value.

An object of the present invention is to extract an attribute / attribute value pair that is a numerical attribute accurately and at high speed by providing means for calculating the strength of the relationship between the attribute and attribute value pair that is a numerical attribute.

A typical example of the present invention is a natural language processing apparatus that analyzes a document structure of input text data, an extraction unit that extracts candidates of pairs of numerical attribute and attribute value from text data, and a predetermined numerical value Based on a model that shows the distribution of attribute values corresponding to attributes, a calculation unit that calculates values indicating the validity of the extracted numeric attribute and attribute value pair candidates, and extraction based on the calculated values indicating validity A determination unit that determines a pair of a numerical attribute and an attribute value from the candidates.

According to one aspect of the present invention, attribute / attribute value pairs that are numerical attributes appearing in text can be extracted with high accuracy and at high speed. Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.

The structure of a numerical attribute extraction apparatus is shown. An outline of numerical attribute extraction processing executed by the numerical attribute extraction program will be described. The flowchart of the numerical value attribute extraction process by a numerical value attribute extraction program is shown. A configuration example of correct text is shown. The structural example of an attribute / attribute value pair list is shown. The flowchart of an attribute / attribute value relationship model learning process is shown. The structural example of an attribute / attribute value relationship model is shown. The flowchart of a relationship model learning process between attributes is shown. A configuration example of the inter-attribute relationship model is shown. It is explanatory drawing of an attribute / attribute value pair determination process. The flowchart of an attribute / attribute value pair determination process is shown. The flowchart of an attribute / attribute value pair determination process is shown.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that this embodiment is merely an example for realizing the present invention, and does not limit the technical scope of the present invention. In each figure, the same reference numerals are given to common configurations.

In the following, a method of extracting attribute / attribute value pairs of numerical attributes from text is disclosed. More specifically, a distribution of attribute values for each attribute is learned as a model, and a numerical attribute is determined based on the model. A method of determining is disclosed. One of the objects of the present disclosure is to extract attribute / attribute value pairs with high accuracy by learning the strength of the relationship between the attributes of attribute values and attribute values from correct data.

An example of the technology of the present disclosure is to learn the distribution of attribute values for each attribute from correct data as a model, and when a certain numerical value is given, the strength of the relationship between the given numerical value and each attribute is used as the posterior probability. Calculate and extract attributes based on the calculated posterior probabilities.

Also, the accuracy of attribute / attribute value pair extraction can be improved by learning the strength of the relationship between attributes from correct data as a model and considering the order of appearance of attributes.

The configuration of the present disclosure will be specifically described below. FIG. 1 shows a configuration of a synonym extraction device 100 of the present embodiment. The synonym extraction device 100 includes a CPU 101, a main memory 102, an input / output device 103, and a disk device 110.

The CPU 101 is a processor, and executes various processes by executing programs stored in the main memory 102. Specifically, the CPU 101 loads a program stored in the disk device 110 into the main memory 102 and executes the program loaded in the main memory 102. The program may be loaded from the external server to the main memory 102 via the network.

The main memory 102 stores programs executed by the CPU 101 and data required by the CPU 101. The input / output device 103 receives input of information from the user and outputs information in response to an instruction from the CPU 101. For example, the input / output device 103 is an input device such as a keyboard and a mouse, and an output device such as a display.

The disk device 110 is an auxiliary memory including a computer-readable non-transitory storage medium. The disk device 110 stores various programs and various data. Specifically, the disk device 110 stores an OS 111, a numerical attribute extraction program 112, a correct text 113, a new text 114, an attribute / attribute value pair list 115, teacher data 116, and a numerical attribute extraction model 117.

The numerical attribute extraction model 117 is an attribute / attribute value relationship model 1171 used for determining whether a pair of attribute candidates and attribute value candidates is truly an attribute / attribute value pair. An inter-attribute relationship model 1172 indicating whether or not it is likely to appear is included.

The numerical attribute extraction program 112 determines whether or not the word / phrase pair included in the input text 114 is an attribute / attribute value pair, and determines the word / phrase pair determined to be the attribute / attribute value pair of the numerical attribute. Extract as attribute / attribute value pairs. The numerical attribute extraction program 112 includes a correct answer pair extraction subprogram 1121, a teacher data creation subprogram 1122, a model learning subprogram 1123, an attribute / attribute value candidate extraction subprogram, and an attribute / attribute value pair determination subprogram 1125. The processing of these programs will be described in detail with reference to FIG.

The CPU 101 realizes a predetermined function by executing the above program. The program performs a predetermined process by being executed by the processor. Therefore, in the present disclosure, the description with the program as the subject may be an explanation with the CPU 101 or the synonym extraction device 100 as the subject.

The CPU 101 operates as a functional unit (means) that realizes a predetermined function by operating according to a program. For example, the CPU 101 functions as a numerical attribute extraction unit (numerical attribute extraction means) by operating according to the numerical attribute extraction program 112. The numerical attribute extraction apparatus 100 is an apparatus including these functional units (means).

The correct text 113 is data input to the numerical attribute extraction program 112 and is used to learn a model for extracting numerical attribute attribute / attribute value pairs. The correct text 113 is a text loaded with information necessary to specify the attribute of the numerical attribute and the appearance position of the corresponding attribute value, and may have an arbitrary format. For example, the correct text 113 is constructed by inserting a tag indicating that it is an attribute or attribute value in the text, or preparing a table indicating the start position and end position of the attribute or attribute value separately from the text. .

The new text 114 is a text that is a target of extraction of attribute / attribute value pairs of numerical attributes. Usually, a new text different from the correct text 113 is a target. Moreover, after executing attribute / attribute value pair extraction for a new text, it can also be registered as correct text 113.

The attribute / attribute value pair list 115 is a list in the order of appearance position in the text of the character string pair indicating the attribute and the attribute value extracted from the correct text 115. The positional relationship between pairs included in different texts is arbitrary. As additional information, a character string indicating the appearance position of the character string, a unit of the attribute value, and the like are stored. The teacher data 116 has the same format as the attribute / attribute value pair list. However, the result of normalizing the numerical value of the attribute or attribute value is stored. Attribute normalization means unifying different notations such as “ANION GAP” and “Anion Gap”. The normalization of attribute values is to align the units to one of the units for numeric character strings with different appearances such as “300 mg” and “0.3 g”.

The numerical attribute extraction model 117 is data used by the attribute / attribute value pair determination subprogram 1125 and includes an attribute / attribute value relationship model 1171 and an inter-attribute relationship model 1172. The attribute / attribute value relationship model 1171 indicates a criterion for determining whether or not two words are a pair of corresponding attribute and attribute value. Specifically, the attribute / attribute value relationship model 1171 expresses a distribution of attribute values for each attribute. The inter-attribute relationship model 1172 indicates the strength of the relationship between two attributes, that is, which other attributes are likely to appear when a certain attribute appears. Specifically, the inter-attribute relationship model 1172 expresses a conditional probability of a subsequent appearing attribute with respect to an appearing attribute.

FIG. 2 shows an outline of numerical attribute extraction processing executed by the synonym extraction device 100. The correct answer pair extraction subprogram 1121 acquires attribute / attribute value pairs from the correct text 113 and generates an attribute / attribute value pair list 115.

The teacher data creation subprogram 1122 refers to the attribute / attribute value pair list 115, identifies the attribute and attribute value to be normalized, and normalizes the attribute and attribute value.

The model learning subprogram 1123 reads the teacher data 116 and learns the numerical attribute extraction model 117. The attribute / attribute value pair candidate extraction subprogram 1124 reads the new text 114 and extracts character strings that are candidates for attributes and attribute values. The attribute / attribute value pair determination subprogram 1125 receives a character string that is an attribute and attribute value candidate from the attribute / attribute value pair candidate extraction subprogram 1124, and uses the numeric attribute extraction model 117 to select the attribute candidate and the attribute value candidate. It is determined whether the pair is an attribute / attribute value pair. The word / phrase pair determined as the attribute / attribute value pair is stored in the correct text 114 through a manual check if necessary.

FIG. 3 shows a flowchart of numerical attribute extraction processing by the numerical attribute extraction program 112. The correct answer pair extraction subprogram 1121 acquires the attribute / attribute value pair of the numerical attribute from the correct answer text 113 as a correct answer, and outputs it as an attribute / attribute value pair list (S11). FIG. 4 shows an example of correct text. In the example of FIG. 4, a correct text is added to the text and a list of appearance positions of attribute / attribute value pairs in the text. Although the correct text is created manually here, instead of the correct text, attribute / attribute value pairs may be extracted from the text using conventional techniques that use syntax analysis results and pattern matching. You may use the text which made only the high pair the correct answer. Since the method of the present embodiment determines attribute / attribute value pairs using the distribution of attribute values of numerical attributes, and is independent of information used in the prior art, there are few problems even with such alternative text. FIG. 5 shows an example of the attribute / attribute value pair list 115. The attribute / attribute value pair list 115 is data in which attributes, attribute values, and additional information extracted from the correct text are stored in each row. In the example of FIG. 5, the first line includes a value attribute “7” for a numeric attribute “ALT” (alanine aminotransferase; Alanine transaminase) in a sentence with a document ID 1 and a sentence ID 1. This indicates that there is an attribute / attribute value pair having a numerical value of “U / L” as the unit of the attribute value.

The teacher data creation subprogram 1122 generates teacher data 116 from the acquired attribute / attribute value pair 115 (S12). In step 12, the attribute and the attribute value are normalized. As for attribute normalization, normalization is performed using synonym extraction and notation fluctuation extraction techniques known in the prior art. In the example of FIG. 5, “ALT” and “AlT (SGPT)” are normalized to “ALT”. For the attribute value, first, the numeric character string extracted from the text is converted into a numeric value. After that, referring to the unit, if the unit is not uniform, it is integrated into a standard unit and the numerical value is converted. For example, if there is an attribute value such as "200mg" and an attribute value such as "0.3g", align "mg" and "g" with, for example, "mg" and convert "0.3" to "300" To do.

The model learning subprogram 1123 learns the numerical attribute extraction model 117 from the teacher data 116 (S13). The numerical attribute extraction model 117 includes an attribute / attribute value relationship model 1171 and an attribute relationship model 1172. Details of step S13 will be described later with reference to FIG.

The attribute / attribute value candidate extraction subprogram 1124 extracts attributes or attribute value candidates from the new text 114 from which the attribute / attribute value pair is to be extracted (S14). Regarding the extraction of attributes and attribute value candidates, an information extraction technique known as a conventional technique can be used, and a description thereof will be omitted. In this step, candidates for attributes or attribute values are extracted as entities.

The attribute / attribute value pair determination subprogram 1125 determines whether the attribute / attribute value candidate pair extracted in step 14 is a true attribute / attribute value pair, and extracts the attribute / attribute value pair. (S15). Details of step 15 will be described later with reference to FIGS.

Hereinafter, the learning process of the numerical attribute extraction model in S13 will be described. In this embodiment, the following two types of models are used as the numerical attribute extraction model. One is an attribute / attribute value relationship model 1171 indicating the strength of the relationship between attributes and attribute values. The attribute / attribute value relationship model 1171 is a model indicating validity of whether a certain value is a value of an attribute. The other is an inter-attribute relationship model 1172 representing the strength of the relationship between attributes. The inter-attribute relationship model 1172 is a model indicating other attributes that are likely to appear when a certain attribute appears.

FIG. 6 shows a processing flow of learning processing of the attribute / attribute value relationship model 1171.

It is checked whether or not all attributes have been processed with reference to the teacher data 116, and if the processing has been completed, the entire processing is terminated. If there is an unprocessed attribute, the process proceeds to S1302 (S1301).

1 Select one of the unprocessed attributes, and acquire all attribute values for the selected attribute (S1302).

Whether at least one type of predetermined model has been processed is checked, and if it has been processed, the process proceeds to step 1305. If there is an unprocessed model, the process proceeds to S1304 (S1303). Here, the model refers to a type of probability distribution such as a normal distribution. Since the distribution of attribute values differs for each attribute, it is preferable to set different types of models in advance and select the most appropriate model.

For the unprocessed model, the most appropriate parameter is calculated from all the attribute values acquired in S1302 (S1304). In this step, the parameters of each model are determined using maximum likelihood estimation for the attribute value acquired in S132 for each model. In this embodiment, a normal distribution, a log normal distribution, and a rectangular distribution are used as models. The normal distribution and lognormal distribution have two parameters, and the rectangular distribution has three parameters. The normal distribution and lognormal distribution can be determined directly from statistics such as mean and standard deviation. For the rectangular distribution, an appropriate parameter is searched for while changing the parameter.

For each attribute, the most appropriate model is selected from preset models based on the parameters calculated in S1304 and stored in the attribute / attribute value relationship model 1171 (S1305). FIG. 7 shows an example of the attribute / attribute value relationship model 1171. Each row of the attribute / attribute value relationship model 1171 includes a model type 11712 and a model parameter group 11713 having the highest validity for each attribute 11711. Various methods can be considered as a method for determining the validity of the model. In this embodiment, AIC is used as an example. In the case of the ALT shown in the figure, the lognormal distribution is selected as the most appropriate model from the maximum likelihood estimation value and the number of parameters of each model.

In the above description, a parametric estimation method is described in which a probability distribution that is a continuous distribution is assumed and selection is performed based on an information amount criterion such as AIC. However, a non-parametric estimation method that does not assume a specific distribution is also considered. It is done.

As an example of a nonparametric estimation method, a method using the K-neighbor method can be considered. The K-neighbor method is a technique that selects K types of cases in the order closest to a certain case, and performs classification and the like based on the classification of the selected case. In the present embodiment, with respect to a certain value, the teacher data 116 is referred to, and K rows are selected in order of decreasing value. For the selected row, the attribute column 1154 is acquired and tabulated. At this time, if the number of certain attributes in the selected row is k, k / K is the reliability of this value for this attribute. In this method, an attribute / attribute value relationship model is not explicitly created, and a necessary numerical value is obtained by executing the above calculation in the attribute / attribute value pair determination process in step 15. As a non-parametric estimation method, a kernel function method other than the K-neighbor method may be used.

As another example, the attribute value is a numerical value, but it may be desirable to treat it as categorical data. For example, there may be a drug having a different ratio of active ingredients to a certain drug, such as “drug name 250 mg” and “drug name 100 mg”. In such a case, “medicine name” is an attribute of “250 mg”, but since there are only a few types of attribute values, it is desirable to treat them in the same way as non-numeric attribute values. In such a case, the number of each numerical value such as “250 mg” and “100 mg” increases. Therefore, if a value obtained by dividing the total number of attribute values that have appeared by the number of types of attribute values that are present is greater than a predetermined threshold value, it can be handled by a method such as considering categorical data. In the case of categorical data, the appearance probability of each attribute value may be estimated as a maximum likelihood as in the past.

Fig. 8 shows the processing flow of the learning process of the inter-attribute relationship model.

Attribute / attribute value pair is acquired from the attribute / attribute value pair list 115 (S1311).

Attribute 2-gram is extracted according to the order of appearance of the attribute value in the text (S1312). The attribute 2-gram is an ordered pair of <ALT, AST> when an attribute value with the attribute “AST” appears after the attribute value with the attribute “ALT”. The high 2-gram frequency indicates that “AST” is likely to appear after “ALT”.

The frequency of 2-grams of the extracted attributes is totaled (S1313).

The conditional probability related to the appearance order of attributes, that is, the conditional probability of the attribute of the attribute value that appears next when the attribute value of a certain attribute appears is calculated and stored in the inter-attribute relationship model 1172. An example of the inter-attribute relationship model 1172 is shown in FIG. For example, all 2-grams with ALT as the first element are obtained, the sum of their frequencies is obtained, and the value obtained by calculating the sum of the frequencies of <ALT, AST> when “ALT” appears The conditional probability that “AST” will appear. That is, if this value is large, it is likely that “AST” appears after “ALT”.

In the above example, the order of appearance is expressed using 2-gram. That is, a simple Markov process is assumed, but a second-order or higher Markov process may be used. In addition, instead of expressing the relationship between attributes as the order relationship, for example, a model in which the probability is high if it appears at an appearance position within a certain distance can be used.

Hereinafter, details of the processing of S15 will be described. In S15, attribute / attribute value pairs are extracted as follows. First, the simplest example will be described. In the simplest example, there is one numerical value that is a candidate attribute value, and there are a plurality of candidate attributes in the new text 114. In such a case, an appropriate attribute can be determined as follows by the attribute / attribute value relationship model. The likelihood of each attribute is given by setting a conditional probability of each attribute for the numerical value of the candidate attribute value in the text. The attribute having the highest likelihood may be selected as the attribute value candidate attribute. In addition to the distribution of attribute values for each attribute, supervised learning using grammatical dependency and distance between attributes / attribute values (number of characters, number of words, distance on syntax tree, etc.) Can also be done.

However, in practice, there are often multiple attribute candidates and attribute value candidates. In such a case, the method of selecting the attribute having the maximum score for each attribute value candidate as described above is highly likely to cause an error, and it is necessary to determine an appropriate combination. Since the combination search requires a very large amount of calculation, a method for determining an appropriate combination with a practical amount of calculation is required. In the present embodiment, an appropriate combination is determined by the following two methods and combinations thereof.

The first method is based on the inter-attribute relationship model 1172. In the correct answer text, that is, in the past case, the combination search can be made efficient based on information on whether the relationship between attributes is strong or weak. In the following, a method to which the Markofaggar used for sentence part-of-speech determination is applied will be described.

FIG. 11 shows a processing flow of the attribute / attribute value pair determination process by Marco Fugger.

A series of attribute value candidates in order of appearance is acquired from the attribute value candidates extracted from the new text 114 (S1501). In the case of FIG. 10, a sequence such as [6, 20] is obtained.

Using the attribute / attribute value candidate relationship model, the conditional probability of the attribute is calculated for each attribute value candidate in the series, and an attribute with a high probability is acquired (S1502). In the case of FIG. 10, a list such as “ALT”, “AST”, and “AGE” is obtained for the attribute value candidate “6”.

Using the inter-attribute relationship model, probabilities are acquired for possible transitions between attributes (S1503). In the case of FIG. 10, transitions such as [ALT, ALT] [ALT, AST] [ALT, AGE] [AST, ALT] [AST, AST] are possible. Refer to and get the probability.

The optimum attribute transition sequence is determined (S1504). The determination of the optimum sequence can be executed by an algorithm called the Viterbi algorithm, and thus detailed description thereof is omitted. Through the above processing, the result that the attribute series [ALT, AST] is most likely to be the attribute value candidate series [6, 20] is obtained. As a result, <ALT, 6>, < Two attribute / attribute value pairs such as AST, 20> are obtained.

FIG. 10 is a conceptual explanatory diagram when performing numerical attribute extraction on the new text 114. In this example, it is assumed that “ALT” and “AST” are extracted as attribute candidates and “6” and “20” are extracted as attribute values. FIG. 10 shows a conceptual diagram of a Markoffagger that assigns attribute columns to these attribute value columns. There is a column of attribute value candidates extracted from the text, and attribute candidates having a high possibility for each attribute value are acquired by the attribute / attribute value relationship model 1171. In the case of the example in FIG. 10, for example, the attributes “ALT”, “AST”, and “AGE” are acquired in the descending order of validity for the attribute value “6”. At this time, for the attribute value “20”, the attribute with the highest validity is “ALT”, and an incorrect result is obtained.

Therefore, processing is performed as follows using the inter-attribute relationship model 1172.

In the processing flow shown in FIG. 11, the attribute candidates in the text are not explicitly used. As for the order between the attributes, priority is given to the order of appearance in the past data such as the correct answer data 113. However, compared to the determination of the part-of-speech order in sentences in which Markofaga is normally used, there is a high degree of freedom in how the attribute order is described. Therefore, a method of more actively using attribute candidate information obtained from the new text 114 can be considered. Hereinafter, attribute / attribute value pair determination processing by DP matching will be described.

FIG. 12 shows a processing flow of attribute / attribute value pair determination processing by DP matching.

A series of attribute value candidates in the order of appearance is acquired from the attribute value candidates extracted from the new text 114 (S1511). In the case of FIG. 10, a sequence such as [6, 20] is obtained.

A series of attribute candidates in order of appearance is acquired from the attribute candidates extracted from the new text 114 (S1512). In the case of FIG. 10, a sequence such as [ALT, AST] is obtained.

The strength of the relationship between all attribute and attribute value pairs is acquired from the attribute / attribute value relationship model, and a matrix composed of the acquired numerical values is generated (S1513). Specifically, the score regarding the I-th attribute and the J-th attribute value pair is stored in the <I, J> element of the matrix.

The optimum attribute transition sequence is determined (S1514). Since the series can be determined by DP matching, the description is omitted.

The attribute / attribute value pair determination process described in FIGS. 11 and 12 can be used in combination. For example, the method of using the attribute candidates in the new text described above with reference to FIG. 12 as a constraint is executed, and the correspondence relationship obtained by DP matching is a part where the reliability is low, for example, a part where the value set in the matrix is small. A method of using a method that refers to the order relation of attributes in the past data shown in FIG.

Also, the attribute candidate information in the new text is not used as a strong constraint, but can be used to temporarily modify the inter-attribute relationship model 1172. In the case of this method, a series of attribute candidates is created, and the value of the inter-attribute relationship model 1172 is increased at a certain ratio with respect to an arbitrary set of attributes on this series. An attribute pair that does not exist in the inter-attribute relationship model 1172 is newly added, and a predetermined value is set as the probability. By such processing, priority is given to the transition between candidate attributes existing in the new text 114, and it is possible to appropriately perform processing regarding the transition of attributes that are not included in the correct text 113 that is past data. Become.

In the above embodiment, for simplicity, it is assumed that the attribute / attribute value pair is one-to-one. However, there are cases like <BP, 141/70>. BP is blood pressure (Blood Pressure), 141 is the highest blood pressure, and 70 is the lowest blood pressure. In the case of such an example, it is possible to cope with the BP by generating virtual labels such as BP_1 and BP_2 and performing the processing from step 13 on the virtual labels.

With the above configuration and processing, attribute / attribute value pairs that are numeric attributes can be extracted at high speed and with high accuracy.

100 Numerical attribute extraction apparatus 101 CPU
102 Main memory 103 Input / output device 110 Disk device 111 OS
112 Numeric attribute extraction program 1121 Correct answer pair extraction subprogram 1122 Teacher data creation subprogram 1123 Model learning subprogram 1124 Attribute / attribute value candidate extraction subprogram 1125 Attribute / attribute value pair determination subprogram 113 Correct text 114 New text 115 Attribute / attribute Value pair list 116 Teacher data 117 Numeric attribute model 1171 Attribute / attribute value relationship model 1172 Attribute relationship model

Claims

A natural language processing apparatus for analyzing a document structure of input text data,
An extractor for extracting candidates for pairs of numerical attributes and attribute values from the text data;
A first calculator that calculates a value indicating the validity of the extracted candidate of the numerical attribute and attribute value pair based on a first model indicating a distribution of attribute values corresponding to a predetermined numerical attribute;
A natural language processing apparatus comprising: a determination unit that determines a pair of the numeric attribute and the attribute value from the candidates based on the value indicating validity.
The natural language processing apparatus according to claim 1, wherein the first model is a model estimated by learning text data including information for identifying a pair of a numerical attribute and an attribute value as teacher data. A natural language processing apparatus characterized by that.
3. The natural language processing apparatus according to claim 2, wherein the first model is a model estimated by a parametric estimation method or a nonparametric estimation method.
The natural language processing apparatus according to claim 3,
A second calculation unit for calculating a value indicating the strength of the relationship between the numerical attributes based on a second model;
The determination unit determines a pair of the numeric attribute and the attribute value based on values calculated by the first and second calculation units.
5. The natural language processing apparatus according to claim 4, wherein the second model is a model calculated based on a frequency at which a plurality of numerical attributes included in the teacher data appear within a predetermined distance. A natural language processing device.
A natural language processing method for analyzing a document structure of text data input to a natural language processing apparatus,
The natural language processing device comprises:
Extracting numerical attribute / attribute value pairs from the input text data,
Based on a first model indicating a distribution of attribute values corresponding to a predetermined numerical attribute, a value indicating the validity of the extracted candidate of the numerical attribute and attribute value pair is calculated,
A natural language processing method, comprising: determining a pair of the numerical attribute and the attribute value from the candidates based on the value indicating validity.
The natural language processing method according to claim 6, wherein the first model is a model estimated by learning text data including information for identifying a pair of a numerical attribute and an attribute value as teacher data. A natural language processing method characterized by that.
8. The natural language processing method according to claim 7, wherein the first model is a model estimated by a parametric estimation method or a nonparametric estimation method.
The natural language processing method according to claim 8,
Further, based on the second model, a value indicating the strength of the relationship between the numerical attributes is calculated,
An information processing apparatus, wherein a pair of the numerical attribute and the attribute value is determined based on values calculated by the first and second calculation units.
The natural language processing method according to claim 9, wherein the second model is a model calculated based on a frequency at which a plurality of numerical attributes included in the teacher data appear within a predetermined distance. A natural language processing method characterized.