WO2018025317A1 - Dispositif et procédé de traitement de langage naturel - Google Patents

Dispositif et procédé de traitement de langage naturel Download PDF

Info

Publication number
WO2018025317A1
WO2018025317A1 PCT/JP2016/072583 JP2016072583W WO2018025317A1 WO 2018025317 A1 WO2018025317 A1 WO 2018025317A1 JP 2016072583 W JP2016072583 W JP 2016072583W WO 2018025317 A1 WO2018025317 A1 WO 2018025317A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute
model
natural language
language processing
numerical
Prior art date
Application number
PCT/JP2016/072583
Other languages
English (en)
Japanese (ja)
Inventor
康嗣 森本
利彦 柳瀬
芳樹 丹羽
利昇 三好
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2016/072583 priority Critical patent/WO2018025317A1/fr
Priority to JP2018531005A priority patent/JP6546703B2/ja
Publication of WO2018025317A1 publication Critical patent/WO2018025317A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to a syntax analysis technique for natural language processing.
  • the present invention relates to a technique for extracting attribute value-attribute value pairs from text.
  • the amount of digitized documents accessible to users is increasing with the spread of personal computers and the Internet. Electronic documents are unstructured data and are difficult to handle with a computer. Therefore, in order to effectively use a large amount of digitized documents by structuring them, expectations for natural language processing are increasing.
  • One technique for structuring digitized documents is attribute / attribute value extraction.
  • the attribute / attribute value extraction is a technique for extracting a pair of an attribute such as “sex” and an attribute value such as “female” from text. It is used for the purpose of extracting various attributes describing the specifications and their attribute values from a variety of information sources and displaying them in an integrated manner. Among such attributes, an attribute value having a numerical value is called a numerical attribute.
  • weight and size are numerical attributes. Such numerical attributes are quantitative and objective information, and are particularly valuable among attribute / attribute value information. In view of such importance, there is a known technique for automatically extracting attribute / attribute value pairs including numerical attributes. A known technique for extracting attribute / attribute value pairs will be described below.
  • Non-Patent Document 1 The most standard technique for extracting attribute / attribute value pairs related to numeric attributes uses syntactic information as a clue.
  • Non-Patent Document 1 for example, for a sentence such as “Weight is 10 Kg”, first, a numeric character string “10 Kg” is extracted as a candidate attribute value, and “Weight” is obtained from the result of parsing. A technique for recognizing that is an attribute is disclosed.
  • a parsing technique for obtaining syntactic information cannot always obtain a correct solution at the current technical level. Therefore, a semantic method that uses the strength of the relationship between attributes and attribute values as a clue is required.
  • Non-patent document 2 As a known technique for extracting attribute / attribute value pairs using information other than syntactic information, there is a technique that uses “proximity” as a clue and a technique that uses “correlation” between attributes and attribute values as a clue.
  • Non-patent document 2 In Non-patent Document 2, first, as a first step, a phrase that is a candidate for an attribute or attribute value is extracted from the text. Next, as a second step, the correspondence between attribute candidate and attribute value candidate pairs is determined. At this time, in particular, in the second step, a technique using “syntax information”, “proximity”, and “correlation” in an appropriate combination is disclosed. “Proximity” is an approximation of syntax information, and “correlation” is semantic information.
  • Non-Patent Document 2 uses a correlation indicating the strength of a semantic relationship between attributes / attribute values even when syntactic information cannot be used as a clue. Attributes / attribute values can be associated.
  • Attribute values related to attributes other than numerical values are finite and the number of types is not so large. Therefore, the strength of the relationship between the attribute and the attribute value can be enumerated for all the attribute values. For example, when “manufacturing company” is considered as an attribute of a certain product, attribute values such as “Company A”, “B company”, and “C company” exist as attribute values, and ⁇ Manufacturing company, Company A>, ⁇ Manufacturing It is possible to consider the strength of the relationship for each attribute / attribute value pair such as “company, company B>, ⁇ manufacturing company, company C>. Specifically, a statistic such as self-mutual information may be calculated based on the appearance frequency of each pair. By learning the strength of such a relationship using correct answer data or the like, it becomes possible to extract attribute / attribute value pairs in the text from the text.
  • the numerical values that are attribute values are in principle continuous and there are infinite types of values.
  • the attribute value is represented by an integer, and although it is discrete, the number of types of attribute values becomes very large. Therefore, it cannot be expected that all attribute values appear in the data used for learning. Therefore, it is difficult to learn in advance the strength of the relationship between the attribute and the attribute value by a method as in the case of an attribute other than a numerical value.
  • An object of the present invention is to extract an attribute / attribute value pair that is a numerical attribute accurately and at high speed by providing means for calculating the strength of the relationship between the attribute and attribute value pair that is a numerical attribute.
  • a typical example of the present invention is a natural language processing apparatus that analyzes a document structure of input text data, an extraction unit that extracts candidates of pairs of numerical attribute and attribute value from text data, and a predetermined numerical value Based on a model that shows the distribution of attribute values corresponding to attributes, a calculation unit that calculates values indicating the validity of the extracted numeric attribute and attribute value pair candidates, and extraction based on the calculated values indicating validity
  • a determination unit that determines a pair of a numerical attribute and an attribute value from the candidates.
  • attribute / attribute value pairs that are numerical attributes appearing in text can be extracted with high accuracy and at high speed. Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.
  • the structure of a numerical attribute extraction apparatus is shown. An outline of numerical attribute extraction processing executed by the numerical attribute extraction program will be described.
  • the flowchart of the numerical value attribute extraction process by a numerical value attribute extraction program is shown. A configuration example of correct text is shown.
  • the structural example of an attribute / attribute value pair list is shown.
  • the flowchart of an attribute / attribute value relationship model learning process is shown.
  • the structural example of an attribute / attribute value relationship model is shown.
  • the flowchart of a relationship model learning process between attributes is shown.
  • a configuration example of the inter-attribute relationship model is shown. It is explanatory drawing of an attribute / attribute value pair determination process.
  • the flowchart of an attribute / attribute value pair determination process is shown.
  • the flowchart of an attribute / attribute value pair determination process is shown.
  • a method of extracting attribute / attribute value pairs of numerical attributes from text is disclosed. More specifically, a distribution of attribute values for each attribute is learned as a model, and a numerical attribute is determined based on the model. A method of determining is disclosed.
  • One of the objects of the present disclosure is to extract attribute / attribute value pairs with high accuracy by learning the strength of the relationship between the attributes of attribute values and attribute values from correct data.
  • An example of the technology of the present disclosure is to learn the distribution of attribute values for each attribute from correct data as a model, and when a certain numerical value is given, the strength of the relationship between the given numerical value and each attribute is used as the posterior probability. Calculate and extract attributes based on the calculated posterior probabilities.
  • the accuracy of attribute / attribute value pair extraction can be improved by learning the strength of the relationship between attributes from correct data as a model and considering the order of appearance of attributes.
  • FIG. 1 shows a configuration of a synonym extraction device 100 of the present embodiment.
  • the synonym extraction device 100 includes a CPU 101, a main memory 102, an input / output device 103, and a disk device 110.
  • the CPU 101 is a processor, and executes various processes by executing programs stored in the main memory 102. Specifically, the CPU 101 loads a program stored in the disk device 110 into the main memory 102 and executes the program loaded in the main memory 102. The program may be loaded from the external server to the main memory 102 via the network.
  • the main memory 102 stores programs executed by the CPU 101 and data required by the CPU 101.
  • the input / output device 103 receives input of information from the user and outputs information in response to an instruction from the CPU 101.
  • the input / output device 103 is an input device such as a keyboard and a mouse, and an output device such as a display.
  • the disk device 110 is an auxiliary memory including a computer-readable non-transitory storage medium.
  • the disk device 110 stores various programs and various data. Specifically, the disk device 110 stores an OS 111, a numerical attribute extraction program 112, a correct text 113, a new text 114, an attribute / attribute value pair list 115, teacher data 116, and a numerical attribute extraction model 117.
  • the numerical attribute extraction model 117 is an attribute / attribute value relationship model 1171 used for determining whether a pair of attribute candidates and attribute value candidates is truly an attribute / attribute value pair.
  • An inter-attribute relationship model 1172 indicating whether or not it is likely to appear is included.
  • the numerical attribute extraction program 112 determines whether or not the word / phrase pair included in the input text 114 is an attribute / attribute value pair, and determines the word / phrase pair determined to be the attribute / attribute value pair of the numerical attribute. Extract as attribute / attribute value pairs.
  • the numerical attribute extraction program 112 includes a correct answer pair extraction subprogram 1121, a teacher data creation subprogram 1122, a model learning subprogram 1123, an attribute / attribute value candidate extraction subprogram, and an attribute / attribute value pair determination subprogram 1125. The processing of these programs will be described in detail with reference to FIG.
  • the CPU 101 realizes a predetermined function by executing the above program.
  • the program performs a predetermined process by being executed by the processor. Therefore, in the present disclosure, the description with the program as the subject may be an explanation with the CPU 101 or the synonym extraction device 100 as the subject.
  • the CPU 101 operates as a functional unit (means) that realizes a predetermined function by operating according to a program.
  • the CPU 101 functions as a numerical attribute extraction unit (numerical attribute extraction means) by operating according to the numerical attribute extraction program 112.
  • the numerical attribute extraction apparatus 100 is an apparatus including these functional units (means).
  • the correct text 113 is data input to the numerical attribute extraction program 112 and is used to learn a model for extracting numerical attribute attribute / attribute value pairs.
  • the correct text 113 is a text loaded with information necessary to specify the attribute of the numerical attribute and the appearance position of the corresponding attribute value, and may have an arbitrary format.
  • the correct text 113 is constructed by inserting a tag indicating that it is an attribute or attribute value in the text, or preparing a table indicating the start position and end position of the attribute or attribute value separately from the text. .
  • the new text 114 is a text that is a target of extraction of attribute / attribute value pairs of numerical attributes. Usually, a new text different from the correct text 113 is a target. Moreover, after executing attribute / attribute value pair extraction for a new text, it can also be registered as correct text 113.
  • the attribute / attribute value pair list 115 is a list in the order of appearance position in the text of the character string pair indicating the attribute and the attribute value extracted from the correct text 115.
  • the positional relationship between pairs included in different texts is arbitrary.
  • a character string indicating the appearance position of the character string, a unit of the attribute value, and the like are stored.
  • the teacher data 116 has the same format as the attribute / attribute value pair list.
  • Attribute normalization means unifying different notations such as “ANION GAP” and “Anion Gap”.
  • the normalization of attribute values is to align the units to one of the units for numeric character strings with different appearances such as “300 mg” and “0.3 g”.
  • the numerical attribute extraction model 117 is data used by the attribute / attribute value pair determination subprogram 1125 and includes an attribute / attribute value relationship model 1171 and an inter-attribute relationship model 1172.
  • the attribute / attribute value relationship model 1171 indicates a criterion for determining whether or not two words are a pair of corresponding attribute and attribute value. Specifically, the attribute / attribute value relationship model 1171 expresses a distribution of attribute values for each attribute.
  • the inter-attribute relationship model 1172 indicates the strength of the relationship between two attributes, that is, which other attributes are likely to appear when a certain attribute appears. Specifically, the inter-attribute relationship model 1172 expresses a conditional probability of a subsequent appearing attribute with respect to an appearing attribute.
  • FIG. 2 shows an outline of numerical attribute extraction processing executed by the synonym extraction device 100.
  • the correct answer pair extraction subprogram 1121 acquires attribute / attribute value pairs from the correct text 113 and generates an attribute / attribute value pair list 115.
  • the teacher data creation subprogram 1122 refers to the attribute / attribute value pair list 115, identifies the attribute and attribute value to be normalized, and normalizes the attribute and attribute value.
  • the model learning subprogram 1123 reads the teacher data 116 and learns the numerical attribute extraction model 117.
  • the attribute / attribute value pair candidate extraction subprogram 1124 reads the new text 114 and extracts character strings that are candidates for attributes and attribute values.
  • the attribute / attribute value pair determination subprogram 1125 receives a character string that is an attribute and attribute value candidate from the attribute / attribute value pair candidate extraction subprogram 1124, and uses the numeric attribute extraction model 117 to select the attribute candidate and the attribute value candidate. It is determined whether the pair is an attribute / attribute value pair.
  • the word / phrase pair determined as the attribute / attribute value pair is stored in the correct text 114 through a manual check if necessary.
  • FIG. 3 shows a flowchart of numerical attribute extraction processing by the numerical attribute extraction program 112.
  • the correct answer pair extraction subprogram 1121 acquires the attribute / attribute value pair of the numerical attribute from the correct answer text 113 as a correct answer, and outputs it as an attribute / attribute value pair list (S11).
  • FIG. 4 shows an example of correct text. In the example of FIG. 4, a correct text is added to the text and a list of appearance positions of attribute / attribute value pairs in the text. Although the correct text is created manually here, instead of the correct text, attribute / attribute value pairs may be extracted from the text using conventional techniques that use syntax analysis results and pattern matching. You may use the text which made only the high pair the correct answer.
  • FIG. 5 shows an example of the attribute / attribute value pair list 115.
  • the attribute / attribute value pair list 115 is data in which attributes, attribute values, and additional information extracted from the correct text are stored in each row.
  • the first line includes a value attribute “7” for a numeric attribute “ALT” (alanine aminotransferase; Alanine transaminase) in a sentence with a document ID 1 and a sentence ID 1. This indicates that there is an attribute / attribute value pair having a numerical value of “U / L” as the unit of the attribute value.
  • the teacher data creation subprogram 1122 generates teacher data 116 from the acquired attribute / attribute value pair 115 (S12).
  • the attribute and the attribute value are normalized.
  • attribute normalization normalization is performed using synonym extraction and notation fluctuation extraction techniques known in the prior art. In the example of FIG. 5, “ALT” and “AlT (SGPT)” are normalized to “ALT”.
  • the attribute value first, the numeric character string extracted from the text is converted into a numeric value. After that, referring to the unit, if the unit is not uniform, it is integrated into a standard unit and the numerical value is converted. For example, if there is an attribute value such as "200mg” and an attribute value such as "0.3g", align "mg” and "g” with, for example, "mg” and convert "0.3” to "300” To do.
  • the model learning subprogram 1123 learns the numerical attribute extraction model 117 from the teacher data 116 (S13).
  • the numerical attribute extraction model 117 includes an attribute / attribute value relationship model 1171 and an attribute relationship model 1172. Details of step S13 will be described later with reference to FIG.
  • the attribute / attribute value candidate extraction subprogram 1124 extracts attributes or attribute value candidates from the new text 114 from which the attribute / attribute value pair is to be extracted (S14).
  • an information extraction technique known as a conventional technique can be used, and a description thereof will be omitted.
  • candidates for attributes or attribute values are extracted as entities.
  • the attribute / attribute value pair determination subprogram 1125 determines whether the attribute / attribute value candidate pair extracted in step 14 is a true attribute / attribute value pair, and extracts the attribute / attribute value pair. (S15). Details of step 15 will be described later with reference to FIGS.
  • the numerical attribute extraction model 1171 indicating the strength of the relationship between attributes and attribute values.
  • the attribute / attribute value relationship model 1171 is a model indicating validity of whether a certain value is a value of an attribute.
  • the other is an inter-attribute relationship model 1172 representing the strength of the relationship between attributes.
  • the inter-attribute relationship model 1172 is a model indicating other attributes that are likely to appear when a certain attribute appears.
  • FIG. 6 shows a processing flow of learning processing of the attribute / attribute value relationship model 1171.
  • the process proceeds to step 1305. If there is an unprocessed model, the process proceeds to S1304 (S1303).
  • the model refers to a type of probability distribution such as a normal distribution. Since the distribution of attribute values differs for each attribute, it is preferable to set different types of models in advance and select the most appropriate model.
  • the most appropriate parameter is calculated from all the attribute values acquired in S1302 (S1304).
  • the parameters of each model are determined using maximum likelihood estimation for the attribute value acquired in S132 for each model.
  • a normal distribution, a log normal distribution, and a rectangular distribution are used as models.
  • the normal distribution and lognormal distribution have two parameters, and the rectangular distribution has three parameters.
  • the normal distribution and lognormal distribution can be determined directly from statistics such as mean and standard deviation.
  • an appropriate parameter is searched for while changing the parameter.
  • the most appropriate model is selected from preset models based on the parameters calculated in S1304 and stored in the attribute / attribute value relationship model 1171 (S1305).
  • FIG. 7 shows an example of the attribute / attribute value relationship model 1171.
  • Each row of the attribute / attribute value relationship model 1171 includes a model type 11712 and a model parameter group 11713 having the highest validity for each attribute 11711.
  • Various methods can be considered as a method for determining the validity of the model.
  • AIC is used as an example.
  • the lognormal distribution is selected as the most appropriate model from the maximum likelihood estimation value and the number of parameters of each model.
  • a parametric estimation method is described in which a probability distribution that is a continuous distribution is assumed and selection is performed based on an information amount criterion such as AIC.
  • an information amount criterion such as AIC.
  • a non-parametric estimation method that does not assume a specific distribution is also considered. It is done.
  • the K-neighbor method is a technique that selects K types of cases in the order closest to a certain case, and performs classification and the like based on the classification of the selected case.
  • the teacher data 116 is referred to, and K rows are selected in order of decreasing value.
  • the attribute column 1154 is acquired and tabulated. At this time, if the number of certain attributes in the selected row is k, k / K is the reliability of this value for this attribute.
  • an attribute / attribute value relationship model is not explicitly created, and a necessary numerical value is obtained by executing the above calculation in the attribute / attribute value pair determination process in step 15.
  • a kernel function method other than the K-neighbor method may be used.
  • the attribute value is a numerical value, but it may be desirable to treat it as categorical data.
  • a drug having a different ratio of active ingredients to a certain drug such as “drug name 250 mg” and “drug name 100 mg”.
  • “medicine name” is an attribute of “250 mg”, but since there are only a few types of attribute values, it is desirable to treat them in the same way as non-numeric attribute values. In such a case, the number of each numerical value such as “250 mg” and “100 mg” increases.
  • a value obtained by dividing the total number of attribute values that have appeared by the number of types of attribute values that are present is greater than a predetermined threshold value, it can be handled by a method such as considering categorical data.
  • the appearance probability of each attribute value may be estimated as a maximum likelihood as in the past.
  • Fig. 8 shows the processing flow of the learning process of the inter-attribute relationship model.
  • Attribute / attribute value pair is acquired from the attribute / attribute value pair list 115 (S1311).
  • Attribute 2-gram is extracted according to the order of appearance of the attribute value in the text (S1312).
  • the attribute 2-gram is an ordered pair of ⁇ ALT, AST> when an attribute value with the attribute “AST” appears after the attribute value with the attribute “ALT”.
  • the high 2-gram frequency indicates that “AST” is likely to appear after “ALT”.
  • the frequency of 2-grams of the extracted attributes is totaled (S1313).
  • the conditional probability related to the appearance order of attributes that is, the conditional probability of the attribute of the attribute value that appears next when the attribute value of a certain attribute appears is calculated and stored in the inter-attribute relationship model 1172.
  • An example of the inter-attribute relationship model 1172 is shown in FIG. For example, all 2-grams with ALT as the first element are obtained, the sum of their frequencies is obtained, and the value obtained by calculating the sum of the frequencies of ⁇ ALT, AST> when “ALT” appears The conditional probability that “AST” will appear. That is, if this value is large, it is likely that “AST” appears after “ALT”.
  • the order of appearance is expressed using 2-gram. That is, a simple Markov process is assumed, but a second-order or higher Markov process may be used.
  • a model in which the probability is high if it appears at an appearance position within a certain distance can be used.
  • S15 attribute / attribute value pairs are extracted as follows.
  • there is one numerical value that is a candidate attribute value and there are a plurality of candidate attributes in the new text 114.
  • an appropriate attribute can be determined as follows by the attribute / attribute value relationship model.
  • the likelihood of each attribute is given by setting a conditional probability of each attribute for the numerical value of the candidate attribute value in the text.
  • the attribute having the highest likelihood may be selected as the attribute value candidate attribute.
  • supervised learning using grammatical dependency and distance between attributes / attribute values (number of characters, number of words, distance on syntax tree, etc.) Can also be done.
  • an appropriate combination is determined by the following two methods and combinations thereof.
  • the first method is based on the inter-attribute relationship model 1172.
  • the combination search can be made efficient based on information on whether the relationship between attributes is strong or weak.
  • a method to which the Markofaggar used for sentence part-of-speech determination is applied will be described.
  • FIG. 11 shows a processing flow of the attribute / attribute value pair determination process by Marco Fugger.
  • a series of attribute value candidates in order of appearance is acquired from the attribute value candidates extracted from the new text 114 (S1501). In the case of FIG. 10, a sequence such as [6, 20] is obtained.
  • the conditional probability of the attribute is calculated for each attribute value candidate in the series, and an attribute with a high probability is acquired (S1502).
  • a list such as “ALT”, “AST”, and “AGE” is obtained for the attribute value candidate “6”.
  • the optimum attribute transition sequence is determined (S1504).
  • the determination of the optimum sequence can be executed by an algorithm called the Viterbi algorithm, and thus detailed description thereof is omitted.
  • the result that the attribute series [ALT, AST] is most likely to be the attribute value candidate series [6, 20] is obtained.
  • ⁇ ALT, 6>, ⁇ Two attribute / attribute value pairs such as AST, 20> are obtained.
  • FIG. 10 is a conceptual explanatory diagram when performing numerical attribute extraction on the new text 114.
  • “ALT” and “AST” are extracted as attribute candidates and “6” and “20” are extracted as attribute values.
  • FIG. 10 shows a conceptual diagram of a Markoffagger that assigns attribute columns to these attribute value columns.
  • attribute candidates There is a column of attribute value candidates extracted from the text, and attribute candidates having a high possibility for each attribute value are acquired by the attribute / attribute value relationship model 1171.
  • the attributes “ALT”, “AST”, and “AGE” are acquired in the descending order of validity for the attribute value “6”.
  • the attribute with the highest validity is “ALT”, and an incorrect result is obtained.
  • processing is performed as follows using the inter-attribute relationship model 1172.
  • the attribute candidates in the text are not explicitly used.
  • priority is given to the order of appearance in the past data such as the correct answer data 113.
  • attribute / attribute value pair determination processing by DP matching will be described.
  • FIG. 12 shows a processing flow of attribute / attribute value pair determination processing by DP matching.
  • a series of attribute value candidates in the order of appearance is acquired from the attribute value candidates extracted from the new text 114 (S1511). In the case of FIG. 10, a sequence such as [6, 20] is obtained.
  • a series of attribute candidates in order of appearance is acquired from the attribute candidates extracted from the new text 114 (S1512).
  • a sequence such as [ALT, AST] is obtained.
  • the strength of the relationship between all attribute and attribute value pairs is acquired from the attribute / attribute value relationship model, and a matrix composed of the acquired numerical values is generated (S1513). Specifically, the score regarding the I-th attribute and the J-th attribute value pair is stored in the ⁇ I, J> element of the matrix.
  • the optimum attribute transition sequence is determined (S1514). Since the series can be determined by DP matching, the description is omitted.
  • the attribute / attribute value pair determination process described in FIGS. 11 and 12 can be used in combination.
  • the method of using the attribute candidates in the new text described above with reference to FIG. 12 as a constraint is executed, and the correspondence relationship obtained by DP matching is a part where the reliability is low, for example, a part where the value set in the matrix is small.
  • the attribute candidate information in the new text is not used as a strong constraint, but can be used to temporarily modify the inter-attribute relationship model 1172.
  • a series of attribute candidates is created, and the value of the inter-attribute relationship model 1172 is increased at a certain ratio with respect to an arbitrary set of attributes on this series.
  • An attribute pair that does not exist in the inter-attribute relationship model 1172 is newly added, and a predetermined value is set as the probability.
  • BP blood pressure
  • 141 is the highest blood pressure
  • 70 is the lowest blood pressure.
  • attribute / attribute value pairs that are numeric attributes can be extracted at high speed and with high accuracy.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention vise à extraire avec précision, à partir d'un texte, une paire attribut-valeur se rapportant à un attribut numérique qui a une valeur numérique. Le dispositif de traitement de langage naturel d'après la présente invention extrait des paires attribut-valeur à partir du texte correct et calcule la validité de l'association entre chaque attribut et une valeur numérique inconnue sur la base de la distribution des valeurs de l'attribut. Le calcul de la validité relatif à chaque attribut est effectué en déterminant la distribution la plus similaire à la distribution des valeurs de l'attribut dans le texte correct puis en utilisant la distribution déterminée. Lorsqu'une pluralité de valeurs d'attribut sont soumises à ce calcul de la validité d'une paire attribut-valeur, des relations entre des attributs et des relations entre des attributs et des valeurs sont apprises à partir du texte correct de façon à déterminer des paires appropriées.
PCT/JP2016/072583 2016-08-02 2016-08-02 Dispositif et procédé de traitement de langage naturel WO2018025317A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2016/072583 WO2018025317A1 (fr) 2016-08-02 2016-08-02 Dispositif et procédé de traitement de langage naturel
JP2018531005A JP6546703B2 (ja) 2016-08-02 2016-08-02 自然言語処理装置及び自然言語処理方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/072583 WO2018025317A1 (fr) 2016-08-02 2016-08-02 Dispositif et procédé de traitement de langage naturel

Publications (1)

Publication Number Publication Date
WO2018025317A1 true WO2018025317A1 (fr) 2018-02-08

Family

ID=61073110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/072583 WO2018025317A1 (fr) 2016-08-02 2016-08-02 Dispositif et procédé de traitement de langage naturel

Country Status (2)

Country Link
JP (1) JP6546703B2 (fr)
WO (1) WO2018025317A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046345A (zh) * 2019-03-12 2019-07-23 同盾控股有限公司 一种数据提取方法和装置
JP2020064482A (ja) * 2018-10-18 2020-04-23 株式会社日立製作所 属性抽出装置および属性抽出方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005250682A (ja) * 2004-03-02 2005-09-15 Oki Electric Ind Co Ltd 情報抽出システム
JP2010117797A (ja) * 2008-11-11 2010-05-27 Hitachi Ltd 数値表現処理装置
JP2010182165A (ja) * 2009-02-06 2010-08-19 Hitachi Ltd 分析システム及び情報分析方法
JP2013527958A (ja) * 2010-04-21 2013-07-04 マイクロソフト コーポレーション 多数のソースからの生産物合成

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005250682A (ja) * 2004-03-02 2005-09-15 Oki Electric Ind Co Ltd 情報抽出システム
JP2010117797A (ja) * 2008-11-11 2010-05-27 Hitachi Ltd 数値表現処理装置
JP2010182165A (ja) * 2009-02-06 2010-08-19 Hitachi Ltd 分析システム及び情報分析方法
JP2013527958A (ja) * 2010-04-21 2013-07-04 マイクロソフト コーポレーション 多数のソースからの生産物合成

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KATSUYUKI FUJIHATA ET AL.: "Extraction of Numerical Expressions by Constraints and Default Rules of Dependency Structure", IPSJ SIG NOTES, vol. 2001, no. 86, 11 September 2001 (2001-09-11), pages 119 - 125 *
RAYID GHANI ET AL.: "Text Mining for Product Attribute Extraction", SIGKDD EXPLORATIONS NEWSLETTER, vol. 8, no. 1, June 2006 (2006-06-01), pages 41 - 48, XP058283395 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020064482A (ja) * 2018-10-18 2020-04-23 株式会社日立製作所 属性抽出装置および属性抽出方法
JP7125322B2 (ja) 2018-10-18 2022-08-24 株式会社日立製作所 属性抽出装置および属性抽出方法
CN110046345A (zh) * 2019-03-12 2019-07-23 同盾控股有限公司 一种数据提取方法和装置

Also Published As

Publication number Publication date
JPWO2018025317A1 (ja) 2018-11-15
JP6546703B2 (ja) 2019-07-17

Similar Documents

Publication Publication Date Title
WO2020082560A1 (fr) Procédé, appareil et dispositif d'extraction de mot-clé de texte, ainsi que support de stockage lisible par ordinateur
CN105988990B (zh) 汉语零指代消解装置和方法、模型训练方法和存储介质
US10210245B2 (en) Natural language question answering method and apparatus
US11544459B2 (en) Method and apparatus for determining feature words and server
US10496928B2 (en) Non-factoid question-answering system and method
US9619583B2 (en) Predictive analysis by example
US20140149102A1 (en) Personalized machine translation via online adaptation
JP5544602B2 (ja) 単語意味関係抽出装置及び単語意味関係抽出方法
US11244009B2 (en) Automatic keyphrase labeling using search queries
US20200065387A1 (en) Systems and methods for report processing
WO2017198031A1 (fr) Procédé et appareil d'analyse sémantique
JP5234232B2 (ja) 同義表現判定装置、方法及びプログラム
US20220245353A1 (en) System and method for entity labeling in a natural language understanding (nlu) framework
WO2022134779A1 (fr) Procédé, appareil et dispositif d'extraction de données associées à une action de personnage et support de stockage
JP4534666B2 (ja) テキスト文検索装置及びテキスト文検索プログラム
US9547645B2 (en) Machine translation apparatus, translation method, and translation system
US10509812B2 (en) Reducing translation volume and ensuring consistent text strings in software development
CN111444713B (zh) 新闻事件内实体关系抽取方法及装置
US20220222442A1 (en) Parameter learning apparatus, parameter learning method, and computer readable recording medium
WO2018025317A1 (fr) Dispositif et procédé de traitement de langage naturel
Khan et al. A clustering framework for lexical normalization of Roman Urdu
JP6867963B2 (ja) 要約評価装置、方法、プログラム、及び記憶媒体
JP2003108571A (ja) 文書要約装置、文書要約装置の制御方法、文書要約装置の制御プログラムおよび記録媒体
JP5225219B2 (ja) 述語項構造解析方法、その装置及びプログラム
Al-Olimat et al. A practical incremental learning framework for sparse entity extraction

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2018531005

Country of ref document: JP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16911578

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16911578

Country of ref document: EP

Kind code of ref document: A1