WO2022168208A1 - 情報処理装置、変換パターンの決定方法、名寄せ方法、学習方法、変換パターン決定プログラム、名寄せプログラム、および学習プログラム - Google Patents
情報処理装置、変換パターンの決定方法、名寄せ方法、学習方法、変換パターン決定プログラム、名寄せプログラム、および学習プログラム Download PDFInfo
- Publication number
- WO2022168208A1 WO2022168208A1 PCT/JP2021/003965 JP2021003965W WO2022168208A1 WO 2022168208 A1 WO2022168208 A1 WO 2022168208A1 JP 2021003965 W JP2021003965 W JP 2021003965W WO 2022168208 A1 WO2022168208 A1 WO 2022168208A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- conversion
- character string
- conversion pattern
- determination
- pairs
- Prior art date
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 440
- 230000010365 information processing Effects 0.000 title claims abstract description 80
- 238000000034 method Methods 0.000 title description 77
- 238000011156 evaluation Methods 0.000 claims abstract description 8
- 238000010801 machine learning Methods 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 19
- 230000002787 reinforcement Effects 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 5
- 230000001131 transforming effect Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 description 23
- 230000008569 process Effects 0.000 description 21
- 230000000694 effects Effects 0.000 description 15
- 238000012549 training Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 238000013519 translation Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000000470 constituent Substances 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/53—Processing of non-Latin text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Definitions
- the present invention relates to an information processing device and the like that performs name identification of character strings.
- Nikkei Stock Average is sometimes written as "Nikkei Stock Average”, sometimes written as "Nikkei”, and sometimes written as "Nikkei225" overseas.
- Non-Patent Document 1 discloses a technique of calculating the degree of similarity between character strings and merging character strings with high calculated similarity.
- Non-Patent Document 2 a binary classifier is used to determine whether or not the two records indicate the same thing using a feature amount that combines the character string vectors of the two target records. Techniques are disclosed.
- Non-Patent Documents 1 and 2 are effective when the notations of character strings to be collated are similar, but it is difficult to perform correct name identification when the notations are far apart.
- it is possible to correctly match “Nikkei Stock Average” and "Nikkei” with similar notations, but "Nikkei” and “Nikkei” with dissimilar notations NKK” (an abbreviation of Nikkei) is difficult to match.
- An object of one aspect of the present invention is to provide an information processing apparatus and the like that can correctly group even character strings whose notations are not similar.
- An information processing apparatus includes data acquisition means for acquiring a data set including a plurality of character string pairs in which it is known whether or not each character string indicates the same character string, and a conversion pattern determining means for determining a conversion pattern for increasing the accuracy of determining whether or not the character string pairs included in the data set indicate the same character string pair, based on the result of trial conversion of the character string pair.
- An information processing apparatus includes conversion means for converting by sequentially applying a plurality of conversion rules to at least one of two character strings forming a character string pair to be collated; and determining means for determining whether the character string pairs indicate the same thing.
- An information processing apparatus provides two characters constituting a character string pair constituting training data for generating a judgment model for determining whether character string pairs to be identified indicate the same thing.
- conversion means for converting by sequentially applying a plurality of conversion rules to at least one of the strings; and learning means for generating the judgment model by machine learning using the converted character string pair as teacher data. .
- a method for determining a conversion pattern comprises: obtaining, by at least one processor, a data set including a plurality of character string pairs in which it is known whether or not each character string indicates the same thing; Determining a conversion pattern that increases the accuracy of determining whether or not the character string pairs included in the data set indicate the same thing, based on the results of trial conversion of the character string pairs included in the data set.
- a name identification method comprises: at least one processor converting by sequentially applying a plurality of conversion rules to at least one of two character strings forming a character string pair to be identified; determining whether the converted string pairs indicate the same thing.
- At least one processor selects character string pairs constituting teacher data for generating a judgment model for judging whether or not character string pairs to be identified indicate the same thing. Converting by sequentially applying a plurality of conversion rules to at least one of the two constituent character strings; and Generating the judgment model by machine learning using the character string pair after conversion as teacher data. ,including.
- a conversion pattern determination program comprises a computer, data acquisition means for acquiring a data set containing a plurality of character string pairs in which it is known whether or not each character string indicates the same thing, and the data Functions as a conversion pattern determining means for determining a conversion pattern that enhances the accuracy of determining whether or not the character string pairs included in the data set indicate the same thing based on the results of conversion trials for the character string pairs included in the set. .
- a name identification program comprises conversion means for converting a computer by sequentially applying a plurality of conversion rules to at least one of two character strings constituting a character string pair to be identified, and the conversion. It functions as determination means for determining whether or not subsequent character string pairs indicate the same thing.
- a learning program configures a character string pair that constitutes teacher data for generating a judgment model that determines whether character string pairs to be identified are the same or not.
- converting means for converting by sequentially applying a plurality of conversion rules to at least one of the two character strings; and learning means for generating the judgment model by machine learning using the converted character string pair as teacher data. make it work.
- FIG. 1 is a block diagram showing the configuration of an information processing device according to exemplary Embodiment 1 of the present invention
- FIG. FIG. 4 is a flow diagram showing the flow of a teacher data generation method, a name identification method, and a learning method according to exemplary embodiment 1 of the present invention
- FIG. 8 is an explanatory diagram of a determination system according to exemplary embodiment 2 of the present invention
- FIG. 10 is a block diagram showing the configuration of an information processing apparatus according to exemplary Embodiment 3 of the present invention
- 4 is a flow diagram showing a flow of processing performed by the information processing apparatus during learning
- FIG. FIG. 4 is a flow chart showing a flow of processing performed by the information processing apparatus when name identification is performed
- FIG. 4 is a diagram showing an example of a computer that executes instructions of a program, which is software that implements each function of each device according to each exemplary embodiment of the present invention;
- FIG. 1 is a block diagram showing the configuration of information processing apparatuses 1-3.
- the information processing apparatus 1 includes a data acquisition section (data acquisition means) 11 and a conversion pattern determination section (conversion pattern determination means) 12 .
- the data acquisition unit 11 acquires a data set including a plurality of character string pairs for which it is known whether or not each character string indicates the same thing. Then, the conversion pattern determining unit 12, based on the trial result of converting the character string pairs included in the data set, determines whether the character string pairs included in the data set are the same or not. to decide.
- a data set including a plurality of character string pairs whose respective character strings are known to indicate the same thing is acquired, and the data set A conversion pattern that increases the accuracy of determining whether the character string pairs included in the data set indicate the same thing is determined based on the results of trial conversions for the character string pairs included in the data set.
- the functions of the information processing apparatus 1 described above can also be realized by a program.
- the conversion pattern determination program according to this exemplary embodiment comprises a computer, data acquisition means for acquiring a data set including a plurality of character string pairs, each character string indicating whether or not the same character string is known, and Functioning as a conversion pattern determining means for determining a conversion pattern that enhances the accuracy of determining whether or not character string pairs included in a data set indicate the same thing based on the results of trial conversions for character string pairs included in the data set , is adopted. Therefore, according to the conversion pattern determination program according to the present exemplary embodiment, it is possible to obtain the effect that it is possible to perform correct name identification even for character strings whose spellings are not similar.
- the information processing device 2 includes a conversion section (conversion means) 21 and a determination section (determination means) 22 .
- the conversion unit 21 converts by sequentially applying a plurality of conversion rules to at least one of two character strings forming a character string pair to be identified. Then, the determination unit 22 determines whether or not the converted character string pairs indicate the same character string pair.
- a plurality of conversion rules are sequentially applied to at least one of the two character strings that form the character string pair to be identified, and the conversion is performed.
- a configuration is adopted in which it is determined whether or not the character string pairs after conversion indicate the same thing.
- a name identification program comprises a computer, conversion means for converting by sequentially applying a plurality of conversion rules to at least one of two character strings constituting a character string pair to be identified, and A configuration is adopted in which it functions as determination means for determining whether or not the character string pairs after conversion indicate the same thing. For this reason, according to the name identification program according to the present exemplary embodiment, it is possible to correctly perform name identification even for character strings whose notations are not similar. It is possible to obtain the effect that it is possible to perform correct name identification even for a character string pair that does not become a character string.
- the information processing device 3 includes a conversion section (conversion means) 31 and a learning section (learning means) 32 .
- the conversion unit 31 converts at least one of two character strings that form a character string pair that constitutes teacher data for generating a judgment model for judging whether or not character string pairs to be identified are the same. Convert by sequentially applying multiple conversion rules. Then, the learning unit 32 generates the determination model by machine learning using the converted character string pair as teacher data.
- character strings constituting teacher data for generating a judgment model for judging whether or not character string pairs to be identified indicate the same character string converting at least one of the two character strings that make up the pair by applying a plurality of conversion rules sequentially, and generating the judgment model by machine learning using the character string pair after conversion as teacher data. configuration is adopted.
- a learning program configures a computer to construct character string pairs that constitute teacher data for generating a determination model for determining whether or not character string pairs to be identified indicate the same thing.
- conversion means for converting by sequentially applying a plurality of conversion rules to at least one of the character strings, and learning means for generating the judgment model by machine learning using the character string pairs after conversion as teacher data A configuration that functions is adopted. Therefore, according to the learning program according to the present exemplary embodiment, it is possible to generate a determination model capable of highly accurately aligning character string pairs after conversion.
- FIG. 2 is a flow diagram showing the flow of the teacher data generation method, name identification method, and learning method according to the first exemplary embodiment of the present invention.
- S11 and S12 indicate the conversion pattern determination method
- S21 and S22 indicate the name identification method
- S31 and S32 indicate the learning method.
- At S11 at least one processor obtains a data set containing a plurality of string pairs where each string is known to denote the same thing.
- At least one processor provides a conversion pattern that enhances the accuracy of determining whether or not the character string pairs included in the data set are the same based on the results of trial conversion of the character string pairs included in the data set. to decide. This completes the conversion pattern determination method shown in FIG.
- a data set including a plurality of character string pairs that are known to indicate whether or not each character string is the same is acquired, and the data A configuration is adopted in which a conversion pattern that enhances the accuracy of determining whether or not the character string pairs included in the data set indicate the same thing is determined based on the results of trial conversion of the character string pairs included in the set.
- the execution subject of each step in this conversion pattern determination method may be the processor provided in the information processing device 1 or the processor provided in another device, and the execution subject of each step is different. It may be a processor provided in the device. This also applies to the name identification method and learning method described below.
- At least one processor converts by sequentially applying a plurality of conversion rules to at least one of two character strings forming a character string pair to be identified.
- At S22 at least one processor determines whether the character string pairs after conversion indicate the same thing. This completes the name identification method shown in FIG.
- a plurality of conversion rules are sequentially applied to at least one of two character strings forming a character string pair to be identified, and the conversion is performed.
- a configuration is adopted in which it is determined whether or not subsequent character string pairs indicate the same thing.
- At least one processor extracts at least two character strings that form a character string pair that constitutes teacher data for generating a determination model for determining whether or not character string pairs to be identified are the same. Convert by sequentially applying multiple conversion rules to one.
- At least one processor generates the judgment model by machine learning using the converted character string pair as teacher data. This ends the learning method shown in FIG.
- character string pairs constituting teacher data for generating a judgment model for judging whether or not character string pairs to be identified indicate the same A configuration in which a plurality of conversion rules are sequentially applied to at least one of the two constituent character strings to convert them, and the judgment model is generated by machine learning using the character string pairs after conversion as teacher data. Adopted.
- FIG. 3 is an illustration of a determination system 100 according to the exemplary embodiment.
- the judgment system 100 is a system for judging whether or not pairs of character strings to be collated indicate the same thing, and includes a conversion device (information processing device) 4 and a judgment device 5 .
- the conversion device 4 determines a character string conversion pattern and converts the character string using the determined conversion pattern.
- the conversion device 4 includes a data acquisition section (data acquisition means) 41 , a conversion pattern determination section (conversion pattern determination means) 42 , and a conversion section 43 .
- the functions of these components are the same as those of data acquisition unit 11, conversion pattern determination unit 12, and conversion units 21 and 31 shown in FIG. 1, so description thereof will not be repeated here.
- the determination device 5 determines whether or not the pairs of character strings to be collated indicate the same thing.
- the determination device 5 also has a function of generating a determination model used for the determination.
- the determination device 5 includes a learning section 51 and a determination section 52 . The functions of these components are the same as those of the learning unit 32 and the determination unit 22 shown in FIG. 1, so description thereof will not be repeated here.
- the determination system 100 first generates a determination model for the determination by machine learning using teacher data in determining whether or not pairs of character strings to be identified are the same.
- the data acquisition unit 41 of the conversion device 4 acquires teacher data.
- the training data to be acquired is a data set that includes a plurality of character string pairs for which it is known whether or not each character string indicates the same thing.
- one of the character string pairs is x l
- the other is x r
- y indicates whether or not those character strings indicate the same thing.
- the conversion pattern determination unit 42 determines whether or not the character string pairs included in the teacher data indicate the same character string pair based on the trial result of conversion of the character string pairs included in the teacher data. Determines a conversion pattern that enhances the determination accuracy of . The details of the conversion pattern determination method will be described later.
- the conversion unit 43 converts each character string pair included in the teacher data using the conversion pattern determined by the conversion pattern determination unit 42 .
- teacher data after conversion is generated.
- the generated teacher data after conversion is output to the determination device 5 .
- the learning unit 51 performs machine learning using the converted teacher data acquired from the conversion device 4, and performs determination for determining whether or not the character string pairs to be identified are the same. Generate a model. This completes the processing of the learning phase.
- the data acquisition unit 41 of the conversion device 4 acquires data to be identified.
- Data to be identified is data containing at least one character string pair for which it is desired to determine whether or not they indicate the same thing.
- one character string pair included in the name identification target data can be expressed as xl and the other as xr .
- the conversion unit 43 converts each character string pair included in the name identification target data using the conversion pattern determined by the conversion pattern determination unit 42 in the learning phase. As a result, name identification target data after conversion is generated. Then, the generated name identification target data after conversion is output to the determination device 5 .
- the determination unit 52 uses the determination model generated in the learning phase to determine whether or not the character string pairs included in the converted name identification data obtained from the conversion device 4 indicate the same character string pair. judge. Then, the determination unit 52 outputs the determination result, that is, the name identification result. This completes the processing of the inference phase.
- the character strings extracted from each target data table should be paired, and correct data that indicates whether or not the character strings indicate the same thing should be associated with the paired character strings as teacher data. Since the character strings used in the training data can be part of the records included in the target data table, the time and effort required to generate such training data is less than when all the names of the target data table are manually collected. Enough less.
- the conversion pattern determining unit 42 can determine effective conversion patterns for matching names between target data tables. Then, by converting other character string pairs extracted from the target data table using this conversion pattern, highly accurate name identification (unification of notation) between the target data tables becomes possible.
- a conversion pattern is determined to restore the character string before such replacement or omission.
- other records included in the target data table that have undergone replacement or omission as described above are returned to the character strings before replacement or omission using the determined conversion pattern.
- a conversion pattern for converting a character string may be a combination of a plurality of conversion rules in their order of application.
- a conversion rule is a rule for converting a character string into another character string.
- a conversion rule can be represented by a function (mapping from a string space to a string space) that outputs a string when a string is input. For example, if a conversion rule is a function f 1 , a character string obtained by converting a character string x l using this conversion rule is expressed as f 1 (x l ). A character string obtained by further converting the converted character string by another conversion rule (function f 2 ) is expressed as f 2 (f 1 (x l )).
- Any conversion rule can be applied as long as it contributes to name identification.
- conversion of character type e.g. conversion to hiragana, conversion to alphabet, etc.
- extraction of initial letters e.g. conversion of Chinese numerals to Arabic numerals
- translation to other languages e.g., conversion to alphabet, etc.
- abbreviations e.g., conversion to specific symbols and the like.
- the translation may be performed using dictionary data or the like, or may be machine translation using a machine translation algorithm.
- the language to be translated into should be determined in advance.
- the replacement of abbreviations and the replacement of specific symbols may be performed using dictionary data or the like according to predetermined replacement rules.
- the conversion pattern determined by the conversion pattern determination unit 42 includes at least one conversion rule of translation into a character string of another language, extraction of initials, and conversion of character types, good.
- Each of these conversion rules is effective for converting a character string pair that indicates the same thing but whose notation is not similar to a character string pair that has similar notation. Therefore, according to the above configuration, it is possible to improve the accuracy of name identification for character strings whose notations are not similar. For example, by translating into character strings in other languages, it is possible to correctly merge character strings that indicate the same thing but are written in different languages and thus have dissimilar notations. . The same applies to character type conversion. In addition, in records such as databases and data tables, character strings that combine the initials of multiple words are often used, so extracting the initials can be said to be one of the effective conversion rules.
- f 1 Extract initial letter
- f 2 Convert to hiragana
- f 3 Convert to alphabet
- the conversion rule is applied to x l in the order of f 1 ⁇ f 2 ⁇ f 3 .
- the conversion pattern determination unit 42 uses the teacher data acquired by the data acquisition unit 41 to determine a conversion pattern that can improve the determination accuracy in name identification.
- the conversion pattern determination unit 42 may attempt to determine whether or not the character string pairs after conversion according to the conversion pattern are the same. Then, the conversion pattern determination unit 42 may determine the conversion pattern based on the evaluation result of the determination accuracy in each trial.
- the conversion pattern is determined based on the evaluation result of evaluating the determination accuracy for each of the conversion patterns.
- the conversion pattern to enhance can be determined with high accuracy.
- R N conversion patterns can be obtained by selecting and arranging N conversion rules from among the conversion rules. Therefore, the conversion pattern determination unit 42 converts each character string pair included in the teacher data according to each conversion pattern, determines whether or not the character string pairs after conversion indicate the same thing, and evaluates the determination accuracy. Just do it.
- the method of determining whether or not the converted character string pairs indicate the same thing is not particularly limited, and it may be determined using a determination model generated by supervised learning, or by using a determination model of unsupervised learning. may be used for determination.
- the evaluation method of determination accuracy is not particularly limited. for example. The determination may be performed for all or part of the character string pairs included in the teacher data, and the percentage of correct answers may be used as the evaluation value. In this case, the conversion pattern determination unit 42 may determine the conversion pattern with the highest percentage of correct answers as the conversion pattern that can improve the determination accuracy.
- the conversion pattern determination unit 42 can determine a conversion pattern that can improve the determination accuracy in name identification for each character string pair included in the teacher data. As a result of the processing described above, a conversion pattern consisting of one conversion rule may be determined as the best conversion pattern. This also applies to Example 2 described below.
- the conversion pattern determination unit 42 may determine the conversion pattern by reinforcement learning with reward for determination accuracy when it is determined whether or not the character string pairs after conversion indicate the same thing. As a result, it is possible to determine with high accuracy a conversion pattern that enhances the accuracy of determining whether or not the character string pairs included in the data set are the same. Moreover, there is also the advantage that the amount of calculation does not become enormous even when the number of conversion rules to be tried is large, compared to the case of evaluating the judgment accuracy for each of the conversion patterns.
- the "state” in the reinforcement learning should be the conversion rules selected so far and their application order. Also, the "action” in the reinforcement learning may be to further select a conversion rule and to end the selection of the conversion rule. As a result, a conversion pattern that increases the accuracy of determining whether or not each character string pair included in the teacher data indicates the same thing is determined based on the trial results of conversion for each character string pair included in the teacher data.
- the state in which the conversion rules are applied in the order of f 3 ⁇ f 1 ⁇ f 9 is f 9 (f 1 (f 3 (x l ))).
- the "action” that can be selected is to select another conversion rule from among f 1 to f 20 or end the selection of the conversion rule.
- the "reward” is determined by completing the selection of the conversion rule. For example, when the selection of the conversion rule is completed in the state of f9 (f1 ( f3 ( xl ))), if the conversion is performed with the conversion pattern f9 (f1 ( f3 ( xl ))) can be calculated, and the reward can be determined based on the calculated determination accuracy. By repeating such processing, it is possible to determine, for each character string pair included in the training data, a conversion pattern that can maximize the accuracy of determining whether or not the character string pair indicates the same thing. .
- the method of calculating the judgment accuracy is not particularly limited. For example, part of the teacher data is used as test data, each character string pair included in the test data is converted by the conversion pattern, and a predetermined determination method is used to determine whether or not the character string pairs after conversion indicate the same thing. judge. Then, the rate of correct answers may be calculated from the determination result for each test data and used as the evaluation value of the determination accuracy.
- FIG. 4 is a block diagram showing the configuration of the information processing device 6.
- the information processing device 6 includes a control section 60 that centrally controls each section of the information processing device 6 and a storage section 61 that stores various data used by the information processing device 6 .
- the information processing device 6 also includes an input unit 62 for receiving input to the information processing device 6 and an output unit 63 for the information processing device 6 to output information.
- the control unit 60 includes a data acquisition unit (data acquisition means) 601, a conversion pattern determination unit (conversion pattern determination unit) 602, a conversion unit (conversion means) 603, a learning unit (learning means) 604, and a conversion necessity determination unit.
- a section 605, a first determination section (determination means) 606, and a second determination section 607 are included.
- a conversion rule 611 , a conversion pattern 612 , and a determination model 613 are stored in the storage unit 61 .
- the data acquisition unit 601 acquires data to be processed by the information processing device 6 . More specifically, the data acquisition unit 601 acquires teacher data used for determining the conversion pattern 612 and generating the judgment model 613 .
- This training data is a data set containing a plurality of character string pairs that are known whether or not each character string indicates the same thing.
- the data acquisition unit 601 also acquires data to be identified, that is, character string pairs for which it is unknown whether or not each character string indicates the same thing.
- the conversion pattern determination unit 602 determines accuracy of determination whether or not the character string pairs included in the teacher data indicate the same character string pairs based on the trial result of conversion of the character string pairs included in the teacher data acquired by the data acquisition unit 601. A conversion pattern 612 that enhances is determined. The method of determining the conversion pattern 612 is as described above, so the description will not be repeated here.
- the conversion unit 603 converts the character string pairs to be identified according to the conversion pattern 612 determined by the conversion pattern determination unit 602 .
- the learning unit 604 generates a determination model 613 that determines whether or not the character string pairs to be identified are the same by machine learning using the character string pairs after conversion by the conversion unit 603 as teacher data.
- the machine learning algorithm is not particularly limited as long as it can classify character string pairs into pairs indicating the same thing and pairs indicating different things.
- the learning unit 604 may generate a decision model 613 such as Logistic Regression, Random Forest, SVM (Support Vector Machine), and neural network. Further, the judgment model 613 may use each character string constituting the character string pair as input data as it is, or may use as input data a feature amount calculated from each character string constituting the character string pair.
- each character string forming a character string pair may be represented by a vector, and a feature amount obtained by combining these vectors may be used as input data.
- the conversion necessity determination unit 605 determines whether or not the conversion unit 603 is to convert the character string pairs to be identified. This determination method is not particularly limited. For example, the conversion necessity determining unit 605 may allow the user to select whether or not to convert the character string pairs to be identified. At this time, the conversion necessity determination unit 605 displays the character string pair to be identified and the character string pair as teacher data on a display device (the information processing device 6 may be provided with the display device, or may be provided on an external device of the information processing device 6). device) may be displayed. In this case, the user may decide whether or not to convert based on whether the character string pair to be identified and the character string pair used as teacher data are similar combinations. For example, if both the character string pair to be identified and the character string pair used as training data are combinations of character strings in Chinese characters and character strings in uppercase alphabet, it is determined that conversion is to be performed. input to the information processing device 6.
- the conversion necessity determination unit 605 uses, for example, a determination model (a model generated by machine learning) that receives a character string pair as input and outputs data indicating whether to convert the character string pair, It may be determined whether or not to convert. In addition, for example, the conversion necessity determining unit 605 determines to convert the combination of character types of the character string pair to be identified, if the combination is included in the character string pair used as teacher data. If not, it may be decided not to convert.
- a determination model a model generated by machine learning
- the first determination unit 606 determines whether or not the character string pairs after conversion converted by the conversion unit 603 (converted character string pairs to be identified) are the same. More specifically, the first determination unit 606 inputs the converted character string pair to the determination model 613, and determines whether or not the character string pair indicates the same character string pair based on the output value of the determination model 613. .
- the second determination unit 607 determines whether or not the character string pairs to be identified are the same. Second determination unit 607 differs from first determination unit 606 in that character string pairs that have not been converted by conversion unit 603 are subjected to determination.
- the determination method of the second determination unit 607 is not particularly limited. For example, the second determination unit 607 may calculate the degree of similarity of each character string forming the character string pair to be identified, and perform the determination based on the calculated degree of similarity. Further, for example, the second determination unit 607 may perform the determination using a determination model generated by machine learning similar to the determination model 613 (however, teacher data that has not been converted is used).
- the conversion rule 611 indicates the content of the conversion process, and is the basis of the conversion pattern 612.
- a conversion pattern 612 is composed of one or more conversion rules 611 .
- As the conversion rule 611 for example, various kinds of conversion processes listed in the above-mentioned "example of conversion rule as the basis of conversion pattern" can be applied.
- the conversion pattern 612 indicates the contents of conversion processing to be performed on at least one of the character string pairs determined by the conversion pattern determining unit 602 .
- a conversion pattern 612 can be determined by combining a plurality of conversion rules 611 in their order of application.
- the conversion pattern 612 may indicate, for example, a combination of conversion rules, the order of application thereof, and a conversion target (whether to convert xl or xr ).
- the judgment model 613 is for judging whether or not the character string pairs to be identified indicate the same thing, and is generated by the learning unit 604 . As described above, the determination model 613 is generated by learning using the converted teacher data, and uses input data obtained by converting character string pairs to be identified.
- the conversion unit 603 converts character string pairs to be identified according to the conversion pattern determined by the conversion pattern determination unit 602, and and a first determination unit 606 that determines whether the character string pairs indicate the same thing. Therefore, according to the information processing apparatus 6 according to the present exemplary embodiment, it is possible to obtain the effect that even character strings whose notations are not similar can be correctly grouped.
- the conversion unit 603 converts the character string pairs to be identified according to the conversion pattern determined by the conversion pattern determination unit 602, and the character string pairs after conversion are used as a teacher. and a learning unit 604 that generates a judgment model 613 that judges whether or not character string pairs to be identified indicate the same thing by machine learning using data. Therefore, according to the information processing device 6 according to the present exemplary embodiment, in addition to the effects of the information processing device 1 according to the first exemplary embodiment, name identification of character string pairs after conversion can be performed with high accuracy. can generate the judgment model 613 capable of
- FIG. 5 is a flowchart showing the flow of processing performed by the information processing device 6 during learning.
- S61 to S64 shown in FIG. 5 S61 to S62 are the conversion pattern determination method, and S63 to S64 are the learning method.
- the processing of S61-S62 and the processing of S63-S64 do not necessarily have to be performed continuously.
- the data acquisition unit 601 acquires teacher data.
- This training data is a data set containing a plurality of character string pairs that are known whether or not each character string indicates the same thing. Any method can be used to acquire the training data.
- the data acquisition unit 601 may acquire teacher data input by the user via the input unit 62, or may acquire teacher data recorded in a storage device or recording medium through wired or wireless communication. good.
- the conversion pattern determination unit 602 determines whether or not the character string pairs included in the teacher data indicate the same character string pairs based on the trial result of conversion of the character string pairs included in the teacher data acquired in S61. Determine a conversion pattern that increases accuracy. Then, the conversion pattern determination unit 602 causes the storage unit 61 to store the determined conversion pattern. The conversion pattern stored in this way is the conversion pattern 612 shown in FIG.
- the conversion pattern is generated by combining the conversion rules 611 stored in the storage unit 61.
- the method of determining the conversion pattern for example, the method described in the above-mentioned “Example 1 of the method of determining the conversion pattern” or “Example 2 of the method of determining the conversion pattern” can be applied.
- the conversion unit 603 applies the conversion pattern 612 determined in S62 to convert the teacher data acquired in S61. More specifically, conversion unit 603 converts a plurality of conversion rules indicated in conversion pattern 612 to at least one of the two character strings that make up the character string pair that makes up the teacher data acquired in S61. The conversion is applied sequentially according to the order shown in pattern 612 . In S62, there is a possibility that a conversion pattern consisting of one conversion rule is determined. In this case, in S63, conversion is performed by applying one determined conversion rule.
- the learning unit 604 performs machine learning using each character string pair converted in S63 as teacher data, and generates a judgment model for judging whether or not the character string pairs to be identified are the same. . Then, the learning unit 604 causes the storage unit 61 to store the generated determination model.
- the judgment model stored in this way is the judgment model 613 shown in FIG. With the above, the processing of FIG. 5 ends.
- a series of processes of acquiring teacher data (S61) and converting the acquired teacher data (S63) can be called a teacher data generation method.
- a series of processes of acquiring teacher data (S61) and converting the acquired teacher data (S63) can be called a teacher data generation method.
- the method of generating teacher data according to the present exemplary embodiment it is possible to generate teacher data for generating a judgment model capable of highly accurately aligning character string pairs after conversion.
- FIG. 6 is a flow chart showing the flow of processing performed by the information processing device 6 at the time of name identification.
- the data acquisition unit 601 acquires data to be identified.
- Data to be identified is a pair of character strings whose identity is unknown. Any method can be used to acquire the name identification target data.
- the data acquisition unit 601 may acquire name identification data input by a user via the input unit 62, or may acquire name identification data recorded in a storage device or recording medium through wired or wireless communication. may
- the conversion necessity determining unit 605 determines whether or not to convert the name identification target data acquired in S71. If it is determined to convert in S72 (YES in S72), the process proceeds to S74. On the other hand, if it is determined not to convert in S72 (NO in S72), the process proceeds to S73.
- the second determination unit 607 determines whether or not the name identification target data acquired in S71 indicate the same data.
- the second determination unit 607 determines whether or not the character string pairs of the name identification target data that have not been converted by the conversion unit 603 are the same. After the determination is completed, the process proceeds to S76.
- the conversion unit 603 applies the conversion pattern 612 determined in S62 of FIG. 5 to convert the name identification target data acquired in S71. More specifically, the conversion unit 603 applies a plurality of conversion rules indicated in the conversion pattern 612 to at least one of the two character strings forming the character string pair forming the name identification target data acquired in S71. They are applied sequentially according to the order shown in the conversion pattern 612 and converted. If a conversion pattern consisting of one conversion rule has been determined in S62 of FIG. 5, conversion is performed by applying the determined one conversion rule in S74.
- the first determination unit 606 uses the determination model 613 generated in S64 of FIG. 5 to determine whether or not the character string pairs of the name identification data converted by the conversion unit 603 in S74 are the same. do. After the determination is completed, the process proceeds to S76.
- the determination result is output. Specifically, when the determination of S73 is performed, the second determination unit 607 causes the output unit 63 to output the determination result of S73. On the other hand, when the determination of S75 is made, the first determination unit 606 causes the output unit 63 to output the determination result of S75. Thus, the processing of FIG. 6 ends.
- the information processing device 6 may perform a process of unifying the character strings constituting the name identification target data determined to indicate the same thing.
- the character strings may be unified by replacing one character string forming the name identification target data with the other character string.
- the character strings may be unified by replacing the two character strings that constitute the name identification target data with a higher-level character string that encompasses those character strings.
- the name identification method according to an aspect of the present invention may include unifying character strings that constitute name identification target data determined to indicate the same item. This is also the case for exemplary embodiments 1 and 2 described above.
- the conversion pattern determining unit 602 may determine a conversion pattern for one of the character strings that make up the character string pair, or determine a conversion pattern for each of the character strings that make up the character string pair. may For example, if one of the character string pairs is xl and the other is xr , the conversion pattern determination unit 602 may determine a conversion pattern for only xl , or determine a conversion pattern for only xr . good too. Alternatively, both the transformation pattern for xl and the transformation pattern for xr may be determined.
- the conversion unit 603 may convert one or both of the character strings forming the name identification target data. If the conversion pattern 612 stored in the storage unit 61 does not specify a conversion target (which of xl and xr is to be converted), the conversion unit 603 determines a character string to be converted. This process is performed between S72 and S74 in FIG.
- the method of determining the character string to be converted is not particularly limited.
- the conversion unit 603 may allow the user to select a character string to be converted.
- the conversion unit 603 displays the character string pair to be identified and the conversion pattern 612 on a display device (which may be included in the information processing device 6 or may be a device external to the information processing device 6). may be displayed.
- the user may select such that the character string to be identified is converted by the conversion pattern 612 that is considered to be effective for the character string.
- the conversion unit 603 may determine the character string to be converted regardless of the user's selection. For example, the conversion unit 603 may determine a character string that can be converted by the first applied conversion rule among the conversion rules indicated by the conversion pattern 612 as the conversion target of the conversion pattern 612 . For example, if the conversion target is a combination of a kanji character string and an alphabetic character string, and the first conversion rule indicated by the conversion pattern 612 is hiragana conversion, the conversion unit 603 converts the conversion target of the conversion pattern 612 to kanji. character string.
- Some or all of the functions of the information processing devices 1 to 3, the conversion device 4, the determination device 5, and the information processing device 6 are implemented by hardware such as integrated circuits (IC chips). may be implemented by software.
- the device is implemented, for example, by a computer that executes program instructions, which are software that implements each function.
- program instructions which are software that implements each function.
- FIG. 1 An example of such a computer (hereinafter referred to as computer C) is shown in FIG.
- Computer C comprises at least one processor C1 and at least one memory C2.
- a program P for operating the computer C as the device is recorded in the memory C2.
- the processor C1 reads the program P from the memory C2 and executes it to realize each function of the device.
- processor C1 for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof.
- memory C2 for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.
- the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data.
- Computer C may further include a communication interface for transmitting and receiving data to and from other devices.
- Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
- the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C.
- a recording medium M for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used.
- the computer C can acquire the program P via such a recording medium M.
- the program P can be transmitted via a transmission medium.
- a transmission medium for example, a communication network or broadcast waves can be used.
- Computer C can also acquire program P via such a transmission medium.
- (Appendix 1) Data acquisition means for acquiring a data set including a plurality of character string pairs in which it is known whether or not each character string indicates the same thing, and based on the trial result of conversion for the character string pairs included in the data set, An information processing apparatus comprising conversion pattern determination means for determining a conversion pattern that enhances the accuracy of determining whether or not character string pairs included in a data set are the same. According to this configuration, it is possible to correctly merge even character strings whose notations are not similar.
- the conversion pattern is a combination of a plurality of conversion rules in their order of application, and the conversion pattern determining means determines that, for each of the plurality of different conversion patterns, the character string pairs after conversion converted according to the conversion pattern are the same.
- the information processing apparatus according to any one of appendices 1 to 3, wherein a trial is performed to determine whether or not an object is indicated, and a conversion pattern is determined based on evaluation results of determination accuracy in each trial. According to this configuration, it is possible to determine with high accuracy a conversion pattern that enhances the accuracy of determining whether or not the character string pairs included in the data set are the same.
- Appendix 5 Any one of Appendices 1 to 3, wherein the conversion pattern determination means determines the conversion pattern by reinforcement learning with a reward of determination accuracy when determining whether or not the character string pairs after conversion indicate the same thing.
- An information processing apparatus comprising: determination means for determining. According to this configuration, it is possible to correctly merge character strings that are not similar in notation, and character string pairs that do not become similar character strings in one conversion using one conversion rule. Correct naming becomes possible.
- a plurality of conversion rules are applied to at least one of the two character strings that make up the character string pair that constitutes the teacher data for generating a judgment model that determines whether or not the character string pair to be identified indicates the same thing.
- An information processing apparatus comprising: converting means for sequentially applying and converting; and learning means for generating the judgment model by machine learning using the character string pairs after the conversion as teacher data. According to this configuration, it is possible to generate a determination model capable of performing high-accuracy name identification of character string pairs after conversion. By using this determination model, it is possible to obtain the effect that it is possible to perform correct name identification even for character strings whose spellings are not similar.
- At least one processor obtains a data set containing a plurality of character string pairs for which it is known whether or not each character string indicates the same thing, and a result of an attempt to convert the character string pairs contained in the data set. determining a conversion pattern that enhances the accuracy of determining whether or not character string pairs included in the data set are the same, based on the above. According to this configuration, it is possible to correctly merge even character strings whose notations are not similar.
- At least one processor converts by sequentially applying a plurality of conversion rules to at least one of two character strings forming a character string pair to be identified; and determining whether to indicate. According to this configuration, it is possible to correctly merge character strings that are not similar in notation, and to correctly identify character string pairs that do not become similar character strings in one conversion using one conversion rule. It becomes possible to name.
- At least one processor for at least one of two character strings forming a character string pair forming teacher data for generating a judgment model for judging whether or not a character string pair to be identified indicates the same thing and generating the judgment model by machine learning using the character string pairs after the conversion as teacher data.
- a determination model capable of performing high-accuracy name identification of character string pairs after conversion.
- (Appendix 12) a data acquisition means for acquiring a data set containing a plurality of character string pairs in which it is known whether or not each character string indicates the same thing;
- a conversion pattern determination program functioning as a conversion pattern determination means for determining a conversion pattern for enhancing the determination accuracy of whether or not character string pairs included in the data set are identical according to the above. According to this configuration, it is possible to correctly merge even character strings whose notations are not similar.
- a name identification program that functions as determination means for determining whether or not. According to this configuration, it is possible to correctly merge character strings that are not similar in notation, and to correctly identify character string pairs that do not become similar character strings in one conversion using one conversion rule. It becomes possible to name.
- At least one processor for obtaining a data set containing a plurality of string pairs each of which is known to represent the same string; and string pairs contained in the data set. and determining a conversion pattern for increasing the accuracy of determining whether or not character string pairs included in the data set indicate the same character string pair based on the trial result of conversion to the data set.
- the information processing apparatus may further include a memory, and the memory stores a program for causing the processor to execute a process of acquiring a data set and a process of determining the conversion pattern. may have been
- this program may be recorded in a computer-readable non-temporary tangible recording medium.
- At least one processor is provided, and the processor converts by sequentially applying a plurality of conversion rules to at least one of two character strings forming a character string pair to be identified, and the character string after conversion.
- An information processing apparatus that executes a process of determining whether or not a pair indicates the same thing.
- the information processing apparatus may further include a memory, and the memory may store a program for causing the processor to execute the converting process and the determining process. .
- this program may be recorded in a computer-readable non-temporary tangible recording medium.
- At least one processor is provided, wherein the two character strings forming a character string pair constituting training data for generating a judgment model for judging whether or not character string pairs to be identified indicate the same thing
- the information processing apparatus may further include a memory, and the memory may store a program for causing the processor to execute the converting process and the generating process. .
- this program may be recorded in a computer-readable non-temporary tangible recording medium.
- Data Processing Device 11 Data Acquisition Unit (Data Acquisition Means) 12 conversion pattern determination unit (conversion pattern determination means) 2 information processing device 21 conversion unit (conversion means) 22 determination unit (determination means) 3 information processing device 31 conversion unit (conversion means) 32 learning part (learning means) 4 conversion device (information processing device) 41 data acquisition unit (data acquisition means) 42 conversion pattern determination unit (conversion pattern determination means) 6 information processing device 601 data acquisition unit (data acquisition means) 602 conversion pattern determination unit (conversion pattern determination means) 603 conversion unit (conversion means) 604 learning unit (learning means) 606 first determination unit (determination means)
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
Description
本発明の第1の例示的実施形態について、図面を参照して詳細に説明する。本例示的実施形態は、後述する例示的実施形態の基本となる形態である。まず、本例示的実施形態に係る情報処理装置1~3の構成について、図1を参照して説明する。図1は、情報処理装置1~3の構成を示すブロック図である。
情報処理装置1は、データ取得部(データ取得手段)11と変換パターン決定部(変換パターン決定手段)12を備えている。データ取得部11は、各文字列が同じものを示すか否かが既知である複数の文字列ペアを含むデータセットを取得する。そして、変換パターン決定部12は、前記データセットに含まれる文字列ペアに対する変換の試行結果に基づき、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを決定する。
上述の情報処理装置1の機能は、プログラムによって実現することもできる。本例示的実施形態に係る変換パターン決定プログラムは、コンピュータを、各文字列が同じものを示すか否かが既知である複数の文字列ペアを含むデータセットを取得するデータ取得手段、および、前記データセットに含まれる文字列ペアに対する変換の試行結果に基づき、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを決定する変換パターン決定手段として機能させる、という構成が採用されている。このため、本例示的実施形態に係る変換パターン決定プログラムによれば、表記が類似していない文字列についても正しく名寄せすることが可能になるという効果が得られる。
情報処理装置2は、変換部(変換手段)21と判定部(判定手段)22を備えている。変換部21は、名寄せ対象の文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する。そして、判定部22は、前記変換後の文字列ペアが同じものを示すか否かを判定する。
上述の情報処理装置2の機能は、プログラムによって実現することもできる。本例示的実施形態に係る名寄せプログラムは、コンピュータを、名寄せ対象の文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段、および、前記変換後の文字列ペアが同じものを示すか否かを判定する判定手段として機能させる、という構成が採用されている。このため、本例示的実施形態に係る名寄せプログラムによれば、表記が類似していない文字列についても正しく名寄せすることが可能になると共に、1つの変換規則を用いた1回の変換では類似した文字列にならない文字列ペアについても正しく名寄せすることが可能になるという効果が得られる。
情報処理装置3は、変換部(変換手段)31と学習部(学習手段)32を備えている。変換部31は、名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成するための教師データを構成する文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する。そして、学習部32は、前記変換後の文字列ペアを教師データとして用いた機械学習により前記判定モデルを生成する。
上述の情報処理装置3の機能は、プログラムによって実現することもできる。本例示的実施形態に係る学習プログラムは、コンピュータを、名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成するための教師データを構成する文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段、および、前記変換後の文字列ペアを教師データとして用いた機械学習により前記判定モデルを生成する学習手段として機能させる、という構成が採用されている。このため、本例示的実施形態に係る学習プログラムによれば、変換後の文字列ペアの名寄せを高精度に行うことが可能な判定モデルを生成することが可能になる。そして、この判定モデルを用いることにより、表記が類似していない文字列についても正しく名寄せすることが可能になるという効果が得られる。また、1つの変換規則を用いた1回の変換では類似した文字列にならない文字列ペアについても正しく名寄せすることが可能になる。
図2は、本発明の第1の例示的実施形態に係る、教師データの生成方法、名寄せ方法、および学習方法の流れを示すフロー図である。なお、S11~S12が変換パターンの決定方法を示し、S21~S22が名寄せ方法を示し、S31~S32が学習方法を示している。
S11では、少なくとも1つのプロセッサが、各文字列が同じものを示すか否かが既知である複数の文字列ペアを含むデータセットを取得する。
S21では、少なくとも1つのプロセッサが、名寄せ対象の文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する。
S31では、少なくとも1つのプロセッサが、名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成するための教師データを構成する文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する。
本発明の第2の例示的実施形態について図3に基づいて説明する。図3は、本例示的実施形態に係る判定システム100の説明図である。判定システム100は、名寄せ対象の文字列のペアが同じものを示しているか否かを判定するシステムであり、変換装置(情報処理装置)4と判定装置5を含む。
判定システム100は、名寄せ対象の文字列のペアが同じものを示しているか否かを判定するにあたり、まず、教師データを用いた機械学習により、当該判定のための判定モデルを生成する。学習フェーズでは、まず、変換装置4のデータ取得部41が教師データを取得する。取得する教師データは、各文字列が同じものを示すか否かが既知である文字列ペアを複数含むデータセットである。
推論フェーズでは、変換装置4のデータ取得部41が名寄せ対象データを取得する。名寄せ対象データは、同じものを示すか否かを判定したい文字列ペアを少なくとも1つ含むデータである。上述の教師データと同様に、名寄せ対象データに含まれる文字列ペアは、その一方をxl、他方をxrとして表すことができる。
例えば、それぞれが複数のレコードからなる2つの対象データテーブルについて、同じものを示しているが、各対象データテーブルにおいて異なる表記となっているレコードの表記を統一したいとする。各対象データテーブルには、多数のレコードが含まれており、人手による名寄せには多大な時間と労力を要する。
文字列を変換する変換パターンは、複数の変換規則をその適用順に組み合わせたものであってもよい。変換規則は、ある文字列を他の文字列に変換する規則である。変換規則は、文字列を入力すると文字列を出力する関数(文字列空間から文字列空間の写像)で表すことができる。例えば、ある変換規則を関数f1とした場合、この変換規則で文字列xlを変換することにより得られる文字列はf1(xl)と表される。そして、この変換後の文字列をさらに他の変換規則(関数f2)で変換することにより得られる文字列はf2(f1(xl))と表される。
f1:頭文字を抽出
f2:ひらがなに変換
f3:アルファベットに変換
ここで、xlに対し、変換規則をf1→f2→f3の順で適用したとする。この場合、f1(xl)=“日”となり、f2(f1(xl))=“にち”となり、f3(f2(f1(xl)))=“Nichi”となる。これらの変換で得られた文字列“Nichi”は、xr=“NKK”と類似しているとは言い難いから、f1→f2→f3の変換パターンは、xl=“日経”とxr=“NKK”の名寄せに有効とは言い難い。
上述のように、変換規則の適用順は、名寄せ精度に影響を与える。このため、変換パターン決定部42は、データ取得部41が取得する教師データを用いて、名寄せにおける判定精度を高めることができるような変換パターンを決定する。
変換パターン決定部42は、変換後の文字列ペアが同じものを示すか否かを判定したときの判定精度を報酬とした強化学習により変換パターンを決定してもよい。これにより、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを高い確度で決定することができる。また、変換パターンのそれぞれについて判定精度を評価する場合と比べて、試行の対象となる変換規則の数が多い場合でも計算量が膨大にならないという利点もある。
(情報処理装置6の構成)
本例示的実施形態に係る情報処理装置6の構成を図4に基づいて説明する。図4は、情報処理装置6の構成を示すブロック図である。図示のように、情報処理装置6は、情報処理装置6の各部を統括して制御する制御部60と、情報処理装置6が使用する各種データを記憶する記憶部61を備えている。また、情報処理装置6は、情報処理装置6に対する入力を受け付ける入力部62と、情報処理装置6が情報を出力するための出力部63を備えている。
本例示的実施形態に係る情報処理装置6が学習時に行う処理の流れについて、図5を参照して説明する。図5は、情報処理装置6が学習時に行う処理の流れを示すフロー図である。なお、図5に示すS61~64のうち、S61~S62が変換パターンの決定方法であり、S63~S64が学習方法である。S61~S62の処理と、S63~S64の処理は、必ずしも続けて行う必要はない。
本例示的実施形態に係る情報処理装置6が名寄せ時に行う処理(名寄せ方法)の流れについて、図6を参照して説明する。図6は、情報処理装置6が名寄せ時に行う処理の流れを示すフロー図である。
図5のS62において、変換パターン決定部602は、文字列ペアを構成する文字列の一方に対する変換パターンを決定してもよいし、文字列ペアを構成する文字列のそれぞれについて変換パターンを決定してもよい。例えば、文字列ペアの一方をxl、他方をxrとした場合、変換パターン決定部602は、xlのみに対する変換パターンを決定してもよいし、xrのみに対する変換パターンを決定してもよい。また、xlに対する変換パターンとxrに対する変換パターンの両方を決定してもよい。
情報処理装置1~3、変換装置4、判定装置5、および情報処理装置6(以下、当該装置と呼ぶ)の一部又は全部の機能は、集積回路(ICチップ)等のハードウェアによって実現してもよいし、ソフトウェアによって実現してもよい。
本発明は、上述した実施形態に限定されるものでなく、請求項に示した範囲で種々の変更が可能である。例えば、上述した実施形態に開示された技術的手段を適宜組み合わせて得られる実施形態についても、本発明の技術的範囲に含まれる。
上述した実施形態の一部又は全部は、以下のようにも記載され得る。ただし、本発明は、以下の記載する態様に限定されるものではない。
各文字列が同じものを示すか否かが既知である複数の文字列ペアを含むデータセットを取得するデータ取得手段と、前記データセットに含まれる文字列ペアに対する変換の試行結果に基づき、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを決定する変換パターン決定手段と、を備える情報処理装置。この構成によれば、表記が類似していない文字列についても正しく名寄せすることが可能になる。
前記変換パターン決定手段が決定した変換パターンに従って、名寄せ対象の文字列ペアを変換する変換手段と、前記変換手段が変換した文字列ペアが同じものを示すか否かを判定する判定手段と、を備える付記1に記載の情報処理装置。この構成によれば、表記が類似していない文字列についても正しく名寄せすることが可能になる。
前記変換パターン決定手段が決定した変換パターンに従って、前記データセットに含まれる文字列ペアを変換する変換手段と、前記変換後の文字列ペアを教師データとして用いた機械学習により、名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成する学習手段と、を備える付記1に記載の情報処理装置。この構成によれば、変換後の文字列ペアの名寄せを高精度に行うことが可能な判定モデルを生成することが可能になる。
前記変換パターンは、複数の変換規則をその適用順に組み合わせたものであり、前記変換パターン決定手段は、それぞれ異なる複数の変換パターンのそれぞれについて、当該変換パターンに従って変換した変換後の文字列ペアが同じものを示すか否かを判定する試行を行い、各試行における判定精度の評価結果に基づいて変換パターンを決定する、付記1から3の何れかに記載の情報処理装置。この構成によれば、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを高い確度で決定することができる。
前記変換パターン決定手段は、前記変換後の文字列ペアが同じものを示すか否かを判定したときの判定精度を報酬とした強化学習により変換パターンを決定する、付記1から3の何れかに記載の情報処理装置。この構成によれば、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを高い確度で決定することができる。また、変換パターンのそれぞれについて判定精度を評価する場合と比べて、試行の対象となる変換規則の数が多い場合でも計算量が膨大にならないという利点もある。
前記変換パターンには、他の言語の文字列への翻訳、頭文字の抽出、および文字種の変換、の少なくとも何れかの変換規則が含まれる、付記1から5の何れかに記載の情報処理装置。この構成によれば、表記が類似していない文字列についての名寄せの精度を高めることができる。
名寄せ対象の文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段と、前記変換後の文字列ペアが同じものを示すか否かを判定する判定手段と、を備える情報処理装置。この構成によれば、表記が類似していない文字列についても正しく名寄せすることが可能になり、また、1つの変換規則を用いた1回の変換では類似した文字列にならない文字列ペアについても正しく名寄せすることが可能になる。
名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成するための教師データを構成する文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段と、前記変換後の文字列ペアを教師データとして用いた機械学習により前記判定モデルを生成する学習手段と、を備える情報処理装置。この構成によれば、変換後の文字列ペアの名寄せを高精度に行うことが可能な判定モデルを生成することが可能になる。そして、この判定モデルを用いることにより、表記が類似していない文字列についても正しく名寄せすることが可能になるという効果が得られる。
少なくとも1つのプロセッサが、各文字列が同じものを示すか否かが既知である複数の文字列ペアを含むデータセットを取得することと、前記データセットに含まれる文字列ペアに対する変換の試行結果に基づき、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを決定することと、を含む変換パターンの決定方法。この構成によれば、表記が類似していない文字列についても正しく名寄せすることが可能になる。
少なくとも1つのプロセッサが、名寄せ対象の文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換することと、前記変換後の文字列ペアが同じものを示すか否かを判定することと、を含む名寄せ方法。この構成によれば、表記が類似していない文字列についても正しく名寄せすることが可能になると共に、1つの変換規則を用いた1回の変換では類似した文字列にならない文字列ペアについても正しく名寄せすることが可能になる。
少なくとも1つのプロセッサが、名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成するための教師データを構成する文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換することと、前記変換後の文字列ペアを教師データとして用いた機械学習により前記判定モデルを生成することと、を含む学習方法。この構成によれば、変換後の文字列ペアの名寄せを高精度に行うことが可能な判定モデルを生成することが可能になる。そして、この判定モデルを用いることにより、表記が類似していない文字列についても正しく名寄せすることが可能になるという効果が得られる。
コンピュータを、各文字列が同じものを示すか否かが既知である複数の文字列ペアを含むデータセットを取得するデータ取得手段、および前記データセットに含まれる文字列ペアに対する変換の試行結果に基づき、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを決定する変換パターン決定手段、として機能させる変換パターン決定プログラム。この構成によれば、表記が類似していない文字列についても正しく名寄せすることが可能になる。
コンピュータを、名寄せ対象の文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段、および前記変換後の文字列ペアが同じものを示すか否かを判定する判定手段、として機能させる名寄せプログラム。この構成によれば、表記が類似していない文字列についても正しく名寄せすることが可能になると共に、1つの変換規則を用いた1回の変換では類似した文字列にならない文字列ペアについても正しく名寄せすることが可能になる。
コンピュータを、名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成するための教師データを構成する文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段、および前記変換後の文字列ペアを教師データとして用いた機械学習により前記判定モデルを生成する学習手段、として機能させる学習プログラム。この構成によれば、変換後の文字列ペアの名寄せを高精度に行うことが可能な判定モデルを生成することが可能になる。そして、この判定モデルを用いることにより、表記が類似していない文字列についても正しく名寄せすることが可能になるという効果が得られる。
上述した実施形態の一部又は全部は、更に、以下のように表現することもできる。
11 データ取得部(データ取得手段)
12 変換パターン決定部(変換パターン決定手段)
2 情報処理装置
21 変換部(変換手段)
22 判定部(判定手段)
3 情報処理装置
31 変換部(変換手段)
32 学習部(学習手段)
4 変換装置(情報処理装置)
41 データ取得部(データ取得手段)
42 変換パターン決定部(変換パターン決定手段)
6 情報処理装置
601 データ取得部(データ取得手段)
602 変換パターン決定部(変換パターン決定手段)
603 変換部(変換手段)
604 学習部(学習手段)
606 第1判定部(判定手段)
Claims (14)
- 各文字列が同じものを示すか否かが既知である複数の文字列ペアを含むデータセットを取得するデータ取得手段と、
前記データセットに含まれる文字列ペアに対する変換の試行結果に基づき、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを決定する変換パターン決定手段と、を備える情報処理装置。 - 前記変換パターン決定手段が決定した変換パターンに従って、名寄せ対象の文字列ペアを変換する変換手段と、
前記変換手段が変換した文字列ペアが同じものを示すか否かを判定する判定手段と、を備える請求項1に記載の情報処理装置。 - 前記変換パターン決定手段が決定した変換パターンに従って、前記データセットに含まれる文字列ペアを変換する変換手段と、
前記変換後の文字列ペアを教師データとして用いた機械学習により、名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成する学習手段と、を備える請求項1に記載の情報処理装置。 - 前記変換パターンは、複数の変換規則をその適用順に組み合わせたものであり、
前記変換パターン決定手段は、それぞれ異なる複数の変換パターンのそれぞれについて、当該変換パターンに従って変換した変換後の文字列ペアが同じものを示すか否かを判定する試行を行い、各試行における判定精度の評価結果に基づいて変換パターンを決定する、請求項1から3の何れか1項に記載の情報処理装置。 - 前記変換パターン決定手段は、前記変換後の文字列ペアが同じものを示すか否かを判定したときの判定精度を報酬とした強化学習により変換パターンを決定する、請求項1から3の何れか1項に記載の情報処理装置。
- 前記変換パターンには、他の言語の文字列への翻訳、頭文字の抽出、および文字種の変換、の少なくとも何れかの変換規則が含まれる、請求項1から5の何れか1項に記載の情報処理装置。
- 名寄せ対象の文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段と、
前記変換後の文字列ペアが同じものを示すか否かを判定する判定手段と、を備える情報処理装置。 - 名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成するための教師データを構成する文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段と、
前記変換後の文字列ペアを教師データとして用いた機械学習により前記判定モデルを生成する学習手段と、を備える情報処理装置。 - 少なくとも1つのプロセッサが、
各文字列が同じものを示すか否かが既知である複数の文字列ペアを含むデータセットを取得することと、
前記データセットに含まれる文字列ペアに対する変換の試行結果に基づき、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを決定することと、を含む変換パターンの決定方法。 - 少なくとも1つのプロセッサが、
名寄せ対象の文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換することと、
前記変換後の文字列ペアが同じものを示すか否かを判定することと、を含む名寄せ方法。 - 少なくとも1つのプロセッサが、
名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成するための教師データを構成する文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換することと、
前記変換後の文字列ペアを教師データとして用いた機械学習により前記判定モデルを生成することと、を含む学習方法。 - コンピュータを、
各文字列が同じものを示すか否かが既知である複数の文字列ペアを含むデータセットを取得するデータ取得手段、および
前記データセットに含まれる文字列ペアに対する変換の試行結果に基づき、前記データセットに含まれる文字列ペアが同じものを示すか否かの判定精度を高める変換パターンを決定する変換パターン決定手段、として機能させる変換パターン決定プログラム。 - コンピュータを、
名寄せ対象の文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段、および
前記変換後の文字列ペアが同じものを示すか否かを判定する判定手段、として機能させる名寄せプログラム。 - コンピュータを、
名寄せ対象の文字列ペアが同じものを示すか否かを判定する判定モデルを生成するための教師データを構成する文字列ペアを構成する2つの文字列の少なくとも一方に対して複数の変換規則を順次適用して変換する変換手段、および
前記変換後の文字列ペアを教師データとして用いた機械学習により前記判定モデルを生成する学習手段、として機能させる学習プログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2022579218A JPWO2022168208A1 (ja) | 2021-02-03 | 2021-02-03 | |
PCT/JP2021/003965 WO2022168208A1 (ja) | 2021-02-03 | 2021-02-03 | 情報処理装置、変換パターンの決定方法、名寄せ方法、学習方法、変換パターン決定プログラム、名寄せプログラム、および学習プログラム |
US18/275,134 US20240104128A1 (en) | 2021-02-03 | 2021-02-03 | Information processing apparatus and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/003965 WO2022168208A1 (ja) | 2021-02-03 | 2021-02-03 | 情報処理装置、変換パターンの決定方法、名寄せ方法、学習方法、変換パターン決定プログラム、名寄せプログラム、および学習プログラム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022168208A1 true WO2022168208A1 (ja) | 2022-08-11 |
Family
ID=82740958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2021/003965 WO2022168208A1 (ja) | 2021-02-03 | 2021-02-03 | 情報処理装置、変換パターンの決定方法、名寄せ方法、学習方法、変換パターン決定プログラム、名寄せプログラム、および学習プログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240104128A1 (ja) |
JP (1) | JPWO2022168208A1 (ja) |
WO (1) | WO2022168208A1 (ja) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009223463A (ja) * | 2008-03-14 | 2009-10-01 | Nippon Telegr & Teleph Corp <Ntt> | 同義性判定装置、その方法、プログラム及び記録媒体 |
-
2021
- 2021-02-03 JP JP2022579218A patent/JPWO2022168208A1/ja active Pending
- 2021-02-03 US US18/275,134 patent/US20240104128A1/en active Pending
- 2021-02-03 WO PCT/JP2021/003965 patent/WO2022168208A1/ja active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009223463A (ja) * | 2008-03-14 | 2009-10-01 | Nippon Telegr & Teleph Corp <Ntt> | 同義性判定装置、その方法、プログラム及び記録媒体 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2022168208A1 (ja) | 2022-08-11 |
US20240104128A1 (en) | 2024-03-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6265921B2 (ja) | テキストの意味的処理のための方法、装置および製品 | |
CN110851596A (zh) | 文本分类方法、装置及计算机可读存储介质 | |
CN115146488B (zh) | 基于大数据的可变业务流程智能建模系统及其方法 | |
KR101939209B1 (ko) | 신경망 기반의 텍스트의 카테고리를 분류하기 위한 장치, 이를 위한 방법 및 이 방법을 수행하기 위한 프로그램이 기록된 컴퓨터 판독 가능한 기록매체 | |
CN113591457B (zh) | 文本纠错方法、装置、设备及存储介质 | |
Gupta et al. | Unsupervised self-training for sentiment analysis of code-switched data | |
CN114564563A (zh) | 一种基于关系分解的端到端实体关系联合抽取方法及系统 | |
CN116821299A (zh) | 智能问答方法、智能问答装置、设备及存储介质 | |
JP2022151838A (ja) | 低リソース言語からのオープン情報の抽出 | |
JP2020154668A (ja) | 同義語判定方法、同義語判定プログラム、および、同義語判定装置 | |
JP2019082860A (ja) | 生成プログラム、生成方法及び生成装置 | |
De Araujo et al. | Automatic cluster labeling based on phylogram analysis | |
WO2022168208A1 (ja) | 情報処理装置、変換パターンの決定方法、名寄せ方法、学習方法、変換パターン決定プログラム、名寄せプログラム、および学習プログラム | |
WO2023132029A1 (ja) | 情報処理装置、情報処理方法及びプログラム | |
JPWO2022168208A5 (ja) | ||
CN109299260B (zh) | 数据分类方法、装置以及计算机可读存储介质 | |
CN114780577A (zh) | Sql语句生成方法、装置、设备及存储介质 | |
CN110633363B (zh) | 一种基于nlp和模糊多准则决策的文本实体推荐方法 | |
Zouidine et al. | A comparative study of pre-trained word embeddings for Arabic sentiment analysis | |
Zagagy et al. | ACKEM: automatic classification, using KNN based ensemble modeling | |
JP7333891B2 (ja) | 情報処理装置、情報処理方法、及び情報処理プログラム | |
WO2024034196A1 (ja) | 学習済モデル選択方法、学習済モデル選択装置および学習済モデル選択プログラム | |
CN117034901B (zh) | 一种基于文本生成模板的数据统计系统 | |
CN115329158B (zh) | 一种基于多源异构电力数据的数据关联方法 | |
WO2022211099A1 (en) | Patent valuation using artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21924605 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18275134 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022579218 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21924605 Country of ref document: EP Kind code of ref document: A1 |