WO2023233467A1 - Information identification device, information identification method, and program - Google Patents

Information identification device, information identification method, and program Download PDF

Info

Publication number
WO2023233467A1
WO2023233467A1 PCT/JP2022/021943 JP2022021943W WO2023233467A1 WO 2023233467 A1 WO2023233467 A1 WO 2023233467A1 JP 2022021943 W JP2022021943 W JP 2022021943W WO 2023233467 A1 WO2023233467 A1 WO 2023233467A1
Authority
WO
WIPO (PCT)
Prior art keywords
distributed
type
unit
information
key
Prior art date
Application number
PCT/JP2022/021943
Other languages
French (fr)
Japanese (ja)
Inventor
遥香 小山内
優 酒井
彩 鈴木
翔 金丸
謙輔 高橋
悟 近藤
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/021943 priority Critical patent/WO2023233467A1/en
Publication of WO2023233467A1 publication Critical patent/WO2023233467A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention relates to an information identification device, an information identification method, and a program.
  • B2B2X services in which multiple businesses in different industries collaborate through B2B2X are increasing.
  • information such as customer information, contract information, billing information, etc. between collaborating businesses.
  • Non-Patent Document 1 morphological analysis
  • Non-Patent Document 2 distributed representation of characters
  • Non-Patent Document 3 confidence calculation technology for classifiers
  • Non-Patent Documents 1-3 do not consider identifying the same type of information when different formats of information are input. Therefore, using these non-patent documents, it is not possible to identify information in different formats as being of the same type.
  • the present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technology that allows information to be identified as being of the same type when information in different formats is input.
  • one aspect of the present invention includes a dividing unit that divides learning information including a key and a value into words, a distributed expression generating unit that generates a distributed expression of each word, and a distributed expression generating unit that generates a distributed expression of each word.
  • a combination generation unit that generates a plurality of patterns of distributed expressions by combining the distributed expressions of words of a distributed expression concatenation unit that generates a representation; a learning data generation unit that generates learning data including a concatenated distributed expression for each pattern; and a type number of a type corresponding to the learning information; and a learning unit that performs learning to generate a classifier for identifying input information including a key and a value into any type.
  • One aspect of the present invention is to combine a dividing unit that divides input information including a key and a value into words, a distributed expression generating unit that generates a distributed representation of each word, and a distributed representation of the key word to create a plurality of words.
  • a combination generation unit that generates a pattern of distributed expressions;
  • a distributed expression concatenation unit that generates a concatenated distributed expression by combining, for each pattern, the distributed expression of the key of the pattern and the distributed expression of the value word;
  • a classifier that outputs a confidence level for each type of the connected distributed representation, and a confidence level that the classifier outputs for each type for each pattern are used to determine whether the type of the input information is one of the existing types. or a new type.
  • One aspect of the present invention is an information identification method performed by an information identification device, which comprises: dividing learning information including a key and a value into words; generating a distributed representation of each word; generating a plurality of patterns of distributed expressions by combining the distributed expressions of the key words; and for each pattern, combining the distributed expressions of the key of the pattern with the distributed expressions of the value words to generate a concatenated distributed expression. a step of generating learning data including a concatenated distributed representation for each pattern and a type number of the type corresponding to the learning information, and performing machine learning on the learning data to generate keys and values. and generating a classifier for identifying input information containing the input information into one of the types.
  • One aspect of the present invention is a program that causes a computer to function as the information identification device.
  • FIG. 1 shows an example of the configuration of an information identification device according to this embodiment.
  • FIG. 2 is an explanatory diagram illustrating the operation of the data generation section of the information identification device.
  • FIG. 3 is an explanatory diagram illustrating another operation of the data generation section of the information identification device.
  • FIG. 4 is a diagram showing the relationship between the distributed representations of each word and the sum of the distributed representations.
  • FIG. 5 is an explanatory diagram illustrating a pattern of distributed representation of keys.
  • FIG. 6 is a sequence diagram showing the operation of the information identification device in the learning phase.
  • FIG. 7 is an explanatory diagram illustrating the operation of the determination section of the information identification device.
  • FIG. 8 is an image diagram of the processing of the synonym conversion unit.
  • FIG. 9 is a sequence diagram showing the operation of the information identification device in the identification phase when distributed representation is possible.
  • FIG. 10 is a sequence diagram showing the operation of the information identification device in the identification phase when distributed representation is not possible.
  • FIG. 11 is an explanatory diagram for explaining a method of calculating the distance between a boundary surface and a point.
  • FIG. 12 is a schematic diagram showing the effects of this embodiment.
  • FIG. 13 is an example of a hardware configuration.
  • FIG. 1 shows a configuration example of an information identification device 1 of this embodiment.
  • the information identification device 1 is a device that identifies the type of information input during information distribution between multiple companies in different industries.
  • the illustrated information identification device 1 includes a data generation section A, a determination section B, a type generation section C, a type holding section 19, and a regular expression determination section 20.
  • the data generation section A includes a character string division section 10, a distributed expression generation section 11, a combination generation section 12, a distributed expression concatenation section 13, and a learning data generation section 14.
  • the determination unit B includes a type classification unit 15, a type determination unit 16, and a synonym conversion unit 21.
  • the type generation C includes a similar word extraction section 17 and a learning data update section 18.
  • the character string dividing unit 10 divides input information into words. Specifically, the character string division unit 10 performs morphological analysis on the character string of information and divides it into minimum unit words that have meaning by themselves (see Non-Patent Document 1). Information in this embodiment includes a key and a value.
  • the character string dividing unit 10 receives learning information including a key and value during learning, and receives input information including a key and value (information to be determined) during determination.
  • the distributed expression generation unit 11 generates a distributed expression for each word divided by the character string division unit 10.
  • Distributed representation is one type of natural language processing, and is a technique for representing words as high-dimensional real vectors (see Non-Patent Document 2). By expressing the meaning of a word mathematically, it becomes possible to perform calculations using the meaning of the word.
  • the combination generation unit 12 combines the distributed expressions of the key words to generate a plurality of distributed expression patterns.
  • the combination generation unit 12 may generate a pattern by combining the distributed expressions of each word of the key from the front or the rear.
  • the distributed expression concatenation unit 13 For each pattern generated by the combination generation unit 12, the distributed expression concatenation unit 13 generates a concatenated distributed expression by combining the key distributed expression of the pattern and the distributed expression of the value word.
  • the distributed expression concatenation unit 13 may calculate the sum of the distributed expression of the key of the pattern and the distributed expression of each word of the value as the concatenated distributed expression.
  • the learning data generation unit 14 generates learning data including a connected distributed expression for each pattern and a type number of the type corresponding to the learning information.
  • the type classification unit 15 includes a learning unit and a classifier.
  • the learning unit performs machine learning on the learning data to generate a classifier.
  • the classifier is a trained model for identifying input information including a key and a value into one of the types.
  • the classifier of this embodiment is input with a connected distributed representation for each pattern, which is generated from the input information.
  • the classifier may output the confidence level for each type of input connected variance representation.
  • SVM support-vector machine
  • SVM is a method that aims to estimate with higher accuracy by focusing on the degree of confidence that a multi-class classifier has in its recognition.
  • SVM creates K boundary surfaces using the idea of One vs All SVM, and uses the confidence calculated from the distance between the classification target and each boundary surface to classify the classification type. Determine.
  • the distance from the boundary surface of the SVM is used as a method for calculating the reliability of the disperser. It is in the positive direction when viewed from the boundary surface, and the farther the distance from the boundary surface is, the higher the confidence is. Calculation of confidence will be described later.
  • the type determination unit 16 determines whether the type of input information is one of the existing types or a new type, using the confidence level that the classifier outputs for each type for each pattern. Determine. Specifically, the type determination unit 16 calculates the average value of the confidence output from the classifier for each type for each pattern, and if the highest value of the average value is positive, the type determination unit 16 determines the type of input information as the highest. It is determined that it is a value type, and if the highest value is negative, it is determined that it is a new type. Further, if the highest value is negative, the type determination unit 16 changes the type of input information to one of the existing types using the converted information in which each word of the key is replaced with a synonym converted by the synonym conversion unit 21.
  • the type determination unit 16 determines that the type of input information is a new type, it generates a type number of the new type (new type number), and stores the type number in the type holding unit 19 together with the new type. .
  • the synonym conversion unit 21 converts each key word of the input information into a synonym.
  • the synonym conversion unit 21 may convert into synonyms using, for example, a database such as a classification vocabulary table.
  • the Classified Vocabulary is a thesaurus (collection of synonyms) that classifies and organizes words according to their meaning, and the database is published by the Linguistic Resource Development Center of the National Institute for Japanese Language and Linguistics.
  • the synonym conversion unit 21 may convert each word into a synonym using, for example, a "classification item" included in a record of a classification vocabulary table.
  • the similar word extraction unit 17 extracts a similar word for the key of the input information (first similar word), a similar word for the value of the input information (second similar word), Extract each. Then, the distributed expression generation unit 11 generates a distributed expression of similar words of key and a distributed expression of similar words of value.
  • the distributed expression linking unit 13 combines the distributed expressions of the similar words of the key and the distributed expressions of the similar words of the value to generate an additional combination of distributed expressions. Further, the distributed expression concatenation unit 13 may generate an additional combination of distributed expressions by combining the distributed expression of the key and the distributed expression of a similar word of the value. Further, the distributed expression concatenation unit 13 may generate an additional concatenated distributed expression by combining the distributed expression of the similar word of key and the distributed expression of value.
  • the learning data updating unit 18 generates learning data (additional learning data) including a new type number. Specifically, the learning data updating unit 18 generates learning data including an additional concatenated distributed representation and a new type number. The learning unit of the type determining unit 16 retrains the classifier using the additional learning data generated by the learning data updating unit 18.
  • the type holding unit 19 stores types and type numbers in association with each other.
  • the regular expression determining unit 20 determines the type of input information when the value of the input information cannot be expressed in a distributed representation (in the case of a regular expression).
  • the information identification device 1 of the present embodiment described above can improve identification accuracy by generating a key pattern in consideration of duplicate words included in the key. Further, the information identification device 1 of this embodiment may perform synonym conversion for the key of input information. Identification accuracy can be improved by considering synonyms of Key.
  • the middle B system 3 is a system of a service provider.
  • the information identification device 1 of this embodiment is a device operated by middle B.
  • the operator terminal 7 is a terminal used by a middle B operator.
  • the First B system 5 is a system of a cooperating business entity related to the services provided by Middle B. In the example shown in FIG. 1, the information identification device 1 identifies the type of information input from the middle B system 3, and outputs the identification result to the first B system 5.
  • FIG. 2 is an explanatory diagram illustrating the operation of the data generation section A of the information identification device 1. The illustrated operations are performed both in the learning phase and in the identification phase (excluding the operation of the learning data generation unit 14).
  • the case of the learning phase will be explained below as an example.
  • the operator terminal 7 transmits the learning information (character string) input by the operator to the character string dividing unit 10 of the information identification device 1.
  • the character string dividing unit 10 divides input information into words by morphological analysis.
  • the character string division unit 10 divides the input "applicant name: Taro Yamada” into “application”, "person”, “name”, “Yamada”, and “Taro”, and divides it into distributed representation. It is output to the generation unit 11.
  • "Applicant name” is the key
  • "Taro Yamada” is the value.
  • the distributed expression generation unit 11 generates a distributed expression for each word divided by the character string division unit 10 and outputs it to the combination generation unit 12. That is, the distributed expression generation unit 11 converts each word into a high-dimensional real vector.
  • the combination generation unit 12 generates a plurality of patterns by combining the distributed expressions of the key words.
  • the combination generation unit 12 generates a plurality of patterns by combining the distributed expressions of the divided key words one by one from the front, and outputs the patterns to the distributed expression concatenation unit 13.
  • the following three patterns are generated. This reduces the impact of including duplicate suffixes such as "...name”.
  • the linking unit 13 combines the distributed representation of the pattern with the distributed representation of the word of value, generates a connected distributed representation, and outputs it to the learning data generation unit 14 .
  • the learning data generation unit 14 receives the type number corresponding to the input learning information from the operator terminal 7, generates learning data including the connected distributed representations of a plurality of patterns and the type number, Output to. In the illustrated learning data, a type number of "0" is set for each connected distributed representation.
  • the learning data is data for training the classifier of the type classification unit 15.
  • the learning section of the type classification section 15 generates a classifier by machine learning using learning data.
  • FIG. 3 is an explanatory diagram illustrating another operation of the data generation section A of the information identification device.
  • the illustrated operations are performed both in the learning phase and in the identification phase (excluding the operation of the learning data generation unit 14).
  • the case of the learning phase will be explained below as an example.
  • the distributed expressions of the word of the key are combined backwards to generate multiple patterns. reduce the impact of inclusion.
  • the character string division unit 10 divides the information transmitted from the operator terminal 7 into words by morphological analysis.
  • the character string division unit 10 converts the input “applicant name: Taro Yamada” into “o”, “application”, “person”, “first name”, “Yamada”, and “Taro”. It is divided and output to the distributed representation generation unit 11.
  • “Applicant name” is the key, and "Taro Yamada” is the value.
  • the distributed expression generation unit 11 generates a distributed expression for each divided word and outputs it to the combination generation unit 12.
  • the combination generation unit 12 generates a plurality of patterns by combining distributed representations of key words.
  • the combination generation unit 12 generates a plurality of patterns by combining the words of the divided keys one by one from the rear, and outputs the patterns to the distributed expression connection unit 13.
  • the following four patterns are generated. This reduces the impact of including prefixes such as "oh".
  • a connected distributed representation is generated by combining the distributed representation of the value word and the distributed representation of the value word, and is output to the learning data generation unit 14.
  • the learning data generation unit 14 receives the type number of learning information from the operator terminal 7, generates learning data including a plurality of connected distributed representations and the type number, and outputs it to the type classification unit 15. do.
  • the learning section of the type classification section 15 generates a classifier by machine learning using learning data.
  • the information identification device 1 divides information including a key and a value into words, calculates a distributed representation of each word, and combines the distributed representations of the key word. A pattern is generated, and for each pattern, a concatenated distributed expression is generated by combining the distributed expression of the key and the distributed expression of each word of the value, and learning data including the concatenated distributed expression and the type number is generated.
  • a pattern is generated by combining the distributed expressions of each word of the key from the front or the rear, but the pattern is not limited to this.
  • the combination generation unit 12 can generate patterns of distributed expressions by various combinations of distributed expressions of key words.
  • Figure 4 shows the distributed representation (vector) of each word (“surname”, “name”, “address”, “Yamaguchi”) and the connected distributed representation (“surname: Yamaguchi”, “name: Yamaguchi”, “address:”). ⁇ Yamaguchi'').
  • each key is "last name,” "full name,” and "address,” which are words that cannot be divided any further. Therefore, one pattern is generated as a combination of keys by the combination generation unit 12.
  • the sum of distributed expressions of the same type indicates that they are mapped to close positions (that is, clustered).
  • "Yamaguchi”, which has similar meanings as “name” and “surname” can be converted into similar distributed expressions.
  • "Yamaguchi" which has different meanings for "name” and "address”, can be converted into a distributed representation of the corresponding meaning.
  • FIG. 5 is an explanatory diagram illustrating a pattern of distributed expressions of keys generated by the combination generation unit 12 of this embodiment.
  • the duplicate words (suffixes, prefixes, etc.) included in the key will have a large effect on identification, and correct identification may be difficult. be.
  • multiple patterns of combinations of distributed expressions of key words (morphemes) of input information are generated, and connected distributed expressions (vectors) are used as input to the type classification unit 15 (classifier). By generating multiple words, the influence of duplicate words is reduced.
  • the illustrated example shows a case where input information with "service name” as a key is input to the character string classification unit 10 in a state where "company name” exists as an existing type.
  • a distributed representation (sum of distributed representations) that is a combination of the distributed representation of "service” and the distributed representation of "name” of "service name” is input to the classifier.
  • the cosine similarity between "company name” and "service name” is high (0.84), and as shown in the figure, the distributed representation of "company name” and the distributed representation of "service name” are mapped in close positions. Therefore, the classifier incorrectly determines that the type of input information with the key of "service name” is "company name.”
  • the combination generation unit 12 in order to reduce the influence of such a suffix or prefix, the combination generation unit 12 generates a distributed representation of one word of key and at least two words. Generate multiple patterns of combined distributed representations. Here, two patterns are generated: a distributed representation that combines a distributed representation of "service” and a distributed representation of "name”, and a distributed representation of only "service”. Then, by inputting the concatenated distributed expression combined with the distributed expression of the value word into the classifier for each pattern, it is possible to reduce the influence of the suffix "first name" and reduce misjudgment of type. In the illustrated example, it can be determined that the type of service name is not a company name but a new type.
  • FIG. 6 is a sequence diagram showing the operation of the information identification device 1 in the learning phase.
  • the operator terminal 7 receives the operator's instructions and inputs learning information to the character string segmentation unit 10 of the information identification device 1 (step S21).
  • the learning information includes at least one piece of learning data including a key and a value.
  • the character string dividing unit 10 divides the input information into words (step S22).
  • the distributed expression generation unit 11 generates a distributed expression of each divided word and outputs it to the combination generation unit 12 (step S23).
  • the combination generation unit 12 combines the distributed expressions of the key words to generate a distributed expression pattern (step S24). Note that if the key is composed of one word, one pattern is generated.
  • the distributed expression concatenation unit 13 generates a concatenated distributed expression for each pattern and outputs it to the learning data generation unit 14 (step S25). Specifically, the distributed expression concatenation unit 13 generates a concatenated distributed expression for each pattern by combining the distributed expression of the pattern and the distributed expression of the value word.
  • the operator terminal 7 receives the operator's instruction and requests the type holding unit 19 for the type number of the information transmitted in S21 (step S26).
  • the type holding unit 19 transmits the type number corresponding to the information transmitted in S21 to the operator terminal 7 (step S27).
  • the operator terminal 7 Upon acquiring the type number, the operator terminal 7 transmits the type number to the learning data generation unit 14 (step S28).
  • the learning data generation unit 14 generates learning data including the concatenated distributed representation received in step S25 and the type number received in S28, and sends it to the type classification unit 15 (step S29).
  • the learning unit of the type classification unit 15 generates a classifier by performing machine learning on the learning data (step S30).
  • the operator terminal 7 requests the type holding unit 19 for the type number in step S26, but the learning data generation unit 14 may request the type holding unit 19 for the type number.
  • the type holding unit 19 sends the type number to the learning data generating unit 14 in step S27, and the learning data generating unit 14 generates the learning data using the connected distributed representation and the type number acquired from the type holding unit 19. generated (step S29).
  • step S31 and S32 the regular expression determination unit 20 of the information identification device 1 uses regular expressions to determine what type it is based on patterns of numbers and character strings. Determine. Note that steps S31 and S32 are performed asynchronously with steps S21 to S30.
  • the operator terminal 7 receives the type and the corresponding regular expression pattern from the partner business operator of the First B system 5 (step S31), and sends the type and the regular expression pattern to the type holding unit 19 of the information identification device 1. Transmit (step S32).
  • the type holding unit 19 stores the transmitted type and regular expression pattern in its own storage unit.
  • FIG. 6 illustrates a regular expression pattern whose type is postal code and a regular expression pattern whose type is telephone number.
  • the regular expression for postal codes indicates that the code starts with the ⁇ mark, ends with three digits, a - (hyphen), and four digits.
  • the information identification device 1 can identify the type of input information into one of the types held in the type holding unit 19 using the confidence calculated by the classifier of the type classification unit 15. Determine whether or not.
  • FIG. 7 is a diagram showing an example of the overall flow of the information identification device 1 in the identification phase.
  • the connected variance representation output from the data generation unit A is input to the classifier, the average value of the confidence for each type obtained from the classifier is calculated, and if the maximum value of the average value is positive, determines that the type corresponding to the maximum confidence value is the type of input information. If the maximum value of the average value is negative, each word of the key is converted into synonyms to generate a connected variance expression, and the average value of the confidence is calculated again.
  • the maximum value of the average value is positive, it is determined that the type corresponds to the confidence level of the maximum value, and if the maximum value of the average value is negative, it is unidentifiable and is not held in the type holding unit 19. It is determined that the type is a new type, and the process moves to the type generation unit C.
  • the middle B system 3 transmits arbitrary information to the data generation unit A of the information identification device 1.
  • service name FLET'S Hikari
  • Service name is the key
  • FLET'S Hikari is the value.
  • the character string dividing unit 10 of the data generating unit A divides input information into words.
  • the distributed expression generation unit 11 generates a distributed expression for each divided word.
  • the combination generation unit 12 combines the distributed expressions of the key words to generate a plurality of distributed expression patterns.
  • the combination generation unit 12 generates two patterns: (1) a distributed representation of "service” and (2) a distributed representation of "service” + distributed representation of "name”.
  • the distributed expression concatenation unit 13 combines the distributed expression of the pattern and the distributed expression of the value word, generates a concatenated distributed expression, and inputs the generated concatenated distributed expression to the determination unit B.
  • (2) distributed representation of "service” + distributed representation of "name” + A distributed representation of "FLET'S” + a distributed representation of "light” is generated.
  • the distributed expression concatenation unit 13 calculates the sum of distributed expressions of words as a concatenated distributed expression.
  • the classifier of the type classification unit 15 of the determination unit B outputs the certainty factor for each type with respect to the connected distributed representation of each input pattern (step S10).
  • the type determination unit 16 calculates the average value of the reliability for each type (step S11), and determines whether the highest value of the average value is positive (step S12). If the reliability of the highest value is positive (step S12: YES), the type determination unit 16 determines that the input information is of the type with the maximum reliability. (Step S13). That is, the type determination unit 16 assigns the type with the maximum reliability to the input information.
  • the type determination unit 16 adds the determined type (or type number) to the input information and outputs it to the first B system 5. It is assumed that each functional unit 10-21 outputs data processed by the functional unit and also outputs data input to the functional unit. Therefore, the type determination unit 16 acquires the information input to the character string division unit 10 via the distributed expression generation unit 11, combination generation unit 12, distributed expression concatenation unit 13, and type classification unit 15.
  • step S12: NO the type determination unit 16 determines whether the connected distributed expression has been converted into synonyms (step S14).
  • the classifier outputs confidence levels for three types: "name,” "address,” and "company name,” and the highest average value of confidence for each type is "name.” Its highest value is negative (-0.25). If the type determination unit 16 determines that the highest value of certainty is negative (step S12: NO), it determines whether synonym conversion has been completed using, for example, a synonym conversion flag.
  • the synonym conversion flag is set in the output data by the type determination unit 16 or the synonym conversion unit 21 when the highest confidence level is not positive.
  • the functional unit outputs the data processed by the functional unit, and also outputs the data input to the functional unit. Therefore, the type determining unit 16 can determine the presence or absence of the conversion flag.
  • the synonym conversion unit 21 converts each word of the key of the input information into a synonym (step S15).
  • the synonym conversion unit 21 converts the keys "service” and "name” into “price/cost” and “name” using the above-mentioned classification vocabulary table and the like. Then, the converted information (price/cost name: FLET'S Hikari) in which the key of the input information is replaced with the converted word is input to the data generation section A.
  • the data generation unit A outputs the connected distributed expression for each pattern of the converted information to the type classification unit 15.
  • connected distributed representations (1) distributed representation of "price” + distributed representation of "cost” + distributed representation of "FLET'S” + distributed representation of "light”
  • the classifier of the determination unit B outputs the certainty factor for each type for the input combination of distributed representations of each pattern (step S10).
  • the type determination unit 16 calculates the average value of the reliability of each pattern output from the type classification unit 15 for each type (step S11), and determines whether the highest value of the reliability of the average value is positive (Ste S12). If the highest value is positive (step S12: YES), the process advances to step S13.
  • step S12: NO the type determination unit 16 determines whether synonym conversion has been completed (step S14). In the illustrated example, the certainty factor of "name" with the highest value of synonym conversion is negative (-0.35) (step S12: NO), and the synonym conversion is completed (step S14: YES).
  • the type determination unit 16 determines that the type of input information cannot be identified by the current classifier. In this case, the type determining unit 16 adds the type number of the new type to the type holding unit 19 (step S16). Then, the type determination unit 16 outputs the distributed expression of the key and the distributed expression of the input information to the similar word extraction unit 17 of the type generation unit C (step S17). Each functional unit outputs data processed by the functional unit, and also outputs data input to the functional unit. Therefore, the type determining unit 16 obtains the distributed representation of the key and value generated by the distributed representation generating unit 11.
  • the similar word extraction unit 17 extracts similar words for the key and similar words for the value using the distributed expression of the key and the distributed expression of the value, and outputs the extracted words to the distributed expression generation unit 11.
  • the similar word extraction unit 17 uses the FastText model of Non-Patent Document 2 and cosine similarity to extract similar words for keys and values.
  • the distributed expression generation unit 11 generates distributed expressions for each similar word of key and each similar word of value.
  • the combination generation unit 12 combines the distributed expressions of the key words to generate a plurality of distributed expression patterns.
  • the distributed expression concatenation unit 13 generates an additional concatenated distributed expression by combining, for each pattern, the distributed expressions of the key and the similar words of the key, and the distributed expressions of the value and the similar words of the value. For example, the following combinations of distributed expressions are generated for each pattern.
  • the learning data update unit 18 uses the concatenated distributed representations generated by the distributed expression concatenation unit 13, In step S16, additional learning data including the new type number issued by the type determination unit 16 is generated and output to the type classification unit 15.
  • the learning section of the type classification section 15 retrains the classifier using additional learning data.
  • the classifier when generating a new type, a combination of the new type, the distributed representation of key and the distributed representation of value, is used as additional learning data. At that time, by extracting key and value similar words and training data using the sum of distributed expressions of the extracted similar words, the classifier generates a new type of class and improves the identification accuracy of the new type. It becomes possible to improve the
  • FIG. 8 is an image diagram of the processing of the synonym conversion unit 21.
  • the illustrated example shows a case where input information using "last name” as a key is input to the information identification device 1 in a state where "name", “address”, and "company name” exist as existing types.
  • the distributed representation of "surname” and the distributed representation of "full name” have a low cosine similarity, and the distributed representation of "surname” and the distributed representation of "full name” are mapped to relatively distant positions.
  • the synonym conversion unit 21 converts "surname” into the synonym "first name”, so that the distributed representation of "first name” and the distributed representation of "full name” have a high cosine similarity, and the distributed representation of "first name” The distributed representation of "name” is mapped to a nearby position. Thereby, the classifier can appropriately determine the type of input information having the key of "last name” to the existing type of "full name”.
  • the key of input information is converted into a synonym, and the converted key is used to determine whether the input information corresponds to any of the existing types.
  • the combination of key and value distributed expressions causes a low degree of similarity even for synonyms, and it is difficult to determine that they are of the same type, the discrepancy between the word meaning and the degree of similarity between the distributed expressions is filled in and the judgment is made correctly. can do.
  • FIG. 9 and 10 are sequence diagrams showing the operation of the information identification device 1 in the identification phase.
  • FIG. 9 shows the operation when the value of input information can be expressed in a distributed manner
  • FIG. 10 shows the operation when the value of the input information cannot be expressed in a distributed manner.
  • the middle B system 3 inputs information including a key and a value to the character string dividing unit 10 of the information identification device 1 (step S61).
  • the character string dividing unit 10 divides the information into words (character strings) (step S62).
  • the distributed expression generation unit 11 generates a distributed expression for each divided word and outputs it to the combination generation unit 12 (step S63).
  • the combination generation unit 12 combines the distributed expressions of the key words, generates a plurality of distributed expression patterns, and outputs them to the distributed expression concatenation unit 13 (step S64).
  • the distributed expression concatenation unit 13 combines the distributed expression of the pattern and the distributed expression of the value word, generates a concatenated distributed expression, and outputs it to the type classification unit 15 (step S65).
  • the classifier of the type classification unit 15 outputs the certainty factor for each type of connected distributed expression for each input pattern (step S66).
  • the classifier calculates K confidence levels for each pattern.
  • the type determination unit 16 calculates the average value of the reliability of each type. If the maximum value of the average reliability is a positive value, it is determined that the type of information input in S61 is the type with the maximum value. Then, the type determination unit 16 transmits the identification result including the information input in S61 and the type (type name and/or type number) with the highest confidence to the first B system 5 (step S67).
  • the type determination unit 16 stores the key of the information input in S61 in the type holding unit 19. Register (step S68).
  • the type holding unit 19 issues the type number of the registered key (step S69) and outputs a completion notification to the type determining unit 16 (step S70).
  • the type determination unit 16 may issue the type number of the key, and the key and type number may be registered in the type holding unit 19 in step S67.
  • the type determination unit 16 When the type determination unit 16 receives the completion notification, it outputs the distributed expression of the key and the distributed expression of the value to the similar word extraction unit 17 (step S71).
  • the similar word extraction unit 17 extracts similar words similar to the distributed expression of key and similar words similar to the distributed expression of value, and outputs them to the distributed expression generation unit 11 (step S72).
  • the similar word extracting unit 17 also outputs the key distributed expression and the value distributed expression obtained in step S70 to the distributed expression generating unit 11 together with the similar word.
  • the distributed expression generation unit 11 generates a distributed expression for each input similar word (step S73).
  • the combination generation unit 12 combines the distributed expressions of the key words of the similar words to generate a plurality of distributed expression patterns (step S74).
  • the distributed expression concatenation unit 13 concatenates the distributed expressions of the key and the similar words of the key, and the distributed expressions of the value and the similar words of the value, respectively, generates an additional concatenated distributed expression, and updates the learning data. 18 (step S75).
  • the learning data updating unit 18 generates additional learning data including the additional connected distributed expression and the new type number issued in step S69, and outputs it to the type classification unit 15 (step S76).
  • the learning unit of the type classification unit 15 performs re-learning using additional learning data to update the classifier, and notifies the distributed expression concatenation unit 13 of the completion of re-learning (step S77).
  • the distributed representation concatenation unit 13 returns to step S65 and inputs the concatenated distributed representation of the information input in step S61 again to the re-learning classifier of the type classification unit 15.
  • the classifier outputs the confidence level for each type (step S66).
  • the type determining unit 16 determines the type of the input information using the confidence output from the updated classifier.
  • the type determination unit 16 instructs the synonym conversion unit 21 to perform the synonym conversion.
  • the synonym conversion unit 21 converts each word of the key in the input information into a synonym, and inputs the converted key and the unconverted value to the character string division unit 10, thereby performing the processing from step S62 onwards.
  • the middle B system 3 inputs information including a key and a value to the character string segmentation unit 10 of the information identification device 1 (step S61).
  • the character string dividing unit 10 divides the information into words (step S62).
  • "telephone number: 090-1234-5678" key:value
  • the distributed expression generation unit 11 attempts distributed expression for each divided word, but if there is a word that cannot be distributed, an error occurs.
  • the distributed expression generation unit 11 outputs “090-1234-5678” (value), which cannot be converted into a distributed expression, to the regular expression determination unit 20 (step S81).
  • the regular expression determining unit 20 When the value is input, the regular expression determining unit 20 requests the type holding unit 19 to obtain all pairs of regular expressions and type numbers registered in the type holding unit 19 (steps S82 and S83). . Then, the regular expression determining unit 20 determines which pattern of the obtained regular expression matches the pattern of the character string of value input in step S82.
  • the regular expression determination unit 20 determines that the type of information input in S61 is the type of the regular expression of the matched pattern. Then, the regular expression determining unit 20 transmits the identification result including the information input in S61 and the type of the matched pattern to the First B system 5 (Step S84).
  • the regular expression determining unit 20 transmits an error indicating that the type cannot be specified to the middle B system 3 (step S85).
  • Confidence is calculated using the boundary surface for each class of a classifier that performs multiclass classification, the distance between the input information and the boundary surface, and whether the input information is on the positive or negative side when viewed from the boundary surface. . That is, the classifier calculates positive or negative confidence from the distance between the boundary surface and the input information.
  • FIG. 11 is an explanatory diagram for explaining a method of calculating the distance between a boundary surface (hyperplane) and a point. If the leg of the perpendicular drawn from point X(x - ) to the boundary surface is H(h), the following vector is perpendicular to the boundary surface, so it is parallel to the normal vector w of the boundary surface.
  • the distance d between the desired point and the boundary surface is calculated using the following formula.
  • the information identification device 1 of this embodiment described above includes a character string dividing unit 10 that divides learning information including a key and a value into words, and a distributed expression generating unit 11 that generates a distributed expression of each word. , a combination generation unit 12 that generates a plurality of distributed expression patterns by combining the distributed expressions of the words of the key, and a combination generation unit 12 that generates a plurality of distributed expression patterns by combining the distributed expressions of the key words of the pattern and the distributed expression of the value words of the pattern.
  • a distributed expression concatenation unit 13 that generates a concatenated distributed expression
  • a learning data generation unit 14 that generates learning data including a concatenated distributed expression for each pattern and a type number of a type corresponding to the learning information
  • a learning unit that performs machine learning on the learning data to generate a classifier for identifying input information including a key and a value into any type.
  • the concatenated distributed representation for each pattern generated from the input information is input to the classifier, and the confidence level that the classifier outputs for each type for each pattern.
  • a type determination unit 16 is provided which determines whether the type of the input information is one of the existing types or a new type.
  • the combination generation unit 12 can reduce the influence and improve the identifiability.
  • the combination generation unit that combines the distributed expressions of the words of the key to generate a plurality of distributed expression patterns. Identification accuracy can be improved.
  • this embodiment includes a synonym conversion unit 21 that converts each word of the key of input information into a synonym.
  • a synonym conversion unit 21 that converts each word of the key of input information into a synonym.
  • a general-purpose computer system as shown in FIG. 13 can be used, for example.
  • the illustrated computer system includes a CPU (Central Processing Unit) 901, a memory 902, a storage 903 (HDD: Hard Disk Drive, SSD: Solid State Drive), a communication device 904, an input device 905, and an output device. 906.
  • Memory 902 and storage 903 are storage devices.
  • the functions of the information identification device 1 are realized by the CPU 901 executing a predetermined program loaded onto the memory 902.
  • the information identification device 1 may be implemented by one computer or by multiple computers. Further, the information identification device 1 may be a virtual machine implemented in a computer.
  • the program of the information identification device 1 can be stored in a computer-readable recording medium such as an HDD, SSD, USB (Universal Serial Bus) memory, CD (Compact Disc), or DVD (Digital Versatile Disc), or can be stored via a network. It can also be distributed.
  • the determination unit B includes the synonym conversion unit 21.
  • the data generation section A may include the synonym conversion section 21.
  • the synonym conversion unit 21 receives an instruction from the type determination unit 16 and converts each word of the key into may be converted into synonyms.
  • the synonym conversion unit 21 may convert the key word of the information input to the information identification device 1 into a synonym from the beginning in the identification phase.
  • the character string division unit 10 divides the input information into words and inputs the words to the synonym conversion unit 21, which converts each word of the key into a synonym and performs distributed expression. It may also be output to the generation unit 11.
  • the subsequent processing is similar to the embodiment.
  • Information identification device 10 Character string dividing section (dividing section) 11: Distributed expression generation unit 12: Combination generation unit 13: Distributed expression concatenation unit 14: Learning data generation unit 15: Type classification unit 16: Type determination unit (determination unit) 17: Similar word extraction unit 18: Learning data update unit 19: Type storage unit 20: Regular expression determination unit 3: Middle B system 5: First B system 7: Operator terminal

Abstract

The present invention comprises: a character string dividing unit 10 that divides information for learning including keys and values into words; a distributed representation generation unit 11 that generates a distributed representation for each word; a combination generation unit 12 that generates a plurality of patterns of distributed representations by combining the distributed representations of words that are keys; a distributed representation coupling unit 13 that generates a coupled distributed representation for each of the patterns by combining the distributed representations of the keys of the pattern and the distributed representations of words that are the values; a learning data generation unit 14 that generates learning data including the coupled distributed representation of each pattern and the type number of the type corresponding to the information for learning; and a learning unit that machine-learns the learning data, and generates a classifier for identifying input information including keys and values as some type.

Description

情報識別装置、情報識別方法、および、プログラムInformation identification device, information identification method, and program
 本発明は、情報識別装置、情報識別方法、および、プログラムに関する。 The present invention relates to an information identification device, an information identification method, and a program.
 B2B2Xによる異業種の複数の事業者が連携したB2B2Xサービスが増加している。このようなサービスを提供するにあたり、顧客情報、契約情報、請求情報等の情報を、連携する事業者間で流通させる必要がある。 B2B2X services in which multiple businesses in different industries collaborate through B2B2X are increasing. In providing such services, it is necessary to distribute information such as customer information, contract information, billing information, etc. between collaborating businesses.
 情報の流通においては、情報の識別が必要であり、情報の識別に関する技術として、形態素解析(非特許文献1)、文字の分散表現化(非特許文献2)、分類器の確信度算出技術(非特許文献3)がある。 In the distribution of information, it is necessary to identify information, and techniques related to information identification include morphological analysis (Non-Patent Document 1), distributed representation of characters (Non-Patent Document 2), and confidence calculation technology for classifiers (Non-Patent Document 2). There is a non-patent document 3).
 現状では、連携する事業者間で情報を流通させるために、オペレータが投入された情報の内容を見て種別を判断し、手作業で流通先の決定および情報の形式変換を行っている。 Currently, in order to distribute information between collaborating businesses, operators look at the content of the input information, determine the type, manually determine the distribution destination, and convert the format of the information.
 今後、異業種の複数社が連携したサービスが増加していくことに対応するために、連携する事業者間での情報流通を自動化する必要がある。情報流通の自動化にあたっては、異なる形式の情報が投入された場合に、同じ種別であると識別することが求められる。具体的には、情報を種別ごとに適切な宛先に流通させるために、各社が用いる既存システムが異なり同種別の情報が異なる形式で投入される場合にも、自動で同種別であると識別する必要がある。 In order to cope with the increasing number of services that collaborate with multiple companies from different industries in the future, it will be necessary to automate the distribution of information between collaborating businesses. When automating information distribution, it is necessary to identify information of the same type when input in different formats. Specifically, in order to distribute information to the appropriate destination for each type, even if the existing systems used by each company are different and the same type of information is input in different formats, it will be automatically identified as the same type. There is a need.
 非特許文献1-3では、異なる形式の情報が投入された場合に、同じ種別であると識別することは考慮されていない。そのため、これらの非特許文献を用いて、異なる形式の情報を同じ種別であると識別することはできない。 Non-Patent Documents 1-3 do not consider identifying the same type of information when different formats of information are input. Therefore, using these non-patent documents, it is not possible to identify information in different formats as being of the same type.
 本発明は、上記事情に鑑みてなされたものであり、本発明の目的は、異なる形式の情報が入力された場合に、同じ種別であると識別可能な技術を提供することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technology that allows information to be identified as being of the same type when information in different formats is input.
 上記目的を達成するため、本発明の一態様は、キーとバリューとを含む、学習用の情報を単語に分割する分割部と、各単語の分散表現を生成する分散表現生成部と、前記キーの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する組み合わせ生成部と、前記パターン毎に、当該パターンのキーの分散表現と前記バリューの単語の分散表現とを組み合わせて、連結分散表現を生成する分散表現連結部と、前記パターン毎の連結分散表現と、前記学習用の情報に対応する種別の種別番号とを含む学習データを生成する学習データ生成部と、前記学習データを機械学習させて、キーとバリューとを含む入力情報をいずれかの種別に識別するための分類器を生成する学習部と、を備える。 To achieve the above object, one aspect of the present invention includes a dividing unit that divides learning information including a key and a value into words, a distributed expression generating unit that generates a distributed expression of each word, and a distributed expression generating unit that generates a distributed expression of each word. a combination generation unit that generates a plurality of patterns of distributed expressions by combining the distributed expressions of words of a distributed expression concatenation unit that generates a representation; a learning data generation unit that generates learning data including a concatenated distributed expression for each pattern; and a type number of a type corresponding to the learning information; and a learning unit that performs learning to generate a classifier for identifying input information including a key and a value into any type.
 本発明の一態様は、キーとバリューとを含む入力情報を単語に分割する分割部と、各単語の分散表現を生成する分散表現生成部と、前記キーの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する組み合わせ生成部と、前記パターン毎に、当該パターンのキーの分散表現と前記バリューの単語の分散表現とを組み合わせて、連結分散表現を生成する分散表現連結部と、前記連結分散表現の種別毎の確信度を出力する分類器と、前記分類器が各パターンに対して、種別毎に出力する確信度を用いて、前記入力情報の種別を既存のいずれかの種別か、あるいは、新たな種別かを判定する判定部を備える。 One aspect of the present invention is to combine a dividing unit that divides input information including a key and a value into words, a distributed expression generating unit that generates a distributed representation of each word, and a distributed representation of the key word to create a plurality of words. a combination generation unit that generates a pattern of distributed expressions; a distributed expression concatenation unit that generates a concatenated distributed expression by combining, for each pattern, the distributed expression of the key of the pattern and the distributed expression of the value word; A classifier that outputs a confidence level for each type of the connected distributed representation, and a confidence level that the classifier outputs for each type for each pattern are used to determine whether the type of the input information is one of the existing types. or a new type.
 本発明の一態様は、情報識別装置が行う情報識別方法であって、キーとバリューとを含む、学習用の情報を単語に分割するステップと、各単語の分散表現を生成するステップと、前記キーの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成するステップと、前記パターン毎に、当該パターンのキーの分散表現と前記バリューの単語の分散表現とを組み合わせて、連結分散表現を生成するステップと、前記パターン毎の連結分散表現と、前記学習用の情報に対応する種別の種別番号とを含む学習データを生成するステップと、前記学習データを機械学習させて、キーとバリューとを含む入力情報をいずれかの種別に識別するための分類器を生成するステップと、を行う。 One aspect of the present invention is an information identification method performed by an information identification device, which comprises: dividing learning information including a key and a value into words; generating a distributed representation of each word; generating a plurality of patterns of distributed expressions by combining the distributed expressions of the key words; and for each pattern, combining the distributed expressions of the key of the pattern with the distributed expressions of the value words to generate a concatenated distributed expression. a step of generating learning data including a concatenated distributed representation for each pattern and a type number of the type corresponding to the learning information, and performing machine learning on the learning data to generate keys and values. and generating a classifier for identifying input information containing the input information into one of the types.
 本発明の一態様は、上記情報識別装置としてコンピュータを機能させるプログラムである。 One aspect of the present invention is a program that causes a computer to function as the information identification device.
 本発明によれば、異なる形式の情報が入力された場合に、同じ種別であると識別可能な技術を提供することができる。 According to the present invention, it is possible to provide a technology that allows information to be identified as being of the same type when information in different formats is input.
図1は、本実施形態の情報識別装置の構成例を示す。FIG. 1 shows an example of the configuration of an information identification device according to this embodiment. 図2は、情報識別装置のデータ生成部の動作を説明する説明図である。FIG. 2 is an explanatory diagram illustrating the operation of the data generation section of the information identification device. 図3は、情報識別装置のデータ生成部の他の動作を説明する説明図である。FIG. 3 is an explanatory diagram illustrating another operation of the data generation section of the information identification device. 図4は、各単語の分散表現と、分散表現の和との関係を示す図である。FIG. 4 is a diagram showing the relationship between the distributed representations of each word and the sum of the distributed representations. 図5は、keyの分散表現のパターンを説明する説明図である。FIG. 5 is an explanatory diagram illustrating a pattern of distributed representation of keys. 図6は、学習フェーズの情報識別装置の動作を示すシーケンス図である。FIG. 6 is a sequence diagram showing the operation of the information identification device in the learning phase. 図7は、情報識別装置の判定部の動作を説明する説明図である。FIG. 7 is an explanatory diagram illustrating the operation of the determination section of the information identification device. 図8は、類義語変換部の処理のイメージ図である。FIG. 8 is an image diagram of the processing of the synonym conversion unit. 図9は、分散表現化できる場合の識別フェーズの情報識別装置の動作を示すシーケンス図である。FIG. 9 is a sequence diagram showing the operation of the information identification device in the identification phase when distributed representation is possible. 図10は、分散表現化できない場合の識別フェーズの情報識別装置の動作を示すシーケンス図である。FIG. 10 is a sequence diagram showing the operation of the information identification device in the identification phase when distributed representation is not possible. 図11は、境界面と点の距離の算出方法を説明するための説明図である。FIG. 11 is an explanatory diagram for explaining a method of calculating the distance between a boundary surface and a point. 図12は、本実施形態の効果を示す模式図である。FIG. 12 is a schematic diagram showing the effects of this embodiment. 図13は、ハードウェア構成例である。FIG. 13 is an example of a hardware configuration.
 以下、本発明の実施の形態について、図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
 <情報識別装置の構成>
 図1に、本実施形態の情報識別装置1の構成例を示す。情報識別装置1は、異業種複数社間での情報流通において、入力された情報の種別を識別する装置である。図示する情報識別装置1は、データ生成部Aと、判定部Bと、種別生成部Cと、種別保持部19と、正規表現判定部20とを備える。
<Configuration of information identification device>
FIG. 1 shows a configuration example of an information identification device 1 of this embodiment. The information identification device 1 is a device that identifies the type of information input during information distribution between multiple companies in different industries. The illustrated information identification device 1 includes a data generation section A, a determination section B, a type generation section C, a type holding section 19, and a regular expression determination section 20.
 データ生成部Aは、文字列分割部10と、分散表現生成部11と、組み合わせ生成部12と、分散表現連結部13と、学習データ生成部14とを備える。判定部Bは、種別分類部15と、種別判定部16と、類義語変換部21とを備える。種別生成Cは、類似語抽出部17と、学習データ更新部18とを備える。 The data generation section A includes a character string division section 10, a distributed expression generation section 11, a combination generation section 12, a distributed expression concatenation section 13, and a learning data generation section 14. The determination unit B includes a type classification unit 15, a type determination unit 16, and a synonym conversion unit 21. The type generation C includes a similar word extraction section 17 and a learning data update section 18.
 文字列分割部10(分割部)は、入力された情報を単語に分割する。具体的には、文字列分割部10は、情報の文字列を形態素解析し、それだけで意味を持つ、最小単位の単語に分割する(非特許文献1参照)。本実施形態の情報は、keyとvalueを含む。文字列分割部10には、学習時にはkeyとvalueを含む学習用の情報が入力され、判定時には、keyとvalueを含む入力情報(判定対象の情報)が入力される。 The character string dividing unit 10 (dividing unit) divides input information into words. Specifically, the character string division unit 10 performs morphological analysis on the character string of information and divides it into minimum unit words that have meaning by themselves (see Non-Patent Document 1). Information in this embodiment includes a key and a value. The character string dividing unit 10 receives learning information including a key and value during learning, and receives input information including a key and value (information to be determined) during determination.
 分散表現生成部11は、文字列分割部10により分割された各単語の分散表現を生成する。分散表現は、自然言語処理の1つであって、単語を高次元の実数ベクトルで表現する技術である(非特許文献2参照)。単語の意味を数学的に表現することで、単語の意味を用いた演算処理が可能となる。 The distributed expression generation unit 11 generates a distributed expression for each word divided by the character string division unit 10. Distributed representation is one type of natural language processing, and is a technique for representing words as high-dimensional real vectors (see Non-Patent Document 2). By expressing the meaning of a word mathematically, it becomes possible to perform calculations using the meaning of the word.
 組み合わせ生成部12は、keyの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する。組み合わせ生成部12は、keyの各単語の分散表現を、前方また後方から組み合わせて、パターンを生成してもよい。 The combination generation unit 12 combines the distributed expressions of the key words to generate a plurality of distributed expression patterns. The combination generation unit 12 may generate a pattern by combining the distributed expressions of each word of the key from the front or the rear.
 分散表現連結部13は、組み合わせ生成部12が生成したパターン毎に、当該パターンのkeyの分散表現と、valueの単語の分散表現とを組み合わせて、連結分散表現を生成する。分散表現連結部13は、連結分散表現として、パターンのkeyの分散表現と、valueの各単語の分散表現との和を算出してもよい。 For each pattern generated by the combination generation unit 12, the distributed expression concatenation unit 13 generates a concatenated distributed expression by combining the key distributed expression of the pattern and the distributed expression of the value word. The distributed expression concatenation unit 13 may calculate the sum of the distributed expression of the key of the pattern and the distributed expression of each word of the value as the concatenated distributed expression.
 学習データ生成部14は、パターン毎の連結分散表現と、前記学習用の情報に対応する種別の種別番号とを含む学習データを生成する。 The learning data generation unit 14 generates learning data including a connected distributed expression for each pattern and a type number of the type corresponding to the learning information.
 種別分類部15は、学習部と、分類器とを備える。学習部は、学習データを機械学習させて分類器を生成する。分類器は、keyとvalueとを含む入力された情報をいずれかの種別に識別するための学習済みモデルである。本実施形態の分類器には、入力された情報から生成された、パターン毎の連結分散表現が入力される。分類器は、入力された連結分散表現の種別毎の確信度を出力してもよい。 The type classification unit 15 includes a learning unit and a classifier. The learning unit performs machine learning on the learning data to generate a classifier. The classifier is a trained model for identifying input information including a key and a value into one of the types. The classifier of this embodiment is input with a connected distributed representation for each pattern, which is generated from the input information. The classifier may output the confidence level for each type of input connected variance representation.
 分類器には、SVM(support-vector machine)を用いてもよい(非特許文献3参照)。SVMは、多クラス分類器がどの程度自信を持って認識をしているかという確信度に焦点をあて、より高い精度での推定を目指した手法である。SVMは、Kクラスに分類する際、One vs All SVMの考え方を用いて、K個の境界面を作成し、分類対象とそれぞれの境界面との距離から算出される確信度を用いて分類種別を決定する。本実施形態における分散器の確信度の算出方法としては、SVMの境界面からの距離を利用する。境界面から見て正の方向にあって、境界面との距離が遠いほど確信度が大きい。確信度の算出については後述する。 An SVM (support-vector machine) may be used as the classifier (see Non-Patent Document 3). SVM is a method that aims to estimate with higher accuracy by focusing on the degree of confidence that a multi-class classifier has in its recognition. When classifying into K classes, SVM creates K boundary surfaces using the idea of One vs All SVM, and uses the confidence calculated from the distance between the classification target and each boundary surface to classify the classification type. Determine. In this embodiment, the distance from the boundary surface of the SVM is used as a method for calculating the reliability of the disperser. It is in the positive direction when viewed from the boundary surface, and the farther the distance from the boundary surface is, the higher the confidence is. Calculation of confidence will be described later.
 種別判定部16(判定部)は、分類器が各パターンに対して種別毎に出力する確信度を用いて、入力された情報の種別を既存のいずれかの種別か、あるいは、新たな種別かを判定する。具体的には、種別判定部16は、各パターンに対して種別毎に分類器から出力される確信度の平均値を算出し、平均値の最高値が正の場合、入力情報の種別を最高値の種別と判定し、最高値が負の場合、新たな種別と判定する。また、種別判定部16は、最高値が負の場合、keyの各単語を類義語変換部21が変換した類義語に置換した変換後の情報を用いて、入力情報の種別を既存のいずれかの種別か、あるいは、新たな種別かを判定してもよい。種別判定部16は、入力情報の種別を新たな種別と判定した場合、新たな種別の種別番号(新たな種別番号)を生成し、当該種別番号を新たな種別とともに種別保持部19に格納する。 The type determination unit 16 (determination unit) determines whether the type of input information is one of the existing types or a new type, using the confidence level that the classifier outputs for each type for each pattern. Determine. Specifically, the type determination unit 16 calculates the average value of the confidence output from the classifier for each type for each pattern, and if the highest value of the average value is positive, the type determination unit 16 determines the type of input information as the highest. It is determined that it is a value type, and if the highest value is negative, it is determined that it is a new type. Further, if the highest value is negative, the type determination unit 16 changes the type of input information to one of the existing types using the converted information in which each word of the key is replaced with a synonym converted by the synonym conversion unit 21. Or, it may be determined whether it is a new type. When the type determination unit 16 determines that the type of input information is a new type, it generates a type number of the new type (new type number), and stores the type number in the type holding unit 19 together with the new type. .
 類義語変換部21は、入力情報のキーの各単語を類義語に変換する。類義語変換部21は、例えば、分類語彙表などのデータベースを用いて、類義語に変換してもよい。分類語彙表は、語を意味によって分類・整理したシソーラス (類義語集)であり、国立国語研究所言語資源開発センターがデータベースを公開している。類義語変換部21は、例えば分類語彙表のレコードに含まれる「分類項目」を用いて、各単語を類義語に変換してもよい。 The synonym conversion unit 21 converts each key word of the input information into a synonym. The synonym conversion unit 21 may convert into synonyms using, for example, a database such as a classification vocabulary table. The Classified Vocabulary is a thesaurus (collection of synonyms) that classifies and organizes words according to their meaning, and the database is published by the Linguistic Resource Development Center of the National Institute for Japanese Language and Linguistics. The synonym conversion unit 21 may convert each word into a synonym using, for example, a "classification item" included in a record of a classification vocabulary table.
 類似語抽出部17は、種別判定部16が新たな種別と判定した場合、入力情報のkeyの類似語(第1類似語)と、入力情報のvalueの類似語(第2類似語)と、をそれぞれ抽出する。そして、分散表現生成部11は、keyの類似語の分散表現と、valueの類似語の分散表現とを生成する。分散表現連結部13は、keyの類似語の分散表現と、valueの類似語の分散表現とを組み合わせて、追加の分散表現の組み合わせを生成する。また、分散表現連結部13は、keyの分散表現と、valueの類似語の分散表現とを組み合わせて、追加の分散表現の組み合わせを生成してもよい。また、分散表現連結部13は、keyの類似語の分散表現と、valueの分散表現とを組み合わせて、追加の連結分散表現を生成してもよい。 When the type determination unit 16 determines that the type is a new type, the similar word extraction unit 17 extracts a similar word for the key of the input information (first similar word), a similar word for the value of the input information (second similar word), Extract each. Then, the distributed expression generation unit 11 generates a distributed expression of similar words of key and a distributed expression of similar words of value. The distributed expression linking unit 13 combines the distributed expressions of the similar words of the key and the distributed expressions of the similar words of the value to generate an additional combination of distributed expressions. Further, the distributed expression concatenation unit 13 may generate an additional combination of distributed expressions by combining the distributed expression of the key and the distributed expression of a similar word of the value. Further, the distributed expression concatenation unit 13 may generate an additional concatenated distributed expression by combining the distributed expression of the similar word of key and the distributed expression of value.
 学習データ更新部18は、新たな種別番号を含む学習データ(追加学習データ)を生成する。具体的には、学習データ更新部18は、追加の連結分散表現と、新たな種別番号とを含む学習データを生成する。種別判定部16の学習部は、学習データ更新部18が生成した追加学習データを用いて分類器を再学習させる。 The learning data updating unit 18 generates learning data (additional learning data) including a new type number. Specifically, the learning data updating unit 18 generates learning data including an additional concatenated distributed representation and a new type number. The learning unit of the type determining unit 16 retrains the classifier using the additional learning data generated by the learning data updating unit 18.
 種別保持部19には、種別と、種別番号とが対応付けて格納されている。正規表現判定部20は、入力された情報のvalueが分散表現化できない場合(正規表現の場合)に、入力された情報の種別を判定する。 The type holding unit 19 stores types and type numbers in association with each other. The regular expression determining unit 20 determines the type of input information when the value of the input information cannot be expressed in a distributed representation (in the case of a regular expression).
 以上説明した本実施形態の情報識別装置1は、keyに含まれる重複する単語を考慮してkeyのパターンを生成することで識別精度を向上させることができる。また、本実施形態の情報識別装置1は、入力情報のkeyに対する類義語変換を行ってもよい。Keyの類義語を考慮することで識別精度を向上させることができる。 The information identification device 1 of the present embodiment described above can improve identification accuracy by generating a key pattern in consideration of duplicate words included in the key. Further, the information identification device 1 of this embodiment may perform synonym conversion for the key of input information. Identification accuracy can be improved by considering synonyms of Key.
 ミドルBシステム3は、サービス事業者のシステムである。本実施形態の情報識別装置1は、ミドルBが運用する装置である。オペレータ端末7は、ミドルBのオペレータが使用する端末である。ファーストBシステム5は、ミドルBが提供するサービスに関連する連携事業者のシステムである。図1に示す例では、情報識別装置1は、ミドルBシステム3から入力された情報の種別を識別し、識別結果をファーストBシステム5に出力する。 The middle B system 3 is a system of a service provider. The information identification device 1 of this embodiment is a device operated by middle B. The operator terminal 7 is a terminal used by a middle B operator. The First B system 5 is a system of a cooperating business entity related to the services provided by Middle B. In the example shown in FIG. 1, the information identification device 1 identifies the type of information input from the middle B system 3, and outputs the identification result to the first B system 5.
 <学習フェーズ>
 図2は、情報識別装置1のデータ生成部Aの動作を説明する説明図である。図示する動作は、学習フェーズにおいても識別フェーズ(学習データ生成部14の動作は除く)においても行われる。ここでは、学習フェーズの場合を例に、以下に説明する。
<Learning phase>
FIG. 2 is an explanatory diagram illustrating the operation of the data generation section A of the information identification device 1. The illustrated operations are performed both in the learning phase and in the identification phase (excluding the operation of the learning data generation unit 14). Here, the case of the learning phase will be explained below as an example.
 オペレータ端末7は、オペレータが入力した学習用の情報(文字列)を、情報識別装置1の文字列分割部10に送信する。文字列分割部10は、入力された情報を形態素解析により単語に分割する。 The operator terminal 7 transmits the learning information (character string) input by the operator to the character string dividing unit 10 of the information identification device 1. The character string dividing unit 10 divides input information into words by morphological analysis.
 図示する例では、文字列分割部10は、入力された「申込者名:山田太郎」を、「申込」、「者」、「名」、「山田」、「太郎」に分割し、分散表現生成部11に出力する。「申込者名」がkeyで、「山田太郎」がvalueである。 In the illustrated example, the character string division unit 10 divides the input "applicant name: Taro Yamada" into "application", "person", "name", "Yamada", and "Taro", and divides it into distributed representation. It is output to the generation unit 11. "Applicant name" is the key, and "Taro Yamada" is the value.
 分散表現生成部11は、文字列分割部10が分割した各単語の分散表現を生成し、組み合わせ生成部12に出力する。すなわち、分散表現生成部11は、各単語を高次元の実数ベクトルに変換する。 The distributed expression generation unit 11 generates a distributed expression for each word divided by the character string division unit 10 and outputs it to the combination generation unit 12. That is, the distributed expression generation unit 11 converts each word into a high-dimensional real vector.
 組み合わせ生成部12は、キーの単語の分散表現を組み合わせて、複数のパターンを生成する。図示する例では、組み合わせ生成部12は、分割されたkeyの単語の分散表現を前方から1つずつ組み合わせて複数のパターンを生成し、分散表現連結部13に出力する。ここでは、以下の3つのパターンが生成される。これにより「~名」といった重複した接尾辞等が含まれることによる影響を削減する。 The combination generation unit 12 generates a plurality of patterns by combining the distributed expressions of the key words. In the illustrated example, the combination generation unit 12 generates a plurality of patterns by combining the distributed expressions of the divided key words one by one from the front, and outputs the patterns to the distributed expression concatenation unit 13. Here, the following three patterns are generated. This reduces the impact of including duplicate suffixes such as "...name".
 (1)「申込」の分散表現
 (2)「申込」の分散表現+「者」の分散表現
 (3)「申込」の分散表現+「者」の分散表現+「名」の分散表現
 分散表現連結部13は、生成されたパターン毎に、当該パターンの分散表現とバリューの単語の分散表現とを組み合わせて、連結分散表現を生成し、学習データ生成部14に出力する。
(1) Distributed representation of "application" (2) Distributed representation of "application" + distributed representation of "person" (3) Distributed representation of "application" + distributed representation of "person" + distributed representation of "name" Distributed representation For each generated pattern, the linking unit 13 combines the distributed representation of the pattern with the distributed representation of the word of value, generates a connected distributed representation, and outputs it to the learning data generation unit 14 .
 ここでは、以下の3つの連結分散表現が生成される。 Here, the following three connected distributed representations are generated.
 (1)「申込」の分散表現+「山田」の分散表現+「太郎」の分散表現
 (2)「申込」の分散表現+「者」の分散表現+「山田」の分散表現+「太郎」の分散表現
 (3)「申込」の分散表現+「者」の分散表現+「名」の分散表現+「山田」の分散表現+「太郎」の分散表現
 分散表現の組み合わせを用いることにより、複数の意味を持つ単語に対して、複数の種別が割り当てられることを防ぎ、分類精度を向上させることができる。本実施形態では、組み合わせ生成部12および分散表現連結部13は、分散表現の組み合わせとして、各単語の分散表現の和を算出する。
(1) Distributed representation of "application" + distributed representation of "Yamada" + distributed representation of "Taro" (2) Distributed representation of "application" + distributed representation of "person" + distributed representation of "Yamada" + "Taro" (3) Distributed representation of "application" + distributed representation of "person" + distributed representation of "name" + distributed representation of "Yamada" + distributed representation of "Taro" By using a combination of distributed representations, multiple It is possible to prevent multiple types from being assigned to a word with the same meaning and improve classification accuracy. In this embodiment, the combination generation unit 12 and the distributed expression linking unit 13 calculate the sum of the distributed expressions of each word as a combination of distributed expressions.
 学習データ生成部14は、入力された学習用の情報に対応する種別番号を、オペレータ端末7から受け付け、複数のパターンの連結分散表現と種別番号とを含む学習データを生成し、種別分類部15に出力する。図示する学習データでは、各連結分散表現に、「0」の種別番号が設定されている。学習データは、種別分類部15の分類器を学習させるためのデータである。種別分類部15の学習部は、学習データを用いた機械学習により分類器を生成する。 The learning data generation unit 14 receives the type number corresponding to the input learning information from the operator terminal 7, generates learning data including the connected distributed representations of a plurality of patterns and the type number, Output to. In the illustrated learning data, a type number of "0" is set for each connected distributed representation. The learning data is data for training the classifier of the type classification unit 15. The learning section of the type classification section 15 generates a classifier by machine learning using learning data.
 図3は、情報識別装置のデータ生成部Aの他の動作を説明する説明図である。図示する動作は、学習フェーズにおいても識別フェーズ(学習データ生成部14の動作は除く)においても行われる。ここでは、学習フェーズの場合を例に以下に説明する。図3では、keyに「お~」等の接頭辞が含まれること等による影響を削減するために、keyの単語の分散表現を後方から組み合わせて複数のパターンを生成することで、接頭辞などが含まれることによる影響を削減する。 FIG. 3 is an explanatory diagram illustrating another operation of the data generation section A of the information identification device. The illustrated operations are performed both in the learning phase and in the identification phase (excluding the operation of the learning data generation unit 14). Here, the case of the learning phase will be explained below as an example. In Figure 3, in order to reduce the influence caused by the presence of prefixes such as "oh" in the key, the distributed expressions of the word of the key are combined backwards to generate multiple patterns. reduce the impact of inclusion.
 文字列分割部10は、図2と同様に、オペレータ端末7から送信された情報を、形態素解析により単語に分割する。図示する例では、文字列分割部10は、入力された「お申込者名:山田太郎」を、「お」、「申込」、「者」、「名」、「山田」、「太郎」に分割し、分散表現生成部11に出力する。「お申込者名」がkeyで、「山田太郎」がvalueである。 Similarly to FIG. 2, the character string division unit 10 divides the information transmitted from the operator terminal 7 into words by morphological analysis. In the illustrated example, the character string division unit 10 converts the input “applicant name: Taro Yamada” into “o”, “application”, “person”, “first name”, “Yamada”, and “Taro”. It is divided and output to the distributed representation generation unit 11. "Applicant name" is the key, and "Taro Yamada" is the value.
 分散表現生成部11は、図2と同様に、分割された各単語の分散表現を生成し、組み合わせ生成部12に出力する。組み合わせ生成部12は、キーの単語の分散表現を組み合わせて、複数のパターンを生成する。図示する例では、組み合わせ生成部12は、分割されたkeyの単語を後方から1つずつ組み合わせて複数のパターンを生成し、分散表現連結部13に出力する。ここでは、以下の4つのパターンを生成する。これにより「お~」等の接頭辞が含まれることによる影響を削減する。 Similarly to FIG. 2, the distributed expression generation unit 11 generates a distributed expression for each divided word and outputs it to the combination generation unit 12. The combination generation unit 12 generates a plurality of patterns by combining distributed representations of key words. In the illustrated example, the combination generation unit 12 generates a plurality of patterns by combining the words of the divided keys one by one from the rear, and outputs the patterns to the distributed expression connection unit 13. Here, the following four patterns are generated. This reduces the impact of including prefixes such as "oh".
 (1)「名」の分散表現
 (2)「者」の分散表現+「名」の分散表現
 (3)「申込」の分散表現+「者」の分散表現+「名」の分散表現
 (4)「お」の分散表現+「申込」の分散表現+「者」の分散表現+「名」の分散表現
 分散表現連結部13は、図2と同様に、生成されたパターン毎に、当該パターンの分散表現とバリューの単語の分散表現とを組み合わせて、連結分散表現を生成し、学習データ生成部14に出力する。
(1) Distributed representation of "Name" (2) Distributed representation of "Party" + Distributed representation of "Name" (3) Distributed representation of "Application" + Distributed representation of "Party" + Distributed representation of "Name" (4 ) Distributed expression of "o" + Distributed expression of "application" + Distributed expression of "person" + Distributed expression of "name" As in FIG. A connected distributed representation is generated by combining the distributed representation of the value word and the distributed representation of the value word, and is output to the learning data generation unit 14.
 学習データ生成部14は、図2と同様に、学習用の情報の種別番号をオペレータ端末7から受け付け、複数の連結分散表現と種別番号とを含む学習データを生成し、種別分類部15に出力する。種別分類部15の学習部は、学習データを用いた機械学習により分類器を生成する。 Similarly to FIG. 2, the learning data generation unit 14 receives the type number of learning information from the operator terminal 7, generates learning data including a plurality of connected distributed representations and the type number, and outputs it to the type classification unit 15. do. The learning section of the type classification section 15 generates a classifier by machine learning using learning data.
 以上説明したように、本実施形態の学習フェーズでは、情報識別装置1は、keyとvalueを含む情報を単語に分割し、各単語の分散表現を算出し、keyの単語の分散表現を組み合わせたパターンを生成し、パターン毎にkeyの分散表現とvalueの各単語の分散表現を組み合わせた連結分散表現を生成し、連結分散表現と種別番号とを含む学習データを生成する。これにより、後述する識別フェーズにおいて、複数の意味を持つ単語、または同義語が投入されても、適切な種別に識別することができる。 As explained above, in the learning phase of this embodiment, the information identification device 1 divides information including a key and a value into words, calculates a distributed representation of each word, and combines the distributed representations of the key word. A pattern is generated, and for each pattern, a concatenated distributed expression is generated by combining the distributed expression of the key and the distributed expression of each word of the value, and learning data including the concatenated distributed expression and the type number is generated. Thereby, in the identification phase described later, even if words with multiple meanings or synonyms are input, they can be identified into appropriate types.
 なお、図2および図3では、組み合わせの例として、keyの各単語の分散表現を、前方または後方から組み合わせてパターンを生成したが、これに限定されない。組み合わせ生成部12は、keyの単語の分散表現を様々に組み合わせて、分散表現のパターンを生成することができる。 Note that in FIGS. 2 and 3, as an example of a combination, a pattern is generated by combining the distributed expressions of each word of the key from the front or the rear, but the pattern is not limited to this. The combination generation unit 12 can generate patterns of distributed expressions by various combinations of distributed expressions of key words.
 図4は、各単語(「姓」、「氏名」、「住所」、「山口」)の分散表現(ベクトル)と、連結分散表現(「姓:山口」、「氏名:山口」、「住所:山口」)と、を示したものである。図示する例では、各keyは、「姓」、「氏名」、「住所」であり、これ以上分割できない単語である。そのため、組み合わせ生成部12によるkeyの組み合わせとして、1つパターンが生成される。図示するように、同じ種別の分散表現の和は、近い位置にマッピング(すなわちクラスタリング)されることを示している。ここでは、「氏名」と「姓」の類似する意味を持つ「山口」を、類似する分散表現に変換できる。また、「氏名」と「住所」の異なる意味を持つ「山口」を、対応する意味の分散表現に変換することができる。 Figure 4 shows the distributed representation (vector) of each word (“surname”, “name”, “address”, “Yamaguchi”) and the connected distributed representation (“surname: Yamaguchi”, “name: Yamaguchi”, “address:”). ``Yamaguchi''). In the illustrated example, each key is "last name," "full name," and "address," which are words that cannot be divided any further. Therefore, one pattern is generated as a combination of keys by the combination generation unit 12. As shown in the figure, the sum of distributed expressions of the same type indicates that they are mapped to close positions (that is, clustered). Here, "Yamaguchi", which has similar meanings as "name" and "surname", can be converted into similar distributed expressions. Furthermore, "Yamaguchi", which has different meanings for "name" and "address", can be converted into a distributed representation of the corresponding meaning.
 図4からわかるように、分散表現の組み合わせとして分散表現の和を用いることで、同じ種別の情報が近い位置にマッピングされる。したがって、Keyの分散表現とValueの分散表現の和を用いて、多クラス分類における学習データを生成することで、複数の意味を持つ単語の意味の識別、および、異なる単語が同種別であることの識別が可能になる。 As can be seen from FIG. 4, by using the sum of distributed representations as a combination of distributed representations, information of the same type is mapped to close positions. Therefore, by generating learning data for multi-class classification using the sum of the distributed representation of Key and the distributed representation of Value, it is possible to identify the meaning of words with multiple meanings and to identify whether different words are of the same type. identification becomes possible.
 すなわち、分散表現の和を取ることにより、同義語を含む情報を似た分散表現として変換することや多義語を含む情報を用途ごとの分散表現に変換することができ、同義語や多義語を含む情報の識別に活用できる。 In other words, by summing the distributed representations, it is possible to convert information containing synonyms into similar distributed representations, and to convert information containing polysemous words into distributed representations for each use. It can be used to identify the information contained.
 図5は、本実施形態の組み合わせ生成部12が生成する、keyの分散表現のパターンを説明する説明図である。 FIG. 5 is an explanatory diagram illustrating a pattern of distributed expressions of keys generated by the combination generation unit 12 of this embodiment.
 単にkey, valueの分散表現の和をデータ生成部Aの出力とする場合、keyに含まれる重複する単語(接尾辞、接頭辞など)による識別への影響が大きく、正しい識別が困難な場合がある。本実施形態では、前述のとおり、入力された情報のkeyの単語(形態素)の分散表現の組み合わせを複数パターン生成し、種別分類部15(分類器)への入力として用いる連結分散表現(ベクトル)を複数生成することで、重複する単語の影響を小さくする。 If the output of data generation unit A is simply the sum of the distributed expressions of key and value, the duplicate words (suffixes, prefixes, etc.) included in the key will have a large effect on identification, and correct identification may be difficult. be. In this embodiment, as described above, multiple patterns of combinations of distributed expressions of key words (morphemes) of input information are generated, and connected distributed expressions (vectors) are used as input to the type classification unit 15 (classifier). By generating multiple words, the influence of duplicate words is reduced.
 図示する例では、既存の種別として「会社名」が存在する状態で、「サービス名」をkeyとする入力情報が、文字列分類部10に入力された場合を示す。 The illustrated example shows a case where input information with "service name" as a key is input to the character string classification unit 10 in a state where "company name" exists as an existing type.
 図5の5Aの場合、「サービス名」の「サービス」の分散表現および「名」の分散表現を組み合わせた分散表現(分散表現の和)が、分類器に入力される。「会社名」と「サービス名」のコサイン類似度は高く(0.84)、図示するように、「会社名」の分散表現と「サービス名」との分散表現とは、近い位置にマッピングされる。そのため、分類器は、「サービス名」のkeyを持つ入力情報の種別を、「会社名」と誤判定してしまう。 In the case of 5A in FIG. 5, a distributed representation (sum of distributed representations) that is a combination of the distributed representation of "service" and the distributed representation of "name" of "service name" is input to the classifier. The cosine similarity between "company name" and "service name" is high (0.84), and as shown in the figure, the distributed representation of "company name" and the distributed representation of "service name" are mapped in close positions. Therefore, the classifier incorrectly determines that the type of input information with the key of "service name" is "company name."
 一方、「会社」と「サービス」のコサイン類似度は低く(0.43)、図示するように「会社」の分散表現と「サービス」との分散表現とは、遠い位置にマッピングされる。このことから、「会社」と「名」とを組み合わせた分散表現と、「サービス」と「名」とを組み合わせた分散表現のコサイン類似度が高くなるのは、重複する接尾辞「名」の影響といえる。 On the other hand, the cosine similarity between "company" and "service" is low (0.43), and as shown in the figure, the distributed representation of "company" and the distributed representation of "service" are mapped to distant positions. From this, the cosine similarity of the distributed expression combining "company" and "first name" and the distributed representation combining "service" and "first name" is high because of the overlapping suffix "first name". It can be said to be an influence.
 図5の5Bに示すように、本実施形態では、このような接尾辞または接頭辞の影響を低減するために、組み合わせ生成部12は、keyの1つの単語の分散表現および少なくとも2つの単語を組み合わせた分散表現の複数のパターンを生成する。ここでは、「サービス」の分散表現と「名」の分散表現とを組み合わせた分散表現と、「サービス」のみの分散表現の2つのパターンを生成する。そして、パターン毎に、valueの単語の分散表現と組み合わせた連結分散表現を分類器に投入することで、接尾辞「名」の影響を低減し、種別の誤判定を低減することができる。図示する例では、サービス名の種別は、会社名ではなく、新たな種別であると判定することができる。 As shown in 5B of FIG. 5, in this embodiment, in order to reduce the influence of such a suffix or prefix, the combination generation unit 12 generates a distributed representation of one word of key and at least two words. Generate multiple patterns of combined distributed representations. Here, two patterns are generated: a distributed representation that combines a distributed representation of "service" and a distributed representation of "name", and a distributed representation of only "service". Then, by inputting the concatenated distributed expression combined with the distributed expression of the value word into the classifier for each pattern, it is possible to reduce the influence of the suffix "first name" and reduce misjudgment of type. In the illustrated example, it can be determined that the type of service name is not a company name but a new type.
 図6は、学習フェーズの情報識別装置1の動作を示すシーケンス図である。 FIG. 6 is a sequence diagram showing the operation of the information identification device 1 in the learning phase.
 オペレータ端末7は、オペレータの指示を受け付けて、学習用の情報を情報識別装置1の文字列分割部10に入力する(ステップS21)。学習用の情報は、key, valueを含む少なくとの1つの学習用データが含まれる。文字列分割部10は、入力された情報を単語に分割する(ステップS22)。分散表現生成部11は、分割された各単語の分散表現を生成し、組み合わせ生成部12に出力する(ステップS23)。 The operator terminal 7 receives the operator's instructions and inputs learning information to the character string segmentation unit 10 of the information identification device 1 (step S21). The learning information includes at least one piece of learning data including a key and a value. The character string dividing unit 10 divides the input information into words (step S22). The distributed expression generation unit 11 generates a distributed expression of each divided word and outputs it to the combination generation unit 12 (step S23).
 組み合わせ生成部12は、keyの単語の分散表現を組み合わせて、分散表現のパターンを生成する(ステップS24)。なお、keyが1つの単語から構成される場合は、1つのパターンが生成される。 The combination generation unit 12 combines the distributed expressions of the key words to generate a distributed expression pattern (step S24). Note that if the key is composed of one word, one pattern is generated.
 分散表現連結部13は、パターン毎に連結分散表現を生成し、学習データ生成部14に出力する(ステップS25)。具体的には、分散表現連結部13は、パターン毎に、当該パターンの分散表現とvalueの単語の分散表現とを組み合わせて、連結分散表現を生成する。 The distributed expression concatenation unit 13 generates a concatenated distributed expression for each pattern and outputs it to the learning data generation unit 14 (step S25). Specifically, the distributed expression concatenation unit 13 generates a concatenated distributed expression for each pattern by combining the distributed expression of the pattern and the distributed expression of the value word.
 オペレータ端末7は、オペレータの指示を受け付けて、S21で送信した情報の種別番号を、種別保持部19に要求する(ステップS26)。種別保持部19は、S21で送信した情報に対応する種別番号をオペレータ端末7に送信する(ステップS27)。オペレータ端末7は、種別番号を取得すると、当該種別番号を学習データ生成部14に送信する(ステップS28)。 The operator terminal 7 receives the operator's instruction and requests the type holding unit 19 for the type number of the information transmitted in S21 (step S26). The type holding unit 19 transmits the type number corresponding to the information transmitted in S21 to the operator terminal 7 (step S27). Upon acquiring the type number, the operator terminal 7 transmits the type number to the learning data generation unit 14 (step S28).
 学習データ生成部14は、ステップS25で受け付けた連結分散表現と、S28で受け付けた種別番号とを含む学習データを生成し、種別分類部15に送出する(ステップS29)。種別分類部15の学習部は、学習データを機械学習することで分類器を生成する(ステップS30)。 The learning data generation unit 14 generates learning data including the concatenated distributed representation received in step S25 and the type number received in S28, and sends it to the type classification unit 15 (step S29). The learning unit of the type classification unit 15 generates a classifier by performing machine learning on the learning data (step S30).
 なお、図6では、ステップS26でオペレータ端末7が種別保持部19に種別番号を要求したが、学習データ生成部14が、種別保持部19に種別番号を要求してもよい。この場合、ステップS27で種別保持部19は種別番号を学習データ生成部14に送出し、学習データ生成部14は、連結分散表現と種別保持部19から取得した種別番号とを用いて学習データを生成する(ステップS29)。 Note that in FIG. 6, the operator terminal 7 requests the type holding unit 19 for the type number in step S26, but the learning data generation unit 14 may request the type holding unit 19 for the type number. In this case, the type holding unit 19 sends the type number to the learning data generating unit 14 in step S27, and the learning data generating unit 14 generates the learning data using the connected distributed representation and the type number acquired from the type holding unit 19. generated (step S29).
 次に、郵便番号や電話番号など分散表現化できない情報を種別保持部19に登録する動作について説明する(ステップS31、S32)。郵便番号や電話番号など分散表現化できない情報については、後述する識別フェーズで、情報識別装置1の正規表現判定部20が、正規表現を用いて数字や文字列のパターンから何の種別であるかを判定する。なお、ステップS31、S32は、ステップS21~S30と非同期に行われる。 Next, the operation of registering information that cannot be expressed in a distributed manner, such as a postal code or telephone number, in the type holding unit 19 will be described (steps S31 and S32). For information that cannot be expressed in distributed representation, such as postal codes and telephone numbers, in the identification phase described later, the regular expression determination unit 20 of the information identification device 1 uses regular expressions to determine what type it is based on patterns of numbers and character strings. Determine. Note that steps S31 and S32 are performed asynchronously with steps S21 to S30.
 オペレータ端末7は、ファーストBシステム5の連携事業者から、種別と対応する正規表現のパターンとを受け付け(ステップS31)、当該種別と正規表現のパターンとを情報識別装置1の種別保持部19に送信する(ステップS32)。種別保持部19は、送信された種別と正規表現のパターンとを自身の記憶部に格納する。 The operator terminal 7 receives the type and the corresponding regular expression pattern from the partner business operator of the First B system 5 (step S31), and sends the type and the regular expression pattern to the type holding unit 19 of the information identification device 1. Transmit (step S32). The type holding unit 19 stores the transmitted type and regular expression pattern in its own storage unit.
 図6では、種別が郵便番号の正規表現のパターンと、種別が電話番号の正規表現のパターンとを例示している。郵便番号の正規表現では、〒マークではじまり、数字3桁と-(ハイフン)と数字4桁で終わることを表している。 FIG. 6 illustrates a regular expression pattern whose type is postal code and a regular expression pattern whose type is telephone number. The regular expression for postal codes indicates that the code starts with the 〒 mark, ends with three digits, a - (hyphen), and four digits.
 <識別フェーズ>
 識別フェーズでは、情報識別装置1は、種別分類部15の分類器で算出された確信度を用いて、入力された情報の種別を、種別保持部19に保持されたいずれかの種別に識別可能か否かを判定する。
<Identification phase>
In the identification phase, the information identification device 1 can identify the type of input information into one of the types held in the type holding unit 19 using the confidence calculated by the classifier of the type classification unit 15. Determine whether or not.
 図7は、識別フェーズの情報識別装置1の全体の流れの一例を示す図である。図示する識別フェーズでは、データ生成部Aから出力される連結分散表現を分類器に入力し、分類器から得られる種別毎の確信度の平均値を算出し、平均値の最大値が正の場合は最大値の確信度に該当する種別を入力された情報の種別であると判定する。平均値の最大値が負の場合は、keyの各単語を類義語変換して連結分散表現を生成し、再度、確信度の平均値を算出する。平均値の最大値が正の場合は最大値の確信度に該当する種別であると判定し、平均値の最大値が負の場合は、識別不可であり、種別保持部19に保持されていない新たな種別であると判定して種別生成部Cの処理に移る。 FIG. 7 is a diagram showing an example of the overall flow of the information identification device 1 in the identification phase. In the illustrated identification phase, the connected variance representation output from the data generation unit A is input to the classifier, the average value of the confidence for each type obtained from the classifier is calculated, and if the maximum value of the average value is positive, determines that the type corresponding to the maximum confidence value is the type of input information. If the maximum value of the average value is negative, each word of the key is converted into synonyms to generate a connected variance expression, and the average value of the confidence is calculated again. If the maximum value of the average value is positive, it is determined that the type corresponds to the confidence level of the maximum value, and if the maximum value of the average value is negative, it is unidentifiable and is not held in the type holding unit 19. It is determined that the type is a new type, and the process moves to the type generation unit C.
 具体的には、ミドルBシステム3は、任意の情報を情報識別装置1のデータ生成部Aに送信する。図7では、「サービス名:フレッツ光」が入力されるものとする。「サービス名」がkeyで「フレッツ光」がvalueとする。データ生成部Aの文字列分割部10は、入力された情報を単語に分割する。分散表現生成部11は、分割された各単語の分散表現を生成する。 Specifically, the middle B system 3 transmits arbitrary information to the data generation unit A of the information identification device 1. In FIG. 7, it is assumed that "service name: FLET'S Hikari" is input. "Service name" is the key and "FLET'S Hikari" is the value. The character string dividing unit 10 of the data generating unit A divides input information into words. The distributed expression generation unit 11 generates a distributed expression for each divided word.
 組み合わせ生成部12は、keyの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する。ここでは、組み合わせ生成部12は、(1)「サービス」の分散表現と、(2)「サービス」の分散表現+「名」の分散表現の2つのパターンを生成する。分散表現連結部13は、パターン毎に、当該パターンの分散表現とvalueの単語の分散表現とを組み合わせて、連結分散表現を生成し、判定部Bに入力する。ここでは、連結分散表現として、(1)「サービス」の分散表現+「フレッツ」の分散表現+「光」の分散表現と、(2)「サービス」の分散表現+「名」の分散表現+「フレッツ」の分散表現+「光」の分散表現が生成される。分散表現連結部13は、連結分散表現として、単語の分散表現の和を算出する。 The combination generation unit 12 combines the distributed expressions of the key words to generate a plurality of distributed expression patterns. Here, the combination generation unit 12 generates two patterns: (1) a distributed representation of "service" and (2) a distributed representation of "service" + distributed representation of "name". For each pattern, the distributed expression concatenation unit 13 combines the distributed expression of the pattern and the distributed expression of the value word, generates a concatenated distributed expression, and inputs the generated concatenated distributed expression to the determination unit B. Here, as connected distributed representations, (1) distributed representation of "service" + distributed representation of "FLET'S" + distributed representation of "light", and (2) distributed representation of "service" + distributed representation of "name" + A distributed representation of "FLET'S" + a distributed representation of "light" is generated. The distributed expression concatenation unit 13 calculates the sum of distributed expressions of words as a concatenated distributed expression.
 判定部Bの種別分類部15の分類器は、入力された各パターンの連結分散表現に対して、種別毎の確信度を出力する(ステップS10)。種別判定部16は、種別毎の確信度の平均値を算出し(ステップS11)、平均値の最高値の確信度が正か否かを判定する(ステップS12)。最高値の確信度が正の場合(ステップS12:YES)、種別判定部16は、入力された情報は確信度が最大値の種別であると判定する。(ステップS13)。すなわち、種別判定部16は、入力された情報に確信度が最大値の種別を割り当てる。 The classifier of the type classification unit 15 of the determination unit B outputs the certainty factor for each type with respect to the connected distributed representation of each input pattern (step S10). The type determination unit 16 calculates the average value of the reliability for each type (step S11), and determines whether the highest value of the average value is positive (step S12). If the reliability of the highest value is positive (step S12: YES), the type determination unit 16 determines that the input information is of the type with the maximum reliability. (Step S13). That is, the type determination unit 16 assigns the type with the maximum reliability to the input information.
 そして、種別判定部16は、入力された情報に、判定した種別(または種別番号)を付加してファーストBシステム5に出力する。なお、各機能部10-21は、当該機能部で処理されたデータを出力するとともに、当該機能部に入力されたデータも合わせて出力するものとする。したがって、種別判定部16は、分散表現生成部11、組み合わせ生成部12、分散表現連結部13および種別分類部15を介して、文字列分割部10に入力された情報を取得する。 Then, the type determination unit 16 adds the determined type (or type number) to the input information and outputs it to the first B system 5. It is assumed that each functional unit 10-21 outputs data processed by the functional unit and also outputs data input to the functional unit. Therefore, the type determination unit 16 acquires the information input to the character string division unit 10 via the distributed expression generation unit 11, combination generation unit 12, distributed expression concatenation unit 13, and type classification unit 15.
 一方、最高値の確信度が正でない場合(ステップS12:NO)、種別判定部16は、連結分散表現が類義語変換済みか否かを判定する(ステップS14)。図示する例では、分類器は、「氏名」、「住所」、「会社名」の3つの種別の確信度を出力し、種別毎の確信度の平均値の最高値は「氏名」であり、その最高値は負(-0.25)である。種別判定部16は、確信度の最高値が負であると判定すると(ステップS12:NO)、類義語変換済みか否かを例えば類義語変換フラグなどを用いて判定する。類義語変換フラグは、最高値の確信度が正でない場合に、種別判定部16または類義語変換部21が出力データに設定するものとする。機能部は、当該機能部で処理されたデータを出力するとともに、当該機能部に入力されたデータも合わせて出力するものとする。したがって、種別判定部16は、変換フラグの有無を判定することができる。 On the other hand, if the highest confidence level is not positive (step S12: NO), the type determination unit 16 determines whether the connected distributed expression has been converted into synonyms (step S14). In the illustrated example, the classifier outputs confidence levels for three types: "name," "address," and "company name," and the highest average value of confidence for each type is "name." Its highest value is negative (-0.25). If the type determination unit 16 determines that the highest value of certainty is negative (step S12: NO), it determines whether synonym conversion has been completed using, for example, a synonym conversion flag. The synonym conversion flag is set in the output data by the type determination unit 16 or the synonym conversion unit 21 when the highest confidence level is not positive. The functional unit outputs the data processed by the functional unit, and also outputs the data input to the functional unit. Therefore, the type determining unit 16 can determine the presence or absence of the conversion flag.
 類義語変換済みでない場合(ステップS14:NO)、類義語変換部21は、入力された情報のkeyの各単語を類義語に変換する(ステップS15)。ここでは、類義語変換部21は、前述の分類語彙表などを用いて、keyの「サービス」および「名」を、「価格・費用」および「名」に変換する。そして、入力された情報のkeyを変換後の単語に置換した変換後の情報(価格・費用名:フレッツ光)を、データ生成部Aに入力する。 If the synonym conversion has not been completed (step S14: NO), the synonym conversion unit 21 converts each word of the key of the input information into a synonym (step S15). Here, the synonym conversion unit 21 converts the keys "service" and "name" into "price/cost" and "name" using the above-mentioned classification vocabulary table and the like. Then, the converted information (price/cost name: FLET'S Hikari) in which the key of the input information is replaced with the converted word is input to the data generation section A.
 データ生成部Aは、前述と同様に、変換後の情報のパターン毎の連結分散表現を、種別分類部15に出力する。ここでは、連結分散表現として、(1)「価格」の分散表現+「費用」の分散表現+「フレッツ」の分散表現+「光」の分散表現と、(2)「価格」の分散表現+「費用」の分散表現+「名」の分散表現+「フレッツ」の分散表現+「光」の分散表現が生成される。 Similarly to the above, the data generation unit A outputs the connected distributed expression for each pattern of the converted information to the type classification unit 15. Here, as connected distributed representations, (1) distributed representation of "price" + distributed representation of "cost" + distributed representation of "FLET'S" + distributed representation of "light", and (2) distributed representation of "price" + A distributed representation of "cost" + a distributed representation of "name" + a distributed representation of "FLET'S" + a distributed representation of "light" is generated.
 判定部Bの分類器は、入力された各パターンの分散表現の組み合わせに対して、種別毎の確信度を出力する(ステップS10)。種別判定部16は、種別分類部15から出力されるパターン毎の確信度の平均値を種別毎に算出し(ステップS11)、平均値の確信度の最高値が正か否かを判定する(ステップS12)。最高値が正の場合(ステップS12:YES)、ステップS13に進む。 The classifier of the determination unit B outputs the certainty factor for each type for the input combination of distributed representations of each pattern (step S10). The type determination unit 16 calculates the average value of the reliability of each pattern output from the type classification unit 15 for each type (step S11), and determines whether the highest value of the reliability of the average value is positive ( Step S12). If the highest value is positive (step S12: YES), the process advances to step S13.
 最高値が正でない場合(ステップS12:NO)、種別判定部16は、類義語変換済みか否かを判定する(ステップS14)。図示する例では、類義語変換の最高値の「氏名」の確信度は負(-0.35)であって(ステップS12:NO)、類義語変換済みとする(ステップ:S14:YES)。 If the highest value is not positive (step S12: NO), the type determination unit 16 determines whether synonym conversion has been completed (step S14). In the illustrated example, the certainty factor of "name" with the highest value of synonym conversion is negative (-0.35) (step S12: NO), and the synonym conversion is completed (step S14: YES).
 この場合、種別判定部16は、入力された情報の種別は現時点の分類器では識別不可能であると判定する。この場合、種別判定部16は、種別保持部19に新たな種別の種別番号を追加する(ステップS16)。そして、種別判定部16は、入力された情報のkeyの分散表現と、valueの分散表現とを、種別生成部Cの類似語抽出部17に出力する(ステップS17)。各機能部は、当該機能部で処理されたデータを出力するとともに、当該機能部に入力されたデータも合わせて出力するものとする。したがって、種別判定部16は、分散表現生成部11が生成したkeyおよびvalueの分散表現を取得する。 In this case, the type determination unit 16 determines that the type of input information cannot be identified by the current classifier. In this case, the type determining unit 16 adds the type number of the new type to the type holding unit 19 (step S16). Then, the type determination unit 16 outputs the distributed expression of the key and the distributed expression of the input information to the similar word extraction unit 17 of the type generation unit C (step S17). Each functional unit outputs data processed by the functional unit, and also outputs data input to the functional unit. Therefore, the type determining unit 16 obtains the distributed representation of the key and value generated by the distributed representation generating unit 11.
 類似語抽出部17は、keyの分散表現と、valueの分散表現とを用いて、keyの類似語とvalueの類似語とを抽出し、分散表現生成部11に出力する。例えば、類似語抽出部17は、非特許文献2のFastTextのモデルと、コサイン類似度とを用いて、keyおよびvalueの類似語を抽出する。 The similar word extraction unit 17 extracts similar words for the key and similar words for the value using the distributed expression of the key and the distributed expression of the value, and outputs the extracted words to the distributed expression generation unit 11. For example, the similar word extraction unit 17 uses the FastText model of Non-Patent Document 2 and cosine similarity to extract similar words for keys and values.
 分散表現生成部11は、keyの各類似語およびvalueの各類似語の分散表現を生成する。組み合わせ生成部12は、keyの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する。分散表現連結部13は、パターン毎に、keyおよびkeyの類似語の分散表現と、valueおよびvalueの類似語の分散表現とをそれぞれ組み合わせて、追加の連結分散表現を生成する。例えば、パターン毎に以下のような分散表現の組み合わせが生成される。 The distributed expression generation unit 11 generates distributed expressions for each similar word of key and each similar word of value. The combination generation unit 12 combines the distributed expressions of the key words to generate a plurality of distributed expression patterns. The distributed expression concatenation unit 13 generates an additional concatenated distributed expression by combining, for each pattern, the distributed expressions of the key and the similar words of the key, and the distributed expressions of the value and the similar words of the value. For example, the following combinations of distributed expressions are generated for each pattern.
 ・(key)の分散表現と、(value)の分散表現の和
 ・(key)の分散表現と、(valueの類似語)の分散表現の和
 ・(keyの類似語)の分散表現と、(valueの類似語)の分散表現の和
 ・(keyの類似語)の分散表現と、(value)の分散表現の和
 学習データ更新部18は、分散表現連結部13が生成した連結分散表現と、ステップS16で種別判定部16が払い出した新たな種別番号とを含む追加の学習データを生成し、種別分類部15に出力する。種別分類部15の学習部は、追加の学習データを用いて、分類器を再学習する。
・The sum of the distributed representation of (key) and the distributed representation of (value) ・The sum of the distributed representation of (key) and the distributed representation of (similar word of value) ・The distributed representation of (similar word of key) and the distributed representation of ( The sum of the distributed representations of (similar words of key) and the distributed representations of (value) The learning data update unit 18 uses the concatenated distributed representations generated by the distributed expression concatenation unit 13, In step S16, additional learning data including the new type number issued by the type determination unit 16 is generated and output to the type classification unit 15. The learning section of the type classification section 15 retrains the classifier using additional learning data.
 分散表現の組み合わせとして分散表現の和をとることで、図4に示すように同じ種別の情報が近い位置にマッピングされる。したがって、新たな種別のKeyおよびValueの分散表現の和と、新たな種別のKeyおよびValueの類似語の分散表現の和とを用いて、新たな種別の追加の学習データを生成し分類器に学習させることで、新たな種別を識別することが可能になる。 By taking the sum of the distributed expressions as a combination of distributed expressions, information of the same type is mapped to close positions as shown in FIG. Therefore, using the sum of the distributed representations of the new type of Key and Value and the sum of the distributed representations of similar words of the new type of Key and Value, additional training data of the new type is generated and applied to the classifier. By learning, it becomes possible to identify new types.
 すなわち、新たな種別を生成する際は、追加の学習データとして、新たな種別であるkeyの分散表現とvalueの分散表現との組み合わせを用いる。その際、keyおよびvalue類似語を抽出し、抽出した類似語の分散表現の和を活用した学習データを学習させることで、分類器に新たな種別のクラスを生成し、新たな種別の識別精度を向上させることが可能となる。 That is, when generating a new type, a combination of the new type, the distributed representation of key and the distributed representation of value, is used as additional learning data. At that time, by extracting key and value similar words and training data using the sum of distributed expressions of the extracted similar words, the classifier generates a new type of class and improves the identification accuracy of the new type. It becomes possible to improve the
 図8は、類義語変換部21の処理のイメージ図である。図示する例では、既存の種別として「氏名」、「住所」、「会社名」が存在する状態で、「姓」をkeyとする入力情報が情報識別装置1に入力された場合を示す。「姓」の分散表現と、「氏名」の分散表現とは、コサイン類似度が低く、「姓」の分散表現と「氏名」の分散表現とは、比較的遠い位置にマッピングされてしまう。類義語変換部21が「姓」を類義語の「名」に変換することで、「名」の分散表現と、「氏名」の分散表現とは、コサイン類似度が高く、「名」の分散表現と「氏名」の分散表現とは、近い位置にマッピングされる。これにより、分類器は、「姓」のkeyを持つ入力情報の種別を、既存の種別の「氏名」に適切に判定することができる。 FIG. 8 is an image diagram of the processing of the synonym conversion unit 21. The illustrated example shows a case where input information using "last name" as a key is input to the information identification device 1 in a state where "name", "address", and "company name" exist as existing types. The distributed representation of "surname" and the distributed representation of "full name" have a low cosine similarity, and the distributed representation of "surname" and the distributed representation of "full name" are mapped to relatively distant positions. The synonym conversion unit 21 converts "surname" into the synonym "first name", so that the distributed representation of "first name" and the distributed representation of "full name" have a high cosine similarity, and the distributed representation of "first name" The distributed representation of "name" is mapped to a nearby position. Thereby, the classifier can appropriately determine the type of input information having the key of "last name" to the existing type of "full name".
 このように、本実施形態では、入力された情報のkeyを類義語に変換し、変換後のkeyを用いて入力された情報が既存種別のいずれかに該当するかどうかを判定する。これにより、key, valueの分散表現の組み合わせだけでは同義語でも類似度が低くなり、同じ種別と判定が難しい場合であっても、単語の意味と分散表現の類似度の乖離を埋めて正しく判定することができる。 As described above, in this embodiment, the key of input information is converted into a synonym, and the converted key is used to determine whether the input information corresponds to any of the existing types. As a result, even if the combination of key and value distributed expressions causes a low degree of similarity even for synonyms, and it is difficult to determine that they are of the same type, the discrepancy between the word meaning and the degree of similarity between the distributed expressions is filled in and the judgment is made correctly. can do.
 図9および図10は、識別フェーズの情報識別装置1の動作を示すシーケンス図である。図9は、入力された情報のvalueが分散表現化できる場合の動作を示し、図10は、入力された情報のvalueが分散表現化できない場合の動作を示す。 9 and 10 are sequence diagrams showing the operation of the information identification device 1 in the identification phase. FIG. 9 shows the operation when the value of input information can be expressed in a distributed manner, and FIG. 10 shows the operation when the value of the input information cannot be expressed in a distributed manner.
 図9では、ミドルBシステム3は、keyとvalueを含む情報を、情報識別装置1の文字列分割部10に入力する(ステップS61)。文字列分割部10は、情報を単語(文字列)に分割する(ステップS62)。分散表現生成部11は、分割された各単語の分散表現を生成し、組み合わせ生成部12に出力する(ステップS63)。 In FIG. 9, the middle B system 3 inputs information including a key and a value to the character string dividing unit 10 of the information identification device 1 (step S61). The character string dividing unit 10 divides the information into words (character strings) (step S62). The distributed expression generation unit 11 generates a distributed expression for each divided word and outputs it to the combination generation unit 12 (step S63).
 組み合わせ生成部12は、keyの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成し、分散表現連結部13に出力する(ステップS64)。分散表現連結部13は、パターン毎に、当該パターンの分散表現とvalueの単語の分散表現とを組み合わせて、連結分散表現を生成し、種別分類部15に出力する(ステップS65)。 The combination generation unit 12 combines the distributed expressions of the key words, generates a plurality of distributed expression patterns, and outputs them to the distributed expression concatenation unit 13 (step S64). For each pattern, the distributed expression concatenation unit 13 combines the distributed expression of the pattern and the distributed expression of the value word, generates a concatenated distributed expression, and outputs it to the type classification unit 15 (step S65).
 種別分類部15の分類器は、入力されたパターン毎に、連結分散表現に対する種別毎の確信度を出力する(ステップS66)。分類器が分類する種別がK個の場合、分類器は、パターン毎にK個の確信度を算出する。 The classifier of the type classification unit 15 outputs the certainty factor for each type of connected distributed expression for each input pattern (step S66). When the classifier classifies K types, the classifier calculates K confidence levels for each pattern.
 種別判定部16は、各種別の確信度の平均値を算出する。平均の確信度の最大値が正の値の場合、S61で入力された情報の種別は最大値の種別であると判定する。そして、種別判定部16は、S61で入力された情報と、最大の確信度の種別(種別名および/または種別番号)とを含む識別結果をファーストBシステム5に送信する(ステップS67)。 The type determination unit 16 calculates the average value of the reliability of each type. If the maximum value of the average reliability is a positive value, it is determined that the type of information input in S61 is the type with the maximum value. Then, the type determination unit 16 transmits the identification result including the information input in S61 and the type (type name and/or type number) with the highest confidence to the first B system 5 (step S67).
 一方、平均の確信度の最大値が負の値の場合であって、類義語変換部21による類義語変種後の場合、種別判定部16は、S61で入力された情報のkeyを種別保持部19に登録する(ステップS68)。種別保持部19は、登録されたkeyの種別番号を払い出し(ステップS69)、完了通知を種別判定部16に出力する(ステップS70)。なお、種別判定部16がkeyの種別番号を払い出し、ステップS67でkeyと種別番号を種別保持部19に登録してもよい。 On the other hand, if the maximum value of the average certainty is a negative value and after the synonym transformation unit 21 has modified the synonyms, the type determination unit 16 stores the key of the information input in S61 in the type holding unit 19. Register (step S68). The type holding unit 19 issues the type number of the registered key (step S69) and outputs a completion notification to the type determining unit 16 (step S70). Note that the type determination unit 16 may issue the type number of the key, and the key and type number may be registered in the type holding unit 19 in step S67.
 種別判定部16は、完了通知を受け付けると、keyの分散表現と、valueの分散表現とを、類似語抽出部17に出力する(ステップS71)。類似語抽出部17は、keyの分散表現に類似する類似語と、valueの分散表現に類似する類似語とを抽出し、分散表現生成部11に出力する(ステップS72)。このとき、類似語抽出部17は、ステップS70で取得したkeyの分散表現およびvalueの分散表現も、類似語とともに分散表現生成部11に出力する。 When the type determination unit 16 receives the completion notification, it outputs the distributed expression of the key and the distributed expression of the value to the similar word extraction unit 17 (step S71). The similar word extraction unit 17 extracts similar words similar to the distributed expression of key and similar words similar to the distributed expression of value, and outputs them to the distributed expression generation unit 11 (step S72). At this time, the similar word extracting unit 17 also outputs the key distributed expression and the value distributed expression obtained in step S70 to the distributed expression generating unit 11 together with the similar word.
 分散表現生成部11は、入力された各類似語の分散表現を生成する(ステップS73)。組み合わせ生成部12は、類似語のkeyの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する(ステップS74)。分散表現連結部13は、パターン毎に、keyおよびkeyの類似語の分散表現と、valueおよびvalueの類似語の分散表現とをそれぞれ連結して、追加の連結分散表現を生成し、学習データ更新部18に送出する(ステップS75)。 The distributed expression generation unit 11 generates a distributed expression for each input similar word (step S73). The combination generation unit 12 combines the distributed expressions of the key words of the similar words to generate a plurality of distributed expression patterns (step S74). For each pattern, the distributed expression concatenation unit 13 concatenates the distributed expressions of the key and the similar words of the key, and the distributed expressions of the value and the similar words of the value, respectively, generates an additional concatenated distributed expression, and updates the learning data. 18 (step S75).
 学習データ更新部18は、追加の連結分散表現と、ステップS69で払い出された新たな種別番号とを含む追加の学習データを生成し、種別分類部15に出力する(ステップS76)。種別分類部15の学習部は、追加の学習データを用いて再学習させて分類器を更新し、再学習完了を分散表現連結部13に通知する(ステップS77)。 The learning data updating unit 18 generates additional learning data including the additional connected distributed expression and the new type number issued in step S69, and outputs it to the type classification unit 15 (step S76). The learning unit of the type classification unit 15 performs re-learning using additional learning data to update the classifier, and notifies the distributed expression concatenation unit 13 of the completion of re-learning (step S77).
 分散表現連結部13は、ステップS65に戻り、ステップS61で入力された情報の連結分散表現を、種別分類部15の再学習後の分類器に再び入力する。分類器は、種別毎の確信度を出力する(ステップS66)。種別判定部16は、更新後の分類器から出力される確信度を用いて入力された情報の種別を判定する。 The distributed representation concatenation unit 13 returns to step S65 and inputs the concatenated distributed representation of the information input in step S61 again to the re-learning classifier of the type classification unit 15. The classifier outputs the confidence level for each type (step S66). The type determining unit 16 determines the type of the input information using the confidence output from the updated classifier.
 なお、平均の確信度の最大値が負の値の場合であって、類義語変換部21による類義語変種が行われていない場合、種別判定部16は、類義語変換部21に類義語変換を指示する。類義語変換部21は、入力情報のkeyの各単語を類義語に変換し、変換後のkeyおよび変換前のvalueを文字列分割部10に入力し、これにより、ステップS62以降の処理が行われる。 Note that if the maximum value of the average certainty is a negative value and the synonym conversion unit 21 has not performed synonym transformation, the type determination unit 16 instructs the synonym conversion unit 21 to perform the synonym conversion. The synonym conversion unit 21 converts each word of the key in the input information into a synonym, and inputs the converted key and the unconverted value to the character string division unit 10, thereby performing the processing from step S62 onwards.
 次に、図10を参照して、入力された情報のvalueが分散表現化できない場合の動作を説明する。 Next, with reference to FIG. 10, the operation when the value of input information cannot be expressed in a distributed manner will be described.
 ミドルBシステム3は、keyとvalueを含む情報を、情報識別装置1の文字列分割部10に入力する(ステップS61)。文字列分割部10は、情報を単語に分割する(ステップS62)。ここでは、「電話番号:090-1234-5678」(key:value)が入力され、「電話」と、「番号」と、「090-1234-5678」とに分割される。 The middle B system 3 inputs information including a key and a value to the character string segmentation unit 10 of the information identification device 1 (step S61). The character string dividing unit 10 divides the information into words (step S62). Here, "telephone number: 090-1234-5678" (key:value) is input and is divided into "telephone", "number", and "090-1234-5678".
 分散表現生成部11は、分割された各単語の分散表現を試みるが、分散表現化できない単語がある場合はエラーとなる。ここでは、分散表現生成部11は、分散表現化できない「090-1234-5678」(value)を正規表現判定部20に出力する(ステップS81)。 The distributed expression generation unit 11 attempts distributed expression for each divided word, but if there is a word that cannot be distributed, an error occurs. Here, the distributed expression generation unit 11 outputs “090-1234-5678” (value), which cannot be converted into a distributed expression, to the regular expression determination unit 20 (step S81).
 正規表現判定部20は、valueが入力されると、種別保持部19に登録されている正規表現と種別番号の組の全てを、種別保持部19に要求し、取得する(ステップS82、S83)。そして、正規表現判定部20は、ステップS82で入力されたvalueの文字列のパターンが取得した正規表現のどのパターンと一致するかを判定する。 When the value is input, the regular expression determining unit 20 requests the type holding unit 19 to obtain all pairs of regular expressions and type numbers registered in the type holding unit 19 (steps S82 and S83). . Then, the regular expression determining unit 20 determines which pattern of the obtained regular expression matches the pattern of the character string of value input in step S82.
 いずれかの正規表現のパターンと一致する場合、正規表現判定部20は、S61で入力された情報の種別は一致したパターンの正規表現の種別であると判定する。そして、正規表現判定部20は、S61で入力された情報と、一致したパターンの種別とを含む識別結果をファーストBシステム5に送信する(ステップS84)。 In the case of a match with any regular expression pattern, the regular expression determination unit 20 determines that the type of information input in S61 is the type of the regular expression of the matched pattern. Then, the regular expression determining unit 20 transmits the identification result including the information input in S61 and the type of the matched pattern to the First B system 5 (Step S84).
 一方、いずれの正規表現のパターンとも一致しない場合、正規表現判定部20は、種別を特定できない旨のエラーをミドルBシステム3に送信する(ステップS85)。 On the other hand, if it does not match any regular expression pattern, the regular expression determining unit 20 transmits an error indicating that the type cannot be specified to the middle B system 3 (step S85).
 <種別分類度の確信度>
 次に、種別分類部15の分類器(SVM)が出力する種別毎の確信度について説明する。確信度は多クラス分類を行う分類器の各クラスにおける境界面、投入した情報と境界面との距離、投入した情報が境界面から見て正負どちら側に存在するか、を用いて算出される。すなわち、分類器は、境界面と投入された情報の距離から正または負の確信度を算出する。
<Confidence of classification degree>
Next, the certainty factor for each type output by the classifier (SVM) of the type classification unit 15 will be explained. Confidence is calculated using the boundary surface for each class of a classifier that performs multiclass classification, the distance between the input information and the boundary surface, and whether the input information is on the positive or negative side when viewed from the boundary surface. . That is, the classifier calculates positive or negative confidence from the distance between the boundary surface and the input information.
 前提として、確信度を用いて割り当てる種別が存在するか否かを判断するために、分類器が識別可能な種別が3種類以上存在することが必要である。識別できる種別が1種類の場合は、境界を引くことができないため確信度を算出することができない。また、識別できる種別が2種類の場合は、どちらかの種別の確信度が必ず正となるため識別不可であることを判断できない。 As a premise, in order to determine whether a type to be assigned using confidence exists, it is necessary that there are three or more types that can be identified by the classifier. If there is only one type that can be identified, the reliability cannot be calculated because boundaries cannot be drawn. Furthermore, if there are two types that can be identified, it cannot be determined that the type cannot be identified because the certainty factor for one of the types is always positive.
 図11は、境界面(超平面)と点の距離の算出方法を説明するための説明図である。点X(x-)から境界面に下した垂線の足をH(h)とすると、下記ベクトルは、境界面に垂直なので境界面の法線ベクトルwに平行である。 FIG. 11 is an explanatory diagram for explaining a method of calculating the distance between a boundary surface (hyperplane) and a point. If the leg of the perpendicular drawn from point X(x - ) to the boundary surface is H(h), the following vector is perpendicular to the boundary surface, so it is parallel to the normal vector w of the boundary surface.
Figure JPOXMLDOC01-appb-M000001
 
Figure JPOXMLDOC01-appb-M000001
 
 よって、実数kを用いて以下のように表現できる。 Therefore, it can be expressed as follows using a real number k.
Figure JPOXMLDOC01-appb-M000002
 
Figure JPOXMLDOC01-appb-M000002
 
 ここで、H(h)は、境界面wTx+b=0上の点であるため以下の式が成り立つ。 Here, since H(h) is a point on the boundary surface w T x+b=0, the following equation holds true.
Figure JPOXMLDOC01-appb-M000003
 
Figure JPOXMLDOC01-appb-M000003
 
 よって求めたい点と境界面との距離dは、以下の式で算出される。 Therefore, the distance d between the desired point and the boundary surface is calculated using the following formula.
Figure JPOXMLDOC01-appb-M000004
 
Figure JPOXMLDOC01-appb-M000004
 
 したがって、以下の式が成立する。すなわち、多クラス分類を行う分類器の各クラスにおける境界面wTx+b=0と、投入された情報X(x-)との距離は、以下の式で表現される。 Therefore, the following formula holds true. That is, the distance between the boundary surface w T x+b=0 in each class of the classifier that performs multi-class classification and the input information X(x - ) is expressed by the following equation.
Figure JPOXMLDOC01-appb-M000005
 
Figure JPOXMLDOC01-appb-M000005
 
 以上説明した本実施形態の情報識別装置1は、keyとvalueとを含む、学習用の情報を単語に分割する文字列分割部10と、各単語の分散表現を生成する分散表現生成部11と、keyの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する組み合わせ生成部12と、前記パターン毎に、当該パターンのkeyの分散表現と前記valueの単語の分散表現とを組み合わせて、連結分散表現を生成する分散表現連結部13と、前記パターン毎の連結分散表現と、前記学習用の情報に対応する種別の種別番号とを含む学習データを生成する学習データ生成部14と、前記学習データを機械学習させて、keyとvalueとを含む入力情報をいずれかの種別に識別するための分類器を生成する学習部と、を備える。 The information identification device 1 of this embodiment described above includes a character string dividing unit 10 that divides learning information including a key and a value into words, and a distributed expression generating unit 11 that generates a distributed expression of each word. , a combination generation unit 12 that generates a plurality of distributed expression patterns by combining the distributed expressions of the words of the key, and a combination generation unit 12 that generates a plurality of distributed expression patterns by combining the distributed expressions of the key words of the pattern and the distributed expression of the value words of the pattern. , a distributed expression concatenation unit 13 that generates a concatenated distributed expression; a learning data generation unit 14 that generates learning data including a concatenated distributed expression for each pattern and a type number of a type corresponding to the learning information; and a learning unit that performs machine learning on the learning data to generate a classifier for identifying input information including a key and a value into any type.
 また、本実施形態の情報識別装置1は、分類器には、入力情報から生成された、パターン毎の連結分散表現が入力され、分類器が各パターンに対して、種別毎に出力する確信度を用いて、前記入力情報の種別を既存のいずれかの種別か、あるいは、新たな種別かを判定する種別判定部16を備える。 In addition, in the information identification device 1 of the present embodiment, the concatenated distributed representation for each pattern generated from the input information is input to the classifier, and the confidence level that the classifier outputs for each type for each pattern. A type determination unit 16 is provided which determines whether the type of the input information is one of the existing types or a new type.
 これにより、図12に示すように、本実施形態では、各社の既存システムが同じ種別の情報を、異なる形式で投入する場合であっても、同じ種別の情報であると識別することができる。これにより、流通元システムのデータ形式から流通先システムのデータ形式への変換を自動化することができる。具体的には、単にkey, valueの分散表現の組み合わせを、分類器に入力する場合、keyに含まれる重複する単語(接尾辞、接頭辞など)による識別への影響が大きく、正しい識別が困難な場合があるが、本実施形態では組み合わせ生成部12によりその影響を小さくし、識別性を向上することができる。 As a result, as shown in FIG. 12, in this embodiment, even if the existing systems of each company input the same type of information in different formats, it can be identified as the same type of information. This makes it possible to automate the conversion from the data format of the distribution source system to the data format of the distribution destination system. Specifically, when simply inputting a combination of distributed expressions of key and value to a classifier, the duplicate words (suffixes, prefixes, etc.) included in the key have a large impact on identification, making correct identification difficult. However, in this embodiment, the combination generation unit 12 can reduce the influence and improve the identifiability.
 また、本実施形態では、keyの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する組み合わせ生成部により、入力情報のkeyに接頭辞や接尾辞が含まれる場合であっても、識別精度を向上させることができる。 In addition, in this embodiment, even if the input information key includes a prefix or suffix, the combination generation unit that combines the distributed expressions of the words of the key to generate a plurality of distributed expression patterns. Identification accuracy can be improved.
 また、本実施形態では、入力情報のkeyの各単語を類義語に変換する類義語変換部21を備える。これにより、本実施形態では、同義語、多義語が含まれる入力情報の識別精度を向上させることができる。具体的には、key, valueの分散表現の和だけでは同義語でも類似度が低くなり同じ種別と判定が難しい場合であっても、単語の意味と分散表現の類似度の乖離を埋めて正しく判定することができる。 Additionally, this embodiment includes a synonym conversion unit 21 that converts each word of the key of input information into a synonym. As a result, in this embodiment, it is possible to improve the accuracy of identifying input information that includes synonyms and polysemy words. Specifically, even if synonyms have a low degree of similarity and are difficult to determine as the same type using only the sum of the distributed representations of key and value, we can correct the difference by filling the gap between the similarity between the meaning of the word and the distributed representation. can be determined.
 また、本実施形態では、分類器で識別できない未知の情報が投入された場合に、新たな種別の情報であると正しく識別し、新たな種別用の学習データを生成して、分類器を再学習させる。これにより、再学習後の分類器は、新たな種別の分類が可能となり、これにより、情報の流通における自動的な情報識別が可能となる。 Furthermore, in this embodiment, when unknown information that cannot be identified by the classifier is input, it is correctly identified as information of a new type, training data for the new type is generated, and the classifier is restarted. Let them learn. This allows the classifier after relearning to classify new types, thereby enabling automatic information identification during information distribution.
 上記説明した情報識別装置1は、例えば、図13に示すような汎用的なコンピュータシステムを用いることができる。図示するコンピュータシステムは、CPU(Central Processing Unit、プロセッサ)901と、メモリ902と、ストレージ903(HDD:Hard Disk Drive、SSD:Solid State Drive)と、通信装置904と、入力装置905と、出力装置906とを備える。メモリ902およびストレージ903は、記憶装置である。このコンピュータシステムにおいて、CPU901がメモリ902上にロードされた所定のプログラムを実行することにより、情報識別装置1の機能が実現される。 For the information identification device 1 described above, a general-purpose computer system as shown in FIG. 13 can be used, for example. The illustrated computer system includes a CPU (Central Processing Unit) 901, a memory 902, a storage 903 (HDD: Hard Disk Drive, SSD: Solid State Drive), a communication device 904, an input device 905, and an output device. 906. Memory 902 and storage 903 are storage devices. In this computer system, the functions of the information identification device 1 are realized by the CPU 901 executing a predetermined program loaded onto the memory 902.
 情報識別装置1は、1つのコンピュータで実装されてもよく、あるいは複数のコンピュータで実装されても良い。また、情報識別装置1は、コンピュータに実装される仮想マシンであっても良い。情報識別装置1のプログラムは、HDD、SSD、USB(Universal Serial Bus)メモリ、CD (Compact Disc)、DVD (Digital Versatile Disc)などのコンピュータ読取り可能な記録媒体に記憶することも、ネットワークを介して配信することもできる。 The information identification device 1 may be implemented by one computer or by multiple computers. Further, the information identification device 1 may be a virtual machine implemented in a computer. The program of the information identification device 1 can be stored in a computer-readable recording medium such as an HDD, SSD, USB (Universal Serial Bus) memory, CD (Compact Disc), or DVD (Digital Versatile Disc), or can be stored via a network. It can also be distributed.
 なお、本発明は上記実施形態に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。例えば、本実施形態では、類義語変換部21は、判定部Bが備えることとした。しかしながら、データ生成部Aが類義語変換部21を備えていてもよい。この場合、類義語変換部21は、本実施形態で説明したように、分類器が出力する最高値の確信度の平均値が負の場合に種別判定部16からの指示を受け付けてkeyの各単語を類義語に変換してもよい。あるいは、類義語変換部21は、識別フェーズにおいて、情報識別装置1に入力された情報のkeyの単語を、最初から類義語に変換してもよい。具体的には、文字列分割部10は、入力された情報を単語に分割し、当該単語を類義語変換部21に入力し、類義語変換部21はkeyの各単語を類義語に変換して分散表現生成部11に出力してもよい。以降の処理は、実施形態と同様である。 Note that the present invention is not limited to the above-described embodiments, and many modifications can be made within the scope of the invention. For example, in this embodiment, the determination unit B includes the synonym conversion unit 21. However, the data generation section A may include the synonym conversion section 21. In this case, as described in the present embodiment, when the average value of the confidence of the highest value output by the classifier is negative, the synonym conversion unit 21 receives an instruction from the type determination unit 16 and converts each word of the key into may be converted into synonyms. Alternatively, the synonym conversion unit 21 may convert the key word of the information input to the information identification device 1 into a synonym from the beginning in the identification phase. Specifically, the character string division unit 10 divides the input information into words and inputs the words to the synonym conversion unit 21, which converts each word of the key into a synonym and performs distributed expression. It may also be output to the generation unit 11. The subsequent processing is similar to the embodiment.
 1 :情報識別装置
 10:文字列分割部(分割部)
 11:分散表現生成部
 12:組み合わせ生成部
 13:分散表現連結部
 14:学習データ生成部
 15:種別分類部
 16:種別判定部(判定部)
 17:類似語抽出部
 18:学習データ更新部
 19:種別保持部
 20:正規表現判定部
 3 :ミドルBシステム
 5 :ファーストBシステム
 7 :オペレータ端末 
1: Information identification device 10: Character string dividing section (dividing section)
11: Distributed expression generation unit 12: Combination generation unit 13: Distributed expression concatenation unit 14: Learning data generation unit 15: Type classification unit 16: Type determination unit (determination unit)
17: Similar word extraction unit 18: Learning data update unit 19: Type storage unit 20: Regular expression determination unit 3: Middle B system 5: First B system 7: Operator terminal

Claims (8)

  1.  キーとバリューとを含む、学習用の情報を単語に分割する分割部と、
     各単語の分散表現を生成する分散表現生成部と、
     前記キーの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する組み合わせ生成部と、
     前記パターン毎に、当該パターンのキーの分散表現と前記バリューの単語の分散表現とを組み合わせて、連結分散表現を生成する分散表現連結部と、
     前記パターン毎の連結分散表現と、前記学習用の情報に対応する種別の種別番号とを含む学習データを生成する学習データ生成部と、
     前記学習データを機械学習させて、キーとバリューとを含む入力情報をいずれかの種別に識別するための分類器を生成する学習部と、を備える
     情報識別装置。
    a dividing unit that divides learning information including a key and a value into words;
    a distributed expression generation unit that generates a distributed expression of each word;
    a combination generation unit that generates a plurality of distributed expression patterns by combining the distributed expressions of the key words;
    a distributed expression concatenation unit that generates a concatenated distributed expression by combining, for each of the patterns, the distributed expression of the key of the pattern and the distributed expression of the value word;
    a learning data generation unit that generates learning data including a connected distributed representation for each pattern and a type number of a type corresponding to the learning information;
    An information identification device, comprising: a learning unit that performs machine learning on the learning data to generate a classifier for identifying input information including a key and a value into any type.
  2.  前記組み合わせ生成部は、前記キーの各単語の分散表現を、前方また後方から組み合わせて、前記パターンを生成する
     請求項1に記載の情報識別装置。
    The information identification device according to claim 1, wherein the combination generation unit generates the pattern by combining distributed representations of each word of the key from the front and back.
  3.  前記分類器には、前記入力情報から生成された、パターン毎の連結分散表現が入力され、
     前記分類器が各パターンに対して、種別毎に出力する確信度を用いて、前記入力情報の種別を既存のいずれかの種別か、あるいは、新たな種別かを判定する判定部を備える
     請求項1に記載の情報識別装置。
    The classifier is input with a connected distributed representation for each pattern generated from the input information,
    The method further comprises a determination unit that determines whether the type of the input information is one of the existing types or a new type, using the confidence level that the classifier outputs for each type for each pattern. 1. The information identification device according to 1.
  4.  前記入力情報のキーの各単語を類義語に変換する類義語変換部を備え、
     前記判定部は、各パターンに対して種別毎に前記分類器から出力される確信度の平均値を算出し、平均値の最高値が正の場合、前記入力情報の種別を前記最高値の種別と判定し、前記最高値が負の場合、前記キーの各単語を前記類義語変換部が変換した類義語に置換した変換後の入力情報を用いて、前記入力情報の種別を既存のいずれかの種別か、あるいは、新たな種別かを判定する
     請求項3に記載の情報識別装置。
    comprising a synonym conversion unit that converts each word of the key of the input information into a synonym,
    The determination unit calculates the average value of the confidence output from the classifier for each type for each pattern, and when the highest value of the average value is positive, the determination unit determines the type of the input information to be the type of the highest value. If it is determined that the highest value is negative, the type of the input information is changed to one of the existing types using the converted input information in which each word of the key is replaced with a synonym converted by the synonym conversion unit. The information identification device according to claim 3, wherein the information identification device determines whether the information is a new type or a new type.
  5.  前記分散表現連結部は、前記連結分散表現として、パターンの分散表現と、バリューの各単語の分散表現との和を算出する
     請求項1に記載の情報識別装置。
    The information identification device according to claim 1, wherein the distributed expression concatenation unit calculates, as the concatenated distributed expression, a sum of a distributed expression of a pattern and a distributed expression of each word of a value.
  6.  キーとバリューとを含む入力情報を単語に分割する分割部と、
     各単語の分散表現を生成する分散表現生成部と、
     前記キーの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成する組み合わせ生成部と、
     前記パターン毎に、当該パターンのキーの分散表現と前記バリューの単語の分散表現とを組み合わせて、連結分散表現を生成する分散表現連結部と、
     前記連結分散表現の種別毎の確信度を出力する分類器と、
     前記分類器が各パターンに対して、種別毎に出力する確信度を用いて、前記入力情報の種別を既存のいずれかの種別か、あるいは、新たな種別かを判定する判定部を備える
     情報識別装置。
    a dividing unit that divides input information including a key and a value into words;
    a distributed expression generation unit that generates a distributed expression of each word;
    a combination generation unit that generates a plurality of distributed expression patterns by combining the distributed expressions of the key words;
    a distributed expression concatenation unit that generates a concatenated distributed expression by combining, for each of the patterns, the distributed expression of the key of the pattern and the distributed expression of the value word;
    a classifier that outputs confidence for each type of the connected distributed representation;
    Information identification, comprising a determination unit that determines whether the type of the input information is one of the existing types or a new type, using the confidence level that the classifier outputs for each type for each pattern. Device.
  7.  情報識別装置が行う情報識別方法であって、
     キーとバリューとを含む、学習用の情報を単語に分割するステップと、
     各単語の分散表現を生成するステップと、
     前記キーの単語の分散表現を組み合わせて、複数の分散表現のパターンを生成するステップと、
     前記パターン毎に、当該パターンのキーの分散表現と前記バリューの単語の分散表現とを組み合わせて、連結分散表現を生成するステップと、
     前記パターン毎の連結分散表現と、前記学習用の情報に対応する種別の種別番号とを含む学習データを生成するステップと、
     前記学習データを機械学習させて、キーとバリューとを含む入力情報をいずれかの種別に識別するための分類器を生成するステップと、を行う
     情報識別方法。
    An information identification method performed by an information identification device, the method comprising:
    dividing the learning information into words, including keys and values;
    generating a distributed representation of each word;
    combining the distributed representations of the key words to generate a plurality of distributed representation patterns;
    for each pattern, generating a concatenated distributed representation by combining the distributed representation of the key of the pattern and the distributed representation of the value word;
    generating learning data including a connected distributed representation for each pattern and a type number of a type corresponding to the learning information;
    An information identification method comprising the steps of performing machine learning on the learning data to generate a classifier for identifying input information including a key and a value into any type.
  8.  請求項1から6のいずれか1項に記載の情報識別装置としてコンピュータを機能させるプログラム。 A program that causes a computer to function as the information identification device according to any one of claims 1 to 6.
PCT/JP2022/021943 2022-05-30 2022-05-30 Information identification device, information identification method, and program WO2023233467A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/021943 WO2023233467A1 (en) 2022-05-30 2022-05-30 Information identification device, information identification method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/021943 WO2023233467A1 (en) 2022-05-30 2022-05-30 Information identification device, information identification method, and program

Publications (1)

Publication Number Publication Date
WO2023233467A1 true WO2023233467A1 (en) 2023-12-07

Family

ID=89025943

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/021943 WO2023233467A1 (en) 2022-05-30 2022-05-30 Information identification device, information identification method, and program

Country Status (1)

Country Link
WO (1) WO2023233467A1 (en)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHIBAYAMA SHOJIRO 、, NAKAMOTO CHIHIRO, SHIMIZU TAKESHI, TAKAYANAGI HIROSHI, NISHIDA MAKOTO: "Study of automatic business email classifier using distributed representation.", IPSJ/SIGSE SOFTWARE ENGINEERING SYMPOSIUM 2018, 29 August 2018 (2018-08-29), pages 257 - 258, XP093117162 *
TAKAGI, RYOTA, KAZAMA, KAZUHIRO, SAKAKI, TAKESHI: "Method to Extend Coverage of Domain Dictionary Based on Distributed Representations of Words. ", IEICE TECHNICAL REPORT, IEICE, JP, vol. 118, no. 210 (NLC2018-27), 30 August 2018 (2018-08-30), JP , pages 103 - 108, XP009551099, ISSN: 2432-6380 *

Similar Documents

Publication Publication Date Title
US11636264B2 (en) Stylistic text rewriting for a target author
CN101978348B (en) Manage the archives about approximate string matching
CN112084337A (en) Training method of text classification model, and text classification method and equipment
US11436446B2 (en) Image analysis enhanced related item decision
US20150242393A1 (en) System and Method for Classifying Text Sentiment Classes Based on Past Examples
CN111144102B (en) Method and device for identifying entity in statement and electronic equipment
CN111062803A (en) Financial business query and review method and system
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
Tüselmann et al. Recognition-free question answering on handwritten document collections
CN111133429A (en) Extracting expressions for natural language processing
CN111368066A (en) Method, device and computer readable storage medium for acquiring dialogue abstract
CN115210705A (en) Vector embedding model for relational tables with invalid or equivalent values
WO2023233467A1 (en) Information identification device, information identification method, and program
CN109344388B (en) Method and device for identifying spam comments and computer-readable storage medium
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN114332872B (en) Contract document fault-tolerant information extraction method based on graph attention network
CN114998920A (en) Supply chain financial file management method and system based on NLP semantic recognition
CN113272799A (en) Coded information extractor
CN112328653B (en) Data identification method, device, electronic equipment and storage medium
CN111708819B (en) Method, apparatus, electronic device, and storage medium for information processing
CN114444441A (en) Name similarity calculation method and device, storage medium and calculation equipment
WO2023119360A1 (en) Information identification device, information identification method, and program
CN107656909B (en) Document similarity judgment method and device based on document mixing characteristics
CA3156204A1 (en) Domain based text extraction
Szegedi et al. Context-based Information Classification on Hungarian Invoices.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22944756

Country of ref document: EP

Kind code of ref document: A1