WO2023233467A1

WO2023233467A1 - Information identification device, information identification method, and program

Info

Publication number: WO2023233467A1
Application number: PCT/JP2022/021943
Authority: WO
Inventors: 遥香小山内; 優酒井; 彩鈴木; 翔金丸; 謙輔高橋; 悟近藤
Original assignee: 日本電信電話株式会社
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2023-12-07

Abstract

The present invention comprises: a character string dividing unit 10 that divides information for learning including keys and values into words; a distributed representation generation unit 11 that generates a distributed representation for each word; a combination generation unit 12 that generates a plurality of patterns of distributed representations by combining the distributed representations of words that are keys; a distributed representation coupling unit 13 that generates a coupled distributed representation for each of the patterns by combining the distributed representations of the keys of the pattern and the distributed representations of words that are the values; a learning data generation unit 14 that generates learning data including the coupled distributed representation of each pattern and the type number of the type corresponding to the information for learning; and a learning unit that machine-learns the learning data, and generates a classifier for identifying input information including keys and values as some type.

Description

Information identification device, information identification method, and program

The present invention relates to an information identification device, an information identification method, and a program.

B2B2X services in which multiple businesses in different industries collaborate through B2B2X are increasing. In providing such services, it is necessary to distribute information such as customer information, contract information, billing information, etc. between collaborating businesses.

In the distribution of information, it is necessary to identify information, and techniques related to information identification include morphological analysis (Non-Patent Document 1), distributed representation of characters (Non-Patent Document 2), and confidence calculation technology for classifiers (Non-Patent Document 2). There is a non-patent document 3).

Currently, in order to distribute information between collaborating businesses, operators look at the content of the input information, determine the type, manually determine the distribution destination, and convert the format of the information.

In order to cope with the increasing number of services that collaborate with multiple companies from different industries in the future, it will be necessary to automate the distribution of information between collaborating businesses. When automating information distribution, it is necessary to identify information of the same type when input in different formats. Specifically, in order to distribute information to the appropriate destination for each type, even if the existing systems used by each company are different and the same type of information is input in different formats, it will be automatically identified as the same type. There is a need.

Non-Patent Documents 1-3 do not consider identifying the same type of information when different formats of information are input. Therefore, using these non-patent documents, it is not possible to identify information in different formats as being of the same type.

The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technology that allows information to be identified as being of the same type when information in different formats is input.

To achieve the above object, one aspect of the present invention includes a dividing unit that divides learning information including a key and a value into words, a distributed expression generating unit that generates a distributed expression of each word, and a distributed expression generating unit that generates a distributed expression of each word. a combination generation unit that generates a plurality of patterns of distributed expressions by combining the distributed expressions of words of a distributed expression concatenation unit that generates a representation; a learning data generation unit that generates learning data including a concatenated distributed expression for each pattern; and a type number of a type corresponding to the learning information; and a learning unit that performs learning to generate a classifier for identifying input information including a key and a value into any type.

One aspect of the present invention is to combine a dividing unit that divides input information including a key and a value into words, a distributed expression generating unit that generates a distributed representation of each word, and a distributed representation of the key word to create a plurality of words. a combination generation unit that generates a pattern of distributed expressions; a distributed expression concatenation unit that generates a concatenated distributed expression by combining, for each pattern, the distributed expression of the key of the pattern and the distributed expression of the value word; A classifier that outputs a confidence level for each type of the connected distributed representation, and a confidence level that the classifier outputs for each type for each pattern are used to determine whether the type of the input information is one of the existing types. or a new type.

One aspect of the present invention is an information identification method performed by an information identification device, which comprises: dividing learning information including a key and a value into words; generating a distributed representation of each word; generating a plurality of patterns of distributed expressions by combining the distributed expressions of the key words; and for each pattern, combining the distributed expressions of the key of the pattern with the distributed expressions of the value words to generate a concatenated distributed expression. a step of generating learning data including a concatenated distributed representation for each pattern and a type number of the type corresponding to the learning information, and performing machine learning on the learning data to generate keys and values. and generating a classifier for identifying input information containing the input information into one of the types.

One aspect of the present invention is a program that causes a computer to function as the information identification device.

According to the present invention, it is possible to provide a technology that allows information to be identified as being of the same type when information in different formats is input.

FIG. 1 shows an example of the configuration of an information identification device according to this embodiment. FIG. 2 is an explanatory diagram illustrating the operation of the data generation section of the information identification device. FIG. 3 is an explanatory diagram illustrating another operation of the data generation section of the information identification device. FIG. 4 is a diagram showing the relationship between the distributed representations of each word and the sum of the distributed representations. FIG. 5 is an explanatory diagram illustrating a pattern of distributed representation of keys. FIG. 6 is a sequence diagram showing the operation of the information identification device in the learning phase. FIG. 7 is an explanatory diagram illustrating the operation of the determination section of the information identification device. FIG. 8 is an image diagram of the processing of the synonym conversion unit. FIG. 9 is a sequence diagram showing the operation of the information identification device in the identification phase when distributed representation is possible. FIG. 10 is a sequence diagram showing the operation of the information identification device in the identification phase when distributed representation is not possible. FIG. 11 is an explanatory diagram for explaining a method of calculating the distance between a boundary surface and a point. FIG. 12 is a schematic diagram showing the effects of this embodiment. FIG. 13 is an example of a hardware configuration.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

<Configuration of information identification device>
FIG. 1 shows a configuration example of an information identification device 1 of this embodiment. The information identification device 1 is a device that identifies the type of information input during information distribution between multiple companies in different industries. The illustrated information identification device 1 includes a data generation section A, a determination section B, a type generation section C, a type holding section 19, and a regular expression determination section 20.

The data generation section A includes a character string division section 10, a distributed expression generation section 11, a combination generation section 12, a distributed expression concatenation section 13, and a learning data generation section 14. The determination unit B includes a type classification unit 15, a type determination unit 16, and a synonym conversion unit 21. The type generation C includes a similar word extraction section 17 and a learning data update section 18.

The character string dividing unit 10 (dividing unit) divides input information into words. Specifically, the character string division unit 10 performs morphological analysis on the character string of information and divides it into minimum unit words that have meaning by themselves (see Non-Patent Document 1). Information in this embodiment includes a key and a value. The character string dividing unit 10 receives learning information including a key and value during learning, and receives input information including a key and value (information to be determined) during determination.

The distributed expression generation unit 11 generates a distributed expression for each word divided by the character string division unit 10. Distributed representation is one type of natural language processing, and is a technique for representing words as high-dimensional real vectors (see Non-Patent Document 2). By expressing the meaning of a word mathematically, it becomes possible to perform calculations using the meaning of the word.

The combination generation unit 12 combines the distributed expressions of the key words to generate a plurality of distributed expression patterns. The combination generation unit 12 may generate a pattern by combining the distributed expressions of each word of the key from the front or the rear.

For each pattern generated by the combination generation unit 12, the distributed expression concatenation unit 13 generates a concatenated distributed expression by combining the key distributed expression of the pattern and the distributed expression of the value word. The distributed expression concatenation unit 13 may calculate the sum of the distributed expression of the key of the pattern and the distributed expression of each word of the value as the concatenated distributed expression.

The learning data generation unit 14 generates learning data including a connected distributed expression for each pattern and a type number of the type corresponding to the learning information.

The type classification unit 15 includes a learning unit and a classifier. The learning unit performs machine learning on the learning data to generate a classifier. The classifier is a trained model for identifying input information including a key and a value into one of the types. The classifier of this embodiment is input with a connected distributed representation for each pattern, which is generated from the input information. The classifier may output the confidence level for each type of input connected variance representation.

An SVM (support-vector machine) may be used as the classifier (see Non-Patent Document 3). SVM is a method that aims to estimate with higher accuracy by focusing on the degree of confidence that a multi-class classifier has in its recognition. When classifying into K classes, SVM creates K boundary surfaces using the idea of One vs All SVM, and uses the confidence calculated from the distance between the classification target and each boundary surface to classify the classification type. Determine. In this embodiment, the distance from the boundary surface of the SVM is used as a method for calculating the reliability of the disperser. It is in the positive direction when viewed from the boundary surface, and the farther the distance from the boundary surface is, the higher the confidence is. Calculation of confidence will be described later.

The type determination unit 16 (determination unit) determines whether the type of input information is one of the existing types or a new type, using the confidence level that the classifier outputs for each type for each pattern. Determine. Specifically, the type determination unit 16 calculates the average value of the confidence output from the classifier for each type for each pattern, and if the highest value of the average value is positive, the type determination unit 16 determines the type of input information as the highest. It is determined that it is a value type, and if the highest value is negative, it is determined that it is a new type. Further, if the highest value is negative, the type determination unit 16 changes the type of input information to one of the existing types using the converted information in which each word of the key is replaced with a synonym converted by the synonym conversion unit 21. Or, it may be determined whether it is a new type. When the type determination unit 16 determines that the type of input information is a new type, it generates a type number of the new type (new type number), and stores the type number in the type holding unit 19 together with the new type. .

The synonym conversion unit 21 converts each key word of the input information into a synonym. The synonym conversion unit 21 may convert into synonyms using, for example, a database such as a classification vocabulary table. The Classified Vocabulary is a thesaurus (collection of synonyms) that classifies and organizes words according to their meaning, and the database is published by the Linguistic Resource Development Center of the National Institute for Japanese Language and Linguistics. The synonym conversion unit 21 may convert each word into a synonym using, for example, a "classification item" included in a record of a classification vocabulary table.

When the type determination unit 16 determines that the type is a new type, the similar word extraction unit 17 extracts a similar word for the key of the input information (first similar word), a similar word for the value of the input information (second similar word), Extract each. Then, the distributed expression generation unit 11 generates a distributed expression of similar words of key and a distributed expression of similar words of value. The distributed expression linking unit 13 combines the distributed expressions of the similar words of the key and the distributed expressions of the similar words of the value to generate an additional combination of distributed expressions. Further, the distributed expression concatenation unit 13 may generate an additional combination of distributed expressions by combining the distributed expression of the key and the distributed expression of a similar word of the value. Further, the distributed expression concatenation unit 13 may generate an additional concatenated distributed expression by combining the distributed expression of the similar word of key and the distributed expression of value.

The learning data updating unit 18 generates learning data (additional learning data) including a new type number. Specifically, the learning data updating unit 18 generates learning data including an additional concatenated distributed representation and a new type number. The learning unit of the type determining unit 16 retrains the classifier using the additional learning data generated by the learning data updating unit 18.

The type holding unit 19 stores types and type numbers in association with each other. The regular expression determining unit 20 determines the type of input information when the value of the input information cannot be expressed in a distributed representation (in the case of a regular expression).

The information identification device 1 of the present embodiment described above can improve identification accuracy by generating a key pattern in consideration of duplicate words included in the key. Further, the information identification device 1 of this embodiment may perform synonym conversion for the key of input information. Identification accuracy can be improved by considering synonyms of Key.

The middle B system 3 is a system of a service provider. The information identification device 1 of this embodiment is a device operated by middle B. The operator terminal 7 is a terminal used by a middle B operator. The First B system 5 is a system of a cooperating business entity related to the services provided by Middle B. In the example shown in FIG. 1, the information identification device 1 identifies the type of information input from the middle B system 3, and outputs the identification result to the first B system 5.

<Learning phase>
FIG. 2 is an explanatory diagram illustrating the operation of the data generation section A of the information identification device 1. The illustrated operations are performed both in the learning phase and in the identification phase (excluding the operation of the learning data generation unit 14). Here, the case of the learning phase will be explained below as an example.

The operator terminal 7 transmits the learning information (character string) input by the operator to the character string dividing unit 10 of the information identification device 1. The character string dividing unit 10 divides input information into words by morphological analysis.

In the illustrated example, the character string division unit 10 divides the input "applicant name: Taro Yamada" into "application", "person", "name", "Yamada", and "Taro", and divides it into distributed representation. It is output to the generation unit 11. "Applicant name" is the key, and "Taro Yamada" is the value.

The distributed expression generation unit 11 generates a distributed expression for each word divided by the character string division unit 10 and outputs it to the combination generation unit 12. That is, the distributed expression generation unit 11 converts each word into a high-dimensional real vector.

The combination generation unit 12 generates a plurality of patterns by combining the distributed expressions of the key words. In the illustrated example, the combination generation unit 12 generates a plurality of patterns by combining the distributed expressions of the divided key words one by one from the front, and outputs the patterns to the distributed expression concatenation unit 13. Here, the following three patterns are generated. This reduces the impact of including duplicate suffixes such as "...name".

(1) Distributed representation of "application" (2) Distributed representation of "application" + distributed representation of "person" (3) Distributed representation of "application" + distributed representation of "person" + distributed representation of "name" Distributed representation For each generated pattern, the linking unit 13 combines the distributed representation of the pattern with the distributed representation of the word of value, generates a connected distributed representation, and outputs it to the learning data generation unit 14 .

Here, the following three connected distributed representations are generated.

(1) Distributed representation of "application" + distributed representation of "Yamada" + distributed representation of "Taro" (2) Distributed representation of "application" + distributed representation of "person" + distributed representation of "Yamada" + "Taro" (3) Distributed representation of "application" + distributed representation of "person" + distributed representation of "name" + distributed representation of "Yamada" + distributed representation of "Taro" By using a combination of distributed representations, multiple It is possible to prevent multiple types from being assigned to a word with the same meaning and improve classification accuracy. In this embodiment, the combination generation unit 12 and the distributed expression linking unit 13 calculate the sum of the distributed expressions of each word as a combination of distributed expressions.

The learning data generation unit 14 receives the type number corresponding to the input learning information from the operator terminal 7, generates learning data including the connected distributed representations of a plurality of patterns and the type number, Output to. In the illustrated learning data, a type number of "0" is set for each connected distributed representation. The learning data is data for training the classifier of the type classification unit 15. The learning section of the type classification section 15 generates a classifier by machine learning using learning data.

FIG. 3 is an explanatory diagram illustrating another operation of the data generation section A of the information identification device. The illustrated operations are performed both in the learning phase and in the identification phase (excluding the operation of the learning data generation unit 14). Here, the case of the learning phase will be explained below as an example. In Figure 3, in order to reduce the influence caused by the presence of prefixes such as "oh" in the key, the distributed expressions of the word of the key are combined backwards to generate multiple patterns. reduce the impact of inclusion.

Similarly to FIG. 2, the character string division unit 10 divides the information transmitted from the operator terminal 7 into words by morphological analysis. In the illustrated example, the character string division unit 10 converts the input “applicant name: Taro Yamada” into “o”, “application”, “person”, “first name”, “Yamada”, and “Taro”. It is divided and output to the distributed representation generation unit 11. "Applicant name" is the key, and "Taro Yamada" is the value.

Similarly to FIG. 2, the distributed expression generation unit 11 generates a distributed expression for each divided word and outputs it to the combination generation unit 12. The combination generation unit 12 generates a plurality of patterns by combining distributed representations of key words. In the illustrated example, the combination generation unit 12 generates a plurality of patterns by combining the words of the divided keys one by one from the rear, and outputs the patterns to the distributed expression connection unit 13. Here, the following four patterns are generated. This reduces the impact of including prefixes such as "oh".

(1) Distributed representation of "Name" (2) Distributed representation of "Party" + Distributed representation of "Name" (3) Distributed representation of "Application" + Distributed representation of "Party" + Distributed representation of "Name" (4 ) Distributed expression of "o" + Distributed expression of "application" + Distributed expression of "person" + Distributed expression of "name" As in FIG. A connected distributed representation is generated by combining the distributed representation of the value word and the distributed representation of the value word, and is output to the learning data generation unit 14.

Similarly to FIG. 2, the learning data generation unit 14 receives the type number of learning information from the operator terminal 7, generates learning data including a plurality of connected distributed representations and the type number, and outputs it to the type classification unit 15. do. The learning section of the type classification section 15 generates a classifier by machine learning using learning data.

As explained above, in the learning phase of this embodiment, the information identification device 1 divides information including a key and a value into words, calculates a distributed representation of each word, and combines the distributed representations of the key word. A pattern is generated, and for each pattern, a concatenated distributed expression is generated by combining the distributed expression of the key and the distributed expression of each word of the value, and learning data including the concatenated distributed expression and the type number is generated. Thereby, in the identification phase described later, even if words with multiple meanings or synonyms are input, they can be identified into appropriate types.

Note that in FIGS. 2 and 3, as an example of a combination, a pattern is generated by combining the distributed expressions of each word of the key from the front or the rear, but the pattern is not limited to this. The combination generation unit 12 can generate patterns of distributed expressions by various combinations of distributed expressions of key words.

Figure 4 shows the distributed representation (vector) of each word (“surname”, “name”, “address”, “Yamaguchi”) and the connected distributed representation (“surname: Yamaguchi”, “name: Yamaguchi”, “address:”). ``Yamaguchi''). In the illustrated example, each key is "last name," "full name," and "address," which are words that cannot be divided any further. Therefore, one pattern is generated as a combination of keys by the combination generation unit 12. As shown in the figure, the sum of distributed expressions of the same type indicates that they are mapped to close positions (that is, clustered). Here, "Yamaguchi", which has similar meanings as "name" and "surname", can be converted into similar distributed expressions. Furthermore, "Yamaguchi", which has different meanings for "name" and "address", can be converted into a distributed representation of the corresponding meaning.

As can be seen from FIG. 4, by using the sum of distributed representations as a combination of distributed representations, information of the same type is mapped to close positions. Therefore, by generating learning data for multi-class classification using the sum of the distributed representation of Key and the distributed representation of Value, it is possible to identify the meaning of words with multiple meanings and to identify whether different words are of the same type. identification becomes possible.

In other words, by summing the distributed representations, it is possible to convert information containing synonyms into similar distributed representations, and to convert information containing polysemous words into distributed representations for each use. It can be used to identify the information contained.

FIG. 5 is an explanatory diagram illustrating a pattern of distributed expressions of keys generated by the combination generation unit 12 of this embodiment.

If the output of data generation unit A is simply the sum of the distributed expressions of key and value, the duplicate words (suffixes, prefixes, etc.) included in the key will have a large effect on identification, and correct identification may be difficult. be. In this embodiment, as described above, multiple patterns of combinations of distributed expressions of key words (morphemes) of input information are generated, and connected distributed expressions (vectors) are used as input to the type classification unit 15 (classifier). By generating multiple words, the influence of duplicate words is reduced.

The illustrated example shows a case where input information with "service name" as a key is input to the character string classification unit 10 in a state where "company name" exists as an existing type.

In the case of 5A in FIG. 5, a distributed representation (sum of distributed representations) that is a combination of the distributed representation of "service" and the distributed representation of "name" of "service name" is input to the classifier. The cosine similarity between "company name" and "service name" is high (0.84), and as shown in the figure, the distributed representation of "company name" and the distributed representation of "service name" are mapped in close positions. Therefore, the classifier incorrectly determines that the type of input information with the key of "service name" is "company name."

On the other hand, the cosine similarity between "company" and "service" is low (0.43), and as shown in the figure, the distributed representation of "company" and the distributed representation of "service" are mapped to distant positions. From this, the cosine similarity of the distributed expression combining "company" and "first name" and the distributed representation combining "service" and "first name" is high because of the overlapping suffix "first name". It can be said to be an influence.

As shown in 5B of FIG. 5, in this embodiment, in order to reduce the influence of such a suffix or prefix, the combination generation unit 12 generates a distributed representation of one word of key and at least two words. Generate multiple patterns of combined distributed representations. Here, two patterns are generated: a distributed representation that combines a distributed representation of "service" and a distributed representation of "name", and a distributed representation of only "service". Then, by inputting the concatenated distributed expression combined with the distributed expression of the value word into the classifier for each pattern, it is possible to reduce the influence of the suffix "first name" and reduce misjudgment of type. In the illustrated example, it can be determined that the type of service name is not a company name but a new type.

FIG. 6 is a sequence diagram showing the operation of the information identification device 1 in the learning phase.

The operator terminal 7 receives the operator's instructions and inputs learning information to the character string segmentation unit 10 of the information identification device 1 (step S21). The learning information includes at least one piece of learning data including a key and a value. The character string dividing unit 10 divides the input information into words (step S22). The distributed expression generation unit 11 generates a distributed expression of each divided word and outputs it to the combination generation unit 12 (step S23).

The combination generation unit 12 combines the distributed expressions of the key words to generate a distributed expression pattern (step S24). Note that if the key is composed of one word, one pattern is generated.

The distributed expression concatenation unit 13 generates a concatenated distributed expression for each pattern and outputs it to the learning data generation unit 14 (step S25). Specifically, the distributed expression concatenation unit 13 generates a concatenated distributed expression for each pattern by combining the distributed expression of the pattern and the distributed expression of the value word.

The operator terminal 7 receives the operator's instruction and requests the type holding unit 19 for the type number of the information transmitted in S21 (step S26). The type holding unit 19 transmits the type number corresponding to the information transmitted in S21 to the operator terminal 7 (step S27). Upon acquiring the type number, the operator terminal 7 transmits the type number to the learning data generation unit 14 (step S28).

The learning data generation unit 14 generates learning data including the concatenated distributed representation received in step S25 and the type number received in S28, and sends it to the type classification unit 15 (step S29). The learning unit of the type classification unit 15 generates a classifier by performing machine learning on the learning data (step S30).

Note that in FIG. 6, the operator terminal 7 requests the type holding unit 19 for the type number in step S26, but the learning data generation unit 14 may request the type holding unit 19 for the type number. In this case, the type holding unit 19 sends the type number to the learning data generating unit 14 in step S27, and the learning data generating unit 14 generates the learning data using the connected distributed representation and the type number acquired from the type holding unit 19. generated (step S29).

Next, the operation of registering information that cannot be expressed in a distributed manner, such as a postal code or telephone number, in the type holding unit 19 will be described (steps S31 and S32). For information that cannot be expressed in distributed representation, such as postal codes and telephone numbers, in the identification phase described later, the regular expression determination unit 20 of the information identification device 1 uses regular expressions to determine what type it is based on patterns of numbers and character strings. Determine. Note that steps S31 and S32 are performed asynchronously with steps S21 to S30.

The operator terminal 7 receives the type and the corresponding regular expression pattern from the partner business operator of the First B system 5 (step S31), and sends the type and the regular expression pattern to the type holding unit 19 of the information identification device 1. Transmit (step S32). The type holding unit 19 stores the transmitted type and regular expression pattern in its own storage unit.

FIG. 6 illustrates a regular expression pattern whose type is postal code and a regular expression pattern whose type is telephone number. The regular expression for postal codes indicates that the code starts with the 〒 mark, ends with three digits, a - (hyphen), and four digits.

<Identification phase>
In the identification phase, the information identification device 1 can identify the type of input information into one of the types held in the type holding unit 19 using the confidence calculated by the classifier of the type classification unit 15. Determine whether or not.

FIG. 7 is a diagram showing an example of the overall flow of the information identification device 1 in the identification phase. In the illustrated identification phase, the connected variance representation output from the data generation unit A is input to the classifier, the average value of the confidence for each type obtained from the classifier is calculated, and if the maximum value of the average value is positive, determines that the type corresponding to the maximum confidence value is the type of input information. If the maximum value of the average value is negative, each word of the key is converted into synonyms to generate a connected variance expression, and the average value of the confidence is calculated again. If the maximum value of the average value is positive, it is determined that the type corresponds to the confidence level of the maximum value, and if the maximum value of the average value is negative, it is unidentifiable and is not held in the type holding unit 19. It is determined that the type is a new type, and the process moves to the type generation unit C.

Specifically, the middle B system 3 transmits arbitrary information to the data generation unit A of the information identification device 1. In FIG. 7, it is assumed that "service name: FLET'S Hikari" is input. "Service name" is the key and "FLET'S Hikari" is the value. The character string dividing unit 10 of the data generating unit A divides input information into words. The distributed expression generation unit 11 generates a distributed expression for each divided word.

The combination generation unit 12 combines the distributed expressions of the key words to generate a plurality of distributed expression patterns. Here, the combination generation unit 12 generates two patterns: (1) a distributed representation of "service" and (2) a distributed representation of "service" + distributed representation of "name". For each pattern, the distributed expression concatenation unit 13 combines the distributed expression of the pattern and the distributed expression of the value word, generates a concatenated distributed expression, and inputs the generated concatenated distributed expression to the determination unit B. Here, as connected distributed representations, (1) distributed representation of "service" + distributed representation of "FLET'S" + distributed representation of "light", and (2) distributed representation of "service" + distributed representation of "name" + A distributed representation of "FLET'S" + a distributed representation of "light" is generated. The distributed expression concatenation unit 13 calculates the sum of distributed expressions of words as a concatenated distributed expression.

The classifier of the type classification unit 15 of the determination unit B outputs the certainty factor for each type with respect to the connected distributed representation of each input pattern (step S10). The type determination unit 16 calculates the average value of the reliability for each type (step S11), and determines whether the highest value of the average value is positive (step S12). If the reliability of the highest value is positive (step S12: YES), the type determination unit 16 determines that the input information is of the type with the maximum reliability. (Step S13). That is, the type determination unit 16 assigns the type with the maximum reliability to the input information.

Then, the type determination unit 16 adds the determined type (or type number) to the input information and outputs it to the first B system 5. It is assumed that each functional unit 10-21 outputs data processed by the functional unit and also outputs data input to the functional unit. Therefore, the type determination unit 16 acquires the information input to the character string division unit 10 via the distributed expression generation unit 11, combination generation unit 12, distributed expression concatenation unit 13, and type classification unit 15.

On the other hand, if the highest confidence level is not positive (step S12: NO), the type determination unit 16 determines whether the connected distributed expression has been converted into synonyms (step S14). In the illustrated example, the classifier outputs confidence levels for three types: "name," "address," and "company name," and the highest average value of confidence for each type is "name." Its highest value is negative (-0.25). If the type determination unit 16 determines that the highest value of certainty is negative (step S12: NO), it determines whether synonym conversion has been completed using, for example, a synonym conversion flag. The synonym conversion flag is set in the output data by the type determination unit 16 or the synonym conversion unit 21 when the highest confidence level is not positive. The functional unit outputs the data processed by the functional unit, and also outputs the data input to the functional unit. Therefore, the type determining unit 16 can determine the presence or absence of the conversion flag.

If the synonym conversion has not been completed (step S14: NO), the synonym conversion unit 21 converts each word of the key of the input information into a synonym (step S15). Here, the synonym conversion unit 21 converts the keys "service" and "name" into "price/cost" and "name" using the above-mentioned classification vocabulary table and the like. Then, the converted information (price/cost name: FLET'S Hikari) in which the key of the input information is replaced with the converted word is input to the data generation section A.

Similarly to the above, the data generation unit A outputs the connected distributed expression for each pattern of the converted information to the type classification unit 15. Here, as connected distributed representations, (1) distributed representation of "price" + distributed representation of "cost" + distributed representation of "FLET'S" + distributed representation of "light", and (2) distributed representation of "price" + A distributed representation of "cost" + a distributed representation of "name" + a distributed representation of "FLET'S" + a distributed representation of "light" is generated.

The classifier of the determination unit B outputs the certainty factor for each type for the input combination of distributed representations of each pattern (step S10). The type determination unit 16 calculates the average value of the reliability of each pattern output from the type classification unit 15 for each type (step S11), and determines whether the highest value of the reliability of the average value is positive ( Step S12). If the highest value is positive (step S12: YES), the process advances to step S13.

If the highest value is not positive (step S12: NO), the type determination unit 16 determines whether synonym conversion has been completed (step S14). In the illustrated example, the certainty factor of "name" with the highest value of synonym conversion is negative (-0.35) (step S12: NO), and the synonym conversion is completed (step S14: YES).

In this case, the type determination unit 16 determines that the type of input information cannot be identified by the current classifier. In this case, the type determining unit 16 adds the type number of the new type to the type holding unit 19 (step S16). Then, the type determination unit 16 outputs the distributed expression of the key and the distributed expression of the input information to the similar word extraction unit 17 of the type generation unit C (step S17). Each functional unit outputs data processed by the functional unit, and also outputs data input to the functional unit. Therefore, the type determining unit 16 obtains the distributed representation of the key and value generated by the distributed representation generating unit 11.

The similar word extraction unit 17 extracts similar words for the key and similar words for the value using the distributed expression of the key and the distributed expression of the value, and outputs the extracted words to the distributed expression generation unit 11. For example, the similar word extraction unit 17 uses the FastText model of Non-Patent Document 2 and cosine similarity to extract similar words for keys and values.

The distributed expression generation unit 11 generates distributed expressions for each similar word of key and each similar word of value. The combination generation unit 12 combines the distributed expressions of the key words to generate a plurality of distributed expression patterns. The distributed expression concatenation unit 13 generates an additional concatenated distributed expression by combining, for each pattern, the distributed expressions of the key and the similar words of the key, and the distributed expressions of the value and the similar words of the value. For example, the following combinations of distributed expressions are generated for each pattern.

・The sum of the distributed representation of (key) and the distributed representation of (value) ・The sum of the distributed representation of (key) and the distributed representation of (similar word of value) ・The distributed representation of (similar word of key) and the distributed representation of ( The sum of the distributed representations of (similar words of key) and the distributed representations of (value) The learning data update unit 18 uses the concatenated distributed representations generated by the distributed expression concatenation unit 13, In step S16, additional learning data including the new type number issued by the type determination unit 16 is generated and output to the type classification unit 15. The learning section of the type classification section 15 retrains the classifier using additional learning data.

By taking the sum of the distributed expressions as a combination of distributed expressions, information of the same type is mapped to close positions as shown in FIG. Therefore, using the sum of the distributed representations of the new type of Key and Value and the sum of the distributed representations of similar words of the new type of Key and Value, additional training data of the new type is generated and applied to the classifier. By learning, it becomes possible to identify new types.

That is, when generating a new type, a combination of the new type, the distributed representation of key and the distributed representation of value, is used as additional learning data. At that time, by extracting key and value similar words and training data using the sum of distributed expressions of the extracted similar words, the classifier generates a new type of class and improves the identification accuracy of the new type. It becomes possible to improve the

FIG. 8 is an image diagram of the processing of the synonym conversion unit 21. The illustrated example shows a case where input information using "last name" as a key is input to the information identification device 1 in a state where "name", "address", and "company name" exist as existing types. The distributed representation of "surname" and the distributed representation of "full name" have a low cosine similarity, and the distributed representation of "surname" and the distributed representation of "full name" are mapped to relatively distant positions. The synonym conversion unit 21 converts "surname" into the synonym "first name", so that the distributed representation of "first name" and the distributed representation of "full name" have a high cosine similarity, and the distributed representation of "first name" The distributed representation of "name" is mapped to a nearby position. Thereby, the classifier can appropriately determine the type of input information having the key of "last name" to the existing type of "full name".

As described above, in this embodiment, the key of input information is converted into a synonym, and the converted key is used to determine whether the input information corresponds to any of the existing types. As a result, even if the combination of key and value distributed expressions causes a low degree of similarity even for synonyms, and it is difficult to determine that they are of the same type, the discrepancy between the word meaning and the degree of similarity between the distributed expressions is filled in and the judgment is made correctly. can do.

9 and 10 are sequence diagrams showing the operation of the information identification device 1 in the identification phase. FIG. 9 shows the operation when the value of input information can be expressed in a distributed manner, and FIG. 10 shows the operation when the value of the input information cannot be expressed in a distributed manner.

In FIG. 9, the middle B system 3 inputs information including a key and a value to the character string dividing unit 10 of the information identification device 1 (step S61). The character string dividing unit 10 divides the information into words (character strings) (step S62). The distributed expression generation unit 11 generates a distributed expression for each divided word and outputs it to the combination generation unit 12 (step S63).

The combination generation unit 12 combines the distributed expressions of the key words, generates a plurality of distributed expression patterns, and outputs them to the distributed expression concatenation unit 13 (step S64). For each pattern, the distributed expression concatenation unit 13 combines the distributed expression of the pattern and the distributed expression of the value word, generates a concatenated distributed expression, and outputs it to the type classification unit 15 (step S65).

The classifier of the type classification unit 15 outputs the certainty factor for each type of connected distributed expression for each input pattern (step S66). When the classifier classifies K types, the classifier calculates K confidence levels for each pattern.

The type determination unit 16 calculates the average value of the reliability of each type. If the maximum value of the average reliability is a positive value, it is determined that the type of information input in S61 is the type with the maximum value. Then, the type determination unit 16 transmits the identification result including the information input in S61 and the type (type name and/or type number) with the highest confidence to the first B system 5 (step S67).

On the other hand, if the maximum value of the average certainty is a negative value and after the synonym transformation unit 21 has modified the synonyms, the type determination unit 16 stores the key of the information input in S61 in the type holding unit 19. Register (step S68). The type holding unit 19 issues the type number of the registered key (step S69) and outputs a completion notification to the type determining unit 16 (step S70). Note that the type determination unit 16 may issue the type number of the key, and the key and type number may be registered in the type holding unit 19 in step S67.

When the type determination unit 16 receives the completion notification, it outputs the distributed expression of the key and the distributed expression of the value to the similar word extraction unit 17 (step S71). The similar word extraction unit 17 extracts similar words similar to the distributed expression of key and similar words similar to the distributed expression of value, and outputs them to the distributed expression generation unit 11 (step S72). At this time, the similar word extracting unit 17 also outputs the key distributed expression and the value distributed expression obtained in step S70 to the distributed expression generating unit 11 together with the similar word.

The distributed expression generation unit 11 generates a distributed expression for each input similar word (step S73). The combination generation unit 12 combines the distributed expressions of the key words of the similar words to generate a plurality of distributed expression patterns (step S74). For each pattern, the distributed expression concatenation unit 13 concatenates the distributed expressions of the key and the similar words of the key, and the distributed expressions of the value and the similar words of the value, respectively, generates an additional concatenated distributed expression, and updates the learning data. 18 (step S75).

The learning data updating unit 18 generates additional learning data including the additional connected distributed expression and the new type number issued in step S69, and outputs it to the type classification unit 15 (step S76). The learning unit of the type classification unit 15 performs re-learning using additional learning data to update the classifier, and notifies the distributed expression concatenation unit 13 of the completion of re-learning (step S77).

The distributed representation concatenation unit 13 returns to step S65 and inputs the concatenated distributed representation of the information input in step S61 again to the re-learning classifier of the type classification unit 15. The classifier outputs the confidence level for each type (step S66). The type determining unit 16 determines the type of the input information using the confidence output from the updated classifier.

Note that if the maximum value of the average certainty is a negative value and the synonym conversion unit 21 has not performed synonym transformation, the type determination unit 16 instructs the synonym conversion unit 21 to perform the synonym conversion. The synonym conversion unit 21 converts each word of the key in the input information into a synonym, and inputs the converted key and the unconverted value to the character string division unit 10, thereby performing the processing from step S62 onwards.

Next, with reference to FIG. 10, the operation when the value of input information cannot be expressed in a distributed manner will be described.

The middle B system 3 inputs information including a key and a value to the character string segmentation unit 10 of the information identification device 1 (step S61). The character string dividing unit 10 divides the information into words (step S62). Here, "telephone number: 090-1234-5678" (key:value) is input and is divided into "telephone", "number", and "090-1234-5678".

The distributed expression generation unit 11 attempts distributed expression for each divided word, but if there is a word that cannot be distributed, an error occurs. Here, the distributed expression generation unit 11 outputs “090-1234-5678” (value), which cannot be converted into a distributed expression, to the regular expression determination unit 20 (step S81).

When the value is input, the regular expression determining unit 20 requests the type holding unit 19 to obtain all pairs of regular expressions and type numbers registered in the type holding unit 19 (steps S82 and S83). . Then, the regular expression determining unit 20 determines which pattern of the obtained regular expression matches the pattern of the character string of value input in step S82.

In the case of a match with any regular expression pattern, the regular expression determination unit 20 determines that the type of information input in S61 is the type of the regular expression of the matched pattern. Then, the regular expression determining unit 20 transmits the identification result including the information input in S61 and the type of the matched pattern to the First B system 5 (Step S84).

On the other hand, if it does not match any regular expression pattern, the regular expression determining unit 20 transmits an error indicating that the type cannot be specified to the middle B system 3 (step S85).

<Confidence of classification degree>
Next, the certainty factor for each type output by the classifier (SVM) of the type classification unit 15 will be explained. Confidence is calculated using the boundary surface for each class of a classifier that performs multiclass classification, the distance between the input information and the boundary surface, and whether the input information is on the positive or negative side when viewed from the boundary surface. . That is, the classifier calculates positive or negative confidence from the distance between the boundary surface and the input information.

As a premise, in order to determine whether a type to be assigned using confidence exists, it is necessary that there are three or more types that can be identified by the classifier. If there is only one type that can be identified, the reliability cannot be calculated because boundaries cannot be drawn. Furthermore, if there are two types that can be identified, it cannot be determined that the type cannot be identified because the certainty factor for one of the types is always positive.

FIG. 11 is an explanatory diagram for explaining a method of calculating the distance between a boundary surface (hyperplane) and a point. If the leg of the perpendicular drawn from point X(x ^- ) to the boundary surface is H(h), the following vector is perpendicular to the boundary surface, so it is parallel to the normal vector w of the boundary surface.

Therefore, it can be expressed as follows using a real number k.

Here, since H(h) is a point on the boundary surface w ^T x+b=0, the following equation holds true.

Therefore, the distance d between the desired point and the boundary surface is calculated using the following formula.

Therefore, the following formula holds true. That is, the distance between the boundary surface w ^T x+b=0 in each class of the classifier that performs multi-class classification and the input information X(x ^- ) is expressed by the following equation.

The information identification device 1 of this embodiment described above includes a character string dividing unit 10 that divides learning information including a key and a value into words, and a distributed expression generating unit 11 that generates a distributed expression of each word. , a combination generation unit 12 that generates a plurality of distributed expression patterns by combining the distributed expressions of the words of the key, and a combination generation unit 12 that generates a plurality of distributed expression patterns by combining the distributed expressions of the key words of the pattern and the distributed expression of the value words of the pattern. , a distributed expression concatenation unit 13 that generates a concatenated distributed expression; a learning data generation unit 14 that generates learning data including a concatenated distributed expression for each pattern and a type number of a type corresponding to the learning information; and a learning unit that performs machine learning on the learning data to generate a classifier for identifying input information including a key and a value into any type.

In addition, in the information identification device 1 of the present embodiment, the concatenated distributed representation for each pattern generated from the input information is input to the classifier, and the confidence level that the classifier outputs for each type for each pattern. A type determination unit 16 is provided which determines whether the type of the input information is one of the existing types or a new type.

As a result, as shown in FIG. 12, in this embodiment, even if the existing systems of each company input the same type of information in different formats, it can be identified as the same type of information. This makes it possible to automate the conversion from the data format of the distribution source system to the data format of the distribution destination system. Specifically, when simply inputting a combination of distributed expressions of key and value to a classifier, the duplicate words (suffixes, prefixes, etc.) included in the key have a large impact on identification, making correct identification difficult. However, in this embodiment, the combination generation unit 12 can reduce the influence and improve the identifiability.

In addition, in this embodiment, even if the input information key includes a prefix or suffix, the combination generation unit that combines the distributed expressions of the words of the key to generate a plurality of distributed expression patterns. Identification accuracy can be improved.

Additionally, this embodiment includes a synonym conversion unit 21 that converts each word of the key of input information into a synonym. As a result, in this embodiment, it is possible to improve the accuracy of identifying input information that includes synonyms and polysemy words. Specifically, even if synonyms have a low degree of similarity and are difficult to determine as the same type using only the sum of the distributed representations of key and value, we can correct the difference by filling the gap between the similarity between the meaning of the word and the distributed representation. can be determined.

Furthermore, in this embodiment, when unknown information that cannot be identified by the classifier is input, it is correctly identified as information of a new type, training data for the new type is generated, and the classifier is restarted. Let them learn. This allows the classifier after relearning to classify new types, thereby enabling automatic information identification during information distribution.

For the information identification device 1 described above, a general-purpose computer system as shown in FIG. 13 can be used, for example. The illustrated computer system includes a CPU (Central Processing Unit) 901, a memory 902, a storage 903 (HDD: Hard Disk Drive, SSD: Solid State Drive), a communication device 904, an input device 905, and an output device. 906. Memory 902 and storage 903 are storage devices. In this computer system, the functions of the information identification device 1 are realized by the CPU 901 executing a predetermined program loaded onto the memory 902.

The information identification device 1 may be implemented by one computer or by multiple computers. Further, the information identification device 1 may be a virtual machine implemented in a computer. The program of the information identification device 1 can be stored in a computer-readable recording medium such as an HDD, SSD, USB (Universal Serial Bus) memory, CD (Compact Disc), or DVD (Digital Versatile Disc), or can be stored via a network. It can also be distributed.

Note that the present invention is not limited to the above-described embodiments, and many modifications can be made within the scope of the invention. For example, in this embodiment, the determination unit B includes the synonym conversion unit 21. However, the data generation section A may include the synonym conversion section 21. In this case, as described in the present embodiment, when the average value of the confidence of the highest value output by the classifier is negative, the synonym conversion unit 21 receives an instruction from the type determination unit 16 and converts each word of the key into may be converted into synonyms. Alternatively, the synonym conversion unit 21 may convert the key word of the information input to the information identification device 1 into a synonym from the beginning in the identification phase. Specifically, the character string division unit 10 divides the input information into words and inputs the words to the synonym conversion unit 21, which converts each word of the key into a synonym and performs distributed expression. It may also be output to the generation unit 11. The subsequent processing is similar to the embodiment.

1: Information identification device 10: Character string dividing section (dividing section)
11: Distributed expression generation unit 12: Combination generation unit 13: Distributed expression concatenation unit 14: Learning data generation unit 15: Type classification unit 16: Type determination unit (determination unit)
17: Similar word extraction unit 18: Learning data update unit 19: Type storage unit 20: Regular expression determination unit 3: Middle B system 5: First B system 7: Operator terminal

Claims

a dividing unit that divides learning information including a key and a value into words;
a distributed expression generation unit that generates a distributed expression of each word;
a combination generation unit that generates a plurality of distributed expression patterns by combining the distributed expressions of the key words;
a distributed expression concatenation unit that generates a concatenated distributed expression by combining, for each of the patterns, the distributed expression of the key of the pattern and the distributed expression of the value word;
a learning data generation unit that generates learning data including a connected distributed representation for each pattern and a type number of a type corresponding to the learning information;
An information identification device, comprising: a learning unit that performs machine learning on the learning data to generate a classifier for identifying input information including a key and a value into any type.
The information identification device according to claim 1, wherein the combination generation unit generates the pattern by combining distributed representations of each word of the key from the front and back.
The classifier is input with a connected distributed representation for each pattern generated from the input information,
The method further comprises a determination unit that determines whether the type of the input information is one of the existing types or a new type, using the confidence level that the classifier outputs for each type for each pattern. 1. The information identification device according to 1.
comprising a synonym conversion unit that converts each word of the key of the input information into a synonym,
The determination unit calculates the average value of the confidence output from the classifier for each type for each pattern, and when the highest value of the average value is positive, the determination unit determines the type of the input information to be the type of the highest value. If it is determined that the highest value is negative, the type of the input information is changed to one of the existing types using the converted input information in which each word of the key is replaced with a synonym converted by the synonym conversion unit. The information identification device according to claim 3, wherein the information identification device determines whether the information is a new type or a new type.
The information identification device according to claim 1, wherein the distributed expression concatenation unit calculates, as the concatenated distributed expression, a sum of a distributed expression of a pattern and a distributed expression of each word of a value.
a dividing unit that divides input information including a key and a value into words;
a distributed expression generation unit that generates a distributed expression of each word;
a combination generation unit that generates a plurality of distributed expression patterns by combining the distributed expressions of the key words;
a distributed expression concatenation unit that generates a concatenated distributed expression by combining, for each of the patterns, the distributed expression of the key of the pattern and the distributed expression of the value word;
a classifier that outputs confidence for each type of the connected distributed representation;
Information identification, comprising a determination unit that determines whether the type of the input information is one of the existing types or a new type, using the confidence level that the classifier outputs for each type for each pattern. Device.
An information identification method performed by an information identification device, the method comprising:
dividing the learning information into words, including keys and values;
generating a distributed representation of each word;
combining the distributed representations of the key words to generate a plurality of distributed representation patterns;
for each pattern, generating a concatenated distributed representation by combining the distributed representation of the key of the pattern and the distributed representation of the value word;
generating learning data including a connected distributed representation for each pattern and a type number of a type corresponding to the learning information;
An information identification method comprising the steps of performing machine learning on the learning data to generate a classifier for identifying input information including a key and a value into any type.
A program that causes a computer to function as the information identification device according to any one of claims 1 to 6.