WO2015170370A1

WO2015170370A1 - Labeling device and labeling method

Info

Publication number: WO2015170370A1
Application number: PCT/JP2014/062216
Authority: WO
Inventors: 真岩山
Original assignee: 株式会社日立製作所
Priority date: 2014-05-07
Filing date: 2014-05-07
Publication date: 2015-11-12

Abstract

The purpose of the present invention is to provide a labeling device whereby it is possible to effectively utilize training data used to learn how to label a character string. A labeling device according to the present invention is provided with first and second discriminators which store respective learning results obtained using the same training data, wherein if these discriminators give different labels, then one of the labels is preferentially selected according to the type of labels (see Fig. 7).

Description

Labeling apparatus and labeling method

The present invention relates to a technique for assigning a label to character string data.

There are attempts to extract and utilize various information such as date, place, part, and state from text. For example, from the maintenance work log “Right rear air cylinder was damaged at the time of work”, if the part “Right rear air cylinder” and the state “damaged” can be extracted, from a huge maintenance work log, It becomes possible to analyze the fragility of a specific part.

In recent years, machine learning is often used to extract information from sentences. In machine learning, a sentence in which information has already been extracted is given as training data, and a model (discriminator) is generated by learning the characteristics of the information to be extracted from the training data. Desired information is extracted from a new sentence using this model.

In machine learning, if a sufficient amount of training data is prepared, a highly accurate model can be learned. However, enormous human costs are required to create training data. Therefore, it is required to use a small amount of training data efficiently.

In the following Patent Document 1, a plurality of models are learned from the same training data using a plurality of machine learning methods having different features, and the same training data is effectively utilized by integrating the identification results of the models. Yes. Machine learning has strengths and weaknesses for each method, but it is thought that if these results are integrated, they can supplement each other. Further, by applying a plurality of machine learning methods to the same training data, one training data can be reused.

JP 2006-330935 A

From the viewpoint of efficiently using training data, it is desirable to learn multiple models using the same training data as much as possible. Moreover, even when integrating the results of labeling using a plurality of models, it is desirable to achieve this without using new training data.

In Patent Document 1, even if the same training data can be reused to obtain a plurality of models, it is necessary to learn a new model in order to integrate the identification results of the plurality of models, At that time, new training data is required.

The present invention has been made in view of the problems as described above, and an object of the present invention is to provide a labeling apparatus that can efficiently use training data used for learning how to label a character string. And

The label applicator according to the present invention includes first and second discriminators for storing the results of learning using the same training data, and when each discriminator assigns a different label, whichever one depends on the type of the label Select with priority.

According to the labeling apparatus according to the present invention, the same training data can be used to learn a plurality of models or to integrate the labeling results by each model. Therefore, training data can be used efficiently.

It is a block diagram of the label provision apparatus 100 which concerns on Embodiment 1. FIG. It is a figure which shows an example of the screen 20 which operates the label provision apparatus. It is a figure which shows the example of the discriminator selection screen 30 which the label provision apparatus 100 displays when the discriminator selection button 204 is pushed. It is a figure explaining IO format. It is a figure explaining the outline of series labeling. It is a figure explaining the outline of classification labeling. It is a figure explaining the method of mutually complementing the label provision result by classification, and the label provision result by series labeling. 5 is a process flowchart of the label applying apparatus 100. It is a figure explaining an IPO format. FIG. 10 is a diagram illustrating examples 1004 to 1007 of label conversion functions. It is a figure which shows the example which links a IPO format and IO format and provides a label. It is a process flowchart of the label provision apparatus 100 which concerns on Embodiment 3. FIG.

<Embodiment 1>
FIG. 1 is a configuration diagram of a label applicator 100 according to Embodiment 1 of the present invention. The labeling device 100 includes a CPU (Central Processing Unit) 101, a memory 102, a keyboard / mouse 103, a display 104, a secondary storage device 105, a control unit 109, a token dividing unit 110, a classifier learning unit 111, and a label conversion classifier. A learning unit 112, an identification unit 113, a score integration unit 114, a label conversion unit 115, and a data communication unit 116 are provided.

The CPU 101 executes various processes by executing various programs. The memory 102 temporarily stores a program executed by the CPU 101 and data necessary for executing the program. The keyboard / mouse 103 receives input from the user. The display 104 displays an input / output screen. The secondary storage device 105 is configured by a storage device such as a hard disk, and permanently stores training data 106, an identifier 107, and a label conversion identifier 108.

The control unit 109 controls each functional unit. The token dividing unit 110 divides the text into tokens. The discriminator learning unit 111 learns the discriminator 107 using the training data 106. The label conversion discriminator learning unit 112 learns the label conversion discriminator 108 using the training data 106. The identification unit 113 gives a label to the token in the character string data using the classifier 107. The score integration unit 114 integrates the certainty of label assignment and the certainty of label conversion. The label conversion unit 115 converts the label using the label conversion discriminator 108. Label conversion and score will be described later. The data communication unit 116 is an interface that performs data communication via the network 117, and controls, for example, a LAN card and a LAN card that can communicate with each other using the TCP / IP protocol.

FIG. 2 is a diagram illustrating an example of a screen 20 for operating the label attaching device 100. The discriminator learning unit 111 and the label conversion discriminator learning unit 112 learn the discriminator 107 and the label conversion discriminator 108 from the training data 106 in advance. The identification unit 113 extracts information from an arbitrary sentence using these learning results. In the case of the example shown in FIG. 2, the part “right rear air cylinder” and the state “damaged” are extracted from the sentence “the right rear air cylinder is broken during work”, and <part>, < The tag <state> is embedded. In the present invention, such extraction processing is called labeling. A label <part> corresponding to the part is assigned to the character string “right rear air cylinder”, and a label <state> corresponding to the state is assigned to the character string “damaged”. A discriminator may be provided individually for each label, or a single discriminator that comprehensively learns each label may be provided.

The user of the label assigning apparatus 100 inputs a sentence to be labeled on the character string input field 201. It is also possible to select a text file by pressing the input button 202 and display the contents of the text file on the character string input field 201.

Next, the user presses the discriminator selection button 204 to select the discriminator used for labeling. When the discriminator is selected and then the discriminator button 205 is pressed, the discriminating unit 113 gives a label to the text input in the character string input field 201 using the selected discriminator. The assigned label is embedded in the text of the character string input field 201 as a tag. The user can save the labeling result by pressing the output button 203.

FIG. 3 is a diagram showing an example of the discriminator selection screen 30 displayed by the labeling apparatus 100 when the discriminator selection button 204 is pressed. The label formats 301, 302, and 303 are variations of the label format, and three types are provided in FIG. Details of the label format will be described later. The discriminator selection screen 30 further provides two types of classification 304 and series labeling 305 as machine learning methods for each label format. The user selects an arbitrary one from six types obtained by combining these, and checks the check box 306. A classifier 107 can be provided for each of these combinations, or a classifier 107 that has comprehensively learned any one of them can be provided. In the example shown in FIG. 3, it is assumed that two discriminators are selected in which the label format is IO and the machine learning method is classification and sequence labeling.

Hereinafter, a method for integrating machine learning results based on sequence labeling and classification will be described by taking the case where the label format is IO as an example.

FIG. 4 is a diagram for explaining the IO format. In FIG. 4, the text “Right rear air cylinder was broken during operation” was input, and a character string corresponding to the part was labeled in the IO format. By enclosing a character string corresponding to a part with a tag <PART>, it is indicated that the character string is a part (403). In the IO format, I or O represents whether each token is a part or a part. The token dividing unit 101 divides an input character string into tokens using a known morphological analysis technique, and outputs a token string 401. A token is a word. The identification unit 113 gives a label to the token string 401 and outputs a label string 402.

The task that extracts information from the input character string can be regarded as a task that assigns a label (I or O in the case of IO format) for each token. In order to assign a label to a token, the first embodiment considers two methods, classification and sequence labeling. Since each of them has advantages and disadvantages, in the first embodiment, they are complemented with each other to perform labeling with high accuracy.

FIG. 5 is a diagram for explaining the outline of series labeling. In sequence labeling, a label is assigned to each token while scanning the token string in word order (from left to right in FIG. 5). FIG. 5 shows a situation in which a label is assigned to the token 503 “rear part”.

In the series labeling, the label of the target token is determined using the information of the target token and the two tokens before and after the target token. There are various types of information used for labeling. First, information on the token itself (character string itself and part of speech) can be used. Further, the label information of tokens that have already been scanned can be used as information unique to sequence labeling. In the example shown in FIG. 5, before scanning “rear part”, scanning is already completed to “work”, “hour”, “in”, “,” and “right”, and these labels are determined. Therefore, for the previous two tokens “,” and “right”, already determined labels are also used as information for label assignment.

The information collected as described above can be expressed as a multidimensional vector 506. A point 508 is obtained by plotting the vector 506 in the multidimensional vector space 507 (schematically described as a two-dimensional plane in FIG. 5). By determining whether the point 508 belongs to the I region 510 or the O region 511 in the space 507, it is determined whether to give I or O to the “rear part”. The hyperplane 509 that divides the two regions is learned by the discriminator learning unit 111 using the training data 106, and the learning result is stored in the discriminator 107. As a learning method, a known technique such as a support vector machine can be used. Specifically, each token in the training data 106 is expressed in a vector format in the same manner, and the hyperplane 509 is determined so that the correct label attached to the token can be discriminated with the highest accuracy.

The advantage of affiliate labeling is that the label information determined immediately before is used when determining the label. In a sentence in which a plurality of tokens are collectively labeled as I, such as a noun representing a part, if the immediately preceding token is I, it is considered that the next token will also be I with a relatively high probability. This tendency is noticeable for O. Series labeling is effective for such text.

On the other hand, in series labeling, information on the token itself is relatively underestimated. For example, the word “rear” itself is a word that tends to be a part of the part, but in series labeling, the information that I was assigned to “right” immediately before is a major factor that gives I to “rear”. This is also true when learning the discriminator 107, and the token such as “rear part” in FIG. 5 is buried without learning the feature that the word “rear part” tends to become a part of the part.

FIG. 6 is a diagram for explaining the outline of classification labeling. Labeling can also be solved as a classification problem. The difference between classification and series labeling is that when a label is assigned to a token (for example, “rear part”), the label information of the previous token that has already been determined is not used. Therefore, the vector shown in FIG. 6 does not include the labeling result. Due to the above features, in the classification labeling, the feature that the token (“rear part”) is likely to become a part of the part is relatively prominently learned.

欠点 The disadvantage of classification is that unlike the labeling, it does not use the already determined token label information. As already described, immediately after I is I with a high probability, and immediately after O is O with a high probability. Since such information is not used in classification, the overall extraction accuracy may be lowered.

As explained above, series labeling and classification have the advantages and disadvantages of one side and the other, so it can be expected that the overall labeling accuracy can be improved if the mutual disadvantages can be compensated. The specific method will be described below.

FIG. 7 is a diagram for explaining a method of mutually complementing the labeling result by classification and the labeling result by series labeling. The label column 701 is a label column labeled by classification, and the label column 702 is a label column labeled by series labeling.

Looking at the label assignment result for the token “right”, label I is assigned by classification (703), and label O is assigned by series labeling (704). In this case, in the classification labeling, the label 703 gives I to “right” due to the feature that “right” tends to be a part of the part. In the series labeling, the label 704 is given O because this feature has not been learned well. In view of the feature that classification is easy to extract a part using information of the token itself, the classification unit 113 preferentially selects the label 703 with respect to “right” by trusting the result of labeling by classification.

Furthermore, the process for assigning a label to the “rear part” will be described. In the series labeling, since the label I is given to the previous token “right”, the label I is also given to the next “rear part” (706). In the classification, the feature that the “rear part” tends to be a part of the region is weak, and the label O is given to the “rear part” (705). In view of the feature that the sequence labeling can easily extract the site using the information of the previous token, the identifying unit 113 preferentially selects the label 706 by trusting the labeling result by the sequence labeling.

As described above, when classification and series labeling have different labeling results, it is possible to compensate for each other's disadvantages by giving priority to the one with I and matching the other with it. When an experiment was actually performed in the information extraction task, the correct answer rate was 54% with classification alone, and the correct answer rate with the above method was 62% with respect to an input character string with an correct answer rate of 54% with only series labeling.

In the above description, it has been described that the identification unit 113 preferentially selects either the labeling result by classification or the labeling result by series labeling, but when labeling subsequent tokens, It is necessary to convert the label assignment result of the one not selected into the label assignment result of the selected one. This conversion may be performed by the identification unit 113 or the label conversion unit 115.

FIG. 8 is a process flowchart of the label attaching apparatus 100. Hereinafter, each step of FIG. 8 will be described.

(FIG. 8: Step S801)
The user of the label applying apparatus 100 designates information to be input to the label applying apparatus 100. T is a token string of the input character string, and is an array that stores the result of the token dividing unit 110 dividing the text input in the character string input field 201 into tokens. Each token is internally composed of a set of “character string” and “part of speech”. m_classification is an identification function for classification, and m_sequence is an identification function for sequence labeling, which is obtained by storing the result of learning from the training data 106 by the discriminator learning unit 111 in the discriminator 107. Each identification function inputs a token string (T), a label string (L_classification or L_sequence) given so far, and a token to be identified (tn), and constructs a vector of identification target tokens. A corresponding label (I or O in the example described with reference to FIGS. 4 to 7) is output in accordance with the relationship between and the boundary surface.

(FIG. 8: Steps S802 to S803)
The identification unit 113 initializes the label string L_classification [] by classification and the label string L_series [] by series labeling (S 802). The identification unit 113 sequentially performs steps S804 to S808 on the token tn according to the word order (S803).

(FIG. 8: Steps S804, S805, S806)
The identification unit 113 assigns a label to the target token tn using each identification function (S804). When label I is given by classification and label O is given by series labeling (S805), label conversion section 115 converts label O given by series labeling to label I (S806).

(FIG. 8: Steps S805, S807, S808)
In step S804, when label O is given by classification and label I is given by series labeling (S807), the label conversion unit 115 converts label O given by classification into label I (S808).

(FIG. 8: Step S809)
The identification unit 113 outputs a label string L_classification [] by classification and a label string L_series [] by series labeling. In this case, both are the same label row. The identification unit 113 embeds a tag in the sentence in the character string input field 201 according to each label string.

<Embodiment 1: Summary>
As described above, when the labeling result by classification labeling and the labeling result by series labeling are different from each other, the labeling apparatus 100 according to the first embodiment gives the labeling result to which the label I representing the part is given. Select with priority. As a result, even if the discriminator learned for classification labeling and the discriminator learned for sequence labeling using the same training data have different characteristics, the advantages of each other are complemented. Accuracy can be improved. Moreover, the same training data can be utilized efficiently.

<Embodiment 2>
In the first embodiment, the case where the learning method is different in the same label format (IO format in the example described in the first embodiment) has been described. In the first embodiment, when converting the labeling result by one discriminator to the other, the label is simply copied as it is. In Embodiment 2 of the present invention, an example in which one label assignment result is converted to the other when the label formats are different will be described. Since the configuration of the label applying apparatus 100 is the same as that of the first embodiment, label conversion will be mainly described below.

In order to convert one label assignment result to the other when the label format is different, it is necessary to know which label the other corresponds to. In addition, since a certain label cannot always be converted to a certain other label, it is necessary to consider the cost for label conversion. Furthermore, it is necessary to consider the case where the correspondence is one-to-many.

FIG. 9 is a diagram for explaining the IPO format. In the second embodiment, a label format called IPO format is considered in addition to the IO format already described. Reference numeral 901 is an example of the IPO format, and reference numeral 902 is an example of the IO format shown for comparison. In the IPO format, a label P indicates the main part of the part. That is, the “right rear air cylinder” is divided into a “right rear part” indicating the location of the part and an “air cylinder” which is the part itself, and the label P is given to the latter. Thus, by giving the label P to a specific part, the part can be clearly learned and identified.

In order to associate one label assignment result with the other when the label formats are different, the label assigned on the one side must be converted and assigned to the other. For example, the token to which I is assigned in the IO format represents only that it is a part of the part, and it is not known whether or not the token is a main part. Therefore, when converting I in the IO format to the IPO format, there are two conversion candidates I or P. In the second embodiment, such a relationship is expressed by a label conversion function.

FIG. 10 is a diagram showing examples 1004 to 1007 of label conversion functions. The first argument of the label conversion function is the label format of the conversion source, the second argument is the label format of the conversion destination, the third argument is the label of the conversion source, and the fourth argument is the label of the conversion destination. The return value of the label conversion function is the certainty of the conversion. The label conversion function 1004 indicates that the certainty that the IPO format label I can be converted to the IO format I is 1.0.

The label conversion discriminator 108 is a set of label conversion functions. The label conversion discriminator learning unit 112 learns a label conversion function using the training data 106 and stores it in the label conversion discriminator 108. The label conversion identifier 108 can be constructed based on the duplication of character strings in the training data 106. Hereinafter, an outline of a method for learning the label conversion discriminator 108 will be described.

10 indicates a correspondence example between the IO format and the IPO format in the training data 106. As described above, when the same sentence is labeled in a plurality of label formats, the label conversion function can be created by aggregating portions corresponding to each other. Even when the correspondence is not completely achieved, a label conversion function can be created by automatically extracting the likely corresponding portions using a known technique using dynamic programming and counting the partial correspondences. Hereinafter, a method for creating a label conversion function will be described using an example in which 1001 is completely compatible.

The label conversion discriminator 108 creates a table 1002 by summing up the correspondence between the IO format and the IPO format. A row 1003 is a result of tabulating corresponding IPO format labels for the four locations to which I is assigned in the IO format. The example indicated by the row 1003 indicates that I is given at two places on the “right” and “rear part” and P is given at two places on the “air” and “cylinder”.

The label conversion functions 1004 to 1007 represent the table 1002 as functions. From the row 1003, it can be seen that the four Is in the IO format correspond to the two Is in the IPO format. This relationship can be expressed by a function 1006. In the second embodiment, the return value of the label conversion function is a simple relative frequency (2/4 in the example).

FIG. 11 is a diagram illustrating an example in which a label is assigned by linking the IPO format and the IO format. Here, the learning method may be either sequence labeling or classification. The labeling process will be described below using the same sentence example as in the first embodiment.

Step S1101 shows a state in which the identification unit 113 gives a label to the token “air”. A label 1106 is a label given in the IPO format, and a label 1104 is a label given in the IO format.

When applying labels in the second embodiment, two types of certainty are considered. The first certainty factor is a certainty factor for the label 1105 given immediately before “air”. The first certainty factor corresponds to the conversion certainty factor when the label 1105 is given after being converted from another label format. The first certainty factor of the label 1106 is the conversion certainty factor when the immediately preceding label 1105 is converted from the other label, and is the maximum value (for example, 1.0) when it is not converted. . In the example shown in FIG. 11, it is assumed that the label 1105 has not been converted, and the first certainty factor of the label 1106 is 1.0. The first certainty factor of the label 1104 is assumed to be 0.8 based on the above method.

The second certainty factor is a certainty factor of label assignment itself by the identification unit 113. The certainty of label assignment can be calculated based on the distance from the boundary plane between labels (for example, the hyperplane 509 in FIG. 5) to the identification target vector (point 508 in FIG. 5). The farther the identification target vector is from the boundary plane, the more confident the label can be assigned, so the second certainty factor increases. Since this method is publicly known, details are omitted. It is assumed that the second certainty factor of label 1106 is 0.5 and the second certainty factor of label 1104 is 1.2.

The score integrating unit 114 calculates the final score of this label by integrating the above two types of certainty. There are several possible integration methods. In the second embodiment, the product of both (1.0 * 0.5 = 0.5 for the label 1106) is used as the score 1107 of the label 1106. Similarly, the score 1108 of the label 1104 is obtained. The label conversion unit 115 compares the score 1107 of the label 1106 in the IPO format with the score 1108 of the label 1104 in the IO format, and preferentially selects the larger one. In the case of the example shown in FIG. 11, I given in the IO format is selected.

Step S1102 is a step of converting the label 1104. When the label conversion unit 115 assigns a label to the next token “cylinder”, the label conversion unit 115 compares the label assignment result in the IPO format with the label assignment result in the IO format. Therefore, even if it is already decided to use the IO format for the label 1104, it is necessary to convert the label 1104 to the IPO format before labeling the token “cylinder”. This step is for that purpose. The label conversion unit 115 performs this step using the label conversion function described above. According to the

label conversion functions

1006 and 1007 described with reference to FIG. 10, it can be understood that I in the IO format can be converted into I and P in the IPO format with a certainty factor 0.5. Therefore, the label conversion unit 115 converts the label 1104 into

labels

1109 and 1110, respectively. Since there are two conversion results, labels 1109 and 1110 are held in this step, respectively.

Step S1103 is a step of labeling the token “cylinder”. Using the label 1104 and the two conversion results generated in step S1102, the label conversion unit 115 assigns labels 1111 to 1113 to “cylinders” in two label formats in the same manner as in step S1101, and the respective scores. Is calculated. In the example shown in FIG. 11, the label 1113 is finally selected because it has the maximum score.

The label conversion unit 115 may leave only one conversion candidate in the same label format and discard the other candidates for efficiency of calculation. In the case of the example shown in FIG. 11, the IPO format has

labels

1113 and 1112 as two conversion candidates in step S1102. The label conversion unit 115 compares the scores of the

labels

1113 and 1112 and discards the lower one (label 1112) while leaving the higher one (label 1113). Similarly, when there are two or more conversion candidates, only the conversion candidate with the maximum score is left, and the others are discarded. The identification unit 113 and the label conversion unit 115 add a label by repeating the above steps.

<Embodiment 2: Summary>
As described above, the labeling apparatus 100 according to the second embodiment preferentially selects one of the labeling results based on the certainty of a plurality of labeling results having different label formats. Thereby, even if it is a case where the label conversion result by one label format is converted into the other, the precision of the label obtained by the conversion can be improved.

Also, the label assigning apparatus 100 according to the second embodiment learns the label conversion discriminator 108 used for mutually converting the label assignment results from the training data 106. That is, since the training data 106 is used not only for learning of the discriminator 107 but also for learning of the label conversion discriminator 108, the training data 106 can be used efficiently.

<Embodiment 3>
In the second embodiment, the conversion between two different label formats is described. In the third embodiment of the present invention, an operation example in which the method described in the second embodiment is extended to an arbitrary number of label formats will be described. Since the configuration of the label applying apparatus 100 is the same as that of the first and second embodiments, the processing flow will be described below.

FIG. 12 is a process flowchart of the label applying apparatus 100 according to the third embodiment. Hereinafter, each step of FIG. 12 will be described.

(FIG. 12: Step S1201)
The user of the label assignment apparatus 100 inputs the token string T and the identification function set M. This step corresponds to step S801 in FIG. The identification function for the label format k is represented by mk.

(FIG. 12: Step S1202)
The identification unit 113, for all label formats k, includes a label string Lk [], a first certainty factor string (an array of first certainty factors representing the certainty of label conversion) Sk [], and a second certainty factor string (label) The second certainty factor array Ck [] representing the certainty factor of the grant itself is initialized. These column indexes are linked to the token column indexes. The first certainty factor sequence Sk [] is initialized to 1.0 (no conversion) in advance.

(FIG. 12: Step S1203)
The identification unit 113 performs the following steps S1204 to S1209 for each token tn (n = 1 to N).

(FIG. 12: Step S1204)
The identification unit 113 performs the following steps S1205 to S1206 for each label format k.

(FIG. 12: Step S1205)
The identification unit 113 assigns a label of the label format k to the token tn using the identification function mk. The identification unit 113 records the assigned label (the label with the highest certainty of grant) and the certainty of grant. In the previous iteration (that is, the previous token tn−1), when label conversion is performed to label format k and there are a plurality of conversion results, as described with reference to FIG. However, the same processing is performed.

(FIG. 12: Step S1206)
The score integration unit 114 integrates the scores calculated in step S1205. If there are a plurality of conversion results in the previous token tn-1, only the one with the maximum integrated score is left and the others are discarded.

(FIG. 12: Step S1207)
The label conversion unit 115 selects a label format having the maximum integrated score. Here, it is assumed that the label format p is selected.

(FIG. 12: Step S1208)
The label conversion unit 115 performs step S1209 for each label format k.

(FIG. 12: Step S1209)
The label conversion unit 115 converts the label from the label format p to another label format k. When there are a plurality of conversion results, each conversion result relating to the label format k is copied and stored. The label conversion unit 115 obtains the label of each conversion result and the conversion certainty factor using a corresponding label conversion function, and stores them in the label string Lk [] and the first certainty string Sk [], respectively.

(FIG. 12: Step S1210)
When the identification unit 113 (or label conversion unit 115) finishes assigning labels to all tokens, it outputs the label string Lk [].

The present invention is not limited to the embodiment described above, and includes various modifications. The above embodiment has been described in detail for easy understanding of the present invention, and is not necessarily limited to the one having all the configurations described. A part of the configuration of one embodiment can be replaced with the configuration of another embodiment. The configuration of another embodiment can be added to the configuration of a certain embodiment. Further, with respect to a part of the configuration of each embodiment, another configuration can be added, deleted, or replaced.

For example, when a label is given to a language in which the input character string is divided into words from the beginning, the token dividing unit 110 can be omitted. In the above example, the IO format and the IPO format are exemplified as the label format, and the sequence labeling and classification are exemplified as the learning method. However, the method of the present invention can be applied to classifiers that have learned other than these.

The above components, functions, processing units, processing means, etc. may be realized in hardware by designing some or all of them, for example, with an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor. Information such as programs, tables, and files for realizing each function can be stored in a recording device such as a memory, a hard disk, an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

101 CPU
DESCRIPTION OF SYMBOLS 102 Memory 103 Keyboard / mouse 104 Display 105 Secondary storage device 106 Training data 107 Classifier 108 Label conversion classifier 109 Control part 110 Token division part 111 Classifier learning part 112 Label conversion classifier learning part 113 Identification part 114 Score integration part 115 Label converter 116 Data communication unit 117 Network

Claims

A device for attaching a label to character string data,
First and second discriminators for storing results of learning how to label labels with training data;
A discriminator for giving a label to character string data using the first and second discriminators;
With
The identification unit gives a first label when a label is given to the token in the character string data using the first discriminator, and a second label is given when a label is given using the second discriminator. When given, according to the classification of each of the first and second labels, one of them is preferentially selected.
The labeling device further includes
The correspondence between the result of assigning a label to the training data using the first discriminator and the result of assigning the label to the training data using the second discriminator A label conversion discriminator for storing results learned using data,
Using the label conversion discriminator, labeling the character string data using the first discriminator and labeling the character string data using the second discriminator A label conversion unit that converts results to each other,
The label applying apparatus according to claim 1, further comprising:
The identification unit obtains a first certainty factor representing a certainty factor of a label given to the first token in the character string data;
The label conversion unit uses the first certainty factor of the first token and a second certainty factor representing a certainty factor of a label given to the second token immediately before the first token. The integrated confidence level obtained from the above is obtained for each of the result of labeling by the discriminator using the first discriminator and the result of labeling using the second discriminator, and the label having the higher integrated confidence level. The labeling apparatus according to claim 2, wherein the application result is preferentially selected.
The label conversion discriminator holds a conversion certainty representing the certainty of the conversion,
The label conversion unit
When the second token is given a label according to the conversion by the label conversion discriminator, adopt the conversion certainty held by the label conversion discriminator as the second certainty,
4. The label assigning apparatus according to claim 3, wherein when the second token is not assigned according to the conversion by the label conversion discriminator, the maximum certainty is adopted as the second certainty. .
The label conversion discriminator holds one or more conversion candidates,
The label conversion unit, after adopting the labeling result with the larger integrated certainty factor, converts the first token into each candidate and acquires the conversion certainty factor of each candidate,
The identification unit obtains a third certainty factor representing a certainty factor of a label given to the third token one after the first token for each of the candidates and the first token,
The label conversion unit obtains the integrated certainty factor obtained using the third certainty factor obtained for the first token and the certainty factor of the label given to the first token, and the respective candidate candidates. 4. The integrated certainty factor obtained using the third certainty factor and the conversion certainty factor of each of the candidates is obtained, and a labeling result obtained with the largest integrated certainty factor is adopted. Labeling device.
After determining the label to be given to the third token, the label conversion unit determines the integrated certainty factor obtained using the third certainty factor obtained for each candidate and the conversion certainty factor of each candidate. The label assignment apparatus according to claim 5, wherein only the candidate corresponding to the one of which the largest one is obtained is left, and the other candidates are deleted from the candidates for label assignment results.
The discriminator gives priority to either the part of speech of the token to which the first label is assigned or the part of speech of the token to which the second label is given, and the priority is given to the first label or the second label. The labeling apparatus according to claim 1, wherein a label corresponding to the part of speech of the person is preferentially selected.
The first discriminator stores a result of performing the learning using one of sequence labeling or classification labeling, and the second learner stores a result of performing the learning using the other one. The label applicator according to claim 1.
The first discriminator stores the result of performing the learning using either the IO format or the IPO format, and the second learner stores the result of performing the learning using the other. The label applicator according to claim 1.
A method of assigning a label to character string data,
Assigning a label to the character string data using first and second discriminators that store the result of learning how to label the character string using the training data;
When a first label is given to the token in the character string data using the first discriminator, a first label is given, and when a label is given using the second discriminator, a second label is given, Preferentially selecting one according to the type of each of the first and second labels;
A labeling method characterized by comprising: