WO2022044243A1

WO2022044243A1 - Training device, inference device, methods therefor, and program

Info

Publication number: WO2022044243A1
Application number: PCT/JP2020/032505
Authority: WO
Inventors: 翔太折橋; 亮増村; 雅人澤田
Original assignee: 日本電信電話株式会社
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2022-03-03
Also published as: JPWO2022044243A1

Abstract

This training device, with respect to a labeling model including a logical relationship understanding means for receiving an input information sequence which is a sequence of pieces of information having a logical relationship and for obtaining and outputting an intermediate feature sequence in which the logical relationship of the input information sequence is considered and a labeling means for receiving a first sequence based on the intermediate feature sequence and for obtaining and outputting an inference label sequence of a label sequence corresponding to the input information sequence, and a domain identification model for receiving a second sequence based on the intermediate feature sequence and for obtaining and outputting inference domain information about domain identification information for information about each portion included in the input information sequence, uses teaching data including labeled teaching data of a source domain and unlabeled teaching data of a target domain, to perform adversarial training so as to train the labeling model such that inference accuracy of the inference label sequence becomes high and inference accuracy of a sequence of the inference domain information becomes low and train the domain identification model such that inference accuracy of a sequence of the inference domain information becomes high, and obtains and outputs at least a parameter of the labeling model.

Description

Learning devices, inference devices, their methods, and programs

The present invention relates to labeling technology.

In recent years, for the purpose of understanding conversations and discourses, a technique of utterance series labeling has been proposed in which a utterance series is input and a label representing a conversation or discourse response scene is estimated for each utterance (for example, Non-Patent Document 1). ).

For example, in Non-Patent Document 1, the utterance text series obtained by voice-recognizing the conversation between the operator and the customer in the contact center is input, and the corresponding scene (opening, grasping the matter, identity verification, correspondence, correspondence, for each utterance, A deep neural network (hereinafter referred to as a labeling network) that realizes speech sequence labeling that estimates the label of any of the closings) is disclosed.

A large amount of labeled teacher data is required for learning a labeling network as in Non-Patent Document 1. However, it is difficult to collect a large amount of labeled teacher data in a new domain each time it is labeled, because the cost of labeling is enormous. Here, in Non-Patent Document 2, labeled data (hereinafter, labeled teacher data) of a domain that has been applied in the past (hereinafter, source domain) and unlabeled data of a domain that is newly applied (hereinafter, target domain) are described. (Hereinafter, unlabeled teacher data), a method for realizing unsupervised domain adaptation by labeling in a new domain has been proposed.

However, the method of Non-Patent Document 2 uses the labeled teacher data of the source domain and the unlabeled teacher data of the target domain to learn a labeling model for estimating the label corresponding to a single image belonging to the target domain. be. That is, the method of Non-Patent Document 2 is to perform unsupervised domain adaptation of a simple classification problem of a single image, and a label sequence corresponding to the sequence of the information in consideration of the logical relationship of the sequence of a plurality of information. An unsupervised domain adaptation method for complex classification problems has not been established.

The present invention has been made in view of such a point, and performs unsupervised domain adaptation of a labeling model that estimates a label sequence corresponding to a sequence of information in consideration of the logical relationship of the sequence of a plurality of information. The purpose is.

A logical relationship understanding means for receiving an input information sequence which is a sequence of a plurality of information having a logical relationship, obtaining an intermediate feature sequence considering the logical relationship of the input information sequence, and outputting the intermediate feature sequence, and the above-mentioned Based on a labeling model including a labeling means that receives a first sequence based on an intermediate feature sequence, obtains an estimated label sequence of a label sequence corresponding to the input information sequence, and outputs the estimated label sequence, and the intermediate feature sequence. The second series is received, the estimated domain information of the domain identification information indicating whether each partial information included in the input information series belongs to the source domain or the target domain is obtained, and the series of the estimated domain information is output. For the domain identification model, the learning device includes labeled teacher data, which is a labeled learning information sequence belonging to the source domain, and unlabeled teacher data, which is an unlabeled learning information sequence belonging to the target domain. Using the teacher data as the input information series, the labeling model is trained so that the estimation accuracy of the estimated label series is high and the estimation accuracy of the estimated domain information series is low, and the estimation accuracy of the estimated domain information series is low. Hostile learning is performed to learn the domain identification model so that the information becomes high, and at least the parameters of the labeling model are obtained and output.

This makes it possible to perform unsupervised domain adaptation of the labeling model that estimates the label sequence corresponding to the sequence of information in consideration of the logical relationship of the sequence of multiple information.

FIG. 1 is a block diagram for illustrating the learning device of the first embodiment. FIG. 2 is a block diagram for exemplifying the detailed configuration of the learning unit of the first embodiment. FIG. 3 is a conceptual diagram for illustrating a network used for the learning process of the first embodiment. FIG. 4 is a block diagram for exemplifying the inference device of the first embodiment. FIG. 5 is a conceptual diagram for exemplifying the learned labeling network of the first embodiment. FIG. 6 is a block diagram for illustrating the learning device of the second embodiment. FIG. 7 is a conceptual diagram for exemplifying a network used for the learning process of the second embodiment. FIG. 8 is a graph for exemplifying the experimental results. FIG. 9 is a block diagram illustrating a hardware configuration of the learning device and the inference device of the embodiment.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In each embodiment, an utterance text series (series of a plurality of information having a logical relationship) is input, and a series of labels (labels) corresponding to corresponding scenes (for example, opening, message grasping, identity verification, correspondence, closing). An example of unsupervised domain adaptation of a labeling model based on a deep neural network that outputs (series labeling) is shown. However, these are examples and do not limit the present invention. That is, the present invention can be used for unsupervised domain adaptation of a labeling model that estimates an arbitrary label sequence corresponding to a sequence of information in consideration of the logical relationship of a sequence of any plurality of information. The logical relationship of the series of a plurality of information is not limited, and any relationship may exist between the plurality of information. Examples of logical relationships are contexts, word dependency relationships, language grammatical relationships, audio and video frame-to-frame relationships, etc., but these are not limited to the present invention. In addition, the labeling model is not limited to a model based on a deep neural network, and any model such as a probability model or a classifier can be used as long as it is a model that estimates and outputs a label sequence corresponding to a sequence of input information. There may be.

[First Embodiment]
The first embodiment of the present invention will be described.
<Functional configuration and learning process of learning device 11>
As illustrated in FIG. 1, the learning device 11 of the first embodiment has a learning unit 11a and

storage units

11b and 11c, and inputs labeled teacher data of a source domain and unlabeled teacher data of a target domain. Then, the parameters (model parameters) of the labeling network of the target domain are obtained and output by learning. In the drawings, for the sake of simplicity, the source domain is referred to as "SD", the target domain is referred to as "TD", and the network is referred to as "NW". The labeled teacher data of the source domain exemplified here is a labeled learning information sequence belonging to the source domain, and is a spoken text sequence of the source domain (input information sequence which is a sequence of a plurality of information having a logical relationship). And the correct label series corresponding to the spoken text series. Further, the unlabeled teacher data of the target domain exemplified here is an unlabeled learning information sequence belonging to the target domain, and includes an utterance text sequence of the target domain but does not include a correct label sequence. Further, a learning schedule, which is a schedule such as a loss coupling ratio, which will be described later, may be input to the learning device 11, and the learning device 11 may perform learning processing according to the learning schedule. Further, the learning device 11 may output the parameters of the domain identification network for realizing the unsupervised domain adaptation. Further, as illustrated in FIG. 2, the learning unit 11a has, for example, a control unit 11aa, a loss function calculation unit 11ab, a gradient inversion unit 11ac, and a parameter update unit 11ad. Further, the learning unit 11a stores each data obtained in the processing process in the

storage units

11b, 11c or a temporary memory (not shown) one by one. The learning unit 11a reads the data as necessary and uses it for each process.

≪Network 100≫
FIG. 3 shows a configuration example of the network 100 used by the learning device 11 in the learning process. The network 100 exemplified in FIG. 3 has a labeling network 150 (labeling model) and a domain identification model 130.

≪Labeling Network 150≫
The labeling network 150 illustrated in FIG. 3 receives (inputs) the utterance text sequence T ₁ , ..., _TN (input information sequence which is a sequence of a plurality of information having a logical relationship), and the utterance text sequence T ₁ , ... ..., The estimated label series L ₁ , ..., L _N , which is the estimated series of the label series corresponding to _TN , is obtained and output. Here, the utterance text series T ₁ , ..., TN is a series of _N utterance texts T _n . However, for example, n is an index corresponding to time, n = 1, ..., N, N is an integer of 1 or more, and N is generally an integer of 2 or more. The utterance text T _n may be a word such as "I'm sorry" or "Yes", or M (n) words such as "I'm in trouble because the reply speed is slow" T _{n, 1} , ..., T _n . _{, M (n)} may be included in the sentence. However, M (n) is an integer of 2 or more. Further, the estimated label series L ₁ , ..., L _N exemplified here is a series of N estimated labels L _n (where n = 1, ..., N). The estimated label L _n in this example corresponds to the utterance text T _n , and represents, for example, a corresponding scene of the utterance text T _n (for example, opening, message grasping, identity verification, correspondence, closing). Further, the labeling network 150 exemplified here has a logical relationship understanding layer 110 (logical relationship understanding means) and a labeling layer 120 (labeling means).

≪Logical relationship understanding layer 110≫
The logical relationship understanding layer 110 receives the utterance text series T ₁ , ..., _TN , and the intermediate feature series LF ₁ , ..., LF considering the context (logical relationship) of the utterance text series T ₁ , ..., _TN . _N is obtained, and the intermediate feature series LF ₁ , ..., LF _N is output. The intermediate feature series LF ₁ , ..., LF _N is a sequence of N intermediate feature LF _n (where n = 1, ..., N). The intermediate feature LF _n corresponds to the utterance text T _n . The logical relationship understanding layer 110 illustrated in FIG. 3 includes a short-term context understanding network 111-1, ..., 111-N (short-term logical relationship understanding means) and a long-term context understanding network 112 (long-term logical relationship understanding means). include. For example, the short-term context understanding networks 111-1, ..., 111-N are networks that are the same as each other (for example, networks with the same parameters), and each short-term context understanding network 111-n is each n = 1, ..., N. Represents a state corresponding to (for example, each time). The short-term context comprehension networks 111-1, ..., 111- _N exemplified here receive the utterance text series T ₁ , ..., TN. Each short-term context understanding network 111-n (where n = 1, ..., N) that received each utterance text T _n (each partial information included in the input information series ₎ included in the utterance text series T ₁ , ..., TN. ) Is a short-term intermediate feature SF _n (short-term intermediate feature considering the logical relationship of information in the partial information ₎ considering the context of the word in the received utterance text Tn (for example, the short-term context of each word). Is obtained, and the short-term intermediate feature SF _n is output. As a result, the short-term intermediate feature series SF ₁ , ..., SF _N are output from the short-term context understanding networks 111-1, ..., 111-N. When the utterance text T _n contains only one word, the context of the words in the utterance text T _n depends on only the one word, but the SF _n obtained in this case is also the context of the word. It is a short-term intermediate feature considering. However, this does not limit the present invention. For example, each short-term context understanding network 111-n may be divided into a plurality of short-term context understanding networks 111-n1, ..., 111-nK'(where K'is an integer of 2 or more). For example, k'= 1, ..., K'is an index representing the layers of the short-term context understanding network, and each short-term context understanding network 111-nk'is a network from the input layer to the k'th layer of the short-term context understanding network. show. In this case, the short-term intermediate feature SF _nk'of the k'layer is output from each short-term context understanding network 111-nk'. The long-term context understanding network 112 exemplified here receives a series SF ₁ , ..., SF _N (short-term intermediate feature series) of a plurality of short-term intermediate features, and a plurality of utterances included in the utterance text series T ₁ , ..., _TN . Intermediate feature series LF ₁ , ..., LF _N (between multiple partial information contained in the input information series) considering the context between texts T _n (for example, long-term context of sentence unit or long-term context spanning multiple sentences). A long-term intermediate feature sequence in consideration of the logical relationship) is obtained, and the intermediate feature sequence LF ₁ , ..., LF _N is output. However, this is not a limitation of the present invention. For example, when the short-term intermediate feature SF _nk'of the k'layer is output from each short-term context understanding network 111-nk'as described above, SF _nK'is input as SF _n to the long-term context understanding network 112. Alternatively, a plurality of SF nK'of SF _n1 , ..., SF _nK' _may be input. Further, the long-term context understanding network 112 may be divided into a plurality of long-term context understanding networks 112-1, ..., 112-K (where K is an integer of 2 or more). For example, k = 1, ..., K is an index representing the layers of the long-term context understanding network, and each long-term context understanding network 112-k represents the network from the input layer to the kth layer of the long-term context understanding network. In this case, each long-term context comprehension network 112-k (where k = 1, ..., K) receives the sequence SF _n of any of the plurality of short-term intermediate features, and the plurality of utterance texts T corresponding to the received sequence SF _n . An intermediate feature considering the context between _n may be obtained and output. In this case, the long-term context understanding network 112-1, ..., 112-K outputs K intermediate feature sequences {LF ₁₁ , ..., LF _N1 }, ..., {LF _1K , ..., LF _NK }. .. LF _nk (where n = 1, ..., N, k = 1, ..., K) corresponds to each n (for example, each time) of the kth layer output from each long-term context understanding network 112-k. Represents an intermediate feature.

Here, the short-term context understanding network 111-n is composed of a combination of an embedded layer that converts words into numerical values by, for example, a unidirectional LSTM (long-short term memory), a bidirectional LSTM, and an attention mechanism. Yes (see, for example, Non-Patent Document 1 and the like). Further, the long-term context understanding network 112 can be configured by a combination of, for example, a unidirectional LSTM or a bidirectional LSTM.

≪Labeling layer 120≫
The labeling layer 120 receives the intermediate feature sequence LF ₁ , ..., LF _N (first sequence based on the intermediate feature sequence), and the estimated label sequence L ₁ , ..., L corresponding to the spoken text sequence T ₁ , ..., _TN . _N is obtained, and the estimated label series L ₁ , ..., L _N is output. The labeling layer 120 illustrated in FIG. 3 includes label prediction networks 120-1, ..., 120-N. For example, the label prediction networks 120-1, ..., 120-N are networks that are the same as each other (for example, networks with the same parameters), and each label prediction network 120-n is each n = 1, ..., N (for example,). , Each time). The label prediction network 120-n exemplified here receives the intermediate feature LF _n , obtains the estimated label L _n corresponding to the utterance text T _n , and outputs the estimated label L _n . The label prediction network 120-n can be configured by, for example, a fully connected neural network having a softmax function as an activation function. Also, if K intermediate feature sequences {LF ₁₁ , ..., LF _N1 }, ..., {LF _1K , ..., LF _NK } are output from the long-term context understanding network 112-1, ..., 112-K, the label. For example, the prediction network 120-n receives LF _nK as an intermediate feature LF _n , obtains an estimated label L _n corresponding to the utterance text T _n , and outputs the estimated label L _n . However, the label prediction network 120- _n receives a plurality of intermediate feature LF _nks among the intermediate feature series {LF ₁₁ , ..., LF _N1 }, ..., {LF _1K , ..., LF _NK } and corresponds to the utterance text Tn. The estimated label L _n may be obtained and the estimated label L _n may be output.

≪Domain Discriminative Model 130≫
The domain identification model 130 illustrated in FIG. 3 receives the intermediate feature series LF ₁ , ..., LF _N (second sequence based on the intermediate feature sequence), and each utterance text included in the utterance text series T ₁ , ..., _TN . Domain identification indicating whether T _n (each partial information contained in the input information series) belongs to the source domain or the target domain (whether each utterance text T _n belongs to the source domain or the target domain). Estimated domain information of information D _n (where n = 1, ..., N) is obtained, and the series D ₁ , ..., _DN of the estimated domain information is output. The domain identification model 130 exemplified here includes N domain identification networks 130-1, ..., 130-N. For example, the domain identification networks 130-1, ..., 130-N are networks that are the same as each other (for example, networks with the same parameters), and each domain identification network 130-n is each n = 1, ..., N (for example,). , Each time). For example, each domain identification network 130-n (where n = 1, ..., N) receives the intermediate feature LF _n , obtains the estimated domain information _Dn , and outputs it. However, this does not limit the present invention. For example, a plurality of domain identification networks 130-nk may exist in place of each domain identification network 130-n. For example, k = 1, ..., K is an index representing a layer of a long-term context understanding network, and each domain identification network 130-nk represents a network corresponding to each n (for example, each time). In this case, each domain identification network 130-nk receives the intermediate feature LF _nk (n ∈ {1, ..., N}, k ∈ {1, ..., K}) and uses them to obtain the estimated domain information D _nk . May be output. D _nk (where n = 1, ..., N, k = 1, ..., K) represents estimated domain information corresponding to each n (for example, each time) output from each domain identification network 130-nk. .. The domain identification network 130-n (or domain identification network 130-nk) can be configured by, for example, a fully connected neural network having a softmax function as an activation function.

≪Learning process≫
In the learning process, the learning unit 11a of the learning device 11 has the labeled teacher data of the source domain (labeled learning information sequence belonging to the source domain) and the unlabeled teacher data of the target domain (unlabeled belonging to the target domain). Information series for learning) is input. The learning unit 11a uses the teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the spoken text series T ₁ , ..., _TN for the above-mentioned network 100, and uses the estimated label series L. The labeling network 150 (labeling model) is learned so that the estimation accuracy of ₁ , ..., L _N is high and the estimation accuracy of the estimated domain information D ₁ , ..., _DN is low, and the estimation domain information series D ₁ , ..., Hostile learning is performed to learn the domain identification model 130 so that the estimation accuracy of _DN becomes high. That is, the learning unit 11a includes the estimated label series L ₁ , ..., L _N output from the labeling network 150 when the above-mentioned teacher data is input to the network 100 as the spoken text series T ₁ , ..., _TN . The loss function (hereinafter referred to as label prediction loss) representing the error from the correct label series of the labeled teacher data of the source domain corresponding to, and the series D ₁ , ..., _DN of the estimated domain information output from the domain identification model 130. Based on the loss function (hereinafter referred to as domain identification loss) that represents the error between the labeled teacher data of the source domain and the correct label series of the estimated domain information identified from the unlabeled teacher data of the target domain, the labeling network 150 and Perform hostile learning with the domain identification model 130. The estimated label series L ₁ , ..., L _N output from the labeling network 150 when the unlabeled teacher data of the target domain is input to the network 100 is not used for calculating the label prediction loss.

The learning unit 11a performs this hostile learning by using, for example, an error backpropagation method. In this case, the gradient inversion layer 141-n (where n =) between the logical relationship understanding layer 110 and the domain discriminative model 130 (eg, between the long-term context understanding network 112-n and the domain discrimination network 130-n). 1, ..., N) are provided, and the gradient is inverted by the gradient inversion layer 141-n only at the time of error back propagation. Here, by learning so that the label prediction loss becomes small, the estimation accuracy of the estimated label series L ₁ , ..., L _N in the labeling network 150 becomes high. In addition, the gradient inversion layer 141-n inverts the gradient only when the error is backpropagated, and learning is performed so that the domain identification loss becomes small, so that the estimation accuracy of the estimation domain information series D ₁ , ..., _DN is high. The domain identification model 130 is learned so as to be, and the logical relationship understanding layer 110 for obtaining the intermediate feature series LF ₁ , ..., LF _N that lowers the estimation accuracy of the estimation domain information series D ₁ , ..., _DN is learned. Can perform hostile learning. By this hostile learning, it is possible to learn a labeling network 150 capable of generating an intermediate feature sequence LF ₁ , ..., LF _N whose domain is not estimated by the domain discrimination model 130, although the estimated label sequence L ₁ , ..., L _N can be estimated accurately. .. As a result, the labeling network 150 can acquire the intermediate feature series LF ₁ , ..., LF _N which are effective for the prediction of the label while suppressing the dependence on the domain, and the unsupervised domain adaptation can be realized.

This learning process can be realized by optimizing (minimizing) the loss function that linearly combines the label prediction loss and the domain identification loss. The combination ratio of the linear combination of the label prediction loss and the domain identification loss may be predetermined or may be specified by the learning schedule input to the learning unit 11a.

The learning unit 11a may perform the above-mentioned learning while changing the combination ratio of the label prediction loss and the domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 11a may learn by using only the label prediction loss as a loss function in the early stage of learning, and gradually increase the ratio of the domain identification loss to the loss function as the number of learning steps increases. good. Further, the learning unit 11a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.

Further, the learning unit 11a prepares a plurality of various domain identification models 130 and / or labeling networks 150 as exemplified above for learning, and among the labeling networks 150 obtained by each learning, the target domain is used. The labeling network 150, which gives the best estimation accuracy of the label sequence by the labeling network 150, may be selected later.

The learning process may be batch learning, mini-batch learning, or online learning.

The learning unit 11a stores the parameters of the labeling network 150 obtained by the above learning in the storage unit 11b, and stores the parameters of the domain identification model 130 (parameters of the domain identification network 130-1, ..., 130-) in the storage unit 11c. Store. The learning device 11 outputs the parameters of the labeling network 150 stored in the storage unit 11b. The parameters of the labeling network 150 are used in the inference processing described later. Normally, the parameters of the domain discrimination model 130 are not used for inference processing, so they do not have to be output from the learning device 11. However, the learning device 11 may output at least one of the parameters of the domain identification model 130 (parameters of the domain identification network 130-1, ..., 130-).

The above-mentioned learning process is functionally illustrated with reference to FIG.
Step S11: The labeled teacher data of the source domain and the unlabeled teacher data of the target domain are input to the control unit 11aa. The control unit 11aa generates teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain. Further, the control unit 11aa initializes the parameters of the network 100.
Step S12: The loss function calculation unit 11ab inputs the teacher data into the network 100 as the utterance text series T ₁ , ..., _TN , obtains the label prediction loss and the domain identification loss, and obtains a loss function in which they are linearly combined.
Step S13: The parameter update unit 11ad backpropagates the information based on the loss function according to the error back propagation method, and updates the parameters of the domain discrimination model 130 and the labeling layer 120.
Step S14: The gradient inversion unit 11ac inverts the gradient of the information based on the loss function back-propagated from the domain discrimination model 130 and back-propagates it to the logical relationship understanding layer 110. The information based on the loss function back-propagated from the labeling layer 120 is back-propagated to the logical relationship understanding layer 110 as it is.
Step S15: The parameter update unit 11ad updates the parameters of the logical relationship understanding layer 110 using the back-propagated information according to the error back-propagation method.
Step S16: The control unit 11aa determines whether or not the end condition (for example, a condition that the number of parameter updates reaches a predetermined number) is satisfied. If the end condition is not satisfied here, the control unit 11aa returns the process to step S12. On the other hand, when the end condition is satisfied, the control unit 11aa outputs the parameter of the labeling network 150. If necessary, the control unit 11aa may output at least one of the parameters of the domain identification networks 130-1, ..., 130-N.

<Functional configuration of inference device 13 and inference processing>
As illustrated in FIG. 4, the inference device 13 of the first embodiment has an inference unit 13a and a storage unit 13b. The parameters of the labeling network 150 obtained as described above are stored in the storage unit 13b.

≪Inference processing≫
In the inference process, an utterance text sequence (input information sequence) for inference is input to the inference unit 13a. The inference unit 13a applies the utterance text sequence for inference to the labeling network 150 (labeling model) specified by the parameters stored in the storage unit 13b, and estimates the label sequence corresponding to the utterance text sequence for inference. Obtain the label sequence and output the estimated label sequence. For example, in the case of the labeling network 150 illustrated in FIG. 5, the reasoning unit 13a inputs the utterance text series T ₁ , ..., _TN for inference to the logical relationship understanding layer 110, and the utterance text series T ₁ for inference. , ..., LF _N corresponding to the intermediate feature sequence LF ₁ , ..., _TN is obtained. For example, the inference unit 13a inputs the utterance text sequences T ₁ , ..., _TN for inference to the short-term context understanding networks 111-1, ..., 111-N, respectively, and the sequence SF ₁ , ..., SF of the short-term intermediate features. _N is obtained, and the short-term intermediate feature sequences SF ₁ , ..., SF _N are input to the long-term context understanding network 112, and the intermediate feature sequences LF ₁ , ..., LF _N are obtained. Further, the inference unit 13a inputs the intermediate feature series LF ₁ , ..., LF _N to the labeling layer 120 to obtain the estimated label series L ₁ , ..., L _N corresponding to the utterance text series T ₁ , ..., _TN . Output.

<Characteristics of the first embodiment>
In this embodiment, the utterance text sequence T ₁ , ..., _TN is received, the utterance text sequence T ₁ , ..., _TN is taken into consideration, and the intermediate feature sequence LF ₁ , ..., LF _N is obtained, and the intermediate feature sequence is obtained. The logical relationship understanding layer 110 that outputs LF ₁ , ..., LF _N , and the estimated label series L ₁ , that receives the intermediate feature series LF ₁ , ..., LF _N and corresponds to the utterance text series T ₁ , ..., _TN . ..., L _N is obtained, the labeling network 150 including the labeling layer 120 including the estimated label sequence L ₁ , ..., L _N is received, and the intermediate feature sequence LF ₁ , ..., LF _N is received, and the utterance text sequence T ₁ , ..., Obtained the estimated domain information D _{n of the domain identification information indicating whether each utterance text T n} _{included in the TN belongs to the source domain or the target domain, and obtained the sequence D 1} _of _the estimated domain information, ..., D. For the domain identification model 130 that outputs _N , the learning device 11 uses the teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the utterance text series T ₁ , ..., _TN . , Estimated label sequence L ₁ , ..., L _N estimation accuracy is high, estimation domain information sequence D ₁ , ..., _DN is learned so that the estimation accuracy is low, and estimation domain information sequence D Hostile learning was performed to learn the domain identification model 130 so that the estimation accuracy of ₁ , ..., and _DN would be high. As a result, the domain dependence of the labeling network 150 is reduced, and as a result, the label series L ₁ , ..., L _N corresponding to the utterance text series T ₁ , ..., _TN are considered in consideration of the context of the utterance text series T 1, ..., TN. Allows unsupervised domain adaptation of the estimated labeling network 150.

[Second Embodiment]
In the second embodiment, a network that identifies a domain from a context between a plurality of utterance texts _Tn (a long-term intermediate feature series considering a logical relationship between a plurality of partial information included in an input information series) and a utterance text. Short-term intermediate features that consider the context of words within T _n A network that identifies domains from SF _n (short-term intermediate features that consider the logical relationship of information within partial information), and learning hostilely using Let me. This makes it possible to remove domain dependencies more efficiently and learn the target domain labeling network with higher accuracy. In the following, the differences from the first embodiment will be mainly described, and the same reference symbols will be cited for the matters common to the first embodiment to simplify the explanation.

<Functional configuration and learning process of learning device 21>
As illustrated in FIG. 6, the learning device 21 of the second embodiment has a learning unit 21a and

storage units

11b, 21c, 21d, and has a source domain labeled teacher data and a target domain unlabeled teacher data. Is used as an input, and the parameters (model parameters) of the labeling network of the target domain are obtained and output by learning. Further, the learning schedule may be input to the learning device 21, and the learning device 21 may perform the learning process according to the learning schedule. Further, the learning device 21 may output the parameters of the domain identification network for realizing the unsupervised domain adaptation. Further, as illustrated in FIG. 2, the learning unit 21a has, for example, a control unit 11aa, a loss function calculation unit 21ab, a gradient inversion unit 11ac, and a parameter update unit 21ad. Further, the learning unit 21a stores each data obtained in the processing process in the

storage units

11b, 21c, 21d or a temporary memory (not shown) one by one. The learning unit 21a reads the data as necessary and uses it for each process.

≪Network 200≫
FIG. 7 shows a configuration example of the network 200 used by the learning device 21 in the learning process. The network 200 illustrated in FIG. 7 has a labeling network 150 (labeling model) and a domain identification model 230. Since the labeling network 150 is the same as that of the first embodiment, the description thereof will be omitted, and the domain identification model 230 will be described below.

≪Domain Discriminative Model 230≫
The domain identification model 230 exemplified in FIG. 7 includes short-term context domain identification networks 231-1, ..., 231-N (short-term logical relationship domain identification means), and long-term context domain identification network 232 (long-term logical relationship domain identification means). )including. For example, the short-term context domain identification networks 231-1, ..., 231-N are networks that are the same as each other (for example, networks with the same parameters), and each short-term context domain identification network 231-n is each n = 1, ... , N (for example, each time) represents a state corresponding to.

The long-term context domain identification network 232 receives the intermediate feature sequence LF ₁ , ..., LF _N (long-term intermediate feature sequence) output from the long-term context understanding network 112, and obtains the sequence LD ₁ , ..., LD _N of the estimated domain information. And output. However, each estimated domain information LD _n (where _n = 1, ..., N) is estimated information of domain identification information indicating whether each utterance text Tn belongs to the source domain or the target domain. Unlike the domain identification network 130-n of the first embodiment, the long-term context domain identification network 232 exemplified in FIG. 7 continuously captures the input short-term intermediate feature sequences SF ₁ , ..., SF _N (). For example, the sequence of short-term intermediate features SF ₁ , ..., SF _N is continuously captured in the time direction), and the domain dependence of the context (logical relationship) across multiple spoken texts _Tn , which are words and sentences. Is intended to be removed from the labeling network 150. However, this is not a limitation of the present invention. For example, instead of the long-term context domain identification network 232, a plurality of long-term context domain identification networks 232-1, ..., 232-K (where K is an integer of 2 or more) may exist. For example, k = 1, ..., K is an index representing the layer of the long-term context understanding network. In this case, each long-term context domain identification network 232-k (where k = 1, ..., K) is a long-term intermediate feature sequence LF _nk (n ∈ {1, ..., N}, k ∈ {1, ..., K). _} ) Is received, and the estimated domain information LD _nk indicating whether the utterance text Tn corresponding to the received long-term intermediate feature series LF _nk belongs to the source domain or the target domain may be obtained and output. LF _nk (where n = 1, ..., N, k = 1, ..., K) is each n (for example, each time) output from each long-term context understanding network 112-k exemplified in the first embodiment. Represents an intermediate feature corresponding to. Even in this case, the domain dependency of the context spanning the plurality of utterance texts _Tn can be removed from the labeling network 150. Here, the long-term context domain identification network 232 can be configured by, for example, a combination of a unidirectional LSTM or a bidirectional LSTM and a fully connected neural network having a softmax function as an activation function.

The short-term context domain identification network 231-1, ..., 231-N is a series of short-term intermediate features SF ₁ , ..., Output from the short-term context understanding network 111-1, ..., 111-N (short-term logical relationship understanding means). , SF _N (second sequence based on intermediate feature sequence, short-term intermediate feature sequence) is received, and the sequence SD ₁ , ..., SD _N of estimated domain information is obtained and output. That is, each short-term context domain identification network 231-n (where n = 1, ..., N) receives the short-term intermediate feature SF _n output from the short-term context understanding network 111- _n , and each utterance text Tn is the source. The estimated domain information SD _n of the domain identification information indicating whether the domain belongs to the domain or the target domain is obtained, and the estimated domain information SD _n is output. Unlike the long-term context domain identification network 232, the short-term context domain identification network 231- _n is domain-dependent by estimating whether the utterance text Tn belongs to the source domain or the target domain for each short-term intermediate feature SF _n . The purpose is to efficiently remove the domain dependency of the spoken text _Tn alone such as a specific word or document. However, this does not limit the present invention. For example, instead of each short-term context domain identification network 231-n, a plurality of short-term context domain identification networks 231-n1, ..., 232-nK'(where K'is an integer of 2 or more) may exist. For example, k'= 1, ..., K'is an index representing the layer of the short-term context domain identification network, and each short-term context domain identification network 231-nk'represents the network corresponding to each n (for example, each time). .. In this case, each short-term context domain identification network 231-nk'(where k'= 1, ..., K') is a short-term intermediate feature SF _nk' (n ∈ {1, ..., N}, k'∈ {1). , ..., K'}) is received, and the estimated domain information SD _{nk'indicating} whether the utterance text T _n corresponding to the received short-term intermediate feature SF _nk' belongs to the source domain or the target domain is obtained and output. May be good. However, SF _nk'is a short-term intermediate feature corresponding to each n (for example, each time) output from each short-term context understanding network 111-nk' exemplified in the first embodiment. Here, the short-term context domain identification network 231-n can be configured by, for example, a combination of a fully connected neural network having a softmax function as an activation function.

≪Learning process≫
In the learning process, the learning unit 21a of the learning device 21 has the labeled teacher data of the source domain (labeled learning information sequence belonging to the source domain) and the unlabeled teacher data of the target domain (unlabeled belonging to the target domain). Information series for learning) is input. The learning unit 21a uses the teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the spoken text series T ₁ , ..., _TN for the above-mentioned network 200, and uses the estimated label series L. _1. The labeling network 150 (labeling model) is learned so that the estimation accuracy of L _N is high and the estimation accuracy of the estimated domain information series LD ₁ , ..., LD _N and SD ₁ , ..., SD _N is low. , A series of estimated domain information LD ₁ , ..., LD _N and SD ₁ , ..., Hostile learning is performed to learn the domain identification model 230 so that the estimation accuracy of SD _N is high. That is, the learning unit 21a includes the estimated label series L ₁ , ..., L _N output from the labeling network 150 when the above-mentioned teacher data is input to the network 200 as the spoken text series T ₁ , ..., _TN . Loss function (hereinafter referred to as label prediction loss) representing the error from the correct label series of the labeled teacher data of the source domain corresponding to, and the series of estimated domain information output from the long-term context domain identification network 232 LD ₁ , ..., A loss function (hereinafter referred to as long-term context domain identification loss) representing the error between the LD _N and the correct label series of estimated domain information identified from the labeled teacher data of the source domain and the unlabeled teacher data of the target domain, and the short-term context. Estimated identified from the sequence of estimated domain information output from the domain identification networks 231-1, ..., 231-N SD ₁ , ..., SD _N , the labeled teacher data of the source domain, and the unlabeled teacher data of the target domain. Hostile learning is performed between the labeling network 150 and the domain identification model 230 based on a loss function (hereinafter, short-term context domain identification loss) that represents an error between the correct label series of domain information. The estimated label series L ₁ , ..., L _N output from the labeling network 150 when the unlabeled teacher data of the target domain is input to the network 200 is not used for calculating the label prediction loss.

The learning unit 21a performs this hostile learning by using, for example, an error backpropagation method. In this case, a gradient inversion layer 242-n (where n = 1, ..., N) is provided between the long-term context understanding network 112 and the long-term context domain identification network 232, and the short-term context understanding network 111-n and the short-term context domain are provided. A gradient inversion layer 241-n (where n = 1, ..., N) is provided between the identification network 231-n and the gradient is inverted by the gradient inversion layers 242-n and 241-n only at the time of error back propagation. Here, by learning so that the label prediction loss becomes small, the estimation accuracy of the estimated label series L ₁ , ..., L _N in the labeling network 150 becomes high. In addition, the gradient inversion layer 242- _n inverts the gradient _only when the error is backpropagated, and learning is performed so that the long-term context domain identification loss becomes small. The long _{-term context domain identification network 232 is learned so that the estimation accuracy of the estimated domain information is low, and the intermediate feature sequence LF 1} _, ..., LF _N is obtained _. Hostile learning to learn 232 can be performed. Furthermore, the gradient inversion layer 241- _n inverts the gradient _only when the error is backpropagated, and learning is performed so that the short-term context domain identification loss becomes small. The short _- term context domain identification network 2311, ..., 231 _- _N is learned so that the estimation accuracy of the estimated domain information is low, and the estimation accuracy of the estimated domain information is lowered. , SF _N is obtained. Short-term context understanding network 111-1, ..., 111-N can be learned by hostile learning. By these hostile learnings, the estimated label sequence L ₁ , ..., L _N can be estimated accurately, but the domain is not estimated by the domain discriminative model 230. Intermediate feature sequence LF ₁ , ..., LF _N and short-term intermediate feature sequence SF ₁ , ..., You can learn the labeling network 150 that can generate SF _N. As a result, the labeling network 150 can acquire the intermediate feature series LF ₁ , ..., LF _N that are effective for label prediction while suppressing the dependence of the context across a plurality of utterance texts _Tn on the domain. It is possible to acquire a series of short _- term intermediate features SF ₁ , ..., SF _N that are effective for label prediction while suppressing domain dependence in utterance text Tn units, and realize unsupervised domain adaptation with higher accuracy.

This learning process can be realized by optimizing (minimizing) the loss function that linearly combines the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss. The combination ratio of the linear combination of the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss may be predetermined or may be specified in the learning schedule input to the learning unit 21a.

The learning unit 21a may perform the above-mentioned learning while changing the combination ratio of the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 21a performs learning using only the label prediction loss as a loss function in the early stage of learning, and as the number of learning steps increases, the ratio of the long-term context domain discrimination loss and the short-term context domain discrimination loss to the loss function gradually increases. You may learn as it is. Further, the learning unit 21a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.

The domain identification model 230 may have only one of the long-term context domain identification network 232 and the short-term context domain identification network 231-1, ..., 231-N.

If the domain discrimination model 230 has only the long-term context domain discrimination network 232, the gradient inversion layers 241-1, ..., 241-N are omitted and based on a loss function that linearly combines the label prediction loss and the long-term context domain discrimination loss. The learning process is performed. In this case as well, the combination ratio of the linear combination may be predetermined or may be specified by the learning schedule input to the learning unit 21a. Further, the learning unit 21a may perform the above-mentioned learning based on the learning schedule while changing the coupling ratio between the label prediction loss and the long-term context domain identification loss according to the number of learning steps. For example, the learning unit 21a learns using only the label prediction loss as a loss function in the early stage of learning, and learns so that the ratio of the long-term context domain identification loss to the loss function gradually increases as the number of learning steps increases. You may. Further, the learning unit 21a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.

If the domain identification model 230 has only the short-term context domain identification network 231-1, ..., 231-N, the gradient inversion layer 242-1, ..., 242-N is omitted, resulting in label prediction loss and short-term context domain identification loss. The learning process is performed based on the linearly connected loss function. In this case as well, the combination ratio of the linear combination may be predetermined or may be specified by the learning schedule input to the learning unit 21a. Further, the learning unit 21a may perform the above-mentioned learning while changing the connection ratio between the label prediction loss and the short-term context domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 21a learns using only the label prediction loss as a loss function in the early stage of learning, and learns so that the ratio of the short-term context domain identification loss to the loss function gradually increases as the number of learning steps increases. You may. Further, the learning unit 21a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.

Further, the domain identification model 230 may have only a part of the long-term context domain identification network 232 and the short-term context domain identification networks 231-1, ..., 231-N. That is, a part of the short-term context domain identification network 231-1, ..., 231-N may be omitted. In this case, the estimated domain information SD _n corresponding to the omitted short-term context domain identification network 231-n and the correct label of the corresponding estimated domain information are not used in the calculation of the short-term context domain identification loss.

Further, the learning unit 21a prepares a plurality of various domain identification models 230 and / or labeling networks 150 as exemplified above for learning, and among the labeling networks 150 obtained by each learning, the target domain is used. The labeling network 150, which gives the best estimation accuracy of the label sequence by the labeling network 150, may be selected later. The plurality of domain identification models 230 prepared include, for example, a domain identification model 230 including a long-term context domain identification network 232 and a short-term context domain identification network 231-1, ..., 231-N, and a long-term context domain identification network, as described above. It includes at least one of a domain identification model 230 having only 232, a domain identification model 230 having only short-term context domain identification networks 231-1, ..., 231-N, and a domain identification model 230 of the first embodiment.

The learning unit 21a stores the parameters of the labeling network 150 obtained by the above learning in the storage unit 11b, stores the parameters of the short-term context domain identification networks 231-1, ..., 231-N in the storage unit 21c, and stores the parameters of the short-term context domain identification network 231-1, ..., 231-N in the storage unit 21c. The parameters of the domain identification model 232 are stored in the storage unit 21d. The learning device 21 outputs the parameters of the labeling network 150 stored in the storage unit 11b. The parameters of the labeling network 150 are used for inference processing. Normally, the parameters of the short-term context domain identification network 2311, ..., 231-N and the parameters of the long-term context domain identification model 232 are not used for inference processing, and therefore need not be output from the learning device 21. However, the learning device 21 may output at least one of the parameters of the short-term context domain identification network 231-1, ..., 231-N and the parameters of the long-term context domain identification model 232.

The above-mentioned learning process is functionally illustrated with reference to FIG.
Step S21: The labeled teacher data of the source domain and the unlabeled teacher data of the target domain are input to the control unit 11aa. The control unit 11aa generates teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain. Further, the control unit 11aa initializes the parameters of the network 200.
Step S22: The loss function calculation unit 21ab inputs the teacher data into the network 200 as the utterance text series T ₁ , ..., _TN , and obtains the loss function as described above.
Step S23: The parameter update unit 21ad backpropagates the information based on the loss function according to the error back propagation method, and updates the parameters of the domain discrimination model 230 and the labeling layer 120.
Step S24: The gradient inversion unit 11ac inverts the gradient of the information based on the loss function back-propagated from the domain discrimination model 230 and back-propagates it to the logical relationship understanding layer 110. The information based on the loss function back-propagated from the labeling layer 120 is back-propagated to the logical relationship understanding layer 110 as it is.
Step S25: The parameter update unit 21ad updates the parameters of the logical relationship understanding layer 110 using the back-propagated information according to the error back-propagation method.
Step S26: The control unit 11aa determines whether or not the end condition is satisfied. If the end condition is not satisfied here, the control unit 11aa returns the process to step S22. On the other hand, when the end condition is satisfied, the control unit 11aa outputs the parameter of the labeling network 150. If necessary, the control unit 11aa may output at least one of the parameters of the short-term context domain identification network 231-1, ..., 231-N and the parameters of the long-term context domain identification model 232.

Since the functional configuration and inference processing of the inference device 13 of the second embodiment are the same as those of the first embodiment, the description thereof will be omitted.

<Characteristics of the second embodiment>
In the present embodiment, the utterance text sequence T ₁ , ..., _TN is received, the utterance text sequence T ₁ , ..., _TN is taken into consideration, and the intermediate feature sequence LF ₁ , ..., LF _N is obtained, and the intermediate feature sequence is obtained. The logical relationship understanding layer 110 that outputs LF ₁ , ..., LF _N , and the estimated label series L ₁ , that receives the intermediate feature series LF ₁ , ..., LF _N and corresponds to the utterance text series T ₁ , ..., _TN . ..., L _N is obtained, the labeling network 150 including the labeling layer 120 including the estimated label sequence L ₁ , ..., L _N is received, and the intermediate feature sequence LF ₁ , ..., LF _N is received, and the utterance text sequence T ₁ , ..., The estimated domain information LD _n and SD _{n of the domain identification information indicating whether each utterance text T n} _contained in the TN belongs to the source domain or the target domain _is obtained, and the sequence LD ₁ of the estimated domain information is obtained. ..., LD _N and SD ₁ , ..., The learning device 21 utters teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain with respect to the domain identification model 230 that outputs SD _N. Used as the text series T ₁ , ..., _TN , the estimation accuracy of the estimated label series L ₁ , ..., L _N is high, and the estimation of the estimated domain information series LD ₁ , ..., LD _N and SD ₁ , ..., SD _N. Hostile learning the labeling network 150 so that the accuracy is low, and the domain identification model 230 so that the estimation accuracy of the estimated domain information series LD ₁ , ..., LD _N and SD ₁ , ..., SD _N is high. I learned. This enables unsupervised domain adaptation of the labeling network 150 that estimates the label sequence L ₁ , ..., L _N corresponding to the utterance text sequence in consideration of the context of the utterance text sequence T ₁ , ..., _TN . ..

In particular, in the present embodiment, the domain identification model 230 receives the sequence SF ₁ , ..., SF _N of the short-term intermediate features output from the short-term context understanding network 111-1, ..., 111-N, and receives the sequence SD of the estimated domain information. ₁ , ..., SD _N is obtained and output from the short-term context domain identification network 231-1, ..., 231-N, and the intermediate feature sequence LF ₁ , ..., LF _N output from the long-term context understanding network 112 is received and estimated. Includes at least one of the long-term context domain identification networks 232 that obtains and outputs the domain information sequence LD ₁ , ..., LD _N. Thereby, at least one of the domain dependency of the utterance text _Tn alone and the domain dependency of the context across the utterance text _Tn can be efficiently removed from the labeling network 150. As a result, unsupervised domain adaptation of the labeling network 150 can be performed with higher accuracy.

Further, since the domain identification model 230 includes at least the long-term context domain identification network 232, the domain dependence of the context across a plurality of utterance texts _Tn can be efficiently removed from the labeling network 150. As a result, the unsupervised domain adaptation of the labeling network 150 that estimates the label series L ₁ , ..., L _N corresponding to the utterance text series in consideration of the context of the utterance text series T ₁ , ..., _TN is performed accurately. be able to.

Further, the domain identification model 230 includes both the short-term context domain identification network 231-1, ..., 231- _N and the long-term context domain identification network 232, so that the utterance text Tn is domain-dependent and a plurality of utterance texts T. The domain dependency of the context across _n can be efficiently removed from the labeling network 150. In this case, the unsupervised domain adaptation of the labeling network 150 can be performed with higher accuracy.

<Experimental results>
The following is an example of the experimental results of unsupervised domain adaptation performed according to the above-described embodiment. The experimental conditions are shown below.
(1) Each utterance text of the simulated data of the utterance text series is classified into 5 classes of corresponding scenes, and the label representing each corresponding scene is estimated.
(2) Treating 5 domains excluding the target domain (new domain) as source domains (applied domains), learning the labeling network using only the data of the source domain, the obtained labeling network, the first embodiment and The labeling network was learned according to the second embodiment, and the discrimination performance (correct answer rate of labeling) for the spoken text series of the target domain was verified by using the labeling network obtained in each.
(3) Discrimination performance for 60 calls of utterance text series (60 calls x 6 domains = 360 calls of simulated data) belonging to 6 target domains (online mail order, ISP, securities, local government, mobile phone, PC support) Was verified.
(4) Each utterance text contains about 100 sentences.

Figure 8 illustrates the experimental results. As illustrated in FIG. 8, in both the first embodiment and the second embodiment, the labeled data of the existing source domain is used without preparing the labeled data in the target domain, and the high accuracy is achieved. It can be seen that unsupervised domain adaptation to the target domain is possible. In particular, the method of the second embodiment enables unsupervised domain adaptation with higher accuracy, and the identification accuracy is improved by an average of 3.4% as compared with the method of learning only with the data of the source domain.

[Hardware configuration]
The

learning devices

11 and 21 and the inference device 13 in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit), a RAM (random-access memory), a ROM (read-only memory), or the like. It is a device configured by a general-purpose or dedicated computer equipped with a memory or the like to execute a predetermined program. This computer may have one processor and memory, or may have a plurality of processors and memory. This program may be installed in a computer or may be recorded in a ROM or the like in advance. Further, a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. .. Further, the electronic circuit constituting one device may include a plurality of CPUs.

FIG. 9 is a block diagram illustrating the hardware configurations of the

learning devices

11 and 21 and the inference device 13 in each embodiment. As illustrated in FIG. 9, the

learning devices

11 and 21 and the inference device 13 of this example include a CPU (Central Processing Unit) 10a, an input unit 10b, an output unit 10c, a RAM (RandomAccessMemory) 10d, and a ROM (ReadOnly). It has a Memory) 10e, an auxiliary storage device 10f, and a bus 10g. The CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac. Further, the input unit 10b is an input terminal, a keyboard, a mouse, a touch panel, or the like into which data is input. Further, the output unit 10c is an output terminal from which data is output, a display, a LAN card controlled by a CPU 10a in which a predetermined program is read, and the like. Further, the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored. Further, the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data. There is. Further, the bus 10g connects the CPU 10a, the input unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged. The CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program. Similarly, the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a. The control unit 10aa of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program. The calculation result is stored in the register 10ac. With such a configuration, the functional configuration of the

learning devices

11 and 21 and the inference device 13 is realized.

The above program can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, and the like.

The distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network. As described above, the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and the result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).

This device may be configured using not only the CPU 10a but also a GPU (Graphics Processing Unit). Further, in each embodiment, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.

The present invention is not limited to the above-described embodiment. For example, in each embodiment, the unlabeled teacher data of the target domain includes the spoken text series of the target domain but does not include the correct label series. However, at least part of the unlabeled teacher data for the target domain may contain a correct label sequence. In this case, the correct label series of unlabeled teacher data of the target domain may or may not be used for learning the

networks

100 and 200.

Further, as described above, in the above-described embodiment, in order to clarify the explanation, a series of a plurality of information having a logical relationship is an utterance text series, and a label series is a corresponding scene of each utterance (for example, opening, opening, etc.). An example is given of a series of labels representing (identification, identity verification, correspondence, closing). However, this is only an example, and other information sequences such as a sentence sequence, a programming language sequence, an audio signal sequence, and a moving image signal sequence may be used as a sequence of a plurality of information having a logical relationship. Further, as the label series, other label series such as a label series representing a situation or an action, a label series representing a place or time, a label series representing a part of speech, and a label series representing a program content may be used. Further, each model such as a labeling model may not be a model based on a deep neural network, but may be another model based on a probability model, a classifier, or the like. Further, the logical relationship understanding layer 110 of each embodiment receives the utterance text series T ₁ , ..., _TN , and considers the context (logical relationship) of the utterance text series T ₁ , ..., _TN . LF ₁ , ..., LF _N were obtained and output. However, the logical relationship understanding layer 110 may receive the utterance text series T ₁ , ..., _TN , and obtain and output a series consisting of less than N or more than N intermediate features. Further, the labeling layer 120 of each embodiment receives the intermediate feature sequence LF ₁ , ..., LF _N (first sequence based on the intermediate feature sequence), and is an estimated label sequence corresponding to the utterance text sequence T ₁ , ..., _TN . L ₁ , ..., L _N were obtained and output. However, the labeling layer 120 may receive the intermediate feature sequence LF ₁ , ..., LF _N (first sequence based on the intermediate feature sequence) to obtain and output a sequence of estimated labels less than or more than N. Further, the

domain identification models

130 and 230 of each embodiment received the intermediate feature series LF ₁ , ..., LF _N (second sequence based on the intermediate feature sequence), and obtained and output N estimated domain information. However, the domain

discriminative models

130, 230 of the embodiment receive the intermediate feature sequences LF ₁ , ..., LF _N (second sequence based on the intermediate feature sequences), and obtain and output less than N or more than N estimated domain information. You may.

Further, the above-mentioned various processes are not only executed in chronological order according to the description, but may also be executed in parallel or individually depending on the processing capacity of the device that executes the processes or if necessary. In addition, it goes without saying that changes can be made as appropriate without departing from the spirit of the present invention.

According to the present invention, for example, it is possible to efficiently remove the domain dependence of intermediate features for a series labeling problem considering a complicated context. In particular, as illustrated in the second embodiment, by efficiently removing the domain dependence of intermediate features for each of the short-term and long-term logical relationships (contexts), labeling with higher fitness to the target domain is achieved. You can learn the network and improve the labeling accuracy in the target domain.

Conventionally, unsupervised domain adaptation technology for image recognition has been studied, but the present invention is the first to consider the logical relationship of multiple information sequences such as language processing, and to create a label sequence corresponding to the information sequence. It is applied to the problem of estimation. This unsupervised domain adaptation technology, for example, can significantly reduce the cost of labeling, which has been a barrier to the industry's expansion of business for contact centers.

In particular, in the method exemplified in the second embodiment, for example, a domain identification network in utterance text units (for example, call units) can be designed as a mechanism that straddles sentence boundaries by unidirectional or bidirectional LSTM. This makes it possible to capture domain dependencies for things like contact center industry-dependent specific story flows and efficiently remove them from the labeling network, resulting in label estimation in the target domain. The accuracy can be improved.

Further, in the method exemplified in the second embodiment, for example, the domain identification network of the utterance text unit (for example, the call unit) can be designed as a mechanism that does not straddle the boundary of the utterance text. This makes it possible, for example, to capture domain dependencies caused by specific words that depend on the contact center industry and efficiently remove them from the labeling network, resulting in improved label estimation accuracy in the target domain. be able to.

11 and 21

Learning device

11a, 21a Learning unit 13 Reasoning device

13a Reasoning unit

100, 200 Network 110 Logical relationship understanding layer 111-1, ..., 111-N Short-term context understanding network 112 Long-term context understanding network 120 Labeling layer 120-1 , ..., 120-N

Label Prediction Network

130, 230 Domain Identification Model 130-1, ..., 130-N Domain Identification Network 231-1, ..., 231-N Short-term Context Domain Identification Network 232 Long-Term Context Domain Identification Network

Claims

A logical relationship understanding means that receives an input information sequence that is a sequence of a plurality of information having a logical relationship, obtains an intermediate feature sequence considering the logical relationship of the input information sequence, and outputs the intermediate feature sequence.
A labeling means that receives a first sequence based on the intermediate feature sequence, obtains an estimated label sequence of a label sequence corresponding to the input information sequence, and outputs the estimated label sequence.
Labeling model including, and
The second series based on the intermediate feature series is received, and the estimated domain information of the domain identification information indicating whether each partial information included in the input information series belongs to the source domain or the target domain is obtained, and the estimated domain is obtained. A domain identification model that outputs a series of information,
On the other hand
The estimation is performed using the teacher data including the labeled teacher data which is a labeled learning information series belonging to the source domain and the unlabeled teacher data which is an unlabeled learning information series belonging to the target domain as the input information series. The labeling model is learned so that the estimation accuracy of the label series is high and the estimation accuracy of the estimated domain information series is low, and the domain identification model is learned so that the estimation accuracy of the estimated domain information series is high. Do hostile learning,
A learning device having a learning unit that obtains and outputs at least the parameters of the labeling model.
The learning device according to claim 1.
The means for understanding the logical relationship is
With a plurality of short-term logical relationship understanding means that receive each partial information included in the input information series, obtain a short-term intermediate feature considering the logical relationship of the information in the received partial information, and output the short-term intermediate feature. ,
Receives a series of the plurality of short-term intermediate features, obtains the intermediate feature series which is a long-term intermediate feature series considering a logical relationship between a plurality of partial information included in the input information series, and outputs the intermediate feature series. Including long-term logical relationship understanding means
The first series is the intermediate feature series.
The second sequence is a learning device, which is the intermediate feature sequence.
The learning device according to claim 1.
The means for understanding the logical relationship is
With a plurality of short-term logical relationship understanding means that receive each partial information included in the input information series, obtain a short-term intermediate feature considering the logical relationship of the information in the received partial information, and output the short-term intermediate feature. ,
A long-term logical relationship that receives a series of a plurality of the short-term intermediate features, obtains a long-term intermediate feature series considering the logical relationship between a plurality of partial information included in the input information series, and outputs the long-term intermediate feature series. Including means of understanding,
The labeling means is
The long-term intermediate feature sequence is received as the first sequence, the estimated label sequence of the label sequence corresponding to the information sequence is obtained, and the estimated label sequence is output.
The domain discrimination model is
A short-term logical relationship domain identification means that receives the short-term intermediate feature series as the second series, obtains the estimated domain information, and outputs the sequence of the estimated domain information, and the long-term intermediate feature series as the second series. A learning device comprising at least one of long-term logical relationship domain identification means that receives, obtains the estimated domain information, and outputs a sequence of said estimated domain information.
The learning device according to claim 3.
The domain identification model is
A learning device including a long-term logical relationship domain identification means that receives the long-term intermediate feature sequence as the second sequence, obtains the estimated domain information, and outputs the sequence of the estimated domain information.
The learning device according to claim 3.
The domain discrimination model is
A short-term logical relationship domain identification means that receives the short-term intermediate feature series as the second series, obtains the estimated domain information, and outputs the sequence of the estimated domain information, and the long-term intermediate feature series as the second series. A learning device comprising a long-term logical relationship domain identification means that receives, obtains the estimated domain information, and outputs a sequence of the estimated domain information.
An input information sequence for inference is applied to the labeling model specified by the parameter obtained by the learning device according to any one of claims 1 to 5, and a label sequence corresponding to the input information sequence for inference is applied. An inference device having an inference unit that obtains an estimated label sequence and outputs the estimated label sequence.
The learning method of the learning device according to any one of claims 1 to 5, or the inference method of the inference device according to claim 6.
A program for operating a computer as the learning device according to any one of claims 1 to 5 or the inference device according to claim 6.