WO2022044243A1 - Training device, inference device, methods therefor, and program - Google Patents

Training device, inference device, methods therefor, and program Download PDF

Info

Publication number
WO2022044243A1
WO2022044243A1 PCT/JP2020/032505 JP2020032505W WO2022044243A1 WO 2022044243 A1 WO2022044243 A1 WO 2022044243A1 JP 2020032505 W JP2020032505 W JP 2020032505W WO 2022044243 A1 WO2022044243 A1 WO 2022044243A1
Authority
WO
WIPO (PCT)
Prior art keywords
domain
sequence
series
information
term
Prior art date
Application number
PCT/JP2020/032505
Other languages
French (fr)
Japanese (ja)
Inventor
翔太 折橋
亮 増村
雅人 澤田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/032505 priority Critical patent/WO2022044243A1/en
Priority to JP2022545187A priority patent/JP7517435B2/en
Publication of WO2022044243A1 publication Critical patent/WO2022044243A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to labeling technology.
  • Non-Patent Document 1 a technique of utterance series labeling has been proposed in which a utterance series is input and a label representing a conversation or discourse response scene is estimated for each utterance (for example, Non-Patent Document 1). ).
  • Non-Patent Document 1 the utterance text series obtained by voice-recognizing the conversation between the operator and the customer in the contact center is input, and the corresponding scene (opening, grasping the matter, identity verification, correspondence, correspondence, for each utterance, A deep neural network (hereinafter referred to as a labeling network) that realizes speech sequence labeling that estimates the label of any of the closings) is disclosed.
  • a labeling network A deep neural network that realizes speech sequence labeling that estimates the label of any of the closings
  • Non-Patent Document 2 labeled data (hereinafter, labeled teacher data) of a domain that has been applied in the past (hereinafter, source domain) and unlabeled data of a domain that is newly applied (hereinafter, target domain) are described. (Hereinafter, unlabeled teacher data), a method for realizing unsupervised domain adaptation by labeling in a new domain has been proposed.
  • the method of Non-Patent Document 2 uses the labeled teacher data of the source domain and the unlabeled teacher data of the target domain to learn a labeling model for estimating the label corresponding to a single image belonging to the target domain.
  • the method of Non-Patent Document 2 is to perform unsupervised domain adaptation of a simple classification problem of a single image, and a label sequence corresponding to the sequence of the information in consideration of the logical relationship of the sequence of a plurality of information.
  • An unsupervised domain adaptation method for complex classification problems has not been established.
  • the present invention has been made in view of such a point, and performs unsupervised domain adaptation of a labeling model that estimates a label sequence corresponding to a sequence of information in consideration of the logical relationship of the sequence of a plurality of information.
  • the purpose is.
  • a logical relationship understanding means for receiving an input information sequence which is a sequence of a plurality of information having a logical relationship, obtaining an intermediate feature sequence considering the logical relationship of the input information sequence, and outputting the intermediate feature sequence, and the above-mentioned Based on a labeling model including a labeling means that receives a first sequence based on an intermediate feature sequence, obtains an estimated label sequence of a label sequence corresponding to the input information sequence, and outputs the estimated label sequence, and the intermediate feature sequence.
  • the second series is received, the estimated domain information of the domain identification information indicating whether each partial information included in the input information series belongs to the source domain or the target domain is obtained, and the series of the estimated domain information is output.
  • the learning device includes labeled teacher data, which is a labeled learning information sequence belonging to the source domain, and unlabeled teacher data, which is an unlabeled learning information sequence belonging to the target domain.
  • labeling model is trained so that the estimation accuracy of the estimated label series is high and the estimation accuracy of the estimated domain information series is low, and the estimation accuracy of the estimated domain information series is low.
  • Hostile learning is performed to learn the domain identification model so that the information becomes high, and at least the parameters of the labeling model are obtained and output.
  • FIG. 1 is a block diagram for illustrating the learning device of the first embodiment.
  • FIG. 2 is a block diagram for exemplifying the detailed configuration of the learning unit of the first embodiment.
  • FIG. 3 is a conceptual diagram for illustrating a network used for the learning process of the first embodiment.
  • FIG. 4 is a block diagram for exemplifying the inference device of the first embodiment.
  • FIG. 5 is a conceptual diagram for exemplifying the learned labeling network of the first embodiment.
  • FIG. 6 is a block diagram for illustrating the learning device of the second embodiment.
  • FIG. 7 is a conceptual diagram for exemplifying a network used for the learning process of the second embodiment.
  • FIG. 8 is a graph for exemplifying the experimental results.
  • FIG. 9 is a block diagram illustrating a hardware configuration of the learning device and the inference device of the embodiment.
  • an utterance text series (series of a plurality of information having a logical relationship) is input, and a series of labels (labels) corresponding to corresponding scenes (for example, opening, message grasping, identity verification, correspondence, closing).
  • An example of unsupervised domain adaptation of a labeling model based on a deep neural network that outputs (series labeling) is shown.
  • the present invention can be used for unsupervised domain adaptation of a labeling model that estimates an arbitrary label sequence corresponding to a sequence of information in consideration of the logical relationship of a sequence of any plurality of information.
  • the logical relationship of the series of a plurality of information is not limited, and any relationship may exist between the plurality of information. Examples of logical relationships are contexts, word dependency relationships, language grammatical relationships, audio and video frame-to-frame relationships, etc., but these are not limited to the present invention.
  • the labeling model is not limited to a model based on a deep neural network, and any model such as a probability model or a classifier can be used as long as it is a model that estimates and outputs a label sequence corresponding to a sequence of input information. There may be.
  • the learning device 11 of the first embodiment has a learning unit 11a and storage units 11b and 11c, and inputs labeled teacher data of a source domain and unlabeled teacher data of a target domain. Then, the parameters (model parameters) of the labeling network of the target domain are obtained and output by learning.
  • the source domain is referred to as "SD”
  • the target domain is referred to as "TD”
  • the network is referred to as "NW”.
  • the labeled teacher data of the source domain exemplified here is a labeled learning information sequence belonging to the source domain, and is a spoken text sequence of the source domain (input information sequence which is a sequence of a plurality of information having a logical relationship). And the correct label series corresponding to the spoken text series.
  • the unlabeled teacher data of the target domain exemplified here is an unlabeled learning information sequence belonging to the target domain, and includes an utterance text sequence of the target domain but does not include a correct label sequence.
  • a learning schedule which is a schedule such as a loss coupling ratio, which will be described later, may be input to the learning device 11, and the learning device 11 may perform learning processing according to the learning schedule.
  • the learning device 11 may output the parameters of the domain identification network for realizing the unsupervised domain adaptation.
  • the learning unit 11a has, for example, a control unit 11aa, a loss function calculation unit 11ab, a gradient inversion unit 11ac, and a parameter update unit 11ad. Further, the learning unit 11a stores each data obtained in the processing process in the storage units 11b, 11c or a temporary memory (not shown) one by one. The learning unit 11a reads the data as necessary and uses it for each process.
  • FIG. 3 shows a configuration example of the network 100 used by the learning device 11 in the learning process.
  • the network 100 exemplified in FIG. 3 has a labeling network 150 (labeling model) and a domain identification model 130.
  • the labeling network 150 illustrated in FIG. 3 receives (inputs) the utterance text sequence T 1 , ..., TN (input information sequence which is a sequence of a plurality of information having a logical relationship), and the utterance text sequence T 1 , ... ..., The estimated label series L 1 , ..., L N , which is the estimated series of the label series corresponding to TN , is obtained and output.
  • the utterance text series T 1 , ..., TN is a series of N utterance texts T n .
  • n is an index corresponding to time
  • n 1, ..., N
  • N is an integer of 1 or more
  • N is generally an integer of 2 or more.
  • the utterance text T n may be a word such as "I'm sorry” or “Yes”, or M (n) words such as "I'm in trouble because the reply speed is slow" T n, 1 , ..., T n . , M (n) may be included in the sentence. However, M (n) is an integer of 2 or more.
  • the estimated label L n in this example corresponds to the utterance text T n , and represents, for example, a corresponding scene of the utterance text T n (for example, opening, message grasping, identity verification, correspondence, closing).
  • the labeling network 150 exemplified here has a logical relationship understanding layer 110 (logical relationship understanding means) and a labeling layer 120 (labeling means).
  • the logical relationship understanding layer 110 receives the utterance text series T 1 , ..., TN , and the intermediate feature series LF 1 , ..., LF considering the context (logical relationship) of the utterance text series T 1 , ..., TN . N is obtained, and the intermediate feature series LF 1 , ..., LF N is output.
  • the intermediate feature LF n corresponds to the utterance text T n .
  • the 3 includes a short-term context understanding network 111-1, ..., 111-N (short-term logical relationship understanding means) and a long-term context understanding network 112 (long-term logical relationship understanding means).
  • the short-term context comprehension networks 111-1, ..., 111- N exemplified here receive the utterance text series T 1 , ..., TN.
  • each short-term context understanding network 111-n may be divided into a plurality of short-term context understanding networks 111-n1, ..., 111-nK'(where K'is an integer of 2 or more).
  • each short-term context understanding network 111-nk' is a network from the input layer to the k'th layer of the short-term context understanding network. show.
  • the short-term intermediate feature SF nk'of the k'layer is output from each short-term context understanding network 111-nk'.
  • the long-term context understanding network 112 exemplified here receives a series SF 1 , ..., SF N (short-term intermediate feature series) of a plurality of short-term intermediate features, and a plurality of utterances included in the utterance text series T 1 , ..., TN .
  • Intermediate feature series LF 1 , ..., LF N (between multiple partial information contained in the input information series) considering the context between texts T n (for example, long-term context of sentence unit or long-term context spanning multiple sentences).
  • a long-term intermediate feature sequence in consideration of the logical relationship) is obtained, and the intermediate feature sequence LF 1 , ..., LF N is output.
  • this is not a limitation of the present invention.
  • SF nk'of the k'layer is output from each short-term context understanding network 111-nk'as described above, SF nK'is input as SF n to the long-term context understanding network 112.
  • a plurality of SF nK'of SF n1 , ..., SF nK' may be input.
  • the long-term context understanding network 112 may be divided into a plurality of long-term context understanding networks 112-1, ..., 112-K (where K is an integer of 2 or more).
  • K is an integer of 2 or more.
  • k 1, ..., K is an index representing the layers of the long-term context understanding network
  • each long-term context understanding network 112-k represents the network from the input layer to the kth layer of the long-term context understanding network.
  • each long-term context comprehension network 112-k receives the sequence SF n of any of the plurality of short-term intermediate features, and the plurality of utterance texts T corresponding to the received sequence SF n .
  • An intermediate feature considering the context between n may be obtained and output.
  • the long-term context understanding network 112-1, ..., 112-K outputs K intermediate feature sequences ⁇ LF 11 , ..., LF N1 ⁇ , ..., ⁇ LF 1K , ..., LF NK ⁇ . ..
  • the short-term context understanding network 111-n is composed of a combination of an embedded layer that converts words into numerical values by, for example, a unidirectional LSTM (long-short term memory), a bidirectional LSTM, and an attention mechanism. Yes (see, for example, Non-Patent Document 1 and the like).
  • the long-term context understanding network 112 can be configured by a combination of, for example, a unidirectional LSTM or a bidirectional LSTM.
  • the labeling layer 120 receives the intermediate feature sequence LF 1 , ..., LF N (first sequence based on the intermediate feature sequence), and the estimated label sequence L 1 , ..., L corresponding to the spoken text sequence T 1 , ..., TN . N is obtained, and the estimated label series L 1 , ..., L N is output.
  • the labeling layer 120 illustrated in FIG. 3 includes label prediction networks 120-1, ..., 120-N.
  • the label prediction network 120-n receives the intermediate feature LF n , obtains the estimated label L n corresponding to the utterance text T n , and outputs the estimated label L n .
  • the label prediction network 120-n can be configured by, for example, a fully connected neural network having a softmax function as an activation function. Also, if K intermediate feature sequences ⁇ LF 11 , ..., LF N1 ⁇ , ..., ⁇ LF 1K , ..., LF NK ⁇ are output from the long-term context understanding network 112-1, ..., 112-K, the label.
  • the prediction network 120-n receives LF nK as an intermediate feature LF n , obtains an estimated label L n corresponding to the utterance text T n , and outputs the estimated label L n .
  • the label prediction network 120- n receives a plurality of intermediate feature LF nks among the intermediate feature series ⁇ LF 11 , ..., LF N1 ⁇ , ..., ⁇ LF 1K , ..., LF NK ⁇ and corresponds to the utterance text Tn.
  • the estimated label L n may be obtained and the estimated label L n may be output.
  • the domain identification model 130 illustrated in FIG. 3 receives the intermediate feature series LF 1 , ..., LF N (second sequence based on the intermediate feature sequence), and each utterance text included in the utterance text series T 1 , ..., TN .
  • Domain identification indicating whether T n (each partial information contained in the input information series) belongs to the source domain or the target domain (whether each utterance text T n belongs to the source domain or the target domain).
  • the domain identification model 130 exemplified here includes N domain identification networks 130-1, ..., 130-N.
  • this does not limit the present invention.
  • a plurality of domain identification networks 130-nk may exist in place of each domain identification network 130-n.
  • each domain identification network 130-nk represents a network corresponding to each n (for example, each time).
  • each domain identification network 130-nk receives the intermediate feature LF nk (n ⁇ ⁇ 1, ..., N ⁇ , k ⁇ ⁇ 1, ..., K ⁇ ) and uses them to obtain the estimated domain information D nk . May be output.
  • the domain identification network 130-n (or domain identification network 130-nk) can be configured by, for example, a fully connected neural network having a softmax function as an activation function.
  • the learning unit 11a of the learning device 11 has the labeled teacher data of the source domain (labeled learning information sequence belonging to the source domain) and the unlabeled teacher data of the target domain (unlabeled belonging to the target domain). Information series for learning) is input.
  • the learning unit 11a uses the teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the spoken text series T 1 , ..., TN for the above-mentioned network 100, and uses the estimated label series L.
  • the labeling network 150 (labeling model) is learned so that the estimation accuracy of 1 , ..., L N is high and the estimation accuracy of the estimated domain information D 1 , ..., DN is low, and the estimation domain information series D 1 , ..., Hostile learning is performed to learn the domain identification model 130 so that the estimation accuracy of DN becomes high. That is, the learning unit 11a includes the estimated label series L 1 , ..., L N output from the labeling network 150 when the above-mentioned teacher data is input to the network 100 as the spoken text series T 1 , ..., TN .
  • the loss function (hereinafter referred to as label prediction loss) representing the error from the correct label series of the labeled teacher data of the source domain corresponding to, and the series D 1 , ..., DN of the estimated domain information output from the domain identification model 130.
  • label prediction loss representing the error from the correct label series of the labeled teacher data of the source domain corresponding to, and the series D 1 , ..., DN of the estimated domain information output from the domain identification model 130.
  • domain identification loss that represents the error between the labeled teacher data of the source domain and the correct label series of the estimated domain information identified from the unlabeled teacher data of the target domain.
  • the estimated label series L 1 , ..., L N output from the labeling network 150 when the unlabeled teacher data of the target domain is input to the network 100 is not used for calculating the label prediction loss.
  • the learning unit 11a performs this hostile learning by using, for example, an error backpropagation method.
  • the estimation accuracy of the estimated label series L 1 , ..., L N in the labeling network 150 becomes high.
  • the gradient inversion layer 141-n inverts the gradient only when the error is backpropagated, and learning is performed so that the domain identification loss becomes small, so that the estimation accuracy of the estimation domain information series D 1 , ..., DN is high.
  • the domain identification model 130 is learned so as to be, and the logical relationship understanding layer 110 for obtaining the intermediate feature series LF 1 , ..., LF N that lowers the estimation accuracy of the estimation domain information series D 1 , ..., DN is learned. Can perform hostile learning.
  • the labeling network 150 can acquire the intermediate feature series LF 1 , ..., LF N which are effective for the prediction of the label while suppressing the dependence on the domain, and the unsupervised domain adaptation can be realized.
  • This learning process can be realized by optimizing (minimizing) the loss function that linearly combines the label prediction loss and the domain identification loss.
  • the combination ratio of the linear combination of the label prediction loss and the domain identification loss may be predetermined or may be specified by the learning schedule input to the learning unit 11a.
  • the learning unit 11a may perform the above-mentioned learning while changing the combination ratio of the label prediction loss and the domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 11a may learn by using only the label prediction loss as a loss function in the early stage of learning, and gradually increase the ratio of the domain identification loss to the loss function as the number of learning steps increases. good. Further, the learning unit 11a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.
  • the learning unit 11a prepares a plurality of various domain identification models 130 and / or labeling networks 150 as exemplified above for learning, and among the labeling networks 150 obtained by each learning, the target domain is used.
  • the labeling network 150 which gives the best estimation accuracy of the label sequence by the labeling network 150, may be selected later.
  • the learning process may be batch learning, mini-batch learning, or online learning.
  • the learning unit 11a stores the parameters of the labeling network 150 obtained by the above learning in the storage unit 11b, and stores the parameters of the domain identification model 130 (parameters of the domain identification network 130-1, ..., 130-) in the storage unit 11c. Store.
  • the learning device 11 outputs the parameters of the labeling network 150 stored in the storage unit 11b.
  • the parameters of the labeling network 150 are used in the inference processing described later. Normally, the parameters of the domain discrimination model 130 are not used for inference processing, so they do not have to be output from the learning device 11. However, the learning device 11 may output at least one of the parameters of the domain identification model 130 (parameters of the domain identification network 130-1, ..., 130-).
  • Step S11 The labeled teacher data of the source domain and the unlabeled teacher data of the target domain are input to the control unit 11aa.
  • the control unit 11aa generates teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain. Further, the control unit 11aa initializes the parameters of the network 100.
  • Step S13 The parameter update unit 11ad backpropagates the information based on the loss function according to the error back propagation method, and updates the parameters of the domain discrimination model 130 and the labeling layer 120.
  • Step S14 The gradient inversion unit 11ac inverts the gradient of the information based on the loss function back-propagated from the domain discrimination model 130 and back-propagates it to the logical relationship understanding layer 110.
  • the information based on the loss function back-propagated from the labeling layer 120 is back-propagated to the logical relationship understanding layer 110 as it is.
  • Step S15 The parameter update unit 11ad updates the parameters of the logical relationship understanding layer 110 using the back-propagated information according to the error back-propagation method.
  • Step S16 The control unit 11aa determines whether or not the end condition (for example, a condition that the number of parameter updates reaches a predetermined number) is satisfied. If the end condition is not satisfied here, the control unit 11aa returns the process to step S12. On the other hand, when the end condition is satisfied, the control unit 11aa outputs the parameter of the labeling network 150. If necessary, the control unit 11aa may output at least one of the parameters of the domain identification networks 130-1, ..., 130-N.
  • the end condition for example, a condition that the number of parameter updates reaches a predetermined number
  • the inference device 13 of the first embodiment has an inference unit 13a and a storage unit 13b.
  • the parameters of the labeling network 150 obtained as described above are stored in the storage unit 13b.
  • an utterance text sequence (input information sequence) for inference is input to the inference unit 13a.
  • the inference unit 13a applies the utterance text sequence for inference to the labeling network 150 (labeling model) specified by the parameters stored in the storage unit 13b, and estimates the label sequence corresponding to the utterance text sequence for inference. Obtain the label sequence and output the estimated label sequence.
  • the reasoning unit 13a inputs the utterance text series T 1 , ..., TN for inference to the logical relationship understanding layer 110, and the utterance text series T 1 for inference.
  • the inference unit 13a inputs the utterance text sequences T 1 , ..., TN for inference to the short-term context understanding networks 111-1, ..., 111-N, respectively, and the sequence SF 1 , ..., SF of the short-term intermediate features. N is obtained, and the short-term intermediate feature sequences SF 1 , ..., SF N are input to the long-term context understanding network 112, and the intermediate feature sequences LF 1 , ..., LF N are obtained.
  • the inference unit 13a inputs the intermediate feature series LF 1 , ..., LF N to the labeling layer 120 to obtain the estimated label series L 1 , ..., L N corresponding to the utterance text series T 1 , ..., TN . Output.
  • the utterance text sequence T 1 , ..., TN is received, the utterance text sequence T 1 , ..., TN is taken into consideration, and the intermediate feature sequence LF 1 , ..., LF N is obtained, and the intermediate feature sequence is obtained.
  • the logical relationship understanding layer 110 that outputs LF 1 , ..., LF N , and the estimated label series L 1 , that receives the intermediate feature series LF 1 , ..., LF N and corresponds to the utterance text series T 1 , ..., TN .
  • the labeling network 150 including the labeling layer 120 including the estimated label sequence L 1 , ..., L N is received, and the intermediate feature sequence LF 1 , ..., LF N is received, and the utterance text sequence T 1 , ..., Obtained the estimated domain information D n of the domain identification information indicating whether each utterance text T n included in the TN belongs to the source domain or the target domain, and obtained the sequence D 1 of the estimated domain information, ..., D.
  • the learning device 11 uses the teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the utterance text series T 1 , ..., TN .
  • Estimated label sequence L 1 , ..., L N estimation accuracy is high
  • estimation domain information sequence D 1 , ..., DN is learned so that the estimation accuracy is low
  • estimation domain information sequence D Hostile learning was performed to learn the domain identification model 130 so that the estimation accuracy of 1 , ..., and DN would be high.
  • the domain dependence of the labeling network 150 is reduced, and as a result, the label series L 1 , ..., L N corresponding to the utterance text series T 1 , ..., TN are considered in consideration of the context of the utterance text series T 1, ..., TN. Allows unsupervised domain adaptation of the estimated labeling network 150.
  • a network that identifies a domain from a context between a plurality of utterance texts Tn (a long-term intermediate feature series considering a logical relationship between a plurality of partial information included in an input information series) and a utterance text.
  • Tn a long-term intermediate feature series considering a logical relationship between a plurality of partial information included in an input information series
  • SF n short-term intermediate features that consider the logical relationship of information within partial information
  • the learning device 21 of the second embodiment has a learning unit 21a and storage units 11b, 21c, 21d, and has a source domain labeled teacher data and a target domain unlabeled teacher data. Is used as an input, and the parameters (model parameters) of the labeling network of the target domain are obtained and output by learning. Further, the learning schedule may be input to the learning device 21, and the learning device 21 may perform the learning process according to the learning schedule. Further, the learning device 21 may output the parameters of the domain identification network for realizing the unsupervised domain adaptation. Further, as illustrated in FIG.
  • the learning unit 21a has, for example, a control unit 11aa, a loss function calculation unit 21ab, a gradient inversion unit 11ac, and a parameter update unit 21ad. Further, the learning unit 21a stores each data obtained in the processing process in the storage units 11b, 21c, 21d or a temporary memory (not shown) one by one. The learning unit 21a reads the data as necessary and uses it for each process.
  • FIG. 7 shows a configuration example of the network 200 used by the learning device 21 in the learning process.
  • the network 200 illustrated in FIG. 7 has a labeling network 150 (labeling model) and a domain identification model 230. Since the labeling network 150 is the same as that of the first embodiment, the description thereof will be omitted, and the domain identification model 230 will be described below.
  • the domain identification model 230 exemplified in FIG. 7 includes short-term context domain identification networks 231-1, ..., 231-N (short-term logical relationship domain identification means), and long-term context domain identification network 232 (long-term logical relationship domain identification means). )including.
  • the long-term context domain identification network 232 receives the intermediate feature sequence LF 1 , ..., LF N (long-term intermediate feature sequence) output from the long-term context understanding network 112, and obtains the sequence LD 1 , ..., LD N of the estimated domain information. And output.
  • the long-term context domain identification network 232 exemplified in FIG. 7 continuously captures the input short-term intermediate feature sequences SF 1 , ..., SF N ().
  • the sequence of short-term intermediate features SF 1 , ..., SF N is continuously captured in the time direction), and the domain dependence of the context (logical relationship) across multiple spoken texts Tn , which are words and sentences. Is intended to be removed from the labeling network 150.
  • this is not a limitation of the present invention.
  • a plurality of long-term context domain identification networks 232-1, ..., 232-K (where K is an integer of 2 or more) may exist.
  • K 1, ..., K is an index representing the layer of the long-term context understanding network.
  • the long-term context domain identification network 232 can be configured by, for example, a combination of a unidirectional LSTM or a bidirectional LSTM and a fully connected neural network having a softmax function as an activation function.
  • the short-term context domain identification network 231-1, ..., 231-N is a series of short-term intermediate features SF 1 , ..., Output from the short-term context understanding network 111-1, ..., 111-N (short-term logical relationship understanding means).
  • SF N second sequence based on intermediate feature sequence, short-term intermediate feature sequence
  • the estimated domain information SD n of the domain identification information indicating whether the domain belongs to the domain or the target domain is obtained, and the estimated domain information SD n is output.
  • the short-term context domain identification network 231- n is domain-dependent by estimating whether the utterance text Tn belongs to the source domain or the target domain for each short-term intermediate feature SF n .
  • the purpose is to efficiently remove the domain dependency of the spoken text Tn alone such as a specific word or document. However, this does not limit the present invention.
  • each short-term context domain identification network 231-n a plurality of short-term context domain identification networks 231-n1, ..., 232-nK'(where K'is an integer of 2 or more) may exist.
  • each short-term context domain identification network 231-nk' represents the network corresponding to each n (for example, each time). ..
  • the short-term context domain identification network 231-n can be configured by, for example, a combination of a fully connected neural network having a softmax function as an activation function.
  • the learning unit 21a of the learning device 21 has the labeled teacher data of the source domain (labeled learning information sequence belonging to the source domain) and the unlabeled teacher data of the target domain (unlabeled belonging to the target domain). Information series for learning) is input.
  • the learning unit 21a uses the teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the spoken text series T 1 , ..., TN for the above-mentioned network 200, and uses the estimated label series L. 1.
  • the labeling network 150 (labeling model) is learned so that the estimation accuracy of L N is high and the estimation accuracy of the estimated domain information series LD 1 , ..., LD N and SD 1 , ..., SD N is low.
  • a series of estimated domain information LD 1 , ..., LD N and SD 1 , ..., Hostile learning is performed to learn the domain identification model 230 so that the estimation accuracy of SD N is high. That is, the learning unit 21a includes the estimated label series L 1 , ..., L N output from the labeling network 150 when the above-mentioned teacher data is input to the network 200 as the spoken text series T 1 , ..., TN .
  • Loss function (hereinafter referred to as label prediction loss) representing the error from the correct label series of the labeled teacher data of the source domain corresponding to, and the series of estimated domain information output from the long-term context domain identification network 232 LD 1 , ...,
  • a loss function (hereinafter referred to as long-term context domain identification loss) representing the error between the LD N and the correct label series of estimated domain information identified from the labeled teacher data of the source domain and the unlabeled teacher data of the target domain, and the short-term context.
  • Hostile learning is performed between the labeling network 150 and the domain identification model 230 based on a loss function (hereinafter, short-term context domain identification loss) that represents an error between the correct label series of domain information.
  • the estimated label series L 1 , ..., L N output from the labeling network 150 when the unlabeled teacher data of the target domain is input to the network 200 is not used for calculating the label prediction loss.
  • the learning unit 21a performs this hostile learning by using, for example, an error backpropagation method.
  • the gradient inversion layer 242- n inverts the gradient only when the error is backpropagated, and learning is performed so that the long-term context domain identification loss becomes small.
  • the long -term context domain identification network 232 is learned so that the estimation accuracy of the estimated domain information is low, and the intermediate feature sequence LF 1 , ..., LF N is obtained . Hostile learning to learn 232 can be performed.
  • the gradient inversion layer 241- n inverts the gradient only when the error is backpropagated, and learning is performed so that the short-term context domain identification loss becomes small.
  • the short - term context domain identification network 2311, ..., 231 - N is learned so that the estimation accuracy of the estimated domain information is low, and the estimation accuracy of the estimated domain information is lowered.
  • SF N is obtained.
  • Short-term context understanding network 111-1, ..., 111-N can be learned by hostile learning. By these hostile learnings, the estimated label sequence L 1 , ..., L N can be estimated accurately, but the domain is not estimated by the domain discriminative model 230.
  • the labeling network 150 can acquire the intermediate feature series LF 1 , ..., LF N that are effective for label prediction while suppressing the dependence of the context across a plurality of utterance texts Tn on the domain. It is possible to acquire a series of short - term intermediate features SF 1 , ..., SF N that are effective for label prediction while suppressing domain dependence in utterance text Tn units, and realize unsupervised domain adaptation with higher accuracy.
  • This learning process can be realized by optimizing (minimizing) the loss function that linearly combines the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss.
  • the combination ratio of the linear combination of the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss may be predetermined or may be specified in the learning schedule input to the learning unit 21a.
  • the learning unit 21a may perform the above-mentioned learning while changing the combination ratio of the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 21a performs learning using only the label prediction loss as a loss function in the early stage of learning, and as the number of learning steps increases, the ratio of the long-term context domain discrimination loss and the short-term context domain discrimination loss to the loss function gradually increases. You may learn as it is.
  • the learning unit 21a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.
  • the domain identification model 230 may have only one of the long-term context domain identification network 232 and the short-term context domain identification network 231-1, ..., 231-N.
  • the domain discrimination model 230 has only the long-term context domain discrimination network 232, the gradient inversion layers 241-1, ..., 241-N are omitted and based on a loss function that linearly combines the label prediction loss and the long-term context domain discrimination loss.
  • the learning process is performed.
  • the combination ratio of the linear combination may be predetermined or may be specified by the learning schedule input to the learning unit 21a. Further, the learning unit 21a may perform the above-mentioned learning based on the learning schedule while changing the coupling ratio between the label prediction loss and the long-term context domain identification loss according to the number of learning steps.
  • the learning unit 21a learns using only the label prediction loss as a loss function in the early stage of learning, and learns so that the ratio of the long-term context domain identification loss to the loss function gradually increases as the number of learning steps increases. You may. Further, the learning unit 21a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.
  • the domain identification model 230 has only the short-term context domain identification network 231-1, ..., 231-N, the gradient inversion layer 242-1, ..., 242-N is omitted, resulting in label prediction loss and short-term context domain identification loss.
  • the learning process is performed based on the linearly connected loss function.
  • the combination ratio of the linear combination may be predetermined or may be specified by the learning schedule input to the learning unit 21a. Further, the learning unit 21a may perform the above-mentioned learning while changing the connection ratio between the label prediction loss and the short-term context domain identification loss according to the number of learning steps based on the learning schedule.
  • the learning unit 21a learns using only the label prediction loss as a loss function in the early stage of learning, and learns so that the ratio of the short-term context domain identification loss to the loss function gradually increases as the number of learning steps increases. You may. Further, the learning unit 21a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.
  • the domain identification model 230 may have only a part of the long-term context domain identification network 232 and the short-term context domain identification networks 231-1, ..., 231-N. That is, a part of the short-term context domain identification network 231-1, ..., 231-N may be omitted. In this case, the estimated domain information SD n corresponding to the omitted short-term context domain identification network 231-n and the correct label of the corresponding estimated domain information are not used in the calculation of the short-term context domain identification loss.
  • the learning unit 21a prepares a plurality of various domain identification models 230 and / or labeling networks 150 as exemplified above for learning, and among the labeling networks 150 obtained by each learning, the target domain is used.
  • the labeling network 150 which gives the best estimation accuracy of the label sequence by the labeling network 150, may be selected later.
  • the plurality of domain identification models 230 prepared include, for example, a domain identification model 230 including a long-term context domain identification network 232 and a short-term context domain identification network 231-1, ..., 231-N, and a long-term context domain identification network, as described above. It includes at least one of a domain identification model 230 having only 232, a domain identification model 230 having only short-term context domain identification networks 231-1, ..., 231-N, and a domain identification model 230 of the first embodiment.
  • the learning process may be batch learning, mini-batch learning, or online learning.
  • the learning unit 21a stores the parameters of the labeling network 150 obtained by the above learning in the storage unit 11b, stores the parameters of the short-term context domain identification networks 231-1, ..., 231-N in the storage unit 21c, and stores the parameters of the short-term context domain identification network 231-1, ..., 231-N in the storage unit 21c.
  • the parameters of the domain identification model 232 are stored in the storage unit 21d.
  • the learning device 21 outputs the parameters of the labeling network 150 stored in the storage unit 11b.
  • the parameters of the labeling network 150 are used for inference processing.
  • the parameters of the short-term context domain identification network 2311, ..., 231-N and the parameters of the long-term context domain identification model 232 are not used for inference processing, and therefore need not be output from the learning device 21.
  • the learning device 21 may output at least one of the parameters of the short-term context domain identification network 231-1, ..., 231-N and the parameters of the long-term context domain identification model 232.
  • Step S21 The labeled teacher data of the source domain and the unlabeled teacher data of the target domain are input to the control unit 11aa.
  • the control unit 11aa generates teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain. Further, the control unit 11aa initializes the parameters of the network 200.
  • Step S23 The parameter update unit 21ad backpropagates the information based on the loss function according to the error back propagation method, and updates the parameters of the domain discrimination model 230 and the labeling layer 120.
  • Step S24 The gradient inversion unit 11ac inverts the gradient of the information based on the loss function back-propagated from the domain discrimination model 230 and back-propagates it to the logical relationship understanding layer 110.
  • the information based on the loss function back-propagated from the labeling layer 120 is back-propagated to the logical relationship understanding layer 110 as it is.
  • Step S25 The parameter update unit 21ad updates the parameters of the logical relationship understanding layer 110 using the back-propagated information according to the error back-propagation method.
  • Step S26 The control unit 11aa determines whether or not the end condition is satisfied. If the end condition is not satisfied here, the control unit 11aa returns the process to step S22. On the other hand, when the end condition is satisfied, the control unit 11aa outputs the parameter of the labeling network 150. If necessary, the control unit 11aa may output at least one of the parameters of the short-term context domain identification network 231-1, ..., 231-N and the parameters of the long-term context domain identification model 232.
  • the utterance text sequence T 1 , ..., TN is received, the utterance text sequence T 1 , ..., TN is taken into consideration, and the intermediate feature sequence LF 1 , ..., LF N is obtained, and the intermediate feature sequence is obtained.
  • the logical relationship understanding layer 110 that outputs LF 1 , ..., LF N , and the estimated label series L 1 , that receives the intermediate feature series LF 1 , ..., LF N and corresponds to the utterance text series T 1 , ..., TN .
  • the labeling network 150 including the labeling layer 120 including the estimated label sequence L 1 , ..., L N is received, and the intermediate feature sequence LF 1 , ..., LF N is received, and the utterance text sequence T 1 , ...,
  • the estimated domain information LD n and SD n of the domain identification information indicating whether each utterance text T n contained in the TN belongs to the source domain or the target domain is obtained, and the sequence LD 1 of the estimated domain information is obtained.
  • the learning device 21 utters teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain with respect to the domain identification model 230 that outputs SD N.
  • the estimation accuracy of the estimated label series L 1 , ..., L N is high, and the estimation of the estimated domain information series LD 1 , ..., LD N and SD 1 , ..., SD N.
  • This enables unsupervised domain adaptation of the labeling network 150 that estimates the label sequence L 1 , ..., L N corresponding to the utterance text sequence in consideration of the context of the utterance text sequence T 1 , ..., TN . ..
  • the domain identification model 230 receives the sequence SF 1 , ..., SF N of the short-term intermediate features output from the short-term context understanding network 111-1, ..., 111-N, and receives the sequence SD of the estimated domain information. 1 , ..., SD N is obtained and output from the short-term context domain identification network 231-1, ..., 231-N, and the intermediate feature sequence LF 1 , ..., LF N output from the long-term context understanding network 112 is received and estimated. Includes at least one of the long-term context domain identification networks 232 that obtains and outputs the domain information sequence LD 1 , ..., LD N.
  • the domain identification model 230 includes at least the long-term context domain identification network 232, the domain dependence of the context across a plurality of utterance texts Tn can be efficiently removed from the labeling network 150.
  • the unsupervised domain adaptation of the labeling network 150 that estimates the label series L 1 , ..., L N corresponding to the utterance text series in consideration of the context of the utterance text series T 1 , ..., TN is performed accurately. be able to.
  • the domain identification model 230 includes both the short-term context domain identification network 231-1, ..., 231- N and the long-term context domain identification network 232, so that the utterance text Tn is domain-dependent and a plurality of utterance texts T.
  • the domain dependency of the context across n can be efficiently removed from the labeling network 150.
  • the unsupervised domain adaptation of the labeling network 150 can be performed with higher accuracy.
  • Figure 8 illustrates the experimental results.
  • the labeled data of the existing source domain is used without preparing the labeled data in the target domain, and the high accuracy is achieved. It can be seen that unsupervised domain adaptation to the target domain is possible.
  • the method of the second embodiment enables unsupervised domain adaptation with higher accuracy, and the identification accuracy is improved by an average of 3.4% as compared with the method of learning only with the data of the source domain.
  • the learning devices 11 and 21 and the inference device 13 in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit), a RAM (random-access memory), a ROM (read-only memory), or the like. It is a device configured by a general-purpose or dedicated computer equipped with a memory or the like to execute a predetermined program.
  • This computer may have one processor and memory, or may have a plurality of processors and memory.
  • This program may be installed in a computer or may be recorded in a ROM or the like in advance.
  • a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. ..
  • the electronic circuit constituting one device may include a plurality of CPUs.
  • FIG. 9 is a block diagram illustrating the hardware configurations of the learning devices 11 and 21 and the inference device 13 in each embodiment.
  • the learning devices 11 and 21 and the inference device 13 of this example include a CPU (Central Processing Unit) 10a, an input unit 10b, an output unit 10c, a RAM (RandomAccessMemory) 10d, and a ROM (ReadOnly). It has a Memory) 10e, an auxiliary storage device 10f, and a bus 10g.
  • the CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac.
  • the input unit 10b is an input terminal, a keyboard, a mouse, a touch panel, or the like into which data is input.
  • the output unit 10c is an output terminal from which data is output, a display, a LAN card controlled by a CPU 10a in which a predetermined program is read, and the like.
  • the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored.
  • the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data.
  • the bus 10g connects the CPU 10a, the input unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged.
  • the CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program.
  • OS Operating System
  • the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a.
  • the control unit 10aa of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program.
  • the calculation result is stored in the register 10ac.
  • the above program can be recorded on a computer-readable recording medium.
  • a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, and the like.
  • the distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded.
  • the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
  • the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program.
  • a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time.
  • the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and the result acquisition without transferring the program from the server computer to this computer. May be.
  • the program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).
  • This device may be configured using not only the CPU 10a but also a GPU (Graphics Processing Unit). Further, in each embodiment, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.
  • a GPU Graphics Processing Unit
  • the unlabeled teacher data of the target domain includes the spoken text series of the target domain but does not include the correct label series.
  • at least part of the unlabeled teacher data for the target domain may contain a correct label sequence.
  • the correct label series of unlabeled teacher data of the target domain may or may not be used for learning the networks 100 and 200.
  • a series of a plurality of information having a logical relationship is an utterance text series
  • a label series is a corresponding scene of each utterance (for example, opening, opening, etc.).
  • An example is given of a series of labels representing (identification, identity verification, correspondence, closing).
  • this is only an example, and other information sequences such as a sentence sequence, a programming language sequence, an audio signal sequence, and a moving image signal sequence may be used as a sequence of a plurality of information having a logical relationship.
  • each model such as a labeling model may not be a model based on a deep neural network, but may be another model based on a probability model, a classifier, or the like.
  • the logical relationship understanding layer 110 of each embodiment receives the utterance text series T 1 , ..., TN , and considers the context (logical relationship) of the utterance text series T 1 , ..., TN . LF 1 , ..., LF N were obtained and output.
  • the logical relationship understanding layer 110 may receive the utterance text series T 1 , ..., TN , and obtain and output a series consisting of less than N or more than N intermediate features.
  • the labeling layer 120 of each embodiment receives the intermediate feature sequence LF 1 , ..., LF N (first sequence based on the intermediate feature sequence), and is an estimated label sequence corresponding to the utterance text sequence T 1 , ..., TN . L 1 , ..., L N were obtained and output.
  • the labeling layer 120 may receive the intermediate feature sequence LF 1 , ..., LF N (first sequence based on the intermediate feature sequence) to obtain and output a sequence of estimated labels less than or more than N.
  • the domain identification models 130 and 230 of each embodiment received the intermediate feature series LF 1 , ..., LF N (second sequence based on the intermediate feature sequence), and obtained and output N estimated domain information.
  • the domain discriminative models 130, 230 of the embodiment receive the intermediate feature sequences LF 1 , ..., LF N (second sequence based on the intermediate feature sequences), and obtain and output less than N or more than N estimated domain information. You may.
  • the present invention for example, it is possible to efficiently remove the domain dependence of intermediate features for a series labeling problem considering a complicated context.
  • the domain dependence of intermediate features for each of the short-term and long-term logical relationships (contexts)
  • labeling with higher fitness to the target domain is achieved. You can learn the network and improve the labeling accuracy in the target domain.
  • unsupervised domain adaptation technology for image recognition has been studied, but the present invention is the first to consider the logical relationship of multiple information sequences such as language processing, and to create a label sequence corresponding to the information sequence. It is applied to the problem of estimation.
  • This unsupervised domain adaptation technology for example, can significantly reduce the cost of labeling, which has been a barrier to the industry's expansion of business for contact centers.
  • a domain identification network in utterance text units can be designed as a mechanism that straddles sentence boundaries by unidirectional or bidirectional LSTM. This makes it possible to capture domain dependencies for things like contact center industry-dependent specific story flows and efficiently remove them from the labeling network, resulting in label estimation in the target domain. The accuracy can be improved.
  • the domain identification network of the utterance text unit (for example, the call unit) can be designed as a mechanism that does not straddle the boundary of the utterance text. This makes it possible, for example, to capture domain dependencies caused by specific words that depend on the contact center industry and efficiently remove them from the labeling network, resulting in improved label estimation accuracy in the target domain. be able to.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This training device, with respect to a labeling model including a logical relationship understanding means for receiving an input information sequence which is a sequence of pieces of information having a logical relationship and for obtaining and outputting an intermediate feature sequence in which the logical relationship of the input information sequence is considered and a labeling means for receiving a first sequence based on the intermediate feature sequence and for obtaining and outputting an inference label sequence of a label sequence corresponding to the input information sequence, and a domain identification model for receiving a second sequence based on the intermediate feature sequence and for obtaining and outputting inference domain information about domain identification information for information about each portion included in the input information sequence, uses teaching data including labeled teaching data of a source domain and unlabeled teaching data of a target domain, to perform adversarial training so as to train the labeling model such that inference accuracy of the inference label sequence becomes high and inference accuracy of a sequence of the inference domain information becomes low and train the domain identification model such that inference accuracy of a sequence of the inference domain information becomes high, and obtains and outputs at least a parameter of the labeling model.

Description

学習装置、推論装置、それらの方法、およびプログラムLearning devices, inference devices, their methods, and programs
 本発明はラベリング技術に関する。 The present invention relates to labeling technology.
 近年、会話や談話の理解を目的に、発話系列を入力として、発話毎に会話や談話の応対シーンを表すラベルを推定する、発話系列ラベリングの技術が提案されている(例えば、非特許文献1)。 In recent years, for the purpose of understanding conversations and discourses, a technique of utterance series labeling has been proposed in which a utterance series is input and a label representing a conversation or discourse response scene is estimated for each utterance (for example, Non-Patent Document 1). ).
 例えば非特許文献1では、コンタクトセンタにおけるオペレータとカスタマとの間の会話を音声認識して得られた発話テキスト系列を入力として、発話毎に対応シーン(オープニング、用件把握、本人確認、対応、クロージングのいずれか)のラベルを推定する、発話系列ラベリングを実現する深層ニューラルネットワーク(以下、ラベリングネットワーク)を開示している。 For example, in Non-Patent Document 1, the utterance text series obtained by voice-recognizing the conversation between the operator and the customer in the contact center is input, and the corresponding scene (opening, grasping the matter, identity verification, correspondence, correspondence, for each utterance, A deep neural network (hereinafter referred to as a labeling network) that realizes speech sequence labeling that estimates the label of any of the closings) is disclosed.
 非特許文献1のようなラベリングネットワークの学習には、多量のラベル付き教師データが必要である。しかし、新たなドメインでのラベリングを行うたびに、そのドメインにおける多量のラベル付き教師データを収集することは、ラベル付与のコストが膨大にかかることから、困難である。ここで非特許文献2には、過去に適用済みのドメイン(以下、ソースドメイン)のラベル付きデータ(以下、ラベル付き教師データ)と、新規に適用したいドメイン(以下、ターゲットドメイン)のラベルなしデータ(以下、ラベルなし教師データ)とから、新たなドメインでのラベリングを行う教師なしドメイン適応を実現する方法が提案されている。 A large amount of labeled teacher data is required for learning a labeling network as in Non-Patent Document 1. However, it is difficult to collect a large amount of labeled teacher data in a new domain each time it is labeled, because the cost of labeling is enormous. Here, in Non-Patent Document 2, labeled data (hereinafter, labeled teacher data) of a domain that has been applied in the past (hereinafter, source domain) and unlabeled data of a domain that is newly applied (hereinafter, target domain) are described. (Hereinafter, unlabeled teacher data), a method for realizing unsupervised domain adaptation by labeling in a new domain has been proposed.
 しかし、非特許文献2の方法は、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを用い、ターゲットドメインに属する単一画像に対応するラベルを推定するラベリングモデルを学習するものである。すなわち、非特許文献2の方法は単一画像の単純な分類問題の教師なしドメイン適応を行うものであり、複数の情報の系列の論理的関係を考慮して当該情報の系列に対応するラベル系列を推定する(例えば、発話テキスト系列に対して対応シーン毎のラベルの系列を推定する)複雑な分類問題の教師なしドメイン適応方法は確立されていない。 However, the method of Non-Patent Document 2 uses the labeled teacher data of the source domain and the unlabeled teacher data of the target domain to learn a labeling model for estimating the label corresponding to a single image belonging to the target domain. be. That is, the method of Non-Patent Document 2 is to perform unsupervised domain adaptation of a simple classification problem of a single image, and a label sequence corresponding to the sequence of the information in consideration of the logical relationship of the sequence of a plurality of information. An unsupervised domain adaptation method for complex classification problems has not been established.
 本発明はこのような点に鑑みてなされたものであり、複数の情報の系列の論理的関係を考慮して当該情報の系列に対応するラベル系列を推定するラベリングモデルの教師なしドメイン適応を行うことを目的とする。 The present invention has been made in view of such a point, and performs unsupervised domain adaptation of a labeling model that estimates a label sequence corresponding to a sequence of information in consideration of the logical relationship of the sequence of a plurality of information. The purpose is.
 論理的関係を持つ複数の情報の系列である入力情報系列を受け取り、前記入力情報系列の論理的関係を考慮した中間特徴系列を得、前記中間特徴系列を出力する論理的関係理解手段と、前記中間特徴系列に基づく第1系列を受け取り、前記入力情報系列に対応するラベル系列の推定ラベル系列を得、前記推定ラベル系列を出力するラベリング手段と、を含むラベリングモデルと、前記中間特徴系列に基づく第2系列を受け取り、前記入力情報系列に含まれる各部分情報がソースドメインに属するか、ターゲットドメインに属するか、を表すドメイン識別情報の推定ドメイン情報を得、前記推定ドメイン情報の系列を出力するドメイン識別モデルと、に対し、学習装置が、ソースドメインに属するラベル付きの学習用情報系列であるラベル付き教師データとターゲットドメインに属するラベルなしの学習用情報系列であるラベルなし教師データとを含む教師データを前記入力情報系列として用い、前記推定ラベル系列の推定精度が高く、前記推定ドメイン情報の系列の推定精度が低くなるように前記ラベリングモデルを学習し、前記推定ドメイン情報の系列の推定精度が高くなるように前記ドメイン識別モデルを学習する敵対的学習を行い、少なくとも前記ラベリングモデルのパラメータを得て出力する。 A logical relationship understanding means for receiving an input information sequence which is a sequence of a plurality of information having a logical relationship, obtaining an intermediate feature sequence considering the logical relationship of the input information sequence, and outputting the intermediate feature sequence, and the above-mentioned Based on a labeling model including a labeling means that receives a first sequence based on an intermediate feature sequence, obtains an estimated label sequence of a label sequence corresponding to the input information sequence, and outputs the estimated label sequence, and the intermediate feature sequence. The second series is received, the estimated domain information of the domain identification information indicating whether each partial information included in the input information series belongs to the source domain or the target domain is obtained, and the series of the estimated domain information is output. For the domain identification model, the learning device includes labeled teacher data, which is a labeled learning information sequence belonging to the source domain, and unlabeled teacher data, which is an unlabeled learning information sequence belonging to the target domain. Using the teacher data as the input information series, the labeling model is trained so that the estimation accuracy of the estimated label series is high and the estimation accuracy of the estimated domain information series is low, and the estimation accuracy of the estimated domain information series is low. Hostile learning is performed to learn the domain identification model so that the information becomes high, and at least the parameters of the labeling model are obtained and output.
 これにより、複数の情報の系列の論理的関係を考慮して当該情報の系列に対応するラベル系列を推定するラベリングモデルの教師なしドメイン適応を行うことができる。 This makes it possible to perform unsupervised domain adaptation of the labeling model that estimates the label sequence corresponding to the sequence of information in consideration of the logical relationship of the sequence of multiple information.
図1は第1実施形態の学習装置を例示するためのブロック図である。FIG. 1 is a block diagram for illustrating the learning device of the first embodiment. 図2は第1実施形態の学習部の詳細構成を例示するためのブロック図である。FIG. 2 is a block diagram for exemplifying the detailed configuration of the learning unit of the first embodiment. 図3は第1実施形態の学習処理に用いるネットワークを例示するための概念図である。FIG. 3 is a conceptual diagram for illustrating a network used for the learning process of the first embodiment. 図4は第1実施形態の推論装置を例示するためのブロック図である。FIG. 4 is a block diagram for exemplifying the inference device of the first embodiment. 図5は第1実施形態の学習済みのラベリングネットワークを例示するための概念図である。FIG. 5 is a conceptual diagram for exemplifying the learned labeling network of the first embodiment. 図6は第2実施形態の学習装置を例示するためのブロック図である。FIG. 6 is a block diagram for illustrating the learning device of the second embodiment. 図7は第2実施形態の学習処理に用いるネットワークを例示するための概念図である。FIG. 7 is a conceptual diagram for exemplifying a network used for the learning process of the second embodiment. 図8は実験結果を例示するためのグラフである。FIG. 8 is a graph for exemplifying the experimental results. 図9は実施形態の学習装置および推論装置のハードウェア構成を例示したブロック図である。FIG. 9 is a block diagram illustrating a hardware configuration of the learning device and the inference device of the embodiment.
 以下、図面を参照して本発明の実施形態を説明する。各実施形態では、発話テキスト系列(論理的関係を持つ複数の情報の系列)を入力とし、対応シーン(例えば、オープニング、用件把握、本人確認、対応、クロージング)に相当するラベルの系列(ラベル系列)を出力(系列ラベリング)する深層ニューラルネットワークに基づくラベリングモデルを教師なしドメイン適応する例を示す。しかし、これらは一例であって本発明を限定するものではない。すなわち、本発明は、任意の複数の情報の系列の論理的関係を考慮して当該情報の系列に対応する任意のラベル系列を推定するラベリングモデルの教師なしドメイン適応に利用できる。なお、複数の情報の系列の論理的関係にも限定はなく、複数の情報の間に何らかの関係が存在すればよい。論理的関係の例は、文脈(コンテキスト)、単語の係り受けの関係、言語の文法的な関係、音声や動画のフレーム間関係などであるが、これらは本発明を限定しない。また、ラベリングモデルは、深層ニューラルネットワークに基づくモデルに限定されず、入力された情報の系列に対応するラベル系列を推定して出力するモデルであれば、確率モデルや分類器などどのようなモデルであってもよい。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In each embodiment, an utterance text series (series of a plurality of information having a logical relationship) is input, and a series of labels (labels) corresponding to corresponding scenes (for example, opening, message grasping, identity verification, correspondence, closing). An example of unsupervised domain adaptation of a labeling model based on a deep neural network that outputs (series labeling) is shown. However, these are examples and do not limit the present invention. That is, the present invention can be used for unsupervised domain adaptation of a labeling model that estimates an arbitrary label sequence corresponding to a sequence of information in consideration of the logical relationship of a sequence of any plurality of information. The logical relationship of the series of a plurality of information is not limited, and any relationship may exist between the plurality of information. Examples of logical relationships are contexts, word dependency relationships, language grammatical relationships, audio and video frame-to-frame relationships, etc., but these are not limited to the present invention. In addition, the labeling model is not limited to a model based on a deep neural network, and any model such as a probability model or a classifier can be used as long as it is a model that estimates and outputs a label sequence corresponding to a sequence of input information. There may be.
 [第1実施形態]
 本発明の第1実施形態を説明する。
 <学習装置11の機能構成および学習処理>
 図1に例示するように、第1実施形態の学習装置11は、学習部11a、および記憶部11b,11cを有し、ソースドメインのラベル付き教師データと、ターゲットドメインのラベルなし教師データを入力とし、学習によってターゲットドメインのラベリングネットワークのパラメータ(モデルパラメータ)を得て出力する。なお、図面では記載の簡略化のため、ソースドメインを「SD」と表記し、ターゲットドメインを「TD」と表記し、ネットワークを「NW」と表記する。ここで例示するソースドメインのラベル付き教師データは、ソースドメインに属するラベル付きの学習用情報系列であり、ソースドメインの発話テキスト系列(論理的関係を持つ複数の情報の系列である入力情報系列)と当該発話テキスト系列に対応する正解ラベル系列とを含む。またここで例示するターゲットドメインのラベルなし教師データは、ターゲットドメインに属するラベルなしの学習用情報系列であり、ターゲットドメインの発話テキスト系列を含むが正解ラベル系列を含まない。さらに後述する損失の結合比率等のスケジュールである学習スケジュールが学習装置11に入力され、学習装置11が当該学習スケジュールに従って学習処理を行ってもよい。また学習装置11が教師なしドメイン適応を実現するためのドメイン識別ネットワークのパラメータを出力してもよい。また図2に例示するように、学習部11aは、例えば、制御部11aa、損失関数計算部11ab、勾配反転部11ac、およびパラメータ更新部11adを有する。また、学習部11aは処理過程で得られた各データを逐一、記憶部11b,11cまたは図示していない一時メモリに格納する。学習部11aは必要に応じて当該データを読み込み、各処理に利用する。
[First Embodiment]
The first embodiment of the present invention will be described.
<Functional configuration and learning process of learning device 11>
As illustrated in FIG. 1, the learning device 11 of the first embodiment has a learning unit 11a and storage units 11b and 11c, and inputs labeled teacher data of a source domain and unlabeled teacher data of a target domain. Then, the parameters (model parameters) of the labeling network of the target domain are obtained and output by learning. In the drawings, for the sake of simplicity, the source domain is referred to as "SD", the target domain is referred to as "TD", and the network is referred to as "NW". The labeled teacher data of the source domain exemplified here is a labeled learning information sequence belonging to the source domain, and is a spoken text sequence of the source domain (input information sequence which is a sequence of a plurality of information having a logical relationship). And the correct label series corresponding to the spoken text series. Further, the unlabeled teacher data of the target domain exemplified here is an unlabeled learning information sequence belonging to the target domain, and includes an utterance text sequence of the target domain but does not include a correct label sequence. Further, a learning schedule, which is a schedule such as a loss coupling ratio, which will be described later, may be input to the learning device 11, and the learning device 11 may perform learning processing according to the learning schedule. Further, the learning device 11 may output the parameters of the domain identification network for realizing the unsupervised domain adaptation. Further, as illustrated in FIG. 2, the learning unit 11a has, for example, a control unit 11aa, a loss function calculation unit 11ab, a gradient inversion unit 11ac, and a parameter update unit 11ad. Further, the learning unit 11a stores each data obtained in the processing process in the storage units 11b, 11c or a temporary memory (not shown) one by one. The learning unit 11a reads the data as necessary and uses it for each process.
 ≪ネットワーク100≫
 図3に学習装置11が学習処理で用いるネットワーク100の構成例を示す。図3に例示するネットワーク100は、ラベリングネットワーク150(ラベリングモデル)およびドメイン識別モデル130を有する。
≪Network 100≫
FIG. 3 shows a configuration example of the network 100 used by the learning device 11 in the learning process. The network 100 exemplified in FIG. 3 has a labeling network 150 (labeling model) and a domain identification model 130.
 ≪ラベリングネットワーク150≫
 図3に例示するラベリングネットワーク150は、発話テキスト系列T,…,T(論理的関係を持つ複数の情報の系列である入力情報系列)を受け取り(入力とし)、発話テキスト系列T,…,Tに対応するラベルの系列の推定系列である推定ラベル系列L,…,Lを得て出力する。ここで、発話テキスト系列T,…,TはN個の発話テキストTの系列である。ただし、例えばnは時間に対応するインデックスであり、n=1,…,Nであり、Nは1以上の整数であり、一般的にはNは2以上の整数である。発話テキストTは「すみません」や「はい」などの単語であってもよいし、「返信速度が遅くて困っています」などのM(n)個の単語Tn,1,…,Tn,M(n)を含む文章であってもよい。ただし、M(n)は2以上の整数である。また、ここで例示する推定ラベル系列L,…,LはN個の推定ラベルL(ただし、n=1,…,N)の系列である。この例の推定ラベルLは発話テキストTに対応し、例えば発話テキストTの対応シーン(例えば、オープニング、用件把握、本人確認、対応、クロージング)を表す。またここで例示するラベリングネットワーク150は、論理的関係理解層110(論理的関係理解手段)とラベリング層120(ラベリング手段)を有する。
≪Labeling Network 150≫
The labeling network 150 illustrated in FIG. 3 receives (inputs) the utterance text sequence T 1 , ..., TN (input information sequence which is a sequence of a plurality of information having a logical relationship), and the utterance text sequence T 1 , ... ..., The estimated label series L 1 , ..., L N , which is the estimated series of the label series corresponding to TN , is obtained and output. Here, the utterance text series T 1 , ..., TN is a series of N utterance texts T n . However, for example, n is an index corresponding to time, n = 1, ..., N, N is an integer of 1 or more, and N is generally an integer of 2 or more. The utterance text T n may be a word such as "I'm sorry" or "Yes", or M (n) words such as "I'm in trouble because the reply speed is slow" T n, 1 , ..., T n . , M (n) may be included in the sentence. However, M (n) is an integer of 2 or more. Further, the estimated label series L 1 , ..., L N exemplified here is a series of N estimated labels L n (where n = 1, ..., N). The estimated label L n in this example corresponds to the utterance text T n , and represents, for example, a corresponding scene of the utterance text T n (for example, opening, message grasping, identity verification, correspondence, closing). Further, the labeling network 150 exemplified here has a logical relationship understanding layer 110 (logical relationship understanding means) and a labeling layer 120 (labeling means).
 ≪論理的関係理解層110≫
 論理的関係理解層110は、発話テキスト系列T,…,Tを受け取り、発話テキスト系列T,…,Tの文脈(論理的関係)を考慮した中間特徴系列LF,…,LFを得、当該中間特徴系列LF,…,LFを出力する。中間特徴系列LF,…,LFは、N個の中間特徴LF(ただし、n=1,…,N)の系列である。中間特徴LFは発話テキストTに対応する。図3に例示する論理的関係理解層110は、短期文脈理解ネットワーク111-1,…,111-N(短期論理的関係理解手段)と長期文脈理解ネットワーク112(長期論理的関係理解手段)とを含む。例えば、短期文脈理解ネットワーク111-1,…,111-Nは互いに同一なネットワーク(例えば、パラメータが互いに同一なネットワーク)であり、各短期文脈理解ネットワーク111-nは各n=1,…,N(例えば、各時間)に対応する状態を表す。ここで例示する短期文脈理解ネットワーク111-1,…,111-Nは発話テキスト系列T,…,Tを受け取る。発話テキスト系列T,…,Tに含まれる各発話テキストT(入力情報系列に含まれる各部分情報)を受け取った各短期文脈理解ネットワーク111-n(ただし、n=1,…,N)は、受け取った発話テキストT内での単語の文脈(例えば、単語単位の短期文脈)を考慮した短期中間特徴SF(部分情報内での情報の論理的関係を考慮した短期中間特徴)を得、当該短期中間特徴SFを出力する。これにより、短期文脈理解ネットワーク111-1,…,111-Nからは短期中間特徴の系列SF,…,SFが出力される。なお、発話テキストTが1個の単語のみを含む場合、その発話テキストT内での単語の文脈は当該1個の単語のみに依存するが、この場合に得られるSFも単語の文脈を考慮した短期中間特徴である。ただし、これは本発明を限定するものではない。例えば、各短期文脈理解ネットワーク111-nが複数の短期文脈理解ネットワーク111-n1,…,111-nK’(ただし、K’は2以上の整数)に区分されてもよい。例えば、k’=1,…,K’は短期文脈理解ネットワークの層を表すインデックスであり、各短期文脈理解ネットワーク111-nk’は短期文脈理解ネットワークの入力層からk’層目までのネットワークを表す。この場合、各短期文脈理解ネットワーク111-nk’からはk’層目の短期中間特徴SFnk’が出力される。ここで例示する長期文脈理解ネットワーク112は、複数の短期中間特徴の系列SF,…,SF(短期中間特徴系列)を受け取り、発話テキスト系列T,…,Tに含まれる複数の発話テキストT間の文脈(例えば、文単位の長期文脈または複数の文に渡る長期文脈)を考慮した中間特徴系列LF,…,LF(入力情報系列に含まれる複数の部分情報間での論理的関係を考慮した長期中間特徴系列)を得、当該中間特徴系列LF,…,LFを出力する。しかし、これは本発明を限定するものではない。例えば、上述のように各短期文脈理解ネットワーク111-nk’からk’層目の短期中間特徴SFnk’が出力される場合、長期文脈理解ネットワーク112にはSFとしてSFnK’が入力されてもよいし、SFn1,…,SFnK’のうち複数のSFnK’が入力されてもよい。また、長期文脈理解ネットワーク112が複数の長期文脈理解ネットワーク112-1,…,112-K(ただし、Kは2以上の整数)に区分されてもよい。例えば、k=1,…,Kは長期文脈理解ネットワークの層を表すインデックスであり、各長期文脈理解ネットワーク112-kは長期文脈理解ネットワークの入力層からk層目までのネットワークを表す。この場合、各長期文脈理解ネットワーク112-k(ただし、k=1,…,K)は何れか複数の短期中間特徴の系列SFを受け取り、受け取った系列SFに対応する複数の発話テキストT間の文脈を考慮した中間特徴を得て出力してもよい。この場合には、長期文脈理解ネットワーク112-1,…,112-KによってK個の中間特徴系列{LF11,…,LFN1},…,{LF1K,…,LFNK}が出力される。LFnk(ただし、n=1,…,N,k=1,…,K)は、各長期文脈理解ネットワーク112-kから出力されるk層目の各n(例えば、各時間)に対応する中間特徴を表す。
≪Logical relationship understanding layer 110≫
The logical relationship understanding layer 110 receives the utterance text series T 1 , ..., TN , and the intermediate feature series LF 1 , ..., LF considering the context (logical relationship) of the utterance text series T 1 , ..., TN . N is obtained, and the intermediate feature series LF 1 , ..., LF N is output. The intermediate feature series LF 1 , ..., LF N is a sequence of N intermediate feature LF n (where n = 1, ..., N). The intermediate feature LF n corresponds to the utterance text T n . The logical relationship understanding layer 110 illustrated in FIG. 3 includes a short-term context understanding network 111-1, ..., 111-N (short-term logical relationship understanding means) and a long-term context understanding network 112 (long-term logical relationship understanding means). include. For example, the short-term context understanding networks 111-1, ..., 111-N are networks that are the same as each other (for example, networks with the same parameters), and each short-term context understanding network 111-n is each n = 1, ..., N. Represents a state corresponding to (for example, each time). The short-term context comprehension networks 111-1, ..., 111- N exemplified here receive the utterance text series T 1 , ..., TN. Each short-term context understanding network 111-n (where n = 1, ..., N) that received each utterance text T n (each partial information included in the input information series ) included in the utterance text series T 1 , ..., TN. ) Is a short-term intermediate feature SF n (short-term intermediate feature considering the logical relationship of information in the partial information ) considering the context of the word in the received utterance text Tn (for example, the short-term context of each word). Is obtained, and the short-term intermediate feature SF n is output. As a result, the short-term intermediate feature series SF 1 , ..., SF N are output from the short-term context understanding networks 111-1, ..., 111-N. When the utterance text T n contains only one word, the context of the words in the utterance text T n depends on only the one word, but the SF n obtained in this case is also the context of the word. It is a short-term intermediate feature considering. However, this does not limit the present invention. For example, each short-term context understanding network 111-n may be divided into a plurality of short-term context understanding networks 111-n1, ..., 111-nK'(where K'is an integer of 2 or more). For example, k'= 1, ..., K'is an index representing the layers of the short-term context understanding network, and each short-term context understanding network 111-nk'is a network from the input layer to the k'th layer of the short-term context understanding network. show. In this case, the short-term intermediate feature SF nk'of the k'layer is output from each short-term context understanding network 111-nk'. The long-term context understanding network 112 exemplified here receives a series SF 1 , ..., SF N (short-term intermediate feature series) of a plurality of short-term intermediate features, and a plurality of utterances included in the utterance text series T 1 , ..., TN . Intermediate feature series LF 1 , ..., LF N (between multiple partial information contained in the input information series) considering the context between texts T n (for example, long-term context of sentence unit or long-term context spanning multiple sentences). A long-term intermediate feature sequence in consideration of the logical relationship) is obtained, and the intermediate feature sequence LF 1 , ..., LF N is output. However, this is not a limitation of the present invention. For example, when the short-term intermediate feature SF nk'of the k'layer is output from each short-term context understanding network 111-nk'as described above, SF nK'is input as SF n to the long-term context understanding network 112. Alternatively, a plurality of SF nK'of SF n1 , ..., SF nK' may be input. Further, the long-term context understanding network 112 may be divided into a plurality of long-term context understanding networks 112-1, ..., 112-K (where K is an integer of 2 or more). For example, k = 1, ..., K is an index representing the layers of the long-term context understanding network, and each long-term context understanding network 112-k represents the network from the input layer to the kth layer of the long-term context understanding network. In this case, each long-term context comprehension network 112-k (where k = 1, ..., K) receives the sequence SF n of any of the plurality of short-term intermediate features, and the plurality of utterance texts T corresponding to the received sequence SF n . An intermediate feature considering the context between n may be obtained and output. In this case, the long-term context understanding network 112-1, ..., 112-K outputs K intermediate feature sequences {LF 11 , ..., LF N1 }, ..., {LF 1K , ..., LF NK }. .. LF nk (where n = 1, ..., N, k = 1, ..., K) corresponds to each n (for example, each time) of the kth layer output from each long-term context understanding network 112-k. Represents an intermediate feature.
 ここで短期文脈理解ネットワーク111-nは、例えば辞書により単語を数値に変換する埋め込み層と、単方向LSTM(long-short term memory、長短期記憶)や双方向LSTM、注意機構等の組み合わせにより構成できる(例えば、非特許文献1等参照)。また長期文脈理解ネットワーク112は、例えば単方向LSTMや双方向LSTM等の組み合わせにより構成できる。 Here, the short-term context understanding network 111-n is composed of a combination of an embedded layer that converts words into numerical values by, for example, a unidirectional LSTM (long-short term memory), a bidirectional LSTM, and an attention mechanism. Yes (see, for example, Non-Patent Document 1 and the like). Further, the long-term context understanding network 112 can be configured by a combination of, for example, a unidirectional LSTM or a bidirectional LSTM.
 ≪ラベリング層120≫
 ラベリング層120は、中間特徴系列LF,…,LF(中間特徴系列に基づく第1系列)を受け取り、発話テキスト系列T,…,Tに対応する推定ラベル系列L,…,Lを得、当該推定ラベル系列L,…,Lを出力する。図3に例示するラベリング層120は、ラベル予測ネットワーク120-1,…,120-Nを含む。例えば、ラベル予測ネットワーク120-1,…,120-Nは互いに同一なネットワーク(例えば、パラメータが互いに同一なネットワーク)であり、各ラベル予測ネットワーク120-nは各n=1,…,N(例えば、各時間)に対応する状態を表す。ここで例示するラベル予測ネットワーク120-nは、中間特徴LFを受け取り、発話テキストTに対応する推定ラベルLを得、当該推定ラベルLを出力する。なお、ラベル予測ネットワーク120-nは、例えば、ソフトマックス関数を活性化関数とする全結合ニューラルネットワーク等により構成できる。また、長期文脈理解ネットワーク112-1,…,112-KからK個の中間特徴系列{LF11,…,LFN1},…,{LF1K,…,LFNK}が出力される場合、ラベル予測ネットワーク120-nは、例えば、中間特徴LFとしてLFnKを受け取り、発話テキストTに対応する推定ラベルLを得、当該推定ラベルLを出力する。しかし、ラベル予測ネットワーク120-nが中間特徴系列{LF11,…,LFN1},…,{LF1K,…,LFNK}のうち複数の中間特徴LFnkを受け取り、発話テキストTに対応する推定ラベルLを得、当該推定ラベルLを出力してもよい。
≪Labeling layer 120≫
The labeling layer 120 receives the intermediate feature sequence LF 1 , ..., LF N (first sequence based on the intermediate feature sequence), and the estimated label sequence L 1 , ..., L corresponding to the spoken text sequence T 1 , ..., TN . N is obtained, and the estimated label series L 1 , ..., L N is output. The labeling layer 120 illustrated in FIG. 3 includes label prediction networks 120-1, ..., 120-N. For example, the label prediction networks 120-1, ..., 120-N are networks that are the same as each other (for example, networks with the same parameters), and each label prediction network 120-n is each n = 1, ..., N (for example,). , Each time). The label prediction network 120-n exemplified here receives the intermediate feature LF n , obtains the estimated label L n corresponding to the utterance text T n , and outputs the estimated label L n . The label prediction network 120-n can be configured by, for example, a fully connected neural network having a softmax function as an activation function. Also, if K intermediate feature sequences {LF 11 , ..., LF N1 }, ..., {LF 1K , ..., LF NK } are output from the long-term context understanding network 112-1, ..., 112-K, the label. For example, the prediction network 120-n receives LF nK as an intermediate feature LF n , obtains an estimated label L n corresponding to the utterance text T n , and outputs the estimated label L n . However, the label prediction network 120- n receives a plurality of intermediate feature LF nks among the intermediate feature series {LF 11 , ..., LF N1 }, ..., {LF 1K , ..., LF NK } and corresponds to the utterance text Tn. The estimated label L n may be obtained and the estimated label L n may be output.
 ≪ドメイン識別モデル130≫
 図3に例示するドメイン識別モデル130は、中間特徴系列LF,…,LF(中間特徴系列に基づく第2系列)を受け取り、発話テキスト系列T,…,Tに含まれる各発話テキストT(入力情報系列に含まれる各部分情報)がソースドメインに属するかターゲットドメインに属するか(各発話テキストTがソースドメインのものであるかターゲットドメインのものであるか)を表すドメイン識別情報の推定ドメイン情報D(ただし、n=1,…,N)を得、当該推定ドメイン情報の系列D,…,Dを出力する。ここで例示するドメイン識別モデル130は、N個のドメイン識別ネットワーク130-1,…,130-Nを含む。例えば、ドメイン識別ネットワーク130-1,…,130-Nは互いに同一なネットワーク(例えば、パラメータが互いに同一なネットワーク)であり、各ドメイン識別ネットワーク130-nは各n=1,…,N(例えば、各時間)に対応する状態を表す。例えば各ドメイン識別ネットワーク130-n(ただし、n=1,…,N)は中間特徴LFを受け取り、推定ドメイン情報Dを得て出力する。ただし、これは本発明を限定するものではない。例えば、各ドメイン識別ネットワーク130-nに代えて複数のドメイン識別ネットワーク130-nkが存在してもよい。例えば、k=1,…,Kは長期文脈理解ネットワークの層を表すインデックスであり、各ドメイン識別ネットワーク130-nkは各n(例えば、各時間)に対応するネットワークを表す。この場合、各ドメイン識別ネットワーク130-nkは中間特徴LFnk(n∈{1,…,N},k∈{1,…,K})を受け取り、これらを用いて推定ドメイン情報Dnkを得て出力してもよい。Dnk(ただし、n=1,…,N,k=1,…,K)は、各ドメイン識別ネットワーク130-nkから出力される各n(例えば、各時間)に対応する推定ドメイン情報を表す。ドメイン識別ネットワーク130-n(または、ドメイン識別ネットワーク130-nk)は、例えば、ソフトマックス関数を活性化関数とする全結合ニューラルネットワーク等により構成できる。
Domain Discriminative Model 130≫
The domain identification model 130 illustrated in FIG. 3 receives the intermediate feature series LF 1 , ..., LF N (second sequence based on the intermediate feature sequence), and each utterance text included in the utterance text series T 1 , ..., TN . Domain identification indicating whether T n (each partial information contained in the input information series) belongs to the source domain or the target domain (whether each utterance text T n belongs to the source domain or the target domain). Estimated domain information of information D n (where n = 1, ..., N) is obtained, and the series D 1 , ..., DN of the estimated domain information is output. The domain identification model 130 exemplified here includes N domain identification networks 130-1, ..., 130-N. For example, the domain identification networks 130-1, ..., 130-N are networks that are the same as each other (for example, networks with the same parameters), and each domain identification network 130-n is each n = 1, ..., N (for example,). , Each time). For example, each domain identification network 130-n (where n = 1, ..., N) receives the intermediate feature LF n , obtains the estimated domain information Dn , and outputs it. However, this does not limit the present invention. For example, a plurality of domain identification networks 130-nk may exist in place of each domain identification network 130-n. For example, k = 1, ..., K is an index representing a layer of a long-term context understanding network, and each domain identification network 130-nk represents a network corresponding to each n (for example, each time). In this case, each domain identification network 130-nk receives the intermediate feature LF nk (n ∈ {1, ..., N}, k ∈ {1, ..., K}) and uses them to obtain the estimated domain information D nk . May be output. D nk (where n = 1, ..., N, k = 1, ..., K) represents estimated domain information corresponding to each n (for example, each time) output from each domain identification network 130-nk. .. The domain identification network 130-n (or domain identification network 130-nk) can be configured by, for example, a fully connected neural network having a softmax function as an activation function.
 ≪学習処理≫
 学習処理では、学習装置11の学習部11aに、ソースドメインのラベル付き教師データ(ソースドメインに属するラベル付きの学習用情報系列)と、ターゲットドメインのラベルなし教師データ(ターゲットドメインに属するラベルなしの学習用情報系列)とが入力される。学習部11aは、上述のネットワーク100に対し、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを発話テキスト系列T,…,Tとして用い、推定ラベル系列L,…,Lの推定精度が高く、推定ドメイン情報の系列D,…,Dの推定精度が低くなるようにラベリングネットワーク150(ラベリングモデル)を学習し、推定ドメイン情報の系列D,…,Dの推定精度が高くなるようにドメイン識別モデル130を学習する敵対的学習を行う。すなわち、学習部11aは、上述の教師データが発話テキスト系列T,…,Tとしてネットワーク100に入力された際にラベリングネットワーク150から出力される推定ラベル系列L,…,Lとそれらに対応するソースドメインのラベル付き教師データの正解ラベル系列との誤差を表す損失関数(以下、ラベル予測損失)と、ドメイン識別モデル130から出力される推定ドメイン情報の系列D,…,Dとソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとから特定される推定ドメイン情報の正解ラベル系列との誤差を表す損失関数(以下、ドメイン識別損失)とに基づき、ラベリングネットワーク150とドメイン識別モデル130との敵対的学習を行う。なお、ターゲットドメインのラベルなし教師データがネットワーク100に入力された際にラベリングネットワーク150から出力される推定ラベル系列L,…,Lはラベル予測損失の算出に用いられない。
≪Learning process≫
In the learning process, the learning unit 11a of the learning device 11 has the labeled teacher data of the source domain (labeled learning information sequence belonging to the source domain) and the unlabeled teacher data of the target domain (unlabeled belonging to the target domain). Information series for learning) is input. The learning unit 11a uses the teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the spoken text series T 1 , ..., TN for the above-mentioned network 100, and uses the estimated label series L. The labeling network 150 (labeling model) is learned so that the estimation accuracy of 1 , ..., L N is high and the estimation accuracy of the estimated domain information D 1 , ..., DN is low, and the estimation domain information series D 1 , ..., Hostile learning is performed to learn the domain identification model 130 so that the estimation accuracy of DN becomes high. That is, the learning unit 11a includes the estimated label series L 1 , ..., L N output from the labeling network 150 when the above-mentioned teacher data is input to the network 100 as the spoken text series T 1 , ..., TN . The loss function (hereinafter referred to as label prediction loss) representing the error from the correct label series of the labeled teacher data of the source domain corresponding to, and the series D 1 , ..., DN of the estimated domain information output from the domain identification model 130. Based on the loss function (hereinafter referred to as domain identification loss) that represents the error between the labeled teacher data of the source domain and the correct label series of the estimated domain information identified from the unlabeled teacher data of the target domain, the labeling network 150 and Perform hostile learning with the domain identification model 130. The estimated label series L 1 , ..., L N output from the labeling network 150 when the unlabeled teacher data of the target domain is input to the network 100 is not used for calculating the label prediction loss.
 学習部11aは、例えば誤差逆伝播法を用いてこの敵対的学習を行う。この場合、論理的関係理解層110とドメイン識別モデル130との間(例えば、長期文脈理解ネットワーク112-nとドメイン識別ネットワーク130-nとの間)に勾配反転層141-n(ただし、n=1,…,N)を設け、誤差逆伝播時にのみ勾配反転層141-nで勾配を反転させる。ここで、ラベル予測損失が小さくなるように学習を行うことで、ラベリングネットワーク150での推定ラベル系列L,…,Lの推定精度が高くなる。また、勾配反転層141-nで誤差逆伝播時にのみ勾配を反転させ、ドメイン識別損失が小さくなるように学習を行うことで、推定ドメイン情報の系列D,…,Dの推定精度が高くなるようにドメイン識別モデル130を学習し、推定ドメイン情報の系列D,…,Dの推定精度を低くする中間特徴系列LF,…,LFを得る論理的関係理解層110を学習する敵対的学習を行うことができる。この敵対的学習により、推定ラベル系列L,…,Lを正確に推定できるがドメイン識別モデル130にドメインを推定されない中間特徴系列LF,…,LFを生成できるラベリングネットワーク150を学習できる。これにより、ラベリングネットワーク150でドメインへの依存性を抑制しつつラベルの予測に有効な中間特徴系列LF,…,LFを獲得でき、教師なしドメイン適応を実現できる。 The learning unit 11a performs this hostile learning by using, for example, an error backpropagation method. In this case, the gradient inversion layer 141-n (where n =) between the logical relationship understanding layer 110 and the domain discriminative model 130 (eg, between the long-term context understanding network 112-n and the domain discrimination network 130-n). 1, ..., N) are provided, and the gradient is inverted by the gradient inversion layer 141-n only at the time of error back propagation. Here, by learning so that the label prediction loss becomes small, the estimation accuracy of the estimated label series L 1 , ..., L N in the labeling network 150 becomes high. In addition, the gradient inversion layer 141-n inverts the gradient only when the error is backpropagated, and learning is performed so that the domain identification loss becomes small, so that the estimation accuracy of the estimation domain information series D 1 , ..., DN is high. The domain identification model 130 is learned so as to be, and the logical relationship understanding layer 110 for obtaining the intermediate feature series LF 1 , ..., LF N that lowers the estimation accuracy of the estimation domain information series D 1 , ..., DN is learned. Can perform hostile learning. By this hostile learning, it is possible to learn a labeling network 150 capable of generating an intermediate feature sequence LF 1 , ..., LF N whose domain is not estimated by the domain discrimination model 130, although the estimated label sequence L 1 , ..., L N can be estimated accurately. .. As a result, the labeling network 150 can acquire the intermediate feature series LF 1 , ..., LF N which are effective for the prediction of the label while suppressing the dependence on the domain, and the unsupervised domain adaptation can be realized.
 この学習処理は、ラベル予測損失とドメイン識別損失とを線形結合した損失関数を最適化(最小化)することで実現できる。ラベル予測損失とドメイン識別損失との線形結合の結合比率は予め定められていてもよいし、学習部11aに入力される学習スケジュールで指定されてもよい。 This learning process can be realized by optimizing (minimizing) the loss function that linearly combines the label prediction loss and the domain identification loss. The combination ratio of the linear combination of the label prediction loss and the domain identification loss may be predetermined or may be specified by the learning schedule input to the learning unit 11a.
 学習部11aが学習スケジュールに基づき、学習のステップ数に応じてラベル予測損失とドメイン識別損失の結合比率を変更しながら上述の学習を行ってもよい。例えば学習部11aは、学習の序盤ではラベル予測損失のみを損失関数として学習を行い、学習のステップ数が増えるにつれて徐々に損失関数に占めるドメイン識別損失の割合が大きくなるようにして学習してもよい。さらに、学習部11aは、一定の結合比率で学習が収束するまで行い、結合比率を変更してまた収束するまで学習を行うような学習を、結合比率を学習スケジュールに基づき変更しながら繰り返し実施してもよい。 The learning unit 11a may perform the above-mentioned learning while changing the combination ratio of the label prediction loss and the domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 11a may learn by using only the label prediction loss as a loss function in the early stage of learning, and gradually increase the ratio of the domain identification loss to the loss function as the number of learning steps increases. good. Further, the learning unit 11a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.
 また学習部11aが、先に例示したような様々なドメイン識別モデル130および/またはラベリングネットワーク150を複数用意して学習を行い、それぞれの学習で得られたラベリングネットワーク150のうち、ターゲットドメインでのラベリングネットワーク150によるラベル系列の推定精度が最善となるラベリングネットワーク150を後で選択してもよい。 Further, the learning unit 11a prepares a plurality of various domain identification models 130 and / or labeling networks 150 as exemplified above for learning, and among the labeling networks 150 obtained by each learning, the target domain is used. The labeling network 150, which gives the best estimation accuracy of the label sequence by the labeling network 150, may be selected later.
 学習処理はバッチ学習であってもよいし、ミニバッチ学習であってもよいし、オンライン学習であってもよい。 The learning process may be batch learning, mini-batch learning, or online learning.
 学習部11aは、上述の学習によって得たラベリングネットワーク150のパラメータを記憶部11bに格納し、ドメイン識別モデル130のパラメータ(ドメイン識別ネットワーク130-1,…,130-のパラメータ)を記憶部11cに格納する。学習装置11は、記憶部11bに格納されたラベリングネットワーク150のパラメータを出力する。ラベリングネットワーク150のパラメータは後述の推論処理に用いられる。通常、ドメイン識別モデル130のパラメータは推論処理には用いられないため、学習装置11から出力されなくてもよい。しかし、学習装置11がドメイン識別モデル130のパラメータ(ドメイン識別ネットワーク130-1,…,130-のパラメータ)の少なくとも何れかを出力してもよい。 The learning unit 11a stores the parameters of the labeling network 150 obtained by the above learning in the storage unit 11b, and stores the parameters of the domain identification model 130 (parameters of the domain identification network 130-1, ..., 130-) in the storage unit 11c. Store. The learning device 11 outputs the parameters of the labeling network 150 stored in the storage unit 11b. The parameters of the labeling network 150 are used in the inference processing described later. Normally, the parameters of the domain discrimination model 130 are not used for inference processing, so they do not have to be output from the learning device 11. However, the learning device 11 may output at least one of the parameters of the domain identification model 130 (parameters of the domain identification network 130-1, ..., 130-).
 図2を用いて上述の学習処理を機能的に例示する。
 ステップS11:ソースドメインのラベル付き教師データと、ターゲットドメインのラベルなし教師データとが制御部11aaに入力される。制御部11aaは、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを生成する。また制御部11aaは、ネットワーク100のパラメータを初期化する。
 ステップS12:損失関数計算部11abは、教師データを発話テキスト系列T,…,Tとしてネットワーク100に入力し、ラベル予測損失とドメイン識別損失を得、それらを線形結合した損失関数を得る。
 ステップS13:パラメータ更新部11adは、誤差逆伝播法に従い、損失関数に基づく情報を逆伝搬し、ドメイン識別モデル130およびラベリング層120のパラメータを更新する。
 ステップS14:勾配反転部11acは、ドメイン識別モデル130から逆伝搬された損失関数に基づく情報の勾配を反転させて論理的関係理解層110に逆伝搬させる。ラベリング層120から逆伝搬された損失関数に基づく情報は、そのまま論理的関係理解層110に逆伝搬される。
 ステップS15:パラメータ更新部11adは、誤差逆伝播法に従い、逆伝搬された情報を用いて論理的関係理解層110のパラメータを更新する。
 ステップS16:制御部11aaは、終了条件(例えば、パラメータの更新回数が所定数に達したなどの条件)を満たしたか否かを判定する。ここで終了条件を満たしていない場合、制御部11aaは処理をステップS12に戻す。一方、終了条件を満たしている場合、制御部11aaはラベリングネットワーク150のパラメータを出力する。必要に応じて制御部11aaがドメイン識別ネットワーク130-1,…,130-Nのパラメータの少なくとも何れかも出力してもよい。
The above-mentioned learning process is functionally illustrated with reference to FIG.
Step S11: The labeled teacher data of the source domain and the unlabeled teacher data of the target domain are input to the control unit 11aa. The control unit 11aa generates teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain. Further, the control unit 11aa initializes the parameters of the network 100.
Step S12: The loss function calculation unit 11ab inputs the teacher data into the network 100 as the utterance text series T 1 , ..., TN , obtains the label prediction loss and the domain identification loss, and obtains a loss function in which they are linearly combined.
Step S13: The parameter update unit 11ad backpropagates the information based on the loss function according to the error back propagation method, and updates the parameters of the domain discrimination model 130 and the labeling layer 120.
Step S14: The gradient inversion unit 11ac inverts the gradient of the information based on the loss function back-propagated from the domain discrimination model 130 and back-propagates it to the logical relationship understanding layer 110. The information based on the loss function back-propagated from the labeling layer 120 is back-propagated to the logical relationship understanding layer 110 as it is.
Step S15: The parameter update unit 11ad updates the parameters of the logical relationship understanding layer 110 using the back-propagated information according to the error back-propagation method.
Step S16: The control unit 11aa determines whether or not the end condition (for example, a condition that the number of parameter updates reaches a predetermined number) is satisfied. If the end condition is not satisfied here, the control unit 11aa returns the process to step S12. On the other hand, when the end condition is satisfied, the control unit 11aa outputs the parameter of the labeling network 150. If necessary, the control unit 11aa may output at least one of the parameters of the domain identification networks 130-1, ..., 130-N.
 <推論装置13の機能構成および推論処理>
 図4に例示するように、第1実施形態の推論装置13は、推論部13aおよび記憶部13bを有する。記憶部13bには上述のように得られたラベリングネットワーク150のパラメータが格納される。
<Functional configuration of inference device 13 and inference processing>
As illustrated in FIG. 4, the inference device 13 of the first embodiment has an inference unit 13a and a storage unit 13b. The parameters of the labeling network 150 obtained as described above are stored in the storage unit 13b.
 ≪推論処理≫
 推論処理では、推論部13aに推論用の発話テキスト系列(入力情報系列)が入力される。推論部13aは、記憶部13bに格納されたパラメータで特定されるラベリングネットワーク150(ラベリングモデル)に対し、推論用の発話テキスト系列を適用し、推論用の発話テキスト系列に対応するラベル系列の推定ラベル系列を得、推定ラベル系列を出力する。例えば、図5に例示するラベリングネットワーク150の場合、推論部13aは、推論用の発話テキスト系列T,…,Tを論理的関係理解層110に入力して推論用の発話テキスト系列T,…,Tに対応する中間特徴系列LF,…,LFを得る。例えば、推論部13aは、推論用の発話テキスト系列T,…,Tを短期文脈理解ネットワーク111-1,…,111-Nにそれぞれ入力し、短期中間特徴の系列SF,…,SFを得、短期中間特徴の系列SF,…,SFを長期文脈理解ネットワーク112に入力し、中間特徴系列LF,…,LFを得る。さらに推論部13aは、中間特徴系列LF,…,LFをラベリング層120に入力して発話テキスト系列T,…,Tに対応する推定ラベル系列L,…,Lを得て出力する。
≪Inference processing≫
In the inference process, an utterance text sequence (input information sequence) for inference is input to the inference unit 13a. The inference unit 13a applies the utterance text sequence for inference to the labeling network 150 (labeling model) specified by the parameters stored in the storage unit 13b, and estimates the label sequence corresponding to the utterance text sequence for inference. Obtain the label sequence and output the estimated label sequence. For example, in the case of the labeling network 150 illustrated in FIG. 5, the reasoning unit 13a inputs the utterance text series T 1 , ..., TN for inference to the logical relationship understanding layer 110, and the utterance text series T 1 for inference. , ..., LF N corresponding to the intermediate feature sequence LF 1 , ..., TN is obtained. For example, the inference unit 13a inputs the utterance text sequences T 1 , ..., TN for inference to the short-term context understanding networks 111-1, ..., 111-N, respectively, and the sequence SF 1 , ..., SF of the short-term intermediate features. N is obtained, and the short-term intermediate feature sequences SF 1 , ..., SF N are input to the long-term context understanding network 112, and the intermediate feature sequences LF 1 , ..., LF N are obtained. Further, the inference unit 13a inputs the intermediate feature series LF 1 , ..., LF N to the labeling layer 120 to obtain the estimated label series L 1 , ..., L N corresponding to the utterance text series T 1 , ..., TN . Output.
 <第1実施形態の特徴>
 本実施形態では、発話テキスト系列T,…,Tを受け取り、発話テキスト系列T,…,Tの文脈を考慮した中間特徴系列LF,…,LFを得、当該中間特徴系列LF,…,LFを出力する論理的関係理解層110と、中間特徴系列LF,…,LFを受け取り、発話テキスト系列T,…,Tに対応する推定ラベル系列L,…,Lを得、当該推定ラベル系列L,…,Lを出力するラベリング層120とを含むラベリングネットワーク150と、中間特徴系列LF,…,LFを受け取り、発話テキスト系列T,…,Tに含まれる各発話テキストTがソースドメインに属するかターゲットドメインに属するかを表すドメイン識別情報の推定ドメイン情報Dを得、当該推定ドメイン情報の系列D,…,Dを出力するドメイン識別モデル130とに対し、学習装置11が、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを発話テキスト系列T,…,Tとして用い、推定ラベル系列L,…,Lの推定精度が高く、推定ドメイン情報の系列D,…,Dの推定精度が低くなるようにラベリングネットワーク150を学習し、推定ドメイン情報の系列D,…,Dの推定精度が高くなるようにドメイン識別モデル130を学習する敵対的学習を行った。これにより、ラベリングネットワーク150のドメイン依存性を低減させ、結果として、発話テキスト系列T,…,Tの文脈を考慮して当該発話テキスト系列に対応するラベル系列L,…,Lを推定するラベリングネットワーク150の教師なしドメイン適応が可能になる。
<Characteristics of the first embodiment>
In this embodiment, the utterance text sequence T 1 , ..., TN is received, the utterance text sequence T 1 , ..., TN is taken into consideration, and the intermediate feature sequence LF 1 , ..., LF N is obtained, and the intermediate feature sequence is obtained. The logical relationship understanding layer 110 that outputs LF 1 , ..., LF N , and the estimated label series L 1 , that receives the intermediate feature series LF 1 , ..., LF N and corresponds to the utterance text series T 1 , ..., TN . ..., L N is obtained, the labeling network 150 including the labeling layer 120 including the estimated label sequence L 1 , ..., L N is received, and the intermediate feature sequence LF 1 , ..., LF N is received, and the utterance text sequence T 1 , ..., Obtained the estimated domain information D n of the domain identification information indicating whether each utterance text T n included in the TN belongs to the source domain or the target domain, and obtained the sequence D 1 of the estimated domain information, ..., D. For the domain identification model 130 that outputs N , the learning device 11 uses the teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the utterance text series T 1 , ..., TN . , Estimated label sequence L 1 , ..., L N estimation accuracy is high, estimation domain information sequence D 1 , ..., DN is learned so that the estimation accuracy is low, and estimation domain information sequence D Hostile learning was performed to learn the domain identification model 130 so that the estimation accuracy of 1 , ..., and DN would be high. As a result, the domain dependence of the labeling network 150 is reduced, and as a result, the label series L 1 , ..., L N corresponding to the utterance text series T 1 , ..., TN are considered in consideration of the context of the utterance text series T 1, ..., TN. Allows unsupervised domain adaptation of the estimated labeling network 150.
 [第2実施形態]
 第2実施形態では、複数の発話テキストT間の文脈(入力情報系列に含まれる複数の部分情報間での論理的関係を考慮した長期中間特徴系列)からドメインを識別するネットワークと、発話テキストT内での単語の文脈を考慮した短期中間特徴SF(部分情報内での情報の論理的関係を考慮した短期中間特徴)からドメインを識別するネットワークと、を同時に用いて敵対的に学習させる。これにより、ドメインへの依存性をさらに効率的に除去し、より高い精度でターゲットドメインのラベリングネットワークを学習できる。以下では、第1実施形態との相違点を中心に説明し、第1実施形態と共通する事項については、同じ参照記号を引用して説明を簡略化する。
[Second Embodiment]
In the second embodiment, a network that identifies a domain from a context between a plurality of utterance texts Tn (a long-term intermediate feature series considering a logical relationship between a plurality of partial information included in an input information series) and a utterance text. Short-term intermediate features that consider the context of words within T n A network that identifies domains from SF n (short-term intermediate features that consider the logical relationship of information within partial information), and learning hostilely using Let me. This makes it possible to remove domain dependencies more efficiently and learn the target domain labeling network with higher accuracy. In the following, the differences from the first embodiment will be mainly described, and the same reference symbols will be cited for the matters common to the first embodiment to simplify the explanation.
 <学習装置21の機能構成および学習処理>
 図6に例示するように、第2実施形態の学習装置21は、学習部21a、および記憶部11b,21c,21dを有し、ソースドメインのラベル付き教師データと、ターゲットドメインのラベルなし教師データを入力とし、学習によってターゲットドメインのラベリングネットワークのパラメータ(モデルパラメータ)を得て出力する。さらに学習スケジュールが学習装置21に入力され、学習装置21が当該学習スケジュールに従って学習処理を行ってもよい。また学習装置21が教師なしドメイン適応を実現するためのドメイン識別ネットワークのパラメータを出力してもよい。また図2に例示するように、学習部21aは、例えば、制御部11aa、損失関数計算部21ab、勾配反転部11ac、およびパラメータ更新部21adを有する。また、学習部21aは処理過程で得られた各データを逐一、記憶部11b,21c,21dまたは図示していない一時メモリに格納する。学習部21aは必要に応じて当該データを読み込み、各処理に利用する。
<Functional configuration and learning process of learning device 21>
As illustrated in FIG. 6, the learning device 21 of the second embodiment has a learning unit 21a and storage units 11b, 21c, 21d, and has a source domain labeled teacher data and a target domain unlabeled teacher data. Is used as an input, and the parameters (model parameters) of the labeling network of the target domain are obtained and output by learning. Further, the learning schedule may be input to the learning device 21, and the learning device 21 may perform the learning process according to the learning schedule. Further, the learning device 21 may output the parameters of the domain identification network for realizing the unsupervised domain adaptation. Further, as illustrated in FIG. 2, the learning unit 21a has, for example, a control unit 11aa, a loss function calculation unit 21ab, a gradient inversion unit 11ac, and a parameter update unit 21ad. Further, the learning unit 21a stores each data obtained in the processing process in the storage units 11b, 21c, 21d or a temporary memory (not shown) one by one. The learning unit 21a reads the data as necessary and uses it for each process.
 ≪ネットワーク200≫
 図7に学習装置21が学習処理で用いるネットワーク200の構成例を示す。図7に例示するネットワーク200は、ラベリングネットワーク150(ラベリングモデル)およびドメイン識別モデル230を有する。ラベリングネットワーク150は第1実施形態と同一であるため説明を省略し、以下ではドメイン識別モデル230の説明を行う。
≪Network 200≫
FIG. 7 shows a configuration example of the network 200 used by the learning device 21 in the learning process. The network 200 illustrated in FIG. 7 has a labeling network 150 (labeling model) and a domain identification model 230. Since the labeling network 150 is the same as that of the first embodiment, the description thereof will be omitted, and the domain identification model 230 will be described below.
 ≪ドメイン識別モデル230≫
 図7に例示するドメイン識別モデル230は、短期文脈ドメイン識別ネットワーク231-1,…,231-N(短期論理的関係ドメイン識別手段)、および長期文脈ドメイン識別ネットワーク232(長期論理的関係ドメイン識別手段)を含む。例えば、短期文脈ドメイン識別ネットワーク231-1,…,231-Nは互いに同一なネットワーク(例えば、パラメータが互いに同一なネットワーク)であり、各短期文脈ドメイン識別ネットワーク231-nは各n=1,…,N(例えば、各時間)に対応する状態を表す。
Domain Discriminative Model 230≫
The domain identification model 230 exemplified in FIG. 7 includes short-term context domain identification networks 231-1, ..., 231-N (short-term logical relationship domain identification means), and long-term context domain identification network 232 (long-term logical relationship domain identification means). )including. For example, the short-term context domain identification networks 231-1, ..., 231-N are networks that are the same as each other (for example, networks with the same parameters), and each short-term context domain identification network 231-n is each n = 1, ... , N (for example, each time) represents a state corresponding to.
 長期文脈ドメイン識別ネットワーク232は、長期文脈理解ネットワーク112から出力された中間特徴系列LF,…,LF(長期中間特徴系列)を受け取り、推定ドメイン情報の系列LD,…,LDを得て出力する。ただし、各推定ドメイン情報LD(ただし、n=1,…,N)は、各発話テキストTがソースドメインに属するかターゲットドメインに属するかを表すドメイン識別情報の推定情報である。図7に例示する長期文脈ドメイン識別ネットワーク232は、第1実施形態のドメイン識別ネットワーク130-nと異なり、入力された短期中間特徴の系列SF,…,SFを連続的に捉えることで(例えば、短期中間特徴の系列SF,…,SFを時間方向に連続的に捉えることで)、単語や文章である複数の発話テキストTを跨いだ文脈(論理的関係)のドメイン依存性をラベリングネットワーク150から取り除くことを目的とする。しかし、これは本発明を限定するものではない。例えば、長期文脈ドメイン識別ネットワーク232に代えて、複数の長期文脈ドメイン識別ネットワーク232-1,…,232-K(ただし、Kは2以上の整数)が存在してもよい。例えば、k=1,…,Kは長期文脈理解ネットワークの層を表すインデックスである。この場合、各長期文脈ドメイン識別ネットワーク232-k(ただし、k=1,…,K)は、長期中間特徴系列LFnk(n∈{1,…,N},k∈{1,…,K})を受け取り、受け取った長期中間特徴系列LFnkに対応する発話テキストTがソースドメインに属するかターゲットドメインに属するかを表す推定ドメイン情報LDnkを得て出力してもよい。LFnk(ただし、n=1,…,N,k=1,…,K)は、第1実施形態で例示した各長期文脈理解ネットワーク112-kから出力される各n(例えば、各時間)に対応する中間特徴を表す。この場合であっても、複数の発話テキストTを跨いだ文脈のドメイン依存性をラベリングネットワーク150から取り除くことができる。ここで長期文脈ドメイン識別ネットワーク232は、例えば、単方向LSTMや双方向LSTMと、ソフトマックス関数を活性化関数とする全結合ニューラルネットワーク等の組み合わせによって構成できる。 The long-term context domain identification network 232 receives the intermediate feature sequence LF 1 , ..., LF N (long-term intermediate feature sequence) output from the long-term context understanding network 112, and obtains the sequence LD 1 , ..., LD N of the estimated domain information. And output. However, each estimated domain information LD n (where n = 1, ..., N) is estimated information of domain identification information indicating whether each utterance text Tn belongs to the source domain or the target domain. Unlike the domain identification network 130-n of the first embodiment, the long-term context domain identification network 232 exemplified in FIG. 7 continuously captures the input short-term intermediate feature sequences SF 1 , ..., SF N (). For example, the sequence of short-term intermediate features SF 1 , ..., SF N is continuously captured in the time direction), and the domain dependence of the context (logical relationship) across multiple spoken texts Tn , which are words and sentences. Is intended to be removed from the labeling network 150. However, this is not a limitation of the present invention. For example, instead of the long-term context domain identification network 232, a plurality of long-term context domain identification networks 232-1, ..., 232-K (where K is an integer of 2 or more) may exist. For example, k = 1, ..., K is an index representing the layer of the long-term context understanding network. In this case, each long-term context domain identification network 232-k (where k = 1, ..., K) is a long-term intermediate feature sequence LF nk (n ∈ {1, ..., N}, k ∈ {1, ..., K). } ) Is received, and the estimated domain information LD nk indicating whether the utterance text Tn corresponding to the received long-term intermediate feature series LF nk belongs to the source domain or the target domain may be obtained and output. LF nk (where n = 1, ..., N, k = 1, ..., K) is each n (for example, each time) output from each long-term context understanding network 112-k exemplified in the first embodiment. Represents an intermediate feature corresponding to. Even in this case, the domain dependency of the context spanning the plurality of utterance texts Tn can be removed from the labeling network 150. Here, the long-term context domain identification network 232 can be configured by, for example, a combination of a unidirectional LSTM or a bidirectional LSTM and a fully connected neural network having a softmax function as an activation function.
 短期文脈ドメイン識別ネットワーク231-1,…,231-Nは、短期文脈理解ネットワーク111-1,…,111-N(短期論理的関係理解手段)から出力された短期中間特徴の系列SF,…,SF(中間特徴系列に基づく第2系列、短期中間特徴系列)を受け取り、推定ドメイン情報の系列SD,…,SDを得て出力する。すなわち、各短期文脈ドメイン識別ネットワーク231-n(ただし、n=1,…,N)は、短期文脈理解ネットワーク111-nから出力された短期中間特徴SFを受け取り、各発話テキストTがソースドメインに属するかターゲットドメインに属するかを表すドメイン識別情報の推定ドメイン情報SDを得、当該推定ドメイン情報SDを出力する。短期文脈ドメイン識別ネットワーク231-nは、長期文脈ドメイン識別ネットワーク232と異なり、短期中間特徴SFごとに発話テキストTがソースドメインに属するかターゲットドメインに属するかを推定することで、ドメイン依存性のある特定の単語や文書などの発話テキストT単体のドメイン依存性を効率的に取り除くことを目的とする。ただし、これは本発明を限定するものではない。例えば、各短期文脈ドメイン識別ネットワーク231-nに代えて、複数の短期文脈ドメイン識別ネットワーク231-n1,…,232-nK’(ただし、K’は2以上の整数)が存在してもよい。例えば、k’=1,…,K’は短期文脈ドメイン識別ネットワークの層を表すインデックスであり、各短期文脈ドメイン識別ネットワーク231-nk’は各n(例えば、各時間)に対応するネットワークを表す。この場合、各短期文脈ドメイン識別ネットワーク231-nk’(ただし、k’=1,…,K’)は、短期中間特徴SFnk’(n∈{1,…,N},k’∈{1,…,K’})を受け取り、受け取った短期中間特徴SFnk’に対応する発話テキストTがソースドメインに属するかターゲットドメインに属するかを表す推定ドメイン情報SDnk’を得て出力してもよい。ただし、SFnk’は第1実施形態で例示した各短期文脈理解ネットワーク111-nk’から出力される各n(例えば、各時間)に対応する短期中間特徴である。ここで、短期文脈ドメイン識別ネットワーク231-nは、例えば、ソフトマックス関数を活性化関数とする全結合ニューラルネットワーク等の組み合わせによって構成できる。 The short-term context domain identification network 231-1, ..., 231-N is a series of short-term intermediate features SF 1 , ..., Output from the short-term context understanding network 111-1, ..., 111-N (short-term logical relationship understanding means). , SF N (second sequence based on intermediate feature sequence, short-term intermediate feature sequence) is received, and the sequence SD 1 , ..., SD N of estimated domain information is obtained and output. That is, each short-term context domain identification network 231-n (where n = 1, ..., N) receives the short-term intermediate feature SF n output from the short-term context understanding network 111- n , and each utterance text Tn is the source. The estimated domain information SD n of the domain identification information indicating whether the domain belongs to the domain or the target domain is obtained, and the estimated domain information SD n is output. Unlike the long-term context domain identification network 232, the short-term context domain identification network 231- n is domain-dependent by estimating whether the utterance text Tn belongs to the source domain or the target domain for each short-term intermediate feature SF n . The purpose is to efficiently remove the domain dependency of the spoken text Tn alone such as a specific word or document. However, this does not limit the present invention. For example, instead of each short-term context domain identification network 231-n, a plurality of short-term context domain identification networks 231-n1, ..., 232-nK'(where K'is an integer of 2 or more) may exist. For example, k'= 1, ..., K'is an index representing the layer of the short-term context domain identification network, and each short-term context domain identification network 231-nk'represents the network corresponding to each n (for example, each time). .. In this case, each short-term context domain identification network 231-nk'(where k'= 1, ..., K') is a short-term intermediate feature SF nk' (n ∈ {1, ..., N}, k'∈ {1). , ..., K'}) is received, and the estimated domain information SD nk'indicating whether the utterance text T n corresponding to the received short-term intermediate feature SF nk' belongs to the source domain or the target domain is obtained and output. May be good. However, SF nk'is a short-term intermediate feature corresponding to each n (for example, each time) output from each short-term context understanding network 111-nk' exemplified in the first embodiment. Here, the short-term context domain identification network 231-n can be configured by, for example, a combination of a fully connected neural network having a softmax function as an activation function.
 ≪学習処理≫
 学習処理では、学習装置21の学習部21aに、ソースドメインのラベル付き教師データ(ソースドメインに属するラベル付きの学習用情報系列)と、ターゲットドメインのラベルなし教師データ(ターゲットドメインに属するラベルなしの学習用情報系列)とが入力される。学習部21aは、上述のネットワーク200に対し、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを発話テキスト系列T,…,Tとして用い、推定ラベル系列L,…,Lの推定精度が高く、推定ドメイン情報の系列LD,…,LDおよびSD,…,SDの推定精度が低くなるようにラベリングネットワーク150(ラベリングモデル)を学習し、推定ドメイン情報の系列LD,…,LDおよびSD,…,SDの推定精度が高くなるようにドメイン識別モデル230を学習する敵対的学習を行う。すなわち、学習部21aは、上述の教師データが発話テキスト系列T,…,Tとしてネットワーク200に入力された際にラベリングネットワーク150から出力される推定ラベル系列L,…,Lとそれらに対応するソースドメインのラベル付き教師データの正解ラベル系列との誤差を表す損失関数(以下、ラベル予測損失)と、長期文脈ドメイン識別ネットワーク232から出力される推定ドメイン情報の系列LD,…,LDとソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとから特定される推定ドメイン情報の正解ラベル系列との誤差を表す損失関数(以下、長期文脈ドメイン識別損失)と、短期文脈ドメイン識別ネットワーク231-1,…,231-Nから出力される推定ドメイン情報の系列SD,…,SDとソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとから特定される推定ドメイン情報の正解ラベル系列との誤差を表す損失関数(以下、短期文脈ドメイン識別損失)とに基づき、ラベリングネットワーク150とドメイン識別モデル230との敵対的学習を行う。なお、ターゲットドメインのラベルなし教師データがネットワーク200に入力された際にラベリングネットワーク150から出力される推定ラベル系列L,…,Lはラベル予測損失の算出に用いられない。
≪Learning process≫
In the learning process, the learning unit 21a of the learning device 21 has the labeled teacher data of the source domain (labeled learning information sequence belonging to the source domain) and the unlabeled teacher data of the target domain (unlabeled belonging to the target domain). Information series for learning) is input. The learning unit 21a uses the teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the spoken text series T 1 , ..., TN for the above-mentioned network 200, and uses the estimated label series L. 1. The labeling network 150 (labeling model) is learned so that the estimation accuracy of L N is high and the estimation accuracy of the estimated domain information series LD 1 , ..., LD N and SD 1 , ..., SD N is low. , A series of estimated domain information LD 1 , ..., LD N and SD 1 , ..., Hostile learning is performed to learn the domain identification model 230 so that the estimation accuracy of SD N is high. That is, the learning unit 21a includes the estimated label series L 1 , ..., L N output from the labeling network 150 when the above-mentioned teacher data is input to the network 200 as the spoken text series T 1 , ..., TN . Loss function (hereinafter referred to as label prediction loss) representing the error from the correct label series of the labeled teacher data of the source domain corresponding to, and the series of estimated domain information output from the long-term context domain identification network 232 LD 1 , ..., A loss function (hereinafter referred to as long-term context domain identification loss) representing the error between the LD N and the correct label series of estimated domain information identified from the labeled teacher data of the source domain and the unlabeled teacher data of the target domain, and the short-term context. Estimated identified from the sequence of estimated domain information output from the domain identification networks 231-1, ..., 231-N SD 1 , ..., SD N , the labeled teacher data of the source domain, and the unlabeled teacher data of the target domain. Hostile learning is performed between the labeling network 150 and the domain identification model 230 based on a loss function (hereinafter, short-term context domain identification loss) that represents an error between the correct label series of domain information. The estimated label series L 1 , ..., L N output from the labeling network 150 when the unlabeled teacher data of the target domain is input to the network 200 is not used for calculating the label prediction loss.
 学習部21aは、例えば誤差逆伝播法を用いてこの敵対的学習を行う。この場合、長期文脈理解ネットワーク112と長期文脈ドメイン識別ネットワーク232との間に勾配反転層242-n(ただし、n=1,…,N)を設け、短期文脈理解ネットワーク111-nと短期文脈ドメイン識別ネットワーク231-nとの間に勾配反転層241-n(ただし、n=1,…,N)を設け、誤差逆伝播時にのみ勾配反転層242-nおよび241-nで勾配を反転させる。ここで、ラベル予測損失が小さくなるように学習を行うことで、ラベリングネットワーク150での推定ラベル系列L,…,Lの推定精度が高くなる。また、勾配反転層242-nで誤差逆伝播時にのみ勾配を反転させ、長期文脈ドメイン識別損失が小さくなるように学習を行うことで、推定ドメイン情報の系列LD,…,LDの推定精度が高くなるように長期文脈ドメイン識別ネットワーク232を学習し、推定ドメイン情報の系列LD,…,LDの推定精度を低くする中間特徴系列LF,…,LFを得る長期文脈ドメイン識別ネットワーク232を学習する敵対的学習を行うことができる。さらに、勾配反転層241-nで誤差逆伝播時にのみ勾配を反転させ、短期文脈ドメイン識別損失が小さくなるように学習を行うことで、推定ドメイン情報の系列SD,…,SDの推定精度が高くなるように短期文脈ドメイン識別ネットワーク231-1,…,231-Nを学習し、推定ドメイン情報の系列SD,…,SDの推定精度を低くする短期中間特徴の系列SF,…,SFを得る短期文脈理解ネットワーク111-1,…,111-Nを学習する敵対的学習を行うことができる。これらの敵対的学習により、推定ラベル系列L,…,Lを正確に推定できるがドメイン識別モデル230にドメインを推定されない中間特徴系列LF,…,LFおよび短期中間特徴の系列SF,…,SFを生成できるラベリングネットワーク150を学習できる。これにより、ラベリングネットワーク150で、複数の発話テキストTを跨いだ文脈のドメインへの依存性を抑制しつつラベルの予測に有効な中間特徴系列LF,…,LFを獲得でき、かつ、発話テキストT単位でのドメインへの依存性を抑制しつつラベルの予測に有効な短期中間特徴の系列SF,…,SFを獲得でき、より高い精度で教師なしドメイン適応を実現できる。 The learning unit 21a performs this hostile learning by using, for example, an error backpropagation method. In this case, a gradient inversion layer 242-n (where n = 1, ..., N) is provided between the long-term context understanding network 112 and the long-term context domain identification network 232, and the short-term context understanding network 111-n and the short-term context domain are provided. A gradient inversion layer 241-n (where n = 1, ..., N) is provided between the identification network 231-n and the gradient is inverted by the gradient inversion layers 242-n and 241-n only at the time of error back propagation. Here, by learning so that the label prediction loss becomes small, the estimation accuracy of the estimated label series L 1 , ..., L N in the labeling network 150 becomes high. In addition, the gradient inversion layer 242- n inverts the gradient only when the error is backpropagated, and learning is performed so that the long-term context domain identification loss becomes small. The long -term context domain identification network 232 is learned so that the estimation accuracy of the estimated domain information is low, and the intermediate feature sequence LF 1 , ..., LF N is obtained . Hostile learning to learn 232 can be performed. Furthermore, the gradient inversion layer 241- n inverts the gradient only when the error is backpropagated, and learning is performed so that the short-term context domain identification loss becomes small. The short - term context domain identification network 2311, ..., 231 - N is learned so that the estimation accuracy of the estimated domain information is low, and the estimation accuracy of the estimated domain information is lowered. , SF N is obtained. Short-term context understanding network 111-1, ..., 111-N can be learned by hostile learning. By these hostile learnings, the estimated label sequence L 1 , ..., L N can be estimated accurately, but the domain is not estimated by the domain discriminative model 230. Intermediate feature sequence LF 1 , ..., LF N and short-term intermediate feature sequence SF 1 , ..., You can learn the labeling network 150 that can generate SF N. As a result, the labeling network 150 can acquire the intermediate feature series LF 1 , ..., LF N that are effective for label prediction while suppressing the dependence of the context across a plurality of utterance texts Tn on the domain. It is possible to acquire a series of short - term intermediate features SF 1 , ..., SF N that are effective for label prediction while suppressing domain dependence in utterance text Tn units, and realize unsupervised domain adaptation with higher accuracy.
 この学習処理は、ラベル予測損失と長期文脈ドメイン識別損失と短期文脈ドメイン識別損失を線形結合した損失関数を最適化(最小化)することで実現できる。ラベル予測損失と長期文脈ドメイン識別損失と短期文脈ドメイン識別損失との線形結合の結合比率は予め定められていてもよいし、学習部21aに入力される学習スケジュールで指定されてもよい。 This learning process can be realized by optimizing (minimizing) the loss function that linearly combines the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss. The combination ratio of the linear combination of the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss may be predetermined or may be specified in the learning schedule input to the learning unit 21a.
 学習部21aが学習スケジュールに基づき、学習のステップ数に応じてラベル予測損失と長期文脈ドメイン識別損失と短期文脈ドメイン識別損失との結合比率を変更しながら上述の学習を行ってもよい。例えば学習部21aは、学習の序盤ではラベル予測損失のみを損失関数として学習を行い、学習のステップ数が増えるにつれて徐々に損失関数に占める長期文脈ドメイン識別損失と短期文脈ドメイン識別損失の割合が大きくなるようにして学習してもよい。さらに、学習部21aは、一定の結合比率で学習が収束するまで行い、結合比率を変更してまた収束するまで学習を行うような学習を、結合比率を学習スケジュールに基づき変更しながら繰り返し実施してもよい。 The learning unit 21a may perform the above-mentioned learning while changing the combination ratio of the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 21a performs learning using only the label prediction loss as a loss function in the early stage of learning, and as the number of learning steps increases, the ratio of the long-term context domain discrimination loss and the short-term context domain discrimination loss to the loss function gradually increases. You may learn as it is. Further, the learning unit 21a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.
 ドメイン識別モデル230が長期文脈ドメイン識別ネットワーク232および短期文脈ドメイン識別ネットワーク231-1,…,231-Nのいずれか一方のみを有することとしてもよい。 The domain identification model 230 may have only one of the long-term context domain identification network 232 and the short-term context domain identification network 231-1, ..., 231-N.
 ドメイン識別モデル230が長期文脈ドメイン識別ネットワーク232のみを有する場合、勾配反転層241-1,…,241-Nが省略され、ラベル予測損失と長期文脈ドメイン識別損失を線形結合した損失関数に基づいて学習処理が行われる。この場合も線形結合の結合比率は予め定められていてもよいし、学習部21aに入力される学習スケジュールで指定されてもよい。また学習部21aが学習スケジュールに基づき、学習のステップ数に応じてラベル予測損失と長期文脈ドメイン識別損失との結合比率を変更しながら上述の学習を行ってもよい。例えば学習部21aは、学習の序盤ではラベル予測損失のみを損失関数として学習を行い、学習のステップ数が増えるにつれて徐々に損失関数に占める長期文脈ドメイン識別損失の割合が大きくなるようにして学習してもよい。さらに、学習部21aは、一定の結合比率で学習が収束するまで行い、結合比率を変更してまた収束するまで学習を行うような学習を、結合比率を学習スケジュールに基づき変更しながら繰り返し実施してもよい。 If the domain discrimination model 230 has only the long-term context domain discrimination network 232, the gradient inversion layers 241-1, ..., 241-N are omitted and based on a loss function that linearly combines the label prediction loss and the long-term context domain discrimination loss. The learning process is performed. In this case as well, the combination ratio of the linear combination may be predetermined or may be specified by the learning schedule input to the learning unit 21a. Further, the learning unit 21a may perform the above-mentioned learning based on the learning schedule while changing the coupling ratio between the label prediction loss and the long-term context domain identification loss according to the number of learning steps. For example, the learning unit 21a learns using only the label prediction loss as a loss function in the early stage of learning, and learns so that the ratio of the long-term context domain identification loss to the loss function gradually increases as the number of learning steps increases. You may. Further, the learning unit 21a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.
 ドメイン識別モデル230が短期文脈ドメイン識別ネットワーク231-1,…,231-Nのみを有する場合、勾配反転層242-1,…,242-Nが省略され、ラベル予測損失と短期文脈ドメイン識別損失を線形結合した損失関数に基づいて学習処理が行われる。この場合も線形結合の結合比率は予め定められていてもよいし、学習部21aに入力される学習スケジュールで指定されてもよい。また学習部21aが学習スケジュールに基づき、学習のステップ数に応じてラベル予測損失と短期文脈ドメイン識別損失との結合比率を変更しながら上述の学習を行ってもよい。例えば学習部21aは、学習の序盤ではラベル予測損失のみを損失関数として学習を行い、学習のステップ数が増えるにつれて徐々に損失関数に占める短期文脈ドメイン識別損失の割合が大きくなるようにして学習してもよい。さらに、学習部21aは、一定の結合比率で学習が収束するまで行い、結合比率を変更してまた収束するまで学習を行うような学習を、結合比率を学習スケジュールに基づき変更しながら繰り返し実施してもよい。 If the domain identification model 230 has only the short-term context domain identification network 231-1, ..., 231-N, the gradient inversion layer 242-1, ..., 242-N is omitted, resulting in label prediction loss and short-term context domain identification loss. The learning process is performed based on the linearly connected loss function. In this case as well, the combination ratio of the linear combination may be predetermined or may be specified by the learning schedule input to the learning unit 21a. Further, the learning unit 21a may perform the above-mentioned learning while changing the connection ratio between the label prediction loss and the short-term context domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 21a learns using only the label prediction loss as a loss function in the early stage of learning, and learns so that the ratio of the short-term context domain identification loss to the loss function gradually increases as the number of learning steps increases. You may. Further, the learning unit 21a repeatedly performs learning such that learning is performed at a constant coupling ratio until the learning converges, the coupling ratio is changed, and learning is performed until the coupling ratio converges, while changing the coupling ratio based on the learning schedule. You may.
 また、ドメイン識別モデル230が長期文脈ドメイン識別ネットワーク232と、短期文脈ドメイン識別ネットワーク231-1,…,231-Nの一部のみを有することとしてもよい。すなわち、短期文脈ドメイン識別ネットワーク231-1,…,231-Nの一部が省略されてもよい。この場合、省略された短期文脈ドメイン識別ネットワーク231-nに対応する推定ドメイン情報SDおよびそれに対応する推定ドメイン情報の正解ラベルは短期文脈ドメイン識別損失の計算に用いられない。 Further, the domain identification model 230 may have only a part of the long-term context domain identification network 232 and the short-term context domain identification networks 231-1, ..., 231-N. That is, a part of the short-term context domain identification network 231-1, ..., 231-N may be omitted. In this case, the estimated domain information SD n corresponding to the omitted short-term context domain identification network 231-n and the correct label of the corresponding estimated domain information are not used in the calculation of the short-term context domain identification loss.
 また学習部21aが、先に例示したような様々なドメイン識別モデル230および/またはラベリングネットワーク150を複数用意して学習を行い、それぞれの学習で得られたラベリングネットワーク150のうち、ターゲットドメインでのラベリングネットワーク150によるラベル系列の推定精度が最善となるラベリングネットワーク150を後で選択してもよい。複数用意されるドメイン識別モデル230は、例えば、上述したような、長期文脈ドメイン識別ネットワーク232および短期文脈ドメイン識別ネットワーク231-1,…,231-Nを含むドメイン識別モデル230、長期文脈ドメイン識別ネットワーク232のみを有するドメイン識別モデル230、短期文脈ドメイン識別ネットワーク231-1,…,231-Nのみを有するドメイン識別モデル230、および第1実施形態のドメイン識別モデル230の少なくとも何れかを含む。 Further, the learning unit 21a prepares a plurality of various domain identification models 230 and / or labeling networks 150 as exemplified above for learning, and among the labeling networks 150 obtained by each learning, the target domain is used. The labeling network 150, which gives the best estimation accuracy of the label sequence by the labeling network 150, may be selected later. The plurality of domain identification models 230 prepared include, for example, a domain identification model 230 including a long-term context domain identification network 232 and a short-term context domain identification network 231-1, ..., 231-N, and a long-term context domain identification network, as described above. It includes at least one of a domain identification model 230 having only 232, a domain identification model 230 having only short-term context domain identification networks 231-1, ..., 231-N, and a domain identification model 230 of the first embodiment.
 学習処理はバッチ学習であってもよいし、ミニバッチ学習であってもよいし、オンライン学習であってもよい。 The learning process may be batch learning, mini-batch learning, or online learning.
 学習部21aは、上述の学習によって得たラベリングネットワーク150のパラメータを記憶部11bに格納し、短期文脈ドメイン識別ネットワーク231-1,…,231-Nのパラメータを記憶部21cに格納し、長期文脈ドメイン識別モデル232のパラメータを記憶部21dに格納する。学習装置21は、記憶部11bに格納されたラベリングネットワーク150のパラメータを出力する。ラベリングネットワーク150のパラメータは推論処理に用いられる。通常、短期文脈ドメイン識別ネットワーク231-1,…,231-Nのパラメータおよび長期文脈ドメイン識別モデル232のパラメータは推論処理には用いられないため、学習装置21から出力されなくてもよい。しかし、学習装置21が短期文脈ドメイン識別ネットワーク231-1,…,231-Nのパラメータおよび長期文脈ドメイン識別モデル232のパラメータの少なくとも何れかを出力してもよい。 The learning unit 21a stores the parameters of the labeling network 150 obtained by the above learning in the storage unit 11b, stores the parameters of the short-term context domain identification networks 231-1, ..., 231-N in the storage unit 21c, and stores the parameters of the short-term context domain identification network 231-1, ..., 231-N in the storage unit 21c. The parameters of the domain identification model 232 are stored in the storage unit 21d. The learning device 21 outputs the parameters of the labeling network 150 stored in the storage unit 11b. The parameters of the labeling network 150 are used for inference processing. Normally, the parameters of the short-term context domain identification network 2311, ..., 231-N and the parameters of the long-term context domain identification model 232 are not used for inference processing, and therefore need not be output from the learning device 21. However, the learning device 21 may output at least one of the parameters of the short-term context domain identification network 231-1, ..., 231-N and the parameters of the long-term context domain identification model 232.
 図2を用いて上述の学習処理を機能的に例示する。
 ステップS21:ソースドメインのラベル付き教師データと、ターゲットドメインのラベルなし教師データとが制御部11aaに入力される。制御部11aaは、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを生成する。また制御部11aaは、ネットワーク200のパラメータを初期化する。
 ステップS22:損失関数計算部21abは、教師データを発話テキスト系列T,…,Tとしてネットワーク200に入力し、前述のように損失関数を得る。
 ステップS23:パラメータ更新部21adは、誤差逆伝播法に従い、損失関数に基づく情報を逆伝搬し、ドメイン識別モデル230およびラベリング層120のパラメータを更新する。
 ステップS24:勾配反転部11acは、ドメイン識別モデル230から逆伝搬された損失関数に基づく情報の勾配を反転させて論理的関係理解層110に逆伝搬させる。ラベリング層120から逆伝搬された損失関数に基づく情報は、そのまま論理的関係理解層110に逆伝搬される。
 ステップS25:パラメータ更新部21adは、誤差逆伝播法に従い、逆伝搬された情報を用いて論理的関係理解層110のパラメータを更新する。
 ステップS26:制御部11aaは、終了条件を満たしたか否かを判定する。ここで終了条件を満たしていない場合、制御部11aaは処理をステップS22に戻す。一方、終了条件を満たしている場合、制御部11aaはラベリングネットワーク150のパラメータを出力する。必要に応じて制御部11aaが短期文脈ドメイン識別ネットワーク231-1,…,231-Nのパラメータおよび長期文脈ドメイン識別モデル232のパラメータの少なくとも何れかも出力してもよい。
The above-mentioned learning process is functionally illustrated with reference to FIG.
Step S21: The labeled teacher data of the source domain and the unlabeled teacher data of the target domain are input to the control unit 11aa. The control unit 11aa generates teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain. Further, the control unit 11aa initializes the parameters of the network 200.
Step S22: The loss function calculation unit 21ab inputs the teacher data into the network 200 as the utterance text series T 1 , ..., TN , and obtains the loss function as described above.
Step S23: The parameter update unit 21ad backpropagates the information based on the loss function according to the error back propagation method, and updates the parameters of the domain discrimination model 230 and the labeling layer 120.
Step S24: The gradient inversion unit 11ac inverts the gradient of the information based on the loss function back-propagated from the domain discrimination model 230 and back-propagates it to the logical relationship understanding layer 110. The information based on the loss function back-propagated from the labeling layer 120 is back-propagated to the logical relationship understanding layer 110 as it is.
Step S25: The parameter update unit 21ad updates the parameters of the logical relationship understanding layer 110 using the back-propagated information according to the error back-propagation method.
Step S26: The control unit 11aa determines whether or not the end condition is satisfied. If the end condition is not satisfied here, the control unit 11aa returns the process to step S22. On the other hand, when the end condition is satisfied, the control unit 11aa outputs the parameter of the labeling network 150. If necessary, the control unit 11aa may output at least one of the parameters of the short-term context domain identification network 231-1, ..., 231-N and the parameters of the long-term context domain identification model 232.
 第2実施形態の推論装置13の機能構成および推論処理は第1実施形態と同じであるため、説明を省略する。 Since the functional configuration and inference processing of the inference device 13 of the second embodiment are the same as those of the first embodiment, the description thereof will be omitted.
 <第2実施形態の特徴>
 本実施形態では、発話テキスト系列T,…,Tを受け取り、発話テキスト系列T,…,Tの文脈を考慮した中間特徴系列LF,…,LFを得、当該中間特徴系列LF,…,LFを出力する論理的関係理解層110と、中間特徴系列LF,…,LFを受け取り、発話テキスト系列T,…,Tに対応する推定ラベル系列L,…,Lを得、当該推定ラベル系列L,…,Lを出力するラベリング層120とを含むラベリングネットワーク150と、中間特徴系列LF,…,LFを受け取り、発話テキスト系列T,…,Tに含まれる各発話テキストTがソースドメインに属するかターゲットドメインに属するかを表すドメイン識別情報の推定ドメイン情報LDおよびSDを得、当該推定ドメイン情報の系列LD,…,LDおよびSD,…,SDを出力するドメイン識別モデル230とに対し、学習装置21が、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを発話テキスト系列T,…,Tとして用い、推定ラベル系列L,…,Lの推定精度が高く、推定ドメイン情報の系列LD,…,LDおよびSD,…,SDの推定精度が低くなるようにラベリングネットワーク150を学習し、推定ドメイン情報の系列LD,…,LDおよびSD,…,SDの推定精度が高くなるようにドメイン識別モデル230を学習する敵対的学習を行った。これにより、発話テキスト系列T,…,Tの文脈を考慮して当該発話テキスト系列に対応するラベル系列L,…,Lを推定するラベリングネットワーク150の教師なしドメイン適応が可能になる。
<Characteristics of the second embodiment>
In the present embodiment, the utterance text sequence T 1 , ..., TN is received, the utterance text sequence T 1 , ..., TN is taken into consideration, and the intermediate feature sequence LF 1 , ..., LF N is obtained, and the intermediate feature sequence is obtained. The logical relationship understanding layer 110 that outputs LF 1 , ..., LF N , and the estimated label series L 1 , that receives the intermediate feature series LF 1 , ..., LF N and corresponds to the utterance text series T 1 , ..., TN . ..., L N is obtained, the labeling network 150 including the labeling layer 120 including the estimated label sequence L 1 , ..., L N is received, and the intermediate feature sequence LF 1 , ..., LF N is received, and the utterance text sequence T 1 , ..., The estimated domain information LD n and SD n of the domain identification information indicating whether each utterance text T n contained in the TN belongs to the source domain or the target domain is obtained, and the sequence LD 1 of the estimated domain information is obtained. ..., LD N and SD 1 , ..., The learning device 21 utters teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain with respect to the domain identification model 230 that outputs SD N. Used as the text series T 1 , ..., TN , the estimation accuracy of the estimated label series L 1 , ..., L N is high, and the estimation of the estimated domain information series LD 1 , ..., LD N and SD 1 , ..., SD N. Hostile learning the labeling network 150 so that the accuracy is low, and the domain identification model 230 so that the estimation accuracy of the estimated domain information series LD 1 , ..., LD N and SD 1 , ..., SD N is high. I learned. This enables unsupervised domain adaptation of the labeling network 150 that estimates the label sequence L 1 , ..., L N corresponding to the utterance text sequence in consideration of the context of the utterance text sequence T 1 , ..., TN . ..
 特に本実施形態では、ドメイン識別モデル230が、短期文脈理解ネットワーク111-1,…,111-Nから出力された短期中間特徴の系列SF,…,SFを受け取り、推定ドメイン情報の系列SD,…,SDを得て出力する短期文脈ドメイン識別ネットワーク231-1,…,231-N、および長期文脈理解ネットワーク112から出力された中間特徴系列LF,…,LFを受け取り、推定ドメイン情報の系列LD,…,LDを得て出力する長期文脈ドメイン識別ネットワーク232の少なくとも一方を含む。これにより、発話テキストT単体のドメイン依存性および発話テキストTを跨いだ文脈のドメイン依存性の少なくとも一方をラベリングネットワーク150から効率的に取り除くことができる。その結果、より高い精度でラベリングネットワーク150の教師なしドメイン適応を行うことができる。 In particular, in the present embodiment, the domain identification model 230 receives the sequence SF 1 , ..., SF N of the short-term intermediate features output from the short-term context understanding network 111-1, ..., 111-N, and receives the sequence SD of the estimated domain information. 1 , ..., SD N is obtained and output from the short-term context domain identification network 231-1, ..., 231-N, and the intermediate feature sequence LF 1 , ..., LF N output from the long-term context understanding network 112 is received and estimated. Includes at least one of the long-term context domain identification networks 232 that obtains and outputs the domain information sequence LD 1 , ..., LD N. Thereby, at least one of the domain dependency of the utterance text Tn alone and the domain dependency of the context across the utterance text Tn can be efficiently removed from the labeling network 150. As a result, unsupervised domain adaptation of the labeling network 150 can be performed with higher accuracy.
 またドメイン識別モデル230が、少なくとも長期文脈ドメイン識別ネットワーク232を含むことで、複数の発話テキストTを跨いだ文脈のドメイン依存性をラベリングネットワーク150から効率的に取り除くことができる。その結果、発話テキスト系列T,…,Tの文脈を考慮して当該発話テキスト系列に対応するラベル系列L,…,Lを推定するラベリングネットワーク150の教師なしドメイン適応を精度よく行うことができる。 Further, since the domain identification model 230 includes at least the long-term context domain identification network 232, the domain dependence of the context across a plurality of utterance texts Tn can be efficiently removed from the labeling network 150. As a result, the unsupervised domain adaptation of the labeling network 150 that estimates the label series L 1 , ..., L N corresponding to the utterance text series in consideration of the context of the utterance text series T 1 , ..., TN is performed accurately. be able to.
 さらにドメイン識別モデル230が、短期文脈ドメイン識別ネットワーク231-1,…,231-Nおよび長期文脈ドメイン識別ネットワーク232の両方を含むことで、発話テキストT単体のドメイン依存性と複数の発話テキストTを跨いだ文脈のドメイン依存性とをラベリングネットワーク150から効率的に取り除くことができる。この場合には、より高い精度でラベリングネットワーク150の教師なしドメイン適応を行うことができる。 Further, the domain identification model 230 includes both the short-term context domain identification network 231-1, ..., 231- N and the long-term context domain identification network 232, so that the utterance text Tn is domain-dependent and a plurality of utterance texts T. The domain dependency of the context across n can be efficiently removed from the labeling network 150. In this case, the unsupervised domain adaptation of the labeling network 150 can be performed with higher accuracy.
 <実験結果>
 以下に上述の実施形態に従って行われた教師なしドメイン適応の実験結果を例示する。以下に実験条件を示す。
(1)発話テキスト系列の模擬データの各発話テキストを5クラスの対応シーンに分類し、各対応シーンを表すラベルを推定する。
(2)ターゲットドメイン(新規ドメイン)を除く5ドメインをソースドメイン(適用済みドメイン)として扱い、ソースドメインのデータのみを用いてラベリングネットワークを学習し、得られたラベリングネットワークと、第1実施形態および第2実施形態に従ってラベリングネットワークを学習し、それぞれで得られたラベリングネットワークを用いて、ターゲットドメインの発話テキスト系列に対する識別性能(ラベリングの正解率)を検証した。
(3)6個のターゲットドメイン(ネット通販、ISP、証券、自治体、携帯電話、PCサポート)に属する60通話分の発話テキスト系列(60通話×6ドメイン=360通話分の模擬データ)について識別性能を検証した。
(4)各発話テキストは100個程度の文を含む。
<Experimental results>
The following is an example of the experimental results of unsupervised domain adaptation performed according to the above-described embodiment. The experimental conditions are shown below.
(1) Each utterance text of the simulated data of the utterance text series is classified into 5 classes of corresponding scenes, and the label representing each corresponding scene is estimated.
(2) Treating 5 domains excluding the target domain (new domain) as source domains (applied domains), learning the labeling network using only the data of the source domain, the obtained labeling network, the first embodiment and The labeling network was learned according to the second embodiment, and the discrimination performance (correct answer rate of labeling) for the spoken text series of the target domain was verified by using the labeling network obtained in each.
(3) Discrimination performance for 60 calls of utterance text series (60 calls x 6 domains = 360 calls of simulated data) belonging to 6 target domains (online mail order, ISP, securities, local government, mobile phone, PC support) Was verified.
(4) Each utterance text contains about 100 sentences.
 図8に実験結果を例示する。図8に例示するように、第1実施形態および第2実施形態のいずれの方法でも、ターゲットドメインにおけるラベル付きデータを用意しなくても、すでに存在するソースドメインのラベル付きデータを用い、高い精度でターゲットドメインに対する教師なしドメイン適応が可能であることが分かる。特に第2実施形態の方法では、より高い精度で教師なしドメイン適応が可能であり、ソースドメインのデータのみで学習する方法に比べて平均3.4%識別精度が向上する。 Figure 8 illustrates the experimental results. As illustrated in FIG. 8, in both the first embodiment and the second embodiment, the labeled data of the existing source domain is used without preparing the labeled data in the target domain, and the high accuracy is achieved. It can be seen that unsupervised domain adaptation to the target domain is possible. In particular, the method of the second embodiment enables unsupervised domain adaptation with higher accuracy, and the identification accuracy is improved by an average of 3.4% as compared with the method of learning only with the data of the source domain.
 [ハードウェア構成]
 各実施形態における学習装置11,21および推論装置13は、例えば、CPU(central processing unit)等のプロセッサ(ハードウェア・プロセッサ)やRAM(random-access memory)・ROM(read-only memory)等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される装置である。このコンピュータは1個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めROM等に記録されていてもよい。また、CPUのようにプログラムが読み込まれることで機能構成を実現する電子回路(circuitry)ではなく、単独で処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。また、1個の装置を構成する電子回路が複数のCPUを含んでいてもよい。
[Hardware configuration]
The learning devices 11 and 21 and the inference device 13 in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit), a RAM (random-access memory), a ROM (read-only memory), or the like. It is a device configured by a general-purpose or dedicated computer equipped with a memory or the like to execute a predetermined program. This computer may have one processor and memory, or may have a plurality of processors and memory. This program may be installed in a computer or may be recorded in a ROM or the like in advance. Further, a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. .. Further, the electronic circuit constituting one device may include a plurality of CPUs.
 図9は、各実施形態における学習装置11,21および推論装置13のハードウェア構成を例示したブロック図である。図9に例示するように、この例の学習装置11,21および推論装置13は、CPU(Central Processing Unit)10a、入力部10b、出力部10c、RAM(Random Access Memory)10d、ROM(Read Only Memory)10e、補助記憶装置10f及びバス10gを有している。この例のCPU10aは、制御部10aa、演算部10ab及びレジスタ10acを有し、レジスタ10acに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部10bは、データが入力される入力端子、キーボード、マウス、タッチパネル等である。また、出力部10cは、データが出力される出力端子、ディスプレイ、所定のプログラムを読み込んだCPU10aによって制御されるLANカード等である。また、RAM10dは、SRAM (Static Random Access Memory)、DRAM (Dynamic Random Access Memory)等であり、所定のプログラムが格納されるプログラム領域10da及び各種データが格納されるデータ領域10dbを有している。また、補助記憶装置10fは、例えば、ハードディスク、MO(Magneto-Optical disc)、半導体メモリ等であり、所定のプログラムが格納されるプログラム領域10fa及び各種データが格納されるデータ領域10fbを有している。また、バス10gは、CPU10a、入力部10b、出力部10c、RAM10d、ROM10e及び補助記憶装置10fを、情報のやり取りが可能なように接続する。CPU10aは、読み込まれたOS(Operating System)プログラムに従い、補助記憶装置10fのプログラム領域10faに格納されているプログラムをRAM10dのプログラム領域10daに書き込む。同様にCPU10aは、補助記憶装置10fのデータ領域10fbに格納されている各種データを、RAM10dのデータ領域10dbに書き込む。そして、このプログラムやデータが書き込まれたRAM10d上のアドレスがCPU10aのレジスタ10acに格納される。CPU10aの制御部10aaは、レジスタ10acに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すRAM10d上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部10abに順次実行させ、その演算結果をレジスタ10acに格納していく。このような構成により、学習装置11,21および推論装置13の機能構成が実現される。 FIG. 9 is a block diagram illustrating the hardware configurations of the learning devices 11 and 21 and the inference device 13 in each embodiment. As illustrated in FIG. 9, the learning devices 11 and 21 and the inference device 13 of this example include a CPU (Central Processing Unit) 10a, an input unit 10b, an output unit 10c, a RAM (RandomAccessMemory) 10d, and a ROM (ReadOnly). It has a Memory) 10e, an auxiliary storage device 10f, and a bus 10g. The CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac. Further, the input unit 10b is an input terminal, a keyboard, a mouse, a touch panel, or the like into which data is input. Further, the output unit 10c is an output terminal from which data is output, a display, a LAN card controlled by a CPU 10a in which a predetermined program is read, and the like. Further, the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored. Further, the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data. There is. Further, the bus 10g connects the CPU 10a, the input unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged. The CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program. Similarly, the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a. The control unit 10aa of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program. The calculation result is stored in the register 10ac. With such a configuration, the functional configuration of the learning devices 11 and 21 and the inference device 13 is realized.
 上述のプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は非一時的な(non-transitory)記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 The above program can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, and the like.
 このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。上述のように、このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 The distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network. As described above, the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and the result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).
 CPU10aだけでなく、GPU(Graphics Processing Unit)を用いて本装置が構成されてもよい。また各実施形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 This device may be configured using not only the CPU 10a but also a GPU (Graphics Processing Unit). Further, in each embodiment, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.
 なお、本発明は上述の実施形態に限定されるものではない。例えば、各実施形態では、ターゲットドメインのラベルなし教師データは、ターゲットドメインの発話テキスト系列を含むが正解ラベル系列を含まないこととした。しかしながら、ターゲットドメインのラベルなし教師データの少なくとも一部が正解ラベル系列を含んでいてもよい。この場合、ネットワーク100,200の学習にターゲットドメインのラベルなし教師データの正解ラベル系列を用いてもよいし、用いなくてもよい。 The present invention is not limited to the above-described embodiment. For example, in each embodiment, the unlabeled teacher data of the target domain includes the spoken text series of the target domain but does not include the correct label series. However, at least part of the unlabeled teacher data for the target domain may contain a correct label sequence. In this case, the correct label series of unlabeled teacher data of the target domain may or may not be used for learning the networks 100 and 200.
 また、前述のように、上述の実施形態では、説明の明確化のため、論理的関係を持つ複数の情報の系列が発話テキスト系列であり、ラベル系列が各発話の対応シーン(例えば、オープニング、用件把握、本人確認、対応、クロージング)を表すラベルの系列である場合を例示した。しかしながら、これは一例であって、論理的関係を持つ複数の情報の系列として、文章系列、プログラミング言語系列、音声信号系列、動画信号系列など、その他の情報の系列を用いてもよい。また、ラベルの系列として、状況や行動を表すラベル系列、場所や時間を表すラベル系列、品詞を表すラベル系列、プログラム内容を表すラベル系列など、その他のラベル系列を用いてもよい。また、ラベリングモデル等の各モデルが深層ニューラルネットワークに基づくモデルではなく、確率モデルや分類器などに基づく、その他のモデルであってもよい。また、各実施形態の論理的関係理解層110は、発話テキスト系列T,…,Tを受け取り、発話テキスト系列T,…,Tの文脈(論理的関係)を考慮した中間特徴系列LF,…,LFを得て出力した。しかしながら、論理的関係理解層110が発話テキスト系列T,…,Tを受け取り、N個未満またはN個を超える中間特徴からなる系列を得て出力してもよい。また、各実施形態のラベリング層120は、中間特徴系列LF,…,LF(中間特徴系列に基づく第1系列)を受け取り、発話テキスト系列T,…,Tに対応する推定ラベル系列L,…,Lを得て出力した。しかしながら、ラベリング層120が中間特徴系列LF,…,LF(中間特徴系列に基づく第1系列)を受け取り、N個未満またはN個を超える推定ラベルの系列を得て出力してもよい。また、各実施形態のドメイン識別モデル130,230は、中間特徴系列LF,…,LF(中間特徴系列に基づく第2系列)を受け取り、N個の推定ドメイン情報を得て出力した。しかしながら、実施形態のドメイン識別モデル130,230が中間特徴系列LF,…,LF(中間特徴系列に基づく第2系列)を受け取り、N個未満またはN個を超える推定ドメイン情報を得て出力してもよい。 Further, as described above, in the above-described embodiment, in order to clarify the explanation, a series of a plurality of information having a logical relationship is an utterance text series, and a label series is a corresponding scene of each utterance (for example, opening, opening, etc.). An example is given of a series of labels representing (identification, identity verification, correspondence, closing). However, this is only an example, and other information sequences such as a sentence sequence, a programming language sequence, an audio signal sequence, and a moving image signal sequence may be used as a sequence of a plurality of information having a logical relationship. Further, as the label series, other label series such as a label series representing a situation or an action, a label series representing a place or time, a label series representing a part of speech, and a label series representing a program content may be used. Further, each model such as a labeling model may not be a model based on a deep neural network, but may be another model based on a probability model, a classifier, or the like. Further, the logical relationship understanding layer 110 of each embodiment receives the utterance text series T 1 , ..., TN , and considers the context (logical relationship) of the utterance text series T 1 , ..., TN . LF 1 , ..., LF N were obtained and output. However, the logical relationship understanding layer 110 may receive the utterance text series T 1 , ..., TN , and obtain and output a series consisting of less than N or more than N intermediate features. Further, the labeling layer 120 of each embodiment receives the intermediate feature sequence LF 1 , ..., LF N (first sequence based on the intermediate feature sequence), and is an estimated label sequence corresponding to the utterance text sequence T 1 , ..., TN . L 1 , ..., L N were obtained and output. However, the labeling layer 120 may receive the intermediate feature sequence LF 1 , ..., LF N (first sequence based on the intermediate feature sequence) to obtain and output a sequence of estimated labels less than or more than N. Further, the domain identification models 130 and 230 of each embodiment received the intermediate feature series LF 1 , ..., LF N (second sequence based on the intermediate feature sequence), and obtained and output N estimated domain information. However, the domain discriminative models 130, 230 of the embodiment receive the intermediate feature sequences LF 1 , ..., LF N (second sequence based on the intermediate feature sequences), and obtain and output less than N or more than N estimated domain information. You may.
 また、上述の各種処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Further, the above-mentioned various processes are not only executed in chronological order according to the description, but may also be executed in parallel or individually depending on the processing capacity of the device that executes the processes or if necessary. In addition, it goes without saying that changes can be made as appropriate without departing from the spirit of the present invention.
 本発明により、例えば、複雑なコンテキストを考慮した系列ラベリング問題に対して、中間特徴のドメイン依存性を効率的に除去することが可能となる。特に第2実施形態で例示したように短期および長期の論理的関係(文脈)それぞれに対して、中間特徴のドメイン依存性を効率的に除去することで、よりターゲットドメインへの適応度が高いラベリングネットワークを学習でき、ターゲットドメインにおけるラベリング精度を向上させることができる。 According to the present invention, for example, it is possible to efficiently remove the domain dependence of intermediate features for a series labeling problem considering a complicated context. In particular, as illustrated in the second embodiment, by efficiently removing the domain dependence of intermediate features for each of the short-term and long-term logical relationships (contexts), labeling with higher fitness to the target domain is achieved. You can learn the network and improve the labeling accuracy in the target domain.
 従来、画像認識に対する教師なしドメイン適応技術は検討されていたが、本発明はこれを初めて言語処理など、複数の情報の系列の論理的関係を考慮して当該情報の系列に対応するラベル系列を推定する問題に適用したものである。この教師なしドメイン適応技術により、例えば、コンタクトセンタ向けビジネスの業界拡大の障壁となっていたラベル付与のコストを大幅に削減することができる。 Conventionally, unsupervised domain adaptation technology for image recognition has been studied, but the present invention is the first to consider the logical relationship of multiple information sequences such as language processing, and to create a label sequence corresponding to the information sequence. It is applied to the problem of estimation. This unsupervised domain adaptation technology, for example, can significantly reduce the cost of labeling, which has been a barrier to the industry's expansion of business for contact centers.
 特に第2実施形態に例示した方法では、例えば、発話テキスト単位(例えば、通話単位)のドメイン識別ネットワークを、単方向や双方向のLSTMにより文の境界をまたいだ機構として設計することができる。これにより、例えばコンタクトセンタの業界に依存した特定の話の流れのようなものに対するドメイン依存性をとらえ、それをラベリングネットワークから効率的に除去することが可能となり、結果としてターゲットドメインにおけるラベルの推定精度を向上させることができる。 In particular, in the method exemplified in the second embodiment, for example, a domain identification network in utterance text units (for example, call units) can be designed as a mechanism that straddles sentence boundaries by unidirectional or bidirectional LSTM. This makes it possible to capture domain dependencies for things like contact center industry-dependent specific story flows and efficiently remove them from the labeling network, resulting in label estimation in the target domain. The accuracy can be improved.
 また、第2実施形態に例示した方法では、例えば、発話テキスト単位(例えば、通話単位)のドメイン識別ネットワークを、発話テキストの境界をまたがない機構として設計することもできる。これにより、例えばコンタクトセンタの業界に依存した特定の単語に起因するドメイン依存性をとらえ、それをラベリングネットワークから効率的に除去することが可能となり、結果としてターゲットドメインにおけるラベルの推定精度を向上させることができる。 Further, in the method exemplified in the second embodiment, for example, the domain identification network of the utterance text unit (for example, the call unit) can be designed as a mechanism that does not straddle the boundary of the utterance text. This makes it possible, for example, to capture domain dependencies caused by specific words that depend on the contact center industry and efficiently remove them from the labeling network, resulting in improved label estimation accuracy in the target domain. be able to.
11,21 学習装置
11a,21a 学習部
13 推論装置
13a 推論部
100,200 ネットワーク
110 論理的関係理解層
111-1,…,111-N 短期文脈理解ネットワーク
112 長期文脈理解ネットワーク
120 ラベリング層
120-1,…,120-N ラベル予測ネットワーク
130,230 ドメイン識別モデル
130-1,…,130-N ドメイン識別ネットワーク
231-1,…,231-N 短期文脈ドメイン識別ネットワーク
232 長期文脈ドメイン識別ネットワーク
11 and 21 Learning device 11a, 21a Learning unit 13 Reasoning device 13a Reasoning unit 100, 200 Network 110 Logical relationship understanding layer 111-1, ..., 111-N Short-term context understanding network 112 Long-term context understanding network 120 Labeling layer 120-1 , ..., 120-N Label Prediction Network 130, 230 Domain Identification Model 130-1, ..., 130-N Domain Identification Network 231-1, ..., 231-N Short-term Context Domain Identification Network 232 Long-Term Context Domain Identification Network

Claims (8)

  1.  論理的関係を持つ複数の情報の系列である入力情報系列を受け取り、前記入力情報系列の論理的関係を考慮した中間特徴系列を得、前記中間特徴系列を出力する論理的関係理解手段と、
     前記中間特徴系列に基づく第1系列を受け取り、前記入力情報系列に対応するラベル系列の推定ラベル系列を得、前記推定ラベル系列を出力するラベリング手段と、
    を含むラベリングモデルと、
    前記中間特徴系列に基づく第2系列を受け取り、前記入力情報系列に含まれる各部分情報がソースドメインに属するか、ターゲットドメインに属するか、を表すドメイン識別情報の推定ドメイン情報を得、前記推定ドメイン情報の系列を出力するドメイン識別モデルと、
    に対し、
    ソースドメインに属するラベル付きの学習用情報系列であるラベル付き教師データとターゲットドメインに属するラベルなしの学習用情報系列であるラベルなし教師データとを含む教師データを前記入力情報系列として用い、前記推定ラベル系列の推定精度が高く、前記推定ドメイン情報の系列の推定精度が低くなるように前記ラベリングモデルを学習し、前記推定ドメイン情報の系列の推定精度が高くなるように前記ドメイン識別モデルを学習する敵対的学習を行い、
    少なくとも前記ラベリングモデルのパラメータを得て出力する学習部を有する学習装置。
    A logical relationship understanding means that receives an input information sequence that is a sequence of a plurality of information having a logical relationship, obtains an intermediate feature sequence considering the logical relationship of the input information sequence, and outputs the intermediate feature sequence.
    A labeling means that receives a first sequence based on the intermediate feature sequence, obtains an estimated label sequence of a label sequence corresponding to the input information sequence, and outputs the estimated label sequence.
    Labeling model including, and
    The second series based on the intermediate feature series is received, and the estimated domain information of the domain identification information indicating whether each partial information included in the input information series belongs to the source domain or the target domain is obtained, and the estimated domain is obtained. A domain identification model that outputs a series of information,
    On the other hand
    The estimation is performed using the teacher data including the labeled teacher data which is a labeled learning information series belonging to the source domain and the unlabeled teacher data which is an unlabeled learning information series belonging to the target domain as the input information series. The labeling model is learned so that the estimation accuracy of the label series is high and the estimation accuracy of the estimated domain information series is low, and the domain identification model is learned so that the estimation accuracy of the estimated domain information series is high. Do hostile learning,
    A learning device having a learning unit that obtains and outputs at least the parameters of the labeling model.
  2.  請求項1の学習装置であって、
     前記論理的関係理解手段は、
     前記入力情報系列に含まれる各部分情報を受け取り、受け取った部分情報内での情報の論理的関係を考慮した短期中間特徴を得、前記短期中間特徴を出力する複数の短期論理的関係理解手段と、
     複数の前記短期中間特徴の系列を受け取り、前記入力情報系列に含まれる複数の部分情報間での論理的関係を考慮した長期中間特徴系列である前記中間特徴系列を得、前記中間特徴系列を出力する長期論理的関係理解手段と、を含み、
     前記第1系列は前記中間特徴系列であり、
     前記第2系列は前記中間特徴系列である、学習装置。
    The learning device according to claim 1.
    The means for understanding the logical relationship is
    With a plurality of short-term logical relationship understanding means that receive each partial information included in the input information series, obtain a short-term intermediate feature considering the logical relationship of the information in the received partial information, and output the short-term intermediate feature. ,
    Receives a series of the plurality of short-term intermediate features, obtains the intermediate feature series which is a long-term intermediate feature series considering a logical relationship between a plurality of partial information included in the input information series, and outputs the intermediate feature series. Including long-term logical relationship understanding means
    The first series is the intermediate feature series.
    The second sequence is a learning device, which is the intermediate feature sequence.
  3.  請求項1の学習装置であって、
     前記論理的関係理解手段は、
     前記入力情報系列に含まれる各部分情報を受け取り、受け取った部分情報内での情報の論理的関係を考慮した短期中間特徴を得、前記短期中間特徴を出力する複数の短期論理的関係理解手段と、
     複数の前記短期中間特徴の系列を受け取り、前記入力情報系列に含まれる複数の部分情報間での論理的関係を考慮した長期中間特徴系列を得、前記長期中間特徴系列を出力する長期論理的関係理解手段と、を含み、
     前記ラベリング手段は、
     前記第1系列として前記長期中間特徴系列を受け取り、前記情報系列に対応するラベル系列の推定ラベル系列を得、前記推定ラベル系列を出力し、
     前記ドメイン識別モデルは、
     前記第2系列として前記短期中間特徴系列を受け取り、前記推定ドメイン情報を得、前記推定ドメイン情報の系列を出力する短期論理的関係ドメイン識別手段、および、前記第2系列として前記長期中間特徴系列を受け取り、前記推定ドメイン情報を得、前記推定ドメイン情報の系列を出力する長期論理的関係ドメイン識別手段の少なくとも一方を含む、学習装置。
    The learning device according to claim 1.
    The means for understanding the logical relationship is
    With a plurality of short-term logical relationship understanding means that receive each partial information included in the input information series, obtain a short-term intermediate feature considering the logical relationship of the information in the received partial information, and output the short-term intermediate feature. ,
    A long-term logical relationship that receives a series of a plurality of the short-term intermediate features, obtains a long-term intermediate feature series considering the logical relationship between a plurality of partial information included in the input information series, and outputs the long-term intermediate feature series. Including means of understanding,
    The labeling means is
    The long-term intermediate feature sequence is received as the first sequence, the estimated label sequence of the label sequence corresponding to the information sequence is obtained, and the estimated label sequence is output.
    The domain discrimination model is
    A short-term logical relationship domain identification means that receives the short-term intermediate feature series as the second series, obtains the estimated domain information, and outputs the sequence of the estimated domain information, and the long-term intermediate feature series as the second series. A learning device comprising at least one of long-term logical relationship domain identification means that receives, obtains the estimated domain information, and outputs a sequence of said estimated domain information.
  4.  請求項3の学習装置であって、
     前記ドメイン識別モデルは、
     前記第2系列として前記長期中間特徴系列を受け取り、前記推定ドメイン情報を得、前記推定ドメイン情報の系列を出力する長期論理的関係ドメイン識別手段を含む、学習装置。
    The learning device according to claim 3.
    The domain identification model is
    A learning device including a long-term logical relationship domain identification means that receives the long-term intermediate feature sequence as the second sequence, obtains the estimated domain information, and outputs the sequence of the estimated domain information.
  5.  請求項3の学習装置であって、
     前記ドメイン識別モデルは、
     前記第2系列として前記短期中間特徴系列を受け取り、前記推定ドメイン情報を得、前記推定ドメイン情報の系列を出力する短期論理的関係ドメイン識別手段、および、前記第2系列として前記長期中間特徴系列を受け取り、前記推定ドメイン情報を得、前記推定ドメイン情報の系列を出力する長期論理的関係ドメイン識別手段を含む、学習装置。
    The learning device according to claim 3.
    The domain discrimination model is
    A short-term logical relationship domain identification means that receives the short-term intermediate feature series as the second series, obtains the estimated domain information, and outputs the sequence of the estimated domain information, and the long-term intermediate feature series as the second series. A learning device comprising a long-term logical relationship domain identification means that receives, obtains the estimated domain information, and outputs a sequence of the estimated domain information.
  6.  請求項1から5の何れかの学習装置で得られた前記パラメータで特定される前記ラベリングモデルに対し、推論用の入力情報系列を適用し、前記推論用の入力情報系列に対応するラベル系列の推定ラベル系列を得、前記推定ラベル系列を出力する推論部を有する推論装置。 An input information sequence for inference is applied to the labeling model specified by the parameter obtained by the learning device according to any one of claims 1 to 5, and a label sequence corresponding to the input information sequence for inference is applied. An inference device having an inference unit that obtains an estimated label sequence and outputs the estimated label sequence.
  7.  請求項1から5の何れかの学習装置の学習方法または請求項6の推論装置の推論方法。 The learning method of the learning device according to any one of claims 1 to 5, or the inference method of the inference device according to claim 6.
  8.  請求項1から5の何れかの学習装置または請求項6の推論装置としてコンピュータを機能させるためプログラム。 A program for operating a computer as the learning device according to any one of claims 1 to 5 or the inference device according to claim 6.
PCT/JP2020/032505 2020-08-28 2020-08-28 Training device, inference device, methods therefor, and program WO2022044243A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/032505 WO2022044243A1 (en) 2020-08-28 2020-08-28 Training device, inference device, methods therefor, and program
JP2022545187A JP7517435B2 (en) 2020-08-28 Learning device, inference device, their methods, and programs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/032505 WO2022044243A1 (en) 2020-08-28 2020-08-28 Training device, inference device, methods therefor, and program

Publications (1)

Publication Number Publication Date
WO2022044243A1 true WO2022044243A1 (en) 2022-03-03

Family

ID=80352930

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/032505 WO2022044243A1 (en) 2020-08-28 2020-08-28 Training device, inference device, methods therefor, and program

Country Status (1)

Country Link
WO (1) WO2022044243A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024023946A1 (en) * 2022-07-26 2024-02-01 日本電信電話株式会社 Speech processing device, speech processing method, and speech processing program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PURUSHOTHAM SANJAY, CARVALHO WILKA, NILANON TANACHAT, LIU YAN: "VARIATIONAL RECURRENT ADVERSARIAL DEEP DOMAIN ADAPTATION", 5TH INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 1 January 2017 (2017-01-01), pages 1 - 15, XP055912530, Retrieved from the Internet <URL:https://openreview.net/pdf?id=rk9eAFcxg> [retrieved on 20220413] *
RYO MASUMURA, TOMOHIRO TANAKA, ATSUSHI ANDO, HOSANA KAMIYAMA, TAKANOBU OBA, YUSHI AONO: "Call scene segmentation based on neural networks with conversational contexts", IEICE TECHNICAL REPORT, NLC; IPSJ INFORMATION FUNDAMENTALS AND ACCESS TECHNOLOGIES (IFAT), vol. 2019-IFAT-133, no. 5 (NLC2018-39), 31 January 2019 (2019-01-31), JP, pages 1 - 6, XP009537301 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024023946A1 (en) * 2022-07-26 2024-02-01 日本電信電話株式会社 Speech processing device, speech processing method, and speech processing program

Also Published As

Publication number Publication date
JPWO2022044243A1 (en) 2022-03-03

Similar Documents

Publication Publication Date Title
JP6615736B2 (en) Spoken language identification apparatus, method thereof, and program
US20220180198A1 (en) Training method, storage medium, and training device
KR102181901B1 (en) Method to create animation
CN115408525B (en) Letters and interviews text classification method, device, equipment and medium based on multi-level label
CN111783873A (en) Incremental naive Bayes model-based user portrait method and device
CN112101526A (en) Knowledge distillation-based model training method and device
WO2022043782A1 (en) Automatic knowledge graph construction
US20210073645A1 (en) Learning apparatus and method, and program
WO2022044243A1 (en) Training device, inference device, methods therefor, and program
JP6230987B2 (en) Language model creation device, language model creation method, program, and recording medium
JP7517435B2 (en) Learning device, inference device, their methods, and programs
CN117476035A (en) Voice activity detection integration to improve automatic speech detection
US11816422B1 (en) System for suggesting words, phrases, or entities to complete sequences in risk control documents
WO2023017568A1 (en) Learning device, inference device, learning method, and program
WO2023224862A1 (en) Hybrid model and system for predicting quality and identifying features and entities of risk controls
JP6633556B2 (en) Acoustic model learning device, speech recognition device, acoustic model learning method, speech recognition method, and program
US11887620B2 (en) Language model score calculation apparatus, language model generation apparatus, methods therefor, program, and recording medium
WO2021117089A1 (en) Model learning device, voice recognition device, method for same, and program
CN115700555A (en) Model training method, prediction method, device and electronic equipment
JP6988756B2 (en) Tag estimation device, tag estimation method, program
WO2020044755A1 (en) Speech recognition device, speech recognition method, and program
WO2021044606A1 (en) Learning device, estimation device, methods therefor, and program
KR102583799B1 (en) Method for detect voice activity in audio data based on anomaly detection
KR102497436B1 (en) Method for acquiring information related to a target word based on content including a voice signal
KR102509007B1 (en) Method for training speech recognition model based on the importance of tokens in sentences

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20951495

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022545187

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20951495

Country of ref document: EP

Kind code of ref document: A1