JP7517435B2

JP7517435B2 - Learning device, inference device, their methods, and programs

Info

Publication number: JP7517435B2
Application number: JP2022545187A
Authority: JP
Inventors: 翔太折橋; 亮増村; 雅人澤田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Filing date: 2020-08-28
Publication date: 2024-07-17
Anticipated expiration: 2040-08-28

Description

本発明はラベリング技術に関する。 The present invention relates to labeling technology.

近年、会話や談話の理解を目的に、発話系列を入力として、発話毎に会話や談話の応対シーンを表すラベルを推定する、発話系列ラベリングの技術が提案されている（例えば、非特許文献１）。In recent years, a technology called speech sequence labeling has been proposed for the purpose of understanding conversations and discourse, which takes a speech sequence as input and estimates a label representing the conversation or discourse scene for each utterance (for example, non-patent document 1).

例えば非特許文献１では、コンタクトセンタにおけるオペレータとカスタマとの間の会話を音声認識して得られた発話テキスト系列を入力として、発話毎に対応シーン（オープニング、用件把握、本人確認、対応、クロージングのいずれか）のラベルを推定する、発話系列ラベリングを実現する深層ニューラルネットワーク（以下、ラベリングネットワーク）を開示している。For example, non-patent document 1 discloses a deep neural network (hereinafter referred to as a labeling network) that realizes speech sequence labeling by taking as input a speech text sequence obtained by speech recognition of a conversation between an operator and a customer at a contact center and estimating a label for each utterance as the response scene (opening, understanding the purpose, identity verification, response, or closing).

非特許文献１のようなラベリングネットワークの学習には、多量のラベル付き教師データが必要である。しかし、新たなドメインでのラベリングを行うたびに、そのドメインにおける多量のラベル付き教師データを収集することは、ラベル付与のコストが膨大にかかることから、困難である。ここで非特許文献２には、過去に適用済みのドメイン（以下、ソースドメイン）のラベル付きデータ（以下、ラベル付き教師データ）と、新規に適用したいドメイン（以下、ターゲットドメイン）のラベルなしデータ（以下、ラベルなし教師データ）とから、新たなドメインでのラベリングを行う教師なしドメイン適応を実現する方法が提案されている。 A large amount of labeled training data is required for learning a labeling network such as that in Non-Patent Document 1. However, it is difficult to collect a large amount of labeled training data in a new domain every time labeling is performed in that domain because the cost of labeling is enormous. Here, Non-Patent Document 2 proposes a method for realizing unsupervised domain adaptation that performs labeling in a new domain from labeled data (hereinafter, labeled training data) in a domain that has been previously applied (hereinafter, source domain) and unlabeled data (hereinafter, unlabeled training data) in a domain to which the new domain is to be applied (hereinafter, target domain).

R. Masumura, S. Yamada, T. Tanaka, A. Ando, H. Kamiyama, and Y. Aono, “Online call scene segmentation of contact center dialogues based on role aware hierarchical LSTM-RNNs,” Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 811-815, 2018.R. Masumura, S. Yamada, T. Tanaka, A. Ando, H. Kamiyama, and Y. Aono, “Online call scene segmentation of contact center dialogues based on role aware hierarchical LSTM-RNNs,” Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 811-815, 2018. Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” Proceedings of the 32nd International Conference on Machine Learning (ICML), vol. 37, pp. 1180-1189, 2015.Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” Proceedings of the 32nd International Conference on Machine Learning (ICML), vol. 37, pp. 1180-1189, 2015.

しかし、非特許文献２の方法は、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを用い、ターゲットドメインに属する単一画像に対応するラベルを推定するラベリングモデルを学習するものである。すなわち、非特許文献２の方法は単一画像の単純な分類問題の教師なしドメイン適応を行うものであり、複数の情報の系列の論理的関係を考慮して当該情報の系列に対応するラベル系列を推定する（例えば、発話テキスト系列に対して対応シーン毎のラベルの系列を推定する）複雑な分類問題の教師なしドメイン適応方法は確立されていない。However, the method of Non-Patent Document 2 uses labeled training data of the source domain and unlabeled training data of the target domain to learn a labeling model that estimates a label corresponding to a single image belonging to the target domain. In other words, the method of Non-Patent Document 2 performs unsupervised domain adaptation for a simple classification problem of a single image, and no unsupervised domain adaptation method has been established for a complex classification problem that estimates a label sequence corresponding to a sequence of multiple pieces of information by taking into account the logical relationship between the series of information (for example, estimating a label sequence for each corresponding scene for a spoken text sequence).

本発明はこのような点に鑑みてなされたものであり、複数の情報の系列の論理的関係を考慮して当該情報の系列に対応するラベル系列を推定するラベリングモデルの教師なしドメイン適応を行うことを目的とする。 The present invention has been made in consideration of these points, and aims to perform unsupervised domain adaptation of a labeling model that estimates a label sequence corresponding to multiple information sequences by taking into account the logical relationships between the sequences.

論理的関係を持つ複数の情報の系列である入力情報系列を受け取り、前記入力情報系列の論理的関係を考慮した中間特徴系列を得、前記中間特徴系列を出力する論理的関係理解手段と、前記中間特徴系列に基づく第１系列を受け取り、前記入力情報系列に対応するラベル系列の推定ラベル系列を得、前記推定ラベル系列を出力するラベリング手段と、を含むラベリングモデルと、前記中間特徴系列に基づく第２系列を受け取り、前記入力情報系列に含まれる各部分情報がソースドメインに属するか、ターゲットドメインに属するか、を表すドメイン識別情報の推定ドメイン情報を得、前記推定ドメイン情報の系列を出力するドメイン識別モデルと、に対し、学習装置が、ソースドメインに属するラベル付きの学習用情報系列であるラベル付き教師データとターゲットドメインに属するラベルなしの学習用情報系列であるラベルなし教師データとを含む教師データを前記入力情報系列として用い、前記推定ラベル系列の推定精度が高く、前記推定ドメイン情報の系列の推定精度が低くなるように前記ラベリングモデルを学習し、前記推定ドメイン情報の系列の推定精度が高くなるように前記ドメイン識別モデルを学習する敵対的学習を行い、少なくとも前記ラベリングモデルのパラメータを得て出力する。A labeling model including a logical relationship understanding means that receives an input information sequence, which is a sequence of multiple pieces of information having a logical relationship, obtains an intermediate feature sequence that takes into account the logical relationship of the input information sequence, and outputs the intermediate feature sequence; a labeling means that receives a first sequence based on the intermediate feature sequence, obtains an estimated label sequence of a label sequence corresponding to the input information sequence, and outputs the estimated label sequence; and a domain identification model that receives a second sequence based on the intermediate feature sequence, obtains estimated domain information of domain identification information indicating whether each piece of partial information included in the input information sequence belongs to a source domain or a target domain, and outputs the estimated domain information sequence. A learning device uses teacher data including labeled teacher data, which is a labeled learning information sequence belonging to the source domain, and unlabeled teacher data, which is an unlabeled learning information sequence belonging to the target domain, as the input information sequence, and learns the labeling model so that the estimation accuracy of the estimated label sequence is high and the estimation accuracy of the estimated domain information sequence is low, and performs adversarial learning to learn the domain identification model so that the estimation accuracy of the estimated domain information sequence is high, and obtains and outputs at least parameters of the labeling model.

これにより、複数の情報の系列の論理的関係を考慮して当該情報の系列に対応するラベル系列を推定するラベリングモデルの教師なしドメイン適応を行うことができる。This enables unsupervised domain adaptation of a labeling model that estimates a label sequence corresponding to multiple information sequences by taking into account the logical relationships between those sequences.

図１は第１実施形態の学習装置を例示するためのブロック図である。FIG. 1 is a block diagram illustrating a learning device according to the first embodiment. 図２は第１実施形態の学習部の詳細構成を例示するためのブロック図である。FIG. 2 is a block diagram illustrating a detailed configuration of the learning unit of the first embodiment. 図３は第１実施形態の学習処理に用いるネットワークを例示するための概念図である。FIG. 3 is a conceptual diagram illustrating a network used in the learning process of the first embodiment. 図４は第１実施形態の推論装置を例示するためのブロック図である。FIG. 4 is a block diagram illustrating the inference device of the first embodiment. 図５は第１実施形態の学習済みのラベリングネットワークを例示するための概念図である。FIG. 5 is a conceptual diagram illustrating the trained labeling network of the first embodiment. 図６は第２実施形態の学習装置を例示するためのブロック図である。FIG. 6 is a block diagram illustrating the learning device of the second embodiment. 図７は第２実施形態の学習処理に用いるネットワークを例示するための概念図である。FIG. 7 is a conceptual diagram illustrating a network used in the learning process of the second embodiment. 図８は実験結果を例示するためのグラフである。FIG. 8 is a graph illustrating the experimental results. 図９は実施形態の学習装置および推論装置のハードウェア構成を例示したブロック図である。FIG. 9 is a block diagram illustrating the hardware configuration of the learning device and the inference device according to the embodiment.

以下、図面を参照して本発明の実施形態を説明する。各実施形態では、発話テキスト系列（論理的関係を持つ複数の情報の系列）を入力とし、対応シーン（例えば、オープニング、用件把握、本人確認、対応、クロージング）に相当するラベルの系列（ラベル系列）を出力（系列ラベリング）する深層ニューラルネットワークに基づくラベリングモデルを教師なしドメイン適応する例を示す。しかし、これらは一例であって本発明を限定するものではない。すなわち、本発明は、任意の複数の情報の系列の論理的関係を考慮して当該情報の系列に対応する任意のラベル系列を推定するラベリングモデルの教師なしドメイン適応に利用できる。なお、複数の情報の系列の論理的関係にも限定はなく、複数の情報の間に何らかの関係が存在すればよい。論理的関係の例は、文脈（コンテキスト）、単語の係り受けの関係、言語の文法的な関係、音声や動画のフレーム間関係などであるが、これらは本発明を限定しない。また、ラベリングモデルは、深層ニューラルネットワークに基づくモデルに限定されず、入力された情報の系列に対応するラベル系列を推定して出力するモデルであれば、確率モデルや分類器などどのようなモデルであってもよい。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. In each embodiment, an example of unsupervised domain adaptation of a labeling model based on a deep neural network that takes a spoken text sequence (a sequence of multiple pieces of information having a logical relationship) as input and outputs a sequence of labels (label sequence) corresponding to a corresponding scene (e.g., opening, understanding the subject matter, identity verification, response, closing) (sequence labeling) is shown. However, these are only examples and do not limit the present invention. That is, the present invention can be used for unsupervised domain adaptation of a labeling model that estimates an arbitrary label sequence corresponding to an arbitrary sequence of multiple pieces of information by considering the logical relationship of the sequence of the information. Note that there is no limitation on the logical relationship of the sequence of multiple pieces of information, and it is sufficient that some relationship exists between the multiple pieces of information. Examples of logical relationships include context, dependency relationships between words, grammatical relationships in languages, and inter-frame relationships in audio and video, but these do not limit the present invention. In addition, the labeling model is not limited to a model based on a deep neural network, and any model such as a probability model or a classifier may be used as long as it estimates and outputs a label sequence corresponding to an input sequence of information.

［第１実施形態］
本発明の第１実施形態を説明する。
＜学習装置１１の機能構成および学習処理＞
図１に例示するように、第１実施形態の学習装置１１は、学習部１１ａ、および記憶部１１ｂ，１１ｃを有し、ソースドメインのラベル付き教師データと、ターゲットドメインのラベルなし教師データを入力とし、学習によってターゲットドメインのラベリングネットワークのパラメータ（モデルパラメータ）を得て出力する。なお、図面では記載の簡略化のため、ソースドメインを「SD」と表記し、ターゲットドメインを「TD」と表記し、ネットワークを「NW」と表記する。ここで例示するソースドメインのラベル付き教師データは、ソースドメインに属するラベル付きの学習用情報系列であり、ソースドメインの発話テキスト系列（論理的関係を持つ複数の情報の系列である入力情報系列）と当該発話テキスト系列に対応する正解ラベル系列とを含む。またここで例示するターゲットドメインのラベルなし教師データは、ターゲットドメインに属するラベルなしの学習用情報系列であり、ターゲットドメインの発話テキスト系列を含むが正解ラベル系列を含まない。さらに後述する損失の結合比率等のスケジュールである学習スケジュールが学習装置１１に入力され、学習装置１１が当該学習スケジュールに従って学習処理を行ってもよい。また学習装置１１が教師なしドメイン適応を実現するためのドメイン識別ネットワークのパラメータを出力してもよい。また図２に例示するように、学習部１１ａは、例えば、制御部１１ａａ、損失関数計算部１１ａｂ、勾配反転部１１ａｃ、およびパラメータ更新部１１ａｄを有する。また、学習部１１ａは処理過程で得られた各データを逐一、記憶部１１ｂ，１１ｃまたは図示していない一時メモリに格納する。学習部１１ａは必要に応じて当該データを読み込み、各処理に利用する。 [First embodiment]
A first embodiment of the present invention will be described.
<Functional configuration of the learning device 11 and learning process>
As illustrated in FIG. 1, the learning device 11 of the first embodiment includes a learning unit 11a and storage units 11b and 11c, and receives labeled teacher data of the source domain and unlabeled teacher data of the target domain as input, and obtains and outputs parameters (model parameters) of a labeling network of the target domain by learning. In the drawings, for the sake of simplicity, the source domain is represented as "SD", the target domain as "TD", and the network as "NW". The labeled teacher data of the source domain illustrated here is a labeled learning information sequence belonging to the source domain, and includes an utterance text sequence of the source domain (an input information sequence that is a sequence of multiple pieces of information having a logical relationship) and a correct label sequence corresponding to the utterance text sequence. The unlabeled teacher data of the target domain illustrated here is an unlabeled learning information sequence belonging to the target domain, and includes an utterance text sequence of the target domain but does not include a correct label sequence. Furthermore, a learning schedule, which is a schedule of a combination ratio of losses and the like to be described later, may be input to the learning device 11, and the learning device 11 may perform a learning process according to the learning schedule. The learning device 11 may also output parameters of a domain identification network for realizing unsupervised domain adaptation. As illustrated in FIG. 2, the learning unit 11a includes, for example, a control unit 11aa, a loss function calculation unit 11ab, a gradient inversion unit 11ac, and a parameter update unit 11ad. The learning unit 11a stores each piece of data obtained during the processing in the storage units 11b and 11c or a temporary memory (not shown). The learning unit 11a reads the data as necessary and uses it for each process.

≪ネットワーク１００≫
図３に学習装置１１が学習処理で用いるネットワーク１００の構成例を示す。図３に例示するネットワーク１００は、ラベリングネットワーク１５０（ラベリングモデル）およびドメイン識別モデル１３０を有する。 <Network 100>
3 shows an example of the configuration of the network 100 used in the learning process by the learning device 11. The network 100 shown in FIG.

≪ラベリングネットワーク１５０≫
図３に例示するラベリングネットワーク１５０は、発話テキスト系列Ｔ_１，…，Ｔ_Ｎ（論理的関係を持つ複数の情報の系列である入力情報系列）を受け取り（入力とし）、発話テキスト系列Ｔ_１，…，Ｔ_Ｎに対応するラベルの系列の推定系列である推定ラベル系列Ｌ_１，…，Ｌ_Ｎを得て出力する。ここで、発話テキスト系列Ｔ_１，…，Ｔ_ＮはＮ個の発話テキストＴ_ｎの系列である。ただし、例えばｎは時間に対応するインデックスであり、ｎ＝１，…，Ｎであり、Ｎは１以上の整数であり、一般的にはＮは２以上の整数である。発話テキストＴ_ｎは「すみません」や「はい」などの単語であってもよいし、「返信速度が遅くて困っています」などのＭ（ｎ）個の単語Ｔ_ｎ，１，…，Ｔ_{ｎ，Ｍ（ｎ）}を含む文章であってもよい。ただし、Ｍ（ｎ）は２以上の整数である。また、ここで例示する推定ラベル系列Ｌ_１，…，Ｌ_ＮはＮ個の推定ラベルＬ_ｎ（ただし、ｎ＝１，…，Ｎ）の系列である。この例の推定ラベルＬ_ｎは発話テキストＴ_ｎに対応し、例えば発話テキストＴ_ｎの対応シーン（例えば、オープニング、用件把握、本人確認、対応、クロージング）を表す。またここで例示するラベリングネットワーク１５０は、論理的関係理解層１１０（論理的関係理解手段）とラベリング層１２０（ラベリング手段）を有する。 <Labeling Network 150>
The labeling network 150 illustrated in FIG. 3 receives (as input) a spoken text sequence T ₁ , ..., T _N (an input information sequence that is a sequence of a plurality of pieces of information having a logical relationship), obtains and outputs an estimated label sequence L ₁ , ..., L _N that is an estimated sequence of a sequence of labels corresponding to the spoken text sequence T ₁ , ..., T _N. Here, the spoken text sequence T ₁ , ..., T _N is a sequence of N spoken texts T _n . Here, n is an index corresponding to time, n=1, ..., N, for example, and N is an integer equal to or greater than 1, and generally N is an integer equal to or greater than 2. The spoken text T _n may be a word such as "I'm sorry" or "Yes," or may be a sentence including M(n) words T _n,1 , ..., T _n,M(n) such as "I'm sorry for the slow reply speed." Here, M(n) is an integer equal to or greater than 2. The estimated label sequence _L1 , ..., _LN illustrated here is a sequence of N estimated labels _Ln (where n = 1, ..., N). The estimated labels _Ln in this example correspond to the spoken text _Tn , and represent, for example, corresponding scenes of the spoken text _Tn (for example, opening, understanding the subject matter, identity verification, response, closing). The labeling network 150 illustrated here has a logical relationship understanding layer 110 (logical relationship understanding means) and a labeling layer 120 (labeling means).

≪論理的関係理解層１１０≫
論理的関係理解層１１０は、発話テキスト系列Ｔ_１，…，Ｔ_Ｎを受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎの文脈（論理的関係）を考慮した中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを得、当該中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを出力する。中間特徴系列ＬＦ_１，…，ＬＦ_Ｎは、Ｎ個の中間特徴ＬＦ_ｎ（ただし、ｎ＝１，…，Ｎ）の系列である。中間特徴ＬＦ_ｎは発話テキストＴ_ｎに対応する。図３に例示する論理的関係理解層１１０は、短期文脈理解ネットワーク１１１－１，…，１１１－Ｎ（短期論理的関係理解手段）と長期文脈理解ネットワーク１１２（長期論理的関係理解手段）とを含む。例えば、短期文脈理解ネットワーク１１１－１，…，１１１－Ｎは互いに同一なネットワーク（例えば、パラメータが互いに同一なネットワーク）であり、各短期文脈理解ネットワーク１１１－ｎは各ｎ＝１，…，Ｎ（例えば、各時間）に対応する状態を表す。ここで例示する短期文脈理解ネットワーク１１１－１，…，１１１－Ｎは発話テキスト系列Ｔ_１，…，Ｔ_Ｎを受け取る。発話テキスト系列Ｔ_１，…，Ｔ_Ｎに含まれる各発話テキストＴ_ｎ（入力情報系列に含まれる各部分情報）を受け取った各短期文脈理解ネットワーク１１１－ｎ（ただし、ｎ＝１，…，Ｎ）は、受け取った発話テキストＴ_ｎ内での単語の文脈（例えば、単語単位の短期文脈）を考慮した短期中間特徴ＳＦ_ｎ（部分情報内での情報の論理的関係を考慮した短期中間特徴）を得、当該短期中間特徴ＳＦ_ｎを出力する。これにより、短期文脈理解ネットワーク１１１－１，…，１１１－Ｎからは短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎが出力される。なお、発話テキストＴ_ｎが１個の単語のみを含む場合、その発話テキストＴ_ｎ内での単語の文脈は当該１個の単語のみに依存するが、この場合に得られるＳＦ_ｎも単語の文脈を考慮した短期中間特徴である。ただし、これは本発明を限定するものではない。例えば、各短期文脈理解ネットワーク１１１－ｎが複数の短期文脈理解ネットワーク１１１－ｎ１，…，１１１－ｎＫ’（ただし、Ｋ’は２以上の整数）に区分されてもよい。例えば、ｋ’＝１，…，Ｋ’は短期文脈理解ネットワークの層を表すインデックスであり、各短期文脈理解ネットワーク１１１－ｎｋ’は短期文脈理解ネットワークの入力層からｋ’層目までのネットワークを表す。この場合、各短期文脈理解ネットワーク１１１－ｎｋ’からはｋ’層目の短期中間特徴ＳＦ_ｎｋ’が出力される。ここで例示する長期文脈理解ネットワーク１１２は、複数の短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎ（短期中間特徴系列）を受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎに含まれる複数の発話テキストＴ_ｎ間の文脈（例えば、文単位の長期文脈または複数の文に渡る長期文脈）を考慮した中間特徴系列ＬＦ_１，…，ＬＦ_Ｎ（入力情報系列に含まれる複数の部分情報間での論理的関係を考慮した長期中間特徴系列）を得、当該中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを出力する。しかし、これは本発明を限定するものではない。例えば、上述のように各短期文脈理解ネットワーク１１１－ｎｋ’からｋ’層目の短期中間特徴ＳＦ_ｎｋ’が出力される場合、長期文脈理解ネットワーク１１２にはＳＦ_ｎとしてＳＦ_ｎＫ’が入力されてもよいし、ＳＦ_ｎ１，…，ＳＦ_ｎＫ’のうち複数のＳＦ_ｎＫ’が入力されてもよい。また、長期文脈理解ネットワーク１１２が複数の長期文脈理解ネットワーク１１２－１，…，１１２－Ｋ（ただし、Ｋは２以上の整数）に区分されてもよい。例えば、ｋ＝１，…，Ｋは長期文脈理解ネットワークの層を表すインデックスであり、各長期文脈理解ネットワーク１１２－ｋは長期文脈理解ネットワークの入力層からｋ層目までのネットワークを表す。この場合、各長期文脈理解ネットワーク１１２－ｋ（ただし、ｋ＝１，…，Ｋ）は何れか複数の短期中間特徴の系列ＳＦ_ｎを受け取り、受け取った系列ＳＦ_ｎに対応する複数の発話テキストＴ_ｎ間の文脈を考慮した中間特徴を得て出力してもよい。この場合には、長期文脈理解ネットワーク１１２－１，…，１１２－ＫによってＫ個の中間特徴系列｛ＬＦ_１１，…，ＬＦ_Ｎ１｝，…，｛ＬＦ_１Ｋ，…，ＬＦ_ＮＫ｝が出力される。ＬＦ_ｎｋ（ただし、ｎ＝１，…，Ｎ，ｋ＝１，…，Ｋ）は、各長期文脈理解ネットワーク１１２－ｋから出力されるｋ層目の各ｎ（例えば、各時間）に対応する中間特徴を表す。 <Logical Relationship Understanding Layer 110>
The logical relationship understanding layer 110 receives a spoken text sequence T ₁ , ..., T _N , obtains an intermediate feature sequence LF ₁ , ..., LF _N considering the context (logical relationship) of the spoken text sequence T ₁ , ..., T _N , and outputs the intermediate feature sequence LF ₁ , ..., LF _N. The intermediate feature sequence LF ₁ , ..., LF _N is a sequence of N intermediate features LF _n (where n=1, ..., N). The intermediate feature LF _n corresponds to the spoken text T _n . The logical relationship understanding layer 110 illustrated in FIG. 3 includes short-term context understanding networks 111-1, ..., 111-N (short-term logical relationship understanding means) and a long-term context understanding network 112 (long-term logical relationship understanding means). For example, the short-term context understanding networks 111-1, ..., 111-N are mutually identical networks (e.g., networks with the same parameters), and each short-term context understanding network 111-n represents a state corresponding to each n=1, ..., N (e.g., each time). The short-term context understanding networks 111-1, ..., 111-N illustrated here receive a spoken text sequence T ₁ , ..., T _N. Each short-term context understanding network 111-n (where n= ₁ , ..., N) that receives each spoken text T _n (each partial information included in the input information sequence) included in the spoken text sequence T 1 , ..., T _N obtains short-term intermediate features SF n (short-term intermediate features that consider the logical relationship of information in the partial information) that take into account the context of words in the received spoken text T _n (e.g., short-term context on a word-by-word basis), and outputs the short-term intermediate features _{SF n} _. As a result, a series of short-term intermediate features SF ₁ , ..., SF _N are output from the short-term context understanding networks 111-1, ..., 111-N. Note that when the speech text T _n includes only one word, the context of the word in the speech text T _n depends only on the one word, and the SF _n obtained in this case is also a short-term intermediate feature that takes into account the context of the word. However, this does not limit the present invention. For example, each short-term context understanding network 111-n may be divided into a plurality of short-term context understanding networks 111-n1, ..., 111-nK' (where K' is an integer equal to or greater than 2). For example, k'=1, ..., K' is an index representing a layer of the short-term context understanding network, and each short-term context understanding network 111-nk' represents a network from the input layer to the k'th layer of the short-term context understanding network. In this case, a short-term intermediate feature SF _nk' of the k'th layer is output from each short-term context understanding network 111-nk'. The long-term context understanding network 112 illustrated here receives a sequence of multiple short-term intermediate features SF ₁ , ..., SF _N (short-term intermediate feature sequence), obtains intermediate feature sequences LF ₁ , ..., LF _N (long-term intermediate feature sequence considering logical relationships between multiple pieces of partial information included in an input information sequence) taking into account the context between multiple spoken texts T _n included in the spoken text sequence T ₁ , ..., T _N (for example, a long-term context of a sentence unit or a long-term context spanning multiple sentences), and outputs the intermediate feature sequence LF ₁ , ..., LF _N. However, this does not limit the present invention. For example, when short-term intermediate features SF _nk' of the k'th layer are output from each short-term context understanding network 111-nk' as described above, SF _nK' may be input to the long-term context understanding network 112 as SF _n , or multiple SF nK _' of SF _n1 , ..., SF _nK' may be input. In addition, the long-term context understanding network 112 may be divided into a plurality of long-term context understanding networks 112-1, ..., 112-K (where K is an integer of 2 or more). For example, k = 1, ..., K is an index representing a layer of the long-term context understanding network, and each long-term context understanding network 112-k represents a network from the input layer to the k-th layer of the long-term context understanding network. In this case, each long-term context understanding network 112-k (where k = 1, ..., K) may receive a series of a plurality of short-term intermediate features SF _n , and obtain and output intermediate features that consider the context between a plurality of speech texts T _n corresponding to the received series SF _n . In this case, K intermediate feature series {LF ₁₁ , ..., LF _N1 }, ..., {LF _1K , ..., LF _NK } are output by the long-term context understanding networks 112-1, ..., 112-K. LF _nk (where n=1, . . . , N, k=1, . . . , K) represents intermediate features corresponding to each n (eg, each time) in the kth layer output from each long-term context understanding network 112-k.

ここで短期文脈理解ネットワーク１１１－ｎは、例えば辞書により単語を数値に変換する埋め込み層と、単方向LSTM（long-short term memory、長短期記憶）や双方向LSTM、注意機構等の組み合わせにより構成できる（例えば、非特許文献１等参照）。また長期文脈理解ネットワーク１１２は、例えば単方向LSTMや双方向LSTM等の組み合わせにより構成できる。Here, the short-term context understanding network 111-n can be configured, for example, by combining an embedding layer that converts words into numerical values using a dictionary, a unidirectional LSTM (long-short term memory), a bidirectional LSTM, an attention mechanism, etc. (For example, see Non-Patent Document 1, etc.). The long-term context understanding network 112 can be configured, for example, by combining a unidirectional LSTM, a bidirectional LSTM, etc.

≪ラベリング層１２０≫
ラベリング層１２０は、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎ（中間特徴系列に基づく第１系列）を受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎに対応する推定ラベル系列Ｌ_１，…，Ｌ_Ｎを得、当該推定ラベル系列Ｌ_１，…，Ｌ_Ｎを出力する。図３に例示するラベリング層１２０は、ラベル予測ネットワーク１２０－１，…，１２０－Ｎを含む。例えば、ラベル予測ネットワーク１２０－１，…，１２０－Ｎは互いに同一なネットワーク（例えば、パラメータが互いに同一なネットワーク）であり、各ラベル予測ネットワーク１２０－ｎは各ｎ＝１，…，Ｎ（例えば、各時間）に対応する状態を表す。ここで例示するラベル予測ネットワーク１２０－ｎは、中間特徴ＬＦ_ｎを受け取り、発話テキストＴ_ｎに対応する推定ラベルＬ_ｎを得、当該推定ラベルＬ_ｎを出力する。なお、ラベル予測ネットワーク１２０－ｎは、例えば、ソフトマックス関数を活性化関数とする全結合ニューラルネットワーク等により構成できる。また、長期文脈理解ネットワーク１１２－１，…，１１２－ＫからＫ個の中間特徴系列｛ＬＦ_１１，…，ＬＦ_Ｎ１｝，…，｛ＬＦ_１Ｋ，…，ＬＦ_ＮＫ｝が出力される場合、ラベル予測ネットワーク１２０－ｎは、例えば、中間特徴ＬＦ_ｎとしてＬＦ_ｎＫを受け取り、発話テキストＴ_ｎに対応する推定ラベルＬ_ｎを得、当該推定ラベルＬ_ｎを出力する。しかし、ラベル予測ネットワーク１２０－ｎが中間特徴系列｛ＬＦ_１１，…，ＬＦ_Ｎ１｝，…，｛ＬＦ_１Ｋ，…，ＬＦ_ＮＫ｝のうち複数の中間特徴ＬＦ_ｎｋを受け取り、発話テキストＴ_ｎに対応する推定ラベルＬ_ｎを得、当該推定ラベルＬ_ｎを出力してもよい。 <Labeling layer 120>
The labeling layer 120 receives the intermediate feature sequence LF ₁ , ..., LF _N (first sequence based on the intermediate feature sequence), obtains an estimated label sequence L ₁ , ..., L _N corresponding to the spoken text sequence T ₁ , ..., T _N , and outputs the estimated label sequence L ₁ , ..., L _N. The labeling layer 120 illustrated in FIG. 3 includes label prediction networks 120-1, ..., 120-N. For example, the label prediction networks 120-1, ..., 120-N are identical networks (e.g., networks with identical parameters), and each label prediction network 120-n represents a state corresponding to each n=1, ..., N (e.g., each time). The label prediction network 120-n illustrated here receives the intermediate feature LF _n , obtains an estimated label L _n corresponding to the spoken text T _n , and outputs the estimated label L _n . The label prediction network 120-n can be configured, for example, by a fully connected neural network with a softmax function as an activation function. In addition, when K intermediate feature sequences {LF ₁₁ , ..., LF _N1 }, ..., {LF _1K , ..., LF _NK } are output from the long-term context understanding networks 112-1, ..., 112-K, the label prediction network 120-n receives, for example, LF _nK as the intermediate feature LF _n , obtains an estimated label L _n corresponding to the spoken text T _n , and outputs the estimated label L _n . However, the label prediction network 120-n may receive a plurality of intermediate features LF _nk from the intermediate feature sequence {LF ₁₁ , ..., LF _N1 }, ..., {LF _1K , ..., LF _NK }, obtains an estimated label L _n corresponding to the spoken text T _n , and outputs the estimated label L _n .

≪ドメイン識別モデル１３０≫
図３に例示するドメイン識別モデル１３０は、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎ（中間特徴系列に基づく第２系列）を受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎに含まれる各発話テキストＴ_ｎ（入力情報系列に含まれる各部分情報）がソースドメインに属するかターゲットドメインに属するか（各発話テキストＴ_ｎがソースドメインのものであるかターゲットドメインのものであるか）を表すドメイン識別情報の推定ドメイン情報Ｄ_ｎ（ただし、ｎ＝１，…，Ｎ）を得、当該推定ドメイン情報の系列Ｄ_１，…，Ｄ_Ｎを出力する。ここで例示するドメイン識別モデル１３０は、Ｎ個のドメイン識別ネットワーク１３０－１，…，１３０－Ｎを含む。例えば、ドメイン識別ネットワーク１３０－１，…，１３０－Ｎは互いに同一なネットワーク（例えば、パラメータが互いに同一なネットワーク）であり、各ドメイン識別ネットワーク１３０－ｎは各ｎ＝１，…，Ｎ（例えば、各時間）に対応する状態を表す。例えば各ドメイン識別ネットワーク１３０－ｎ（ただし、ｎ＝１，…，Ｎ）は中間特徴ＬＦ_ｎを受け取り、推定ドメイン情報Ｄ_ｎを得て出力する。ただし、これは本発明を限定するものではない。例えば、各ドメイン識別ネットワーク１３０－ｎに代えて複数のドメイン識別ネットワーク１３０－ｎｋが存在してもよい。例えば、ｋ＝１，…，Ｋは長期文脈理解ネットワークの層を表すインデックスであり、各ドメイン識別ネットワーク１３０－ｎｋは各ｎ（例えば、各時間）に対応するネットワークを表す。この場合、各ドメイン識別ネットワーク１３０－ｎｋは中間特徴ＬＦ_ｎｋ（ｎ∈｛１，…，Ｎ｝，ｋ∈｛１，…，Ｋ｝）を受け取り、これらを用いて推定ドメイン情報Ｄ_ｎｋを得て出力してもよい。Ｄ_ｎｋ（ただし、ｎ＝１，…，Ｎ，ｋ＝１，…，Ｋ）は、各ドメイン識別ネットワーク１３０－ｎｋから出力される各ｎ（例えば、各時間）に対応する推定ドメイン情報を表す。ドメイン識別ネットワーク１３０－ｎ（または、ドメイン識別ネットワーク１３０－ｎｋ）は、例えば、ソフトマックス関数を活性化関数とする全結合ニューラルネットワーク等により構成できる。 <<Domain Identification Model 130>>
The domain identification model 130 illustrated in FIG. 3 receives the intermediate feature sequence LF ₁ , ..., LF _N (a second sequence based on the intermediate feature sequence), obtains estimated domain information D n (where n= ₁ , ..., _N ) of domain identification information indicating whether each spoken text T _n (each piece of partial information included in the input information sequence) included in the spoken text sequence T 1 , ..., T N belongs to the source domain or the target _domain (whether each spoken text T _n belongs to the source domain or the target domain), and outputs the sequence D ₁ , ..., D _N of the estimated domain information. The domain identification model 130 illustrated here includes N domain identification networks 130-1, ..., 130-N. For example, the domain identification networks 130-1, ..., 130-N are identical networks (for example, networks with the same parameters), and each domain identification network 130-n represents a state corresponding to each n=1, ..., N (for example, each time). For example, each domain identification network 130-n (where n=1,...,N) receives intermediate features LF _n , obtains estimated domain information D _n , and outputs it. However, this does not limit the present invention. For example, instead of each domain identification network 130-n, there may be a plurality of domain identification networks 130-nk. For example, k=1,...,K is an index representing a layer of the long-term context understanding network, and each domain identification network 130-nk represents a network corresponding to each n (e.g., each time). In this case, each domain identification network 130-nk may receive intermediate features LF _nk (nε{1,...,N},kε{1,...,K}), and use them to obtain estimated domain information D _nk and output it. D _nk (where n=1,...,N,k=1,...,K) represents estimated domain information corresponding to each n (e.g., each time) output from each domain identification network 130-nk. The domain identification network 130-n (or the domain identification network 130-nk) can be configured, for example, by a fully connected neural network using a softmax function as an activation function.

≪学習処理≫
学習処理では、学習装置１１の学習部１１ａに、ソースドメインのラベル付き教師データ（ソースドメインに属するラベル付きの学習用情報系列）と、ターゲットドメインのラベルなし教師データ（ターゲットドメインに属するラベルなしの学習用情報系列）とが入力される。学習部１１ａは、上述のネットワーク１００に対し、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを発話テキスト系列Ｔ_１，…，Ｔ_Ｎとして用い、推定ラベル系列Ｌ_１，…，Ｌ_Ｎの推定精度が高く、推定ドメイン情報の系列Ｄ_１，…，Ｄ_Ｎの推定精度が低くなるようにラベリングネットワーク１５０（ラベリングモデル）を学習し、推定ドメイン情報の系列Ｄ_１，…，Ｄ_Ｎの推定精度が高くなるようにドメイン識別モデル１３０を学習する敵対的学習を行う。すなわち、学習部１１ａは、上述の教師データが発話テキスト系列Ｔ_１，…，Ｔ_Ｎとしてネットワーク１００に入力された際にラベリングネットワーク１５０から出力される推定ラベル系列Ｌ_１，…，Ｌ_Ｎとそれらに対応するソースドメインのラベル付き教師データの正解ラベル系列との誤差を表す損失関数（以下、ラベル予測損失）と、ドメイン識別モデル１３０から出力される推定ドメイン情報の系列Ｄ_１，…，Ｄ_Ｎとソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとから特定される推定ドメイン情報の正解ラベル系列との誤差を表す損失関数（以下、ドメイン識別損失）とに基づき、ラベリングネットワーク１５０とドメイン識別モデル１３０との敵対的学習を行う。なお、ターゲットドメインのラベルなし教師データがネットワーク１００に入力された際にラベリングネットワーク１５０から出力される推定ラベル系列Ｌ_１，…，Ｌ_Ｎはラベル予測損失の算出に用いられない。 <Learning process>
In the learning process, labeled teacher data of the source domain (a labeled training information sequence belonging to the source domain) and unlabeled teacher data of the target domain (unlabeled training information sequence belonging to the target domain) are input to the learning unit 11a of the learning device 11. The learning unit 11a performs adversarial learning on the above-mentioned network 100, using teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the spoken text sequence _T1 , ..., _TN , to learn a labeling network 150 (labeling model) so that the estimation accuracy of the estimated label sequence _L1 , ..., _LN is high and the estimation accuracy of the estimated domain information sequence _D1 , ..., _DN is low, and to learn a domain discrimination model 130 so that the estimation accuracy of the estimated domain information sequence _D1 , ..., _DN is high. That is, the learning unit 11a _{performs adversarial learning between the labeling network 150 and the domain discrimination model 130 based on a loss function (hereinafter, label prediction loss) that represents an error between the estimated label sequence L 1} _, _... , _LN output from the labeling network 150 when the above-mentioned teacher data is input to the network 100 as the speech text sequence T ₁ , ..., T _N and the corresponding correct label sequence of labeled teacher data of the source domain, and a loss function (hereinafter, domain discrimination loss) that represents an error between the estimated domain information sequence D 1 , ..., D N output from the domain discrimination model 130 and the correct label sequence of estimated domain information identified from the labeled teacher data of the source domain and the unlabeled teacher data of the target domain. Note that the estimated label sequence L ₁ , ..., _LN output from the labeling network 150 when the unlabeled teacher data of the target domain is input to the network 100 is not used to calculate the label prediction loss.

学習部１１ａは、例えば誤差逆伝播法を用いてこの敵対的学習を行う。この場合、論理的関係理解層１１０とドメイン識別モデル１３０との間（例えば、長期文脈理解ネットワーク１１２－ｎとドメイン識別ネットワーク１３０－ｎとの間）に勾配反転層１４１－ｎ（ただし、ｎ＝１，…，Ｎ）を設け、誤差逆伝播時にのみ勾配反転層１４１－ｎで勾配を反転させる。ここで、ラベル予測損失が小さくなるように学習を行うことで、ラベリングネットワーク１５０での推定ラベル系列Ｌ_１，…，Ｌ_Ｎの推定精度が高くなる。また、勾配反転層１４１－ｎで誤差逆伝播時にのみ勾配を反転させ、ドメイン識別損失が小さくなるように学習を行うことで、推定ドメイン情報の系列Ｄ_１，…，Ｄ_Ｎの推定精度が高くなるようにドメイン識別モデル１３０を学習し、推定ドメイン情報の系列Ｄ_１，…，Ｄ_Ｎの推定精度を低くする中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを得る論理的関係理解層１１０を学習する敵対的学習を行うことができる。この敵対的学習により、推定ラベル系列Ｌ_１，…，Ｌ_Ｎを正確に推定できるがドメイン識別モデル１３０にドメインを推定されない中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを生成できるラベリングネットワーク１５０を学習できる。これにより、ラベリングネットワーク１５０でドメインへの依存性を抑制しつつラベルの予測に有効な中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを獲得でき、教師なしドメイン適応を実現できる。 The learning unit 11a performs this adversarial learning using, for example, the backpropagation method. In this case, a gradient inversion layer 141-n (where n=1, ..., N) is provided between the logical relationship understanding layer 110 and the domain identification model 130 (for example, between the long-term context understanding network 112-n and the domain identification network 130-n), and the gradient is inverted in the gradient inversion layer 141-n only during backpropagation. Here, by performing learning so as to reduce the label prediction loss, the estimation accuracy of the estimated label sequence L ₁ , ..., L _N in the labeling network 150 is improved. In addition, the gradient inversion layer 141-n inverts the gradient only during backpropagation of errors, and learning is performed to reduce domain discrimination loss, thereby learning the domain discrimination model 130 so as to increase the estimation accuracy of the estimated domain information sequence D ₁ , ..., D _N , and learning the logical relationship understanding layer 110 to obtain the intermediate feature sequence LF ₁ , ..., LF _N that reduces the estimation accuracy of the estimated domain information sequence D ₁ , _... , D N. This adversarial learning makes it possible to learn the labeling network 150 that can accurately estimate the estimated label sequence L ₁ , ..., L _N but can generate the intermediate feature sequence LF ₁ , ..., LF _N whose domain is not estimated by the domain discrimination model 130. As a result, the labeling network 150 can acquire the intermediate feature sequence LF ₁ , ..., LF _N that is effective for label prediction while suppressing the dependency on the domain, thereby realizing unsupervised domain adaptation.

この学習処理は、ラベル予測損失とドメイン識別損失とを線形結合した損失関数を最適化（最小化）することで実現できる。ラベル予測損失とドメイン識別損失との線形結合の結合比率は予め定められていてもよいし、学習部１１ａに入力される学習スケジュールで指定されてもよい。This learning process can be realized by optimizing (minimizing) a loss function that is a linear combination of the label prediction loss and the domain classification loss. The combination ratio of the linear combination of the label prediction loss and the domain classification loss may be predetermined or may be specified by a learning schedule input to the learning unit 11a.

学習部１１ａが学習スケジュールに基づき、学習のステップ数に応じてラベル予測損失とドメイン識別損失の結合比率を変更しながら上述の学習を行ってもよい。例えば学習部１１ａは、学習の序盤ではラベル予測損失のみを損失関数として学習を行い、学習のステップ数が増えるにつれて徐々に損失関数に占めるドメイン識別損失の割合が大きくなるようにして学習してもよい。さらに、学習部１１ａは、一定の結合比率で学習が収束するまで行い、結合比率を変更してまた収束するまで学習を行うような学習を、結合比率を学習スケジュールに基づき変更しながら繰り返し実施してもよい。The learning unit 11a may perform the above-mentioned learning while changing the combination ratio of the label prediction loss and the domain discrimination loss according to the number of learning steps based on a learning schedule. For example, the learning unit 11a may perform learning using only the label prediction loss as a loss function in the early stages of learning, and as the number of learning steps increases, the proportion of the domain discrimination loss in the loss function gradually increases. Furthermore, the learning unit 11a may perform learning until the learning converges with a constant combination ratio, change the combination ratio, and perform learning until it converges again, repeatedly performing learning while changing the combination ratio based on the learning schedule.

また学習部１１ａが、先に例示したような様々なドメイン識別モデル１３０および／またはラベリングネットワーク１５０を複数用意して学習を行い、それぞれの学習で得られたラベリングネットワーク１５０のうち、ターゲットドメインでのラベリングネットワーク１５０によるラベル系列の推定精度が最善となるラベリングネットワーク１５０を後で選択してもよい。 The learning unit 11a may also prepare and learn a variety of domain identification models 130 and/or labeling networks 150 such as those exemplified above, and later select, from among the labeling networks 150 obtained by each learning, the labeling network 150 that provides the best estimation accuracy of the label sequence in the target domain.

学習処理はバッチ学習であってもよいし、ミニバッチ学習であってもよいし、オンライン学習であってもよい。The training process may be batch training, mini-batch training, or online training.

学習部１１ａは、上述の学習によって得たラベリングネットワーク１５０のパラメータを記憶部１１ｂに格納し、ドメイン識別モデル１３０のパラメータ（ドメイン識別ネットワーク１３０－１，…，１３０－のパラメータ）を記憶部１１ｃに格納する。学習装置１１は、記憶部１１ｂに格納されたラベリングネットワーク１５０のパラメータを出力する。ラベリングネットワーク１５０のパラメータは後述の推論処理に用いられる。通常、ドメイン識別モデル１３０のパラメータは推論処理には用いられないため、学習装置１１から出力されなくてもよい。しかし、学習装置１１がドメイン識別モデル１３０のパラメータ（ドメイン識別ネットワーク１３０－１，…，１３０－のパラメータ）の少なくとも何れかを出力してもよい。The learning unit 11a stores the parameters of the labeling network 150 obtained by the above-mentioned learning in the memory unit 11b, and stores the parameters of the domain identification model 130 (parameters of domain identification networks 130-1, ..., 130-) in the memory unit 11c. The learning device 11 outputs the parameters of the labeling network 150 stored in the memory unit 11b. The parameters of the labeling network 150 are used in the inference process described below. Normally, the parameters of the domain identification model 130 are not used in the inference process, and therefore do not need to be output from the learning device 11. However, the learning device 11 may output at least any of the parameters of the domain identification model 130 (parameters of domain identification networks 130-1, ..., 130-).

図２を用いて上述の学習処理を機能的に例示する。
ステップＳ１１：ソースドメインのラベル付き教師データと、ターゲットドメインのラベルなし教師データとが制御部１１ａａに入力される。制御部１１ａａは、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを生成する。また制御部１１ａａは、ネットワーク１００のパラメータを初期化する。
ステップＳ１２：損失関数計算部１１ａｂは、教師データを発話テキスト系列Ｔ_１，…，Ｔ_Ｎとしてネットワーク１００に入力し、ラベル予測損失とドメイン識別損失を得、それらを線形結合した損失関数を得る。
ステップＳ１３：パラメータ更新部１１ａｄは、誤差逆伝播法に従い、損失関数に基づく情報を逆伝搬し、ドメイン識別モデル１３０およびラベリング層１２０のパラメータを更新する。
ステップＳ１４：勾配反転部１１ａｃは、ドメイン識別モデル１３０から逆伝搬された損失関数に基づく情報の勾配を反転させて論理的関係理解層１１０に逆伝搬させる。ラベリング層１２０から逆伝搬された損失関数に基づく情報は、そのまま論理的関係理解層１１０に逆伝搬される。
ステップＳ１５：パラメータ更新部１１ａｄは、誤差逆伝播法に従い、逆伝搬された情報を用いて論理的関係理解層１１０のパラメータを更新する。
ステップＳ１６：制御部１１ａａは、終了条件（例えば、パラメータの更新回数が所定数に達したなどの条件）を満たしたか否かを判定する。ここで終了条件を満たしていない場合、制御部１１ａａは処理をステップＳ１２に戻す。一方、終了条件を満たしている場合、制御部１１ａａはラベリングネットワーク１５０のパラメータを出力する。必要に応じて制御部１１ａａがドメイン識別ネットワーク１３０－１，…，１３０－Ｎのパラメータの少なくとも何れかも出力してもよい。 The above-mentioned learning process will be functionally illustrated with reference to FIG.
Step S11: Labeled teacher data of the source domain and unlabeled teacher data of the target domain are input to the control unit 11aa. The control unit 11aa generates teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain. The control unit 11aa also initializes parameters of the network 100.
Step S12: The loss function calculation unit 11ab inputs the training data as spoken text sequences T ₁ , . . . , T _N to the network 100, obtains a label prediction loss and a domain discrimination loss, and obtains a loss function by linearly combining them.
Step S13: The parameter update unit 11ad backpropagates information based on the loss function according to the backpropagation method, and updates the parameters of the domain identification model 130 and the labeling layer 120.
Step S14: The gradient inversion unit 11ac inverts the gradient of the information based on the loss function backpropagated from the domain discrimination model 130 and backpropagates it to the logical relationship understanding layer 110. The information based on the loss function backpropagated from the labeling layer 120 is backpropagated to the logical relationship understanding layer 110 as it is.
Step S15: The parameter update unit 11ad updates the parameters of the logical relationship understanding layer 110 using the backpropagated information according to the backpropagation method.
Step S16: The control unit 11aa judges whether or not a termination condition (for example, a condition that the number of parameter updates has reached a predetermined number) has been satisfied. If the termination condition has not been satisfied, the control unit 11aa returns the process to step S12. On the other hand, if the termination condition has been satisfied, the control unit 11aa outputs the parameters of the labeling network 150. If necessary, the control unit 11aa may also output at least one of the parameters of the domain identification networks 130-1, ..., 130-N.

＜推論装置１３の機能構成および推論処理＞
図４に例示するように、第１実施形態の推論装置１３は、推論部１３ａおよび記憶部１３ｂを有する。記憶部１３ｂには上述のように得られたラベリングネットワーク１５０のパラメータが格納される。 <Functional configuration and inference processing of the inference device 13>
4, the inference device 13 of the first embodiment includes an inference unit 13a and a storage unit 13b. The storage unit 13b stores parameters of the labeling network 150 obtained as described above.

≪推論処理≫
推論処理では、推論部１３ａに推論用の発話テキスト系列（入力情報系列）が入力される。推論部１３ａは、記憶部１３ｂに格納されたパラメータで特定されるラベリングネットワーク１５０（ラベリングモデル）に対し、推論用の発話テキスト系列を適用し、推論用の発話テキスト系列に対応するラベル系列の推定ラベル系列を得、推定ラベル系列を出力する。例えば、図５に例示するラベリングネットワーク１５０の場合、推論部１３ａは、推論用の発話テキスト系列Ｔ_１，…，Ｔ_Ｎを論理的関係理解層１１０に入力して推論用の発話テキスト系列Ｔ_１，…，Ｔ_Ｎに対応する中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを得る。例えば、推論部１３ａは、推論用の発話テキスト系列Ｔ_１，…，Ｔ_Ｎを短期文脈理解ネットワーク１１１－１，…，１１１－Ｎにそれぞれ入力し、短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎを得、短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎを長期文脈理解ネットワーク１１２に入力し、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを得る。さらに推論部１３ａは、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎをラベリング層１２０に入力して発話テキスト系列Ｔ_１，…，Ｔ_Ｎに対応する推定ラベル系列Ｌ_１，…，Ｌ_Ｎを得て出力する。 Inference processing
In the inference process, an utterance text sequence for inference (input information sequence) is input to the inference unit 13a. The inference unit 13a applies the utterance text sequence for inference to a labeling network 150 (labeling model) specified by parameters stored in the storage unit 13b, obtains an estimated label sequence of a label sequence corresponding to the utterance text sequence for inference, and outputs the estimated label sequence. For example, in the case of the labeling network 150 illustrated in FIG. 5, the inference unit 13a inputs the utterance text sequence for inference T ₁ , ..., T _N to the logical relation understanding layer 110 to obtain an intermediate feature sequence LF ₁ , ..., LF _N corresponding to the utterance text sequence for inference T ₁ , ..., T _N. For example, the inference unit 13a inputs a spoken text sequence T ₁ , ..., T _N for inference to the short-term context understanding networks 111-1, ..., 111-N, respectively, to obtain a sequence of short-term intermediate features SF ₁ , ..., SF _N , and inputs the sequence of short-term intermediate features SF ₁ , ..., SF _N to the long-term context understanding network 112 to obtain an intermediate feature sequence LF ₁ , ..., LF _N. Furthermore, the inference unit 13a inputs the intermediate feature sequence LF ₁ , ..., LF _N to the labeling layer 120 to obtain and output an estimated label sequence L ₁ , ..., L _N corresponding to the spoken text sequence T ₁ , ..., T _N.

＜第１実施形態の特徴＞
本実施形態では、発話テキスト系列Ｔ_１，…，Ｔ_Ｎを受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎの文脈を考慮した中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを得、当該中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを出力する論理的関係理解層１１０と、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎに対応する推定ラベル系列Ｌ_１，…，Ｌ_Ｎを得、当該推定ラベル系列Ｌ_１，…，Ｌ_Ｎを出力するラベリング層１２０とを含むラベリングネットワーク１５０と、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎに含まれる各発話テキストＴ_ｎがソースドメインに属するかターゲットドメインに属するかを表すドメイン識別情報の推定ドメイン情報Ｄ_ｎを得、当該推定ドメイン情報の系列Ｄ_１，…，Ｄ_Ｎを出力するドメイン識別モデル１３０とに対し、学習装置１１が、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを発話テキスト系列Ｔ_１，…，Ｔ_Ｎとして用い、推定ラベル系列Ｌ_１，…，Ｌ_Ｎの推定精度が高く、推定ドメイン情報の系列Ｄ_１，…，Ｄ_Ｎの推定精度が低くなるようにラベリングネットワーク１５０を学習し、推定ドメイン情報の系列Ｄ_１，…，Ｄ_Ｎの推定精度が高くなるようにドメイン識別モデル１３０を学習する敵対的学習を行った。これにより、ラベリングネットワーク１５０のドメイン依存性を低減させ、結果として、発話テキスト系列Ｔ_１，…，Ｔ_Ｎの文脈を考慮して当該発話テキスト系列に対応するラベル系列Ｌ_１，…，Ｌ_Ｎを推定するラベリングネットワーク１５０の教師なしドメイン適応が可能になる。 <Features of the First Embodiment>
In this embodiment, the labeling network 150 includes _a logical relationship understanding layer 110 that receives a spoken text sequence T ₁ , ..., _TN , obtains an intermediate feature sequence LF ₁ , ..., LF _N taking into account the context of the spoken text sequence T ₁ , ..., _TN , and outputs the intermediate feature sequence LF ₁ , ..., LF _N , and a labeling layer 120 that receives the intermediate feature sequence LF ₁ , ..., LF _N , obtains an estimated label sequence L ₁ , ..., LN corresponding to the spoken text sequence T ₁ , ..., _TN , and outputs the estimated label sequence L ₁ , ..., _LN . The labeling network 150 includes a logical relationship understanding layer 110 that receives a spoken text sequence T ₁ , ..., _TN , obtains estimated domain information D _n that is domain identification information indicating whether each spoken text T _n included in the spoken text sequence T ₁ , ..., _TN belongs to the source domain or the target domain, and outputs the estimated domain information sequence D ₁ , ..., D For the domain discrimination model 130 that outputs _{L N} , the learning device 11 performed adversarial learning using teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain as the spoken text sequence T ₁ , ..., T _N to train the labeling network 150 so that the estimation accuracy of the estimated label sequence L ₁ , ..., L _N is high and the estimation accuracy of the estimated domain information sequence D ₁ , ..., D _N is low, and to train the domain discrimination model 130 so that the estimation accuracy of the estimated domain information sequence D ₁ , ..., D _N is high. This reduces the domain dependency of the labeling network 150, and as a result, unsupervised domain adaptation of the labeling network 150 that estimates the label sequence L ₁ , ..., L _N corresponding to the spoken text sequence T ₁ , ..., T N by taking into account the context of the spoken text sequence T 1 , ..., T _N becomes possible.

［第２実施形態］
第２実施形態では、複数の発話テキストＴ_ｎ間の文脈（入力情報系列に含まれる複数の部分情報間での論理的関係を考慮した長期中間特徴系列）からドメインを識別するネットワークと、発話テキストＴ_ｎ内での単語の文脈を考慮した短期中間特徴ＳＦ_ｎ（部分情報内での情報の論理的関係を考慮した短期中間特徴）からドメインを識別するネットワークと、を同時に用いて敵対的に学習させる。これにより、ドメインへの依存性をさらに効率的に除去し、より高い精度でターゲットドメインのラベリングネットワークを学習できる。以下では、第１実施形態との相違点を中心に説明し、第１実施形態と共通する事項については、同じ参照記号を引用して説明を簡略化する。 [Second embodiment]
In the second embodiment, a network for identifying a domain from the context between a plurality of spoken texts _Tn (a long-term intermediate feature sequence taking into account the logical relationship between a plurality of pieces of partial information included in an input information sequence) and a network for identifying a domain from short-term intermediate features _SFn (short-term intermediate features taking into account the logical relationship between information in a partial information) taking into account the context of words in the spoken _text Tn are simultaneously used for adversarial learning. This makes it possible to more efficiently remove domain dependency and learn a labeling network for a target domain with higher accuracy. The following description will focus on the differences from the first embodiment, and the same reference symbols will be used to simplify the description of matters common to the first embodiment.

＜学習装置２１の機能構成および学習処理＞
図６に例示するように、第２実施形態の学習装置２１は、学習部２１ａ、および記憶部１１ｂ，２１ｃ，２１ｄを有し、ソースドメインのラベル付き教師データと、ターゲットドメインのラベルなし教師データを入力とし、学習によってターゲットドメインのラベリングネットワークのパラメータ（モデルパラメータ）を得て出力する。さらに学習スケジュールが学習装置２１に入力され、学習装置２１が当該学習スケジュールに従って学習処理を行ってもよい。また学習装置２１が教師なしドメイン適応を実現するためのドメイン識別ネットワークのパラメータを出力してもよい。また図２に例示するように、学習部２１ａは、例えば、制御部１１ａａ、損失関数計算部２１ａｂ、勾配反転部１１ａｃ、およびパラメータ更新部２１ａｄを有する。また、学習部２１ａは処理過程で得られた各データを逐一、記憶部１１ｂ，２１ｃ，２１ｄまたは図示していない一時メモリに格納する。学習部２１ａは必要に応じて当該データを読み込み、各処理に利用する。 <Functional configuration of the learning device 21 and learning process>
As illustrated in FIG. 6, the learning device 21 of the second embodiment has a learning unit 21a and storage units 11b, 21c, and 21d, and receives labeled teacher data of the source domain and unlabeled teacher data of the target domain as input, and obtains and outputs parameters (model parameters) of the labeling network of the target domain by learning. Furthermore, a learning schedule may be input to the learning device 21, and the learning device 21 may perform a learning process according to the learning schedule. The learning device 21 may also output parameters of a domain identification network for realizing unsupervised domain adaptation. As illustrated in FIG. 2, the learning unit 21a has, for example, a control unit 11aa, a loss function calculation unit 21ab, a gradient inversion unit 11ac, and a parameter update unit 21ad. The learning unit 21a stores each piece of data obtained during the processing in the storage units 11b, 21c, and 21d or a temporary memory not illustrated. The learning unit 21a reads the data as necessary and uses it for each process.

≪ネットワーク２００≫
図７に学習装置２１が学習処理で用いるネットワーク２００の構成例を示す。図７に例示するネットワーク２００は、ラベリングネットワーク１５０（ラベリングモデル）およびドメイン識別モデル２３０を有する。ラベリングネットワーク１５０は第１実施形態と同一であるため説明を省略し、以下ではドメイン識別モデル２３０の説明を行う。 <Network 200>
Fig. 7 shows an example of the configuration of a network 200 used in the learning process by the learning device 21. The network 200 shown in Fig. 7 has a labeling network 150 (labeling model) and a domain discrimination model 230. The labeling network 150 is the same as that in the first embodiment, so a description thereof will be omitted, and the domain discrimination model 230 will be described below.

≪ドメイン識別モデル２３０≫
図７に例示するドメイン識別モデル２３０は、短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎ（短期論理的関係ドメイン識別手段）、および長期文脈ドメイン識別ネットワーク２３２（長期論理的関係ドメイン識別手段）を含む。例えば、短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎは互いに同一なネットワーク（例えば、パラメータが互いに同一なネットワーク）であり、各短期文脈ドメイン識別ネットワーク２３１－ｎは各ｎ＝１，…，Ｎ（例えば、各時間）に対応する状態を表す。 <<Domain Identification Model 230>>
7 includes short-term context domain identification networks 231-1, ..., 231-N (short-term logically related domain identification means) and a long-term context domain identification network 232 (long-term logically related domain identification means). For example, the short-term context domain identification networks 231-1, ..., 231-N are identical networks (e.g., networks with identical parameters), and each short-term context domain identification network 231-n represents a state corresponding to each n=1, ..., N (e.g., each time).

長期文脈ドメイン識別ネットワーク２３２は、長期文脈理解ネットワーク１１２から出力された中間特徴系列ＬＦ_１，…，ＬＦ_Ｎ（長期中間特徴系列）を受け取り、推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_Ｎを得て出力する。ただし、各推定ドメイン情報ＬＤ_ｎ（ただし、ｎ＝１，…，Ｎ）は、各発話テキストＴ_ｎがソースドメインに属するかターゲットドメインに属するかを表すドメイン識別情報の推定情報である。図７に例示する長期文脈ドメイン識別ネットワーク２３２は、第１実施形態のドメイン識別ネットワーク１３０－ｎと異なり、入力された短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎを連続的に捉えることで（例えば、短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎを時間方向に連続的に捉えることで）、単語や文章である複数の発話テキストＴ_ｎを跨いだ文脈（論理的関係）のドメイン依存性をラベリングネットワーク１５０から取り除くことを目的とする。しかし、これは本発明を限定するものではない。例えば、長期文脈ドメイン識別ネットワーク２３２に代えて、複数の長期文脈ドメイン識別ネットワーク２３２－１，…，２３２－Ｋ（ただし、Ｋは２以上の整数）が存在してもよい。例えば、ｋ＝１，…，Ｋは長期文脈理解ネットワークの層を表すインデックスである。この場合、各長期文脈ドメイン識別ネットワーク２３２－ｋ（ただし、ｋ＝１，…，Ｋ）は、長期中間特徴系列ＬＦ_ｎｋ（ｎ∈｛１，…，Ｎ｝，ｋ∈｛１，…，Ｋ｝）を受け取り、受け取った長期中間特徴系列ＬＦ_ｎｋに対応する発話テキストＴ_ｎがソースドメインに属するかターゲットドメインに属するかを表す推定ドメイン情報ＬＤ_ｎｋを得て出力してもよい。ＬＦ_ｎｋ（ただし、ｎ＝１，…，Ｎ，ｋ＝１，…，Ｋ）は、第１実施形態で例示した各長期文脈理解ネットワーク１１２－ｋから出力される各ｎ（例えば、各時間）に対応する中間特徴を表す。この場合であっても、複数の発話テキストＴ_ｎを跨いだ文脈のドメイン依存性をラベリングネットワーク１５０から取り除くことができる。ここで長期文脈ドメイン識別ネットワーク２３２は、例えば、単方向LSTMや双方向LSTMと、ソフトマックス関数を活性化関数とする全結合ニューラルネットワーク等の組み合わせによって構成できる。 The long-term context domain identification network 232 receives the intermediate feature sequence LF ₁ , ..., LF _N (long-term intermediate feature sequence) output from the long-term context understanding network 112, obtains and outputs a sequence of estimated domain information LD ₁ , ..., LD _N. Here, each piece of estimated domain information LD _n (where n=1, ..., N) is estimated information of domain identification information indicating whether each spoken text T _n belongs to the source domain or the target _domain . Unlike the domain identification network 130-n of the first embodiment, the long-term context domain identification network 232 illustrated in FIG. 7 aims to remove the domain dependency of the context (logical relationship) across multiple spoken texts T _n , which are words or sentences _, from the labeling network 150 by continuously capturing the input short-term intermediate feature sequence SF ₁ , ..., SF _N (for example, by continuously capturing the short-term intermediate feature sequence SF 1 , ..., SF N in the time direction). However, this does not limit the present invention. For example, instead of the long-term context domain identification network 232, a plurality of long-term context domain identification networks 232-1, ..., 232-K (where K is an integer equal to or greater than 2) may be present. For example, k = 1, ..., K is an index representing a layer of the long-term context understanding network. In this case, each long-term context domain identification network 232-k (where k = 1, ..., K) may receive a long-term intermediate feature sequence LF _nk (n ∈ {1, ..., N}, k ∈ {1, ..., K}), and obtain and output estimated domain information LD _nk representing whether the speech text T _n corresponding to the received long-term intermediate feature sequence LF _nk belongs to the source domain or the target domain. LF _nk (where n = 1, ..., N, k = 1, ..., K) represents an intermediate feature corresponding to each n (for example, each time) output from each long-term context understanding network 112-k exemplified in the first embodiment. Even in this case, the domain dependency of the context across multiple spoken texts _Tn can be removed from the labeling network 150. Here, the long-term context domain identification network 232 can be configured by, for example, a combination of a unidirectional LSTM or bidirectional LSTM and a fully connected neural network using a softmax function as an activation function.

短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎは、短期文脈理解ネットワーク１１１－１，…，１１１－Ｎ（短期論理的関係理解手段）から出力された短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎ（中間特徴系列に基づく第２系列、短期中間特徴系列）を受け取り、推定ドメイン情報の系列ＳＤ_１，…，ＳＤ_Ｎを得て出力する。すなわち、各短期文脈ドメイン識別ネットワーク２３１－ｎ（ただし、ｎ＝１，…，Ｎ）は、短期文脈理解ネットワーク１１１－ｎから出力された短期中間特徴ＳＦ_ｎを受け取り、各発話テキストＴ_ｎがソースドメインに属するかターゲットドメインに属するかを表すドメイン識別情報の推定ドメイン情報ＳＤ_ｎを得、当該推定ドメイン情報ＳＤ_ｎを出力する。短期文脈ドメイン識別ネットワーク２３１－ｎは、長期文脈ドメイン識別ネットワーク２３２と異なり、短期中間特徴ＳＦ_ｎごとに発話テキストＴ_ｎがソースドメインに属するかターゲットドメインに属するかを推定することで、ドメイン依存性のある特定の単語や文書などの発話テキストＴ_ｎ単体のドメイン依存性を効率的に取り除くことを目的とする。ただし、これは本発明を限定するものではない。例えば、各短期文脈ドメイン識別ネットワーク２３１－ｎに代えて、複数の短期文脈ドメイン識別ネットワーク２３１－ｎ１，…，２３２－ｎＫ’（ただし、Ｋ’は２以上の整数）が存在してもよい。例えば、ｋ’＝１，…，Ｋ’は短期文脈ドメイン識別ネットワークの層を表すインデックスであり、各短期文脈ドメイン識別ネットワーク２３１－ｎｋ’は各ｎ（例えば、各時間）に対応するネットワークを表す。この場合、各短期文脈ドメイン識別ネットワーク２３１－ｎｋ’（ただし、ｋ’＝１，…，Ｋ’）は、短期中間特徴ＳＦ_ｎｋ’（ｎ∈｛１，…，Ｎ｝，ｋ’∈｛１，…，Ｋ’｝）を受け取り、受け取った短期中間特徴ＳＦ_ｎｋ’に対応する発話テキストＴ_ｎがソースドメインに属するかターゲットドメインに属するかを表す推定ドメイン情報ＳＤ_ｎｋ’を得て出力してもよい。ただし、ＳＦ_ｎｋ’は第１実施形態で例示した各短期文脈理解ネットワーク１１１－ｎｋ’から出力される各ｎ（例えば、各時間）に対応する短期中間特徴である。ここで、短期文脈ドメイン識別ネットワーク２３１－ｎは、例えば、ソフトマックス関数を活性化関数とする全結合ニューラルネットワーク等の組み合わせによって構成できる。 The short-term context domain identification networks 231-1, ..., 231-N receive the series of short-term intermediate features SF ₁ , ..., SF _N (a second series based on the intermediate feature series, a short-term intermediate feature series) output from the short-term context understanding networks 111-1, ..., 111-N (short-term logical relation understanding means), and obtain and output a series of estimated domain information SD ₁ , ..., SD _N. That is, each short-term context domain identification network 231-n (where n=1, ..., N) receives the short-term intermediate features SF _n output from the short-term context understanding network 111-n, obtains estimated domain information SD _n of domain identification information indicating whether each spoken text T _n belongs to the source domain or the target domain, and outputs the estimated domain information SD _n . Unlike the long-term context domain identification network 232, the short-term context domain identification network 231-n aims to efficiently remove the domain dependency of a single utterance text T _n such as a specific word or document having domain dependency by estimating whether the utterance text T _n belongs to the source domain or the target domain for _each short-term intermediate feature SF n. However, this does not limit the present invention. For example, instead of each short-term context domain identification network 231-n, there may be a plurality of short-term context domain identification networks 231-n1, ..., 232-nK' (where K' is an integer of 2 or more). For example, k' = 1, ..., K' is an index representing a layer of the short-term context domain identification network, and each short-term context domain identification network 231-nk' represents a network corresponding to each n (e.g., each time). In this case, each short-term context domain identification network 231-nk' (where k'=1,...,K') may receive short-term intermediate features SF _nk' (n∈{1,...,N},k'∈{1,...,K'}), obtain and output estimated domain information SD _nk' indicating whether the speech text T _n corresponding to the received short-term intermediate features SF _nk' belongs to the source domain or the target domain. Here, SF _nk' is a short-term intermediate feature corresponding to each n (e.g., each time) output from each short-term context understanding network 111-nk' exemplified in the first embodiment. Here, the short-term context domain identification network 231-n can be configured by a combination of fully connected neural networks, etc., with a softmax function as an activation function.

≪学習処理≫
学習処理では、学習装置２１の学習部２１ａに、ソースドメインのラベル付き教師データ（ソースドメインに属するラベル付きの学習用情報系列）と、ターゲットドメインのラベルなし教師データ（ターゲットドメインに属するラベルなしの学習用情報系列）とが入力される。学習部２１ａは、上述のネットワーク２００に対し、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを発話テキスト系列Ｔ_１，…，Ｔ_Ｎとして用い、推定ラベル系列Ｌ_１，…，Ｌ_Ｎの推定精度が高く、推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_ＮおよびＳＤ_１，…，ＳＤ_Ｎの推定精度が低くなるようにラベリングネットワーク１５０（ラベリングモデル）を学習し、推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_ＮおよびＳＤ_１，…，ＳＤ_Ｎの推定精度が高くなるようにドメイン識別モデル２３０を学習する敵対的学習を行う。すなわち、学習部２１ａは、上述の教師データが発話テキスト系列Ｔ_１，…，Ｔ_Ｎとしてネットワーク２００に入力された際にラベリングネットワーク１５０から出力される推定ラベル系列Ｌ_１，…，Ｌ_Ｎとそれらに対応するソースドメインのラベル付き教師データの正解ラベル系列との誤差を表す損失関数（以下、ラベル予測損失）と、長期文脈ドメイン識別ネットワーク２３２から出力される推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_Ｎとソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとから特定される推定ドメイン情報の正解ラベル系列との誤差を表す損失関数（以下、長期文脈ドメイン識別損失）と、短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎから出力される推定ドメイン情報の系列ＳＤ_１，…，ＳＤ_Ｎとソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとから特定される推定ドメイン情報の正解ラベル系列との誤差を表す損失関数（以下、短期文脈ドメイン識別損失）とに基づき、ラベリングネットワーク１５０とドメイン識別モデル２３０との敵対的学習を行う。なお、ターゲットドメインのラベルなし教師データがネットワーク２００に入力された際にラベリングネットワーク１５０から出力される推定ラベル系列Ｌ_１，…，Ｌ_Ｎはラベル予測損失の算出に用いられない。 <Learning process>
In the learning process, labeled teacher data of the source domain (a labeled training information sequence belonging to the source domain) and unlabeled teacher data of the target domain (unlabeled training information sequence belonging to the target domain) are input to the learning unit 21a of the learning device 21. The learning unit 21a performs adversarial learning on the above-mentioned network 200, using teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain as the spoken text sequence _T1 , ..., _TN , to learn the labeling network ₁₅₀ (labeling model) so that the estimation accuracy of the estimated label sequence _L1 , ..., _LN is high and the estimation accuracy of the estimated domain information sequence _LD1 , ..., _LDN and _SD1 , ..., _SDN is low, and to learn the domain discrimination model 230 so that the estimation accuracy of the estimated domain information sequence _LD1 , ..., _LDN and SD1, ..., _SDN is high. That is, the learning unit 21a calculates a loss function (hereinafter, label prediction _loss) that expresses _{the error between the estimated label sequence L 1} _, ..., L _N output from the labeling network 150 and the corresponding correct label sequence of labeled teacher data of the source domain when the above-mentioned teacher data is input to the network 200 as the speech text sequence T 1 , ..., T N, a loss function (hereinafter, long-term context domain identification loss) that expresses the error between the sequence LD ₁ , ..., LD _N of estimated domain information output from the long-term context domain identification network 232 and the correct label sequence of estimated domain information identified from the labeled teacher data of the source domain and the unlabeled teacher data of the target domain, and a loss function (hereinafter, long-term context domain identification loss) that expresses the error between the sequence SD ₁ , ..., SD Adversarial learning is performed between the labeling network 150 and the domain discrimination model 230 based on _N and a loss function (hereinafter, short-term context domain discrimination loss) that represents the error between the correct label sequence of estimated domain information identified from labeled training data of the source domain and unlabeled training data of the target domain. Note that the estimated label sequence L ₁ , ..., L _N output from the labeling network 150 when unlabeled training data of the target domain is input to the network 200 is not used to calculate the label prediction loss.

学習部２１ａは、例えば誤差逆伝播法を用いてこの敵対的学習を行う。この場合、長期文脈理解ネットワーク１１２と長期文脈ドメイン識別ネットワーク２３２との間に勾配反転層２４２－ｎ（ただし、ｎ＝１，…，Ｎ）を設け、短期文脈理解ネットワーク１１１－ｎと短期文脈ドメイン識別ネットワーク２３１－ｎとの間に勾配反転層２４１－ｎ（ただし、ｎ＝１，…，Ｎ）を設け、誤差逆伝播時にのみ勾配反転層２４２－ｎおよび２４１－ｎで勾配を反転させる。ここで、ラベル予測損失が小さくなるように学習を行うことで、ラベリングネットワーク１５０での推定ラベル系列Ｌ_１，…，Ｌ_Ｎの推定精度が高くなる。また、勾配反転層２４２－ｎで誤差逆伝播時にのみ勾配を反転させ、長期文脈ドメイン識別損失が小さくなるように学習を行うことで、推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_Ｎの推定精度が高くなるように長期文脈ドメイン識別ネットワーク２３２を学習し、推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_Ｎの推定精度を低くする中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを得る長期文脈ドメイン識別ネットワーク２３２を学習する敵対的学習を行うことができる。さらに、勾配反転層２４１－ｎで誤差逆伝播時にのみ勾配を反転させ、短期文脈ドメイン識別損失が小さくなるように学習を行うことで、推定ドメイン情報の系列ＳＤ_１，…，ＳＤ_Ｎの推定精度が高くなるように短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎを学習し、推定ドメイン情報の系列ＳＤ_１，…，ＳＤ_Ｎの推定精度を低くする短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎを得る短期文脈理解ネットワーク１１１－１，…，１１１－Ｎを学習する敵対的学習を行うことができる。これらの敵対的学習により、推定ラベル系列Ｌ_１，…，Ｌ_Ｎを正確に推定できるがドメイン識別モデル２３０にドメインを推定されない中間特徴系列ＬＦ_１，…，ＬＦ_Ｎおよび短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎを生成できるラベリングネットワーク１５０を学習できる。これにより、ラベリングネットワーク１５０で、複数の発話テキストＴ_ｎを跨いだ文脈のドメインへの依存性を抑制しつつラベルの予測に有効な中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを獲得でき、かつ、発話テキストＴ_ｎ単位でのドメインへの依存性を抑制しつつラベルの予測に有効な短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎを獲得でき、より高い精度で教師なしドメイン適応を実現できる。 The learning unit 21a performs this adversarial learning using, for example, an error backpropagation method. In this case, a gradient inversion layer 242-n (where n=1, ..., N) is provided between the long-term context understanding network 112 and the long-term context domain identification network 232, and a gradient inversion layer 241-n (where n=1, ..., N) is provided between the short-term context understanding network 111-n and the short-term context domain identification network 231-n, and the gradient is inverted in the gradient inversion layers 242-n and 241-n only during error backpropagation. Here, by performing learning so that the label prediction loss is reduced, the estimation accuracy of the estimated label sequence L ₁ , ..., L _N in the labeling network 150 is improved. Furthermore, by inverting the gradient inversion layer 242-n only during error backpropagation and performing training to reduce the long-term context domain discrimination loss, it is possible to train the long-term context domain discrimination network ₂₃₂ to increase the estimation accuracy of the estimated domain information series LD ₁ , ..., LD _N , and perform adversarial training to train the long-term context domain discrimination network 232 to obtain the intermediate feature series LF ₁ , ..., LF _N that reduces the estimation accuracy of the estimated domain information series LD 1 , ..., LD _N. Furthermore, by inverting the gradient inversion layer 241-n only during backpropagation and performing learning to reduce the short-term context domain discrimination loss, it is possible to train the short-term context domain discrimination networks 231-1, ..., 231-N to increase the estimation accuracy of the estimated domain information series SD ₁ , ..., SD _N , and perform adversarial learning to train the short-term context understanding networks 111-1, ..., 111-N to obtain the short-term intermediate feature series SF ₁ , ..., SF _N that reduces the estimation accuracy of the estimated domain information series SD ₁ , ..., SD _N. By these adversarial learnings, it is possible to train the labeling network 150 that can accurately estimate the estimated label series L ₁ , ..., L _N but can generate the intermediate feature series LF ₁ , ..., LF _N and the short-term intermediate feature series SF ₁ , ..., SF _N whose domain is not estimated by the domain discrimination model 230. This makes it possible for the labeling network 150 to acquire a sequence of intermediate features LF ₁ , ..., LF _N that are effective for predicting labels while suppressing dependency on the domain of context across multiple spoken texts T _n , and to acquire a sequence of short-term intermediate features SF ₁ , ..., SF _N that are effective for predicting labels while suppressing dependency on the domain on a per-speech text T _n basis, thereby realizing unsupervised domain adaptation with higher accuracy.

この学習処理は、ラベル予測損失と長期文脈ドメイン識別損失と短期文脈ドメイン識別損失を線形結合した損失関数を最適化（最小化）することで実現できる。ラベル予測損失と長期文脈ドメイン識別損失と短期文脈ドメイン識別損失との線形結合の結合比率は予め定められていてもよいし、学習部２１ａに入力される学習スケジュールで指定されてもよい。This learning process can be realized by optimizing (minimizing) a loss function that is a linear combination of the label prediction loss, the long-term context domain classification loss, and the short-term context domain classification loss. The combination ratio of the linear combination of the label prediction loss, the long-term context domain classification loss, and the short-term context domain classification loss may be predetermined or may be specified by a learning schedule input to the learning unit 21a.

学習部２１ａが学習スケジュールに基づき、学習のステップ数に応じてラベル予測損失と長期文脈ドメイン識別損失と短期文脈ドメイン識別損失との結合比率を変更しながら上述の学習を行ってもよい。例えば学習部２１ａは、学習の序盤ではラベル予測損失のみを損失関数として学習を行い、学習のステップ数が増えるにつれて徐々に損失関数に占める長期文脈ドメイン識別損失と短期文脈ドメイン識別損失の割合が大きくなるようにして学習してもよい。さらに、学習部２１ａは、一定の結合比率で学習が収束するまで行い、結合比率を変更してまた収束するまで学習を行うような学習を、結合比率を学習スケジュールに基づき変更しながら繰り返し実施してもよい。The learning unit 21a may perform the above-mentioned learning while changing the combination ratio of the label prediction loss, the long-term context domain identification loss, and the short-term context domain identification loss according to the number of learning steps based on a learning schedule. For example, the learning unit 21a may perform learning using only the label prediction loss as a loss function in the early stages of learning, and as the number of learning steps increases, the ratio of the long-term context domain identification loss and the short-term context domain identification loss in the loss function gradually increases. Furthermore, the learning unit 21a may perform learning until the learning converges with a certain combination ratio, change the combination ratio, and perform learning until it converges again, while repeatedly changing the combination ratio based on the learning schedule.

ドメイン識別モデル２３０が長期文脈ドメイン識別ネットワーク２３２および短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎのいずれか一方のみを有することとしてもよい。The domain identification model 230 may have only one of the long-term context domain identification network 232 and the short-term context domain identification networks 231-1, ..., 231-N.

ドメイン識別モデル２３０が長期文脈ドメイン識別ネットワーク２３２のみを有する場合、勾配反転層２４１－１，…，２４１－Ｎが省略され、ラベル予測損失と長期文脈ドメイン識別損失を線形結合した損失関数に基づいて学習処理が行われる。この場合も線形結合の結合比率は予め定められていてもよいし、学習部２１ａに入力される学習スケジュールで指定されてもよい。また学習部２１ａが学習スケジュールに基づき、学習のステップ数に応じてラベル予測損失と長期文脈ドメイン識別損失との結合比率を変更しながら上述の学習を行ってもよい。例えば学習部２１ａは、学習の序盤ではラベル予測損失のみを損失関数として学習を行い、学習のステップ数が増えるにつれて徐々に損失関数に占める長期文脈ドメイン識別損失の割合が大きくなるようにして学習してもよい。さらに、学習部２１ａは、一定の結合比率で学習が収束するまで行い、結合比率を変更してまた収束するまで学習を行うような学習を、結合比率を学習スケジュールに基づき変更しながら繰り返し実施してもよい。 When the domain identification model 230 has only the long-term context domain identification network 232, the gradient inversion layers 241-1, ..., 241-N are omitted, and the learning process is performed based on a loss function that is a linear combination of the label prediction loss and the long-term context domain identification loss. In this case, the combination ratio of the linear combination may be predetermined or may be specified by a learning schedule input to the learning unit 21a. The learning unit 21a may also perform the above-mentioned learning while changing the combination ratio of the label prediction loss and the long-term context domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 21a may perform learning using only the label prediction loss as a loss function in the early stage of learning, and learning may be performed so that the proportion of the long-term context domain identification loss in the loss function gradually increases as the number of learning steps increases. Furthermore, the learning unit 21a may repeatedly perform learning such that the learning converges at a certain combination ratio, the combination ratio is changed, and learning is performed again until convergence occurs, while changing the combination ratio based on the learning schedule.

ドメイン識別モデル２３０が短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎのみを有する場合、勾配反転層２４２－１，…，２４２－Ｎが省略され、ラベル予測損失と短期文脈ドメイン識別損失を線形結合した損失関数に基づいて学習処理が行われる。この場合も線形結合の結合比率は予め定められていてもよいし、学習部２１ａに入力される学習スケジュールで指定されてもよい。また学習部２１ａが学習スケジュールに基づき、学習のステップ数に応じてラベル予測損失と短期文脈ドメイン識別損失との結合比率を変更しながら上述の学習を行ってもよい。例えば学習部２１ａは、学習の序盤ではラベル予測損失のみを損失関数として学習を行い、学習のステップ数が増えるにつれて徐々に損失関数に占める短期文脈ドメイン識別損失の割合が大きくなるようにして学習してもよい。さらに、学習部２１ａは、一定の結合比率で学習が収束するまで行い、結合比率を変更してまた収束するまで学習を行うような学習を、結合比率を学習スケジュールに基づき変更しながら繰り返し実施してもよい。 When the domain identification model 230 has only the short-term context domain identification networks 231-1, ..., 231-N, the gradient inversion layers 242-1, ..., 242-N are omitted, and the learning process is performed based on a loss function that is a linear combination of the label prediction loss and the short-term context domain identification loss. In this case, the combination ratio of the linear combination may be predetermined or may be specified by a learning schedule input to the learning unit 21a. The learning unit 21a may also perform the above-mentioned learning while changing the combination ratio of the label prediction loss and the short-term context domain identification loss according to the number of learning steps based on the learning schedule. For example, the learning unit 21a may perform learning using only the label prediction loss as a loss function in the early stage of learning, and may learn so that the proportion of the short-term context domain identification loss in the loss function gradually increases as the number of learning steps increases. Furthermore, the learning unit 21a may repeatedly perform learning such that the learning converges at a certain combination ratio, the combination ratio is changed, and learning is performed again until it converges, while changing the combination ratio based on the learning schedule.

また、ドメイン識別モデル２３０が長期文脈ドメイン識別ネットワーク２３２と、短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎの一部のみを有することとしてもよい。すなわち、短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎの一部が省略されてもよい。この場合、省略された短期文脈ドメイン識別ネットワーク２３１－ｎに対応する推定ドメイン情報ＳＤ_ｎおよびそれに対応する推定ドメイン情報の正解ラベルは短期文脈ドメイン識別損失の計算に用いられない。 Also, the domain identification model 230 may have only the long-term context domain identification network 232 and a part of the short-term context domain identification networks 231-1, ..., 231-N. That is, a part of the short-term context domain identification networks 231-1, ..., 231-N may be omitted. In this case, the estimated domain information SD _n corresponding to the omitted short-term context domain identification network 231-n and the correct label of the estimated domain information corresponding thereto are not used in the calculation of the short-term context domain identification loss.

また学習部２１ａが、先に例示したような様々なドメイン識別モデル２３０および／またはラベリングネットワーク１５０を複数用意して学習を行い、それぞれの学習で得られたラベリングネットワーク１５０のうち、ターゲットドメインでのラベリングネットワーク１５０によるラベル系列の推定精度が最善となるラベリングネットワーク１５０を後で選択してもよい。複数用意されるドメイン識別モデル２３０は、例えば、上述したような、長期文脈ドメイン識別ネットワーク２３２および短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎを含むドメイン識別モデル２３０、長期文脈ドメイン識別ネットワーク２３２のみを有するドメイン識別モデル２３０、短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎのみを有するドメイン識別モデル２３０、および第１実施形態のドメイン識別モデル２３０の少なくとも何れかを含む。 The learning unit 21a may also prepare a plurality of various domain identification models 230 and/or labeling networks 150 as exemplified above, perform learning, and later select a labeling network 150 that has the best estimation accuracy of the label sequence by the labeling network 150 in the target domain from among the labeling networks 150 obtained by each learning. The plurality of domain identification models 230 prepared may include, for example, at least one of the domain identification model 230 including the long-term context domain identification network 232 and the short-term context domain identification network 231-1, ..., 231-N, the domain identification model 230 having only the long-term context domain identification network 232, the domain identification model 230 having only the short-term context domain identification network 231-1, ..., 231-N, and the domain identification model 230 of the first embodiment, as described above.

学習部２１ａは、上述の学習によって得たラベリングネットワーク１５０のパラメータを記憶部１１ｂに格納し、短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎのパラメータを記憶部２１ｃに格納し、長期文脈ドメイン識別モデル２３２のパラメータを記憶部２１ｄに格納する。学習装置２１は、記憶部１１ｂに格納されたラベリングネットワーク１５０のパラメータを出力する。ラベリングネットワーク１５０のパラメータは推論処理に用いられる。通常、短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎのパラメータおよび長期文脈ドメイン識別モデル２３２のパラメータは推論処理には用いられないため、学習装置２１から出力されなくてもよい。しかし、学習装置２１が短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎのパラメータおよび長期文脈ドメイン識別モデル２３２のパラメータの少なくとも何れかを出力してもよい。The learning unit 21a stores the parameters of the labeling network 150 obtained by the above-mentioned learning in the memory unit 11b, stores the parameters of the short-term context domain identification networks 231-1, ..., 231-N in the memory unit 21c, and stores the parameters of the long-term context domain identification model 232 in the memory unit 21d. The learning device 21 outputs the parameters of the labeling network 150 stored in the memory unit 11b. The parameters of the labeling network 150 are used in the inference process. Normally, the parameters of the short-term context domain identification networks 231-1, ..., 231-N and the parameters of the long-term context domain identification model 232 are not used in the inference process, so they do not need to be output from the learning device 21. However, the learning device 21 may output at least one of the parameters of the short-term context domain identification networks 231-1, ..., 231-N and the parameters of the long-term context domain identification model 232.

図２を用いて上述の学習処理を機能的に例示する。
ステップＳ２１：ソースドメインのラベル付き教師データと、ターゲットドメインのラベルなし教師データとが制御部１１ａａに入力される。制御部１１ａａは、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを生成する。また制御部１１ａａは、ネットワーク２００のパラメータを初期化する。
ステップＳ２２：損失関数計算部２１ａｂは、教師データを発話テキスト系列Ｔ_１，…，Ｔ_Ｎとしてネットワーク２００に入力し、前述のように損失関数を得る。
ステップＳ２３：パラメータ更新部２１ａｄは、誤差逆伝播法に従い、損失関数に基づく情報を逆伝搬し、ドメイン識別モデル２３０およびラベリング層１２０のパラメータを更新する。
ステップＳ２４：勾配反転部１１ａｃは、ドメイン識別モデル２３０から逆伝搬された損失関数に基づく情報の勾配を反転させて論理的関係理解層１１０に逆伝搬させる。ラベリング層１２０から逆伝搬された損失関数に基づく情報は、そのまま論理的関係理解層１１０に逆伝搬される。
ステップＳ２５：パラメータ更新部２１ａｄは、誤差逆伝播法に従い、逆伝搬された情報を用いて論理的関係理解層１１０のパラメータを更新する。
ステップＳ２６：制御部１１ａａは、終了条件を満たしたか否かを判定する。ここで終了条件を満たしていない場合、制御部１１ａａは処理をステップＳ２２に戻す。一方、終了条件を満たしている場合、制御部１１ａａはラベリングネットワーク１５０のパラメータを出力する。必要に応じて制御部１１ａａが短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎのパラメータおよび長期文脈ドメイン識別モデル２３２のパラメータの少なくとも何れかも出力してもよい。 The above-mentioned learning process will be functionally illustrated with reference to FIG.
Step S21: Labeled teacher data of the source domain and unlabeled teacher data of the target domain are input to the control unit 11aa. The control unit 11aa generates teacher data including the labeled teacher data of the source domain and the unlabeled teacher data of the target domain. The control unit 11aa also initializes parameters of the network 200.
Step S22: The loss function calculation unit 21ab inputs the teacher data as the spoken text sequence T ₁ , . . . , T _N to the network 200, and obtains the loss function as described above.
Step S23: The parameter update unit 21ad backpropagates information based on the loss function according to the backpropagation method, and updates the parameters of the domain identification model 230 and the labeling layer 120.
Step S24: The gradient inversion unit 11ac inverts the gradient of the information based on the loss function backpropagated from the domain discrimination model 230 and backpropagates it to the logical relationship understanding layer 110. The information based on the loss function backpropagated from the labeling layer 120 is backpropagated to the logical relationship understanding layer 110 as it is.
Step S25: The parameter update unit 21ad updates the parameters of the logical relationship understanding layer 110 using the backpropagated information according to the backpropagation method.
Step S26: The control unit 11aa judges whether or not the termination condition is satisfied. If the termination condition is not satisfied, the control unit 11aa returns the process to step S22. On the other hand, if the termination condition is satisfied, the control unit 11aa outputs the parameters of the labeling network 150. If necessary, the control unit 11aa may also output at least one of the parameters of the short-term context domain identification networks 231-1, ..., 231-N and the parameters of the long-term context domain identification model 232.

第２実施形態の推論装置１３の機能構成および推論処理は第１実施形態と同じであるため、説明を省略する。 The functional configuration and inference processing of the inference device 13 in the second embodiment are the same as those in the first embodiment, so description is omitted.

＜第２実施形態の特徴＞
本実施形態では、発話テキスト系列Ｔ_１，…，Ｔ_Ｎを受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎの文脈を考慮した中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを得、当該中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを出力する論理的関係理解層１１０と、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎに対応する推定ラベル系列Ｌ_１，…，Ｌ_Ｎを得、当該推定ラベル系列Ｌ_１，…，Ｌ_Ｎを出力するラベリング層１２０とを含むラベリングネットワーク１５０と、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎに含まれる各発話テキストＴ_ｎがソースドメインに属するかターゲットドメインに属するかを表すドメイン識別情報の推定ドメイン情報ＬＤ_ｎおよびＳＤ_ｎを得、当該推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_ＮおよびＳＤ_１，…，ＳＤ_Ｎを出力するドメイン識別モデル２３０とに対し、学習装置２１が、ソースドメインのラベル付き教師データとターゲットドメインのラベルなし教師データとを含む教師データを発話テキスト系列Ｔ_１，…，Ｔ_Ｎとして用い、推定ラベル系列Ｌ_１，…，Ｌ_Ｎの推定精度が高く、推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_ＮおよびＳＤ_１，…，ＳＤ_Ｎの推定精度が低くなるようにラベリングネットワーク１５０を学習し、推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_ＮおよびＳＤ_１，…，ＳＤ_Ｎの推定精度が高くなるようにドメイン識別モデル２３０を学習する敵対的学習を行った。これにより、発話テキスト系列Ｔ_１，…，Ｔ_Ｎの文脈を考慮して当該発話テキスト系列に対応するラベル系列Ｌ_１，…，Ｌ_Ｎを推定するラベリングネットワーク１５０の教師なしドメイン適応が可能になる。 <Features of the second embodiment>
In this embodiment, the labeling network 150 includes _a logical relationship understanding layer 110 that receives a spoken text sequence T ₁ , ..., _TN , obtains an intermediate feature sequence LF ₁ , ..., LF _N taking into account the context of the spoken text sequence T ₁ , ..., _TN , and outputs the intermediate feature sequence LF ₁ , ..., LF _N , and a labeling layer 120 that receives the intermediate feature sequence LF ₁ , ..., LF _N , obtains an estimated label sequence L ₁ , ..., LN corresponding to the spoken text sequence T ₁ , ..., _TN , and outputs the estimated label sequence L ₁ , ..., _LN . The labeling network 150 includes a logical relationship understanding layer 110 that receives a spoken text sequence T ₁ , ..., TN, obtains estimated domain information LD _n and _{SD n} _that are domain identification information indicating whether each spoken text T _n included in the spoken text sequence T ₁ , ..., _TN belongs to a source domain or a target domain, and outputs the estimated domain information sequence LD ₁ , ..., LD For the domain discrimination model 230 that outputs label sequence _L1 , ..., _LDN and SD1, ..., _SDN , the learning device 21 performed adversarial learning using teacher data including labeled teacher data of the source domain and unlabeled teacher data of the target domain as the spoken text sequence _T1 , ..., _TDN to train the labeling network 150 so that the estimation accuracy of the estimated label sequence _L1 , ..., _LDN is high and the estimation accuracy of the estimated domain information sequences _LD1 , ..., _LDN and _SD1 , ..., _SDN is low, and to train the domain discrimination model 230 so that the estimation accuracy of the estimated domain information sequences _LD1 , ..., _LDN and _SD1 , ..., _SDN is high. This enables unsupervised domain adaptation of the labeling network 150 that estimates the label sequence _L1 , ..., _LN corresponding to the spoken text sequence _T1 , ..., _TDN in consideration of the context of the spoken text sequence.

特に本実施形態では、ドメイン識別モデル２３０が、短期文脈理解ネットワーク１１１－１，…，１１１－Ｎから出力された短期中間特徴の系列ＳＦ_１，…，ＳＦ_Ｎを受け取り、推定ドメイン情報の系列ＳＤ_１，…，ＳＤ_Ｎを得て出力する短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎ、および長期文脈理解ネットワーク１１２から出力された中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを受け取り、推定ドメイン情報の系列ＬＤ_１，…，ＬＤ_Ｎを得て出力する長期文脈ドメイン識別ネットワーク２３２の少なくとも一方を含む。これにより、発話テキストＴ_ｎ単体のドメイン依存性および発話テキストＴ_ｎを跨いだ文脈のドメイン依存性の少なくとも一方をラベリングネットワーク１５０から効率的に取り除くことができる。その結果、より高い精度でラベリングネットワーク１５０の教師なしドメイン適応を行うことができる。 In particular, in this embodiment, the domain identification model 230 includes at least one of a short-term context domain identification network 231-1, ..., 231-N that receives a sequence of short-term intermediate features SF ₁ , ..., SF _N output from the short-term context understanding network 111-1, ..., 111-N, obtains and outputs a sequence of estimated domain information SD ₁ , ..., SD _N , and a long-term context domain identification network 232 that receives a sequence of intermediate features LF ₁ , ..., LF _N output from the long-term context understanding network 112, obtains and outputs a sequence of estimated domain information LD ₁ , ..., LD _N. This makes it possible to efficiently remove at least one of the domain dependency of the spoken text T _n alone and the domain dependency of the context across the spoken text T _n from the labeling network 150. As a result, unsupervised domain adaptation of the labeling network 150 can be performed with higher accuracy.

またドメイン識別モデル２３０が、少なくとも長期文脈ドメイン識別ネットワーク２３２を含むことで、複数の発話テキストＴ_ｎを跨いだ文脈のドメイン依存性をラベリングネットワーク１５０から効率的に取り除くことができる。その結果、発話テキスト系列Ｔ_１，…，Ｔ_Ｎの文脈を考慮して当該発話テキスト系列に対応するラベル系列Ｌ_１，…，Ｌ_Ｎを推定するラベリングネットワーク１５０の教師なしドメイン適応を精度よく行うことができる。 Furthermore, since the domain discrimination model 230 includes at least the long-term context domain discrimination network 232, it is possible to efficiently remove domain dependency of context across multiple spoken texts _Tn from the labeling network 150. As a result, it is possible to perform unsupervised domain adaptation with high accuracy for the labeling network 150 that estimates a label sequence _L1 , ..., _LN corresponding to an utterance text sequence _T1 , ..., _TN by taking into account the context of the utterance text sequence.

さらにドメイン識別モデル２３０が、短期文脈ドメイン識別ネットワーク２３１－１，…，２３１－Ｎおよび長期文脈ドメイン識別ネットワーク２３２の両方を含むことで、発話テキストＴ_ｎ単体のドメイン依存性と複数の発話テキストＴ_ｎを跨いだ文脈のドメイン依存性とをラベリングネットワーク１５０から効率的に取り除くことができる。この場合には、より高い精度でラベリングネットワーク１５０の教師なしドメイン適応を行うことができる。 Furthermore, since the domain discrimination model 230 includes both the short-term context domain discrimination networks 231-1, ..., 231-N and the long-term context domain discrimination network 232, it is possible to efficiently remove the domain dependency of a single utterance text _Tn and the domain dependency of a context across multiple utterance texts _Tn from the labeling network 150. In this case, unsupervised domain adaptation of the labeling network 150 can be performed with higher accuracy.

＜実験結果＞
以下に上述の実施形態に従って行われた教師なしドメイン適応の実験結果を例示する。以下に実験条件を示す。
(1)発話テキスト系列の模擬データの各発話テキストを５クラスの対応シーンに分類し、各対応シーンを表すラベルを推定する。
(2)ターゲットドメイン（新規ドメイン）を除く５ドメインをソースドメイン（適用済みドメイン）として扱い、ソースドメインのデータのみを用いてラベリングネットワークを学習し、得られたラベリングネットワークと、第１実施形態および第２実施形態に従ってラベリングネットワークを学習し、それぞれで得られたラベリングネットワークを用いて、ターゲットドメインの発話テキスト系列に対する識別性能（ラベリングの正解率）を検証した。
(3)６個のターゲットドメイン（ネット通販、ＩＳＰ、証券、自治体、携帯電話、ＰＣサポート）に属する６０通話分の発話テキスト系列（６０通話×６ドメイン＝３６０通話分の模擬データ）について識別性能を検証した。
(4)各発話テキストは１００個程度の文を含む。 <Experimental Results>
The following are examples of experimental results of unsupervised domain adaptation performed according to the above-described embodiment. The experimental conditions are as follows.
(1) Classify each utterance text of the simulated data of the utterance text sequence into five classes of corresponding scenes, and estimate a label representing each corresponding scene.
(2) The five domains excluding the target domain (new domain) were treated as source domains (applied domains), and a labeling network was trained using only data from the source domain. The obtained labeling network and the labeling networks obtained in the first and second embodiments were trained, and the recognition performance (labeling accuracy rate) for the spoken text sequence of the target domain was verified using the obtained labeling network.
(3) The classification performance was verified for 60 phone call text sequences (60 calls x 6 domains = 360 phone call simulation data) belonging to six target domains (online shopping, ISP, securities, local government, mobile phone, and PC support).
(4) Each utterance text contains approximately 100 sentences.

図８に実験結果を例示する。図８に例示するように、第１実施形態および第２実施形態のいずれの方法でも、ターゲットドメインにおけるラベル付きデータを用意しなくても、すでに存在するソースドメインのラベル付きデータを用い、高い精度でターゲットドメインに対する教師なしドメイン適応が可能であることが分かる。特に第２実施形態の方法では、より高い精度で教師なしドメイン適応が可能であり、ソースドメインのデータのみで学習する方法に比べて平均３．４％識別精度が向上する。 Figure 8 shows an example of the experimental results. As shown in Figure 8, in both the first and second embodiments, unsupervised domain adaptation to the target domain is possible with high accuracy by using already existing labeled data in the source domain, even if labeled data in the target domain is not prepared. In particular, the method of the second embodiment enables unsupervised domain adaptation with higher accuracy, and the classification accuracy is improved by an average of 3.4% compared to the method of learning only with source domain data.

［ハードウェア構成］
各実施形態における学習装置１１，２１および推論装置１３は、例えば、ＣＰＵ（central processing unit）等のプロセッサ（ハードウェア・プロセッサ）やＲＡＭ（random-access memory）・ＲＯＭ（read-only memory）等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される装置である。このコンピュータは１個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めＲＯＭ等に記録されていてもよい。また、ＣＰＵのようにプログラムが読み込まれることで機能構成を実現する電子回路（circuitry）ではなく、単独で処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。また、１個の装置を構成する電子回路が複数のＣＰＵを含んでいてもよい。 [Hardware configuration]
The learning devices 11, 21 and the inference device 13 in each embodiment are devices configured by a general-purpose or dedicated computer having a processor (hardware processor) such as a CPU (central processing unit) and memories such as a RAM (random-access memory) and a ROM (read-only memory) executing a predetermined program. This computer may have one processor and memory, or may have multiple processors and memories. This program may be installed in the computer, or may be recorded in a ROM or the like in advance. In addition, some or all of the processing units may be configured using electronic circuits that realize processing functions independently, rather than electronic circuits that realize functional configurations by reading programs like a CPU. In addition, an electronic circuit that configures one device may include multiple CPUs.

図９は、各実施形態における学習装置１１，２１および推論装置１３のハードウェア構成を例示したブロック図である。図９に例示するように、この例の学習装置１１，２１および推論装置１３は、ＣＰＵ（Central Processing Unit）１０ａ、入力部１０ｂ、出力部１０ｃ、ＲＡＭ（Random Access Memory）１０ｄ、ＲＯＭ（Read Only Memory）１０ｅ、補助記憶装置１０ｆ及びバス１０ｇを有している。この例のＣＰＵ１０ａは、制御部１０ａａ、演算部１０ａｂ及びレジスタ１０ａｃを有し、レジスタ１０ａｃに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部１０ｂは、データが入力される入力端子、キーボード、マウス、タッチパネル等である。また、出力部１０ｃは、データが出力される出力端子、ディスプレイ、所定のプログラムを読み込んだＣＰＵ１０ａによって制御されるＬＡＮカード等である。また、ＲＡＭ１０ｄは、ＳＲＡＭ (Static Random Access Memory)、ＤＲＡＭ (Dynamic Random Access Memory)等であり、所定のプログラムが格納されるプログラム領域１０ｄａ及び各種データが格納されるデータ領域１０ｄｂを有している。また、補助記憶装置１０ｆは、例えば、ハードディスク、ＭＯ（Magneto-Optical disc）、半導体メモリ等であり、所定のプログラムが格納されるプログラム領域１０ｆａ及び各種データが格納されるデータ領域１０ｆｂを有している。また、バス１０ｇは、ＣＰＵ１０ａ、入力部１０ｂ、出力部１０ｃ、ＲＡＭ１０ｄ、ＲＯＭ１０ｅ及び補助記憶装置１０ｆを、情報のやり取りが可能なように接続する。ＣＰＵ１０ａは、読み込まれたＯＳ（Operating System）プログラムに従い、補助記憶装置１０ｆのプログラム領域１０ｆａに格納されているプログラムをＲＡＭ１０ｄのプログラム領域１０ｄａに書き込む。同様にＣＰＵ１０ａは、補助記憶装置１０ｆのデータ領域１０ｆｂに格納されている各種データを、ＲＡＭ１０ｄのデータ領域１０ｄｂに書き込む。そして、このプログラムやデータが書き込まれたＲＡＭ１０ｄ上のアドレスがＣＰＵ１０ａのレジスタ１０ａｃに格納される。ＣＰＵ１０ａの制御部１０ａａは、レジスタ１０ａｃに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ１０ｄ上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１０ａｂに順次実行させ、その演算結果をレジスタ１０ａｃに格納していく。このような構成により、学習装置１１，２１および推論装置１３の機能構成が実現される。9 is a block diagram illustrating the hardware configuration of the learning device 11, 21 and the inference device 13 in each embodiment. As illustrated in FIG. 9, the learning device 11, 21 and the inference device 13 in this example have a CPU (Central Processing Unit) 10a, an input unit 10b, an output unit 10c, a RAM (Random Access Memory) 10d, a ROM (Read Only Memory) 10e, an auxiliary storage device 10f, and a bus 10g. The CPU 10a in this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various calculation processes according to various programs loaded into the register 10ac. The input unit 10b is an input terminal to which data is input, a keyboard, a mouse, a touch panel, etc. The output unit 10c is an output terminal to which data is output, a display, a LAN card controlled by the CPU 10a that has loaded a specific program, etc. The RAM 10d is a static random access memory (SRAM), a dynamic random access memory (DRAM), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data is stored. The auxiliary storage device 10f is, for example, a hard disk, a magneto-optical disc (MO), a semiconductor memory, or the like, and has a program area 10fa in which a predetermined program is stored and a data area 10fb in which various data is stored. The bus 10g connects the CPU 10a, the input unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged. The CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the loaded OS (Operating System) program. Similarly, the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. The addresses on the RAM 10d to which the programs and data are written are then stored in the register 10ac of the CPU 10a. The control unit 10aa of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads out the programs and data from the areas on the RAM 10d indicated by the read addresses, causes the calculation unit 10ab to sequentially execute the calculations indicated by the programs, and stores the results of the calculations in the register 10ac. With this configuration, the functional configurations of the learning devices 11 and 21 and the inference device 13 are realized.

上述のプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。The above-mentioned program can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such recording media include magnetic recording devices, optical disks, magneto-optical recording media, semiconductor memories, etc.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ－ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。上述のように、このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 The distribution of this program is, for example, by selling, transferring, lending, etc., portable recording media such as DVDs and CD-ROMs on which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of a server computer and transferring the program from the server computer to other computers via a network. As described above, a computer that executes such a program, for example, first temporarily stores in its own storage device the program recorded in the portable recording medium or the program transferred from the server computer. Then, when executing the process, the computer reads the program stored in its own storage device and executes the process according to the read program. In addition, as another execution form of this program, the computer may read the program directly from the portable recording medium and execute the process according to the program, and further, each time the program is transferred from the server computer to this computer, the computer may execute the process according to the received program one by one. In addition, the server computer may not transfer the program to this computer, but may execute the above-mentioned process by a so-called ASP (Application Service Provider) type service that realizes the processing function only by issuing an execution instruction and obtaining the result. In this embodiment, the program includes information used for processing by an electronic computer and equivalent to a program (such as data that is not a direct instruction to a computer but has the nature of defining computer processing).

ＣＰＵ１０ａだけでなく、ＧＰＵ（Graphics Processing Unit）を用いて本装置が構成されてもよい。また各実施形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。The device may be configured using not only the CPU 10a but also a GPU (Graphics Processing Unit). In each embodiment, the device is configured by executing a specific program on a computer, but at least a part of the processing contents may be realized by hardware.

なお、本発明は上述の実施形態に限定されるものではない。例えば、各実施形態では、ターゲットドメインのラベルなし教師データは、ターゲットドメインの発話テキスト系列を含むが正解ラベル系列を含まないこととした。しかしながら、ターゲットドメインのラベルなし教師データの少なくとも一部が正解ラベル系列を含んでいてもよい。この場合、ネットワーク１００，２００の学習にターゲットドメインのラベルなし教師データの正解ラベル系列を用いてもよいし、用いなくてもよい。 Note that the present invention is not limited to the above-described embodiments. For example, in each embodiment, the unlabeled training data of the target domain includes a speech text sequence of the target domain but does not include a correct label sequence. However, at least a portion of the unlabeled training data of the target domain may include a correct label sequence. In this case, the correct label sequence of the unlabeled training data of the target domain may or may not be used to train the networks 100 and 200.

また、前述のように、上述の実施形態では、説明の明確化のため、論理的関係を持つ複数の情報の系列が発話テキスト系列であり、ラベル系列が各発話の対応シーン（例えば、オープニング、用件把握、本人確認、対応、クロージング）を表すラベルの系列である場合を例示した。しかしながら、これは一例であって、論理的関係を持つ複数の情報の系列として、文章系列、プログラミング言語系列、音声信号系列、動画信号系列など、その他の情報の系列を用いてもよい。また、ラベルの系列として、状況や行動を表すラベル系列、場所や時間を表すラベル系列、品詞を表すラベル系列、プログラム内容を表すラベル系列など、その他のラベル系列を用いてもよい。また、ラベリングモデル等の各モデルが深層ニューラルネットワークに基づくモデルではなく、確率モデルや分類器などに基づく、その他のモデルであってもよい。また、各実施形態の論理的関係理解層１１０は、発話テキスト系列Ｔ_１，…，Ｔ_Ｎを受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎの文脈（論理的関係）を考慮した中間特徴系列ＬＦ_１，…，ＬＦ_Ｎを得て出力した。しかしながら、論理的関係理解層１１０が発話テキスト系列Ｔ_１，…，Ｔ_Ｎを受け取り、Ｎ個未満またはＮ個を超える中間特徴からなる系列を得て出力してもよい。また、各実施形態のラベリング層１２０は、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎ（中間特徴系列に基づく第１系列）を受け取り、発話テキスト系列Ｔ_１，…，Ｔ_Ｎに対応する推定ラベル系列Ｌ_１，…，Ｌ_Ｎを得て出力した。しかしながら、ラベリング層１２０が中間特徴系列ＬＦ_１，…，ＬＦ_Ｎ（中間特徴系列に基づく第１系列）を受け取り、Ｎ個未満またはＮ個を超える推定ラベルの系列を得て出力してもよい。また、各実施形態のドメイン識別モデル１３０，２３０は、中間特徴系列ＬＦ_１，…，ＬＦ_Ｎ（中間特徴系列に基づく第２系列）を受け取り、Ｎ個の推定ドメイン情報を得て出力した。しかしながら、実施形態のドメイン識別モデル１３０，２３０が中間特徴系列ＬＦ_１，…，ＬＦ_Ｎ（中間特徴系列に基づく第２系列）を受け取り、Ｎ個未満またはＮ個を超える推定ドメイン情報を得て出力してもよい。 Also, as described above, in the above-mentioned embodiment, for the sake of clarity, the case is illustrated in which the sequence of multiple pieces of information having a logical relationship is a spoken text sequence, and the label sequence is a sequence of labels representing the corresponding scenes of each utterance (for example, opening, understanding the subject, identity verification, response, closing). However, this is only an example, and other information sequences, such as a sentence sequence, a programming language sequence, a voice signal sequence, or a video signal sequence, may be used as the sequence of multiple pieces of information having a logical relationship. Also, other label sequences, such as a label sequence representing a situation or action, a label sequence representing a place or time, a label sequence representing a part of speech, or a label sequence representing program content, may be used as the label sequence. Also, each model, such as the labeling model, may not be a model based on a deep neural network, but may be other models based on a probability model, a classifier, or the like. Also, the logical relationship understanding layer 110 in each embodiment receives the spoken text sequence T ₁ , ..., T _N , obtains and outputs the intermediate feature sequence LF ₁ , ..., LF _N that takes into account the context (logical relationship) of the spoken text sequence T ₁ , ..., T _N. However, the logical relationship understanding layer 110 may receive the spoken text sequence T ₁ , ..., T _N and obtain and output a sequence consisting of less than N or more than N intermediate features. Also, the labeling layer 120 of each embodiment receives the intermediate feature sequence LF ₁ , ..., LF _N (first sequence based on the intermediate feature sequence) and obtains and outputs an estimated label sequence L ₁ , ..., L _N corresponding to the spoken text sequence T ₁ , ..., T _N. However, the labeling layer 120 may receive the intermediate feature sequence LF ₁ , ..., LF _N (first sequence based on the intermediate feature sequence) and obtain and output a sequence of less than N or more than N estimated labels. Also, the domain identification model 130, 230 of each embodiment receives the intermediate feature sequence LF ₁ , ..., LF _N (second sequence based on the intermediate feature sequence) and obtains and outputs N pieces of estimated domain information. However, the domain discrimination model 130, 230 of the embodiment may receive the intermediate feature sequence LF ₁ , . . . , LF _N (a second sequence based on the intermediate feature sequence) and obtain and output less than N or more than N pieces of estimated domain information.

また、上述の各種処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。In addition, the various processes described above may not only be executed in chronological order as described, but may also be executed in parallel or individually depending on the processing capacity of the device executing the processes or as necessary. Needless to say, other modifications may be made as appropriate without departing from the spirit of the present invention.

本発明により、例えば、複雑なコンテキストを考慮した系列ラベリング問題に対して、中間特徴のドメイン依存性を効率的に除去することが可能となる。特に第２実施形態で例示したように短期および長期の論理的関係（文脈）それぞれに対して、中間特徴のドメイン依存性を効率的に除去することで、よりターゲットドメインへの適応度が高いラベリングネットワークを学習でき、ターゲットドメインにおけるラベリング精度を向上させることができる。 The present invention makes it possible to efficiently remove the domain dependency of intermediate features, for example, for sequence labeling problems that take complex contexts into account. In particular, by efficiently removing the domain dependency of intermediate features for each of short-term and long-term logical relationships (contexts) as exemplified in the second embodiment, it is possible to learn a labeling network that is more adaptable to the target domain, and improve the labeling accuracy in the target domain.

従来、画像認識に対する教師なしドメイン適応技術は検討されていたが、本発明はこれを初めて言語処理など、複数の情報の系列の論理的関係を考慮して当該情報の系列に対応するラベル系列を推定する問題に適用したものである。この教師なしドメイン適応技術により、例えば、コンタクトセンタ向けビジネスの業界拡大の障壁となっていたラベル付与のコストを大幅に削減することができる。 Unsupervised domain adaptation technology has been considered for image recognition in the past, but this invention is the first to apply it to the problem of estimating a label sequence corresponding to multiple information sequences by considering the logical relationships between the sequences, such as in language processing. This unsupervised domain adaptation technology can significantly reduce the cost of labeling, which has been a barrier to the expansion of the contact center business industry, for example.

特に第２実施形態に例示した方法では、例えば、発話テキスト単位（例えば、通話単位）のドメイン識別ネットワークを、単方向や双方向のLSTMにより文の境界をまたいだ機構として設計することができる。これにより、例えばコンタクトセンタの業界に依存した特定の話の流れのようなものに対するドメイン依存性をとらえ、それをラベリングネットワークから効率的に除去することが可能となり、結果としてターゲットドメインにおけるラベルの推定精度を向上させることができる。In particular, in the method illustrated in the second embodiment, for example, a domain identification network for a spoken text unit (e.g., a call unit) can be designed as a mechanism across sentence boundaries using unidirectional or bidirectional LSTM. This makes it possible to capture domain dependency, such as a specific conversation flow that depends on the industry of a contact center, and to efficiently remove it from the labeling network, thereby improving the estimation accuracy of labels in the target domain.

また、第２実施形態に例示した方法では、例えば、発話テキスト単位（例えば、通話単位）のドメイン識別ネットワークを、発話テキストの境界をまたがない機構として設計することもできる。これにより、例えばコンタクトセンタの業界に依存した特定の単語に起因するドメイン依存性をとらえ、それをラベリングネットワークから効率的に除去することが可能となり、結果としてターゲットドメインにおけるラベルの推定精度を向上させることができる。In addition, in the method illustrated in the second embodiment, for example, a domain identification network for each spoken text (for example, for each call) can be designed as a mechanism that does not cross the boundaries of spoken text. This makes it possible to capture domain dependency caused by specific words that depend on the industry of contact centers, for example, and to efficiently remove them from the labeling network, thereby improving the estimation accuracy of labels in the target domain.

１１，２１学習装置
１１ａ，２１ａ学習部
１３推論装置
１３ａ推論部
１００，２００ネットワーク
１１０論理的関係理解層
１１１－１，…，１１１－Ｎ短期文脈理解ネットワーク
１１２長期文脈理解ネットワーク
１２０ラベリング層
１２０－１，…，１２０－Ｎラベル予測ネットワーク
１３０，２３０ドメイン識別モデル
１３０－１，…，１３０－Ｎドメイン識別ネットワーク
２３１－１，…，２３１－Ｎ短期文脈ドメイン識別ネットワーク
２３２長期文脈ドメイン識別ネットワーク 11, 21 Learning devices 11a, 21a Learning unit 13 Inference device 13a Inference unit 100, 200 Network 110 Logical relationship understanding layer 111-1, ..., 111-N Short-term context understanding network 112 Long-term context understanding network 120 Labeling layer 120-1, ..., 120-N Label prediction network 130, 230 Domain identification model 130-1, ..., 130-N Domain identification network 231-1, ..., 231-N Short-term context domain identification network 232 Long-term context domain identification network

Claims

a logical relationship understanding means for receiving an input information sequence, which is a sequence of a plurality of pieces of information having a logical relationship, and outputting an intermediate feature sequence taking into consideration the logical relationship of the input information sequence;
a labeling means for receiving a first sequence based on the intermediate feature sequence and outputting an estimated label sequence of a label sequence corresponding to the input information sequence;
A labeling model including
a domain discrimination model that receives a second sequence based on the intermediate feature sequence and outputs a sequence of estimated domain information indicating whether each piece of information included in the input information sequence belongs to a source domain or a target domain;
Whereas,
a learning unit that uses, as the input information sequence, teacher data including labeled teacher data which is a labeled learning information sequence belonging to a source domain and unlabeled teacher data which is an unlabeled learning information sequence belonging to a target domain, learns the labeling model so that the estimation accuracy of the estimated label sequence is high and the estimation accuracy of the estimated domain information sequence is low, and performs adversarial learning to learn the domain discrimination model so that the estimation accuracy of the estimated domain information sequence is high, and obtains and outputs parameters of the labeling model,
The logical relationship understanding means includes:
a plurality of short-term logical relationship understanding means for receiving each piece of information included in the input information sequence and outputting short-term intermediate features taking into account logical relationships within the received information;
a long-term logical relationship understanding means for receiving a short-term intermediate feature sequence consisting of a plurality of short-term intermediate features, and outputting a long-term intermediate feature sequence taking into consideration a logical relationship between a plurality of pieces of information included in the input information sequence,
The labeling means comprises:
receiving the long-term intermediate feature sequence as the first sequence, and outputting an estimated label sequence of a label sequence corresponding to the information sequence;
The domain identification model is
a short-term logical relation domain identification means for receiving the short-term intermediate feature sequence as the second sequence and outputting the sequence of estimated domain information;

a logical relationship understanding means for receiving an input information sequence, which is a sequence of a plurality of pieces of information having a logical relationship, and outputting an intermediate feature sequence taking into consideration the logical relationship of the input information sequence;
a labeling means for receiving a first sequence based on the intermediate feature sequence and outputting an estimated label sequence of a label sequence corresponding to the input information sequence;
A labeling model including
a domain discrimination model that receives a second sequence based on the intermediate feature sequence and outputs a sequence of estimated domain information indicating whether each piece of information included in the input information sequence belongs to a source domain or a target domain;
Whereas,
a learning unit that uses, as the input information sequence, teacher data including labeled teacher data which is a labeled learning information sequence belonging to a source domain and unlabeled teacher data which is an unlabeled learning information sequence belonging to a target domain, learns the labeling model so that the estimation accuracy of the estimated label sequence is high and the estimation accuracy of the estimated domain information sequence is low, and performs adversarial learning to learn the domain discrimination model so that the estimation accuracy of the estimated domain information sequence is high, and obtains and outputs parameters of the labeling model,
The logical relationship understanding means includes:
a plurality of short-term logical relationship understanding means for receiving each piece of information included in the input information sequence and outputting short-term intermediate features taking into account logical relationships within the received information;
a long-term logical relationship understanding means for receiving a short-term intermediate feature sequence consisting of a plurality of short-term intermediate features, and outputting a long-term intermediate feature sequence taking into consideration a logical relationship between a plurality of pieces of information included in the input information sequence,
The labeling means comprises:
receiving the long-term intermediate feature sequence as the first sequence, and outputting an estimated label sequence of a label sequence corresponding to the information sequence;
The domain identification model is
A learning device comprising: a short-term logical relational domain identification means for receiving the short-term intermediate feature sequence as the second sequence and outputting the estimated domain information sequence; and a long-term logical relational domain identification means for receiving the long-term intermediate feature sequence as the second sequence and outputting the estimated domain information sequence.

An inference device having an inference unit that applies an input information sequence for inference to the labeling model specified by the parameters obtained by the learning device of claim 1 or 2, obtains an estimated label sequence of a label sequence corresponding to the input information sequence for inference, and outputs the estimated label sequence.

A learning method for the learning device of claim 1 or 2, or an inference method for the inference device of claim 3.

A program for causing a computer to function as the learning device of claim 1 or 2 or the inference device of claim 3.