JP2023048466A

JP2023048466A - Conversation management system and conversation management method

Info

Publication number: JP2023048466A
Application number: JP2021157794A
Authority: JP
Inventors: アマリアアディバ; Adiba Amalia; 健本間; Takeshi Honma
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2023-04-07

Abstract

To determine a speech production destination in conversation among a plurality of participants.SOLUTION: A conversation management system manages conversation in which a plurality of participants are present. The conversation management system stores past speech production history information of each of the participants in the conversation. The conversation management system acquires a text feature amount of first current speech production of a speaker. The conversation management system acquires a text feature amount of past speech production of each of a plurality of listeners. The conversation management system calculates a speech production destination probability of each listener on the basis of the text feature amount of the first current speech production and the text feature amount of the past speech production of each listener. The conversation management system determines a speech production destination of the first current speech production on the basis of the speech production probabilities of the listeners.SELECTED DRAWING: Figure 7

Description

本開示は、複数の参加者の会話を管理する技術に関する。 The present disclosure relates to techniques for managing conversations of multiple participants.

複数の人が参加する会話において、その会話をサポートするエージェントに関する技術が存在する。日常生活においては、複数の人の間で会話がなされることが一般的である。ここで、会話に参加する人のことをユーザと称することとする。会話をサポートするエージェントが存在するシステムを利用した会話では、複数のユーザは、お互いユーザ同士で会話するだけでなく、システムのエージェントとも会話する。すなわち各ユーザは、一人の参加者又は複数の参加者と会話を行う。ここでいう参加者とは、人であるユーザ、またはエージェントである。ユーザたちは、会話の目的を効率的かつ効果的に達成できるように、ユーザたちの間で会話の調整を行う。 In a conversation in which multiple people participate, there are technologies related to agents that support the conversation. In daily life, it is common for a plurality of people to have a conversation. Here, a person who participates in the conversation is called a user. In a conversation using a system in which an agent supporting conversation exists, a plurality of users not only converse with each other but also converse with an agent of the system. That is, each user conducts a conversation with one participant or with multiple participants. A participant here is a human user or an agent. Users coordinate conversations among themselves so that the purpose of the conversation can be achieved efficiently and effectively.

エージェントは、直前のユーザの発話に反応することもあれば、自発的に新しいアイデアを話すこともあるし、あるいは必要に応じてユーザの話を中断させることもある。会話の目的を達成するためには、ユーザ同士の会話がシームレスに継続できることが重要である。そのため、エージェントが適切なタイミングで適切な内容を発言することで、ユーザ同士の会話をシームレスに継続するサポートを行うことが実現手段として考えられる。エージェントが効率的に会話をサポートするためには、話し手であるユーザがどの参加者に話しかけているか識別する機能が必須となる。しかし、複数のユーザが参加する会話において、話し手が誰に話しかけているか検出する方法は、自明ではない。 The agent may respond to previous user utterances, may spontaneously speak new ideas, or may interrupt the user when necessary. In order to achieve the purpose of conversation, it is important to be able to seamlessly continue conversations between users. For this reason, it is conceivable as a means of realizing this that an agent can speak appropriate content at an appropriate timing, thereby supporting the seamless continuation of conversations between users. In order for an agent to support conversations efficiently, it is essential to have the ability to identify which participant the speaker is talking to. However, it is not obvious how to detect who the speaker is speaking to in a conversation involving multiple users.

例えば、米国特許第９７６１２４７号は、複数のユーザが参加する会話を支援するシステムを開示する。複数のユーザは、お互いに、あるいはシステムのエージェントと会話を行う。システムは、複数ユーザの会話において、韻律と語彙の特徴を用いて、ある音声がエージェントに対して話しかけられた音声であるか否かを識別する。 For example, US Pat. No. 9,761,247 discloses a system for supporting conversations involving multiple users. Multiple users converse with each other and with agents of the system. The system uses prosodic and lexical features in multi-user conversations to identify whether a speech is spoken to an agent or not.

米国特許第９７６１２４７号U.S. Patent No. 9761247

これまでの発話先の検出技術は、ユーザの発話がエージェントに向けられているか又は人のユーザに向けられているかという２個の選択肢から選ぶことに注力している。しかし、真にコミュニケーションを行うためには、従来技術だけでは不十分である。システムのエージェントは、例えば、会話におけるどのユーザが次に発言する可能性が高いかを推測し、推測結果に応じて、エージェントが次に発話すべきか、または待ったほうがよいか、等を決定する必要がある。このためには、話し手の発話先がユーザのうち誰であるかを適切に推定することが重要である。 Previous speech target detection techniques focus on choosing between two options: whether a user's speech is directed to an agent or to a human user. However, for true communication, conventional technology alone is insufficient. The agent of the system, for example, guesses which user in the conversation is most likely to speak next, and depending on the result of the guess, decides whether the agent should speak next, wait, etc. There is a need. For this purpose, it is important to appropriately estimate which of the users the speaker is speaking to.

発話先の検出には、エージェント向けと人間向けがある。エージェント向けは、発話先がシステムのエージェントに属することを意味する。しかし、エージェント向けの発話の推定だけでは十分ではない。グループ内の会話において、話し手は他の参加者（人のユーザ及びエージェントを含む）の誰にでも話しかけることができ、またはグループ全体に向けて話すこともできる。したがって、話し手が誰に向かって話しているかを検出する際、誰か特定のユーザに向かって話していることと、グループに向かって話していることの両者の識別することは、会話をシームレスに継続するために重要である。 There are two types of speech target detection: agent-oriented and human-oriented. Agent-directed means that the talkee belongs to an agent in the system. However, just estimating utterances for agents is not enough. In a conversation within a group, a speaker can speak to any of the other participants (including human users and agents) or to the entire group. Therefore, in detecting who the speaker is speaking to, identifying both those speaking to some specific user and those speaking to a group seamlessly continues the conversation. important to.

本開示一態様は、会話管理システムであって、１以上の記憶装置と、１以上の演算装置と、を含む。前記１以上の記憶装置は、会話の複数の参加者それぞれの過去発話履歴情報を格納する。前記１以上の演算装置は、話し手の第１現在発話のテキスト特徴量を取得する。前記１以上の演算装置は、前記過去発話履歴情報から、複数の聞き手における各聞き手の過去発話のテキスト特徴量を取得する。前記１以上の演算装置は、前記第１現在発話のテキスト特徴量と各聞き手の前記過去発話のテキスト特徴量とに基づいて、各聞き手の発話先確率を計算する。前記１以上の演算装置は、前記複数の聞き手の発話先確率に基づいて前記第１現在発話の発話先を判定する。 One aspect of the present disclosure is a conversation management system including one or more storage devices and one or more computing devices. The one or more storage devices store past speech history information of each of the plurality of participants in the conversation. The one or more computing devices obtain text features of a first current utterance of a speaker. The one or more computing devices acquire text feature amounts of past utterances of each of a plurality of listeners from the past utterance history information. The one or more computing devices calculate the utterance destination probability of each listener based on the text feature amount of the first current utterance and the text feature amount of the past utterance of each listener. The one or more computing devices determine the destination of the first current utterance based on the destination probabilities of the plurality of listeners.

本開示の一態様によれば、複数の参加者が存在する会話において、話し手の発話先を推定することができる。 According to one aspect of the present disclosure, it is possible to estimate the utterance destination of a speaker in a conversation in which a plurality of participants are present.

本明細書の一実施形態に係る会話管理システムの利用形態の例を示す。1 shows an example of a usage form of a conversation management system according to an embodiment of the present specification; 本明細書の会話管理システムの他の利用形態を示す。Fig. 3 shows another form of use of the conversation management system of the present specification; 会話管理システムのハードウェア構成例を示す。A hardware configuration example of a conversation management system is shown. ユーザ端末のハードウェア構成例を示す。1 shows an example of the hardware configuration of a user terminal; 本明細書の一実施形態に係る会話管理システムの機能構成例を示す。1 shows a functional configuration example of a conversation management system according to an embodiment of the present specification; 参加者固有特徴量管理情報の構成例を示す。4 shows a configuration example of participant-specific feature quantity management information; 発話履歴管理情報の構成例を示す。4 shows a configuration example of utterance history management information; 会話管理システムによる動作例のフローチャートである。4 is a flow chart of an example of operation by the conversation management system; 会話において収集された発話のサンプルを示す。Figure 2 shows a sample of utterances collected in a conversation. 発話先判定部が発話先の判定に使用することができるリカレントニューラルネットワークへの入力及びその出力の例を示す。An example of input to and output from a recurrent neural network that can be used by the speech destination determination unit to determine the speech destination will be shown. 単語のセマンティック関連性の例を模式的に示す。Schematically illustrates an example of semantic relevance of words. マルチタスクによる発話先判定処理例を示す。An example of speech destination determination processing by multitasking is shown. 会話を構成する一連の発話の一つを示す。Shows one of a series of utterances that make up a conversation. 会話を構成する一連の発話の一つを示す。Shows one of a series of utterances that make up a conversation. 会話を構成する一連の発話の一つを示す。Shows one of a series of utterances that make up a conversation. 会話を構成する一連の発話の一つを示す。Shows one of a series of utterances that make up a conversation. 会話を構成する一連の発話の一つを示す。Shows one of a series of utterances that make up a conversation. 会話を構成する一連の発話の一つを示す。Shows one of a series of utterances that make up a conversation. 会話を構成する一連の発話の一つを示す。Shows one of a series of utterances that make up a conversation. 会話を構成する一連の発話の一つを示す。Shows one of a series of utterances that make up a conversation. 話し手の発話先がグループか個人か識別するＲＮＮを示す。It indicates an RNN that identifies whether a speaker's utterance destination is a group or an individual. ＬＳＴＭにおける処理の詳細を示す。Details of processing in LSTM are shown. 単語を単語埋め込み層とＬＳＴＭに入力する部分の詳細を示す。The details of inputting words into the word embedding layer and LSTM are shown. 発話先が個人である場合、どの聞き手が発話先であるかを判断するＲＮＮを示す。If the addressee is an individual, indicate the RNN that determines which listener is the addressee. ＭａＬＳＴＭの構成を示す。1 shows the structure of MaLSTM.

以下においては、便宜上その必要があるときは、複数のセクションまたは実施形態に分割して説明するが、特に明示した場合を除き、それらは互いに無関係なものではなく、一方は他方の一部または全部の変形例、詳細、補足説明等の関係にある。また、以下において、要素の数等（個数、数値、量、範囲等を含む）に言及する場合、特に明示した場合及び原理的に明らかに特定の数に限定される場合等を除き、その特定の数に限定されるものではなく、特定の数以上でも以下でもよい。 The following description is divided into multiple sections or embodiments when necessary for convenience, but unless otherwise specified, they are not independent of each other, and one part or all of the other Modifications, details, supplementary explanations, etc. In addition, hereinafter, when referring to the number of elements (including number, numerical value, amount, range, etc.), unless otherwise specified or clearly limited to a specific number in principle, is not limited to the number of , and may be greater than or less than a specific number.

本システムは、物理的な計算機システム（一つ以上の物理的な計算機）でもよいし、クラウド基盤のような計算リソース群（複数の計算リソース）上に構築されたシステムでもよい。計算機システムあるいは計算リソース群は、１以上のインタフェイス装置（例えば通信装置及び入出力装置を含む）、１以上の記憶装置（例えば、メモリ（主記憶）及び補助記憶装置を含む）、及び、１以上の演算装置を含む。 This system may be a physical computer system (one or more physical computers), or a system constructed on a group of computational resources (a plurality of computational resources) such as a cloud platform. A computer system or a group of computing resources includes one or more interface devices (for example, including communication devices and input/output devices), one or more storage devices (for example, including memory (main storage) and auxiliary storage devices), and one Including the above arithmetic unit.

プログラムが演算装置によって実行されることで機能が実現される場合、定められた処理が、適宜に記憶装置及び／またはインタフェイス装置等を用いながら行われるため、機能は演算装置の少なくとも一部とされてもよい。機能を主語として説明された処理は、演算装置を有するシステムが行う処理としてもよい。プログラムは、プログラムソースからインストールされてもよい。プログラムソースは、例えば、プログラム配布計算機または計算機が読み取り可能な記憶媒体（例えば計算機読み取り可能な非一過性記憶媒体）であってもよい。各機能の説明は一例であり、複数の機能が一つの機能にまとめられたり、一つの機能が複数の機能に分割されたりしてもよい。 When a function is realized by executing a program by an arithmetic device, the specified processing is performed using a storage device and/or an interface device as appropriate, so the function is at least part of the arithmetic device. may be The processing described with the function as the subject may be processing performed by a system having an arithmetic device. Programs may be installed from program sources. The program source may be, for example, a program distribution computer or a computer-readable storage medium (eg, a computer-readable non-transitory storage medium). The description of each function is an example, and multiple functions may be combined into one function, or one function may be divided into multiple functions.

本明細書の一実施形態は、複数のユーザが会話を行っているときに、話し手が誰に向かって話しているか（発話先）を推定するメカニズムを提供する。本明細書の一実施形態は、複数の参加者が参加する会話を支援するシステムである。「参加者」は、会話の参加者を意味する。「参加者」は、「ユーザ」又は「エージェント」であり得る。「ユーザ」は人間である。「エージェント」は、例えば、音声認識機能を備えたロボット、アプリケーションプログラムを実行する計算機システム、又はそのプログラムである。エージェントは、さらに、音声合成機能も備え、音声による応答が可能であってもよい。 One embodiment herein provides a mechanism for estimating who the speaker is speaking to (speaker target) when multiple users are having a conversation. One embodiment herein is a system for supporting conversations involving multiple participants. "Participant" means a participant in a conversation. A "participant" can be a "user" or an "agent". A "user" is a human being. An "agent" is, for example, a robot equipped with a speech recognition function, a computer system executing an application program, or its program. The agent may also have a voice synthesis function and be able to respond by voice.

本明細書の一実施形態において、会話に参加するグループは、３人以上の参加者で構成され、１以上のエージェントが含まれ得る。ただし、エージェントが存在せず、たんに本実施形態で説明する会話管理システムが、人間のユーザ同士の会話において次の話者を推定する構成でもよい。会話管理システムは、複数の参加者が参加している会話において、参加者により発話を聞き取り、いずれの参加者が発話先であるかを推定する。 In one embodiment herein, a group participating in a conversation may consist of three or more participants and may include one or more agents. However, the configuration may be such that the agent does not exist, and the conversation management system described in this embodiment simply estimates the next speaker in a conversation between human users. A conversation management system listens to utterances by participants in a conversation in which a plurality of participants are participating, and estimates which participant is the utterance destination.

本明細書の一実施形態において、会話管理システムは、Ｎフリー分類を行う。ここでＮは参加者の人数を表す記号であり、発話先を検出するシステムが、参加者の人数に依存せずに動作可能であることを意味する。会話管理システムは、任意数の参加者の会話において動作可能である。本明細書の一実施形態において、会話管理システムは、入力された発話と各聞き手の発話履歴およびそのほかの情報を利用した計算に基づき、各参加者の発話先確率のスコアを算出する。最も高い確率を示す参加者が、現在の発話の発話先であると推定される。 In one embodiment herein, a conversation management system performs N-free classification. Here, N is a symbol representing the number of participants, which means that the system for detecting speech destinations can operate independently of the number of participants. The conversation management system is operable in conversations with any number of participants. In one embodiment herein, the conversation management system calculates a destination probability score for each participant based on calculations using the input utterances and each listener's utterance history and other information. The participant showing the highest probability is presumed to be the target of the current utterance.

本明細書の一実施形態に係る会話管理システムは、以下に説明する処理を実行することによって、発話先を推定する。図１Ａは、本明細書の一実施形態に係る会話管理システムの利用形態の例を示す。システムは、会話管理システム１０１及び複数のユーザ端末１０４を含む。図１Ａの例において、会話管理システム１０１は一つのサーバで構成され、４人のユーザが会話に参加している。 A conversation management system according to an embodiment of the present specification estimates a speech destination by executing the processing described below. FIG. 1A illustrates an example usage of a conversation management system according to one embodiment herein. The system includes a conversation management system 101 and multiple user terminals 104 . In the example of FIG. 1A, conversation management system 101 consists of one server, and four users are participating in the conversation.

ユーザ端末１０４は、図１Ａに示すように無線によって、又はケーブルを介して、アクセスポイント１０３にアクセスする。会話管理システム１０１及びユーザ端末１０４は、ネットワーク１０２及びアクセスポイント１０３を介して、互いに通信する。なお、会話管理システム１０１とユーザ端末１０４との間の通信ネットワークの構成は任意である。 User terminal 104 accesses access point 103 either wirelessly as shown in FIG. 1A or via a cable. Conversation management system 101 and user terminal 104 communicate with each other via network 102 and access point 103 . The configuration of the communication network between conversation management system 101 and user terminal 104 is arbitrary.

会話管理システム１０１は、ユーザ端末１０４に対して、ネットワークを介した会話、例えば会議や打ち合わせのサービスを提供する。さらに、会話管理システム１０１は、エージェントプログラムを実行し、ユーザの間の会話を支援する。会話管理システム１０１は、複数ユーザが参加可能な会話サービスを提供すると共に、その会話を支援する。 The conversation management system 101 provides the user terminal 104 with services such as conversations via the network, such as conferences and meetings. In addition, conversation management system 101 executes agent programs to facilitate conversations between users. The conversation management system 101 provides a conversation service in which multiple users can participate and supports the conversation.

会話管理システム１０１は、１又は複数の計算機で構成することができる他、クラウド上の計算機リソースにより構成することができる。このように、会話管理システム１０１は、１以上の演算装置及び１以上の記憶装置を含むことができる。 The conversation management system 101 can be composed of one or more computers, or can be composed of computer resources on the cloud. As such, conversation management system 101 may include one or more computing devices and one or more storage devices.

複数のユーザは、ユーザ端末１０４をそれぞれ使用して、会話管理システム１０１にログインし、会話管理システム１０１が提供するプラットフォームにおいて、大小様々なミーティングを含む会話を行うことができる。ユーザは、ユーザ端末１０４に実装されたカメラによって自分の映像を他の参加者に送信することができ、マイクによって自分の発話を他の参加者に送信することができる。ユーザは、ユーザ端末１０４に実装された表示装置において他のユーザの画像を視認することでき、また、スピーカによって他の参加者の発話を聞くことができる。 A plurality of users can log in to the conversation management system 101 using the user terminals 104 respectively, and have conversations, including large and small meetings, on the platform provided by the conversation management system 101 . A user can transmit his or her own video to other participants through a camera mounted on the user terminal 104, and can transmit his or her speech to other participants through a microphone. The user can visually recognize images of other users on the display device mounted on the user terminal 104, and can hear speech of other participants through the speaker.

図１Ｂは、本明細書の会話管理システム１０１の他の利用形態を示す。会話管理システム１０１は、例えば、デスクトップ型又はラップトップ型計算機システムに実装することができる。複数人のユーザが対面で直接に会話を行い、会話の場に存在する会話管理システム１０１が、ユーザ間の会話を支援する。図１Ｂの例において、４人のユーザ１０５が、会話に参加している。 FIG. 1B illustrates another use of the conversation management system 101 herein. Conversation management system 101 can be implemented, for example, in a desktop or laptop computer system. A plurality of users have a direct face-to-face conversation, and a conversation management system 101 present at the place of conversation supports the conversation between the users. In the example of FIG. 1B, four users 105 are participating in a conversation.

各参加者は、装置を使用することなく他の参加者と会話を行うことができる。会話管理システム１０１は、実装されたマイクによってユーザ１０５の発話の音声データを取得し、スピーカによって参加者に対して発話を行うことができる。会話管理システム１０１は、カメラが実装され、ユーザ１０５それぞれの画像が取得可能であってもよい。 Each participant can converse with other participants without using the device. The conversation management system 101 can acquire voice data of the user's 105 utterances through the installed microphone, and speak to the participants through the speaker. Conversation management system 101 may be equipped with a camera so that an image of each user 105 can be obtained.

図２は、会話管理システム１０１のハードウェア構成例を示す。会話管理システム１０１は、ＲＡＭなど揮発性記憶素子で構成される主記憶装置２０２、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される補助記憶装置２０３を含む。 FIG. 2 shows a hardware configuration example of the conversation management system 101. As shown in FIG. The conversation management system 101 includes a main memory device 202 composed of volatile memory elements such as RAM, and an auxiliary memory device 203 composed of appropriate non-volatile memory elements such as SSD (Solid State Drive) and hard disk drive.

会話管理システム１０１は、さらに、補助記憶装置２０３等に格納されているプログラム２０６を主記憶装置２０２に読み出すなどして実行し、システム自体の統括制御を行うとともに、各種判定、演算及び制御処理を行うＣＰＵなどの演算装置２０１を含む。会話管理システム１０１は、ネットワークに接続しデータをやり取りするための通信インタフェイス２０４を含む。 Further, the conversation management system 101 reads out a program 206 stored in the auxiliary storage device 203 or the like into the main storage device 202 and executes it, performs overall control of the system itself, and performs various determinations, calculations and control processes. It includes an arithmetic device 201 such as a CPU that performs the processing. Conversation management system 101 includes communication interface 204 for connecting to and exchanging data with a network.

会話管理システム１０１は、入力動作を受け付けるキーボード、マウス、タッチパネル、マイク、カメラなどの入力装置２０８、及び処理結果を表示するディスプレイや音声を出力するスピーカ等の出力装置３０５を含む。会話管理システム１０１の構成要素は、内部バス２０５によって、相互に通信することができる。図１Ａに示す構成例において出力装置２０９及び入力装置２０８が省略可能であるように、会話管理システム１０１の構成要素の一部は省略されていてもよい。 The conversation management system 101 includes an input device 208 such as a keyboard, mouse, touch panel, microphone, and camera for receiving input operations, and an output device 305 such as a display for displaying processing results and a speaker for outputting voice. The components of conversation management system 101 can communicate with each other by internal bus 205 . Some of the components of conversation management system 101 may be omitted, such that output device 209 and input device 208 can be omitted in the configuration example shown in FIG. 1A.

補助記憶装置２０３内には、本実施形態の会話管理システム１０１として必要な機能を実装するためのプログラム２０６の他、各種処理に必要なデータを格納した情報データベース（ＤＢ）２０７が格納されている。図２は、補助記憶装置２０３から主記憶装置２０２にロードされたプログラム２０６を図示している。 The auxiliary storage device 203 stores a program 206 for implementing the functions necessary for the conversation management system 101 of this embodiment, as well as an information database (DB) 207 storing data necessary for various processes. . FIG. 2 illustrates program 206 loaded from secondary storage device 203 into main storage device 202 .

図３は、ユーザ端末１０４のハードウェア構成例を示す。ユーザ端末１０４は、ＲＡＭなど揮発性記憶素子で構成される主記憶装置３０２、ＳＳＤやハードディスクドライブなど適宜な不揮発性記憶素子で構成される補助記憶装置３０３を含む。ユーザ端末１０４は、さらに、補助記憶装置３０３等に格納されているプログラム３０８を主記憶装置３０２に読み出すなどして実行し、システム自体の統括制御を行うとともに、各種判定、演算及び制御処理を行うＣＰＵなどの演算装置３０１を含む。プログラム３０８は、ユーザ端末１０４がネットワークを介した会話に参加することを可能とするプログラムを含む。 FIG. 3 shows an example hardware configuration of the user terminal 104 . The user terminal 104 includes a main memory device 302 composed of volatile memory elements such as RAM, and an auxiliary memory device 303 composed of appropriate non-volatile memory elements such as SSD and hard disk drive. The user terminal 104 further reads a program 308 stored in the auxiliary storage device 303 or the like into the main storage device 302 and executes it, performs overall control of the system itself, and performs various determinations, calculations, and control processes. It includes an arithmetic unit 301 such as a CPU. Programs 308 include programs that enable user terminal 104 to participate in conversations over a network.

ユーザ端末１０４は、ネットワークに接続しデータをやり取りするための通信インタフェイス３０６、ユーザからの入力動作を受け付けるキーボード、マウス、タッチパネル、マイク、カメラなどの入力装置３０４、及びユーザに対して処理結果を表示するディスプレイや音声を出力するスピーカ等の出力装置３０５を含む。これらユーザ端末１０４の構成要素は、内部バス３０７によって、相互に通信することができる。補助記憶装置３０３内には、必要な機能を実装するためのプログラム３０８の他、必要な情報（データ）が格納される。 The user terminal 104 includes a communication interface 306 for connecting to a network and exchanging data, an input device 304 such as a keyboard, mouse, touch panel, microphone, and camera for receiving input operations from the user, and processing results to the user. It includes an output device 305 such as a display for displaying and a speaker for outputting sound. These user terminal 104 components can communicate with each other via an internal bus 307 . The auxiliary storage device 303 stores programs 308 for implementing necessary functions and necessary information (data).

次に、会話管理システム１０１の機能を説明する。図４は、本明細書の一実施形態に係る会話管理システム１０１の機能構成例を示す。以下に説明する機能は、例えば会話管理システム１０１の演算装置２０１が、プログラム２０６を実行することにより実装される。 Next, the functions of conversation management system 101 will be described. FIG. 4 shows a functional configuration example of the conversation management system 101 according to one embodiment of the present specification. The functions described below are implemented by the arithmetic unit 201 of the conversation management system 101 executing the program 206, for example.

会話管理システム１０１のプログラム２０６は、音声認識部４０１、参加者特定部４０３、音声特徴量抽出部４０５、映像特徴量抽出部４０７、発話先判定部４１１、及びエージェント４１３を含む。データベース２０７は、参加者固有特徴量管理情報４２１及び発話履歴管理情報４２３を含む。参加者固有特徴量管理情報４２１及び発話履歴管理情報４２３の詳細は、それぞれ、図５及び６を参照して後述する。 Program 206 of conversation management system 101 includes speech recognition unit 401 , participant identification unit 403 , audio feature amount extraction unit 405 , video feature amount extraction unit 407 , speech destination determination unit 411 , and agent 413 . The database 207 includes participant-specific feature quantity management information 421 and speech history management information 423 . Details of the participant-specific feature amount management information 421 and the utterance history management information 423 will be described later with reference to FIGS. 5 and 6, respectively.

音声認識部４０１は、ユーザによる発話を認識する。音声は、会話管理システム１０１のマイクを介して（図１Ｂの例）、又はユーザ端末１０４からネットワークを介して（図１Ａの例）受信される。例えば、音声認識部４０１は、音声の後にあらかじめ定義された無音時間、例えば２００ミリ秒以上の無音が続く場合に、その音声を発話として抽出することができる。 A speech recognition unit 401 recognizes an utterance by a user. Speech is received via the microphone of conversation management system 101 (example of FIG. 1B) or over the network from user terminal 104 (example of FIG. 1A). For example, the speech recognition unit 401 can extract the speech as an utterance when the speech is followed by silence for a predefined silence period, for example, 200 milliseconds or longer.

音声認識部４０１は、さらに、ユーザの音声データをテキストに変換する。発話から変換された結果は、発話のテキスト特徴量である。音声認識部４０１は、入力として音声で与えられると、語彙又は単語に変換して出力する。発話のテキストは、発話履歴管理情報４２３に格納される。 The speech recognition unit 401 further converts the user's speech data into text. The result transformed from the utterance is the text feature of the utterance. The speech recognition unit 401, when given as an input in speech, converts it into vocabulary or words and outputs it. The utterance text is stored in the utterance history management information 423 .

参加者特定部４０３は、図１Ｂに示すような状況において、参加者の発話位置を判定し、判定された参加者の発話位置と参加者ＩＤとを関連付ける。図１Ａに示すように、ユーザ端末１０４を使用したネットワークを介した会話の場合、参加者特定部４０３は、参加者の位置判定を行うことなく、参加者のログインＩＤによって参加者ＩＤを決定することができる。 The participant identification unit 403 determines the utterance position of the participant in the situation shown in FIG. 1B, and associates the determined utterance position of the participant with the participant ID. As shown in FIG. 1A, in the case of a conversation over a network using the user terminal 104, the participant identification unit 403 determines the participant ID based on the participant's login ID without determining the participant's position. be able to.

音声特徴量抽出部４０５は、発話の音声データから予め定められた処理方法によって音声特徴量を抽出する。音声特徴量の例は、ＭＦＣＣ（Ｍｅｌ－ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）、ＰＬＰ（ｐｅｒｃｅｐｔｕａｌｌｉｎｅａｒｐｒｅｄｉｃｔｉｏｎ）、ピッチ等を含む。音声特徴量抽出部４０５は、抽出した音声特徴量を発話履歴管理情報４２３に格納する。 The voice feature quantity extraction unit 405 extracts a voice feature quantity from the voice data of the utterance by a predetermined processing method. Examples of speech features include MFCC (Mel-Frequency Cepstrum Coefficients), PLP (perceptual linear prediction), pitch, and the like. The voice feature amount extraction unit 405 stores the extracted voice feature amount in the utterance history management information 423 .

映像特徴量抽出部４０７は、発話者の映像データから予め定められた処理方法によって映像特徴量を抽出する。映像特徴量は、例えば、畳込みニューラルネットワークを使用して映像データの画素の値から生成される。映像データは、会話管理システム１０１のカメラ（図１Ｂの例）又は、又はユーザ端末１０４からネットワークを介して（図１Ａの例）受信される。映像特徴量抽出部４０７は、抽出した映像特徴量を発話履歴管理情報４２３に格納する。映像特徴量抽出部４０７は、発話者の映像特徴量に加えて、他のユーザの映像特徴量を抽出し、発話履歴管理情報４２３に格納してもよい。 A video feature amount extraction unit 407 extracts a video feature amount from the speaker's video data by a predetermined processing method. The video feature amount is generated from pixel values of video data using, for example, a convolutional neural network. The video data is received from the camera of conversation management system 101 (example in FIG. 1B) or from user terminal 104 via the network (example in FIG. 1A). The video feature quantity extraction unit 407 stores the extracted video feature quantity in the speech history management information 423 . The video feature amount extraction unit 407 may extract video feature amounts of other users in addition to the video feature amount of the speaker and store them in the speech history management information 423 .

発話先判定部４１１は、話し手の現在の発話及び聞き手の過去の発話履歴に基づいて、現在の発話の発話先を推定する。後述するように、発話先の推定は、聞き手の固有特徴量、発話時の音声データ、テキストデータ、映像データの特徴量に基づき実行されてよい。 The speech destination determination unit 411 estimates the speech destination of the current speech based on the speaker's current speech and the listener's past speech history. As will be described later, the estimation of the utterance destination may be performed based on the unique feature amount of the listener, and the feature amounts of the voice data, text data, and video data at the time of utterance.

エージェント４１３は、音声又は音声及び映像によって、会話の参加者に働きかけて、会話の進行を促す。エージェント４１３は、認識された発話先に基づいて、発話の許否を判定し、認識された発話及び発話先に基づいて出力する発話内容や発話先を決定することができる。 The agent 413 encourages the progress of the conversation by working with the participants in the conversation using voice or voice and video. The agent 413 can determine whether or not to permit an utterance based on the recognized utterance destination, and can determine the utterance content and utterance destination to be output based on the recognized utterance and the utterance destination.

図５は、参加者固有特徴量管理情報４２１の構成例を示す。参加者固有特徴量管理情報４２１は、会話に参加している各参加者に固有の特徴量を管理している。参加者固有特徴量管理情報４２１は、会話が開始される前に、予め設定登録される。図５に示す構成例において、参加者固有特徴量管理情報４２１は、参加者ＩＤ欄４５１、名称欄４５２、役割欄４５３、ソフトスキル欄４５４、及び、ハードスキル欄４５５を含む。 FIG. 5 shows a configuration example of the participant-specific feature quantity management information 421. As shown in FIG. The participant-specific feature quantity management information 421 manages feature quantities specific to each participant participating in the conversation. The participant-specific feature amount management information 421 is set and registered in advance before the conversation is started. In the configuration example shown in FIG. 5, the participant-specific feature amount management information 421 includes a participant ID column 451, a name column 452, a role column 453, a soft skill column 454, and a hard skill column 455.

参加者ＩＤ欄４５１は、会話の参加者それぞれのＩＤを示す。名称欄４５２は、参加者それぞれの名称を示す。役割欄４５３は、参加者それぞれに割り当てられている役割を示す。役割欄４５３は、組織又は会話における参加者の役割を示すことができる。図５の例において、役割欄４５３は、会社内での役職を示す。ソフトスキル欄４５４は、参加者の専門的知識に基づかない汎用的（非専門的）なスキルを示す。ハードスキル欄４５５は、参加者の専門知識に基づくスキル（専門的スキル）を示す。なお、図５に示す参加者固有特徴量は例にすぎず、これらと異なる特徴量が登録されていてよい。 Participant ID column 451 indicates the ID of each participant in the conversation. A name column 452 indicates the name of each participant. A role column 453 indicates the role assigned to each participant. Role column 453 may indicate the participant's role in the organization or conversation. In the example of FIG. 5, the role column 453 indicates positions within the company. A soft skills column 454 indicates general (non-professional) skills that are not based on the participant's expertise. The hard skill column 455 indicates skills (professional skills) based on the participant's specialized knowledge. Note that the participant-specific feature amounts shown in FIG. 5 are merely examples, and feature amounts different from these may be registered.

図６は、発話履歴管理情報４２３の構成例を示す。発話履歴管理情報４２３は、会話における発話の履歴を管理する。発話履歴管理情報４２３は、会話の進行に応じて更新される。図６に示す構成例において、発話履歴管理情報４２３は、発話ＩＤ欄４７１、発話者ＩＤ欄４７２、発話テキスト欄４７３、発話時刻欄４７４、音声特徴量欄４７５、及び映像特徴量欄４７６を含む。 FIG. 6 shows a configuration example of the speech history management information 423. As shown in FIG. The utterance history management information 423 manages the history of utterances in conversation. The speech history management information 423 is updated as the conversation progresses. In the configuration example shown in FIG. 6, the utterance history management information 423 includes an utterance ID column 471, a speaker ID column 472, an utterance text column 473, an utterance time column 474, an audio feature column 475, and a video feature column 476. .

発話ＩＤ欄４７１は、発話のＩＤを示す。発話者ＩＤ欄４７２は、発話を行った参加者（発話者）のＩＤを示す。発話テキスト欄４７３は、発話のテキスト（テキスト特徴量）を示す。上述のように、発話テキストは音声認識部４０１によって生成され発話履歴管理情報４２３に格納される。発話時刻欄４７４は、発話の時刻、例えば開始時刻を示す。 The utterance ID column 471 indicates the ID of the utterance. The speaker ID column 472 indicates the ID of the participant (speaker) who made the speech. The utterance text column 473 indicates the text of the utterance (text feature amount). As described above, the speech text is generated by the speech recognition unit 401 and stored in the speech history management information 423. FIG. The utterance time column 474 indicates the time of utterance, for example, the start time.

音声特徴量欄４７５は、発話の音声特徴量を示す。上述のように、音声特徴量は音声特徴量抽出部４０５によって抽出され、発話履歴管理情報４２３に格納される。映像特徴量欄４７６は、発話を行った話し手の映像特徴量を示す。上述のように、映像特徴量は映像特徴量抽出部４０７によって抽出され、発話履歴管理情報４２３に格納される。話し手と異なる他のユーザの映像特徴量も、合わせて格納されてもよい。 The voice feature quantity column 475 indicates the voice feature quantity of the utterance. As described above, the speech feature quantity is extracted by the speech feature quantity extraction unit 405 and stored in the utterance history management information 423 . The video feature quantity column 476 indicates the video feature quantity of the speaker who made the utterance. As described above, the video feature quantity is extracted by the video feature quantity extraction unit 407 and stored in the speech history management information 423 . Video feature amounts of other users different from the speaker may also be stored together.

発話履歴管理情報４２３は、会話におけるすべての発話の情報を格納してもよく、発話者ＩＤ毎に規定数以下の発話の情報のみを格納してもよい。発話履歴管理情報４２３は、発話テキストや特徴量自体を格納するのではなく、これらのファイルを示すパスを管理してもよい。なお、発話履歴管理情報４２３は、図６に示す情報の一部のみを管理していてもよい。 The utterance history management information 423 may store information on all utterances in a conversation, or may store only information on utterances of a specified number or less for each speaker ID. The utterance history management information 423 may manage paths indicating these files instead of storing the utterance texts and the feature values themselves. Note that the utterance history management information 423 may manage only part of the information shown in FIG.

以下において、会話管理システム１０１の動作を説明する。図７は、会話管理システム１０１による動作例のフローチャートである。会話管理システム１０１は、３人以上の参加者が存在する会話において、話し手の発話先を推定する。参加者は、人のユーザに加えて、会話管理システム１０１のエージェント４１３が加わることができる。 The operation of conversation management system 101 will be described below. FIG. 7 is a flow chart of an operation example by the conversation management system 101. As shown in FIG. Conversation management system 101 estimates a speaker's speaking destination in a conversation involving three or more participants. Participants can be joined by agents 413 of conversation management system 101 in addition to human users.

まず、会話管理システム１０１は、会話の音声データ及び映像データを受信する（Ｓ１１）。音声データは会話管理システム１０１のマイク又はユーザ端末１０４からネットワークを介して受信され得、映像データは会話管理システム１０１のカメラ又はユーザ端末１０４からネットワークを介して受信され得る。 First, the conversation management system 101 receives conversation audio data and video data (S11). Audio data can be received from the microphone of conversation management system 101 or user terminal 104 over the network, and video data can be received from the camera of conversation management system 101 or user terminal 104 over the network.

次に、音声認識部４０１は、音声データを解析して、一つの発話を認識して、その音声データを抽出する（Ｓ１２）。具体的には、音声をテキストに変換し、テキスト特徴量を抽出する。音声認識部４０１は、音声の後にあらかじめ定義された無音時間以上の無音が続く場合に、その音声を発話として抽出し、テキストに変換して出力することができる。 Next, the speech recognition unit 401 analyzes the speech data, recognizes one utterance, and extracts the speech data (S12). Specifically, speech is converted into text, and text features are extracted. The speech recognition unit 401 can extract the speech as an utterance, convert it into text, and output it when silence continues for a predetermined silence time or longer after the speech.

次に、参加者特定部４０３は、発話の音声データを解析して、対応する話し手を判定する（Ｓ１３）。参加者特定部４０３は、発話の音声データから、機械学習（例えばダイアライゼーション技術）を利用して、話し手のＩＤを決定することができる。ダイアライゼーションは、音声を入力とし、話し手が誰であるかを出力する。 Next, the participant identification unit 403 analyzes the voice data of the utterance and determines the corresponding speaker (S13). The participant identification unit 403 can determine the speaker's ID from the voice data of the utterance using machine learning (for example, diarization technology). Diarization takes speech as input and outputs who the speaker is.

また、機械学習を用いない方法として、参加者特定部４０３は、複数チャネルのマイクアレイを含む特定のマイクデバイスを用いて、話し手とマイクとの相対的な位置関係を計算して取得することも可能である。そして、話し手の位置情報に関連付けられた参加者のＩＤによって、発話者を特定することができる。映像情報も利用できる場合、顔画像に基づく人物同定を用いてもよい。さらに顔の口唇の動きに基づく発話者推定、または画像情報と音声情報の両方を用いた発話者の推定を行ってもよい。 In addition, as a method that does not use machine learning, the participant identification unit 403 may use a specific microphone device including a multi-channel microphone array to calculate and obtain the relative positional relationship between the speaker and the microphone. It is possible. Then, the speaker can be identified by the participant's ID associated with the speaker's location information. If video information is also available, person identification based on facial images may be used. Furthermore, speaker estimation based on facial lip movements or speaker estimation using both image information and audio information may be performed.

次に、発話の音声データ及び発話時間における話し手の映像データから特徴量が抽出される（Ｓ１４）。具体的には、音声認識部４０１はテキスト特徴量を抽出し、音声特徴量抽出部４０５は音声データの特徴量を抽出し、映像特徴量抽出部４０７は話し手の映像データから映像特徴量を抽出する。音声認識部４０１は、既知の自動音声認識技術を使用することができる。音声認識部４０１、音声特徴量抽出部４０５及び映像特徴量抽出部４０７は、例えば、ニューラルネットワークを使用して構成することができる。音声認識部４０１、音声特徴量抽出部４０５及び映像特徴量抽出部４０７は、それぞれ、抽出した特徴量を、発話履歴管理情報４２３のレコードに格納する（Ｓ１５）。 Next, a feature amount is extracted from the speech data of the speech and the video data of the speaker at the speech time (S14). Specifically, the speech recognition unit 401 extracts the text feature quantity, the speech feature quantity extraction unit 405 extracts the feature quantity of the speech data, and the video feature quantity extraction unit 407 extracts the video feature quantity from the video data of the speaker. do. The speech recognition unit 401 can use known automatic speech recognition technology. The voice recognition unit 401, the voice feature amount extraction unit 405, and the video feature amount extraction unit 407 can be configured using, for example, a neural network. The speech recognition unit 401, the speech feature quantity extraction unit 405, and the video feature quantity extraction unit 407 each store the extracted feature quantity in the record of the utterance history management information 423 (S15).

次に、発話先判定部４１１は、対象となる現在の発話の発話先を判定する（Ｓ１６）。発話先の判定方法の詳細は後述する。本明細書の一実施形態において、発話先判定部４１１は、発話履歴管理情報４２３から、現在の発話の情報及び過去の発話の情報を取得し、さらに、参加者固有特徴量管理情報４２１から、参加者それぞれの固有特徴量を取得する。発話先判定部４１１は、取得した情報に基づいて、発話先を判定する。 Next, the utterance destination determining unit 411 determines the utterance destination of the current target utterance (S16). The details of the determination method of the speech destination will be described later. In one embodiment of the present specification, the utterance destination determination unit 411 acquires current utterance information and past utterance information from the utterance history management information 423, furthermore, from the participant-specific feature amount management information 421, Acquire unique features of each participant. The speech destination determination unit 411 determines the speech destination based on the acquired information.

本明細書の一実施形態において、発話先判定部４１１は、発話先の種類を判定する（Ｓ１７）。具体的には、発話先が、参加者の全ての聞き手（グループ全体）、特定の一人の聞き手ユーザ、又は会話管理システム１０１が実行しているエージェント、のいずれであるかが判定される。特定の一人のユーザが発話先である場合、発話先判定部４１１は、そのユーザのユーザＩＤを決定する。 In one embodiment of the present specification, the speech destination determination unit 411 determines the type of the speech destination (S17). Specifically, it is determined whether the speech destination is all listeners of the participants (the whole group), a specific single listener user, or an agent that the conversation management system 101 is executing. When a specific user is the speech destination, the speech destination determination unit 411 determines the user ID of the user.

発話先判定部４１１の出力は、以下のいずれかである。「ユーザＩＤ」、「エージェント」、又は「グループ」である。発話先が「エージェント」である場合、エージェント４１３は、応答を生成して（Ｓ１８）、出力する（Ｓ１９）。発話先が特定の「ユーザＩＤ」又は「グループ」である場合、エージェント４１３は、フィードバックが許されるか判定する（Ｓ２０）。許される場合（Ｓ２０：ＹＥＳ）、エージェント４１３は、応答を生成して（Ｓ１８）、出力する（Ｓ１９）。許されない場合、フローはステップＳ１１に戻る。 The output of the speech destination determination unit 411 is one of the following. It can be "user ID", "agent" or "group". If the utterance destination is "agent", the agent 413 generates a response (S18) and outputs it (S19). If the speech destination is a specific "user ID" or "group", the agent 413 determines whether feedback is permitted (S20). If permitted (S20: YES), the agent 413 generates a response (S18) and outputs it (S19). If not, flow returns to step S11.

会話管理システム１０１のエージェント４１３は、発話先判定部４１１の出力がエージェント４１３に向けた発話でなくても、フィードバックを与えられる場合がある。以下に、その条件の例を示す。 The agent 413 of the conversation management system 101 may be given feedback even if the output of the speech destination determination unit 411 is not the speech directed to the agent 413 . Examples of the conditions are shown below.

発話先判定部４１１が、発話先の「ユーザＩＤ」を返し、エージェント４１３が、特定の参加者のフィードバックを所定の時間検出できない場合がある。この場合、エージェント４１３は、例えば、対応するユーザＩＤを音声で呼び、その人の回答を促してもよい。 In some cases, the speech destination determination unit 411 returns the "user ID" of the speech destination, and the agent 413 cannot detect the feedback of a specific participant for a predetermined time. In this case, agent 413 may, for example, call the corresponding user ID by voice to prompt that person to answer.

発話先判定部４１１が、発話先が「グループ」であることを返す場合、このグループは、会話管理システム１０１のエージェント４１３を含む。したがって、エージェント４１３は、それが持つ知識に基づいてフィードバックを与えてもよい。 When speech destination determination unit 411 returns that the speech destination is “group”, this group includes agent 413 of conversation management system 101 . Agent 413 may therefore provide feedback based on the knowledge it possesses.

なお、会話管理システム１０１は、映像データの特徴量（視覚的特徴量）、音声特徴量またはテキスト特徴量を利用して、追加情報として参加者の感情を検出してもよい。この感情推定は、ニューラルネットワークを利用して、聞き手が話を理解しているかどうかなど、参加者の感情を推定することができる。エージェント４１３は、参加者の感情に応じてフィードバックを返してもよい。例えば、聞き手が話を理解していないと判定された場合、話し手に説明を繰り返すことを促してもよい。 Note that the conversation management system 101 may detect the emotions of the participants as additional information using the feature amount (visual feature amount) of the video data, the audio feature amount, or the text feature amount. This emotion estimation can use neural networks to estimate the emotions of the participants, such as whether the listener understands the speech. Agent 413 may give feedback according to the participant's emotions. For example, if it is determined that the listener does not understand the story, the speaker may be prompted to repeat the explanation.

以下において、発話先判定部４１１による処理の詳細を説明する。発話先判定部４１１は、参加者の過去の発話履歴に基づいて、現在の発話の発話先を判定する。これにより、より適切な発話先の判定が可能となる。上述のように、発話履歴管理情報４２３は、各話し手の過去の発話の特徴量情報を格納する。発話履歴管理情報４２３は、直前の発話だけでなく、複数の過去の発話の履歴を蓄積する。 Details of the processing by the speech destination determination unit 411 will be described below. The utterance destination determination unit 411 determines the utterance destination of the current utterance based on the past utterance history of the participant. This makes it possible to determine the utterance destination more appropriately. As described above, the utterance history management information 423 stores the feature amount information of each speaker's past utterances. The utterance history management information 423 accumulates not only the immediately preceding utterance, but also the history of a plurality of past utterances.

各参加者の発話履歴には、例えば、最大でｋ個の最新の入力特徴量が保存される。ｋは自然数であり、例えば、２以上の自然数である。過去の特徴量情報は、ｋ個の発話履歴、つまり、各参加者から会話中に収集されたｋ個の過去の入力特徴量を意味する。抽出された発話履歴は、過去の特徴量情報とみなされる。 Each participant's utterance history stores, for example, up to k latest input feature amounts. k is a natural number, for example, a natural number of 2 or more. The past feature amount information means k utterance histories, that is, k past input feature amounts collected during conversation from each participant. The extracted speech history is regarded as past feature amount information.

例えば、テキスト特徴量のみを考慮し、会話の参加者が４人である場合、収集された発話は、例えば、図８に示すサンプルのようになる。現在の発話は、ＪＡＭＥＳによる発話である。図８は、ＪＡＭＥＳの現在の発話テキストを示している。過去の発話テキストの例として、ＥＬＥＮＡ、ＲＯＢＹ、エージェント４１３の発話テキストが示されている。発話テキストは、発話のテキスト特徴量である。発話先判定部４１１は、現在の発話者の発話のテキスト特徴量と他の参加者の過去発話のテキスト特徴量との間の関係に基づき、発話先を判定する。 For example, if only text features are considered and there are four participants in the conversation, the collected utterances will look like the sample shown in FIG. 8, for example. The current utterance is the utterance by JAMES. FIG. 8 shows the current spoken text of JAMES. As examples of past spoken texts, the spoken texts of ELENA, ROBY, and agent 413 are shown. Spoken text is a text feature of an utterance. The speech destination determination unit 411 determines the speech destination based on the relationship between the text feature amount of the current speaker's utterance and the text feature amount of the past utterances of other participants.

発話先判定部４１１は、例えば、各聞き手が発話先である確率を，ニューラルネットワークベースのアーキテクチャで生成することができる。図９は、発話先判定部４１１が発話先の判定に使用することができるニューラルネットワークへの入力及び出力の例を示す。 The speech destination determination unit 411 can generate, for example, the probability that each listener is the speech destination using a neural network-based architecture. FIG. 9 shows an example of inputs and outputs to a neural network that can be used by the speech destination determination unit 411 to determine the speech destination.

ニューラルネットワーク５０１は、一人の聞き手が発話先でありうる確率５２１を出力する。発話先判定部４１１は、各聞き手について、ニューラルネットワークの推論計算を実行して、各聞き手が発話先である確率を算出できる。このニューラルネットワークの構成は限定されるものではないが、たとえばＴｒａｎｓｆｏｒｍｅｒ、リカレントニューラルネットワーク、畳み込みニューラルネットワークを利用することができる。 Neural network 501 outputs a probability 521 that one listener can be the speaker. The speech destination determination unit 411 can calculate the probability that each listener is the speech destination by executing neural network inference calculation for each listener. Although the configuration of this neural network is not limited, for example, a Transformer, a recurrent neural network, or a convolutional neural network can be used.

図９に示すように、ニューラルネットワーク５０１の入力は、発話者の発話情報である入力特徴量５１１、聞き手の過去発話のテキスト特徴量（ＬｉｓｔｅｎｅｒＵｔｔｅｒａｎｃｅ）５１２、そして聞き手の固有特徴量５１３が入力される。図９に示す構成例において、発話先判定部４１１は、聞き手の過去の発話履歴に加えて、聞き手の固有特徴量と話し手の現在の発話のテキスト特徴量との関係に基づき、発話先を判定する。発話者の発話情報である入力特徴量５１１には、発話者が発言したテキストから得たテキスト特徴量、発話者の音声から得た音声特徴量、発話者の固有特徴量を含んでよい。 As shown in FIG. 9, the inputs of the neural network 501 are an input feature quantity 511 which is utterance information of the speaker, a text feature quantity (Listener Utterance) 512 of the listener's past utterance, and a unique feature quantity 513 of the listener. be. In the configuration example shown in FIG. 9, the utterance destination determination unit 411 determines the utterance destination based on the relationship between the listener's unique feature amount and the speaker's current utterance text feature amount, in addition to the listener's past utterance history. do. The input feature quantity 511, which is the utterance information of the speaker, may include a text feature quantity obtained from the text uttered by the speaker, a voice feature quantity obtained from the speech of the speaker, and a specific feature quantity of the speaker.

本実施を行う別形態について説明する。図１３から図１６には、本実施形態をリカレントニューラルネットワーク（ＲＮＮ）にて実現する際のニューラルネットワークの構成を示す。 Another form for carrying out the present embodiment will be described. FIGS. 13 to 16 show the configuration of a neural network when implementing this embodiment with a recurrent neural network (RNN).

図１３は、話し手の発話先がグループか個人か識別するＲＮＮである。一般素性（Ｇｅｎｅｒｉｃｆｅａｔｕｒｅ）１３１０は、発話先を識別するために役立つ情報で構成される。たとえば、一個前の発話において発話先が誰であったかを示す情報、発話されたテキストが接続詞で始まっていたかを示す情報などを使うことができる。これらの情報をベクトルの形として構成し、一般素性として使用する。 FIG. 13 is an RNN that identifies whether a speaker's utterance destination is a group or an individual. Generic features 1310 consist of information that helps identify a speaker. For example, information indicating who the addressee was in the previous utterance, information indicating whether the uttered text started with a conjunction, etc. can be used. These pieces of information are organized in the form of vectors and used as general features.

ＲＮＮ１３２０は、音声特徴量抽出部４０５から得た話し手の音声特徴量を受理する。ここではＲＮＮセルとしてＬＳＴＭ（Ｌｏｎｇｓｈｏｒｔ－ｔｅｒｍｍｅｍｏｒｙ）を用いる。 The RNN 1320 receives the speaker's speech feature obtained from the speech feature extraction unit 405 . Here, an LSTM (Long short-term memory) is used as the RNN cell.

ＬＳＴＭ１３２０における処理の詳細を記述したものを図１４Ａに示す。時刻ごとの話し手の音声の音響特徴量１４１０は、ＬＳＴＭの前向きネットワークのセル１４３０、及び逆向きネットワークのセル１４４０に入力される。前向き計算の最終セル、及び逆向き計算の最終セルのそれぞれの隠れ層の値のベクトルを連結し、最終的なＬＳＴＭの出力ベクトル１４５０を得る。 A detailed description of the processing in LSTM 1320 is shown in FIG. 14A. Acoustic features 1410 of the speaker's speech for each time are input to forward network cell 1430 and backward network cell 1440 of the LSTM. Concatenate the vector of hidden layer values of the final cell of the forward computation and the final cell of the backward computation to obtain the final LSTM output vector 1450 .

図１３に戻って、ロジスティック回帰（ＬｏｇｉｓｔｉｃＲｅｇｒｅｓｓｉｏｎ）計算を行う層１３３０は、ＬＳＴＭ１３２０の出力ベクトルの次元数を減らす。このロジスティック回帰１３３０は、省略してもよい。 Returning to FIG. 13, a layer 1330 performing Logistic Regression calculations reduces the dimensionality of the LSTM 1320 output vector. This logistic regression 1330 may be omitted.

単語埋め込み層１３４０は、自然言語で表される特徴に含まれる単語をベクトルに変換する。単語埋め込み層で１３４０の出力は、ＬＳＴＭ１３５０に入力される。自然言語で表される特徴には、第１に、話し手が発言したテキストがある。さらに、話し手のソフトスキル、ハードスキル、役割、名称もある。これらは、すなわち話し手の固有特徴量である。なお、自然言語で表される特徴であれば、これら以外の特徴を入力しても構わない。 The word embedding layer 1340 converts words included in features expressed in natural language into vectors. The output of word embedding layer 1340 is input to LSTM 1350 . Features expressed in natural language include, first, the text uttered by the speaker. In addition, there are speaker soft skills, hard skills, roles, and names. These are speaker's characteristic features. Note that features other than these may be input as long as they are features expressed in natural language.

単語を単語埋め込み層１３４０とＬＳＴＭ１３５０に入力する部分の詳細を、図１４Ｂを参照し説明する。自然言語で表される特徴のそれぞれの単語１４６１は、単語埋め込み層１４６２に入力され、固定長のベクトルに変換される。単語埋め込み層１４６２の出力は、ＬＳＴＭの前向きネットワークのセル１４６３、及び逆向きネットワークのセル１４６４に入力される。前向き計算の最終セル、および逆向き計算の最終セルのそれぞれの隠れ層の値のベクトルを連結し、最終的なＬＳＴＭの出力ベクトル１４６５とする。 The details of inputting words into word embedding layer 1340 and LSTM 1350 are described with reference to FIG. 14B. Each word 1461 of the natural language features is input to the word embedding layer 1462 and converted to a fixed-length vector. The output of word embedding layer 1462 is input to forward network cell 1463 and backward network cell 1464 of the LSTM. Concatenate the vectors of the hidden layer values of the last cell of the forward calculation and the last cell of the backward calculation to obtain the final LSTM output vector 1465 .

図１３に戻って、層１３６０は、入力されるベクトルを連結することで、１個の多次元ベクトルを構成する。連結されて生成された多次元ベクトルは、多層パーセプトロン１３７０に入力され、出力として発話先がグループである確率１３８０を出力する。 Returning to FIG. 13, layer 1360 constructs one multidimensional vector by concatenating the input vectors. The concatenated and generated multidimensional vectors are input to a multi-layer perceptron 1370, which outputs a probability 1380 that the utterance destination is a group.

図１５は、発話先が個人である場合、どの聞き手が発話先であるかを判断するＲＮＮである。本実施形態では、それぞれの聞き手に関して、当該聞き手が発話先である確率を計算するニューラルネットワークを構成する。ニューラルネットワーク１５７０、１５７１は、それぞれ第１番目、第ｉ番目の聞き手が発話先である確率を計算する。ニューラルネットワーク１５７０、１５７１の内部にあるネットワークの構成およびパラメータは同一であるため、ここではニューラルネットワーク１５７０のみを説明する。 FIG. 15 is an RNN that determines which listener is the speaking party when the speaking party is an individual. In this embodiment, for each listener, a neural network is configured that calculates the probability that the listener is the speaker. Neural networks 1570 and 1571 calculate the probability that the 1st and i-th listeners are the speaker, respectively. Since the configurations and parameters of the networks inside neural networks 1570 and 1571 are identical, only neural network 1570 will be described here.

単語埋め込み層１５１０は、自然言語で表現される特徴に含まれる各単語を有限長のベクトルに変換する。この単語埋め込み層には、話し手が発言したテキスト、聞き手が過去に発言したテキスト、聞き手の固有特徴量を表すテキストが入力される。それぞれのテキストに含まれる単語を単語埋め込み層に入力することで、単語ごとに固定長のベクトルを得る。 The word embedding layer 1510 transforms each word in the natural language features into a finite length vector. In this word embedding layer, the text uttered by the speaker, the text uttered by the listener in the past, and the text representing the unique feature amount of the listener are input. By inputting the words contained in each text into the word embedding layer, we obtain a fixed-length vector for each word.

ＭａＬＳＴＭ１５２０は、ＬＳＴＭを使い２個のテキストの関係性の大きさをマンハッタン距離に基づいて計算するニューラルネットワークである。１個のＭａＬＳＴＭは、話し手が発言したテキストと、聞き手に関するテキストで表現できる特徴のうち１個を入力とする。 MaLSTM 1520 is a neural network that uses LSTM to calculate the magnitude of the relationship between two texts based on the Manhattan distance. One MaLSTM takes as input one of the text uttered by the speaker and one of the features that can be expressed by the text about the listener.

ＭａＬＳＴＭ１５２０の構成を図１６に示す。ＭａＬＳＴＭは、話し手に関するテキストで表される情報１６１０と、聞き手に関するテキストで表される情報１６２０を受け付ける。ここでいうテキストで表される情報とは、発言したテキストであったり、聞き手および話し手の固有特徴量のテキストであったりする。 FIG. 16 shows the configuration of MaLSTM1520. MaLSTM accepts textual information 1610 about the speaker and textual information 1620 about the listener. The information represented by the text here may be the text of the utterance, or the text of the unique feature amount of the listener and the speaker.

話し手の情報は、単語列１６１１で表される。これを単語埋め込み層に入力することにより単語それぞれのベクトル１６１２に変換する。さらに単語のベクトルは、ＬＳＴＭの前向きネットワーク１６１３、および逆向きネットワーク１６１４に入力される。そして、前向きネットワークおよび逆向きネットワークの最終セルの隠れ層の値のベクトルを連結することで、話し手に関する情報のベクトル１６１５を得る。 Information on the speaker is represented by a word string 1611 . By inputting this into the word embedding layer, it is converted into a vector 1612 for each word. The vector of words is then input to forward network 1613 and backward network 1614 of the LSTM. The vector of information about the speaker 1615 is then obtained by concatenating the vectors of hidden layer values of the final cells of the forward and backward networks.

聞き手の情報についても、話し手の情報と同様の処理を行う。すなわち、単語列１６２１、単語埋め込み表現である単語のベクトル１６２２、ＬＳＴＭの前向きネットワーク１６２３、ＬＳＴＭの逆向きネットワーク１６２４、最終セルの隠れ層の値の連結を経て、聞き手に関する情報のベクトル１６２５が得られる。 Listener information is also processed in the same manner as the speaker information. That is, through the concatenation of the word string 1621, the vector 1622 of words that are word embedding expressions, the LSTM forward network 1623, the LSTM backward network 1624, and the value of the hidden layer of the final cell, a vector 1625 of information about the listener is obtained. .

話し手に関する情報、および聞き手に関する情報を使い、両者の関係性を計算する。ここでは、層１６３０において次式を用いマンハッタン距離に基づいた関係性の大きさを計算する。
Ｄｓ-ｌ＝ｅｘｐ（－｜｜ｈs－ｈｌ｜｜）
Ｄs-ｌは、話し手ｓの情報１６１５と、聞き手ｌの情報１６２５との間の関係性の大きさを表す。この結果１６４０が、出力される。 Information about the speaker and information about the listener are used to calculate the relationship between the two. Here, in layer 1630, we use the following equation to calculate the magnitude of the relationship based on the Manhattan distance.
Ds-l=exp(-||hs-hl||)
Ds-l represents the magnitude of the relationship between the information 1615 of speaker s and the information 1625 of listener l. The result 1640 is output.

図１５に戻って、ＭａＬＳＴＭ１５２０は複数存在する。すべては話し手の発言テキストを受け付けるが、もう片方の入力がそれぞれのＭａＬＳＴＭにおいて異なっている。このもう片方の入力は、たとえば、該当する聞き手の過去の発言のテキストであったり、聞き手の固有特徴量のテキストであったりする。層１５４０は、これらのＭａＬＳＴＭ１５２０の出力を連結し、１個のベクトルを形成する。 Returning to FIG. 15, there are multiple MaLSTM 1520 . All accept the speaker's uttered text, but the other input is different in each MaLSTM. This other input is, for example, the text of the listener's past utterances or the text of the listener's unique feature amount. Layer 1540 concatenates these MaLSTM 1520 outputs to form a single vector.

なお、他の特徴量を使ってもよく、たとえば話し手が該当する聞き手を見ていたかを示すフォーカス情報１５３０を連結してもよい。このフォーカス情報は、話し手の顔画像や目の画像から検出することができる。層１５４０が出力するベクトルは、多層パーセプトロン１５５０に入力され、当該の聞き手が発話先である確率１５６０を出力する。 Note that other feature amounts may be used, for example, focus information 1530 indicating whether the speaker was looking at the corresponding listener may be linked. This focus information can be detected from the speaker's face image or eye image. The vector output by layer 1540 is input to multi-layer perceptron 1550, which outputs probability 1560 that the listener in question is the speaker.

以上のニューラルネットワークの処理により、発話先がグループか否かの確率、および発話先がそれぞれの聞き手である確率を計算することができる。 Through the processing of the neural network described above, it is possible to calculate the probability of whether or not the utterance destination is a group, and the probability that the utterance destination is each listener.

これにより、より正確に発話先の聞き手を推定することができる。なお、図９に示す聞き手の情報の入力の一部が省略されてよく、例えば、聞き手の固有特徴量や、話し手の音声特徴量及び映像特徴量が省略されてもよい。 This makes it possible to more accurately estimate the listener of the utterance destination. Note that part of the input of the listener's information shown in FIG. 9 may be omitted. For example, the listener's unique feature amount and the speaker's audio feature amount and video feature amount may be omitted.

発話者の入力特徴量は、例えば、発話者が発言した現在のテキスト特徴量、音声特徴量及び映像特徴量を含むことができる。発話者の入力特徴量は、発話者の固有特徴量を含んでもよい。テキスト特徴量は、音声特徴量及び映像特徴量は、それぞれ、発話履歴管理情報４２３の現在の発話のレコードにおいて、発話テキスト欄４７３、音声特徴量欄４７５及び映像特徴量欄４７６から取得できる。 Speaker input features may include, for example, current text features, audio features, and video features uttered by the speaker. The speaker's input feature amount may include a speaker's unique feature amount. The text feature amount can be obtained from the speech text column 473, the voice feature amount column 475, and the video feature amount column 476 in the current utterance record of the utterance history management information 423, respectively.

聞き手の過去発話のテキスト特徴量５１２は、発話履歴管理情報４２３の発話テキスト欄４７３から、取得できる。聞き手の予め設定された数、例えばｋ個の過去の発話のテキスト特徴量が選択される。例えば、複数の過去発話のテキスト特徴量が選択される。過去のｋ個の発話テキストを用いる場合、それぞれの発話テキストの間に特殊な単語（＜ＳＥＰ＞など）を挟んだうえで、単語列を連結したものを入力として用いることができる。これにより、より適切に発話先確率を算出できる。 The text feature quantity 512 of the listener's past utterance can be acquired from the utterance text column 473 of the utterance history management information 423 . A preset number of listeners' text features, eg, k past utterances, are selected. For example, text features of multiple past utterances are selected. When k past utterance texts are used, a special word (such as <SEP>) is interposed between each utterance text, and a concatenated word string can be used as an input. This makes it possible to more appropriately calculate the destination probability.

聞き手の固有特徴量５１３は、参加者固有特徴量管理情報４２１の聞き手のレコードから取得される。本明細書の一実施形態において、聞き手の発話先確率の算出のために参照される聞き手の固有特徴量５１３は、βsoft skill、βhard skill、βrole、βnameを含む。 The listener's unique feature amount 513 is acquired from the listener's record in the participant's unique feature amount management information 421 . In one embodiment of the present specification, the unique feature amount 513 of the listener referred to for calculating the listener's speech destination probability includes βsoft skill, βhard skill, βrole, and βname.

本明細書の一実施形態において、βsoft skill、βhard skill、βrole、βnameは、それぞれ、聞き手の固有特徴量の各項目において、ニューラルネットワーク５０１、１３４０、１５１０に入力される発話者の特徴量の語、例えば、入力された発話テキスト特徴量又は発話者固有特徴量の語である。 In one embodiment of the present specification, βsoft skill, βhard skill, βrole, and βname are the words of the speaker's feature quantity input to the neural networks 501, 1340, and 1510, respectively, in each item of the listener's unique feature quantity. , for example, input speech text features or speaker-specific features words.

さらに、語をそのまま使うのではなく、語に最も近い関係性を有する語を用いることもできる。具体的には、βsoft skill、βhard skill、βrole、βnameは、聞き手の、「ソフトスキル」、「ハードスキル」、「役割」、「名称」それぞれの語において、入力語に最も近い関係性を有する語である。 Furthermore, instead of using the word as it is, the word with the closest relationship to the word can be used. Specifically, βsoft skill, βhard skill, βrole, and βname have the closest relationship to the input word in the listener's "soft skill," "hard skill," "role," and "name," respectively. language.

本明細書の一実施形態において、発話者についての入力語にセマンティック（意味的）関係性が高い固有特徴量が選択される。セマンティック関係性はセマンティック空間における距離（セマンティック距離）で表わすことができる。セマンティック距離を計算するいくつかの方法が考えられる。 In one embodiment of the present specification, unique features are selected that have a high semantic relationship to the input word for the speaker. A semantic relationship can be represented by a distance in the semantic space (semantic distance). There are several possible ways to compute the semantic distance.

例えば、マンハッタン距離を使ってセマンティック関係性を計算すると、次のような式になる。
Ｄs-t＝ｅｘｐ（－｜｜ｈs－ｈt｜｜）
Ｄs-tは、話し手ｓ（例：話し手のテキスト特徴量の単語「ｍａｃｈｉｎｅ」）と、聞き手（発話先）ｔ（例：聞き手のハードスキル特徴量における語「ｍａｃｈｉｎｅｌｅａｒｎｉｎｇ」）との間のセマンティック距離（距離スコア）である。話し手と聞き手の特徴量ベクトルは、例えばＬＳＴＭネットワークに渡され、それぞれ隠れ状態（コンテクスト特徴量）ｈsとｈtを更新する。ＬＳＴＭネットワークの出力が、入力された特徴量間のセマンティック関係性の値を示す。 For example, if we use the Manhattan distance to calculate the semantic relationship, we get the following formula:
Ds-t=exp(-||hs-ht||)
Ds-t is the semantic relationship between speaker s (e.g., the word "machine" in the speaker's text features) and the listener (speaker) t (e.g., the word "machine learning" in the listener's hard skill features). distance (distance score). The speaker and listener feature vectors are passed, for example, to an LSTM network to update hidden states (context features) hs and ht, respectively. The output of the LSTM network indicates the semantic relationship values between the input features.

出力される関連性は、図１０のように模式的に示すことができる。例えば、「ｍａｃｈｉｎｅ」という単語は、＜ｎａｍｅ＞における「ＡＩ」、＜ハードスキル＞における「ｍａｃｈｉｎｅｌｅａｒｎｉｎｇ」、＜ソフトスキル＞における「ｔｅａｍｗｏｒｋ」、＜役割＞における「ｄｅｓｉｇｎｅｒ」と関係している。それぞれの参加者の固有な特徴量の種類の中で、「ｍａｃｈｉｎｅ」の単語に最も近い単語が、βsoft skill、βhard skill、βrole、βnameとされる。 The output relationships can be schematically shown as in FIG. For example, the word "machine" is related to "AI" in <name>, "machine learning" in <hard skills>, "teamwork" in <soft skills>, and "designer" in <role>. Among the types of features unique to each participant, words closest to the word "machine" are βsoft skill, βhard skill, βrole, and βname.

上述のように、発話先判定部４１１は、聞き手それぞれの発話先確率を算出する。このように、聞き手それぞれの確率を互いに独立に計算するので、任意数の聞き手の確率を計算し、発話先を判定することができる。発話先判定部４１１は、最も確率が高い聞き手を発話先と判定することができる。 As described above, the speech destination determination unit 411 calculates the speech destination probability of each listener. In this way, since the probability of each listener is calculated independently of each other, it is possible to calculate the probability of any number of listeners and determine the utterance destination. The speech destination determination unit 411 can determine the listener with the highest probability as the speech destination.

発話先判定部４１１は、全ての聞き手の確率を比較し、所定条件を満たす場合、発話先がグループ（聞き手全員）であると判定する。例えば、発話先判定部４１１は、発話先確率の最小値と最大値との差が予め設定された閾値より小さい場合に、発話先がグループであると判定してもよく、分散が閾値より小さい場合に発話先がグループであると判定してもよい。 The speech destination determination unit 411 compares the probabilities of all listeners, and determines that the speech destination is a group (all listeners) when a predetermined condition is satisfied. For example, when the difference between the minimum value and the maximum value of the speech destination probability is smaller than a preset threshold, the speech destination determination unit 411 may determine that the speech destination is a group, and the variance is less than the threshold. In some cases, it may be determined that the utterance destination is a group.

発話者により発話の特徴量と聞き手の特徴量との間のセマンティック関係性スコアを計算し、そのスコアに基づいて発話先を決定することができる。セマンティック関係性スコアは、例えば、マンハッタン距離で表すことができる。上述のように、３以上の参加者が存在する会話において、発話はグループに向けられていることがある。 It is possible to calculate a semantic relationship score between the speaker's utterance feature amount and the listener's feature amount, and determine the utterance destination based on the score. A semantic relationship score can be represented, for example, by the Manhattan distance. As mentioned above, in conversations with three or more participants, speech may be directed to groups.

発話先判定部４１１は、シングルタスクまたはマルチタスクのニューラルネットワークを使用して、発話が「グループに向けられた」ものか、または「個人に向けられた」ものかを推定できる。図１１は、マルチタスクによる発話先判定処理例を示す。 The speech destination determination unit 411 can use a single-task or multi-task neural network to estimate whether the speech is "directed to a group" or "directed to an individual." FIG. 11 shows an example of speech destination determination processing by multitasking.

図１１において、最初に話し手の入力特徴量（発話テキスト及び固有の特徴量）を計算し、さらにそれぞれの聞き手の特徴量（過去発話テキスト及び固有の特徴量）を計算した後、話し手の発話がグループか個人向けかを判定し（第１のタスク）、もし個人向けであった場合には、どの聞き手に向けた発話であったかを判定する（第２のタスク）。 In FIG. 11, first, the speaker's input feature amount (uttered text and unique feature amount) is calculated, and after further calculating the feature amount of each listener (past utterance text and unique feature amount), the speaker's utterance is Determine whether the utterance is for a group or for an individual (first task), and if it is for an individual, determine which listener the utterance was directed to (second task).

ステップ６０１において、今回の話し手の特徴量を、発話テキスト及び固有の特徴量に基づいて計算する。ステップ６０５では、話し手の特徴量をニューラルネットワークに入力することにより、話し手がグループに話しているのか個人に話しているのかを表す確率を計算する。ステップ６０６において、発話先がグループである確率が０．５より大きかった場合は、ステップ６１０に進み、発話先はグループであるという情報を保存し、処理を終了する。 In step 601, features of the current speaker are calculated based on the spoken text and unique features. In step 605, by inputting the speaker's features into the neural network, the probability of whether the speaker is speaking to a group or to an individual is calculated. In step 606, if the probability that the addressee is a group is greater than 0.5, proceed to step 610, save the information that the addressee is a group, and terminate the process.

もしグループではなかった場合、ステップ６１１に進みループを行う。ここでは対象となる聞き手の番号をｉと表す。ステップ６０４の処理において、対象となるｉ番目の聞き手の特徴量を計算する。ここでは、例えば、聞き手ｉが過去に発話したテキスト及び固有の特徴量を使用する。 If it is not a group, go to step 611 and loop. Here, the target listener number is represented as i. In the process of step 604, the feature quantity of the i-th listener of interest is calculated. Here, for example, the text uttered by the listener i in the past and the unique feature amount are used.

ステップ６０７の処理において、発話先判定部４１１は、第２のタスク向けに学習されたニューラルネットワークに対して、発話者の特徴量と聞き手ｉの特徴量を入力し、発話者が聞き手ｉに話しかけた確率を出力する。この値は、通常０～１の範囲の値である。 In the process of step 607, the utterance destination determination unit 411 inputs the feature amount of the speaker and the feature amount of the listener i to the neural network trained for the second task. output the probability that This value is typically in the range 0-1.

ステップ６１２、６１３はループを繰り返す処理である。すなわち、すべての聞き手に対して、発話者が話しかけた確率を算出する。最後に、ステップ６０８の処理において、ステップ６０７で出力された確率が最大となった聞き手を求め、その聞き手を発話先として保存し、処理を終了する。 Steps 612 and 613 are processing for repeating a loop. That is, the probability that the speaker spoke to all listeners is calculated. Finally, in the process of step 608, the listener with the maximum probability output in step 607 is found, the listener is saved as the utterance destination, and the process ends.

発話先判定部４１１は、マルチタスクニューラルネットワークにおいて、最初のタスクとして、「グループ」または「個人」の判定を開始する（６０５）。ニューラルネットワークは、ベクトル化された話し手の入力特徴量から、発話先がグループ又は個人のいずれであるかを推定する。例えば、ニューラルネットワークの出力する確率（グループ確率）が０．５より高い場合（６０６：ＹＥＳ）、その発話先はグループに属する。 The utterance destination determination unit 411 starts determination of "group" or "individual" as the first task in the multitask neural network (605). A neural network estimates whether the speaker is a group or an individual from vectorized speaker input features. For example, if the neural network output probability (group probability) is higher than 0.5 (606: YES), the speech destination belongs to the group.

グループ確率が０．５以下である場合（６０６：ＮＯ）、発話先は個人と判定される。発話先判定部４１１は、ニューラルネットワークによる追加の推定を行って、各聞き手の確率スコアを得る（６０７、６０８）。図１１は、二人の聞き手のスコアの算出のみ例として示している。 If the group probability is 0.5 or less (606: NO), the speech destination is determined to be an individual. Speech target determination unit 411 performs additional estimation by a neural network to obtain probability scores for each listener (607, 608). FIG. 11 shows only the calculation of scores for two listeners as an example.

グループ向け／個人向けのために取得された、ベクトル化された話し手の入力特徴量が使用される。ベクトル化された話し手の入力特徴量は、グループ向け／個人向けの判定タスクと、聞き手の発話先確率の算出のタスクとに共有される。 Vectorized speaker input features obtained for group/individual use are used. The vectorized input features of the speaker are shared by the group/individual decision task and the listener's destination probability calculation task.

発話先判定部４１１は、ベクトル化された発話者の特徴量と聞き手の特徴量との間のセマンティック関係性に基づき確率を算出することができる。セマンティック関係性は、例えばマンハッタン距離を用いて計算でき、０から１の間で出力される。 The utterance destination determination unit 411 can calculate the probability based on the semantic relationship between the vectorized feature amount of the speaker and the feature amount of the listener. The semantic relationship can be computed using, for example, the Manhattan distance and is output between 0 and 1.

発話先判定部４１１は、２つのタスクの「共有情報」を利用する。シングルタスクと異なり、マルチタスクニューラルネットワークでは、第１タスク（グループ向又は個人向を判定するタスク）と、第１タスクに関連し得る１以上の第２タスク（各聞き手の確率スコア算出）の両方を同時に最適化することができる。 The speech destination determination unit 411 uses the “shared information” of the two tasks. Unlike single-tasking, multi-tasking neural networks perform both a primary task (the task of determining group orientation or individual orientation) and one or more secondary tasks that may be related to the primary task (calculating probability scores for each listener). can be optimized at the same time.

ここで、２つのタスクをどう「共有」するか具体例を使い説明する。本明細書の一実施形態において、単語を単語ベクトルに変換する単語埋め込み層は、複数の箇所で使われている（図１３の１３４０，図１５の１５１０）。これらはそれぞれ別のタスクで用いられるが、すべて共通したパラメータを持つ単語埋め込み層として学習することが可能である。これが、第１の共有の例である。 Here, we will use a specific example to explain how to "share" two tasks. In one embodiment herein, a word embedding layer that converts words into word vectors is used in several places (1340 in FIG. 13, 1510 in FIG. 15). Although these are used in different tasks, they can all be learned as word embedding layers with common parameters. This is the first example of sharing.

第２の共有の例は、ニューラルネットワークの学習において使う損失関数の定義方法に関してのものである。本明細書の一実施形態において、ニューラルネットワークの損失は、以下の式で計算される。
Ｌlot＝λ1＊ＬG＋λ2＊ＬID
ＬGはグループ／個人判定タスクの損失関数、ＬIDは聞き手の確率スコア算出タスクの損失関数、Ｌlotは全体の損失関数である。すなわち、ニューラルネットワークは、グループ判定と個人判定の両方の性質を同時に考慮しつつ、学習される。λ1とλ2はニューラルネットワークにおける各損失関数の影響を制御するために選択される重みである。 A second sharing example relates to how to define a loss function used in neural network training. In one embodiment herein, the neural network loss is calculated by the following equation.
Llot = λ1 * LG + λ2 * LID
LG is the loss function for the group/individual decision task, LID is the loss function for the listener probability score calculation task, and Llot is the overall loss function. That is, the neural network is learned while simultaneously considering the properties of both group judgment and individual judgment. λ1 and λ2 are weights chosen to control the influence of each loss function in the neural network.

発話先判定部４１１は、全ての聞き手それぞれの「発話先である確率」を比較し、最も高い確率を持つ発話先候補を、真の発話先とみなす。発話先判定部４１１は、その聞き手の対応する参加者ＩＤを、発話先ＩＤと決定する。 The speech destination determining unit 411 compares the “probability of being the speech destination” of all the listeners, and regards the speech destination candidate with the highest probability as the true speech destination. The speech destination determination unit 411 determines the participant ID corresponding to the listener as the speech destination ID.

以下において、会話管理システム１０１が管理する会話の例を説明する。図１２Ａから１２Ｈは、会話を構成する一連の発話の例を示す。会話管理システム１０１はエージェント４１３を実行し、判定された発話先に応じてエージェント４１３を制御する。図１２Ａから１２Ｈの例において、３人のユーザ１０５Ａから１０５Ｃが、エージェント４１３と共に会話に参加している。 Examples of conversations managed by the conversation management system 101 will be described below. Figures 12A through 12H show an example of a series of utterances that make up a conversation. Conversation management system 101 executes agent 413 and controls agent 413 according to the determined speech destination. In the example of FIGS. 12A-12H, three users 105A-105C are participating in a conversation with agent 413. In the example of FIGS.

図１２Ａを参照して、ユーザ１０５Ａが、発話６５１を行っている。発話先判定部４１１は、発話先を「グループ」と判定し、それをエージェント４１３に返す。これは、エージェント４１３を含む誰もが、応答できることを意味する。エージェント４１３は、回答を開始する前に、所定長さの時間だけ、ユーザの発話を待ってもよい。 Referring to FIG. 12A, user 105A is making an utterance 651 . The utterance destination determination unit 411 determines the utterance destination as “group” and returns it to the agent 413 . This means that anyone, including agent 413, can respond. Agent 413 may wait a predetermined amount of time for the user's utterance before beginning to respond.

次に、図１２Ｂを参照して、直前のユーザ１０５Ａの発話６５１に対して、ユーザ１０５Ｂが応答６５２を発している。発話先判定部４１１は、ユーザ１０５Ａが発話先であると判定し、その判定結果をエージェント４１３に返す。 Next, referring to FIG. 12B, user 105B utters a response 652 to the immediately preceding utterance 651 of user 105A. The speech destination determination unit 411 determines that the user 105A is the speech destination, and returns the determination result to the agent 413 .

次に、図１２Ｃを参照して、直前のユーザ１０５Ｂの発話６５２に対して、ユーザ１０５Ａが応答６５３を発している。発話先判定部４１１は、発話先を「グループ」と判定し、それをエージェント４１３に返す。 Next, referring to FIG. 12C, user 105A utters a response 653 to the immediately preceding utterance 652 of user 105B. The utterance destination determination unit 411 determines the utterance destination as “group” and returns it to the agent 413 .

次に、図１２Ｄを参照して、直前のユーザ１０５Ａの発話６５３に対して、ユーザ１０５Ｂが応答６５４を発している。発話先判定部４１１は、発話先を「エージェント」と判定し、それをエージェント４１３に返す。エージェント４１３は、発話内容と自分の知識に基づいて応答を生成する。 Next, referring to FIG. 12D, user 105B utters a response 654 to the immediately preceding utterance 653 of user 105A. The utterance destination determination unit 411 determines the utterance destination as “agent” and returns it to the agent 413 . Agent 413 generates a response based on what is said and what it knows.

次に、図１２Ｅを参照して、直前のユーザ１０５Ｂの発話６５４に対して、エージェント４１３が応答６５５を発している。発話先判定部４１１は、発話先を「グループ」と判定し、それをエージェント４１３に返す。 Next, referring to FIG. 12E, agent 413 is issuing response 655 to previous utterance 654 of user 105B. The utterance destination determination unit 411 determines the utterance destination as “group” and returns it to the agent 413 .

次に、図１２Ｆを参照して、直前のエージェント４１３の発話６５５に対して、ユーザ１０５Ｃが応答６５６を発している。発話先判定部４１１は、発話先を「グループ」と判定し、それをエージェント４１３に返す。 Next, referring to FIG. 12F, user 105C utters response 656 to agent 413's utterance 655 immediately before. The utterance destination determination unit 411 determines the utterance destination as “group” and returns it to the agent 413 .

次に、図１２Ｇを参照して、直前のユーザ１０５Ｃの発話６５６に対して、ユーザ１０５Ａが応答６５７を発している。発話先判定部４１１は、発話先を「グループ」と判定し、それをエージェント４１３に返す。 Next, referring to FIG. 12G, user 105A utters a response 657 to the previous utterance 656 of user 105C. The utterance destination determination unit 411 determines the utterance destination as “group” and returns it to the agent 413 .

次に、図１２Ｈを参照して、直前のユーザ１０５Ａの発話６５７に対して、エージェント４１３が応答６５８を発している。直前の発話先はグループであるため、エージェント４１３は応答することができる。発話先判定部４１１は、発話先を「グループ」と判定し、それをエージェント４１３に返す。 Next, referring to FIG. 12H, agent 413 has uttered response 658 to previous utterance 657 of user 105A. Agent 413 can respond because the last utterance destination is a group. The utterance destination determination unit 411 determines the utterance destination as “group” and returns it to the agent 413 .

上述のように、マルチユーザ会話において、発話先を推定し、その推定結果に基づいてエージェントを制御することで、マルチユーザ会話の進行をより適切に支援することが可能となる。なお、本開示の発話先の判定結果は、エージェントの制御と異なる用途に適用することができる。例えば、会話履歴に発話先を含める又は発話先のユーザ端末１０４にそれを示す情報を送信してもよい。 As described above, in a multi-user conversation, it is possible to more appropriately support the progress of the multi-user conversation by estimating the utterance destination and controlling the agent based on the estimation result. It should be noted that the speech destination determination result of the present disclosure can be applied to uses other than agent control. For example, the conversation history may include the utterance destination or transmit information indicating it to the user terminal 104 of the utterance destination.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明したすべての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 In addition, the present invention is not limited to the above-described embodiments, and includes various modifications. For example, the above-described embodiments have been described in detail in order to explain the present invention in an easy-to-understand manner, and are not necessarily limited to those having all the described configurations. In addition, it is possible to replace part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. Moreover, it is possible to add, delete, or replace a part of the configuration of each embodiment with another configuration.

また、上記の各構成・機能・処理部等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記録装置、または、ＩＣカード、ＳＤカード等の記録媒体に置くことができる。 Further, each of the configurations, functions, processing units, etc. described above may be realized by hardware, for example, by designing a part or all of them using an integrated circuit. Moreover, each of the above configurations, functions, etc. may be realized by software by a processor interpreting and executing a program for realizing each function. Information such as programs, tables, and files that implement each function can be stored in recording devices such as memories, hard disks, SSDs (Solid State Drives), or recording media such as IC cards and SD cards.

また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしもすべての制御線や情報線を示しているとは限らない。実際には殆どすべての構成が相互に接続されていると考えてもよい。 In addition, the control lines and information lines indicate those considered necessary for explanation, and not all control lines and information lines are necessarily indicated on the product. In fact, it may be considered that almost all configurations are interconnected.

１０１会話管理システム
１０４ユーザ端末
１０５ユーザ
４０１音声認識部
４０３参加者特定部
４０５音声特徴量抽出部
４０７映像特徴量抽出部
４１１発話先判定部
４１３エージェント
４２１参加者固有特徴量管理情報
４２３発話履歴管理情報 101 Conversation management system 104 User terminal 105 User 401 Voice recognition unit 403 Participant identification unit 405 Voice feature amount extraction unit 407 Video feature amount extraction unit 411 Speech destination determination unit 413 Agent 421 Participant-specific feature amount management information 423 Speech history management information

Claims

A conversation management system,
one or more storage devices;
one or more computing devices;
The one or more storage devices store past utterance history information of each of a plurality of participants in the conversation,
The one or more computing devices,
obtaining the text features of the first current utterance of the speaker;
Obtaining a text feature amount of past utterances of each listener in a plurality of listeners from the past utterance history information,
calculating the utterance destination probability of each listener based on the text feature of the first current utterance and the text feature of the past utterance of each listener;
A conversation management system, wherein the destination of the first current utterance is determined based on the destination probabilities of the plurality of listeners.

The conversation management system of claim 1, comprising:
The one or more storage devices store feature quantities associated with each of the plurality of participants,
The one or more computing devices,
Obtaining from the one or more storage devices a feature associated with each listener among a plurality of listeners;
A conversation management system that calculates the destination probability of each listener based on the feature associated with each listener.

The conversation management system of claim 1, comprising:
The conversation management system, wherein the one or more computing devices calculate the utterance destination probability of each listener based on text features of a plurality of past utterances of each listener.

The conversation management system of claim 1, comprising:
The conversation management system, wherein the one or more computing devices determine whether the utterance destination is all listeners based on the text feature amount of the first current utterance.

The conversation management system of claim 1, comprising:
The one or more computing devices,
calculating a group probability indicating the probability that the utterance destination of the second current utterance is all listeners based on the text feature amount of the second current utterance;
A conversation management system, wherein if the group probability exceeds a threshold, the speech destination of the second current speech is determined to be all listeners.

A conversation management system according to claim 5, wherein
The one or more computing devices,
performing a first task and a second task after the first task with a neural network;
the loss function of the neural network includes a loss in the first task and a loss in the second task;
The first task calculates the group probability based on the text features of the second current utterance,
The second task is to calculate a destination probability of each listener based on the text feature amount of the second current utterance used in the first task and the text feature amount of the past utterance of each listener. management system.

The conversation management system of claim 1, comprising:
The one or more computing devices,
extracting a speech feature of the first current utterance;
A conversation management system that calculates the speech destination probability of each listener based on the speech feature amount.

The conversation management system of claim 1, comprising:
The one or more computing devices,
run the programs of the agents who are participants in the conversation,
A conversation management system that controls the agent based on the determined speech destination.

A conversation management system according to claim 4, wherein
The one or more computing devices,
run the programs of the agents who are participants in the conversation,
A conversation management system that permits the agent to respond to a current speech when determining that the speech destination is all of the plurality of listeners or the agent.

The conversation management system of claim 1, comprising:
The one or more computing devices,
A conversation management system that determines the speaker based on an utterance position identified based on the audio of the first current utterance.

A method for a conversation management system to manage a conversation having multiple participants, comprising:
The conversation management system stores past utterance history information of each of a plurality of participants in the conversation,
The method includes:
wherein the conversation management system obtains text features of a first current utterance of a speaker;
The conversation management system acquires text feature amounts of past utterances of each listener among a plurality of listeners from the past utterance history information,
The conversation management system calculates the utterance destination probability of each listener based on the text feature of the first current utterance and the text feature of the past utterance of each listener,
The method, wherein the conversation management system determines a target of the first current utterance based on the target probabilities of the plurality of listeners.