JP2021124530A

JP2021124530A - Information processor, information processing method and program

Info

Publication number: JP2021124530A
Application number: JP2020015544A
Authority: JP
Inventors: 龍太若山; Ryuta Wakayama; ウェイ・チー; Qi Wei
Original assignee: Hmcomm; Hmcomm Inc
Current assignee: Hmcomm; Hmcomm Inc
Priority date: 2020-01-31
Filing date: 2020-01-31
Publication date: 2021-08-30

Abstract

To understand an utterance content and emotion at that time when voice uttered by a speaker is voice-recognized.SOLUTION: A text data acquisition part 33 of a computer 3 voice-recognizes voice data of speaker's utterance, and acquires text data indicating a content of the utterance on the basis of a result of voice recognition. An acoustic data acquisition part 34 acquires voice data of the speaker's utterance as acoustic data. An emotion recognition part 36 recognizes speaker's emotion on the basis of the acoustic data and the text data, and acquires emotion information indicating the recognition result.SELECTED DRAWING: Figure 3

Description

本発明は、情報処理装置、情報処理方法及びプログラムに関する。 The present invention relates to an information processing device, an information processing method and a program.

近年、お客様センター等の電話応対の窓口業務において、音声認識技術によりお客様と電話窓口のオペレータとの間で交わされる通話の音声データを夫々分離して音声認識し、情報量がコンパクトな文字形式のデータ、例えばテキストデータ等として保存しておくことが行われている。保存されたテキストデータは、後に例えば対応履歴等として利用される。 In recent years, in the telephone reception counter business of customer centers, etc., voice recognition technology is used to separate the voice data of calls exchanged between the customer and the operator of the telephone counter for voice recognition, and the amount of information is in a compact character format. It is stored as data, for example, text data. The saved text data will be used later, for example, as a correspondence history.

ところで、お客様の対応履歴を、保存されているテキストデータで確認したとき、電話対応時にお客様がどのような感情で話をされていたか、テキストデータからでは、汲み取りにくいことが多い。例えば怒っているが丁寧な言葉使いであった場合は、テキストデータには現れない。 By the way, when the customer's response history is confirmed with the stored text data, it is often difficult to grasp what kind of emotion the customer was talking about when responding to the phone call from the text data. For example, if you are angry but use polite language, it will not appear in the text data.

特開２０１９−１５３９６１号公報Japanese Unexamined Patent Publication No. 2019-153961

このように、人が発話した音声を音声認識してテキストデータにした場合、テキストデータからは、発話内容自体は確認できるものの、発話した人の感情までは伝わらないことがある。 In this way, when the voice uttered by a person is voice-recognized and converted into text data, the content of the utterance itself can be confirmed from the text data, but the emotion of the person who uttered may not be transmitted.

本願発明はこのような状況に鑑みてなされたものであり、人が発話した音声を音声認識する上で、発話内容とその時の感情が解るようにできることを目的とする。 The present invention has been made in view of such a situation, and an object of the present invention is to be able to understand the content of the utterance and the emotion at that time in recognizing the voice uttered by a person.

上記目的を達成するため、本発明の一態様の情報処理装置は、
話者の発話の音声データを音声認識し、前記音声認識の結果に基づいて前記発話の内容を示すテキストデータを取得するテキスト取得手段と、
前記話者の前記発話の前記音声データを音響データとして取得する音響取得手段と、
前記音響データと前記テキストデータとに基づいて前記話者の感情を認識し、認識結果を示す感情情報を得る感情認識手段と、
を備える情報処理装置。 In order to achieve the above object, the information processing device of one aspect of the present invention is
A text acquisition means that recognizes the voice data of the speaker's utterance and acquires text data indicating the content of the utterance based on the result of the voice recognition.
An acoustic acquisition means for acquiring the voice data of the utterance of the speaker as acoustic data,
An emotion recognition means that recognizes the emotion of the speaker based on the acoustic data and the text data and obtains emotion information indicating the recognition result.
Information processing device equipped with.

本発明の一態様の上記情報処理装置に対応する情報処理方法及びプログラムも、本発明の一態様の情報処理方法及びプログラムとして提供される。 The information processing method and program corresponding to the information processing apparatus according to one aspect of the present invention are also provided as the information processing method and program according to one aspect of the present invention.

本発明によれば、話者が発話した音声を音声認識する上で、発話内容とその時の話者の感情が解るようにできる。 According to the present invention, in recognizing a voice spoken by a speaker, the content of the utterance and the emotion of the speaker at that time can be understood.

本発明の情報処理装置の一実施形態に係るコンピュータを含む情報処理システムの構成の例を示す図である。It is a figure which shows the example of the structure of the information processing system including the computer which concerns on one Embodiment of the information processing apparatus of this invention. 図１の情報処理システムのうち、本発明の情報処理装置に係るコンピュータのハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware composition of the computer which concerns on the information processing apparatus of this invention among the information processing system of FIG. 実施形態の図１の情報処理システムのうち図２のコンピュータの機能的構成の一例を示す機能ブロック図である。FIG. 5 is a functional block diagram showing an example of the functional configuration of the computer of FIG. 2 in the information processing system of FIG. 1 of the embodiment. 感情認識モデルの構築に用いた特徴量の一例を示す図である。It is a figure which shows an example of the feature quantity used for constructing an emotion recognition model. 感情認識手段の一例を示す図である。It is a figure which shows an example of the emotion recognition means. 図５の感情認識モデルの詳細構成（隠れ層５２ａの一例）を示す図である。It is a figure which shows the detailed structure (an example of a hidden layer 52a) of the emotion recognition model of FIG. 図５の感情認識モデルの詳細構成（隠れ層５２ｂの一例）を示す図である。It is a figure which shows the detailed structure (an example of a hidden layer 52b) of the emotion recognition model of FIG. 図１及び図３のコンピュータの動作を示すフローチャートである。It is a flowchart which shows the operation of the computer of FIG. 1 and FIG. 感情認識精度の評価方法を説明するための図である。It is a figure for demonstrating the evaluation method of emotion recognition accuracy.

以下、本発明の実施形態について、図面を用いて説明する。
図１は、本発明の情報処理装置の一実施形態に係るコンピュータを含む情報処理システムの構成を示す図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram showing a configuration of an information processing system including a computer according to an embodiment of the information processing device of the present invention.

図１に示す情報処理システムは、電話器１、ヘッドセット２、コンピュータ３及び学習装置４がネットワークＮを介して相互に接続されることで構成される。この情報処理システムでは、話者が発話した音声に基づく感情認識サービスを提供できる。 The information processing system shown in FIG. 1 is configured by connecting a telephone 1, a headset 2, a computer 3, and a learning device 4 to each other via a network N. This information processing system can provide an emotion recognition service based on the voice spoken by the speaker.

電話器１は、マイク及びスピーカを含む通話機能を有し、問い合わせを行うユーザＵが問い合わせセンターに電話をかけて、問い合わせセンターの交換機を介して接続されたヘッドセット２をかけたオペレータＯＰと音声による相互通信、つまり通話を行う通話装置である。電話器１は、ユーザＵが発話した音声（アナログ信号）を受け付ける。 The telephone device 1 has a call function including a microphone and a speaker, and a user U who makes an inquiry calls the inquiry center, and an operator OP and a voice wearing a headset 2 connected via an exchange of the inquiry center. It is a communication device that communicates with each other, that is, makes a call. The telephone device 1 receives the voice (analog signal) spoken by the user U.

ヘッドセット２は、マイク及びスピーカを含む通話機能及び録音機能を有し、オペレータＯＰが問い合わせを受けたユーザＵと通話を行うための通話装置である。ヘッドセット２は、オペレータＯＰが発話した音声データ（アナログ信号）を受け付けて夫々の音声データを録音しネットワークＮを通じてコンピュータ３へ送る。 The headset 2 has a call function and a recording function including a microphone and a speaker, and is a call device for the operator OP to make a call with the user U who has received an inquiry. The headset 2 receives the voice data (analog signal) spoken by the operator OP, records each voice data, and sends the voice data to the computer 3 through the network N.

コンピュータ３は、ヘッドセット２から受信されたユーザＵ及びオペレータＯＰのうち少なくとも一方の音声を音声認識し、その音声認識結果のテキストデータに関する感情情報を出力する音声認識装置として機能する。感情情報としては、例えば「喜」「怒」「哀」「平常」等の感情種別を少なくとも含むように感情を分類した感情ラベルが出力される。なお、コンピュータ３の機能的構成や処理の詳細については、図３以降の図面を参照して後述する。 The computer 3 functions as a voice recognition device that recognizes the voice of at least one of the user U and the operator OP received from the headset 2 and outputs emotional information regarding the text data of the voice recognition result. As the emotion information, an emotion label that classifies emotions so as to include at least emotion types such as "joy", "anger", "sorrow", and "normal" is output. The details of the functional configuration and processing of the computer 3 will be described later with reference to the drawings of FIGS. 3 and later.

学習装置４は、予め用意した感情種別毎に分類した複数の音声データについて機械学習を行いモデル化する。学習装置４は、一つ以上（多数）の学習用の音声データを機械学習することで、学習モデルを生成し、コンピュータ３の記憶部１８に格納する。 The learning device 4 performs machine learning and models a plurality of voice data classified for each emotion type prepared in advance. The learning device 4 generates a learning model by machine learning one or more (many) voice data for learning, and stores it in the storage unit 18 of the computer 3.

学習モデルには、例えばニューラルネットワーク等を適用することができる。なお、ニューラルネットワークは、一例に過ぎず、これ以外の機械学習の手法を適用してもよい。さらに言えば、学習モデルは、機械学習のモデルに限らず、所定のアルゴリズムにより判定を行う判定器を採用してもよい。 For example, a neural network or the like can be applied to the learning model. The neural network is only an example, and other machine learning methods may be applied. Furthermore, the learning model is not limited to the machine learning model, and a determination device that makes a determination by a predetermined algorithm may be adopted.

図２は、図１の情報処理システムのうち、本発明の情報処理装置の第１実施形態に係るコンピュータのハードウェア構成の一例を示すブロック図である。 FIG. 2 is a block diagram showing an example of a computer hardware configuration according to a first embodiment of the information processing apparatus of the present invention in the information processing system of FIG.

コンピュータ３は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３と、バス１４と、入出力インターフェース１５と、出力部１６と、入力部１７と、記憶部１８と、通信部１９と、ドライブ２０とを備えている。 The computer 3 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a bus 14, an input / output interface 15, an output unit 16, and an input unit 17. A storage unit 18, a communication unit 19, and a drive 20 are provided.

ＣＰＵ１１は、ＲＯＭ１２に記録されているプログラム、又は、記憶部１８からＲＡＭ１３にロードされたプログラムに従って各種の処理を実行する。
ＲＡＭ１３には、ＣＰＵ１１が各種の処理を実行する上で必要なデータ等が適宜記憶される。 The CPU 11 executes various processes according to the program recorded in the ROM 12 or the program loaded from the storage unit 18 into the RAM 13.
Data and the like necessary for the CPU 11 to execute various processes are appropriately stored in the RAM 13.

ＣＰＵ１１、ＲＯＭ１２及びＲＡＭ１３は、バス１４を介して相互に接続されている。このバス１４にはまた、入出力インターフェース１５も接続されている。入出力インターフェース１５には、出力部１６、入力部１７、記憶部１８、通信部１９、及びドライブ２０が接続されている。
出力部１６は、ディスプレイやスピーカ、プリンタ等で構成され、音声データ及びテキストデータ等の出力情報を出力する。出力部１６が例えばプリンタ等であれば、出力情報を印刷することもできる。
入力部１７は、キーボードやマウス等で構成され、ユーザの指示操作に応じて各種情報を入力する。記憶部１８は、ハードディスク装置等で構成され、各種情報のデータを記憶する。 The CPU 11, ROM 12 and RAM 13 are connected to each other via the bus 14. An input / output interface 15 is also connected to the bus 14. An output unit 16, an input unit 17, a storage unit 18, a communication unit 19, and a drive 20 are connected to the input / output interface 15.
The output unit 16 is composed of a display, a speaker, a printer, and the like, and outputs output information such as voice data and text data. If the output unit 16 is, for example, a printer or the like, the output information can be printed.
The input unit 17 is composed of a keyboard, a mouse, and the like, and inputs various information according to a user's instruction operation. The storage unit 18 is composed of a hard disk device or the like, and stores various information data.

通信部１９は、ネットワークＮを介して他の対象（例えば図１のヘッドセット２や学習装置４等）との間で相互に通信を行う。
ドライブ２０には、磁気ディスク、光ディスク、光磁気ディスク、或いは半導体メモリ等よりなる、リムーバブルメディア２１が適宜装着される。ドライブ２０によってリムーバブルメディア２１から読み出されたプログラムは、必要に応じて記憶部１８にインストールされる。また、リムーバブルメディア２１は、記憶部１８に記憶されている各種データも、記憶部１８と同様に記憶することができる。 The communication unit 19 communicates with another target (for example, the headset 2 or the learning device 4 in FIG. 1) via the network N.
A removable media 21 made of a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is appropriately mounted on the drive 20. The program read from the removable media 21 by the drive 20 is installed in the storage unit 18 as needed. Further, the removable media 21 can also store various data stored in the storage unit 18 in the same manner as the storage unit 18.

なお、図示はしないが、図１の情報処理システムの学習装置４は、図２に示すコンピュータ３のハードウェア構成と基本的に同様の構成を有している。従って、学習装置４のハードウェア構成の説明は省略する。
また、説明の便宜上、コンピュータ３は、学習装置４とは別途設けるものとしたが、特にこれに限定されず、学習装置４とコンピュータ３との各機能を１台の情報処理装置に集約してもよい。 Although not shown, the learning device 4 of the information processing system of FIG. 1 has basically the same configuration as the hardware configuration of the computer 3 shown in FIG. Therefore, the description of the hardware configuration of the learning device 4 will be omitted.
Further, for convenience of explanation, the computer 3 is provided separately from the learning device 4, but the present invention is not particularly limited to this, and the functions of the learning device 4 and the computer 3 are integrated into one information processing device. May be good.

図３は、一実施形態の図１の情報処理システムのうち図２のコンピュータ３の機能的構成の一例を示す機能ブロック図である。 FIG. 3 is a functional block diagram showing an example of the functional configuration of the computer 3 of FIG. 2 in the information processing system of FIG. 1 of the embodiment.

図３に示すように、コンピュータ３のＣＰＵ１１は、変換部３１、音声認識モデル３２、テキストデータ取得部３３、音響データ取得部３４、感情認識モデル３５及び感情認識部３６等として機能する。また、コンピュータ３の記憶部１８には、音声認識モデル３２及び感情認識モデル３５等が記憶されている。 As shown in FIG. 3, the CPU 11 of the computer 3 functions as a conversion unit 31, a voice recognition model 32, a text data acquisition unit 33, an acoustic data acquisition unit 34, an emotion recognition model 35, an emotion recognition unit 36, and the like. Further, the voice recognition model 32, the emotion recognition model 35, and the like are stored in the storage unit 18 of the computer 3.

変換部３１は、コンピュータ３に入力される話者の発話音声、つまりアナログの音声データをデジタルの音声データに変換する。変換部３１は、ヘッドセット２から受信された音声がオペレータＯＰ及びユーザＵの通話の中で混合した音声データの場合は、夫々の話者が発話した音声データを分離して感情認識対象の何れか一方、又は夫々別々に出力する。
なお、予め一方の音声（例えばユーザＵの音声のみ）を出力するよう設定しておくことで、ユーザＵの音声データのみを後段へ出力するようにもできる。
図３において、変換部３１から出力される２つの音声データは、同じ話者の同じ時系列で入力された音声データとする。 The conversion unit 31 converts the speaker's spoken voice input to the computer 3, that is, analog voice data, into digital voice data. When the voice received from the headset 2 is the voice data mixed in the call of the operator OP and the user U, the conversion unit 31 separates the voice data spoken by each speaker and determines which of the emotion recognition targets. Either one or each is output separately.
By setting in advance to output one voice (for example, only the voice of the user U), it is possible to output only the voice data of the user U to the subsequent stage.
In FIG. 3, the two voice data output from the conversion unit 31 are voice data input by the same speaker in the same time series.

音声認識モデル３２は、学習装置４が予め音声データを単語単位に学習し、入力された音声データに対応するテキストデータを認識結果として出力するように構築した学習済みのモデルである。即ち、音声認識モデル３２は、音声データが入力されると、音声データに対応するテキストデータを出力するように学習して生成されたものであればよい。 The voice recognition model 32 is a trained model constructed so that the learning device 4 learns voice data in word units in advance and outputs text data corresponding to the input voice data as a recognition result. That is, the voice recognition model 32 may be generated by learning to output text data corresponding to the voice data when the voice data is input.

テキストデータ取得部３３は、話者の発話の音声データを音声認識し、その音声認識の結果に基づいて発話の内容を示すテキストデータを取得する。具体的には、テキストデータ取得部３３は、記憶部１８の音声認識モデル３２に話者の発話の音声データを入力し、音声認識モデル３２において音声認識の結果として出力される単語や文節からなるテキストデータを連結して発話の内容を示すテキストデータを取得する。 The text data acquisition unit 33 voice-recognizes the voice data of the speaker's utterance, and acquires text data indicating the content of the utterance based on the result of the voice recognition. Specifically, the text data acquisition unit 33 is composed of words and phrases output as a result of voice recognition in the voice recognition model 32 by inputting voice data of the speaker's utterance into the voice recognition model 32 of the storage unit 18. The text data is concatenated to acquire the text data indicating the content of the utterance.

音響データ取得部３４は、話者の発話の音声データを音響データとして取得する。具体的には、受け取った話者の発話の音声データに含まれる音声認識に不要な雑音等を除去し、音響データとして取得する。 The acoustic data acquisition unit 34 acquires the voice data of the speaker's utterance as acoustic data. Specifically, noise and the like unnecessary for voice recognition included in the voice data of the received speaker's utterance are removed and acquired as acoustic data.

感情認識モデル３５は、予め感情種別毎に分類し正解の感情ラベルを付与した複数の音声データ（以下これを「学習用データ」と称す）の学習を実施して構築した感情学習済みのモデルである。図４に感情認識モデル３５の構築に用いた特徴量を示す。図４に示す特徴量の平均、及び標準偏差値を用いて感情認識モデル３５を学習した。 The emotion recognition model 35 is an emotion-learned model constructed by learning a plurality of voice data (hereinafter referred to as “learning data”) classified in advance for each emotion type and given correct emotion labels. be. FIG. 4 shows the features used for constructing the emotion recognition model 35. The emotion recognition model 35 was learned using the average of the features shown in FIG. 4 and the standard deviation value.

感情認識部３６は、入力される音響データとテキストデータとに基づいて話者の感情を認識し、認識結果を示す感情情報を得る。 The emotion recognition unit 36 recognizes the emotion of the speaker based on the input acoustic data and the text data, and obtains emotion information indicating the recognition result.

詳細に説明すると、感情認識部３６は、感情認識モデル３５を用いて、音響データ取得部３４により取得された音響データとテキストデータ取得部３３により取得されたテキストデータとに基づいて話者の感情を認識し、認識結果を示す感情情報を出力する。 More specifically, the emotion recognition unit 36 uses the emotion recognition model 35 to describe the emotions of the speaker based on the acoustic data acquired by the acoustic data acquisition unit 34 and the text data acquired by the text data acquisition unit 33. Is recognized, and emotional information indicating the recognition result is output.

ここで、感情認識部３６は、テキストデータを形態素解析により形態素に分解し、分解した形態素を予め設定された単語判定条件に基づいて平仮名に変換し、変換した平仮名を予め平仮名と特徴ベクトルとの関係が学習されたテキスト学習済みのモデルを用いて第２特徴値である特徴ベクトルに変換する。 Here, the emotion recognition unit 36 decomposes the text data into morphemes by morphological analysis, converts the decomposed morphemes into hiragana based on preset word determination conditions, and converts the converted hiragana into hiragana and a feature vector in advance. Using the text-learned model in which the relationship is learned, it is converted into a feature vector which is a second feature value.

また、感情認識部３６は、音響データから第１特徴値である特徴ベクトルを抽出する。そして、感情認識部３６は、テキストデータから抽出した特徴ベクトルと音響データから抽出した特徴ベクトルとに基づいて、話者の感情を認識する。 In addition, the emotion recognition unit 36 extracts a feature vector, which is the first feature value, from the acoustic data. Then, the emotion recognition unit 36 recognizes the emotion of the speaker based on the feature vector extracted from the text data and the feature vector extracted from the acoustic data.

感情認識部３６は、感情情報を、音声認識結果のテキストデータに対応付けて記憶部１８に記憶する。感情情報は、テキストデータ全体に付与してもよく、文の区切り（文節）毎に付与してもよく、単語や一定の文字列の単位毎に付与してもよい。 The emotion recognition unit 36 stores the emotion information in the storage unit 18 in association with the text data of the voice recognition result. Emotion information may be given to the entire text data, may be given to each sentence break (phrase), or may be given to each word or a certain character string unit.

感情認識部３６は、感情情報として、話者の感情を、例えば人の喜びを示す「喜」、怒りを示す「怒」、哀しみを示す「哀」、平常心を示す「平常」等の４つに種別し、これら４つの感情の種別を少なくとも含むように分類した感情ラベルを出力する。感情情報としては、感情ラベルの他に、例えば分類した感情ラベルの確度を示す確率値を感情ラベルに対応付けて出力してもよい。 As emotional information, the emotion recognition unit 36 expresses the emotions of the speaker, for example, "joy" indicating human joy, "anger" indicating anger, "sadness" indicating sadness, "normal" indicating normal heart, and the like. An emotion label is output, which is classified into one and classified so as to include at least these four emotion types. As the emotion information, in addition to the emotion label, for example, a probability value indicating the accuracy of the classified emotion label may be output in association with the emotion label.

ここで、図５乃至図７を参照して感情認識モデル及び感情認識部から構成される感情認識手段について説明する。図５は、感情認識手段を示す図であり、図６は、図５の感情認識モデルの隠れ層５２ａの一例を示す図であり、図７は、図５の感情認識モデルの隠れ層５２ｂの一例を示す図である。 Here, an emotion recognition means composed of an emotion recognition model and an emotion recognition unit will be described with reference to FIGS. 5 to 7. 5 is a diagram showing an emotion recognition means, FIG. 6 is a diagram showing an example of the hidden layer 52a of the emotion recognition model of FIG. 5, and FIG. 7 is a diagram of the hidden layer 52b of the emotion recognition model of FIG. It is a figure which shows an example.

感情認識手段は、図３に示した感情認識モデル３５及び感情認識部３６から構成される。感情認識モデル３５は、隠れ層５２ａ、隠れ層５２ｂ、重み付け手段５３（データ結合回路及び次元数調整回路等から構成される）、Ｓｏｆｔｍａｘ関数を適用した出力層５４を有する。データ結合回路は、隠れ層５２ａ及び隠れ層５２ｂの演算結果のデータを順に結合する。次元数調整回路は、データ結合回路により結合されたデータにＳＯＦＴＭＡＸ関数をかけて、８次元のベクトルデータを生成する。
感情認識部３６は、図５に示すように、音響データを処理する特徴ベクトル抽出部５１ａと、テキストデータを処理する特徴ベクトル抽出部５１ｂとを有する。
特徴ベクトル抽出部５１ａは、入力された音響データから第１特徴値である特徴ベクトルを抽出する。
この際、特徴ベクトル抽出部５１ａは、入力された音響データを、音声区間検出（ＶＡＤ）を行う単位の細かな音声、例えば音声１、音声２…音声Ｎ等に小さく分けて、夫々の音声ｍについて、メル周波数ケプストラム係数（ＭＦＣＣ：ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）等を用いて特徴値を抽出し、結果として、２１次元の特徴ベクトルとする。
特徴ベクトル抽出部５１ａは、さらに２１次元の特徴ベクトルのうち各次元毎の特徴ベクトルに１５種類の関数（平均、最大、最小等）を掛けて、感情認識モデル３５の隠れ層５２ａに出力する。関数の値は、音声１から音声Ｎの出力平均、最大、最小のうち何れかをとる。結果として、２１×１５次元の特徴ベクトルが得られる。 The emotion recognition means is composed of the emotion recognition model 35 and the emotion recognition unit 36 shown in FIG. The emotion recognition model 35 has a hidden layer 52a, a hidden layer 52b, a weighting means 53 (composed of a data coupling circuit, a dimension number adjusting circuit, and the like), and an output layer 54 to which the Softmax function is applied. The data coupling circuit sequentially couples the data of the calculation results of the hidden layer 52a and the hidden layer 52b. The dimension adjustment circuit applies a SOFTMAX function to the data combined by the data coupling circuit to generate 8-dimensional vector data.
As shown in FIG. 5, the emotion recognition unit 36 includes a feature vector extraction unit 51a for processing acoustic data and a feature vector extraction unit 51b for processing text data.
The feature vector extraction unit 51a extracts the feature vector, which is the first feature value, from the input acoustic data.
At this time, the feature vector extraction unit 51a divides the input acoustic data into fine voices such as voice 1, voice 2, ... voice N, etc., which are units for performing voice section detection (VAD), and each voice m. For, feature values are extracted using Mel Frequency Cepstrum Coefficients (MFCC: Mel Frequency Cepstrum Cofficients) and the like, and as a result, a 21-dimensional feature vector is obtained.
The feature vector extraction unit 51a further multiplies the feature vector for each dimension of the 21-dimensional feature vector by 15 types of functions (average, maximum, minimum, etc.) and outputs the feature vector to the hidden layer 52a of the emotion recognition model 35. The value of the function takes one of the output average, maximum, and minimum of voice 1 to voice N. As a result, a 21 × 15 dimensional feature vector is obtained.

特徴ベクトル抽出部５１ｂは、入力されたテキストデータから第２特徴値である特徴ベクトルを抽出する。
詳細に説明すると、特徴ベクトル抽出部５１ｂは、テキストデータを形態素解析により形態素に分解し、分解した形態素に対して、日本語評価極性辞書又は日本語単語感情極性対応表を参照し、形態素の極性を変換し、形態素を分解する。
そして、特徴ベクトル抽出部５１ｂは、分解した形態素に対して、予め設定された単語判定条件に基づいて平仮名に変換（又は形態素の漢字のままに）して、変換した平仮名（文字データ）を、予めテキスト学習済みのモデルを用いて特徴ベクトルに変換する。
さらに、特徴ベクトル抽出部５１ｂは、変換した特徴ベクトルに形態素の極性を加えて第２特徴値を生成し、生成した第２特徴値を感情認識モデル３５の隠れ層５２ｂに入力する。
なお、テキスト学習済みのモデルは、学習装置４により予め平仮名と特徴ベクトルとを関係付けた正解データを学習して得られたものである。 The feature vector extraction unit 51b extracts a feature vector, which is a second feature value, from the input text data.
More specifically, the feature vector extraction unit 51b decomposes the text data into morphemes by morphological analysis, and refers to the Japanese evaluation polarity dictionary or the Japanese word emotion polarity correspondence table for the decomposed morphemes, and refers to the polarity of the morphemes. Is converted and the morpheme is decomposed.
Then, the feature vector extraction unit 51b converts the decomposed morphemes into hiragana (or keeps the kanji of the morphemes) based on preset word determination conditions, and converts the converted hiragana (character data) into hiragana (character data). Convert to a feature vector using a model that has been text-learned in advance.
Further, the feature vector extraction unit 51b adds the polarity of the morpheme to the converted feature vector to generate a second feature value, and inputs the generated second feature value to the hidden layer 52b of the emotion recognition model 35.
The text-learned model is obtained by learning the correct answer data in which the hiragana and the feature vector are related in advance by the learning device 4.

図５に示すように、感情認識モデル３５の隠れ層５２ａは、４層のＬＳＴＭ（ＬｏｎｇＳｈｏｒｔ−ＴｅｒｍＭｅｍｏｒｙ）で構成される。各層のノード数は、２５６−５１２−５１２−２５６とし、ｅｐｏｃｈ数は、５０００、バッチサイズは、５００、ドロップアウトは、０．７としている。入力は、例えば２５２個の特徴量（特徴ベクトル）である。つまり隠れ層５２ａには、音響データから抽出された２５２次元の特徴ベクトルが入力される。
隠れ層５２ａは、入力された２５２個の特徴ベクトルを用いて４層ＬＳＴＭ計算を行った上で重み付け手段５３のデータ結合回路に入力する。
感情認識モデル３５の隠れ層５２ｂは、２層のＢＩＬＳＴＭで構成される。各層のノード数は、１０２４−２５６とし、ｅｐｏｃｈ数は、５０００、パッチ最大は、２５０である。入力は、例えば７７０個の特徴量である。つまり隠れ層５２ｂには、テキストデータから抽出された７７０個の特徴量（特徴ベクトル）が入力される。
隠れ層５２ｂは、入力された７７０個の特徴ベクトルを用いてＢｉ−ｄｉｒｅｃｔｉｏｎ−ＲＮＮ計算を行った上で重み付け手段５３のデータ結合回路に入力する。
重み付け手段５３は、隠れ層５２ａと隠れ層５２ｂの出力の重み付けを行った上で、Ｓｏｆｔｍａｘ関数を適用した出力層５４に出力する。 As shown in FIG. 5, the hidden layer 52a of the emotion recognition model 35 is composed of four layers of LSTM (Long Short-Term Memory). The number of nodes in each layer is 256-512-512-256, the number of epoches is 5000, the batch size is 500, and the dropout is 0.7. The input is, for example, 252 feature quantities (feature vectors). That is, a 252 dimensional feature vector extracted from the acoustic data is input to the hidden layer 52a.
The hidden layer 52a performs a 4-layer LSTM calculation using the input 252 feature vectors, and then inputs the hidden layer 52a to the data coupling circuit of the weighting means 53.
The hidden layer 52b of the emotion recognition model 35 is composed of two layers of BILSTM. The number of nodes in each layer is 1024-256, the number of epoches is 5000, and the maximum patch is 250. The input is, for example, 770 features. That is, 770 feature quantities (feature vectors) extracted from the text data are input to the hidden layer 52b.
The hidden layer 52b performs a Bi-direction-RNN calculation using the input 770 feature vectors, and then inputs the hidden layer 52b to the data coupling circuit of the weighting means 53.
The weighting means 53 weights the outputs of the hidden layer 52a and the hidden layer 52b, and then outputs the weight to the output layer 54 to which the Softmax function is applied.

出力層５４は、重み付け手段５３により重み付けされた特徴ベクトルをＳｏｆｔｍａｘ関数に入力し、Ｓｏｆｔｍａｘ関数から得られる値を基に、人の喜びを示す「喜」、怒りを示す「怒」、哀しみを示す「哀」、平常心を示す「平常」の４種の感情のうち何れかに種別した感情ラベルＹを出力する。Ｓｏｆｔｍａｘ関数から得られる値は、０以上１以下の範囲で各成分の合計が１になる数値である。なお、感情ラベルＹにはその確度を示す確率値を付与し、感情ラベルＹと同時又は別々に出力してもよい。 The output layer 54 inputs the feature vector weighted by the weighting means 53 into the Softmax function, and based on the value obtained from the Softmax function, indicates “joy” indicating human joy, “anger” indicating anger, and sadness. An emotion label Y classified into any of four types of emotions, "sadness" and "normal" indicating normal feelings, is output. The value obtained from the Softmax function is a numerical value in which the sum of each component is 1 in the range of 0 or more and 1 or less. A probability value indicating the accuracy of the emotion label Y may be assigned and output simultaneously with or separately from the emotion label Y.

なお、上記隠れ層５２ａは、図６に示すように、フレームベース２１次元の特徴ベクトルＸ１乃至Ｘ３０を順に処理し、上記４種類の感情のうち何れかに種別した感情ラベルＹを出力層５４より出力するよう構成されている。なお、図６に示すマークのうち、四角は、計算を示し、白円は、入力ゲートを示し、破線入りの円は、ｔａｎｈ双曲線正接関数を示す。
上記隠れ層５２ｂは、図７に示すように、複数のリカレントニューラルネットワーク（以下「ＲＮＮ」と呼ぶ）が相互接続されたＢｉ−ｄｉｒｅｃｔｉｏｎａｌＲＮＮと統合部Σとを有する。この隠れ層５２ｂは、Ｂｉ−ｄｉｒｅｃｔｉｏｎａｌＲＮＮが過去からの情報を学習するのに加えて未来からの情報も学習し、夫々学習した結果を統合部Σで統合するよう構成されている。 As shown in FIG. 6, the hidden layer 52a processes the frame-based 21-dimensional feature vectors X1 to X30 in order, and outputs an emotion label Y classified into any of the above four types of emotions from the output layer 54. It is configured to output. Among the marks shown in FIG. 6, the square indicates the calculation, the white circle indicates the input gate, and the circle with a broken line indicates the tanh hyperbolic tangent function.
As shown in FIG. 7, the hidden layer 52b has a Bi-directional RNN in which a plurality of recurrent neural networks (hereinafter referred to as “RNN”) are interconnected, and an integrated portion Σ. The hidden layer 52b is configured such that the Bi-directional RNN learns information from the future in addition to learning information from the past, and the learning results are integrated by the integration unit Σ.

次に、図８を参照して、コンピュータ３により実行される音声認識処理について説明する。図８は、図３の機能的構成を有するコンピュータ３により実行される感情認識処理の流れの一例を説明するフローチャートである。
この情報処理システムの場合、コンピュータ３は、オペレータＯＰ及びユーザＵ等の話者の音声から感情を認識する感情認識処理を以下のように実行する。 Next, the voice recognition process executed by the computer 3 will be described with reference to FIG. FIG. 8 is a flowchart illustrating an example of a flow of emotion recognition processing executed by the computer 3 having the functional configuration of FIG.
In the case of this information processing system, the computer 3 executes an emotion recognition process for recognizing emotions from the voices of speakers such as the operator OP and the user U as follows.

ヘッドセット２において録音された話者の音声データがネットワークＮを通じてコンピュータ３に受信されると、コンピュータ３では、テキストデータ取得部３３が、ステップＳ１１において、音声認識モデル３２を用いて、話者の発話の音声データを音声認識し、その音声認識の結果に基づいて発話の内容を示すテキストデータを取得する。 When the speaker's voice data recorded in the headset 2 is received by the computer 3 through the network N, the text data acquisition unit 33 of the computer 3 uses the voice recognition model 32 in step S11 to display the speaker's voice data. The voice data of the utterance is recognized by voice, and the text data indicating the content of the utterance is acquired based on the result of the voice recognition.

また、ステップＳ１２において、音響データ取得部３４は、話者の発話の音声データを音響データとして取得する。 Further, in step S12, the acoustic data acquisition unit 34 acquires the voice data of the speaker's utterance as acoustic data.

そして、ステップＳ１３において、感情認識部３６は、感情認識モデル３５を用いて、音響データ取得部３４により取得された音響データとテキストデータ取得部３３により取得されたテキストデータとに基づいて話者の感情を認識し、認識結果を示す感情情報を得る。
感情情報としては、話者の感情を、例えば「喜」「怒」「哀」「平常」の何れかに種別した感情ラベル及びその確度を示す確率値が得られるので、感情認識部３６は、得られた感情情報を音声認識結果のテキストデータに対応させて記憶部１８に記憶する。 Then, in step S13, the emotion recognition unit 36 uses the emotion recognition model 35, and based on the acoustic data acquired by the acoustic data acquisition unit 34 and the text data acquired by the text data acquisition unit 33, the speaker Recognize emotions and obtain emotional information indicating the recognition result.
As the emotion information, an emotion label indicating the speaker's emotions, for example, one of "joy", "anger", "sorrow", and "normal", and a probability value indicating the probability thereof can be obtained. The obtained emotion information is stored in the storage unit 18 in correspondence with the text data of the voice recognition result.

必要に応じて記憶部１８に記憶された感情情報とテキストデータとのペアを読み出し、コンピュータ３の画面上に互いを並べて表示することで、話者が発話したタイミングのテキストデータに感情ラベルが付与されているので、話者がこのテキストの表現で問い合わせをしたときに怒っていた等のことが解るようになる。 By reading out the pair of emotion information and text data stored in the storage unit 18 as necessary and displaying them side by side on the screen of the computer 3, an emotion label is given to the text data at the timing when the speaker utters. Therefore, it becomes clear that the speaker was angry when inquiring with the expression of this text.

ここで、図９を参照して感情認識精度の評価方法を説明する。図９は、感情認識精度の評価方法を説明するための図である。 Here, an evaluation method of emotion recognition accuracy will be described with reference to FIG. FIG. 9 is a diagram for explaining a method of evaluating emotion recognition accuracy.

図９に示すように、感情認識精度の評価には、１０分割交差検証（１０ｆｏｌｄｃｖ）を用いる。 As shown in FIG. 9, 10-fold cross-validation (10-fold cv) is used to evaluate the emotion recognition accuracy.

交差検証とは、データの解析がどれだけ本当に母集団に対処できるかを検証、確認するための手法である。この検証は、統計学において標本データを分割し、その一部をまず解析し、残る部分でその解析のテストを行い、解析自身の妥当性の検証、確認に当てる。 Cross-validation is a method for verifying and confirming how much data analysis can really deal with the population. In this verification, sample data is divided in statistics, a part of it is analyzed first, and the remaining part is tested for the analysis, and the validity of the analysis itself is verified and confirmed.

特にそれ以上標本を集めるのが困難な場合に、データから推定して検証結果を導くためには交差検証等を行うことで慎重に裏付けを確認する必要がある。
例えばｋ１分割交差検証では、標本群をｋ個に分割する。そして、そのうちの一つをテスト事例とし、残るｋ１個を訓練事例とするのが一般的である。交差検証は、ｋ個に分割された標本群夫々をテスト事例としてｋ回検証を行う。このようにして得られたｋ回の結果を平均して一つの推定結果を得る。今回はｋ＝１０として１０分割交差検証により検証を進める。 Especially when it is difficult to collect more samples, it is necessary to carefully confirm the support by performing cross-validation etc. in order to estimate from the data and derive the verification result.
For example, in k1 partition cross-validation, the sample group is divided into k pieces. Then, it is common to use one of them as a test case and the remaining k1 as a training case. In cross-validation, each sample group divided into k pieces is used as a test case and verified k times. One estimation result is obtained by averaging the results of k times obtained in this way. This time, we will proceed with verification by 10-fold cross-validation with k = 10.

検証には、図９に示す相関表８１を用い、適合率Ｐ（ｐｒｅｃｉｓｉｏｎ)、再現率Ｒ（ｒｅｃａｌｌ）、Ｆ値（Ｆｍｅａｓｕｒｅ）を求めることで、感情認識精度を評価することができる。相関表８１は、予測結果と真の結果との関係を示す表である。
図９に示す相関表８１において、予測結果が「正」で真の結果が「正」ではＴＰ、予測結果が「正」で真の結果が「負」ではＦＰ、予測結果が「負」で真の結果が「正」ではＦＮ、予測結果が「負」で真の結果が「負」ではＴＮとなる。 For verification, the correlation table 81 shown in FIG. 9 is used, and the emotion recognition accuracy can be evaluated by obtaining the precision rate P (precision), the recall rate R (recall), and the F value (Fmease). Correlation table 81 is a table showing the relationship between the predicted result and the true result.
In the correlation table 81 shown in FIG. 9, when the prediction result is "positive" and the true result is "positive", TP is used, when the prediction result is "positive" and the true result is "negative", FP is used, and when the prediction result is "negative". If the true result is "positive", it is FN, and if the predicted result is "negative" and the true result is "negative", it is TN.

ここで、適合率Ｐは、「正」と予測したデータのうち、実際に「正」であるものの割合を言い、Ｐ＝ＴＰ／（ＴＰ＋ＦＰ）で表すことができる。
再現率Ｒは、実際に「正」であるもののうち、「正」であると予測されたものの割合を言い、Ｒ＝ＴＰ／（ＴＰ＋ＦＮ）で表すことができる。
Ｆ値は、適合率Ｐと再現率Ｒの調和平均であり、Ｆ＝（２＊Ｒ＊Ｐ）／（Ｒ＋Ｐ）で表すことができる。
適合率Ｐと再現率Ｒは、互いにトレードオフの関係があり、Ｆ値が高いということは、適合率・再現率の両方がバランス良く高いことを示す。 Here, the precision ratio P refers to the ratio of the data predicted to be “positive” that is actually “positive”, and can be expressed by P = TP / (TP + FP).
The recall rate R refers to the ratio of those that are actually “positive” and those that are predicted to be “positive”, and can be expressed by R = TP / (TP + FN).
The F value is a harmonic mean of the precision rate P and the recall rate R, and can be expressed by F = (2 * R * P) / (R + P).
The precision P and the recall R have a trade-off relationship with each other, and a high F value indicates that both the precision and the recall are well-balanced and high.

以上、説明したように本実施形態によれば、テキストデータ取得部３３が、話者の発話の音声データを音声認識し、音声認識の結果に基づいて発話の内容を示すテキストデータを取得し、音響データ取得部３４が、話者の前記発話の前記音声データを音響データとして取得し、感情認識部３６が、音響データとテキストデータとに基づいて話者の感情を認識し、その認識結果を示す感情情報（例えば「喜」「怒」「哀」「平常」等に分類した感情ラベル）を得るので、認識結果のテキストデータといくつかに分類された感情情報から、人が発話した音声の発話内容とその時の話者の感情が解るようになる。 As described above, according to the present embodiment, the text data acquisition unit 33 voice-recognizes the voice data of the speaker's speech, and acquires the text data indicating the content of the speech based on the result of the voice recognition. The acoustic data acquisition unit 34 acquires the voice data of the speaker's speech as acoustic data, and the emotion recognition unit 36 recognizes the speaker's emotions based on the acoustic data and the text data, and obtains the recognition result. Since the emotional information to be shown (for example, emotional labels classified into "joy", "anger", "sorrow", "normal", etc.) is obtained, the voice spoken by a person is obtained from the text data of the recognition result and the emotional information classified into several categories. You will be able to understand the content of the utterance and the feelings of the speaker at that time.

なお、上記実施形態では、学習モデルをコンピュータ３に記憶したが、学習装置４に記憶してもよく、ネットワークＮに接続された他のコンピュータ（サーバ等）に記憶してもよい。 In the above embodiment, the learning model is stored in the computer 3, but it may be stored in the learning device 4 or in another computer (server or the like) connected to the network N.

また、上記実施形態では、オペレータＯＰと問い合わせを行うユーザＵとの間で交わされる会話を例にしたが、これ以外に、例えば介護施設等で働く介護者と、介護者により介護を受ける利用者（例えば老人等）との間でも老人が発した言葉の意味が介護者に理解できない等の意思疎通の問題が生じるため、言葉そのものよりも話者の感情を理解する上で本特許を適用できる。 Further, in the above embodiment, the conversation exchanged between the operator OP and the user U who makes an inquiry is taken as an example, but in addition to this, for example, a caregiver working in a care facility and a user receiving care by the caregiver Since there is a problem of communication such as the caregiver cannot understand the meaning of the words spoken by the elderly (for example, the elderly), this patent can be applied to understand the feelings of the speaker rather than the words themselves. ..

以上を換言すると、本発明が適用される情報処理装置は、次のような構成を有する各種各様の実施形態を取ることができる。
即ち、本発明が適用される情報処理装置（例えば図３等のコンピュータ３）は、話者の発話の音声データを音声認識し、前記音声認識の結果に基づいて前記発話の内容を示すテキストデータを取得するテキストデータ取得手段（例えば図３等の音声認識モデル３２及びテキストデータ取得部３３）と、
前記話者の前記発話の前記音声データを音響データとして取得する音響データ取得手段（例えば図３等の音響データ取得部３４）と、
前記音響データと前記テキストデータとに基づいて前記話者の感情を認識し、その認識結果を示す感情情報を得る感情認識手段（例えば図３等の感情認識モデル３５及び感情認識部３６）と、
を備える。
これにより、認識結果のテキストデータと感情情報から、話者が発話した音声の発話内容とその時の感情が解るようになる。
前記感情認識手段（例えば図３等の感情認識モデル３５及び感情認識部３６）は、
前記感情情報として、話者の感情を、「喜」「怒」「哀」「平常」を少なくとも含むように分類した感情ラベルと、分類した前記感情ラベルの確度を示す確率値とのうち少なくとも一方を出力する、
ことにより、話者の音声データ上では、丁寧な表現をしているが、実は話者は怒っている等、言葉と裏腹な表現をしていることがわかる。
前記感情認識手段は、
前記音響データから抽出された第１特徴値（特徴ベクトル）と前記テキストデータから抽出された第２特徴値（特徴ベクトル）とを用いて前記話者の感情を認識する、
ことにより、音響データとテキストデータから話者の感情を認識するので、感情の認識精度を高めることができる。
前記感情認識手段は、
前記テキストデータを形態素解析により形態素に分解し、
分解した形態素に対して予め設定された単語判定条件に基づいて平仮名に変換し、
変換した平仮名を、予めテキスト学習済みのモデルを用いて前記第２特徴値に変換する特徴値抽出手段（例えば図５等の特徴ベクトル抽出部５１ｂ）、
をさらに備える、
ことにより、音響データから得られる感情に、テキストデータから得られる感情の表現を加味することができるので、感情の認識精度を高めることができる。 In other words, the information processing apparatus to which the present invention is applied can take various embodiments having the following configurations.
That is, the information processing device to which the present invention is applied (for example, the computer 3 as shown in FIG. 3) recognizes the voice data of the speaker's speech, and based on the result of the voice recognition, text data indicating the content of the speech. (For example, the voice recognition model 32 and the text data acquisition unit 33 as shown in FIG. 3) and the text data acquisition unit 33.
An acoustic data acquisition means (for example, an acoustic data acquisition unit 34 in FIG. 3 or the like) for acquiring the voice data of the utterance of the speaker as acoustic data, and
An emotion recognition means (for example, an emotion recognition model 35 and an emotion recognition unit 36 in FIG. 3 or the like) that recognizes the emotion of the speaker based on the acoustic data and the text data and obtains emotion information indicating the recognition result.
To be equipped.
As a result, from the text data of the recognition result and the emotional information, the utterance content of the voice uttered by the speaker and the emotion at that time can be understood.
The emotion recognition means (for example, the emotion recognition model 35 and the emotion recognition unit 36 in FIG. 3 and the like) is
As the emotional information, at least one of an emotional label that classifies the speaker's emotions so as to include at least "joy", "anger", "sorrow", and "normal" and a probability value indicating the accuracy of the classified emotional label. To output,
As a result, it can be seen that although the speaker's voice data expresses politely, the speaker actually expresses something contrary to the words, such as being angry.
The emotion recognition means
The speaker's emotion is recognized by using the first feature value (feature vector) extracted from the acoustic data and the second feature value (feature vector) extracted from the text data.
As a result, the speaker's emotions are recognized from the acoustic data and the text data, so that the emotion recognition accuracy can be improved.
The emotion recognition means
The text data is decomposed into morphemes by morphological analysis.
Converts the decomposed morphemes into hiragana based on preset word judgment conditions.
A feature value extraction means (for example, a feature vector extraction unit 51b in FIG. 5 or the like) that converts the converted hiragana into the second feature value using a model in which text has been learned in advance.
Further prepare,
As a result, the expression of the emotion obtained from the text data can be added to the emotion obtained from the acoustic data, so that the recognition accuracy of the emotion can be improved.

上述した一連の処理は、ハードウェアにより実行させることもできるし、ソフトウェアにより実行させることもできる。
換言すると、図３の機能構成は、一例にしか過ぎず、特に限定されるものではない。
即ち、上述した一連の処理を全体として実行できる機能が情報処理システム又は情報処理装置に備えられていれば足り、この機能を実現するためにどのような機能ブロックを用いるのかは特に図３の例に限定されない。また、機能ブロックの存在場所も、図３に特に限定されず、任意でよい。例えば、学習装置４の機能ブロックをコンピュータ３等に移譲させてもよい。また、コンピュータ３の機能ブロックを学習装置４等に移譲させてもよい。さらに言えば、コンピュータ３と学習装置４は、同じハードウェアであってもよい。 The series of processes described above can be executed by hardware or software.
In other words, the functional configuration of FIG. 3 is only an example and is not particularly limited.
That is, it suffices if the information processing system or the information processing device is provided with a function capable of executing the above-mentioned series of processes as a whole, and what kind of functional block is used to realize this function is particularly an example of FIG. Not limited to. Further, the location of the functional block is not particularly limited to FIG. 3, and may be arbitrary. For example, the functional block of the learning device 4 may be transferred to a computer 3 or the like. Further, the functional block of the computer 3 may be transferred to the learning device 4 or the like. Furthermore, the computer 3 and the learning device 4 may have the same hardware.

また例えば、一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、コンピュータ等にネットワークや記録媒体からインストールされる。
コンピュータは、専用のハードウェアに組み込まれているコンピュータであってもよい。また、コンピュータは、各種のプログラムをインストールすることで、各種の機能を実行することが可能なコンピュータ、例えばサーバの他汎用のスマートフォンやパーソナルコンピュータであってもよい。 Further, for example, when a series of processes are executed by software, a program constituting the software is installed on a computer or the like from a network or a recording medium.
The computer may be a computer embedded in dedicated hardware. Further, the computer may be a computer capable of executing various functions by installing various programs, for example, a general-purpose smartphone or a personal computer in addition to a server.

また、例えば、このようなプログラムを含む記録媒体は、ユーザにプログラムを提供するために装置本体とは別に配布される図示せぬリムーバブルメディアにより構成されるだけでなく、装置本体に予め組み込まれた状態でユーザに提供される記録媒体等で構成される。 Further, for example, a recording medium containing such a program is not only composed of a removable medium (not shown) distributed separately from the device main body in order to provide the program to the user, but is also preliminarily incorporated in the device main body. It is composed of a recording medium or the like provided to the user in the state.

なお、本明細書において、記録媒体に記録されるプログラムを記述するステップは、その順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的或いは個別に実行される処理をも含むものである。
また、本明細書において、システムの用語は、複数の装置や複数の手段等より構成される全体的な装置を意味するものとする。 In the present specification, the steps for describing a program recorded on a recording medium are not necessarily processed in chronological order, but also in parallel or individually, even if they are not necessarily processed in chronological order. It also includes the processing to be executed.
Further, in the present specification, the term of the system means an overall device composed of a plurality of devices, a plurality of means, and the like.

１・・・電話機、２・・・ヘッドセット、３・・・コンピュータ、４・・・学習装置、１１・・・ＣＰＵ、１８・・・記憶部、３１・・・変換部、３２・・・音声認識モデル、３３・・・テキストデータ取得部、３４・・・音響データ取得部、３５・・・感情認識モデル、３６・・・感情認識部 1 ... Phone, 2 ... Headset, 3 ... Computer, 4 ... Learning device, 11 ... CPU, 18 ... Storage unit, 31 ... Conversion unit, 32 ... Speech recognition model, 33 ... text data acquisition unit, 34 ... acoustic data acquisition unit, 35 ... emotion recognition model, 36 ... emotion recognition unit

Claims

A text data acquisition means that recognizes the voice data of the speaker's utterance and acquires text data indicating the content of the utterance based on the result of the voice recognition.
An acoustic data acquisition means for acquiring the voice data of the utterance of the speaker as acoustic data,
An emotion recognition means that recognizes the emotion of the speaker based on the acoustic data and the text data and obtains emotion information indicating the recognition result.
Information processing device equipped with.

The emotion recognition means
At least one of an emotion label that classifies the speaker's emotions so as to include at least "joy", "anger", "sorrow", and "normal" as the emotion information, and a probability value indicating the accuracy of the classified emotion label. To output,
The information processing device according to claim 1.

The emotion recognition means
The emotion of the speaker is recognized by using the first feature value extracted from the acoustic data and the second feature value extracted from the text data.
The information processing device according to claim 1 or 2.

The emotion recognition means
The text data is decomposed into morphemes by morphological analysis.
Converts the decomposed morphemes into hiragana based on preset word judgment conditions.
A feature value extraction means for converting the converted hiragana into the second feature value using a model in which text has been learned in advance.
The information processing apparatus according to claim 3, further comprising.

It is an information processing method executed by an information processing device.
A step of voice-recognizing the voice data of the speaker's utterance and acquiring text data indicating the content of the utterance based on the result of the voice recognition.
A step of acquiring the voice data of the utterance of the speaker as acoustic data, and
A step of recognizing the emotion of the speaker based on the acoustic data and the text data, and obtaining emotion information indicating the recognition result.
Information processing methods including.

A program that lets a computer perform processing
The computer
A text acquisition means that recognizes the voice data of the speaker's utterance and acquires text data indicating the content of the utterance based on the result of the voice recognition.
An acoustic acquisition means for acquiring the voice data of the utterance of the speaker as acoustic data,
An emotion recognition means that recognizes the emotion of the speaker based on the acoustic data and the text data, and obtains emotion information indicating the recognition result.
A program that functions as.