JP6819426B2

JP6819426B2 - Speech processing program, speech processing method and speech processor

Info

Publication number: JP6819426B2
Application number: JP2017074704A
Authority: JP
Inventors: 昭二早川
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-04-04
Filing date: 2017-04-04
Publication date: 2021-01-27
Anticipated expiration: 2037-04-04
Also published as: JP2018180061A

Description

本発明は、音声処理プログラム等に関する。 The present invention relates to a voice processing program and the like.

近年、企業内の社員は、自席ＰＣ（Personal Computer）のアプリケーションソフトウェアと、ヘッドセットとを用いて通話や電話会議等を行っている。以下の説明では、企業内の社員および他の利用者をまとめて、利用者と表記する。 In recent years, employees in a company have been making telephone calls and conference calls using application software of their own personal computer (Personal Computer) and a headset. In the following explanation, employees and other users in the company are collectively referred to as users.

利用者がヘッドセットの取り扱いに慣れていない場合には、利用者の口とヘッドセットのマイクとの距離が適切でない場合が多い。例えば、利用者の口とマイクとの距離が近いと、音量が適切な大きさを超えやすく、相手側に不快感を与える場合がある。一方、利用者の口とマイクとの距離が遠いと、音量が適切な大きさに足りず、相手側が音声を聞き取りづらくなる。 If the user is not accustomed to handling the headset, the distance between the user's mouth and the headset microphone is often inadequate. For example, if the distance between the user's mouth and the microphone is short, the volume tends to exceed an appropriate level, which may cause discomfort to the other party. On the other hand, if the distance between the user's mouth and the microphone is long, the volume is not sufficient and it becomes difficult for the other party to hear the voice.

音質の評価を行い、利用者に通知する技術として、例えば、従来技術１がある。従来技術１では、マイクに紙がこすれる音、エコー、周囲雑音、残留雑音等の外的要因に基づいて音質を評価し、評価結果を利用者に表示している。 As a technique for evaluating sound quality and notifying the user, for example, there is a conventional technique 1. In the prior art 1, the sound quality is evaluated based on external factors such as the sound of paper rubbing against the microphone, echo, ambient noise, and residual noise, and the evaluation result is displayed to the user.

特開平１−１５５４３０号公報Japanese Unexamined Patent Publication No. 1-155430 特開２０１０−２５９６９１号公報JP-A-2010-259691

しかしながら、上述した従来技術では、入力音声の発声状態を適切に推定することができないという問題がある。 However, in the above-mentioned conventional technique, there is a problem that the utterance state of the input voice cannot be estimated appropriately.

例えば、利用者からマイクに入力される入力音声の音量は、常に一定というわけではなく、ストレス等の利用者の心理状況の影響によって、変動するため、適切な利用者の口とマイクとの距離も一定ではない。従って、入力音声の発話状態を適切に推定して、利用者の口とマイクとの距離が適切になるように、利用者に通知することが望ましい。 For example, the volume of the input voice input from the user to the microphone is not always constant and fluctuates due to the influence of the user's psychological condition such as stress, so the appropriate distance between the user's mouth and the microphone. Is not constant. Therefore, it is desirable to appropriately estimate the utterance state of the input voice and notify the user so that the distance between the user's mouth and the microphone is appropriate.

これに対して、従来技術１による音質の評価では、雑音等の外的要因を考慮して、音質を評価しているに過ぎず、入力音声の発話状態を評価するものではない。また、会話開始時の音量に基づき、口とマイクとの距離が適切であるか否かを通知する技術も存在するが、上記のように、入力音声の音量は心理状況の影響により変動するため、開始時の音量に基づく口とマイクとの距離が必ずしも継続して、最適な距離であるとは言えない。 On the other hand, in the evaluation of sound quality by the prior art 1, the sound quality is only evaluated in consideration of external factors such as noise, and the utterance state of the input voice is not evaluated. There is also a technology to notify whether the distance between the mouth and the microphone is appropriate based on the volume at the start of conversation, but as mentioned above, the volume of the input voice fluctuates due to the influence of the psychological situation. , The distance between the mouth and the microphone based on the volume at the start does not always continue and cannot be said to be the optimum distance.

１つの側面では、本発明は、入力音声の発声状態を適切に推定することができる音声処理プログラム、音声処理方法および音声処理装置を提供することを目的とする。 In one aspect, it is an object of the present invention to provide a speech processing program, a speech processing method, and a speech processing device capable of appropriately estimating the vocalization state of an input speech.

第１の案では、コンピュータに下記の処理を実行させる。コンピュータは、入力音声からピッチ周波数と周波数パワーとを抽出する。コンピュータは、ピッチ周波数および周波数パワーに基づく値が所定の閾値以上となる条件を満たすか否か判定結果を出力する。コンピュータは、判定結果と、周波数パワーの平均パワーとの関係に基づいて、入力音声の発声状態を推定する。 In the first plan, the computer is made to perform the following processing. The computer extracts the pitch frequency and frequency power from the input voice. The computer outputs a determination result as to whether or not the condition that the value based on the pitch frequency and the frequency power is equal to or higher than a predetermined threshold value is satisfied. The computer estimates the utterance state of the input voice based on the relationship between the determination result and the average power of the frequency power.

入力音声の発声状態を適切に推定することができる。 The vocalization state of the input voice can be estimated appropriately.

図１は、本実施例１に係る音声処理装置の構成を示す機能ブロック図である。FIG. 1 is a functional block diagram showing a configuration of a voice processing device according to the first embodiment. 図２は、本実施例１に係る音声処理装置の処理手順を示すフローチャートである。FIG. 2 is a flowchart showing a processing procedure of the voice processing device according to the first embodiment. 図３は、本実施例２に係る音声処理装置の構成を示す機能ブロック図である。FIG. 3 is a functional block diagram showing the configuration of the voice processing device according to the second embodiment. 図４は、推定結果のデータ構造の一例を示す図である。FIG. 4 is a diagram showing an example of the data structure of the estimation result. 図５は、本実施例２に係る更新部の更新処理の一例を示すフローチャート（１）である。FIG. 5 is a flowchart (1) showing an example of the update process of the update unit according to the second embodiment. 図６は、本実施例２に係る更新部の更新処理の一例を示すフローチャート（２）である。FIG. 6 is a flowchart (2) showing an example of the update process of the update unit according to the second embodiment. 図７は、本実施例２に係る更新部の更新処理の一例を示すフローチャート（３）である。FIG. 7 is a flowchart (3) showing an example of the update process of the update unit according to the second embodiment. 図８は、本実施例２に係る音声処理装置の処理手順を示すフローチャートである。FIG. 8 is a flowchart showing a processing procedure of the voice processing device according to the second embodiment. 図９は、本実施例３に係るシステムの一例を示す図である。FIG. 9 is a diagram showing an example of the system according to the third embodiment. 図１０は、本実施例３に係る音声処理装置の構成を示す機能ブロック図である。FIG. 10 is a functional block diagram showing the configuration of the voice processing device according to the third embodiment. 図１１Ａは、本実施例３に係るサーバの構成を示す機能ブロック図である。FIG. 11A is a functional block diagram showing a server configuration according to the third embodiment. 図１１Ｂは、本実施例３に係る閾値テーブルのデータ構造の一例を示す図である。FIG. 11B is a diagram showing an example of the data structure of the threshold table according to the third embodiment. 図１２は、本実施例３に係る音声処理装置の処理手順を示すフローチャート（１）である。FIG. 12 is a flowchart (1) showing a processing procedure of the voice processing device according to the third embodiment. 図１３は、本実施例３に係る音声処理装置の処理手順を示すフローチャート（２）である。FIG. 13 is a flowchart (2) showing a processing procedure of the voice processing device according to the third embodiment. 図１４は、本実施例３に係る音声処理装置の処理手順を示すフローチャート（３）である。FIG. 14 is a flowchart (3) showing a processing procedure of the voice processing device according to the third embodiment. 図１５は、本実施例４に係るシステムの一例を示す図である。FIG. 15 is a diagram showing an example of the system according to the fourth embodiment. 図１６は、本実施例４に係る音声処理装置の構成を示す機能ブロック図である。FIG. 16 is a functional block diagram showing the configuration of the voice processing device according to the fourth embodiment. 図１７は、本実施例４に係るサーバの構成を示す機能ブロック図である。FIG. 17 is a functional block diagram showing a server configuration according to the fourth embodiment. 図１８は、本実施例４に係る分類テーブルのデータ構造の一例を示す図である。FIG. 18 is a diagram showing an example of the data structure of the classification table according to the fourth embodiment. 図１９は、統計量のデータ構造の一例を示す図である。FIG. 19 is a diagram showing an example of a statistical data structure. 図２０は、本実施例４に係る音声処理装置の処理手順を示すフローチャートである。FIG. 20 is a flowchart showing a processing procedure of the voice processing device according to the fourth embodiment. 図２１は、本実施例４に係るサーバの処理手順を示すフローチャートである。FIG. 21 is a flowchart showing a processing procedure of the server according to the fourth embodiment. 図２２は、本実施例５に係る音声処理装置の構成を示す機能ブロック図である。FIG. 22 is a functional block diagram showing a configuration of the voice processing device according to the fifth embodiment. 図２３は、本実施例５に係る音声処理装置の処理手順を示すフローチャートである。FIG. 23 is a flowchart showing a processing procedure of the voice processing device according to the fifth embodiment. 図２４は、音声処理装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。FIG. 24 is a diagram showing an example of a computer hardware configuration that realizes a function similar to that of a voice processing device.

以下に、本願の開示する音声処理プログラム、音声処理方法および音声処理装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Hereinafter, examples of the voice processing program, voice processing method, and voice processing apparatus disclosed in the present application will be described in detail with reference to the drawings. The present invention is not limited to this embodiment.

図１は、本実施例１に係る音声処理装置の構成を示す機能ブロック図である。図１に示すように、この音声処理装置１００は、マイク１０に接続される。音声処理装置１００は、ＡＤ（Analog/Digital）変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、記憶部１４０、推定部１５０、情報提示部１６０を有する。ピッチ抽出部１２０ａ、パワー抽出部１２０ｂは、抽出部の一例である。 FIG. 1 is a functional block diagram showing a configuration of a voice processing device according to the first embodiment. As shown in FIG. 1, the voice processing device 100 is connected to the microphone 10. The voice processing device 100 includes an AD (Analog / Digital) conversion unit 110, a pitch extraction unit 120a, a power extraction unit 120b, a stress detection unit 130, a storage unit 140, an estimation unit 150, and an information presentation unit 160. The pitch extraction unit 120a and the power extraction unit 120b are examples of the extraction unit.

マイク１０は、利用者が装着するヘッドセット（図示略）に備え付けられたマイクであり、利用者の音声を集音する。マイク１０は、集音した利用者の音声データを、音声処理装置１００のＡＤ変換部１１０に出力する。以下の説明では、マイク１０がＡＤ変換部１１０に出力する音声データを、入力音声と表記する。 The microphone 10 is a microphone attached to a headset (not shown) worn by the user, and collects the user's voice. The microphone 10 outputs the collected voice data of the user to the AD conversion unit 110 of the voice processing device 100. In the following description, the voice data output by the microphone 10 to the AD conversion unit 110 is referred to as an input voice.

ＡＤ変換部１１０は、マイク１０から入力音声を受け付け、受け付けた入力音声に対してＡＤ変換を実行する。ＡＤ変換部１１０は、ＡＤ変換した入力音声を、ピッチ抽出部１２０ａおよびパワー抽出部１２０ｂに出力する。ＡＤ変換は、アナログ信号をデジタル信号に変換する処理である。すなわち、ＡＤ変換部１１０は、アナログ信号の入力音声を、デジタル信号の入力音声に変換する。以下の説明では、ＡＤ変換部１１０により変換されたデジタル信号の入力音声を、単に、「入力音声」と表記する。 The AD conversion unit 110 receives the input voice from the microphone 10 and executes AD conversion on the received input voice. The AD conversion unit 110 outputs the AD-converted input voice to the pitch extraction unit 120a and the power extraction unit 120b. AD conversion is a process of converting an analog signal into a digital signal. That is, the AD conversion unit 110 converts the input voice of the analog signal into the input voice of the digital signal. In the following description, the input voice of the digital signal converted by the AD conversion unit 110 is simply referred to as “input voice”.

ピッチ抽出部１２０ａは、入力音声を基にして、入力音声の基本周波数となるピッチを抽出する処理部である。ピッチ抽出部１２０ａは、抽出したピッチの情報をストレス検出部１３０に出力する。 The pitch extraction unit 120a is a processing unit that extracts a pitch that is the fundamental frequency of the input voice based on the input voice. The pitch extraction unit 120a outputs the extracted pitch information to the stress detection unit 130.

ピッチ抽出部１２０ａは、フレーム処理、ピッチ算出処理を実行する。まず、フレーム処理について説明する。ピッチ抽出部１２０ａは、入力音声の信号系列を、予め決められたサンプル数毎に「フレーム」として取り出し、フレームにハニング窓等の分析窓を乗算することで、後述する時間周波数変換を行った際の高周波成分による歪を抑える。 The pitch extraction unit 120a executes frame processing and pitch calculation processing. First, frame processing will be described. When the pitch extraction unit 120a takes out the signal sequence of the input voice as a "frame" for each predetermined number of samples and multiplies the frame by an analysis window such as a Hanning window to perform time-frequency conversion described later. Suppresses distortion caused by high frequency components.

例えば、ピッチ抽出部１２０ａは、サンプリング周波数８ｋＨｚで３２ｍｓの区間のサンプルＮをフレームとして取り出す。例えば、Ｎ＝２５６とする。フレームに含まれる各サンプルを「ｓ（０）、ｓ（１）、ｓ（２）、・・・、ｓ（Ｎ−１）」とする。ピッチ抽出部１２０ａは、上記の各サンプルに対しハミング窓を乗算する。例えば、ハミング窓は、式（１）により示される。 For example, the pitch extraction unit 120a extracts a sample N in a section of 32 ms at a sampling frequency of 8 kHz as a frame. For example, N = 256. Let each sample included in the frame be "s (0), s (1), s (2), ..., S (N-1)". The pitch extraction unit 120a multiplies each of the above samples by a humming window. For example, the humming window is represented by equation (1).

各サンプルに対しハミング窓を乗算したサンプルを「ｘ（０）、ｘ（１）、ｘ（２）、・・・、ｘ（Ｎ−１）」とする。以下の説明では、ハミング窓を乗算した結果得られるサンプル「ｘ（０）、ｘ（１）、ｘ（２）、・・・、ｘ（Ｎ−１）」をサンプル値と表記する。 The sample obtained by multiplying each sample by the humming window is defined as "x (0), x (1), x (2), ..., X (N-1)". In the following description, the sample "x (0), x (1), x (2), ..., X (N-1)" obtained by multiplying the humming window is referred to as a sample value.

ピッチ算出処理について説明する。ピッチ抽出部１２０ａは、フレームに含まれる各サンプル値を用いて、自己相関関数を計算する。例えば、ピッチ抽出部１２０ａは、式（２）に基づいて、自己相関関数φ（ｍ）を計算する。式（２）に示すｍは、遅延時間を示す。 The pitch calculation process will be described. The pitch extraction unit 120a calculates the autocorrelation function using each sample value included in the frame. For example, the pitch extraction unit 120a calculates the autocorrelation function φ (m) based on the equation (2). The m shown in the formula (2) indicates the delay time.

ピッチ抽出部１２０ａは、式（２）について、遅延時間ｍ＝０以外において、自己相関関数が極大値となる遅延時間ｍの値を特定する。自己相関関数が極大となる遅延時間ｍを「遅延時間ｍ’」と表記する。ピッチ抽出部１２０ａは、遅延時間ｍ’を算出した後に、式（３）に基づいて、ピッチを算出する。 Regarding the equation (2), the pitch extraction unit 120a specifies the value of the delay time m at which the autocorrelation function becomes the maximum value, except for the delay time m = 0. The delay time m at which the autocorrelation function is maximized is expressed as "delay time m'". The pitch extraction unit 120a calculates the pitch based on the equation (3) after calculating the delay time m'.

ピッチ＝１／遅延時間ｍ’・・・（３） Pitch = 1 / delay time m'... (3)

ピッチ抽出部１２０ａは、入力音声に対してフレーム処理を繰り返し実行することで、入力音声から複数のフレームを抽出し、各フレームからピッチをそれぞれ算出する。ピッチ抽出部１２０ａは、フレーム毎のピッチの情報を、ストレス検出部１３０に出力する。 The pitch extraction unit 120a extracts a plurality of frames from the input voice by repeatedly executing the frame process for the input voice, and calculates the pitch from each frame. The pitch extraction unit 120a outputs pitch information for each frame to the stress detection unit 130.

また、ピッチ抽出部１２０ａは、フレームが有音区間であるか否かを自己相関関数の極大値φ（ｍ’）を基にして判定し、判定結果をストレス検出部１３０に出力する。例えば、ピッチ抽出部１２０ａは、フレームの自己相関関数の極大値φ（ｍ’）が所定値以上である場合には、該当するフレームが有音区間であると判定する。 Further, the pitch extraction unit 120a determines whether or not the frame is a sound section based on the maximum value φ (m') of the autocorrelation function, and outputs the determination result to the stress detection unit 130. For example, when the maximum value φ (m') of the autocorrelation function of the frame is equal to or more than a predetermined value, the pitch extraction unit 120a determines that the corresponding frame is a sound section.

パワー抽出部１２０ｂは、入力音声を基にして、入力音声のパワーを抽出する処理部である。パワー抽出部１２０ｂは、抽出したパワーの情報をストレス検出部１３０に出力する。 The power extraction unit 120b is a processing unit that extracts the power of the input voice based on the input voice. The power extraction unit 120b outputs the extracted power information to the stress detection unit 130.

パワー抽出部１２０ｂは、ピッチ抽出部１２０ａと同様にして、フレーム処理を実行することで、入力音声からフレームを抽出する。パワー抽出部１２０ｂは、フレームの各サンプル値「「ｘ（０）、ｘ（１）、ｘ（２）、・・・、ｘ（Ｎ−１）」の入力音声を時間周波数変換を用いて時間領域から周波数領域のスペクトル信号に変換する。時間周波数変換として、例えば、高速フーリエ変換(Fast Fourier Transform,FFT)を用いることができる。そして、パワー抽出部１２０ｂは、各周波数帯域のスペクトル信号Ｐ（ｎ）を２乗することにより周波数帯域ごとのパワーを求め、全周波数帯域にわたってパワーの総和を求め対数化した値を算出する。これを以降“パワー”と呼ぶ。例えば、パワー抽出部１２０ｂは、式（４）に基づいて、フレームのスペクトル信号を用いてパワーを算出する。 The power extraction unit 120b extracts a frame from the input voice by executing the frame processing in the same manner as the pitch extraction unit 120a. The power extraction unit 120b uses time-frequency conversion to time the input voice of each sample value “x (0), x (1), x (2), ..., X (N-1)” of the frame. Converts a region to a spectral signal in the frequency domain. As the time-frequency transform, for example, a Fast Fourier Transform (FFT) can be used. Then, the power extraction unit 120b obtains the power for each frequency band by squaring the spectral signal P (n) of each frequency band, obtains the total power over the entire frequency band, and calculates a logarithmic value. This is hereafter referred to as "power". For example, the power extraction unit 120b calculates the power using the spectral signal of the frame based on the equation (4).

パワー抽出部１２０ｂは、入力音声に対してフレーム処理を繰り返し実行することで、入力音声から複数のフレームを抽出し、各フレームからパワーをそれぞれ算出する。パワー抽出部１２０ｂは、フレーム毎のパワーの情報をストレス検出部１３０に出力する。 The power extraction unit 120b extracts a plurality of frames from the input voice by repeatedly executing the frame processing for the input voice, and calculates the power from each frame. The power extraction unit 120b outputs power information for each frame to the stress detection unit 130.

ストレス検出部１３０は、入力音声のピッチおよびパワーに基づいて、利用者のストレス値を検出する処理部である。例えば、ストレス検出部１３０は、利用者の平常時のピッチおよびパワーの統計値と比較して、現在のピッチおよびパワーの統計値が離れるほど、ストレス値を大きくし、近づくほどストレス値を小さくする。ストレス検出部１３０は、検出したストレス値の情報と、パワーの情報を、推定部１５０に出力する。 The stress detection unit 130 is a processing unit that detects the stress value of the user based on the pitch and power of the input voice. For example, the stress detection unit 130 increases the stress value as the current pitch and power statistics are farther from the user's normal pitch and power statistics, and decreases the stress value as the current pitch and power statistics are closer. .. The stress detection unit 130 outputs the detected stress value information and the power information to the estimation unit 150.

ここで、ストレス検出部１３０の処理の一例について説明する。ストレス検出部１３０は、予め、平常時の利用者の入力音声に基づくピッチおよびパワーから、平常時のピッチの標準偏差および平常時のパワーの標準偏差を算出して、保持しておく。例えば、平常時のピッチの標準偏差を「標準偏差σＡ１」とし、平常時のパワーの標準偏差を「標準偏差σＢ１」と表記する。 Here, an example of processing of the stress detection unit 130 will be described. The stress detection unit 130 calculates and holds in advance the standard deviation of the pitch in normal times and the standard deviation of power in normal times from the pitch and power based on the input voice of the user in normal times. For example, the standard deviation of the pitch in normal times is referred to as "standard deviation σA1", and the standard deviation of power in normal times is referred to as "standard deviation σB1".

ストレス検出フェーズにおいて、ストレス検出部１３０は、各フレームのピッチの「標準偏差σＡ２」を算出し、各フレームのパワーの「標準偏差σＢ２」を算出する。例えば、ストレス検出部１３０は、式（５）に基づいて、ストレス値を算出する。式（５）において、α、βは、利用者に予め設定される係数である。 In the stress detection phase, the stress detection unit 130 calculates the "standard deviation σA2" of the pitch of each frame and calculates the "standard deviation σB2" of the power of each frame. For example, the stress detection unit 130 calculates the stress value based on the equation (5). In the formula (5), α and β are coefficients preset by the user.

ストレス値＝α×｜標準偏差σＡ１−標準偏差σＡ２｜＋β×｜標準偏差σＢ１−標準偏差σＢ２｜・・・（５） Stress value = α × | standard deviation σA1-standard deviation σA2 | + β × | standard deviation σB1-standard deviation σB2 | ... (5)

ストレス検出部１３０は、フレームのピッチおよびパワーの情報をピッチ抽出部１２０ａおよびパワー抽出部１２０ｂから取得する度に、上記処理を繰り返し実行することで、フレーム毎のストレス値を算出する。ストレス検出部１３０は、フレーム毎のストレス値と、パワーとを対応付けて、推定部１５０に出力する。 The stress detection unit 130 calculates the stress value for each frame by repeatedly executing the above processing each time information on the pitch and power of the frame is acquired from the pitch extraction unit 120a and the power extraction unit 120b. The stress detection unit 130 associates the stress value for each frame with the power and outputs the stress value to the estimation unit 150.

記憶部１４０は、判定基準データ１４０ａを有する。記憶部１４０は、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、フラッシュメモリ（Flash Memory）などの半導体メモリ素子や、ＨＤＤ（Hard Disk Drive）などの記憶装置に対応する。 The storage unit 140 has determination reference data 140a. The storage unit 140 corresponds to semiconductor memory elements such as RAM (Random Access Memory), ROM (Read Only Memory), and flash memory (Flash Memory), and storage devices such as HDD (Hard Disk Drive).

判定基準データ１４０ａは、後述する推定部１５０が入力音声の発声状態を推定する場合に利用する複数の閾値データを含む。具体的に、判定基準データ１４０ａは、第１閾値、第２閾値、第３閾値を含む。第１閾値および第２閾値は、パワーと比較される閾値である。第１閾値と第２閾値との大小関係は、第１閾値＞第２閾値とする。第３閾値は、ストレス値と比較される閾値である。例えば、パワーが、第２閾値以上、第１閾値未満である場合には、会話の音声が良好であると言える。 The determination reference data 140a includes a plurality of threshold data used when the estimation unit 150, which will be described later, estimates the utterance state of the input voice. Specifically, the determination reference data 140a includes a first threshold value, a second threshold value, and a third threshold value. The first threshold and the second threshold are thresholds to be compared with power. The magnitude relationship between the first threshold value and the second threshold value is such that the first threshold value> the second threshold value. The third threshold is a threshold to be compared with the stress value. For example, when the power is equal to or more than the second threshold value and less than the first threshold value, it can be said that the voice of the conversation is good.

推定部１５０は、入力音声のストレス値と、パワーと、判定基準データ１４０ａとを基にして、入力音声の発声状態を推定する処理部である。推定部１５０は、入力音声の発声状態を推定した後に、発声状態に応じたメッセージを生成し、メッセージを情報提示部１６０に出力することで、メッセージを表示させる。後述するように、推定部１５０は、ストレス値により、利用者のストレスの大小を推定する。推定部１５０は、パワーにより、利用者の口がマイク１０に近いか否かを推定する。 The estimation unit 150 is a processing unit that estimates the utterance state of the input voice based on the stress value of the input voice, the power, and the determination reference data 140a. After estimating the vocalization state of the input voice, the estimation unit 150 generates a message according to the vocalization state and outputs the message to the information presentation unit 160 to display the message. As will be described later, the estimation unit 150 estimates the magnitude of the stress of the user from the stress value. The estimation unit 150 estimates whether or not the user's mouth is close to the microphone 10 by the power.

推定部１５０は、ストレス値が第３閾値以上となる場合には、利用者のストレスが「大」であると推定し、ストレス値が第３閾値未満となる場合には、利用者のストレスが「小」であると推定する。 The estimation unit 150 estimates that the user's stress is "large" when the stress value is equal to or higher than the third threshold value, and when the stress value is less than the third threshold value, the user's stress is high. Estimated to be "small".

推定部１５０は、有音区間のフレームのパワーの平均値を算出する。以下の説明では、有音区間のフレームのパワーの平均値を「平均パワー」と表記する。推定部１５０は、平均パワーが第１閾値以上となる場合には、「利用者の口とマイク１０との距離が近い」と推定する。推定部１５０は、平均パワーが第２閾値未満となる場合には、「利用者の口とマイク１０との距離が遠い」と推定する。 The estimation unit 150 calculates the average value of the power of the frame in the sounded section. In the following description, the average value of the power of the frame in the sound section is referred to as "average power". When the average power is equal to or higher than the first threshold value, the estimation unit 150 estimates that "the distance between the user's mouth and the microphone 10 is short". When the average power is less than the second threshold value, the estimation unit 150 estimates that "the distance between the user's mouth and the microphone 10 is long".

利用者は、会話を続けていく過程において、ストレス小からストレス大に推移すると、入力音声のパワーが現在のパワーよりも大きくなる傾向がある。このため、現在のストレスが「小」であり、かつ、「利用者の口とマイク１０との距離が近い」場合には、今後、ストレスが「大」に推移すると、入力音声のパワーが適切なパワーを超える恐れがある。すなわち、推定部１５０は、「ストレスが小」かつ「利用者の口とマイク１０との距離が近い」場合に、第１メッセージ「マイクを口から少し離してください」を生成する。 When the stress changes from low stress to high stress in the process of continuing the conversation, the power of the input voice tends to be larger than the current power of the user. Therefore, if the current stress is "small" and "the distance between the user's mouth and the microphone 10 is short", the power of the input voice will be appropriate when the stress changes to "large" in the future. There is a risk of exceeding the power. That is, the estimation unit 150 generates the first message "Please move the microphone a little away from the mouth" when "the stress is small" and "the distance between the user's mouth and the microphone 10 is short".

利用者は、会話を続けていく過程において、ストレス大からストレス小に推移すると、入力音声のパワーが現在のパワーよりも小さくなる傾向がある。このため、現在のストレスが「大」であり、かつ、「利用者の口とマイク１０との距離が遠い」場合には、今後、ストレスが「小」に推移すると、入力音声のパワーが適切なパワーを下回る恐れがある。すなわち、推定部１５０は、「ストレスが大」かつ「利用者の口とマイク１０との距離が遠い」場合に、第２メッセージ「マイクを口に少し近づけてください」を生成する。 When the stress changes from high stress to low stress in the process of continuing the conversation, the power of the input voice tends to be smaller than the current power of the user. Therefore, if the current stress is "large" and "the distance between the user's mouth and the microphone 10 is long", the power of the input voice will be appropriate when the stress changes to "small" in the future. There is a risk of falling below the power. That is, the estimation unit 150 generates the second message "Please bring the microphone a little closer to the mouth" when "the stress is great" and "the distance between the user's mouth and the microphone 10 is long".

情報提示部１６０は、推定部１５０により生成されるメッセージを利用者に提示する処理部である。例えば、情報提示部１６０は、液晶ディスプレイ等の表示装置またはスピーカ等の出力装置に接続される。ここでは一例として、情報提示部１６０は、液晶ディスプレイに接続され、推定部１５０により生成されるメッセージを表示する。 The information presentation unit 160 is a processing unit that presents a message generated by the estimation unit 150 to the user. For example, the information presentation unit 160 is connected to a display device such as a liquid crystal display or an output device such as a speaker. Here, as an example, the information presentation unit 160 is connected to a liquid crystal display and displays a message generated by the estimation unit 150.

ところで、図１に示したＡＤ変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、推定部１５０、情報提示部１６０の各処理は、図示しない所定の制御部が実行しても良い。この制御部は、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などによって実現できる。また、制御部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などのハードワイヤードロジックによっても実現できる。 By the way, each process of the AD conversion unit 110, the pitch extraction unit 120a, the power extraction unit 120b, the stress detection unit 130, the estimation unit 150, and the information presentation unit 160 shown in FIG. 1 is executed by a predetermined control unit (not shown). Is also good. This control unit can be realized by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like. The control unit can also be realized by hard-wired logic such as ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).

次に、本実施例１に係る音声処理装置１００の処理手順について説明する。図２は、本実施例１に係る音声処理装置の処理手順を示すフローチャートである。図２に示すように、音声処理装置１００のＡＤ変換部１１０は、入力音声の受け付けを開始する（ステップＳ１０１）。ＡＤ変換部１１０は、ＡＤ変換を行う（ステップＳ１０２）。音声処理装置１００のピッチ抽出部１２０ａは、ピッチを抽出し、音声処理装置１００のパワー抽出部１２０ｂは、パワーを抽出する（ステップＳ１０３）。 Next, the processing procedure of the voice processing device 100 according to the first embodiment will be described. FIG. 2 is a flowchart showing a processing procedure of the voice processing device according to the first embodiment. As shown in FIG. 2, the AD conversion unit 110 of the voice processing device 100 starts accepting input voice (step S101). The AD conversion unit 110 performs AD conversion (step S102). The pitch extraction unit 120a of the voice processing device 100 extracts the pitch, and the power extraction unit 120b of the voice processing device 100 extracts the power (step S103).

ピッチ抽出部１２０ａは、有音区間を検出する（ステップＳ１０４）。音声処理装置１００のストレス検出部１３０は、ピッチ・パワーを蓄積する（ステップＳ１０５）。ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積された場合には（ステップＳ１０６，Ｙｅｓ）、ステップＳ１０７に移行する。一方、ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積されていない場合には（ステップＳ１０６，Ｎｏ）、ステップＳ１０１に移行する。 The pitch extraction unit 120a detects a sound section (step S104). The stress detection unit 130 of the voice processing device 100 accumulates pitch power (step S105). When the pitch power corresponding to the specified number of frames is accumulated (steps S106, Yes), the stress detection unit 130 proceeds to step S107. On the other hand, the stress detection unit 130 proceeds to step S101 when the pitch power corresponding to the specified number of frames is not accumulated (steps S106, No).

ストレス検出部１３０は、ストレス値を算出する（ステップＳ１０７）。音声処理装置１００の推定部１５０は、有音区間の平均パワーを算出する（ステップＳ１０８）。推定部１５０は、平均パワーが第１閾値以上である場合には（ステップＳ１０９，Ｙｅｓ）、ステップＳ１１０に移行する。一方、推定部１５０は、平均パワーが第１閾値未満である場合には（ステップＳ１０９，Ｎｏ）、ステップＳ１１２に移行する。 The stress detection unit 130 calculates the stress value (step S107). The estimation unit 150 of the voice processing device 100 calculates the average power of the sounded section (step S108). When the average power is equal to or higher than the first threshold value (step S109, Yes), the estimation unit 150 shifts to step S110. On the other hand, when the average power is less than the first threshold value (steps S109, No), the estimation unit 150 shifts to step S112.

推定部１５０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ１１０）。推定部１５０は、ストレス値が第３閾値以上である場合には（ステップＳ１１０，Ｙｅｓ）、ステップＳ１０１に移行する。推定部１５０は、ストレス値が第３閾値未満である場合には（ステップＳ１１０，Ｎｏ）、情報提示部１６０に第１メッセージを表示させ（ステップＳ１１１）、ステップＳ１０１に移行する。例えば、第１メッセージは、「マイクを口から少し離してください」である。 The estimation unit 150 determines whether or not the stress value is equal to or higher than the third threshold value (step S110). When the stress value is equal to or higher than the third threshold value (step S110, Yes), the estimation unit 150 proceeds to step S101. When the stress value is less than the third threshold value (step S110, No), the estimation unit 150 causes the information presentation unit 160 to display the first message (step S111), and proceeds to step S101. For example, the first message is "Please move the microphone away from your mouth."

ステップＳ１１２の説明に移行する。推定部１５０は、平均パワーが第２閾値未満であるか否かを判定する（ステップＳ１１２）。推定部１５０は、平均パワーが第２閾値未満でない場合には（ステップＳ１１２，Ｎｏ）、ステップＳ１０１に移行する。一方、推定部１５０は、平均パワーが第２閾値未満である場合には（ステップＳ１１２，Ｙｅｓ）、ステップＳ１１３に移行する。 The process proceeds to the description of step S112. The estimation unit 150 determines whether or not the average power is less than the second threshold value (step S112). When the average power is not less than the second threshold value (step S112, No), the estimation unit 150 shifts to step S101. On the other hand, when the average power is less than the second threshold value (step S112, Yes), the estimation unit 150 shifts to step S113.

推定部１５０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ１１３）。推定部１５０は、ストレス値が第３閾値以上でない場合には（ステップＳ１１３，Ｎｏ）、ステップＳ１０１に移行する。一方、推定部１５０は、ストレス値が第３閾値以上である場合には（ステップＳ１１３，Ｙｅｓ）、情報提示部１６０に第２メッセージを表示させ（ステップＳ１１４）、ステップＳ１０１に移行する。例えば、第２メッセージは、「マイクを口に少し近づけてください」である。 The estimation unit 150 determines whether or not the stress value is equal to or higher than the third threshold value (step S113). When the stress value is not equal to or higher than the third threshold value (steps S113, No), the estimation unit 150 proceeds to step S101. On the other hand, when the stress value is equal to or higher than the third threshold value (step S113, Yes), the estimation unit 150 causes the information presentation unit 160 to display the second message (step S114), and proceeds to step S101. For example, the second message is "Please bring the microphone a little closer to your mouth."

次に、本実施例１に係る音声処理装置１００の効果について説明する。音声処理装置１００は、入力音声のピッチおよびパワーに基づいて利用者のストレス値を算出し、算出したストレス値の大小関係とパワーの大小関係との双方に基づいて、発声状態を推定する。これにより、例えば、入力音声のパワーだけでなく、利用者のストレス度合を考慮した今後の発声状態を推定することができる。また、現在だけでなく、今後の発声状態の推移を適切に推定できるので、かかる推定結果に応じたメッセージを生成して、利用者に提示することで、利用者の口とマイクとの距離を適切な距離に保つことができ、各利用者の通話を快適に保つことができる。 Next, the effect of the voice processing device 100 according to the first embodiment will be described. The voice processing device 100 calculates the stress value of the user based on the pitch and power of the input voice, and estimates the utterance state based on both the magnitude relation of the calculated stress value and the magnitude relation of the power. Thereby, for example, it is possible to estimate the future utterance state in consideration of not only the power of the input voice but also the degree of stress of the user. In addition, since it is possible to appropriately estimate the transition of the vocalization state not only at present but also in the future, by generating a message according to the estimation result and presenting it to the user, the distance between the user's mouth and the microphone can be estimated. It can be kept at an appropriate distance, and each user's call can be kept comfortable.

図３は、本実施例２に係る音声処理装置の構成を示す機能ブロック図である。この音声処理装置２００は、図３に示すように、マイク１０に接続される。音声処理装置２００は、ＡＤ変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、記憶部１４０、情報提示部１６０、推定部２１０、更新部２２０を有する。このうち、マイク１０、ＡＤ変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、記憶部１４０、情報提示部１６０に関する説明は、実施例１で説明したものと同様であるため、説明を省略する。 FIG. 3 is a functional block diagram showing the configuration of the voice processing device according to the second embodiment. The voice processing device 200 is connected to the microphone 10 as shown in FIG. The voice processing device 200 includes an AD conversion unit 110, a pitch extraction unit 120a, a power extraction unit 120b, a stress detection unit 130, a storage unit 140, an information presentation unit 160, an estimation unit 210, and an update unit 220. Of these, the description of the microphone 10, the AD conversion unit 110, the pitch extraction unit 120a, the power extraction unit 120b, the stress detection unit 130, the storage unit 140, and the information presentation unit 160 is the same as that described in the first embodiment. , The description is omitted.

推定部２１０は、入力音声のストレス値と、平均パワーと、判定基準データ１４０ａとを基にして、入力音声の発声状態を推定する処理部である。推定部２１０は、実施例１で説明した推定部１５０の処理に加えて、推定結果を更新部２２０に出力する。 The estimation unit 210 is a processing unit that estimates the utterance state of the input voice based on the stress value of the input voice, the average power, and the determination reference data 140a. The estimation unit 210 outputs the estimation result to the update unit 220 in addition to the processing of the estimation unit 150 described in the first embodiment.

図４は、推定結果のデータ構造の一例を示す図である。図４に示すように、推定結果には、提示フラグ、メッセージ種別、ストレス値Ｓｎ、平均パワーＰｎ、ストレス値Ｓｐ１、平均パワーＰｐ１、ストレス値Ｓｐ２、平均パワーＰｐ２を含む。 FIG. 4 is a diagram showing an example of the data structure of the estimation result. As shown in FIG. 4, the estimation result includes a presentation flag, a message type, a stress value Sn, an average power Pn, a stress value Sp1, an average power Pp1, a stress value Sp2, and an average power Pp2.

提示フラグは、前回の推定時にメッセージを表示したか否かを示す情報である。前回メッセージを提示した場合には、提示フラグは「オン」となり、前回メッセージを提示していない場合には、提示フラグは「オフ」となる。メッセージ種別は、前回提示したメッセージが、第１メッセージであるか、第２メッセージであるかを示す情報である。なお、提示フラグが「オフ」である場合には、メッセージ種別には情報が格納されない。 The presentation flag is information indicating whether or not a message was displayed at the time of the previous estimation. If the previous message was presented, the presentation flag is "on", and if the previous message was not presented, the presentation flag is "off". The message type is information indicating whether the previously presented message is the first message or the second message. If the presentation flag is "off", no information is stored in the message type.

ストレス値Ｓｎは、現在の入力音声のストレス値が「大」であるか「小」であるかを示す。平均パワーＰｎは、現在の入力音声の平均パワーを示す。ストレス値Ｓｐ１は、前回メッセージを提示した際の入力音声のストレス値が「大」であるか「小」であるかを示す。平均パワーＰｐ１は、前回メッセージを提示した際の入力音声の平均パワーを示す。 The stress value Sn indicates whether the stress value of the current input voice is “large” or “small”. The average power Pn indicates the average power of the current input voice. The stress value Sp1 indicates whether the stress value of the input voice when the previous message is presented is “large” or “small”. The average power Pp1 indicates the average power of the input voice when the previous message is presented.

ストレス値Ｓｐ２は、一定時間前の入力音声のストレス値が「大」であるか「小」であるかを示す。平均パワーＰｐ２は、一定時間前の入力音声の平均パワーを示す。 The stress value Sp2 indicates whether the stress value of the input voice before a certain period of time is “large” or “small”. The average power Pp2 indicates the average power of the input voice before a certain time.

更新部２２０は、推定部２１０から取得する推定結果を基にして、判定基準データ１４０ａの第１閾値、第２閾値、第３閾値を更新する処理部である。判定基準データ１４０ａは学習データの一例である。なお、上記のように、平均パワーが第２閾値以上、第１閾値未満であれは、会話は良好である。また、第３閾値は、ストレスの大小を区別する際に用いる閾値である。 The update unit 220 is a processing unit that updates the first threshold value, the second threshold value, and the third threshold value of the determination reference data 140a based on the estimation result acquired from the estimation unit 210. The determination standard data 140a is an example of learning data. As described above, if the average power is equal to or more than the second threshold value and less than the first threshold value, the conversation is good. Further, the third threshold value is a threshold value used when distinguishing the magnitude of stress.

第１メッセージ「マイクを口から少し離してください」を表示した前後で、ストレス値に変化がなく、平均パワーＰｎが、第１閾値未満となった場合には、パワーが良好な状態まで減少しており、発声状態が改善されていると言える。この場合には、更新部２２０は、第１閾値、第２閾値、第３閾値が正しい値であるとして更新処理をスキップする。 If there is no change in the stress value and the average power Pn is less than the first threshold value before and after displaying the first message "Please move the microphone a little away from your mouth", the power decreases to a good state. It can be said that the vocalization state is improved. In this case, the update unit 220 skips the update process assuming that the first threshold value, the second threshold value, and the third threshold value are correct values.

第１メッセージ「マイクを口から少し離してください」を表示した前後で、ストレス値に変化がなく、平均パワー（Ｐｎ、Ｐｐ１との比較）に変化が見られない場合には、発声状態が改善されていない。この場合には、更新部２２０は、第１閾値が不正解の値であるとして、第１閾値を上げる。 If there is no change in the stress value and no change in the average power (compared with Pn and Pp1) before and after the first message "Please move the microphone away from your mouth" is displayed, the vocalization state is improved. It has not been. In this case, the update unit 220 raises the first threshold value, assuming that the first threshold value is an incorrect value.

第１メッセージ「マイクを口から少し離してください」を表示した前後で、ストレス値が小から大に変化し、平均パワー（Ｐｎ、Ｐｐ１との比較）が増加した場合には、ストレスの大小を適切に判断できておらず、適切なメッセージを提示できていない。この場合には、更新部２２０は、第３閾値が不正解の値であるとして、第３閾値を下げる。 If the stress value changes from small to large and the average power (comparison with Pn and Pp1) increases before and after the first message "Please move the microphone away from your mouth" is displayed, the stress level is changed. I haven't been able to make a proper decision and present an appropriate message. In this case, the update unit 220 lowers the third threshold value, assuming that the third threshold value is an incorrect value.

第２メッセージ「マイクを口に少し近づけてください」を表示した前後で、ストレス値に変化がなく、平均パワーＰｎが、第２閾値以上となった場合には、パワーが良好な状態まで増加していると言える。この場合には、更新部２２０は、第１閾値、第２閾値、第３閾値が正しい値であるとして更新処理をスキップする。 Before and after displaying the second message "Please bring the microphone a little closer to your mouth", if there is no change in the stress value and the average power Pn becomes equal to or higher than the second threshold value, the power increases to a good state. It can be said that it is. In this case, the update unit 220 skips the update process assuming that the first threshold value, the second threshold value, and the third threshold value are correct values.

第２メッセージ「マイクを口に少し近づけてください」を表示した前後で、ストレス値に変化がなく、平均パワー（Ｐｎ、Ｐｐ１との比較）に変化が見られない場合には、発声状態が改善されていない。この場合には、更新部２２０は、第２閾値が不正解の値であるとして、第２閾値を下げる。 If there is no change in the stress value and no change in the average power (comparison with Pn and Pp1) before and after the second message "Please bring the microphone a little closer to your mouth" is displayed, the vocalization state is improved. It has not been. In this case, the update unit 220 lowers the second threshold value, assuming that the second threshold value is an incorrect value.

第２メッセージ「マイクを口に少し近づけてください」を表示した前後で、ストレス値が大から小に変化し、平均パワー（Ｐｎ、Ｐｐ１との比較）が減少した場合には、ストレスの大小を適切に判断できておらず、適切なメッセージを提示できていない。この場合には、更新部２２０は、第３閾値が不正解の値であるとして、第３閾値を上げる。 If the stress value changes from large to small and the average power (comparison with Pn and Pp1) decreases before and after the second message "Please bring the microphone a little closer to your mouth" is displayed, the magnitude of the stress is changed. I haven't been able to make a proper decision and present an appropriate message. In this case, the update unit 220 raises the third threshold value, assuming that the third threshold value is an incorrect value.

更新部２２０は、前回メッセージを提示しておらず、ストレス値に変化が無く、平均パワーに変化が見られない場合には、第１閾値、第２閾値、第３閾値が正しい値であるとして更新処理をスキップする。 If the update unit 220 has not presented the previous message, there is no change in the stress value, and there is no change in the average power, it is assumed that the first threshold value, the second threshold value, and the third threshold value are correct values. Skip the update process.

前回メッセージを提示しておらず、ストレス値に変化が無く、平均パワー（Ｐｎ、Ｐｐ２との比較）が増加した場合には、第２メッセージ「マイクを口に少し近づけてください」の提示もれであり、第２閾値が不正解であるとして、第２閾値を上げる。 If the previous message was not presented, the stress value did not change, and the average power (comparison with Pn and Pp2) increased, the second message "Please bring the microphone closer to your mouth" is not presented. Therefore, assuming that the second threshold value is incorrect, the second threshold value is raised.

前回メッセージを提示しておらず、ストレス値に変化が無く、平均パワー（Ｐｎ、Ｐｐ２との比較）が減少した場合には、第１メッセージ「マイクを口から少し離してください」の提示もれであり、第１閾値が不正解であるとして、第１閾値を下げる。 If the previous message was not presented, the stress value did not change, and the average power (comparison with Pn and Pp2) decreased, the first message "Please move the microphone away from your mouth" is not presented. Therefore, assuming that the first threshold value is incorrect, the first threshold value is lowered.

更新部２２０は、上記処理を繰り返し実行することで、第１閾値、第２閾値、第３閾値が正しい値となるように、第１閾値、第２閾値、第３閾値を更新していく。 By repeatedly executing the above processing, the update unit 220 updates the first threshold value, the second threshold value, and the third threshold value so that the first threshold value, the second threshold value, and the third threshold value become correct values.

図５、図６、図７は、本実施例２に係る更新部の更新処理の一例を示すフローチャートである。図５に示すように、更新部２２０は、推定結果を取得し（ステップＳ１０）、前回メッセージを提示したか否かを判定する（ステップＳ１１）。更新部２２０は、前回メッセージを提示していない場合には（ステップＳ１１，Ｎｏ）、図７のステップＳ２１に移行する。一方、更新部２２０は、前回メッセージを提示している場合には（ステップＳ１１，Ｙｅｓ）、ステップＳ１２に移行する。 5, FIG. 6 and FIG. 7 are flowcharts showing an example of the update process of the update unit according to the second embodiment. As shown in FIG. 5, the update unit 220 acquires the estimation result (step S10) and determines whether or not the previous message was presented (step S11). If the update unit 220 has not presented the previous message (steps S11, No), the update unit 220 proceeds to step S21 in FIG. On the other hand, if the update unit 220 has presented the previous message (steps S11, Yes), the update unit 220 proceeds to step S12.

更新部２２０は、提示したメッセージが「第１メッセージ」であるか否かを判定する（ステップＳ１２）。更新部２２０は、提示したメッセージが「第１メッセージ」でない場合には（ステップＳ１２，Ｎｏ）、図６のステップＳ１７に移行する。更新部２２０は、提示したメッセージが「第１メッセージ」である場合には（ステップＳ１２，Ｙｅｓ）、ステップＳ１３に移行する。 The update unit 220 determines whether or not the presented message is the "first message" (step S12). If the presented message is not the "first message" (steps S12, No), the update unit 220 proceeds to step S17 of FIG. If the presented message is the "first message" (steps S12, Yes), the update unit 220 proceeds to step S13.

更新部２２０は、ストレス値ＳｎおよびＳｐ１がストレス小であり、かつ、平均パワーＰｎとＰｐ１とが変化なしである場合には（ステップＳ１３，Ｙｅｓ）、第１閾値を上げる（ステップＳ１４）。例えば、ステップ１４において、更新部２２０は、式（６）に基づいて、第１閾値を更新する。 When the stress values Sn and Sp1 are low stress and the average power Pn and Pp1 are unchanged (steps S13, Yes), the update unit 220 raises the first threshold value (step S14). For example, in step 14, the update unit 220 updates the first threshold value based on the equation (6).

第１閾値＝１．０５×第１閾値・・・（６） First threshold = 1.05 x first threshold ... (6)

一方、更新部２２０は、ストレス値ＳｎおよびＳｐ１がストレス小でない、または、平均パワーＰｎとＰｐ１とが変化ありの場合には（ステップＳ１３，Ｎｏ）、ステップＳ１５に移行する。 On the other hand, when the stress values Sn and Sp1 are not small stress or the average power Pn and Pp1 are changed (steps S13 and No), the update unit 220 shifts to step S15.

更新部２２０は、ストレス値Ｓｎがストレス大、かつ、ストレス値Ｓｐ１がストレス小である場合には（ステップＳ１５，Ｙｅｓ）、第３閾値を下げる（ステップＳ１６）。例えば、ステップＳ１６において、更新部２２０は、式（７）に基づいて、第３閾値を更新する。 When the stress value Sn is high stress and the stress value Sp1 is low stress (steps S15, Yes), the update unit 220 lowers the third threshold value (step S16). For example, in step S16, the update unit 220 updates the third threshold value based on the equation (7).

第３閾値＝０．９×第３閾値＋０．１×（Ｓｐ１−Ｓｎ）・・・（７） Third threshold = 0.9 x third threshold + 0.1 x (Sp1-Sn) ... (7)

一方、更新部２２０は、ストレス値Ｓｎがストレス大、かつ、ストレス値Ｓｐ１がストレス小でない場合には（ステップＳ１５，Ｎｏ）、処理を終了する。 On the other hand, when the stress value Sn is high stress and the stress value Sp1 is not low stress (steps S15, No), the update unit 220 ends the process.

図６の説明に移行する。更新部２２０は、ストレス値ＳｎおよびＳｐ１がストレス大であり、かつ、平均パワーＰｎとＰｐ１とが変化なしである場合には（ステップＳ１７，Ｙｅｓ）、第２閾値を下げる（ステップＳ１８）。例えば、ステップＳ１８において、更新部２２０は、式（８）に基づいて、第２閾値を更新する。 The description shifts to FIG. When the stress values Sn and Sp1 are stressful and the average powers Pn and Pp1 are unchanged (steps S17, Yes), the update unit 220 lowers the second threshold value (step S18). For example, in step S18, the update unit 220 updates the second threshold value based on the equation (8).

第２閾値＝０．９５×第２閾値・・・（８） Second threshold = 0.95 x second threshold ... (8)

一方、更新部２２０は、ストレス値ＳｎおよびＳｐ１がストレス大でない、または、平均パワーＰｎとＰｐ１とが変化ありである場合には（ステップＳ１７，Ｎｏ）、ステップＳ１９に移行する。 On the other hand, when the stress values Sn and Sp1 are not stressful or the average power Pn and Pp1 are changed (steps S17 and No), the update unit 220 proceeds to step S19.

更新部２２０は、ストレス値Ｓｎがストレス小かつストレス値Ｓｐ１がストレス大である場合には（ステップＳ１９，Ｙｅｓ）、第３閾値を上げる（ステップＳ２０）。例えば、ステップＳ２０において、更新部２２０は、式（７）に基づいて、第３閾値を更新する。なお、ステップＳ１９において、更新部２２０は、ストレス値Ｓｎがストレス小かつストレス値Ｓｐ１がストレス大でない場合には（ステップＳ１９，Ｎｏ）、処理を終了する。 When the stress value Sn is low stress and the stress value Sp1 is high stress (step S19, Yes), the update unit 220 raises the third threshold value (step S20). For example, in step S20, the update unit 220 updates the third threshold value based on the equation (7). In step S19, when the stress value Sn is low stress and the stress value Sp1 is not high stress (steps S19, No), the update unit 220 ends the process.

図７の説明に移行する。更新部２２０は、ストレス値ＳｎおよびＳｐ２がストレス大、かつ、平均パワーＰｎがＰｐ２と比較して増加した場合には（ステップＳ２１，Ｙｅｓ）、第２閾値を上げる（ステップＳ２２）。例えば、更新部２２０は、ステップＳ２２において、式（９）に基づいて、第２閾値を更新する。 The description shifts to FIG. When the stress values Sn and Sp2 are high in stress and the average power Pn is increased as compared with Pp2 (steps S21, Yes), the update unit 220 raises the second threshold value (step S22). For example, the update unit 220 updates the second threshold value in step S22 based on the equation (9).

第２閾値＝０．９×第２閾値＋０．１×（Ｐｎ−Ｐｐ２）・・・（９） Second threshold = 0.9 × second threshold + 0.1 × (Pn-Pp2) ... (9)

一方、更新部２２０は、ストレス値ＳｎおよびＳｐ２がストレス大ではない、または、平均パワーＰｎがＰｐ２と比較して増加していない場合には（ステップＳ２１，Ｎｏ）、ステップＳ２３に移行する。 On the other hand, when the stress values Sn and Sp2 are not stressful or the average power Pn is not increased as compared with Pp2 (steps S21, No), the update unit 220 shifts to step S23.

更新部２２０は、ストレス値ＳｎおよびＳｐ２がストレス小かつ平均パワーＰｎがＰｐ２と比較して減少している場合には（ステップＳ２３，Ｙｅｓ）、第１閾値を下げる（ステップＳ２４）。例えば、ステップＳ２４にいて、更新部２２０は、式（１０）に基づいて、第１閾値を更新する。更新部２２０は、ストレス値ＳｎおよびＳｐ２がストレス小でない、または、平均パワーＰｎがＰｐ２と比較して減少していない場合には（ステップＳ２３，Ｎｏ）、処理を終了する。 When the stress values Sn and Sp2 are small and the average power Pn is reduced as compared with Pp2 (steps S23, Yes), the update unit 220 lowers the first threshold value (step S24). For example, in step S24, the update unit 220 updates the first threshold value based on the equation (10). When the stress values Sn and Sp2 are not low in stress or the average power Pn is not decreased as compared with Pp2 (steps S23, No), the update unit 220 ends the process.

第１閾値＝０．９５×第１閾値・・・（１０） First threshold = 0.95 x first threshold ... (10)

次に、本実施例２に係る音声処理装置２００の処理手順について説明する。図８は、本実施例２に係る音声処理装置の処理手順を示すフローチャートである。図８に示すように、音声処理装置２００のＡＤ変換部１１０は、入力音声の受け付けを開始する（ステップＳ２０１）。ＡＤ変換部１１０は、ＡＤ変換を行う（ステップＳ２０２）。音声処理装置２００のピッチ抽出部１２０ａは、ピッチを抽出し、音声処理装置２００のパワー抽出部１２０ｂは、パワーを抽出する（ステップＳ２０３）。 Next, the processing procedure of the voice processing device 200 according to the second embodiment will be described. FIG. 8 is a flowchart showing a processing procedure of the voice processing device according to the second embodiment. As shown in FIG. 8, the AD conversion unit 110 of the voice processing device 200 starts accepting the input voice (step S201). The AD conversion unit 110 performs AD conversion (step S202). The pitch extraction unit 120a of the voice processing device 200 extracts the pitch, and the power extraction unit 120b of the voice processing device 200 extracts the power (step S203).

ピッチ抽出部１２０ａは、有音区間を検出する（ステップＳ２０４）。音声処理装置２００のストレス検出部１３０は、ピッチ・パワーを蓄積する（ステップＳ２０５）。ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積された場合には（ステップＳ２０６，Ｙｅｓ）、ステップＳ２０７に移行する。一方、ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積されていない場合には（ステップＳ２０６，Ｎｏ）、ステップＳ２０１に移行する。 The pitch extraction unit 120a detects a sound section (step S204). The stress detection unit 130 of the voice processing device 200 accumulates pitch power (step S205). When the pitch power corresponding to the specified number of frames is accumulated (step S206, Yes), the stress detection unit 130 proceeds to step S207. On the other hand, the stress detection unit 130 proceeds to step S201 when the pitch power corresponding to the specified number of frames is not accumulated (steps S206, No).

ストレス検出部１３０は、ストレス値を算出する（ステップＳ２０７）。音声処理装置２００の推定部２１０は、有音区間の平均パワーを算出する（ステップＳ２０８）。推定部２１０は、平均パワーが第１閾値以上である場合には（ステップＳ２０９，Ｙｅｓ）、ステップＳ２１０に移行する。一方、推定部２１０は、平均パワーが第１閾値未満である場合には（ステップＳ２０９，Ｎｏ）、ステップＳ２１２に移行する。 The stress detection unit 130 calculates the stress value (step S207). The estimation unit 210 of the voice processing device 200 calculates the average power of the sounded section (step S208). When the average power is equal to or higher than the first threshold value (step S209, Yes), the estimation unit 210 shifts to step S210. On the other hand, when the average power is less than the first threshold value (step S209, No), the estimation unit 210 shifts to step S212.

推定部２１０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ２１０）。推定部２１０は、ストレス値が第３閾値以上である場合には（ステップＳ２１０，Ｙｅｓ）、ステップＳ２１５に移行する。推定部２１０は、ストレス値が第３閾値未満である場合には（ステップＳ２１０，Ｎｏ）、情報提示部１６０に第１メッセージを表示させ（ステップＳ２１１）、ステップＳ２１５に移行する。例えば、第１メッセージは、「マイクを口から少し離してください」である。 The estimation unit 210 determines whether or not the stress value is equal to or higher than the third threshold value (step S210). When the stress value is equal to or higher than the third threshold value (step S210, Yes), the estimation unit 210 shifts to step S215. When the stress value is less than the third threshold value (step S210, No), the estimation unit 210 causes the information presentation unit 160 to display the first message (step S211), and proceeds to step S215. For example, the first message is "Please move the microphone away from your mouth."

ステップＳ２１２の説明に移行する。推定部２１０は、平均パワーが第２閾値未満であるか否かを判定する（ステップＳ２１２）。推定部２１０は、平均パワーが第２閾値未満でない場合には（ステップＳ２１２，Ｎｏ）、ステップＳ２１５に移行する。一方、推定部２１０は、平均パワーが第２閾値未満である場合には（ステップＳ２１２，Ｙｅｓ）、ステップＳ２１３に移行する。 The process proceeds to the description of step S212. The estimation unit 210 determines whether or not the average power is less than the second threshold value (step S212). When the average power is not less than the second threshold value (step S212, No), the estimation unit 210 shifts to step S215. On the other hand, when the average power is less than the second threshold value (step S212, Yes), the estimation unit 210 shifts to step S213.

推定部２１０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ２１３）。推定部２１０は、ストレス値が第３閾値以上でない場合には（ステップＳ２１３，Ｎｏ）、ステップＳ２１５に移行する。一方、推定部２１０は、ストレス値が第３閾値以上である場合には（ステップＳ２１３，Ｙｅｓ）、情報提示部１６０に第２メッセージを表示させ（ステップＳ２１４）、ステップＳ２１５に移行する。例えば、第２メッセージは、「マイクを口に少し近づけてください」である。 The estimation unit 210 determines whether or not the stress value is equal to or higher than the third threshold value (step S213). When the stress value is not equal to or higher than the third threshold value (steps S213 and No), the estimation unit 210 proceeds to step S215. On the other hand, when the stress value is equal to or higher than the third threshold value (step S213, Yes), the estimation unit 210 causes the information presentation unit 160 to display the second message (step S214), and proceeds to step S215. For example, the second message is "Please bring the microphone a little closer to your mouth."

音声処理装置２００の更新部２２０は、更新処理を実行する（ステップＳ２１５）。ステップＳ２１５に示す更新処理は、図５、図６、図７に示した処理に対応する。推定部１５０は、ストレス値および平均パワーを記憶部１４０に記憶し（ステップＳ２１６）、ステップＳ２０１に移行する。 The update unit 220 of the voice processing device 200 executes the update process (step S215). The update process shown in step S215 corresponds to the process shown in FIGS. 5, 6, and 7. The estimation unit 150 stores the stress value and the average power in the storage unit 140 (step S216), and proceeds to step S201.

次に、本実施例２に係る音声処理装置２００の効果について説明する。音声処理装置２００は、更新処理を繰り返し実行することで、第１閾値、第２閾値、第３閾値が正しい値となるように、第１閾値、第２閾値、第３閾値を更新していく。これにより、ストレス値の大小、パワーが良好であるか否かを適切に判定でき、現在の発声状態に対する適切なメッセージを表示することができる。 Next, the effect of the voice processing device 200 according to the second embodiment will be described. The voice processing device 200 updates the first threshold value, the second threshold value, and the third threshold value by repeatedly executing the update process so that the first threshold value, the second threshold value, and the third threshold value become correct values. .. As a result, it is possible to appropriately determine whether or not the stress value is high or low and the power is good, and it is possible to display an appropriate message for the current vocalization state.

図９は、本実施例３に係るシステムの一例を示す図である。図９に示すように、このシステムは、音声処理装置３００ａ、３００ｂ、３００ｃと、サーバ４００とを有する。音声処理装置３００ａ〜３００ｃと、サーバ４００とはネットワーク５０を介して相互に接続される。ここでは一例として、音声処理装置３００ａ〜３００ｃを示すが、その他の音声処理装置を含んでいても良い。以下の説明では、音声処理装置３００ａ〜３００ｃをまとめて、音声処理装置３００と表記する。 FIG. 9 is a diagram showing an example of the system according to the third embodiment. As shown in FIG. 9, this system has voice processing devices 300a, 300b, 300c and a server 400. The voice processing devices 300a to 300c and the server 400 are connected to each other via the network 50. Here, as an example, the voice processing devices 300a to 300c are shown, but other voice processing devices may be included. In the following description, the voice processing devices 300a to 300c are collectively referred to as the voice processing device 300.

音声処理装置３００は、実施例２で説明した音声処理装置２００と同様にして、更新処理を繰り返し実行し、更新後の判定基準データ１４０ａを、サーバ４００に送信する。実施例１、２で説明したように、判定基準データは、第１閾値、第２閾値、第３閾値を有する。 The voice processing device 300 repeatedly executes the update process in the same manner as the voice processing device 200 described in the second embodiment, and transmits the updated determination reference data 140a to the server 400. As described in Examples 1 and 2, the determination criterion data has a first threshold value, a second threshold value, and a third threshold value.

サーバ４００は、音声処理装置３００から判定基準データ１４０ａを取得し、取得した判定基準データを基にして、第１閾値、第２閾値、第３閾値の初期値を算出する。サーバ４００は、算出した第１閾値、第２閾値、第３閾値の初期値のデータを、音声処理装置３００に送信する。以下の説明では、サーバ４００が算出した第１閾値、第２閾値、第３閾値の初期値のデータを、「初期値データ」と表記する。 The server 400 acquires the determination reference data 140a from the voice processing device 300, and calculates the initial values of the first threshold value, the second threshold value, and the third threshold value based on the acquired determination reference data. The server 400 transmits the calculated initial value data of the first threshold value, the second threshold value, and the third threshold value to the voice processing device 300. In the following description, the data of the initial values of the first threshold value, the second threshold value, and the third threshold value calculated by the server 400 is referred to as "initial value data".

音声処理装置３００は、サーバ４００から初期データを受信すると、受信した初期データにより、判定基準データを更新する。 When the voice processing device 300 receives the initial data from the server 400, the voice processing device 300 updates the determination reference data with the received initial data.

図１０は、本実施例３に係る音声処理装置の構成を示す機能ブロック図である。音声処理装置３００ａは、図１０に示すように、マイク１０に接続される。音声処理装置３００ａは、ＡＤ変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、記憶部１４０、情報提示部１６０、推定部２１０、更新部２２０を有する。また、音声処理装置３００ａは、アップロード部３１０およびダウンロード部３２０を有する。ここでは一例として、音声処理装置３００ａの構成について説明するが、音声処理装置３００ｂ、３００ｃの構成は、音声処理装置３００ａの構成と同様である。 FIG. 10 is a functional block diagram showing the configuration of the voice processing device according to the third embodiment. The voice processing device 300a is connected to the microphone 10 as shown in FIG. The voice processing device 300a includes an AD conversion unit 110, a pitch extraction unit 120a, a power extraction unit 120b, a stress detection unit 130, a storage unit 140, an information presentation unit 160, an estimation unit 210, and an update unit 220. Further, the voice processing device 300a has an upload unit 310 and a download unit 320. Here, the configuration of the voice processing device 300a will be described as an example, but the configurations of the voice processing devices 300b and 300c are the same as the configurations of the voice processing device 300a.

図１０において、マイク１０、ＡＤ変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、記憶部１４０、情報提示部１６０に関する説明は、実施例１で説明したものと同様であるため、説明を省略する。推定部２１０および更新部２２０に関する説明は、実施例２で説明したものと同様であるため、説明を省略する。 In FIG. 10, the description of the microphone 10, the AD conversion unit 110, the pitch extraction unit 120a, the power extraction unit 120b, the stress detection unit 130, the storage unit 140, and the information presentation unit 160 is the same as that described in the first embodiment. Therefore, the description thereof will be omitted. Since the description of the estimation unit 210 and the update unit 220 is the same as that described in the second embodiment, the description thereof will be omitted.

アップロード部３１０は、更新部２２０により更新された判定基準データ１４０ａを、サーバ４００に送信（アップロード）する処理部である。例えば、アップロード部３１０は、音声処理装置３００ａと他の音声処理装置との間の通話回数Ｎをカウントし、通話回数Ｎが、第４閾値を超えた場合に、判定基準データ１４０ａを、サーバ４００に送信する。 The upload unit 310 is a processing unit that transmits (uploads) the determination reference data 140a updated by the update unit 220 to the server 400. For example, the upload unit 310 counts the number of calls N between the voice processing device 300a and another voice processing device, and when the number of calls N exceeds the fourth threshold value, the determination reference data 140a is transmitted to the server 400. Send to.

ダウンロード部３２０は、サーバ４００から初期値データを受信（ダウンロード）する処理部である。ダウンロード部３２０は、受信した初期値データにより、判定基準データ１４０ａを更新する。推定部２１０は、初期値データにより更新された判定基準データ１４０ａを初期値として、処理を行う。 The download unit 320 is a processing unit that receives (downloads) initial value data from the server 400. The download unit 320 updates the determination reference data 140a with the received initial value data. The estimation unit 210 performs processing using the determination reference data 140a updated by the initial value data as the initial value.

上記のアップロード部３１０およびダウンロード部３２０は、図示しない通信装置を用いて、ネットワーク５０を介して、サーバ４００とデータ通信を実行するものとする。 It is assumed that the upload unit 310 and the download unit 320 execute data communication with the server 400 via the network 50 by using a communication device (not shown).

図１１Ａは、本実施例３に係るサーバの構成を示す機能ブロック図である。図１１Ａに示すように、サーバ４００は、通信部４１０と、記憶部４２０と、制御部４３０とを有する。 FIG. 11A is a functional block diagram showing a server configuration according to the third embodiment. As shown in FIG. 11A, the server 400 has a communication unit 410, a storage unit 420, and a control unit 430.

通信部４１０は、ネットワーク５０を介して、音声処理装置３００とデータ通信を実行する処理部である。後述する制御部４３０は、通信部４１０を介して、音声処理装置３００とデータをやり取りする。通信部４１０は、通信装置に対応する。 The communication unit 410 is a processing unit that executes data communication with the voice processing device 300 via the network 50. The control unit 430, which will be described later, exchanges data with the voice processing device 300 via the communication unit 410. The communication unit 410 corresponds to the communication device.

記憶部４２０は、閾値テーブル４２０ａを有する。記憶部４２０は、ＲＡＭ、ＲＯＭ、フラッシュメモリなどの半導体メモリ素子や、ＨＤＤなどの記憶装置に対応する。 The storage unit 420 has a threshold table 420a. The storage unit 420 corresponds to semiconductor memory elements such as RAM, ROM, and flash memory, and storage devices such as HDD.

閾値テーブル４２０ａは、音声処理装置３００から送信される判定基準データ１４０ａを保持するテーブルである。図１１Ｂは、本実施例３に係る閾値テーブルのデータ構造の一例を示す図である。図１１Ｂに示すように、この閾値テーブル４２０ａは、識別情報と、判定基準データとを対応付ける。識別情報は、音声処理装置３００を一意に識別する情報である。判定基準データは、音声処理装置から受信する判定基準データである。実施例１、２で説明したように、判定基準データ１４０ａには、第１閾値、第２閾値、第３閾値が含まれる。 The threshold value table 420a is a table that holds the determination reference data 140a transmitted from the voice processing device 300. FIG. 11B is a diagram showing an example of the data structure of the threshold table according to the third embodiment. As shown in FIG. 11B, the threshold table 420a associates the identification information with the determination reference data. The identification information is information that uniquely identifies the voice processing device 300. The judgment standard data is the judgment standard data received from the voice processing device. As described in Examples 1 and 2, the determination reference data 140a includes a first threshold value, a second threshold value, and a third threshold value.

制御部４３０は、受信部４３０ａ、算出部４３０ｂ、配信部４３０ｃを有する。制御部４３０は、ＣＰＵやＭＰＵなどによって実現できる。また、制御部４３０は、ＡＳＩＣやＦＰＧＡなどのハードワイヤードロジックによっても実現できる。 The control unit 430 includes a reception unit 430a, a calculation unit 430b, and a distribution unit 430c. The control unit 430 can be realized by a CPU, an MPU, or the like. The control unit 430 can also be realized by hard-wired logic such as ASIC or FPGA.

受信部４３０ａは、音声処理装置３００から判定基準データ１４０ａを受信する処理部である。例えば、判定基準データ１４０ａには、この判定基準データ１４０ａの送信元となる音声処理装置３００を識別する識別情報が付与されているものとする。受付部４３０ａは、判定基準データ１４０ａを、識別情報と対応付けて、閾値テーブル４２０ａに登録する。 The receiving unit 430a is a processing unit that receives the determination reference data 140a from the voice processing device 300. For example, it is assumed that the determination reference data 140a is provided with identification information that identifies the voice processing device 300 that is the source of the determination reference data 140a. The reception unit 430a registers the determination reference data 140a in the threshold table 420a in association with the identification information.

算出部４３０ｂは、閾値テーブル４２０ａを基にして、初期値データを算出する処理部である。算出部４３０ｂは、算出した初期値データを、配信部４３０ｃに出力する。以下において、算出部４３０ｂの処理の一例について説明する。 The calculation unit 430b is a processing unit that calculates initial value data based on the threshold table 420a. The calculation unit 430b outputs the calculated initial value data to the distribution unit 430c. An example of the processing of the calculation unit 430b will be described below.

算出部４３０ｂは、閾値テーブル４２０ａを参照し、閾値テーブル４２０ａに登録されたレコードの数が第５閾値以上である場合に、算出処理を開始する。例えば、第５閾値を「３」とする。図１１Ｂに示す閾値テーブル４２０ａでは、音声処理装置３００ａ〜３００ｃから受信した判断基準データ１４０ａを有する（レコードの数が３以上である）ので、算出部４３０ｂは、算出処理を実行する。 The calculation unit 430b refers to the threshold table 420a and starts the calculation process when the number of records registered in the threshold table 420a is equal to or greater than the fifth threshold. For example, the fifth threshold is set to "3". Since the threshold table 420a shown in FIG. 11B has the determination reference data 140a received from the voice processing devices 300a to 300c (the number of records is 3 or more), the calculation unit 430b executes the calculation process.

算出部４３０ｂが実行する算出処理の一例について説明する。算出部４３０ｂは、各判定基準データ１４０ａの第１閾値の平均値を算出することで、第１閾値の初期値μ１を算出する。算出部４３０ｂは、各判定基準データ１４０ａの第２閾値の平均値を算出することで、第２閾値の初期値μ２を算出する。算出部４３０ｂは、各判定基準データ１４０ａの第３閾値の平均値を算出することで、第３閾値の初期値μ３を算出する。 An example of the calculation process executed by the calculation unit 430b will be described. The calculation unit 430b calculates the initial value μ1 of the first threshold value by calculating the average value of the first threshold values of each determination reference data 140a. The calculation unit 430b calculates the initial value μ2 of the second threshold value by calculating the average value of the second threshold value of each determination reference data 140a. The calculation unit 430b calculates the initial value μ3 of the third threshold value by calculating the average value of the third threshold value of each determination reference data 140a.

算出部４３０ｂは、上記の初期値μ１〜μ３を初期値データとして、配信部４３０ｃに出力する。 The calculation unit 430b outputs the above initial values μ1 to μ3 as initial value data to the distribution unit 430c.

配信部４３０ｃは、初期値データを算出部４３０ｂから取得した場合に、取得した初期値データを、音声処理装置３００に送信する処理部である。 The distribution unit 430c is a processing unit that transmits the acquired initial value data to the voice processing device 300 when the initial value data is acquired from the calculation unit 430b.

次に、本実施例３に係る音声処理装置３００の処理手順について説明する。図１２及び図１３は、本実施例３に係る音声処理装置の処理手順を示すフローチャートである。図１２に示すように、音声処理装置３００のＡＤ変換部１１０は、入力音声の受け付けを開始する（ステップＳ３０１）。ＡＤ変換部１１０は、ＡＤ変換を行う（ステップＳ３０２）。音声処理装置３００のピッチ抽出部１２０ａは、ピッチを抽出し、音声処理装置３００のパワー抽出部１２０ｂは、パワーを抽出する（ステップＳ３０３）。 Next, the processing procedure of the voice processing device 300 according to the third embodiment will be described. 12 and 13 are flowcharts showing a processing procedure of the voice processing apparatus according to the third embodiment. As shown in FIG. 12, the AD conversion unit 110 of the voice processing device 300 starts accepting the input voice (step S301). The AD conversion unit 110 performs AD conversion (step S302). The pitch extraction unit 120a of the voice processing device 300 extracts the pitch, and the power extraction unit 120b of the voice processing device 300 extracts the power (step S303).

ピッチ抽出部１２０ａは、有音区間を検出する（ステップＳ３０４）。音声処理装置３００のストレス検出部１３０は、ピッチ・パワーを蓄積する（ステップＳ３０５）。ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積された場合には（ステップＳ３０６，Ｙｅｓ）、ステップＳ３０７に移行する。一方、ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積されていない場合には（ステップＳ３０６，Ｎｏ）、ステップＳ３０１に移行する。 The pitch extraction unit 120a detects a sound section (step S304). The stress detection unit 130 of the voice processing device 300 accumulates pitch power (step S305). When the pitch power corresponding to the specified number of frames is accumulated (step S306, Yes), the stress detection unit 130 proceeds to step S307. On the other hand, the stress detection unit 130 proceeds to step S301 when the pitch power corresponding to the specified number of frames is not accumulated (steps S306, No).

ストレス検出部１３０は、ストレス値を算出する（ステップＳ３０７）。音声処理装置３００の推定部２１０は、有音区間の平均パワーを算出する（ステップＳ３０８）。推定部２１０は、平均パワーが第１閾値以上である場合には（ステップＳ３０９，Ｙｅｓ）、ステップＳ３１０に移行する。一方、推定部２１０は、平均パワーが第１閾値未満である場合には（ステップＳ３０９，Ｎｏ）、ステップＳ３１２に移行する。 The stress detection unit 130 calculates the stress value (step S307). The estimation unit 210 of the voice processing device 300 calculates the average power of the sounded section (step S308). When the average power is equal to or higher than the first threshold value (step S309, Yes), the estimation unit 210 shifts to step S310. On the other hand, when the average power is less than the first threshold value (step S309, No), the estimation unit 210 shifts to step S312.

推定部２１０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ３１０）。推定部２１０は、ストレス値が第３閾値以上である場合には（ステップＳ３１０，Ｙｅｓ）、ステップＳ３１５に移行する。推定部２１０は、ストレス値が第３閾値未満である場合には（ステップＳ３１０，Ｎｏ）、情報提示部１６０に第１メッセージを表示させ（ステップＳ３１１）、ステップＳ３１５に移行する。例えば、第１メッセージは、「マイクを口から少し離してください」である。 The estimation unit 210 determines whether or not the stress value is equal to or higher than the third threshold value (step S310). When the stress value is equal to or higher than the third threshold value (step S310, Yes), the estimation unit 210 shifts to step S315. When the stress value is less than the third threshold value (step S310, No), the estimation unit 210 causes the information presentation unit 160 to display the first message (step S311), and proceeds to step S315. For example, the first message is "Please move the microphone away from your mouth."

ステップＳ３１２の説明に移行する。推定部２１０は、平均パワーが第２閾値未満であるか否かを判定する（ステップＳ３１２）。推定部２１０は、平均パワーが第２閾値未満でない場合には（ステップＳ３１２，Ｎｏ）、ステップＳ３１５に移行する。一方、推定部２１０は、平均パワーが第２閾値未満である場合には（ステップＳ３１２，Ｙｅｓ）、ステップＳ３１３に移行する。 The process proceeds to the description of step S312. The estimation unit 210 determines whether or not the average power is less than the second threshold value (step S312). When the average power is not less than the second threshold value (step S312, No), the estimation unit 210 shifts to step S315. On the other hand, when the average power is less than the second threshold value (step S312, Yes), the estimation unit 210 shifts to step S313.

推定部２１０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ３１３）。推定部２１０は、ストレス値が第３閾値以上でない場合には（ステップＳ３１３，Ｎｏ）、ステップＳ３１５に移行する。一方、推定部２１０は、ストレス値が第３閾値以上である場合には（ステップＳ３１３，Ｙｅｓ）、情報提示部１６０に第２メッセージを表示させ（ステップＳ３１４）、ステップＳ３１５に移行する。例えば、第２メッセージは、「マイクを口に少し近づけてください」である。 The estimation unit 210 determines whether or not the stress value is equal to or higher than the third threshold value (step S313). If the stress value is not equal to or higher than the third threshold value (steps S313, No), the estimation unit 210 proceeds to step S315. On the other hand, when the stress value is equal to or higher than the third threshold value (step S313, Yes), the estimation unit 210 causes the information presentation unit 160 to display the second message (step S314), and proceeds to step S315. For example, the second message is "Please bring the microphone a little closer to your mouth."

音声処理装置３００の更新部２２０は、更新処理を実行する（ステップＳ３１５）。ステップＳ３１５に示す更新処理は、実施例２の図５、図６、図７に示した処理に対応する。推定部２１０は、ストレス値および平均パワーを記憶部１４０に記憶し（ステップＳ３１６）、図１３のステップＳ３１７に移行する。 The update unit 220 of the voice processing device 300 executes the update process (step S315). The update process shown in step S315 corresponds to the process shown in FIGS. 5, 6 and 7 of the second embodiment. The estimation unit 210 stores the stress value and the average power in the storage unit 140 (step S316), and proceeds to step S317 of FIG.

図１３について説明する。音声処理装置３００は、通話が終了したか否かを判定する（ステップＳ３１７）。音声処理部３００は、通話が終了していない場合には（ステップＳ３１７，Ｎｏ）、図１２のステップＳ３０１に移行する。 FIG. 13 will be described. The voice processing device 300 determines whether or not the call has ended (step S317). If the call is not completed (step S317, No), the voice processing unit 300 proceeds to step S301 in FIG.

音声処理装置３００のアップロード部３１０は、通話が終了した場合には（ステップＳ３１７，Ｙｅｓ）、通話回数Ｎに１を加算する（ステップＳ３１８）。アップロード部３１０は、通話回数Ｎが第４閾値以上でない場合には（ステップＳ３１９，Ｎｏ）、処理を終了する。 When the call is completed (step S317, Yes), the upload unit 310 of the voice processing device 300 adds 1 to the number of calls N (step S318). When the number of calls N is not equal to or greater than the fourth threshold value (step S319, No), the upload unit 310 ends the process.

一方、アップロード部３１０は、通話回数Ｎが第４閾値以上である場合には（ステップＳ３１９，Ｙｅｓ）、判定基準データ１４０ａを、サーバ４００に送信する（ステップＳ３２０）。 On the other hand, when the number of calls N is equal to or greater than the fourth threshold value (step S319, Yes), the upload unit 310 transmits the determination reference data 140a to the server 400 (step S320).

次に、実施例３に係るサーバ４００の処理手順について説明する。図１４は、本実施例３に係るサーバの処理手順を示すフローチャートである。図１４に示すように、サーバ４００の受信部４３０ａは、判定基準データ１４０ａを受信する（ステップＳ４０１）。受信部４３０ａは、判定基準データ１４０ａを閾値テーブル４２０ａに登録する（ステップＳ４０２）。 Next, the processing procedure of the server 400 according to the third embodiment will be described. FIG. 14 is a flowchart showing a processing procedure of the server according to the third embodiment. As shown in FIG. 14, the receiving unit 430a of the server 400 receives the determination reference data 140a (step S401). The receiving unit 430a registers the determination reference data 140a in the threshold table 420a (step S402).

サーバ４００の算出部４３０ｂは、閾値テーブル４２０ａに基づいて、判定基準データのレコード数が第５閾値以上であるか否かを判定する（ステップＳ４０３）。算出部４３０ｂは、判定基準データのレコード数が第５閾値以上でない場合には（ステップＳ４０３，Ｎｏ）、処理を終了する。 The calculation unit 430b of the server 400 determines whether or not the number of records of the determination reference data is equal to or greater than the fifth threshold value based on the threshold value table 420a (step S403). If the number of records of the determination reference data is not equal to or greater than the fifth threshold value (steps S403, No), the calculation unit 430b ends the process.

一方、算出部４３０ｂは、判定基準データのレコード数が第５閾値以上である場合には（ステップＳ４０３，Ｙｅｓ）、第１閾値〜第３閾値について、それぞれ平均値を算出し、初期値μ１〜μ３を特定する（ステップＳ４０４）。 On the other hand, when the number of records of the determination reference data is equal to or greater than the fifth threshold value (step S403, Yes), the calculation unit 430b calculates the average value for each of the first threshold value to the third threshold value, and the initial values μ1 to μ1 to Specify μ3 (step S404).

算出部４３０ｂは、初期値データを生成する（ステップＳ４０５）。サーバ４００の配信部４３０ｃは、音声処理装置３００に初期値データを送信する（ステップＳ４０６）。ここで、初期値データを送信する音声処理装置３００は、一回も使われていない新規に導入した音声処理装置であってもよい。 The calculation unit 430b generates initial value data (step S405). The distribution unit 430c of the server 400 transmits the initial value data to the voice processing device 300 (step S406). Here, the voice processing device 300 that transmits the initial value data may be a newly introduced voice processing device that has never been used.

次に、本実施例３に係るシステムの効果について説明する。音声処理装置３００は、更新処理を繰り返し実行した後に、判定基準データ１４０ａをサーバ４００に通知し、サーバ４００は、各判定基準データ１４０ａを基にして、初期値データを生成し、音声処理装置３００に通知する。音声処理装置３００は、係る初期値データを利用することで、より正しい第１閾値、第２閾値、第３閾値を初期値の判定基準データ１４０ａとして用いることができる。 Next, the effect of the system according to the third embodiment will be described. The voice processing device 300 notifies the server 400 of the determination reference data 140a after repeatedly executing the update process, and the server 400 generates initial value data based on each determination reference data 140a, and the voice processing device 300 generates initial value data. Notify to. By using the initial value data, the voice processing device 300 can use the more correct first threshold value, second threshold value, and third threshold value as the initial value determination reference data 140a.

なお、本実施例３では、次の処理も可能である。例えば、サーバ４００は、音声処理装置３００ｂ、３００ｃ、その他の音声処理装置の判定基準データ１４０ａを基にして、初期値データを生成しておき、音声処理装置３００ａの起動時に、生成しておいた初期値データを音声処理装置３００ａに送信する。音声処理装置３００ａは、サーバ４００から受信した初期値データを起動時から用いることで、上記の更新処理を繰り返し実行しなくても、より正しいメッセージを利用者に通知することができる。また、一回も使われていない新規に導入した音声処理装置に初期値をダウンロードすることで、１回目の使用時から既に更新された判定基準データを用いることができるので、初回からより正しいメッセージを利用者に通知することができる。 In addition, in this Example 3, the following processing is also possible. For example, the server 400 generates initial value data based on the determination reference data 140a of the voice processing devices 300b, 300c, and other voice processing devices, and generates the initial value data when the voice processing device 300a is started. The initial value data is transmitted to the voice processing device 300a. By using the initial value data received from the server 400 from the time of startup, the voice processing device 300a can notify the user of a more correct message without repeatedly executing the above-mentioned update process. In addition, by downloading the initial value to a newly introduced voice processing device that has never been used, the judgment criteria data that has already been updated from the first use can be used, so a more correct message from the first time. Can be notified to the user.

図１５は、本実施例４に係るシステムの一例を示す図である。図１５に示すように、このシステムは、音声処理装置５００ａ〜５００ｌと、サーバ６００とを有する。音声処理装置５００ａ〜５００ｌと、サーバ６００とはネットワーク５０を介して相互に接続される。ここでは一例として、音声処理装置５００ａ〜５００ｌを示すが、その他の音声処理装置を含んでいても良い。以下の説明では、音声処理装置５００ａ〜５００ｌをまとめて、適宜、音声処理装置５００と表記する。 FIG. 15 is a diagram showing an example of the system according to the fourth embodiment. As shown in FIG. 15, this system has voice processing devices 500a to 500l and a server 600. The voice processing devices 500a to 500l and the server 600 are connected to each other via the network 50. Here, as an example, the voice processing devices 500a to 500l are shown, but other voice processing devices may be included. In the following description, the voice processing devices 500a to 500l are collectively referred to as the voice processing device 500 as appropriate.

なお、本実施例４では一例として、音声処理装置５００ａ〜５００ｃは、部屋１０Ａに配置される。このため、音声処理装置５００ａ〜５００ｃは、使用環境が類似する。音声処理装置５００ｄ〜５００ｆは、部屋１０Ｂに配置される。このため、音声処理装置５００ｄ〜５００ｆは、使用環境が類似する。音声処理装置５００ｇ〜５００ｉは、部屋１０Ｃに配置される。このため、音声処理装置５００ｇ〜５００ｉは、使用環境が類似する。音声処理装置５００ｊ〜５００ｌは、部屋１０Ｄに配置される。このため、音声処理装置５００ｊ〜５００ｌは、使用環境が類似する。 As an example in the fourth embodiment, the voice processing devices 500a to 500c are arranged in the room 10A. Therefore, the voice processing devices 500a to 500c have similar usage environments. The voice processing devices 500d to 500f are arranged in the room 10B. Therefore, the voice processing devices 500d to 500f have similar usage environments. The voice processing devices 500g to 500i are arranged in the room 10C. Therefore, the voice processing devices 500g to 500i have similar usage environments. The voice processing devices 500j to 500l are arranged in the room 10D. Therefore, the voice processing devices 500j to 500l have similar usage environments.

音声処理装置５００は、実施例２で説明した音声処理装置２００と同様にして、更新処理を繰り返し実行し、更新後の判定基準データ１４０ａを、サーバ６００に送信する。実施例１〜３で説明したように、判定基準データ１４０ａは、第１閾値、第２閾値、第３閾値を有する。 The voice processing device 500 repeatedly executes the update process in the same manner as the voice processing device 200 described in the second embodiment, and transmits the updated determination reference data 140a to the server 600. As described in Examples 1 to 3, the determination reference data 140a has a first threshold value, a second threshold value, and a third threshold value.

サーバ６００は、音声処理装置５００から判定基準データ１４０ａを取得し、取得した判定基準データの各第１閾値を基にして、他の音声処理装置５００と比較して、声の大きい利用者が使用する音声処理装置５００を特定する。サーバ６００は、特定した音声処理装置５００に第３メッセージ「少し声を小さくしてください」を送信する。係る第３メッセージを受信した音声処理装置５００は、第３メッセージを利用者に提示する。 The server 600 acquires the determination reference data 140a from the voice processing device 500, and based on each first threshold value of the acquired determination reference data, is used by a user who has a loud voice as compared with other voice processing devices 500. The voice processing device 500 to be used is specified. The server 600 sends a third message "Please make your voice a little quieter" to the specified voice processing device 500. The voice processing device 500 that has received the third message presents the third message to the user.

サーバ６００は、音声処理装置５００から判定基準データ１４０ａを取得し、取得した判定基準データの各第２閾値を基にして、他の音声処理装置５００と比較して、声の小さい利用者が使用する音声処理装置５００を特定する。サーバ６００は、特定した音声処理装置５００に第４メッセージ「少し声を大きくしてください」を送信する。係る第４メッセージを受信した音声処理装置５００は、第４メッセージを利用者に提示する。 The server 600 acquires the determination reference data 140a from the voice processing device 500, and based on each second threshold value of the acquired determination reference data, the server 600 is used by a user having a lower voice than the other voice processing device 500. The voice processing device 500 to be used is specified. The server 600 sends a fourth message "Please make your voice a little louder" to the specified voice processing device 500. The voice processing device 500 that has received the fourth message presents the fourth message to the user.

図１６は、本実施例４に係る音声処理装置の構成を示す機能ブロック図である。図１６に示すように、音声処理装置５００ａは、図１６に示すように、マイク１０に接続される。音声処理装置５００ａは、ＡＤ変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、記憶部１４０、情報提示部１６０、推定部２１０、更新部２２０を有する。また、音声処理装置５００ａは、メッセージ受信部５１０を有する。ここでは一例として、音声処理装置５００ａの構成について説明するが、音声処理装置５００ｂ〜５００ｌの構成は、音声処理装置５００ａの構成と同様である。 FIG. 16 is a functional block diagram showing the configuration of the voice processing device according to the fourth embodiment. As shown in FIG. 16, the voice processing device 500a is connected to the microphone 10 as shown in FIG. The voice processing device 500a includes an AD conversion unit 110, a pitch extraction unit 120a, a power extraction unit 120b, a stress detection unit 130, a storage unit 140, an information presentation unit 160, an estimation unit 210, and an update unit 220. Further, the voice processing device 500a has a message receiving unit 510. Here, the configuration of the voice processing device 500a will be described as an example, but the configuration of the voice processing devices 500b to 500l is the same as the configuration of the voice processing device 500a.

図１６において、マイク１０、ＡＤ変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、記憶部１４０、情報提示部１６０に関する説明は、実施例１で説明したものと同様であるため、説明を省略する。推定部２１０および更新部２２０に関する説明は、実施例２で説明したものと同様であるため、説明を省略する。 In FIG. 16, the description of the microphone 10, the AD conversion unit 110, the pitch extraction unit 120a, the power extraction unit 120b, the stress detection unit 130, the storage unit 140, and the information presentation unit 160 is the same as that described in the first embodiment. Therefore, the description thereof will be omitted. Since the description of the estimation unit 210 and the update unit 220 is the same as that described in the second embodiment, the description thereof will be omitted.

メッセージ受信部５１０は、通信装置を介して、サーバ６００からメッセージを受信した場合に、受信したメッセージを情報提示部１６０に提示させる。例えば、サーバ６００から受信するメッセージは、上記のように、第３メッセージまたは第４メッセージとなる。 When the message receiving unit 510 receives a message from the server 600 via the communication device, the message receiving unit 510 causes the information presenting unit 160 to present the received message. For example, the message received from the server 600 is the third message or the fourth message as described above.

なお、更新部２２０は、判定基準データ１４０ａの更新を行う度に、更新回数をカウントする。更新部２２０は、判定基準データ１４０ａの更新回数が所定回数以上となった場合に、通信装置を用いて、判定基準データ１４０ａをサーバ６００に送信する。 The update unit 220 counts the number of updates each time the determination reference data 140a is updated. When the number of updates of the determination reference data 140a exceeds a predetermined number of times, the update unit 220 transmits the determination reference data 140a to the server 600 by using the communication device.

図１７は、本実施例４に係るサーバの構成を示す機能ブロック図である。図１７に示すように、このサーバ６００は、通信部６１０と、記憶部６２０と、制御部６３０とを有する。 FIG. 17 is a functional block diagram showing a server configuration according to the fourth embodiment. As shown in FIG. 17, the server 600 has a communication unit 610, a storage unit 620, and a control unit 630.

通信部６１０は、ネットワーク５０を介して、音声処理装置５００とデータ通信を実行する処理部である。後述する制御部６３０は、通信部６１０を介して、音声処理装置５００とデータをやり取りする。通信部６１０は、通信装置に対応する。 The communication unit 610 is a processing unit that executes data communication with the voice processing device 500 via the network 50. The control unit 630, which will be described later, exchanges data with the voice processing device 500 via the communication unit 610. The communication unit 610 corresponds to the communication device.

記憶部６２０は、閾値テーブル６２０ａを有する。記憶部６２０は、ＲＡＭ、ＲＯＭ、フラッシュメモリなどの半導体メモリ素子や、ＨＤＤなどの記憶装置に対応する。 The storage unit 620 has a threshold table 620a. The storage unit 620 corresponds to semiconductor memory elements such as RAM, ROM, and flash memory, and storage devices such as HDD.

閾値テーブル６２０ａは、音声処理装置５００から送信される判定基準データ１４０ａを保持するテーブルである。閾値テーブル６２０ａのデータ構造は、図１１Ｂで説明した閾値テーブル４２０ａに対応するため説明を省略する。 The threshold table 620a is a table that holds the determination reference data 140a transmitted from the voice processing device 500. Since the data structure of the threshold table 620a corresponds to the threshold table 420a described with reference to FIG. 11B, the description thereof will be omitted.

分類テーブル６２０ｂは、音声処理装置５００が属するグループのデータを保持するテーブルである。図１８は、本実施例４に係る分類テーブルのデータ構造の一例を示す図である。図１８に示すように、この分類テーブル６２０ｂは、グループ識別情報と、識別情報とを対応づける。グループ識別情報は、グループを一意に識別する情報である。識別情報は、音声処理装置５００を一意に識別する情報である。 The classification table 620b is a table that holds data of the group to which the voice processing device 500 belongs. FIG. 18 is a diagram showing an example of the data structure of the classification table according to the fourth embodiment. As shown in FIG. 18, the classification table 620b associates the group identification information with the identification information. The group identification information is information that uniquely identifies a group. The identification information is information that uniquely identifies the voice processing device 500.

同一のグループに分類される音声処理装置５００は、使用環境が類似する。例えば、音声処理装置５００ａ〜５００ｃは、同一のグループに分類される。音声処理装置５００ｄ〜５００ｆは、同一のグループに分類される。音声処理装置５００ｇ〜５００ｉは、同一のグループに分類される。音声処理装置５００ｊ〜５００ｌは、同一のグループに分類される。 The voice processing devices 500 classified into the same group have similar usage environments. For example, the voice processing devices 500a to 500c are classified into the same group. The voice processing devices 500d to 500f are classified into the same group. The voice processing devices 500g to 500i are classified into the same group. The voice processing devices 500j to 500l are classified into the same group.

制御部６３０は、受信部６３０ａ、統計量算出部６３０ｂ、外れ値抽出部６３０ｃ、メッセージ送信部６３０ｄを有する。制御部６３０は、ＣＰＵやＭＰＵなどによって実現できる。また、制御部６３０は、ＡＳＩＣやＦＰＧＡなどのハードワイヤードロジックによっても実現できる。 The control unit 630 includes a reception unit 630a, a statistic calculation unit 630b, an outlier extraction unit 630c, and a message transmission unit 630d. The control unit 630 can be realized by a CPU, an MPU, or the like. The control unit 630 can also be realized by hard-wired logic such as ASIC or FPGA.

受信部６３０ａは、音声処理装置５００から判定基準データ１４０ａを受信する処理部である。例えば、判定基準データ１４０ａには、この判定基準データ１４０ａの送信元となる音声処理装置５００を識別する識別情報が付与されているものとする。受信部６３０ａは、判定基準データ１４０ａを、識別情報と対応付けて、閾値テーブル６２０ａに登録する。 The receiving unit 630a is a processing unit that receives the determination reference data 140a from the voice processing device 500. For example, it is assumed that the determination reference data 140a is provided with identification information that identifies the voice processing device 500 that is the source of the determination reference data 140a. The receiving unit 630a registers the determination reference data 140a in the threshold table 620a in association with the identification information.

統計量算出部６３０ｂは、閾値テーブル６２０ａを基にして、同一のグループ毎に、統計量を算出する処理部である。統計量算出部６３０ｂは、統計量として、第１閾値の平均値μ１と、第１閾値の標準偏差σ１を算出する。また、統計量算出部６３０ｂは、第２閾値の平均値μ２と、第２閾値の標準偏差σ２を算出する。統計量算出部６３０ｂは、グループ毎の統計量の情報を、外れ値抽出部６３０ｃに出力する。 The statistic calculation unit 630b is a processing unit that calculates statistics for the same group based on the threshold table 620a. The statistic calculation unit 630b calculates the average value μ1 of the first threshold value and the standard deviation σ1 of the first threshold value as statistics. Further, the statistic calculation unit 630b calculates the average value μ2 of the second threshold value and the standard deviation σ2 of the second threshold value. The statistic calculation unit 630b outputs the statistic information for each group to the outlier extraction unit 630c.

統計量算出部６３０ｂは、分類テーブル６２０ｂを参照することで、同一のグループに属する音声処理装置５００の識別情報を特定する。統計量算出部６３０ｂは、特定した識別情報と、閾値テーブル６２０ａとを比較することで、同一のグループに属する音声処理装置５００の判定基準データ１４０ａ（第１閾値、第２閾値）を取得する。統計値算出部６３０ｂは、同一のグループに属する音声処理装置５００の各第１閾値、第２閾値を用いて、上記の統計量を算出する。 The statistic calculation unit 630b identifies the identification information of the voice processing device 500 belonging to the same group by referring to the classification table 620b. The statistic calculation unit 630b acquires the determination reference data 140a (first threshold value, second threshold value) of the voice processing device 500 belonging to the same group by comparing the specified identification information with the threshold value table 620a. The statistical value calculation unit 630b calculates the above-mentioned statistic by using the first threshold value and the second threshold value of the voice processing device 500 belonging to the same group.

図１９は、統計量のデータ構造の一例を示す図である。図１９に示すように、この統計量は、グループ識別情報と、第１平均値と、第１標準偏差と、第２平均値と、第２標準偏差とを対応づける。グループ識別情報は、グループを一意に識別する情報である。第１平均値は、同一のグループの各第１閾値の平均値を示す。第１標準偏差は、同一のグループの各第１閾値の標準偏差を示す。第２平均値は、同一のグループの各第２閾値の平均値を示す。第２標準偏差は、同一のグループの各第２閾値の標準偏差を示す。 FIG. 19 is a diagram showing an example of a statistical data structure. As shown in FIG. 19, this statistic associates the group identification information with the first mean, the first standard deviation, the second mean, and the second standard deviation. The group identification information is information that uniquely identifies a group. The first mean value indicates the mean value of each first threshold value of the same group. The first standard deviation indicates the standard deviation of each first threshold in the same group. The second mean value indicates the mean value of each second threshold value of the same group. The second standard deviation indicates the standard deviation of each second threshold in the same group.

外れ値抽出部６３０ｃは、統計量と、閾値テーブル６２０ａと、分類テーブル６２０ｂとを基にして、外れ値の第１閾値または第２閾値（判定基準データ１４０ａ）を送信した音声処理装置５００を、グループ毎に抽出する処理部である。 The outlier extraction unit 630c uses the statistic, the threshold value table 620a, and the classification table 620b to transmit the first threshold value or the second threshold value (determination reference data 140a) of the outliers to the voice processing device 500. It is a processing unit that extracts each group.

外れ値抽出部６３０ｃは、同一のグループに含まれる音声処理装置５００の第１閾値と、該当するグループの「第１平均値μ１＋３×第１標準偏差σ１」とを比較する。外れ値抽出部６３０ｃは、第１閾値が「第１平均値μ１＋３×第１標準偏差σ１」を超える音声処理装置５００を、「第１外れ装置」として抽出する。本実施の形態では平均値から標準偏差の３倍離れた値を閾値としたが、３倍に限定されず、２倍や１倍に設定してもよい。 The outlier extraction unit 630c compares the first threshold value of the voice processing device 500 included in the same group with the “first mean value μ1 + 3 × first standard deviation σ1” of the corresponding group. The outlier extraction unit 630c extracts the voice processing device 500 whose first threshold value exceeds “first average value μ1 + 3 × first standard deviation σ1” as the “first outlier”. In the present embodiment, a value 3 times the standard deviation from the average value is set as the threshold value, but the threshold value is not limited to 3 times, and may be set to 2 times or 1 time.

外れ値抽出部６３０ｃは、同一のグループに含まれる音声処理装置５００の第２閾値と、該当するグループの「第２平均値μ２−３×第２標準偏差σ２」とを比較する。外れ値抽出部６３０ｃは、第２閾値が「第２平均値μ２−３×第２標準偏差σ２」を下回る音声処理装置５００を、「第２外れ装置」として抽出する。外れ値抽出部６３０ｃは、第１外れ装置の識別情報および第２外れ値の識別情報を、メッセージ送信部６３０ｄに出力する。本実施の形態では平均値から標準偏差の３倍離れた値を閾値としたが、３倍に限定されず、２倍や１倍に設定してもよい。 The outlier extraction unit 630c compares the second threshold value of the voice processing device 500 included in the same group with the “second mean value μ2-3 × second standard deviation σ2” of the corresponding group. The outlier extraction unit 630c extracts the voice processing device 500 whose second threshold value is less than the “second mean value μ2-3 × second standard deviation σ2” as the “second outlier”. The outlier extraction unit 630c outputs the identification information of the first outlier and the identification information of the second outlier to the message transmission unit 630d. In the present embodiment, a value 3 times the standard deviation from the average value is set as the threshold value, but the threshold value is not limited to 3 times, and may be set to 2 times or 1 time.

外れ値抽出部６３０ｃは、上記処理を、グループ毎に繰り返し実行することで、グループ毎の第１外れ装置の識別情報および第２外れ値の識別情報を、メッセージ送信部６３０ｄに出力する。 The outlier extraction unit 630c repeatedly executes the above processing for each group, and outputs the identification information of the first outlier and the identification information of the second outlier for each group to the message transmission unit 630d.

第１外れ装置の識別情報に対応する音声処理装置５００を用いて通話している利用者は、使用環境が類似する他の利用者と比較して、「声が大きい」と言える。第２外れ装置の識別情報に対応する音声処理装置５００を用いて通話している利用者は、使用環境が類似する他の利用者と比較して、「声が小さい」と言える。 It can be said that a user who is talking using the voice processing device 500 corresponding to the identification information of the first detached device is "loud" as compared with other users who have similar usage environments. It can be said that a user who is making a call using the voice processing device 500 corresponding to the identification information of the second detached device has a "low voice" as compared with other users having a similar usage environment.

メッセージ送信部６３０ｄは、外れ値抽出部６３０ｃから取得する情報を基にして、メッセージを音声処理装置５００に送信する処理部である。例えば、メッセージ送信部６３０ｄは、第１外れ装置の識別情報に対応する音声処理装置５００に、第３メッセージ「少し声を小さくしてください」を送信する。例えば、メッセージ送信部６３０ｄは、第２外れ装置の識別情報に対応する音声処理装置５００に、第４メッセージ「少し声を大きくしてください」を送信する。 The message transmission unit 630d is a processing unit that transmits a message to the voice processing device 500 based on the information acquired from the outlier extraction unit 630c. For example, the message transmission unit 630d transmits the third message "Please make your voice a little quieter" to the voice processing device 500 corresponding to the identification information of the first detached device. For example, the message transmission unit 630d transmits the fourth message "Please make your voice a little louder" to the voice processing device 500 corresponding to the identification information of the second detached device.

次に、本実施例４に係る音声処理装置５００の処理手順について説明する。図２０は、本実施例４に係る音声処理装置の処理手順を示すフローチャートである。図２０に示すように、音声処理装置５００のＡＤ変換部１１０は、入力音声の受け付けを開始する（ステップＳ５０１）。ＡＤ変換部１１０は、ＡＤ変換を行う（ステップＳ５０２）。音声処理装置５００のピッチ抽出部１２０ａは、ピッチを抽出し、音声処理装置５００のパワー抽出部１２０ｂは、パワーを抽出する（ステップＳ５０３）。 Next, the processing procedure of the voice processing device 500 according to the fourth embodiment will be described. FIG. 20 is a flowchart showing a processing procedure of the voice processing device according to the fourth embodiment. As shown in FIG. 20, the AD conversion unit 110 of the voice processing device 500 starts accepting the input voice (step S501). The AD conversion unit 110 performs AD conversion (step S502). The pitch extraction unit 120a of the voice processing device 500 extracts the pitch, and the power extraction unit 120b of the voice processing device 500 extracts the power (step S503).

ピッチ抽出部１２０ａは、有音区間を検出する（ステップＳ５０４）。音声処理装置５００のストレス検出部１３０は、ピッチ・パワーを蓄積する（ステップＳ５０５）。ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積された場合には（ステップＳ５０６，Ｙｅｓ）、ステップＳ５０７に移行する。一方、ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積されていない場合には（ステップＳ５０６，Ｎｏ）、ステップＳ５０１に移行する。 The pitch extraction unit 120a detects a sounded section (step S504). The stress detection unit 130 of the voice processing device 500 accumulates pitch power (step S505). When the pitch power corresponding to the specified number of frames is accumulated (steps S506 and Yes), the stress detection unit 130 shifts to step S507. On the other hand, when the pitch power corresponding to the specified number of frames is not accumulated (step S506, No), the stress detection unit 130 shifts to step S501.

ストレス検出部１３０は、ストレス値を算出する（ステップＳ５０７）。音声処理装置５００の推定部２１０は、有音区間の平均パワーを算出する（ステップＳ５０８）。推定部２１０は、平均パワーが第１閾値以上である場合には（ステップＳ５０９，Ｙｅｓ）、ステップＳ５１０に移行する。一方、推定部２１０は、平均パワーが第１閾値未満である場合には（ステップＳ５０９，Ｎｏ）、ステップＳ５１２に移行する。 The stress detection unit 130 calculates the stress value (step S507). The estimation unit 210 of the voice processing device 500 calculates the average power of the sounded section (step S508). When the average power is equal to or higher than the first threshold value (step S509, Yes), the estimation unit 210 shifts to step S510. On the other hand, when the average power is less than the first threshold value (step S509, No), the estimation unit 210 shifts to step S512.

推定部２１０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ５１０）。推定部２１０は、ストレス値が第３閾値以上である場合には（ステップＳ５１０，Ｙｅｓ）、ステップＳ５１５に移行する。推定部２１０は、ストレス値が第３閾値未満である場合には（ステップＳ５１０，Ｎｏ）、情報提示部１６０に第１メッセージを表示させ（ステップＳ５１１）、ステップＳ５１５に移行する。例えば、第１メッセージは、「マイクを口から少し離してください」である。 The estimation unit 210 determines whether or not the stress value is equal to or higher than the third threshold value (step S510). When the stress value is equal to or higher than the third threshold value (step S510, Yes), the estimation unit 210 shifts to step S515. When the stress value is less than the third threshold value (step S510, No), the estimation unit 210 causes the information presentation unit 160 to display the first message (step S511), and proceeds to step S515. For example, the first message is "Please move the microphone away from your mouth."

ステップＳ５１２の説明に移行する。推定部２１０は、平均パワーが第２閾値未満であるか否かを判定する（ステップＳ５１２）。推定部２１０は、平均パワーが第２閾値未満でない場合には（ステップＳ５１２，Ｎｏ）、ステップＳ５１５に移行する。一方、推定部２１０は、平均パワーが第２閾値未満である場合には（ステップＳ５１２，Ｙｅｓ）、ステップＳ５１３に移行する。 The process proceeds to the description of step S512. The estimation unit 210 determines whether or not the average power is less than the second threshold value (step S512). When the average power is not less than the second threshold value (step S512, No), the estimation unit 210 shifts to step S515. On the other hand, when the average power is less than the second threshold value (step S512, Yes), the estimation unit 210 shifts to step S513.

推定部２１０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ５１３）。推定部２１０は、ストレス値が第３閾値以上でない場合には（ステップＳ５１３，Ｎｏ）、ステップＳ５１５に移行する。一方、推定部２１０は、ストレス値が第３閾値以上である場合には（ステップＳ５１３，Ｙｅｓ）、情報提示部１６０に第２メッセージを表示させ（ステップＳ５１４）、ステップＳ５１５に移行する。例えば、第２メッセージは、「マイクを口に少し近づけてください」である。 The estimation unit 210 determines whether or not the stress value is equal to or higher than the third threshold value (step S513). When the stress value is not equal to or higher than the third threshold value (step S513, No), the estimation unit 210 proceeds to step S515. On the other hand, when the stress value is equal to or higher than the third threshold value (step S513, Yes), the estimation unit 210 causes the information presentation unit 160 to display the second message (step S514), and proceeds to step S515. For example, the second message is "Please bring the microphone a little closer to your mouth."

音声処理装置５００の更新部２２０は、更新処理を実行する（ステップＳ５１５）。ステップＳ５１５に示す更新処理は、実施例２の図５、図６、図７に示した処理に対応する。推定部１６０は、ストレス値および平均パワーを記憶部１４０に記憶する（ステップＳ５１６）。 The update unit 220 of the voice processing device 500 executes the update process (step S515). The update process shown in step S515 corresponds to the process shown in FIGS. 5, 6 and 7 of the second embodiment. The estimation unit 160 stores the stress value and the average power in the storage unit 140 (step S516).

更新部２２０は、更新回数が所定回数以上となった場合に、判定基準データ１４０ａを、サーバ６００に送信する（ステップＳ５１７）。更新部２２０は、更新回数に１を加算し（ステップＳ５１８）、ステップＳ５０１に移行する。 When the number of updates exceeds a predetermined number, the update unit 220 transmits the determination reference data 140a to the server 600 (step S517). The update unit 220 adds 1 to the number of updates (step S518), and proceeds to step S501.

次に、実施例４に係るサーバ６００の処理手順について説明する。図２１は、本実施例４に係るサーバの処理手順を示すフローチャートである。図２１に示すように、サーバ６００の受信部６３０ａは、判定基準データ１４０ａを受信する（ステップＳ６０１）。受信部６３０ａは、判定基準データ１４０ａを閾値テーブル６２０ａに登録する（ステップＳ６０２）。 Next, the processing procedure of the server 600 according to the fourth embodiment will be described. FIG. 21 is a flowchart showing a processing procedure of the server according to the fourth embodiment. As shown in FIG. 21, the receiving unit 630a of the server 600 receives the determination reference data 140a (step S601). The receiving unit 630a registers the determination reference data 140a in the threshold table 620a (step S602).

サーバ６００の統計量算出部６３０ｂは、閾値データ６２０ａに基づいて、判定基準データのレコード数が第５閾値以上であるか否かを判定する（ステップＳ６０３）。統計量算出部６３０ｂは、判定基準データのレコード数が第５閾値以上でない場合には（ステップＳ６０３，Ｎｏ）、処理を終了する。 The statistic calculation unit 630b of the server 600 determines whether or not the number of records of the determination reference data is equal to or greater than the fifth threshold value based on the threshold value data 620a (step S603). When the number of records of the determination reference data is not equal to or greater than the fifth threshold value (step S603, No), the statistic calculation unit 630b ends the process.

一方、統計量算出部６３０ｂは、判定基準データのレコード数が第５閾値以上である場合には（ステップＳ６０３，Ｙｅｓ）、第１閾値、第２閾値について、それぞれ平均値μを算出する（ステップＳ６０４）。統計量算出部６３０ｂは、第１閾値、第２閾値について、それぞれ標準偏差σを算出する（ステップＳ６０５）。 On the other hand, when the number of records of the determination reference data is equal to or greater than the fifth threshold value (step S603, Yes), the statistic calculation unit 630b calculates an average value μ for each of the first threshold value and the second threshold value (step). S604). The statistic calculation unit 630b calculates the standard deviation σ for each of the first threshold value and the second threshold value (step S605).

サーバ６００の外れ値抽出部６３０ｃは、判断基準データ１４０ａにおいて、第１閾値が第１平均値μ１＋３×第１標準偏差σ１を超えるものがない場合には（ステップＳ６０６，Ｎｏ）、ステップＳ６０８に移行する。 The outlier extraction unit 630c of the server 600 proceeds to step S608 when the first threshold value does not exceed the first mean value μ1 + 3 × first standard deviation σ1 in the judgment reference data 140a (step S606, No). To do.

外れ値抽出部６３０ｃは、判断基準データ１４０ａにおいて、第１閾値が第１平均値μ１＋３×第１標準偏差σ１を超えるものがある場合には（ステップＳ６０６，Ｙｅｓ）、ステップＳ６０７に移行する。サーバ６００のメッセージ送信部６３０ｄは、該当する音声処理装置５００に第３メッセージ「少し声を小さくしてください」を送信する（ステップＳ６０７）。 The outlier extraction unit 630c proceeds to step S607 when the first threshold value exceeds the first mean value μ1 + 3 × first standard deviation σ1 in the determination reference data 140a (step S606, Yes). The message transmission unit 630d of the server 600 transmits a third message "Please make your voice a little quieter" to the corresponding voice processing device 500 (step S607).

外れ値抽出部６３０ｃは、判断基準データ１４０ａにおいて、第２閾値が第２平均値μ２−３×第２標準偏差σ２を下回るものがない場合には（ステップＳ６０８，Ｎｏ）、処理を終了する。 The outlier extraction unit 630c ends the process when there is no second threshold value less than the second mean value μ2-3 × second standard deviation σ2 in the determination reference data 140a (step S608, No).

外れ値抽出部６３０ｃは、判断基準データ１４０ａにおいて、第２閾値が第２平均値μ２−３×第２標準偏差σ２を下回るものがある場合には（ステップＳ６０８，Ｙｅｓ）、ステップＳ６０９に移行する。サーバ６００のメッセージ送信部６３０ｄは、該当する音声処理装置５００に第４メッセージ「少し声を大きくしてください」を送信する（ステップＳ６０９）。 The outlier extraction unit 630c shifts to step S609 when the second threshold value is less than the second mean value μ2-3 × second standard deviation σ2 in the judgment reference data 140a (step S608, Yes). .. The message transmission unit 630d of the server 600 transmits the fourth message “Please make your voice a little louder” to the corresponding voice processing device 500 (step S609).

次に、本実施例４に係るシステムの効果について説明する。音声処理装置５００は、更新処理を繰り返し実行した後に、判定基準データ１４０ａをサーバ６００に通知する。サーバ６００は、各判定基準データ１４０ａを基にして、統計量を算出し、外れ値の第１閾値を送信した音声処理装置５００、外れ値の第２閾値を送信した音声処理装置５００にメッセージを送信する。例えば、サーバ６００は、外れ値の第１閾値を送信した音声処理装置５００に第３メッセージを送信するため、使用環境が類似する音声処理装置の中で、相対的に話し声の大きい利用者の音声処理装置に対して注意発起を行うことができる。サーバ６００は、外れ値の第２閾値を送信した音声処理装置５００に第４メッセージを送信するため、使用環境が類似する音声処理装置の中で、相対的に話し声の小さい利用者の音声処理装置５００に対して注意発起を行うことができる。 Next, the effect of the system according to the fourth embodiment will be described. The voice processing device 500 notifies the server 600 of the determination reference data 140a after repeatedly executing the update process. The server 600 calculates a statistic based on each determination reference data 140a, and sends a message to the voice processing device 500 that transmits the first threshold value of the outliers and the voice processing device 500 that transmits the second threshold value of the outliers. Send. For example, since the server 600 transmits a third message to the voice processing device 500 that has transmitted the first threshold value of the outlier, the voice of a user who speaks relatively loudly among the voice processing devices having similar usage environments. It is possible to issue attention to the processing device. Since the server 600 transmits the fourth message to the voice processing device 500 that has transmitted the second threshold value of the outlier, the voice processing device of the user who speaks relatively quietly among the voice processing devices having similar usage environments. Attention can be issued to 500.

なお、本実施例４に係るシステムでは、音声処理装置５００を使用環境が類似するグループに分けて、グループ毎に、第３メッセージ、第４メッセージを送信していたが、これに限定されるものではない。本実施例４に係るシステムでは、音声処理装置５００ａ〜５００ｌを一つのグループにまとめて、同一の処理を実行しても良い。 In the system according to the fourth embodiment, the voice processing device 500 is divided into groups having similar usage environments, and the third message and the fourth message are transmitted for each group, but the system is limited to this. is not it. In the system according to the fourth embodiment, the voice processing devices 500a to 500l may be grouped into one group and the same processing may be executed.

図２２は、本実施例５に係る音声処理装置の構成を示す機能ブロック図である。図２２に示すように、この音声処理装置７００は、マイク１０に接続される。音声処理装置７００は、ＡＤ変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、記憶部１４０、推定部７１０、ゲイン調整部７２０を有する。このうち、マイク１０、ＡＤ変換部１１０、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂ、ストレス検出部１３０、記憶部１４０に関する説明は、実施例１で説明したものと同様であるため、説明を省略する。 FIG. 22 is a functional block diagram showing a configuration of the voice processing device according to the fifth embodiment. As shown in FIG. 22, the voice processing device 700 is connected to the microphone 10. The voice processing device 700 includes an AD conversion unit 110, a pitch extraction unit 120a, a power extraction unit 120b, a stress detection unit 130, a storage unit 140, an estimation unit 710, and a gain adjustment unit 720. Of these, the description of the microphone 10, the AD conversion unit 110, the pitch extraction unit 120a, the power extraction unit 120b, the stress detection unit 130, and the storage unit 140 is the same as that described in the first embodiment, and thus the description thereof will be omitted. ..

推定部７１０は、入力音声のストレス値と、平均パワーと、判定基準データ１４０ａとを基にして、入力音声の発声状態を推定する処理部である。推定部７１０は、推定結果を、ゲイン調整部７２０に出力する。 The estimation unit 710 is a processing unit that estimates the utterance state of the input voice based on the stress value of the input voice, the average power, and the determination reference data 140a. The estimation unit 710 outputs the estimation result to the gain adjustment unit 720.

例えば、推定部７１０は、「ストレスが小」かつ「平均パワーが第１閾値以上」場合に、第１推定結果を、ゲイン調整部７２０に出力する。推定部７１０は、「ストレスが大」かつ「平均パワーが第２閾値未満」場合に、第２推定結果を、ゲイン調整部７２０に出力する。推定部７１０が、ストレスの大、小、平均パワーを第１閾値および第２閾値と比較する処理は、実施例１に示した推定部１５０の処理と同様である。 For example, the estimation unit 710 outputs the first estimation result to the gain adjustment unit 720 when the stress is small and the average power is equal to or higher than the first threshold value. The estimation unit 710 outputs the second estimation result to the gain adjustment unit 720 when “the stress is large” and “the average power is less than the second threshold value”. The process in which the estimation unit 710 compares the high, low, and average power of stress with the first threshold value and the second threshold value is the same as the process of the estimation unit 150 shown in Example 1.

ゲイン調整部７２０は、推定部７１０の推定結果を基にして、マイク１０のゲインを調整する処理部である。ゲイン調整部７２０は、推定部７１０から第１推定結果を受信した場合には、ゲイン調整部７２０は、マイク１０のゲインを下げる。例えば、ゲイン調整部７２０は、マイク１０の録音レベルを３ｄＢ下げる。第１推定結果は、現在のストレスが「小」であり、かつ、「平均パワーが第１閾値以上である」ことを示し、今後、ストレスが「大」に推移すると、入力音声のパワーが適切なパワーを超える恐れがある。 The gain adjusting unit 720 is a processing unit that adjusts the gain of the microphone 10 based on the estimation result of the estimation unit 710. When the gain adjusting unit 720 receives the first estimation result from the estimation unit 710, the gain adjusting unit 720 lowers the gain of the microphone 10. For example, the gain adjusting unit 720 lowers the recording level of the microphone 10 by 3 dB. The first estimation result shows that the current stress is "small" and "the average power is equal to or higher than the first threshold value", and when the stress changes to "large" in the future, the power of the input voice is appropriate. There is a risk of exceeding the power.

ゲイン調整部７２０は、推定部７１０から第２推定結果を受信した場合には、ゲイン調整部７２０は、マイク１０のゲインを上げる。例えば、ゲイン調整部７２０は、マイク１０の録音レベルを３ｄＢ上げる。第２推定結果は、現在のストレスが「大」であり、かつ、「平均パワーが第２閾値未満である」ことを示し、今後、ストレスが「小」に推移すると、入力音声のパワーが適切なパワーを下回る恐れがある。 When the gain adjusting unit 720 receives the second estimation result from the estimation unit 710, the gain adjusting unit 720 raises the gain of the microphone 10. For example, the gain adjusting unit 720 raises the recording level of the microphone 10 by 3 dB. The second estimation result indicates that the current stress is "large" and "the average power is less than the second threshold", and when the stress changes to "small" in the future, the power of the input voice is appropriate. There is a risk of falling below the power.

次に、本実施例５に係る音声処理装置７００の処理手順について説明する。図２３は、本実施例５に係る音声処理装置の処理手順を示すフローチャートである。図２３に示すように、音声処理装置７００のＡＤ変換部１１０は、入力音声の受け付けを開始する（ステップＳ７０１）。ＡＤ変換部１１０は、ＡＤ変換を行う（ステップＳ７０２）。音声処理装置７００のピッチ抽出部１２０ａは、ピッチを抽出し、音声処理装置７００のパワー抽出部１２０ｂは、パワーを抽出する（ステップＳ７０３）。 Next, the processing procedure of the voice processing device 700 according to the fifth embodiment will be described. FIG. 23 is a flowchart showing a processing procedure of the voice processing device according to the fifth embodiment. As shown in FIG. 23, the AD conversion unit 110 of the voice processing device 700 starts accepting the input voice (step S701). The AD conversion unit 110 performs AD conversion (step S702). The pitch extraction unit 120a of the voice processing device 700 extracts the pitch, and the power extraction unit 120b of the voice processing device 700 extracts the power (step S703).

ピッチ抽出部１２０ａは、有音区間を検出する（ステップＳ７０４）。音声処理装置７００のストレス検出部１３０は、ピッチ・パワーを蓄積する（ステップＳ７０５）。ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積された場合には（ステップＳ７０６，Ｙｅｓ）、ステップＳ７０７に移行する。一方、ストレス検出部１３０は、指定されたフレーム数に対応するピッチ・パワーが蓄積されていない場合には（ステップＳ７０６，Ｎｏ）、ステップＳ７０１に移行する。 The pitch extraction unit 120a detects a sounded section (step S704). The stress detection unit 130 of the voice processing device 700 accumulates pitch power (step S705). When the pitch power corresponding to the specified number of frames is accumulated (step S706, Yes), the stress detection unit 130 shifts to step S707. On the other hand, when the pitch power corresponding to the specified number of frames is not accumulated (step S706, No), the stress detection unit 130 shifts to step S701.

ストレス検出部１３０は、ストレス値を算出する（ステップＳ７０７）。音声処理装置７００の推定部７１０は、有音区間の平均パワーを算出する（ステップＳ７０８）。推定部７１０は、平均パワーが第１閾値以上である場合には（ステップＳ７０９，Ｙｅｓ）、ステップＳ７１０に移行する。一方、推定部７１０は、平均パワーが第１閾値未満である場合には（ステップＳ７０９，Ｎｏ）、ステップＳ７１２に移行する。 The stress detection unit 130 calculates the stress value (step S707). The estimation unit 710 of the voice processing device 700 calculates the average power of the sounded section (step S708). When the average power is equal to or higher than the first threshold value (step S709, Yes), the estimation unit 710 shifts to step S710. On the other hand, when the average power is less than the first threshold value (step S709, No), the estimation unit 710 shifts to step S712.

推定部７１０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ７１０）。推定部７１０は、ストレス値が第３閾値以上である場合には（ステップＳ７１０，Ｙｅｓ）、ステップＳ７０１に移行する。推定部７１０は、ストレス値が第３閾値未満である場合には（ステップＳ７１０，Ｎｏ）、ステップＳ７１１に移行する。音声処理装置７００のゲイン調整部７２０は、マイク１０の音声レベルを３ｄＢ下げ（ステップＳ７１１）、ステップＳ７０１に移行する。 The estimation unit 710 determines whether or not the stress value is equal to or higher than the third threshold value (step S710). When the stress value is equal to or higher than the third threshold value (step S710, Yes), the estimation unit 710 shifts to step S701. When the stress value is less than the third threshold value (step S710, No), the estimation unit 710 shifts to step S711. The gain adjusting unit 720 of the voice processing device 700 lowers the voice level of the microphone 10 by 3 dB (step S711), and shifts to step S701.

ステップＳ７１２の説明に移行する。推定部７１０は、平均パワーが第２閾値未満であるか否かを判定する（ステップＳ７１２）。推定部７１０は、平均パワーが第２閾値未満でない場合には（ステップＳ７１２，Ｎｏ）、ステップＳ７０１に移行する。一方、推定部７１０は、平均パワーが第２閾値未満である場合には（ステップＳ７１２，Ｙｅｓ）、ステップＳ７１３に移行する。 The process proceeds to the description of step S712. The estimation unit 710 determines whether or not the average power is less than the second threshold value (step S712). When the average power is not less than the second threshold value (step S712, No), the estimation unit 710 shifts to step S701. On the other hand, when the average power is less than the second threshold value (step S712, Yes), the estimation unit 710 shifts to step S713.

推定部７１０は、ストレス値が第３閾値以上であるか否かを判定する（ステップＳ７１３）。推定部７１０は、ストレス値が第３閾値以上でない場合には（ステップＳ７１３，Ｎｏ）、ステップＳ７０１に移行する。一方、推定部７１０は、ストレス値が第３閾値以上である場合には（ステップＳ７１３，Ｙｅｓ）、ステップＳ７１４に移行する。ゲイン調整部７２０は、マイク１０の音声レベルを３ｄＢ上げ（ステップＳ７１４）、ステップＳ７０１に移行する。 The estimation unit 710 determines whether or not the stress value is equal to or higher than the third threshold value (step S713). When the stress value is not equal to or higher than the third threshold value (step S713, No), the estimation unit 710 proceeds to step S701. On the other hand, when the stress value is equal to or higher than the third threshold value (step S713, Yes), the estimation unit 710 shifts to step S714. The gain adjusting unit 720 raises the sound level of the microphone 10 by 3 dB (step S714), and shifts to step S701.

次に、本実施例５に係る音声処理装置７００の効果について説明する。音声処理装置７００は、発声状態を推定し、推移結果に基づいて、マイク１０のゲインを調整する。これにより、利用者の心理状況も考慮して、利用者の入力音声の音量を適切な音量に保つことができ、各利用者の通話を快適に保つことができる。 Next, the effect of the voice processing device 700 according to the fifth embodiment will be described. The voice processing device 700 estimates the utterance state and adjusts the gain of the microphone 10 based on the transition result. As a result, the volume of the input voice of the user can be kept at an appropriate volume in consideration of the psychological situation of the user, and the call of each user can be kept comfortable.

次に、上記実施例に示した音声処理装置１００，２００，３００，５００，７００と同様の機能を実現するコンピュータのハードウェア構成の一例について説明する。図２４は、音声処理装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。 Next, an example of a computer hardware configuration that realizes the same functions as the voice processing devices 100, 200, 300, 500, and 700 shown in the above embodiment will be described. FIG. 24 is a diagram showing an example of a computer hardware configuration that realizes a function similar to that of a voice processing device.

図２４に示すように、コンピュータ８００は、各種演算処理を実行するＣＰＵ８０１と、ユーザからのデータの入力を受け付ける入力装置８０２と、ディスプレイ８０３とを有する。また、コンピュータ８００は、記憶媒体からプログラム等を読み取る読み取り装置８０４と、有線または無線ネットワークを介して他のコンピュータとの間でデータの授受を行うインターフェース装置８０５とを有する。例えば、インターフェース装置８０５は、通信装置等に接続される。コンピュータ８００は、マイク８０６に接続される。また、コンピュータ８００は、各種情報を一時記憶するＲＡＭ８０７と、ハードディスク装置８０８とを有する。そして、各装置８０１〜８０８は、バス８０９に接続される。 As shown in FIG. 24, the computer 800 includes a CPU 801 that executes various arithmetic processes, an input device 802 that receives data input from a user, and a display 803. Further, the computer 800 has a reading device 804 that reads a program or the like from a storage medium, and an interface device 805 that exchanges data with another computer via a wired or wireless network. For example, the interface device 805 is connected to a communication device or the like. The computer 800 is connected to the microphone 806. Further, the computer 800 has a RAM 807 that temporarily stores various information and a hard disk device 808. Then, each of the devices 801-808 is connected to the bus 809.

ハードディスク装置８０８は、抽出プログラム８０８ａ、ストレス検出プログラム８０８ｂ、推定プログラム８０８ｃ、更新プログラム８０８ｄを有する。また、ハードディスク装置８０８は、アップロード・ダウンロードプログラム８０８ｅ、受信プログラム８０８ｆ、提示プログラム８０８ｇ、ゲイン調整プログラム８０８ｈを有する。ＣＰＵ８０１は、抽出プログラム８０８ａ、ストレス検出プログラム８０８ｂ、推定プログラム８０８ｃ、更新プログラム８０８ｄを読み出してＲＡＭ８０７に展開する。ＣＰＵ８０１は、アップロード・ダウンロードプログラム８０８ｅ、受信プログラム８０８ｆ、提示プログラム８０８ｇ、ゲイン調整プログラム８０８ｈを読み出してＲＡＭ８０７に展開する。 The hard disk device 808 has an extraction program 808a, a stress detection program 808b, an estimation program 808c, and an update program 808d. Further, the hard disk device 808 has an upload / download program 808e, a reception program 808f, a presentation program 808g, and a gain adjustment program 808h. The CPU 801 reads out the extraction program 808a, the stress detection program 808b, the estimation program 808c, and the update program 808d and deploys them in the RAM 807. The CPU 801 reads the upload / download program 808e, the reception program 808f, the presentation program 808g, and the gain adjustment program 808h and deploys them in the RAM 807.

抽出プログラム８０８ａは、抽出プロセス８０７ａとして機能する。ストレス検出プログラム８０８ｂは、ストレス検出プロセス８０７ｂとして機能する。推定プログラム８０８ｃは、推定プロセス８０７ｃとして機能する。更新プログラム８０８ｄは、更新プロセス８０７ｄとして機能する。アップロード・ダウンロードプログラム８０８ｅは、アップロード・ダウンロードプロセス８０７ｅとして機能する。受信プログラム８０８ｆは、受信プロセス８０７ｆとして機能する。提示プログラム８０８ｇは、提示プロセス８０７ｇとして機能する。ゲイン調整プログラム８０８ｈは、ゲイン調整プロセス８０７ｈとして機能する。 The extraction program 808a functions as an extraction process 807a. The stress detection program 808b functions as the stress detection process 807b. The estimation program 808c functions as an estimation process 807c. Update 808d functions as update process 807d. The upload / download program 808e functions as an upload / download process 807e. The receiving program 808f functions as a receiving process 807f. The presentation program 808g functions as a presentation process 807g. The gain adjustment program 808h functions as a gain adjustment process 807h.

抽出プロセス８０７ａの処理は、ピッチ抽出部１２０ａ、パワー抽出部１２０ｂの処理に対応する。ストレス検出プロセス８０７ｂの処理は、ストレス検出部１３０の処理に対応する。推定プロセス８０７ｃの処理は、推定部１５０、２１０、７１０の処理に対応する。更新プロセス８０７ｄの処理は、更新部２２０の処理に対応する。アップロード・ダウンロードプロセス８０７ｅの処理は、アップロード部３１０、ダウンロード部３２０の処理に対応する。受信プロセス８０７ｆの処理は、メッセージ受信部５１０の処理に対応する。提示プロセス８０７ｇの処理は、情報提示部１６０の処理に対応する。ゲイン調整プロセス８０７ｈの処理は、ゲイン調整部７２０の処理に対応する。 The processing of the extraction process 807a corresponds to the processing of the pitch extraction unit 120a and the power extraction unit 120b. The process of the stress detection process 807b corresponds to the process of the stress detection unit 130. The processing of the estimation process 807c corresponds to the processing of the estimation units 150, 210, and 710. The process of the update process 807d corresponds to the process of the update unit 220. The processing of the upload / download process 807e corresponds to the processing of the upload unit 310 and the download unit 320. The processing of the receiving process 807f corresponds to the processing of the message receiving unit 510. The processing of the presentation process 807g corresponds to the processing of the information presentation unit 160. The processing of the gain adjusting process 807h corresponds to the processing of the gain adjusting unit 720.

なお、各プログラム８０８ａ〜８０８ｈについては、必ずしも最初からハードディスク装置８０８に記憶させておかなくても良い。例えば、コンピュータ８００に挿入されるフレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に各プログラムを記憶させておく。そして、コンピュータ８００が各プログラム８０８ａ〜８０８ｈを読み出して実行するようにしても良い。 The programs 808a to 808h do not necessarily have to be stored in the hard disk device 808 from the beginning. For example, each program is stored in a "portable physical medium" such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, or an IC card inserted into the computer 800. Then, the computer 800 may read and execute each program 808a to 808h.

以上の各実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following additional notes will be further disclosed with respect to the embodiments including each of the above embodiments.

（付記１）コンピュータに、
入力音声からピッチ周波数と周波数パワーとを抽出し、
前記ピッチ周波数および前記周波数パワーに基づく値が所定の閾値以上となる条件を満たすか否か判定結果を出力し、
前記判定結果と、前記周波数パワーの平均パワーとの関係に基づいて、前記入力音声の発声状態を推定する
処理を実行させることを特徴とする音声処理プログラム。 (Appendix 1) To the computer
Extract the pitch frequency and frequency power from the input voice,
A determination result is output as to whether or not the condition that the value based on the pitch frequency and the frequency power is equal to or higher than a predetermined threshold value is satisfied.
A voice processing program characterized in that a process of estimating the utterance state of the input voice is executed based on the relationship between the determination result and the average power of the frequency power.

（付記２）前記発声状態の推定結果を通知し、前記発声状態が改善されたか否かを判定し、改善された場合には、前記発声状態を推定する処理が用いる判定基準または前記閾値を正解データとし、改善されていない場合または変化が無い場合には、前記判定基準または前記閾値を不正解データとし、前記正解データおよび前記不正解データを学習データとして用いることで、前記不正解データが減るように、前記判定基準または前記閾値を更新する処理を更にコンピュータに実行させることを特徴とする付記１に記載の音声処理プログラム。 (Appendix 2) The estimation result of the vocalization state is notified, it is determined whether or not the vocalization state is improved, and if it is improved, the judgment standard or the threshold value used by the process of estimating the vocalization state is correctly answered. If the data is not improved or if there is no change, the incorrect answer data is reduced by using the judgment criterion or the threshold value as incorrect answer data and using the correct answer data and the incorrect answer data as training data. The voice processing program according to Appendix 1, wherein the computer further executes a process of updating the determination criterion or the threshold value.

（付記３）ネットワークに接続された複数の端末から前記学習データを収集し、収集した前記学習データを基にして前記判定基準の初期値および前記閾値の初期値を生成するサーバから、前記判定基準の初期値および前記閾値の初期値を受信し、前記判定基準の初期値を、前記発声状態を推定する処理が用いる判定基準の初期値に設定し、前記閾値の初期値を、前記判定する処理の前記閾値の初期値に設定する処理を更にコンピュータに実行させることを特徴とする付記２に記載の音声処理プログラム。 (Appendix 3) The determination criteria are collected from a server that collects the learning data from a plurality of terminals connected to the network and generates an initial value of the determination criterion and an initial value of the threshold value based on the collected learning data. The initial value of the above threshold value and the initial value of the threshold value are received, the initial value of the determination criterion is set to the initial value of the determination criterion used in the process of estimating the vocalization state, and the initial value of the threshold value is the determination process. The voice processing program according to Appendix 2, wherein the computer further executes the process of setting the initial value of the threshold value.

（付記４）前記サーバは、前記学習データに基づいて更新された複数の閾値を前記複数の端末から収集し、収集した前記複数の閾値をそれぞれ比較することで、声の大きい利用者を特定し、特定した利用者が使用する端末にアラームを通知し、
前記アラームを受け付けた場合に、アラームを出力する処理を更にコンピュータに実行させることを特徴とする付記３に記載の音声処理プログラム。 (Appendix 4) The server collects a plurality of threshold values updated based on the learning data from the plurality of terminals, and compares the collected plurality of threshold values with each other to identify a user with a loud voice. , Notify the terminal used by the specified user of the alarm,
The voice processing program according to Appendix 3, wherein when the alarm is received, the computer further executes a process of outputting the alarm.

（付記５）前記サーバは、前記複数の端末を使用環境に応じてグループに分類し、使用環境の類似するグループに分類された複数の端末から取得する複数の閾値を比較することで、声の大きい利用者を特定することを特徴とする付記４に記載の音声処理プログラム。 (Appendix 5) The server classifies the plurality of terminals into groups according to the usage environment, and compares a plurality of thresholds acquired from a plurality of terminals classified into groups having similar usage environments to make a voice voice. The voice processing program according to Appendix 4, which is characterized by identifying a large user.

（付記６）前記ピッチ周波数および前記周波数パワーに基づく値が第３閾値未満かつ前記周波数パワーの平均パワーが第１閾値以上である場合、または、前記ピッチ周波数および前記周波数パワーに基づく値が前記第３閾値以上かつ前記周波数パワーの平均パワーが第２閾値未満である場合には、前記入力音声に対して補正ゲインを加える処理を更にコンピュータに実行させることを特徴とする付記１〜５のいずれか一つに記載の音声処理プログラム。 (Appendix 6) When the value based on the pitch frequency and the frequency power is less than the third threshold value and the average power of the frequency power is equal to or more than the first threshold value, or the value based on the pitch frequency and the frequency power is the first. When the average power of the frequency power is less than or equal to the second threshold value of 3 threshold values or more, any one of Supplementary notes 1 to 5 is characterized in that the computer is further executed to add a correction gain to the input voice. The voice processing program described in one.

（付記７）コンピュータが実行する音声処理方法であって、
入力音声からピッチ周波数と周波数パワーとを抽出し、
前記ピッチ周波数および前記周波数パワーに基づく値が所定の閾値以上となる条件を満たすか否か判定結果を判定し、
前記判定結果と、前記周波数パワーの平均パワーとの関係に基づいて、前記入力音声の発声状態を推定する
処理を実行することを特徴とする音声処理方法。 (Appendix 7) A voice processing method executed by a computer.
Extract the pitch frequency and frequency power from the input voice,
The determination result is determined whether or not the condition that the value based on the pitch frequency and the frequency power is equal to or higher than a predetermined threshold value is satisfied.
A voice processing method characterized by executing a process of estimating a vocalization state of the input voice based on the relationship between the determination result and the average power of the frequency power.

（付記８）前記発声状態の推定結果を通知し、前記発声状態が改善されたか否かを判定し、改善された場合には、前記発声状態を推定する処理が用いる判定基準または前記閾値を正解データとし、改善されていない場合または変化が無い場合には、前記判定基準または前記閾値を不正解データとし、前記正解データおよび前記不正解データを学習データとして用いることで、前記不正解データが減るように、前記判定基準または前記閾値を更新する処理を更に実行することを特徴とする付記７に記載の音声処理方法。 (Appendix 8) The estimation result of the vocalization state is notified, it is determined whether or not the vocalization state is improved, and if it is improved, the determination criterion or the threshold value used in the process of estimating the vocalization state is correctly answered. If the data is not improved or if there is no change, the incorrect answer data is reduced by using the judgment criterion or the threshold value as incorrect answer data and using the correct answer data and the incorrect answer data as learning data. The voice processing method according to Appendix 7, wherein the process of updating the determination criterion or the threshold value is further executed.

（付記９）ネットワークに接続された複数の端末から前記学習データを収集し、収集した前記学習データを基にして前記判定基準の初期値および前記閾値の初期値を生成するサーバから、前記判定基準の初期値および前記閾値の初期値を受信し、前記判定基準の初期値を、前記発声状態を推定する処理が用いる判定基準の初期値に設定し、前記閾値の初期値を、前記判定する処理の前記閾値の初期値に設定する処理を更に実行することを特徴とする付記８に記載の音声処理方法。 (Appendix 9) The determination criterion is collected from a server that collects the learning data from a plurality of terminals connected to the network and generates an initial value of the determination criterion and an initial value of the threshold value based on the collected learning data. The initial value of the above threshold value and the initial value of the threshold value are received, the initial value of the determination criterion is set to the initial value of the determination criterion used in the process of estimating the vocalization state, and the initial value of the threshold value is the determination process. 8. The voice processing method according to Appendix 8, wherein the process of setting the initial value of the threshold value is further executed.

（付記１０）前記サーバは、前記学習データに基づいて更新された複数の閾値を前記複数の端末から収集し、収集した前記複数の閾値をそれぞれ比較することで、声の大きい利用者を特定し、特定した利用者が使用する端末にアラームを通知し、
前記アラームを受け付けた場合に、アラームを出力する処理を更に実行することを特徴とする付記９に記載の音声処理方法。 (Appendix 10) The server collects a plurality of threshold values updated based on the learning data from the plurality of terminals, and compares the collected plurality of threshold values with each other to identify a user with a loud voice. , Notify the terminal used by the specified user of the alarm,
The voice processing method according to Appendix 9, wherein when the alarm is received, a process of outputting the alarm is further executed.

（付記１１）前記サーバは、前記複数の端末を使用環境に応じてグループに分類し、使用環境の類似するグループに分類された複数の端末から取得する複数の閾値を比較することで、声の大きい利用者を特定することを特徴とする付記１０に記載の音声処理方法。 (Appendix 11) The server classifies the plurality of terminals into groups according to the usage environment, and compares a plurality of threshold values acquired from a plurality of terminals classified into groups having similar usage environments to make a voice voice. The voice processing method according to Appendix 10, wherein a large user is specified.

（付記１２）前記ピッチ周波数および前記周波数パワーに基づく値が第３閾値未満かつ前記周波数パワーの平均パワーが第１閾値以上である場合、または、前記ピッチ周波数および前記周波数パワーに基づく値が前記第３閾値以上かつ前記周波数パワーの平均パワーが第２閾値未満である場合には、前記入力音声に対して補正ゲインを加える処理を更にコンピュータに実行させることを特徴とする付記７〜１１のいずれか一つに記載の音声処理方法。 (Appendix 12) When the value based on the pitch frequency and the frequency power is less than the third threshold value and the average power of the frequency power is equal to or more than the first threshold value, or the value based on the pitch frequency and the frequency power is the first. When the average power of the frequency power is less than or equal to the second threshold value of 3 threshold values or more, any one of Supplementary notes 7 to 11 is characterized in that the computer is further executed to add a correction gain to the input voice. The voice processing method described in one.

（付記１３）入力音声からピッチ周波数と周波数パワーとを抽出する抽出部と、
前記ピッチ周波数および前記周波数パワーに基づく値が所定の閾値以上となる条件を満たすか否か判定結果を判定し、前記判定結果と、前記周波数パワーの平均パワーとの関係に基づいて、前記入力音声の発声状態を推定する推定部と、
を有することを特徴とする音声処理装置。 (Appendix 13) An extraction unit that extracts the pitch frequency and frequency power from the input voice,
The determination result is determined whether or not the condition that the pitch frequency and the value based on the frequency power satisfy the predetermined threshold value or more is satisfied, and the input voice is based on the relationship between the determination result and the average power of the frequency power. The estimation unit that estimates the vocalization state of
A voice processing device characterized by having.

（付記１４）前記発声状態の推定結果を通知し、前記発声状態が改善されたか否かを判定し、改善された場合には、前記発声状態を推定する処理が用いる判定基準または前記閾値を正解データとし、改善されていない場合または変化が無い場合には、前記判定基準または前記閾値を不正解データとし、前記正解データおよび前記不正解データを学習データとして用いることで、前記不正解データが減るように、前記判定基準または前記閾値を更新する更新部を更に有することを特徴とする付記１３に記載の音声処理装置。 (Appendix 14) The estimation result of the vocalization state is notified, it is determined whether or not the vocalization state is improved, and if it is improved, the determination criterion or the threshold value used in the process of estimating the vocalization state is correctly answered. If the data is not improved or if there is no change, the incorrect answer data is reduced by using the judgment criterion or the threshold value as incorrect answer data and using the correct answer data and the incorrect answer data as learning data. The voice processing apparatus according to Appendix 13, further comprising an update unit for updating the determination criterion or the threshold value.

（付記１５）ネットワークに接続された複数の端末から前記学習データを収集し、収集した前記学習データを基にして前記判定基準の初期値および前記閾値の初期値を生成するサーバから、前記判定基準の初期値および前記閾値の初期値を受信し、前記判定基準の初期値を、前記発声状態を推定する処理が用いる判定基準の初期値に設定し、前記閾値の初期値を、前記判定する処理の前記閾値の初期値に設定するダウンロード部を更に有することを特徴とする付記１４に記載の音声処理装置。 (Appendix 15) The determination criterion is collected from a server that collects the learning data from a plurality of terminals connected to the network and generates an initial value of the determination criterion and an initial value of the threshold value based on the collected learning data. The initial value of the above threshold value and the initial value of the threshold value are received, the initial value of the determination criterion is set to the initial value of the determination criterion used in the process of estimating the vocalization state, and the initial value of the threshold value is the determination process. The voice processing apparatus according to Appendix 14, further comprising a download unit for setting the initial value of the threshold value of the above.

（付記１６）前記サーバは、前記学習データに基づいて更新された複数の閾値を前記複数の端末から収集し、収集した前記複数の閾値をそれぞれ比較することで、声の大きい利用者を特定し、特定した利用者が使用する端末にアラームを通知し、
前記アラームを受け付けた場合に、アラームを出力するメッセージ受信部を更に有することを特徴とする付記１５に記載の音声処理装置。 (Appendix 16) The server collects a plurality of threshold values updated based on the learning data from the plurality of terminals, and compares the collected plurality of threshold values with each other to identify a user with a loud voice. , Notify the terminal used by the specified user of the alarm,
The voice processing device according to Appendix 15, further comprising a message receiving unit that outputs an alarm when the alarm is received.

（付記１７）前記サーバは、前記複数の端末を使用環境に応じてグループに分類し、使用環境の類似するグループに分類された複数の端末から取得する複数の閾値を比較することで、声の大きい利用者を特定することを特徴とする付記１６に記載の音声処理装置。 (Appendix 17) The server classifies the plurality of terminals into groups according to the usage environment, and compares a plurality of threshold values acquired from a plurality of terminals classified into groups having similar usage environments to make a voice voice. The voice processing device according to Appendix 16, wherein a large user is identified.

（付記１８）前記ピッチ周波数および前記周波数パワーに基づく値が第３閾値未満かつ前記周波数パワーの平均パワーが第１閾値以上である場合、または、前記ピッチ周波数および前記周波数パワーに基づく値が前記第３閾値以上かつ前記周波数パワーの平均パワーが第２閾値未満である場合には、前記入力音声に対して補正ゲインを加えるゲイン調整部を更に有することを特徴とする付記１３〜１７のいずれか一つに記載の音声処理装置。 (Appendix 18) When the value based on the pitch frequency and the frequency power is less than the third threshold value and the average power of the frequency power is equal to or more than the first threshold value, or the value based on the pitch frequency and the frequency power is the first. When the average power of the frequency power is less than or equal to the second threshold value of 3 threshold values or more, any one of Appendix 13 to 17, further comprising a gain adjusting unit for adding a correction gain to the input sound. The voice processing device described in 1.

１０マイク
１００，２００，３００ａ，３００ｂ，３００ｃ，５００ａ，５００ｂ，５００ｃ，５００ｄ，５００ｅ，５００ｆ，５００ｇ，５００ｈ，５００ｉ，５００ｊ，５００ｋ，５００ｌ，７００音声処理装置
１１０ＡＤ変換部
１２０ａピッチ抽出部
１２０ｂパワー抽出部
１３０ストレス検出部
１４０記憶部
１４０ａ判定基準データ
１６０情報提示部
２２０更新部
３１０アップロード部
３２０ダウンロード部
４００，６００サーバ
５１０メッセージ受信部
７２０ゲイン調整部 10 Microphone 100, 200, 300a, 300b, 300c, 500a, 500b, 500c, 500d, 500e, 500f, 500g, 500h, 500i, 500j, 500k, 500l, 700 Audio processing device 110 AD conversion unit 120a Pitch extraction unit 120b Power Extraction unit 130 Stress detection unit 140 Storage unit 140a Judgment standard data 160 Information presentation unit 220 Update unit 310 Upload unit 320 Download unit 400,600 Server 510 Message reception unit 720 Gain adjustment unit

Claims

On the computer
Extracts the frequency power and the pitch frequency from the input speech,
Wherein a frequency power and a value based on the pitch frequency, the value indicating the user's stress having issued the input speech is determined satisfying whether the determination result equal to or greater than a predetermined threshold value,
A voice processing program characterized in that a process of estimating the utterance state of the input voice is executed based on the relationship between the determination result and the average power of the frequency power.

The estimation result of the vocalization state is notified, it is determined whether or not the vocalization state is improved, and if it is improved, the judgment standard or the threshold value used in the process of estimating the vocalization state is used as correct answer data, and the improvement is made. If it is not done or there is no change, the judgment criterion or the threshold value is used as incorrect answer data, and the correct answer data and the incorrect answer data are used as training data so that the incorrect answer data is reduced. The voice processing program according to claim 1, wherein the computer further executes a process of updating the determination standard or the threshold value.

From a server that collects the training data from a plurality of terminals connected to the network and generates an initial value of the determination criterion and an initial value of the threshold value based on the collected learning data, the initial value of the determination criterion and The initial value of the threshold value is received, the initial value of the determination criterion is set to the initial value of the determination criterion used in the process of estimating the vocalization state, and the initial value of the threshold value is set to the threshold value of the determination process. The voice processing program according to claim 2, wherein the computer further executes the process of setting the initial value.

The server collects a plurality of threshold values updated based on the learning data from the plurality of terminals, and compares the collected plurality of threshold values with each other to identify and identify a user with a loud voice. Notifies the terminal used by the person of the alarm and
The voice processing program according to claim 3, wherein when the alarm is received, a computer further executes a process of outputting the alarm.

The server classifies the plurality of terminals into groups according to the usage environment, and compares a plurality of thresholds acquired from the plurality of terminals classified into groups having similar usage environments to obtain a loud user. The voice processing program according to claim 4, wherein the voice processing program is specified.

Wherein when the frequency power and a value based on the pitch frequency is the average power of the third threshold value less than and the frequency power is equal to or larger than the first threshold, or a value based on the frequency power and the pitch frequency is more than the third threshold value Further, when the average power of the frequency power is less than the second threshold value, any one of claims 1 to 5, wherein the computer further executes a process of adding a correction gain to the input voice. The voice processing program described in.

A computer-executed voice processing method
Extracts the frequency power and the pitch frequency from the input speech,
Wherein a frequency power and a value based on the pitch frequency, the value indicating the user's stress having issued the input speech is determined satisfying whether the determination result equal to or greater than a predetermined threshold value,
A voice processing method characterized by executing a process of estimating a vocalization state of the input voice based on the relationship between the determination result and the average power of the frequency power.

An extractor that extracts frequency power and pitch frequency from input audio,
Wherein a frequency power and a value based on the pitch frequency, the value indicating the user's stress having issued the input speech is determined satisfying whether the determination result equal to or greater than a predetermined threshold value, the determination An estimation unit that estimates the vocalization state of the input voice based on the relationship between the result and the average power of the frequency power.
A voice processing device characterized by having.