JP7176325B2

JP7176325B2 - Speech processing program, speech processing method and speech processing device

Info

Publication number: JP7176325B2
Application number: JP2018181937A
Authority: JP
Inventors: 紗友梨中山; 太郎外川; 清訓森岡
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2022-11-22
Anticipated expiration: 2038-09-27
Also published as: JP2020052256A

Description

本発明は、音声処理プログラム等に関する。 The present invention relates to an audio processing program and the like.

近年、多くの企業では、店員と顧客とのコミュニケーションから様々な情報を獲得し、顧客満足度の向上や、マーケティングの戦略に役立てたいというニーズがある。たとえば、顧客が判断に迷っている音声区間には、顧客の意志決定に至る判断材料に関する情報が多く含まれるため、顧客が判断に迷った音声区間を検出し、検出した音声区間の音声情報を分析することで、顧客のニーズを推定することができる。 In recent years, in many companies, there is a need to acquire various information from communication between store clerks and customers and use it for improving customer satisfaction and marketing strategies. For example, since the voice segments in which customers hesitate to make decisions contain a lot of information related to decision-making materials, we can detect voice segments in which customers hesitate to make decisions, and use the voice information of the detected voice segments. By analyzing, it is possible to estimate customer needs.

たとえば、エージェントがユーザに質問した時点から、ユーザが回答するまでの無音および意味をなさない発話区間の終了時点までを、思考時間として検出する従来技術がある。意味をなさない発話区間は「フィラー区間」と呼ばれる。 For example, there is a conventional technology that detects as thinking time the time from when the agent asks the user a question to when the user answers, until the end of silence and meaningless utterance intervals. Speech intervals that do not make sense are called “filler intervals”.

特開２０１７－１１１７６０号公報JP 2017-111760 A 特開２０００－１９４３８６号公報JP-A-2000-194386 国際公開第２０１７／０８５９９２号WO2017/085992

しかしながら、上述した従来技術では、ユーザが迷っている区間を正確に推定することができないと言う問題がある。 However, the conventional technology described above has a problem that it is impossible to accurately estimate the section in which the user is hesitant.

図１５は、従来技術の問題を説明するための図である。従来技術では、フィラー区間を思考時間（判断に迷った音声区間）として検出しているが、フィラー区間において、ユーザは判断に迷っている場合もあれば、言葉探しや記憶操作等をしている場合もある。 FIG. 15 is a diagram for explaining the problem of the conventional technology. In the conventional technology, the filler section is detected as thinking time (speech section in which the user hesitates to make a decision). In some cases.

図１５において、店員が「何か気になる機種などありますか？」と質問し、顧客は「えっと、あれ？なんだっけ・・・」と発話した後に、「あ！××です」と回答している。この例では、「えっと、あれ？なんだっけ・・・」と発話された区間Ｔ１が、フィラー区間として検出される。この区間Ｔ１は、顧客が記憶操作を行っている区間といえる。 In FIG. 15, the store clerk asks, "Are there any models that you are interested in?" answering. In this example, a section T1 in which "Uh, what? What is it?" is detected as a filler section. This section T1 can be said to be a section in which the customer performs the storage operation.

店員が「××ですね。人気ですよ。ただ、××にはＹＹがついていなくて・・・」と発話し、顧客は「そうなんだうーん・・・」と発話した後に、「やっぱりＹＹは必要かなぁ」と発話している。この例では、「そうなんだうーん・・・」と発話された区間Ｔ２が、フィラー区間として検出される。この区間Ｔ２は、顧客が迷っている区間といえる。 The store clerk said, "That's XX. It's popular. However, XX doesn't have YY...", and the customer said, "That's right." Is it necessary?" In this example, the section T2 in which the utterance "That's right..." is detected as the filler section. This section T2 can be said to be a section in which the customer hesitates.

従来技術では、区間Ｔ１およびＴ１を、思考時間として検出することになるが、区間Ｔ１は、顧客が記憶操作を行っている区間であり、顧客が判断に迷っている区間ではない。 In the conventional technology, the sections T1 and T1 are detected as thinking time, but the section T1 is a section in which the customer performs a memory operation, not a section in which the customer hesitates to make a decision.

１つの側面では、本発明は、ユーザが迷っている区間を正確に推定することができる音声処理プログラム、音声処理方法および音声処理装置を提供することを目的とする。 In one aspect, an object of the present invention is to provide a speech processing program, a speech processing method, and a speech processing device capable of accurately estimating a section in which a user hesitates.

第１の案では、コンピュータに次の処理を実行させる。コンピュータは、音声情報から複数の発話区間を検出する。コンピュータは、複数の発話区間からフィラーを検出した発話区間をフィラー区間として特定する。コンピュータは、フィラー区間の音声情報の特徴量を基にして、フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定する。 The first option is to have the computer perform the following processing. A computer detects a plurality of speech segments from the voice information. The computer specifies, as a filler segment, an utterance segment in which filler is detected from a plurality of utterance segments. The computer determines whether or not the voice information in the filler section is voice information to be spoken when the user hesitates to make a decision, based on the feature amount of the voice information in the filler section.

ユーザが迷っている区間を正確に推定することができる。 It is possible to accurately estimate the section in which the user is lost.

図１は、本実施例１に係る音声処理装置の処理を説明するための図である。FIG. 1 is a diagram for explaining processing of the speech processing device according to the first embodiment. 図２は、本実施例１に係るシステムの構成を示す図である。FIG. 2 is a diagram showing the configuration of the system according to the first embodiment. 図３は、本実施例１に係る音声処理装置の構成を示す機能ブロック図である。FIG. 3 is a functional block diagram showing the configuration of the speech processing device according to the first embodiment. 図４は、本実施例１に係る状態判定部の構成を示す機能ブロック図である。FIG. 4 is a functional block diagram showing the configuration of the state determination unit according to the first embodiment. 図５は、明るさ算出部の処理を説明するための図である。FIG. 5 is a diagram for explaining the processing of the brightness calculator. 図６は、本実施例１に係る音声処理装置の処理手順を示すフローチャートである。FIG. 6 is a flow chart showing the processing procedure of the speech processing device according to the first embodiment. 図７は、実験結果の一例を示す図である。FIG. 7 is a diagram showing an example of experimental results. 図８は、音声情報の周波数の推移を説明するための図である。FIG. 8 is a diagram for explaining the transition of the frequency of audio information. 図９は、本実施例２に係るシステムの構成を示す図である。FIG. 9 is a diagram showing the configuration of a system according to the second embodiment. 図１０は、本実施例２に係る収録機器の構成を示す機能ブロック図である。FIG. 10 is a functional block diagram showing the configuration of the recording device according to the second embodiment. 図１１は、本実施例２に係る音声処理装置の構成を示す機能ブロック図である。FIG. 11 is a functional block diagram showing the configuration of the speech processing device according to the second embodiment. 図１２は、本実施例２に係る状態判定部の構成を示す機能ブロック図である。FIG. 12 is a functional block diagram showing the configuration of the state determination unit according to the second embodiment. 図１３は、本実施例２に係る音声処理装置の処理手順を示すフローチャートである。FIG. 13 is a flow chart showing the processing procedure of the speech processing device according to the second embodiment. 図１４は、音声処理装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。FIG. 14 is a diagram showing an example of the hardware configuration of a computer that implements functions similar to those of the audio processing device. 図１５は、従来技術の問題を説明するための図である。FIG. 15 is a diagram for explaining the problem of the conventional technology.

以下に、本願の開示する音声処理プログラム、音声処理方法および音声処理装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Hereinafter, embodiments of the audio processing program, the audio processing method, and the audio processing apparatus disclosed in the present application will be described in detail based on the drawings. In addition, this invention is not limited by this Example.

図１は、本実施例１に係る音声処理装置の処理を説明するための図である。本実施例１では一例として、第１話者と第２話者との音声情報を取得して発話区間を検出し、第２話者のある発話区間が、第２話者が「判断に迷っている区間」であるか否かを判定する場合について説明する。以下の説明では、適宜、発話区間のうち、判断に迷っている区間を「判断迷い区間」と表記する。 FIG. 1 is a diagram for explaining processing of the speech processing device according to the first embodiment. In the first embodiment, as an example, voice information of the first speaker and the second speaker is acquired to detect an utterance section, and an utterance section with the second speaker is detected as an utterance section by the second speaker, A case will be described in which it is determined whether or not it is an interval where the In the following description, among the utterance sections, a section in which the user hesitates to make a decision will be referred to as a "questionable section" as appropriate.

音声処理装置は、音声情報を取得すると、第１話者の発話区間と、第２話者の発話区間とを検出し、第１話者の発話区間の後の第２発話区間のうち、フィラー区間となる発話区間を特定する。発話区間は、無音区間に挟まれる区間である。フィラー区間は、意味をなさない発話がなされた発話区間であり、たとえば、「えっと、あれ、なんだっけ」、「そうなんだ、うーん」等が発話された区間は、フィラー区間となる。 When the speech information is acquired, the speech processing device detects an utterance period of the first speaker and an utterance period of the second speaker. Identify the utterance segment that will be the segment. A speech segment is a segment sandwiched between silent segments. A filler section is an utterance section in which meaningless utterances are made. For example, a section in which utterances such as "Well, what is that?"

図１では、第１話者の発話区間をＴ１１、Ｔ１２、Ｔ１３とし、第２話者の発話区間をＴ２１、Ｔ２２、Ｔ２３、Ｔ２４とする。音声処理装置は、発話区間Ｔ１１に続く、発話区間Ｔ２１、Ｔ２２の音声情報、発話区間Ｔ１２に続く、発話区間Ｔ２３、Ｔ２４の音声情報を解析して、フィラー区間を特定する。たとえば、音声処理装置は、発話区間Ｔ２１，Ｔ２３をフィラー区間と特定したものとして説明する。音声処理装置は、フィラー区間の音声情報の特徴量を基にして、フィラー区間が、判断迷い区間であるか否かを判定する。 In FIG. 1, the utterance periods of the first speaker are T11, T12, and T13, and the utterance periods of the second speaker are T21, T22, T23, and T24. The speech processing device analyzes the speech information of the speech sections T21 and T22 following the speech section T11 and the speech information of the speech sections T23 and T24 following the speech section T12 to identify filler sections. For example, the speech processing device will be described assuming that the utterance sections T21 and T23 are identified as filler sections. The speech processing device determines whether or not the filler section is a questionable section based on the feature amount of the speech information of the filler section.

図１のグラフＧ１は、音声情報の特徴量（声の明るさ）の時間変化を示すものである。グラフＧ１の横軸は時間軸である。グラフＧ１の縦軸は話者の声の明るさに対応する軸であり、閾値Ｔｈ＿Ｄよりも大きい場合には、話者の声が明るく、閾値ＴＨ＿Ｄ未満である場合には、話者の声が暗いことを示す。たとえば、音声処理装置は、フィラー区間における明るさが「明るい」場合には、フィラー区間を「判断迷い区間ではない」と判定する。これに対して、音声処理装置は、フィラー区間における明るさが「暗い」場合には、フィラー区間を「判断迷い区間である」と判定する。 A graph G1 in FIG. 1 shows a temporal change in the feature amount (brightness of voice) of voice information. The horizontal axis of graph G1 is the time axis. The vertical axis of the graph G1 is the axis corresponding to the brightness of the speaker's voice. Indicates dark. For example, when the brightness in the filler section is "bright", the speech processing device determines that the filler section is "not a doubtful section". On the other hand, if the brightness in the filler section is "dark", the speech processing device determines that the filler section is "a doubtful section".

図１を用いて説明すると、音声処理装置は、フィラー区間Ｔ２１において、声の明るさが「明るい」ため、フィラー区間Ｔ２１が「判断迷い区間ではない」と判定する。音声処理装置は、フィラー区間Ｔ２３において、声の明るさが「暗い」ため、フィラー区間Ｔ２３は「判断迷い区間である」と判定する。このように、本実施例に係る音声処理装置は、フィラー区間の音声情報の特徴量を基にして、判断迷い区間であるか否かを判定するので、ユーザが迷っている区間を正確に推定することができる。 To explain using FIG. 1, the speech processing apparatus determines that the filler section T21 is not a doubtful section because the brightness of the voice is "bright" in the filler section T21. Since the brightness of the voice in the filler section T23 is "dark", the speech processing device determines that the filler section T23 is a "deterministic section". In this way, the speech processing apparatus according to the present embodiment determines whether or not it is a questionable section based on the feature amount of the speech information of the filler section, so that the section in which the user hesitates can be accurately estimated. can do.

なお、音声処理装置は、フィラー区間の音声情報の特徴量だけでなく、フィラー区間に続く応答区間の音声情報の特徴量、対話全体の明るさの平均、応答時間を更に用いて、フィラー区間が、判断迷い区間であるか否かを判定してもよい。これにより、判断迷い区間を更に精度よく判定することができる。なお、応答区間は、第２話者の発話区間のうち、フィラー区間以外の発話区間とする。図１に示す例では、発話区間Ｔ２２、Ｔ２４が応答区間となる。応答時間は、第１話者の発話区間から、第２話者の応答区間までの時間である。たとえば、発話区間Ｔ１１の終了時刻から、発話区間（応答区間）Ｔ２２の開始時刻までが応答時間となる。 Note that the speech processing device uses not only the feature amount of speech information in the filler section, but also the feature amount of speech information in the response section following the filler section, the average brightness of the entire dialogue, and the response time to determine the filler section. , it may be determined whether or not it is an indecisive section. As a result, it is possible to determine the doubtful decision section with higher accuracy. It should be noted that the response period is the utterance period other than the filler period among the utterance periods of the second speaker. In the example shown in FIG. 1, utterance sections T22 and T24 are response sections. The response time is the time from the utterance period of the first speaker to the response period of the second speaker. For example, the response time is from the end time of the utterance period T11 to the start time of the utterance period (response period) T22.

音声処理装置が、フィラー区間と、フィラー区間に続く応答区間とを基にして、判断迷い区間を判定する場合について説明する。音声処理装置は、フィラー区間の話者の声の明るさが「暗く」、かつ、応答区間の話者の声の明るさが「明るい」場合に、フィラー区間が「判断迷い区間」であると判定する。かかる条件を満たさない場合には、音声処理装置は、フィラー区間が「判断迷い区間ではない」と判定する。 A case will be described in which the speech processing device determines a questionable section based on a filler section and a response section following the filler section. When the brightness of the speaker's voice in the filler section is "dark" and the brightness of the speaker's voice in the response section is "bright", the speech processing device determines that the filler section is the "uncertain section". judge. If this condition is not satisfied, the speech processing device determines that the filler section is "not a questionable section."

たとえば、音声処理装置は、フィラー区間Ｔ２３の声の明るさが「暗く」、応答区間Ｔ２４の声の明るさが「明るい」ため、上記の条件を満たし、フィラー区間Ｔ２３を、「判断迷い区間」であると判定する。一方、音声処理装置は、フィラー区間Ｔ２１の声の明るさが「明るく」、応答区間Ｔ２４の声の明るさが「明るい」ため、上記の条件を満たさないので、フィラー区間Ｔ２３を「判断迷い区間」でないと判定する。 For example, since the brightness of the voice in the filler section T23 is "dark" and the brightness of the voice in the response section T24 is "bright", the speech processing device satisfies the above conditions and treats the filler section T23 as the "difficult to judge" section. It is determined that On the other hand, since the voice brightness of the filler section T21 is "bright" and the voice brightness of the response section T24 is "bright", the speech processing device does not satisfy the above conditions, so the filler section T23 is regarded as the "uncertain section is not determined.

続いて、音声処理装置が、フィラー区間と、フィラー区間に続く応答区間と、対話全体の明るさの平均と、応答時間を基にして、判断迷い区間であるか否かを判定する場合について説明する。音声処理装置は、対話全体の明るさの平均が、閾値ＴＨ＿Ｄ’以上であり、かつ、応答時間が、閾値ＴＨ＿Ｒ以上である場合に限り、上記のフィラー区間と、フィラー区間に続く応答区間とを基にした判定を行う。これに対して、音声処理装置は、対話全体の明るさの平均が、閾値ＴＨ＿Ｄ’未満である場合には、話者に興味がないといえるため、フィラー区間と、フィラー区間に続く応答区間と音声情報の特徴によらず、フィラー区間が、「判断迷い区間ではない」と判定する。また、音声処理装置は、応答時間が閾値ＴＨ＿Ｒ未満である場合には、フィラー区間と、フィラー区間に続く応答区間と音声情報の特徴によらず、フィラー区間が、「判断迷い区間ではない」と判定する。 Next, a case will be described in which the speech processing device determines whether or not it is an indecisive section based on the filler section, the response section following the filler section, the average brightness of the entire dialogue, and the response time. do. Only when the average brightness of the entire dialogue is equal to or greater than the threshold TH_D' and the response time is equal to or greater than the threshold TH_R, the speech processing device divides the filler section and the response section following the filler section. make a decision based on On the other hand, when the average brightness of the entire dialogue is less than the threshold TH_D', the speech processing device can be said that the speaker is not interested in the speaker. Regardless of the characteristics of the audio information, it is determined that the filler section is "not a dubious section". In addition, when the response time is less than the threshold TH_R, the speech processing device determines that the filler section is not a doubtful section regardless of the characteristics of the filler section, the response section following the filler section, and the audio information. judge.

音声処理装置は、対話全体の明るさの平均と、応答時間を用いることで、フィラー区間の音声情報の特徴量および応答区間の音声情報の特徴量を解析しなくても、フィラー区間が「判断迷い区間ではない」と判定することができる。 By using the average brightness of the entire dialogue and the response time, the speech processing device can determine whether the filler section is “determined” without analyzing the feature amount of the speech information in the filler section and the feature amount of the speech information in the response section. It can be determined that it is not a lost section.

次に、本実施例１に係るシステムの構成について説明する。図２は、本実施例１に係るシステムの構成を示す図である。図２に示すように、このシステムは、音声処理装置１と、マイク１０ａ，１０ｂとが含まれる。音声処理装置１は、マイク１０ａ，１０ｂに接続される。 Next, the configuration of the system according to the first embodiment will be described. FIG. 2 is a diagram showing the configuration of the system according to the first embodiment. As shown in FIG. 2, this system includes a voice processing device 1 and microphones 10a and 10b. The audio processing device 1 is connected to microphones 10a and 10b.

マイク１０ａは、ユーザＵ１０１の音声を集音するマイクである。マイク１０ａは、集音した音声情報を、音声処理装置１に出力する。マイク１０ｂは、ユーザＵ１０２の音声を集音するマイクである。マイク１０ａは、集音した音声情報を、音声処理装置１に出力する。たとえば、ユーザＵ１０１は、第１話者に対応し、ユーザＵ１０２は、第２話者に対応する。以下の説明では適宜、ユーザＵ１０１の音声情報を「第１音声情報」と表記し、ユーザＵ１０２の音声情報を「第２音声情報」と表記する。 The microphone 10a is a microphone that collects the voice of the user U101. The microphone 10 a outputs the collected sound information to the sound processing device 1 . The microphone 10b is a microphone that collects the voice of the user U102. The microphone 10 a outputs the collected sound information to the sound processing device 1 . For example, user U101 corresponds to the first speaker and user U102 corresponds to the second speaker. In the following description, the voice information of the user U101 will be referred to as "first voice information", and the voice information of the user U102 will be referred to as "second voice information".

音声処理装置１は、マイク１０ａ，１０ｂから音声情報を取得し、取得した音声情報を基にして、ユーザＵ１０２の発話区間から、判断迷い区間を判定し、判定した判断迷い区間の音声情報を記憶装置に格納する装置である。 The voice processing device 1 acquires voice information from the microphones 10a and 10b, determines a doubtful judgment interval from the utterance interval of the user U102 based on the acquired voice information, and stores the determined voice information of the doubtful judgment interval. It is a device that stores in the device.

図３は、本実施例１に係る音声処理装置の構成を示す機能ブロック図である。図３に示すように、この音声処理装置１は、図２で説明したマイク１０ａ，１０ｂに接続される。音声処理装置１は、ＡＤ変換部２０ａ，２０ｂと、前処理部３０と、記憶部４０と、状態判定部１００とを有する。ＡＤ変換部２０ａ，２０ｂと、前処理部３０、状態判定部１００の各処理部は、例えば、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）等により実現される。また、各処理部は、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等の集積回路により実現されてもよい。 FIG. 3 is a functional block diagram showing the configuration of the speech processing device according to the first embodiment. As shown in FIG. 3, this audio processing device 1 is connected to the microphones 10a and 10b described with reference to FIG. The speech processing device 1 includes AD converters 20a and 20b, a preprocessing unit 30, a storage unit 40, and a state determination unit 100. Each processing unit of the AD conversion units 20a and 20b, the preprocessing unit 30, and the state determination unit 100 is realized by, for example, a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like. Also, each processing unit may be implemented by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).

ＡＤ変換部２０ａは、マイク１０ａから入力される第１音声情報を、アナログ信号からデジタル信号に変換するＡＤ変換回路（Analog-to-digital converter）である。ＡＤ変換部２０ａは、デジタル信号に変換した、第１音声情報を前処理部３０に出力する。以下の説明では、デジタル信号に変換した第１音声情報を単に、第１音声情報と表記する。 The AD conversion unit 20a is an AD conversion circuit (Analog-to-digital converter) that converts the first audio information input from the microphone 10a from an analog signal to a digital signal. The AD conversion unit 20 a outputs the first audio information converted into a digital signal to the preprocessing unit 30 . In the following description, the first audio information converted into a digital signal is simply referred to as first audio information.

ＡＤ変換部２０ｂは、マイク１０ｂから入力される第２音声情報を、アナログ信号からデジタル信号に変換するＡＤ変換回路である。ＡＤ変換部２０ｂは、デジタル信号に変換した、第２音声情報を前処理部３０に出力する。以下の説明では、デジタル信号に変換した第２音声情報を単に、第２音声情報と表記する。 The AD conversion unit 20b is an AD conversion circuit that converts the second audio information input from the microphone 10b from an analog signal to a digital signal. The AD conversion section 20b outputs the second audio information converted into a digital signal to the preprocessing section 30 . In the following description, the second audio information converted into a digital signal is simply referred to as second audio information.

前処理部３０は、第１音声情報および第２音声情報に対して各種の前処理を実行し、前処理を行った第１音声情報および第２音声情報を、状態判定部１００に出力する処理部である。たとえば、図２に示したシステムでは、マイク１０ａ，１０ｂに、ユーザＵ１０１，Ｕ１０２双方の音声が集音される場合がある。このため、前処理部３０は、第１音声情報に、ユーザＵ１０１の音声のみが含まれるように、第１音声情報から、ユーザＵ１０２の音声を取り除く前処理を行う。前処理部３０は、第２音声情報に、ユーザＵ１０２の音声のみが含まれるように、第２音声情報から、ユーザＵ１０１の音声を取り除く前処理を行う。 The preprocessing unit 30 performs various preprocessing on the first audio information and the second audio information, and outputs the preprocessed first audio information and the second audio information to the state determination unit 100. Department. For example, in the system shown in FIG. 2, voices of both users U101 and U102 may be collected by microphones 10a and 10b. Therefore, the preprocessing unit 30 performs preprocessing to remove the voice of the user U102 from the first voice information so that only the voice of the user U101 is included in the first voice information. The preprocessing unit 30 performs preprocessing to remove the voice of the user U101 from the second voice information so that only the voice of the user U102 is included in the second voice information.

状態判定部１００は、第１音声情報および第２音声情報を取得して、発話区間を検出し、各発話区間からフィラー区間を特定する。状態判定部１００は、特定したフィラー区間に含まれる音声情報の特徴量を基にして、フィラー区間が判断迷い区間であるか否かを判定する。状態判定部１００は、フィラー区間が判断迷い区間である場合には、判断迷い区間の音声情報を、記憶部４０に格納する。本実施例では、第２音声情報の発話区間から、フィラー区間を特定し、判断迷い区間であるか否かを判定する場合について説明するが、これに限定されるものではない。状態判定部１００は、第１音声情報の発話区間から、フィラー区間を特定し、判断迷い区間であるか否かを判定してもよい。 The state determination unit 100 acquires the first audio information and the second audio information, detects the utterance period, and identifies the filler period from each utterance period. The state determination unit 100 determines whether or not the filler section is a questionable section based on the feature amount of the audio information included in the specified filler section. The state determination unit 100 stores the voice information of the uncertain judgment interval in the storage unit 40 when the filler interval is the doubtful judgment interval. In the present embodiment, a case will be described in which a filler section is specified from the utterance section of the second audio information, and whether or not it is an indecisive section is determined, but the present invention is not limited to this. The state determination unit 100 may identify a filler segment from the utterance segment of the first audio information and determine whether or not it is a doubtful determination segment.

記憶部４０は、判断迷い区間の音声情報を記憶する記憶装置である。記憶部４０は、たとえば、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、フラッシュメモリ（Flash Memory）などの半導体メモリ素子などの記憶装置に対応する。 The storage unit 40 is a storage device that stores audio information of the doubtful judgment section. The storage unit 40 corresponds to, for example, a storage device such as a semiconductor memory device such as a RAM (Random Access Memory), a ROM (Read Only Memory), or a flash memory.

図４は、本実施例１に係る状態判定部の構成を示す機能ブロック図である。図４に示すように、状態判定部１００は、発話区間検出部１１０ａ，１１０ｂと、特定部１２０と、応答時間算出部１３０と、明るさ算出部１４０と、長期平均算出部１５０と、閾値算出部１６０と、閾値ＤＢ１７０と、判定部１８０とを有する。 FIG. 4 is a functional block diagram showing the configuration of the state determination unit according to the first embodiment. As shown in FIG. 4, the state determination unit 100 includes speech period detection units 110a and 110b, an identification unit 120, a response time calculation unit 130, a brightness calculation unit 140, a long-term average calculation unit 150, and a threshold calculation unit. It has a unit 160 , a threshold DB 170 and a determination unit 180 .

発話区間検出部１１０ａは、第１話者（ユーザＵ１０１）の第１音声情報の入力を受け付け、第１音声情報のパワーの小さい無音区間に挟まれた区間を発話区間として算出する。発話区間検出部１１０ａは、第１音声情報の発話区間の情報を、応答時間算出部１３０に出力する。 The utterance period detection unit 110a receives input of the first voice information of the first speaker (user U101), and calculates a period sandwiched between silent periods of low power of the first voice information as a utterance period. Speech period detection section 110 a outputs information on the speech period of the first audio information to response time calculation section 130 .

たとえば、発話区間検出部１１０ａは、第１音声情報ｘ１（ｎ）のパワーを検出し、発話の有無を示す変数「ｖ１（ｔ）」を設定する。ｎは音声情報のサンプル番号を示す。ｔはフレーム番号を示す。１フレームを、３０ｍｓとする。発話区間検出部１１０ａは、１フレーム中の第１音声情報ｘ１（ｎ）のパワーが閾値未満である場合には、「発話なし」として、変数「ｖ１（ｔ）＝０」に設定する。発話区間検出部１１０ａは、１フレーム中の第１音声情報ｘ１（ｎ）のパワーが閾値以上である場合には、「発話あり」として、変数「ｖ１（ｔ）＝１」に設定する。なお、発話区間検出部１１０ａは、ＷＯ２００９／１４５１９２に記載された技術を基にして、発話の有無を判定してもよい。また、発話区間検出部１１０ａは、音声情報のパワーの代わりに、ＳＮＲ（signal-noise ratio）を用いて、発話の有無を判定してもよい。後述する発話区間検出部１１０ｂも同様である。 For example, the speech period detection unit 110a detects the power of the first audio information x1(n) and sets a variable "v1(t)" indicating the presence or absence of speech. n indicates the sample number of voice information. t indicates a frame number. One frame is assumed to be 30 ms. If the power of the first audio information x1(n) in one frame is less than the threshold, the speech period detection unit 110a sets the variable "v1(t)=0" as "no speech". When the power of the first audio information x1(n) in one frame is equal to or greater than the threshold, the speech period detection unit 110a sets the variable "v1(t)=1" as "there is speech". Note that the speech period detection unit 110a may determine the presence or absence of speech based on the technique described in WO2009/145192. In addition, speech period detection section 110a may determine the presence or absence of speech using SNR (signal-noise ratio) instead of the power of voice information. The same applies to the speech period detection unit 110b, which will be described later.

発話区間検出部１１０ａは、変数Ｖ１（ｔ）を基にして、第１話者の発話区間の開始時刻Ｔ１ｓ（ｋ１）と、第１話者の発話区間の終了時刻Ｔ１ｅ（ｋ１）とを、応答時間算出部１３０に出力する。「ｋ１」は、第１音声情報の発話区間を識別する発話区間番号である。 Based on the variable V1(t), the utterance period detection unit 110a detects the start time T1s(k1) of the utterance period of the first speaker and the end time T1e(k1) of the utterance period of the first speaker. Output to the response time calculation unit 130 . "k1" is a speech segment number that identifies a speech segment of the first audio information.

たとえば、発話区間検出部１１０ａは、下記の式（１）、（２）に基づいて、Ｔ１ｓ（ｋ１）、Ｔ１ｅ（ｋ１）を出力する。式（１）に示すように、発話区間検出部１１０ａは、ｖ１（ｔ）の値が「０」から「１」に変化するタイミングを、発話区間の開始時刻とする。式（２）に示すように、発話区間検出部１１０ａは、ｖ１（ｔ）の値が「１」から「０」に変化するタイミングを、発話区間の終了時刻とする。 For example, speech period detection section 110a outputs T1s(k1) and T1e(k1) based on the following equations (1) and (2). As shown in Equation (1), speech segment detection section 110a sets the timing at which the value of v1(t) changes from "0" to "1" as the start time of the speech segment. As shown in Equation (2), speech segment detection section 110a determines the timing at which the value of v1(t) changes from "1" to "0" as the end time of the speech segment.

発話区間検出部１１０ｂは、第２話者（ユーザＵ１０２）の第２音声情報の入力を受け付け、第２音声情報のパワーの小さい無音区間に挟まれた区間を発話区間として算出する。発話区間検出部１１０ｂは、発話区間の情報を、特定部１２０および明るさ算出部１４０に出力する。 The utterance period detection unit 110b receives input of the second voice information of the second speaker (user U102), and calculates a period sandwiched between silent periods with low power of the second voice information as a utterance period. Speech period detection section 110 b outputs speech period information to identification section 120 and brightness calculation section 140 .

たとえば、発話区間検出部１１０ｂは、第２音声情報ｘ２（ｎ）のパワーを検出し、発話の有無を示す変数「ｖ２（ｔ）」を設定する。発話区間検出部１１０ｂは、１フレーム中の第２音声情報ｘ２（ｎ）のパワーが閾値未満である場合には、「発話なし」として、変数「ｖ２（ｔ）＝０」に設定する。発話区間検出部１１０ｂは、１フレーム中の第２音声情報ｘ２（ｎ）のパワーが閾値以上である場合には、「発話あり」として、変数「ｖ２（ｔ）＝１」に設定する。 For example, the speech period detection unit 110b detects the power of the second audio information x2(n) and sets a variable "v2(t)" indicating the presence or absence of speech. If the power of the second audio information x2(n) in one frame is less than the threshold, the speech period detection unit 110b sets the variable "v2(t)=0" as "no speech". When the power of the second audio information x2(n) in one frame is equal to or greater than the threshold, the speech period detection unit 110b sets the variable "v2(t)=1" as "there is speech".

発話区間検出部１１０ｂは、変数Ｖ２（ｔ）を基にして、第２話者の発話区間の開始時刻Ｔ２ｓ（ｋ２）と、第２話者の発話区間の終了時刻Ｔ２ｅ（ｋ２）とを特定部１２０および明るさ算出部１４０に出力する。「ｋ２」は、第２音声情報の発話区間を識別する発話区間番号である。 Based on the variable V2(t), the speech period detection unit 110b identifies the start time T2s(k2) of the second speaker's speech period and the end time T2e(k2) of the second speaker's speech period. Output to the unit 120 and the brightness calculation unit 140 . "k2" is a speech segment number that identifies a speech segment of the second audio information.

たとえば、発話区間検出部１１０ｂは、下記の式（３）、（４）に基づいて、Ｔ２ｓ（ｋ２）、Ｔ２ｅ（ｋ２）を出力する。式（３）に示すように、発話区間検出部１１０ｂは、ｖ２（ｔ）の値が「０」から「１」に変化するタイミングを、発話区間の開始時刻とする。式（４）に示すように、発話区間検出部１１０ｂは、ｖ２（ｔ）の値が「１」から「０」に変化するタイミングを、発話区間の終了時刻とする。 For example, speech period detection section 110b outputs T2s(k2) and T2e(k2) based on the following equations (3) and (4). As shown in Equation (3), speech period detection section 110b sets the timing at which the value of v2(t) changes from "0" to "1" as the start time of the speech period. As shown in Equation (4), speech segment detection section 110b sets the timing at which the value of v2(t) changes from "1" to "0" as the end time of the speech segment.

特定部１２０は、第２音声情報および第２話者の発話区間の情報を受け付け、発話区間がフィラー区間であるか否かを判定する処理部である。特定部１２０は、判定結果を応答時間算出部１３０および判定部１８０に出力する。 The identification unit 120 is a processing unit that receives the second voice information and the information on the second speaker's utterance period and determines whether the utterance period is a filler period. Identifying section 120 outputs the determination result to response time calculating section 130 and determining section 180 .

特定部１２０は、第２音声情報ｘ２（ｎ）のうち、発話区間の開始時刻Ｔ２ｓ（ｋ２）から終了時刻Ｔ２ｅ（ｋ２）に含まれる音声情報を、音声認識エンジンに入力し、音声認識結果をフィラーＤＢ（data base）と参照することで、発話区間がフィラー区間であるかを判定する。たとえば、フィラーＤＢには、意味をなさない音声の各種情報「そうなんだ、えーっと、なんだっけ、・・・」を記憶する。特定部１２０は、音声認識結果が、フィラーＤＢの音声の情報にヒットした場合に、発話区間を、フィラー区間と特定する。また、特定部１２０は、特開２０１５－０８２０８７に記載された技術を用いて、発話区間が、フィラー区間であるか否かを判定してもよい。 The specifying unit 120 inputs the speech information included in the second speech information x2(n) from the start time T2s(k2) to the end time T2e(k2) of the utterance period to the speech recognition engine, and outputs the speech recognition result. By referring to a filler DB (data base), it is determined whether or not the speech section is a filler section. For example, the filler DB stores various kinds of nonsense information about speech, ``That's right, um, what is it?...''. The identifying unit 120 identifies an utterance segment as a filler segment when the speech recognition result hits the speech information in the filler DB. Further, the specifying unit 120 may determine whether or not the speech segment is a filler segment using the technique described in Japanese Patent Application Laid-Open No. 2015-082087.

特定部１２０は、フィラーの判定結果Ｆ（ｋ２）を、応答時間算出部１３０および判定部１８０に出力する。特定部１２０は、第２音声情報の発話区間番号「ｋ２」の発話区間が、フィラー区間であると判定した場合には、「Ｆ（ｋ２）＝１」を判定部１８０に出力する。特定部１２０は、第２音声情報の発話区間番号「ｋ２」の発話区間が、フィラー区間でないと判定した場合には、「Ｆ（ｋ２）＝０」を判定部１８０に出力する。 The identifying unit 120 outputs the filler determination result F(k2) to the response time calculating unit 130 and the determining unit 180 . When identifying section 120 determines that the utterance section with utterance section number “k2” of the second audio information is a filler section, identifying section 120 outputs “F(k2)=1” to determining section 180 . When identifying unit 120 determines that the utterance segment with utterance segment number “k2” of the second audio information is not a filler segment, specifying unit 120 outputs “F(k2)=0” to determination unit 180 .

なお、音声認識は一般的に音声を文字に変換することを意味する。特定部１２０が、フィラー検出を行う場合には、文字に変換してから判定することも可能であるが、文字に変換しなくても、フィラー検出を行うことが可能である。たとえば、特定部１２０は、発話区間に含まれる韻律的特徴（アクセント、抑揚、リズムなどの特徴）や音響的特徴（音声認識の基となる特徴量など）からもフィラー区間であるか否かを検出してもよい。 Speech recognition generally means converting speech into characters. When the specifying unit 120 detects fillers, it is possible to perform determination after converting to characters, but it is possible to detect fillers without converting to characters. For example, the identifying unit 120 may determine whether or not the utterance segment is a filler segment based on prosodic features (such as accent, intonation, rhythm, etc.) and acoustic features (such as feature amounts that form the basis of speech recognition) included in the utterance segment. may be detected.

応答時間算出部１３０は、第１話者による発話区間が終了してから、第２話者による発話区間（フィラー区間ではない発話区間）が開始されるまでの応答時間を算出する処理部である。応答時間算出部１３０は、応答時間の情報を、閾値算出部１６０および判定部１８０に出力する。 The response time calculation unit 130 is a processing unit that calculates the response time from the end of the utterance interval by the first speaker to the start of the utterance interval (not the filler interval) by the second speaker. . Response time calculation section 130 outputs response time information to threshold calculation section 160 and determination section 180 .

たとえば、応答時間算出部１３０は、式（５）を基にして、応答時間Ｒ（ｋ２）を算出する。式（５）において、Ｔ１ｅ（ｋ１）は、第１話者の発話の終了時刻である。Ｔ２ｓ（ｋ２’）は、第１話者の発話の終了時刻Ｔ１ｅ（ｋ１）から次の発話の開始時刻Ｔ１ｓ（ｋ１＋１）に含まれる、第２話者の発話区間番号ｋ２の発話区間のうち、フィラー区間ではない、最初の発話区間の開始時刻を示す。なお、第１話者の発話の終了時刻Ｔ１ｅ（ｋ１）から次の発話の開始時刻Ｔ１ｓ（ｋ１＋１）には、第２話者の複数の発話区間が含まれていてもよい。 For example, response time calculator 130 calculates response time R(k2) based on equation (5). In Equation (5), T1e(k1) is the end time of the first speaker's utterance. T2s(k2') is an utterance segment of the second speaker's utterance segment number k2, which is included from the first speaker's utterance end time T1e(k1) to the next utterance start time T1s(k1+1), Indicates the start time of the first speech segment that is not a filler segment. Note that a plurality of utterance intervals of the second speaker may be included between the end time T1e(k1) of the first speaker's utterance and the start time T1s(k1+1) of the next utterance.

明るさ算出部１４０は、第２音声情報および第２話者の発話区間の情報を受け付け、第２話者の発話区間の明るさの推定値Ｄを算出する処理部である。明るさ算出部１４０は、明るさの推定値Ｄの情報を、長期平均算出部１５０、閾値算出部１６０、判定部１８０に出力する。明るさの推定値Ｄは、発話区間に含まれる各基本周波数の分散に対応するものである。 The brightness calculation unit 140 is a processing unit that receives the second voice information and the information of the speech period of the second speaker, and calculates the estimated value D of the brightness of the speech period of the second speaker. The brightness calculation unit 140 outputs the information of the brightness estimated value D to the long-term average calculation unit 150 , the threshold calculation unit 160 and the determination unit 180 . The brightness estimate value D corresponds to the variance of each fundamental frequency included in the speech period.

以下において、明るさ算出部１４０の処理の一例について説明する。まず、明るさ算出部１４０は、第２話者の発話区間Ｔ２ｓ（ｋ２）～Ｔ２ｅ（ｋ２）に含まれる第２音声情報から、基本周波数Ｐ（ｔ）を算出する。明るさ算出部１４０は、第２音声情報の自己相関関数を算出し、自己相関関数の値がピークとなる位置に基づいて、フレーム毎の基本周波数を算出する。明るさ算出部１４０は、特開平８－４４３９５に記載された技術を用いて、基本周波数を算出してもよい。 An example of the processing of the brightness calculation unit 140 will be described below. First, the brightness calculator 140 calculates the fundamental frequency P(t) from the second voice information included in the second speaker's utterance period T2s(k2) to T2e(k2). Brightness calculator 140 calculates the autocorrelation function of the second audio information, and calculates the fundamental frequency for each frame based on the position where the value of the autocorrelation function peaks. The brightness calculator 140 may calculate the fundamental frequency using the technique described in Japanese Patent Laid-Open No. 8-44395.

明るさ算出部１４０は、基本周波数Ｐ（ｔ）［Ｈｚ］を、式（６）を基にして、基本周波数Ｐ’（ｔ）［semitone］に変換する。基本周波数Ｐ’（ｔ）は、人の聴覚上の声の高さに合った対数領域での尺度により示されるものである。 The brightness calculator 140 converts the fundamental frequency P(t) [Hz] into a fundamental frequency P'(t) [semitone] based on Equation (6). The fundamental frequency P'(t) is indicated by a scale in the logarithmic domain that matches the human auditory pitch of the voice.

明るさ算出部１４０は、基本周波数の時系列データＰ’（ｔ）から、所定フレームの移動平均により、基本周波数の長期平均Ｐ＿ａｖｅ＿ｌｏｎｇ（ｔ）を算出する。たとえば、明るさ算出部１４０は、式（７）に基づいて、Ｐ＿ａｖｅ＿ｌｏｎｇ（ｔ）を算出する。式（７）に含まれる「Ｌ」は、平均算出時の移動幅を示すものである。 The brightness calculation unit 140 calculates a long-term average P_ave_long(t) of the fundamental frequency from the time-series data P'(t) of the fundamental frequency by moving average of a predetermined frame. For example, brightness calculator 140 calculates P_ave_long(t) based on equation (7). "L" included in equation (7) indicates the movement width at the time of average calculation.

明るさ算出部１４０は、発話区間番号「ｋ２」の発話区間における平均差分量（分散）を、明るさの推定値Ｄ（ｋ２）として、算出する。たとえば、明るさ算出部１４０は、式（８）を基にして、明るさの推定値Ｄ（ｋ２）を算出する。 The brightness calculation unit 140 calculates the average amount of difference (variance) in the speech segment with the speech segment number “k2” as the estimated brightness value D(k2). For example, the brightness calculation unit 140 calculates the brightness estimated value D(k2) based on Equation (8).

図５は、明るさ算出部の処理を説明するための図である。図５において、グラフＧ２の横軸はフレーム数に対応する軸であり、縦軸は基本周波数Ｐ’（ｔ）［semitone］に対応する軸である。グラフＧ３の横軸は第２話者の発話区間番号ｋ２に対応する軸であり、縦軸は明るさの推定値Ｄ（ｋ２）に対応する軸である。たとえば、グラフＧ２の領域Ａ１に含まれるＰ’（ｔ）を、フレーム番号ｔ－Ｌ～Ｌに含まれるＰ’（ｔ）の長期平均Ｐ＿ａｖｅ＿ｌｏｎｇ（ｔ）で除算することで、推定値Ｄ_Ａ１が算出される。 FIG. 5 is a diagram for explaining the processing of the brightness calculator. In FIG. 5, the horizontal axis of the graph G2 is the axis corresponding to the number of frames, and the vertical axis is the axis corresponding to the fundamental frequency P'(t) [semitone]. The horizontal axis of the graph G3 is the axis corresponding to the speech period number k2 of the second speaker, and the vertical axis is the axis corresponding to the estimated brightness value D(k2). For example, by dividing P'(t) included in area A1 of graph G2 by the long-term average P_ave_long(t) of P'(t) included in frame numbers tL to L, the estimated value D _A1 is Calculated.

明るさ算出部１４０は、第２話者の各発話区間について、上記処理を繰り返し実行することで、各発話区間の明るさの推定値Ｄを算出する。 The brightness calculation unit 140 calculates the estimated value D of the brightness of each utterance period of the second speaker by repeatedly executing the above process for each utterance period of the second speaker.

図４の説明に戻る。長期平均算出部１５０は、明るさ算出部１４０から取得する明るさの推定値の時系列データＤ（ｋ２）から、所定フレームの移動平均により、明るさの長期平均Ｄ’（ｋ２）を算出する。たとえば、長期平均算出部１５０は、式（９）を基にして、明るさの長期平均Ｄ’（ｋ２）を算出する。式（９）において、Ｌ２は、発話区間番号ｋ２の発話区間の終了時刻から所定時間後の時刻を示す。Ｌ１は、発話区間番号ｋ２の発話区間の開始時刻から所定時間前の時刻を示す。長期平均算出部１５０は、前後の会話状況の明るさを示す指標として活用する。長期平均算出部１５０は、明るさの長期平均Ｄ’（ｋ２）を判定部１８０に出力する。 Returning to the description of FIG. The long-term average calculation unit 150 calculates a long-term average brightness D′(k2) from the time-series data D(k2) of the estimated brightness value obtained from the brightness calculation unit 140, using a moving average of a predetermined frame. . For example, the long-term average calculating unit 150 calculates the long-term average D'(k2) of brightness based on Equation (9). In Expression (9), L2 indicates the time after a predetermined time from the end time of the speech segment with the speech segment number k2. L1 indicates the time a predetermined time before the start time of the speech segment with the speech segment number k2. The long-term average calculation unit 150 utilizes it as an index indicating the brightness of the conversation situation before and after. Long-term average calculation section 150 outputs long-term average D′(k2) of brightness to determination section 180 .

閾値算出部１６０は、各種の閾値を算出し、算出した閾値の情報を判定部１８０に出力する処理部である。たとえば、閾値算出部１６０は、閾値ＴＨ＿Ｄ、閾値ＴＨ＿Ｒ、閾値ＴＨ＿Ｄ’を算出する。閾値ＴＨ＿Ｄは、発話区間の明るさの推定値Ｄと比較される閾値である。閾値ＴＨ＿Ｒは、応答時間と比較される閾値である。閾値ＴＨ＿Ｄ’は、明るさ長期平均Ｄ’と比較される閾値である。 The threshold calculation unit 160 is a processing unit that calculates various thresholds and outputs information on the calculated thresholds to the determination unit 180 . For example, threshold calculator 160 calculates threshold TH_D, threshold TH_R, and threshold TH_D'. The threshold TH_D is a threshold to be compared with the estimated brightness value D of the speech period. The threshold TH_R is the threshold that is compared with the response time. The threshold TH_D' is the threshold that is compared with the brightness long-term average D'.

閾値ＴＨ＿Ｄ、閾値ＴＨ＿Ｒ、閾値ＴＨ＿Ｄ’の初期値は、閾値ＤＢ１７０に記録されているものとする。たとえば、閾値ＴＨ＿Ｄの初期値を「１．５［semitone］」とする。閾値ＴＨ＿Ｒの初期値を「２００［Frame］」とする。閾値ＴＨ＿Ｄ’の初期値を「１．０［semitone］」とする。 It is assumed that initial values of the threshold TH_D, the threshold TH_R, and the threshold TH_D' are recorded in the threshold DB 170 . For example, let the initial value of the threshold TH_D be "1.5 [semitone]". Assume that the initial value of the threshold TH_R is "200 [Frame]". Assume that the initial value of the threshold TH_D' is "1.0 [semitone]".

閾値算出部１６０が、閾値ＨＴ＿Ｄを算出する処理について説明する。閾値算出部１６０は、明るさ算出部１４０から、各発話区間の明るさの推定値Ｄを取得し、取得した複数の推定値Ｄの平均ＡＶＥ＿Ｄおよび分散ＶＡＲ＿Ｄを算出する。閾値算出部１６０は、式（１０）を基にして、閾値ＨＴ＿Ｄを更新する。閾値算出部１６０は、更新した閾値ＨＴ＿Ｄの情報を、判定部１８０に出力する。式（１０）のαは係数であり、たとえば「α＝０．５」とする。 A process of calculating the threshold HT_D by the threshold calculator 160 will be described. The threshold calculator 160 acquires the estimated brightness value D of each speech period from the brightness calculator 140, and calculates the average AVE_D and the variance VAR_D of the multiple estimated values D thus acquired. The threshold calculator 160 updates the threshold HT_D based on Equation (10). Threshold calculation section 160 outputs information on updated threshold HT_D to determination section 180 . α in equation (10) is a coefficient, for example, “α=0.5”.

ＨＴ＿Ｄ＝ＡＶＥ＿Ｄ－α×ＶＡＲ＿Ｄ・・・（１０） HT_D=AVE_D-α×VAR_D (10)

閾値算出部１６０が、閾値ＨＴ＿Ｒを算出する処理について説明する。閾値算出部１６０は、応答時間算出部１３０から、各応答時間Ｒを取得し、取得した複数の応答時間Ｒの平均ＡＶＥ＿Ｒおよび分散ＶＡＲ＿Ｒを算出する。閾値算出部１６０は、式（１１）を基にして、閾値ＨＴ＿Ｒを更新する。閾値算出部１６０は、更新した閾値ＨＴ＿Ｒの情報を、判定部１８０に出力する。式（１１）のβは係数であり、たとえば「β＝０．５」とする。 A process of calculating the threshold HT_R by the threshold calculator 160 will be described. The threshold calculator 160 acquires each response time R from the response time calculator 130 and calculates the average AVE_R and variance VAR_R of the multiple response times R thus acquired. The threshold calculator 160 updates the threshold HT_R based on Equation (11). Threshold calculation section 160 outputs information on updated threshold HT_R to determination section 180 . β in equation (11) is a coefficient, for example, "β=0.5".

ＨＴ＿Ｒ＝ＡＶＥ＿Ｒ－β×ＶＡＲ＿Ｒ・・・（１１） HT_R=AVE_R-β×VAR_R (11)

閾値算出部１６０は、閾値ＴＨ＿Ｄ’に関しては、更新処理を行わないで、そのまま、判定部１８０に出力する。 Threshold calculation section 160 outputs threshold TH_D' to determination section 180 as it is without performing update processing.

閾値算出部１６０は、上記の閾値ＴＨ＿Ｄ、閾値ＴＨ＿Ｒを更新する処理を定期的に行い、更新を行ったタイミングで、判定部１８０に更新した閾値ＴＨ＿Ｄ、閾値ＴＨ＿Ｒを判定部１８０に出力する。また、閾値算出部１６０は、第２話者の識別情報と対応付けて、閾値ＴＨ＿Ｄ、閾値ＴＨ＿Ｒを、閾値ＤＢ１７０に格納しておき、別の機会に第２話者の音声情報を基に「判断迷い区間ではない」を行う場合に、格納しておいた各閾値から、第２話者に対応する閾値を検索して、検索した閾値を、判定部１８０に出力してもよい。これにより、第２話者の音声情報に最適化された閾値をもちいて、処理を行うことができる。 The threshold calculation unit 160 periodically performs processing for updating the threshold TH_D and the threshold TH_R, and outputs the updated threshold TH_D and the threshold TH_R to the determination unit 180 at the update timing. In addition, the threshold calculation unit 160 stores the threshold TH_D and the threshold TH_R in the threshold DB 170 in association with the identification information of the second speaker. In the case of "not in doubtful judgment interval", a threshold value corresponding to the second speaker may be retrieved from the stored threshold values, and the retrieved threshold value may be output to the determination unit 180. As a result, processing can be performed using a threshold optimized for the voice information of the second speaker.

判定部１８０は、各発話区間に関するフィラー区間の有無Ｆ、応答時間Ｒ、推定値Ｄ、長期平均Ｄ’を取得し、フィラー区間と判定された発話区間について、下記の処理を行うことで、フィラー区間が「判断迷い区間であるかいなか」を判定する。判定部１８０は、判断迷い区間であると判定した場合には、判断迷い区間の第２音声情報を、記憶部４０に格納する。 The determination unit 180 acquires the presence/absence F of the filler section, the response time R, the estimated value D, and the long-term average D′ for each utterance section, and performs the following processing on the utterance section determined to be the filler section, thereby obtaining the filler It is determined whether or not the section is “a doubtful judgment section”. The determination unit 180 stores the second audio information of the uncertain judgment interval in the storage unit 40 when judging that it is the doubtful judgment interval.

たとえば、判定部１８０は、長期平均判定処理、応答時間判定処理、フィラー明るさ判定処理、応答明るさ判定処理を行う。なお、判定部１８０は、フィラー区間ではない（Ｆ（ｋ２）＝０）発話区間については、長期平均判定処理、応答時間判定処理、フィラー明るさ判定処理、応答明るさ判定処理をスキップし「判断迷い区間ではない」と判定する。 For example, the determination unit 180 performs long-term average determination processing, response time determination processing, filler brightness determination processing, and response brightness determination processing. Note that the determination unit 180 skips the long-term average determination process, the response time determination process, the filler brightness determination process, and the response brightness determination process for an utterance segment (F(k2)=0) that is not a filler segment, and skips the “determination It is not a lost section."

長期平均判定処理について説明する。判定部１８０は、フィラー区間と判定された（Ｆ（ｋ２）＝１）の発話区間番号「ｋ２」の発話区間の明るさ長期平均Ｄ’（ｋ２）と、閾値ＴＨ＿Ｄ’とを比較し、明るさ長期平均が「明」か「暗」かを判定する。判定部１８０は、長期平均Ｄ’（ｋ２）が、閾値ＴＨ＿Ｄ’以上である場合に、発話区間番号ｋ２の発話区間の明るさ長期平均が「明」であると判定する。判定部１８０は、長期平均Ｄ’（ｋ２）が、閾値ＴＨ＿Ｄ’未満である場合に、発話区間番号ｋ２の発話区間の明るさ長期平均が「暗」であると判定する。 The long-term average determination processing will be explained. The determination unit 180 compares the long-term average brightness D′(k2) of the speech segment with the speech segment number “k2” of the filler segment (F(k2)=1) with the threshold value TH_D′ to determine the brightness. Determines whether the long-term average is “bright” or “dark”. If the long-term average D'(k2) is equal to or greater than the threshold TH_D', the determining unit 180 determines that the long-term average brightness of the speech segment with the speech segment number k2 is "bright." If the long-term average D'(k2) is less than the threshold TH_D', the determining unit 180 determines that the long-term average brightness of the speech segment with the speech segment number k2 is "dark."

応答時間判定処理について説明する。判定部１８０は、フィラー区間と判定された（Ｆ（ｋ２）＝１）の発話区間番号「ｋ２」の発話区間に対応する応答時間Ｒ（ｋ２）と、閾値ＴＨ＿Ｒとを比較し、応答時間Ｒ（ｋ２）が「長」か「短」かを判定する。判定部１８０は、応答時間Ｒ（ｋ２）が、閾値ＴＨ＿Ｒ以上である場合に、発話区間番号ｋ２の発話区間に対応する応答時間が「長」と判定する。判定部１８０は、応答時間Ｒ（ｋ２）が、閾値ＴＨ＿Ｒ未満である場合に、発話区間番号ｋ２の発話区間に対応する応答時間が「短」と判定する。 Response time determination processing will be described. The determination unit 180 compares the response time R(k2) corresponding to the utterance segment with the utterance segment number “k2” of the filler segment (F(k2)=1) with the threshold TH_R, and determines the response time R Determine whether (k2) is "long" or "short". If the response time R(k2) is equal to or greater than the threshold TH_R, the determination unit 180 determines that the response time corresponding to the speech segment with the speech segment number k2 is "long". If the response time R(k2) is less than the threshold TH_R, the determination unit 180 determines that the response time corresponding to the speech segment with the speech segment number k2 is "short".

判定部１８０は、長期平均判定処理の判定結果が「明」であり、かつ、応答時間判定結果が「長」である場合に、続く、フィラー明るさ判定処理、応答明るさ判定処理を行う。一方、判定部１８０は、長期平均判定処理の判定結果が「暗」である、または、応答時間判定結果が「短」である場合に、発話区間番号「ｋ２」の発話区間が、「判断迷い区間ではない」と判定し、フィラー明るさ判定処理、応答明るさ判定処理をスキップする。 When the determination result of the long-term average determination process is "bright" and the response time determination result is "long", the determination unit 180 performs subsequent filler brightness determination processing and response brightness determination processing. On the other hand, when the determination result of the long-term average determination process is “dark” or the response time determination result is “short,” determination unit 180 determines that the utterance segment with utterance segment number “k2” It is not an interval", and the filler brightness determination process and the response brightness determination process are skipped.

フィラー明るさ判定処理について説明する。判定部１８０は、フィラー区間と判定された（Ｆ（ｋ２）＝１）の発話区間番号「ｋ２」の発話区間の明るさ推定値Ｄ（ｋ２）と、閾値ＴＨ＿Ｄとを比較し、明るさが「明」か「暗」かを判定する。判定部１８０は、推定値Ｄ（ｋ２）が、閾値ＴＨ＿Ｄ以上である場合に、発話区間番号ｋ２の発話区間の明るさが「明」であると判定する。判定部１８０は、推定値Ｄ（ｋ２）が、閾値ＴＨ＿Ｄ未満である場合に、発話区間番号ｋ２の発話区間の明るさが「暗」であると判定する。 The filler brightness determination processing will be described. The determination unit 180 compares the estimated brightness value D(k2) of the speech segment with the speech segment number “k2” of the filler segment (F(k2)=1) with the threshold value TH_D. Determine whether it is "light" or "dark". If the estimated value D(k2) is equal to or greater than the threshold TH_D, the determination unit 180 determines that the brightness of the speech period with the speech period number k2 is "bright". If the estimated value D(k2) is less than the threshold TH_D, the determination unit 180 determines that the brightness of the speech period with the speech period number k2 is "dark".

応答明るさ判定処理について説明する。判定部１８０は、フィラー区間と判定された（Ｆ（ｋ２）＝１）の発話区間番号「ｋ２」の発話区間に続く応答区間の明るさ推定値Ｄ（ｋ２’）と、閾値ＴＨ＿Ｄとを比較し、明るさが「明」か「暗」かを判定する。判定部１８０は、推定値Ｄ（ｋ２’）が、閾値ＴＨ＿Ｄ以上である場合に、応答区間の明るさが「明」であると判定する。判定部１８０は、推定値Ｄ（ｋ２’）が、閾値ＴＨ＿Ｄ未満である場合に、応答区間の明るさが「暗」であると判定する。 The response brightness determination process will be described. The determination unit 180 compares the estimated brightness value D(k2′) of the response segment following the speech segment with the speech segment number “k2” of the filler segment (F(k2)=1) with the threshold value TH_D. and determines whether the brightness is “bright” or “dark”. The determination unit 180 determines that the brightness of the response section is "bright" when the estimated value D(k2') is equal to or greater than the threshold TH_D. The determination unit 180 determines that the brightness of the response section is "dark" when the estimated value D(k2') is less than the threshold TH_D.

判定部１８０は、フィラー明るさ判定処理の判定結果が「暗」であり、かつ、応答明るさ判定処理の判定結果が「明」である場合に、発話区間番号「ｋ２」の発話区間が「判断迷い区間である」と判定する。 If the determination result of the filler brightness determination process is "dark" and the determination result of the response brightness determination process is "bright", determination unit 180 determines that the utterance segment of utterance segment number "k2" is " It is a judgment uncertain section."

なお、判定部１８０は、長期平均判定処理、応答時間判定処理、フィラー明るさ判定処理、応答明るさ判定処理をそれぞれ実行して、各判定結果をまとめて用いて、発話区間番号「ｋ２」の発話区間が「判断迷い区間であるかいなか」を判定してもよい。判定部１８０は、長期平均判定処理の結果が「明」、応答時間判定処理の結果が「長」、フィラー明るさ判定処理の結果が「暗」、応答明るさ判定処理の結果が「明」である場合に、発話区間番号「ｋ２」の発話区間が「判断迷い区間である」と判定してもよい。 Note that the determination unit 180 executes the long-term average determination process, the response time determination process, the filler brightness determination process, and the response brightness determination process, and collectively uses each determination result to determine the utterance segment number “k2”. It may be determined whether or not the utterance segment is “a doubtful determination segment”. The determination unit 180 determines that the result of the long-term average determination process is "bright", the result of the response time determination process is "long", the result of the filler brightness determination process is "dark", and the result of the response brightness determination process is "bright". , it may be determined that the utterance segment with the utterance segment number “k2” is “a doubtful determination segment”.

次に、本実施例１の音声処理装置１の状態判定部１００の処理手順の一例について説明する。図６は、本実施例１に係る音声処理装置の処理手順を示すフローチャートである。図６に示すように、この音声処理装置１の状態判定部１００は、第１音声情報および第２音声情報を取得する（ステップＳ１０１）。 Next, an example of the processing procedure of the state determination unit 100 of the speech processing device 1 of the first embodiment will be described. FIG. 6 is a flow chart showing the processing procedure of the speech processing device according to the first embodiment. As shown in FIG. 6, the state determination unit 100 of the speech processing device 1 acquires first speech information and second speech information (step S101).

状態判定部１００の発話区間検出部１１０ａは、第１話者の発話区間を検出し、発話区間検出部１１０ｂは、第２話者の発話区間を検出する（ステップＳ１０２）。状態判定部１００の特定部１２０は、フィラー区間を検出する（ステップＳ１０３）。状態判定部１００は、フィラー区間が存在しない場合には（ステップＳ１０４，Ｎｏ）、ステップＳ１１１に移行する。 The speech period detection unit 110a of the state determination unit 100 detects the speech period of the first speaker, and the speech period detection unit 110b detects the speech period of the second speaker (step S102). The specifying unit 120 of the state determination unit 100 detects the filler section (step S103). If there is no filler section (step S104, No), the state determination unit 100 proceeds to step S111.

一方、状態判定部１００は、フィラー区間が存在する場合には（ステップＳ１０４，Ｙｅｓ）、ステップＳ１０５に移行する。状態判定部１００の応答時間算出部１３０は、応答時間Ｒを算出する（ステップＳ１０５）。状態判定部１００の明るさ算出部１４０は、明るさの推定値Ｄを算出する（ステップＳ１０６）。 On the other hand, if there is a filler section (step S104, Yes), the state determination unit 100 proceeds to step S105. The response time calculation unit 130 of the state determination unit 100 calculates the response time R (step S105). The brightness calculation unit 140 of the state determination unit 100 calculates an estimated value D of brightness (step S106).

状態判定部１００の長期平均算出部１５０は、明るさ長期平均Ｄ’を算出する（ステップＳ１０７）。状態判定部１００の判定部１８０は、各閾値判定を実行する（ステップＳ１０８）。ステップＳ１０８において、判定部１８０は、長期平均判定処理、応答時間判定処理、フィラー明るさ判定処理、応答明るさ判定処理をそれぞれ実行する。 The long-term average calculator 150 of the state determination unit 100 calculates the long-term average brightness D' (step S107). The determination unit 180 of the state determination unit 100 executes each threshold determination (step S108). In step S108, the determination unit 180 executes a long-term average determination process, a response time determination process, a filler brightness determination process, and a response brightness determination process.

判定部１８０は、発話区間が「判断迷い区間」であるか否かを判定する（ステップＳ１０９）。判定部１８０は、発話区間が「判断迷い区間」である場合には（ステップＳ１０９，Ｙｅｓ）、判定迷い区間の音声情報を記憶部４０に格納する（ステップＳ１１０）。一方、判定部１８０は、発話区間が「判断迷い区間」でない場合には（ステップＳ１０９，Ｎｏ）、ステップＳ１１１に移行する。 The determination unit 180 determines whether or not the utterance period is the "difficult decision period" (step S109). If the utterance period is the "unsure judgment period" (step S109, Yes), the determination unit 180 stores the voice information of the uncertain judgment period in the storage unit 40 (step S110). On the other hand, if the utterance period is not the "unsure judgment period" (step S109, No), the determination unit 180 proceeds to step S111.

状態判定部１００は、次の会話がある場合には（ステップＳ１１１，Ｙｅｓ）、ステップＳ１０３に移行する。状態判定部１００は、次の会話がない場合には（ステップＳ１１１，Ｎｏ）、処理を終了する。 If there is a next conversation (step S111, Yes), the state determination unit 100 proceeds to step S103. If there is no next conversation (step S111, No), the state determination unit 100 terminates the process.

次に、本実施例１に係る音声処理装置１の効果について説明する。音声処理装置１は、話者の発話区間からフィラー区間を特定し、特定したフィラー区間の明るさの推定値が閾値未満である場合に、フィラー区間を判断迷い区間として判定する。これにより、ユーザが迷っている区間を正確に推定することができる。 Next, effects of the speech processing device 1 according to the first embodiment will be described. The speech processing device 1 identifies a filler section from a speaker's utterance section, and determines the filler section as a questionable section when the estimated brightness value of the identified filler section is less than a threshold. This makes it possible to accurately estimate the section in which the user is hesitant.

また、音声処理装置１は、フィラー区間の明るさの推定値の他に、応答区間の明るさの推定値、応答時間、長期平均の判定結果の組合せを基にして、フィラー区間が判断迷い区間であるか否かを判定することで、判定精度を向上させることができる。 In addition to the estimated value of the brightness of the filler section, the speech processing device 1 determines whether the filler section is an uncertain section based on a combination of the estimated value of the brightness of the response section, the response time, and the determination result of the long-term average. By determining whether or not, the determination accuracy can be improved.

なお、話者の声の明るさは、声の高さと関係があるといわれており、声が高いと明るいと感じる。発明者は、異なる話者に関して、フィラー区間のおける音声情報の周波数の中央値と、フィラー区間の状態との関係に関して実験を行った。図７は、実験結果の一例を示す図である。図７に示すように、全体として（一部例外を除けば）、フィラー区間の状態が「記憶操作（判断迷い区間でない）」の場合と比較して、状態が「判断迷い区間」の周波数が低くなっており、話者の声の明るさが「暗い」と、フィラー区間は「判断迷い区間」であると言える。 It is said that the brightness of a speaker's voice is related to the pitch of the voice, and the higher the voice, the brighter the speaker feels. The inventor conducted an experiment regarding the relationship between the median frequency of speech information in the filler section and the state of the filler section for different speakers. FIG. 7 is a diagram showing an example of experimental results. As shown in FIG. 7, as a whole (with some exceptions), the frequency of the filler section in the state of "determination doubtful section" is higher than that of the filler section state of "memory operation (not judgment doubtful section)". When the brightness of the speaker's voice is low and the brightness of the speaker's voice is "dark", it can be said that the filler section is a "difficult judgment section".

図８は、音声情報の周波数の推移を説明するための図である。図８のグラフＧ４，Ｇ５の横軸は時間軸であり、縦軸は周波数（声の高さ）に対応する軸である。たとえば、グラフＧ４において、フィラー区間Ｔ３１の状態は「記憶操作（判断迷い区間でない）」である。このフィラー区間Ｔ３１の周波数は、２００～２５０Ｈｚ付近であり、明るい声である。一方、グラフＧ５において、フィラー区間３２の状態は「判断迷い区間」である。このフィラー区間Ｔ３１の周波数は、１５０～２００Ｈｚ付近であり、暗い声である。なお、フィラー区間３２に続く発話区間Ｔ３３の周波数は、２００～３００Ｈｚ付近の明るい声である。 FIG. 8 is a diagram for explaining the transition of the frequency of audio information. The horizontal axis of graphs G4 and G5 in FIG. 8 is the time axis, and the vertical axis is the axis corresponding to frequency (pitch of voice). For example, in the graph G4, the state of the filler section T31 is "memory operation (not indecisive section)". The frequency of this filler section T31 is around 200 to 250 Hz, which is a bright voice. On the other hand, in the graph G5, the state of the filler section 32 is the "deterministic section". The frequency of this filler section T31 is around 150 to 200 Hz, which is a dark voice. The frequency of the utterance section T33 following the filler section 32 is a bright voice around 200-300 Hz.

図９は、本実施例２に係るシステムの構成を示す図である。図９に示すように、このシステムは、音声処理装置２と、マイク１０ａ，１０ｂと、収録機器３００とを有する。マイク１０ａ，１０ｂは、収録機器３００に接続される。収録機器３００は、ネットワーク５を介して、クラウド上の音声処理装置２に接続される。図示を省略するが、音声処理装置２は、複数のサーバによって構成されていてもよい。 FIG. 9 is a diagram showing the configuration of a system according to the second embodiment. As shown in FIG. 9, this system has an audio processing device 2, microphones 10a and 10b, and a recording device 300. In FIG. Microphones 10 a and 10 b are connected to recording equipment 300 . The recording device 300 is connected via the network 5 to the audio processing device 2 on the cloud. Although illustration is omitted, the voice processing device 2 may be configured by a plurality of servers.

マイク１０ａは、ユーザＵ１０１の音声を集音するマイクである。マイク１０ａは、集音した音声情報を、収録機器３００に出力する。マイク１０ｂは、ユーザＵ１０２の音声を集音するマイクである。マイク１０ａは、集音した音声情報を、収録機器３００に出力する。たとえば、ユーザＵ１０１は、第１話者に対応し、ユーザＵ１０２は、第２話者に対応する。以下の説明では適宜、ユーザＵ１０１の音声情報を「第１音声情報」と表記し、ユーザＵ１０２の音声情報を「第２音声情報」と表記する。 The microphone 10a is a microphone that collects the voice of the user U101. The microphone 10 a outputs collected audio information to the recording device 300 . The microphone 10b is a microphone that collects the voice of the user U102. The microphone 10 a outputs collected audio information to the recording device 300 . For example, user U101 corresponds to the first speaker and user U102 corresponds to the second speaker. In the following description, the voice information of the user U101 will be referred to as "first voice information", and the voice information of the user U102 will be referred to as "second voice information".

収録機器３００は、第１音声情報および第２音声情報を収録する装置である。収録機器３００は、第１音声情報および第２音声情報を音声ファイル化して、音声処理装置２に送信する。 The recording device 300 is a device that records the first audio information and the second audio information. The recording device 300 converts the first audio information and the second audio information into an audio file and transmits the audio file to the audio processing device 2 .

図１０は、本実施例２に係る収録機器の構成を示す機能ブロック図である。図１０に示すように、この収録機器３００は、マイク１０ａ，１０ｂに接続される。また、収録機器３００は、ＡＤ変換部３１０ａ，３１０ｂ、音声ファイル化部３２０、送信部３３０を有する。ＡＤ変換部３１０ａ，３１０ｂ、音声ファイル化部３２０、送信部３３０の各処理部は、例えば、ＣＰＵやＭＰＵ等により実現される。また、各処理部は、例えば、ＡＳＩＣやＦＰＧＡ等の集積回路により実現されてもよい。 FIG. 10 is a functional block diagram showing the configuration of the recording device according to the second embodiment. As shown in FIG. 10, this recording device 300 is connected to microphones 10a and 10b. The recording device 300 also has AD converters 310 a and 310 b , an audio file generator 320 and a transmitter 330 . Each of the AD converters 310a and 310b, the audio file generator 320, and the transmitter 330 is realized by, for example, a CPU, MPU, or the like. Also, each processing unit may be implemented by an integrated circuit such as an ASIC or FPGA, for example.

ＡＤ変換部３１０ａは、マイク１０ａから入力される第１音声情報を、アナログ信号からデジタル信号に変換するＡＤ変換回路である。ＡＤ変換部３１０ａは、デジタル信号に変換した、第１音声情報を音声ファイル化部３２０に出力する。以下の説明では、デジタル信号に変換した第１音声情報を単に、第１音声情報と表記する。 The AD conversion unit 310a is an AD conversion circuit that converts the first audio information input from the microphone 10a from an analog signal to a digital signal. The AD converting section 310 a outputs the first audio information converted into a digital signal to the audio filing section 320 . In the following description, the first audio information converted into a digital signal is simply referred to as first audio information.

ＡＤ変換部３１０ｂは、マイク１０ｂから入力される第２音声情報を、アナログ信号からデジタル信号に変換するＡＤ変換回路である。ＡＤ変換部３１０ｂは、デジタル信号に変換した、第２音声情報を音声ファイル化部３２０に出力する。以下の説明では、デジタル信号に変換した第２音声情報を単に、第２音声情報と表記する。 The AD conversion unit 310b is an AD conversion circuit that converts the second audio information input from the microphone 10b from an analog signal to a digital signal. The AD converting section 310b outputs the second audio information converted into a digital signal to the audio filing section 320. FIG. In the following description, the second audio information converted into a digital signal is simply referred to as second audio information.

音声ファイル化部３２０は、ＡＤ変換部３１０ａから第１音声情報を取得し、取得した第１音声情報を音声ファイル化する。音声ファイル化部３２０は、音声ファイル化した第１音声情報を、送信部３３０に出力する。また、音声ファイル化部３２０は、ＡＤ変換部３１０ｂから第２音声情報を取得し、取得した第２音声情報を音声ファイル化する。音声ファイル化部３２０は、音声ファイル化した第２音声情報を、送信部３３０に出力する。 The audio filing unit 320 acquires the first audio information from the AD converting unit 310a and creates an audio file from the acquired first audio information. Audio filing unit 320 outputs the audio filed first audio information to transmission unit 330 . In addition, the audio filing unit 320 acquires the second audio information from the AD conversion unit 310b and forms the acquired second audio information into an audio file. The voice filing unit 320 outputs the voice filed second voice information to the transmission unit 330 .

送信部３３０は、音声ファイル化された第１音声情報を、ネットワーク５を介して、音声処理装置２に送信する。また、送信部３３０は、音声ファイル化された第２音声情報を、ネットワーク５を介して、音声処理装置２に送信する。 The transmission unit 330 transmits the first audio information converted into an audio file to the audio processing device 2 via the network 5 . Also, the transmission unit 330 transmits the second audio information converted into an audio file to the audio processing device 2 via the network 5 .

音声処理装置２は、収録機器３００から第１音声情報および第２音声情報の音声ファイルを受信する。音声処理装置２は、音声ファイルに含まれる各音声情報を基にして、ユーザＵ１０２の発話区間から、判断迷い区間を判定する。音声処理装置２は、判断迷い区間の音声情報を記憶装置に格納する。 The audio processing device 2 receives audio files of the first audio information and the second audio information from the recording device 300 . The speech processing device 2 determines a doubtful judgment section from the utterance section of the user U102 based on each piece of sound information included in the sound file. The speech processing device 2 stores the speech information of the doubtful decision section in the storage device.

図１１は、本実施例２に係る音声処理装置の構成を示す機能ブロック図である。図１１に示すように、この音声処理装置２は、受信部２１、前処理部２２、記憶部２３、状態判定部２００を有する。受信部２１、前処理部２２、記憶部２３、状態判定部２００の各処理部は、例えば、ＣＰＵやＭＰＵ等により実現される。また、各処理部は、例えば、ＡＳＩＣやＦＰＧＡ等の集積回路により実現されてもよい。 FIG. 11 is a functional block diagram showing the configuration of the speech processing device according to the second embodiment. As shown in FIG. 11 , this speech processing device 2 has a receiving section 21 , a preprocessing section 22 , a storage section 23 and a state determination section 200 . Each processing unit of the receiving unit 21, the preprocessing unit 22, the storage unit 23, and the state determination unit 200 is realized by, for example, a CPU, an MPU, or the like. Also, each processing unit may be implemented by an integrated circuit such as an ASIC or FPGA, for example.

受信部２１は、ネットワーク５を介して、収録機器３００から音声ファイル化された第１音声情報および第２音声情報を受信する処理部である。以下の説明では、音声ファイル化された第１音声情報および第２音声情報を単に、第１音声情報および第２音声情報と表記する。受信部２１は、第１音声情報および第２音声情報を、前処理部２２に出力する。 The receiving unit 21 is a processing unit that receives the first audio information and the second audio information converted into audio files from the recording device 300 via the network 5 . In the following description, the first audio information and the second audio information converted into audio files are simply referred to as the first audio information and the second audio information. The receiving section 21 outputs the first audio information and the second audio information to the preprocessing section 22 .

前処理部２２は、第１音声情報および第２音声情報に対して各種の前処理を実行し、前処理を行った第１音声情報および第２音声情報を、状態判定部２００に出力する処理部である。たとえば、図９に示したシステムでは、マイク１０ａ，１０ｂに、ユーザＵ１０１，Ｕ１０２双方の音声が集音される場合がある。このため、前処理部２２は、第１音声情報に、ユーザＵ１０１の音声のみが含まれるように、第１音声情報から、ユーザＵ１０２の音声を取り除く前処理を行う。前処理部２２は、第２音声情報に、ユーザＵ１０２の音声のみが含まれるように、第２音声情報から、ユーザＵ１０１の音声を取り除く前処理を行う。 The preprocessing unit 22 performs various preprocessing on the first audio information and the second audio information, and outputs the preprocessed first audio information and the second audio information to the state determination unit 200. Department. For example, in the system shown in FIG. 9, voices of both users U101 and U102 may be collected by microphones 10a and 10b. Therefore, the preprocessing unit 22 performs preprocessing to remove the voice of the user U102 from the first voice information so that only the voice of the user U101 is included in the first voice information. The preprocessing unit 22 performs preprocessing to remove the voice of the user U101 from the second voice information so that only the voice of the user U102 is included in the second voice information.

状態判定部２００は、第１音声情報および第２音声情報を取得して、発話区間を検出し、各発話区間からフィラー区間を特定する。状態判定部２００は、特定したフィラー区間に含まれる音声情報の特徴量を基にして、フィラー区間が判断迷い区間であるか否かを判定する。状態判定部２００は、フィラー区間が判断迷い区間である場合には、判断迷い区間の音声情報を、記憶部２３に格納する。本実施例では、第２音声情報の発話区間から、フィラー区間を特定し、判断迷い区間であるか否かを判定する場合について説明するが、これに限定されるものではない。状態判定部２００は、第１音声情報の発話区間から、フィラー区間を特定し、判断迷い区間であるか否かを判定してもよい。 The state determination unit 200 acquires the first audio information and the second audio information, detects the utterance period, and identifies the filler period from each utterance period. The state determination unit 200 determines whether or not the filler section is a questionable section based on the feature amount of the audio information included in the identified filler section. The state determination unit 200 stores the voice information of the uncertain judgment interval in the storage unit 23 when the filler interval is the doubtful judgment interval. In the present embodiment, a case will be described in which a filler section is specified from the utterance section of the second audio information, and whether or not it is an indecisive section is determined, but the present invention is not limited to this. The state determination unit 200 may identify a filler segment from the utterance segment of the first audio information, and determine whether or not it is a questionable segment.

たとえば、状態判定部２００は、第２音声情報に対して、短時間離散フーリエ変換を実行することで、第２音声情報を入力スペクトルに変換する。状態判定部２００は、入力スペクトルに関する特徴量を用いて、発話区間の明るさを判定する。 For example, the state determination unit 200 converts the second audio information into an input spectrum by performing a short-time discrete Fourier transform on the second audio information. The state determination section 200 determines the brightness of the speech period using the feature amount related to the input spectrum.

記憶部２３は、判断迷い区間の音声情報を記憶する記憶装置である。記憶部２３は、たとえば、ＲＡＭ、ＲＯＭ、フラッシュメモリなどの半導体メモリ素子などの記憶装置に対応する。 The storage unit 23 is a storage device that stores voice information of the doubtful decision section. The storage unit 23 corresponds to a storage device such as a semiconductor memory device such as RAM, ROM, flash memory, or the like.

図１２は、本実施例２に係る状態判定部の構成を示す機能ブロック図である。図１２に示すように、この状態判定部２００は、発話区間検出部２１０ａ，２１０ｂと、特定部２２０と、応答時間算出部２３０と、明るさ算出部２４０と、判定部２５０とを有する。 FIG. 12 is a functional block diagram showing the configuration of the state determination unit according to the second embodiment. As shown in FIG. 12 , the state determination section 200 has speech period detection sections 210 a and 210 b , an identification section 220 , a response time calculation section 230 , a brightness calculation section 240 and a determination section 250 .

発話区間検出部２１０ａは、第１話者（ユーザＵ１０１）の第１音声情報の入力を受け付け、第１音声情報のパワーの小さい無音区間に挟まれた区間を発話区間として算出する。発話区間検出部２１０ａは、第１音声情報の発話区間の情報を、応答時間算出部１３０に出力する。発話区間検出部２１０ａが、発話区間を算出する処理は、発話区間検出部１１０ａの処理と同様である。 The utterance period detection unit 210a receives input of the first voice information of the first speaker (user U101), and calculates a period sandwiched between silent periods of low power of the first voice information as the utterance period. Speech period detection section 210 a outputs information on the speech period of the first audio information to response time calculation section 130 . The process of calculating the speech period by the speech period detection unit 210a is the same as the process of the speech period detection unit 110a.

発話区間検出部２１０ｂは、第２話者（ユーザＵ１０２）の第２音声情報の入力を受け付け、第２音声情報のパワーの小さい無音区間に挟まれた区間を発話区間として算出する。発話区間検出部２１０ｂは、第２音声情報の発話区間の情報を、特定部２２０、明るさ算出部２４０に出力する。発話区間検出部２１０ｂが、発話区間を算出する処理は、発話区間検出部１１０ｂの処理と同様である。 The utterance period detection unit 210b receives input of the second voice information of the second speaker (user U102), and calculates a period sandwiched between silent periods of low power of the second voice information as a utterance period. The speech period detection unit 210 b outputs information on the speech period of the second audio information to the identification unit 220 and the brightness calculation unit 240 . The process of calculating the utterance period by the utterance period detection section 210b is the same as the process of the utterance period detection section 110b.

特定部２２０は、第２音声情報および第２話者の発話区間の情報を受け付け、発話区間がフィラー区間であるか否かを判定する処理部である。特定部１２０は、判定結果を応答時間算出部２３０および判定部２５０に出力する。特定部２２０が、フィラー区間であるか否かを判定する処理は、実施例１で説明した特定部１２０の処理と同様である。 The specifying unit 220 is a processing unit that receives the second voice information and the information on the speech period of the second speaker and determines whether the speech period is a filler period. Identifying section 120 outputs the determination result to response time calculating section 230 and determining section 250 . The processing by the identifying unit 220 for determining whether or not it is a filler section is the same as the processing by the identifying unit 120 described in the first embodiment.

応答時間算出部２３０は、第１話者による発話区間が終了してから、第２話者による発話区間（フィラー区間ではない発話区間）が開始されるまでの応答時間を算出する処理部である。応答時間算出部２３０は、応答時間の情報を、判定部２５０に出力する。応答時間算出部２３０が、応答時間を算出する処理は、実施例１で説明した応答時間算出部１３０の処理と同様である。 The response time calculation unit 230 is a processing unit that calculates the response time from the end of the utterance period by the first speaker to the start of the utterance period (not the filler period) by the second speaker. . Response time calculation section 230 outputs response time information to determination section 250 . The process of calculating the response time by the response time calculator 230 is the same as the process of the response time calculator 130 described in the first embodiment.

明るさ算出部２４０は、第２音声情報および第２話者の発話区間の情報を受け付け、第２話者の発話区間の明るさの特徴量を算出する処理部である。明るさ算出部２４０は、各発話区間の明るさの特徴量の情報を、判定部２５０に出力する。 The brightness calculation unit 240 is a processing unit that receives the second voice information and the information of the speech period of the second speaker, and calculates the feature amount of the brightness of the speech period of the second speaker. Brightness calculation section 240 outputs information on the brightness feature amount of each speech period to determination section 250 .

以下において、明るさ算出部２４０の処理の一例について説明する。まず、明るさ算出部２４０は、第２話者の発話区間Ｔ２ｓ（ｋ２）～Ｔ２ｅ（ｋ２）に含まれる第２音声情報ｘ２（ｎ）に対して、短期間離散フーリエ変換を実行することで、第２音声情報の入力スペクトルＸ２（ｌ）を生成する。 An example of the processing of the brightness calculation unit 240 will be described below. First, the brightness calculation unit 240 performs a short-term discrete Fourier transform on the second speech information x2(n) included in the utterance period T2s(k2) to T2e(k2) of the second speaker. , to generate the input spectrum X2(l) of the second speech information.

明るさ算出部２４０は、文献（F.Eyben et al.,“The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing,”in IEEE Transactions on Affective Computing, vol. 7, no. 2,pp.190-202,April-June 1 2016.）等に記載された方法に基づいて、入力スペクトルに関する複数の特徴量の平均、分散、中央値などの統計量を算出し、特徴ベクトルＶ（ｋｓ、ｓ）の各要素として格納する。ここで、特徴ベクトルＶの「ｓ」は、特徴量の要素数を示すものである。 The brightness calculation unit 240 is described in the document (F. Eyben et al., “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing,” in IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. .190-202, April-June 1 2016.), etc., calculate statistics such as the average, variance, and median of multiple feature values regarding the input spectrum, and obtain a feature vector V (ks, s). Here, "s" of the feature vector V indicates the number of elements of the feature quantity.

入力スペクトルに関する特徴量には、スペクトルの形状に関する特徴量と、音量に関する特徴量と、話速に関する特徴量とが含まれる。スペクトルの形状に関する特徴量には「Alpha Ratio、Hammarberg Indes、Spectral Slope 0-500Hz、Spectral Slope 500-1500Hz、Formant 1,2, and relative energy、Harmonic differnce H1-H2、Harmonic differnce H1-A3」等が含まれる。音量に関する特徴量には「Loudness、Rate of loudness peaks」等が含まれる。話速に関する特徴量には「Number of continuous voiced regions per second (pseudo syllable rate)」等が含まれる。 The feature amount related to the input spectrum includes a feature amount related to the shape of the spectrum, a feature amount related to volume, and a feature amount related to speech speed. Features related to the shape of the spectrum include "Alpha Ratio, Hammarberg Indes, Spectral Slope 0-500Hz, Spectral Slope 500-1500Hz, Formant 1,2, and relative energy, Harmonic difference H1-H2, Harmonic difference H1-A3", etc. included. The volume-related feature quantity includes "Loudness, Rate of loudness peaks" and the like. Features related to speech speed include "Number of continuous voiced regions per second (pseudo syllable rate)".

明るさ算出部２４０は、上記の入力スペクトルに関する複数の特徴量から、特徴ベクトルＶ（ｋｓ、ｓ）を生成しても良いし、複数の特徴量の一部から、特徴ベクトルＶ（ｋｓ、ｓ）を生成してもよい。明るさ算出部２４０は、話速に関して、音声認識を併用し、１秒当たりの文字数を、話速として算出してもよい。明るさ算出部２４０は、第２話者の各発話区間について、特徴ベクトルを算出し、判定部２５０に出力する。 The brightness calculation unit 240 may generate a feature vector V(ks, s) from a plurality of feature amounts related to the input spectrum, or may generate a feature vector V(ks, s ) may be generated. The brightness calculation unit 240 may also use speech recognition to calculate the number of characters per second as the speech speed. Brightness calculation section 240 calculates a feature vector for each utterance period of the second speaker and outputs it to determination section 250 .

なお、明るさ算出部２４０は、次のような事前処理を行う。明るさ算出部２４０は、判定部２５０から教師データを受け付けた場合に、教師データに対して上記の処理を実行することで、特徴ベクトルＶを算出する。明るさ算出部２４０は、教師データに対して算出した特徴ベクトルＶを、判定部２５０に出力する。教師データに対応する特徴ベクトルＶは、閾値を決定する場合に用いられる。 In addition, the brightness calculation unit 240 performs the following pre-processing. When receiving teacher data from the determination unit 250, the brightness calculation unit 240 calculates the feature vector V by performing the above processing on the teacher data. The brightness calculation section 240 outputs the feature vector V calculated for the teacher data to the determination section 250 . A feature vector V corresponding to the training data is used when determining the threshold.

判定部２５０は、各発話区間に関するフィラー区間の有無Ｆ、応答時間Ｒ、特徴ベクトルＶを取得し、下記の処理を行うことで、フィラー区間が「判断迷い区間であるかいなか」を判定する。判定部２５０は、判断迷い区間であると判定した場合には、判断迷い区間の第２音声情報を、記憶部２３に格納する。 The determination unit 250 acquires the presence/absence F of the filler segment, the response time R, and the feature vector V for each utterance segment, and performs the following processing to determine whether or not the filler segment is a doubtful segment. The determination unit 250 stores the second audio information of the uncertain judgment interval in the storage unit 23 when judging that it is the uncertain judgment interval.

まず、判定部２５０の事前処理について説明する。かかる事前処理を実行することで、特徴ベクトルＶ（ｋ２、ｓ）を「明」または「暗」に分類する閾値ＴＨ＿Ｖ（ｓ）を準備する。判定部２５０は、明るさの判定結果が「明」と判断されたフィラー区間の音声情報と、応答区間の音声情報との教師データを事前に収集しておく。判定部２５０は、教師データを、明るさ算出部２４０に出力し、教師データに対応する特徴ベクトルＶ（ｋ２、ｓ）を取得する。判定部２５０は、特徴ベクトルＶ（ｋ２、ｓ）と正解ラベルとの組をサポートベクターマシンに入力し、「明」または「暗」の２クラス分類を実行する。判定部２５０は、「明」または「暗」の２クラス分類の境界面を、明るさ閾値ＴＨ＿Ｖ（ｓ）とする。応答時間の閾値ＴＨ＿Ｒの値は、予め設定さているものとする。 First, pre-processing of the determination unit 250 will be described. By executing such preprocessing, a threshold TH_V(s) for classifying the feature vector V(k2, s) into "bright" or "dark" is prepared. The judging unit 250 collects in advance teacher data of speech information in the filler section and speech information in the response section in which the brightness judgment result is determined to be “bright”. The determination unit 250 outputs the teacher data to the brightness calculation unit 240 and acquires the feature vector V(k2, s) corresponding to the teacher data. The determination unit 250 inputs the set of the feature vector V(k2, s) and the correct label to the support vector machine, and performs two-class classification of “bright” or “dark”. The determination unit 250 sets the boundary surface of the two-class classification of “bright” and “dark” as the brightness threshold TH_V(s). It is assumed that the value of the response time threshold TH_R is set in advance.

判定部２５０が「判断迷い区間であるかいなか」を判定する処理について説明する。判定部２５０は、フィラー区間の特徴ベクトルＶ（ｋ２、ｓ）と、閾値ＴＨ＿Ｖ（ｓ）とを比較し、特徴ベクトルＶ（ｋｓ、ｓ）が、閾値ＴＨ＿Ｖ（ｓ）以上である場合に、フィラー区間が「明」であると判定する。判定部２５０は、フィラー区間の特徴ベクトルＶ（ｋ２、ｓ）と、閾値ＴＨ＿Ｖ（ｓ）とを比較し、特徴ベクトルＶ（ｋｓ、ｓ）が、閾値ＴＨ＿Ｖ（ｓ）未満である場合に、フィラー区間が「暗」であると判定する。なお、判定部２５０は、フィラー区間の特徴ベクトルＶ（ｋ２、ｓ）を、学習済みのベクターマシーンに入力して、「明」または「暗」を判定してもよい。 A process of determining whether or not the determination section 250 is in an indecisive section will be described. The determination unit 250 compares the feature vector V(k2, s) of the filler section with the threshold TH_V(s), and if the feature vector V(ks, s) is equal to or greater than the threshold TH_V(s), the filler section It is determined that the interval is “bright”. The determination unit 250 compares the feature vector V(k2, s) of the filler section with the threshold TH_V(s), and if the feature vector V(ks, s) is less than the threshold TH_V(s), the filler section The section is determined to be "dark". Note that the determination unit 250 may input the feature vector V(k2, s) of the filler section to a learned vector machine to determine "bright" or "dark".

判定部２５０は、応答区間の特徴ベクトルＶ（ｋ２、ｓ）と、閾値ＴＨ＿Ｖ（ｓ）とを比較し、特徴ベクトルＶ（ｋｓ、ｓ）が、閾値ＴＨ＿Ｖ（ｓ）以上である場合に、応答区間が「明」であると判定する。判定部２５０は、応答区間の特徴ベクトルＶ（ｋ２、ｓ）と、閾値ＴＨ＿Ｖ（ｓ）とを比較し、特徴ベクトルＶ（ｋｓ、ｓ）が、閾値ＴＨ＿Ｖ（ｓ）未満である場合に、応答区間が「暗」であると判定する。なお、判定部２５０は、応答区間の特徴ベクトルＶ（ｋ２、ｓ）を、学習済みのベクターマシーンに入力して、「明」または「暗」を判定してもよい。 The determination unit 250 compares the feature vector V(k2, s) of the response interval with the threshold TH_V(s), and if the feature vector V(ks, s) is equal to or greater than the threshold TH_V(s), the response It is determined that the interval is “bright”. The determination unit 250 compares the feature vector V(k2, s) of the response interval with the threshold TH_V(s), and if the feature vector V(ks, s) is less than the threshold TH_V(s), the response The section is determined to be "dark". Note that the determination unit 250 may input the feature vector V(k2, s) of the response interval to a learned vector machine to determine “bright” or “dark”.

また、判定部２５０は、実施例１で説明した判定部１８０と同様にして、フィラー区間と判定された（Ｆ（ｋ２）＝１）の発話区間番号「ｋ２」の発話区間に対応する応答時間Ｒ（ｋ２）と、閾値ＴＨ＿Ｒとを比較し、応答時間Ｒ（ｋ２）が「長」か「短」かを判定する。 In addition, similarly to the determination unit 180 described in the first embodiment, the determination unit 250 determines the response time corresponding to the utterance segment with the utterance segment number “k2” (F(k2)=1) determined as the filler segment. R(k2) is compared with the threshold TH_R to determine whether the response time R(k2) is "long" or "short".

判定部２５０は、フィラー区間の判定結果が「暗」、かつ、応答区間の判定結果が「明」、かつ、応答時間が「長」である場合おいて、フィラー区間が「判断迷い区間である」と判定する。これに対して、判定部２５０は、フィラー区間の判定結果が「明」、応答区間の判定結果が「暗」、または、応答時間が「長」である場合において、フィラー区間が「判断迷い区間でない」と判定する。 When the determination result of the filler section is "dark", the determination result of the response section is "bright", and the response time is "long", the determination unit 250 determines that the filler section is a "deterministic section." ” is determined. On the other hand, when the determination result of the filler section is "bright", the determination result of the response section is "dark", or the response time is "long", the determination unit 250 determines that the filler section is the "difficult to judge" section. not."

次に、本実施例２の音声処理装置２の状態判定部２００の処理手順の一例について説明する。図１３は、本実施例２に係る音声処理装置の処理手順を示すフローチャートである。図１３に示すように、この音声処理装置２の状態判定部２００は、第１音声情報および第２音声情報を取得する（ステップＳ２０１）。 Next, an example of the processing procedure of the state determination unit 200 of the speech processing device 2 of the second embodiment will be described. FIG. 13 is a flow chart showing the processing procedure of the speech processing device according to the second embodiment. As shown in FIG. 13, the state determination unit 200 of the speech processing device 2 acquires first speech information and second speech information (step S201).

状態判定部２００の発話区間検出部２１０ａは、第１話者の発話区間を検出し、発話区間検出部２１０ｂは、第２話者の発話区間を検出する（ステップＳ２０２）。状態判定部２００の特定部２２０は、フィラー区間を検出する（ステップＳ２０３）。状態判定部２００は、フィラー区間が存在しない場合には（ステップＳ２０４，Ｎｏ）、ステップＳ２０９に移行する。 The speech period detection unit 210a of the state determination unit 200 detects the speech period of the first speaker, and the speech period detection unit 210b detects the speech period of the second speaker (step S202). The specifying unit 220 of the state determination unit 200 detects the filler section (step S203). If there is no filler section (step S204, No), the state determination unit 200 proceeds to step S209.

一方、状態判定部２００は、フィラー区間が存在する場合には（ステップＳ２０４，Ｙｅｓ）、ステップＳ２０５に移行する。状態判定部２００の応答時間算出部２３０は、応答時間Ｒを算出する（ステップ２０５）。状態判定部２００の明るさ算出部２４０は、明るさの特徴ベクトルを算出する（ステップＳ２０６）。 On the other hand, if there is a filler section (step S204, Yes), the state determination unit 200 proceeds to step S205. The response time calculation unit 230 of the state determination unit 200 calculates the response time R (step 205). The brightness calculation unit 240 of the state determination unit 200 calculates a brightness feature vector (step S206).

判定部２５０は、発話区間が「判断迷い区間」であるか否かを判定する（ステップＳ２０７）。判定部２５０は、発話区間が「判断迷い区間」である場合には（ステップＳ２０７，Ｙｅｓ）、判定迷い区間の音声情報を記憶部２３に格納する（ステップＳ２０８）。一方、判定部２５０は、発話区間が「判断迷い区間」でない場合には（ステップＳ２０７，Ｎｏ）、ステップＳ２０９に移行する。 The determination unit 250 determines whether or not the utterance period is the "difficult decision period" (step S207). If the utterance period is the "unsure judgment period" (step S207, Yes), the determination unit 250 stores the voice information of the uncertain judgment period in the storage unit 23 (step S208). On the other hand, if the utterance period is not the "unsure judgment period" (step S207, No), the determination unit 250 proceeds to step S209.

状態判定部２００は、次の会話がある場合には（ステップＳ２０９，Ｙｅｓ）、ステップＳ２０３に移行する。状態判定部２００は、次の会話がない場合には（ステップＳ２０９，Ｎｏ）、処理を終了する。 If there is a next conversation (step S209, Yes), the state determination unit 200 proceeds to step S203. If there is no next conversation (step S209, No), state determination unit 200 ends the process.

次に、本実施例２に係る音声処理装置２の効果について説明する。音声処理装置２は、話者の発話区間からフィラー区間を特定し、特定したフィラー区間のスペクトルの特徴量を基にして、フィラー区間を判断迷い区間として判定する。これにより、ユーザが迷っている区間を正確に推定することができる。 Next, effects of the speech processing device 2 according to the second embodiment will be described. The speech processing device 2 identifies a filler section from the speaker's utterance section, and determines the filler section as a judgment uncertain section based on the spectral feature amount of the identified filler section. This makes it possible to accurately estimate the section in which the user is hesitant.

ところで、上述した状態判定部２００の処理は一例であり、その他の処理を行ってもよい。たとえば、明るさ算出部２４０は、明るさの特徴量を、教師ありの機械学習により予め生成してもよい。明るさ算出部２４０は、明るさの判定結果が「明」と判断されたフィラー区間の音声情報と、応答区間の音声情報との教師データを事前に収集しておく。 By the way, the processing of the state determination unit 200 described above is an example, and other processing may be performed. For example, the brightness calculation unit 240 may generate a feature amount of brightness in advance by supervised machine learning. The brightness calculation unit 240 collects in advance teacher data of voice information in the filler section and voice information in the response section for which the brightness determination result is determined to be “bright”.

明るさ算出部２４０は、文献（SoundNet:Learning Sound Representations from Unlabeled Video Yusuf Aytar,Carl Vondrick,Antonio Torralba NIPS 2016）に記載されているＤＮＮモデルに、上記の教師データを入力して分類器を学習する。 The brightness calculation unit 240 inputs the above teacher data to the DNN model described in the literature (SoundNet: Learning Sound Representations from Unlabeled Video Yusuf Aytar, Carl Vondrick, Antonio Torralba NIPS 2016) to learn the classifier. .

明るさ算出部２４０は、かかる分類器を学習しておき、発話区間の音声情報を分類器に入力し、分類器の出力層（最終層）の一つ手前の層から出力される情報を、特徴量ベクトルＷ（ｋ２，ｕ）として算出する。ここで、特徴ベクトルＷの「ｕ」は、特徴ベクトルの要素数を示すものである。明るさ算出部２４０は、特徴量ベクトルＷを、判定部２５０に出力する。 The brightness calculation unit 240 learns such a classifier in advance, inputs the speech information of the utterance period to the classifier, and converts the information output from the layer immediately before the output layer (final layer) of the classifier to It is calculated as a feature amount vector W(k2, u). Here, "u" of the feature vector W indicates the number of elements of the feature vector. Brightness calculation section 240 outputs feature amount vector W to determination section 250 .

なお、判定部２５０は、事前処理を実行し、特徴ベクトルＷ（ｋ２、ｕ）を「明」または「暗」に分類する閾値ＴＨ＿Ｗ（ｕ）を準備する。判定部２５０は、明るさの判定結果が「明」と判断されたフィラー区間の音声情報と、応答区間の音声情報との教師データを事前に収集しておく。判定部２５０は、教師データを、明るさ算出部２４０に出力し、教師データに対応する特徴量ベクトルＷ（ｋ２、ｕ）を取得する。判定部２５０は、特徴量ベクトルＷ（ｋ２、ｕ）と正解ラベルとの組をサポートベクターマシンに入力し、「明」または「暗」の２クラス分類を実行する。判定部２５０は、「明」または「暗」の２クラス分類の境界面を、明るさ閾値ＴＨ＿Ｗ（ｓ）とする。応答時間の閾値ＴＨ＿Ｒの値は、予め設定さているものとする。 Note that the determination unit 250 performs preprocessing to prepare a threshold TH_W(u) for classifying the feature vector W(k2, u) into "bright" or "dark". The judging unit 250 collects in advance teacher data of speech information in the filler section and speech information in the response section for which the brightness judgment result is determined to be “bright”. The determination unit 250 outputs the teacher data to the brightness calculation unit 240 and acquires the feature amount vector W(k2, u) corresponding to the teacher data. The determination unit 250 inputs the set of the feature amount vector W(k2, u) and the correct label to the support vector machine, and performs two-class classification of “bright” or “dark”. The determination unit 250 sets the boundary surface of the two-class classification of “bright” and “dark” as the brightness threshold TH_W(s). It is assumed that the value of the response time threshold TH_R is set in advance.

判定部２５０は、フィラー区間の特徴量ベクトルＷ（ｋ２、ｕ）と、閾値ＴＨ＿Ｗ（ｓ）とを比較し、特徴量ベクトルＷ（ｋｓ、ｕ）が、閾値ＴＨ＿Ｗ（ｕ）以上である場合に、フィラー区間が「明」であると判定する。判定部２５０は、フィラー区間の特徴量ベクトルＶ（ｋ２、ｕ）と、閾値ＴＨ＿Ｗ（ｕ）とを比較し、特徴ベクトルＷ（ｋｓ、ｕ）が、閾値ＴＨ＿Ｗ（ｕ）未満である場合に、フィラー区間が「暗」であると判定する。なお、判定部２５０は、フィラー区間の特徴ベクトルＶ（ｋ２、ｓ）を、学習済みのベクターマシーンに入力して、「明」または「暗」を判定してもよい。 The determination unit 250 compares the feature vector W (k2, u) of the filler section with the threshold TH_W (s), and if the feature vector W (ks, u) is equal to or greater than the threshold TH_W (u), , the filler section is determined to be “bright”. The determination unit 250 compares the feature amount vector V(k2, u) of the filler section with the threshold TH_W(u), and if the feature vector W(ks, u) is less than the threshold TH_W(u), The filler section is determined to be "dark". Note that the determination unit 250 may input the feature vector V(k2, s) of the filler section to a learned vector machine to determine “bright” or “dark”.

次に、上記実施例に示した音声処理装置１，２と同様の機能を実現するコンピュータのハードウェア構成の一例について説明する。図１４は、音声処理装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。 Next, an example of the hardware configuration of a computer that implements the same functions as those of the speech processing apparatuses 1 and 2 shown in the above embodiments will be described. FIG. 14 is a diagram showing an example of the hardware configuration of a computer that implements functions similar to those of the audio processing device.

図１４に示すように、コンピュータ３００は、各種演算処理を実行するＣＰＵ３０１と、ユーザからのデータの入力を受け付ける入力装置３０２と、ディスプレイ３０３とを有する。また、コンピュータ３００は、記憶媒体からプログラム等を読み取る読み取り装置３０４と、有線または無線ネットワークを介して収録機器等との間でデータの授受を行うインターフェース装置３０５とを有する。また、コンピュータ３００は、各種情報を一時記憶するＲＡＭ３０６と、ハードディスク装置３０７とを有する。そして、各装置３０１～３０７は、バス３０８に接続される。 As shown in FIG. 14, a computer 300 has a CPU 301 that executes various arithmetic processes, an input device 302 that receives data input from a user, and a display 303 . The computer 300 also has a reading device 304 that reads programs and the like from a storage medium, and an interface device 305 that exchanges data with recording devices and the like via a wired or wireless network. The computer 300 also has a RAM 306 that temporarily stores various information, and a hard disk device 307 . Each device 301 - 307 is then connected to a bus 308 .

ハードディスク装置３０７は、発話区間検出プログラム３０７ａ、特定プログラム３０７ｂ、応答時間算出プログラム３０７ｃ、明るさ算出プログラム３０７ｄ、長期平均算出プログラム３０７ｅを有する。また、ハードディスク装置３０７は、閾値算出プログラム２０７ｆ、判定プログラム３０７ｇを有する。ＣＰＵ５０１は、各プログラム３０７ａ～３０７ｇを読み出してＲＡＭ３０６に展開する。 The hard disk device 307 has a speech period detection program 307a, a specific program 307b, a response time calculation program 307c, a brightness calculation program 307d, and a long-term average calculation program 307e. The hard disk device 307 also has a threshold calculation program 207f and a determination program 307g. The CPU 501 reads each program 307 a to 307 g and develops them in the RAM 306 .

発話区間検出プログラム３０７ａは、発話区間検出プロセス３０６ａとして機能する。特定プログラム３０７ｂは、特定プロセス３０６ｂとして機能する。応答時間算出プログラム３０７ｃは、応答時間算出プロセス３０６ｃとして機能する。明るさ算出プログラム３０７ｄは、明るさ算出プロセス３０６ｄとして機能する。長期平均算出プログラム３０７ｅは、長期平均算出プロセス３０６ｅとして機能する。閾値算出プログラム３０７ｆは、閾値算出プロセス３０６ｆとして機能する。判定プログラム３０７ｇは、判定プロセス３０６ｆとして機能する。 The speech segment detection program 307a functions as a speech segment detection process 306a. Specific program 307b functions as specific process 306b. The response time calculation program 307c functions as a response time calculation process 306c. The brightness calculation program 307d functions as a brightness calculation process 306d. The long-term average calculation program 307e functions as a long-term average calculation process 306e. The threshold calculation program 307f functions as a threshold calculation process 306f. The determination program 307g functions as a determination process 306f.

発話区間検出プロセス３０６ａの処理は、発話区間検出部１１０ａ，１１０ｂ，２１０ａ，２１０ｂの処理に対応する。特定プロセス３０６ｂは、特定部１２０，２２０の処理に対応する。応答時間算出プロセス３０６ｃの処理は、応答時間算出部１３０，２３０の処理に対応する。明るさ算出プロセス３０６ｄは、明るさ算出部１４０，２４０の処理に対応する。長期平均算出プロセス３０６ｅは、長期平均算出部１５０の処理に対応する。閾値算出プロセス３０６ｆは、閾値算出部１６０の処理に対応する。判定プロセス３０６ｇは、判定部３６０ｇの処理に対応する。 The processing of the speech segment detection process 306a corresponds to the processing of the speech segment detection units 110a, 110b, 210a, and 210b. The identification process 306 b corresponds to the processing of the identification units 120 and 220 . The processing of the response time calculation process 306 c corresponds to the processing of the response time calculators 130 and 230 . A brightness calculation process 306 d corresponds to the processing of the brightness calculation units 140 and 240 . The long-term average calculation process 306 e corresponds to the processing of the long-term average calculation unit 150 . A threshold calculation process 306 f corresponds to the processing of the threshold calculation unit 160 . The determination process 306g corresponds to the processing of the determination unit 360g.

なお、各プログラム３０７ａ～３０７ｇについては、必ずしも最初からハードディスク装置３０７に記憶させておかなくても良い。例えば、コンピュータ３００に挿入されるフレキシブルディスク（ＦＤ）、ＣＤ－ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に各プログラムを記憶させておく。そして、コンピュータ５００が各プログラム３０７ａ～３０７ｇを読み出して実行するようにしても良い。 Note that the programs 307a to 307g do not necessarily have to be stored in the hard disk device 307 from the beginning. For example, each program is stored in a “portable physical medium” such as a flexible disk (FD), CD-ROM, DVD disk, magneto-optical disk, IC card, etc. inserted into the computer 300 . Then, the computer 500 may read and execute the programs 307a to 307g.

以上の各実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following additional remarks are disclosed regarding the embodiments including the above examples.

（付記１）音声情報から複数の発話区間を検出し、
前記複数の発話区間からフィラーを検出した発話区間をフィラー区間として特定し、
前記フィラー区間の音声情報の特徴量を基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定する
処理をコンピュータに実行させることを特徴とする音声処理プログラム。 (Appendix 1) Detecting a plurality of utterance intervals from voice information,
identifying an utterance segment in which filler is detected from the plurality of utterance segments as a filler segment;
causing a computer to execute a process of determining whether or not the voice information of the filler section is voice information to be uttered when the user hesitates to make a decision based on the feature amount of the voice information of the filler section; A speech processing program characterized by:

（付記２）前記判定する処理は、前記フィラー区間の音声情報の特徴量と、前記フィラー区間に続く前記発話区間の音声情報の特徴量とを基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記１に記載の音声処理プログラム。 (Supplementary Note 2) The determination process is based on the feature amount of the voice information of the filler section and the feature amount of the voice information of the utterance section following the filler section. The speech processing program according to appendix 1, wherein the speech processing program determines whether or not the speech information is spoken when the judgment is uncertain.

（付記３）前記判定する処理は、第１のユーザの発話区間から第２のユーザの発話区間までの応答時間を更に用いて、前記フィラー区間の音声情報が、前記第２のユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記１または２に記載の音声処理プログラム。 (Supplementary note 3) The determination process further uses the response time from the first user's utterance period to the second user's utterance period, and the voice information of the filler period is used by the second user to make a determination. 3. The speech processing program according to appendix 1 or 2, wherein it is determined whether or not the speech information is uttered when hesitating.

（付記４）前記判定する処理は、前記第２のユーザの各発話区間から算出された各特徴量の平均値を算出し、前記各特徴量の平均値を更に用いて、前記第２のユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記３に記載の音声処理プログラム。 (Supplementary note 4) The determination process calculates an average value of each feature amount calculated from each utterance period of the second user, and further uses the average value of each feature amount to determine the second user 3. The voice processing program according to appendix 3, wherein it is determined whether or not the voice information is to be uttered when the determination is uncertain.

（付記５）前記判定する処理は、前記音声情報の基本周波数を基に算出される分散を、前記特徴量として用いることを特徴とする付記１～４のいずれか一つに記載の音声処理プログラム。 (Appendix 5) The audio processing program according to any one of appendices 1 to 4, wherein the determining process uses a variance calculated based on the fundamental frequency of the audio information as the feature amount. .

（付記６）前記判定する処理は、前記音声情報をスペクトルに変換し、前記スペクトルの特徴を前記特徴量として用いることを特徴とする付記１～４のいずれか一つに記載の音声処理プログラム。 (Appendix 6) The audio processing program according to any one of appendices 1 to 4, wherein the determination process converts the audio information into a spectrum, and uses a feature of the spectrum as the feature amount.

（付記７）前記判定する処理は、前記特徴量と、前記特徴量に対応する音声情報が明るい音声であるか否かの情報とを対応付けた教師データにより学習された分類器に、前記特徴量を入力して得られる結果を基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記１に記載の音声処理プログラム。 (Supplementary Note 7) In the determination process, a classifier trained using teacher data that associates the feature amount with information indicating whether or not the audio information corresponding to the feature amount is bright audio is given the feature Supplementary note 1 characterized by determining whether or not the voice information in the filler section is voice information to be uttered when the user hesitates to make a decision based on the result obtained by inputting the amount. the audio processing program described in .

（付記８）前記判定する処理は、前記第２のユーザに対応する第１閾値、第２閾値、第３閾値を取得し、前記フィラー区間の音声情報の特徴量と前記第１閾値との比較結果、前記フィラー区間に続く前記発話区間の音声情報の特徴量と前記第１閾値との比較結果、前記応答時間と前記第２閾値との比較結果、前記各特徴量の平均値と前記第３閾値との比較結果を基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記４に記載の音声処理プログラム。 (Supplementary Note 8) The determination process acquires a first threshold, a second threshold, and a third threshold corresponding to the second user, and compares the feature amount of the speech information of the filler section with the first threshold. As a result, the comparison result between the feature amount of the speech information in the utterance section following the filler section and the first threshold, the comparison result between the response time and the second threshold, the average value of each feature amount and the third 4. The method according to appendix 4, wherein it is determined whether or not the voice information in the filler section is voice information to be spoken when the user hesitates to make a decision, based on the comparison result with the threshold. sound processing program.

（付記９）コンピュータが実行する音声処理方法であって、
音声情報から複数の発話区間を検出し、
前記複数の発話区間からフィラーを検出した発話区間をフィラー区間として特定し、
前記フィラー区間の音声情報の特徴量を基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定する
処理を実行することを特徴とする音声処理方法。 (Appendix 9) A computer-executed audio processing method,
Detect multiple utterance segments from voice information,
identifying an utterance segment in which filler is detected from the plurality of utterance segments as a filler segment;
Determining whether or not the voice information of the filler section is voice information to be uttered when the user hesitates to make a decision based on the feature amount of the voice information of the filler section. An audio processing method characterized by:

（付記１０）前記判定する処理は、前記フィラー区間の音声情報の特徴量と、前記フィラー区間に続く前記発話区間の音声情報の特徴量とを基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記９に記載の音声処理方法。 (Supplementary Note 10) The determination process is based on the feature amount of the voice information of the filler section and the feature amount of the voice information of the utterance section following the filler section. The speech processing method according to appendix 9, wherein it is determined whether or not the speech is speech information to be spoken when the judgment is uncertain.

（付記１１）前記判定する処理は、第１のユーザの発話区間から第２のユーザの発話区間までの応答時間を更に用いて、前記フィラー区間の音声情報が、前記第２のユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記９または１０に記載の音声処理方法。 (Supplementary note 11) The determination process further uses the response time from the first user's utterance period to the second user's utterance period, and the voice information of the filler period is used by the second user for determination. 11. The voice processing method according to appendix 9 or 10, wherein it is determined whether or not the voice information is uttered when in doubt.

（付記１２）前記判定する処理は、前記第２のユーザの各発話区間から算出された各特徴量の平均値を算出し、前記各特徴量の平均値を更に用いて、前記第２のユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記１１に記載の音声処理方法。 (Supplementary note 12) In the determination process, an average value of each feature amount calculated from each utterance period of the second user is calculated, and further using the average value of each feature amount, the second user 12. The speech processing method according to appendix 11, wherein it is determined whether or not the speech information is to be spoken when the judgment is uncertain.

（付記１３）前記判定する処理は、前記音声情報の基本周波数を基に算出される分散を、前記特徴量として用いることを特徴とする付記９～１２のいずれか一つに記載の音声処理方法。 (Appendix 13) The speech processing method according to any one of appendices 9 to 12, wherein the determination process uses a variance calculated based on the fundamental frequency of the speech information as the feature amount. .

（付記１４）前記判定する処理は、前記音声情報をスペクトルに変換し、前記スペクトルの特徴を前記特徴量として用いることを特徴とする付記９～１２のいずれか一つに記載の音声処理方法。 (Appendix 14) The speech processing method according to any one of appendices 9 to 12, wherein the determination process converts the speech information into a spectrum and uses a feature of the spectrum as the feature amount.

（付記１５）前記判定する処理は、前記特徴量と、前記特徴量に対応する音声情報が明るい音声であるか否かの情報とを対応付けた教師データにより学習された分類器に、前記特徴量を入力して得られる結果を基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記９に記載の音声処理方法。 (Supplementary note 15) In the determination process, a classifier learned by teacher data that associates the feature amount with information indicating whether or not the speech information corresponding to the feature amount is bright voice, and the feature Supplementary note 9 characterized by determining whether or not the voice information in the filler section is voice information to be uttered when the user hesitates to make a decision based on the result obtained by inputting the amount. 2. The audio processing method described in .

（付記１６）前記判定する処理は、前記第２のユーザに対応する第１閾値、第２閾値、第３閾値を取得し、前記フィラー区間の音声情報の特徴量と前記第１閾値との比較結果、前記フィラー区間に続く前記発話区間の音声情報の特徴量と前記第１閾値との比較結果、前記応答時間と前記第２閾値との比較結果、前記各特徴量の平均値と前記第３閾値との比較結果を基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記１２に記載の音声処理方法。 (Supplementary Note 16) The determination process acquires a first threshold, a second threshold, and a third threshold corresponding to the second user, and compares the feature amount of the speech information of the filler section with the first threshold. As a result, the comparison result between the feature amount of the speech information in the utterance section following the filler section and the first threshold, the comparison result between the response time and the second threshold, the average value of each feature amount and the third 13. The method according to appendix 12, wherein it is determined whether or not the voice information in the filler section is voice information to be spoken when the user hesitates to make a decision based on the comparison result with the threshold. Audio processing method.

（付記１７）音声情報から複数の発話区間を検出する発話区間検出部と、
前記複数の発話区間からフィラーを検出した発話区間をフィラー区間として特定する特定部と、
前記フィラー区間の音声情報の特徴量を基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定する判定部と
を有することを特徴とする音声処理装置。 (Appendix 17) an utterance period detection unit that detects a plurality of utterance periods from voice information;
a specifying unit that specifies, as a filler segment, an utterance segment in which a filler is detected from the plurality of utterance segments;
a determination unit that determines whether or not the voice information in the filler section is voice information to be uttered when the user hesitates to make a decision, based on the feature amount of the voice information in the filler section; A speech processing device characterized by:

（付記１８）前記判定部は、前記フィラー区間の音声情報の特徴量と、前記フィラー区間に続く前記発話区間の音声情報の特徴量とを基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記１７に記載の音声処理装置。 (Supplementary note 18) The determination unit determines, based on the feature amount of the voice information of the filler section and the feature amount of the voice information of the utterance section following the filler section, that the voice information of the filler section is 18. The speech processing device according to appendix 17, wherein it is determined whether or not the information is voice information to be uttered when in doubt.

（付記１９）前記判定部は、第１のユーザの発話区間から第２のユーザの発話区間までの応答時間を更に用いて、前記フィラー区間の音声情報が、前記第２のユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記１７または１８に記載の音声処理装置。 (Supplementary note 19) The determination unit further uses the response time from the first user's utterance period to the second user's utterance period to determine whether the speech information in the filler period is 19. The speech processing device according to appendix 17 or 18, wherein it is determined whether or not the speech information is to be uttered when the voice is spoken.

（付記２０）前記判定部は、前記第２のユーザの各発話区間から算出された各特徴量の平均値を算出し、前記各特徴量の平均値を更に用いて、前記第２のユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記１９に記載の音声処理装置。 (Supplementary Note 20) The determination unit calculates an average value of each feature amount calculated from each utterance period of the second user, and further uses the average value of each feature amount to determine whether the second user 19. The speech processing device according to appendix 19, wherein it is determined whether or not the information is voice information to be spoken when the user is uncertain about the determination.

（付記２１）前記判定部は、前記音声情報の基本周波数を基に算出される分散を、前記特徴量として用いることを特徴とする付記１７～２０のいずれか一つに記載の音声処理装置。 (Appendix 21) The audio processing apparatus according to any one of appendices 17 to 20, wherein the determination unit uses a variance calculated based on the fundamental frequency of the audio information as the feature amount.

（付記２２）前記判定部は、前記音声情報をスペクトルに変換し、前記スペクトルの特徴を前記特徴量として用いることを特徴とする付記１７～２０のいずれか一つに記載の音声処理装置。 (Appendix 22) The speech processing apparatus according to any one of appendices 17 to 20, wherein the determination unit converts the speech information into a spectrum and uses a feature of the spectrum as the feature amount.

（付記２３）前記判定部は、前記特徴量と、前記特徴量に対応する音声情報が明るい音声であるか否かの情報とを対応付けた教師データにより学習された分類器に、前記特徴量を入力して得られる結果を基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記１７に記載の音声処理装置。 (Supplementary Note 23) The determination unit assigns the feature amount to a classifier trained using teacher data that associates the feature amount with information indicating whether or not the audio information corresponding to the feature amount is bright audio. Supplementary note 17, wherein it is determined whether or not the voice information in the filler section is voice information to be spoken when the user is unsure of the decision based on the result obtained by inputting A sound processing device as described.

（付記２４）前記判定部は、前記第２のユーザに対応する第１閾値、第２閾値、第３閾値を取得し、前記フィラー区間の音声情報の特徴量と前記第１閾値との比較結果、前記フィラー区間に続く前記発話区間の音声情報の特徴量と前記第１閾値との比較結果、前記応答時間と前記第２閾値との比較結果、前記各特徴量の平均値と前記第３閾値との比較結果を基にして、前記フィラー区間の音声情報が、ユーザが判断に迷っている場合に発話される音声情報であるか否かを判定することを特徴とする付記２０に記載の音声処理装置。 (Additional remark 24) The determination unit acquires a first threshold, a second threshold, and a third threshold corresponding to the second user, and compares the feature amount of the speech information of the filler section with the first threshold. , a comparison result between the feature amount of the speech information in the utterance section following the filler section and the first threshold, a comparison result between the response time and the second threshold, an average value of each feature amount and the third threshold Based on the result of comparison with the voice according to Supplementary Note 20, it is determined whether or not the voice information in the filler section is voice information to be spoken when the user is unsure of the decision. processing equipment.

１，２音声処理装置
５ネットワーク
１０ａ，１０ｂマイク
２０ａ，２０ｂ，３１０ａ，３１０ｂＡＤ変換部
２１受信部
２２，３０前処理部
２３，４０記憶部
１００，２００状態判定部
１１０ａ，１１０ｂ，２１０ａ，２１０ｂ発話区間検出部
１２０，２２０特定部
１３０，２３０応答時間算出部
１４０，２４０明るさ算出部
１５０長期平均算出部
１６０閾値算出部
１７０閾値ＤＢ
１８０，２５０判定部
３００収録機器
３２０音声ファイル化部
３３０送信部 Reference Signs List 1, 2 Speech processing device 5 Network 10a, 10b Microphones 20a, 20b, 310a, 310b AD conversion unit 21 Receiving unit 22, 30 Preprocessing unit 23, 40 Storage unit 100, 200 State determination unit 110a, 110b, 210a, 210b Speech Section detection unit 120, 220 Identification unit 130, 230 Response time calculation unit 140, 240 Brightness calculation unit 150 Long-term average calculation unit 160 Threshold calculation unit 170 Threshold DB
180, 250 Determination unit 300 Recording device 320 Audio file generation unit 330 Transmission unit

Claims

Detect multiple utterance segments from voice information,
identifying an utterance segment in which filler is detected from the plurality of utterance segments as a filler segment;
identifying an utterance segment included in the plurality of utterance segments, the utterance segment following the filler segment as an utterance segment following the filler segment;
Based on the combination of the feature amount of the voice information of the filler section and the comparison result of the first threshold and the comparison result of the feature amount of the voice information of the utterance section following the filler section and the second threshold, the filler section A voice processing program characterized by causing a computer to execute a process of determining whether or not the voice information of is voice information to be uttered when the user hesitates to make a decision.

The determination process further uses the response time from the first user's utterance period to the second user's utterance period, and the voice information of the filler period is used when the second user is unsure of the judgment 2. The voice processing program according to claim 1, wherein it is determined whether or not the voice information is voice information to be uttered.

In the determination processing, an average value of each feature amount calculated from each utterance period of the second user is calculated, and the average value of each feature amount is further used to determine whether the second user hesitates to make a decision. 3. The voice processing program according to claim 2 , wherein it is determined whether or not the voice information is spoken when the voice is spoken.

4. The audio processing program according to claim 1 , wherein said determining process uses a variance calculated based on a fundamental frequency of said audio information as said feature quantity.

4. The audio processing program according to claim 1 , wherein said determining process converts said audio information into a spectrum and uses a feature of said spectrum as said feature quantity.

In the determination process, the feature amount is input to a classifier learned by teacher data that associates the feature amount with information indicating whether or not the speech information corresponding to the feature amount is a bright voice. 2. The method according to claim 1, wherein, based on the result obtained by the above-mentioned processing, it is determined whether or not the voice information in the filler section is voice information to be uttered when the user hesitates to make a decision. sound processing program.

The determination process acquires the first threshold, the second threshold, and the third threshold corresponding to the second user, and compares the feature amount of the voice information of the filler section with the first threshold, A comparison result between the feature amount of the speech information in the utterance section following the filler section and the first threshold, a comparison result between the response time and the second threshold, an average value of each feature amount and the third threshold 4. The voice according to claim 3 , wherein it is determined whether or not the voice information in the filler section is voice information to be uttered when the user hesitates to make a decision based on the comparison result of processing program.

A computer-implemented audio processing method comprising:
Detect multiple utterance segments from voice information,
identifying an utterance segment in which filler is detected from the plurality of utterance segments as a filler segment;
identifying an utterance segment included in the plurality of utterance segments, the utterance segment following the filler segment as an utterance segment following the filler segment;
Based on the combination of the feature amount of the voice information of the filler section and the comparison result of the first threshold and the comparison result of the feature amount of the voice information of the utterance section following the filler section and the second threshold, the filler section A voice processing method characterized by determining whether or not the voice information of is voice information to be uttered when the user hesitates to make a decision.

an utterance segment detection unit that detects a plurality of utterance segments from voice information;
An utterance segment in which a filler is detected from the plurality of utterance segments is specified as a filler segment, and an utterance segment included in the plurality of utterance segments and next to the filler segment is an utterance segment following the filler segment. a specific part identified as
Based on the combination of the feature amount of the voice information of the filler section and the comparison result of the first threshold and the comparison result of the feature amount of the voice information of the utterance section following the filler section and the second threshold, the filler section and a judgment unit for judging whether or not the speech information of is speech information to be uttered when the user hesitates to make a decision.