JP2018081239A

JP2018081239A - Voice processing method, voice processing device, and voice processing program

Info

Publication number: JP2018081239A
Application number: JP2016224568A
Authority: JP
Inventors: 俊輔武内; Shunsuke Takeuchi; 祥吾中村; Shogo Nakamura; 土永　義照; Yoshiteru Tsuchinaga; 義照土永; 鈴木　政直; Masanao Suzuki; 政直鈴木; 鷲尾　信之; Nobuyuki Washio; 信之鷲尾; 千里塩田; Chisato Shioda
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2016-11-17
Filing date: 2016-11-17
Publication date: 2018-05-24
Anticipated expiration: 2036-11-17
Also published as: JP6737141B2

Abstract

PROBLEM TO BE SOLVED: To set the utterance start position in voice data correctly.SOLUTION: The voice processing method includes a computer exerting the following processing. When plural microphones detect the utterance from the plural voice data which are collected at the same time according to the prescribed condition, the computer firstly sets the utterance start position, based on the prescribed condition, back to the past from the position of the detection of the utterance in the voice data. Next, the computer identifies the direction of the speaker viewed from the plural microphones, based on the characteristic difference from the utterance start position to the position of the detection of the utterance in each of the plural voice data. Next, the computer, based on the direction of the identified speaker, extracts the utterance section after the utterance start position in either one of the plural voice data.SELECTED DRAWING: Figure 6A

Description

本発明は、音声処理方法、音声処理装置、及び音声処理プログラムに関する。 The present invention relates to a voice processing method, a voice processing apparatus, and a voice processing program.

音声データの発話内容を翻訳する技術の１つとして、複数のマイクで同時に収音した複数の音声データに基づいて発話者を特定し、発話者の特定結果に応じて翻訳元言語と翻訳先言語とを切り替えて音声データを翻訳する技術がある。この種の技術では、例えば、複数の音声データにおける発話区間の特性差に基づいて発話者を特定した後、該発話者の特定結果に基づいて複数の音声データのいずれかを選択して発話内容を翻訳する（例えば、特許文献１を参照）。また、この種の技術では、例えば、複数の音声データから推定した音源（話者）の方向に応じて、音声データに含まれる音源を分離する処理、及び翻訳して表示する処理を制御する（例えば、特許文献２を参照）。 As one of the technologies for translating the utterance content of voice data, a speaker is identified based on a plurality of voice data picked up simultaneously by a plurality of microphones, and a source language and a destination language are determined according to the result of the speaker's identification. There is a technology to translate voice data by switching between and. In this type of technology, for example, after specifying a speaker based on a characteristic difference between utterance sections in a plurality of voice data, the user selects any one of the plurality of voice data based on the result of the speaker's specification, and utterance contents Is translated (see, for example, Patent Document 1). Further, in this type of technology, for example, according to the direction of a sound source (speaker) estimated from a plurality of sound data, processing for separating sound sources included in sound data and processing for translation and display are controlled ( For example, see Patent Document 2).

更に、音声データから発話を検知する際には、音声データを複数のフレームに分割し、Voice Activity Ditection（ＶＡＤ）等によりフレーム毎に有意な音声を含むか否かを判定する。そして、有意な音声を含むフレームが所定数連続した場合に、発話の検知を確定する。 Furthermore, when detecting an utterance from voice data, the voice data is divided into a plurality of frames, and it is determined whether or not a significant voice is included in each frame by Voice Activity Ditection (VAD) or the like. Then, when a predetermined number of frames including significant speech continues, speech detection is confirmed.

特開２００５−１４８３０１号公報JP 2005-148301 A 特開２０１５−１０６０１４号公報JP2015-106014A

有意な音声を含むフレームが所定数連続した場合に発話の検知を確定する方法では、実際に発話が開始したフレームから数フレーム遅延した位置で発話の検知が確定する。このため、例えば、発話の検知が確定したフレームを発話区間の開始位置として発話区間の発話内容を翻訳する際には、語頭切れが生じる。 In the method of determining the detection of an utterance when a predetermined number of frames including significant speech are continued, the detection of the utterance is determined at a position delayed by several frames from the frame where the utterance is actually started. For this reason, for example, when the utterance content in the utterance section is translated using the frame in which the detection of the utterance is confirmed as the start position of the utterance section, the beginning of the word is cut off.

また、上記の遅延を考慮して発話の検知が確定したフレームから所定の固定長分だけ前のフレームを発話区間の開始位置とした場合、実際に発話が開始したフレームよりも前の無音のフレームや雑音フレームが発話区間に含まれる可能性がある。 In addition, when a frame preceding a predetermined fixed length from the frame in which the detection of the utterance is determined in consideration of the above delay is set as the start position of the utterance section, a silent frame before the frame in which the utterance actually starts And noise frames may be included in the speech interval.

１つの側面において、本発明は、音声データにおける発話開始位置を正しく設定することを目的とする。 In one aspect, an object of the present invention is to correctly set an utterance start position in audio data.

１つの態様の音声処理方法では、コンピュータが、以下の処理を実行する。コンピュータは、まず、複数のマイクロフォンにより同時に収音された複数の音声データから所定の条件に従って発話を検知したときに、所定の条件に基づいて、音声データにおける発話を検知した位置から過去に遡って発話の開始位置を設定する。次に、コンピュータは、複数の音声データのそれぞれにおける発話の開始位置から発話を検知した位置までの特性差に基づいて、複数のマイクロフォンから見た発話者の方向を識別する。次に、コンピュータは、識別した発話者の方向に基づいて、複数の音声データのいずれかにおける発話の開始位置以降の発話区間を抽出する。 In the voice processing method according to one aspect, the computer executes the following processing. First, when a computer detects an utterance according to a predetermined condition from a plurality of audio data collected simultaneously by a plurality of microphones, the computer goes back to the past from the position where the utterance is detected in the audio data based on the predetermined condition. Set the start position of the utterance. Next, the computer identifies the direction of the speaker viewed from the plurality of microphones based on the characteristic difference from the start position of the utterance to the position where the utterance is detected in each of the plurality of audio data. Next, the computer extracts an utterance section after the utterance start position in any of the plurality of audio data based on the direction of the identified speaker.

上述の態様によれば、音声データにおける発話開始位置を正しく設定することが可能となる。 According to the above aspect, it is possible to correctly set the utterance start position in the audio data.

第１の実施形態に係る音声処理装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the speech processing unit which concerns on 1st Embodiment. 方向識別部の構成例を示すブロック図である。It is a block diagram which shows the structural example of a direction identification part. マイクと話者との位置関係の例を示す図である。It is a figure which shows the example of the positional relationship of a microphone and a speaker. 話者情報の設定例を示す図である。It is a figure which shows the example of a setting of speaker information. 第１の実施形態に係る発話内容の翻訳処理を説明するフローチャートである。It is a flowchart explaining the translation process of the utterance content which concerns on 1st Embodiment. 発話検知処理の内容を説明するフローチャート（その１）である。It is a flowchart (the 1) explaining the content of an utterance detection process. 発話検知処理の内容を説明するフローチャート（その２）である。It is a flowchart (the 2) explaining the content of an utterance detection process. 方向識別処理の内容を説明するフローチャートである。It is a flowchart explaining the content of a direction identification process. 第１の音声処理部が行う処理を説明する図である。It is a figure explaining the process which a 1st audio | voice process part performs. フレームが有音であるか否かの判定方法の例を説明する図である。It is a figure explaining the example of the determination method of whether a frame is sound. 発話検知処理の処理結果の一例を示す図である。It is a figure which shows an example of the process result of an utterance detection process. 発話検知処理の処理結果の別の例を示す図である。It is a figure which shows another example of the process result of an utterance detection process. 方向識別処理における平均パワーの算出方法を説明する図である。It is a figure explaining the calculation method of the average power in a direction identification process. 音圧差と話者の方向との関係を説明する図である。It is a figure explaining the relationship between a sound pressure difference and the direction of a speaker. 翻訳言語の切替方法を説明する図である。It is a figure explaining the switching method of a translation language. 第２の実施形態に係る発話内容の翻訳処理を説明するフローチャートである。It is a flowchart explaining the translation process of the utterance content which concerns on 2nd Embodiment. 第１の実施形態に係る音声処理において起こり得る翻訳処理の例を示す図である。It is a figure which shows the example of the translation process which can occur in the audio | voice process which concerns on 1st Embodiment. 第２の実施形態に係る音声処理による翻訳処理の例を示す図である。It is a figure which shows the example of the translation process by the audio | voice process which concerns on 2nd Embodiment. 第３の実施形態に係る音声処理装置及びサーバ装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the audio processing apparatus and server apparatus which concern on 3rd Embodiment. コンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a computer.

＜第１の実施形態＞
図１は、第１の実施形態に係る音声処理装置の機能的構成を示すブロック図である。 <First Embodiment>
FIG. 1 is a block diagram showing a functional configuration of the speech processing apparatus according to the first embodiment.

図１に示すように、本実施形態の音声処理装置１は、第１の音声処理部１１０と、第２の音声処理部１２０と、話者情報設定部１３０と、を備える。また、本実施形態の音声処理装置１は、音声データ保持部１９１と、演算結果保持部１９２と、話者情報保持部１９３と、翻訳用辞書１９４と、を備える。 As shown in FIG. 1, the speech processing apparatus 1 of this embodiment includes a first speech processing unit 110, a second speech processing unit 120, and a speaker information setting unit 130. The speech processing apparatus 1 according to the present embodiment also includes a speech data holding unit 191, a calculation result holding unit 192, a speaker information holding unit 193, and a translation dictionary 194.

第１の音声処理部１１０は、同時に収音した音声データを別個に出力可能な複数のマイクロフォン２０１，２０２のそれぞれから音声データを取得し、該音声データをフレーム単位で音声データ保持部１９１に保持させる処理を行う。第１の音声処理部１１０は、音声データ取得部１１１と、保持データ制御部１１２とを含む。以下の説明では、マイクロフォンのことを単に「マイク」という。 The first audio processing unit 110 acquires audio data from each of a plurality of microphones 201 and 202 capable of separately outputting audio data collected simultaneously, and holds the audio data in the audio data holding unit 191 in units of frames. To perform the process. The first audio processing unit 110 includes an audio data acquisition unit 111 and a retained data control unit 112. In the following description, the microphone is simply referred to as “microphone”.

音声データ取得部１１１は、複数のマイク２０１，２０２のそれぞれから、同時に収音した音声データを取得する。複数のマイク２０１，２０２は、例えば、１個のステレオマイク２におけるＬチャンネル用のマイク２０１と、Ｒチャンネル用のマイク２０２との組である。 The audio data acquisition unit 111 acquires audio data collected simultaneously from each of the plurality of microphones 201 and 202. The plurality of microphones 201 and 202 are, for example, a set of an L channel microphone 201 and an R channel microphone 202 in one stereo microphone 2.

保持データ制御部１１２は、取得した音声データを所定の時間長毎のフレーム（処理単位）に分割し、各フレームを識別可能な状態で音声データ保持部１９１に保持させる。また、保持データ制御部１１２は、音声データ保持部１９１で保持しているフレームのうちの、不要となったフレームを消去する。 The holding data control unit 112 divides the acquired audio data into frames (processing units) for each predetermined time length, and causes the audio data holding unit 191 to hold each frame in an identifiable state. Also, the retained data control unit 112 deletes frames that are no longer needed from among the frames retained by the audio data retaining unit 191.

第２の音声処理部１２０は、複数のマイク２０１，２０２のそれぞれから取得した音声データにおける発話区間を検知し、該発話区間で発話している話者（発話者）が存在する方向を識別する。この際、第２の音声処理部１２０は、音声データにおける発話を検知した位置から所定の時間長（フレーム）だけ遡って、発話区間の開始位置を設定する。また、第２の音声処理部１２０は、複数の音声データ間の特性差に基づいて、発話区間における発話者がステレオマイク２から見て左右どちらの方向に存在するかを識別する。更に、本実施形態の音声処理装置１における第２の音声処理部１２０は、発話区間における発話者の識別結果に基づいて、発話区間の発話内容を発話者と対応付けられた言語から他の言語に翻訳する処理を行う。本実施形態の音声処理装置１における第２の音声処理部１２０は、発話検知部１２１と、方向識別部１２２と、言語切替部１２３と、音声データ抽出部１２４と、翻訳処理部１２５と、出力部１２６と、を含む。 The second speech processing unit 120 detects a speech segment in speech data acquired from each of the plurality of microphones 201 and 202, and identifies the direction in which a speaker (speaker) speaking in the speech segment exists. . At this time, the second voice processing unit 120 sets the start position of the utterance section by going back a predetermined time length (frame) from the position where the utterance is detected in the voice data. Further, the second audio processing unit 120 identifies whether the speaker in the utterance section exists in the left or right direction when viewed from the stereo microphone 2 based on the characteristic difference between the plurality of audio data. Furthermore, the second speech processing unit 120 in the speech processing device 1 according to the present embodiment changes the utterance content of the utterance section from the language associated with the utterer to another language based on the identification result of the utterer in the utterance section. Process to translate to. The second speech processing unit 120 in the speech processing apparatus 1 of the present embodiment includes an utterance detection unit 121, a direction identification unit 122, a language switching unit 123, a speech data extraction unit 124, a translation processing unit 125, and an output. Part 126.

発話検知部１２１は、各音声データのフレーム毎に有音であるか否かを判定し、該判定結果に基づいて各音声データの発話区間を検知する。ここで、有音であるフレームは、話者の発話等の有意な音声成分を含むフレームである。発話検知部１２１は、既知の判定方法により、フレームが有音であるか否かを判定する。また、発話検知部１２１は、音声データ毎に、現在の処理対象であるフレームを含む連続した複数のフレームにおける有音であるフレームの数を計数する。そして、連続した複数のフレームにおける有音であるフレームの数が閾値以上となった場合に、発話検知部１２１は、該複数のフレームを含む音声データの発話区間を検知する。この際、発話検知部１２１は、発話を検知したフレームから過去のフレームに遡って有音であるフレームの数を計数し、発話検知遅延長d_frame個めとなる有音であるフレームを発話区間の開始フレームとする。 The utterance detection unit 121 determines whether or not there is sound for each frame of audio data, and detects the utterance section of each audio data based on the determination result. Here, a frame which is sound is a frame including a significant audio component such as a speaker's utterance. The utterance detection unit 121 determines whether the frame is sounded by a known determination method. Further, the utterance detection unit 121 counts the number of frames that are sound in a plurality of consecutive frames including the current processing target frame for each audio data. Then, when the number of sound frames in a plurality of consecutive frames is equal to or greater than the threshold, the utterance detection unit 121 detects an utterance section of audio data including the plurality of frames. At this time, the utterance detection unit 121 counts the number of frames that are uttered from the frame in which the utterance is detected to the past frames, and the utterance detection frame having the utterance detection delay length d_frame is the utterance interval. This is the start frame.

方向識別部１２２は、各音声データから検出した発話区間のフレーム同士の特性の差に基づいて、話者の方向を識別する。例えば、方向識別部１２２は、Ｌチャンネル用のマイク２０１から取得した第１の音声データのフレームと、Ｒチャンネル用のマイク２０２から取得した第２の音声データのフレームとの音圧差に基づいて、話者の方向を識別する。Ｌチャンネル用のマイク２０１と、Ｒチャンネル用のマイク２０２とを含むステレオマイク２により音声データを取得する場合、方向識別部１２２は、ステレオマイク２の左方と、右方とのいずれの方向に話者が存在するかを識別する。 The direction identifying unit 122 identifies the direction of the speaker based on the difference in characteristics between frames in the utterance section detected from each voice data. For example, the direction identification unit 122 is based on the sound pressure difference between the first audio data frame acquired from the L channel microphone 201 and the second audio data frame acquired from the R channel microphone 202. Identify the direction of the speaker. When audio data is acquired by the stereo microphone 2 including the L-channel microphone 201 and the R-channel microphone 202, the direction identification unit 122 moves in either the left direction or the right direction of the stereo microphone 2. Identify if the speaker is present.

翻訳言語切替部１２３は、方向識別部１２２において識別した話者の方向と、予め設定した話者情報とに基づいて、翻訳元言語と翻訳先言語との組み合わせを切り替える。話者情報は、ステレオマイク２から見た話者の存在する方向と、発話する言語とを含む。該話者情報は、例えば、音声処理装置１の利用者が話者情報設定部１３０に対し所定の操作を行って入力する。話者情報設定部１３０は、利用者により入力された話者情報を話者情報保持部１９３に保持させる。例えば、Ｌチャンネル用のマイク２０１の方向にいる話者の発話言語が日本語であり、Ｒチャンネル用のマイク２０２の方向にいる話者の発話言語が英語であるとする。この場合、識別した話者の方向がＬチャンネル用のマイク２０１の方向であると、翻訳言語切替部１２３は、翻訳元言語を日本語とし、翻訳先言語を英語とする。その後、識別した話者の方向がＲチャンネル用のマイク２０２の方向に変化すると、翻訳言語切替部１２３は、翻訳元言語を英語とし、翻訳先言語を日本語とする。 The translation language switching unit 123 switches the combination of the translation source language and the translation destination language based on the direction of the speaker identified by the direction identification unit 122 and preset speaker information. The speaker information includes the direction in which the speaker is seen from the stereo microphone 2 and the language in which the speaker is uttered. For example, the user of the speech processing apparatus 1 inputs the speaker information by performing a predetermined operation on the speaker information setting unit 130. The speaker information setting unit 130 causes the speaker information holding unit 193 to hold the speaker information input by the user. For example, it is assumed that the utterance language of the speaker in the direction of the microphone 201 for the L channel is Japanese, and the utterance language of the speaker in the direction of the microphone 202 for the R channel is English. In this case, if the direction of the identified speaker is the direction of the microphone 201 for the L channel, the translation language switching unit 123 sets the translation source language to Japanese and the translation destination language to English. Thereafter, when the direction of the identified speaker changes to the direction of the microphone 202 for the R channel, the translation language switching unit 123 sets the translation source language to English and the translation destination language to Japanese.

音声データ抽出部１２４は、方向識別部１２２において識別した話者の方向に基づいて、複数の音声データのなかから翻訳処理に用いる音声データの発話区間を抽出する。例えば、話者の方向がＬチャンネル用のマイク２０１の方向である場合、音声データ抽出部１２４は、音声データ保持部１９１で保持しているＬチャンネル用のマイク２０１から取得した音声データのうちの翻訳する発話区間を抽出する。 Based on the direction of the speaker identified by the direction identifying unit 122, the speech data extracting unit 124 extracts a speech section of speech data used for translation processing from a plurality of speech data. For example, when the direction of the speaker is the direction of the L channel microphone 201, the audio data extraction unit 124 includes the audio data acquired from the L channel microphone 201 held by the audio data holding unit 191. Extract the utterance section to translate.

翻訳処理部１２５は、音声データ抽出部１２４で抽出した音声データの発話区間における発話内容を解析し、翻訳元言語から翻訳先言語へと翻訳する。翻訳処理部１２５は、既知の翻訳方法により、発話区間の発話内容を翻訳元言語から翻訳先言語へと翻訳した（変換した）文書データを生成する。例えば、翻訳処理部１２５は、まず、翻訳元言語における音声の特性等に基づいて発話区間のフレームデータから発話内容を示す文字列を抽出し、形態素解析により抽出した文字列を文章化する。その後、翻訳処理部１２５は、翻訳元言語で文章化した文字列を、翻訳先言語の文章データに翻訳する（変換する）。この際、翻訳処理部１２５は、翻訳用辞書１９４に含まれる形態素解析用の単語辞書を参照して発話内容を文章化する。また、翻訳処理部１２５は、翻訳用辞書１９４に含まれる翻訳辞書を参照して、文章化した文字列を翻訳先言語の文書データに翻訳する。 The translation processing unit 125 analyzes the utterance content in the utterance section of the voice data extracted by the voice data extraction unit 124 and translates it from the source language into the target language. The translation processing unit 125 generates document data obtained by translating (converted) the utterance content of the utterance section from the translation source language to the translation destination language by a known translation method. For example, the translation processing unit 125 first extracts a character string indicating the utterance content from the frame data of the utterance section based on the voice characteristics and the like in the translation source language, and converts the extracted character string into a sentence. After that, the translation processing unit 125 translates (converts) the character string converted into a text in the translation source language into text data in the translation destination language. At this time, the translation processing unit 125 refers to the word dictionary for morphological analysis included in the translation dictionary 194 and converts the utterance contents into sentences. In addition, the translation processing unit 125 refers to the translation dictionary included in the translation dictionary 194 and translates the text string converted into text data in the translation destination language.

出力部１２６は、翻訳処理部１２５で生成した翻訳先言語の文章データを可視化して表示装置３に出力する。なお、出力部１２６は、翻訳処理部１２５で生成した翻訳先言語の文章データを音声データとして、図示しないスピーカに出力してもよい。 The output unit 126 visualizes the text data in the translation destination language generated by the translation processing unit 125 and outputs the text data to the display device 3. Note that the output unit 126 may output the text data in the translation destination language generated by the translation processing unit 125 as audio data to a speaker (not shown).

音声データ保持部１９１は、上記のように、複数のマイク２０１，２０２から取得した音声データをフレーム単位で保持する。音声データ保持部１９１には、第２の音声処理部１２０における処理において処理の対象となっているフレームを含む、過去の複数フレームを保持する。演算結果保持部１９２は、第２の音声処理部１２０における各種演算処理の結果を保持する。話者情報保持部１９３は、上記のように、言語切替部１２３が参照する話者情報を保持する。翻訳用辞書１９４は、上記のように、翻訳処理部１２５が参照する形態素解析用の単語辞書、及び翻訳辞書を含む。 As described above, the audio data holding unit 191 holds the audio data acquired from the plurality of microphones 201 and 202 in units of frames. The audio data holding unit 191 holds a plurality of past frames including a frame that is a processing target in the processing in the second audio processing unit 120. The calculation result holding unit 192 holds the results of various calculation processes in the second sound processing unit 120. As described above, the speaker information holding unit 193 holds the speaker information referred to by the language switching unit 123. As described above, the translation dictionary 194 includes the word dictionary for morphological analysis referred to by the translation processing unit 125 and the translation dictionary.

図２は、方向識別部の構成例を示すブロック図である。
本実施形態の音声処理装置１における方向識別部１２２は、話者の識別に用いる発話区間のフレーム同士の特性差として、上記のように、発話区間における平均パワーの音圧差を利用する。このため、方向識別部１２２は、図２に示すように、フレームパワー算出部１２２Ａと、平均パワー算出部１２２Ｂと、音圧差算出部１２２Ｃと、話者方向判定部１２２Ｄとを含む。 FIG. 2 is a block diagram illustrating a configuration example of the direction identification unit.
The direction identification unit 122 in the speech processing apparatus 1 of the present embodiment uses the sound pressure difference of the average power in the utterance section as described above as the characteristic difference between the frames in the utterance section used for speaker identification. Therefore, the direction identification unit 122 includes a frame power calculation unit 122A, an average power calculation unit 122B, a sound pressure difference calculation unit 122C, and a speaker direction determination unit 122D, as shown in FIG.

フレームパワー算出部１２２Ａは、音声データ保持部１９１で保持している音声データにおける各フレームのパワーを算出する。例えば、フレームパワー算出部１２２Ａは、フレーム毎に、時間波形の振幅から該フレームのパワーを算出する。フレームパワー算出部１２２Ａは、算出した各フレームのパワーを演算結果保持部１９３に保持させる。 The frame power calculation unit 122A calculates the power of each frame in the audio data held by the audio data holding unit 191. For example, the frame power calculation unit 122A calculates the power of the frame from the amplitude of the time waveform for each frame. The frame power calculation unit 122A causes the calculation result holding unit 193 to hold the calculated power of each frame.

平均パワー算出部１２２Ｂは、発話区間に含まれるフレームのパワーから平均パワーを算出する。本実施形態の音声処理装置１における平均パワー算出部１２２Ｂは、発話検知部１２１で検知した発話区間の開始フレームから現在の処理対象であるフレームまでの複数のフレームのうちの、有音であるフレームについての平均パワーを算出する。 The average power calculation unit 122B calculates the average power from the power of the frames included in the utterance section. The average power calculation unit 122B in the speech processing apparatus 1 according to the present embodiment is a frame that is a sound among a plurality of frames from the start frame of the utterance section detected by the utterance detection unit 121 to the current processing target frame. Calculate the average power for.

音圧差算出部１２２Ｃは、音声データ毎に算出した発話区間の平均パワーに基づいて、音圧差を算出する。音圧差算出部１２２Ｃは、例えば、下記式（１）により、音圧差DiffPowを算出する。 The sound pressure difference calculation unit 122C calculates the sound pressure difference based on the average power of the utterance section calculated for each voice data. The sound pressure difference calculation unit 122C calculates the sound pressure difference DiffPow by, for example, the following equation (1).

数１において、AvePow_L、及びAvePow_Rは、それぞれ、Ｌチャンネル用のマイク２０１から取得した音声データについての平均パワー、及びＲチャンネル用のマイク２０２から取得した音声データについての平均パワーである。 In Equation 1, AvePow_L and AvePow_R are the average power for the audio data acquired from the L-channel microphone 201 and the average power for the audio data acquired from the R-channel microphone 202, respectively.

話者方向判定部１２２Ｄは、音圧差算出部１２２Ｃで算出した音圧差と、所定の閾値との大小関係に基づいて、発話者がいる方向を判定する。式（１）により算出した音圧差DiffPowが閾値ＴＨよりも大きい場合（ＴＨ＜DiffPowの場合）、話者方向判定部１２２Ｄは、ステレオマイク２から見て左方（マイク左方）に存在する話者が発話していると判定する。一方、式（１）により算出した音圧差DiffPowが閾値−ＴＨよりも小さい場合（DiffPow＜−ＴＨの場合）、話者方向判定部１２２Ｄは、ステレオマイク２から見て右方（マイク右方）に存在する話者が発話していると判定する。更に、本実施形態では、式（１）により算出した音圧差DiffPowが−ＴＨ≦DiffPow≦ＴＨである場合、話者方向判定部１２２Ｄは、複数の話者が発話している（話者の方向を特定できない）と判定する。 The speaker direction determination unit 122D determines the direction in which the speaker is present based on the magnitude relationship between the sound pressure difference calculated by the sound pressure difference calculation unit 122C and a predetermined threshold. When the sound pressure difference DiffPow calculated by Expression (1) is larger than the threshold value TH (when TH <DiffPow), the speaker direction determination unit 122D is a story that exists on the left side (left side of the microphone) when viewed from the stereo microphone 2. The person is speaking. On the other hand, when the sound pressure difference DiffPow calculated by the equation (1) is smaller than the threshold value −TH (when DiffPow <−TH), the speaker direction determination unit 122D is the right side (the right side of the microphone) as viewed from the stereo microphone 2. Is determined to be speaking. Furthermore, in the present embodiment, when the sound pressure difference DiffPow calculated by the equation (1) is −TH ≦ DiffPow ≦ TH, the speaker direction determination unit 122D has a plurality of speakers speaking (speaker direction). Cannot be specified).

話者方向判定部１２２Ｄの判定結果は、言語切替部１２３に入力される。言語切替部１２３は、話者方向判定部１２２Ｄ（方向識別部１２２）の処理結果と、話者情報保持部１９３で保持している話者の位置情報とに基づいて、翻訳元言語と翻訳先言語との切り替えを行う。 The determination result of the speaker direction determination unit 122D is input to the language switching unit 123. Based on the processing result of the speaker direction determination unit 122D (direction identification unit 122) and the speaker position information held in the speaker information holding unit 193, the language switching unit 123 translates the source language and the translation destination. Switch to language.

図３は、マイクと話者との位置関係の例を示す図である。図４は、話者情報の設定例を示す図である。 FIG. 3 is a diagram illustrating an example of the positional relationship between the microphone and the speaker. FIG. 4 is a diagram illustrating a setting example of speaker information.

本実施形態の音声処理装置１は、上記のように、Ｌチャンネル用のマイク２０１と、Ｒチャンネル用のマイク２０２とを含むステレオマイク２から音声データを取得する。この際、ステレオマイク２は、例えば、図３に示すように、会話（対話）を行う二人の話者４Ａ，４Ｂの間となる位置に設置する。また、ステレオマイク２は、２個のマイク２０１，２０２の中点から見て第１の話者４Ａがいる方向にＬチャンネル用のマイク２０１が位置し、第２の話者４Ｂがいる方向にＲチャンネル用のマイク２０２が位置する向きで設置する。 As described above, the audio processing device 1 according to the present embodiment acquires audio data from the stereo microphone 2 including the L channel microphone 201 and the R channel microphone 202. At this time, as shown in FIG. 3, for example, the stereo microphone 2 is installed at a position between the two speakers 4A and 4B having a conversation (conversation). In addition, the stereo microphone 2 is located in the direction in which the L-channel microphone 201 is located in the direction where the first speaker 4A is present and the second speaker 4B is present as seen from the midpoint between the two microphones 201 and 202. It is installed in the direction in which the R channel microphone 202 is located.

第１の話者４Ａの発話言語が日本語であり、第２の話者４Ｂの発話言語が英語である場合、話者情報保持部１９３には、例えば、図４に示した話者情報を保持させる。図４の話者情報において、話者ＩＤは、第１の話者４Ａ及び第２の話者４Ｂを識別する識別子である。話者の位置は、ステレオマイク２から見た話者１及び話者２の位置（存在する方向）を示す情報である。発話言語は、話者１及び話者２の発話言語を示す情報である。図３において、マイク左方にいる話者１は第１の話者４Ａであり、該話者１の発話言語は、日本語である。また、図３において、マイク右方にいる話者２は、第２の話者４Ｂであり、該話者２の発話言語は英語である。よって、図４の話者情報５０１では、話者１に、話者の位置がマイク左方であることと、発話言語が日本語であることを示す情報が対応付けられている。同様に、図４の話者情報５０１では、話者２に、話者の位置がマイク右方であることと、発話言語が英語であることを示す情報が対応付けられている。 When the utterance language of the first speaker 4A is Japanese and the utterance language of the second speaker 4B is English, the speaker information holding unit 193 stores, for example, the speaker information shown in FIG. Hold. In the speaker information of FIG. 4, the speaker ID is an identifier for identifying the first speaker 4A and the second speaker 4B. The position of the speaker is information indicating the positions of speaker 1 and speaker 2 as viewed from stereo microphone 2 (the direction in which they exist). The utterance language is information indicating the utterance languages of the speaker 1 and the speaker 2. In FIG. 3, the speaker 1 on the left side of the microphone is the first speaker 4A, and the language of the speaker 1 is Japanese. In FIG. 3, the speaker 2 on the right side of the microphone is the second speaker 4B, and the speaking language of the speaker 2 is English. Therefore, in the speaker information 501 of FIG. 4, the speaker 1 is associated with information indicating that the position of the speaker is on the left side of the microphone and that the utterance language is Japanese. Similarly, in the speaker information 501 of FIG. 4, the speaker 2 is associated with information indicating that the position of the speaker is on the right side of the microphone and that the utterance language is English.

上記のようにステレオマイク２を設置した状態でマイク左方にいる第１の話者４Ａが発話すると、音声処理装置１は、ステレオマイク２から取得した音声データに基づいて、マイク左方にいる第１の話者４Ａが発話していると識別する。このため、音声処理装置１は、例えば、Ｌチャンネル用のマイク２０１から取得した音声データを用いて、第１の話者４Ａの発話内容を翻訳する。第１の話者４Ａの発話言語が日本語であり、第２の話者４Ｂの発話言語が英語である場合、音声処理装置１は、Ｌチャンネル用のマイク２０１から取得した音声データにおける発話区間を日本語で文章化した後、該文章を英語に翻訳して表示装置３に出力する。このため、第２の話者４Ｂは、表示装置４に表示された文章（英文）を読むことで、第１の話者４Ａの発話内容を理解することが可能となる。 When the first speaker 4A located on the left side of the microphone with the stereo microphone 2 installed as described above speaks, the audio processing device 1 is located on the left side of the microphone based on the audio data acquired from the stereo microphone 2. It is identified that the first speaker 4A is speaking. For this reason, the speech processing apparatus 1 translates the utterance content of the first speaker 4A using, for example, speech data acquired from the L channel microphone 201. When the utterance language of the first speaker 4A is Japanese and the utterance language of the second speaker 4B is English, the speech processing apparatus 1 uses the speech data in the speech data acquired from the L channel microphone 201. Is translated into Japanese, and the sentence is translated into English and output to the display device 3. Therefore, the second speaker 4B can understand the utterance contents of the first speaker 4A by reading the text (English) displayed on the display device 4.

一方、第２の話者４Ｂが発話すると、音声処理装置１は、Ｒチャンネル用のマイク２０２から取得した音声データを用いて、第２の話者４Ｂの発話内容を英語から日本語に翻訳する。 On the other hand, when the second speaker 4B speaks, the speech processing apparatus 1 translates the speech content of the second speaker 4B from English to Japanese using the speech data acquired from the R channel microphone 202. .

本実施形態の音声処理装置１は、上記のように、マイク２０１，２０２から取得した各音声データをフレームに分割して保持する処理と、取得した音声データの発話区間における話者を識別して発話内容を翻訳する処理とを行う。 As described above, the speech processing apparatus 1 according to the present embodiment divides each speech data acquired from the microphones 201 and 202 into frames, and identifies a speaker in the speech section of the acquired speech data. Processing to translate utterance content.

音声データをフレームに分割して保持する処理は、音声処理装置１の第１の音声処理部１１０（音声データ取得部１１１、及び保持データ制御部１１２）が行う。音声データ取得部１１１は、複数のマイク２０１，２０２のそれぞれから音声データを取得する。保持データ制御部１１２は、取得した音声データ毎に、所定の時間長のフレームに分割し、各フレームを時間による並び順が識別可能な態様で音声データ保持部１９１に保持させる。音声データ取得部１１１及び保持データ制御部１１２は、処理終了の命令の入力を受け付けるまで、音声データをフレームに分割して保持する処理を続ける。 The first audio processing unit 110 (the audio data acquisition unit 111 and the holding data control unit 112) of the audio processing device 1 performs processing for dividing the audio data into frames and holding it. The audio data acquisition unit 111 acquires audio data from each of the plurality of microphones 201 and 202. The holding data control unit 112 divides each acquired audio data into frames having a predetermined time length, and causes the audio data holding unit 191 to hold each frame in a manner in which the arrangement order according to time can be identified. The audio data acquisition unit 111 and the holding data control unit 112 continue the process of dividing the audio data into frames and holding them until receiving an input of a process end command.

一方、音声データにおける発話区間の発話内容を翻訳する処理は、音声処理装置１の第２の音声処理部１２０（発話検知部１２１、方向識別部１２２、翻訳言語切替部１２３、音声データ抽出部１２４、翻訳処理部１２５、及び出力部１２６）が行う。音声処理装置１の第２の音声処理部１２０は、発話区間の発話内容を翻訳する処理として、図５に示した処理を行う。 On the other hand, the process of translating the utterance content of the utterance section in the speech data is performed by the second speech processing unit 120 (the speech detection unit 121, the direction identification unit 122, the translation language switching unit 123, the speech data extraction unit 124) of the speech processing apparatus 1. The translation processing unit 125 and the output unit 126). The second voice processing unit 120 of the voice processing device 1 performs the process shown in FIG. 5 as a process of translating the utterance content of the utterance section.

図５は、第１の実施形態に係る発話内容の翻訳処理を説明するフローチャートである。
音声処理装置１の第２の音声処理部１２０が行う処理は、図５に示すように、複数の音声データにおける同時刻のフレームの組のそれぞれに対しステップＳ２以降の処理を行うループ処理（ステップＳ１〜Ｓ１０）となっている。第２の音声処理部１２０は、音声データに含まれる全てのフレームの組でステップＳ２以降の処理を行い、ループ処理の終了端（ステップＳ１０）に到達するとループ処理を終了する。 FIG. 5 is a flowchart for explaining the utterance content translation processing according to the first embodiment.
As shown in FIG. 5, the process performed by the second sound processing unit 120 of the sound processing device 1 is a loop process (step S2) for performing a process after step S2 on each set of frames at the same time in a plurality of sound data. S1-S10). The second audio processing unit 120 performs the processing from step S2 on all frame sets included in the audio data, and ends the loop processing when it reaches the end of the loop processing (step S10).

１組のフレームの組に対する処理において、第２の音声処理部１２０は、まず、該フレームの組に含まれるフレーム毎に発話を検知する発話検知処理（ステップＳ２）を行う。ステップＳ２の処理は、第２の音声処理部１２０の発話検知部１２１が行う。発話検知部１２１は、既知の判定方法に従って、処理対象のフレームが有音のフレームであるか否かを判定する。また、発話検知部１２１は、処理対象のフレームが有音のフレームである場合、当該フレームから過去に所定のフレーム数だけ遡り、該過去フレームにおける有音のフレーム数に基づいて発話を検知する。所定数の過去フレームにおける有音のフレーム数が閾値に到達した場合、発話検知部１２１は、発話の検知を確定する（認識する）。 In the process for one set of frames, the second audio processing unit 120 first performs an utterance detection process (step S2) for detecting an utterance for each frame included in the set of frames. The processing in step S2 is performed by the utterance detection unit 121 of the second voice processing unit 120. The utterance detection unit 121 determines whether the processing target frame is a sound frame according to a known determination method. Further, when the processing target frame is a sound frame, the speech detection unit 121 goes back a predetermined number of frames from the frame in the past, and detects the speech based on the number of sound frames in the past frame. When the number of sound frames in a predetermined number of past frames reaches the threshold value, the utterance detection unit 121 determines (recognizes) the detection of the utterance.

ステップＳ２の処理を終えると、発話検知部１２１は、続けて、発話検知処理の処理結果に基づいて、発話を検知したか否かを判定する（ステップＳ３）。発話を検知しなかった場合（ステップＳ３；ＮＯ）、発話検知部１２１は、現在の処理対象であるフレームに対する処理を終了させる。その後、未処理のフレームの組がある場合、第２の音声処理部１２０は、次のフレームの組に対するステップＳ２以降の処理を開始する。 When the process of step S2 is completed, the utterance detection unit 121 continues to determine whether or not an utterance has been detected based on the processing result of the utterance detection process (step S3). When the utterance is not detected (step S3; NO), the utterance detection unit 121 ends the process for the frame that is the current processing target. Thereafter, when there is a set of unprocessed frames, the second sound processing unit 120 starts the processing from step S2 onward for the next set of frames.

一方、発話を検知した場合（ステップＳ３；ＹＥＳ）、発話検知部１２１は、次に、発話が継続中であるか否かを判定する（ステップＳ４）。ステップＳ４において、発話検知部１２１は、例えば、ループ処理における１回前の処理での処理結果に基づいて、発話が継続中であるか否かを判定する。 On the other hand, when an utterance is detected (step S3; YES), the utterance detection unit 121 next determines whether or not the utterance is continuing (step S4). In step S <b> 4, the utterance detection unit 121 determines whether or not the utterance is continuing based on, for example, the processing result of the previous processing in the loop processing.

発話を検知したが継続中ではない場合（ステップＳ４；ＮＯ）、第２の音声処理部１２０は、次に、発話者の方向を識別する方向識別処理（ステップＳ５）を行う。ステップＳ５の処理は、第２の音声処理部１２０の方向識別部１２２が行う。方向識別部１２２は、現在の処理対象であるフレームから過去に所定のフレーム数だけ遡り、有音であるフレームのパワーに基づいて平均パワーを算出する。また、方向識別部１２２は、平均パワーに基づいて発話区間同士の音圧差を算出し、該音圧差に基づいて現在の処理対象であるフレームを含む発話区間の発話者の方向を識別する。ステレオマイク２から取得した２個の音声信号によりループ処理Ｓ１〜Ｓ１０を行っている場合、方向識別部１２２は、発話者の方向がマイク左方、マイク右方、及び正面方向（特定できず）のいずれかであるかを識別する。 When the utterance is detected but not continued (step S4; NO), the second voice processing unit 120 next performs a direction identification process (step S5) for identifying the direction of the speaker. The process of step S5 is performed by the direction identifying unit 122 of the second sound processing unit 120. The direction identifying unit 122 goes back a predetermined number of frames in the past from the current processing target frame, and calculates the average power based on the power of the sound frame. In addition, the direction identification unit 122 calculates a sound pressure difference between utterance sections based on the average power, and identifies the direction of the speaker in the utterance section including the current processing target frame based on the sound pressure difference. When the loop processing S1 to S10 is performed using two audio signals acquired from the stereo microphone 2, the direction identification unit 122 indicates that the direction of the speaker is the left microphone, the right microphone, and the front direction (not specified). It identifies whether it is either.

ステップＳ５の処理を終えると、方向識別部１２２は、続けて、発話者の方向を一方のマイクの方向に識別したか否かを判定する（ステップＳ６）。ステレオマイク２から取得した２個の音声信号によりループ処理Ｓ１〜Ｓ１０を行っている場合、方向識別部１２２は、発話者の方向がマイク左方及びマイク右方のいずれかであると識別すると、発話者の方向を一方のマイクの方向に識別したと判定する。発話者の方向を識別できなかった場合（ステップＳ６；ＮＯ）、方向識別部１２２は、現在の処理対象であるフレームに対する処理を終了させる。その後、未処理のフレームの組がある場合、第２の音声処理部１２０は、次のフレームの組に対するステップＳ２以降の処理を開始する。 When the process of step S5 is completed, the direction identifying unit 122 continues to determine whether or not the direction of the speaker has been identified as the direction of one microphone (step S6). When loop processing S1 to S10 is performed using two audio signals acquired from the stereo microphone 2, the direction identification unit 122 identifies that the direction of the speaker is either the left microphone or the right microphone. It is determined that the direction of the speaker is identified as the direction of one microphone. When the direction of the speaker cannot be identified (step S6; NO), the direction identification unit 122 ends the process for the current processing target frame. Thereafter, when there is a set of unprocessed frames, the second sound processing unit 120 starts the processing from step S2 onward for the next set of frames.

一方、発話者の方向を識別した場合（ステップＳ６；ＹＥＳ）、第２の音声処理部１２０は、次に、識別した話者の方向に基づいて翻訳言語を切り替える（ステップＳ７）。ステップＳ７の処理は、言語切替部１２３が行う。言語切替部１２３は、現在の処理対象であるフレームを含む発話区間における発話者の方向の識別結果と、話者情報保持部１９３の話者の位置情報（図４を参照）とに基づいて、翻訳元言語と翻訳先言語との組み合わせを設定する。 On the other hand, when the direction of the speaker is identified (step S6; YES), the second speech processing unit 120 next switches the translation language based on the identified direction of the speaker (step S7). The language switching unit 123 performs the process in step S7. The language switching unit 123 is based on the identification result of the direction of the speaker in the utterance section including the current processing target frame, and the speaker position information (see FIG. 4) in the speaker information holding unit 193. Set the combination of source language and target language.

ステップＳ７の処理を行った場合、或いは発話が継続中である場合（ステップＳ４；ＹＥＳ）、第２の音声処理部１２０は、次に、翻訳する音声データを抽出する（ステップＳ８）。ステップＳ８の処理は、音声データ抽出部１２４が行う。音声データ抽出部１２４は、現在の処理対象のフレームを含む音声データにおいて翻訳の対象となる発話区間のフレーム（音声データ）を音声データ保持部１９１から読み出す。 When the process of step S7 is performed or when the utterance is continuing (step S4; YES), the second voice processing unit 120 next extracts voice data to be translated (step S8). The voice data extraction unit 124 performs the process in step S8. The voice data extraction unit 124 reads, from the voice data holding unit 191, a frame (voice data) of an utterance section to be translated in the voice data including the current processing target frame.

次に、第２の音声処理部１２０は、抽出した音声データの発話内容を翻訳して出力する（ステップＳ９）。ステップＳ９の処理は、翻訳処理部１２５と、出力部１２６とが行う。翻訳処理部１２５は、既知の解析方法に従って、音声データ（発話区間）を文字列に変換し、該文字列に対する形態素解析を行って音声データを文章化する。この際、翻訳処理部１２５は、翻訳用辞書１９４に含まれる形態素解析用の辞書を参照し、ステップＳ７で設定した翻訳元言語で音声データを文書化する。その後、翻訳処理部１２５は、既知の翻訳方法に従って、翻訳元言語の文書を翻訳先言語の文書に翻訳する（変換する）。翻訳処理部１２５における翻訳処理が終了すると、出力部１２６が、翻訳後の文書データを可視化する表示データを作成し、表示装置３に出力する。これにより、表示装置３には、現在の処理対象であるフレームを含む発話区間における発話内容の翻訳文が表示される。 Next, the second voice processing unit 120 translates and outputs the utterance content of the extracted voice data (step S9). The translation processing unit 125 and the output unit 126 perform the process in step S9. The translation processing unit 125 converts speech data (speech section) into a character string according to a known analysis method, performs morphological analysis on the character string, and converts the speech data into sentences. At this time, the translation processing unit 125 refers to the morphological analysis dictionary included in the translation dictionary 194 and documents the voice data in the translation source language set in step S7. Thereafter, the translation processing unit 125 translates (converts) the document in the translation source language into the document in the translation destination language in accordance with a known translation method. When the translation processing in the translation processing unit 125 ends, the output unit 126 creates display data for visualizing the translated document data and outputs the display data to the display device 3. Thereby, the translated text of the utterance content in the utterance section including the frame that is the current processing target is displayed on the display device 3.

ステップＳ９の処理を終えると、現在の処理対象のフレームの組に対する処理が終了する。ここで、未処理のフレームの組がある場合、第２の音声処理部１２０は、次のフレームの組に対するステップＳ２以降の処理を開始する。また、全てのフレームの組に対しステップＳ２以降の処理を行った場合、第２の音声処理部１２０は、ループ処理（ステップＳ１〜Ｓ１０）を終了し、発話内容の翻訳処理を終了する。 When the process of step S9 is completed, the process for the current set of frames to be processed ends. Here, when there is a set of unprocessed frames, the second sound processing unit 120 starts the processing from step S2 onward for the next set of frames. In addition, when the processing after step S2 is performed on all sets of frames, the second speech processing unit 120 ends the loop processing (steps S1 to S10) and ends the utterance content translation processing.

本実施形態に係る発話検知処理（ステップＳ２）では、上記のように、現在の処理対象であるフレームが有音のフレームである場合に、該フレームから過去に遡り、所定数のフレーム内における有音のフレーム数を調べる。そして、有音のフレーム数が閾値に到達した場合に、発話検知部１２１は、発話を検知したと認識する。言い換えると、発話検知部１２１は、有音のフレームが連続していなくても、現在の処理対象であるフレームを含む発話検知遅延長d_frame個分のフレーム内における有音のフレームが閾値ＴＨ個に到達すると、発話の検知を確定する（認識する）。発話検知遅延長d_frameは、発話の検知を確定するための条件である有音のフレームの個数の計数を開始したフレームから、発話の検知を確定したフレームまでの数に相当する。本実施形態の発話検知部１２１が行う上記の発話検知処理について、図６Ａ及び図６Ｂを参照して説明する。 In the utterance detection process (step S2) according to the present embodiment, as described above, when the current processing target frame is a voiced frame, the frame is traced back to the past and the presence in a predetermined number of frames is detected. Check the number of sound frames. When the number of frames with sound reaches the threshold, the utterance detection unit 121 recognizes that the utterance has been detected. In other words, the utterance detection unit 121 sets the number of uttered frames in the frame corresponding to the utterance detection delay length d_frame including the current processing target frame to the threshold value TH even if the voicing frames are not continuous. When it arrives, it confirms (recognizes) the detection of the utterance. The utterance detection delay length d_frame corresponds to the number of frames from the start of counting the number of voiced frames, which is a condition for determining utterance detection, to the frame for which utterance detection is determined. The utterance detection process performed by the utterance detection unit 121 of this embodiment will be described with reference to FIGS. 6A and 6B.

図６Ａは、発話検知処理の内容を説明するフローチャート（その１）である。図６Ｂは、発話検知処理の内容を説明するフローチャート（その２）である。 FIG. 6A is a flowchart (part 1) for explaining the contents of the speech detection process. FIG. 6B is a flowchart (part 2) illustrating the content of the speech detection process.

本実施形態の発話検知部１２１が行う発話検知処理は、図６Ａ及び図６Ｂに示すように、１組のフレームの組に含まれる複数のフレームのそれぞれに対しステップＳ２０２以降の処理を行うループ処理（Ｓ２０１〜Ｓ２１１）となっている。発話検知部１２１は、１組のフレームの組に含まれる全てのフレームでＳ２０２以降の処理を行い、ループ処理の終了端（ステップＳ２１１）に到達すると、ループ処理を終了する。 As shown in FIGS. 6A and 6B, the utterance detection process performed by the utterance detection unit 121 according to the present embodiment is a loop process in which the processes after step S202 are performed on each of a plurality of frames included in one set of frames. (S201 to S211). The utterance detection unit 121 performs the processing from S202 on all the frames included in one set of frames, and ends the loop processing when reaching the end of the loop processing (step S211).

１個のフレームに対する処理において、発話検知部１２１は、まず、処理対象のフレームが有音であるか否かを判定し、有音である場合には該フレームに対する判定結果を表す値ＶＡＤを１とする（ステップＳ２０２）。なお、処理対象のフレームが有音ではない場合、発話検知部１２１は、判定結果を表す値ＶＡＤを１とは異なる値（例えば、０）とする。 In the processing for one frame, the speech detection unit 121 first determines whether or not the processing target frame is sounded, and if it is sounded, sets the value VAD representing the determination result for the frame to 1 (Step S202). When the processing target frame is not sounded, the utterance detection unit 121 sets a value VAD representing the determination result to a value different from 1 (for example, 0).

次に、発話検知部１２１は、処理対象のフレームがＶＡＤ＝１（有音）であるか否かをチェックする（ステップＳ２０３）。 Next, the speech detection unit 121 checks whether or not the processing target frame is VAD = 1 (sound) (step S203).

処理対象のフレームがＶＡＤ＝１である場合（ステップＳ２０３；ＹＥＳ）、発話検知部１２１は、ステップＳ２０４及びＳ２０５の処理を行い、発話検知処理の処理結果を表すフラグf_speechの値を１又は０に設定する。ステップＳ２０４において、発話検知部１２１は、ＶＡＤ＝１のフレーム数をカウントするカウンタvad_countをvad_count＋１に更新し、ＶＡＤ≠１のフレーム数をカウントするカウンタn_vad_countを０にリセットする。また、ステップＳ２０５において、発話検知部１２１は、ＶＡＤ＝１のフレーム数をカウントするカウンタvad_countが閾値ＴＨ１よりも大きいか否かを判定する。vad_count＞ＴＨ１の場合（ステップＳ２０５；ＹＥＳ）、発話検知部１２１は、処理結果を表すフラグf_speechを、発話を検知したことを示す値（例えば、１）に設定するとともに、発話検知遅延長d_frameを設定する（ステップＳ２０６）。ステップＳ２０６において、発話検知部１２１は、ＶＡＤ＝１となったフレームから現在の処理対象であるフレームまでのフレーム数を、発話検知遅延長d_frameに設定する。一方、vad_count≦ＴＨ１の場合（ステップＳ２０５；ＮＯ）、発話検知部１２１は、処理結果を表すフラグf_speechを、発話を検知しなかったことを示す値（例えば、０）に設定する（ステップＳ２０７）。 When the processing target frame is VAD = 1 (step S203; YES), the speech detection unit 121 performs the processing of steps S204 and S205, and sets the value of the flag f_speech representing the processing result of the speech detection processing to 1 or 0. Set. In step S204, the speech detection unit 121 updates the counter vad_count that counts the number of frames with VAD = 1 to vad_count + 1, and resets the counter n_vad_count that counts the number of frames with VAD ≠ 1 to zero. In step S205, the speech detection unit 121 determines whether or not a counter vad_count that counts the number of frames with VAD = 1 is greater than a threshold value TH1. When vad_count> TH1 (step S205; YES), the utterance detection unit 121 sets the flag f_speech indicating the processing result to a value (for example, 1) indicating that the utterance has been detected, and sets the utterance detection delay length d_frame. Setting is made (step S206). In step S206, the utterance detection unit 121 sets the number of frames from the frame in which VAD = 1 to the current processing target frame to the utterance detection delay length d_frame. On the other hand, when vad_count ≦ TH1 (step S205; NO), the utterance detection unit 121 sets the flag f_speech indicating the processing result to a value (for example, 0) indicating that no utterance has been detected (step S207). .

一方、処理対象のフレームがＶＡＤ≠１である場合（ステップＳ２０３；ＮＯ）、発話検知部１２１は、図６ＢのステップＳ２０８〜Ｓ２１０の処理を行い、発話検知処理の処理結果を表すフラグf_speechの値を１又は０に設定する。ステップＳ２０８において、発話検知部１２１は、ＶＡＤ≠１のフレーム数をカウントするカウンタn_vad_countを、n_vad_count＋１に更新する。また、ステップＳ２０９において、発話検知部１２１は、ＶＡＤ≠１のフレーム数をカウントするカウンタn_vad_countが閾値ＴＨ２よりも大きいか否かを判定する。n_vad_count＞閾値ＴＨ２の場合（ステップＳ２０９；ＹＥＳ）、発話検知部１２１は、次に、ＶＡＤ＝１のフレーム数をカウントするカウンタvad_countを０にリセットする。ステップＳ２０９の後、発話検知部１２１は、処理結果を表すフラグf_speechを、発話を検知しなかったことを示す値（f_speech＝０）に設定する（ステップＳ２０７）。一方、n_vad_count≦閾値ＴＨ２の場合（ステップＳ２０９；ＮＯ）、発話検知部１２１は、次に、処理結果を表すフラグf_speechを、発話を検知したことを示す値（f_speech＝１）に設定するとともに、発話検知遅延長d_frameを設定する（ステップＳ２０６）。 On the other hand, when the frame to be processed is VAD ≠ 1 (step S203; NO), the utterance detection unit 121 performs the processing of steps S208 to S210 in FIG. 6B and the value of the flag f_speech indicating the processing result of the utterance detection process. Is set to 1 or 0. In step S208, the utterance detection unit 121 updates the counter n_vad_count that counts the number of frames with VAD ≠ 1 to n_vad_count + 1. In step S209, the speech detection unit 121 determines whether or not a counter n_vad_count that counts the number of frames of VAD ≠ 1 is larger than the threshold value TH2. If n_vad_count> threshold TH2 (step S209; YES), the speech detection unit 121 next resets a counter vad_count that counts the number of frames with VAD = 1 to 0. After step S209, the utterance detection unit 121 sets the flag f_speech representing the processing result to a value (f_speech = 0) indicating that no utterance has been detected (step S207). On the other hand, when n_vad_count ≦ threshold TH2 (step S209; NO), the utterance detection unit 121 next sets a flag f_speech indicating a processing result to a value (f_speech = 1) indicating that the utterance has been detected. An utterance detection delay length d_frame is set (step S206).

ステップＳ２０６又はＳ２０７を終えると現在の処理対象であるフレームに対する処理が終了し、発話検知部１２１が行う処理は、ループ処理の終了端（ステップＳ２１１）にいたる。ここで、未処理のフレームがある場合、発話検知部１２１は、次のフレームに対するステップＳ２０２以降の処理を開始する。また、全てのフレームに対しステップＳ２０２以降の処理を行った場合、発話検知部１２１は、ループ処理を終了し、発話検知処理を終了する。 When step S206 or S207 is completed, the processing for the current processing target frame is completed, and the processing performed by the utterance detection unit 121 reaches the end of the loop processing (step S211). Here, when there is an unprocessed frame, the utterance detection unit 121 starts processing from step S202 onward for the next frame. Further, when the processing from step S202 is performed on all frames, the utterance detection unit 121 ends the loop processing and ends the utterance detection processing.

このように、本実施形態に係る発話検知処理において、発話検知部１２１は、連続する所定数のフレーム内における有音のフレーム数が閾値ＴＨ１に到達すると、発話の検知を確定する。更に、発話検知部１２１は、発話の検知を確定する条件である有音のフレーム数のカウントを開始したフレームから発話の検知を確定したフレームまでのフレーム数を発話検知遅延長d_frameに設定し、発話を検知したフレームから発話検知遅延長d_frame分だけ過去に遡ったフレームを、発話区間の開始位置とする。よって、本実施形態によれば、発話の検知を確定したフレームを含む発話区間の開始位置を正しく設定することが可能となる。これにより、例えば、音声データから発話区間を抽出する際に、音声データにおける発話開始位置と、発話を検知したフレームとのずれによる語頭切れを抑制することが可能となる。また、上記の発話検知処理によれば、有音であると判定したフレームが発話区間の開始位置となるため、発話区間に含まれる発話前の雑音区間を低減することが可能となる。 As described above, in the utterance detection process according to the present embodiment, the utterance detection unit 121 determines the detection of the utterance when the number of sound frames in the predetermined number of consecutive frames reaches the threshold value TH1. Furthermore, the utterance detection unit 121 sets the number of frames from the frame that starts counting the number of voiced frames, which is a condition for determining the detection of utterance, to the frame that determines the detection of utterance as the utterance detection delay length d_frame, A frame that goes back in the past by the utterance detection delay length d_frame from the frame in which the utterance is detected is set as the start position of the utterance section. Therefore, according to the present embodiment, it is possible to correctly set the start position of the utterance section including the frame in which the utterance detection is confirmed. As a result, for example, when an utterance section is extracted from audio data, it is possible to suppress the beginning of words due to a difference between the utterance start position in the audio data and the frame in which the utterance is detected. Further, according to the utterance detection process described above, since the frame determined to be sound is the start position of the utterance section, it is possible to reduce the noise section before the utterance included in the utterance section.

次に、図５のフローチャートにおける方向識別処理（ステップＳ５）の内容について、図７を参照して説明する。 Next, the content of the direction identification process (step S5) in the flowchart of FIG. 5 will be described with reference to FIG.

図７は、方向識別処理の内容を説明するフローチャートである。
方向識別処理は、方向識別部１２２が行う。方向識別部１２２は、まず、１組のフレームの組に含まれる複数のフレームのそれぞれについて、該フレームを含む発話区間の平均パワーを算出するループ処理（Ｓ５０１〜Ｓ５０６）を行う。 FIG. 7 is a flowchart for explaining the contents of the direction identification processing.
The direction identification process is performed by the direction identification unit 122. The direction identifying unit 122 first performs a loop process (S501 to S506) for calculating the average power of the speech section including the frame for each of a plurality of frames included in the set of one frame.

平均パワーを算出するループ処理において、方向識別部１２２は、まず、処理対象のフレームがＶＡＤ＝１（有音のフレーム）であるか否かを判定する（ステップＳ５０２）。ステップＳ５０２の処理は、方向識別部１２２のパワー算出部１２２Ａが行う。処理対象のフレームがＶＡＤ＝１である場合（ステップＳ５０２；ＹＥＳ）、パワー算出部１２２Ａは、処理対象のフレームのフレームパワーを算出し（ステップＳ５０３）、演算結果保持部１９２に保持させる。処理対象のフレームがステレオマイク２から取得した２個の音声データのうちのＬチャンネルの音声データのフレームである場合、パワー算出部１２２Ａは、例えば、下記式（２−１）によりパワーFramePow_Lを算出する。また、処理対象のフレームがステレオマイク２から取得した２個の音声データのうちのＲチャンネルの音声データのフレームである場合、パワー算出部１２２Ａは、例えば、下記式（２−２）によりパワーFramePow_Rを算出する。 In the loop processing for calculating the average power, the direction identification unit 122 first determines whether or not the processing target frame is VAD = 1 (sound frame) (step S502). The power calculation unit 122A of the direction identification unit 122 performs the process of step S502. When the processing target frame is VAD = 1 (step S502; YES), the power calculation unit 122A calculates the frame power of the processing target frame (step S503) and causes the calculation result holding unit 192 to hold it. When the processing target frame is a frame of L channel audio data of the two audio data acquired from the stereo microphone 2, the power calculation unit 122A calculates the power FramePow_L by the following equation (2-1), for example. To do. When the processing target frame is a frame of R channel audio data out of the two audio data acquired from the stereo microphone 2, the power calculation unit 122A uses, for example, the power FramePow_R according to the following equation (2-2). Is calculated.

式（２−１）のs_L、及び式（２−２）のs_Rは、それぞれ、Ｌチャンネルの１フレーム分の音声データ、及びＲチャンネルの１フレーム分の音声データである。また、式（２−１）及び式（２−２）の変数ｎはサンプル数である。 S_L in Expression (2-1) and s_R in Expression (2-2) are audio data for one frame of the L channel and audio data for one frame of the R channel, respectively. Moreover, the variable n of Formula (2-1) and Formula (2-2) is the number of samples.

一方、処理対象のフレームがＶＡＤ≠１である場合（ステップＳ５０２；ＮＯ）、パワー算出部１２２Ａは、処理対象のフレームのパワーを０とし（ステップＳ５０４）、演算結果保持部１９２に保持させる。 On the other hand, when the frame to be processed is VAD ≠ 1 (step S502; NO), the power calculation unit 122A sets the power of the frame to be processed to 0 (step S504) and causes the calculation result holding unit 192 to hold it.

ステップＳ５０３又はＳ５０４の処理を終えると、方向識別部１２２は、次に、処理対象のフレームを含む音声データにおける発話検知遅延長d_frame分のフレームのうちのＶＡＤ＝１のフレームについての平均パワーを算出する（ステップＳ５０５）。ステップＳ５０５の処理は、方向識別部１２２の平均パワー算出部１２２Ｂが行う。平均パワー算出部１２２Ｂは、演算結果保持部１９２で保持している発話検知遅延長d_frame分のフレームのうちのＶＡＤ＝１のフレームについてのパワーを読み出し平均パワーを算出する。処理対象のフレームがステレオマイク２から取得した２個の音声データのうちのＬチャンネルの音声データのフレームである場合、パワー算出部１２２Ａは、例えば、下記式（３−１）により平均パワーAvePow_Lを算出する。また、処理対象のフレームがステレオマイク２から取得した２個の音声データのうちのＲチャンネルの音声データのフレームである場合、パワー算出部１２２Ａは、例えば、下記式（３−２）により平均パワーAvePow_Rを算出する。 When the process of step S503 or S504 is completed, the direction identification unit 122 calculates the average power for the frame of VAD = 1 among the frames for the speech detection delay length d_frame in the audio data including the processing target frame. (Step S505). The process of step S505 is performed by the average power calculation unit 122B of the direction identification unit 122. The average power calculation unit 122B calculates the average power by reading the power for the frame of VAD = 1 among the frames corresponding to the speech detection delay length d_frame held by the calculation result holding unit 192. When the processing target frame is a frame of L channel audio data out of the two audio data acquired from the stereo microphone 2, the power calculation unit 122A calculates the average power AvePow_L by the following equation (3-1), for example. calculate. When the processing target frame is a frame of R channel audio data out of the two audio data acquired from the stereo microphone 2, the power calculation unit 122A, for example, calculates the average power by the following equation (3-2). Calculate AvePow_R.

式（３−１）及び式（３−２）における値Ｍは、発話検知遅延長d_frame個のフレームのうちのＶＡＤ＝１であるフレームの数である。 The value M in the equations (3-1) and (3-2) is the number of frames with VAD = 1 among the frames with the speech detection delay length d_frame.

ステップＳ５０５の処理を終えると、現在の処理対象であるフレームに対する処理が終了し、方向識別部１２１が行う処理は、ループ処理の終了端（ステップＳ５０５）にいたる。ここで、未処理のフレームがある場合、方向識別部１２２（パワー算出部１２２Ａ及び平均パワー算出部１２２Ｂ）は、次のフレームについてのステップＳ５０２以降の処理を開始する。また、全てのフレームについてのステップＳ５０２以降の処理を行った場合、方向識別部１２２は、ループ処理を終了し、次に、ループ処理（ステップＳ５０１〜Ｓ５０６）で算出した平均パワーに基づいて音圧差を算出する（ステップＳ５０７）。ステップＳ５０７の処理は、方向識別部１２２の音圧差算出部１２２Ｃが行う。処理対象のフレームの組がステレオマイク２から取得したＬチャンネルの音声データのフレームとＲチャンネルの音声データのフレームとである場合、音圧差算出部１２２Ｃは、上記の式（１）により音圧差DiffPowを算出する
次に、方向識別部１２２は、算出した音圧差DiffPowと、閾値ＴＨ３とに基づいて、発話している話者の方向を識別する（ステップＳ５０８）。ステップＳ５０８の処理は、話者方向判定部１２２Ｄが行う。式（１）により音圧差DiffPowを算出する場合、AvePow_L＞AvePow_RであるとDiffPow＞０となり、AvePow_L＜AvePow_RであるとDiffPow＜０となる。よって、閾値ＴＨ３＞０の場合、話者方向判定部１２２Ｄは、ＴＨ３＜DiffPowであれば発話者はマイク左方であると識別し、DiffPow＜−ＴＨ３であれば発話者がマイク右方であると識別する。 When the process of step S505 is completed, the process for the current processing target frame is completed, and the process performed by the direction identification unit 121 reaches the end of the loop process (step S505). Here, when there is an unprocessed frame, the direction identification unit 122 (the power calculation unit 122A and the average power calculation unit 122B) starts the processing after step S502 for the next frame. In addition, when the processing after step S502 is performed for all the frames, the direction identification unit 122 ends the loop processing, and then the sound pressure difference based on the average power calculated in the loop processing (steps S501 to S506). Is calculated (step S507). The process of step S507 is performed by the sound pressure difference calculation unit 122C of the direction identification unit 122. When the set of frames to be processed is an L-channel audio data frame and an R-channel audio data frame acquired from the stereo microphone 2, the sound pressure difference calculation unit 122C calculates the sound pressure difference DiffPow according to the above equation (1). Next, the direction identifying unit 122 identifies the direction of the speaker who is speaking based on the calculated sound pressure difference DiffPow and the threshold value TH3 (step S508). The process of step S508 is performed by the speaker direction determination unit 122D. When the sound pressure difference DiffPow is calculated by the equation (1), DiffPow> 0 when AvePow_L> AvePow_R, and DiffPow <0 when AvePow_L <AvePow_R. Therefore, when the threshold TH3> 0, the speaker direction determination unit 122D identifies that the speaker is on the left side of the microphone if TH3 <DiffPow, and the speaker is on the right side if DiffPow <−TH3. Identify.

ステップＳ５０８の処理を終えると、方向識別部１２２は、話者の方向の識別結果を言語切替部１２３に送信して方向識別処理を終了する。 When the process of step S508 is completed, the direction identification unit 122 transmits the speaker direction identification result to the language switching unit 123 and ends the direction identification process.

このように、本実施形態に係る方向識別部１２２は、発話を検知したフレームと、発話検知遅延長d_frameとに基づいて定まる発話区間に含まれるフレームのうちの、有音であるフレームの特性（波形）のみに基づいて、発話者の方向を識別する。このため、発話区間内に振幅の小さいフレーム（ＶＡＤ≠１のフレーム）が含まれる場合に、方向識別部１２２は、該振幅の小さいフレームや信号対ノイズ比が小さいフレームを除外して発話者の方向を識別することが可能となる。したがって、本実施形態によれば、発話区間に含まれるＶＡＤ≠１のフレームによる発話者の方向の識別の誤りを低減することが可能となる。 As described above, the direction identification unit 122 according to the present embodiment has a characteristic of a frame that is sound among frames that are determined based on the frame in which the utterance is detected and the utterance detection delay length d_frame ( The direction of the speaker is identified based only on the waveform. Therefore, when a frame with a small amplitude (a frame with VAD ≠ 1) is included in the speech section, the direction identification unit 122 excludes the frame with the small amplitude and the frame with a small signal-to-noise ratio and excludes the speaker's It becomes possible to identify the direction. Therefore, according to the present embodiment, it is possible to reduce the error in identifying the direction of the speaker due to the frame of VAD ≠ 1 included in the speech section.

図８は、第１の音声処理部が行う処理を説明する図である。
図８の（ａ）には、ステレオマイク２から取得する音声データ６の例と、音声データ保持部１９１に保持させる際のデータの処理単位（フレーム）とを説明する図を示している。ステレオマイク２から収音する音声データ６は、例えば、Pulse Code Modulation（ＰＣＭ）形式の音声データである。この場合、音声データ６は、Ｌチャンネルのマイク２０１で収音した１フレーム分の音声データと、Ｒチャンネルのマイク２０２で収音した１フレーム分の音声データとを１組のフレームデータとして、ステレオマイク２から出力される。本実施形態の音声処理装置１における第１の音声処理部１１０では、音声データ６を、同時刻に収音されたＬチャンネルのフレームとＲチャンネルのフレームとの組のフレームデータ６０１，６０２，６０３，・・・に分割する。 FIG. 8 is a diagram illustrating processing performed by the first sound processing unit.
FIG. 8A shows an example of the audio data 6 acquired from the stereo microphone 2 and a data processing unit (frame) when the audio data holding unit 191 holds the data. The audio data 6 collected from the stereo microphone 2 is, for example, audio data in Pulse Code Modulation (PCM) format. In this case, the audio data 6 is a stereo signal obtained by using one frame of audio data collected by the L channel microphone 201 and one frame of audio data collected by the R channel microphone 202 as a set of frame data. Output from the microphone 2. In the first sound processing unit 110 in the sound processing apparatus 1 of the present embodiment, the sound data 6 is converted into frame data 601, 602, and 603 of a set of L channel frames and R channel frames collected at the same time. , ...

図８の（ｂ）には、音声データ保持部１９１に保持させるフレームデータの量を説明する図を示している。例えば、音声データ保持部１９１には、Ｄ_ｍａｘフレーム分のフレームデータ（ＬチャンネルのフレームとＲチャンネルのフレームとの組）を保持する。ここで、保持するフレーム数の最大値Ｄ_ｍａｘは、発話検知遅延長d_frameとして設定し得る値以上とする。また、第１の音声処理部１１０の保持データ制御部１１２は、新たなフレーム（ｈ＋Ｄ_ｍａｘ番目のフレーム）６１４を音声データ保持部１９１に保持させる際に、既に保持させているフレーム６１１〜６１３のうちの最も早い（最も過去の）フレーム（ｈ番目のフレーム）６１１を削除する。 FIG. 8B shows a diagram for explaining the amount of frame data held in the audio data holding unit 191. For example, the audio data holding unit 191 holds D _max frame data (a set of L channel frames and R channel frames). Here, the maximum value D _{max of the} number of frames to be held is not less than a value that can be set as the speech detection delay length d_frame. In addition, the holding data control unit 112 of the first audio processing unit 110 causes the audio data holding unit 191 to hold a new frame (h + D _max- th frame) 614 of the frames 611 to 613 that are already held. The earliest (earliest) frame (h-th frame) 611 is deleted.

なお、図８の（ａ）に示した音声データ６は、第１の音声処理部１１０が取得する音声データの一例に過ぎない。第１の音声処理部１１０が取得する音声データは、Ｌチャンネルのマイク２０１で収音した音声データと、Ｒチャンネルのマイク２０２で収音した音声データとが独立していてもよい。また、図８の（ｂ）に示した音声データの保持方法は、一例に過ぎない。第１の音声処理部１１０は、取得した音声データにおける先頭のフレームから、全てのフレームを音声データ保持部１９１に順次保持させてもよい。 Note that the audio data 6 illustrated in FIG. 8A is only an example of audio data acquired by the first audio processing unit 110. The audio data acquired by the first audio processing unit 110 may be independent of the audio data collected by the L-channel microphone 201 and the audio data collected by the R-channel microphone 202. Further, the audio data holding method shown in FIG. 8B is merely an example. The first audio processing unit 110 may cause the audio data holding unit 191 to sequentially hold all frames from the first frame in the acquired audio data.

図９は、フレームが有音であるか否かの判定方法の例を説明する図である。
図９には、音声データにおけるフレームが有音のフレームであるか否かを判定する方法の１つである、Voice Activity Ditection（ＶＡＤ）を説明する図を示している。図９に示した音声データの時間波形７０１における時刻ｔ０からｔ１までの区間、及び時刻ｔ４からｔ５までの区間は、それぞれ、有意な音声を含まない非音声区間（ＶＡＤ＝０の区間）である。これに対し、時間波形における時刻ｔ１からｔ４までの区間は、有意な音声を含む音声区間（ＶＡＤ＝１の区間）である。 FIG. 9 is a diagram illustrating an example of a method for determining whether or not a frame is sounded.
FIG. 9 is a diagram for explaining Voice Activity Ditection (VAD), which is one of the methods for determining whether or not a frame in audio data is a voiced frame. The section from time t0 to t1 and the section from time t4 to t5 in the time waveform 701 of the sound data shown in FIG. 9 are non-speech sections (sections with VAD = 0) that do not include significant speech, respectively. . On the other hand, a section from time t1 to t4 in the time waveform is a voice section including a significant voice (section with VAD = 1).

また、音声データの音声区間における時刻ｔ２からｔ３までの区間の時間波形７０１Ａをみると、例えば、図９に示したように、時間波形には周期性がみられる。 Further, when the time waveform 701A in the section from time t2 to t3 in the voice section of the voice data is seen, for example, as shown in FIG. 9, the time waveform has periodicity.

ＶＡＤによる有音のフレームで有るか否かの判定は、音声区間の波形における上記の特徴を利用して、処理対象のフレームが有音のフレームであるか否かを判定する。すなわち、本実施形態に係る音声処理装置１における発話検知部１２１は、ステップＳ２０２の処理において、例えば、処理対象のフレームが下記の条件１及び条件２を満たすか否かにより、処理対象のフレームが有音であるか否かを判定する。
（条件１）振幅レベルの変動幅が閾値よりも大きい。
（条件２）時間波形が周期性を持つ。 Whether or not the frame is a voiced frame by VAD is determined by using the above-described feature in the waveform of the voice section to determine whether or not the processing target frame is a voiced frame. In other words, the speech detection unit 121 in the speech processing device 1 according to the present embodiment determines whether the processing target frame depends on whether the processing target frame satisfies the following conditions 1 and 2 in the process of step S202, for example. It is determined whether or not there is sound.
(Condition 1) The fluctuation range of the amplitude level is larger than the threshold value.
(Condition 2) The time waveform has periodicity.

図１０は、発話検知処理の処理結果の一例を示す図である。
本実施形態の音声処理装置１の第２の音声処理部１２０において、図１０に示した波形７０２を持つ音声データに対して発話検知処理を行うと、例えば、テーブル５０２のような処理結果が得られる。テーブル５０２において、ＦＮは処理対象のフレームを識別するフレーム番号であり、ＶＡＤは各フレームが有音のフレームであるか否かを示す値である。また、vad_countは有音のフレーム数をカウントするカウンタの値であり、n_vad_countは有音ではないフレーム数をカウントするカウンタの値である。更に、f_speechは発話の検知を確定したか否かを示すフラグの値であり、d_frameは発話検知遅延長を示す値である。 FIG. 10 is a diagram illustrating an example of a processing result of the speech detection processing.
When the second speech processing unit 120 of the speech processing apparatus 1 of the present embodiment performs speech detection processing on speech data having the waveform 702 shown in FIG. 10, for example, a processing result such as a table 502 is obtained. It is done. In the table 502, FN is a frame number for identifying a frame to be processed, and VAD is a value indicating whether each frame is a sound frame. Further, vad_count is a counter value that counts the number of frames with sound, and n_vad_count is a value of a counter that counts the number of frames without sound. Further, f_speech is a flag value indicating whether or not utterance detection is confirmed, and d_frame is a value indicating the utterance detection delay length.

テーブル５０２におけるフレーム番号ＦＮの各マスの位置及び横幅は、波形７０２における破線で区切られた区間と対応している。すなわち、フレーム番号ＦＮ＝ｈのフレームにおける波形の振幅は、フレーム番号ＦＮ＞ｈのフレームにおける波形の振幅と比べて非常に小さい。このため、フレーム番号ｈのフレームに対する発話検知処理では、ステップＳ２０２においてＶＡＤ＝０となる。この際、フレーム番号ｈのフレームに対するvad_count及びn_vad_countは、例えば、それぞれ、０及びＭＡとなる。ここで、ＭＡは、例えば、１以上であり、かつステップＳ２０９の判定で用いる閾値ＴＨ２よりも小さい整数である。また、フレーム番号ｈのフレームはＶＡＤ＝０であるため、f_speech＝０となり、発話検知遅延長d_frameは不定となる。 The position and the horizontal width of each square of the frame number FN in the table 502 correspond to the section delimited by the broken line in the waveform 702. That is, the amplitude of the waveform in the frame of frame number FN = h is very small compared to the amplitude of the waveform in the frame of frame number FN> h. For this reason, in the speech detection process for the frame of frame number h, VAD = 0 in step S202. At this time, vad_count and n_vad_count for the frame with frame number h are, for example, 0 and MA, respectively. Here, MA is an integer that is, for example, 1 or more and smaller than the threshold value TH2 used in the determination in step S209. Since the frame with frame number h is VAD = 0, f_speech = 0, and the speech detection delay length d_frame is indefinite.

次に、発話検知部１２１は、フレーム番号ｈ＋１のフレームに対する発話検知処理を行う。フレーム番号ｈ＋１のフレームにおける波形は、フレーム番号ｈのフレームと比べて振幅が大きく、かつ振幅の変動に周期性が見られる。このため、フレーム番号ｈ＋１のフレームは、ステップＳ２０２においてＶＡＤ＝１となる。したがって、発話検知部１２１は、テーブル５０２に示したように、フレーム番号ｈ＋１のフレームについての処理においてカウンタvad_count及びn_vad_frameの値を、それぞれ、１及び０とする（ステップＳ２０４）。また、ステップＳ２０５の判定で用いる閾値ＴＨ１を４とした場合、フレーム番号ｈ＋１のフレームについてのフラグf_speechは０となる。一方、発話検知遅延長d_frameは、１となる。 Next, the utterance detection unit 121 performs an utterance detection process for the frame with frame number h + 1. The waveform in the frame with the frame number h + 1 has a larger amplitude than the frame with the frame number h, and a periodicity is observed in the fluctuation of the amplitude. For this reason, the frame of frame number h + 1 is VAD = 1 in step S202. Therefore, as shown in the table 502, the utterance detection unit 121 sets the values of the counters vad_count and n_vad_frame to 1 and 0, respectively, in the process for the frame with the frame number h + 1 (step S204). If the threshold value TH1 used in the determination in step S205 is 4, the flag f_speech for the frame with frame number h + 1 is 0. On the other hand, the utterance detection delay length d_frame is 1.

この後、フレーム番号ｈ＋２以降のフレームに対し同様の処理を繰り返すと、テーブル５０２に示したような処理結果が得られる。テーブル５０２に示した処理結果においては、フレーム番号ｈ＋１以降のフレームが全てＶＡＤ＝１となっており、vad_countの値が１ずつ増えている。よって、ＶＡＤ＝０からＶＡＤ＝１に変化したフレーム番号ｈ＋１のフレームから数えて４個目のフレームとなるフレーム番号ｈ＋４のフレームにおいて発話の検知が確定し、f_speech＝１となる。また、フレーム番号ｈ＋４のフレームに対する発話検知処理を行う際には、発話検知遅延長d_frame＝４となっている。したがって、第２の音声処理部１２０（音声処理装置１）は、発話の検知が確定したフレームのフレーム番号ｈ＋４と、発話検知遅延長d_frame＝４とに基づいて、フレーム番号ｈ＋１のフレームを発話区間の先頭フレームと認識する。これにより、発話を検知したフレーム番号ｈ＋４のフレームよりも前のフレーム番号ｈ＋１〜ｈ＋３の３個のフレームを発話区間に含めることが可能となり、フレーム番号ｈ＋１〜ｈ＋３の３個のフレームの発話内容を翻訳することが可能となる。 Thereafter, when the same processing is repeated for frames after frame number h + 2, the processing result as shown in the table 502 is obtained. In the processing result shown in the table 502, all the frames after the frame number h + 1 are VAD = 1, and the value of vad_count is increased by 1. Therefore, the detection of the utterance is determined in the frame of frame number h + 4 which is the fourth frame counted from the frame of frame number h + 1 changed from VAD = 0 to VAD = 1, and f_speech = 1. Further, when performing the speech detection process for the frame of frame number h + 4, the speech detection delay length d_frame = 4. Therefore, the second speech processing unit 120 (speech processing device 1) uses the frame number h + 4 of the frame for which speech detection is confirmed and the speech detection delay length d_frame = 4 as the speech interval for the frame of frame number h + 1. Recognized as the first frame. As a result, it is possible to include in the utterance section three frames with frame numbers h + 1 to h + 3 prior to the frame with frame number h + 4 that detected the utterance, and the utterance contents of the three frames with frame numbers h + 1 to h + 3 are included. It becomes possible to translate.

図１１は、発話検知処理の処理結果の別の例を示す図である。
本実施形態の音声処理装置１の第２の音声処理部１２０において、図１１に示した波形７０３を持つ音声データに対して発話検知処理を行うと、例えば、テーブル５０３のような処理結果が得られる。テーブル５０３において、ＦＮは処理対象のフレームを識別するフレーム番号であり、ＶＡＤは各フレームが有音のフレームであるか否かを示す値である。また、vad_countは有音のフレーム数をカウントするカウンタの値であり、n_vad_countは有音ではないフレーム数をカウントするカウンタの値である。更に、f_speechは発話を検知したか否かを示すフラグの値であり、d_frameは発話検知遅延長を示す値である。 FIG. 11 is a diagram illustrating another example of the processing result of the speech detection process.
When the second speech processing unit 120 of the speech processing apparatus 1 according to the present embodiment performs speech detection processing on speech data having the waveform 703 shown in FIG. 11, for example, a processing result such as a table 503 is obtained. It is done. In the table 503, FN is a frame number for identifying a frame to be processed, and VAD is a value indicating whether each frame is a sound frame. Further, vad_count is a counter value that counts the number of frames with sound, and n_vad_count is a value of a counter that counts the number of frames without sound. Further, f_speech is a flag value indicating whether or not an utterance has been detected, and d_frame is a value indicating the utterance detection delay length.

テーブル５０３におけるフレーム番号ＦＮの各マスの位置及び横幅は、波形７０３における破線で区切られた区間と対応している。すなわち、フレーム番号ＦＮ＝ｋのフレームにおける波形の振幅は、フレーム番号ＦＮ＞ｋのフレームにおける波形の振幅と比べて非常に小さい。このため、フレーム番号ｋのフレームに対する発話検知処理では、ステップＳ２０２においてＶＡＤ＝０となる。この際、フレーム番号ｋのフレームに対する処理におけるカウンタvad_count及びn_vad_countの値は、例えば、それぞれ、０及びＭＢとなる。また、フレーム番号ｋのフレームはＶＡＤ＝０であるため、フラグf_speechは０となり、発話検知遅延長d_frameは不定となる。 The position and width of each square of the frame number FN in the table 503 correspond to the section delimited by the broken line in the waveform 703. That is, the amplitude of the waveform in the frame of frame number FN = k is very small compared to the amplitude of the waveform in the frame of frame number FN> k. For this reason, in the speech detection process for the frame of frame number k, VAD = 0 in step S202. At this time, the values of the counters vad_count and n_vad_count in the process for the frame with the frame number k are, for example, 0 and MB, respectively. Further, since the frame with frame number k is VAD = 0, the flag f_speech is 0, and the speech detection delay length d_frame is indefinite.

次に、発話検知部１２１は、フレーム番号ｋ＋１のフレームに対する発話検知処理を行う。フレーム番号ｋ＋１のフレームにおける波形は、フレーム番号ｋのフレームと比べて振幅が大きく、かつ振幅の変動に周期性が見られる。このため、フレーム番号ｋ＋１のフレームは、ステップＳ２０２においてＶＡＤ＝１となる。したがって、発話検知部１２１は、テーブル５０３に示したように、フレーム番号ｋ＋１のフレームに対する処理においてカウンタvad_count及びn_vad_frameの値を、それぞれ、１及び０とする（ステップＳ２０４）。また、ステップＳ２０５の判定で用いる閾値ＴＨ１を４とした場合、フレーム番号ｈ＋１のフレームについてのフラグf_speechは０となり、発話検知遅延長d_frameは１となる。 Next, the utterance detection unit 121 performs an utterance detection process for the frame with frame number k + 1. The waveform in the frame of frame number k + 1 has a larger amplitude than the frame of frame number k, and a periodicity is seen in the fluctuation of the amplitude. For this reason, the frame of frame number k + 1 is VAD = 1 in step S202. Accordingly, as shown in the table 503, the utterance detection unit 121 sets the values of the counters vad_count and n_vad_frame to 1 and 0, respectively, in the process for the frame with the frame number k + 1 (step S204). When the threshold value TH1 used in the determination in step S205 is 4, the flag f_speech for the frame with frame number h + 1 is 0 and the speech detection delay length d_frame is 1.

この後、ステップＳ２０９の判定に用いる閾値ＴＨ２を３として、フレーム番号ｋ＋２以降のフレームに対し同様の処理を繰り返すと、テーブル５０３に示したような処理結果が得られる。すなわち、フレーム番号ｋ＋１，ｋ＋２，ｋ＋５，及びｋ＋８のフレームにおいてＶＡＤ＝１となっており、フレーム番号ｋ＋８のフレームにおいてカウンタvad_countが４となる。このため、図１１に示した例では、フレーム番号ｋ＋８のフレームにおいてフラグf_speechが１となり発話の検知が確定する。また、フレーム番号ｋ＋１のフレームに対する処理を行った際に発話検知遅延長d_frameのカウントが始まるため、フレーム番号ｋ＋８のフレームに対する処理を行ったときの発話検知遅延長d_frameは８となる。したがって、第２の音声処理部１２０（音声処理装置１）は、発話を検知したフレームのフレーム番号ｋ＋８と、発話検知遅延長d_frame＝８とに基づいて、フレーム番号ｋ＋１のフレームを発話区間の先頭フレームと認識する。これにより、発話を検知したフレーム番号ｋ＋８のフレームよりも前のフレーム番号ｋ＋１，ｋ＋２，ｋ＋５の３個のフレームを発話区間に含めることが可能となり、当該３個のフレームの発話内容を翻訳することが可能となる。 Thereafter, when the threshold TH2 used for the determination in step S209 is set to 3 and the same processing is repeated for frames after the frame number k + 2, the processing result as shown in the table 503 is obtained. That is, VAD = 1 in the frames of frame numbers k + 1, k + 2, k + 5, and k + 8, and the counter vad_count is 4 in the frame of frame number k + 8. For this reason, in the example shown in FIG. 11, the flag f_speech becomes 1 in the frame of frame number k + 8, and the detection of the utterance is confirmed. Further, since the count of the utterance detection delay length d_frame starts when the process for the frame with the frame number k + 1 is performed, the utterance detection delay length d_frame with the process for the frame with the frame number k + 8 is 8. Therefore, the second speech processing unit 120 (speech processing device 1) uses the frame number k + 8 of the frame in which the speech is detected and the speech detection delay length d_frame = 8 as the start of the speech section. Recognize as a frame. As a result, it is possible to include three frames of frame numbers k + 1, k + 2, and k + 5 before the frame of frame number k + 8 in which the utterance is detected in the utterance section, and translate the utterance contents of the three frames. Is possible.

図１２は、方向識別処理における平均パワーの算出方法を説明する図である。
本実施形態の音声処理装置１の第２の音声処理部１２０において、図１２に示した波形７０４を持つ音声データに対してフレームパワーを算出する処理を行うと、例えば、テーブル５０４のような処理結果が得られる。テーブル５０３において、ＦＮは処理対象のフレームを識別するフレーム番号であり、ＶＡＤは各フレームが有音のフレームであるか否かを示す値である。また、フレームパワーは、図７のステップＳ５０２〜Ｓ５０４の処理により算出した各フレームのパワーを示す値である。 FIG. 12 is a diagram for explaining an average power calculation method in the direction identification processing.
When the second sound processing unit 120 of the sound processing apparatus 1 according to the present embodiment performs the process of calculating the frame power on the sound data having the waveform 704 illustrated in FIG. Results are obtained. In the table 503, FN is a frame number for identifying a frame to be processed, and VAD is a value indicating whether each frame is a sound frame. The frame power is a value indicating the power of each frame calculated by the processes in steps S502 to S504 in FIG.

テーブル５０４におけるフレーム番号ＦＮの各マスの位置及び横幅は、波形７０４における破線で区切られた区間と対応している。すなわち、フレーム番号ＦＮ≦ｊ−６のフレームにおける波形の振幅は、フレーム番号ｊ＋５のフレームにおける波形の振幅と比べて非常に小さい。このため、フレーム番号ｊ−１０〜ｊ−６のフレームに対する発話検知処理では、ステップＳ２０２においてＶＡＤ＝０となる。同様に、フレーム番号ｊ＋５等のＶＡＤ＝１となるフレームと比べて波形の振幅が小さいフレーム番号ｊ−２，ｊ−１のフレームは、ＶＡＤ＝０となる。ここで、発話検知処理における閾値ＴＨ１及びＴＨ２をそれぞれ４及び３とすると、発話検知部１２１は、フレーム番号ｊのフレームに対する処理を行った際に発話を検知する。したがって、発話区間の開始位置は、フレーム番号ｊ−５のフレームとなる。 The position and width of each square of the frame number FN in the table 504 correspond to the section delimited by the broken line in the waveform 704. That is, the amplitude of the waveform in the frame of frame number FN ≦ j−6 is very small compared to the amplitude of the waveform in the frame of frame number j + 5. For this reason, in the speech detection process for the frames with the frame numbers j-10 to j-6, VAD = 0 in step S202. Similarly, frames with frame numbers j−2 and j−1 having a smaller waveform amplitude than VAD = 1 such as frame number j + 5 have VAD = 0. Here, if the thresholds TH1 and TH2 in the utterance detection process are 4 and 3, respectively, the utterance detection unit 121 detects the utterance when the process for the frame with the frame number j is performed. Therefore, the start position of the utterance section is the frame of frame number j-5.

このとき、第２の音声処理部１２０における方向識別部１２２では、ＶＡＤ＝１のフレームのみ上記の式（２−１）及び式（２−２）によりパワーを算出し、ＶＡＤ＝０のフレームのパワーは０としている（ステップＳ５０２〜Ｓ５０４）。更に、方向識別部１２２は、フレーム番号ｊのフレームを含む発話検知遅延長d_frame分のフレームのうちのＶＡＤ＝１のフレームのパワーについての平均パワーを算出する（ステップＳ５０５）。すなわち、図１２の例におけるフレーム番号ｊのフレームが処理対象である場合、方向識別部１２２は、フレーム番号ｊ−５，ｊ−４，ｊ−３，及びｊのフレームのそれぞれから算出したパワーＰ１，Ｐ２，Ｐ３，Ｐ４の平均パワーを算出する。このように、発話検知遅延長d_frame分のフレームのうちのＶＡＤ＝１のフレームのパワーのみに基づいて平均パワーを算出することにより、発話区間ないの有音ではない区間やＳＮ比が小さい区間のパワーを除外した平均パワーを算出することが可能となる。 At this time, the direction identification unit 122 in the second audio processing unit 120 calculates the power only for the frame with VAD = 1 by the above formulas (2-1) and (2-2), and the frame of VAD = 0. The power is 0 (steps S502 to S504). Furthermore, the direction identification unit 122 calculates the average power for the power of the frame of VAD = 1 among the frames for the speech detection delay length d_frame including the frame of the frame number j (step S505). That is, when the frame with the frame number j in the example of FIG. 12 is the processing target, the direction identification unit 122 calculates the power P1 calculated from each of the frames with the frame numbers j-5, j-4, j-3, and j. , P2, P3, and P4 are calculated. Thus, by calculating the average power based only on the power of the VAD = 1 frame among the frames corresponding to the utterance detection delay length d_frame, the non-speech section without the utterance section or the section with a small S / N ratio is obtained. It is possible to calculate the average power excluding the power.

図１３は、音圧差と話者の方向との関係を説明する図である。
図１３の（ａ）には、ステレオマイク２に含まれる２個のマイク２０１，２０２から見た音源（話者）の方向と、各マイクから出力される音声データにおける音圧との関係を模式的に示している。 FIG. 13 is a diagram for explaining the relationship between the sound pressure difference and the direction of the speaker.
FIG. 13A schematically shows the relationship between the direction of the sound source (speaker) seen from the two microphones 201 and 202 included in the stereo microphone 2 and the sound pressure in the audio data output from each microphone. Is shown.

図１３の（ａ）のように、Ｌチャンネルのマイク２０１から見て右方にＲチャンネルのマイク２０２が位置する態様でステレオマイク２を設置した場合、Ｌチャンネルのマイク２０１の左方がマイク左方となる。また、この場合、Ｒチャンネルのマイク２０２の右方がマイク右方となる。 As shown in FIG. 13A, when the stereo microphone 2 is installed such that the R channel microphone 202 is positioned on the right side when viewed from the L channel microphone 201, the left side of the L channel microphone 201 is the left side of the microphone. Become. In this case, the right side of the R channel microphone 202 is the right side of the microphone.

マイク左方となる位置に存在する話者が発話した場合、該話者からＬチャンネルのマイク２０１までの距離は、該話者からＲチャンネルのマイク２０２までの距離よりも短くなる。したがって、マイク左方となる領域８０１に存在する話者が発話した場合、Ｌチャンネルのマイク２０１から取得した音声データは、Ｒチャンネルのマイク２０２から取得した音声データよりもパワーが大きくなる。一方、マイク右方となる領域８０３に存在する話者が発話した場合、Ｒチャンネルのマイク２０２から取得した音声データは、Ｌチャンネルのマイク２０１から取得した音声データよりもパワーが大きくなる。これに対し、マイク２０１，２０２から見て正面方向となる領域８０２に存在する話者が発話した場合、領域８０１，８０３に存在する話者が発話した場合に比べて２つの音声データのパワーの差は小さくなる。 When a speaker at a position on the left side of the microphone speaks, the distance from the speaker to the L-channel microphone 201 is shorter than the distance from the speaker to the R-channel microphone 202. Therefore, when a speaker in the area 801 on the left side of the microphone speaks, the voice data acquired from the L-channel microphone 201 has higher power than the voice data acquired from the R-channel microphone 202. On the other hand, when a speaker who is present in the area 803 on the right side of the microphone speaks, the voice data acquired from the R channel microphone 202 has higher power than the voice data acquired from the L channel microphone 201. On the other hand, when a speaker existing in the area 802 in the front direction when viewed from the microphones 201 and 202 speaks, the power of the two audio data is higher than when a speaker existing in the areas 801 and 803 speaks. The difference is smaller.

以上の点を踏まえ、本実施形態においては、例えば、図１３の（ｂ）に示したテーブル５０５のように、式（１）により算出した音圧差と閾値との関係に基づいて、発話している話者の方向を識別する。式(１)により算出される音圧差DiffPowは、マイク左方に存在する話者が発話しているときにはDiffPow＞０となり、マイク右方に存在する話者が発話しているときにはDiffPow＜０となる。よって、テーブル５０５では、閾値ＴＨ３（＞０）により、音圧差と話者の方向との対応関係を３通りに分類している。また、テーブル５０５には、話者の方向の識別結果を示す出力値を登録してある。すなわち、マイク左方にいる話者が発話しているフレームについての処理を行っている場合には出力値１を出力し、マイク右方にいる話者が発話しているフレームについての処理を行っている場合には出力値−１を出力する。更に、音圧差の絶対値が閾値ＴＨ３の絶対値以下である場合には、発話している話者がマイク左方と右方とのどちらであるかを識別すること（特定すること）ができなかったことを示す出力値０を出力する。 Based on the above points, in the present embodiment, for example, as in the table 505 shown in FIG. 13B, the user speaks based on the relationship between the sound pressure difference calculated by the expression (1) and the threshold value. Identify the direction of the speaker. The sound pressure difference DiffPow calculated by the equation (1) is DiffPow> 0 when a speaker existing on the left side of the microphone is speaking, and DiffPow <0 when a speaker existing on the right side of the microphone is speaking. Become. Therefore, in the table 505, the correspondence relationship between the sound pressure difference and the direction of the speaker is classified into three types according to the threshold value TH3 (> 0). In the table 505, an output value indicating a speaker direction identification result is registered. That is, when processing is performed on a frame where a speaker located on the left side of the microphone is speaking, an output value 1 is output, and processing is performed on a frame where a speaker located on the right side of the microphone is speaking. If so, output value -1 is output. Further, when the absolute value of the sound pressure difference is equal to or smaller than the absolute value of the threshold value TH3, it is possible to identify (specify) whether the speaker speaking is left or right of the microphone. An output value 0 indicating that there was no output is output.

図１４は、翻訳言語の切替方法を説明する図である。
図１４の（ａ）には、図３と同じ図を改めて示している。本実施形態の音声処理装置１は、上記のように、Ｌチャンネル用のマイク２０１と、Ｒチャンネル用のマイク２０２とを含むステレオマイク２から音声データを取得する。この際、ステレオマイク２は、例えば、図１４に示すように、会話（対話）を行う二人の話者４Ａ，４Ｂの間となる位置に設置する。また、ステレオマイク２は、２個のマイク２０１，２０２の中点から見て第１の話者４Ａがいる方向にＬチャンネル用のマイク２０１が位置し、第２の話者４Ｂがいる方向にＲチャンネル用のマイク２０２が位置する向きで設置する。更に、第１の話者４Ａは発話言語が日本語であり、第２の話者４Ｂは発話言語が英語であるとする。この際、話者情報保持部１９３には、例えば、図４に示した話者の位置情報５０１が格納される。 FIG. 14 is a diagram illustrating a translation language switching method.
FIG. 14A again shows the same diagram as FIG. As described above, the audio processing device 1 according to the present embodiment acquires audio data from the stereo microphone 2 including the L channel microphone 201 and the R channel microphone 202. At this time, as shown in FIG. 14, for example, the stereo microphone 2 is installed at a position between the two speakers 4A and 4B having a conversation (conversation). In addition, the stereo microphone 2 is located in the direction in which the L-channel microphone 201 is located in the direction where the first speaker 4A is present and the second speaker 4B is present as seen from the midpoint between the two microphones 201 and 202. It is installed in the direction in which the R channel microphone 202 is located. Further, it is assumed that the first speaker 4A has an utterance language of Japanese, and the second speaker 4B has an utterance language of English. At this time, the speaker information holding unit 193 stores, for example, the speaker position information 501 shown in FIG.

この状態で第１の話者４Ａが発話すると、本実施形態の音声処理装置１では、上記の各処理を行ってマイク左方にいる話者が発話していると判定する。この場合、音声処理装置１は、話者の位置情報５０１と、発話者の方向がマイク左方であることを示す出力値（図１３の（ｂ）を参照）に基づいて、Ｌチャンネルのマイク２０１から取得した音声データを翻訳用の音声データとして抽出する。その後、音声処理装置１は、翻訳処理部１２５において取得した音声データ（発話区間）の翻訳を行う。なお、本実施形態の音声処理装置１では、上記のように、フレーム毎に有音であるか否かを判定し、あるフレームにおいて発話を検知したときに、当該発話を検知したフレームから過去に遡って、発話区間の開始位置を設定する。このため、本実施形態の音声処理装置１では、第１の人物４Ａが発話を開始した直後のフレームを発話区間に含めることが可能となり、翻訳時の語頭切れを防ぐことが可能となる。 When the first speaker 4A speaks in this state, the speech processing apparatus 1 according to the present embodiment performs the above processes and determines that the speaker located on the left side of the microphone is speaking. In this case, the speech processing apparatus 1 uses the L-channel microphone based on the speaker position information 501 and an output value (see FIG. 13B) indicating that the direction of the speaker is to the left of the microphone. The voice data acquired from 201 is extracted as voice data for translation. Thereafter, the voice processing device 1 translates the voice data (speech section) acquired by the translation processing unit 125. Note that, as described above, the sound processing device 1 of the present embodiment determines whether or not there is sound for each frame, and when an utterance is detected in a certain frame, from the frame in which the utterance is detected in the past Go back and set the start position of the utterance section. For this reason, in the speech processing device 1 according to the present embodiment, it is possible to include the frame immediately after the first person 4A starts speaking in the speaking section, and to prevent the beginning of words at the time of translation.

また、図１４の（ｂ）に示すように、第２の話者４Ｂが発話を開始すると、本実施形態の音声処理装置１では、上記の各処理を行ってマイク右方にいる話者が発話していると判定する。この場合、音声処理装置１は、話者の位置情報５０１と、発話者の方向がマイク右方であることを示す出力値に基づいて、Ｒチャンネルのマイク２０２から取得した音声データを翻訳用の音声データとして抽出する。このときも、本実施形態の音声処理装置１は、第２の話者４Ｂの発話を検知したフレームから過去に遡って、発話区間の開始位置を設定する。よって、発話者が第１の人物４Ａから第２の人物４Ｂに変わった場合にも、第２の人物４Ｂが発話を開始した直後のフレームを発話区間に含めることが可能となり、翻訳時の語頭切れを防ぐことが可能となる。 Further, as shown in FIG. 14B, when the second speaker 4B starts speaking, the speech processing apparatus 1 according to the present embodiment performs the above-described processes and the speaker on the right side of the microphone Determine that you are speaking. In this case, the speech processing apparatus 1 translates speech data acquired from the R channel microphone 202 based on the speaker position information 501 and an output value indicating that the direction of the speaker is to the right of the microphone. Extract as voice data. Also at this time, the speech processing apparatus 1 according to the present embodiment sets the start position of the utterance section retroactively from the frame in which the utterance of the second speaker 4B is detected. Therefore, even when the speaker changes from the first person 4A to the second person 4B, the frame immediately after the second person 4B starts speaking can be included in the speech section, and the beginning of the translation It becomes possible to prevent cutting.

更に、本実施形態においては、発話者の方向を識別する際に、発話を検知したフレームから過去に遡り、発話区間に含まれるフレームのうちの有音のフレーム（ＶＡＤ＝１のフレーム）についての平均パワーに基づいて、発話者の方向を識別する。すなわち、本実施形態では、発話区間に含まれる振幅の小さい区間やＳＮ比の小さい区間を除外して発話者の方向を識別する。このため、発話区間に発話者が発話していない区間が含まれることによる発話者の方向の識別の誤りを低減することが可能となる。 Furthermore, in the present embodiment, when identifying the direction of the speaker, the voiced frame (frame with VAD = 1) of the frames included in the utterance section is traced back from the frame in which the utterance is detected. Identify the speaker's direction based on the average power. That is, in this embodiment, the direction of the speaker is identified by excluding the section with a small amplitude and the section with a small SN ratio included in the speech section. For this reason, it becomes possible to reduce an error in identifying the direction of the speaker due to the fact that the section where the speaker is not speaking is included in the speech section.

なお、図５、図６Ａ、図６Ｂ、及び図７のフローチャートは、本実施形態の音声処理装置１における第２の音声処理部１２０が行う処理の一例に過ぎない。本実施形態に係る第２の音声処理部１２０が行う処理は、図５、図６Ａ、図６Ｂ、及び図７の手順に限らず、適宜変更可能である。例えば、有音のフレームであるか否かを判定するステップＳ２０２の処理は、上記のＶＡＤに限らず、処理対象のフレームの電力又はＳＮ比に基づいて有音であるか否かを判定してもよい。 Note that the flowcharts of FIGS. 5, 6A, 6B, and 7 are merely examples of processing performed by the second audio processing unit 120 in the audio processing device 1 of the present embodiment. The processing performed by the second audio processing unit 120 according to the present embodiment is not limited to the procedures in FIGS. 5, 6A, 6B, and 7, and can be changed as appropriate. For example, the process of step S202 for determining whether or not the frame is a sound frame is not limited to the above VAD, and whether or not the frame is sound is determined based on the power or SN ratio of the frame to be processed. Also good.

また、本実施形態に係る音声処理装置１で行う翻訳処理における翻訳元言語と翻訳先言語との組み合わせは、上記の英語と日本語との組み合わせに限らず、適宜変更可能である。また、翻訳元言語と翻訳先言語との組み合わせは、複数通りの組み合わせの中から選択することも可能である。 Further, the combination of the translation source language and the translation destination language in the translation processing performed by the speech processing apparatus 1 according to the present embodiment is not limited to the above-described combination of English and Japanese, and can be changed as appropriate. Further, the combination of the translation source language and the translation destination language can be selected from a plurality of combinations.

更に、本実施形態に係る音声処理装置１で取得する音声データは、ステレオマイク２で収音した２チャンネルの音声データに限らず、２個の独立したモノラルマイクで収音した音声データであってもよい。また、マイクから見た発話者の方向は、上記の左右方向に限らず、前後方向等であってもよい。 Furthermore, the audio data acquired by the audio processing device 1 according to the present embodiment is not limited to 2-channel audio data collected by the stereo microphone 2, but is audio data collected by two independent monaural microphones. Also good. Further, the direction of the speaker as viewed from the microphone is not limited to the left-right direction, but may be the front-rear direction.

また、本実施形態に係る音声処理装置１は、翻訳結果を表示装置３に出力する代わりに、音声データとしてスピーカやレシーバ等に出力してもよい。 Moreover, the speech processing apparatus 1 according to the present embodiment may output the translation result to a speaker, a receiver, or the like instead of outputting the translation result to the display device 3.

加えて、本実施形態に係る音声処理装置１では、発話区間の話者の方向を識別した後、該識別結果に基づいて発話区間の翻訳処理を行っている。しかしながら、音声処理装置１では、翻訳処理に限らず、発話区間の話者の方向の識別結果に基づいて、他の音声処理を行ってもよい。例えば、音声処理装置１は、発話区間の話者の方向の識別結果に基づいて、音声データにおける各話者の発話内容を話者毎にまとめて要約する音声処理を行ってもよい。 In addition, in the speech processing apparatus 1 according to the present embodiment, after identifying the direction of the speaker in the utterance section, the speech section is translated based on the identification result. However, the speech processing apparatus 1 is not limited to the translation processing, and may perform other speech processing based on the identification result of the direction of the speaker in the speech section. For example, the speech processing apparatus 1 may perform speech processing that summarizes the speech content of each speaker in the speech data for each speaker based on the identification result of the speaker direction in the speech section.

＜第２の実施形態＞
本実施形態では、第１の実施形態で説明した音声処理装置１における第２の音声処理部１２０が行う音声処理の別の例を説明する。本実施形態に係る音声処理装置１は、図１に示したように、第１の音声処理部１１０と、第２の音声処理部１２０と、話者情報設定部１３０と、を備える。第１の音声処理部１１０は、音声データ取得部１１１と、保持データ制御部１１２とを含む。第２の音声処理部１２０は、発話検知部１２１と、方向識別部１２２と、言語切替部１２３と、音声データ抽出部１２４と、翻訳処理部１２５と、出力部１２６と、を含む。また、本実施形態の音声処理装置１は、音声データ保持部１９１と、演算結果保持部１９２と、話者情報保持部１９３と、翻訳用辞書１９４と、を備える。 <Second Embodiment>
In the present embodiment, another example of audio processing performed by the second audio processing unit 120 in the audio processing device 1 described in the first embodiment will be described. As shown in FIG. 1, the speech processing apparatus 1 according to the present embodiment includes a first speech processing unit 110, a second speech processing unit 120, and a speaker information setting unit 130. The first audio processing unit 110 includes an audio data acquisition unit 111 and a retained data control unit 112. The second voice processing unit 120 includes an utterance detection unit 121, a direction identification unit 122, a language switching unit 123, a voice data extraction unit 124, a translation processing unit 125, and an output unit 126. The speech processing apparatus 1 according to the present embodiment also includes a speech data holding unit 191, a calculation result holding unit 192, a speaker information holding unit 193, and a translation dictionary 194.

図１５は、第２の実施形態に係る発話内容の翻訳処理を説明するフローチャートである。 FIG. 15 is a flowchart for explaining the utterance content translation processing according to the second embodiment.

本実施形態の音声処理装置１における第２の音声処理部１２０が行う処理は、図１５に示すように、複数の音声データにおける同時刻のフレームの組のそれぞれに対しステップＳ２以降の処理を行うループ処理（ステップＳ１〜Ｓ１０）となっている。第２の音声処理部１２０は、音声データに含まれる全てのフレームの組でステップＳ２以降の処理を行い、ループ処理の終了端（ステップＳ１０）に到達するとループ処理を終了する。 The process performed by the second sound processing unit 120 in the sound processing apparatus 1 according to the present embodiment performs the processes from step S2 onward for each set of frames at the same time in a plurality of sound data, as shown in FIG. Loop processing (steps S1 to S10) is performed. The second audio processing unit 120 performs the processing from step S2 on all frame sets included in the audio data, and ends the loop processing when it reaches the end of the loop processing (step S10).

なお、本実施形態における発話内容の翻訳処理では、ステップＳ２，Ｓ３の処理において発話を検知した場合（ステップＳ３；ＹＥＳ）に、ステップＳ４の判定を行わずに、方向識別処理（ステップＳ５）を行う。すなわち、本実施形態における発話内容の翻訳処理では、発話を検知した際に、発話が継続しているか否かによらず、方向識別処理を都度実行して処理対象のフレームの組における発話者の方向を識別する。そして、発話者の方向を識別した場合（ステップＳ６；ＹＥＳ）、第２の音声処理部１２０は、第１の実施形態で説明したステップＳ７〜Ｓ９の処理を実行し、取得した音声データに対する翻訳結果を表示装置３等に出力する。 In the utterance content translation process according to the present embodiment, when the utterance is detected in the processes of steps S2 and S3 (step S3; YES), the direction identification process (step S5) is performed without performing the determination of step S4. Do. That is, in the utterance content translation process in this embodiment, when the utterance is detected, the direction identification process is executed each time regardless of whether the utterance is continued or not. Identify the direction. And when the direction of a speaker is identified (step S6; YES), the 2nd audio | voice processing part 120 performs the process of step S7-S9 demonstrated in 1st Embodiment, and the translation with respect to the acquired audio | voice data The result is output to the display device 3 or the like.

図１６は、第１の実施形態に係る音声処理において起こり得る翻訳処理の例を示す図である。 FIG. 16 is a diagram illustrating an example of translation processing that can occur in the audio processing according to the first embodiment.

図１６に示した４つの音声データ６２０〜６２３は、１つの音声データにおける同一区間を異なる観点で分類して示している。最上段の音声データ６２０には、取得した音声データにおける実際の発話状況を示している。取得した音声データ６２０における時刻ｔ０からｔ１までの区間、及び時刻ｔ５からｔ６までの区間は、それぞれ、誰も発話していない非発話区間である。また、音声データ６２０において、時刻ｔ１からｔ３までの区間はマイク右方の話者が発話しており、時刻ｔ４からｔ５までの区間はマイク左方の話者が発話している。更に、音声データ６２０において、時刻ｔ３からｔ４までの区間は、両者が発話している。 The four pieces of audio data 620 to 623 shown in FIG. 16 indicate the same section in one piece of audio data classified from different viewpoints. The uppermost audio data 620 shows the actual speech situation in the acquired audio data. The section from time t0 to t1 and the section from time t5 to t6 in the acquired voice data 620 are non-speaking sections where no one is speaking. Further, in the audio data 620, a speaker on the right side of the microphone speaks in a section from time t1 to t3, and a speaker on the left side of the microphone speaks in a section from time t4 to t5. Further, in the voice data 620, both speak in the section from time t3 to t4.

音声データ６２０の下方の音声データ６２１には、音声データ６２０に対して発話検知処理を実行したときの発話区間と非発話区間とを示している。すなわち、時刻ｔ０からｔ１までの区間、及び時刻ｔ５からｔ６までの区間は非発話区間となり、時刻ｔ１からｔ５までの区間は発話区間となる。 The audio data 621 below the audio data 620 shows an utterance interval and a non-utterance interval when the utterance detection process is executed on the audio data 620. That is, the section from time t0 to t1 and the section from time t5 to t6 are non-speech sections, and the section from time t1 to t5 is an utterance section.

音声データ６２１の下方の音声データ６２２は、方向識別処理を実行する区間を示している。第１の実施形態においては、図５に示したように、発話を検知し（ステップＳ３；ＹＥＳ）、かつ発話が継続中ではない場合（ステップＳ４；ＮＯ）に、方向識別処理を行う。よって、方向識別処理を実行する区間は、発話区間の開始直後となる時刻ｔ１からｔ２（＜ｔ３）までの区間となる。 The audio data 622 below the audio data 621 indicates a section in which the direction identification process is executed. In the first embodiment, as shown in FIG. 5, when the utterance is detected (step S3; YES) and the utterance is not continuing (step S4; NO), the direction identification process is performed. Therefore, the section in which the direction identification process is executed is a section from time t1 to t2 (<t3) immediately after the start of the speech section.

音声データ６２２の下方に示した音声データ６２３は、翻訳処理を実行する区間を示している。第１の実施形態においては、図５に示したように、発話が継続中である場合（ステップＳ４；ＹＥＳ）、方向識別処理（ステップＳ５）等を省略して翻訳処理を行う。このため、第１の実施形態に係る音声処理装置１では、マイク右方の話者が発話を開始した時刻ｔ１から、マイク左方の話者が発話を終了する時刻ｔ５まで、翻訳元言語と翻訳先言語との組み合わせを変えることなく翻訳処理を行う。 Audio data 623 shown below the audio data 622 indicates a section in which translation processing is executed. In the first embodiment, as shown in FIG. 5, when the utterance is continuing (step S4; YES), the direction identification process (step S5) or the like is omitted and the translation process is performed. For this reason, in the speech processing device 1 according to the first embodiment, from the time t1 when the speaker on the right side of the microphone starts speaking until the time t5 when the speaker on the left side of the microphone ends speaking, Perform translation without changing the combination with the target language.

このように、第１の実施形態に係る音声処理では、音声データにおける話者の方向が異なる２つの区間の間に両方の話者が発話している区間が存在すると、翻訳元言語と翻訳先言語との組み合わせに誤りが生じ、誤った翻訳結果が出力される可能性がある。 As described above, in the speech processing according to the first embodiment, if there is a section in which both speakers speak between two sections having different speaker directions in the speech data, the source language and the translation destination An error may occur in combination with the language, and an incorrect translation result may be output.

これに対し、本実施形態に係る音声処理では、上記のように、発話が継続しているか否かによらず、都度、方向識別処理を行って話者の方向を識別する。よって、図１６の音声データ６２０に対し本実施形態に係る音声処理を行った場合、図１７に示すような翻訳処理が行われる。 On the other hand, in the audio processing according to the present embodiment, as described above, the direction of the speaker is identified each time by performing the direction identification process regardless of whether or not the utterance is continued. Therefore, when the audio processing according to the present embodiment is performed on the audio data 620 in FIG. 16, a translation process as shown in FIG. 17 is performed.

図１７は、第２の実施形態に係る音声処理による翻訳処理の例を示す図である。
図１７に示した音声データ６２０には、取得した音声データにおける実際の発話状況を示している。取得した音声データ６２０における時刻ｔ０からｔ１までの区間、及び時刻ｔ５からｔ６までの区間は、それぞれ、誰も発話していない非発話区間である。また、音声データ６２０において、時刻ｔ１からｔ３までの区間はマイク右方の話者が発話しており、時刻ｔ４からｔ５までの区間はマイク左方の話者が発話している。更に、音声データ６２０において、時刻ｔ３からｔ４までの区間は、両者が発話している。 FIG. 17 is a diagram illustrating an example of translation processing by speech processing according to the second embodiment.
The voice data 620 shown in FIG. 17 shows the actual speech situation in the acquired voice data. The section from time t0 to t1 and the section from time t5 to t6 in the acquired voice data 620 are non-speaking sections where no one is speaking. Further, in the audio data 620, a speaker on the right side of the microphone speaks in a section from time t1 to t3, and a speaker on the left side of the microphone speaks in a section from time t4 to t5. Further, in the voice data 620, both speak in the section from time t3 to t4.

音声データ６２０の下方に示した音声データ６２１は、音声データ６２０に対して発話検知処理を実行したときの発話区間と非発話区間とを示している。すなわち、時刻ｔ０からｔ１までの区間、及び時刻ｔ５からｔ６までの区間は非発話区間となり、時刻ｔ１からｔ５までの区間は発話区間となる。ここまでは、第１の実施形態で説明した処理と同じである。 Audio data 621 shown below the audio data 620 indicates an utterance interval and a non-utterance interval when the utterance detection process is executed on the audio data 620. That is, the section from time t0 to t1 and the section from time t5 to t6 are non-speech sections, and the section from time t1 to t5 is an utterance section. Up to this point, the processing is the same as that described in the first embodiment.

音声データ６２１の下方に示した音声データ６２５は、方向識別処理を実行する区間を示している。第２の実施形態においては、図１５に示したように、発話を検知した場合（ステップＳ３；ＹＥＳ）、都度、方向識別処理を行う。よって、方向識別処理を実行する区間は、発話区間の開始時刻ｔ１から発話区間の終了時刻ｔ５までの区間となる。 The audio data 625 shown below the audio data 621 indicates a section in which the direction identification process is executed. In the second embodiment, as shown in FIG. 15, when an utterance is detected (step S3; YES), direction identification processing is performed each time. Therefore, the section in which the direction identification process is executed is a section from the start time t1 of the utterance section to the end time t5 of the utterance section.

音声データ６２２の下方に示した音声データ６２６は、翻訳処理を実行する区間を示している。音声データ６２６における時刻ｔ１からｔ３までの区間は、マイク右方の話者が発話している区間である。このため、本実施形態の音声処理装置１では、マイク右方の話者の発話言語を翻訳元言語として、音声データ６２６における時刻ｔ４からｔ５までの区間の翻訳処理を行う。一方、音声データ６２６における時刻ｔ４からｔ５までの区間は、マイク左方の話者が発話している区間である。このため、本実施形態の音声処理装置１では、マイク左方の話者の発話言語を翻訳元言語として、音声データ６２６における時刻ｔ４からｔ５までの区間の翻訳処理を行う。 The audio data 626 shown below the audio data 622 indicates a section in which translation processing is executed. A section from time t1 to time t3 in the audio data 626 is a section in which a speaker on the right side of the microphone is speaking. For this reason, in the speech processing apparatus 1 of the present embodiment, the speech processing of the section from time t4 to time t5 in the speech data 626 is performed using the speech language of the speaker on the right side of the microphone as the translation source language. On the other hand, a section from time t4 to t5 in the audio data 626 is a section in which a speaker on the left side of the microphone is speaking. For this reason, in the speech processing apparatus 1 according to the present embodiment, the speech processing of the section from time t4 to t5 in the speech data 626 is performed using the speech language of the speaker on the left side of the microphone as the translation source language.

これに対し、時刻ｔ３からｔ４までの区間は、両者が発話している区間である。このため、本実施形態の音声処理装置１では、時刻ｔ３からｔ４までの区間は話者の方向をマイク左方及びマイク右方のいずれかに特定することができない。このため、本実施形態の音声処理装置１では、音声データ６２６における時刻ｔ３からｔ４の区間は翻訳されない。 On the other hand, the section from time t3 to t4 is a section in which both are speaking. For this reason, in the speech processing apparatus 1 according to the present embodiment, in the section from the time t3 to the time t4, the direction of the speaker cannot be specified as either the left microphone or the right microphone. For this reason, in the speech processing apparatus 1 of the present embodiment, the section from the time t3 to the time t4 in the speech data 626 is not translated.

このように、本実施形態に係る発話内容の翻訳処理では、発話が継続しており、かつ途中で発話者の方向が変更された場合にも、発話者の方向が変更されたタイミングで翻訳元言語と翻訳先言語との組み合わせを切り替えることが可能となる。また、発話が継続しているときに両者が発話する等の理由で話者の方向を識別できなくなると、本実施形態に係る音声処理装置１は、翻訳を中止する。よって、本実施形態によれば、発話区間における話者の方向を正しく識別することが可能となる。したがって、話者の方向の識別結果に基づく翻訳元言語と翻訳先言語との組み合わせの切り替えや、翻訳の開始及び停止の切り替えを適切なタイミングで行うことが可能となる。 As described above, in the utterance content translation processing according to the present embodiment, even when the utterance is continued and the direction of the speaker is changed in the middle, the translation source is changed at the timing when the direction of the speaker is changed. It becomes possible to switch the combination of language and target language. Further, if the direction of the speaker cannot be identified due to reasons such as both speaking while the utterance continues, the speech processing apparatus 1 according to the present embodiment stops the translation. Therefore, according to the present embodiment, it is possible to correctly identify the direction of the speaker in the utterance section. Therefore, it is possible to switch the combination of the translation source language and the translation destination language based on the identification result of the speaker direction, and to start and stop the translation at an appropriate timing.

＜第３の実施形態＞
本実施形態では、第１の実施形態及び第２の実施形態に係る音声処理装置１で行われる翻訳処理を、音声処理装置１とは異なるサーバ装置で行う例を説明する。 <Third Embodiment>
In the present embodiment, an example will be described in which translation processing performed by the speech processing apparatus 1 according to the first embodiment and the second embodiment is performed by a server apparatus different from the speech processing apparatus 1.

図１８は、第３の実施形態に係る音声処理装置及びサーバ装置の機能的構成を示すブロック図である。 FIG. 18 is a block diagram illustrating a functional configuration of the voice processing device and the server device according to the third embodiment.

図１８に示すように、本実施形態の音声処理装置１は、第１の音声処理部１１０と、第２の音声処理部１２０と、話者情報設定部１３０と、通信部１４０と、を備える。第１の音声処理部１１０は、音声データ取得部１１１と、保持データ制御部１１２とを含む。第２の音声処理部１２０は、発話検知部１２１と、方向識別部１２２と、言語切替部１２３と、音声データ抽出部１２４と、を含む。また、本実施形態の音声処理装置１は、音声データ保持部１９１と、演算結果保持部１９２と、話者情報保持部１９３と、を備える。 As shown in FIG. 18, the speech processing apparatus 1 of the present embodiment includes a first speech processing unit 110, a second speech processing unit 120, a speaker information setting unit 130, and a communication unit 140. . The first audio processing unit 110 includes an audio data acquisition unit 111 and a retained data control unit 112. The second voice processing unit 120 includes an utterance detection unit 121, a direction identification unit 122, a language switching unit 123, and a voice data extraction unit 124. In addition, the speech processing apparatus 1 according to the present embodiment includes a speech data holding unit 191, a calculation result holding unit 192, and a speaker information holding unit 193.

本実施形態の音声処理装置１における第１の音声処理部１１０と、話者情報設定部１３０とは、それぞれ、第１の実施形態で説明した機能を持つ。これに対し、本実施形態の音声処理装置１における第２の音声処理部１２０は、第１の実施形態及び第２の実施形態における第２の音声処理部１２０から、翻訳処理部１２５及び出力部１２６を省略している。すなわち、本実施形態の音声処理装置１における第２の音声処理部１２０は、第１の実施形態又は第２の実施形態で説明した処理のうちの翻訳する音声データを抽出する処理までを行う。 The first speech processing unit 110 and the speaker information setting unit 130 in the speech processing apparatus 1 of the present embodiment each have the functions described in the first embodiment. On the other hand, the second speech processing unit 120 in the speech processing apparatus 1 of the present embodiment is different from the second speech processing unit 120 in the first embodiment and the second embodiment in that the translation processing unit 125 and the output unit. 126 is omitted. That is, the second speech processing unit 120 in the speech processing apparatus 1 according to the present embodiment performs processing up to extracting speech data to be translated among the processing described in the first embodiment or the second embodiment.

通信部１４０は、インターネット等の通信ネットワーク（図示せず）を介してサーバ装置９と通信を行う。通信部１４０は、例えば、音声データ抽出部１２４で抽出した音声データと、翻訳元言語と、翻訳先言語とを含む翻訳依頼をサーバ９に送信する。また、通信部１４０は、サーバ９から受信した翻訳結果を表示装置３等に出力する。 The communication unit 140 communicates with the server device 9 via a communication network (not shown) such as the Internet. For example, the communication unit 140 transmits a translation request including the audio data extracted by the audio data extraction unit 124, the translation source language, and the translation destination language to the server 9. Further, the communication unit 140 outputs the translation result received from the server 9 to the display device 3 or the like.

サーバ装置９は、音声処理装置１から受信した音声データに対する翻訳処理を行い、翻訳結果を音声処理装置１に返送する。サーバ装置９は、翻訳処理部９１０と、通信部９２０と、翻訳用辞書９９１と、を備える。 The server device 9 performs a translation process on the speech data received from the speech processing device 1 and returns the translation result to the speech processing device 1. The server device 9 includes a translation processing unit 910, a communication unit 920, and a translation dictionary 991.

翻訳処理部９１０は、翻訳用辞書９９１を参照し、音声データを翻訳元言語で文章化した後、該文書を翻訳先言語の文章に翻訳する（変換する）。サーバ装置９の翻訳処理部９１０、及び翻訳用辞書９９１は、それぞれ、第１の実施形態で説明した音声処理装置１における翻訳処理部１２５、及び翻訳用辞書１９４に相当する。 The translation processing unit 910 refers to the translation dictionary 991, converts the voice data into a sentence in the translation source language, and then translates (converts) the document into a sentence in the translation destination language. The translation processing unit 910 and the translation dictionary 991 of the server device 9 correspond to the translation processing unit 125 and the translation dictionary 194 in the speech processing device 1 described in the first embodiment, respectively.

通信部９２０は、音声処理装置１からの翻訳依頼を受信する処理、及び翻訳処理部９１０による翻訳処理の結果を音声処理装置１に返送する処理を行う。 The communication unit 920 performs processing for receiving a translation request from the speech processing device 1 and processing for returning the result of translation processing by the translation processing unit 910 to the speech processing device 1.

本実施形態のように、音声データの翻訳処理を音声処理装置１とは異なるサーバ装置３で行うことにより、音声処理装置１における処理負荷を軽減することが可能となる。また、音声処理装置１とは異なるサーバ装置３で翻訳処理を行うことにより、翻訳処理部９１０及び翻訳用辞書９９１を複数の音声処理装置１で共有することが可能となる。このため、例えば、翻訳用辞書のメンテナンスや最新版への更新を容易に行うことが可能となる。また、音声処理装置１で翻訳用辞書を保持することによる音声処理装置１の記憶部（図示せず）の容量の圧迫を抑制することが可能となる。 As in the present embodiment, the processing load on the voice processing device 1 can be reduced by performing the translation processing of the voice data on the server device 3 different from the voice processing device 1. Further, the translation processing unit 910 and the translation dictionary 991 can be shared by the plurality of speech processing devices 1 by performing the translation processing in the server device 3 different from the speech processing device 1. For this reason, for example, the translation dictionary can be easily maintained and updated to the latest version. In addition, it is possible to suppress the compression of the capacity of the storage unit (not shown) of the speech processing device 1 by holding the translation dictionary in the speech processing device 1.

なお、本実施形態ではサーバ装置９において翻訳処理を行う例を示したが、サーバ装置９は、これに限らず、音声データと該音声データの発話者とを対応付けて行う他の音声処理を行うものであってもよい。 In the present embodiment, an example in which the translation processing is performed in the server device 9 has been described. However, the server device 9 is not limited thereto, and performs other speech processing in which speech data and a speaker of the speech data are associated with each other. You may do it.

上記の各実施形態に係る音声処理装置１は、コンピュータと、当該コンピュータに実行させるプログラムとにより実現可能である。以下、図１９を参照して、コンピュータとプログラムとにより実現される音声処理装置１について説明する。 The audio processing device 1 according to each of the above embodiments can be realized by a computer and a program executed by the computer. Hereinafter, the sound processing apparatus 1 realized by a computer and a program will be described with reference to FIG.

図１９は、コンピュータのハードウェア構成を示す図である。
図１９に示すように、コンピュータ１５は、プロセッサ１５０１と、主記憶装置１５０２と、補助記憶装置１５０３と、入力装置１５０４と、出力装置１５０５と、入出力インタフェース１５０６と、通信制御装置１５０７と、媒体駆動装置１５０８と、を備える。コンピュータ１５におけるこれらの要素１５０１〜１５０８は、バス１５１０により相互に接続されており、要素間でのデータの受け渡しが可能になっている。 FIG. 19 is a diagram illustrating a hardware configuration of a computer.
As shown in FIG. 19, the computer 15 includes a processor 1501, a main storage device 1502, an auxiliary storage device 1503, an input device 1504, an output device 1505, an input / output interface 1506, a communication control device 1507, and a medium. A driving device 1508. These elements 1501 to 1508 in the computer 15 are connected to each other by a bus 1510 so that data can be exchanged between the elements.

プロセッサ１５０１は、Central Processing Unit（ＣＰＵ）やMicro Processing Unit（ＭＰＵ）等である。プロセッサ１５０１は、オペレーティングシステムを含む各種のプログラムを実行することにより、コンピュータ１５の全体の動作を制御する。また、プロセッサ１５０１は、例えば、図５或いは図１５に示した発話内容の翻訳処理を含む音声処理プログラムを実行する。なお、コンピュータ１５を第３の実施形態に係る音声処理装置１として動作させる場合、プロセッサ１５０１は、図５或いは図１５に示したフローチャートにおけるステップＳ９の翻訳処理を省略したプログラムを実行する。 The processor 1501 is a central processing unit (CPU), a micro processing unit (MPU), or the like. The processor 1501 controls the overall operation of the computer 15 by executing various programs including an operating system. In addition, the processor 1501 executes a speech processing program including, for example, the utterance content translation processing illustrated in FIG. 5 or 15. When the computer 15 is operated as the speech processing apparatus 1 according to the third embodiment, the processor 1501 executes a program in which the translation process in step S9 in the flowchart shown in FIG. 5 or FIG. 15 is omitted.

主記憶装置１５０２は、図示しないRead Only Memory（ＲＯＭ）及びRandom Access Memory（ＲＡＭ）を含む。主記憶装置１５０２のＲＯＭには、例えば、コンピュータ１５の起動時にプロセッサ１５０１が読み出す所定の基本制御プログラム等が予め記録されている。一方、主記憶装置１５０２のＲＡＭは、プロセッサ１５０１が、各種のプログラムを実行する際に必要に応じて作業用記憶領域として使用する。主記憶装置１５０２のＲＡＭは、例えば、図１及び図１９に示した音声保持部１９１、演算結果保持部１９２、及び話者情報保持部１９３として利用可能である。 The main storage device 1502 includes a read only memory (ROM) and a random access memory (RAM) not shown. In the ROM of the main storage device 1502, for example, a predetermined basic control program read by the processor 1501 when the computer 15 is started is recorded in advance. On the other hand, the RAM of the main storage device 1502 is used as a working storage area as necessary when the processor 1501 executes various programs. The RAM of the main storage device 1502 can be used as, for example, the voice holding unit 191, the calculation result holding unit 192, and the speaker information holding unit 193 illustrated in FIGS. 1 and 19.

補助記憶装置１５０３は、主記憶装置１５０２のＲＡＭと比べて容量の大きい記憶装置であり、例えば、Hard Disk Drive（ＨＤＤ）や、フラッシュメモリのような不揮発性メモリ（Solid State Drive（ＳＳＤ）を含む）等である。補助記憶装置１５０３は、プロセッサ１５０１によって実行される各種のプログラムや各種のデータ等の記憶に利用可能である。補助記憶装置１５０３は、例えば、図５或いは図１５に示した発話内容の翻訳処理を含む音声処理プログラム、若しくは図５或いは図１５に示したフローチャートにおけるステップＳ９の翻訳処理を省略したプログラムの記憶に利用可能である。また、補助記憶装置１５０３は、例えば、図１及び図１９に示した音声保持部１９１、演算結果保持部１９２、及び話者情報保持部１９３として利用可能である。更に、補助記憶装置１５０３は、図１に示した翻訳用辞書１９４を記憶させる記憶部として利用可能である。 The auxiliary storage device 1503 is a storage device having a larger capacity than the RAM of the main storage device 1502, and includes, for example, a hard disk drive (HDD) and a non-volatile memory (Solid State Drive (SSD)) such as a flash memory. ) Etc. The auxiliary storage device 1503 can be used to store various programs executed by the processor 1501 and various data. The auxiliary storage device 1503 stores, for example, a speech processing program including the utterance content translation processing shown in FIG. 5 or FIG. 15 or a program in which the translation processing in step S9 in the flowchart shown in FIG. 5 or FIG. Is available. Further, the auxiliary storage device 1503 can be used as, for example, the voice holding unit 191, the calculation result holding unit 192, and the speaker information holding unit 193 illustrated in FIGS. 1 and 19. Further, the auxiliary storage device 1503 can be used as a storage unit for storing the translation dictionary 194 shown in FIG.

入力装置１５０４は、例えば、キーボード装置やタッチパネル装置等である。コンピュータ１５のオペレータ（利用者）が入力装置１５０４に対して所定の操作を行うと、入力装置１５０４は、その操作内容に対応付けられている入力情報をプロセッサ１５０１に送信する。入力装置１５０４は、例えば、音声処理を開始させる命令、コンピュータ１５が実行可能な他の処理に関する命令等の入力や、話者情報の設定等に利用可能である。 The input device 1504 is, for example, a keyboard device or a touch panel device. When an operator (user) of the computer 15 performs a predetermined operation on the input device 1504, the input device 1504 transmits input information associated with the operation content to the processor 1501. The input device 1504 can be used, for example, to input a command for starting voice processing, a command related to other processing that can be executed by the computer 15, setting speaker information, and the like.

出力装置１５０５は、例えば、液晶表示装置等の表示装置、スピーカやレシーバ等の音声出力装置である。出力装置１５０５は、翻訳処理の結果（翻訳文）の表示や発生に利用可能である。 The output device 1505 is, for example, a display device such as a liquid crystal display device or an audio output device such as a speaker or a receiver. The output device 1505 can be used to display and generate a translation processing result (translated text).

入出力インタフェース１５０６は、コンピュータ１５と、他の電子機器とを接続する。入出力インタフェース１５０６は、例えば、Universal Serial Bus（ＵＳＢ）規格のコネクタ等を備える。入出力インタフェース１５０６は、例えば、コンピュータ１５と複数のマイク２０１，２０２との接続等に利用可能である。 The input / output interface 1506 connects the computer 15 and other electronic devices. The input / output interface 1506 includes, for example, a universal serial bus (USB) standard connector. The input / output interface 1506 can be used for connection between the computer 15 and the plurality of microphones 201 and 202, for example.

通信制御装置１５０７は、コンピュータ１５をインターネット等のネットワークに接続し、ネットワークを介したコンピュータ１５と他の通信機器との各種通信を制御する装置である。通信制御装置１５０７は、例えば、コンピュータ１５と、第３の実施形態で説明したサーバ装置９との通信等に利用可能である。 The communication control device 1507 is a device that connects the computer 15 to a network such as the Internet and controls various communications between the computer 15 and other communication devices via the network. The communication control device 1507 can be used, for example, for communication between the computer 15 and the server device 9 described in the third embodiment.

媒体駆動装置１５０８は、可搬型記憶媒体１６に記録されているプログラムやデータの読み出し、補助記憶装置１５０３に記憶されたデータ等の可搬型記憶媒体１６への書き込みを行う。媒体駆動装置１５０８には、例えば、１種類又は複数種類の規格に対応したメモリカード用リーダ／ライタが利用可能である。媒体駆動装置１５０８としてメモリカード用リーダ／ライタを用いる場合、可搬型記憶媒体１６としては、メモリカード用リーダ／ライタが対応している規格、例えば、Secure Digital（ＳＤ）規格のメモリカード（フラッシュメモリ）等を利用可能である。また、可搬型記録媒体１６としては、例えば、ＵＳＢ規格のコネクタを備えたフラッシュメモリが利用可能である。更に、コンピュータ１５が媒体駆動装置１５０８として利用可能な光ディスクドライブを搭載している場合、当該光ディスクドライブで認識可能な各種の光ディスクを可搬型記録媒体１６として利用可能である。可搬型記録媒体１６として利用可能な光ディスクには、例えば、Compact Disc（ＣＤ）、Digital Versatile Disc（ＤＶＤ）、Blu-ray Disc（Blu-rayは登録商標）等がある。可搬型記録媒体１６は、例えば、図５或いは図１５に示した発話内容の翻訳処理を含む音声処理プログラム、若しくは図５或いは図１５に示したフローチャートにおけるステップＳ９の翻訳処理を省略したプログラムの記憶に利用可能である。また、可搬型記録媒体１６は、例えば、図１及び図１９に示した音声保持部１９１、演算結果保持部１９２、及び話者情報保持部１９３として利用可能である。更に、可搬型記録媒体１６は、図１に示した翻訳用辞書１９４の記憶に利用可能である。 The medium driving device 1508 reads programs and data recorded in the portable storage medium 16 and writes data stored in the auxiliary storage device 1503 to the portable storage medium 16. For the medium driving device 1508, for example, a memory card reader / writer corresponding to one type or a plurality of types of standards can be used. When a memory card reader / writer is used as the medium driving device 1508, the portable storage medium 16 is a memory card (flash memory) conforming to a standard supported by the memory card reader / writer, for example, Secure Digital (SD) standard. ) Etc. can be used. As the portable recording medium 16, for example, a flash memory provided with a USB standard connector can be used. Further, when the computer 15 is equipped with an optical disk drive that can be used as the medium driving device 1508, various optical disks that can be recognized by the optical disk drive can be used as the portable recording medium 16. Examples of the optical disc that can be used as the portable recording medium 16 include a Compact Disc (CD), a Digital Versatile Disc (DVD), and a Blu-ray Disc (Blu-ray is a registered trademark). The portable recording medium 16 stores, for example, a voice processing program including the utterance content translation processing shown in FIG. 5 or 15 or a program in which the translation processing in step S9 in the flowchart shown in FIG. 5 or FIG. 15 is omitted. Is available. The portable recording medium 16 can be used as, for example, the voice holding unit 191, the calculation result holding unit 192, and the speaker information holding unit 193 shown in FIGS. 1 and 19. Furthermore, the portable recording medium 16 can be used for storing the translation dictionary 194 shown in FIG.

オペレータが入力装置１５０４等を利用して音声処理の開始命令をコンピュータ１５に入力すると、プロセッサ１５０１が、補助記憶装置１５０３等の非一時的な記録媒体に記憶させた音声処理プログラムを読み出して実行する。この処理において、プロセッサ１５０１は、図１又は図１９の音声処理装置１における第１の音声処理部１１０、及び第２の音声処理部１２０として機能する（動作する）。また、プロセッサ９０１は、入出力インタフェース９０６を介して第１のマイク２から骨導音信号を取得するとともに、第２のマイク３から気導音信号を取得する。また、プロセッサ１５０１が上記の音声処理を行っている間、主記憶装置１５０２のＲＡＭや補助記憶装置１５０３等は、図１又は図１９の音声処理装置１における音声データ保持部１９１、演算結果保持部１９２、話者情報保持部１９３として機能する。また、コンピュータ１５を第１の実施形態又は第２の実施形態の音声処理装置１として動作させる場合、主記憶装置１５０２のＲＡＭや補助記憶装置１５０３等は、翻訳用辞書１９４を記憶する記憶部としての機能も持つ。 When an operator inputs a voice processing start command to the computer 15 using the input device 1504 or the like, the processor 1501 reads and executes a voice processing program stored in a non-temporary recording medium such as the auxiliary storage device 1503. . In this processing, the processor 1501 functions (operates) as the first sound processing unit 110 and the second sound processing unit 120 in the sound processing apparatus 1 of FIG. 1 or FIG. In addition, the processor 901 acquires the bone conduction sound signal from the first microphone 2 and the air conduction sound signal from the second microphone 3 via the input / output interface 906. Further, while the processor 1501 performs the above-described sound processing, the RAM of the main storage device 1502, the auxiliary storage device 1503, and the like are stored in the sound data holding unit 191 and the calculation result holding unit in the sound processing device 1 of FIG. 192, which functions as a speaker information holding unit 193. When the computer 15 is operated as the speech processing apparatus 1 according to the first embodiment or the second embodiment, the RAM of the main storage device 1502, the auxiliary storage device 1503, and the like are used as storage units for storing the translation dictionary 194. It also has the function.

なお、音声処理装置１として動作させるコンピュータ１５は、図１９に示した全ての要素１５０１〜１５０８を含む必要はなく、用途や条件に応じて一部の要素を省略することも可能である。例えば、第１の実施形態又は第２の実施形態で説明した音声処理装置１として動作させるコンピュータ１５は、通信制御装置９０７や媒体駆動装置９０８が省略されたものであってもよい。更に、音声処理装置１として動作させるコンピュータ１５は、例えば、スマートフォン等の通話機能を持つ携帯型のコンピュータであってもよい。 Note that the computer 15 that operates as the voice processing device 1 does not need to include all the elements 1501 to 1508 illustrated in FIG. 19, and some elements may be omitted depending on applications and conditions. For example, the computer 15 operated as the voice processing apparatus 1 described in the first embodiment or the second embodiment may be one in which the communication control device 907 and the medium driving device 908 are omitted. Furthermore, the computer 15 operated as the voice processing device 1 may be a portable computer having a call function such as a smartphone.

以上記載した各実施形態に関し、更に以下の付記を開示する。
（付記１）
コンピュータが、
複数のマイクロフォンにより同時に収音された複数の音声データから所定の条件に従って発話を検知したときに、前記所定の条件に基づいて、前記音声データにおける前記発話を検知した位置から過去に遡って前記発話の開始位置を設定し、
前記複数の音声データのそれぞれにおける前記発話の開始位置から前記発話を検知した位置までの特性差に基づいて、前記複数のマイクロフォンから見た発話者の方向を識別し、
識別した前記発話者の方向に基づいて、前記複数の音声データのいずれかにおける前記発話の開始位置以降の発話区間を抽出する、
処理を実行することを特徴とする音声処理方法。
（付記２）
前記発話者の方向を識別する処理において、前記コンピュータは、前記音声データにおける前記発話の開始位置から前記発話を検知した位置までの区間のうちの有意な音声を含む区間についての前記特性差に基づいて、前記発話者の方向を識別する、
ことを特徴とする付記１に記載の音声処理方法。
（付記３）
前記音声データの前記発話区間を抽出する処理において、前記コンピュータは、前記音声データと、前記複数のマイクロフォンから見た発話者の方向とを対応付けた話者の位置情報に基づいて、前記複数の音声データのうちの、識別した前記発話者の方向と対応付けられた音声データにおける前記発話区間を抽出する、
ことを特徴とする付記２に記載の音声処理方法。
（付記４）
前記発話の開始位置を設定する処理において、前記コンピュータは、前記音声データを複数のフレームに分割してフレーム毎に有意な音声が含まれるか否かを判定し、連続する複数のフレームにおける前記有意な音声が含まれるフレーム数に基づいて発話の検知を確定するとともに、前記有意な音声が含まれるフレーム数の計数を開始したフレームを前記発話の開始位置に設定する、
ことを特徴とする付記１に記載の音声処理方法。
（付記５）
前記コンピュータは、前記発話を検知する毎に、前記発話者の方向を識別する処理を行う、
ことを特徴とする付記１に記載の音声処理方法。
（付記６）
前記コンピュータは、前記複数の音声データとして、ステレオマイクにより収音した２個の音声データを取得する、
ことを特徴とする付記１に記載の音声処理方法。
（付記７）
前記音声データを抽出した後、前記コンピュータは、前記発話者の方向と対応付けられた発話言語に基づいて、抽出した前記音声データの発話内容を翻訳する処理、を更に実行する、
ことを特徴とする付記１に記載の音声処理方法。
（付記８）
前記音声データを抽出した後、前記コンピュータは、
抽出した前記音声データと、翻訳元の言語と、翻訳先の言語とを含む翻訳依頼を、音声データの翻訳処理を実行可能なサーバ装置に送信し、
前記サーバ装置から前記音声データに対する翻訳処理の結果を受信して出力する処理、を更に実行する、
ことを特徴とする付記１に記載の音声処理方法。
（付記９）
複数のマイクロフォンにより同時に収音された複数の音声データから所定の条件に従って発話を検知したときに、前記所定の条件に基づいて、前記音声データにおける前記発話を検知した位置から過去に遡って前記発話の開始位置を設定する発話検知部と、
前記複数の音声データのそれぞれにおける前記発話の開始位置から前記発話を検知した位置までの特性差に基づいて、前記複数のマイクロフォンから見た発話者の方向を識別する方向識別部と、
識別した前記発話者の方向に基づいて、前記複数の音声データのいずれかにおける前記発話の開始位置以降の発話区間を抽出する音声データ抽出部と、
を備えることを特徴とする音声処理装置。
（付記１０）
複数のマイクロフォンにより同時に収音された複数の音声データのから所定の条件に従って発話を検知したときに、前記所定の条件に基づいて、前記音声データにおける前記発話を検知した位置から過去に遡って前記発話の開始位置を設定し、
前記複数の音声データのそれぞれにおける前記発話の開始位置から前記発話を検知した位置までの特性差に基づいて、前記複数のマイクロフォンから見た発話者の方向を識別し、
識別した前記発話者の方向に基づいて、前記複数の音声データのいずれかにおける前記発話の開始位置以降の発話区間を抽出する、
処理をコンピュータに実行させる音声処理プログラム。 The following additional notes are disclosed for each of the embodiments described above.
(Appendix 1)
Computer
When an utterance is detected according to a predetermined condition from a plurality of voice data collected simultaneously by a plurality of microphones, the utterance is traced back from the position where the utterance is detected in the voice data based on the predetermined condition. Set the start position of
Based on the characteristic difference from the start position of the utterance in each of the plurality of audio data to the position where the utterance is detected, the direction of the speaker viewed from the plurality of microphones is identified,
Based on the identified direction of the speaker, the speech section after the start position of the speech in any of the plurality of voice data is extracted.
A voice processing method characterized by executing processing.
(Appendix 2)
In the process of identifying the direction of the speaker, the computer is based on the characteristic difference for a section including a significant voice in a section from a start position of the utterance to a position where the utterance is detected in the voice data. Identifying the direction of the speaker,
The speech processing method according to supplementary note 1, wherein:
(Appendix 3)
In the process of extracting the utterance section of the voice data, the computer is configured to output the plurality of speech data based on the speaker position information in which the voice data is associated with the direction of the speaker viewed from the plurality of microphones. Extracting the utterance section in the audio data associated with the identified direction of the speaker from the audio data;
The speech processing method according to Supplementary Note 2, wherein
(Appendix 4)
In the process of setting the start position of the utterance, the computer divides the audio data into a plurality of frames, determines whether or not significant audio is included for each frame, and determines whether the significant in a plurality of consecutive frames Utterance detection is determined based on the number of frames that include a valid voice, and the frame that starts counting the number of frames that includes the significant voice is set as the start position of the utterance.
The speech processing method according to supplementary note 1, wherein:
(Appendix 5)
The computer performs a process of identifying the direction of the speaker each time the utterance is detected.
The speech processing method according to supplementary note 1, wherein:
(Appendix 6)
The computer acquires two pieces of sound data collected by a stereo microphone as the plurality of pieces of sound data;
The speech processing method according to supplementary note 1, wherein:
(Appendix 7)
After extracting the voice data, the computer further executes a process of translating the utterance content of the extracted voice data based on the utterance language associated with the direction of the speaker.
The speech processing method according to supplementary note 1, wherein:
(Appendix 8)
After extracting the audio data, the computer
Sending the translation request including the extracted voice data, the language of the translation source, and the language of the translation destination to a server device capable of executing the translation process of the voice data,
A process of receiving and outputting a result of translation processing on the audio data from the server device;
The speech processing method according to supplementary note 1, wherein:
(Appendix 9)
When an utterance is detected according to a predetermined condition from a plurality of voice data collected simultaneously by a plurality of microphones, the utterance is traced back from the position where the utterance is detected in the voice data based on the predetermined condition. An utterance detection unit for setting the start position of
A direction identifying unit for identifying a direction of a speaker viewed from the plurality of microphones based on a characteristic difference from a start position of the utterance to a position where the utterance is detected in each of the plurality of voice data;
A voice data extraction unit that extracts a utterance section after the start position of the utterance in any of the plurality of voice data based on the direction of the identified speaker;
An audio processing apparatus comprising:
(Appendix 10)
When an utterance is detected according to a predetermined condition from a plurality of pieces of voice data collected simultaneously by a plurality of microphones, based on the predetermined condition, the position of the utterance in the voice data is traced back to the past. Set the utterance start position,
Based on the characteristic difference from the start position of the utterance in each of the plurality of audio data to the position where the utterance is detected, the direction of the speaker viewed from the plurality of microphones is identified,
Based on the identified direction of the speaker, the speech section after the start position of the speech in any of the plurality of voice data is extracted.
A voice processing program that causes a computer to execute processing.

１音声処理装置
２ステレオマイク
３表示装置
４Ａ，４Ｂ話者
９サーバ装置
１５コンピュータ
１６可搬型記録媒体
１１０第1の音声処理部
１１１音声データ取得部
１１２保持データ制御部
１２０第２の音声処理部
１２１発話検知部
１２２方向識別部
１２２Ａフレームパワー算出部
１２２Ｂ平均パワー算出部
１２２Ｃ音圧差算出部
１２２Ｄ話者方向判定部
１２３言語切替部
１２４音声データ抽出部
１２５，９１０翻訳処理部
１２６出力部
１３０話者情報設定部
１４０，９２０通信部
１９１音声データ保持部
１９２演算結果保持部
１９３話者情報保持部
１９４，９９１翻訳用辞書
２０１，２０２マイク
１５０１プロセッサ
１５０２主記憶装置
１５０３補助記憶装置
１５０４入力装置
１５０５出力装置
１５０６入出力インタフェース
１５０７通信制御装置
１５０８媒体駆動装置
１５１０バス DESCRIPTION OF SYMBOLS 1 Audio | voice processing apparatus 2 Stereo microphone 3 Display apparatus 4A, 4B Speaker 9 Server apparatus 15 Computer 16 Portable recording medium 110 1st audio | voice processing part 111 Audio | voice data acquisition part 112 Holding data control part 120 2nd audio | voice processing part 121 Speech detection unit 122 Direction identification unit 122A Frame power calculation unit 122B Average power calculation unit 122C Sound pressure difference calculation unit 122D Speaker direction determination unit 123 Language switching unit 124 Speech data extraction unit 125, 910 Translation processing unit 126 Output unit 130 Speaker information Setting unit 140, 920 Communication unit 191 Audio data holding unit 192 Calculation result holding unit 193 Speaker information holding unit 194, 991 Translation dictionary 201, 202 Microphone 1501 Processor 1502 Main storage device 1503 Auxiliary storage device 1504 Input device 1505 Output device 1506 I / O interface Over scan 1507 communication control unit 1508 medium driving device 1510 bus

Claims

Computer
When an utterance is detected according to a predetermined condition from a plurality of voice data collected simultaneously by a plurality of microphones, the utterance is traced back from the position where the utterance is detected in the voice data based on the predetermined condition. Set the start position of
Based on the characteristic difference from the start position of the utterance in each of the plurality of audio data to the position where the utterance is detected, the direction of the speaker viewed from the plurality of microphones is identified,
Based on the identified direction of the speaker, the speech section after the start position of the speech in any of the plurality of voice data is extracted.
A voice processing method characterized by executing processing.

In the process of identifying the direction of the speaker, the computer is based on the characteristic difference for a section including a significant voice in a section from a start position of the utterance to a position where the utterance is detected in the voice data. Identifying the direction of the speaker,
The voice processing method according to claim 1.

In the process of extracting the utterance section of the voice data, the computer is configured to output the plurality of speech data based on the speaker position information in which the voice data is associated with the direction of the speaker viewed from the plurality of microphones. Extracting the utterance section in the audio data associated with the identified direction of the speaker from the audio data;
The voice processing method according to claim 2.

In the process of setting the start position of the utterance, the computer divides the audio data into a plurality of frames, determines whether or not significant audio is included for each frame, and determines whether the significant in a plurality of consecutive frames Utterance detection is determined based on the number of frames that include a valid voice, and the frame that starts counting the number of frames that includes the significant voice is set as the start position of the utterance.
The voice processing method according to claim 1.

The computer performs a process of identifying the direction of the speaker each time the utterance is detected.
The voice processing method according to claim 1.

After extracting the voice data, the computer further executes a process of translating the utterance content of the extracted voice data based on the utterance language associated with the direction of the speaker.
The voice processing method according to claim 1.

When an utterance is detected according to a predetermined condition from a plurality of voice data collected simultaneously by a plurality of microphones, the utterance is traced back from the position where the utterance is detected in the voice data based on the predetermined condition. An utterance detection unit for setting the start position of
A direction identifying unit for identifying a direction of a speaker viewed from the plurality of microphones based on a characteristic difference from a start position of the utterance to a position where the utterance is detected in each of the plurality of voice data;
A voice data extraction unit that extracts a utterance section after the start position of the utterance in any of the plurality of voice data based on the direction of the identified speaker;
An audio processing apparatus comprising:

When an utterance is detected according to a predetermined condition from a plurality of voice data collected simultaneously by a plurality of microphones, the utterance is traced back from the position where the utterance is detected in the voice data based on the predetermined condition. Set the start position of
Based on the characteristic difference from the start position of the utterance in each of the plurality of audio data to the position where the utterance is detected, the direction of the speaker viewed from the plurality of microphones is identified,
Based on the identified direction of the speaker, the speech section after the start position of the speech in any of the plurality of voice data is extracted.
A voice processing program that causes a computer to execute processing.