JP2017097160A

JP2017097160A - Speech processing device, speech processing method, and program

Info

Publication number: JP2017097160A
Application number: JP2015228807A
Authority: JP
Inventors: 旭美梅松; Terumi Umematsu; 亮輔磯谷; Ryosuke Isotani; 祥史大西; Yoshifumi Onishi
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2015-11-24
Filing date: 2015-11-24
Publication date: 2017-06-01

Abstract

PROBLEM TO BE SOLVED: To provide a technique that prevents the speaker of an unintended speech and/or the content of utterance by the speaker of the intended speech from being specified from an acoustic signal that includes an intended speech even when the unintended speech and the intended speech overlap.SOLUTION: A speech processing device of the present invention comprises: detection means for detecting the utterance section of utterance by the speaker of an unintended speech from a first input signal that is a sensor signal, and specifying time information that indicates the start time and end time of the utterance section; and processing means which, in order for at least one of the content of speech and the speaker to be not specified by a signal, out of a second input signal that is an acoustic signal including the speech of the speaker of an intended speech and the unintended speech, that is a portion equivalent to a period that corresponds to the specified time information, processes the portion of the second input signal that is equivalent to the period.SELECTED DRAWING: Figure 15

Description

本発明は、音声処理装置、音声処理方法およびプログラムに関する。 The present invention relates to a voice processing device, a voice processing method, and a program.

ノイズキャンセラやビームフォーミング等で目的音声を強調および検出し、目的音声以外を抑圧および除去する技術がある。このような技術として、たとえば、特許文献１の技術が知られている。 There is a technique for emphasizing and detecting target speech by noise canceller, beam forming, or the like, and suppressing and removing other than the target speech. As such a technique, for example, the technique of Patent Document 1 is known.

特許文献１に記載の技術では、入力信号を時間領域から周波数領域に変換し、それぞれ所定の方向に死角を有する第一の指向性を持つ信号と第二の指向性を持つ信号に基づいて、コヒーレンス値を求め、その値からコヒーレンス勾配を求める。そして、特許文献１に記載の技術では、目的音区間検出手段が、コヒーレンス値が所定の目的音区間判定閾値より大きいか、または、コヒーレンス勾配がコヒーレンス勾配判定閾値より小さい場合、目的音区間と判定し、そうでない場合は非目的音区間と判定し、その判定結果に応じて利得を設定する。 In the technique described in Patent Document 1, an input signal is converted from a time domain to a frequency domain, and based on a signal having a first directivity and a signal having a second directivity each having a blind spot in a predetermined direction, A coherence value is obtained, and a coherence gradient is obtained from the value. In the technique described in Patent Document 1, when the target sound section detection unit determines that the coherence value is larger than a predetermined target sound section determination threshold or the coherence gradient is smaller than the coherence gradient determination threshold, the target sound section is determined. Otherwise, it is determined as a non-target sound section, and a gain is set according to the determination result.

特開２０１３−１２６０２６号公報JP 2013-1206026 A

特許文献１に記載の技術では、コヒーレンス値を用いて、目的音声の音声区間である目的音声区間（目的音区間）か否かを判定している。したがって、目的音（目的音声）と非目的音声とが同時刻に重なる場合、目的音声区間に、目的音声と非目的音声とが存在する可能性がある。そのため、特許文献１の技術では、非目的音声の話者および発話内容が、人がわかる程度に、目的音声区間に残留する可能性がある。 In the technique described in Patent Document 1, it is determined using a coherence value whether or not a target speech section (target sound section) that is a speech section of the target speech. Therefore, when the target sound (target voice) and the non-target voice overlap at the same time, the target voice and the non-target voice may exist in the target voice section. Therefore, in the technique of Patent Document 1, there is a possibility that the non-target voice speaker and the utterance content remain in the target voice section to the extent that the person can be understood.

また、例えば、店舗の支払カウンタで客と会話する店員の発話内容を、店長が後で音を聞いて確認する、または音声認識等の技術を用いて自動的に処理および分析する場合を想定する。この場合、店員の音声が目的音、客の音声が非目的音声となる。上記場合において、店員の音声を目的音声として音声区間を検出すると、店員と客とが同時に話した場合、検出した音声区間に、非目的音声である客の音声が混入してしまう。 In addition, for example, a case is assumed in which the store manager confirms the utterance content of a store clerk who talks with a customer at a store payment counter by listening to sound later, or automatically processing and analyzing using speech recognition or other techniques. . In this case, the clerk's voice is the target sound and the customer's voice is the non-target voice. In the above case, when a voice section is detected using the clerk's voice as the target voice, when the clerk and the customer speak at the same time, the voice of the customer as the non-target voice is mixed into the detected voice section.

このように、出力対象の音声区間の音声信号に、非目的音声が、人がわかる程度に残留したり、混入したりすることにより、該音声区間の音声信号の音声を聴いた第三者に、発話内容や発話を行った話者が特定されてしまうという課題がある。 In this way, the non-target sound remains or is mixed in the sound signal of the sound section to be output so that a person can understand, so that the third party who has listened to the sound of the sound signal of the sound section There is a problem that the utterance content and the speaker who made the utterance are specified.

本発明の目的は、非目的音声が目的音と重なっている場合でも、目的音を含む音響信号から、非目的音声の話者および／または非目的音声の話者による発話内容が特定できないようにする技術を提供することにある。 An object of the present invention is to prevent the utterance contents of a non-target voice speaker and / or a non-target voice speaker from being identified from an acoustic signal including the target sound even when the non-target voice overlaps with the target sound. It is to provide technology to do.

本発明の一態様に係る音声処理装置は、センサ信号である第１入力信号から、非目的音声の話者による発話の発話区間を検出し、前記発話区間の開始時刻および終了時刻を表す時間情報を特定する検出手段と、目的音および前記非目的音声の話者の音声を含む音響信号である第２入力信号のうち、前記特定された時間情報に対応する期間の信号によって発話内容および話者の少なくとも何れかが特定されないように、前記第２入力信号における前記期間の信号を加工する加工手段と、を備える。 The speech processing apparatus according to an aspect of the present invention detects a speech segment of an utterance by a speaker of a non-target speech from a first input signal that is a sensor signal, and represents time information indicating a start time and an end time of the speech segment Utterance contents and a speaker by a signal in a period corresponding to the specified time information out of a second input signal which is an acoustic signal including a target sound and the voice of the non-target voice speaker Processing means for processing the signal of the period in the second input signal so that at least one of the above is not specified.

また、本発明の一態様に係る音声処理方法は、センサ信号である第１入力信号から、非目的音声の話者による発話の発話区間を検出し、前記発話区間の開始時刻および終了時刻を表す時間情報を特定し、目的音および前記非目的音声の話者の音声を含む音響信号である第２入力信号のうち、前記特定された時間情報に対応する期間の信号によって発話内容および話者の少なくとも何れかが特定されないように、前記第２入力信号における前記期間の信号を加工する。 In addition, the speech processing method according to an aspect of the present invention detects an utterance section of an utterance by a speaker of non-target speech from a first input signal that is a sensor signal, and represents a start time and an end time of the utterance section. Of the second input signal that is the acoustic signal including the target sound and the voice of the non-target speaker, the time information is specified, and the content of the utterance and the speaker's voice are determined according to the signal of the period corresponding to the specified time information. The signal of the period in the second input signal is processed so that at least one is not specified.

なお、上記装置または方法を、コンピュータによって実現するコンピュータプログラム、およびそのコンピュータプログラムが格納されている、コンピュータ読み取り可能な非一時的記録媒体も、本発明の範疇に含まれる。 Note that a computer program for realizing the above apparatus or method by a computer and a computer-readable non-transitory recording medium storing the computer program are also included in the scope of the present invention.

本発明によれば、非目的音声が目的音と重なっている場合でも、目的音を含む音響信号から、非目的音声の話者および／または非目的音声の話者による発話内容が特定できないようにすることができる。 According to the present invention, even when the non-target voice overlaps with the target sound, the utterance content by the non-target voice speaker and / or the non-target voice speaker cannot be specified from the acoustic signal including the target sound. can do.

本発明の第１の実施形態に係る音声処理装置の適用シーンの一例を説明するための図である。It is a figure for demonstrating an example of the application scene of the audio processing apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る音声処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the speech processing unit which concerns on the 1st Embodiment of this invention. 第１入力信号、第２入力信号および発話区間加工信号のイメージを示す図である。It is a figure which shows the image of a 1st input signal, a 2nd input signal, and an utterance area process signal. 本発明の第１の実施形態に係る音声処理装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech processing unit which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態の変形例１に係る音声処理装置の機能構成の一例を示す機能ブロック図ある。It is a functional block diagram which shows an example of a function structure of the speech processing unit which concerns on the modification 1 of the 1st Embodiment of this invention. 本発明の第２の実施形態に係る音声認識装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the speech recognition apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施形態に係る音声処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the speech processing unit which concerns on the 3rd Embodiment of this invention. 本発明の第３の実施形態に係る音声処理装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech processing unit which concerns on the 3rd Embodiment of this invention. 本発明の第４の実施形態に係る音声処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the speech processing unit which concerns on the 4th Embodiment of this invention. 本発明の第４の実施形態に係る音声処理装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech processing unit which concerns on the 4th Embodiment of this invention. 本発明の第５の実施形態に係る音声処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the speech processing unit which concerns on the 5th Embodiment of this invention. 本発明の第５の実施形態に係る音声処理装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech processing unit which concerns on the 5th Embodiment of this invention. 本発明の第６の実施形態に係る音声処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the speech processing unit which concerns on the 6th Embodiment of this invention. 本発明の第６の実施形態に係る音声処理装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech processing unit which concerns on the 6th Embodiment of this invention. 本発明の第７の実施形態に係る音声処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the speech processing unit which concerns on the 7th Embodiment of this invention. 本発明の各実施形態を実現可能なコンピュータ（情報処理装置）のハードウェア構成を例示的に説明する図である。It is a figure which illustrates illustartively the hardware constitutions of the computer (information processing apparatus) which can implement | achieve each embodiment of this invention.

次に、本発明の実施形態について図面を参照して詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

＜第１の実施形態＞
まず、本発明の第１の実施形態に係る音声処理装置の適用シーンの一例を、図１を参照して説明する。本実施形態において、音声処理装置に入力される入力信号は、センサ信号である。センサ信号はマイクロホンやカメラ等のセンサから出力される信号のことである。本実施形態では、センサの一例としてマイクロホンを採用する。マイクロホンは音を集音するため、マイクロホンから出力されるセンサ信号は、音響信号となる。 <First Embodiment>
First, an example of an application scene of the sound processing apparatus according to the first embodiment of the present invention will be described with reference to FIG. In the present embodiment, the input signal input to the sound processing device is a sensor signal. The sensor signal is a signal output from a sensor such as a microphone or a camera. In this embodiment, a microphone is employed as an example of a sensor. Since the microphone collects sound, the sensor signal output from the microphone is an acoustic signal.

また、本実施形態において、音声処理装置に入力される入力信号は、例えば図１に示すような店舗の支払カウンタでの店員と客との会話場面において集音された音を表す信号であると想定する。なお、本実施形態では、店員と客とが１名ずつであることを例に説明を行うが、店員と客とは各々１名に限定されない。また、本実施形態では、上記会話場面は、図１に示す通り、店員と客とが支払カウンタを挟んで向かい合っている状態における会話場面を想定する。なお、以降に説明する他の実施形態においても、特に断らない限り、本実施形態と同様に、図１に示すような、店舗の支払カウンタでの店員と客の会話場面を想定する。 Further, in the present embodiment, the input signal input to the voice processing device is a signal representing the sound collected in the conversation scene between the store clerk and the customer at the store payment counter as shown in FIG. Suppose. In the present embodiment, an example is described in which there are one clerk and one customer, but each clerk and customer is not limited to one. Moreover, in this embodiment, the said conversation scene assumes the conversation scene in the state where the salesclerk and the customer face each other across the payment counter as shown in FIG. In other embodiments described below, unless otherwise specified, a conversation scene between a store clerk and a customer at a store payment counter as shown in FIG. 1 is assumed as in this embodiment.

図１に示す通り、支払カウンタ上に設けられたＰｏｉｎｔＯｆＳａｌｅ（ＰＯＳ）レジスタを操作する店員の音声を主に集音するマイクロホン（第２マイクロホン）ｍ２と、客の音声を主に集音するマイクロホン（第１マイクロホン）ｍ１とが設けられているとする。図１においては、マイクロホンｍ１は、主に客の音声を集音するために、ＰＯＳレジスタ上の、店員よりも客に近い位置に設けられている。また、図１において、マイクロホンｍ２は、主に店員の音声を集音するために、ＰＯＳレジスタ上の、客よりも店員に近い位置に設けられている。 As shown in FIG. 1, a microphone (second microphone) m2 that mainly collects the voice of the store clerk operating the Point Of Sale (POS) register provided on the payment counter, and the voice of the customer are mainly collected. It is assumed that a microphone (first microphone) m1 is provided. In FIG. 1, the microphone m1 is provided at a position closer to the customer than the store clerk on the POS register in order to mainly collect the voice of the customer. In FIG. 1, the microphone m2 is provided on the POS register at a position closer to the store clerk than the customer in order to mainly collect the store clerk's voice.

このように、マイクロホンｍ１およびマイクロホンｍ２が設けられることにより、マイクロホンｍ１から出力される音響信号は、主に客の音声の信号（音声信号）を含む。加えて、マイクロホンｍ１から出力される音響信号には、店員の音声の音声信号も、マイクロホンｍ１に対する音声の回り込みによって混入する。また、マイクロホンｍ２から出力される音響信号は、主に店員の音声の音声信号を含む。加えて、マイクロホンｍ２から出力される音響信号には、マイクロホンｍ２に対する音声の回り込みにより、客の音声の音声信号も、含まれる。 Thus, by providing the microphone m1 and the microphone m2, the acoustic signal output from the microphone m1 mainly includes a customer's voice signal (voice signal). In addition, the sound signal of the clerk's voice is also mixed into the acoustic signal output from the microphone m1 due to the sound wrapping around the microphone m1. The acoustic signal output from the microphone m2 mainly includes a voice signal of a store clerk's voice. In addition, the sound signal output from the microphone m2 includes the sound signal of the customer's voice due to the sound wrapping around the microphone m2.

また、本実施形態では、店員の発話内容の確認を行うことを想定する。つまり、目的音は店員の音声とする。目的音の音源は店員である。上述したとおり、マイクロホンｍ２から出力される音響信号には、店員の音声の音声信号と、客の音声の音声信号とが含まれる。そのため、このマイクロホンｍ２から出力される音響信号を、音声に変換して出力すると、この出力された音声には、客の音声も含まれてしまう。そこで、本実施形態では、客のプライバシーを保護するために、マイクロホンｍ２から出力される音響信号のうち、客の音声の音声信号を含む区間の信号を加工する。本実施形態では、非目的音声を客の音声とする。なお、以降に説明する他の実施形態においても、特に断らない限り、本実施形態と同様に、目的音を店員の音声とし、非目的音声を客の音声とする。 In this embodiment, it is assumed that the clerk's utterance content is confirmed. That is, the target sound is the clerk's voice. The sound source of the target sound is a store clerk. As described above, the sound signal output from the microphone m2 includes the sound signal of the store clerk's voice and the sound signal of the customer's voice. For this reason, if the acoustic signal output from the microphone m2 is converted into sound and output, the output sound includes the customer's sound. Therefore, in the present embodiment, in order to protect the privacy of the customer, the signal in the section including the voice signal of the customer's voice is processed among the acoustic signals output from the microphone m2. In the present embodiment, the non-target voice is the customer voice. In other embodiments described below, the target sound is the clerk's voice and the non-target voice is the customer's voice, as in this embodiment, unless otherwise specified.

なお、店員の発話以外でも、例えば、電子マネーの支払い音やアナウンス等を第三者が聞いて店員の作業を評価したい場合もあるため、目的音は店員の発話による音声に限らないが、本実施形態では、目的音が店員の発話による音声であることを想定して、以下述べる。 In addition to the clerk's utterance, for example, a third party may want to evaluate the clerk's work by listening to the payment sound or announcement of electronic money, so the target sound is not limited to the voice of the clerk's utterance. In the embodiment, the following description will be made assuming that the target sound is a voice generated by a store clerk.

なお、以下では、マイクロホンｍ１から出力され、音声処理装置に入力する音響信号を第１入力信号と呼び、マイクロホンｍ２から出力され、音声処理装置に入力する音響信号を第２入力信号と呼ぶ。また、同じタイミングでマイクロホンｍ１およびマイクロホンｍ２によって集音され、マイクロホンｍ１およびマイクロホンｍ２から出力された信号（第１入力信号および第２入力信号）は、互いに関連付けられているとする。 In the following description, the acoustic signal output from the microphone m1 and input to the speech processing device is referred to as a first input signal, and the acoustic signal output from the microphone m2 and input to the speech processing device is referred to as a second input signal. Also, it is assumed that signals (first input signal and second input signal) collected by the microphone m1 and the microphone m2 at the same timing and output from the microphone m1 and the microphone m2 are associated with each other.

（音声処理装置１００）
本発明の第１の実施形態に係る音声処理装置について、図面を参照して説明する。 (Speech processor 100)
A speech processing apparatus according to a first embodiment of the present invention will be described with reference to the drawings.

図２は本発明の第１の実施形態に係る音声処理装置１００の機能構成の一例を示す機能ブロック図である。本実施形態に係る音声処理装置１００は、目的音と非目的音声とを含む音響信号を、非目的音声話者による発話内容および／または非目的音声の話者が特定不能なように加工する装置である。 FIG. 2 is a functional block diagram showing an example of the functional configuration of the speech processing apparatus 100 according to the first embodiment of the present invention. The speech processing apparatus 100 according to the present embodiment processes an acoustic signal including a target sound and a non-target voice so that the utterance content by the non-target voice speaker and / or the speaker of the non-target voice cannot be specified. It is.

図２に示すように音声処理装置１００は、発話区間検出部（検出部）１０１と発話区間加工部（加工部）１０２とを含む。 As shown in FIG. 2, the speech processing apparatus 100 includes an utterance interval detection unit (detection unit) 101 and an utterance interval processing unit (processing unit) 102.

（発話区間検出部１０１）
発話区間検出部１０１は、マイクロホンｍ１から出力された音響信号である第１入力信号を受信する。発話区間検出部１０１は、受信した第１入力信号から人が発話した区間（発話区間）を検出する発話区間検出を行う。発話区間検出部１０１は、発話区間検出に、既存の音声検出技術を用いる。音声検出技術として、例えば、音声信号（音響信号）のうち、音声信号（音響信号）に対応する音の音量が所定の値以上の信号の区間を発話区間と判定する技術や、音声の特徴をあらかじめモデル化したものと入力信号の特徴量とを比較して発話区間を決定する技術等を用いてもよい。第１入力信号は、主に客の音声の音声信号を含むため、発話区間検出部１０１が検出した発話区間は、客の音声すなわち非目的音声の音声信号を含む区間となる。なお、発話区間検出部１０１が検出する発話区間は、１つに限定されない。 (Speech section detection unit 101)
The utterance section detection unit 101 receives a first input signal that is an acoustic signal output from the microphone m1. The utterance section detection unit 101 performs utterance section detection for detecting a section (speaking section) in which a person speaks from the received first input signal. The utterance section detection unit 101 uses an existing voice detection technique for utterance section detection. As a voice detection technique, for example, among voice signals (acoustic signals), a technique for determining a section of a signal whose sound volume corresponding to the voice signal (sound signal) is a predetermined value or more as a speech section, For example, a technique for determining an utterance section by comparing a modeled in advance with a feature amount of an input signal may be used. Since the first input signal mainly includes the voice signal of the customer's voice, the utterance period detected by the utterance period detection unit 101 is a section including the voice signal of the customer's voice, that is, the non-target voice. Note that the utterance interval detected by the utterance interval detection unit 101 is not limited to one.

発話区間検出部１０１は、第１入力信号中の発話区間の開始時刻を表す始端時刻情報と発話区間の終了時刻を表す終端時刻情報とを発話区間の時間情報として、特定する。そして、発話区間検出部１０１は、特定した時間情報を発話区間加工部１０２に出力する。検出された発話区間が複数ある場合は、発話区間検出部１０１は、そのすべてについて時間情報を特定する。なお、発話区間検出部１０１から出力される時間情報には、時間情報の検出元となる第１入力信号を表す情報が関連付けられている。 The utterance section detection unit 101 specifies start time information indicating the start time of the utterance section in the first input signal and end time information indicating the end time of the utterance section as time information of the utterance section. Then, the utterance section detection unit 101 outputs the specified time information to the utterance section processing unit 102. When there are a plurality of detected utterance sections, the utterance section detection unit 101 specifies time information for all of them. Note that the time information output from the utterance section detection unit 101 is associated with information representing the first input signal that is the detection source of the time information.

（発話区間加工部１０２）
発話区間加工部１０２は、発話区間検出部１０１から出力された時間情報と、マイクロホンｍ２から出力された音響信号である第２入力信号とを入力とする。発話区間加工部１０２は、時間情報に関連付けられた第１入力信号と関連する第２入力信号に対して、該第２入力信号に関連する第１入力信号中の発話区間の時間情報を照らし合わせ、該時間情報に対応する第２入力信号中の対応する期間を特定する。発話区間検出部１０１から出力された時間情報は、上述したとおり発話区間の始端時刻情報と終端時刻情報とを含む。発話区間加工部１０２は、第２入力信号のうち、上記始端時刻情報によって表される開始時刻に対応する時刻（第１時刻と呼ぶ）と、上記終端時刻情報によって表される終了時刻に対応する時刻（第２時刻と呼ぶ）とを特定する。つまり、発話区間加工部１０２は、第２入力信号中の第１時刻から第２時刻までの期間（時間区間と呼ぶ）を特定する。そして、発話区間加工部１０２は、第２入力信号のうち、この特定された時間区間の信号を特定する。発話区間検出部１０１から出力される時間情報が複数の発話区間に対する時間情報を含む場合、発話区間加工部１０２は、この複数の時間情報の夫々に対して対応する時間区間を特定し、第２入力信号中の、複数の時間区間の信号を特定する。 (Speech section processing unit 102)
The speech segment processing unit 102 receives the time information output from the speech segment detection unit 101 and the second input signal that is an acoustic signal output from the microphone m2. The speech segment processing unit 102 compares the time information of the speech segment in the first input signal related to the second input signal with respect to the second input signal related to the first input signal associated with the time information. , A corresponding period in the second input signal corresponding to the time information is specified. The time information output from the utterance section detection unit 101 includes the start time information and the end time information of the utterance section as described above. The utterance section processing unit 102 corresponds to the time (referred to as the first time) corresponding to the start time represented by the start time information and the end time represented by the end time information in the second input signal. The time (referred to as the second time) is specified. That is, the utterance section processing unit 102 specifies a period (referred to as a time section) from the first time to the second time in the second input signal. Then, the utterance section processing unit 102 specifies a signal in the specified time section from the second input signal. When the time information output from the utterance section detection unit 101 includes time information for a plurality of utterance sections, the utterance section processing unit 102 specifies a corresponding time section for each of the plurality of time information, and the second Identify signals of multiple time intervals in the input signal.

この時間区間は客が発話している期間であるため、第２入力信号内に客の音声の音声信号が含まれるのは、この時間区間の信号である。したがって、発話区間加工部１０２は、第２入力信号中の、特定した時間区間の信号を加工する。なお、発話区間加工部１０２は、第２入力信号中の特定した時間区間以外の信号を、加工しない。つまり、発話区間加工部１０２は、第２入力信号に対し、特定した時間区間の信号のみを加工した信号を生成する。そして、発話区間加工部１０２は、生成した信号を発話区間加工信号とする。発話区間加工部１０２が行う音響信号の加工の方法としては、時間区間の音響信号をゼロ信号で置き換える方法、時間区間の音響信号を切り取り、切り取った部分を除いてその前後の音響信号をつないだものを出力する方法、などが挙げられる。 Since this time section is a period during which the customer speaks, it is the signal of this time section that includes the voice signal of the customer's voice in the second input signal. Therefore, the speech section processing unit 102 processes the signal of the specified time section in the second input signal. Note that the speech section processing unit 102 does not process signals other than the specified time section in the second input signal. That is, the utterance section processing unit 102 generates a signal obtained by processing only the signal in the specified time section with respect to the second input signal. Then, the utterance section processing unit 102 sets the generated signal as the utterance section processing signal. As a method of processing the acoustic signal performed by the speech section processing unit 102, a method in which the acoustic signal in the time section is replaced with a zero signal, the acoustic signal in the time section is cut out, and the acoustic signals before and after the cut portion are connected. The method of outputting things, etc. are mentioned.

なお、発話区間加工部１０２が行う音響信号の加工の方法はこれに限定されるものではない。例えば、発話区間加工部１０２が行う音響信号の加工の方法は、時間区間の音響信号を白色雑音等の雑音で置き換える方法であっても良いし、時間区間の音響信号に、人が聞いた時に元の発話内容が特定できない音量の白色雑音等の雑音を表す雑音信号を加える方法であっても良い。なお、この場合、雑音の音量は、発話区間検出部１０１で検出された発話区間における第１入力信号に対応する音の音量または第２入力信号の時間区間の信号に対応する音の音量に応じて、その音量にあらかじめ定めた定数をかけた音量にしてもよい。また、発話区間加工部１０２は、第２音声信号の時間区間の音響信号に、声質変換処理を施してもよい。ここで、声質変換処理とは、音響信号によって表される客の音声の声質を、客の声質とは異なる第三者の声質に変換することである。これにより、発話区間加工部１０２は、時間区間の音響信号を、発話内容は特定できても、話者が特定できないように加工することができる。したがって、第三者が発話区間加工部１０２から出力された音響信号を音声に変換して、該音声を聴いたとしても、第三者が該音声から話者、つまり客を特定することができなくなる。したがって、本実施形態の音声処理装置１００は、客のプライバシーを保護した音響信号を出力することができる。 Note that the method of processing the acoustic signal performed by the speech section processing unit 102 is not limited to this. For example, the processing method of the acoustic signal performed by the speech section processing unit 102 may be a method of replacing the acoustic signal in the time section with noise such as white noise, or when a person listens to the acoustic signal in the time section. A method may be used in which a noise signal representing noise such as white noise having a volume at which the original utterance content cannot be specified is added. In this case, the volume of the noise depends on the volume of the sound corresponding to the first input signal in the utterance section detected by the utterance section detection unit 101 or the volume of the sound corresponding to the signal in the time section of the second input signal. The volume may be a volume obtained by multiplying the volume by a predetermined constant. Further, the speech segment processing unit 102 may perform voice quality conversion processing on the acoustic signal in the time segment of the second audio signal. Here, the voice quality conversion process is to convert the voice quality of the customer's voice represented by the acoustic signal into a voice quality of a third party different from the voice quality of the customer. Thereby, the utterance section processing unit 102 can process the acoustic signal in the time section so that the speaker cannot be specified even if the utterance content can be specified. Therefore, even if a third party converts the acoustic signal output from the utterance section processing unit 102 into speech and listens to the speech, the third party can identify the speaker, that is, the customer from the speech. Disappear. Therefore, the sound processing apparatus 100 of the present embodiment can output an acoustic signal that protects the privacy of the customer.

発話区間加工部１０２が第２入力信号中の特定した時間区間の音響信号をゼロ信号で置き換えた場合における、発話区間加工信号の一例を、図３を用いて説明する。図３は、第１入力信号、第２入力信号および発話区間加工信号のイメージを示す図である。図３においては、上段に第１入力信号を示し、中央に第２入力信号を示し、下段に発話区間加工信号を示している。各段における横軸は、時間を示している。 An example of the speech segment processing signal when the speech segment processing unit 102 replaces the acoustic signal of the specified time segment in the second input signal with a zero signal will be described with reference to FIG. FIG. 3 is a diagram illustrating an image of the first input signal, the second input signal, and the speech segment processing signal. In FIG. 3, the first input signal is shown in the upper stage, the second input signal is shown in the center, and the speech segment processing signal is shown in the lower stage. The horizontal axis in each stage indicates time.

図３において、ｔ０は発話区間の開始時刻を表し、ｔ１は発話区間の終了時刻を表している。したがって、発話区間加工部１０２は、第２入力信号のｔ０からｔ１までの期間を時間区間として特定し、この特定した時間区間の音響信号を加工する。上述したとおり、図３の例では、発話区間加工部１０２は、時間区間の音響信号をゼロ信号で置き換えるため、発話区間加工信号は、第２音声信号のｔ０からｔ１までの期間がゼロ信号に置き換えられた音響信号となる。 In FIG. 3, t0 represents the start time of the utterance section, and t1 represents the end time of the utterance section. Therefore, the utterance section processing unit 102 specifies the period from t0 to t1 of the second input signal as a time section, and processes the acoustic signal in the specified time section. As described above, in the example of FIG. 3, the utterance section processing unit 102 replaces the acoustic signal in the time section with a zero signal, so that the utterance section processing signal is a zero signal during the period from t0 to t1 of the second speech signal. It becomes the replaced acoustic signal.

次に、図４を用いて、本実施形態に係る音声処理装置１００の動作について説明する。図４は、本実施形態に係る音声処理装置１００の動作の一例を示すフローチャートである。 Next, the operation of the speech processing apparatus 100 according to the present embodiment will be described using FIG. FIG. 4 is a flowchart showing an example of the operation of the speech processing apparatus 100 according to the present embodiment.

図４に示す通り、まず、発話区間検出部１０１が、第１入力信号から客による発話の発話区間を検出する（ステップＳ１０１）。そして、発話区間検出部１０１は、検出した発話区間の開始時刻を表す始端時刻情報と検出した発話区間の終了時刻を表す終端時刻情報とを発話区間の時間情報として、特定する（ステップＳ１０２）。 As shown in FIG. 4, first, the utterance section detecting unit 101 detects the utterance section of the utterance by the customer from the first input signal (step S101). Then, the utterance section detection unit 101 specifies the start time information indicating the start time of the detected utterance section and the end time information indicating the end time of the detected utterance section as time information of the utterance section (step S102).

次に、発話区間加工部１０２が、第２入力信号において、ステップＳ１０２で特定した時間情報に対応する時間区間を特定する（ステップＳ１０３）。そして、発話区間加工部１０２は、第２入力信号における、特定した時間区間の信号を加工する（ステップＳ１０４）。 Next, the utterance section processing unit 102 specifies a time section corresponding to the time information specified in step S102 in the second input signal (step S103). Then, the utterance section processing unit 102 processes the signal of the specified time section in the second input signal (step S104).

以上により、音声処理装置１００は処理を終了する。 Thus, the voice processing apparatus 100 ends the process.

以上のように、本実施形態に係る音声処理装置１００は、発話区間検出部１０１が、第１入力信号から、非目的音声の話者による発話の発話区間を検出し、非目的音声の話者による発話の発話区間の開始時刻および終了時刻を表す時間情報を特定する。そして、発話区間加工部１０２が、目的音および非目的音声の話者の音声を含む第２入力信号のうち、特定された時間情報に対応する期間の信号によって発話内容および話者の少なくとも何れかが特定されないように、第２入力信号における前記期間の信号を加工する。 As described above, in the speech processing apparatus 100 according to the present embodiment, the utterance section detection unit 101 detects the utterance section of the utterance by the non-target voice speaker from the first input signal, and the non-target voice speaker is detected. The time information indicating the start time and end time of the utterance section of the utterance is identified. Then, the utterance section processing unit 102 selects at least one of the utterance content and the speaker based on the signal in the period corresponding to the specified time information among the second input signals including the target sound and the non-target speaker voice. The signal of the period in the second input signal is processed so that is not specified.

第２入力信号のうち、発話区間検出部１０１で特定する時間情報によって表される期間には、非目的音声の音声信号が含まれる。発話区間加工部１０２がこの期間の信号を加工することにより、この期間の信号に対応する音から、非目的音声の話者および／または非目的音声の話者の発話の発話内容が特定できない状態になる。したがって、音声処理装置１００は、非目的音声が目的音と重なっている場合でも、目的音を含む音響信号から、非目的音声の話者および／または非目的音声の話者による発話内容が特定できないようにすることができる。 Of the second input signal, the period represented by the time information specified by the utterance section detection unit 101 includes a non-target voice signal. A state in which the utterance content of the utterance of the non-target voice speaker and / or the non-target voice speaker cannot be specified from the sound corresponding to the signal of this period by the utterance section processing unit 102 processing the signal of this period become. Therefore, even when the non-target voice overlaps with the target sound, the voice processing device 100 cannot identify the utterance content by the non-target voice speaker and / or the non-target voice speaker from the acoustic signal including the target sound. Can be.

また、第１入力信号は、目的音の音源（例えば、店員）よりも、非目的音声の話者（例えば、客）により近い位置にあるマイクロホンｍ１で収集され、出力された音声信号を用いる。これにより、加工対象である非目的音声の話者による発話の発話区間をより正確に求めることができる。また、第２入力信号は、非目的音声の話者（例えば、客）よりも、目的音の音源（例えば、店員）により近い位置にあるマイクロホンｍ２で収集され、出力された音声信号を用いる。これにより、発話区間加工部１０２は、発話区間加工信号に対応する音声に、目的音である店員の音声をより明瞭に含めることができる。 The first input signal is an audio signal collected and output by the microphone m1 located closer to the non-target voice speaker (eg, customer) than the target sound source (eg, store clerk). Thereby, the utterance section of the utterance by the speaker of the non-target voice to be processed can be obtained more accurately. The second input signal is a voice signal that is collected and output by the microphone m2 that is closer to the sound source (for example, a clerk) of the target sound than the speaker (for example, the customer) of the non-target voice. Thereby, the utterance section processing unit 102 can more clearly include the clerk's voice as the target sound in the voice corresponding to the utterance section processing signal.

（第１の実施形態の変形例１）
次に、図５を用いて、本実施形態の変形例１について説明する。図５は本実施形態の変形例１に係る音声処理装置１１０の機能構成の一例を示す機能ブロック図である。なお、説明の便宜上、前述した第１の実施形態で説明した図面に含まれる部材と同じ機能を有する部材については、同じ符号を付し、その詳細な説明を省略する。 (Modification 1 of the first embodiment)
Next, Modification 1 of the present embodiment will be described with reference to FIG. FIG. 5 is a functional block diagram illustrating an example of a functional configuration of the speech processing apparatus 110 according to the first modification of the present embodiment. For convenience of explanation, members having the same functions as those included in the drawings described in the first embodiment are given the same reference numerals, and detailed descriptions thereof are omitted.

図５に示す通り、音声処理装置１１０は、発話区間検出部１０１と、発話区間加工部１０２と、音響信号保存部１０３とを備える。つまり、変形例１に係る音声処理装置１１０は、実施形態１の音声処理装置１００に、音響信号保存部１０３が加わる構成である。本変形例では、発話区間加工信号を、人が後で聞くことを想定する。 As shown in FIG. 5, the speech processing apparatus 110 includes an utterance section detection unit 101, an utterance section processing unit 102, and an acoustic signal storage unit 103. That is, the sound processing apparatus 110 according to the first modification has a configuration in which the acoustic signal storage unit 103 is added to the sound processing apparatus 100 according to the first embodiment. In this modification, it is assumed that a person listens to the speech segment processing signal later.

発話区間加工部１０２は、発話区間加工信号を、音響信号保存部１０３に格納する。 The utterance section processing unit 102 stores the utterance section processing signal in the acoustic signal storage unit 103.

音響信号保存部１０３には、発話区間加工信号が格納される。音響信号保存部１０３は、例えば、ハードディスクドライブなどの記憶装置によって実現される。 The acoustic signal storage unit 103 stores an utterance section processing signal. The acoustic signal storage unit 103 is realized by a storage device such as a hard disk drive, for example.

音響信号保存部１０３に格納された発話区間加工信号に対し、例えば、音声処理装置１１０の外部装置から再生指示等が送信されると、音声処理装置１１０は、発話区間加工信号を録音信号として出力する。この録音信号を受け取った外部装置は、この録音信号を音声に変換して出力することができる。 For example, when a playback instruction or the like is transmitted from an external device of the speech processing device 110 to the speech segment processing signal stored in the acoustic signal storage unit 103, the speech processing device 110 outputs the speech segment processing signal as a recording signal. To do. The external device that has received the recording signal can convert the recording signal into sound and output it.

このように、本実施形態に係る音声処理装置１１０は、発話区間加工信号を音響信号保存部１０３で一旦保存するため、後で、人が聞きたいときに、該発話区間加工信号を録音信号として出力することができる。 As described above, since the speech processing apparatus 110 according to the present embodiment temporarily stores the utterance section processed signal in the acoustic signal storage unit 103, when the person wants to listen later, the utterance section processed signal is used as a recording signal. Can be output.

（第１の実施形態の変形例２）
次に、本実施形態の変形例２について説明する。上述した第１の実施形態では、音声処理装置１００に入力される第１入力信号の一例として、マイクロホンｍ１からの音響信号であることについて説明した。しかしながら、第１入力信号は、音響信号に限定されない。本実施形態の変形例２では、第１入力信号の他の例について説明する。なお、本変形例における音声処理装置１００の機能構成は、図２と同様であるため、説明を省略する。 (Modification 2 of the first embodiment)
Next, a second modification of the present embodiment will be described. In the first embodiment described above, as an example of the first input signal input to the sound processing device 100, the acoustic signal from the microphone m1 has been described. However, the first input signal is not limited to an acoustic signal. In the second modification of the present embodiment, another example of the first input signal will be described. Note that the functional configuration of the speech processing apparatus 100 in this modification is the same as that in FIG.

本変形例では、第１入力信号として、例えば、店内に設置したカメラ等の撮像装置（センサ）によって取得された映像信号を用いる。発話区間検出部１０１は、映像信号から発話区間を検出する。以下、発話区間検出部１０１が映像信号から発話区間を検出する方法について説明する。 In the present modification, for example, a video signal acquired by an imaging device (sensor) such as a camera installed in a store is used as the first input signal. The utterance period detection unit 101 detects an utterance period from the video signal. Hereinafter, a method in which the utterance section detection unit 101 detects the utterance section from the video signal will be described.

まず、発話区間検出部１０１は、カメラから映像信号を受信する。発話区間検出部１０１は、受信した映像信号が表す映像から客を判別する。カメラが固定カメラの場合、受信した映像信号が表す映像における、例えば図１に示すような支払カウンタおよびＰＯＳレジスタなどの位置は固定される。また、図１に示す通り、店員の位置は、ＰＯＳレジスタを操作可能な位置であり、客の位置は、支払カウンタを中心に店員の位置とは反対の位置にある。発話区間検出部１０１は、これらの位置を示す情報を用いて、受信した映像信号が表す映像から客を判別する。なお、上記位置を示す情報は、発話区間検出部１０１内または音声処理装置１００内の図示しない記憶部に格納されていてもよいし、音声処理装置１００の外部から受信してもよい。 First, the utterance section detection unit 101 receives a video signal from a camera. The utterance section detection unit 101 determines a customer from the video represented by the received video signal. When the camera is a fixed camera, the positions of the payment counter and the POS register as shown in FIG. 1, for example, in the video represented by the received video signal are fixed. Further, as shown in FIG. 1, the position of the clerk is a position where the POS register can be operated, and the position of the customer is at a position opposite to the position of the clerk with the payment counter as the center. The utterance section detection unit 101 uses the information indicating these positions to determine a customer from the video represented by the received video signal. The information indicating the position may be stored in a storage unit (not shown) in the utterance section detection unit 101 or the voice processing device 100, or may be received from outside the voice processing device 100.

発話区間検出部１０１は、映像信号を表す映像を用いて、客であると判別した人物に対して、該人物の口を特定する。顔に含まれる部品（この場合は、口）の特定方法は、一般的な画像処理技術を適用するため、本変形例ではその詳細な説明を省略する。そして、発話区間検出部１０１は、映像信号を表す映像から、口の動きを検出し、口が動き始めた時刻から口の動きが止まった時刻までを発話区間として検出する。そして、発話区間検出部１０１は、口が動き始めた時刻を、発話区間の開始時刻とし、口の動きが止まった時刻を、発話区間の終了時刻とする。そして、発話区間検出部１０１は、開始時刻を表す始端時刻情報と、終了時刻を表す終端時刻情報とを発話区間の時間情報として特定する。 The utterance section detection unit 101 uses the video representing the video signal to identify the mouth of the person determined to be a customer. Since a general image processing technique is applied to a method for identifying a part (in this case, the mouth) included in the face, detailed description thereof is omitted in this modification. Then, the utterance section detection unit 101 detects the movement of the mouth from the video representing the video signal, and detects from the time when the mouth starts to the time when the movement of the mouth stops as the utterance section. Then, the utterance section detection unit 101 sets the time when the mouth starts to move as the start time of the utterance section and the time when the movement of the mouth stops as the end time of the utterance section. Then, the utterance section detection unit 101 specifies start time information indicating the start time and end time information indicating the end time as time information of the utterance section.

以上のように、本変形例に係る音声処理装置１００は、第１入力信号が映像信号であっても、上述した第１の実施形態に係る音声処理装置１００と同様の効果を得ることができる。 As described above, the audio processing device 100 according to the present modification can obtain the same effects as those of the audio processing device 100 according to the first embodiment described above even when the first input signal is a video signal. .

＜第２の実施形態＞
本発明の第２の実施形態に係る音声認識装置について、図６を用いて説明する。本実施形態では、上述した第１の実施形態に係る音声処理装置１００を音声認識装置に適用した場合の一例を示す。なお、説明の便宜上、前述した第１の実施形態で説明した図面に含まれる部材と同じ機能を有する部材については、同じ符号を付し、その詳細な説明を省略する。 <Second Embodiment>
A speech recognition apparatus according to the second embodiment of the present invention will be described with reference to FIG. In the present embodiment, an example in which the speech processing apparatus 100 according to the first embodiment described above is applied to a speech recognition apparatus is shown. For convenience of explanation, members having the same functions as those included in the drawings described in the first embodiment are given the same reference numerals, and detailed descriptions thereof are omitted.

図６は、本実施形態に係る音声認識装置の機能構成の一例を示す機能ブロック図である。図６に示す通り、本実施形態に係る音声認識装置６００は、発話区間検出部１０１と発話区間加工部１０２と音声認識部６０１とを含む。本実施形態に係る音声認識装置６００は、上述した第１の実施形態における音声処理装置１００を内部に含む装置である。 FIG. 6 is a functional block diagram illustrating an example of a functional configuration of the speech recognition apparatus according to the present embodiment. As shown in FIG. 6, the speech recognition apparatus 600 according to the present embodiment includes an utterance section detection unit 101, an utterance section processing unit 102, and a speech recognition unit 601. The speech recognition apparatus 600 according to the present embodiment is an apparatus that includes therein the speech processing apparatus 100 according to the first embodiment described above.

音声認識部６０１は、第２入力信号を入力とする。音声認識部６０１は、第２入力信号に基づいて、音声認識等の音声処理を行う。そして、音声認識部６０１は、音声認識結果を生成する。音声認識部６０１が生成する音声認識結果は、本実施形態ではテキストで表現されるものとする。 The voice recognition unit 601 receives the second input signal. The voice recognition unit 601 performs voice processing such as voice recognition based on the second input signal. Then, the voice recognition unit 601 generates a voice recognition result. In this embodiment, the speech recognition result generated by the speech recognition unit 601 is expressed by text.

例えば、発話区間加工部１０２によって生成された発話区間加工信号を用いて、音声認識を行った場合について説明する。上述したとおり、発話区間加工信号は、第２入力信号の一部が加工された信号である。したがって、この発話区間加工信号に対して音声認識を行った場合、音声認識対象である目的音声が削られてしまう可能性がある。よって、目的音声の音声認識精度が劣化する可能性がある。 For example, the case where speech recognition is performed using the speech segment processing signal generated by the speech segment processing unit 102 will be described. As described above, the speech segment processing signal is a signal obtained by processing a part of the second input signal. Therefore, when speech recognition is performed on this speech segment processing signal, the target speech that is the target of speech recognition may be deleted. Therefore, the voice recognition accuracy of the target voice may be deteriorated.

しかしながら、本実施形態に係る音声認識装置６００は、上述したとおり、発話区間加工信号とは別に、音声認識結果を出力することが可能となる。これにより、音声認識装置６００は、音響信号の加工による劣化を受けない音声認識結果と、非目的音声の話者および／または非目的音声の話者による発話内容が特定できない発話区間加工信号の両方を得ることが可能となる。テキストで表現される音声認識結果には、非目的音声の話者による発話内容が含まれることはありうる。しかしながら、音声認識結果はテキストであるため、この音声認識結果から非目的音声の話者の特定はできない。よって、音声認識装置６００は、非目的話者のプライバシーを保護できる。 However, as described above, the speech recognition apparatus 600 according to the present embodiment can output a speech recognition result separately from the speech segment processing signal. Thereby, the speech recognition apparatus 600 has both a speech recognition result that is not deteriorated due to the processing of the acoustic signal and an utterance section processing signal in which the utterance content by the non-target speech speaker and / or the non-target speech speaker cannot be specified. Can be obtained. Speech recognition results expressed in text may include utterances by non-target speech speakers. However, since the speech recognition result is text, the speaker of the non-target speech cannot be specified from the speech recognition result. Therefore, the speech recognition apparatus 600 can protect the privacy of non-target speakers.

＜第３の実施形態＞
本発明の第３の実施形態に係る音声処理装置について、図面を参照して説明する。 <Third Embodiment>
A speech processing apparatus according to a third embodiment of the present invention will be described with reference to the drawings.

上述した第１の実施形態では、第１入力信号をマイクロホンｍ１から出力された信号とし、第２入力信号をマイクロホンｍ２から出力された信号として説明を行った。しかしながら、マイクロホンの数は、複数であってもよい。例えば、図１に示す支払カウンタおよび／またはＰＯＳ端末上に複数のマイクロホンが設けられていてもよい。本実施形態では、ｎ個（ｎは２以上の自然数）のマイクロホンを用いる場合について説明する。 In the first embodiment described above, the first input signal is described as a signal output from the microphone m1, and the second input signal is described as a signal output from the microphone m2. However, the number of microphones may be plural. For example, a plurality of microphones may be provided on the payment counter and / or the POS terminal shown in FIG. In the present embodiment, a case where n (n is a natural number of 2 or more) microphones are used will be described.

（第１入力信号および第２入力信号について）
まず、本実施形態に係る音声処理装置で使用する第１入力信号および第２入力信号について説明する。上述したとおり、本実施形態では複数のマイクロホン（ｍ１〜ｍｎ）を用いて音声を集音する。 (First input signal and second input signal)
First, the first input signal and the second input signal used in the speech processing apparatus according to this embodiment will be described. As described above, in this embodiment, sound is collected using a plurality of microphones (m1 to mn).

複数のマイクロホン（ｍ１〜ｍｎ）の少なくとも何れかから出力された１または複数の音響信号をまとめた信号を、本実施形態では第１入力信号および第２入力信号と呼ぶ。第１入力信号および第２入力信号は、ｎ個のマイクロホン（ｍ１〜ｍｎ）の夫々から出力される音響信号の全てをまとめた信号であっても良い。また、第１入力信号および第２入力信号は、それぞれ選択された１または複数個の音響信号をまとめた信号であってもよい。また、第１入力信号および第２入力信号には、同一のマイクロホンから出力された音響信号が含まれていても良いし、含まれていなくても良い。ただし、第１入力信号には、客の音声（非目的音声）を集音できるマイクロホンから出力された音響信号が少なくとも１つ含まれ、第２入力信号には、店員の音声（目的音）を集音できるマイクロホンから出力された音響信号が少なくとも１つ含まれるものとする。 In this embodiment, a signal in which one or a plurality of acoustic signals output from at least one of the plurality of microphones (m1 to mn) is combined is referred to as a first input signal and a second input signal. The first input signal and the second input signal may be signals obtained by collecting all the acoustic signals output from each of the n microphones (m1 to mn). Further, the first input signal and the second input signal may be signals obtained by collecting one or more selected acoustic signals. The first input signal and the second input signal may or may not include an acoustic signal output from the same microphone. However, the first input signal includes at least one acoustic signal output from a microphone capable of collecting the customer's voice (non-target voice), and the second input signal includes the clerk's voice (target sound). It is assumed that at least one acoustic signal output from a microphone that can collect sound is included.

このように、第１入力信号に客の音声を主に集音できるマイクロホンから出力される音響信号を、第２入力信号に店員の音声を主に集音できるマイクロホンから出力される音響信号を含める。これにより、本実施形態に係る音声処理装置は、より精度よく、第２入力信号における客の音声、つまり非目的音声が含まれる期間の信号を加工することが可能になる。 As described above, the first input signal includes the acoustic signal output from the microphone that can mainly collect the voice of the customer, and the second input signal includes the acoustic signal output from the microphone that can mainly collect the clerk's voice. . Thereby, the sound processing apparatus according to the present embodiment can process the signal of the customer in the second input signal, that is, the signal in the period including the non-target sound, with higher accuracy.

（音声処理装置２００）
図７は、本実施形態に係る音声処理装置２００の機能構成の一例を示す機能ブロック図である。なお、説明の便宜上、前述した第１の実施形態で説明した図面に含まれる部材と同じ機能を有する部材については、同じ符号を付し、その詳細な説明を省略する。 (Voice processing device 200)
FIG. 7 is a functional block diagram illustrating an example of a functional configuration of the voice processing device 200 according to the present embodiment. For convenience of explanation, members having the same functions as those included in the drawings described in the first embodiment are given the same reference numerals, and detailed descriptions thereof are omitted.

図７に示す通り、本実施形態に係る音声処理装置２００は、音声信号生成部（生成部）２１０と、発話区間検出部１０１と、発話区間加工部１０２とを備えている。本実施形態に係る音声処理装置２００は、上述した第１の実施形態に係る音声処理装置１００に、音声信号生成部２１０を更に備える構成である。また、音声信号生成部２１０は、第１音声信号生成部２０１と、第２音声信号生成部２０２とを含む。 As shown in FIG. 7, the speech processing apparatus 200 according to the present embodiment includes a speech signal generation unit (generation unit) 210, a speech segment detection unit 101, and a speech segment processing unit 102. The audio processing device 200 according to the present embodiment is configured to further include an audio signal generation unit 210 in the audio processing device 100 according to the first embodiment described above. The audio signal generation unit 210 includes a first audio signal generation unit 201 and a second audio signal generation unit 202.

（第１音声信号生成部２０１）
第１音声信号生成部２０１は、音声処理装置２００に入力された、第１入力信号を受信する。そして、第１音声信号生成部２０１は、受信した第１入力信号から、主として客の音声の音声信号を含む１つの音響信号である第１音声信号を生成する。第１音声信号の生成方法としては、例えば第１入力信号に含まれる、１または複数のマイクロホン（ｍ１〜ｍｎ）から出力された音響信号から、音声強調技術を適用し、非目的音声（客の音声）が強調された音響信号を生成する方法が考えられる。音声強調技術は、一般的なものであればよく、例えば、ビームフォーミングを用いてもよい。この場合、第１音声信号生成部２０１は、第１入力信号に含まれる、１または複数の音響信号に対し、客の方向に対してビームを形成する。そして、第１音声信号生成部２０１は、ビームを形成することによって、１または複数の音響信号から、客の音声の成分が強調された音響信号（第１音声信号）を生成する。この客の方向を示す情報は、予め第１音声信号生成部２０１内の記憶部に格納されていてもよい。上述したとおり、本実施形態では、図１に示すような会話場面を想定している。したがって、客の方向は、マイクロホン（ｍ１〜ｍｎ）に対して、予め定めることができる。このように、第１音声信号生成部２０１は、店内で客が存在すると想定される客の方向に向けてビームを形成することで、精度よく客の音声が強調された音響信号を生成することができる。 (First audio signal generation unit 201)
The first audio signal generation unit 201 receives the first input signal input to the audio processing device 200. And the 1st audio | voice signal production | generation part 201 produces | generates the 1st audio | voice signal which is one acoustic signal mainly including the audio | voice signal of a customer's audio | voice from the received 1st input signal. As a method for generating the first audio signal, for example, an audio enhancement technique is applied from an acoustic signal output from one or more microphones (m1 to mn) included in the first input signal, and a non-target audio (customer's voice) is applied. A method of generating an acoustic signal in which (speech) is emphasized is conceivable. The speech enhancement technique may be a general one, and for example, beam forming may be used. In this case, the 1st audio | voice signal generation part 201 forms a beam with respect to a customer's direction with respect to the 1 or several acoustic signal contained in a 1st input signal. And the 1st audio | voice signal production | generation part 201 produces | generates the acoustic signal (1st audio | voice signal) by which the component of the customer's audio | voice was emphasized from one or several acoustic signals by forming a beam. Information indicating the direction of the customer may be stored in advance in a storage unit in the first audio signal generation unit 201. As described above, in this embodiment, a conversation scene as shown in FIG. 1 is assumed. Accordingly, the direction of the customer can be determined in advance with respect to the microphones (m1 to mn). Thus, the 1st audio | voice signal production | generation part 201 produces | generates the acoustic signal with which the customer's audio | voice was emphasized accurately by forming a beam toward the direction of the customer in which a customer is assumed to exist in a shop. Can do.

また、第１音声信号の生成方法は、これに限定されない。第１音声信号生成部２０１は、第１入力信号に含まれる、１または複数の音響信号のうち、客に一番近いと想定されるマイクロホンが出力した音響信号を、該音響信号に対応する音の音量等に基づいて選択し、選択した音響信号を、第１音声信号としてもよい。この場合、第１音声信号生成部２０１は、選択した音響信号に対し雑音抑圧処理を施し、雑音が抑圧された音響信号を第１音声信号としてもよい。 Further, the method for generating the first audio signal is not limited to this. The first audio signal generation unit 201 outputs an acoustic signal output from a microphone assumed to be closest to the customer among one or more acoustic signals included in the first input signal, to a sound corresponding to the acoustic signal. The selected sound signal may be the first sound signal. In this case, the first sound signal generation unit 201 may perform noise suppression processing on the selected sound signal, and may use the sound signal whose noise is suppressed as the first sound signal.

また、第１音声信号生成部２０１は、第１入力信号に対し、ノイズキャンセラ等の雑音抑圧技術を用いて周囲の雑音を抑圧し、抑圧した信号を第１音声信号としてもよい。また、第１音声信号生成部２０１は、第１入力信号に対し、雑音抑圧技術と音声強調技術との両方を適用しても良い。第１入力信号が１つのマイクロホンが出力した音響信号である場合は、第１音声信号生成部２０１は、第１入力信号に対してスペクトル・サブトラクション法などの１入力に対する雑音抑圧技術を適用しても良い。このように、第１音声信号生成部２０１は、非目的音声である客の音声を強調したり、客の音声以外を抑圧したりすることで、第１音声信号を生成することができる。 In addition, the first audio signal generation unit 201 may suppress ambient noise with respect to the first input signal using a noise suppression technique such as a noise canceller, and may use the suppressed signal as the first audio signal. The first audio signal generation unit 201 may apply both the noise suppression technique and the voice enhancement technique to the first input signal. When the first input signal is an acoustic signal output from one microphone, the first audio signal generation unit 201 applies a noise suppression technique for one input such as a spectral subtraction method to the first input signal. Also good. Thus, the 1st audio | voice signal production | generation part 201 can produce | generate a 1st audio | voice signal by emphasizing the customer's audio | voice which is non-purpose audio | voice, or suppressing other than a customer's audio | voice.

そして、第１音声信号生成部２０１は、生成した第１音声信号を、発話区間検出部１０１に供給する。 Then, the first audio signal generation unit 201 supplies the generated first audio signal to the utterance section detection unit 101.

（第２音声信号生成部２０２）
第２音声信号生成部２０２は、音声処理装置２００に入力された、第２入力信号を受信する。そして、第２音声信号生成部２０２は、受信した第２入力信号から、主として店員の音声を含む１つの音響信号である第２音声信号を生成する。第２音声信号の生成方法としては、第１音声信号の生成方法と同様にビームフォーミング等の音声強調技術を適用し、目的音（店員の音声）が強調された音響信号を生成してもよい。この場合、第２音声信号生成部２０２は、第２入力信号に含まれる、１または複数の音響信号に対し、店員の方向に対してビームを形成する。そして、第２音声信号生成部２０２は、ビームを形成することによって、１または複数の音響信号から、店員の音声の成分が強調された音響信号（第２音声信号）を生成する。この店員の方向を示す情報は、予め第２音声信号生成部２０２内の記憶部に格納されていてもよい。上述したとおり、本実施形態では、図１に示すような会話場面を想定している。したがって、店員の方向は、マイクロホン（ｍ１〜ｍｎ）に対して、予め定めることができる。 (Second audio signal generation unit 202)
The second audio signal generation unit 202 receives the second input signal input to the audio processing device 200. And the 2nd audio | voice signal production | generation part 202 produces | generates the 2nd audio | voice signal which is one acoustic signal mainly including a store clerk's audio | voice from the received 2nd input signal. As a method for generating the second sound signal, a sound enhancement technique such as beam forming may be applied in the same manner as the method for generating the first sound signal to generate an acoustic signal in which the target sound (store clerk's sound) is emphasized. . In this case, the second audio signal generation unit 202 forms a beam in the direction of the store clerk for one or more acoustic signals included in the second input signal. And the 2nd audio | voice signal production | generation part 202 produces | generates the acoustic signal (2nd audio | voice signal) by which the component of the salesclerk's audio | voice was emphasized from one or several acoustic signals by forming a beam. Information indicating the direction of the store clerk may be stored in advance in the storage unit in the second audio signal generation unit 202. As described above, in this embodiment, a conversation scene as shown in FIG. 1 is assumed. Therefore, the direction of the store clerk can be determined in advance with respect to the microphones (m1 to mn).

また、第２音声信号の生成方法は、これに限定されない。第２音声信号生成部２０２は、第２入力信号に含まれる、１または複数の音響信号のうち、店員に一番近いと想定されるマイクロホンが出力した音響信号を、該音響信号に対応する音の音量等に基づいて選択し、選択した信号を、第２音声信号としてもよい。この場合、第２音声信号生成部２０２は、選択した音響信号に対し雑音抑圧処理を施し、雑音が抑圧された音響信号を第２音声信号としてもよい。 Further, the method of generating the second audio signal is not limited to this. The second audio signal generation unit 202 outputs an acoustic signal output from a microphone assumed to be closest to the store clerk among one or more acoustic signals included in the second input signal, as a sound corresponding to the acoustic signal. The selected signal may be the second audio signal. In this case, the second sound signal generation unit 202 may perform noise suppression processing on the selected sound signal and use the sound signal in which noise is suppressed as the second sound signal.

また、第２音声信号生成部２０２は、第１音声信号生成部２０１と同様に、ノイズキャンセラ等の雑音抑圧技術を用いて、第２音声信号を生成してもよい。 Similarly to the first audio signal generation unit 201, the second audio signal generation unit 202 may generate a second audio signal using a noise suppression technique such as a noise canceller.

また、第１音声信号の生成方法と第２音声信号の生成方法とは同じであってもよいし、異なっていてもよい。たとえば、第１音声信号の生成方法は、音声強調技術を用いる方法を採用し、第２音声信号の生成方法には雑音抑圧技術を用いる方法を採用してもよい。 Further, the first audio signal generation method and the second audio signal generation method may be the same or different. For example, a method using a speech enhancement technique may be employed as the first sound signal generation method, and a method using a noise suppression technique may be employed as the second sound signal generation method.

このような方法により、第２音声信号生成部２０２は、目的音である店員の音声が強調された、または、店員の音声以外が抑圧された第２音声信号を生成することができる。 By such a method, the second audio signal generation unit 202 can generate the second audio signal in which the clerk's voice, which is the target sound, is emphasized or other than the clerk's voice is suppressed.

本実施形態に係る音声処理装置２００における発話区間検出部１０１は、第１音声信号生成部２０１から出力された第１音声信号を入力とし、上述した第１の実施形態に係る音声処理装置１００の発話区間検出部１０１と同様の処理を行う。また、本実施形態に係る音声処理装置２００における発話区間加工部１０２は、第２音声信号生成部２０２から出力された第２音声信号と、発話区間検出部１０１から出力された時間情報とを入力とする。そして、発話区間加工部１０２は、上述した第１の実施形態に係る音声処理装置１００の発話区間加工部１０２と同様の処理を行う。 The speech segment detection unit 101 in the speech processing apparatus 200 according to the present embodiment receives the first speech signal output from the first speech signal generation unit 201 as an input, and the speech processing apparatus 100 according to the first embodiment described above. Processing similar to that performed by the utterance section detection unit 101 is performed. Further, the speech segment processing unit 102 in the speech processing apparatus 200 according to the present embodiment inputs the second speech signal output from the second speech signal generation unit 202 and the time information output from the speech segment detection unit 101. And Then, the utterance section processing unit 102 performs the same processing as the utterance section processing unit 102 of the speech processing apparatus 100 according to the first embodiment described above.

次に、図８を用いて、本実施形態に係る音声処理装置２００の動作について説明する。図８は、本実施形態に係る音声処理装置２００の動作の一例を示すフローチャートである。 Next, the operation of the speech processing apparatus 200 according to the present embodiment will be described using FIG. FIG. 8 is a flowchart showing an example of the operation of the speech processing apparatus 200 according to the present embodiment.

図８に示す通り、まず、音声信号生成部２１０の第１音声信号生成部２０１が、第１入力信号から第１音声信号を生成する（ステップＳ２０１）。また、音声信号生成部２１０の第２音声信号生成部２０２が第２入力信号から第２音声信号を生成する（ステップＳ２０２）。なお、ステップＳ２０１とステップＳ２０２とは同時に行われてもよいし、逆順で行われてもよい。 As shown in FIG. 8, first, the first audio signal generation unit 201 of the audio signal generation unit 210 generates a first audio signal from the first input signal (step S201). Further, the second audio signal generation unit 202 of the audio signal generation unit 210 generates a second audio signal from the second input signal (step S202). Note that step S201 and step S202 may be performed simultaneously or in reverse order.

その後、発話区間検出部１０１が、第１音声信号から客による発話の発話区間を検出する（ステップＳ２０３）。そして、発話区間検出部１０１は、検出した発話区間の開始時刻を表す始端時刻情報と検出した発話区間の終了時刻を表す終端時刻情報とを発話区間の時間情報として、特定する（ステップＳ２０４）。 Thereafter, the utterance section detection unit 101 detects the utterance section of the utterance by the customer from the first voice signal (step S203). Then, the utterance section detection unit 101 specifies the start time information indicating the start time of the detected utterance section and the end time information indicating the end time of the detected utterance section as time information of the utterance section (step S204).

次に、発話区間加工部１０２が、第２音声信号において、ステップＳ２０４で特定した時間情報に対応する時間区間を特定する（ステップＳ２０５）。そして、発話区間加工部１０２は、第２音声信号における、特定した時間区間の信号を加工する（ステップＳ２０６）。 Next, the speech segment processing unit 102 identifies a time segment corresponding to the time information identified in Step S204 in the second audio signal (Step S205). Then, the utterance section processing unit 102 processes the signal of the specified time section in the second voice signal (step S206).

以上により、音声処理装置２００は処理を終了する。 Thus, the voice processing device 200 ends the process.

本実施形態に係る音声処理装置２００によれば、上述した第１の実施形態における効果に加え、更に、非目的音声を精度よく特定することができる。なぜならば、第１音声信号生成部２０１が第１入力信号から前記非目的音声を主に含む音声の信号である第１音響信号を生成するからである。これにより、発話区間検出部１０１が第１音響信号から検出する発話区間には、非目的音声に対応する音響信号がより精度よく含まれることになる。したがって、発話区間加工部１０２は、非目的音声がより精度よく含まれる区間の信号に対して加工を行うことができる。 According to the sound processing apparatus 200 according to the present embodiment, in addition to the effects of the first embodiment described above, it is possible to specify non-target sound with high accuracy. This is because the first sound signal generation unit 201 generates a first acoustic signal that is a sound signal mainly including the non-target sound from the first input signal. As a result, the speech segment detected by the speech segment detection unit 101 from the first acoustic signal includes the acoustic signal corresponding to the non-target speech with higher accuracy. Therefore, the utterance section processing unit 102 can perform processing on a signal in a section in which non-target speech is included with higher accuracy.

したがって、発話区間加工部１０２によって加工された、発話区間加工信号に対応する音声は、目的音声がより聞き取りやすく含まれる。このような発話区間加工信号に対して、音声認識処理等を行う場合、音声認識の精度がより高くなる。 Therefore, the speech corresponding to the speech segment processing signal processed by the speech segment processing unit 102 includes the target speech more easily. When speech recognition processing or the like is performed on such a speech segment processing signal, the accuracy of speech recognition becomes higher.

なお、第２音声信号は、複数であってもよい。例えば、店員と客との会話場面で、複数人の店員がいる場合、第２音声信号生成部２０２は、店員ごとに第２音声信号を生成してもよい。この場合、発話区間加工部１０２は、複数の第２音声信号の夫々に対して、時間区間を特定し、特定した時間区間の信号を加工してもよい。これにより、発話区間加工部１０２は、複数の店員の夫々に対する、発話区間加工信号を生成することができる。 The second audio signal may be plural. For example, when there are a plurality of salesclerks in a conversation scene between a salesclerk and a customer, the second audio signal generation unit 202 may generate a second audio signal for each salesclerk. In this case, the utterance section processing unit 102 may specify a time section for each of the plurality of second audio signals, and may process the signal of the specified time section. Thereby, the utterance section processing unit 102 can generate an utterance section processing signal for each of a plurality of salesclerks.

＜第４の実施形態＞
本発明の第４の実施形態に係る音声処理装置について、図面を参照して説明する。まず、本実施形態に係る音声処理装置に入力される入力信号について説明する。本実施形態では、第３の実施形態と同様に複数のマイクロホン（ｍ１〜ｍｎ）の少なくとも何れかから出力された１または複数の音響信号をまとめた第１入力信号および第２入力信号が、音声処理装置に入力される。 <Fourth Embodiment>
A speech processing apparatus according to a fourth embodiment of the present invention will be described with reference to the drawings. First, an input signal input to the speech processing apparatus according to this embodiment will be described. In the present embodiment, as in the third embodiment, the first input signal and the second input signal in which one or a plurality of acoustic signals output from at least one of the plurality of microphones (m1 to mn) are combined are audio. Input to the processing unit.

また、本実施形態に係る音声処理装置には、第３入力信号が更に入力される。この第３入力信号は、センサ信号である。センサ信号は、例えばＰＯＳレジスタ付近の映像を撮影するカメラから出力された映像信号である。 Further, the third input signal is further input to the speech processing apparatus according to the present embodiment. This third input signal is a sensor signal. The sensor signal is, for example, a video signal output from a camera that captures a video near the POS register.

（音声処理装置３００）
図９は、本実施形態に係る音声処理装置３００の機能構成の一例を示す機能ブロック図である。なお、説明の便宜上、前述した各実施形態で説明した図面に含まれる部材と同じ機能を有する部材については、同じ符号を付し、その詳細な説明を省略する。 (Speech processor 300)
FIG. 9 is a functional block diagram illustrating an example of a functional configuration of the voice processing device 300 according to the present embodiment. For convenience of explanation, members having the same functions as those included in the drawings described in the above-described embodiments are given the same reference numerals, and detailed descriptions thereof are omitted.

図９に示す通り、本実施形態に係る音声処理装置３００は、信号選択部（選択部）３０１と、音声信号生成部２１０と、発話区間検出部１０１と、発話区間加工部１０２とを、備えている。本実施形態に係る音声処理装置３００は、上述した音声処理装置２００に、信号選択部３０１を更に備える構成である。また、音声信号生成部２１０は、第１音声信号生成部２０１と、第２音声信号生成部２０２とを含む。 As shown in FIG. 9, the speech processing apparatus 300 according to the present embodiment includes a signal selection unit (selection unit) 301, a speech signal generation unit 210, a speech segment detection unit 101, and a speech segment processing unit 102. ing. The speech processing apparatus 300 according to the present embodiment is configured to further include a signal selection unit 301 in the speech processing apparatus 200 described above. The audio signal generation unit 210 includes a first audio signal generation unit 201 and a second audio signal generation unit 202.

（信号選択部３０１）
信号選択部３０１は、第１入力信号、第２入力信号、および第３入力信号を受信する。そして、信号選択部３０１は、第３入力信号を用いて、話者（店員、客）がいる位置を推定する。まず、信号選択部３０１は、第３入力信号が表す映像から話者を特定する。そして、信号選択部３０１は、特定した話者が店員か客かの判別を行う。このとき、信号選択部３０１は、例えば、信号選択部３０１内または図示しない記憶部に格納された店員の制服の情報や店員の顔情報を用いて、店員を判別する。そして、信号選択部３０１は、残りの話者を客と判別する。信号選択部３０１が行う、店員の判別方法は、一般的な人物検出技術等を用いるとするため、本実施形態では、詳細な説明を省略する。 (Signal selection unit 301)
The signal selection unit 301 receives the first input signal, the second input signal, and the third input signal. And the signal selection part 301 estimates the position where a speaker (a salesclerk, a customer) exists using a 3rd input signal. First, the signal selection unit 301 identifies a speaker from the video represented by the third input signal. Then, the signal selection unit 301 determines whether the identified speaker is a store clerk or a customer. At this time, the signal selection unit 301 determines the store clerk using, for example, the store clerk uniform information and the store clerk face information stored in the signal selection unit 301 or in a storage unit (not shown). Then, the signal selection unit 301 determines the remaining speakers as customers. Since the salesclerk discrimination method performed by the signal selection unit 301 uses a general person detection technique or the like, detailed description is omitted in this embodiment.

そして、信号選択部３０１は、第３入力信号から店員と判別した話者の位置を推定する。また、信号選択部３０１は、第３入力信号から客と判別した話者の位置を推定する。 And the signal selection part 301 estimates the position of the speaker who discriminate | determined as a salesclerk from the 3rd input signal. Further, the signal selection unit 301 estimates the position of the speaker determined as the customer from the third input signal.

そして、信号選択部３０１は、推定した客の位置に基づいて第１入力信号から１または複数の音響信号を選択する。また、信号選択部３０１は、推定した店員の位置に基づいて、第２入力信号から１または複数の音響信号を選択する。 Then, the signal selection unit 301 selects one or a plurality of acoustic signals from the first input signal based on the estimated customer position. Further, the signal selection unit 301 selects one or a plurality of acoustic signals from the second input signal based on the estimated position of the store clerk.

以下、信号選択部３０１が、第１入力信号から選択した１または複数の音響信号を第１選択信号と呼び、第２入力信号から選択した１または複数の音響信号を第２選択信号と呼ぶ。 Hereinafter, the one or more acoustic signals selected from the first input signal by the signal selection unit 301 are referred to as first selection signals, and the one or more acoustic signals selected from the second input signals are referred to as second selection signals.

信号選択部３０１による第１選択信号の選択方法は、例えば、複数のマイクロホン（ｍ１〜ｍｎ）の位置を表す位置情報をあらかじめ保持しておき、該位置情報に基づいて、推定した客の位置に近いマイクロホンから出力された所定数の音響信号を選択する方法であってもよい。 The selection method of the first selection signal by the signal selection unit 301 includes, for example, preliminarily storing position information indicating the positions of a plurality of microphones (m1 to mn), and based on the position information, the estimated position of the customer is stored. A method of selecting a predetermined number of acoustic signals output from a close microphone may be used.

同様に、信号選択部３０１による第２選択信号の選択方法は、例えば、複数のマイクロホン（ｍ１〜ｍｎ）の位置を表す位置情報をあらかじめ保持しておき、該位置情報に基づいて、推定した店員の位置に近いマイクロホンから出力された所定数の音響信号を選択する方法であってもよい。 Similarly, the selection method of the second selection signal by the signal selection unit 301 is, for example, preliminarily holding position information indicating the positions of the plurality of microphones (m1 to mn) and estimating the salesclerk based on the position information. A method of selecting a predetermined number of acoustic signals output from a microphone close to the position may be used.

信号選択部３０１は、第１選択信号を第１音声信号生成部２０１に出力する。また、信号選択部３０１は、第２選択信号を第２音声信号生成部２０２に出力する。 The signal selection unit 301 outputs the first selection signal to the first audio signal generation unit 201. Further, the signal selection unit 301 outputs the second selection signal to the second audio signal generation unit 202.

以上、第３入力信号として１台のカメラからの映像信号を用いる例について説明したが、第３入力信号は、複数のカメラの夫々から出力された映像信号であってもよい。例えば、店員の位置を推定するために用いる映像信号と、客の位置を推定するために用いる映像信号とは異なる信号であってもよい。また、第３入力信号は、店員や客の位置を推定可能な信号であれば、カメラからの映像信号に限らない。たとえば、音声処理装置３００は、第３入力信号として、無線タグなどを利用した位置推定システムからの信号を用いてもよい。音声処理装置３００は、第３入力信号として、位置推定システムからの信号とカメラからの映像信号とを併用してもよい。 The example in which the video signal from one camera is used as the third input signal has been described above, but the third input signal may be a video signal output from each of a plurality of cameras. For example, the video signal used for estimating the position of the store clerk may be different from the video signal used for estimating the position of the customer. Further, the third input signal is not limited to the video signal from the camera as long as the position of the store clerk or the customer can be estimated. For example, the audio processing device 300 may use a signal from a position estimation system using a wireless tag or the like as the third input signal. The audio processing device 300 may use a signal from the position estimation system and a video signal from the camera in combination as the third input signal.

本実施形態に係る音声処理装置３００における音声信号生成部２１０の第１音声信号生成部２０１は、信号選択部３０１から出力された第１選択信号を入力とし、上述した第３の実施形態に係る音声処理装置２００の第１音声信号生成部２０１と同様の処理を行う。また、本実施形態に係る音声処理装置３００における音声信号生成部２１０の第２音声信号生成部２０２は、第２選択信号を入力とし、上述した第３の実施形態に係る音声処理装置２００の第２音声信号生成部２０２と同様の処理を行う。 The first audio signal generation unit 201 of the audio signal generation unit 210 in the audio processing device 300 according to the present embodiment receives the first selection signal output from the signal selection unit 301 as an input, and relates to the above-described third embodiment. The same processing as the first audio signal generation unit 201 of the audio processing device 200 is performed. Further, the second audio signal generation unit 202 of the audio signal generation unit 210 in the audio processing device 300 according to the present embodiment receives the second selection signal, and the second audio signal generation unit 202 of the audio processing device 200 according to the third embodiment described above. The same processing as that of the 2 audio signal generation unit 202 is performed.

また、発話区間検出部１０１および発話区間加工部１０２は、上述した第３の実施形態における発話区間検出部１０１および発話区間加工部１０２と夫々同様の処理を行う。 Further, the utterance section detection unit 101 and the utterance section processing unit 102 perform the same processes as the utterance section detection unit 101 and the utterance section processing unit 102 in the third embodiment described above, respectively.

次に、図１０を用いて、本実施形態に係る音声処理装置３００の動作について説明する。図１０は、本実施形態に係る音声処理装置３００の動作の一例を示すフローチャートである。 Next, the operation of the speech processing apparatus 300 according to the present embodiment will be described using FIG. FIG. 10 is a flowchart showing an example of the operation of the speech processing apparatus 300 according to this embodiment.

図１０に示す通り、まず、信号選択部３０１が、第１入力信号から第１選択信号を選択する（ステップＳ３０１）。また、信号選択部３０１が第２入力信号から第２選択信号を選択する（ステップＳ３０２）。なお、ステップＳ３０１とステップＳ３０２とは同時に行われてもよいし、逆順で行われてもよい。 As shown in FIG. 10, first, the signal selection unit 301 selects a first selection signal from the first input signal (step S301). Further, the signal selection unit 301 selects the second selection signal from the second input signal (step S302). Note that step S301 and step S302 may be performed simultaneously or in reverse order.

次に、音声信号生成部２１０の第１音声信号生成部２０１が、第１選択信号から第１音声信号を生成する（ステップＳ３０３）。また、音声信号生成部２１０の第２音声信号生成部２０２が第２選択信号から第２音声信号を生成する（ステップＳ３０４）。なお、ステップＳ３０３とステップＳ３０４とは同時に行われてもよいし、逆順で行われてもよい。 Next, the first audio signal generation unit 201 of the audio signal generation unit 210 generates a first audio signal from the first selection signal (step S303). Also, the second audio signal generation unit 202 of the audio signal generation unit 210 generates a second audio signal from the second selection signal (step S304). Note that step S303 and step S304 may be performed simultaneously or in the reverse order.

その後、発話区間検出部１０１が、第１音声信号から客による発話の発話区間を検出する（ステップＳ３０５）。そして、発話区間検出部１０１は、検出した発話区間の開始時刻を表す始端時刻情報と検出した発話区間の終了時刻を表す終端時刻情報とを発話区間の時間情報として、特定する（ステップＳ３０６）。 Thereafter, the utterance section detection unit 101 detects the utterance section of the utterance by the customer from the first voice signal (step S305). Then, the utterance section detecting unit 101 specifies the start time information indicating the start time of the detected utterance section and the end time information indicating the end time of the detected utterance section as time information of the utterance section (step S306).

次に、発話区間加工部１０２が、第２音声信号において、ステップＳ３０６で特定した時間情報に対応する時間区間を特定する（ステップＳ３０７）。そして、発話区間加工部１０２は、第２音声信号における、特定した時間区間の信号を加工する（ステップＳ３０８）。 Next, the speech segment processing unit 102 identifies a time segment corresponding to the time information identified in step S306 in the second audio signal (step S307). Then, the utterance section processing unit 102 processes the signal of the specified time section in the second audio signal (step S308).

以上により、音声処理装置３００は処理を終了する。 Thus, the voice processing device 300 ends the process.

本実施形態に係る音声処理装置３００によれば、上述した第１の実施形態における効果と同様の効果を奏することができる。また、本実施形態に係る音声処理装置３００は、客や店員の位置が変動しても、精度よく、客の音声を含む区間の信号を加工することができる。なぜならば、信号選択部３０１が第３入力信号に基づいて、第１入力信号から１以上の音響信号を選択し、第３入力信号に基づいて、第２入力信号から１以上の音響信号を選択するからである。これにより、客や店員の位置が移動した場合であっても、移動した位置に応じて信号選択部３０１が客の音声に対応する信号を主に含む音響信号を第１入力信号から選択し、店員の音声に対応する信号を主に含む音響信号を第２入力信号から選択することができる。そして、音声信号生成部２１０は、選択された音響信号から第１音響信号および第２音響信号を生成することができる。 According to the speech processing apparatus 300 according to the present embodiment, the same effects as those in the first embodiment described above can be achieved. In addition, the voice processing device 300 according to the present embodiment can process the signal of the section including the voice of the customer with high accuracy even if the position of the customer or the store clerk varies. This is because the signal selection unit 301 selects one or more acoustic signals from the first input signal based on the third input signal, and selects one or more acoustic signals from the second input signal based on the third input signal. Because it does. Thereby, even if the position of the customer or the store clerk moves, the signal selection unit 301 selects an acoustic signal mainly including a signal corresponding to the voice of the customer from the first input signal according to the moved position, An acoustic signal mainly including a signal corresponding to the clerk's voice can be selected from the second input signal. And the audio | voice signal production | generation part 210 can produce | generate a 1st acoustic signal and a 2nd acoustic signal from the selected acoustic signal.

また、信号選択部３０１が第１入力信号および第２入力信号から音響信号を選択するため、音声信号生成部２１０で処理する音響信号の数を減らすことができる。したがって、本実施形態に係る音声処理装置３００は、音声処理装置２００よりも、処理量を削減することが可能となる。 In addition, since the signal selection unit 301 selects an acoustic signal from the first input signal and the second input signal, the number of acoustic signals processed by the audio signal generation unit 210 can be reduced. Therefore, the speech processing apparatus 300 according to the present embodiment can reduce the processing amount as compared with the speech processing apparatus 200.

以上では、支払カウンタでの店員と客の会話をＰＯＳレジスタに設置したマイクロホンで集音する場面を想定して説明した。しかしながら、マイクロホンの設置場所はこれに限定されるものではない。例えば、店舗フロア内の任意の場所での店員と客との会話を、店舗内の天井あるいは棚等の様々な場所に設置した複数のマイクロホンで集音する場面も考えられる。そのような場合には、本実施形態における第３の入力信号として、フロア内の異なる箇所に設置された複数のカメラからの映像信号を用いるのが望ましい。この場合、信号選択部３０１は、各カメラの映像信号の画像から客および店員の検出と位置の推定を行う。そして、信号選択部３０１は、第１入力信号のうち、推定した客の位置から所定の距離以内に配置されたマイクロホンから出力された音響信号を、第１選択信号として選択してもよい。同様に、信号選択部３０１は、第２入力信号のうち、推定した店員の位置から所定の距離以内に配置されたマイクロホンから出力された音響信号を、第２選択信号として選択してもよい。また、信号選択部３０１は、検出した客のうち該客に対して推定した位置が、検出した店員に対して推定した位置から所定の距離以上離れている場合、第１選択信号として選択する対象の音響信号から除外してもよい。信号選択部３０１が店員を複数検出した場合、信号選択部３０１は、検出した店員ごとに第２選択信号を選択してもよい。この場合、音声信号生成部２１０および発話区間検出部１０１の処理も信号選択部３０１で検出した店員ごとに行い、発話区間加工部１０２は、店員ごとに発話区間加工信号を生成することができる。 In the above description, it is assumed that the conversation between the store clerk and the customer at the payment counter is collected by the microphone installed in the POS register. However, the installation location of the microphone is not limited to this. For example, a scene in which conversation between a store clerk and a customer at an arbitrary location on a store floor is collected by a plurality of microphones installed at various locations such as a ceiling or a shelf in the store can be considered. In such a case, it is desirable to use video signals from a plurality of cameras installed at different locations in the floor as the third input signal in the present embodiment. In this case, the signal selection unit 301 detects a customer and a store clerk and estimates a position from the image of the video signal of each camera. And the signal selection part 301 may select the acoustic signal output from the microphone arrange | positioned within the predetermined distance from the estimated customer's position among the first input signals as the first selection signal. Similarly, the signal selection unit 301 may select, as the second selection signal, an acoustic signal output from a microphone arranged within a predetermined distance from the estimated position of the store clerk among the second input signals. Further, the signal selection unit 301 selects the first selection signal when the position estimated for the detected customer is more than a predetermined distance from the estimated position for the detected store clerk. May be excluded from the acoustic signal. When the signal selection unit 301 detects a plurality of store clerk, the signal selection unit 301 may select the second selection signal for each detected store clerk. In this case, the processing of the voice signal generation unit 210 and the utterance section detection unit 101 is also performed for each clerk detected by the signal selection unit 301, and the utterance section processing unit 102 can generate the utterance section processing signal for each clerk.

（変形例）
本実施形態では、第３入力信号を、第１入力信号および第２入力信号とは異なる信号として説明したが、第３入力信号は、第１入力信号および第２入力信号と同様の信号であってもよい。 (Modification)
In the present embodiment, the third input signal is described as a signal different from the first input signal and the second input signal. However, the third input signal is a signal similar to the first input signal and the second input signal. May be.

つまり、第１入力信号および第２入力信号を第３入力信号として用いてもよい。例えば、信号選択部３０１は、第１入力信号に含まれる１または複数の音響信号の夫々に対応する音の音量を比較し、音量が大きい音に対応する音響信号を所定数選択し、第１選択信号としてもよい。また、信号選択部３０１は、音量を比較する際に、第１入力信号に含まれる１または複数の音響信号の夫々に対し、雑音抑圧技術、雑音抑圧技術などを適用し、非目的音声以外の音声、および、雑音を除去してもよい。また、信号選択部３０１は、第１入力信号に含まれる１または複数の音響信号の夫々に対し、音源を分離して、抽出したい音声のみを抽出する音源分離技術を適用してもよい。これにより、信号選択部３０１は、非目的音声に対応する音響信号から、第１選択信号を選択することができる。同様に、信号選択部３０１は、第２入力信号に含まれる１または複数の音響信号の夫々に対応する音の音量を比較し、音量が大きい音に対応する音響信号を所定数選択し、第２選択信号としてもよい。 That is, the first input signal and the second input signal may be used as the third input signal. For example, the signal selection unit 301 compares the sound volume corresponding to each of the one or more sound signals included in the first input signal, selects a predetermined number of sound signals corresponding to the sound having a high sound volume, and It may be a selection signal. Further, when comparing the volume, the signal selection unit 301 applies a noise suppression technique, a noise suppression technique, or the like to each of one or a plurality of acoustic signals included in the first input signal, so that the signals other than the non-target voices are applied. Voice and noise may be removed. In addition, the signal selection unit 301 may apply a sound source separation technique that separates a sound source and extracts only the sound to be extracted for each of one or a plurality of acoustic signals included in the first input signal. Thereby, the signal selection part 301 can select a 1st selection signal from the acoustic signal corresponding to a non-target audio | voice. Similarly, the signal selection unit 301 compares the sound volume corresponding to each of the one or more sound signals included in the second input signal, selects a predetermined number of sound signals corresponding to the sound having a high sound volume, Two selection signals may be used.

＜第５の実施形態＞
本発明の第５の実施形態について、図面を参照して説明する。図１１は、本実施形態に係る音声処理装置４００の機能構成の一例を示す機能ブロック図である。なお、説明の便宜上、前述した各実施形態で説明した図面に含まれる部材と同じ機能を有する部材については、同じ符号を付し、その詳細な説明を省略する。図１１に示す通り、本実施形態に係る音声処理装置４００は、発話区間検出部１０１と、発話区間加工部１０２と、音声信号生成部４１０と、位置推定部（推定部）４２０とを備える。また、音声信号生成部４１０は、第１音声信号生成部４０１と、第２音声信号生成部４０２とを備える。また、位置推定部４２０は、第１話者位置推定部４０３と、第２話者位置推定部４０４とを備える。 <Fifth Embodiment>
A fifth embodiment of the present invention will be described with reference to the drawings. FIG. 11 is a functional block diagram illustrating an example of a functional configuration of the voice processing device 400 according to the present embodiment. For convenience of explanation, members having the same functions as those included in the drawings described in the above-described embodiments are given the same reference numerals, and detailed descriptions thereof are omitted. As shown in FIG. 11, the speech processing apparatus 400 according to the present embodiment includes an utterance section detection unit 101, an utterance section processing unit 102, an audio signal generation unit 410, and a position estimation unit (estimation unit) 420. The audio signal generation unit 410 includes a first audio signal generation unit 401 and a second audio signal generation unit 402. The position estimation unit 420 includes a first speaker position estimation unit 403 and a second speaker position estimation unit 404.

本実施形態に係る音声処理装置４００は、第４の実施形態に係る音声処理装置３００の信号選択部３０１に代え、位置推定部４２０を備え、音声信号生成部２１０に代え、音声信号生成部４１０を備える構成である。なお、本実施形態に係る音声処理装置４００に入力される第１入力信号、第２入力信号および第３入力信号は、第４の実施形態と同様のものであるとする。 The speech processing apparatus 400 according to the present embodiment includes a position estimation unit 420 instead of the signal selection unit 301 of the speech processing apparatus 300 according to the fourth embodiment, and replaces the speech signal generation unit 210 with a speech signal generation unit 410. It is the structure provided with. Note that the first input signal, the second input signal, and the third input signal that are input to the speech processing apparatus 400 according to the present embodiment are the same as those in the fourth embodiment.

位置推定部４２０の第１話者位置推定部４０３は、第１入力信号および／または第３入力信号を用いて、客の位置を推定する。第１入力信号または第３入力信号を用いて、第１話者位置推定部４０３が行う客の位置の推定方法は、上述した信号選択部３０１が行う客の位置の推定方法と同様の方法であるため、詳細な説明を省略する。また、第１話者位置推定部４０３が第１入力信号を用いて客の位置を推定する場合において、第１入力信号に複数の音響信号が含まれる場合、第１話者位置推定部４０３は、一般的な音源を推定する技術を適用して、音源を特定し、客の位置を推定してもよい。また、第１話者位置推定部４０３が第１入力信号および第３入力信号を用いて客の位置を推定する場合、第１入力信号から音源の候補を特定し、第３入力信号を用いて、音源の候補から客を選択し、選択した客の位置を推定してもよい。 The first speaker position estimation unit 403 of the position estimation unit 420 estimates the position of the customer using the first input signal and / or the third input signal. The customer position estimation method performed by the first speaker position estimation unit 403 using the first input signal or the third input signal is the same method as the customer position estimation method performed by the signal selection unit 301 described above. Therefore, detailed description is omitted. Further, when the first speaker position estimation unit 403 estimates the position of the customer using the first input signal, if the first input signal includes a plurality of acoustic signals, the first speaker position estimation unit 403 A technique for estimating a general sound source may be applied to identify the sound source and estimate the position of the customer. Further, when the first speaker position estimation unit 403 estimates the position of the customer using the first input signal and the third input signal, the sound source candidate is identified from the first input signal, and the third input signal is used. Alternatively, a customer may be selected from the sound source candidates and the position of the selected customer may be estimated.

第１話者位置推定部４０３は推定した客の位置（以下、第１話者位置と呼ぶ）を示す第１位置情報を、第１音声信号生成部４０１に出力する。この第１位置情報は、例えば、第３入力信号を用いて推定された場合、第３入力信号が表す映像上の位置を示す情報（例えば、ｘｙ座標）であってもよいし、映像上の位置を実世界上の位置に変換した位置を示す情報（例えば、緯度および経度等）であってもよい。また、第１位置情報は、店舗内に設置されたマイクロホンからの相対位置を示す情報であってもよい。この場合、マイクロホンの位置を示す情報が、第１話者位置推定部４０３内の記憶部に格納されていればよい。 The first speaker position estimation unit 403 outputs first position information indicating the estimated customer position (hereinafter referred to as a first speaker position) to the first audio signal generation unit 401. For example, when the first position information is estimated using the third input signal, the first position information may be information (for example, xy coordinates) indicating the position on the video represented by the third input signal. The information (for example, latitude, longitude, etc.) which shows the position which converted the position into the position in the real world may be sufficient. The first position information may be information indicating a relative position from a microphone installed in the store. In this case, information indicating the position of the microphone may be stored in the storage unit in the first speaker position estimation unit 403.

位置推定部４２０の第２話者位置推定部４０４は、第２入力信号および／または第３入力信号を用いて、店員の位置を推定する。第２話者位置推定部４０４が行う店員の位置の推定方法は、第１話者位置推定部４０３が客の位置を推定する際に用いた方法と同様である。第２話者位置推定部４０４は推定した店員の位置（以下、第２話者位置と呼ぶ）を示す第２位置情報を、第２音声信号生成部４０２に出力する。 The second speaker position estimation unit 404 of the position estimation unit 420 estimates the position of the store clerk using the second input signal and / or the third input signal. The method of estimating the position of the clerk performed by the second speaker position estimating unit 404 is the same as the method used when the first speaker position estimating unit 403 estimates the position of the customer. The second speaker position estimation unit 404 outputs second position information indicating the estimated position of the store clerk (hereinafter referred to as a second speaker position) to the second audio signal generation unit 402.

なお、第１話者位置推定部４０３と、第２話者位置推定部４０４とは、同じ推定方法で客および店員の位置を推定してもよいし、異なる推定方法で客および店員の位置を推定してもよい。 Note that the first speaker position estimation unit 403 and the second speaker position estimation unit 404 may estimate the position of the customer and the clerk by the same estimation method, or the position of the customer and the clerk by different estimation methods. It may be estimated.

音声信号生成部４１０の第１音声信号生成部４０１は、第１入力信号と第１位置情報とを入力とする。第１音声信号生成部４０１は、第１入力信号に対し、ビームフォーミング等の音声強調技術を適用し、客の音声を強調した音響信号である第１音声信号を生成する。第１音声信号生成部４０１は、ビームフォーミングを行う際に、形成するビームの方向を、第１位置情報に基づいて決定する。 The first audio signal generation unit 401 of the audio signal generation unit 410 receives the first input signal and the first position information. The first audio signal generation unit 401 applies an audio enhancement technique such as beam forming to the first input signal, and generates a first audio signal that is an acoustic signal in which the customer's voice is emphasized. The first audio signal generation unit 401 determines the direction of the beam to be formed based on the first position information when performing beamforming.

そして、第１音声信号生成部４０１は、生成した第１音声信号を発話区間検出部１０１に出力する。 Then, the first audio signal generation unit 401 outputs the generated first audio signal to the utterance section detection unit 101.

音声信号生成部４１０の第２音声信号生成部４０２は、第２入力信号と第２位置情報とを入力とする。第２音声信号生成部４０２は、第２入力信号に対し、ビームフォーミング等の音声強調技術を適用し、店員の音声を強調した音響信号である第２音声信号を生成する。第２音声信号生成部４０２は、ビームフォーミングを行う際に、形成するビームの方向を、第２位置情報に基づいて決定する。 The second audio signal generation unit 402 of the audio signal generation unit 410 receives the second input signal and the second position information. The second sound signal generation unit 402 applies a sound enhancement technique such as beam forming to the second input signal, and generates a second sound signal that is an acoustic signal in which the clerk's sound is emphasized. The second audio signal generation unit 402 determines the direction of the beam to be formed based on the second position information when performing beamforming.

そして、第２音声信号生成部４０２は、生成した第２音声信号を発話区間加工部１０２に出力する。 Then, the second voice signal generation unit 402 outputs the generated second voice signal to the utterance section processing unit 102.

次に、図１２を用いて、本実施形態に係る音声処理装置４００の動作について説明する。図１２は、本実施形態に係る音声処理装置４００の動作の一例を示すフローチャートである。 Next, the operation of the speech processing apparatus 400 according to the present embodiment will be described using FIG. FIG. 12 is a flowchart showing an example of the operation of the speech processing apparatus 400 according to this embodiment.

図１２に示す通り、まず、位置推定部４２０が第１話者位置および第２話者位置を推定する。具体的には、位置推定部４２０の第１話者位置推定部４０３が、第３入力信号および／または第１入力信号から第１話者位置を推定し、第２話者位置推定部４０４が第３入力信号および／または第２入力信号から第２話者位置を推定する（ステップＳ４０１）。 As shown in FIG. 12, first, the position estimation unit 420 estimates the first speaker position and the second speaker position. Specifically, the first speaker position estimation unit 403 of the position estimation unit 420 estimates the first speaker position from the third input signal and / or the first input signal, and the second speaker position estimation unit 404 A second speaker position is estimated from the third input signal and / or the second input signal (step S401).

次に、音声信号生成部４１０の第１音声信号生成部４０１が、第１話者位置に基づいて、第１入力信号から第１音声信号を生成する（ステップＳ４０２）。また、音声信号生成部４１０の第２音声信号生成部４０２が第２話者位置に基づいて、第２入力信号から第２音声信号を生成する（ステップＳ４０３）。なお、ステップＳ４０２とステップＳ４０３とは同時に行われてもよいし、逆順で行われてもよい。 Next, the first audio signal generation unit 401 of the audio signal generation unit 410 generates a first audio signal from the first input signal based on the first speaker position (step S402). Further, the second audio signal generation unit 402 of the audio signal generation unit 410 generates a second audio signal from the second input signal based on the second speaker position (step S403). Note that step S402 and step S403 may be performed simultaneously or in reverse order.

その後、発話区間検出部１０１が、第１音声信号から客による発話の発話区間を検出する（ステップＳ４０４）。そして、発話区間検出部１０１は、検出した発話区間の開始時刻を表す始端時刻情報と検出した発話区間の終了時刻を表す終端時刻情報とを発話区間の時間情報として、特定する（ステップＳ４０５）。 Thereafter, the utterance section detecting unit 101 detects the utterance section of the utterance by the customer from the first voice signal (step S404). Then, the utterance section detection unit 101 specifies the start time information indicating the start time of the detected utterance section and the end time information indicating the end time of the detected utterance section as time information of the utterance section (step S405).

次に、発話区間加工部１０２が、第２音声信号において、ステップＳ４０５で特定した時間情報に対応する時間区間を特定する（ステップＳ４０６）。そして、発話区間加工部１０２は、第２音声信号における、特定した時間区間の信号を加工する（ステップＳ４０７）。 Next, the utterance section processing unit 102 specifies a time section corresponding to the time information specified in step S405 in the second audio signal (step S406). Then, the utterance section processing unit 102 processes the signal of the specified time section in the second audio signal (step S407).

以上により、音声処理装置４００は処理を終了する。 Thus, the voice processing device 400 ends the process.

なお、本実施形態に係る音声処理装置４００は、信号選択部３０１の代わりに位置推定部４２０を備える構成について説明したが、音声処理装置４００は、信号選択部３０１と位置推定部４２０とを備えていてもよい。 In addition, although the audio | voice processing apparatus 400 which concerns on this embodiment demonstrated the structure provided with the position estimation part 420 instead of the signal selection part 301, the audio | voice processing apparatus 400 is provided with the signal selection part 301 and the position estimation part 420. It may be.

以上のように、本実施形態に係る音声処理装置４００によれば、位置推定部４２０が客の位置および店員の位置を推定し、音声信号生成部４１０が、推定された客の位置および店員の位置に基づいて、第１音声信号および第２音声信号を生成する。このように生成された第１音声信号および第２音声信号は、客または店員の音声がより強調された信号となる。これにより、発話区間加工部１０２は、客の音声が含まれる期間を、より精度よく加工することができる。 As described above, according to the voice processing device 400 according to the present embodiment, the position estimation unit 420 estimates the position of the customer and the salesclerk, and the voice signal generation unit 410 determines the estimated customer position and the salesclerk's position. Based on the position, a first audio signal and a second audio signal are generated. The first audio signal and the second audio signal thus generated are signals in which the voice of the customer or the store clerk is further emphasized. As a result, the utterance section processing unit 102 can process the period in which the customer's voice is included more accurately.

＜第６の実施形態＞
本発明の第６の実施形態について、図面を参照して説明する。本実施形態では、目的音を特定の話者の音声であるとし、以下ではその話者、音声をそれぞれ「目的話者」、「目的音声」と呼ぶ。なお、目的話者は複数の話者を含む場合もある。 <Sixth Embodiment>
A sixth embodiment of the present invention will be described with reference to the drawings. In this embodiment, it is assumed that the target sound is the voice of a specific speaker, and the speaker and the voice are hereinafter referred to as “target speaker” and “target voice”, respectively. Note that the target speaker may include a plurality of speakers.

図１３は、本実施形態に係る音声処理装置５００の機能構成の一例を示す機能ブロック図である。なお、説明の便宜上、前述した各実施形態で説明した図面に含まれる部材と同じ機能を有する部材については、同じ符号を付し、その詳細な説明を省略する。 FIG. 13 is a functional block diagram illustrating an example of a functional configuration of the voice processing device 500 according to the present embodiment. For convenience of explanation, members having the same functions as those included in the drawings described in the above-described embodiments are given the same reference numerals, and detailed descriptions thereof are omitted.

本実施形態に係る音声処理装置５００は、音声信号生成部２１０と発話区間検出部１０１と目的話者発話区間検出部５０１と、発話区間加工部５０２と、記憶部５０３とを、備えている。 The speech processing apparatus 500 according to the present embodiment includes a speech signal generation unit 210, a speech segment detection unit 101, a target speaker speech segment detection unit 501, a speech segment processing unit 502, and a storage unit 503.

本実施形態では、音声信号生成部２１０の第２音声信号生成部２０２が生成した第２音声信号が、目的話者発話区間検出部５０１と発話区間加工部５０２とに入力される。 In the present embodiment, the second audio signal generated by the second audio signal generation unit 202 of the audio signal generation unit 210 is input to the target speaker utterance interval detection unit 501 and the utterance interval processing unit 502.

（記憶部５０３）
記憶部５０３は、目的話者の音声の音響信号の特徴量を格納している。特徴量は、例えば、混合ガウス分布モデル（ＧＭＭ：ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）で表現されるものであってもよいし、その他のモデルで表現されるものであってもよい。例えば、特徴量は、一般的な話者認識技術で用いられている特徴量等であればよい。 (Storage unit 503)
The storage unit 503 stores the feature amount of the acoustic signal of the target speaker's voice. The feature amount may be expressed by, for example, a mixed Gaussian distribution model (GMM), or may be expressed by other models. For example, the feature amount may be a feature amount used in general speaker recognition technology.

また、例えば、目的話者が店舗の店員の場合、記憶部５０３は、店舗内の店員全員の音声の音響信号の特徴量を格納する。 Further, for example, when the target speaker is a store clerk, the storage unit 503 stores the feature amount of the sound signal of the voice of all the store clerk in the store.

（目的話者発話区間検出部５０１）
目的話者発話区間検出部５０１は、第２音声信号生成部２０２から第２音声信号を受信する。そして、目的話者発話区間検出部５０１は、受信した第２音声信号から目的話者の発話区間を検出する。なお、第２音声信号は、第２の実施形態における第２音声信号と同様であるため、目的話者を含む１または複数の話者の音声の音声信号を含んでいる。 (Target speaker utterance section detector 501)
The target speaker utterance section detection unit 501 receives the second audio signal from the second audio signal generation unit 202. Then, the target speaker utterance section detector 501 detects the utterance section of the target speaker from the received second voice signal. Since the second audio signal is the same as the second audio signal in the second embodiment, the second audio signal includes audio signals of one or a plurality of speakers including the target speaker.

具体的には、目的話者発話区間検出部５０１は、第２音声信号を、時系列の特徴量に変換し、変換した特徴量（入力特徴量と呼ぶ）と、記憶部５０３に格納された特徴量とを照合する。目的話者発話区間検出部５０１は、特徴量の照合を一定時間ごと、例えば１０ミリ秒単位で行ってもよい。また、目的話者発話区間検出部５０１は、第２音声信号に対し音声検出技術を適用して、第２音声信号の発話区間を検出し、検出された発話区間ごとに特徴量の照合を行い、各発話区間が目的話者の発話か否かを判別しても良い。 Specifically, the target speaker utterance section detection unit 501 converts the second audio signal into a time-series feature value, and stores the converted feature value (referred to as input feature value) and the storage unit 503. Check against feature quantity. The target speaker utterance section detection unit 501 may perform feature amount collation at regular intervals, for example, in units of 10 milliseconds. In addition, the target speaker utterance section detection unit 501 detects the utterance section of the second voice signal by applying a voice detection technique to the second voice signal, and performs feature amount collation for each detected utterance section. Alternatively, it may be determined whether or not each utterance section is the utterance of the target speaker.

例えば、目的話者発話区間検出部５０１は、入力特徴量と、記憶部５０３に格納された特徴量とを照合し、入力特徴量に対する、格納された特徴量の尤度を求める。そして、目的話者発話区間検出部５０１は、尤度を所定の閾値と比較し、所定の閾値より高い尤度となる入力特徴量を有する第２音声信号の区間を目的話者が発話した区間であると判別する。 For example, the target speaker utterance section detection unit 501 collates the input feature quantity with the feature quantity stored in the storage unit 503, and obtains the likelihood of the stored feature quantity with respect to the input feature quantity. Then, the target speaker utterance section detecting unit 501 compares the likelihood with a predetermined threshold, and the section in which the target speaker utters the section of the second speech signal having the input feature amount having the likelihood higher than the predetermined threshold. It is determined that

そして、目的話者発話区間検出部５０１は、判別結果に基づいて、目的話者が発話している区間の時間情報である区間情報を特定し、該区間情報を発話区間加工部５０２に出力する。区間情報は、区間の開始時刻を表す始端時刻情報と終了時刻を表す終端時刻情報とを含む。発話している区間が複数存在する場合は、目的話者発話区間検出部５０１は、それらすべての区間の始端時刻情報と終端時刻情報とを、区間情報として出力する。 Then, the target speaker utterance section detection unit 501 identifies section information that is time information of a section in which the target speaker is speaking based on the determination result, and outputs the section information to the utterance section processing unit 502. . The section information includes start end time information indicating the start time of the section and end time information indicating the end time. When there are a plurality of uttered sections, the target speaker utterance section detecting unit 501 outputs start time information and end time information of all the sections as section information.

＜発話区間加工部５０２＞
発話区間加工部５０２は、目的話者発話区間検出部５０１から区間情報を受信する。また、発話区間加工部５０２は、第２音声信号生成部２０２から第２音声信号を受信する。また、発話区間加工部５０２は、発話区間検出部１０１から時間情報を受信する。発話区間加工部５０２は、受信した区間情報に基づいて、第２音声信号中の、区間情報で表現される区間の信号を抽出する。具体的には、発話区間加工部５０２は、受信した区間情報に基づいて、第２音声信号中の区間情報で表現される区間の信号のみ残した目的話者発話区間音声信号を生成する。そして、発話区間加工部５０２は、目的話者発話区間音声信号に対し、上述した各実施形態における発話区間加工部１０２と同様の処理を行い、発話区間加工信号を生成する。 <Speech section processing unit 502>
The utterance section processing unit 502 receives section information from the target speaker utterance section detection unit 501. Further, the utterance section processing unit 502 receives the second audio signal from the second audio signal generation unit 202. Further, the utterance section processing unit 502 receives time information from the utterance section detection unit 101. Based on the received section information, the utterance section processing unit 502 extracts a signal of a section expressed by the section information from the second audio signal. Specifically, the utterance section processing unit 502 generates a target speaker utterance section speech signal that leaves only the section signal represented by the section information in the second speech signal, based on the received section information. Then, the utterance section processing unit 502 performs the same processing as the utterance section processing unit 102 in each embodiment described above on the target speaker utterance section voice signal, and generates a utterance section processing signal.

次に、図１４を用いて、本実施形態に係る音声処理装置５００の動作について説明する。図１４は、本実施形態に係る音声処理装置５００の動作の一例を示すフローチャートである。 Next, the operation of the speech processing apparatus 500 according to the present embodiment will be described using FIG. FIG. 14 is a flowchart showing an example of the operation of the speech processing apparatus 500 according to this embodiment.

図１５に示す通り、まず、音声信号生成部２１０の第１音声信号生成部２０１が、第１入力信号から第１音声信号を生成する（ステップＳ５０１）。また、音声信号生成部２１０の第２音声信号生成部２０２が第２入力信号から第２音声信号を生成する（ステップＳ５０２）。なお、ステップＳ５０１とステップＳ５０２とは同時に行われてもよいし、逆順で行われてもよい。 As shown in FIG. 15, first, the first audio signal generation unit 201 of the audio signal generation unit 210 generates a first audio signal from the first input signal (step S501). Further, the second audio signal generation unit 202 of the audio signal generation unit 210 generates a second audio signal from the second input signal (step S502). Note that step S501 and step S502 may be performed simultaneously or in the reverse order.

その後、発話区間検出部１０１が、第１音声信号から非目的音声の話者による発話の発話区間を検出する（ステップＳ５０３）。そして、発話区間検出部１０１は、検出した発話区間の開始時刻を表す始端時刻情報と検出した発話区間の終了時刻を表す終端時刻情報とを非目的音声の発話区間の時間情報として、特定する（ステップＳ５０４）。 Thereafter, the utterance section detection unit 101 detects the utterance section of the utterance by the speaker of the non-target voice from the first voice signal (step S503). Then, the utterance section detection unit 101 specifies start time information indicating the start time of the detected utterance section and end time information indicating the end time of the detected utterance section as time information of the utterance section of the non-target speech ( Step S504).

次に、目的話者発話区間検出部５０１が、第２音声信号から、目的話者の発話区間を検出する（ステップＳ５０５）。そして、目的話者発話区間検出部５０１は、目的話者の発話区間の時間情報である区間情報を特定する（ステップＳ５０６）。なお、ステップＳ５０３およびステップＳ５０４と、ステップＳ５０５およびステップＳ５０６とは、同時行われてもよいし、ステップＳ５０６の終了後に、ステップＳ５０３およびステップＳ５０４が行われてもよい。 Next, the target speaker utterance section detection unit 501 detects the utterance section of the target speaker from the second audio signal (step S505). Then, the target speaker utterance section detection unit 501 identifies section information that is time information of the target speaker's utterance section (step S506). Note that step S503 and step S504 and step S505 and step S506 may be performed at the same time, or step S503 and step S504 may be performed after step S506 ends.

次に、発話区間加工部５０２が、第２音声信号から、ステップＳ５０６で特定された時間情報に対応する時間区間の信号を抽出する（ステップＳ５０７）。そして、発話区間加工部５０２は、ステップＳ５０７で抽出した信号から、ステップＳ５０３で特定した時間情報に対応する時間区間を特定する（ステップＳ５０８）。そして、発話区間加工部５０２は、抽出した信号における、ステップＳ５０８で特定した時間区間の信号を加工する（ステップＳ５０９）。 Next, the utterance section processing unit 502 extracts a signal in the time section corresponding to the time information specified in step S506 from the second voice signal (step S507). Then, the speech segment processing unit 502 identifies the time segment corresponding to the time information identified in Step S503 from the signal extracted in Step S507 (Step S508). Then, the utterance section processing unit 502 processes the signal of the time section specified in step S508 in the extracted signal (step S509).

以上により、音声処理装置５００は処理を終了する。 Thus, the voice processing device 500 ends the process.

本実施形態に係る音声処理装置５００によれば、上述した第１の実施形態における効果と同様の効果を奏することができる。更に、本実施の形態に係る音声処理装置５００は、目的話者の音声の音響信号の特徴量を用いて目的音声の発話区間の時間情報によって表される時間区間を特定する。そして、音声処理装置５００は、第２音声信号から、特定した時間区間の音声信号のみを抽出する。そして、音声処理装置５００は、抽出した音声信号のうち、非目的音声の発話区間に対応する期間の信号を加工する。これにより、音声処理装置５００は、非目的音声に対応する信号が加工の対象から漏れて残留する可能性をより低減することができる。 According to the speech processing apparatus 500 according to the present embodiment, the same effects as those in the first embodiment described above can be achieved. Furthermore, speech processing apparatus 500 according to the present embodiment specifies a time interval represented by time information of the speech interval of the target speech using the feature amount of the acoustic signal of the target speaker's speech. Then, the audio processing device 500 extracts only the audio signal in the specified time interval from the second audio signal. Then, the speech processing apparatus 500 processes a signal in a period corresponding to the speech section of the non-target speech among the extracted speech signals. Thereby, the audio processing device 500 can further reduce the possibility that a signal corresponding to the non-purpose audio leaks from the processing target and remains.

なお、発話区間加工部５０２が発話区間加工信号を生成する方法は上記に限定されない。例えば、発話区間加工部５０２では、第２音声信号から、受信した区間情報で表現される区間以外の区間の集合を特定する。そして、発話区間加工部５０２は、特定した集合と、受信した時間情報が表す区間の集合との和集合を求める。そして、発話区間加工部５０２は、第２音声信号のうち、和集合に含まれる期間に対し、発話区間加工部１０２と同様の、加工処理を行ってもよい。 The method by which the utterance section processing unit 502 generates the utterance section processing signal is not limited to the above. For example, the utterance section processing unit 502 specifies a set of sections other than the section expressed by the received section information from the second audio signal. Then, the utterance section processing unit 502 obtains a union of the identified set and the section set represented by the received time information. Then, the utterance section processing unit 502 may perform the same processing as the utterance section processing unit 102 on the period included in the union of the second audio signal.

＜第７の実施形態＞
次に、本発明の第７の実施形態について説明する。本実施形態では、本発明の課題を解決する最小の構成について説明する。 <Seventh Embodiment>
Next, a seventh embodiment of the present invention will be described. In this embodiment, a minimum configuration that solves the problems of the present invention will be described.

図１５は、本実施形態に係る音声処理装置１０の機能構成の一例を示す機能ブロック図である。図１５に示す通り、本実施形態に係る音声処理装置１０は、検出部１１と、加工部１２とを備えている。 FIG. 15 is a functional block diagram illustrating an example of a functional configuration of the voice processing device 10 according to the present embodiment. As shown in FIG. 15, the speech processing apparatus 10 according to the present embodiment includes a detection unit 11 and a processing unit 12.

検出部１１は、各実施形態における発話区間検出部１０１に相当する。検出部１１は、センサ信号である第１入力信号から、非目的音声の話者による発話の発話区間を検出する。そして、検出部１１は、発話区間の開始時刻および終了時刻を表す時間情報を特定する。検出部１１は、特定した時間情報を加工部１２に供給する。 The detection unit 11 corresponds to the utterance section detection unit 101 in each embodiment. The detection unit 11 detects an utterance section of an utterance by a non-target voice speaker from a first input signal that is a sensor signal. And the detection part 11 specifies the time information showing the start time and end time of an utterance area. The detection unit 11 supplies the specified time information to the processing unit 12.

加工部１２は、各実施形態における発話区間加工部（１０２、５０２）に相当する。加工部１２は、検出部１１から時間情報を受信する。また、加工部１２は、目的音および非目的音声の話者の音声を含む音響信号である第２入力信号を受信する。加工部１２は、第２入力信号のうち、時間情報に対応する期間の信号を特定する。そして、加工部１２は、特定した期間の信号によって発話内容および話者の少なくとも何れかが特定されないように、第２入力信号における期間の信号を加工する。加工部１２が行う信号の加工の方法は、特に限定されず、該期間の信号を削除してもよいし、該期間の信号をゼロ信号で置き換えてもよい。加工部１２は、特定した期間の信号によって、発話内容および／または話者が特定されないように、該期間の信号を加工すればよく、その方法は、特に限定されない。 The processing unit 12 corresponds to the utterance section processing unit (102, 502) in each embodiment. The processing unit 12 receives time information from the detection unit 11. The processing unit 12 also receives a second input signal that is an acoustic signal including the target sound and the non-target speaker's voice. The processing unit 12 specifies a signal in a period corresponding to time information among the second input signals. And the process part 12 processes the signal of the period in a 2nd input signal so that at least any one of utterance content and a speaker may not be specified by the signal of the specified period. The signal processing method performed by the processing unit 12 is not particularly limited, and the signal in the period may be deleted, or the signal in the period may be replaced with a zero signal. The processing unit 12 may process the signal of the period so that the utterance content and / or the speaker is not specified by the signal of the specified period, and the method is not particularly limited.

検出部１１が特定した期間には非目的音声の信号が含まれる。したがって、加工部１２が、例えば、上記期間の信号をゼロ信号で置き換えた場合、この加工部１２によって加工された信号には、非目的音声が含まれない。よって、この加工された信号から非目的音声の話者による発話内容および話者を特定できない。また、上記期間の信号に声質変換処理を施す場合、加工された信号からは、話者を特定できない。このように、加工部１２は、非目的音声の話者および／または非目的音声の話者による発話内容が加工された音響信号を出力することができる。したがって、音声処理装置１０は、非目的音声が目的音と重なっている場合でも、目的音を含む音響信号から、非目的音声の話者および／または非目的音声の話者による発話内容が特定できないようにすることができる。 The period of time specified by the detection unit 11 includes a non-target audio signal. Therefore, for example, when the processing unit 12 replaces the signal of the above period with a zero signal, the signal processed by the processing unit 12 does not include non-target sound. Therefore, the utterance content and the speaker by the non-target speaker cannot be specified from the processed signal. Further, when the voice quality conversion process is performed on the signal in the above period, the speaker cannot be specified from the processed signal. In this manner, the processing unit 12 can output an acoustic signal in which the utterance content of a non-target voice speaker and / or a non-target voice speaker is processed. Therefore, even when the non-target voice overlaps with the target sound, the voice processing device 10 cannot specify the utterance content by the non-target voice speaker and / or the non-target voice speaker from the acoustic signal including the target sound. Can be.

（ハードウェア構成について）
本発明の各実施形態において、各装置の各構成要素は、機能単位のブロックを示している。各装置の各構成要素の一部又は全部は、例えば図１６に示すような情報処理装置９００とプログラムとの任意の組み合わせにより実現される。情報処理装置９００は、一例として、以下のような構成を含む。 (About hardware configuration)
In each embodiment of the present invention, each component of each device represents a functional unit block. A part or all of each component of each device is realized by an arbitrary combination of an information processing device 900 and a program as shown in FIG. 16, for example. The information processing apparatus 900 includes the following configuration as an example.

・ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）９０１
・ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）９０２
・ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９０３
・ＲＡＭ９０３にロードされるプログラム９０４
・プログラム９０４を格納する記憶装置９０５
・記録媒体９０６の読み書きを行うドライブ装置９０７
・通信ネットワーク９０９と接続する通信インターフェース９０８
・データの入出力を行う入出力インターフェース９１０
・各構成要素を接続するバス９１１
各実施形態における各装置の各構成要素は、これらの機能を実現するプログラム９０４をＣＰＵ９０１が取得して実行することで実現される。各装置の各構成要素の機能を実現するプログラム９０４は、例えば、予め記憶装置９０５やＲＡＭ９０３に格納されており、必要に応じてＣＰＵ９０１が読み出す。なお、プログラム９０４は、通信ネットワーク９０９を介してＣＰＵ９０１に供給されてもよいし、予め記録媒体９０６に格納されており、ドライブ装置９０７が当該プログラムを読み出してＣＰＵ９０１に供給してもよい。 CPU (Central Processing Unit) 901
ROM (Read Only Memory) 902
-RAM (Random Access Memory) 903
A program 904 loaded into the RAM 903
A storage device 905 that stores the program 904
A drive device 907 that reads / writes data from / to the recording medium 906
A communication interface 908 connected to the communication network 909
・ Input / output interface 910 for inputting / outputting data
-Bus 911 connecting each component
Each component of each device in each embodiment is realized by the CPU 901 acquiring and executing a program 904 that realizes these functions. The program 904 that realizes the function of each component of each device is stored in advance in the storage device 905 or the RAM 903, for example, and is read by the CPU 901 as necessary. The program 904 may be supplied to the CPU 901 via the communication network 909, or may be stored in the recording medium 906 in advance, and the drive device 907 may read the program and supply it to the CPU 901.

各装置の実現方法には、様々な変形例がある。例えば、各装置は、構成要素毎にそれぞれ別個の情報処理装置９００とプログラムとの任意の組み合わせにより実現されてもよい。また、各装置が備える複数の構成要素が、一つの情報処理装置９００とプログラムとの任意の組み合わせにより実現されてもよい。 There are various modifications to the method of realizing each device. For example, each device may be realized by an arbitrary combination of an information processing device 900 and a program that are different for each component. A plurality of constituent elements included in each device may be realized by an arbitrary combination of one information processing device 900 and a program.

また、各装置の各構成要素の一部又は全部は、その他の汎用または専用の回路、プロセッサ等やこれらの組み合わせによって実現される。これらは、単一のチップによって構成されてもよいし、バスを介して接続される複数のチップによって構成されてもよい。 In addition, some or all of the components of each device are realized by other general-purpose or dedicated circuits, processors, or combinations thereof. These may be configured by a single chip or may be configured by a plurality of chips connected via a bus.

各装置の各構成要素の一部又は全部は、上述した回路等とプログラムとの組み合わせによって実現されてもよい。 Part or all of each component of each device may be realized by a combination of the above-described circuit and the like and a program.

各装置の各構成要素の一部又は全部が複数の情報処理装置や回路等により実現される場合には、複数の情報処理装置や回路等は、集中配置されてもよいし、分散配置されてもよい。例えば、情報処理装置や回路等は、クライアントアンドサーバシステム、クラウドコンピューティングシステム等、各々が通信ネットワークを介して接続される形態として実現されてもよい。 When some or all of the constituent elements of each device are realized by a plurality of information processing devices and circuits, the plurality of information processing devices and circuits may be centrally arranged or distributedly arranged. Also good. For example, the information processing apparatus, the circuit, and the like may be realized as a form in which each is connected via a communication network, such as a client and server system and a cloud computing system.

なお、上述した各実施形態は、本発明の好適な実施形態であり、上記各実施形態にのみ本発明の範囲を限定するものではなく、本発明の要旨を逸脱しない範囲において当業者が上記各実施形態の修正や代用を行い、種々の変更を施した形態を構築することが可能である。 Each of the above-described embodiments is a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention only to the above-described each embodiment. It is possible to construct a form in which various modifications are made by modifying or substituting the embodiment.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

（付記１）センサ信号である第１入力信号から、非目的音声の話者による発話の発話区間を検出し、前記発話区間の開始時刻および終了時刻を表す時間情報を特定する検出手段と、目的音および前記非目的音声の話者の音声を含む音響信号である第２入力信号のうち、前記特定された時間情報に対応する期間の信号によって発話内容および話者の少なくとも何れかが特定されないように、前記第２入力信号における前記期間の信号を加工する加工手段と、を備えることを特徴とする音声処理装置。 (Supplementary Note 1) Detection means for detecting a speech section of a speech by a non-target speech speaker from a first input signal that is a sensor signal, and identifying time information indicating a start time and an end time of the speech section; Of the second input signal that is an acoustic signal including sound and the voice of the non-target speaker, at least one of the utterance content and the speaker is not specified by a signal in a period corresponding to the specified time information And processing means for processing the signal of the period in the second input signal.

（付記２）
前記第１入力信号は、前記非目的音声を主に含む音声の信号であり、
前記第２入力信号は、前記目的音を主に含む音声の信号である、ことを特徴とする付記１に記載の音声処理装置。 (Appendix 2)
The first input signal is an audio signal mainly including the non-purpose audio,
The audio processing apparatus according to appendix 1, wherein the second input signal is an audio signal mainly including the target sound.

（付記３）
前記第１入力信号は、前記目的音の音源より、前記非目的音声の話者に近い位置に設けられた第１マイクロホンで集音され、該第１マイクロホンから出力された音響信号であり、
前記第２入力信号は、前記非目的音声の話者より、前記目的音の音源に近い位置に設けられた、前記第１マイクロホンとは異なる第２マイクロホンで集音され、該第２マイクロホンから出力された音響信号である、ことを特徴とする付記２に記載の音声処理装置。 (Appendix 3)
The first input signal is an acoustic signal that is collected by a first microphone provided at a position closer to a speaker of the non-target sound than the sound source of the target sound and output from the first microphone;
The second input signal is collected from a speaker of the non-target voice by a second microphone different from the first microphone provided at a position close to the sound source of the target sound, and output from the second microphone. The audio processing device according to attachment 2, wherein the audio processing device is an acoustic signal that has been recorded.

（付記４）
前記第１入力信号および前記第２入力信号は、夫々、複数のマイクロホンの少なくとも何れかから出力された１または複数の音響信号を含み、前記音声処理装置は、前記第１入力信号から前記非目的音声を主に含む音声の信号である第１音響信号を生成し、前記第２入力信号から前記目的音を主に含む音声の信号である第２音響信号を生成する生成手段をさらに備え、前記検出手段は、前記第１音響信号から前記時間情報を特定し、前記加工手段は、前記第２音響信号における前記期間の信号を加工する、ことを特徴とする付記１から３の何れか１つに記載の音声処理装置。 (Appendix 4)
Each of the first input signal and the second input signal includes one or a plurality of acoustic signals output from at least one of a plurality of microphones, and the sound processing device receives the non-purpose from the first input signal. Generating means for generating a first acoustic signal that is an audio signal mainly including sound, and generating a second acoustic signal that is an audio signal mainly including the target sound from the second input signal; Any one of appendices 1 to 3, wherein the detecting means specifies the time information from the first acoustic signal, and the processing means processes the signal of the period in the second acoustic signal. The voice processing apparatus according to 1.

（付記５）
センサ信号である第３入力信号および前記第１入力信号に基づいて、前記第１入力信号から１または複数の音響信号を第１選択信号として選択し、前記第３入力信号および前記第２入力信号に基づいて、前記第２入力信号から１または複数の音響信号を第２選択信号として選択する選択手段を更に備え、前記生成手段は、前記第１選択信号から前記第１音響信号を生成し、前記第２選択信号から前記第２音響信号を生成する、ことを特徴とする付記４に記載の音声処理装置。 (Appendix 5)
Based on the third input signal that is a sensor signal and the first input signal, one or more acoustic signals are selected as the first selection signal from the first input signal, and the third input signal and the second input signal are selected. Based on the second input signal, further comprising selection means for selecting one or a plurality of acoustic signals as a second selection signal, the generation means generates the first acoustic signal from the first selection signal, The speech processing apparatus according to appendix 4, wherein the second acoustic signal is generated from the second selection signal.

（付記６）
センサ信号である第３入力信号から前記目的音の音源の位置および前記非目的音声の話者の位置を推定する推定手段を更に備え、前記生成手段は、前記推定された非目的音声の話者の位置に基づいて、前記第１入力信号から前記第１音響信号を生成し、前記推定された目的音の音源の位置に基づいて、前記第２入力信号から前記第２音響信号を生成する、ことを特徴とする付記４に記載の音声処理装置。 (Appendix 6)
The apparatus further comprises estimation means for estimating the position of the sound source of the target sound and the position of the speaker of the non-target voice from a third input signal which is a sensor signal, and the generation means includes a speaker of the estimated non-target voice Generating the first acoustic signal from the first input signal based on the position of the second input signal, and generating the second acoustic signal from the second input signal based on the estimated position of the sound source of the target sound. The speech processing apparatus according to supplementary note 4, wherein

（付記７）
前記第１入力信号は、前記非目的音声の話者を撮映した映像を表す映像信号であり、前記検出手段は、前記映像信号によって表される映像から前記非目的音声の話者を判別し、該判別した話者の映像を用いて、前記話者による発話の発話区間を検出する、ことを特徴とする付記１に記載の音声処理装置。 (Appendix 7)
The first input signal is a video signal representing a video of the non-target voice speaker, and the detection means determines the non-target voice speaker from the video represented by the video signal. The speech processing apparatus according to appendix 1, wherein an utterance section of the utterance by the speaker is detected using the determined speaker image.

（付記８）
前記加工手段は、前記第２入力信号のうち前記特定された時間情報に対応する期間の信号を、ゼロ信号または所定の雑音信号で置き換える、ことを特徴とする付記１から７の何れか１つに記載の音声処理装置。 (Appendix 8)
The processing means replaces a signal in a period corresponding to the specified time information in the second input signal with a zero signal or a predetermined noise signal. The voice processing apparatus according to 1.

（付記９）
前記加工手段は、前記第２入力信号のうち前記特定された時間情報に対応する期間の信号に、前記センサ信号のうち前記発話区間の音量および前記第２入力信号のうち前記期間の音量の少なくとも何れかに応じて、前記期間の音声の内容が特定できない音量の雑音を加える、ことを特徴とする付記１から７の何れか１つに記載の音声処理装置。 (Appendix 9)
The processing means includes at least a volume of the utterance section of the sensor signal and a volume of the period of the second input signal in a signal corresponding to the specified time information in the second input signal. The sound processing apparatus according to any one of appendices 1 to 7, wherein a noise with a volume at which the content of the sound during the period cannot be specified is added according to any of the above.

（付記１０）
前記加工手段は、前記第２入力信号のうち前記特定された時間情報に対応する期間の信号に、声質変換処理を施す、ことを特徴とする付記１から７の何れか１つに記載の音声処理装置。 (Appendix 10)
The voice according to any one of appendices 1 to 7, wherein the processing means performs voice quality conversion processing on a signal in a period corresponding to the specified time information in the second input signal. Processing equipment.

（付記１１）
センサ信号である第１入力信号から、非目的音声の話者による発話の発話区間を検出し、前記発話区間の開始時刻および終了時刻を表す時間情報を特定し、目的音および前記非目的音声の話者の音声を含む音響信号である第２入力信号のうち、前記特定された時間情報に対応する期間の信号によって発話内容および話者の少なくとも何れかが特定されないように、前記第２入力信号における前記期間の信号を加工する、ことを特徴とする音声処理方法。 (Appendix 11)
An utterance section of an utterance by a non-target speech speaker is detected from a first input signal that is a sensor signal, time information indicating a start time and an end time of the utterance section is specified, and the target sound and the non-target speech The second input signal so that at least one of the utterance contents and the speaker is not specified by the signal in the period corresponding to the specified time information among the second input signals which are acoustic signals including the voice of the speaker. A voice processing method, wherein the signal in the period is processed.

（付記１２）
センサ信号である第１入力信号から、非目的音声の話者による発話の発話区間を検出し、前記発話区間の開始時刻および終了時刻を表す時間情報を特定する処理と、目的音および前記非目的音声の話者の音声を含む音響信号である第２入力信号のうち、前記特定された時間情報に対応する期間の信号によって発話内容および話者の少なくとも何れかが特定されないように、前記第２入力信号における前記期間の信号を加工する処理と、をコンピュータに実行させることを特徴とするプログラム。 (Appendix 12)
A process of detecting an utterance section of an utterance by a speaker of a non-target voice from a first input signal which is a sensor signal, specifying time information indicating a start time and an end time of the utterance section, a target sound and the non-purpose The second input signal, which is an acoustic signal including the voice of the voice speaker, is not specified by at least one of the utterance contents and the speaker by a signal in a period corresponding to the specified time information. A program for causing a computer to execute processing for processing a signal of the period in an input signal.

１０音声処理装置
１１検出部
１２加工部
１００音声処理装置
１０１発話区間検出部
１０２発話区間加工部
１０３音響信号保存部
１１０音声処理装置
２００音声処理装置
２０１第１音声信号生成部
２０２第２音声信号生成部
２１０音声信号生成部
３００音声処理装置
３０１信号選択部
４００音声処理装置
４０１第１音声信号生成部
４０２第２音声信号生成部
４０３第１話者位置推定部
４０４第２話者位置推定部
４１０音声信号生成部
４２０位置推定部
５００音声処理装置
５０１目的話者発話区間検出部
５０２発話区間加工部
５０３記憶部
６００音声認識装置
６０１音声認識部 DESCRIPTION OF SYMBOLS 10 Speech processing apparatus 11 Detection part 12 Processing part 100 Speech processing apparatus 101 Speech area detection part 102 Speech area processing part 103 Acoustic signal preservation | save part 110 Speech processing apparatus 200 Speech processing apparatus 201 1st speech signal generation part 202 2nd speech signal generation Unit 210 audio signal generation unit 300 audio processing device 301 signal selection unit 400 audio processing device 401 first audio signal generation unit 402 second audio signal generation unit 403 first speaker position estimation unit 404 second speaker position estimation unit 410 audio Signal generation unit 420 Position estimation unit 500 Speech processing device 501 Target speaker utterance section detection section 502 Speaking section processing section 503 Storage section 600 Speech recognition apparatus 601 Speech recognition section

Claims

Detecting means for detecting an utterance section of an utterance by a non-target voice speaker from a first input signal that is a sensor signal, and identifying time information indicating a start time and an end time of the utterance section;
Of the second input signal that is an acoustic signal including the target sound and the voice of the non-target speaker, at least one of the utterance content and the speaker is not specified by the signal in the period corresponding to the specified time information And a processing means for processing the signal of the period in the second input signal.

The first input signal is an audio signal mainly including the non-purpose audio,
The audio processing apparatus according to claim 1, wherein the second input signal is an audio signal mainly including the target sound.

Each of the first input signal and the second input signal includes one or a plurality of acoustic signals output from at least one of a plurality of microphones,
The sound processing device generates a first acoustic signal that is a sound signal mainly including the non-target sound from the first input signal, and a sound signal mainly including the target sound from the second input signal. A generator for generating a second acoustic signal;
The detection means identifies the time information from the first acoustic signal,
The audio processing apparatus according to claim 1, wherein the processing unit processes a signal of the period in the second acoustic signal.

Based on the third input signal that is a sensor signal and the first input signal, one or more acoustic signals are selected as the first selection signal from the first input signal, and the third input signal and the second input signal are selected. And further comprising a selection means for selecting one or more acoustic signals as the second selection signal from the second input signal,
The sound processing apparatus according to claim 3, wherein the generation unit generates the first acoustic signal from the first selection signal and generates the second acoustic signal from the second selection signal.

An estimation means for estimating a position of a sound source of the target sound and a position of a speaker of the non-target sound from a third input signal which is a sensor signal;
The generating means generates the first acoustic signal from the first input signal based on the estimated position of the speaker of the non-target speech, and based on the position of the estimated sound source of the target sound, The audio processing apparatus according to claim 3, wherein the second acoustic signal is generated from the second input signal.

The first input signal is a video signal representing a video of a speaker of the non-purpose voice;
The detecting means determines a speaker of the non-target voice from the video represented by the video signal, and uses the determined video of the speaker to detect an utterance section of the utterance by the speaker; The speech processing apparatus according to claim 1, wherein

The processing means replaces a signal of a period corresponding to the specified time information in the second input signal with a zero signal or a predetermined noise signal. The speech processing apparatus according to the item.

The said process means performs a voice quality conversion process to the signal of the period corresponding to the said specified time information among the said 2nd input signals, The any one of Claim 1 to 6 characterized by the above-mentioned. Audio processing device.

Detecting an utterance section of an utterance by a non-target speaker from a first input signal that is a sensor signal, and identifying time information indicating a start time and an end time of the utterance section;
Of the second input signal that is an acoustic signal including the target sound and the voice of the non-target speaker, at least one of the utterance content and the speaker is not specified by the signal in the period corresponding to the specified time information As described above, the speech processing method characterized by processing the signal of the period in the second input signal.

A process of detecting an utterance section of an utterance by a speaker of a non-target voice from a first input signal which is a sensor signal, and specifying time information indicating a start time and an end time of the utterance section;
Of the second input signal that is an acoustic signal including the target sound and the voice of the non-target speaker, at least one of the utterance content and the speaker is not specified by the signal in the period corresponding to the specified time information As described above, a program for causing a computer to execute processing for processing a signal of the period in the second input signal.