JP6478727B2

JP6478727B2 - Audio processing apparatus, audio processing method and program

Info

Publication number: JP6478727B2
Application number: JP2015047658A
Authority: JP
Inventors: 祐介木田; 誠広畑; 尚水吉田; 達馬石原
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2015-03-10
Filing date: 2015-03-10
Publication date: 2019-03-06
Anticipated expiration: 2035-03-10
Also published as: JP2016167782A

Description

本発明の実施形態は、音声処理装置、音声処理方法およびプログラムに関する。 Embodiments described herein relate generally to a voice processing device, a voice processing method, and a program.

近年の通信需要の増大に伴い、無線通信装置（無線移動局及び無線基地局）が急速に普及している。このような状況のなかで秩序を保ち、かつ有効に電波を利用するためには、それぞれの無線通信装置を一定の条件のもとで使用する必要がある。しかしながら、無線通信装置の故障や違法な運用などにより、全ての無線通信装置が条件を満たして運用されているとは言えない状況にある。これらの無線通信装置を放置すると、正常に運用されている無線通信装置の運用に障害を及ぼすおそれがあるため、電波の利用状況を監視して異常電波の発生を防止することが重要になってきている。しかし、電波信号の周波数帯域は広範であり、その全体を人が常時監視するにはコストがかかる。そこで、電波信号から自動的に目的信号を検出する技術に注目が集まっている。 With recent increase in communication demand, wireless communication devices (wireless mobile stations and wireless base stations) are rapidly spreading. In order to maintain order and effectively use radio waves in such a situation, it is necessary to use each wireless communication device under certain conditions. However, it cannot be said that all the wireless communication devices are operated in a condition due to a failure or illegal operation of the wireless communication device. If these wireless communication devices are left unattended, there is a risk that the operation of a wireless communication device that is operating normally may be disturbed. Therefore, it is important to monitor the use of radio waves and prevent the occurrence of abnormal radio waves. ing. However, the frequency band of radio signals is wide, and it is costly for a person to constantly monitor the entire frequency band. Therefore, attention is being focused on a technique for automatically detecting a target signal from a radio signal.

ここで、異常電波による音声通信を検出することを考える。この場合、目的信号は音声（人の発話）である。音響信号から人の発話を自動的に検出する技術として、「発話区間検出」と呼ばれる技術が知られている。発話区間検出は主に音声認識等で用いられる技術であり、これまでに様々な方式が開発されている。この発話区間検出の技術は、異常電波による音声通信を検出する場合にも有用な技術と考えられる。 Here, it is assumed that voice communication using abnormal radio waves is detected. In this case, the target signal is voice (human speech). As a technique for automatically detecting a human utterance from an acoustic signal, a technique called “speech section detection” is known. Speaking section detection is a technique mainly used in speech recognition and the like, and various methods have been developed so far. This technique for detecting an utterance section is considered to be a useful technique when detecting voice communication using abnormal radio waves.

異常電波の通信者が用いる周波数帯域は、通常は事前に知ることができない。そこで、異常電波の存在する周波数帯域を特定するために、通過させる周波数帯域（通過帯域）が異なる複数のバンドパスフィルタにより構成されるフィルタバンクを用いる方法が考えられる。フィルタバンクによって電波信号を複数のサブバンド信号に分割し、各サブバンド信号を復調した復調信号を対象に発話区間検出を実行することで、発話（音声）が検出されたバンドパスフィルタの通過帯域から異常電波の通信者が用いる周波数帯域を特定することができる。 The frequency band used by the abnormal radio wave communication person cannot usually be known in advance. Therefore, in order to specify a frequency band in which abnormal radio waves exist, a method using a filter bank composed of a plurality of band pass filters having different frequency bands (pass bands) to be passed can be considered. The passband of the bandpass filter in which speech (speech) is detected by dividing the radio signal into multiple subband signals using a filter bank and performing speech segment detection on the demodulated signal obtained by demodulating each subband signal Therefore, it is possible to specify the frequency band used by the abnormal radio wave communication person.

特開２００７−１７６２０号公報JP 2007-17620 A

しかしながら、隣接するバンドパスフィルタ同士の通過帯域がオーバーラップする場合など、フィルタバンクの構成によっては、一つの異常電波が複数のバンドパスフィルタを通過してしまう可能性がある。ここで、異常電波に発話が含まれている場合、複数のバンドパスフィルタに対応する複数の復調信号から、同一の発話が検出されることになる。そのため、例えば、検出された発話を人が聴いて確認する際に同じ発話を何度も聴くことになるなど、確認作業が煩雑になり、異常電波による音声通信の検出を効率よく行えなくなる問題があった。 However, depending on the configuration of the filter bank, such as when the passbands of adjacent bandpass filters overlap, one abnormal radio wave may pass through a plurality of bandpass filters. Here, when an abnormal radio wave includes an utterance, the same utterance is detected from a plurality of demodulated signals corresponding to the plurality of bandpass filters. For this reason, for example, when a person listens to and confirms a detected utterance, the same utterance is listened to many times, and the confirmation work becomes complicated, making it impossible to efficiently detect voice communication using abnormal radio waves. there were.

本発明が解決しようとする課題は、異常電波による音声通信を効率よく検出することができる音声処理装置、音声処理方法およびプログラムを提供することである。 The problem to be solved by the present invention is to provide a voice processing device, a voice processing method, and a program capable of efficiently detecting voice communication using abnormal radio waves.

実施形態の音声処理装置は、分割部と、復調部と、検出部と、判定部と、選択部と、を備える。分割部は、通過帯域が異なる複数のバンドパスフィルタより構成されるフィルタバンクを用いて、受信した電波信号を複数のサブバンド信号に分割する。復調部は、前記複数のサブバンド信号を個別に復調して、前記複数のバンドパスフィルタに各々対応する複数の復調信号を生成する。検出部は、発話の尤もらしさを表す信頼度スコアに基づき、前記複数の復調信号の各々から発話を検出する。判定部は、前記複数の復調信号のうち注目する復調信号から検出された発話を第１の発話とし、前記注目する復調信号に対応するバンドパスフィルタに対して周波数方向に隣接する他のバンドパスフィルタに対応する他の復調信号から検出された発話を第２の発話としたときに、第１の発話に対して少なくとも一部の時刻が重なる第２の発話が１つ以上存在する場合に、これら第１の発話と第２の発話が同一の発話であるか否かを判定する。選択部は、第１の発話と第２の発話が同一の発話であると判定された場合に、これら第１の発話と第２の発話のうち、いずれかの発話を選択する。 The speech processing apparatus according to the embodiment includes a dividing unit, a demodulating unit, a detecting unit, a determining unit, and a selecting unit. The dividing unit divides the received radio signal into a plurality of subband signals using a filter bank including a plurality of bandpass filters having different pass bands. The demodulating unit individually demodulates the plurality of subband signals to generate a plurality of demodulated signals respectively corresponding to the plurality of bandpass filters. The detection unit detects an utterance from each of the plurality of demodulated signals based on a reliability score representing the likelihood of the utterance. The determination unit sets the utterance detected from the demodulated signal of interest among the plurality of demodulated signals as the first utterance, and another bandpass adjacent in the frequency direction to the bandpass filter corresponding to the demodulated signal of interest When the utterance detected from the other demodulated signal corresponding to the filter is the second utterance, when there is one or more second utterances that overlap at least part of the time with respect to the first utterance, It is determined whether or not the first utterance and the second utterance are the same utterance. When it is determined that the first utterance and the second utterance are the same utterance, the selection unit selects one of the first utterance and the second utterance.

実施形態の音声処理装置の機能的な構成例を示すブロック図。The block diagram which shows the functional structural example of the audio processing apparatus of embodiment. 実施形態の音声処理装置による処理手順の一例を示すフローチャート。The flowchart which shows an example of the process sequence by the audio | voice processing apparatus of embodiment. フィルタバンクの構成例を説明する図。The figure explaining the structural example of a filter bank. 電波信号中に存在する発話の一例を時間−周波数平面上で表した図。The figure which represented on the time-frequency plane an example of the speech which exists in a radio signal. 復調信号の波形例を示す図。The figure which shows the example of a waveform of a demodulation signal. 復調信号の波形例を示す図。The figure which shows the example of a waveform of a demodulation signal. 復調信号の波形例を示す図。The figure which shows the example of a waveform of a demodulation signal. 図５に示す復調信号から発話を検出した結果を示す図。The figure which shows the result of having detected speech from the demodulated signal shown in FIG. 図６に示す復調信号から発話を検出した結果を示す図。The figure which shows the result of having detected speech from the demodulated signal shown in FIG. 図７に示す復調信号から発話を検出した結果を示す図。The figure which shows the result of having detected speech from the demodulated signal shown in FIG. 発話に関する情報の一例を示す図。The figure which shows an example of the information regarding speech. 音声処理装置のハードウェア構成例を示すブロック図。The block diagram which shows the hardware structural example of an audio processing apparatus.

以下、添付図面を参照しながら、実施形態の音声処理装置、音声処理方法およびプログラムについて詳細に説明する。本実施形態の音声処理装置は、電波信号から人の発話（音声）を検出し、検出した発話に関する情報を出力する。 Hereinafter, an audio processing device, an audio processing method, and a program according to embodiments will be described in detail with reference to the accompanying drawings. The speech processing apparatus according to the present embodiment detects a human utterance (speech) from a radio wave signal and outputs information related to the detected utterance.

まず、本実施形態の音声処理装置の構成について、図１を参照して説明する。図１は、本実施形態の音声処理装置１の機能的な構成例を示すブロック図である。図１に示すように、音声処理装置１は、分割部１１と、複数の復調部１２＿１，１２＿２，・・・，１２＿ｎ（以下、これらを総称して復調部１２と表記する。）と、複数の検出部１３＿１，１３＿２，・・・，１３＿ｎ（以下、これらを総称して検出部１３と表記する。）と、判定部１４と、選択部１５と、出力部１６とを備える。 First, the configuration of the speech processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 1 is a block diagram illustrating a functional configuration example of the speech processing apparatus 1 according to the present embodiment. As shown in FIG. 1, the audio processing apparatus 1 includes a dividing unit 11, a plurality of demodulation units 12_1, 12_2,..., 12_n (hereinafter collectively referred to as a demodulation unit 12) and a plurality. , 13_n (hereinafter collectively referred to as the detection unit 13), a determination unit 14, a selection unit 15, and an output unit 16.

分割部１１は、通過帯域が異なる複数のバンドパスフィルタより構成されるフィルタバンクを用いて、受信した電波信号を複数のサブバンド信号に分割する。フィルタバンクを構成する複数のバンドパスフィルタは、通過帯域の一部が、隣接するバンドパスフィルタの通過帯域とオーバーラップしていてもよい。 The dividing unit 11 divides the received radio wave signal into a plurality of subband signals using a filter bank composed of a plurality of bandpass filters having different pass bands. In the plurality of bandpass filters constituting the filter bank, part of the passband may overlap with the passband of the adjacent bandpass filter.

復調部１２は、分割部１１により分割されたサブバンド信号を復調して復調信号を生成する。複数の復調部１２は、各々が個別のサブバンド信号に対応する。すなわち、複数の復調部１２の各々は、フィルタバンクを構成する各バンドパスフィルタを通過した信号であるサブバンド信号を個別に復調する。これにより、フィルタバンクを構成する複数のバンドパスフィルタに各々対応する複数の復調信号が生成される。電波信号の変調および復調部１２によるサブバンド信号の復調の方法としては、例えば、周波数偏移変調（ＦＳＫ）や位相変調（ＰＳＫ）などのデジタル変調であってもよいし、振幅変調（ＡＭ）や周波数変調（ＦＭ）などのアナログ変調であってもよい。 The demodulator 12 demodulates the subband signal divided by the divider 11 to generate a demodulated signal. Each of the plurality of demodulation units 12 corresponds to an individual subband signal. That is, each of the plurality of demodulation units 12 individually demodulates a subband signal that is a signal that has passed through each bandpass filter constituting the filter bank. As a result, a plurality of demodulated signals respectively corresponding to the plurality of bandpass filters constituting the filter bank are generated. As a method of modulating the radio signal and demodulating the subband signal by the demodulator 12, for example, digital modulation such as frequency shift keying (FSK) or phase modulation (PSK) may be used, or amplitude modulation (AM). Or analog modulation such as frequency modulation (FM).

なお、図１では、複数のサブバンド信号に対する復調を並列処理により実行することを想定して、サブバンド信号の数と同じ数（フィルタバンクを構成するバンドパスフィルタの数と同じ数）の復調部１２を備える構成を例示しているが、これに限らない。例えば単一の復調部１２またはサブバンド信号の数よりも少ない数の復調部１２により、複数のサブバンド信号の少なくとも一部に対する復調を時系列で行う構成であってもよい。また、分割部１１により分割された複数のサブバンド信号および復調部１２により復調された複数の復調信号の各々は、信号成分の時間方向の位置を表す共通の時刻情報が付加されているものとする。 In FIG. 1, assuming that demodulation for a plurality of subband signals is performed by parallel processing, the same number of demodulations as the number of subband signals (the same number as the number of bandpass filters constituting the filter bank). Although the structure provided with the part 12 is illustrated, it is not restricted to this. For example, a configuration may be employed in which demodulation of at least a part of a plurality of subband signals is performed in time series by a single demodulation unit 12 or a number of demodulation units 12 smaller than the number of subband signals. Each of the plurality of subband signals divided by the dividing unit 11 and the plurality of demodulated signals demodulated by the demodulating unit 12 is added with common time information indicating the position of the signal component in the time direction. To do.

検出部１３は、復調部１２により生成された復調信号に対し、発話の尤もらしさを表す信頼度スコアを復調信号の時間方向に沿って算出し、算出した信頼度スコアに基づいて復調信号から発話を検出する。複数の検出部１３は、各々が個別の復調信号に対応する。すなわち、複数の検出部１３の各々は個別の復調部１２に対応して設けられ、各復調部１２により生成された復調信号に対して個別に発話を検出する処理を行う。信頼度スコアに基づいて発話を検出する方法は、例えば特許文献１に記載されている方法など、公知の方法を利用することができる。 The detection unit 13 calculates a reliability score indicating the likelihood of the utterance of the demodulated signal generated by the demodulation unit 12 along the time direction of the demodulated signal, and utters the utterance from the demodulated signal based on the calculated reliability score. Is detected. Each of the plurality of detection units 13 corresponds to an individual demodulated signal. That is, each of the plurality of detection units 13 is provided corresponding to the individual demodulation unit 12, and performs processing for individually detecting an utterance on the demodulated signal generated by each demodulation unit 12. As a method for detecting an utterance based on the reliability score, a known method such as a method described in Patent Document 1 can be used.

なお、図１では、複数の復調信号に対する発話の検出を並列処理により実行することを想定して、サブバンド信号の数と同じ数（フィルタバンクを構成するバンドパスフィルタの数と同じ数）の検出部１３を備える構成を例示しているが、これに限らない。例えば単一の検出部１３または復調信号の数よりも少ない数の検出部１３により、複数の復調信号の少なくとも一部に対する発話の検出を時系列で行う構成であってもよい。 In FIG. 1, the number of subband signals is the same as the number of subband signals (the same number as the number of bandpass filters constituting the filter bank) on the assumption that speech detection for a plurality of demodulated signals is performed by parallel processing. Although the structure provided with the detection part 13 is illustrated, it is not restricted to this. For example, a configuration may be adopted in which utterances are detected in time series for at least some of a plurality of demodulated signals by a single detector 13 or a number of detectors 13 smaller than the number of demodulated signals.

判定部１４は、複数の検出部１３による発話の検出結果をもとに、異なる検出部１３によって異なる復調信号から各々検出された発話の同一性を判定する。ここで、複数の復調信号のうち注目する復調信号から検出された発話を「第１の発話」とし、注目する復調信号に対応するバンドパスフィルタに対して通過帯域が近い他のバンドパスフィルタに対応する復調信号から検出された発話を「第２の発話」とする。なお、ここでいう他のバンドパスフィルタは、注目する復調信号に対応するバンドパスフィルタに対して周波数方向に隣接するバンドパスフィルタとしてもよい。 The determination unit 14 determines the identity of utterances respectively detected from different demodulated signals by different detection units 13 based on the utterance detection results by the plurality of detection units 13. Here, the utterance detected from the demodulated signal of interest among the plurality of demodulated signals is referred to as a “first utterance”, and the bandpass filter corresponding to the demodulated signal of interest has another bandpass filter close to the passband. The utterance detected from the corresponding demodulated signal is defined as a “second utterance”. Note that the other bandpass filter referred to here may be a bandpass filter adjacent in the frequency direction to the bandpass filter corresponding to the demodulated signal of interest.

判定部１４は、まず、注目する復調信号から検出された第１の発話のそれぞれに対し、少なくとも一部の時刻が重なる第２の発話を探索する。そして、探索の結果、第１の発話に対して少なくとも一部の時刻が重なる第２の発話が１つ以上見つかった場合、これら第１の発話と第２の発話とが同一の発話であるか否かを判定する。判定部１４は、注目する復調信号を切替えながら、複数の復調信号のそれぞれに対し以上の処理を繰り返し行う。 First, the determination unit 14 searches for a second utterance in which at least a part of time overlaps with each of the first utterances detected from the demodulated signal of interest. As a result of the search, if one or more second utterances that overlap at least part of the time are found with respect to the first utterance, are the first and second utterances the same utterance? Determine whether or not. The determination unit 14 repeatedly performs the above processing on each of the plurality of demodulated signals while switching the demodulated signal of interest.

第１の発話と第２の発話が同一の発話であるか否かは、例えば、それぞれの発話が存在する時刻の重なり度合いに基づいて判定することができる。具体的には例えば、第１の発話と第２の発話の開始時刻のずれが所定の時間以内であり、かつ、第１の発話と第２の発話の終了時刻のずれが所定の時間以内である場合に、これらの発話を同一の発話であると判定する。 Whether or not the first utterance and the second utterance are the same utterance can be determined based on, for example, the degree of overlap of the times when the respective utterances exist. Specifically, for example, the difference between the start times of the first utterance and the second utterance is within a predetermined time, and the difference between the end times of the first utterance and the second utterance is within a predetermined time. In some cases, it is determined that these utterances are the same utterance.

また、第１の発話と第２の発話が同一の発話であるか否かは、例えば、第１の発話から抽出した特徴量と第２の発話から抽出した特徴量との類似性の評価結果に基づいて判定することができる。ここで用いる特徴量としては、例えば、対数パワーやＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）などが挙げられる。また、検出部１３で算出した信頼度スコアを特徴量として用いてもよい。特徴量の類似性を評価する方法としては、例えば、第１の発話と第２の発話からそれぞれ時刻（例えばフレーム）ごとに算出した特徴量の相関係数（例えば内積など）を利用することができる。このとき、第１の発話と第２の発話の時刻が完全に一致しておらず、一部の時刻が重なっている場合には、重なった部分の時刻を用いて特徴量の類似性を評価してもよいし、２つの発話を包含する時刻を用いて特徴量の類似性を評価してもよい。 Whether or not the first utterance and the second utterance are the same utterance is determined, for example, by the evaluation result of the similarity between the feature amount extracted from the first utterance and the feature amount extracted from the second utterance Can be determined based on Examples of the feature amount used here include logarithmic power and MFCC (Mel-Frequency Cepstrum Coefficient). Further, the reliability score calculated by the detection unit 13 may be used as a feature amount. As a method for evaluating the similarity of feature amounts, for example, a correlation coefficient (for example, an inner product) of feature amounts calculated for each time (for example, a frame) from the first utterance and the second utterance may be used. it can. At this time, if the times of the first utterance and the second utterance do not completely match and some times overlap, the similarity of the feature quantities is evaluated using the time of the overlapped portion. Alternatively, the similarity between feature quantities may be evaluated using a time including two utterances.

選択部１５は、判定部１４による判定の結果を利用して、複数の検出部１３により複数の復調信号から各々検出された発話のうち、後述の出力部１６によって情報を出力する対象となる発話を選択する。すなわち、選択部１５は、判定部１４によって第１の発話と第２の発話とが同一の発話であると判定された場合は、これら第１の発話と第２の発話のうちのいずれかの発話、例えば、検出部１３で算出した信頼度スコアが高い方の発話を選択する。また、選択部１５は、判定部１４による処理において、第１の発話に対して少なくとも一部の時刻が重なる第２の発話が１つも見つからない場合、または、第１の発話に対して少なくとも一部の時刻が重なる第２の発話が１つ以上見つかったが、これら第１の発話と第２の発話が同一の発話でないと判定された場合は、第１の発話を選択する。 Of the utterances detected from the plurality of demodulated signals by the plurality of detection units 13 using the result of determination by the determination unit 14, the selection unit 15 is an utterance whose information is to be output by the output unit 16 described later. Select. That is, when the determination unit 14 determines that the first utterance and the second utterance are the same utterance, the selection unit 15 selects one of the first utterance and the second utterance. An utterance, for example, an utterance having a higher reliability score calculated by the detection unit 13 is selected. Further, the selection unit 15 does not find at least one second utterance in which at least a part of the time overlaps with the first utterance in the process by the determination unit 14, or at least one for the first utterance. If one or more second utterances with the same part time are found, but it is determined that the first utterance and the second utterance are not the same utterance, the first utterance is selected.

また、選択部１５は、判定部１４により同一の発話と判定された第１の発話と第２の発話のうちのいずれかの発話を選択した場合、選択した発話と選択しなかった発話の時刻が完全に一致していなければ、選択した発話に対して、選択されなかった発話の一部であって選択した発話に時刻が重ならない部分を統合してもよい。この場合、この統合した発話が、後述の出力部１６によって情報を出力する対象となる発話とされる。 In addition, when the selection unit 15 selects one of the first utterance and the second utterance determined as the same utterance by the determination unit 14, the selected utterance and the time of the utterance not selected If they do not completely match, the selected utterance may be integrated with a part of the utterance that was not selected and that does not overlap the selected utterance. In this case, the integrated utterance is set as an utterance for which information is output by the output unit 16 described later.

出力部１６は、選択部１５により選択された発話に関する情報を出力する。出力部１６が出力する情報としては、例えば、選択部１５により選択された発話の音声信号であってもよいし、発話が検出された復調信号に対応するバンドパスフィルタの番号や、検出された発話が存在する時刻など、選択された発話を特定するための情報であってもよい。また、これらの情報を組み合わせて出力してもよい。さらに、選択した発話の信頼度スコアを付加して出力してもよい。 The output unit 16 outputs information related to the utterance selected by the selection unit 15. The information output from the output unit 16 may be, for example, an audio signal of an utterance selected by the selection unit 15, a bandpass filter number corresponding to a demodulated signal from which an utterance has been detected, Information for identifying the selected utterance, such as the time when the utterance exists, may be used. Moreover, you may output combining these information. Furthermore, the reliability score of the selected utterance may be added and output.

また、出力部１６は、選択部１５により選択された発話に関する情報と併せて、選択されなかった発話に関する情報を出力してもよい。例えば、選択部１５により選択された発話の音声信号と組み合わせて、選択された発話が検出された復調信号に対応するバンドパスフィルタの番号だけでなく、選択されなかった発話が検出された復調信号に対応するバンドパスフィルタの番号も併せて出力するようにしてもよい。 Further, the output unit 16 may output information related to the utterance that has not been selected together with information related to the utterance selected by the selection unit 15. For example, in combination with the voice signal of the utterance selected by the selection unit 15, not only the number of the bandpass filter corresponding to the demodulated signal from which the selected utterance has been detected, but also the demodulated signal from which the utterance not selected has been detected The number of the bandpass filter corresponding to may also be output.

次に、本実施形態の音声処理装置１の動作について、図２を参照して説明する。図２は、音声処理装置１による処理手順の一例を示すフローチャートである。この図２のフローチャートで示す一連の処理は、電波信号の受信と併せて音声処理装置１により所定周期で繰り返し実行される。 Next, the operation of the speech processing apparatus 1 of this embodiment will be described with reference to FIG. FIG. 2 is a flowchart illustrating an example of a processing procedure performed by the voice processing device 1. A series of processes shown in the flowchart of FIG. 2 is repeatedly executed at a predetermined cycle by the audio processing apparatus 1 together with reception of the radio signal.

図２のフローチャートで示す処理が開始されると、まず、分割部１１が、受信した電波信号を複数のサブバンド信号に分割する（ステップＳ１０１）。分割部１１によって分割された複数のサブバンド信号は、複数の復調部１２に各々供給される。 When the processing shown in the flowchart of FIG. 2 is started, first, the dividing unit 11 divides the received radio wave signal into a plurality of subband signals (step S101). The plurality of subband signals divided by the dividing unit 11 are respectively supplied to the plurality of demodulating units 12.

次に、複数の復調部１２のそれぞれが、分割部１１から供給されたサブバンド信号を個別に復調して、複数の復調信号を生成する（ステップＳ１０２）。複数の復調部１２により生成された複数の復調信号は、対応する複数の検出部１３に各々供給される。 Next, each of the plurality of demodulation units 12 individually demodulates the subband signals supplied from the division unit 11 to generate a plurality of demodulation signals (step S102). The plurality of demodulated signals generated by the plurality of demodulation units 12 are respectively supplied to the corresponding plurality of detection units 13.

次に、複数の検出部１３のそれぞれが、発話の尤もらしさを表す信頼度スコアに基づいて、復調部１２から供給された復調信号から発話を検出する（ステップＳ１０３）。複数の検出部１３による発話の検出結果は、判定部１４および選択部１５に供給される。 Next, each of the plurality of detection units 13 detects the utterance from the demodulated signal supplied from the demodulation unit 12 based on the reliability score representing the likelihood of the utterance (step S103). The detection results of the utterances by the plurality of detection units 13 are supplied to the determination unit 14 and the selection unit 15.

次に、判定部１４が、複数の検出部１３による発話の検出結果をもとに、異なる検出部１３によって異なる復調信号から各々検出された発話の同一性を判定する処理を行う。すなわち、判定部１４は、まず、注目する復調信号から検出された第１の発話の各々について、第１の発話に対して少なくとも一部の時刻が重なる第２の発話を探索する（ステップＳ１０４）。そして、第１の発話に対して少なくとも一部の時刻が重なる第２の発話が１つ以上存在する場合、判定部１４は、これら第１の発話と第２の発話が同一の発話であるか否かを判定する（ステップＳ１０５）。判定部１４は、注目する復調信号を切替えながら、複数の検出部１３により検出されたすべての発話について、ステップＳ１０４およびステップＳ１０５の処理を繰り返し行う。判定部１４による判定の結果は、選択部１５に供給される。 Next, the determination unit 14 performs a process of determining the identity of the utterances detected from the different demodulated signals by the different detection units 13 based on the utterance detection results by the plurality of detection units 13. That is, the determination unit 14 first searches for a second utterance in which at least a part of the time overlaps with the first utterance for each of the first utterances detected from the demodulated signal of interest (step S104). . If there is one or more second utterances that overlap at least part of the time with respect to the first utterance, the determination unit 14 determines whether the first utterance and the second utterance are the same utterance. It is determined whether or not (step S105). The determination unit 14 repeatedly performs the processing of step S104 and step S105 for all utterances detected by the plurality of detection units 13 while switching the demodulated signal of interest. The result of determination by the determination unit 14 is supplied to the selection unit 15.

次に、選択部１５が、判定部１４による判定の結果を利用して、複数の検出部１３により複数の復調信号から各々検出された発話のうち、出力部１６によって情報を出力する対象となる発話を選択する（ステップＳ１０６）。選択部１５による選択の結果は出力部１６に供給される。 Next, the selection unit 15 is a target to output information by the output unit 16 among the utterances detected from the plurality of demodulated signals by the plurality of detection units 13 using the determination result by the determination unit 14. An utterance is selected (step S106). The result of selection by the selection unit 15 is supplied to the output unit 16.

最後に、出力部１６が、選択部１５により選択された発話に関する情報を、分割部１１から判定部１４までの各部から取得し、例えばディスプレイやスピーカなどの出力装置、ＨＤＤなどのファイル記憶装置、ネットワークに接続された通信Ｉ／Ｆなどに出力する（ステップＳ１０７）。 Finally, the output unit 16 acquires information related to the utterance selected by the selection unit 15 from each unit from the division unit 11 to the determination unit 14, for example, an output device such as a display or a speaker, a file storage device such as an HDD, The data is output to a communication I / F connected to the network (step S107).

以上説明したように、本実施形態の音声処理装置１は、受信した電波信号を複数のサブバンド信号に分割し、各サブバンド信号を復調した復調信号から各々発話を検出する。このとき、異なる復調信号から同一の発話が検出された場合は、判定部１４および選択部１５の処理によりいずれかの発話が選択され、選択された発話に関する情報が出力される。したがって、例えば、検出された発話を人が聴いて確認する際に同じ発話を何度も聴くことがなく、確認作業にかかる手間を軽減できるため、異常電波による音声通信の検出を効率よく行うことができる。 As described above, the sound processing apparatus 1 according to the present embodiment divides the received radio wave signal into a plurality of subband signals, and detects speech from the demodulated signals obtained by demodulating each subband signal. At this time, when the same utterance is detected from different demodulated signals, one of the utterances is selected by the processing of the determination unit 14 and the selection unit 15, and information about the selected utterance is output. Therefore, for example, when a person listens to and confirms a detected utterance, the same utterance is not listened to many times, and the time required for the confirmation work can be reduced, so voice communication using abnormal radio waves can be detected efficiently. Can do.

次に、具体的な事例を挙げながら、本実施形態の音声処理装置１による処理の一例を説明する。まず、処理対象となる電波信号の具体例と、検出部１３までの処理結果について述べる。 Next, an example of processing by the speech processing apparatus 1 of the present embodiment will be described with specific examples. First, a specific example of a radio wave signal to be processed and a processing result up to the detection unit 13 will be described.

図３は、分割部１１におけるフィルタバンクの構成例を説明する図である。図３に例示するフィルタバンクは、通過帯域の幅が８０００ヘルツである複数のバンドパスフィルタを６０００ヘルツおきに並べることで構成されている。フィルタバンクを構成する個々のバンドパスフィルタは、通過帯域の一部が隣接するバンドパスフィルタの通過帯域とオーバーラップしている。 FIG. 3 is a diagram illustrating a configuration example of a filter bank in the dividing unit 11. The filter bank illustrated in FIG. 3 is configured by arranging a plurality of bandpass filters having a passband width of 8000 hertz every 6000 hertz. In each bandpass filter constituting the filter bank, a part of the passband overlaps the passband of the adjacent bandpass filter.

図４は、電波信号中に存在する発話の一例を時間−周波数平面上で表した図である。本例では、電波信号中に発話Ｕ１１と発話Ｕ１２とが存在しているものとする。図４の左側には図３に例示したフィルタバンクが示されている。本フィルタバンクを用いてこの電波信号を分割すると、バンドパスフィルタＦ１を通過したサブバンド信号とバンドパスフィルタＦ２を通過したサブバンド信号とに、発話Ｕ１１の信号成分が含まれることになる。また、バンドパスフィルタＦ３を通過したサブバンド信号に、発話Ｕ１２の信号成分が含まれることになる。 FIG. 4 is a diagram illustrating an example of an utterance existing in a radio signal on a time-frequency plane. In this example, it is assumed that the utterance U11 and the utterance U12 exist in the radio signal. The filter bank illustrated in FIG. 3 is shown on the left side of FIG. When this radio wave signal is divided using this filter bank, the signal component of the utterance U11 is included in the subband signal that has passed through the bandpass filter F1 and the subband signal that has passed through the bandpass filter F2. In addition, the signal component of the utterance U12 is included in the subband signal that has passed through the bandpass filter F3.

図５は、バンドパスフィルタＦ１を通過したサブバンド信号を復調して得られる復調信号の波形例を示す図である。図６は、バンドパスフィルタＦ２を通過したサブバンド信号を復調して得られる復調信号の波形例を示す図である。図７は、バンドパスフィルタＦ３を通過したサブバンド信号を復調して得られる復調信号の波形例を示す図である。図中のＴ０およびＴｎは、それぞれ共通の時刻を示している。 FIG. 5 is a diagram illustrating a waveform example of a demodulated signal obtained by demodulating the subband signal that has passed through the bandpass filter F1. FIG. 6 is a diagram illustrating a waveform example of a demodulated signal obtained by demodulating the subband signal that has passed through the bandpass filter F2. FIG. 7 is a diagram illustrating a waveform example of a demodulated signal obtained by demodulating the subband signal that has passed through the bandpass filter F3. T0 and Tn in the figure each indicate a common time.

図８乃至図１０は、図５乃至図７に示した復調信号に対してそれぞれ検出部１３により発話を検出した結果を示す図である。図中のグラフは、検出部１３で算出した信頼度スコアの時系列を表している。本例では、検出部１３において、信頼度スコアが閾値を上回った区間を発話として検出するものとする。その結果、バンドパスフィルタＦ１に対応する復調信号からは、図８に示すように、２６．４秒から３０．３秒までの区間が発話Ｕ２１として検出されている。また、バンドパスフィルタＦ２に対応する復調信号からは、図９に示すように、２６．１秒から２９．９秒までの区間が発話Ｕ２２として検出されている。また、バンドパスフィルタＦ３に対応する復調信号からは、図１０に示すように、１８．４秒から３８．１秒までの区間が発話Ｕ２３として検出されている。 FIGS. 8 to 10 are diagrams showing the results of detecting speech by the detection unit 13 for the demodulated signals shown in FIGS. The graph in the figure represents a time series of reliability scores calculated by the detection unit 13. In this example, the detection unit 13 detects a section in which the reliability score exceeds a threshold as an utterance. As a result, from the demodulated signal corresponding to the bandpass filter F1, a section from 26.4 seconds to 30.3 seconds is detected as an utterance U21 as shown in FIG. Further, from the demodulated signal corresponding to the band pass filter F2, as shown in FIG. 9, a section from 26.1 seconds to 29.9 seconds is detected as the utterance U22. Further, from the demodulated signal corresponding to the bandpass filter F3, as shown in FIG. 10, a section from 18.4 seconds to 38.1 seconds is detected as the utterance U23.

図１１は、本例における検出部１３によって検出された発話Ｕ２１，Ｕ２２，Ｕ２３に関する情報の一例を示す図である。図中の平均信頼度スコアは、発話区間内における信頼度スコアの平均を示している。 FIG. 11 is a diagram illustrating an example of information related to the utterances U21, U22, and U23 detected by the detection unit 13 in this example. The average reliability score in the figure indicates the average reliability score in the utterance interval.

次に、本例における判定部１４、選択部１５および出力部１６の挙動について説明する。 Next, behaviors of the determination unit 14, the selection unit 15, and the output unit 16 in this example will be described.

本例における判定部１４は、検出された発話ごとに、当該発話が検出された復調信号に対して、対応するバンドパスフィルタが隣接する他の復調信号から検出された発話であって、当該発話と少なくとも一部の時刻が重なった発話を探索する。この方法によると、判定部１４は、はじめに、バンドパスフィルタＦ１に対応する復調信号から検出された発話Ｕ２１について、隣接するバンドパスフィルタＦ０に対応する復調信号およびバンドパスフィルタＦ２に対応する復調信号に対する検出部１３の結果から、対象となる発話を探索する。本例では、バンドパスフィルタＦ０に対応する復調信号からは発話が検出されず、バンドパスフィルタＦ２に対応する復調信号からは発話Ｕ２２が検出されている。そして、発話Ｕ２２は、２６．４秒から２９．９秒までの区間において、発話Ｕ２１と重なっている。そこで、判定部１４は、発話Ｕ２１と発話Ｕ２２が同一の発話であるか否かを後に判定するため、２つの発話Ｕ２１，Ｕ２２を組にして記憶部に書き込む。 For each detected utterance, the determination unit 14 in this example is an utterance detected from another demodulated signal adjacent to the corresponding demodulated signal in which the corresponding utterance is detected, and the corresponding bandpass filter is the utterance Search for utterances that overlap at least part of the time. According to this method, for the speech U21 detected from the demodulated signal corresponding to the bandpass filter F1, the determination unit 14 firstly demodulates the demodulated signal corresponding to the adjacent bandpass filter F0 and the demodulated signal corresponding to the bandpass filter F2. The target utterance is searched from the result of the detection unit 13 for. In this example, the utterance is not detected from the demodulated signal corresponding to the bandpass filter F0, and the utterance U22 is detected from the demodulated signal corresponding to the bandpass filter F2. And the utterance U22 overlaps with the utterance U21 in the section from 26.4 seconds to 29.9 seconds. Therefore, the determination unit 14 writes the two utterances U21 and U22 together in the storage unit in order to determine later whether or not the utterance U21 and the utterance U22 are the same utterance.

判定部１４は、次に、バンドパスフィルタＦ２に対応する復調信号から検出された発話Ｕ２２について、隣接するバンドパスフィルタＦ１に対応する復調信号およびバンドパスフィルタＦ３に対応する復調信号に対する検出部１３の結果から、対象となる発話を探索する。本例では、バンドパスフィルタＦ１に対応する復調信号からは発話Ｕ２１が検出され、バンドパスフィルタＦ３に対応する復調信号からは発話Ｕ２３が検出されている。そして、発話Ｕ２３は、２６．１秒から２９．９秒までの区間において、発話Ｕ２２と重なっている。そこで、判定部１４は、発話Ｕ２２と発話Ｕ２３が同一の発話であるか否かを後に判定するため、２つの発話Ｕ２２，Ｕ２３を組にして記憶部に書き込む。なお、発話Ｕ２１と発話Ｕ２２の組はすでに記憶部に書き込まれているため、重複を避けるためにここでは新たな書き込みは行わない。 Next, for the utterance U22 detected from the demodulated signal corresponding to the bandpass filter F2, the determining unit 14 detects the demodulated signal corresponding to the adjacent bandpass filter F1 and the demodulated signal corresponding to the bandpass filter F3. From the result, the target utterance is searched. In this example, the utterance U21 is detected from the demodulated signal corresponding to the bandpass filter F1, and the utterance U23 is detected from the demodulated signal corresponding to the bandpass filter F3. And the utterance U23 overlaps with the utterance U22 in the section from 26.1 seconds to 29.9 seconds. Therefore, the determination unit 14 writes the two utterances U22 and U23 as a set to the storage unit in order to determine later whether or not the utterance U22 and the utterance U23 are the same utterance. In addition, since the set of the utterance U21 and the utterance U22 has already been written in the storage unit, new writing is not performed here in order to avoid duplication.

判定部１４は、次に、記憶部に書き込まれた発話の組の各々について、両発話が同一の発話であるか否かを判定する。本例における判定部１４は、２つの発話の重なった時刻を用いて信頼度スコアの相関係数を算出し、相関係数が事前に定めた閾値（ここでは０．６０とする）を上回ったかどうかにより、２つの発話が同一の発話であるか否かを判定するものとする。 Next, the determination unit 14 determines whether or not both utterances are the same utterance for each utterance set written in the storage unit. The determination unit 14 in this example calculates the correlation coefficient of the reliability score using the time at which two utterances overlap, and has the correlation coefficient exceeded a predetermined threshold (here, 0.60)? It is determined whether or not two utterances are the same utterance.

まず、発話Ｕ２１と発話Ｕ２２の組については、両発話の重なった時刻である２６．４秒から２９．９秒までの区間において、バンドパスフィルタＦ１に対応する復調信号から算出された信頼度スコアと、バンドパスフィルタＦ２に対応する復調信号から算出された信頼度スコアとの相関係数を求める。その結果、算出された相関係数は０．９１であり、閾値である０．６０を上回るため、判定部１４はこれら２つの発話が同一の発話であると判定する。次に、発話Ｕ２２と発話Ｕ２３の組については、両発話の重なった時刻である２６．１秒から２９．９秒までの区間において、バンドパスフィルタＦ２に対応する復調信号から算出された信頼度スコアと、バンドパスフィルタＦ３に対応する復調信号から算出された信頼度スコアとの相関係数を求める。その結果、算出された相関係数は０．０８であり、閾値である０．６０を下回るため、判定部１４はこれら２つの発話が同一の発話ではないと判定する。 First, for the set of the utterance U21 and the utterance U22, the reliability score calculated from the demodulated signal corresponding to the bandpass filter F1 in the section from 26.4 seconds to 29.9 seconds, which is the time when both utterances overlap. And a correlation coefficient with the reliability score calculated from the demodulated signal corresponding to the bandpass filter F2. As a result, the calculated correlation coefficient is 0.91, which exceeds the threshold value of 0.60. Therefore, the determination unit 14 determines that these two utterances are the same utterance. Next, for the set of the utterance U22 and the utterance U23, the reliability calculated from the demodulated signal corresponding to the bandpass filter F2 in the interval from 26.1 seconds to 29.9 seconds, which is the time when both utterances overlap. A correlation coefficient between the score and the reliability score calculated from the demodulated signal corresponding to the bandpass filter F3 is obtained. As a result, the calculated correlation coefficient is 0.08, which is below the threshold value of 0.60, so the determination unit 14 determines that these two utterances are not the same utterance.

判定部１４での判定結果を受け、本例における選択部１５は、同一と判定された発話が存在しなかった発話については当該発話を選択し、同一と判定された発話が存在した発話については、同一と判定された発話の中で信頼度スコア（例えば平均信頼度スコア）が最も高い発話を選択する。その結果、同一と判定された発話が存在しなかった発話Ｕ２３が、出力部１６による情報出力の対象となる発話として選択される。また、同一と判定された発話Ｕ２１と発話Ｕ２２については、発話Ｕ２２よりも平均信頼度スコアが高い発話Ｕ２１が、出力部１６による情報出力の対象となる発話として選択される。 In response to the determination result of the determination unit 14, the selection unit 15 in this example selects the utterance for the utterance for which the utterance determined to be the same does not exist, and for the utterance for which the utterance determined to be the same exists. The utterance having the highest reliability score (for example, average reliability score) is selected from the utterances determined to be the same. As a result, the utterance U23 for which there is no utterance determined to be the same is selected as the utterance to be output by the output unit 16. For the utterance U21 and the utterance U22 determined to be the same, the utterance U21 having an average reliability score higher than that of the utterance U22 is selected as an utterance to be output by the output unit 16.

選択部１５での結果を受け、本例における出力部１６は、選択された発話に関する情報を出力する。例えば出力部１６は、選択された発話Ｕ２１に関する情報として、発話Ｕ２１が検出された復調信号に対応するバンドパスフィルタＦ１の番号、発話の存在する時刻、および平均信頼度スコアなどとともに、発話Ｕ２１の音声信号を出力する。また、出力部１６は、選択された発話Ｕ２３に関する情報として、発話Ｕ２３が検出された復調信号に対応するバンドパスフィルタＦ３の番号、発話の存在する時刻、および平均信頼度スコアなどとともに、発話Ｕ２３の音声信号を出力する。 In response to the result of the selection unit 15, the output unit 16 in this example outputs information on the selected utterance. For example, the output unit 16 includes the number of the band pass filter F1 corresponding to the demodulated signal from which the utterance U21 is detected, the time when the utterance exists, the average reliability score, and the like as information on the selected utterance U21. Output audio signals. Further, the output unit 16 includes the utterance U23 as information on the selected utterance U23 together with the number of the bandpass filter F3 corresponding to the demodulated signal from which the utterance U23 is detected, the time when the utterance exists, the average reliability score, and the like. Audio signal is output.

以上説明したように、本例では、電波信号から分割された複数のサブバンド信号を復号することで得られる複数の復調信号から、電波信号中の同一の発話Ｕ１１を示す２つの発話Ｕ２１，Ｕ２２が検出された。ここで、検出された発話を人が聴いて確認する場合、従来技術の音声区間検出をそのまま適用するだけでは、同一の発話Ｕ１１を示す２つの発話Ｕ２１，Ｕ２２を繰り返し聴くことになり、確認作業が煩雑になる。一方、本実施形態によれば、同一の発話Ｕ１１を示す２つの発話Ｕ２１，Ｕ２２のうちの一方の発話Ｕ２１が情報出力の対象として選択されるので、同一の発話を繰り返し聴くことなく確認を行うことができる。これにより、確認作業にかかる手間を軽減できるため、異常電波による音声通信の検出を効率よく行うことができる。 As described above, in this example, two utterances U21 and U22 indicating the same utterance U11 in the radio signal are obtained from a plurality of demodulated signals obtained by decoding a plurality of subband signals divided from the radio signal. Was detected. Here, when a person listens to confirm the detected utterance, the two speeches U21 and U22 indicating the same utterance U11 are repeatedly listened to by simply applying the conventional speech segment detection as it is. Becomes complicated. On the other hand, according to the present embodiment, one utterance U21 of two utterances U21 and U22 indicating the same utterance U11 is selected as an information output target, so confirmation is performed without repeatedly listening to the same utterance. be able to. As a result, it is possible to reduce the time and effort required for the confirmation work, and it is possible to efficiently detect voice communication using abnormal radio waves.

本実施形態の音声処理装置１は、例えば、汎用のコンピュータシステムを基本ハードウェアとして用い、このコンピュータシステム上で所定のプログラム（ソフトウェア）を実行することによって、上述した各部（分割部１１、復調部１２、検出部１３、判定部１４、選択部１５および出力部１６）を実現することができる。 The speech processing apparatus 1 according to the present embodiment uses, for example, a general-purpose computer system as basic hardware, and executes a predetermined program (software) on the computer system, whereby the above-described units (dividing unit 11, demodulating unit) 12, the detection unit 13, the determination unit 14, the selection unit 15, and the output unit 16) can be realized.

図１２は、本実施形態の音声処理装置１のハードウェア構成例を示すブロック図である。音声処理装置１は、例えば図１２に示すように、ＣＰＵ１０１などのプロセッサと、ＲＡＭ１０２やＲＯＭ１０３などの記憶装置と、ディスプレイ１１０やスピーカ１２０などの周辺機器との間のデータ入出力を仲介する周辺機器Ｉ／Ｆ１０４と、ＨＤＤ１０５などのファイル記憶装置と、ネットワークを介して外部と通信を行う通信Ｉ／Ｆ１０６と、を備えた通常のコンピュータ装置のハードウェア構成を有する。 FIG. 12 is a block diagram illustrating a hardware configuration example of the voice processing device 1 according to the present embodiment. For example, as shown in FIG. 12, the audio processing device 1 is a peripheral device that mediates data input / output between a processor such as a CPU 101, a storage device such as a RAM 102 and a ROM 103, and a peripheral device such as a display 110 and a speaker 120. It has a hardware configuration of a normal computer device including an I / F 104, a file storage device such as an HDD 105, and a communication I / F 106 that communicates with the outside via a network.

このとき、上記のプログラムは、例えば、磁気ディスク（フレキシブルディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＯＭ、ＤＶＤ±Ｒ、ＤＶＤ±ＲＷ、Ｂｌｕ−ｒａｙ（登録商標）Ｄｉｓｃなど）、半導体メモリ、またはこれに類する記録媒体に記録されて提供される。なお、プログラムを記録する記録媒体は、コンピュータシステムが読み取り可能な記録媒体であれば、その記憶形式は何れの形態であってもよい。また、上記プログラムを、コンピュータシステムに予めインストールするように構成してもよいし、ネットワークを介して配布される上記のプログラムをコンピュータシステムに適宜インストールするように構成してもよい。 At this time, the above programs are, for example, magnetic disks (flexible disks, hard disks, etc.), optical disks (CD-ROM, CD-R, CD-RW, DVD-ROM, DVD ± R, DVD ± RW, Blu-ray ( (Registered trademark) Disc, etc.), a semiconductor memory, or a similar recording medium. The recording medium for recording the program may be in any form as long as the computer system can read the recording medium. Further, the program may be configured to be installed in advance in the computer system, or the program distributed via a network may be configured to be installed in the computer system as appropriate.

上記のコンピュータシステムで実行されるプログラムは、本実施形態の音声処理装置１における機能的な構成要素である上述した各部（分割部１１、復調部１２、検出部１３、判定部１４、選択部１５および出力部１６）を含むモジュール構成となっており、プロセッサがこのプログラムを適宜読み出して実行することにより、上述した各部がＲＡＭ１０２などの主記憶上に生成されるようになっている。 The program executed in the above computer system is the above-described units (dividing unit 11, demodulating unit 12, detecting unit 13, determining unit 14, selecting unit 15) that are functional components in the audio processing device 1 of the present embodiment. And a module configuration including the output unit 16), and the processor reads and executes the program as appropriate, so that the above-described units are generated on the main memory such as the RAM 102.

なお、本実施形態の音声処理装置１の上述した各部（分割部１１、復調部１２、検出部１３、判定部１４、選択部１５および出力部１６）は、プログラム（ソフトウェア）により実現するだけでなく、その一部または全部を、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）やＦＰＧＡ（Ｆｉｅｌｄ−ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）などの専用のハードウェアにより実現することもできる。 Note that each of the above-described units (the dividing unit 11, the demodulating unit 12, the detecting unit 13, the determining unit 14, the selecting unit 15, and the output unit 16) of the audio processing device 1 according to the present embodiment is realized only by a program (software). Alternatively, a part or all of them can be realized by dedicated hardware such as ASIC (Application Specific Integrated Circuit) or FPGA (Field-Programmable Gate Array).

また、本実施形態の音声処理装置１は、複数台のコンピュータを通信可能に接続したネットワークシステムとして構成し、上述した各部を複数台のコンピュータに分散して実現する構成であってもよい。例えば、分割部１１の機能を持つ１台のコンピュータと、複数の復調部１２および複数の検出部１３のうち、対応する１つずつの復調部１２および検出部１３の機能を持つ複数台のコンピュータと、判定部１４、選択部１５および出力部１６の機能を持つ１台のコンピュータとを通信可能に接続して、本実施形態の音声処理装置１としてもよい。 The voice processing apparatus 1 according to the present embodiment may be configured as a network system in which a plurality of computers are communicably connected, and the above-described units may be distributed and realized in a plurality of computers. For example, one computer having the function of the dividing unit 11 and a plurality of computers having the functions of the corresponding one demodulating unit 12 and the detecting unit 13 among the plurality of demodulating units 12 and the plurality of detecting units 13. The speech processing apparatus 1 according to the present embodiment may be connected to a single computer having the functions of the determination unit 14, the selection unit 15, and the output unit 16 in a communicable manner.

以上、本発明の実施形態を説明したが、ここで説明した実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。ここで説明した新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。ここで説明した実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 As mentioned above, although embodiment of this invention was described, embodiment described here is shown as an example and is not intending limiting the range of invention. The novel embodiments described herein can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the spirit of the invention. The embodiments and modifications described herein are included in the scope and gist of the invention, and are also included in the invention described in the claims and the equivalents thereof.

１音声処理装置
１１分割部
１２（１２＿１，１２＿２，・・・，１２＿ｎ）復調部
１３（１３＿１，１３＿２，・・・，１３＿ｎ）検出部
１４判定部
１５選択部
１６出力部
Ｕ１１，Ｕ１２（電波信号中に存在する）発話
Ｆ１，Ｆ２，Ｆ３バンドパスフィルタ
Ｕ２１，Ｕ２２，Ｕ２３（復調信号から検出された）発話 DESCRIPTION OF SYMBOLS 1 Audio | voice processing apparatus 11 Dividing part 12 (12_1,12_2, ..., 12_n) Demodulating part 13 (13_1,13_2, ..., 13_n) Detection part 14 Determination part 15 Selection part 16 Output part U11, U12 (Radio signal Speech F1, F2, F3 Bandpass filters U21, U22, U23 (detected from the demodulated signal)

Claims

A dividing unit that divides a received radio wave signal into a plurality of subband signals using a filter bank composed of a plurality of bandpass filters having different passbands;
A demodulator that individually demodulates the plurality of subband signals and generates a plurality of demodulated signals respectively corresponding to the plurality of bandpass filters;
A detection unit that detects an utterance from each of the plurality of demodulated signals based on a reliability score representing the likelihood of the utterance;
The utterance detected from the demodulated signal of interest among the plurality of demodulated signals is defined as the first utterance and corresponds to another bandpass filter adjacent in the frequency direction to the bandpass filter corresponding to the demodulated signal of interest. When the utterance detected from other demodulated signals is set as the second utterance, when there are one or more second utterances that overlap at least a part of the time with respect to the first utterance, these first utterances exist. A determination unit for determining whether the utterance and the second utterance are the same utterance;
A voice processing comprising: a selection unit that selects one of the first utterance and the second utterance when it is determined that the first utterance and the second utterance are the same utterance apparatus.

The said determination part determines whether these 1st utterance and 2nd utterance are the same utterance based on the overlapping degree of the time of a 1st utterance and a 2nd utterance. The voice processing apparatus according to 1.

Based on the evaluation result of the similarity between the feature amount extracted from the first utterance and the feature amount extracted from the second utterance, the determination unit determines that the first utterance and the second utterance are the same utterance. The speech processing apparatus according to claim 1, wherein it is determined whether or not there is.

When it is determined that the first utterance and the second utterance are the same utterance, the selection unit selects an utterance having a high reliability score from the first utterance and the second utterance. The speech processing apparatus according to any one of claims 1 to 3.

The selection unit has one second utterance in which at least part of time overlaps with the first utterance when there is no second utterance in which at least part of time overlaps with the first utterance. The voice processing according to any one of claims 1 to 4, wherein the first utterance is selected when it is determined that the first utterance and the second utterance are not the same utterance although there is the above utterance. apparatus.

The speech processing apparatus according to any one of claims 1 to 5, wherein the selection unit integrates, with respect to the selected utterance, a part of the utterance that is not selected and a time that does not overlap with the selected utterance.

The speech processing apparatus according to any one of claims 1 to 6, further comprising an output unit configured to output information related to the selected utterance.

The speech processing apparatus according to claim 7, wherein the output unit further outputs information about an unselected utterance together with information about the selected utterance.

The determination unit sets the utterance detected from the demodulated signal of interest among the plurality of demodulated signals as a first utterance, and another band adjacent in the frequency direction to the bandpass filter corresponding to the demodulated signal of interest. When the utterance detected from another demodulated signal corresponding to the pass filter is the second utterance, when there is one or more second utterances that overlap at least part of the time with respect to the first utterance The speech processing apparatus according to claim 1, wherein the first utterance and the second utterance are determined to be the same utterance.

A voice processing method executed by a voice processing device,
Dividing a received radio wave signal into a plurality of subband signals using a filter bank composed of a plurality of bandpass filters having different passbands;
Demodulating the plurality of subband signals individually to generate a plurality of demodulated signals respectively corresponding to the plurality of bandpass filters;
Detecting an utterance from each of the plurality of demodulated signals based on a confidence score representing the likelihood of the utterance;
The utterance detected from the demodulated signal of interest among the plurality of demodulated signals is defined as the first utterance and corresponds to another bandpass filter adjacent in the frequency direction to the bandpass filter corresponding to the demodulated signal of interest. When the utterance detected from other demodulated signals is set as the second utterance, when there are one or more second utterances that overlap at least a part of the time with respect to the first utterance, these first utterances exist. Determining whether the utterance and the second utterance are the same utterance;
A step of selecting one of the first utterance and the second utterance when it is determined that the first utterance and the second utterance are the same utterance. .

On the computer,
A function of dividing a received radio wave signal into a plurality of subband signals using a filter bank composed of a plurality of bandpass filters having different passbands;
A function of individually demodulating the plurality of subband signals to generate a plurality of demodulated signals respectively corresponding to the plurality of bandpass filters;
A function of detecting an utterance from each of the plurality of demodulated signals based on a confidence score representing the likelihood of the utterance;
The utterance detected from the demodulated signal of interest among the plurality of demodulated signals is defined as the first utterance, and the other corresponding to another bandpass filter adjacent in the frequency direction to the bandpass filter corresponding to the demodulated signal of interest. If the utterance detected from the demodulated signal is the second utterance, and there is one or more second utterances that overlap at least part of the time with respect to the first utterance, these first utterances And a function for determining whether or not the second utterance is the same utterance;
A function for selecting one of the first utterance and the second utterance when it is determined that the first utterance and the second utterance are the same utterance; program.