JP6587088B2

JP6587088B2 - Audio transmission system and audio transmission method

Info

Publication number: JP6587088B2
Application number: JP2014222994A
Authority: JP
Inventors: 加藤　修; 修加藤
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2014-10-31
Filing date: 2014-10-31
Publication date: 2019-10-09
Anticipated expiration: 2034-10-31
Also published as: JP2016092529A

Description

本発明は、複数の音声が混じり合う音空間において収音された音声に所定の信号処理を施して伝達する音声伝達システム及び音声伝達方法に関する。 The present invention relates to a sound transmission system and a sound transmission method for performing predetermined signal processing and transmitting a sound collected in a sound space in which a plurality of sounds are mixed.

複数の音声が混じり合う空間（以下、「音空間」という）において、複数のマイクロホンを用いて音声を収音し、収音された音声の音声信号から特定方向の音源（音声信号）を抽出する音源分離技術が検討されている（例えば特許文献１参照）。 In a space where a plurality of sounds are mixed (hereinafter referred to as “sound space”), a plurality of microphones are used to pick up the sound, and a sound source (sound signal) in a specific direction is extracted from the picked up sound signal. A sound source separation technique has been studied (see, for example, Patent Document 1).

特許文献１に示す音源分離装置は、音の回折の影響を考慮してマイクロホンの配置の情報を用いずに音を分離する独立成分分析と、音の回折に弱いが目的の方向を指定して音を抽出することが可能なフィルタリング処理とを並列に実行し、フィルタリング処理結果の音のパワースペクトルをリファレンス信号として、リファレンス信号との時間相関が高い音を雑音抑圧性能が高い音とみなし、リファレンス信号との時間相関が高い音を音源分離信号として選択する。これにより、音源分離装置は、例えば携帯電話のように筐体上で音が回折する場合でも、短時間に音の分離を行うことができる。 The sound source separation device shown in Patent Document 1 specifies independent component analysis that separates sounds without using information on the arrangement of microphones in consideration of the effects of sound diffraction, and specifies a target direction that is weak to sound diffraction. A filtering process that can extract the sound is executed in parallel, the sound power spectrum of the filtering process result is used as a reference signal, a sound having a high time correlation with the reference signal is regarded as a sound having a high noise suppression performance, and the reference A sound having a high time correlation with the signal is selected as a sound source separation signal. As a result, the sound source separation device can perform sound separation in a short time even when the sound is diffracted on the housing, such as a mobile phone.

特開２０１１−１９９４７４号公報JP 2011-199447 A

特許文献１では、製品のサイズが予め規定された筐体（例えば携帯電話）に対応して、複数のマイクロホンを有するマイクアレイにより収音された音波形情報（音声信号）から任意の方向の音声が抽出可能となる。 In Patent Document 1, sound in an arbitrary direction is obtained from sound waveform information (audio signal) collected by a microphone array having a plurality of microphones, corresponding to a housing (for example, a mobile phone) whose product size is defined in advance. Can be extracted.

しかし、特許文献１の構成では、製品のサイズを想定したマイクロホンの個数及び配置が前提となるので、数多くの話者やノイズを含む環境音が混じり合った実際の音空間では、複数のマイクロホンにより収音された音声から特定の音源を抽出することは困難であるという課題がある。 However, in the configuration of Patent Document 1, since the number and arrangement of microphones assuming the size of the product are premised, in an actual sound space in which environmental sounds including many speakers and noise are mixed, a plurality of microphones are used. There is a problem that it is difficult to extract a specific sound source from the collected sound.

本発明は、上述した課題を解決するために、数多くの話者やノイズを含む環境音が混じり合った実際の音空間において、収音された音声から特定の音源を抽出し、抽出後の高品質な音声を自端末又は自端末の通話相手に伝達する音声伝達システム及び音声伝達方法を提供することを目的とする。 In order to solve the above-described problems, the present invention extracts a specific sound source from collected sound in an actual sound space in which environmental sounds including many speakers and noise are mixed, It is an object of the present invention to provide a voice transmission system and a voice transmission method for transmitting high-quality voice to the own terminal or a call partner of the own terminal.

本発明は、所定の入力操作を受け付ける第１の通信端末を含む音声伝達システムであって、複数の音声が含まれる音空間において音声を収音する１つ以上の収音部と、前記１つ以上の収音部により収音された前記音声の音声データを記憶する記憶部と、前記記憶部に記憶される前記音声データを、前記１つ以上の収音部と同数の種類の音声データに分離し、個々の分離音声データを生成する分離部と、前記分離部により生成された個々の分離音声データのうち、前記第１の通信端末に対する前記所定の入力操作に応じたいずれかの分離音声データを、切替音声データとして前記第１の通信端末に送信する第１送信部と、を備え、前記第１の通信端末は、前記切替音声データを受信し音声出力する、音声伝達システムである。 The present invention is an audio transmission system including a first communication terminal that accepts a predetermined input operation , wherein the one or more sound collection units collect sound in a sound space including a plurality of sounds, and the one A storage unit that stores the sound data of the sound collected by the sound collection unit, and the sound data stored in the storage unit is converted into the same number of types of sound data as the one or more sound collection units. separated, a separation unit for generating a respective isolation audio data, among the individual separation audio data generated by the separating unit, either separate sound corresponding to the predetermined input operation to the first communication terminal the data includes a first transmission unit for transmitting to the first communication terminal as the switching voice data, wherein the first communication terminal, you output sound receiving the switching voice data is voice transmission system .

また、本発明は、所定の入力操作を受け付ける第１の通信端末を含む音声伝達システムにおける音声伝達方法であって、複数の音声が含まれる音空間に配置された１つ以上の収音装置において音声を収音するステップと、前記１つ以上の収音装置により収音された前記音声の音声データを記憶するステップと、前記音声データを、前記１つ以上の収音装置と同数の種類の音声データに分離し、個々の分離音声データを生成するステップと、分離後の個々の分離音声データのうち、前記第１の通信端末に対する前記所定の入力操作に応じたいずれかの分離音声データを、切替音声データとして前記第１の通信端末に送信するステップと、前記切替音声データを受信し音声出力するステップと、を有する、音声伝達方法である。 The present invention is also an audio transmission method in an audio transmission system including a first communication terminal that accepts a predetermined input operation, in one or more sound collection devices arranged in a sound space including a plurality of sounds. Collecting sound, storing sound data of the sound collected by the one or more sound collecting devices, and collecting the sound data in the same number of types as the one or more sound collecting devices. Separating into voice data and generating individual separated voice data, and among the separated separated voice data after separation , any one of the separated voice data according to the predetermined input operation to the first communication terminal A voice transmission method comprising: transmitting to the first communication terminal as switched voice data; and receiving the switched voice data and outputting the voice.

本発明によれば、数多くの話者やノイズを含む環境音が混じり合った実際の音空間において、収音された音声から特定の音源を抽出することができるので、抽出後の高品質な音声を自端末又は自端末の通話相手に伝達することができる。 According to the present invention, since a specific sound source can be extracted from the collected sound in an actual sound space in which a lot of speakers and environmental sounds including noise are mixed, high-quality sound after extraction can be obtained. Can be transmitted to the own terminal or a call partner of the own terminal.

本実施形態の音声伝達システムのシステム構成の一例を示すブロック図The block diagram which shows an example of the system configuration | structure of the audio transmission system of this embodiment 本実施形態の通信端末の内部構成の一例を示すブロック図The block diagram which shows an example of an internal structure of the communication terminal of this embodiment 本実施形態の音声伝達システムの第１のユースケース例を模式的に示す説明図Explanatory drawing which shows typically the 1st use case example of the audio | voice transmission system of this embodiment. 本実施形態の音声伝達システムの第２のユースケース例を模式的に示す説明図Explanatory drawing which shows typically the 2nd use case example of the audio | voice transmission system of this embodiment. 本実施形態の音声伝達システムの第３のユースケース例を模式的に示す説明図Explanatory drawing which shows typically the 3rd use case example of the audio | voice transmission system of this embodiment. 図５に示す第３のユースケース例における音声伝達システムの時系列に沿った動作概要の一例を説明するフローチャートThe flowchart explaining an example of the operation | movement outline | summary along the time series of the audio | voice transmission system in the 3rd use case example shown in FIG. 本実施形態の音声伝達システムの第４のユースケース例を模式的に示す説明図Explanatory drawing which shows typically the 4th use case example of the audio transmission system of this embodiment. 本実施形態の音声伝達システムの第１動作例の動作手順を説明するフローチャートThe flowchart explaining the operation | movement procedure of the 1st operation example of the audio | voice transmission system of this embodiment. 本実施形態の音声伝達システムの第２動作例の動作手順を説明するフローチャートThe flowchart explaining the operation | movement procedure of the 2nd operation example of the audio | voice transmission system of this embodiment. 図９の続きの動作手順を説明するフローチャートThe flowchart explaining the operation | movement procedure following FIG.

以下、本発明に係る音声伝達システム及び音声伝達方法を具体的に開示した実施形態（以下、「本実施形態」という）について、図面を参照して詳細に説明する。 Hereinafter, an embodiment (hereinafter referred to as “the present embodiment”) that specifically discloses a voice transmission system and a voice transmission method according to the present invention will be described in detail with reference to the drawings.

本実施形態の第１動作例では、音声伝達システムは、複数の音声（例えば複数人の声）が混じり合って含まれる音空間ＡＲに配置された１つ以上のマイクロホンにおいて音声を収音し、１つ以上のマイクロホンにより収音された音声の音声データを音空間ＡＲとは異なる場所（例えばクラウドネットワーク網上のサーバ装置）において記憶する。音声伝達システムは、通信端末のユーザによる所定の入力操作により、記憶された音声データを、１つ以上のマイクロホンと同数の種類の音声データ（以下、「分離音声データ」という）に分離し（音源分離）、音源分離後の個々の分離音声データのうちいずれかを通信端末に伝達する。 In the first operation example of the present embodiment, the sound transmission system picks up sound in one or more microphones arranged in a sound space AR in which a plurality of sounds (for example, voices of a plurality of people) are mixed and included, Audio data of audio collected by one or more microphones is stored in a location (for example, a server device on a cloud network) different from the sound space AR. The sound transmission system separates stored sound data into sound data of the same number as one or more microphones (hereinafter referred to as “separated sound data”) by a predetermined input operation by a user of the communication terminal (sound source). Separation), any one of the separated audio data after the sound source separation is transmitted to the communication terminal.

また、本実施形態の第２動作例では、更に、音声伝達システムは、音声伝達システムは、複数の音声（例えば複数人の声）が混じり合って含まれる音空間ＡＲに配置された１つ以上のマイクロホンにおいて音声を収音し、１つ以上のマイクロホンにより収音された音声の音声データを音空間ＡＲとは異なる場所（例えばクラウドネットワーク網上のサーバ装置）において記憶する。音声伝達システムは、第１の通信端末のユーザによる所定の入力操作により、記憶された音声データを、１つ以上のマイクロホンと同数の種類の音声データ（以下、「分離音声データ」という）に分離し（音源分離）、第１の通信端末のユーザによる所定の入力操作により、分離音声データのうちいずれかを、第１の通信端末の通話相手となる第２の通信端末に伝達する。 Further, in the second operation example of the present embodiment, the audio transmission system further includes one or more audio transmission systems arranged in a sound space AR including a mixture of a plurality of sounds (for example, voices of a plurality of people). The microphone picks up the voice and stores the voice data of the voice picked up by the one or more microphones in a place different from the sound space AR (for example, a server device on the cloud network network). The voice transmission system separates the stored voice data into the same number of types of voice data as one or more microphones (hereinafter referred to as “separated voice data”) by a predetermined input operation by the user of the first communication terminal. (Sound source separation), and by a predetermined input operation by the user of the first communication terminal, any one of the separated voice data is transmitted to the second communication terminal which is a call partner of the first communication terminal.

以下、本実施形態の音声伝達システムの詳細について、具体的に説明する。以下、上述した第１動作例における通信端末及び第２動作例における第１の通信端末（図１に示す通信端末Ｔ１参照）を使用するユーザを「ユーザＡ」とし、第２動作例における第２の通信端末（図１に示す通信端末Ｔ２参照）を使用するユーザを「ユーザＢ」という。 Hereinafter, the details of the audio transmission system of the present embodiment will be specifically described. Hereinafter, the user who uses the communication terminal in the first operation example and the first communication terminal in the second operation example (refer to the communication terminal T1 shown in FIG. 1) is referred to as “user A”, and the second in the second operation example. The user who uses the communication terminal (see the communication terminal T2 shown in FIG. 1) is referred to as “user B”.

図１は、本実施形態の音声伝達システム１００のシステム構成の一例を示すブロック図である。図１に示す音声伝達システム１００は、上述した第１動作例では、通信端末Ｔ１と、音空間ＡＲに配置された１つ以上のマイクロホンＭ１，Ｍ２，Ｍ３，Ｍ４と、マイク信号多重伝送装置ＭＸと、クラウドネットワーク網ＮＷ１上に設けられたサーバ装置ＣＤＳと、ＬＴＥネットワーク網ＮＷ２上に設けられた第１選択部ＳＬ１とを含む構成である。なお、音声伝達システム１００は、上述した第２動作例では、上述した第１動作例に対応する構成に加え、通信端末Ｔ２と、ＬＴＥネットワーク網ＮＷ２上に設けられた第１選択部ＳＬ１とを更に含む構成である。 FIG. 1 is a block diagram illustrating an example of a system configuration of a voice transmission system 100 according to the present embodiment. In the first operation example described above, the audio transmission system 100 illustrated in FIG. 1 includes the communication terminal T1, one or more microphones M1, M2, M3, and M4 arranged in the sound space AR, and a microphone signal multiplexing transmission device MX. And a server device CDS provided on the cloud network NW1 and a first selection unit SL1 provided on the LTE network NW2. In the second operation example described above, the voice transmission system 100 includes a communication terminal T2 and a first selection unit SL1 provided on the LTE network NW2 in addition to the configuration corresponding to the first operation example described above. Furthermore, it is the structure which contains.

収音部又は収音装置の一例としてのマイクロホンＭ１〜Ｍ４は、それぞれ音空間ＡＲの所定の位置に配置され、音空間ＡＲにおいて複数の音声が混じり合った音声を収音してマイク信号多重伝送装置ＭＸに出力する。 Microphones M1 to M4, which are examples of the sound collection unit or the sound collection device, are arranged at predetermined positions in the sound space AR, respectively, collect sound in which a plurality of sounds are mixed in the sound space AR, and perform microphone signal multiplexing transmission. Output to device MX.

マイク信号多重伝送装置ＭＸは、マイクロホンＭ１〜Ｍ４においてそれぞれ収音された音声の音声データを集約して多重化し、多重化された音声データ（音声信号）を、クラウドネットワーク網ＮＷ１上に設けられたサーバ装置ＣＤＳに送信する。なお、マイク信号多重伝送装置ＭＸは、マイクロホンＭ１〜Ｍ４と電気的に接続している限り、音空間ＡＲ内に配置されてもよいし、音空間ＡＲ以外の場所に配置されてもよい。 The microphone signal multiplex transmission device MX aggregates and multiplexes the audio data collected by the microphones M1 to M4, and the multiplexed audio data (audio signal) is provided on the cloud network NW1. It transmits to the server device CDS. Note that the microphone signal multiplex transmission device MX may be disposed in the sound space AR as long as it is electrically connected to the microphones M1 to M4, or may be disposed in a place other than the sound space AR.

サーバ装置ＣＤＳは、インターネット網又はイントラネット網の一例としてのクラウドネットワーク網ＮＷ１上に設けられ、音信号処理伝達部ＰＲを少なくとも含む構成である。なお、図１では、本実施形態のサーバ装置ＣＤＳの動作を説明する上で必要最低限な構成として音信号処理伝達部ＰＲのみ図示しており、マイク信号多重伝送装置ＭＸから送信された音声信号を受信する受信部の図示を省略している。 The server device CDS is provided on a cloud network NW1 as an example of the Internet network or an intranet network, and includes at least a sound signal processing transmission unit PR. In FIG. 1, only the sound signal processing transmission unit PR is shown as a minimum configuration necessary for explaining the operation of the server device CDS of the present embodiment, and the audio signal transmitted from the microphone signal multiplex transmission device MX is shown. The receiving unit for receiving is not shown.

音信号処理伝達部ＰＲは、制御部ＰＲ１と音源分離部ＰＲ２と記憶部の一例としてのメモリ（不図示）とを少なくとも含む構成である。制御部ＰＲ１及び音源分離部ＰＲ２は、例えばＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）又はＤＳＰ（Digital Signal Processor）を用いて構成される。音信号処理伝達部ＰＲは、マイク信号多重伝送装置ＭＸにおいて多重化された音声信号を多重分離する。 The sound signal processing transmission unit PR includes at least a control unit PR1, a sound source separation unit PR2, and a memory (not shown) as an example of a storage unit. The controller PR1 and the sound source separator PR2 are configured using, for example, a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or a DSP (Digital Signal Processor). The sound signal processing transmission unit PR demultiplexes the audio signal multiplexed in the microphone signal multiplexing transmission device MX.

制御部ＰＲ１は、通信端末Ｔ１，Ｔ２に対するユーザＡ，Ｂの所定の入力操作毎に、音源分離部ＰＲ２による音源分離後により生成された分離音声データを順次切り替えるための制御信号（後述参照）が入力される。制御部ＰＲ１は、この制御信号の入力に応じて、サーバ装置ＣＤＳから送信される分離音声データを順次切り替えるための指示を音源分離部ＰＲ２に出力する。音信号処理伝達部ＰＲは、制御部ＰＲ１の切替指示により、音源分離部ＰＲ２から出力された分離音声データを第１選択部ＳＬ１又は第２選択部ＳＬ２に送信する。 The control unit PR1 receives a control signal (see below) for sequentially switching separated audio data generated after sound source separation by the sound source separation unit PR2 for each predetermined input operation of the users A and B with respect to the communication terminals T1 and T2. Entered. In response to the input of this control signal, the control unit PR1 outputs an instruction for sequentially switching the separated audio data transmitted from the server device CDS to the sound source separation unit PR2. The sound signal processing transmission unit PR transmits the separated audio data output from the sound source separation unit PR2 to the first selection unit SL1 or the second selection unit SL2 in accordance with the switching instruction of the control unit PR1.

音源分離部ＰＲ２は、メモリに記憶される多重分離後の音声データを用いて音源分離することで、１つ以上のマイクロホン（例えばマイクロホンＭ１〜Ｍ４）と同数の種類の分離音声データ（即ち、マイクロホンの配置数と同数の音源が音空間ＡＲに存在する場合に、各音源から出力された音声の音声データのこと）を生成する。なお、音源分離部ＰＲ２における音源分離の処理は公知技術であり、例えば上述した特許文献１に開示された方法でもよいし、その他の公知技術により開示された方法でもよいので、音源分離の処理の詳細については説明を割愛する。 The sound source separation unit PR2 performs sound source separation using the sound data after demultiplexing stored in the memory, thereby providing the same number of types of separated sound data (that is, microphones) as one or more microphones (for example, microphones M1 to M4). When there are as many sound sources as the number of arrangements in the sound space AR, the sound data of the sound output from each sound source is generated. Note that the sound source separation process in the sound source separation unit PR2 is a known technique. For example, the method disclosed in Patent Document 1 described above or the method disclosed in other known techniques may be used. I will omit the details.

メモリは、制御部ＰＲ１及び音源分離部ＰＲ２の動作時のワークメモリとして機能し、更に、音信号処理伝達部ＰＲにより多重分離された後の音声信号の音声データを記憶する。 The memory functions as a work memory when the control unit PR1 and the sound source separation unit PR2 are in operation, and further stores audio data of the audio signal after being demultiplexed by the sound signal processing transmission unit PR.

第１選択部ＳＬ１及び第２選択部ＳＬ２は、例えば公衆電話網の一例としてのＬＴＥネットワーク網ＮＷ２上に設けられた無線基地局により構成される。なお、第１選択部ＳＬ１及び第２選択部ＳＬ２が配置されるネットワーク網は、ＬＴＥ（Long Term Evolution）に限定されず、３Ｇ（3rd Generation）網又はＨＳＰＡ（High Speed Packet Access）網でもよい。 The first selection unit SL1 and the second selection unit SL2 are configured by, for example, radio base stations provided on an LTE network NW2 as an example of a public telephone network. The network in which the first selection unit SL1 and the second selection unit SL2 are arranged is not limited to LTE (Long Term Evolution), but may be a 3G (3rd Generation) network or an HSPA (High Speed Packet Access) network.

第１選択部ＳＬ１は、音源分離部ＰＲ２により生成された１つ以上の音源に対応する分離音声データのうち、制御部ＰＲ１により指示されたいずれかの分離音声データｃ１を受信する。第１選択部ＳＬ１は、ＬＴＥネットワーク網ＮＷ２を介して、いずれかの分離音声データｃ１を通信端末Ｔ１に送信する。 The first selection unit SL1 receives one of the separated sound data c1 instructed by the control unit PR1 among the separated sound data corresponding to one or more sound sources generated by the sound source separation unit PR2. The first selection unit SL1 transmits any one of the separated voice data c1 to the communication terminal T1 via the LTE network NW2.

また、第１選択部ＳＬ１は、ユーザＡ，Ｂがそれぞれ通信端末Ｔ１，Ｔ２を用いて通話している場合には、通信端末Ｔ２から送信されたユーザＢの通話音声（言い換えると、サーバ装置ＣＤＳにより音源分離されていない生の通話音声）の音声データｂを受信し、いずれかの分離音声データｃ１又はユーザＢの通話音声の音声データｂを通信端末Ｔ１に送信する。 In addition, when the users A and B are calling using the communication terminals T1 and T2, respectively, the first selection unit SL1 calls the user B's call voice transmitted from the communication terminal T2 (in other words, the server device CDS). The voice data b of the live call voice that has not been separated by the sound source) is received, and any one of the separated voice data c1 or the voice data b of the call voice of the user B is transmitted to the communication terminal T1.

第２選択部ＳＬ２は、音源分離部ＰＲ２により生成された１つ以上の音源に対応する分離音声データのうち、制御部ＰＲ１により指示されたいずれかの分離音声データｃ２を受信する。第２選択部ＳＬ２は、ＬＴＥネットワーク網ＮＷ２を介して、いずれかの分離音声データｃ２を通信端末Ｔ２に送信する。 The second selection unit SL2 receives one of the separated sound data c2 instructed by the control unit PR1 among the separated sound data corresponding to one or more sound sources generated by the sound source separation unit PR2. The second selection unit SL2 transmits any separated voice data c2 to the communication terminal T2 via the LTE network NW2.

また、第２選択部ＳＬ２は、ユーザＡ，Ｂがそれぞれ通信端末Ｔ１，Ｔ２を用いて通話している場合には、通信端末Ｔ１から送信されたユーザＡの通話音声（言い換えると、サーバ装置ＣＤＳにより音源分離されていない生の通話音声）の音声データａを受信し、いずれかの分離音声データｃ２又はユーザＡの通話音声の音声データａを通信端末Ｔ２に送信する。 In addition, when the users A and B are making a call using the communication terminals T1 and T2, respectively, the second selection unit SL2 transmits the call voice of the user A transmitted from the communication terminal T1 (in other words, the server device CDS). The voice data a) (raw call voice that has not been sound source separated) is received, and any one of the separated voice data c2 or the voice data a of the call voice of the user A is transmitted to the communication terminal T2.

通信端末Ｔ１，Ｔ２は、音声伝達システム１００における音源分離サービス（つまり、音空間ＡＲにおいて収音された音声の音声データを１つ以上のマイクロホンと同数の種類の音声データに分離するためのサービス）を使用するユーザＡ，Ｂによりそれぞれ操作され、少なくとも無線通信機能及び通話機能を有する無線通信装置である。通信端末Ｔ１，Ｔ２は、例えば携帯電話機、スマートフォン又はコードレス電話子機である。 The communication terminals T1 and T2 use a sound source separation service in the sound transmission system 100 (that is, a service for separating the sound data of the sound collected in the sound space AR into the same number of types of sound data as one or more microphones). Is a wireless communication device that is operated by users A and B who use the mobile phone and has at least a wireless communication function and a telephone call function. The communication terminals T1 and T2 are, for example, mobile phones, smartphones, or cordless telephone handsets.

また、通信端末Ｔ１，Ｔ２には、音源分離サービスを使用するためのアプリケーションＡＰ１，ＡＰ２（図１では単に「アプリ」と略記）が予めインストールされている。言い換えると、図２を参照して後述するように、通信端末Ｔ１，Ｔ２には、音源分離サービスを使用するためのアプリケーションＡＰ１，ＡＰ２のプログラム及びデータが予め保存されており、ユーザＡ，Ｂが音源分離サービスを使用する際に、アプリケーションＡＰ１，ＡＰ２がユーザＡ，Ｂの入力操作に応じて起動する。 Also, applications AP1 and AP2 (simply abbreviated as “application” in FIG. 1) for using the sound source separation service are installed in advance in the communication terminals T1 and T2. In other words, as will be described later with reference to FIG. 2, the programs and data of the applications AP1 and AP2 for using the sound source separation service are stored in advance in the communication terminals T1 and T2, and the users A and B When using the sound source separation service, the applications AP1 and AP2 are activated according to the input operations of the users A and B.

図２は、本実施形態の通信端末Ｔ１，Ｔ２の内部構成の一例を示すブロック図である。図２に示す通信端末Ｔ１は、収音部１１と、ＣＰＵ１２と、メモリ１３と、入力部１４と、表示部１５と、音声出力部１６と、アンテナ１８が接続された無線通信部１７とを含む構成である。同様に、図２に示す通信端末Ｔ２は、収音部２１と、ＣＰＵ２２と、メモリ２３と、入力部２４と、表示部２５と、音声出力部２６と、アンテナ２８が接続された無線通信部２７とを含む構成である。以下、通信端末Ｔ１，Ｔ２の構成は同一であるため、図２に示す各部の説明では通信端末Ｔ１を例示して説明し、通信端末Ｔ２の動作の説明としては通信端末Ｔ１の動作と異なる内容について説明する。 FIG. 2 is a block diagram illustrating an example of an internal configuration of the communication terminals T1 and T2 according to the present embodiment. The communication terminal T1 shown in FIG. 2 includes a sound collection unit 11, a CPU 12, a memory 13, an input unit 14, a display unit 15, an audio output unit 16, and a wireless communication unit 17 to which an antenna 18 is connected. It is the composition which includes. Similarly, the communication terminal T2 illustrated in FIG. 2 includes a sound collection unit 21, a CPU 22, a memory 23, an input unit 24, a display unit 25, an audio output unit 26, and a wireless communication unit to which an antenna 28 is connected. 27. In the following, since the configurations of the communication terminals T1 and T2 are the same, the description of each unit illustrated in FIG. 2 illustrates the communication terminal T1, and the operation of the communication terminal T2 is different from the operation of the communication terminal T1. Will be described.

収音部１１，２１は、ユーザＡ，Ｂが通信端末Ｔ１，Ｔ２を把持した状態で通話（つまりハンドセット通話）する際の通話音声、又はユーザＡ，Ｂが通信端末Ｔ１，Ｔ２を把持していない状態で通話（例えばハンズフリー通話）する際の通話音声を収音してＣＰＵ１２に出力する。なお、収音部１１，２１により収音された通話音声の音声データはアナログ信号であるが、収音部１１，２１に含まれるＡＤＣ（Analog Digital Converter、図示略）によりデジタル信号に変換されてＣＰＵ１２，２２に入力される。 The sound collection units 11 and 21 are voices of conversation when the users A and B hold the communication terminals T1 and T2 (ie, handset calls), or the users A and B hold the communication terminals T1 and T2. The voice of the call when a call (for example, a hands-free call) is made without being picked up and output to the CPU 12. Note that the voice data of the call voice collected by the sound collection units 11 and 21 is an analog signal, but is converted into a digital signal by an ADC (Analog Digital Converter, not shown) included in the sound collection units 11 and 21. Input to the CPUs 12 and 22.

ＣＰＵ１２，２２は、通信端末Ｔ１，Ｔ２の各部の動作を全体的に統括するための制御処理、他の各部との間のデータの入出力処理、データの演算（計算）処理及びデータの記憶処理を行う。ＣＰＵ１２，２２は、例えば音源分離サービスを報知させるための信号（報知信号）を、アンテナ１８，２８を介して無線通信部１７，２７において受信した場合には、音源分離サービスを使用するためのアプリケーションＡＰ１，ＡＰ２を起動する。また、ＣＰＵ１２，２２は、音源分離サービスを使用するためのアプリケーションＡＰ１，ＡＰ２を起動した後、例えば音空間ＡＲの地図情報が予めメモリ１３，２３において保存されている場合には、音空間ＡＲの地図情報を表示部１５，２５に表示させる。 The CPUs 12 and 22 perform control processing for overall control of operations of the respective units of the communication terminals T1 and T2, data input / output processing with other units, data calculation (calculation) processing, and data storage processing. I do. For example, when the wireless communication units 17 and 27 receive signals (notification signals) for informing the sound source separation service via the antennas 18 and 28, the CPUs 12 and 22 are applications for using the sound source separation service. AP1 and AP2 are activated. Further, after starting the applications AP1 and AP2 for using the sound source separation service, the CPUs 12 and 22, for example, when the map information of the sound space AR is stored in the memories 13 and 23 in advance, the sound space AR Map information is displayed on the display units 15 and 25.

メモリ１３，２３は、例えばＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、不揮発性又は揮発性の半導体メモリを用いて構成され、ＣＰＵ１２，２２の動作時のワークメモリとして機能し、更に、ＣＰＵ１２，２２を動作させるための所定のプログラム（例えばアプリケーションＡＰ１，ＡＰ２）及びデータを保存している。 The memories 13 and 23 are configured using, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), a nonvolatile or volatile semiconductor memory, and function as a work memory when the CPUs 12 and 22 are operated. Predetermined programs (for example, applications AP1 and AP2) and data for operating the CPUs 12 and 22 are stored.

入力部１４，２４は、ユーザＡ，Ｂの入力操作を受け付け、ＣＰＵ１２，２２に通知するためのユーザインターフェース（ＵＩ：User Interface）であり、例えば表示部１５，２５の画面に対応して配置され、ユーザＡ，Ｂの指又はスタイラスペンによって操作が可能なタッチパネル又はタッチパッドを用いて構成される。 The input units 14 and 24 are user interfaces (UI) for receiving input operations of the users A and B and notifying the CPUs 12 and 22, and are arranged corresponding to the screens of the display units 15 and 25, for example. The touch panel or the touch pad that can be operated by the fingers of the users A and B or the stylus pen is used.

表示部１５，２５は、例えばＬＣＤ（Liquid Crystal Display）又は有機ＥＬ（Electroluminescence）を用いて構成され、ユーザＡ，Ｂの入力部１４，２４に対する所定の入力操作により、例えば音源分離サービスを使用するためのアプリケーションＡＰ１，ＡＰ２の画面を表示する。 The display units 15 and 25 are configured using, for example, an LCD (Liquid Crystal Display) or an organic EL (Electroluminescence), and use, for example, a sound source separation service by a predetermined input operation on the input units 14 and 24 of the users A and B. The screens of the applications AP1 and AP2 for displaying are displayed.

音声出力部（例えばスピーカ）１６，２６は、例えばユーザＡ，Ｂの入力操作に応じて、第１選択部ＳＬ１から送信されたいずれかの分離音声データｃ１又はユーザＢの通話音声の音声データｂの音声を出力する。 The voice output units (for example, speakers) 16 and 26, for example, according to the input operation of the users A and B, either the separated voice data c1 transmitted from the first selection unit SL1 or the voice data b of the call voice of the user B Is output.

無線通信部１７は、無線通信（例えばＬＴＥ、３Ｇ、ＨＳＰＡ、Ｗｉ−ｆｉ（登録商標）、Ｂｌｕｅｔｏｏｔｈ（登録商標））用のアンテナ１８が接続され、第１選択部ＳＬ１から送信されたいずれかの分離音声データｃ１又はユーザＢの通話音声の音声データｂを受信する。無線通信部２７は、無線通信（例えばＬＴＥ、３Ｇ、ＨＳＰＡ、Ｗｉ−ｆｉ（登録商標）、Ｂｌｕｅｔｏｏｔｈ（登録商標））用のアンテナ２８が接続され、第２選択部ＳＬ２から送信されたいずれかの分離音声データｃ２又はユーザＡの通話音声の音声データａを受信する。 The wireless communication unit 17 is connected to an antenna 18 for wireless communication (for example, LTE, 3G, HSPA, Wi-fi (registered trademark), Bluetooth (registered trademark)), and is transmitted from the first selection unit SL1. The separated voice data c1 or the voice data b of the call voice of the user B is received. The wireless communication unit 27 is connected to an antenna 28 for wireless communication (for example, LTE, 3G, HSPA, Wi-fi (registered trademark), Bluetooth (registered trademark)), and is transmitted from the second selection unit SL2. The separated voice data c2 or the voice data a of the call voice of the user A is received.

また、無線通信部１７は、ユーザＡの入力部１４に対する入力操作に応じて、音源分離部ＰＲ２による音源分離後により生成された分離音声データを順次切り替えるためにＣＰＵ１２が生成した制御信号を、アンテナ１８を介してサーバ装置ＣＤＳに送信する。無線通信部２７は、ユーザＢの入力部２４に対する入力操作に応じて、第２選択部ＳＬ２から送信されるいずれかの分離音声データｃ２又はユーザＡの通話音声の音声データａを切り替えるためにＣＰＵ２２が生成した制御信号を、アンテナ２８を介してサーバ装置ＣＤＳに送信する。 In addition, the wireless communication unit 17 receives the control signal generated by the CPU 12 in order to sequentially switch the separated audio data generated after the sound source separation by the sound source separation unit PR2 in response to an input operation on the input unit 14 by the user A. 18 to the server device CDS. In response to an input operation on the input unit 24 by the user B, the wireless communication unit 27 switches the CPU 22 to switch between the separated voice data c2 transmitted from the second selection unit SL2 or the voice data a of the call voice of the user A. Is transmitted to the server device CDS via the antenna 28.

次に、本実施形態の音声伝達システムにおける音源分離サービスによる各種のユースケース例について、図３〜図７を参照して説明する。なお、各種のユースケース例の説明を簡単にするために、図３、図４、図５及び図７では、図１に示すマイク信号多重伝送装置ＭＸと、第１選択部ＳＬ１及び第２選択部ＳＬ２を含むＬＴＥネットワーク網ＮＷ２との図示は省略されている。 Next, various use case examples by the sound source separation service in the voice transmission system of the present embodiment will be described with reference to FIGS. In order to simplify the explanation of various use case examples, in FIGS. 3, 4, 5, and 7, the microphone signal multiplexing transmission device MX shown in FIG. 1, the first selection unit SL1, and the second selection are used. Illustration of the LTE network NW2 including the part SL2 is omitted.

（第１のユースケース例）
図３は、本実施形態の音声伝達システム１００の第１のユースケース例を模式的に示す説明図である。第１のユースケース例では、音空間ＡＲ（例えばインタビュールーム）に複数のマイクロホンＭ１ａ，Ｍ２ａ，Ｍ３ａ，Ｍ４ａ，Ｍ５ａが配置され、日本語を話す話者Ｐｓ１ａの周囲で、英語の通訳者Ｔｒ１、仏語の通訳者Ｔｒ２及び中国語の通訳者Ｔｒ３により、話者Ｐｓ１ａの会話内容が多言語に同時通訳されている状況を想定する。この場合、サーバ装置ＣＤＳには、話者Ｐｓ１ａの日本語音声と通訳者Ｔｒ１の英語音声と通訳者Ｔｒ２の仏語音声と通訳者Ｔｒ３の中国語音声とが少なくとも収音された音声データが記憶される。 (First use case example)
FIG. 3 is an explanatory diagram schematically illustrating a first use case example of the audio transmission system 100 according to the present embodiment. In the first use case example, a plurality of microphones M1a, M2a, M3a, M4a, and M5a are arranged in a sound space AR (for example, an interview room), and an English interpreter Tr1 around a speaker Ps1a that speaks Japanese. Assume a situation in which the conversation content of the speaker Ps1a is simultaneously translated into multiple languages by a French interpreter Tr2 and a Chinese interpreter Tr3. In this case, the server device CDS stores voice data in which at least the Japanese voice of the speaker Ps1a, the English voice of the interpreter Tr1, the French voice of the interpreter Tr2, and the Chinese voice of the interpreter Tr3 are collected. The

なお、図３に示す第１のユースケース例では、音源分離サービスを使用するためのアプリケーションＡＰ１がインストールされた通信端末Ｔ１のユーザＰｓ２ａは、音空間ＡＲ内にいてもいなくてもよい。 In the first use case example shown in FIG. 3, the user Ps2a of the communication terminal T1 in which the application AP1 for using the sound source separation service is installed may or may not be in the sound space AR.

サーバ装置ＣＤＳは、ユーザＰｓ２ａの通信端末Ｔ１に対する所定の入力操作により、音空間ＡＲ内の各マイクロホンＭ１ａ〜Ｍ５ａにより収音された音声の音声データを、マイクロホンの個数と同数の種類の音源（例えば、日本語音声、英語音声、仏語音声、中国語音声、その他音声）に音源分離し、いずれかの分離音声データ（例えば英語音声）を通信端末Ｔ１に送信する。通信端末Ｔ１は、ユーザＰｓ２ａの所定の入力操作毎に、サーバ装置ＣＤＳから送信された分離音声データを順次切り替えて出力する。これにより、ユーザＰｓ２ａは、音空間ＡＲであるインタビュールームにおいて日本語で会話している話者Ｐｓ１ａの内容を、ユーザＰｓ２ａ自身が理解可能な言語（例えば英語）で高品質な音声を聴取することができる。 The server device CDS uses the predetermined input operation of the user Ps2a to the communication terminal T1, and the sound data collected by the microphones M1a to M5a in the sound space AR is converted into the same number of types of sound sources as the number of microphones (for example, , Japanese voice, English voice, French voice, Chinese voice, and other voices), and one of the separated voice data (for example, English voice) is transmitted to the communication terminal T1. The communication terminal T1 sequentially switches and outputs the separated voice data transmitted from the server device CDS for every predetermined input operation of the user Ps2a. Thereby, the user Ps2a listens to the content of the speaker Ps1a who is speaking in Japanese in the interview room which is the sound space AR in a language (for example, English) that the user Ps2a can understand in high quality. Can do.

（第２のユースケース例）
図４は、本実施形態の音声伝達システム１００の第２のユースケース例を模式的に示す説明図である。第２のユースケース例では、音空間ＡＲ（例えば東京のコンサートホール）に複数のマイクロホンＭ１ｂ，Ｍ２ｂ，Ｍ３ｂ，Ｍ４ｂ，Ｍ５ｂ，Ｍ６ｂ，Ｍ７ｂ，Ｍ８ｂが配置され、歌手、ピアノの演奏者、バイオリンの演奏者、クラリネットの演奏者、フルートの演奏者によりオーケストラが演奏されている状況を想定する。この場合、サーバ装置ＣＤＳには、歌手の歌の音声と、ピアノの演奏者によるピアノの音と、バイオリンの演奏者によるバイオリンの音と、クラリネットの演奏者によるクラリネットの音と、フルートの演奏者によるフルートの音と、ノイズ源によるノイズ音とが少なくとも収音された音声データが記憶される。 (Second use case example)
FIG. 4 is an explanatory diagram schematically showing a second use case example of the audio transmission system 100 of the present embodiment. In the second use case example, a plurality of microphones M1b, M2b, M3b, M4b, M5b, M6b, M7b, M8b are arranged in a sound space AR (for example, a concert hall in Tokyo), and a singer, piano player, violin Suppose an orchestra is being played by performers, clarinet players, and flute players. In this case, the server device CDS includes the sound of the singer's song, the piano sound by the piano player, the violin sound by the violin player, the clarinet sound by the clarinet player, and the flute player. The sound data in which at least the flute sound due to and the noise sound due to the noise source are collected are stored.

なお、図４に示す第２のユースケース例では、音源分離サービスを使用するためのアプリケーションＡＰ１がインストールされた通信端末Ｔ１のユーザＰｓ１ｂ，Ｐｓ２ｂは、例えばコンサートホールから遠く離れた大阪にいてもよいし、コンサートホール付近にいてもよい。また、ユーザＰｓ１ｂ，Ｐｓ２ｂの通信端末Ｔ１の画面には、音源分離サービスを使用するためのアプリケーションＡＰ１の画面例として、音空間ＡＲ（例えばコンサートホールの見取り図）が表示部１５上に表示されている。 In the second use case example shown in FIG. 4, the users Ps1b and Ps2b of the communication terminal T1 in which the application AP1 for using the sound source separation service is installed may be in Osaka, for example, far away from the concert hall. You may be near the concert hall. On the screen of the communication terminal T1 of the users Ps1b and Ps2b, a sound space AR (for example, a sketch of a concert hall) is displayed on the display unit 15 as an example of a screen of the application AP1 for using the sound source separation service. .

サーバ装置ＣＤＳは、例えば大阪にいるユーザＰｓ１ｂの通信端末Ｔ１に対する所定の入力操作により、音空間ＡＲ内の各マイクロホンＭ１ｂ〜Ｍ８ｂにより収音された音声の音声データを、マイクロホンの個数と同数の種類の音源（例えば、歌手の歌の音声と、ピアノの演奏者によるピアノの音と、バイオリンの演奏者によるバイオリンの音と、クラリネットの演奏者によるクラリネットの音と、フルートの演奏者によるフルートの音と、ノイズ源によるノイズ音、その他音声）に音源分離し、いずれかの分離音声データ（例えばピアノの音）を通信端末Ｔ１に送信する。通信端末Ｔ１は、ユーザＰｓ１ｂの所定の入力操作毎に、サーバ装置ＣＤＳから送信された分離音声データを順次切り替えて出力する。これにより、ユーザＰｓ１ｂは、音空間ＡＲである東京のコンサートホールにおいて演奏されているオーケストラの中で、自己が指定した楽器（例えばピアノ）の音源分離後の高品質な音声だけを聴取することができる。 The server device CDS, for example, receives the voice data collected by the microphones M1b to M8b in the sound space AR by a predetermined input operation on the communication terminal T1 of the user Ps1b in Osaka, for the same number as the number of microphones. Sound sources (for example, the sound of the singer's song, the piano sound by the piano player, the violin sound by the violin player, the clarinet sound by the clarinet player, and the flute sound by the flute player) Then, the sound source is separated into noise sound generated by the noise source and other sounds), and any separated sound data (for example, piano sound) is transmitted to the communication terminal T1. The communication terminal T1 sequentially switches and outputs the separated voice data transmitted from the server device CDS for every predetermined input operation of the user Ps1b. Thereby, the user Ps1b can listen only to high-quality sound after the sound source separation of the instrument (for example, piano) specified by the user in the orchestra performed in the concert hall in Tokyo, which is the sound space AR. it can.

同様に、サーバ装置ＣＤＳは、例えばコンサートホール付近にいるユーザＰｓ２ｂの通信端末Ｔ１に対する所定の入力操作により、音空間ＡＲ内の各マイクロホンＭ１ｂ〜Ｍ８ｂにより収音された音声の音声データを、マイクロホンの個数と同数の種類の音源（例えば、歌手の歌の音声と、ピアノの演奏者によるピアノの音と、バイオリンの演奏者によるバイオリンの音と、クラリネットの演奏者によるクラリネットの音と、フルートの演奏者によるフルートの音と、ノイズ源によるノイズ音、その他音声）に音源分離し、いずれかの分離音声データ（例えばフルートの音）を通信端末Ｔ１に送信する。通信端末Ｔ１は、ユーザＰｓ２ｂの所定の入力操作毎に、サーバ装置ＣＤＳから送信された分離音声データを順次切り替えて出力する。これにより、ユーザＰｓ２ｂは、音空間ＡＲである東京のコンサートホールにおいて演奏されているオーケストラの中で、自己が指定した楽器（例えばフルート）の音源分離後の高品質な音声だけを聴取することができる。 Similarly, the server device CDS receives, for example, audio data collected by the microphones M1b to M8b in the sound space AR by a predetermined input operation on the communication terminal T1 of the user Ps2b near the concert hall. The same number of types of sound sources (for example, the sound of a singer's song, the sound of a piano by a piano player, the sound of a violin by a violin player, the sound of a clarinet by a clarinet player, and the performance of a flute The sound source is separated into a flute sound by a person, a noise sound by a noise source, and other sounds), and any separated sound data (for example, a flute sound) is transmitted to the communication terminal T1. The communication terminal T1 sequentially switches and outputs the separated audio data transmitted from the server device CDS for every predetermined input operation of the user Ps2b. As a result, the user Ps2b can listen only to high-quality sound after the sound source separation of the instrument (for example, flute) specified by the user in the orchestra performed in the concert hall in Tokyo, which is the sound space AR. it can.

（第３のユースケース例）
図５は、本実施形態の音声伝達システムの第３のユースケース例を模式的に示す説明図である。図６は、図５に示す第３のユースケース例における音声伝達システムの時系列に沿った動作概要の一例を説明するフローチャートである。第３のユースケース例では、音空間ＡＲ（例えば複数の店舗が並べて配置されたショッピングモールＳＭＬ）に複数のマイクロホンＭ１ｃ，Ｍ２ｃ，Ｍ３ｃ，Ｍ４ｃ，Ｍ５ｃ，Ｍ６ｃ，Ｍ７ｃ，Ｍ８ｃ，Ｍ９ｃが配置され、ユーザＰｓ１ｃがショッピングモールにショッピングに来た状況を想定する。この場合、サーバ装置ＣＤＳには、各店舗（具体的には、魚屋、雑貨屋、本屋、靴屋、すし屋、ラーメン屋、インフォメーション、服屋）の商品又はサービスのＰＲのスピーチが少なくとも収音された音声データが記憶される。 (Third use case example)
FIG. 5 is an explanatory diagram schematically showing a third use case example of the audio transmission system according to the present embodiment. FIG. 6 is a flowchart for explaining an example of the operation outline along the time series of the audio transmission system in the third use case example shown in FIG. 5. In the third use case example, a plurality of microphones M1c, M2c, M3c, M4c, M5c, M6c, M7c, M8c, M9c are arranged in a sound space AR (for example, a shopping mall SML in which a plurality of stores are arranged side by side) A situation is assumed in which the user Ps1c comes to the shopping mall. In this case, the server device CDS collects at least the PR speeches of the products or services of each store (specifically, fish shop, general store, book store, shoe store, sushi store, ramen store, information, clothing store). The recorded audio data is stored.

なお、図５に示す第３のユースケース例では、ユーザＰｓ１ｃの通信端末Ｔ１の画面には、音源分離サービスを使用するためのアプリケーションＡＰ１の画面例として、音空間ＡＲ（例えばショッピングモールＳＭＬ）の各店舗の地図情報が表示部１５上に表示されている。 In the third use case example shown in FIG. 5, the screen of the communication terminal T1 of the user Ps1c has a sound space AR (for example, a shopping mall SML) as an example of a screen of the application AP1 for using the sound source separation service. Map information of each store is displayed on the display unit 15.

サーバ装置ＣＤＳは、ユーザＰｓ１ｃの通信端末Ｔ１に対する所定の入力操作により、音空間ＡＲ（例えばショッピングモールＳＭＬ）内の各マイクロホンＭ１ｃ〜Ｍ９ｃにより収音された音声の音声データを、マイクロホンの個数と同数の種類の音源（例えば、魚屋からの音声、雑貨屋からの音声、本屋からの音声、靴屋からの音声、すし屋からの音声、ラーメン屋からの音声、インフォメーションからの音声、服屋からの音声、その他音声）に音源分離し、いずれかの分離音声データ（例えば本屋からの音声）を通信端末Ｔ１に送信する。通信端末Ｔ１は、ユーザＰｓ１ｃの所定の入力操作毎に、サーバ装置ＣＤＳから送信された分離音声データを順次切り替えて出力する。これにより、ユーザＰｓ１ｃは、音空間ＡＲであるショッピングモールＳＭＬにおいて複数の店舗からの音声が混じって聞こえる音声から、ユーザＰｓ１ｃ自身が希望する高品質な音声（例えば本屋からの音声）を聴取することができる。 The server device CDS has the same number of voice data as the number of microphones collected by the microphones M1c to M9c in the sound space AR (for example, the shopping mall SML) by a predetermined input operation of the user Ps1c to the communication terminal T1. Type of sound source (for example, audio from a fish shop, audio from a general store, audio from a bookstore, audio from a shoe store, audio from a sushi store, audio from a ramen store, audio from an information store, audio from a clothing store The sound source is separated into voice and other voices, and one of the separated voice data (for example, voice from a bookstore) is transmitted to the communication terminal T1. The communication terminal T1 sequentially switches and outputs the separated voice data transmitted from the server device CDS for every predetermined input operation of the user Ps1c. As a result, the user Ps1c listens to the high-quality sound desired by the user Ps1c himself (for example, the sound from the bookstore) from the sound heard from the plurality of stores in the shopping mall SML as the sound space AR. Can do.

図６において、買い物客であるユーザＰｓ１ｃがショッピングモールＳＭＬに入ると（ＳＴ１）、ユーザＰｓ１ｃが所持する通信端末Ｔ１は、本実施形態の音声伝達システム１００により実現可能な音源分離サービスの対象エリアに位置することの報告（報知信号）を、例えばＢｌｕｅｔｏｏｔｈ（登録商標）によって受信する（ＳＴ２）。報知信号は、例えば音源分離サービスを提供するショッピングモールの運営業者が設置した所定の音源分離サービス提供装置（不図示）から送信される。 In FIG. 6, when the user Ps1c who is a shopper enters the shopping mall SML (ST1), the communication terminal T1 possessed by the user Ps1c is in the target area of the sound source separation service that can be realized by the voice transmission system 100 of the present embodiment. A position report (notification signal) is received by, for example, Bluetooth (registered trademark) (ST2). The notification signal is transmitted, for example, from a predetermined sound source separation service providing device (not shown) installed by a shopping mall operator who provides the sound source separation service.

通信端末Ｔ１は、ステップＳＴ２の報知信号を受信すると、音源分離サービスを使用するためのアプリケーションＡＰ１を起動し（ＳＴ３）、表示部１５の画面上に、現在ユーザＰｓ１ｃが位置する音空間ＡＲ（例えばショッピングモールＳＭＬ）の見取り図（図５参照）を表示させる（ＳＴ４）。なお、ステップＳＴ４の処理は通信端末Ｔ１において省略されてもよい。 Upon receiving the notification signal in step ST2, the communication terminal T1 activates the application AP1 for using the sound source separation service (ST3), and the sound space AR (for example, the user Ps1c is currently located on the screen of the display unit 15) A sketch (see FIG. 5) of the shopping mall SML) is displayed (ST4). Note that the process of step ST4 may be omitted in the communication terminal T1.

上述したように、サーバ装置ＣＤＳには、各店舗（具体的には、魚屋、雑貨屋、本屋、靴屋、すし屋、ラーメン屋、インフォメーション、服屋）の商品又はサービスのＰＲのスピーチが少なくとも収音された音声データが記憶される。サーバ装置ＣＤＳは、ユーザＰｓ１ｃの通信端末Ｔ１に対する所定の入力操作により、音空間ＡＲ（例えばショッピングモールＳＭＬ）内の各マイクロホンＭ１ｃ〜Ｍ９ｃにより収音された音声の音声データを、マイクロホンの個数と同数の種類の音源に音源分離し、いずれかの分離音声データ（例えば本屋からの音声）を通信端末Ｔ１に送信する。 As described above, the server device CDS has at least PR speeches of goods or services of each store (specifically, fish shop, general store, book store, shoe store, sushi store, ramen store, information, clothing store). The collected voice data is stored. The server device CDS has the same number of voice data as the number of microphones collected by the microphones M1c to M9c in the sound space AR (for example, the shopping mall SML) by a predetermined input operation of the user Ps1c to the communication terminal T1. The sound source is separated into two types of sound sources, and one of the separated sound data (for example, sound from a bookstore) is transmitted to the communication terminal T1.

通信端末Ｔ１は、ユーザＰｓ１ｃの所定の入力操作毎に、サーバ装置ＣＤＳから送信された分離音声データを順次切り替えて出力する（ＳＴ５）。ユーザＰｓ１ｃが希望した分離音声データが通信端末Ｔ１から出力された後、ユーザＰｓ１ｃの入力操作により、音源分離サービスを使用するためのアプリケーションＡＰ１が終了する（ＳＴ６）。 The communication terminal T1 sequentially switches and outputs the separated voice data transmitted from the server device CDS for every predetermined input operation of the user Ps1c (ST5). After the separated voice data desired by the user Ps1c is output from the communication terminal T1, the application AP1 for using the sound source separation service is terminated by the input operation of the user Ps1c (ST6).

（第４のユースケース例）
図７は、本実施形態の音声伝達システム１００の第４のユースケース例を模式的に示す説明図である。第４のユースケース例では、音空間ＡＲ（例えばオフィスルーム）に複数のマイクロホンＭ１ｄ，Ｍ２ｄ，Ｍ３ｄ，Ｍ４ｄ，Ｍ５ｄが配置され、大声で話している話者Ｐｓ１の周囲に配置された商談スペースＤｋで商談している話者Ｐｓ２，Ｐｓ３が会話している状況を想定する。この場合、サーバ装置ＣＤＳには、話者Ｐｓ１の音声と話者Ｐｓ２の音声と話者Ｐｓ３の１の音声とが少なくとも収音された音声データが記憶される。 (Fourth use case example)
FIG. 7 is an explanatory diagram schematically illustrating a fourth use case example of the audio transmission system 100 according to the present embodiment. In the fourth use case example, a plurality of microphones M1d, M2d, M3d, M4d, and M5d are arranged in the sound space AR (for example, an office room), and the negotiation space Dk arranged around the speaker Ps1 who is speaking loudly. Assume that the speakers Ps2 and Ps3 having a business talk are talking. In this case, the server device CDS stores voice data in which at least the voice of the speaker Ps1, the voice of the speaker Ps2, and the voice of one of the speaker Ps3 are collected.

なお、図７に示す第４のユースケース例では、音源分離サービスを使用するためのアプリケーションＡＰ１がインストールされた通信端末Ｔ１のユーザである話者Ｐｓ２は、音空間ＡＲ内にいない通信端末Ｔ２の通話相手Ｐｓ４と通話しており、通信端末Ｔ２にも音源分離サービスを使用するためのアプリケーションＡＰ２がインストールされている。また、話者Ｐｓ２，通話相手Ｐｓ４がともに音源分離サービスを使用するためのアプリケーションＡＰ１，ＡＰ２をそれぞれ通信端末Ｔ１，Ｔ２において起動していない状態で通話する際には、通信端末Ｔ１，Ｔ２は、話者Ｐｓ２，通話相手Ｐｓ４の音声（つまり、通信端末Ｔ１，Ｔ２が収音した生の通話音声）を、例えばＬＴＥネットワーク網ＮＷ２を介して送受信する。 In the fourth use case example shown in FIG. 7, the speaker Ps2 who is the user of the communication terminal T1 in which the application AP1 for using the sound source separation service is installed is the communication terminal T2 that is not in the sound space AR. The application AP2 for using the sound source separation service is installed in the communication terminal T2 while talking with the call partner Ps4. In addition, when the telephones Ps2 and the other party Ps4 use the sound source separation service to make a call in a state where the applications AP1 and AP2 are not activated in the communication terminals T1 and T2, respectively, the communication terminals T1 and T2 Voices of the speaker Ps2 and the call partner Ps4 (that is, live call voices collected by the communication terminals T1 and T2) are transmitted / received via, for example, the LTE network NW2.

サーバ装置ＣＤＳは、話者Ｐｓ２の通信端末Ｔ１に対する所定の入力操作により、音空間ＡＲ内の各マイクロホンＭ１ｄ〜Ｍ５ｄにより収音された音声の音声データを、マイクロホンの個数と同数の種類の音源（例えば、話者Ｐｓ１の音声、話者Ｐｓ２の音声、話者Ｐｓ３の音声、その他音声）に音源分離し、いずれかの分離音声データ（例えば話者Ｐｓ３の音声）を通信端末Ｔ１に送信する。通信端末Ｔ１は、話者Ｐｓ２の所定の入力操作毎に、サーバ装置ＣＤＳから送信された分離音声データを順次切り替えて出力する。通信端末Ｔ２は、話者Ｐｓ２が通信端末Ｔ１に対していずれかの分離音声データ（例えば話者Ｐｓ３の分離音声データ）を指定操作すると、指定された分離音声データを、サーバ装置ＣＤＳを介して受信し、話者Ｐｓ２の音声（受信音声）を話者Ｐｓ３の分離音声データに切り替えて出力する。これにより、話者Ｐｓ２は、音空間ＡＲであるオフィスルームにおいて話者Ｐｓ２と会話している話者Ｐｓ３の会話内容の高品質な分離音声データを、オフィスルームにいない遠く離れた通話相手Ｐｓ４に聴取させることができる。 The server device CDS obtains the sound data of the sounds collected by the microphones M1d to M5d in the sound space AR by a predetermined input operation on the communication terminal T1 of the speaker Ps2 as many kinds of sound sources as the number of microphones ( For example, the sound source is separated into the voice of the speaker Ps1, the voice of the speaker Ps2, the voice of the speaker Ps3, and other voices, and one of the separated voice data (for example, the voice of the speaker Ps3) is transmitted to the communication terminal T1. The communication terminal T1 sequentially switches and outputs the separated voice data transmitted from the server device CDS for every predetermined input operation of the speaker Ps2. When the speaker Ps2 designates any separated voice data (for example, separated voice data of the speaker Ps3) for the communication terminal T1, the communication terminal T2 performs the designated separated voice data via the server device CDS. The voice of the speaker Ps2 (received voice) is switched to the separated voice data of the speaker Ps3 and output. As a result, the speaker Ps2 transmits the high-quality separated voice data of the conversation contents of the speaker Ps3 who is talking to the speaker Ps2 in the office room which is the sound space AR to the far-end communication partner Ps4 which is not in the office room. Can be heard.

次に、本実施形態の音声伝達システムの第１動作例の動作手順について、図１及び図８を参照して説明する。図８は、本実施形態の音声伝達システム１００の第１動作例の動作手順を説明するフローチャートである。なお、以下の説明では、説明を簡単にするために、通信端末Ｔ１，Ｔ２とサーバ装置ＣＤＳの動作を主体的に説明し、必要に応じて第１選択部ＳＬ１及び第２選択部ＳＬ２の動作を説明し、マイク信号多重伝送装置ＭＸの動作の説明は省略する。 Next, the operation procedure of the first operation example of the sound transmission system according to the present embodiment will be described with reference to FIGS. 1 and 8. FIG. 8 is a flowchart for explaining the operation procedure of the first operation example of the sound transmission system 100 according to the present embodiment. In the following description, for the sake of simplicity, the operations of the communication terminals T1 and T2 and the server device CDS are mainly described, and the operations of the first selection unit SL1 and the second selection unit SL2 are performed as necessary. The description of the operation of the microphone signal multiplex transmission apparatus MX will be omitted.

図８において、ユーザＡは通信端末Ｔ１に対し、音源分離サービスを使用するためのアプリケーションＡＰ１の起動操作を行う。これにより、通信端末Ｔ１は、音源分離サービスを使用するためのアプリケーションＡＰ１を起動する（ＳＴ１１）。通信端末Ｔ１は、ユーザＡの所定の入力操作により、音源分離サービスの対象エリアである音空間ＡＲにおける音源分離、及び音源分離された分離音声データの聴取の開始を要求するための開始信号をサーバ装置ＣＤＳに送信する（ＳＴ１２）。 In FIG. 8, the user A performs an activation operation of the application AP1 for using the sound source separation service with respect to the communication terminal T1. Thereby, communication terminal T1 starts application AP1 for using a sound source separation service (ST11). The communication terminal T1 receives a start signal for requesting start of sound source separation in the sound space AR, which is a target area of the sound source separation service, and listening to the separated sound data separated by the sound source, by a predetermined input operation of the user A. Transmit to the device CDS (ST12).

サーバ装置ＣＤＳは、音空間ＡＲにおいて収音された複数の種類の音声が混じり合った音声データをサーバ装置ＣＤＳ内のメモリ（不図示）から読み出し、多重分離後の音声データを用いて音源分離することで、１つ以上のマイクロホン（例えばマイクロホンＭ１〜Ｍ４）と同数の種類の分離音声データ（即ち、マイクロホンの配置数と同数の音源が音空間ＡＲに存在する場合に、各音源から出力された音声の音声データのこと）を生成する（ＳＴ２１）。 The server device CDS reads sound data in which a plurality of types of sounds collected in the sound space AR are mixed from a memory (not shown) in the server device CDS, and performs sound source separation using the sound data after demultiplexing. Thus, the same number of types of separated voice data as that of one or more microphones (for example, microphones M1 to M4) (that is, output from each sound source when the same number of sound sources as the number of microphones is present in the sound space AR) Audio data) (ST21).

サーバ装置ＣＤＳは、制御部ＰＲ１において、ステップＳＴ２１において生成した１つ以上の分離音声データの序数を示すパラメータｋを初期化し（ＳＴ２２）、ｋ＝１に対応する分離音声データを、第１選択部ＳＬ１を介して通信端末Ｔ１に送信する（ＳＴ２３）。 The server device CDS initializes the parameter k indicating the ordinal number of the one or more separated speech data generated in step ST21 in the control unit PR1 (ST22), and the separated sound data corresponding to k = 1 is the first selection unit. It transmits to communication terminal T1 via SL1 (ST23).

通信端末Ｔ１は、ステップＳＴ２３において送信された分離音声データを受信し、音声出力部１６において出力する（ＳＴ１３、ユーザＡにおける仮聴取）。なお、このステップＳＴ１３の時点では、サーバ装置ＣＤＳにより生成されたいずれかの分離音声データｃ１が通信端末において出力（再生）される。 The communication terminal T1 receives the separated voice data transmitted in step ST23 and outputs it in the voice output unit 16 (ST13, provisional listening by the user A). At the time of step ST13, any one of the separated audio data c1 generated by the server device CDS is output (reproduced) at the communication terminal.

ここで、ユーザＡは音声出力部１６において出力された分離音声データが気に入らない場合には（ＳＴ１４、ＮＯ）、その旨の入力操作（例えば、現在出力されている分離音声データ以外の他の分離音声データに切り替えるための入力操作）を行い、通信端末Ｔ１は、現在出力されている分離音声データ以外の他の分離音声データに切り替えるための次選択要求信号をサーバ装置ＣＤＳに送信する。 Here, when the user A does not like the separated voice data output from the voice output unit 16 (ST14, NO), an input operation to that effect (for example, other separated voice data other than the currently outputted separated voice data) The communication terminal T1 transmits to the server device CDS a next selection request signal for switching to other separated voice data other than the currently outputted separated voice data.

サーバ装置ＣＤＳは、通信端末Ｔ１から送信された次選択要求信号に応じて、制御部ＰＲ１において、ステップＳＴ２１において生成した１つ以上の分離音声データの序数を示すパラメータｋをインクリメントし（ＳＴ２４）、インクリメント後のパラメータｋ（例えばｋ＝２）に対応する分離音声データを、第１選択部ＳＬ１を介して通信端末Ｔ１に送信する（ＳＴ２３）。このように、本実施形態の音声伝達システム１００の第１動作例では、ユーザＡが聴取したい分離音声データが見つかるまで、通信端末Ｔ１とサーバ装置ＣＤＳとの間で、分離音声データの送受信が繰り返され、通信端末Ｔ１は、聴取したい分離音声データが見つかった後、その分離音声データの聴取を継続することができる。 In response to the next selection request signal transmitted from communication terminal T1, server device CDS increments parameter k indicating the ordinal number of one or more separated audio data generated in step ST21 in control unit PR1 (ST24). The separated voice data corresponding to the parameter k after the increment (for example, k = 2) is transmitted to the communication terminal T1 via the first selection unit SL1 (ST23). As described above, in the first operation example of the voice transmission system 100 according to the present embodiment, transmission / reception of separated voice data is repeated between the communication terminal T1 and the server device CDS until the separated voice data that the user A wants to hear is found. Thus, the communication terminal T1 can continue to listen to the separated voice data after the separated voice data to be heard is found.

一方、ユーザＡは音声出力部１６において出力された分離音声データが気に入った場合には（ＳＴ１４、ＹＥＳ）、その旨の入力操作（例えば、現在出力されている分離音声データの出力（再生）を継続するための入力操作）を行い、通信端末Ｔ１は、現在出力されている分離音声データの出力（再生）を継続する（ＳＴ１５、ユーザＡにおける本聴取）。ユーザＡがアプリケーションＡＰ１の終了を指示するための入力操作を行った場合には（ＳＴ１６）、通信端末Ｔ１は、アプリケーションＡＰ１を終了する。 On the other hand, if the user A likes the separated voice data output from the voice output unit 16 (ST14, YES), the user A performs an input operation (for example, output (reproduction) of the currently outputted separated voice data). The communication terminal T1 continues to output (reproduce) the separated audio data that is currently output (ST15, main listening by the user A). When the user A performs an input operation for instructing to end the application AP1 (ST16), the communication terminal T1 ends the application AP1.

次に、本実施形態の音声伝達システムの第２動作例の動作手順について、図１、図９及び図１０を参照して説明する。図９は、本実施形態の音声伝達システム１００の第２動作例の動作手順を説明するフローチャートである。図１０は、図９の続きの動作手順を説明するフローチャートである。図９及び図１０の説明の前提として、ユーザＡが使用する通信端末Ｔ１を送話側、ユーザＢが使用する通信端末Ｔ２を受話側とする。なお、図９の説明では、図８と重複する動作については同一のステップ番号を付与して説明を省略し、異なる内容について説明する。 Next, the operation procedure of the second operation example of the sound transmission system according to the present embodiment will be described with reference to FIG. 1, FIG. 9, and FIG. FIG. 9 is a flowchart for explaining an operation procedure of the second operation example of the sound transmission system 100 according to the present embodiment. FIG. 10 is a flowchart for explaining the operation procedure subsequent to FIG. 9 and 10, the communication terminal T1 used by the user A is the transmitting side, and the communication terminal T2 used by the user B is the receiving side. In the description of FIG. 9, the same operations as those in FIG. 8 are given the same step numbers and the description thereof is omitted, and different contents will be described.

図９において、通信端末Ｔ１は、ユーザＡとユーザＢとの間の通話を開始するためのユーザＡの所定の入力操作により、通信端末Ｔ２との間で通話音声の送受信を開始し、同様に、通信端末Ｔ２は、ユーザＡとユーザＢとの間の通話を開始するためのユーザＢの所定の入力操作により、通信端末Ｔ１との間で通話音声の送受信を開始する（ＳＴ５１）。従って、通信端末Ｔ１から通信端末Ｔ２に向けて送信される音声の音声データは、ユーザＡの通話音声の音声データａであり、通信端末Ｔ２から通信端末Ｔ１に向けて送信される音声の音声データは、ユーザＢの通話音声の音声データｂである。 In FIG. 9, the communication terminal T1 starts transmission / reception of call voice with the communication terminal T2 by a predetermined input operation of the user A for starting a call between the user A and the user B, and similarly. The communication terminal T2 starts transmission / reception of call voice with the communication terminal T1 by a predetermined input operation of the user B for starting a call between the user A and the user B (ST51). Therefore, the voice data of the voice transmitted from the communication terminal T1 to the communication terminal T2 is the voice data a of the call voice of the user A, and the voice data of the voice sent from the communication terminal T2 to the communication terminal T1. Is voice data b of user B's call voice.

ユーザＡは通信端末Ｔ１に対し、送信者モードとして音源分離サービスを使用するためのアプリケーションＡＰ１の起動操作を行う。これにより、通信端末Ｔ１は、音源分離サービスを使用するためのアプリケーションＡＰ１を送信者モードとして起動する（ＳＴ３２）。同様に、ユーザＢは通信端末Ｔ２に対し、受信者モードとして音源分離サービスを使用するためのアプリケーションＡＰ２の起動操作を行う。これにより、通信端末Ｔ２は、音源分離サービスを使用するためのアプリケーションＡＰ２を受信者モードとして起動する（ＳＴ５２）。 The user A performs an activation operation of the application AP1 for using the sound source separation service as the sender mode with respect to the communication terminal T1. Thereby, communication terminal T1 starts application AP1 for using a sound source separation service as a sender mode (ST32). Similarly, the user B performs an activation operation of the application AP2 for using the sound source separation service as the receiver mode with respect to the communication terminal T2. Thereby, communication terminal T2 starts application AP2 for using a sound source separation service as a receiver mode (ST52).

通信端末Ｔ１は、ステップＳＴ２３においてサーバ装置ＣＤＳから送信されたパラメータｋ＝１に対応する分離音声データを受信し、音声出力部１６において出力する（ＳＴ１３、ユーザＡにおける仮聴取）。なお、このステップＳＴ１３の時点では、サーバ装置ＣＤＳにより生成されたいずれかの分離音声データｃ１が通信端末において出力（再生）される。 The communication terminal T1 receives the separated voice data corresponding to the parameter k = 1 transmitted from the server device CDS in step ST23, and outputs it in the voice output unit 16 (ST13, provisional listening by the user A). At the time of step ST13, any one of the separated audio data c1 generated by the server device CDS is output (reproduced) at the communication terminal.

ユーザＡは音声出力部１６において出力された分離音声データが気に入った場合には（ＳＴ１４、ＹＥＳ）、通話相手であるユーザＢに伝達するための音声データ（分離音声データ）を決定した旨の操作を行う。これにより、通信端末Ｔ１は、ユーザＡの決定操作に従い、通信端末Ｔ２に伝達するための分離音声データを決定した旨の決定信号をサーバ装置ＣＤＳ及び通信端末Ｔ２にそれぞれ送信する（ＳＴ３３）。なお、決定信号には、ユーザＡにより決定されたいずれかの分離音声データの種別を示す情報が含まれる。 When the user A likes the separated voice data output from the voice output unit 16 (ST14, YES), an operation to the effect that the voice data (separated voice data) to be transmitted to the user B who is the other party is determined. I do. Thereby, according to the determination operation of the user A, the communication terminal T1 transmits a determination signal indicating that the separated voice data to be transmitted to the communication terminal T2 is determined to the server device CDS and the communication terminal T2 (ST33). Note that the determination signal includes information indicating the type of any one of the separated audio data determined by the user A.

サーバ装置ＣＤＳは、ステップＳＴ３３において通信端末Ｔ１から送信された決定信号に応じて、決定信号に対応するいずれかの分離音声データを通信端末Ｔ２に送信する（ＳＴ４１）。 In response to the determination signal transmitted from communication terminal T1 in step ST33, server apparatus CDS transmits any separated voice data corresponding to the determination signal to communication terminal T2 (ST41).

一方、通信端末Ｔ２は、ステップＳＴ３３において通信端末Ｔ１から送信された決定信号を受信した場合には（ＳＴ５３、ＹＥＳ）、音声出力部２６から出力する受信音声を、通信端末Ｔ１から送信されたユーザＡの通話音声の音声データａから、ステップＳＴ４１においてサーバ装置ＣＤＳが第２選択部ＳＬ２を介して送信したいずれかの分離音声データｃ２に切り替える（ＳＴ５４）。 On the other hand, when the communication terminal T2 receives the determination signal transmitted from the communication terminal T1 in step ST33 (ST53, YES), the user who has transmitted the received voice output from the voice output unit 26 is transmitted from the communication terminal T1. The voice data a of the call voice A is switched to one of the separated voice data c2 transmitted by the server device CDS via the second selection unit SL2 in step ST41 (ST54).

なお、通信端末Ｔ２は、ユーザＢが聴取先の切替操作を行った場合には（ＳＴ５５、ＹＥＳ）、音声出力部２６から出力する受信音声を、通信端末Ｔ１から送信されたユーザＡの通話音声の音声データａと、ステップＳＴ４１においてサーバ装置ＣＤＳが第２選択部ＳＬ２を介して送信したいずれかの分離音声データｃ２との間で切り替える（ＳＴ５６）。 In addition, when the user B performs the switching operation of the listening destination (ST55, YES), the communication terminal T2 uses the received voice output from the voice output unit 26 as the call voice of the user A transmitted from the communication terminal T1. And the separated audio data c2 transmitted from the server device CDS via the second selection unit SL2 in step ST41 (ST56).

ステップＳＴ５４の後、ユーザＢが聴取先の切替操作を行わない場合には（ＳＴ５５、ＮＯ）、通信端末Ｔ２は、通信端末Ｔ１との間で、ユーザＡ，Ｂ間の通話音声の送受信を継続する（ＳＴ５７）。同様に、通信端末Ｔ１は、通信端末Ｔ２との間で、ユーザＡ，Ｂ間の通話音声の送受信を継続する（ＳＴ３４）。なお、通信端末Ｔ２から通信端末Ｔ１に向けて送信される音声の音声データは、ユーザＢの通話音声の音声データｂであるが、通信端末Ｔ１から通信端末Ｔ２に向けて送信される音声の音声データは、ユーザＡの通話音声の音声データａ、又はサーバ装置ＣＤＳが第２選択部ＳＬ２を介して送信したいずれかの分離音声データｃ２である。 After step ST54, when the user B does not perform the switching operation of the listening destination (ST55, NO), the communication terminal T2 continues to transmit and receive call voice between the users A and B with the communication terminal T1. (ST57). Similarly, communication terminal T1 continues to transmit and receive call voice between users A and B with communication terminal T2 (ST34). Note that the voice data of the voice transmitted from the communication terminal T2 to the communication terminal T1 is the voice data b of the call voice of the user B, but the voice voice transmitted from the communication terminal T1 to the communication terminal T2. The data is the voice data a of the call voice of the user A or any one of the separated voice data c2 transmitted from the server device CDS via the second selection unit SL2.

ステップＳＴ３４の後、ユーザＡがアプリケーションＡＰ１の終了を指示するための入力操作を行った場合には（ＳＴ３５）、通信端末Ｔ１は、アプリケーションＡＰ１を終了し、アプリケーションＡＰ１の終了を通知するためのアプリ終了通知信号をサーバ装置ＣＤＳ及び通信端末Ｔ２にそれぞれ送信する。サーバ装置ＣＤＳは、通信端末Ｔ１からのアプリ終了通知信号を受信した場合には（ＳＴ４２、ＹＥＳ）、音源分離サービスにおける動作を終了する。 After step ST34, when the user A performs an input operation for instructing the end of the application AP1 (ST35), the communication terminal T1 ends the application AP1 and notifies the end of the application AP1. An end notification signal is transmitted to each of the server device CDS and the communication terminal T2. When the server device CDS receives the application end notification signal from the communication terminal T1 (ST42, YES), the server device CDS ends the operation in the sound source separation service.

また、通信端末Ｔ２は、通信端末Ｔ１からのアプリ終了通知信号を受信した場合には、その旨を表示部２５の画面に表示し、ユーザＢのアプリケーションＡＰ２の終了を指示するための入力操作を行った場合に（ＳＴ５８）、音声出力部２６から出力する受信音声を、通信端末Ｔ１から送信されたユーザＡの通話音声の音声データａと、ステップＳＴ４１においてサーバ装置ＣＤＳが第２選択部ＳＬ２を介して送信したいずれかの分離音声データｃ２とのうち前者に切り替える（ＳＴ５９）。 In addition, when the communication terminal T2 receives the application end notification signal from the communication terminal T1, the communication terminal T2 displays a message to that effect on the screen of the display unit 25, and performs an input operation for instructing the user B to end the application AP2. If it is performed (ST58), the received voice output from the voice output unit 26 is the voice data a of the call voice of the user A transmitted from the communication terminal T1, and the server device CDS sets the second selection unit SL2 in step ST41. Of the separated voice data c2 transmitted via the former (ST59).

ステップＳＴ３５の後、通信端末Ｔ１は、通信端末Ｔ２との間で、ユーザＡ，Ｂ間の通話音声の送受信を継続する（ＳＴ３６）。同様に、通信端末Ｔ２は、通信端末Ｔ１との間で、ユーザＡ，Ｂ間の通話音声の送受信を継続する（ＳＴ６０）。なお、音源分離サービスのアプリケーションＡＰ１，ＡＰ２が通信端末Ｔ１，Ｔ２において終了しているので、通信端末Ｔ１から通信端末Ｔ２に向けて送信される音声の音声データは、ユーザＡの通話音声の音声データａであり、通信端末Ｔ２から通信端末Ｔ１に向けて送信される音声の音声データは、ユーザＢの通話音声の音声データｂとなる。 After step ST35, the communication terminal T1 continues to transmit and receive call voice between the users A and B with the communication terminal T2 (ST36). Similarly, communication terminal T2 continues to transmit and receive call voice between users A and B with communication terminal T1 (ST60). Since the sound source separation service applications AP1 and AP2 are terminated at the communication terminals T1 and T2, the voice data of the voice transmitted from the communication terminal T1 to the communication terminal T2 is the voice data of the call voice of the user A. The voice data of voice transmitted from the communication terminal T2 to the communication terminal T1 is the voice data b of the call voice of the user B.

通信端末Ｔ１，Ｔ２は、通信端末Ｔ２，Ｔ１との間で、ユーザＡ，Ｂ間の通話が終了すると、通話音声の送受信を終了する（ＳＴ３７，ＳＴ６１）。 When communication between the users A and B ends with the communication terminals T2 and T1, the communication terminals T1 and T2 end transmission / reception of the call voice (ST37, ST61).

以上により、本実施形態の音声伝達システム１００は、複数の音声が含まれる音空間ＡＲに配置された複数のマイクロホンＭ１〜Ｍ４において音声を収音し、これらのマイクロホンＭ１〜Ｍ４により収音された音声の音声データを、クラウドネットワーク網ＮＷ１上のサーバ装置ＣＤＳにおいて記憶する。サーバ装置ＣＤＳは、音源分離部ＰＲ２において記憶した音声データを、マイクロホンＭ１〜Ｍ４と同数の種類の音声データ（分離音声データ）に分離し、通信端末Ｔ１に対する所定の入力操作により、いずれかの分離音声データをＬＴＥネットワーク網ＮＷ２上の第１選択部ＳＬ１に送信する。第１選択部ＳＬ１は、音源分離部ＰＲ２により生成された個々の分離音声データのうちいずれかを通信端末Ｔ１に送信する。 As described above, the sound transmission system 100 according to the present embodiment picks up sound in the plurality of microphones M1 to M4 arranged in the sound space AR including a plurality of sounds, and the sound is picked up by these microphones M1 to M4. The voice data of voice is stored in the server device CDS on the cloud network NW1. The server device CDS separates the sound data stored in the sound source separation unit PR2 into the same number of types of sound data (separated sound data) as the microphones M1 to M4, and performs any separation by a predetermined input operation to the communication terminal T1. The voice data is transmitted to the first selection unit SL1 on the LTE network NW2. The first selection unit SL1 transmits any one of the individual separated audio data generated by the sound source separation unit PR2 to the communication terminal T1.

これにより、音声伝達システム１００は、数多くの話者やノイズを含む環境音が混じり合った実際の音空間ＡＲにおいて、収音された音声から特定の音源となる分離音声データを抽出することができるので、通信端末Ｔ１に対する所定の入力操作により、通信端末Ｔ１のユーザＡが希望する高品質な分離音声データを自端末である通信端末Ｔ１に伝達することができる。 As a result, the voice transmission system 100 can extract separated voice data serving as a specific sound source from the collected voice in an actual sound space AR in which environmental sounds including many speakers and noise are mixed. Therefore, high-quality separated voice data desired by the user A of the communication terminal T1 can be transmitted to the communication terminal T1 that is the terminal by a predetermined input operation on the communication terminal T1.

また、音声伝達システム１００は、通信端末Ｔ１に対する所定の入力操作毎に、サーバ装置ＣＤＳの制御部ＰＲ１において、個々の分離音声データを順次切り替えて音源分離部ＰＲ２から第１選択部ＳＬ１に送信させる。通信端末Ｔ１は、ユーザＡの所定の入力操作に応じて、第１選択部ＳＬ１から送信されたいずれかの分離音声データを受信して出力する。 In addition, for each predetermined input operation on the communication terminal T1, the audio transmission system 100 causes the control unit PR1 of the server device CDS to sequentially switch individual separated audio data so as to be transmitted from the sound source separation unit PR2 to the first selection unit SL1. . The communication terminal T1 receives and outputs one of the separated voice data transmitted from the first selection unit SL1 according to a predetermined input operation of the user A.

これにより、音声伝達システム１００は、通信端末Ｔ１に対するユーザＡの簡易な入力操作により、サーバ装置ＣＤＳにおいて１つ以上の音源として分離された分離音声データの中からユーザＡが希望する高品質な分離音声データを選択的に切り替えて通信端末Ｔ１に伝達することができる。 As a result, the voice transmission system 100 can perform high-quality separation desired by the user A from the separated voice data separated as one or more sound sources in the server device CDS by a simple input operation of the user A with respect to the communication terminal T1. The voice data can be selectively switched and transmitted to the communication terminal T1.

また、音声伝達システム１００は、サーバ装置ＣＤＳにおいて、音源分離部ＰＲ２において記憶した音声データを、マイクロホンＭ１〜Ｍ４と同数の種類の音声データ（分離音声データ）に分離し、通信端末Ｔ２に対する所定の入力操作により、いずれかの分離音声データをＬＴＥネットワーク網ＮＷ２上の第２選択部ＳＬ２に送信する。第２選択部ＳＬ２は、通信端末Ｔ１から送信された音声データ（例えば通信端末Ｔ１が収音したユーザＡの生の通話音声）、又は音源分離部ＰＲ２により生成された個々の分離音声データのうちいずれかを通信端末Ｔ２に送信する。 In addition, in the server device CDS, the audio transmission system 100 separates the audio data stored in the sound source separation unit PR2 into the same number of types of audio data (separated audio data) as the microphones M1 to M4, and performs predetermined processing for the communication terminal T2. By the input operation, one of the separated voice data is transmitted to the second selection unit SL2 on the LTE network NW2. The second selection unit SL2 includes the voice data transmitted from the communication terminal T1 (for example, the raw call voice of the user A picked up by the communication terminal T1) or individual separated voice data generated by the sound source separation unit PR2. Either is transmitted to the communication terminal T2.

これにより、音声伝達システム１００は、数多くの話者やノイズを含む環境音が混じり合った実際の音空間ＡＲにおいて、収音された音声から特定の音源となる分離音声データを抽出することができるので、通信端末Ｔ１の通話相手となる通信端末Ｔ２に対する所定の入力操作により、通話相手であるユーザＡの音声の音声データ又はサーバ装置ＣＤＳにより生成された高品質な分離音声データを、通信端末Ｔ２のユーザＢの好みに合わせて通信端末Ｔ１の通話相手である通信端末Ｔ２に選択的に伝達することができる。 As a result, the voice transmission system 100 can extract separated voice data serving as a specific sound source from the collected voice in an actual sound space AR in which environmental sounds including many speakers and noise are mixed. Therefore, the voice data of the voice of the user A who is the call partner or the high-quality separated voice data generated by the server device CDS by the predetermined input operation on the communication terminal T2 which is the call partner of the communication terminal T1 is transmitted to the communication terminal T2. Can be selectively transmitted to the communication terminal T2 which is the communication partner of the communication terminal T1 in accordance with the user B's preference.

また、音声伝達システム１００は、音源分離部ＰＲ２により生成された個々の分離音声データのうちいずれかを通信端末Ｔ１に送信し、通信端末Ｔ１に対するユーザＡの所定の入力操作毎に、個々の分離音声データを順次切り替えて音源分離部ＰＲ２から第１選択部ＳＬ１に送信させる。通信端末Ｔ１は、ユーザＡの所定の入力操作に応じて、第１選択部ＳＬ１から送信されたいずれかの分離音声データを受信して出力する。 Further, the voice transmission system 100 transmits any one of the individual separated voice data generated by the sound source separation unit PR2 to the communication terminal T1, and each separation is performed for each predetermined input operation of the user A with respect to the communication terminal T1. The sound data is sequentially switched and transmitted from the sound source separation unit PR2 to the first selection unit SL1. The communication terminal T1 receives and outputs one of the separated voice data transmitted from the first selection unit SL1 according to a predetermined input operation of the user A.

これにより、音声伝達システム１００は、通信端末Ｔ１に対するユーザＡの簡易な入力操作により、サーバ装置ＣＤＳにおいて１つ以上の音源として分離された分離音声データの中から通話相手であるユーザＢが希望する高品質な分離音声データを、ユーザＡの希望に合わせて通信端末Ｔ１に伝達することができるので、通信端末Ｔ１において予め聴取した上で、通信端末Ｔ１により指定された高品質な分離音声データを通信端末Ｔ２に伝達することができる。 As a result, the voice transmission system 100 is requested by the user B who is a call partner from the separated voice data separated as one or more sound sources in the server device CDS by a simple input operation of the user A to the communication terminal T1. Since the high quality separated voice data can be transmitted to the communication terminal T1 in accordance with the desire of the user A, the high quality separated voice data designated by the communication terminal T1 is obtained after listening in advance at the communication terminal T1. It can be transmitted to the communication terminal T2.

また、音声伝達システム１００は、制御部ＰＲ１において、通信端末Ｔ１に対する所定の入力操作に応じて、いずれかの分離音声データを音源分離部ＰＲ２から第２選択部ＳＬ２に出力させる。第２選択部ＳＬ２は、音源分離部ＰＲ２から出力されたいずれかの分離音声データを第２の通信端末に送信する。 In addition, the sound transmission system 100 causes the control unit PR1 to output any separated sound data from the sound source separation unit PR2 to the second selection unit SL2 in accordance with a predetermined input operation on the communication terminal T1. The second selection unit SL2 transmits any separated audio data output from the sound source separation unit PR2 to the second communication terminal.

これにより、音声伝達システム１００は、通信端末Ｔ１に伝達された分離音声データをユーザＡが聴取した上で選択されたいずれかの分離音声データを通信端末Ｔ２に伝達することができる。 Thereby, the voice transmission system 100 can transmit any one of the separated voice data selected after the user A listens to the separated voice data transmitted to the communication terminal T1 to the communication terminal T2.

また、音声伝達システム１００では、通信端末Ｔ１は、音空間ＡＲにおいて収音された音声の音声データをマイクロホンＭ１〜Ｍ４と同数の種類の音声データに分離する音声分離サービスの報知信号を受信した場合には、音声分離サービスを享受するためのサービス開始操作に応じて、ＬＴＥネットワーク網ＮＷ２上の第１選択部ＳＬ１から送信されたいずれかの分離音声データを受信する。 Also, in the audio transmission system 100, when the communication terminal T1 receives a notification signal of a voice separation service that separates voice data of voice collected in the sound space AR into the same number of types of voice data as the microphones M1 to M4. In response to the service start operation for enjoying the voice separation service, one of the separated voice data transmitted from the first selection unit SL1 on the LTE network NW2 is received.

これにより、音声伝達システム１００は、通信端末Ｔ１のユーザＡが音声分離サービスの存在を知らなくても、音声分離サービスの報知信号を通信端末Ｔ１において受信した際には、ユーザＡの簡易なサービス開始操作により、音源分離により生成された分離音声データを通信端末Ｔ１において受信してユーザＡに聴取させることができ、又はその分離音声データを通信端末Ｔ２に伝達することができる。 As a result, even if the user A of the communication terminal T1 does not know the presence of the voice separation service, the voice transmission system 100 can receive a simple service of the user A when receiving the voice separation service notification signal at the communication terminal T1. By the start operation, the separated voice data generated by the sound source separation can be received by the communication terminal T1 and can be heard by the user A, or the separated voice data can be transmitted to the communication terminal T2.

また、音声伝達システム１００では、通信端末Ｔ１は、音空間ＡＲにおいて収音された音声の音声データをマイクロホンＭ１〜Ｍ４と同数の種類の音声データに分離する音声分離サービスの対象エリアに関する情報の報知信号を受信する。 Further, in the audio transmission system 100, the communication terminal T1 notifies the information regarding the target area of the audio separation service that separates the audio data of the audio collected in the sound space AR into the same number of types of audio data as the microphones M1 to M4. Receive a signal.

これにより、音声伝達システム１００は、通信端末Ｔ１のユーザＡが音声分離サービスの対象エリアに関する情報を知らなくても、音声分離サービスの対象エリアに関する情報を受信した際には、ユーザＡの簡易な入力操作により、ユーザＡにより指定されたいずれかの対象エリアに対応して音源分離により生成された分離音声データを通信端末Ｔ１において受信してユーザＡに聴取させることができ、又はその分離音声データを通信端末Ｔ２に伝達することができる。 As a result, the voice transmission system 100, when the user A of the communication terminal T1 does not know the information related to the target area of the voice separation service, when the information about the target area of the voice separation service is received, By the input operation, the separated voice data generated by the sound source separation corresponding to any target area designated by the user A can be received at the communication terminal T1 and can be heard by the user A, or the separated voice data Can be transmitted to the communication terminal T2.

また、音声伝達システム１００では、音源分離部ＰＲ２は、１つ以上のマイクロホンＭ１〜Ｍ４のうち１つ以上のマイクロホンを選択するための通信端末Ｔ１からの選択信号に応じて、選択された１つ以上のマイクロホン（例えばマイクロホンＭ１，Ｍ２）により収音された音声の音声データを用いて、選択された１つ以上のマイクロホン（例えばマイクロホンＭ１，Ｍ２）と同数の種類の音声データに分離する。 In the sound transmission system 100, the sound source separation unit PR2 is selected according to a selection signal from the communication terminal T1 for selecting one or more microphones from the one or more microphones M1 to M4. Using the sound data of the sound collected by the above microphones (for example, the microphones M1 and M2), the data is separated into the same number of types of sound data as the one or more selected microphones (for example, the microphones M1 and M2).

これにより、音声伝達システム１００は、音空間ＡＲにおいてユーザＡが聴取したいと考える相手（例えば講演者）の周囲にあるマイクロホンを通信端末Ｔ１に対して選択操作することで、ユーザＡが聴取したいと考える相手（例えば講演者）から遠く離れた位置におけるノイズ源（例えばモータ音）の収音を排除して、その相手の音声を効率的に収音することができる。 Thereby, the audio transmission system 100 wants to listen to the user A by selecting and operating on the communication terminal T1 the microphone around the partner (for example, a speaker) that the user A wants to listen to in the sound space AR. It is possible to efficiently collect the sound of the other party by eliminating the collection of a noise source (for example, a motor sound) at a position far away from the other party (for example, a speaker) to be considered.

以上、図面を参照しながら各種の実施形態について説明したが、本発明はかかる例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例又は修正例に想到し得ることは明らかであり、それらについても当然に本発明の技術的範囲に属するものと了解される。 While various embodiments have been described above with reference to the drawings, it goes without saying that the present invention is not limited to such examples. It will be apparent to those skilled in the art that various changes and modifications can be made within the scope of the claims, and these are naturally within the technical scope of the present invention. Understood.

本発明は、数多くの話者やノイズを含む環境音が混じり合った実際の音空間において、収音された音声から特定の音源を抽出し、抽出後の高品質な音声を自端末又は自端末の通話相手に伝達する音声伝達システム及び音声伝達方法として有用である。 The present invention extracts a specific sound source from collected sound in an actual sound space in which environmental sounds including many speakers and noise are mixed, and the extracted high-quality sound is transmitted to the own terminal or the own terminal. It is useful as a voice transmission system and a voice transmission method for transmitting to the other party.

１１、２１収音部
１２、２２ＣＰＵ
１３、２３メモリ
１４、２４入力部
１５、２５表示部
１６、２６音声出力部
１７、２７無線通信部
１８、２８アンテナ
ａ、ｂ通話音声の音声データ
ｃ１、ｃ２いずれかの分離音声データ
ＡＰ１、ＡＰ２アプリケーション
ＣＤＳサーバ装置
Ｍ１、Ｍ２、Ｍ３、Ｍ４マイクロホン
ＭＸマイク信号多重伝送装置
ＮＷ１クラウドネットワーク網
ＮＷ２ＬＴＥネットワーク網
ＰＲ音信号処理伝達部
ＰＲ１制御部
ＰＲ２音源分離部
ＳＬ１第１選択部
ＳＬ２第２選択部
Ｔ１、Ｔ２通信端末 11, 21 Sound collection unit 12, 22 CPU
13, 23 Memory 14, 24 Input unit 15, 25 Display unit 16, 26 Audio output unit 17, 27 Wireless communication unit 18, 28 Antenna a, b Voice data c1, c2 of voice data Separated voice data AP1, AP2 Application CDS server device M1, M2, M3, M4 Microphone MX Microphone signal multiplexing transmission device NW1 Cloud network network NW2 LTE network PR Sound signal processing transmission unit PR1 Control unit PR2 Sound source separation unit SL1 First selection unit SL2 Second selection unit T1 , T2 communication terminal

Claims

An audio transmission system including a first communication terminal that accepts a predetermined input operation,
One or more sound collection units for collecting sound in a sound space including a plurality of sounds;
A storage unit for storing audio data of the audio collected by the one or more sound collection units;
A separation unit that separates the sound data stored in the storage unit into the same number of types of sound data as the one or more sound collection units, and generates individual separated sound data;
Among the individual separated voice data generated by the separation unit, any separated voice data corresponding to the predetermined input operation to the first communication terminal is transmitted to the first communication terminal as switching voice data. A first transmitter that
The first communication terminal receives the switching voice data and outputs the voice;
Audio transmission system.

The audio transmission system according to claim 1,
A control unit that sequentially switches the individual separated audio data for each predetermined input operation to the first communication terminal and outputs the separated separated data to the first transmission unit;
Audio transmission system.

The voice transmission system according to claim 1 or 2,
When the first communication terminal receives a notification signal of a voice separation service that separates voice data of voice collected in the sound space into the same number of types of voice data as the sound collection unit, the voice separation In response to a service start operation, the switching voice data transmitted from the first transmission unit is received.
Audio transmission system.

The voice transmission system according to claim 1 or 2 ,
The first communication terminal receives a notification signal of information related to a target area of a voice separation service that separates voice data of voice collected in the sound space into the same number of types of voice data as the sound collection unit,
Audio transmission system.

The voice transmission system according to claim 1 or 2 ,
The separation unit is configured to select one or more of the selected one or more of the one or more sound collection units according to a selection signal from the first communication terminal for selecting the one or more sound collection units. Using the voice data of the voice collected by the sound collection unit, the voice data is separated into the same number of types of voice data as the one or more selected sound collection units.
Audio transmission system.

An audio transmission method in an audio transmission system including a first communication terminal that accepts a predetermined input operation,
Collecting sound in one or more sound collection devices arranged in a sound space including a plurality of sounds;
Storing audio data of the audio collected by the one or more sound collection devices;
Separating the audio data into the same number of types of audio data as the one or more sound collection devices, and generating individual separated audio data;
A step of transmitting any separated voice data corresponding to the predetermined input operation to the first communication terminal among the separated separated voice data after separation to the first communication terminal as switching voice data;
Receiving the switching voice data and outputting the voice,
Audio transmission method.