JP2017092608A

JP2017092608A - Telephone conversation device

Info

Publication number: JP2017092608A
Application number: JP2015217777A
Authority: JP
Inventors: 順也瀧上; Junya Takigami; 菊入　圭; Kei Kikuiri; 圭菊入
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2015-11-05
Filing date: 2015-11-05
Publication date: 2017-05-25

Abstract

PROBLEM TO BE SOLVED: To prevent a speech voice from becoming hard to hear when outputting an external sound that is different from the speech voice, simultaneously with the speech voice.SOLUTION: A telephone conversation device (terminal 100) for a first speaker to talk with a second speaker comprises: input means (second voice receiving part 104) for inputting voice data (coded sequence B2) of the second speaker; adjustment means (content adjustment part 112) for adjusting an external sound (content V3) that is different from a voice of the second speaker, in accordance with characteristics of the voice (speech voice V2) of the second speaker; and output means (content output part 113) for outputting the adjusted external sound that is different from the voice of the second speaker.SELECTED DRAWING: Figure 2

Description

本発明は、通話のほかに音楽等の外部音を通話相手に提供する通話装置に関する。 The present invention relates to a call device that provides an external sound such as music to a call partner in addition to a call.

従来より、携帯電話機などを用いた音声通話に、話者の音声（通話音声）とは異なる音（外部音）を提供する技術が提案されている。たとえば下記特許文献１は、通話中に背景楽音を再生することができる電話端末装置を開示する。 Conventionally, a technique for providing a sound (external sound) different from a speaker's voice (call voice) in a voice call using a mobile phone or the like has been proposed. For example, Patent Document 1 below discloses a telephone terminal device that can reproduce background music during a call.

特開２００５−２１７６１４号公報JP 2005-217614 A

特許文献１の電話端末装置のように、機器に保持された音源を外部からの制御信号により再生する場合、音源を保持する機器からは伝送に伴う劣化のない高品質な音を再生することができる。しかしながら、それらの音を音声通話中の背景音として利用する場合、伝送に伴う劣化を含む音声に対し、高品質な背景音が際立ってしまい、音声が聞き取りづらくなるおそれがある。 When a sound source held in a device is reproduced by an external control signal, as in the telephone terminal device of Patent Document 1, a high-quality sound that does not deteriorate due to transmission can be reproduced from a device that holds the sound source. it can. However, when these sounds are used as background sounds during a voice call, high-quality background sounds stand out from the sounds including deterioration due to transmission, and the sounds may be difficult to hear.

本発明は、上記問題点に鑑みてなされたものであり、通話音声とは異なる外部音が当該通話音声と同時に出力される際に、通話音声が聞き取りにくくならない通話装置を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a call device that does not make it difficult to hear a call voice when an external sound different from the call voice is output simultaneously with the call voice. To do.

上述の課題を解決するために、本発明の通話装置は、通信網を介して通話を行う通話装置において、前記通信網を介して通話音声および当該通話音声とは異なる外部音を入力する入力手段と、前記入力手段により入力された外部音を、前記通話音声の特性に応じて調整する調整手段と、前記調整手段により調整された外部音を、通話音声とともに出力する出力手段と、を備える。 In order to solve the above-described problems, the communication device of the present invention is a communication device that performs a call via a communication network, and an input unit that inputs a call sound and an external sound different from the call sound via the communication network. And adjusting means for adjusting the external sound input by the input means according to the characteristics of the call voice, and output means for outputting the external sound adjusted by the adjusting means together with the call voice.

この発明によれば、通話相手の音声とは異なる外部音は、通話相手の音声の特性に応じて調整されて出力されるため、当該外部音を、通話音声と同時に出力する際に、通話音声が聞き取りにくくならない。 According to the present invention, since the external sound different from the voice of the other party is adjusted and output according to the characteristics of the other party's voice, the call voice is output when the external sound is output simultaneously with the call voice. Is not difficult to hear.

また、本発明の通話装置において、前記調整手段は、前記通話音声とは異なる外部音を、前記通話音声の音声周波数帯域に応じて調整するようにしてもよい。
この発明によれば、通話相手の音声の音声周波数帯域に応じて通話相手の音声とは異なる外部音を調整することで、通話相手の音声の聞き取りやすさを損なわないように調整することができる。 In the call device of the present invention, the adjustment unit may adjust an external sound different from the call voice according to a voice frequency band of the call voice.
According to this invention, by adjusting the external sound different from the voice of the other party in accordance with the voice frequency band of the other party's voice, it is possible to make an adjustment so as not to impair the voice of the other party. .

また、本発明の通話装置において、前記調整手段は、前記通話音声とは異なる外部音を、通話に利用される音声コーデックの種類、サンプリングレートまたは通話相手の通信装置の種別の少なくとも一つに応じて調整するものである。これにより、通話相手の音声の聞き取りやすさを損なわないようにすることができる。 Further, in the call device of the present invention, the adjusting unit may generate an external sound different from the call sound according to at least one of a type of a voice codec used for a call, a sampling rate, or a type of a communication device of a call partner. To adjust. As a result, it is possible to prevent the voice of the other party from being easily heard.

また、本発明の通話装置における前記調整手段は、前記通話音声の特性に基づいて前記外部音の音声周波数帯域から所定の帯域を特定し、当該所定の帯域におけるパワーを調整する。これにより、前記通話相手の音声とは異なる外部音が同時に出力される場合にも、通話相手の音声の聞き取りやすさを損なわないように調整することができる。 Further, the adjusting means in the call device of the present invention specifies a predetermined band from the audio frequency band of the external sound based on the characteristics of the call voice, and adjusts the power in the predetermined band. As a result, even when an external sound different from the voice of the other party is output at the same time, it can be adjusted so as not to impair the ease of hearing the other party's voice.

本発明によれば、通話音声とは異なる外部音を、通話音声の特性に応じて調整して出力することによって、通話音声とは異なる外部音を、通話音声と同時に出力する際に、通話音声が聞き取りにくくならないことが可能になる。 According to the present invention, the external sound different from the call voice is adjusted according to the characteristics of the call voice and output by adjusting the external sound different from the call voice. It becomes possible not to become difficult to hear.

本発明の通話装置に係る通話システムの概略構成を示す図である。It is a figure which shows schematic structure of the call system which concerns on the call device of this invention. 本発明の通話装置の機能ブロックを示す図である。It is a figure which shows the functional block of the communication apparatus of this invention. 本発明の通話装置に係る端末のハードウェア構成図である。It is a hardware block diagram of the terminal which concerns on the communication apparatus of this invention. 本発明の通話装置において実行される処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process performed in the telephone apparatus of this invention.

以下、本発明の実施形態について、図面を参照しながら説明する。なお、図面の説明において同一要素には同一符号を付し、重複する説明は省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant descriptions are omitted.

図１は、本発明の通話装置に係る通話システムの概略構成を示す図である。図１に示すように、通話システムにおいて、ユーザ１（第１話者）とユーザ２（第２話者）との通話が行われる。通話システムでは、第１端末（後述の端末１００）および第２端末（後述の端末２００、端末１００と同じ構成が好ましい）が、通話装置として用いられる。ユーザ１が第１端末を利用し、ユーザ２が第２端末を利用することによって、ユーザ１とユーザ２との音声通話が実現される。 FIG. 1 is a diagram showing a schematic configuration of a call system according to the call device of the present invention. As shown in FIG. 1, in a call system, a call between a user 1 (first speaker) and a user 2 (second speaker) is performed. In the call system, a first terminal (terminal 100 described later) and a second terminal (the same configuration as terminal 200 and terminal 100 described later are preferable) are used as the call device. When user 1 uses the first terminal and user 2 uses the second terminal, a voice call between user 1 and user 2 is realized.

通話システムにおいて、たとえば端末１００および端末２００は、通信ネットワーク３（通信網）を介して通信可能に構成されている。さらに、サーバ３００が、通信ネットワーク３を介して、端末１００および端末２００と通信可能に構成されていてもよい。 In the call system, for example, the terminal 100 and the terminal 200 are configured to be communicable via the communication network 3 (communication network). Furthermore, the server 300 may be configured to be able to communicate with the terminal 100 and the terminal 200 via the communication network 3.

通話システムの動作概要の一例について説明する。たとえば、端末２００においてユーザ１の通話相手であるユーザ２の通話音声Ｖ２が入力されると、符号化された後、符号化系列として出力されて端末１００へ送信される。同時に、ユーザ２の端末２００の操作（ユーザ操作）によって端末１００内に保持されているコンテンツＶ３（通話音声とは異なる外部音）を再生するための指示信号が生成され、端末１００へ送信される。 An example of the operation outline of the call system will be described. For example, when the call voice V2 of the user 2 who is the other party of the user 1 is input to the terminal 200, the call voice V2 is encoded and then output as an encoded sequence and transmitted to the terminal 100. At the same time, an instruction signal for reproducing the content V3 (external sound different from the call voice) held in the terminal 100 is generated by an operation (user operation) of the terminal 200 of the user 2 and transmitted to the terminal 100. .

具体的に、図１に示される通り、端末１００と端末２００とにおいて、コンテンツのセットであるコンテンツａ〜ｃが、各コンテンツを再生するための指示信号ａ〜ｃと対応付けられて保持されている。ユーザ２が選択したコンテンツに対応した指示信号がユーザ１の端末１００に送信され、端末１００で指示信号に対応するコンテンツが再生されることでユーザ２が所望するコンテンツを端末１００で再生することができる。これによって、ユーザ１は、ユーザ２の通話音声Ｖ２だけでなく、ユーザ２が所望するコンテンツＶ３も聞くことができる。同時に、ユーザ２もコンテンツＶ３を聞くことができてもよい。端末１００と端末２００とに保持されているコンテンツのセットは、すべて同じものであってもよい。また、少なくともユーザ２が選択したコンテンツが端末１００に保持されていてもよい。 Specifically, as shown in FIG. 1, in terminals 100 and 200, contents a to c, which are sets of contents, are held in association with instruction signals a to c for reproducing each content. Yes. The instruction signal corresponding to the content selected by the user 2 is transmitted to the terminal 100 of the user 1, and the content corresponding to the instruction signal is reproduced at the terminal 100, whereby the content desired by the user 2 can be reproduced at the terminal 100. it can. As a result, the user 1 can hear not only the call voice V2 of the user 2 but also the content V3 desired by the user 2. At the same time, the user 2 may be able to listen to the content V3. The set of contents held in the terminal 100 and the terminal 200 may all be the same. Further, at least content selected by the user 2 may be held in the terminal 100.

端末１００と端末２００との構成が同じであることで、逆にユーザ２は、ユーザ１の通話音声Ｖ１だけでなく、ユーザ１が所望するコンテンツＶ３も聞くことができる。このとき、ユーザ１もコンテンツＶ３を聞くことができてもよい。 Since the configurations of the terminal 100 and the terminal 200 are the same, the user 2 can hear not only the call voice V1 of the user 1 but also the content V3 desired by the user 1. At this time, the user 1 may also be able to listen to the content V3.

コンテンツＶ３の例として、たとえば効果音や、ＢＧＭのような音楽が挙げられる、ただし、コンテンツＶ３は音楽に限定されるものでなく、鳥のさえずりや駅の雑踏のような環境音、あるいはセリフなどの音声であってもよい。 Examples of the content V3 include sound effects and music such as BGM. However, the content V3 is not limited to music, but is an environmental sound such as a song of a bird or a bustle of a station, or a line. May be used.

ここで、通話システムでは、通話音声Ｖ１またはＶ２の特性に基づいてコンテンツＶ３を調整することができる。たとえば、ユーザ１が通話音声Ｖ２とコンテンツＶ３を同時に聞く場合、この通信システムは、通話音声Ｖ２の音声周波数帯域に応じて、コンテンツＶ３の音声周波数帯域や音量を調整することで、通話音声Ｖ２とコンテンツＶ３を同時に聞く場合でも、ユーザ１が通話音声Ｖ２を聞き取りづらくならないようにすることができる。 Here, in the call system, the content V3 can be adjusted based on the characteristics of the call voice V1 or V2. For example, when the user 1 listens to the call voice V2 and the content V3 at the same time, the communication system adjusts the voice frequency band and volume of the content V3 according to the voice frequency band of the call voice V2, thereby Even when listening to the content V3 at the same time, it is possible to prevent the user 1 from having difficulty in listening to the call voice V2.

なお、本発明はコンテンツが端末１００に保持されてない場合にも適用できる。その場合、コンテンツＶ４が端末２００またはサーバ３００に保持され、ユーザ２の操作に応じて端末１００に送信されてもよい。端末１００は、通話音声Ｖ２の音声周波数帯域に応じて、受信したコンテンツＶ４の音声周波数帯域や音量を調整する。この場合、以降の説明におけるコンテンツＶ３はコンテンツＶ４に置き換えられる。 Note that the present invention can also be applied to a case where content is not held in the terminal 100. In that case, the content V4 may be held in the terminal 200 or the server 300 and transmitted to the terminal 100 in accordance with the operation of the user 2. The terminal 100 adjusts the audio frequency band and volume of the received content V4 according to the audio frequency band of the call audio V2. In this case, the content V3 in the following description is replaced with the content V4.

図２は、本発明の通話装置の機能ブロックを示す図である。図１に示したように、第１の通話装置である端末１００（第１端末）と、第２の通話装置である端末２００（第２端末）とによって、ユーザ１とユーザ２との通話が行われる。 FIG. 2 is a diagram showing functional blocks of the communication device of the present invention. As shown in FIG. 1, a call between the user 1 and the user 2 is performed by the terminal 100 (first terminal) as the first call device and the terminal 200 (second terminal) as the second call device. Done.

図２に示すように、端末１００は、第１音声入力部１０１と、第１音声符号化部１０２と、第１音声送信部１０３と、第２音声受信部１０４（入力手段）と、第２音声復号部１０５と、第２音声出力部１０６と、第１コンテンツ再生指示信号入力部１０７と、第１コンテンツ再生指示信号送信部１０８と、第２コンテンツ再生指示信号受信部１０９と、コンテンツ保持部１１０と、音声特性保持部１１１と、コンテンツ調整部１１２（調整手段）と、コンテンツ出力部１１３（出力手段）とを含む。 As shown in FIG. 2, the terminal 100 includes a first speech input unit 101, a first speech encoding unit 102, a first speech transmission unit 103, a second speech reception unit 104 (input means), and a second Audio decoding unit 105, second audio output unit 106, first content reproduction instruction signal input unit 107, first content reproduction instruction signal transmission unit 108, second content reproduction instruction signal reception unit 109, and content holding unit 110, an audio characteristic holding unit 111, a content adjustment unit 112 (adjustment unit), and a content output unit 113 (output unit).

第１音声入力部１０１は、一方の話者（図１のユーザ１）の通話音声Ｖ１を入力する部分である。第１音声入力部１０１は、たとえばマイクロフォンを含んで構成される。
第１音声符号化部１０２は、第１音声入力部１０１に入力された通話音声Ｖ１を符号化して、符号化系列Ｂ１を生成する部分である。 The first voice input unit 101 is a part for inputting the call voice V1 of one speaker (user 1 in FIG. 1). The first voice input unit 101 includes, for example, a microphone.
The first speech encoding unit 102 is a part that encodes the call speech V1 input to the first speech input unit 101 to generate an encoded sequence B1.

第１音声送信部１０３は、第１音声符号化部１０２によって符号化された通話音声Ｖ１を送信する部分である。具体的に、第１音声送信部１０３は、通話音声Ｖ１が符号化された符号化系列Ｂ１を、端末２００に送信する。 The first voice transmission unit 103 is a part that transmits the call voice V <b> 1 encoded by the first voice encoding unit 102. Specifically, the first voice transmission unit 103 transmits a coded sequence B1 in which the call voice V1 is coded to the terminal 200.

第２音声受信部１０４は、通話相手（図１のユーザ２）の符号化された通話音声Ｖ２を受信する部分である。具体的に、第２音声受信部１０４は、端末２００においてユーザ２の通話音声Ｖ２が符号化された符号化系列Ｂ２を受信する。
第２音声復号部１０５は、第２音声受信部１０４が受信した符号化系列Ｂ２を復号する部分である。 The second voice receiving unit 104 is a part that receives the encoded call voice V2 of the other party (user 2 in FIG. 1). Specifically, the second voice receiving unit 104 receives a coded sequence B2 in which the call voice V2 of the user 2 is coded at the terminal 200.
The second speech decoding unit 105 is a part that decodes the encoded sequence B2 received by the second speech receiving unit 104.

第２音声出力部１０６は、第２音声復号部１０５によって復号された符号化系列Ｂ２に応じた音（つまりユーザ２の通話音声Ｖ２）を出力する部分である。第２音声出力部１０６は、たとえばスピーカを含んで構成されており、通話音声Ｖ２を出力する。これにより、ユーザ１は、ユーザ２の通話音声Ｖ２を聞くことができる。 The second voice output unit 106 is a part that outputs a sound corresponding to the encoded sequence B2 decoded by the second voice decoding unit 105 (that is, the call voice V2 of the user 2). The second voice output unit 106 includes, for example, a speaker, and outputs a call voice V2. Thereby, the user 1 can hear the call voice V2 of the user 2.

第１コンテンツ再生指示信号入力部１０７は、ユーザ操作に基づいて、第１コンテンツ再生指示信号Ｃ１を入力する部分である。ユーザ操作は、たとえばユーザ１（図１）が端末１００に設けられた操作盤やタッチパネルなどのデバイスを操作することによって行われる。また、ユーザ操作は、たとえば先に図１を参照して説明したように、種々のコンテンツａ〜ｃなどから、ユーザ１が所望する音をコンテンツＶ３として選択する操作や、コンテンツＶ３の再生を停止する操作を含む。さらに、上記の操作盤やタッチパネルなどのデバイスが、ユーザ操作に応じて選択されたコンテンツに対応した第１コンテンツ再生指示信号Ｃ１を発生させ、その第１コンテンツ再生指示信号Ｃ１が第１コンテンツ再生指示信号入力部１０７に入力される。 The first content reproduction instruction signal input unit 107 is a part that inputs a first content reproduction instruction signal C1 based on a user operation. The user operation is performed, for example, when the user 1 (FIG. 1) operates a device such as an operation panel or a touch panel provided in the terminal 100. In addition, as described above with reference to FIG. 1, for example, the user operation is an operation for selecting a sound desired by the user 1 as the content V3 from various contents a to c and the reproduction of the content V3 is stopped. Operation to perform. Further, a device such as the operation panel or the touch panel generates a first content reproduction instruction signal C1 corresponding to the content selected in response to a user operation, and the first content reproduction instruction signal C1 is the first content reproduction instruction. The signal is input to the signal input unit 107.

第１コンテンツ再生指示信号送信部１０８は、第１コンテンツ再生指示信号入力部１０７に入力された第１コンテンツ再生指示信号Ｃ１を送信する部分である。具体的に、第１コンテンツ再生指示信号送信部１０８は、第１コンテンツ再生指示信号Ｃ１を、端末２００に送信する。 The first content reproduction instruction signal transmission unit 108 is a part that transmits the first content reproduction instruction signal C1 input to the first content reproduction instruction signal input unit 107. Specifically, first content reproduction instruction signal transmission section 108 transmits first content reproduction instruction signal C1 to terminal 200.

なお、第１コンテンツ再生指示信号送信部１０８が、第１コンテンツ再生指示信号を送信する代わりに、第１音声符号化部１０２が、第１コンテンツ再生指示信号Ｃ１が指示するコンテンツＶ３を通話音声Ｖ１に重畳し、重畳された第１音声を符号化し、第１音声送信部１０３が送信してもよい。または、第１音声入力部１０１と、第１音声符号化部１０２との間に重畳処理部を配置しておき、当該重畳処理部が、後述するコンテンツ保持部１１０から出力されたコンテンツＶ３の再生音を通話音声Ｖ１に重畳し、第１音声送信部１０３から符号化系列Ｂ１として送信するようにしてもよい。 Instead of the first content reproduction instruction signal transmitting unit 108 transmitting the first content reproduction instruction signal, the first audio encoding unit 102 uses the call audio V1 for the content V3 indicated by the first content reproduction instruction signal C1. And the first audio transmission unit 103 may transmit the encoded first audio. Alternatively, a superimposition processing unit is arranged between the first audio input unit 101 and the first audio encoding unit 102, and the superimposition processing unit reproduces the content V3 output from the content holding unit 110 described later. The sound may be superimposed on the call voice V1 and transmitted from the first voice transmission unit 103 as the encoded sequence B1.

さらに、端末２００がユーザ１の所望するコンテンツを保持していない場合、コンテンツ送信部を設け、後述するコンテンツ保持部１１０から出力されたコンテンツＶ４を端末２００に送信しても良い。この場合、以降の説明におけるコンテンツＶ３はコンテンツＶ４に置き換えられる。 Furthermore, when the terminal 200 does not hold the content desired by the user 1, a content transmission unit may be provided, and the content V4 output from the content holding unit 110 described later may be transmitted to the terminal 200. In this case, the content V3 in the following description is replaced with the content V4.

第２コンテンツ再生指示信号受信部１０９は、端末２００（図１）から送信された第２コンテンツ再生指示信号Ｃ２を受信する部分である。第２コンテンツ再生指示信号Ｃ２は第１コンテンツ再生指示信号Ｃ１と同様にユーザ２（図１）によって生成され、後述するコンテンツ保持部１１０に保持されている種々のコンテンツａ〜ｃなどから、ユーザ２が所望する音の再生を指示する。 Second content reproduction instruction signal receiving section 109 is a part that receives second content reproduction instruction signal C2 transmitted from terminal 200 (FIG. 1). The second content reproduction instruction signal C2 is generated by the user 2 (FIG. 1) in the same manner as the first content reproduction instruction signal C1, and the user 2 is obtained from various contents a to c held in the content holding unit 110 described later. Directs playback of the desired sound.

コンテンツ保持部１１０は、通話音声とは異なる外部音であるコンテンツＶ３を保持する部分である。具体的には、コンテンツ保持部１１０は、図１に示したように種々のコンテンツａ〜ｃなどを、コンテンツａ〜ｃなどを再生指示信号ａ〜ｃなどと対応付けて保持している。コンテンツ保持部１１０は、第１コンテンツ再生指示信号Ｃ１あるいは第２コンテンツ再生指示信号Ｃ２を入力すると、再生指示信号に対応したコンテンツＶ３を出力する。たとえば、入力された第１コンテンツ再生指示信号Ｃ１が、コンテンツａの再生を指示する再生指示信号aであった場合、コンテンツ保持部１１０はコンテンツＶ３として、コンテンツａを出力する。なお、言うまでもなく、コンテンツ保持部１１０が保持するコンテンツは所定のコンテンツａ〜ｃに限定されるものではなく、たとえば、外部のサーバから新たなコンテンツのセットをダウンロードし、コンテンツＶ３として利用することができる。 The content holding unit 110 is a part that holds content V3 that is an external sound different from the call voice. Specifically, as shown in FIG. 1, the content holding unit 110 holds various contents a to c and the like in association with the contents a to c and the reproduction instruction signals a to c. When content holding unit 110 receives first content reproduction instruction signal C1 or second content reproduction instruction signal C2, content holding unit 110 outputs content V3 corresponding to the reproduction instruction signal. For example, when the input first content reproduction instruction signal C1 is the reproduction instruction signal a instructing reproduction of the content a, the content holding unit 110 outputs the content a as the content V3. Needless to say, the content held by the content holding unit 110 is not limited to the predetermined content a to c. For example, a new set of content can be downloaded from an external server and used as the content V3. it can.

また、コンテンツ保持部１１０に入力されるのは、第２コンテンツ再生指示信号Ｃ２の代わりに、外部のサーバあるいは端末２００から送信されてきたコンテンツＶ４であってもよい。その場合は、コンテンツ保持部１１０は受信したコンテンツＶ４を出力する。この場合、以降の説明におけるコンテンツＶ３はコンテンツＶ４に置き換えられる。 Further, the content V4 transmitted from the external server or the terminal 200 may be input to the content holding unit 110 instead of the second content reproduction instruction signal C2. In that case, the content holding unit 110 outputs the received content V4. In this case, the content V3 in the following description is replaced with the content V4.

音声特性保持部１１１は、通話相手の音声の特性を保持する部分である。具体的には、音声特性保持部１１１に保持される特性は、通話に利用されているコーデックや、サンプリングレートや、音声周波数帯域である。あるいは、通話相手が使用している通話装置の種類、例えば固定電話か、スマートフォンの機種などでも良い。一般的に、コーデックや、サンプリングレートに基づいて、音声周波数帯域を把握することが可能である。また、固定電話か携帯電話かによっても、使用される音声周波数帯域が異なり、さらに、携帯電話の機種によっても使用される音声周波数帯域が異なる場合があることから、それら情報を音声特性として保持しておくようにしてもよい。 The voice characteristic holding unit 111 is a part that holds the voice characteristics of the other party. Specifically, the characteristics held in the voice characteristics holding unit 111 are a codec used for a call, a sampling rate, and a voice frequency band. Or the kind of telephone apparatus which the other party is using, for example, a fixed telephone or a smart phone model, may be used. Generally, it is possible to grasp the audio frequency band based on the codec and the sampling rate. Also, the voice frequency band used differs depending on whether it is a fixed phone or a mobile phone, and furthermore, the voice frequency band used may differ depending on the model of the mobile phone. You may make it leave.

この特性は、通話開始時に取得することが好ましく、コーデックの種類やサンプリングレート、通信装置の種別（固定電話であるか否か）は、通信のセッションを確立するときにそれら情報を取得することが可能であり、このとき、通話に利用されているコーデックや、サンプリングレートや、音声周波数帯域や、通話装置の種類から出力される通話音声Ｖ２の特性を知ることができる。 This characteristic is preferably acquired at the start of a call, and the type of codec, sampling rate, and type of communication device (whether it is a fixed telephone) can be acquired when establishing a communication session. At this time, it is possible to know the characteristics of the call voice V2 output from the codec used for the call, the sampling rate, the voice frequency band, and the type of the call device.

また、音声特性保持部１１１に保持される特性は、復号された第２音声を解析することで得られても良い。特性として、たとえば、通話音声Ｖ２に含まれる音声周波数の分布や、音量レベルなどが考えられる。通話音声Ｖ２の解析を逐次実行することで、後述するコンテンツ調整部１１２において動的な制御を行うこともできる。この場合、音声特性保持部１１１の代わりに音声解析部を設け、解析結果を逐次コンテンツ調整部１１２に入力しても良い。コンテンツ調整部１１２は、第２音声に含まれる音声周波数の分布と音量レベルとの一方に応じた調整をしても良いし、両方に応じた調整をしても良い。 Further, the characteristic held in the voice characteristic holding unit 111 may be obtained by analyzing the decoded second voice. As the characteristics, for example, the distribution of voice frequencies included in the call voice V2, the volume level, and the like can be considered. By performing the analysis of the call voice V2 sequentially, the content adjustment unit 112 described later can perform dynamic control. In this case, an audio analysis unit may be provided instead of the audio characteristic holding unit 111, and the analysis result may be sequentially input to the content adjustment unit 112. The content adjustment unit 112 may perform adjustment according to one of the distribution of the audio frequency included in the second audio and the volume level, or may perform adjustment according to both.

コンテンツ調整部１１２は、コンテンツ保持部から出力されたコンテンツＶ３を通話相手の音声の特性に応じて調整する部分（調整手段）である。具体的には、通話音声Ｖ２よりも、コンテンツＶ３の方が際立ち、通話音声Ｖ２とコンテンツＶ３とが同時に出力された場合に、通話音声Ｖ２が聞き取りづらくならないようにコンテンツＶ３の音声周波数帯域や音量を調整する。例えば、コンテンツＶ３に、通話音声Ｖ２よりも高い音声周波数帯域の音が含まれないようにしたり、高い音声周波数帯域（所定範囲の音声周波数帯域）のパワーを小さくしたりする、などの調整が考えられる。 The content adjustment unit 112 is a part (adjustment unit) that adjusts the content V3 output from the content holding unit according to the voice characteristics of the other party. Specifically, when the content V3 stands out from the call voice V2 and the call voice V2 and the content V3 are output simultaneously, the voice frequency band and volume of the content V3 are prevented so that the call voice V2 is not easily heard. Adjust. For example, adjustments may be made such that the content V3 does not include sound in a voice frequency band higher than that of the call voice V2, or the power of a high voice frequency band (a predetermined range of voice frequency band) is reduced. It is done.

さらに、コンテンツ調整部１１２は、コンテンツＶ３の音声周波数帯域ごとのパワーが、基準となる通話音声Ｖ２の音声周波数帯域ごとのパワーを、その帯域ごとにあらかじめ定めた割合（ｘ％）、上回る場合には、その音声周波数帯域ごとにおけるパワーを所定量（ｆ（ｘ）［ｄＢ］）下げるようにするようにしてもよい。これにより、通話音声Ｖ２に応じたコンテンツＶ３のきめ細かな調整制御を可能にすることができる。 Furthermore, the content adjustment unit 112 determines that the power for each audio frequency band of the content V3 exceeds the power for each audio frequency band of the reference call voice V2 by a predetermined ratio (x%) for each band. The power in each voice frequency band may be lowered by a predetermined amount (f (x) [dB]). Thereby, fine adjustment control of the content V3 according to the call voice V2 can be made possible.

コンテンツ調整部１１２が行う調整方法は、ダイナミクス処理、フィルタリング処理、イコライジング処理など、一般的なオーディオ処理でも良いし、符号化および復号処理でも良い。符号化および復号処理の場合は、音声通話に使用されている符号化および復号方法と同じ方法を使用することが好ましい。 The adjustment method performed by the content adjustment unit 112 may be general audio processing such as dynamics processing, filtering processing, and equalizing processing, or may be encoding and decoding processing. In the case of encoding and decoding processes, it is preferable to use the same encoding and decoding methods used for voice calls.

コンテンツ調整部１１２は、特定の条件が満たされた場合、調整の有無を含めた調整方法を変えても良い。例えば、通話に利用されているコーデックによっては調整しない（あるいは調整を軽度にする）、通話音声Ｖ２に含まれる音声周波数の帯域によっては調整しない（あるいは調整を軽度にする）、などの制御が考えられる。その場合、前記の特性は音声特性保持部１１１から入力されることが好ましい。他の例としては、通話相手が発話せず、通話音声Ｖ２が出力されない間はコンテンツＶ３の調整は行わない（あるいは調整を軽度にする）、などの制御が考えられる。その場合、通話音声Ｖ２の符号化データまたは復号された通話音声Ｖ２がコンテンツ調整部１１２に入力されることが好ましい。ここで、調整を軽度にするということは、調整対象となる周波数帯域を狭くしたり、その調整量となるパワーの増減幅を小さくしたりすることなどである。 The content adjustment unit 112 may change the adjustment method including the presence / absence of adjustment when a specific condition is satisfied. For example, the control may not be adjusted depending on the codec used for the call (or light adjustment), or not adjusted depending on the audio frequency band included in the call voice V2 (or light adjustment). It is done. In that case, it is preferable that the characteristic is input from the voice characteristic holding unit 111. As another example, it is conceivable that the content V3 is not adjusted (or the adjustment is lightened) while the other party does not speak and the call voice V2 is not output. In this case, it is preferable that the encoded data of the call voice V2 or the decoded call voice V2 is input to the content adjustment unit 112. Here, making the adjustment light means that the frequency band to be adjusted is narrowed, the increase / decrease width of the power that is the adjustment amount is reduced, and the like.

また、コンテンツ調整部１１２は、入力されるコンテンツＶ３が、第１コンテンツ再生指示信号Ｃ１に由来するか、あるいは第２コンテンツ再生指示信号Ｃ２に由来するかに応じて、調整の有無を含めた調整方法を変えても良い。たとえば、第１コンテンツ再生指示信号Ｃ１に由来する場合はコンテンツＶ３の調整は行わない（あるいは調整を軽度にする）などの制御が考えられる。その場合、入力されるコンテンツＶ３が、第１コンテンツ再生指示信号Ｃ１に由来するか、あるいは第２コンテンツ再生指示信号Ｃ２に由来するかを示す情報がコンテンツ調整部１１２に入力されることが好ましい。また、その情報は、第１コンテンツ再生指示信号Ｃ１ならびに第２コンテンツ再生指示信号Ｃ２が、コンテンツ調整部１１２に直接入力されても良い。 Further, the content adjustment unit 112 adjusts including the presence / absence of adjustment depending on whether the input content V3 is derived from the first content reproduction instruction signal C1 or the second content reproduction instruction signal C2. You may change the method. For example, in the case of being derived from the first content reproduction instruction signal C1, control such as not adjusting the content V3 (or making the adjustment light) is conceivable. In this case, it is preferable that information indicating whether the input content V3 is derived from the first content reproduction instruction signal C1 or the second content reproduction instruction signal C2 is input to the content adjustment unit 112. As the information, the first content reproduction instruction signal C1 and the second content reproduction instruction signal C2 may be directly input to the content adjustment unit 112.

さらに、コンテンツ調整部１１２は、別途入力されるユーザからの指示に応じて調整の有無を含めた調整方法を変えても良い、たとえば、ユーザ操作によって調整の有無を切り替える（あるいは調整の程度を変更する）、などの制御が考えられる。その場合、ユーザインターフェースを介したユーザからの指示情報がコンテンツ調整部１１２に入力されることが好ましい。 Furthermore, the content adjustment unit 112 may change the adjustment method including the presence / absence of adjustment according to an instruction from the user separately input. For example, the content adjustment unit 112 switches the presence / absence of adjustment by a user operation (or changes the degree of adjustment). Control) and the like. In that case, it is preferable that instruction information from the user via the user interface is input to the content adjustment unit 112.

コンテンツ調整部１１２が調整を行う代わりに、あらかじめ調整されたコンテンツを選択することでコンテンツの調整を実現しても良い。その場合、コンテンツ保持部１１０にあらかじめ調整されたコンテンツが保持され、音声特性保持部１１１から入力される特性に応じて、コンテンツを選択することが望ましい。ここで、あらかじめ施される調整とは、コンテンツ調整部１１２が行う調整と同じであることが好ましい。 Instead of the content adjustment unit 112 performing the adjustment, the content adjustment may be realized by selecting a content adjusted in advance. In that case, it is desirable that the content adjusted in advance is held in the content holding unit 110 and the content is selected according to the characteristics input from the audio characteristic holding unit 111. Here, the adjustment performed in advance is preferably the same as the adjustment performed by the content adjustment unit 112.

なお、コンテンツ調整部１１２は、コンテンツ保持部１１０が保持するコンテンツＶ３を入力する代わりに、外部から受信したコンテンツＶ４を、コンテンツ保持部１１０を介して、あるいは介さずに直接、入力しても良い。たとえば、通話相手が第２コンテンツ再生指示信号の代わりにコンテンツＶ４を送信し、コンテンツ調整部１１２は、そのコンテンツＶ４を入力する、あるいは通話相手が送信した再生指示信号をネットワーク上のサーバが受け、サーバがコンテンツＶ４を送信し、コンテンツ調整部１１２は、そのコンテンツＶ４を入力するなどの形態が考えられる。この場合、以降の説明におけるコンテンツＶ３はコンテンツＶ４に置き換えられる。 Note that the content adjustment unit 112 may directly input the content V4 received from the outside via the content holding unit 110 or not, instead of inputting the content V3 held by the content holding unit 110. . For example, the call partner transmits the content V4 instead of the second content playback instruction signal, and the content adjustment unit 112 receives the playback instruction signal input by the content V4 or transmitted by the call partner, A form in which the server transmits the content V4 and the content adjustment unit 112 inputs the content V4 is conceivable. In this case, the content V3 in the following description is replaced with the content V4.

コンテンツ出力部１１３は、コンテンツ調整部から１１２から入力されたコンテンツＶ３を出力する部分である。コンテンツ出力部１１３は、たとえばスピーカを含んで構成される。具体的に、コンテンツ出力部１１３は、コンテンツＶ３を出力する。これにより、ユーザ１は、ユーザ２が所望したコンテンツＶ３の再生音を聞くことができる。 The content output unit 113 is a part that outputs the content V3 input from the content adjustment unit 112. The content output unit 113 includes a speaker, for example. Specifically, the content output unit 113 outputs the content V3. Thereby, the user 1 can hear the reproduction sound of the content V3 desired by the user 2.

コンテンツ出力部１１３が、コンテンツＶ３の再生音を出力する代わりに、復号された通話音声Ｖ２にコンテンツＶ３の再生音を重畳し、前記重畳された通話音声を第２音声出力部１０６から出力してもよい。その場合、第２音声復号部１０５と、第２音声出力部１０６との間で、コンテンツ調整部１１２から出力されたコンテンツＶ３の再生音が重畳されることが好ましい。 Instead of outputting the reproduction sound of the content V3, the content output unit 113 superimposes the reproduction sound of the content V3 on the decoded call voice V2, and outputs the superimposed call voice from the second audio output unit 106. Also good. In that case, it is preferable that the reproduction sound of the content V3 output from the content adjustment unit 112 is superimposed between the second audio decoding unit 105 and the second audio output unit 106.

ここで、図３を参照して、端末１００のハードウェア構成について説明する。図３は、端末１００のハードウェア構成図である。図３に示されるように、端末１００は、物理的には、１または複数のＣＰＵ（Central Processing unit）２１、主記憶装置であるＲＡＭ（Random Access Memory）２２およびＲＯＭ（Read Only Memory)２３、データ送受信デバイスである通信モジュール２６、半導体メモリなどの補助記憶装置２７、操作盤（操作ボタンを含む）やタッチパネルなどのユーザの入力を受け付ける入力装置２８、ディスプレイなどの出力装置２９、などのハードウェアを備えるコンピュータとして構成することができる。図２における端末１００の各機能は、たとえば、ＣＰＵ２１、ＲＡＭ２２などのハードウェア上に１または複数の所定のコンピュータソフトウェアを読み込ませることにより、ＣＰＵ２１の制御のもとで通信モジュール２６、入力装置２８、出力装置２９を動作させるとともに、ＲＡＭ２２および補助記憶装置２７におけるデータの読み出しおよび書き込みを行うことで実現することができる。なお、端末２００についても、端末１００と同様のハードウェア構成とすることができる。 Here, the hardware configuration of the terminal 100 will be described with reference to FIG. FIG. 3 is a hardware configuration diagram of the terminal 100. As shown in FIG. 3, the terminal 100 physically includes one or a plurality of CPUs (Central Processing Units) 21, a RAM (Random Access Memory) 22 and a ROM (Read Only Memory) 23, which are main storage devices, Hardware such as a communication module 26 that is a data transmission / reception device, an auxiliary storage device 27 such as a semiconductor memory, an input device 28 that accepts user input such as an operation panel (including operation buttons) and a touch panel, and an output device 29 such as a display. It can comprise as a computer provided with. Each function of the terminal 100 in FIG. 2 includes, for example, reading one or a plurality of predetermined computer software on hardware such as the CPU 21 and the RAM 22, thereby controlling the communication module 26, the input device 28, This can be realized by operating the output device 29 and reading and writing data in the RAM 22 and the auxiliary storage device 27. Note that the terminal 200 can have the same hardware configuration as the terminal 100.

次に、図４を参照して、本発明に係る通話装置の動作（端末１００によって実行される通話方法）について説明する。図４は、端末１００において実行される処理の一例を示すフローチャートである。このフローチャートの処理は、端末１００を利用するユーザ１（図１）と、端末２００を利用するユーザ２との通話中に実行される。 Next, with reference to FIG. 4, the operation of the call device according to the present invention (call method executed by the terminal 100) will be described. FIG. 4 is a flowchart illustrating an example of processing executed in the terminal 100. The process of this flowchart is executed during a call between the user 1 (FIG. 1) using the terminal 100 and the user 2 using the terminal 200.

図４に示すように、端末１００においては、４系統の処理フロー（Ｓ１０１〜Ｓ１０３、Ｓ１０４〜Ｓ１０９、Ｓ１１０〜Ｓ１１２、Ｓ１１３〜Ｓ１１７）の複数を並列に実行可能である。 As shown in FIG. 4, in the terminal 100, a plurality of four processing flows (S101 to S103, S104 to S109, S110 to S112, S113 to S117) can be executed in parallel.

まずは、Ｓ１０１〜Ｓ１０３の処理フローについて説明する。はじめに、端末１００において、第１音声を入力する（ステップＳ１０１）。具体的には、第１音声入力部１０１が、ユーザ１の通話音声Ｖ１を入力する。 First, the processing flow of S101 to S103 will be described. First, in the terminal 100, the first voice is input (step S101). Specifically, the first voice input unit 101 inputs the call voice V1 of the user 1.

さらに、端末１００は、第１音声を符号化する（ステップＳ１０２）。具体的には、第１音声符号化部１０２が、ユーザ１の通話音声Ｖ１を符号化する。 Furthermore, the terminal 100 encodes the first voice (step S102). Specifically, the first voice encoding unit 102 encodes the call voice V1 of the user 1.

さらに、端末１００は、符号化系列を送信する（ステップＳ１０３）。具体的には、第１音声送信部１０３が、符号化系列Ｂ１を、端末２００に送信する。 Furthermore, terminal 100 transmits the encoded sequence (step S103). Specifically, the first audio transmission unit 103 transmits the encoded sequence B1 to the terminal 200.

次に、Ｓ１０４〜Ｓ１０９の処理フローについて説明する。はじめに、端末１００において、第１コンテンツ再生指示信号を入力する（ステップＳ１０４）。具体的には、第１コンテンツ再生指示信号入力部１０７がユーザ操作に基づいて、第１コンテンツ再生指示信号Ｃ１を入力する。 Next, the processing flow of S104 to S109 will be described. First, in the terminal 100, a first content reproduction instruction signal is input (step S104). Specifically, the first content reproduction instruction signal input unit 107 inputs the first content reproduction instruction signal C1 based on a user operation.

さらに、端末１００は、第１コンテンツ再生指示信号を送信する（ステップＳ１０５）。具体的には、第１コンテンツ再生指示信号送信部１０８が、第１コンテンツ再生指示信号Ｃ１を端末２００に送信する。 Furthermore, terminal 100 transmits a first content reproduction instruction signal (step S105). Specifically, first content reproduction instruction signal transmission section 108 transmits first content reproduction instruction signal C1 to terminal 200.

なお、端末２００がユーザ１の所望するコンテンツを保持していない場合、第１コンテンツ再生指示信号Ｃ１の代わりに、コンテンツＶ４を端末２００に送信しても良い。この場合、以降の説明におけるコンテンツＶ３はコンテンツＶ４に置き換えられる。 When terminal 200 does not hold the content desired by user 1, content V4 may be transmitted to terminal 200 instead of first content reproduction instruction signal C1. In this case, the content V3 in the following description is replaced with the content V4.

また、端末１００は、コンテンツを読み込む（ステップＳ１０６）。具体的には、コンテンツ保持部１１０が、保持しているコンテンツのうち、第１コンテンツ再生指示信号Ｃ１よって指示されたコンテンツＶ３を出力し、コンテンツ調整部１１２が読み込む。 Further, the terminal 100 reads content (step S106). Specifically, the content holding unit 110 outputs the content V3 instructed by the first content reproduction instruction signal C1 among the held content, and the content adjustment unit 112 reads the content V3.

また、端末１００は、音声特性を読み込む（ステップＳ１０７）。具体的には、音声特性保持部１１１が、保持している通話相手の音声の特性を出力し、コンテンツ調整部１１２が読み込む。音声特性保持部１１１に保持される特性が、復号された通話音声Ｖ２を解析することで得られる場合には、ステップＳ１０７の処理は後述するステップＳ１１１の処理が実行された後に実行される。 Further, the terminal 100 reads the voice characteristics (step S107). Specifically, the voice characteristic holding unit 111 outputs the held voice characteristic of the other party, and the content adjustment unit 112 reads it. When the characteristic held in the voice characteristic holding unit 111 is obtained by analyzing the decoded call voice V2, the process of step S107 is executed after the process of step S111 described later is executed.

なお、上記ステップＳ１０６、Ｓ１０７の処理については、ステップＳ１０６の処理が実行された後にステップＳ１０７の処理が実行されてもよいし、ステップＳ１０７の処理が実行された後にステップＳ１０６の処理が実行されてもよい。
さらに、端末１００はコンテンツを調整する（ステップＳ１０８）。具体的には、コンテンツ調整部１１２が、コンテンツ保持部から出力されたコンテンツＶ３を通話相手の音声の特性に応じて調整する。 In addition, about the process of said step S106, S107, the process of step S107 may be performed after the process of step S106 is performed, or the process of step S106 is performed after the process of step S107 is performed. Also good.
Further, the terminal 100 adjusts the content (step S108). Specifically, the content adjustment unit 112 adjusts the content V3 output from the content holding unit according to the voice characteristics of the other party.

さらに、端末１００は調整されたコンテンツを出力する（ステップＳ１０９）。具体的には、コンテンツ出力部１１３が、調整されたコンテンツＶ３を出力する。 Furthermore, the terminal 100 outputs the adjusted content (step S109). Specifically, the content output unit 113 outputs the adjusted content V3.

なお、上記ステップＳ１０５、Ｓ１０６〜Ｓ１０９の処理については、ステップＳ１０５の処理が実行された後にステップＳ１０６〜Ｓ１０９の処理が実行されてもよいし、ステップＳ１０６〜Ｓ１０９の処理が実行された後にステップＳ１０５の処理が実行されてもよい。 In addition, about the process of said step S105, S106-S109, after the process of step S105 is performed, the process of step S106-S109 may be performed, and after the process of step S106-S109 is performed, step S105 is performed. These processes may be executed.

次に、Ｓ１１０〜Ｓ１１２の処理フローについて説明する。はじめに、端末１００において、通話音声Ｖ２を受信する（ステップＳ１１０）。具体的には、第２音声受信部１０４が、符号化系列Ｂ２を受信する。 Next, the processing flow of S110 to S112 will be described. First, the terminal 100 receives the call voice V2 (step S110). Specifically, the second audio receiving unit 104 receives the encoded sequence B2.

さらに、端末１００は、通話音声Ｖ２を復号する（ステップＳ１１１）。具体的には、第２音声復号部１０５が、符号化系列Ｂ２を復号する。 Furthermore, the terminal 100 decodes the call voice V2 (step S111). Specifically, the second speech decoding unit 105 decodes the encoded sequence B2.

さらに、端末１００は、通話音声Ｖ２を出力する（ステップＳ１１２）。具体的には、復号された符号化系列Ｂ２に応じた音（つまりユーザ２の通話音声Ｖ２）を出力する。 Furthermore, the terminal 100 outputs the call voice V2 (step S112). Specifically, a sound corresponding to the decoded coded sequence B2 (that is, the call voice V2 of the user 2) is output.

次に、Ｓ１１３〜Ｓ１１７の処理フローについて説明する。はじめに、端末１００において、第２コンテンツ再生指示信号を受信する（ステップＳ１１３）。具体的には、第２コンテンツ再生指示信号受信部１０９が、第２コンテンツ再生指示信号Ｃ２を受信する。 Next, the processing flow of S113 to S117 will be described. First, the terminal 100 receives a second content reproduction instruction signal (step S113). Specifically, the second content reproduction instruction signal receiving unit 109 receives the second content reproduction instruction signal C2.

さらに、端末１００は、コンテンツを読み込む（ステップＳ１１４）。具体的には、コンテンツ保持部１１０が、保持しているコンテンツのうち、第２コンテンツ再生指示信号Ｃ２よって指示されたコンテンツＶ３を出力し、コンテンツ調整部１１２が読み込む。 Furthermore, the terminal 100 reads content (step S114). Specifically, the content holding unit 110 outputs the content V3 indicated by the second content reproduction instruction signal C2 among the held content, and the content adjustment unit 112 reads the content V3.

なお、ステップＳ１１３とステップＳ１１４では第２コンテンツ再生指示信号Ｃ２の代わりに、端末２００あるいは外部のサーバから送信されたコンテンツＶ４を受信し、出力しても良い。この場合、以降の説明におけるコンテンツＶ３はコンテンツＶ４に置き換えられる。 In step S113 and step S114, the content V4 transmitted from the terminal 200 or an external server may be received and output instead of the second content reproduction instruction signal C2. In this case, the content V3 in the following description is replaced with the content V4.

また、端末１００は、音声特性を読み込む（ステップＳ１１５）。具体的には、音声特性保持部１１１が、保持している通話相手の音声の特性を出力し、コンテンツ調整部１１２が読み込む。音声特性保持部１１１に保持される特性が、復号された通話音声Ｖ２を解析することで得られる場合には、ステップＳ１１５の処理はステップＳ１１１の処理が実行された後に実行される。 Further, the terminal 100 reads the voice characteristics (step S115). Specifically, the voice characteristic holding unit 111 outputs the held voice characteristic of the other party, and the content adjustment unit 112 reads it. When the characteristic held in the voice characteristic holding unit 111 is obtained by analyzing the decoded call voice V2, the process of step S115 is executed after the process of step S111 is executed.

なお、上記ステップＳ１１４、Ｓ１１５の処理については、ステップＳ１１４の処理が実行された後にステップＳ１１５の処理が実行されてもよいし、ステップＳ１１５の処理が実行された後にステップＳ１１４の処理が実行されてもよい。
さらに、端末１００はコンテンツを調整する（ステップＳ１１６）。具体的には、コンテンツ調整部１１２が、コンテンツ保持部から出力されたコンテンツＶ３を通話相手の音声の特性に応じて調整する。 In addition, about the process of said step S114, S115, the process of step S115 may be performed after the process of step S114 is performed, or the process of step S114 is performed after the process of step S115 is performed. Also good.
Further, the terminal 100 adjusts the content (step S116). Specifically, the content adjustment unit 112 adjusts the content V3 output from the content holding unit according to the voice characteristics of the other party.

さらに、端末１００は調整されたコンテンツを出力する（ステップＳ１１７）。具体的には、コンテンツ出力部１１３が、調整されたコンテンツＶ３を出力する。 Furthermore, the terminal 100 outputs the adjusted content (step S117). Specifically, the content output unit 113 outputs the adjusted content V3.

上述した４系統の処理フロー（Ｓ１０１〜Ｓ１０３、Ｓ１０４〜Ｓ１０９、Ｓ１１０〜Ｓ１１２、Ｓ１１３〜Ｓ１１７）を実行した後、端末１００は再び４系統の処理フローを実行する（並列に実行可能）である。このようにして図４のフローチャートの処理が繰り返し実行されることによって、ユーザ１とユーザ２との通話が進められる。 After executing the above-described four systems of processing flows (S101 to S103, S104 to S109, S110 to S112, S113 to S117), the terminal 100 executes the four systems of processing flows again (can be performed in parallel). In this way, the call between the user 1 and the user 2 is advanced by repeatedly executing the processing of the flowchart of FIG.

次に、端末１００の作用効果について説明する。端末１００では、第１音声入力部１０１がユーザ１の通話音声Ｖ１を入力し（ステップＳ１０１）、第１音声符号化部１０２が通話音声Ｖ１を符号化し（ステップＳ１０２）、第１音声送信部１０３が符号化された通話音声Ｖ１である符号化系列Ｂ１を送信する（ステップＳ１０３）。 Next, the effect of the terminal 100 will be described. In the terminal 100, the first voice input unit 101 inputs the call voice V1 of the user 1 (step S101), the first voice encoding unit 102 encodes the call voice V1 (step S102), and the first voice transmission unit 103. Is transmitted as an encoded sequence B1 that is a call voice V1 encoded (step S103).

また、第２音声受信部１０４が、符号化されたユーザ２の通話音声Ｖ２である符号化系列Ｂ２を受信し（ステップＳ１１０）、第２音声復号部１０５が符号化系列Ｂ２を復号し（ステップＳ１１１）、第２音声出力部１０６が、復号された符号化系列Ｂ２に応じた音（つまりユーザ２の通話音声Ｖ２）を出力する（ステップＳ１１２）。 The second voice receiving unit 104 receives the encoded sequence B2 that is the encoded voice V2 of the user 2 (step S110), and the second voice decoding unit 105 decodes the encoded sequence B2 (step S110). S111), the second audio output unit 106 outputs a sound corresponding to the decoded encoded sequence B2 (that is, the voice 2 of the user 2) (step S112).

また、第１コンテンツ再生指示信号入力部１０７が、ユーザ操作に基づいて、第１コンテンツ再生指示信号Ｃ１を入力し（ステップＳ１０４）、第１コンテンツ再生指示信号送信部１０８が、第１コンテンツ再生指示信号Ｃ１を送信し（ステップＳ１０５）、また、第２コンテンツ再生指示信号受信部１０９が、第２コンテンツ再生指示信号Ｃ２を受信し（ステップＳ１１３）、コンテンツ保持部１１０が保持し、第１コンテンツ再生指示信号Ｃ１あるいは第２コンテンツ再生指示信号Ｃ２に対応した通話音声とは異なる外部音であるコンテンツＶ３を出力し（ステップＳ１０６またはＳ１１４）、音声特性保持部１１１が、通話相手の音声の特性を出力し（ステップＳ１０７またはＳ１１５）、コンテンツ調整部１１２が、コンテンツＶ３を通話相手の音声の特性に応じて調整し（ステップＳ１０８またはＳ１１６）、コンテンツ出力部１１３は、調整されたコンテンツＶ３を出力する（ステップＳ１０９またはＳ１１７）。 The first content reproduction instruction signal input unit 107 inputs the first content reproduction instruction signal C1 based on a user operation (step S104), and the first content reproduction instruction signal transmission unit 108 receives the first content reproduction instruction. The signal C1 is transmitted (step S105), and the second content reproduction instruction signal receiving unit 109 receives the second content reproduction instruction signal C2 (step S113), and the content holding unit 110 holds the first content reproduction. The content V3, which is an external sound different from the call voice corresponding to the instruction signal C1 or the second content playback instruction signal C2, is output (step S106 or S114), and the voice characteristic holding unit 111 outputs the voice characteristic of the other party. (Step S107 or S115), the content adjustment unit 112 receives the content V3. Adjusted according to the characteristics of the voice of the other party (step S108 or S116), the content output unit 113 outputs the adjusted contents V3 (step S109 or S117).

コンテンツ調整部１１２によって実行される処理（ステップＳ１０８またはＳ１１６）は、通話相手の通話音声Ｖ２とは異なる外部音であるコンテンツＶ３を、通話相手の音声の特性に応じて調整する処理である。端末１００によれば、たとえば、コンテンツ調整部１１２は、通話に利用される音声コーデックや、通話相手の音声の音声周波数帯域に基づいて、通話相手の音声とは異なる外部音の音声周波数帯域から所定の帯域を特定し、その帯域におけるパワーを調整することで、通話相手の音声とは異なる外部音が同時に出力される場合にも、通話相手の音声の聞き取りやすさを損なわないようにすることができる。 The process (step S108 or S116) executed by the content adjustment unit 112 is a process of adjusting the content V3, which is an external sound different from the call voice V2 of the call partner, according to the characteristics of the call partner voice. According to the terminal 100, for example, the content adjustment unit 112 is predetermined from the audio frequency band of the external sound different from the voice of the other party based on the audio codec used for the call or the audio frequency band of the other party's voice. By specifying the bandwidth of the device and adjusting the power in that bandwidth, it is possible to prevent the other party's voice from being easily heard even when external sound different from the other party's voice is output at the same time. it can.

本発明は、上述した実施形態に限定されるものではない。実施形態に含まれる各手段や処理ステップの特徴部分を適宜組み合わせた構成についても、本発明の実施形態とすることができる。 The present invention is not limited to the embodiment described above. A configuration in which the features included in the respective embodiments and processing steps included in the embodiment are appropriately combined can also be used as the embodiment of the present invention.

１００…端末、１０１…第１音声入力部、１０２…第１音声符号化部、１０３…第１音声送信部、１０４…第２音声受信部、１０５…第２音声復号部、１０６…第２音声出力部、１０７…第１コンテンツ再生指示信号入力部、１０８…第１コンテンツ再生指示信号送信部、１０９…第２コンテンツ再生指示信号受信部、１１０…コンテンツ保持部、１１１…音声特性保持部、１１２…コンテンツ調整部、１１３…コンテンツ出力部、２００…端末、３００…サーバ。
DESCRIPTION OF SYMBOLS 100 ... Terminal, 101 ... 1st audio | voice input part, 102 ... 1st audio | voice encoding part, 103 ... 1st audio | voice transmission part, 104 ... 2nd audio | voice receiving part, 105 ... 2nd audio | voice decoding part, 106 ... 2nd audio | voice Output unit 107: First content reproduction instruction signal input unit 108: First content reproduction instruction signal transmission unit 109: Second content reproduction instruction signal reception unit 110: Content holding unit 111: Audio characteristic holding unit 112 ... Content adjustment unit, 113 ... Content output unit, 200 ... Terminal, 300 ... Server.

Claims

In a call device that makes a call via a communication network,
Input means for inputting call voice and external sound different from the call voice via the communication network;
Adjusting means for adjusting the external sound input by the input means according to the characteristics of the call voice;
Output means for outputting the external sound adjusted by the adjusting means together with the call voice;
A communication device comprising:

The adjusting means adjusts an external sound different from the call voice according to a voice frequency band of the call voice;
The call device according to claim 1.

The adjusting means adjusts an external sound different from the call voice according to at least one of a type of a voice codec used for a call, a sampling rate, or a type of a communication device of a call partner;
The call device according to claim 1.

The adjusting means specifies a predetermined band from the audio frequency band of the external sound based on the characteristics of the call voice, and adjusts power in the predetermined band.
The communication device according to any one of claims 1 to 3.