JP2023020577A

JP2023020577A - masking device

Info

Publication number: JP2023020577A
Application number: JP2021126014A
Authority: JP
Inventors: 信昭辻; Nobuaki Tsuji
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2023-02-09

Abstract

To provide a masking device that generates and reproduces a masking sound that responds to utterance of a person in real time.SOLUTION: A masking device 1 includes: a detection unit 111 that detects a sound signal indicating a sound from an output signal output from a microphone; an analysis unit 112 that analyzes the sound signal to generate feature data indicating features of the sound; and a generation unit 114 that based on the feature data, generates masking data indicating a music that masks the sound.SELECTED DRAWING: Figure 4

Description

本発明は、マスキング装置に関する。 The present invention relates to masking devices.

従来、自動車の車内や、店舗や病院のカウンター等で、人間同士の対話音声の内容を第三者に把握されなくするために、当該対話音声をかき消すマスキング音を出力する技術が用いられてきた。 Conventionally, in order to prevent third parties from grasping the content of conversational voices between people in automobiles, counters of shops and hospitals, etc., technology has been used to output a masking sound that drowns out the conversational voices. .

例えば、特許文献１は、対話音声を秘匿化するための秘匿化装置を開示している。当該秘匿化装置は、一般的な会話の音声を示す音声データと音楽を示す音楽データとが予め記憶された記憶装置を備える。秘匿化装置は、記憶装置から読み出した音声データ及び音楽データが合成された秘匿化データを生成する秘匿化データ生成装置を備える。更に、当該秘匿化装置は、秘匿化データを再生する音楽再生装置を備える。この秘匿化データを再生することによって、例えば、銀行の窓口において、行員と利用者との会話を第三者に聞こえないように秘匿化できる。 For example, Patent Literature 1 discloses an anonymizing device for anonymizing dialogue voice. The anonymizing device includes a storage device in which voice data representing voices of general conversation and music data representing music are stored in advance. The anonymization device includes an anonymization data generation device that generates anonymization data by synthesizing audio data and music data read from a storage device. Further, the anonymizing device includes a music reproducing device that reproduces the anonymized data. By reproducing this anonymized data, for example, at a bank window, conversation between a bank employee and a user can be anonymized so that it cannot be heard by a third party.

特開２０１２－１４１５２４号公報JP 2012-141524 A

しかし、特許文献１に係る秘匿化装置は、予め記憶されたサンプルデータとしての音声データと音楽データとを合成することで、マスキング音としての秘匿化データを生成するものであった。すなわち、特許文献１に係る技術は、人間の発話にリアルタイムで対応して、マスキング音を生成するものではなかった。 However, the anonymization device according to Patent Document 1 generates anonymization data as a masking sound by synthesizing audio data and music data as sample data stored in advance. That is, the technique according to Patent Document 1 does not generate masking sounds in response to human speech in real time.

以上の事情を考慮して、本開示のひとつの態様は、人間の発話に対してリアルタイムで対応するマスキングデータを生成し、生成されたマスキングデータに基づいて、人間の音声をマスキングする音楽を再生するマスキング装置を提供することを目的とする。 In view of the above circumstances, one aspect of the present disclosure is to generate masking data corresponding to human speech in real time, and play music that masks human speech based on the generated masking data. It is an object of the present invention to provide a masking device that

以上の課題を解決するために、本開示のひとつの態様に係るマスキング装置は、マイクから出力される出力信号から音声を示す音声信号を検出する検出部と、前記音声信号を分析することによって、前記音声の特徴を示す特徴データを生成する分析部と、前記特徴データに基づいて、前記音声をマスキングする音楽を示すマスキングデータを生成する生成部と、を備える。 In order to solve the above problems, a masking device according to one aspect of the present disclosure includes a detection unit that detects an audio signal indicating audio from an output signal output from a microphone, and by analyzing the audio signal, The apparatus includes an analysis unit that generates feature data indicating features of the voice, and a generation unit that generates masking data indicating music for masking the voice based on the feature data.

第１実施形態に係るマスキング装置１の構成を例示するブロック図である。1 is a block diagram illustrating the configuration of a masking device 1 according to a first embodiment; FIG. 第１実施形態に係るマスキング装置１を搭載した車両Ｃの平面図の例である。It is an example of the top view of the vehicle C which mounts the masking apparatus 1 which concerns on 1st Embodiment. 第１実施形態に係るマスキング装置１を搭載した車両Ｃの側面図の例である。It is an example of the side view of the vehicle C which mounts the masking apparatus 1 which concerns on 1st Embodiment. 制御装置１１の機能的な構成を例示するブロック図である。3 is a block diagram illustrating the functional configuration of the control device 11; FIG. 音楽データに含まれる複数のパートが各々対応する複数の楽器の周波数帯域の例を示す図である。FIG. 3 is a diagram showing an example of frequency bands of a plurality of musical instruments to which a plurality of parts included in music data respectively correspond; 生成部１１４によって出力される音楽データ及びマスキングデータに含まれる各パートのレベルを示す図である。3 is a diagram showing the level of each part included in music data and masking data output by a generation unit 114. FIG. 第１実施形態に係るマスキング装置１の動作を示すフローチャートである。4 is a flow chart showing the operation of the masking device 1 according to the first embodiment; 制御装置１１の機能的な構成を例示するブロック図である。3 is a block diagram illustrating the functional configuration of the control device 11; FIG. 生成部１１４Ａによって出力される音楽データ及びマスキングデータに含まれる各パートのレベルを示す図である。FIG. 4 is a diagram showing the level of each part included in the music data and masking data output by the generator 114A;

〔１．第１実施形態〕
〔１－１．第１実施形態の構成〕
図１は、本開示の第１実施形態に係るマスキング装置１の構成を例示するブロック図である。マスキング装置１は、収音した人間の音声の特徴に応じて、当該音声をマスキングする音楽を示すマスキングデータＤｍを生成し、生成されたマスキングデータＤｍに基づいて、当該音声をマスキングする音楽を再生する装置である。具体的には、マスキング装置１は、制御装置１１、記憶装置１２、操作装置１３、収音装置１４、及び再生装置１５を備える。 [1. First Embodiment]
[1-1. Configuration of the First Embodiment]
FIG. 1 is a block diagram illustrating the configuration of a masking device 1 according to the first embodiment of the present disclosure. The masking device 1 generates masking data Dm indicating music for masking the collected human voice according to the characteristics of the collected human voice, and reproduces the music for masking the voice based on the generated masking data Dm. It is a device that Specifically, the masking device 1 includes a control device 11 , a storage device 12 , an operation device 13 , a sound pickup device 14 and a reproduction device 15 .

図１の制御装置１１は、例えばマスキング装置１の各要素を制御する単数又は複数のプロセッサである。例えば、制御装置１１は、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＦＰＧＡ（Field Programmable Gate Array）、又はＡＳＩＣ（Application Specific Integrated Circuit）等の１種類以上のプロセッサにより構成される。 The control device 11 of FIG. 1 is, for example, one or more processors that control each element of the masking device 1 . For example, the control device 11 includes one or more types of CPU (Central Processing Unit), GPU (Graphics Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). It consists of a processor.

記憶装置１２は、例えば磁気記録媒体又は半導体記録媒体等の公知の記録媒体で構成された単数又は複数のメモリである。記憶装置１２は、制御装置１１が実行する制御プログラムＰＲ１と制御装置１１が使用する各種のデータ、とりわけ音楽データＤｘを記憶する。なお、記憶装置１２は、複数種の記録媒体の組合せにより構成されてもよい。また、記憶装置１２は、マスキング装置１に対して着脱可能な可搬型の記録媒体、又はマスキング装置１が通信網を介して通信可能な外部記録媒体（例えばオンラインストレージ）としてもよい。 The storage device 12 is, for example, one or a plurality of memories composed of known recording media such as magnetic recording media or semiconductor recording media. The storage device 12 stores a control program PR1 executed by the control device 11 and various data used by the control device 11, especially music data Dx. Note that the storage device 12 may be configured by a combination of multiple types of recording media. Also, the storage device 12 may be a removable recording medium detachable from the masking device 1, or an external recording medium (for example, online storage) with which the masking device 1 can communicate via a communication network.

操作装置１３は、利用者からの指示を受け付ける入力機器である。操作装置１３は、例えば、利用者が操作可能な複数の操作子、又は、利用者からの接触を検知するタッチパネルである。とりわけ、操作装置１３は、マスキング装置１の動作の開始と終了を指示するスイッチとしての機能を有する。また、操作装置１３は、記憶装置１２に外部から供給される音楽データＤｘを格納する場合に用いられる。 The operation device 13 is an input device that receives instructions from the user. The operation device 13 is, for example, a plurality of operators that can be operated by the user, or a touch panel that detects contact from the user. Among other things, the operation device 13 functions as a switch for instructing the start and end of the operation of the masking device 1 . The operation device 13 is also used to store music data Dx supplied from the outside in the storage device 12 .

収音装置１４は、周囲の音を収音する収音部を備え、収音した音を電気信号に変換するマイクである。収音部は音を収音する構成であれば、どのようなものであってもよいが、例えば、防風の構造が該当する。また、周囲の音には人間の音声が含まれ得る。本実施形態の収音装置１４は、収音した音に基づいたアナログ形式の音信号を生成する。また、収音装置１４は音信号を音データＤｓに変換するＡＤ変換器を備える。収音装置１４から音データＤｓが出力される。 The sound pickup device 14 is a microphone that includes a sound pickup unit that picks up ambient sound and converts the picked sound into an electric signal. The sound pickup part may have any structure as long as it is configured to pick up sound, and for example, a windbreak structure is applicable. Ambient sounds may also include human speech. The sound collecting device 14 of this embodiment generates an analog sound signal based on the collected sound. The sound collection device 14 also includes an AD converter that converts sound signals into sound data Ds. Sound data Ds is output from the sound collection device 14 .

再生装置１５は、制御装置１１による制御のもとで、制御装置１１により生成されたマスキングデータＤｍに基づいて音楽を再生する。マスキングデータＤｍは音楽を示す。再生装置１５は、ＤＡ変換器、アンプ、及びスピーカーを備える。ＤＡ変換器には、デジタル信号であるマスキングデータＤｍが入力される。入力されたマスキングデータＤｍは、アナログ信号であるマスキング信号に変換される。マスキング信号は、アンプにおいて、後段のスピーカーでの放音に適した振幅となるように増幅される。振幅が増幅されたマスキング信号によって示される音楽は、放音装置としてのスピーカーから放音される。本実施形態に係るマスキング装置１は、例として、図２に示される車両Ｃで用いられることが好適であるが、この場合、車両Ｃに搭載されたスピーカーが、再生装置１５に備わる要素として利用される。 The playback device 15 plays back music under the control of the control device 11 based on the masking data Dm generated by the control device 11 . Masking data Dm indicates music. The playback device 15 includes a DA converter, an amplifier, and a speaker. Masking data Dm, which is a digital signal, is input to the DA converter. The input masking data Dm is converted into a masking signal that is an analog signal. The masking signal is amplified by the amplifier so that it has an amplitude suitable for sound emission by the speaker in the subsequent stage. Music indicated by the masking signal whose amplitude is amplified is emitted from a speaker as a sound emitting device. As an example, the masking device 1 according to the present embodiment is preferably used in a vehicle C shown in FIG. be done.

図２は、本実施形態に係るマスキング装置１を搭載した車両Ｃの平面図の例であり、図３は車両Ｃの側面図の例である。 FIG. 2 is an example of a plan view of a vehicle C equipped with the masking device 1 according to this embodiment, and FIG. 3 is an example of a side view of the vehicle C. As shown in FIG.

図２及び図３に示される例において、車両Ｃの車室Ｒには、マスキング装置１の他に、矩形に配置された４つの座席５１～５４と、天井６と、フロントライトドア７１と、フロントレフトドア７２と、リアライトドア７３と、リアレフトドア７４が配置されている。座席５１は運転席であり、座席５２は助手席であり、座席５３は後部右座席であり、更に、座席５４は後部左座席である。座席５１～５４の各々は、布又は革を素材とする材質であり吸音性を有する。座席５１～５４は、共通の方向を向いている。座席５１～５４の各々はヘッドレスト５１－１～５４－１を有する。 In the example shown in FIGS. 2 and 3, in addition to the masking device 1, the passenger compartment R of the vehicle C includes four seats 51 to 54 arranged in a rectangular shape, a ceiling 6, a front light door 71, A front left door 72, a rear right door 73, and a rear left door 74 are arranged. Seat 51 is the driver's seat, seat 52 is the passenger's seat, seat 53 is the rear right seat, and seat 54 is the rear left seat. Each of the seats 51 to 54 is made of cloth or leather and has sound absorption properties. The seats 51-54 face in a common direction. Each of the seats 51-54 has a headrest 51-1-54-1.

マスキング装置１は、上記の収音装置１４としてのマイク、及び再生装置１５の要素である第１スピーカー１５－１、第２スピーカー１５－２、第３スピーカー１５－３及び第４スピーカー１５－４を含んで構成されている。収音装置１４は、車室Ｒの天井６に配置される。収音装置１４は、第１の収音装置１４－１と第２の収音装置１４－２とを備えることが好適である。この場合、第１の収音装置１４－１は、車室Ｒの天井６において、前席である座席５１及び座席５２付近に設置される。また、第１の収音装置１４－１は、座席５１及び座席５２に着座する人物の声を収音しやすくするように、指向性を有することが好適である。同様に、第２の収音装置１４－２は、車室Ｒの天井６において、後席である座席５３及び座席５４付近に設置される。また、第２の収音装置１４－２は、座席５３及び座席５４に着座する人物の声を収音しやすくするように、指向性を有することが好適である。しかし、収音装置１４の構成はこれには限定されない。収音装置１４は、前席である座席５１及び座席５２に着座する人物の声と、後席である座席５３及び座席５４に着座する人物の声とを個別に収音できることが好適であるが、その構成は問わない。 The masking device 1 includes a microphone as the sound collecting device 14, and a first speaker 15-1, a second speaker 15-2, a third speaker 15-3, and a fourth speaker 15-4, which are elements of the reproducing device 15. is composed of The sound collecting device 14 is arranged on the ceiling 6 of the passenger compartment R. The sound collection device 14 preferably includes a first sound collection device 14-1 and a second sound collection device 14-2. In this case, the first sound collecting device 14-1 is installed on the ceiling 6 of the passenger compartment R near the seats 51 and 52, which are the front seats. Also, the first sound pickup device 14-1 preferably has directivity so as to easily pick up the voices of the persons seated on the seats 51 and 52. FIG. Similarly, the second sound pickup device 14-2 is installed on the ceiling 6 of the passenger compartment R near the seats 53 and 54, which are the rear seats. Also, the second sound pickup device 14-2 preferably has directivity so as to easily pick up the voices of the persons seated on the seats 53 and 54. FIG. However, the configuration of the sound pickup device 14 is not limited to this. It is preferable that the sound pickup device 14 can pick up the voices of the persons sitting on the front seats 51 and 52 and the voices of the persons sitting on the rear seats 53 and 54 separately. , can be of any configuration.

第１スピーカー１５－１は、ヘッドレスト５１－１に設置される。第２スピーカー１５－２は、ヘッドレスト５２－１に設置される。第３スピーカー１５－３は、ヘッドレスト５３－１に設置される。第４スピーカー１５－４は、ヘッドレスト５４－１に設置される。なお、これらの設置個所は一例であって、これらには限定されない。例えば、第１スピーカー１５－１～第４スピーカー１５－４の各々は、フロントライトドア７１の下部、フロントレフトドア７２の下部、リアライトドア７３の下部、及びリアレフトドア７４の下部に設置されてもよい。 The first speaker 15-1 is installed on the headrest 51-1. The second speaker 15-2 is installed on the headrest 52-1. The third speaker 15-3 is installed on the headrest 53-1. The fourth speaker 15-4 is installed on the headrest 54-1. It should be noted that these installation locations are merely examples, and the present invention is not limited to these. For example, each of the first speaker 15-1 to the fourth speaker 15-4 may be installed under the front right door 71, under the front left door 72, under the rear right door 73, and under the rear left door 74. good.

第１の収音装置１４－１が、前席である座席５１及び座席５２に着座する人物の声を収音した場合、後席である座席５３のヘッドレスト５３－１に設置される第３スピーカー１５－３、及び座席５４のヘッドレスト５４－１に設置される第４スピーカー１５－４からマスキング信号によって示される音楽が放音される。これは、前席のスピーカーである第１スピーカー１５－１及び第２スピーカー１５－２からマスキング信号によって示される音楽が放音されると、前席間の会話に支障をきたす恐れがあるためである。 When the first sound pickup device 14-1 picks up the voices of persons seated on the front seats 51 and 52, the third speaker installed in the headrest 53-1 of the rear seat 53 15-3 and the fourth speaker 15-4 installed in the headrest 54-1 of the seat 54 emit music indicated by the masking signal. This is because if the music indicated by the masking signal is emitted from the first speaker 15-1 and the second speaker 15-2, which are the speakers in the front seats, it may interfere with conversation between the front seats. be.

これにより、例えば、会話するドライバーの声を、後席に着座する人物に聞かせなくすることが可能となる。延いては、ドライバーは、自らの会話が後席に聞かれていないという安心感を得ることができ、運転に集中することが可能となる。 This makes it possible, for example, to prevent the driver's voice from being heard by a person sitting in the back seat. As a result, the driver can feel secure that his/her conversation is not being overheard by the rear seats, and can concentrate on driving.

一方で、第２の収音装置１４－２が、後席である座席５３及び座席５４に着座する人物の声を収音した場合、前席である座席５１のヘッドレスト５１－１に設置される第１スピーカー１５－１、及び座席５２のヘッドレスト５２－１に設置される第２スピーカー１５－２からマスキング信号によって示される音楽が放音される。これは、後席のスピーカーである第３スピーカー１５－３及び第４スピーカー１５－４からマスキング信号によって示される音楽が放音されると、後席間の会話に支障をきたす恐れがあるためである。 On the other hand, when the second sound pickup device 14-2 picks up the voices of persons seated on the rear seats 53 and 54, it is installed on the headrest 51-1 of the front seat 51. Music indicated by the masking signal is emitted from the first speaker 15-1 and the second speaker 15-2 installed on the headrest 52-1 of the seat 52. FIG. This is because if the music indicated by the masking signal is emitted from the third speaker 15-3 and the fourth speaker 15-4, which are the rear seat speakers, there is a risk of interfering with the conversation between the rear seats. be.

これにより後席に着座する人物の会話をドライバーに聞かせなくすることが可能となる。更に、マスキング音として音楽を用いることにより、ドライバーは運転に集中することが可能となる。 This makes it possible to prevent the driver from hearing the conversation of the person sitting in the back seat. Furthermore, using music as the masking sound allows the driver to concentrate on driving.

なお、第１の収音装置１４－１と第２の収音装置１４－２の双方が、座席５１～座席５４に着座する人物の声を収音した場合、第１スピーカー１５－１～第４スピーカー１５－４のいずれからも、マスキング信号によって示される音楽は放音されない。これは、前席と後席との間の会話を邪魔しないためである。 Note that when both the first sound collection device 14-1 and the second sound collection device 14-2 collect voices of persons seated on the seats 51 to 54, the first speaker 15-1 to the first speaker 15-1 Music indicated by the masking signal is not emitted from any of the four speakers 15-4. This is so as not to disturb the conversation between the front and rear seats.

図４は、制御装置１１の機能的な構成を例示するブロック図である。制御装置１１は、制御プログラムＰＲ１を読み出し、読み出した制御プログラムＰＲ１を実行することによって、検出部１１１、分析部１１２、取得部１１３、及び生成部１１４、及び選択部１１５として機能する。 FIG. 4 is a block diagram illustrating the functional configuration of the control device 11. As shown in FIG. The control device 11 functions as a detection unit 111, an analysis unit 112, an acquisition unit 113, a generation unit 114, and a selection unit 115 by reading the control program PR1 and executing the read control program PR1.

検出部１１１は、収音装置１４から出力される音データＤｓから、人間の音声を示す音声データＤｖを検出する。音声データＤｖは、音声が入っていない無声区間と、音声が入っている音声区間とを有する。検出部１１１は、例えば、音声帯域を通過帯域とするバンドパスフィルタによって構成される。音データＤｓの示す音には、音声の他に、走行音及び楽音等が含まれる場合がある。検出部１１１によって音データＤｓから音声データＤｖが抽出される。 The detection unit 111 detects voice data Dv representing human voice from the sound data Ds output from the sound collection device 14 . The audio data Dv has an unvoiced period with no audio and a voiced period with audio. The detection unit 111 is configured by, for example, a bandpass filter whose passband is the voice band. Sounds indicated by the sound data Ds may include running sounds, musical sounds, etc. in addition to voices. The sound data Dv is extracted from the sound data Ds by the detection unit 111 .

また、検出部１１１は、選択部１１５に対して制御信号Ｓを出力する。検出部１１１が、音声データＤｖから音声区間を検出した場合には、制御信号Ｓは“ＯＮ”を示す値となる。一方で、検出部１１１が無音区間を検出した場合には、制御信号Ｓは“ＯＦＦ”を示す値となる。 Also, the detection unit 111 outputs a control signal S to the selection unit 115 . When the detection unit 111 detects a voice section from the voice data Dv, the control signal S takes a value indicating "ON". On the other hand, when the detection unit 111 detects a silent period, the control signal S has a value indicating "OFF".

分析部１１２は、検出部１１１によって検出された音声データＤｖを分析することによって、音声の特徴を示す特徴データＤｆを生成する。より詳細には、分析部１１２は、音声区間における音声データＤｖを分析することによって、音声の特徴を示す特徴データＤｆを生成する。ここで、「音声の特徴」は、音声のピッチ、音声のレベル、音声のフォルマントのうち少なくとも１つを含む。「音声のピッチ」とは、音声の基本周波数のことである。「音声のレベル」とは、音声の音量のことである。「音声のフォルマント」とは、音声の周波数スペクトルにおいて、周囲よりも強度が大きい周波数帯のことである。当該周波数帯は、低い方から順に、「第１フォルマント」、「第２フォルマント」、「第３フォルマント」・・・と呼称される。複数のフォルマントの各々の周波数の高さによって、音声の質が定まる。 The analysis unit 112 analyzes the voice data Dv detected by the detection unit 111 to generate feature data Df indicating voice features. More specifically, the analysis unit 112 generates feature data Df indicating voice features by analyzing the voice data Dv in the voice section. Here, the "speech features" include at least one of speech pitch, speech level, and speech formants. "Speech pitch" refers to the fundamental frequency of speech. "Audio level" is the volume of audio. A "speech formant" is a frequency band in the frequency spectrum of speech that is more intense than its surroundings. The frequency bands are called "first formant", "second formant", "third formant", and so on in ascending order. The quality of speech is determined by the height of each frequency of a plurality of formants.

とりわけ、分析部１１２によって生成される特徴データＤｆに、音声のピッチ、又は音声のフォルマントが含まれる場合、分析部１１２は、音声のピッチ又はフォルマントを分析することにより、当該音声を発話したのが、男性であるか女性であるかを判別することが可能である。具体的には、分析部１１２は、音声のピッチが所定値以上である場合には、当該音声の発話の主が女性であると判別する。一方で、分析部１１２は、音声のピッチが所定値未満である場合には、当該音声の発話の主が男性であると判別する。また、分析部１１２は、音声に含まれる母音の第１フォルマント及び第２フォルマントが所定値以上である場合には、当該音声の発話の主が女性であると判別する。一方で、分析部１１２は、音声に含まれる母音の第１フォルマント及び第２フォルマントが所定値未満である場合には、当該音声を発話の主が男性であると判別する。 In particular, when the feature data Df generated by the analysis unit 112 includes the pitch of the speech or the formant of the speech, the analysis unit 112 analyzes the pitch or the formant of the speech to find out who uttered the speech. , it is possible to determine whether a person is male or female. Specifically, when the pitch of the voice is equal to or greater than a predetermined value, the analysis unit 112 determines that the speaker of the voice is female. On the other hand, when the pitch of the voice is less than the predetermined value, the analysis unit 112 determines that the speaker of the voice is male. Further, when the first formant and the second formant of the vowels included in the speech are equal to or greater than a predetermined value, the analysis unit 112 determines that the speaker of the speech is female. On the other hand, when the first formant and the second formant of the vowels included in the speech are less than the predetermined values, the analysis unit 112 determines that the speaker of the speech is male.

取得部１１３は、記憶装置１２から音楽データＤｘを取得する。後述のように、マスキング装置１が生成するマスキングデータＤｍの示す音楽は、複数の音色と１対１に対応する複数のパートを含む。音楽データＤｘは、これら複数のパートと１対１に対応する複数のパートデータＤｐ１、Ｄｐ２、…Ｄｐｎを含む。ｎは２以上の整数である。なお、各パートを区別する必要が無い場合は、単に、パートデータＤｐと称する。 Acquisition unit 113 acquires music data Dx from storage device 12 . As will be described later, the music indicated by the masking data Dm generated by the masking device 1 includes a plurality of parts in one-to-one correspondence with a plurality of tones. The music data Dx includes a plurality of part data Dp1, Dp2, . . . Dpn in one-to-one correspondence with the plurality of parts. n is an integer of 2 or more. In addition, when there is no need to distinguish each part, it is simply referred to as part data Dp.

図５は、人間の音声の周波数帯域、及びマスキングデータＤｍの示す音楽に含まれる複数のパートが各々対応する、複数の音色の周波数帯域の例を示す図である。図５において、最上段の行は周波数を示す。２段目の行はコードを示す。図５に示す例においては、同じＣコードであると共に、Ｃ０からＣ８へと、１オクターブずつ周波数が上昇する例を示す。３段目～９段目の行は人間の音声の周波数帯域を示す。１０段目～１４段目の行は楽器の演奏音の周波数帯域を示す。 FIG. 5 is a diagram showing an example of a frequency band of human voice and frequency bands of a plurality of timbres corresponding to a plurality of parts included in music indicated by masking data Dm. In FIG. 5, the top row shows frequencies. The second row shows the code. The example shown in FIG. 5 shows an example in which the same C code is used and the frequency increases by one octave from C0 to C8. The third to ninth rows indicate the frequency bands of human speech. The 10th to 14th rows indicate the frequency band of the performance sound of the musical instrument.

図５に示されるように、人間の音声は略７３Ｈｚから略１０４７Ｈｚの周波数帯域を有する。 As shown in FIG. 5, human speech has a frequency band from approximately 73 Hz to approximately 1047 Hz.

とりわけ男性の音声であるバスは、およそＤ２からＦ４の声域、すなわち略７３Ｈｚから略３５０Ｈｚの周波数帯域を有する。男性の音声であるバリトンは、およそＧ２からＧ４の声域、すなわち略９８Ｈｚから略３９２Ｈｚの周波数帯域を有する。男性の音声であるテノールは、およそＣ３からＣ５の声域、すなわち略１３１Ｈｚから略５２３Ｈｚの周波数帯域を有する。総じて男性の音声は、略７３Ｈｚから略５２３Ｈｚの周波数帯域を有する。 Bass, especially male voices, has a frequency range of approximately D2 to F4, ie approximately 73 Hz to approximately 350 Hz. The male voice baritone has a vocal range of approximately G2 to G4, or a frequency range of approximately 98 Hz to approximately 392 Hz. The male voice, the tenor, has a vocal range of approximately C3 to C5, or a frequency range of approximately 131 Hz to approximately 523 Hz. Male voice generally has a frequency band from approximately 73 Hz to approximately 523 Hz.

女性の音声であるアルトは、およそＦ３からＥ５の声域、すなわち略１７５Ｈｚから略６５９Ｈｚの周波数帯域を有する。女性の音声であるメゾソプラノは、およそＡ３からＡ５の声域、すなわち略２２０Ｈｚから略８８０Ｈｚの周波数帯域を有する。女性の音声であるソプラノは、およそＣ４からＣ６の声域、すなわち略２６２Ｈｚから略１０４７Ｈｚの周波数帯域を有する。総じて女性の音声は、略１７５Ｈｚから略１０４７Ｈｚの周波数帯域を有する。 Alto, the female voice, has a vocal range of approximately F3 to E5, or a frequency range of approximately 175 Hz to approximately 659 Hz. The female voice, mezzo-soprano, has a vocal range of approximately A3 to A5, ie, a frequency band of approximately 220 Hz to approximately 880 Hz. The female voice, the soprano, has a vocal range of approximately C4 to C6, ie, a frequency range of approximately 262 Hz to approximately 1047 Hz. Female voice generally has a frequency band from approximately 175 Hz to approximately 1047 Hz.

一方、図５に示されるように、楽器の演奏音は、略２５Ｈｚから略４４００Ｈｚの周波数帯域を有する。例として、パートデータＤｐ１に対応するコントラバスは、およそＥ１からＧ３の音域、すなわち、略４１Ｈｚから略１９６Ｈｚの周波数帯域を有する。パートデータＤｐ２に対応するチェロは、およそＣ２からＣ５の音域、すなわち略６５Ｈｚから略５２３Ｈｚの周波数帯域を有する。パートデータＤｐ３に対応するビオラは、およそＣ３からＣ６の音域、すなわち略１３１Ｈｚから略１０４７Ｈｚの周波数帯域を有する。パートデータＤｐ４に対応するバイオリンは、およそＧ３からＥ７の音域、すなわち略１９６Ｈｚから略２６３７Ｈｚの周波数帯域を有する。 On the other hand, as shown in FIG. 5, musical instrument performance sounds have a frequency band of approximately 25 Hz to approximately 4400 Hz. As an example, the contrabass corresponding to part data Dp1 has a sound range from E1 to G3, that is, a frequency band from approximately 41 Hz to approximately 196 Hz. The cello corresponding to the part data Dp2 has a frequency range from approximately C2 to C5, that is, approximately 65 Hz to approximately 523 Hz. The viola corresponding to the part data Dp3 has a sound range of approximately C3 to C6, that is, a frequency band of approximately 131 Hz to approximately 1047 Hz. The violin corresponding to part data Dp4 has a sound range from G3 to E7, that is, a frequency band from approximately 196 Hz to approximately 2637 Hz.

人間の音声の周波数帯域と、楽器の演奏音の周波数帯域とを比較すると、男性の音声の周波数帯域は、概ね、チェロの演奏音の周波数帯域に含まれると言える。一方、女性の音声の周波数帯域は、概ね、ビオラの演奏音の周波数帯域に含まれると言える。 Comparing the frequency band of human voice with the frequency band of sound played by a musical instrument, it can be said that the frequency band of male voice is generally included in the frequency band of sound played by a cello. On the other hand, it can be said that the frequency band of female voices is generally included in the frequency band of viola performance sounds.

マスキングデータＤｍの示す音楽に含まれる複数のパートの各々は、人間の音声のピッチ又はフォルマントに対応付けられている。例として、チェロのパートと、音声のピッチのうち、男性の音声であることを示すピッチとが対応付けられていてもよい。あるいは、チェロのパートと、音声のフォルマントのうち、男性の音声であることを示すフォルマントとが対応付けられていてもよい。同様に、ビオラのパートと、音声のピッチのうち、女性の音声であることを示すピッチとが対応付けられていてもよい。あるいは、ビオラのパートと、音声のフォルマントのうち、女性の音声であることを示すフォルマントとが対応付けられていてもよい。 Each of the parts included in the music indicated by the masking data Dm is associated with the pitch or formant of human speech. As an example, a cello part may be associated with a voice pitch indicating male voice. Alternatively, the cello part may be associated with a formant indicating male voice among voice formants. Similarly, a viola part may be associated with a pitch indicating that it is a female voice, among voice pitches. Alternatively, a viola part may be associated with a formant indicating a female voice among voice formants.

音楽データＤｘはＭＩＤＩ（Musical Instrument Digital Interface）データであってよい。音楽データＤｘがＭＩＤＩデータである場合、所定楽曲の音楽データＤｘは、各々が各音色に対応する複数のパートデータＤｐを包含する。ここで、各パートデータＤｐに対応する音色は、楽器音のみならず、人の声、合成音等の楽器以外の音声の音色も含む。あるいは、音楽データＤｘは音楽信号をサンプリングすることによって得られたＰＣＭデータであってもよい。また、音楽データＤｘがＰＣＭデータである場合、音楽データＤｘは複数の音色に１対１に対応する複数のＰＣＭデータから構成されてもよい。音楽データＤｘが、複数の音色が混在したＰＣＭデータの場合には、周知の音源分離技術により、音楽データＤｘを複数の音色のＰＣＭデータに分解し、その中から所定の音色（チェロ、ビオラ、等）を選択し、マスキングに利用しても良い。複数のＰＣＭデータはパートデータＤｐ１～Ｄｐｎに対応する。 The music data Dx may be MIDI (Musical Instrument Digital Interface) data. When the music data Dx is MIDI data, the music data Dx of a predetermined piece of music includes a plurality of part data Dp each corresponding to each timbre. Here, the timbre corresponding to each part data Dp includes not only instrumental sounds, but also timbres of non-instrumental sounds such as human voices and synthesized sounds. Alternatively, the music data Dx may be PCM data obtained by sampling a music signal. Further, when the music data Dx is PCM data, the music data Dx may be composed of a plurality of PCM data corresponding to a plurality of timbres on a one-to-one basis. When the music data Dx is PCM data in which a plurality of timbres are mixed, the music data Dx is decomposed into PCM data of a plurality of timbres by a well-known sound source separation technique, and a predetermined timbre (cello, viola, etc.) to be used for masking. A plurality of PCM data correspond to part data Dp1-Dpn.

図４に戻ると、生成部１１４は、分析部１１２によって生成された特徴データＤｆに基づいて、音声をマスキングする音楽を示すマスキングデータＤｍを生成する。とりわけ、本実施形態において、生成部１１４は、特徴データＤｆに基づいて、取得部１１３によって取得された音楽データＤｘに含まれる複数のパートデータＤｐ１～Ｄｐｎのうち、１つのパートデータＤｐを選択する。次に、生成部１１４は、選択したパートデータＤｐの示す音のピッチ、及び音のレベルのうち少なくとも１つを補正することにより、マスキングデータＤｍを生成する。選択されたパートデータＤｐがＤｐｓである場合、マスキングデータＤｍは、パートデータＤｐｓが補正された１つのパートデータＤｐｓ’と、上記の複数のパートデータＤｐ１～Ｄｐｎのうち、当該補正の対象となった１つのパートデータＤｐｓを除いたパートデータＤｐとを含む。 Returning to FIG. 4, the generation unit 114 generates masking data Dm representing music for masking the voice based on the feature data Df generated by the analysis unit 112 . In particular, in the present embodiment, the generation unit 114 selects one part data Dp from among the plurality of part data Dp1 to Dpn included in the music data Dx acquired by the acquisition unit 113 based on the characteristic data Df. . Next, the generation unit 114 generates the masking data Dm by correcting at least one of the pitch of the sound indicated by the selected part data Dp and the level of the sound. When the selected part data Dp is Dps, the masking data Dm is one part data Dps' in which the part data Dps is corrected, and among the plurality of part data Dp1 to Dpn, which are subject to the correction. and part data Dp excluding one part data Dps.

より詳細には、生成部１１４は、音声の特徴にフォルマントが含まれる場合、音声のフォルマントに重なる音域のパートデータＤｐｓを選択する。あるいは、生成部１１４は、音声の特徴にピッチが含まれる場合、音声のピッチと同じ周波数が含まれる音域のパートデータＤｐｓを選択する。例として、音声のフォルマント又はピッチが、男性の音声に対応する場合には、生成部１１４は、チェロのパートを選択する。一方で、音声のフォルマント又はピッチが、女性の音声に対応する場合には、生成部１１４は、ビオラのパートを選択する。 More specifically, when the feature of speech includes formants, the generation unit 114 selects the part data Dps in the range overlapping the formants of the speech. Alternatively, when the pitch is included in the feature of the voice, the generation unit 114 selects the part data Dps of the range including the same frequency as the pitch of the voice. As an example, if the vocal formant or pitch corresponds to a male voice, the generator 114 selects the cello part. On the other hand, if the vocal formant or pitch corresponds to a female voice, the generator 114 selects the viola part.

なお、適切なパートが存在しなかった場合には、生成部１１４は、既存のパートデータＤｐ１～Ｄｐｎの中から、音声の特徴のうち、音声のフォルマントに最も近い音域のパートデータＤｐｓを選択する。あるいは、生成部１１４は、既存のパートデータＤｐ１～Ｄｐｎの中から、音声の特徴のうち音声のピッチに最も近い周波数を有する音域のパートデータＤｐｓを選択する。 If there is no appropriate part, the generation unit 114 selects the part data Dps in the range closest to the formant of the voice among the voice features from the existing part data Dp1 to Dpn. . Alternatively, the generation unit 114 selects the part data Dps of the range having the frequency closest to the pitch of the voice among the voice features from the existing part data Dp1 to Dpn.

その上で、生成部１１４は、選択したパートデータＤｐｓの示す音のレベルを、音声のレベルに応じて変更するように、当該パートデータＤｐｓを補正し、パートデータＤｐｓ’を生成する。より詳細には、生成部１１４は、パートデータＤｐｓ’に基づく音楽をスピーカーから放音した場合に、放音される音楽によって音声データＤｖの示す音声をマスキングできるようにパートデータＤｐｓを補正する。更に、生成部１１４は、補正されたパートデータＤｐｓ’と、複数のパートデータＤｐのうち、当該補正の対象となったパートデータＤｐｓを除いたパートデータＤｐとから、マスキングデータＤｍを生成する。とりわけ、検出部１１１によって検出された音声のレベルが大きい場合には、生成部１１４は、音声の大きさに応じて、選択したパートデータＤｐの示す音のレベルを上げるように、当該パートデータＤｐを補正する。 Then, the generation unit 114 corrects the selected part data Dps so that the sound level indicated by the selected part data Dps is changed according to the voice level, and generates part data Dps'. More specifically, the generator 114 corrects the part data Dps so that the sound indicated by the audio data Dv can be masked by the music to be emitted when the music based on the part data Dps' is emitted from the speaker. Furthermore, the generation unit 114 generates masking data Dm from the corrected part data Dps' and the part data Dp excluding the part data Dps to be corrected among the plurality of part data Dp. In particular, when the level of the voice detected by the detection unit 111 is high, the generation unit 114 increases the level of the sound indicated by the selected part data Dp according to the volume of the voice. correct.

また、本実施形態において、生成部１１４は、取得部１１３が、記憶装置１２から音楽データＤｘを読み出している期間中に、検出部１１１によって音声データＤｖの音声区間が検出された場合、上記の補正を実行することで、マスキングデータＤｍを生成する。更に生成部１１４は、生成したマスキングデータＤｍを、選択部１１５に出力する。 Further, in the present embodiment, when the detecting unit 111 detects the voice section of the voice data Dv while the acquiring unit 113 is reading the music data Dx from the storage device 12, the generating unit 114 performs the above-described Masking data Dm is generated by executing the correction. Furthermore, the generation unit 114 outputs the generated masking data Dm to the selection unit 115 .

また、生成部１１４は、マスキングデータＤｍの出力と並行して、取得部１１３から取得した音楽データＤｘを、選択部１１５に出力する。 In parallel with outputting the masking data Dm, the generation unit 114 outputs the music data Dx acquired from the acquisition unit 113 to the selection unit 115 .

選択部１１５は、検出部１１１から入力される制御信号Ｓに基づいて、マスキングデータＤｍと音楽データＤｘのうち一方を選択し、再生装置１５に出力する。より詳細には、制御信号Ｓが“ＯＮ”を示す値である場合には、選択部１１５は、マスキングデータＤｍを選択し、選択したマスキングデータＤｍを再生装置１５に出力する。一方で、制御信号Ｓが“ＯＦＦ”を示す値である場合には、選択部１１５は、音楽データＤｘを選択し、選択した音楽データＤｘを再生装置１５に出力する。 The selection unit 115 selects one of the masking data Dm and the music data Dx based on the control signal S input from the detection unit 111 and outputs it to the reproduction device 15 . More specifically, when the control signal S has a value indicating “ON”, the selection unit 115 selects the masking data Dm and outputs the selected masking data Dm to the reproduction device 15 . On the other hand, when the control signal S has a value indicating “OFF”, the selection unit 115 selects the music data Dx and outputs the selected music data Dx to the reproduction device 15 .

再生装置１５は、ＭＩＤＩデータ又はＰＣＭデータのフォーマットを、音楽データのフォーマットに変換する機能を有する。これにより、再生装置１５は、常時音楽データＤｘの示す音楽を再生しており、その途中で、マスキングデータＤｍの示す音楽を再生するように動作を切り替える。この際、生成部１１４は、元々再生されていた音楽の一パートを示すパートデータＤｐｓを補正する。このため、再生装置１５によって再生される音楽を聴いていた人間にとって、違和感が発生しない。 The playback device 15 has a function of converting the format of MIDI data or PCM data into the format of music data. As a result, the reproduction device 15 always reproduces the music indicated by the music data Dx, and in the middle of the reproduction, switches the operation so as to reproduce the music indicated by the masking data Dm. At this time, the generation unit 114 corrects the part data Dps indicating one part of the music originally reproduced. Therefore, a person listening to the music played back by the playback device 15 does not feel uncomfortable.

図６は、生成部１１４によって出力される音楽データＤｘ及びマスキングデータＤｍに含まれる各パートデータＤｐのレベルを示す図である。なお、図６に示す例は、音声データＤｖによって示される人間の音声が男性の音声である場合を示す。時刻ｔ１の時点で、生成部１１４は、あらかじめ音楽データＤｘとして、チェロのパートデータＤｐ２と、その他のパートデータＤｐ１、Ｄｐ３及びＤｐ４とをパラレルに選択部１１５に対して出力しておく。この間、選択部１１５は、音楽データＤｘを再生装置１５に出力する。時刻ｔ２の時点で、検出部１１１が人間の音声を検出すると、分析部１１２が、当該音声のレベルと、当該音声のピッチ、及びフォルマントのうち少なくとも１つを含む音声の特徴を示す特徴データＤｆを生成する。生成部１１４は、音域が当該音声のピッチと同じ周波数を含むパートデータＤｐ、あるいは、音域が当該音声のフォルマントに重なるパートデータＤｐとして、チェロのパートデータＤｐ２を選択する。更に、生成部１１４は、当該音声のレベルに応じて、チェロのパートデータＤｐ２の示す音のレベルを上げるように、当該パートデータＤｐ２を補正し、パートデータＤｐ２’を生成する。生成部１１４は、音のレベルを上げたチェロのパートデータＤｐ２を含むマスキングデータＤｍを、再生装置１５に出力する。マスキングデータＤｍに含まれる他のパートデータＤｐ１、Ｄｐ３及びＤｐ４に関しては、引き続き音のレベルが変更されることがない。選択部１１５は、制御信号Ｓに基づいて、音楽データＤｘとマスキングデータＤｍとからマスキングデータＤｍを選択し、選択したマスキングデータＤｍを再生装置１５に出力する。時刻ｔ３の時点で、検出部１１１が人間の音声を検出しなくなると、生成部１１４は、チェロのパートデータＤｐ２のレベルを元に戻す。その上で、生成部１１４は、チェロのパートデータＤｐ２とその他のパートデータＤｐ１、Ｄｐ３及びＤｐ４を含む音楽データＤｘを再生装置１５に出力し続ける。 FIG. 6 is a diagram showing the level of each part data Dp included in the music data Dx and the masking data Dm output by the generator 114. As shown in FIG. Note that the example shown in FIG. 6 shows a case where the human voice indicated by the voice data Dv is male voice. At time t1, the generation unit 114 outputs cello part data Dp2 and other part data Dp1, Dp3, and Dp4 in parallel to the selection unit 115 as music data Dx. During this time, the selection unit 115 outputs the music data Dx to the playback device 15 . At time t2, when the detection unit 111 detects human speech, the analysis unit 112 generates feature data Df indicating speech features including at least one of the level of the speech, the pitch of the speech, and the formants. to generate The generation unit 114 selects cello part data Dp2 as part data Dp whose range includes the same frequency as the pitch of the voice, or part data Dp whose range overlaps with the formants of the voice. Furthermore, the generation unit 114 corrects the part data Dp2 so as to increase the sound level indicated by the cello part data Dp2 according to the level of the voice, and generates part data Dp2'. The generation unit 114 outputs the masking data Dm including the cello part data Dp2 with the raised sound level to the playback device 15 . The sound levels of the other part data Dp1, Dp3 and Dp4 included in the masking data Dm are not changed. Selector 115 selects masking data Dm from music data Dx and masking data Dm based on control signal S, and outputs selected masking data Dm to playback device 15 . At time t3, when the detection unit 111 no longer detects human voice, the generation unit 114 returns the level of the cello part data Dp2 to the original level. After that, the generation unit 114 continues to output the music data Dx including the cello part data Dp2 and the other part data Dp1, Dp3 and Dp4 to the playback device 15 .

生成部１１４は、音のレベルに係る補正の代わりに、あるいは音のレベルに係る補正に加えて、検出部１１１によって検出された音声データＤｖによって示される音声のピッチに、選択したパートデータＤｐの示す音のピッチを近づけるように、当該選択したパートデータＤｐを補正し、パートデータＤｐ’を生成してもよい。音声のピッチと補正後の音のピッチとの差分は、音声のピッチと補正前の音のピッチとの差分より小さい。従って、音声のピッチと補正後の音のピッチとは、不一致であってよい。 Instead of correcting the sound level, or in addition to correcting the sound level, the generator 114 converts the selected part data Dp to the pitch of the sound indicated by the sound data Dv detected by the detector 111. Part data Dp' may be generated by correcting the selected part data Dp so that the pitch of the indicated sound is brought closer. The difference between the pitch of the voice and the pitch of the sound after correction is smaller than the difference between the pitch of the voice and the pitch of the sound before correction. Therefore, the pitch of the voice and the corrected pitch of the sound may not match.

より詳細には、生成部１１４は、人間の音声のピッチに応じて、選択したパートデータＤｐの示す音のキーをオクターブ単位で上下させるように、当該選択したパートデータＤｐを補正し、パートデータＤｐ’を生成してもよい。これにより、生成部１１４は、音楽データＤｘが示す音楽の曲調を変更することなく、楽曲として成立させた状態で、選択したパートデータＤｐのみを補正することが可能となる。 More specifically, the generation unit 114 corrects the selected part data Dp so as to move the key of the sound indicated by the selected part data Dp up or down in octave units according to the pitch of human speech, and corrects the selected part data Dp. Dp' may be generated. As a result, the generation unit 114 can correct only the selected part data Dp in a state in which the music represented by the music data Dx is established as a piece of music without changing the tone of the music indicated by the music data Dx.

あるいは生成部１１４は、人間の音声のピッチに応じて、選択したパートデータＤｐの示す音のコードを半音単位で上下させるように、選択したパートデータＤｐを補正し、パートデータＤｐ’を生成してもよい。これにより、音楽データＤｘが示す音楽の曲調は変わるものの、生成部１１４は、選択したパートデータＤｐの示す音のピッチを微調整することが可能となる。このように音のピッチを補正することによって、音のピッチが音声のピッチに近づくので、マスキングの効果が向上する。 Alternatively, the generation unit 114 corrects the selected part data Dp so that the code of the sound indicated by the selected part data Dp is raised or lowered in semitone units according to the pitch of human speech, and generates part data Dp′. may As a result, although the tone of the music indicated by the music data Dx changes, the generator 114 can finely adjust the pitch of the sound indicated by the selected part data Dp. By correcting the pitch of the sound in this way, the pitch of the sound approaches the pitch of the voice, so that the masking effect is improved.

〔１－２．第１実施形態の動作〕
図７は、第１実施形態に係るマスキング装置１の動作を示すフローチャートである。以下、図７を参照することにより、第１実施形態に係るマスキング装置１の動作について説明する。 [1-2. Operation of the First Embodiment]
FIG. 7 is a flow chart showing the operation of the masking device 1 according to the first embodiment. The operation of the masking device 1 according to the first embodiment will be described below with reference to FIG.

ステップＳ１において、取得部１１３は、記憶装置１２から音楽データＤｘを取得する。 In step S1 , the acquisition unit 113 acquires music data Dx from the storage device 12 .

ステップＳ２において、生成部１１４は、取得部１１３から取得した音楽データＤｘを、選択部１１５に出力する。選択部１１５は、音楽データＤｘを再生装置１５に出力する。 In step S2 , the generating section 114 outputs the music data Dx acquired from the acquiring section 113 to the selecting section 115 . The selection unit 115 outputs the music data Dx to the reproduction device 15 .

ステップＳ３において、検出部１１１によって人間の音声が検出された場合（Ｓ３：ＹＥＳ）には、マスキング装置１はステップＳ４の処理を実行する。検出部１１１によって人間の音声が検出されていない場合（Ｓ３：ＮＯ）には、マスキング装置１は、ステップＳ２の処理を実行する。 In step S3, when human voice is detected by the detection unit 111 (S3: YES), the masking device 1 executes the process of step S4. If the detection unit 111 does not detect human voice (S3: NO), the masking device 1 executes the process of step S2.

ステップＳ４において、分析部１１２は、検出部１１１によって検出された音声信号を分析することによって、音声の特徴を示す特徴データＤｆを生成する。 In step S4, the analysis unit 112 analyzes the audio signal detected by the detection unit 111 to generate feature data Df indicating audio features.

ステップＳ５において、生成部１１４は、分析部１１２によって生成された特徴データＤｆに基づいて、音声をマスキングする音楽を示すマスキングデータＤｍを生成する。より詳細には、ステップＳ５において、生成部１１４は、特徴データＤｆに基づいて、取得部１１３によって取得された音楽データＤｘに含まれる複数のパートデータＤｐのうち、１つのパートデータＤｐｓを選択する。次に、生成部１１４は、選択したパートデータＤｐｓの示す音のレベルを特徴データＤｆに応じて変更するように、当該選択したパートデータＤｐｓを補正し、パートデータＤｐｓ’を生成する。更に、生成部１１４は、パートデータＤｐｓ’と、複数のパートデータＤｐのうち、当該補正の対象となったパートデータＤｐｓを除いたパートデータＤｐとから、マスキングデータＤｍを生成する。なお、生成部１１４は、選択したパートデータＤｐｓの示す音のレベルの代わりに、あるいは音のレベルに加えて、音のピッチを特徴データＤｆに応じて変更してもよい。とりわけ、生成部１１４は、取得部１１３が記憶装置１２から音楽データＤｘを読み出している期間中に、検出部１１１によって音声区間が検出された場合、上記の補正を実行する。 In step S5 , the generation unit 114 generates masking data Dm representing music for masking the voice based on the feature data Df generated by the analysis unit 112 . More specifically, in step S5, the generation unit 114 selects one part data Dps from among the plurality of part data Dp included in the music data Dx acquired by the acquisition unit 113, based on the characteristic data Df. . Next, the generation unit 114 corrects the selected part data Dps so that the sound level indicated by the selected part data Dps is changed according to the feature data Df, and generates part data Dps'. Furthermore, the generation unit 114 generates masking data Dm from the part data Dps' and the part data Dp excluding the part data Dps to be corrected among the plurality of part data Dp. Note that the generation unit 114 may change the pitch of the sound according to the feature data Df instead of or in addition to the sound level indicated by the selected part data Dps. In particular, the generation unit 114 performs the above correction when the detection unit 111 detects a voice section while the acquisition unit 113 is reading the music data Dx from the storage device 12 .

ステップＳ６において、生成部１１４は、生成したマスキングデータＤｍを、選択部１１５に出力する。選択部１１５は、マスキングデータＤｍを再生装置１５に出力する。 In step S6 , the generation unit 114 outputs the generated masking data Dm to the selection unit 115 . Selector 115 outputs masking data Dm to playback device 15 .

〔２．第２実施形態〕
以下、本開示の第２実施形態に係るマスキング装置１について説明する。第２実施形態に係るマスキング装置１に備わる構成要素のうち、第１実施形態に係るマスキング装置１に備わる構成要素と同一の構成要素については、同一の符号を用いると共に、その機能の説明を省略する。 [2. Second Embodiment]
A masking device 1 according to a second embodiment of the present disclosure will be described below. Among the components provided in the masking device 1 according to the second embodiment, the same reference numerals are used for the same components as the components provided in the masking device 1 according to the first embodiment, and descriptions of their functions are omitted. do.

〔２－１．第２実施形態の構成〕
図８は、第２実施形態に係るマスキング装置１が備える制御装置１１の機能的な構成を例示するブロック図である。第２実施形態に係るマスキング装置１は、第１実施形態に係るマスキング装置１に備わる生成部１１４の代わりに、生成部１１４Ａを備える。 [2-1. Configuration of Second Embodiment]
FIG. 8 is a block diagram illustrating the functional configuration of the control device 11 included in the masking device 1 according to the second embodiment. The masking device 1 according to the second embodiment includes a generator 114A instead of the generator 114 included in the masking device 1 according to the first embodiment.

生成部１１４Ａは、複数のパートデータＤｐのうち所定のパートデータＤｐを音楽データＤｘとして、選択部１１５に出力する。一方で、生成部１１４Ａは生成部１１４と同様の補正を実行する。その上で、生成部１１４Ａは、上記の所定のパートデータＤｐと、補正後の一のパートデータＤｐｓ’とを含むマスキングデータＤｍを、選択部１１５に出力する。 The generation unit 114A outputs predetermined part data Dp among the plurality of part data Dp to the selection unit 115 as the music data Dx. On the other hand, the generator 114A performs the same correction as the generator 114 does. Then, the generation unit 114A outputs the masking data Dm including the predetermined part data Dp and the corrected one part data Dps′ to the selection unit 115 .

図９は、生成部１１４Ａによって生成されるマスキングデータＤｍに含まれる各パートデータＤｐのレベルを示す図である。なお、図９に示す例は、人間の音声が男性の音声である場合を示す。時刻ｔ１の時点で、生成部１１４は、あらかじめ音楽データＤｘとして、チェロ以外のその他のパートデータＤｐを、選択部１１５に対して出力しておく。「その他のパートデータ」は、例えばバイオリンのパートデータＤｐ４である。この間、選択部１１５は、音楽データＤｘを再生装置１５に出力する。時刻ｔ２の時点で、検出部１１１が人間の音声を検出すると、分析部１１２が、当該音声のピッチ、レベル、及びフォルマントのうち、少なくとも１つを含む音声の特徴データＤｆを生成する。生成部１１４は、音域が当該音声のピッチと同じ周波数を含むパートデータＤｐ、あるいは、音域が当該音声のフォルマントに重なるパートデータＤｐとして、チェロのパートデータＤｐ２を選択する。更に、生成部１１４は、チェロのパートデータＤｐ２によって示される音のレベルを、当該音声のレベルに応じて変更するように、当該チェロのパートデータＤｐ２を補正し、パートデータＤｐ２’を生成する。生成部１１４は、音のレベルを補正したチェロのパートデータＤｐ２’と、音のレベルを補正していないバイオリンのパートデータＤｐ４とを含むマスキングデータＤｍを、選択部１１５に出力する。選択部１１５は、制御信号Ｓに基づいて、音楽データＤｘとマスキングデータＤｍとからマスキングデータＤｍを選択し、選択したマスキングデータＤｍを再生装置１５に出力する。時刻ｔ３の時点で、検出部１１１が人間の音声を検出しなくなると、生成部１１４は、チェロの補正後のパートデータＤｐ２’の出力を停止する。その上で、生成部１１４は、その他のパートデータＤｐであるバイオリンのパートデータＤｐ４を、音楽データＤｘとして選択部１１５に出力し続ける。選択部１１５は、音楽データＤｘを再生装置１５に出力する。 FIG. 9 is a diagram showing the level of each part data Dp included in the masking data Dm generated by the generator 114A. Note that the example shown in FIG. 9 shows a case where the human voice is male voice. At time t1, the generation unit 114 outputs the part data Dp other than the cello to the selection unit 115 in advance as the music data Dx. "Other part data" is, for example, violin part data Dp4. During this time, the selection unit 115 outputs the music data Dx to the playback device 15 . When the detection unit 111 detects human speech at time t2, the analysis unit 112 generates speech feature data Df including at least one of the pitch, level, and formants of the speech. The generation unit 114 selects cello part data Dp2 as part data Dp whose range includes the same frequency as the pitch of the voice, or part data Dp whose range overlaps with the formants of the voice. Further, the generation unit 114 corrects the cello part data Dp2 so that the sound level indicated by the cello part data Dp2 is changed according to the voice level, and generates part data Dp2'. The generating unit 114 outputs masking data Dm including the cello part data Dp2′ whose sound level is corrected and the violin part data Dp4 whose sound level is not corrected to the selecting unit 115 . Selector 115 selects masking data Dm from music data Dx and masking data Dm based on control signal S, and outputs selected masking data Dm to playback device 15 . At time t3, when the detection unit 111 no longer detects human voice, the generation unit 114 stops outputting the corrected part data Dp2' for the cello. After that, the generation unit 114 continues to output the violin part data Dp4, which is the other part data Dp, to the selection unit 115 as the music data Dx. The selection unit 115 outputs the music data Dx to the reproduction device 15 .

〔２－２．第２実施形態の動作〕
第２実施形態に係るマスキング装置１の動作は、基本的には、第１実施形態に係るマスキング装置１の動作と同様であるため、その図示を省略する。 [2-2. Operation of Second Embodiment]
Since the operation of the masking device 1 according to the second embodiment is basically the same as the operation of the masking device 1 according to the first embodiment, illustration thereof is omitted.

ステップＳ２において、生成部１１４Ａは、音楽データＤｘに含まれる複数のパートデータＤｐのうち所定のパートデータＤｐを、選択部１１５に出力する。選択部１１５は、所定のパートデータＤｐを音楽データＤｘとして再生装置１５に出力する。 In step S2 , the generation unit 114A outputs to the selection unit 115 predetermined part data Dp among the plurality of part data Dp included in the music data Dx. The selection unit 115 outputs predetermined part data Dp to the reproduction device 15 as music data Dx.

ステップＳ５において、生成部１１４Ａは、生成部１１４と同様の補正を実行し、ステップＳ２における所定のパートデータＤｐと、補正後の一のパートデータＤｐｓ’とを含むマスキングデータＤｍを、生成する。 In step S5, the generation unit 114A performs the same correction as the generation unit 114, and generates masking data Dm including the predetermined part data Dp in step S2 and the corrected one part data Dps'.

〔３．変形例〕
以上の実施態様は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は相矛盾しない限り適宜に併合され得る。 [3. Modification]
The above embodiment can be variously modified. Specific modification modes are exemplified below. Two or more aspects arbitrarily selected from the following examples may be combined as appropriate unless contradictory.

〔３－１．変形例１〕
上記の第１実施形態及び第２実施形態において、生成部１１４及び１１４Ａは、取得部１１３によって記憶装置１２から取得された音楽データＤｘを補正することにより、マスキングデータＤｍを生成していた。しかし、本発明の実施態様におけるマスキングデータＤｍの生成方法は、これには限定されない。例えば、生成部１１４及び１１４Ａは、新たな曲を生成し、生成した曲に対応するマスキングデータＤｍを生成してもよい。例えば、生成部１１４及び１１４Ａは、指定されたキー及びコードに基づいて自動で作曲又は伴奏する従来技術を適用することにより、新たな曲を生成してもよい。この場合、生成部１１４及び１１４Ａは、検出部１１１によって検出された人間の音声のピッチに基づいてキーを決定し、予め選択されたコードに基づいて、自動で新たな曲を生成してもよい。 [3-1. Modification 1]
In the first and second embodiments described above, the generating units 114 and 114A generate the masking data Dm by correcting the music data Dx acquired from the storage device 12 by the acquiring unit 113 . However, the method of generating masking data Dm in the embodiment of the present invention is not limited to this. For example, the generators 114 and 114A may generate new music and generate masking data Dm corresponding to the generated music. For example, generators 114 and 114A may generate new songs by applying conventional techniques for automatically composing or accompaniment based on designated keys and chords. In this case, the generators 114 and 114A may determine the key based on the pitch of the human voice detected by the detector 111, and automatically generate a new song based on the pre-selected chords. .

〔３－２．変形例２〕
上記の第１実施形態及び第２実施形態において、再生装置１５は、生成部１１４から出力されるマスキングデータＤｍに基づいて、マスキング音としての音楽を再生していた。本変形例において、当該再生装置１５は、更に、検出部１１１によって人間の音声が検出された場合に特化して、マスキング音としての音楽を再生してもよい。 [3-2. Modification 2]
In the above-described first and second embodiments, the reproducing device 15 reproduces music as masking sound based on the masking data Dm output from the generator 114 . In this modified example, the playback device 15 may further play back music as a masking sound, especially when human voice is detected by the detection unit 111 .

〔４．付記〕
上述した実施形態等から、例えば以下のような態様が把握される。 [4. Note]
For example, the following aspects can be grasped from the above-described embodiments and the like.

本開示の態様（第１態様）に係るマスキング装置１は、収音装置１４から出力される出力信号から音声を示す音声信号を検出する検出部１１１を備える。また、当該マスキング装置１は、音声信号を分析することによって、音声の特徴を示す特徴データＤｆを生成する分析部１１２を備える。更に、当該マスキング装置１は、特徴データＤｆに基づいて、音声をマスキングする音楽を示すマスキングデータＤｍを生成する生成部１１４を備える。 The masking device 1 according to the aspect (first aspect) of the present disclosure includes a detection unit 111 that detects an audio signal representing audio from an output signal output from the sound pickup device 14 . The masking device 1 also includes an analysis unit 112 that analyzes the audio signal to generate feature data Df that indicates the feature of the audio. Further, the masking device 1 includes a generation unit 114 that generates masking data Dm representing music for masking voice based on the feature data Df.

この構成を有することにより、検出部１１１によって人間の音声をリアルタイムで検出し、分析部１１２で、音声の特徴を抽出し、生成部１１４で音声の特徴に応じた音楽データＤｘを生成することが可能となる。このため、マスキング装置１は、人間の発話に対してリアルタイムで対応するマスキングデータＤｍを生成し、生成されたマスキングデータＤｍに基づいて、人間の音声をマスキングする音楽を再生できる。また、マスキングに用いる音が音楽であるため、長時間聴いても疲れないといった利点がある。 With this configuration, the detection unit 111 detects human voice in real time, the analysis unit 112 extracts features of the voice, and the generation unit 114 generates music data Dx corresponding to the features of the voice. It becomes possible. Therefore, the masking device 1 can generate masking data Dm corresponding to human speech in real time, and reproduce music for masking human speech based on the generated masking data Dm. In addition, since the sound used for masking is music, there is an advantage that listening for a long time does not cause fatigue.

また、第１態様の例（第２態様）において、音声の特徴は、音声のピッチ、音声のレベル、及び音声のフォルマントのうち、少なくとも１つを含む。 In addition, in the example of the first aspect (second aspect), the speech features include at least one of speech pitch, speech level, and speech formants.

この構成を有することにより、具体的な特徴として、人間の音声のピッチ、レベル、及びフォルマントのうち少なくとも１つに応じて、マスキング音としての音楽を示すマスキングデータＤｍを生成することが可能になる。例えば、人間の音声のピッチやフォルマントに応じて、当該音声を発話したのが男性か女性かを判別し、判別結果に応じて、マスキング音を生成することが可能となる。 With this configuration, as a specific feature, it is possible to generate masking data Dm representing music as a masking sound according to at least one of the pitch, level, and formants of human speech. . For example, it is possible to determine whether the utterance is male or female according to the pitch and formants of human speech, and to generate a masking sound according to the determination result.

また、第１態様の例（第３態様）は、音楽を示す音楽データＤｘを取得する取得部１１３を更に備える。生成部１１４は、特徴データＤｆに基づいて、音楽データＤｘを補正することにより、マスキングデータＤｍを生成する。 Further, the example of the first mode (third mode) further includes an acquisition unit 113 that acquires music data Dx representing music. The generator 114 generates masking data Dm by correcting the music data Dx based on the characteristic data Df.

この構成を有することにより、予め記憶された音楽データＤｘを補正してマスキング音を示すマスキングデータＤｍを生成することで、簡便にマスキング音を生成することが可能となる。 With this configuration, the masking sound can be easily generated by correcting the pre-stored music data Dx to generate the masking data Dm representing the masking sound.

また、第１態様の例（第４態様）において、上記の音楽は、複数の音色と１対１に対応する複数のパートを含む。また、上記の音楽データＤｘは、複数のパートと１対１に対応する複数のパートデータＤｐを含む。また、生成部１１４は、特徴データＤｆに基づいて、複数のパートデータＤｐのうち一のパートデータＤｐｓを選択する。更に、生成部１１４は、特徴データＤｆに基づいて、一のパートデータＤｐｓの示す音のピッチ、及び一のパートデータＤｐｓの示す音のレベルのうち少なくとも１つを補正することにより、マスキングデータＤｍを生成する。 Further, in the example of the first aspect (fourth aspect), the music includes a plurality of parts corresponding to a plurality of timbres on a one-to-one basis. Further, the music data Dx includes a plurality of part data Dp in one-to-one correspondence with a plurality of parts. Moreover, the generation unit 114 selects one part data Dps from the plurality of part data Dp based on the feature data Df. Furthermore, the generation unit 114 corrects at least one of the pitch of the sound indicated by the one part data Dps and the level of the sound indicated by the one part data Dps based on the feature data Df, thereby generating the masking data Dm. to generate

この構成を有することにより、人間の音声の特徴に応じて、音楽データＤｘによって示される音楽内で発せられる音のピッチ、及び音のレベルのうち少なくとも１つを補正することで、マスキング音を示すマスキングデータＤｍを生成することが可能となる。 With this configuration, masking sound is produced by correcting at least one of the pitch and sound level of the sound emitted in the music indicated by the music data Dx in accordance with the characteristics of human speech. Masking data Dm can be generated.

また、第１態様の例（第５態様）において、音声の特徴は、音声のフォルマントと音声のレベルとを含む。生成部１１４は、複数のパートデータＤｐのうち、音域が特徴データＤｆの示す音声のフォルマントに重なる一のパートデータＤｐｓを選択し、特徴データＤｆの示す音声のレベルに応じて、選択した一のパートデータＤｐｓの示す音のレベルを変更するように、当該選択した一のパートデータＤｐｓを変更する。 In addition, in the example of the first aspect (fifth aspect), the speech features include speech formants and speech levels. The generation unit 114 selects one part data Dps whose range overlaps the formant of the voice indicated by the feature data Df from among the plurality of part data Dp, and selects one part data Dps according to the level of the voice indicated by the feature data Df. The selected one part data Dps is changed so as to change the sound level indicated by the part data Dps.

この構成を有することにより、例えば、人間の音声が男性の音声か女性の音声かに応じて、パートデータＤｐｓを選択し、選択したパートデータＤｐｓのレベルを、人間の音声のレベルに合わせることが可能となる。 With this configuration, for example, it is possible to select part data Dps according to whether the human voice is a male voice or a female voice, and match the level of the selected part data Dps to the level of the human voice. It becomes possible.

また、第１態様の例（第６態様）において、音声の特徴は、音声のピッチと音声のレベルとを含む。生成部１１４は、複数のパートデータＤｐのうち、音域が特徴データＤｆの示す音声のピッチと同じ周波数を含む一のパートデータＤｐｓを選択し、特徴データＤｆの示す音声のレベルに応じて、選択した一のパートデータＤｐｓの示す音のレベルを変更するように、当該選択した一のパートデータＤｐｓを補正する。 Also, in the example of the first aspect (sixth aspect), the audio features include the pitch of the audio and the level of the audio. The generation unit 114 selects one part data Dps whose range includes the same frequency as the pitch of the sound indicated by the feature data Df from among the plurality of part data Dp, and selects according to the level of the sound indicated by the feature data Df. The selected one part data Dps is corrected so as to change the sound level indicated by the one part data Dps.

また、第１態様の例（第７態様）において、音声の特徴は、音声のフォルマントと音声のレベルとを含む。生成部１１４は、複数のパートデータＤｐのうち、音域が特徴データＤｆの示す音声のフォルマントに重なる一のパートデータＤｐｓを選択し、特徴データＤｆの示す音声のピッチに応じて、選択した一のパートデータＤｐｓのピッチを変更するように、当該選択した一のパートデータＤｐｓを補正する。 In addition, in the example of the first aspect (seventh aspect), the speech features include speech formants and speech levels. The generation unit 114 selects one part data Dps whose range overlaps the formants of the speech indicated by the feature data Df from among the plurality of part data Dp, and selects one part data Dps according to the pitch of the speech indicated by the feature data Df. The selected one part data Dps is corrected so as to change the pitch of the part data Dps.

この構成を有することにより、例えば、人間の音声が男性の音声か女性の音声かに応じて、パートデータＤｐを選択し、選択したパートデータＤｐのピッチを、人間の音声のピッチに合わせることが可能となる。 With this configuration, for example, it is possible to select part data Dp according to whether the human voice is a male voice or a female voice, and match the pitch of the selected part data Dp to the pitch of the human voice. It becomes possible.

また、第１態様の例（第８態様）において、音声の特徴は、音声のピッチを含む。生成部１１４は、複数のパートデータＤｐのうち、音域が特徴データＤｆの示す音声のピッチと同じ周波数を含む一のパートデータＤｐｓを選択し、特徴データＤｆの示す音声のピッチに応じて、選択した一のパートデータＤｐｓのピッチを変更するように、当該選択した一のパートデータＤｐｓを補正する。 In addition, in the example of the first mode (eighth mode), the speech feature includes the pitch of the speech. The generation unit 114 selects one part data Dps whose range includes the same frequency as the pitch of the voice indicated by the feature data Df from among the plurality of part data Dp, and selects according to the pitch of the voice indicated by the feature data Df. The selected one part data Dps is corrected so as to change the pitch of the selected one part data Dps.

この構成を有することにより、例えば、人間の音声が男性の音声か女性の音声かに応じて、パートを選択し、選択したパートのピッチを、人間の音声のピッチに合わせることが可能となる。 With this configuration, it is possible, for example, to select a part according to whether the human voice is male or female, and match the pitch of the selected part to the pitch of the human voice.

また、第１態様の例（第９態様）において、生成部１１４は、選択した一のパートデータＤｐのキーを、オクターブ単位で上下させる。 In addition, in the example of the first mode (the ninth mode), the generation unit 114 moves the key of the selected piece of part data Dp up and down in octave units.

この構成を有することにより、生成部１１４は、音楽データＤｘが示す音楽の曲調を変更することなく、楽曲として成立させた状態で、選択したパートデータＤｐｓのみを補正することが可能となる。 With this configuration, the generation unit 114 can correct only the selected part data Dps in a state of forming a piece of music without changing the tone of the music indicated by the music data Dx.

また、第１態様の例（第１０態様）において、生成部１１４は、選択した一のパートデータＤｐｓのコードを、半音単位で上下させる。 In addition, in the example of the first aspect (tenth aspect), the generation unit 114 moves the chord of the selected piece of part data Dps up and down in semitone units.

この構成を有することにより、生成部１１４は、選択したパートデータＤｐｓのピッチを微調整することが可能となる。 With this configuration, the generator 114 can finely adjust the pitch of the selected part data Dps.

また、第１態様の例（第１１態様）において、マスキングデータＤｍは、補正された一のパートデータＤｐｓ’と、上記の複数のパートデータＤｐのうち、上記の一のパートデータＤｐｓを除いたパートデータＤｐとを含む。 Further, in the example of the first aspect (eleventh aspect), the masking data Dm includes the corrected one part data Dps' and the one part data Dps out of the plurality of part data Dp. and part data Dp.

この構成を有することにより、一つの楽器による演奏音を示すパートデータＤｐｓを補正し、補正されたパートデータＤｐｓ’と、当該パートデータＤｐが補正された楽器とは異なる楽器による演奏音を示すパートデータＤｐとから、マスキングデータＤｍを生成することが可能となる。 With this configuration, the part data Dps representing the sound played by one musical instrument is corrected, and the corrected part data Dps' and the part representing the sound played by a musical instrument different from the musical instrument for which the part data Dp was corrected are provided. The masking data Dm can be generated from the data Dp.

また、第１態様の例（第１２態様）は、音楽データＤｘを記憶する記憶装置１２を更に備える。取得部１１３は、記憶装置１２から音楽データＤｘを読み出す。生成部１１４は、取得部１１３が音楽データＤｘを読み出している期間中に、検出部１１１によって音声信号が検出された場合、上記の補正を実行する。 The example of the first mode (twelfth mode) further includes a storage device 12 that stores the music data Dx. Acquisition unit 113 reads music data Dx from storage device 12 . If the detection unit 111 detects an audio signal while the acquisition unit 113 is reading the music data Dx, the generation unit 114 performs the above correction.

この構成を有することにより、マスキング装置１は、予め複数の楽器の演奏音を含む楽曲を流しておき、人間の音声を感知して初めて、当該音声の特徴に応じて、例えば一部の楽器の演奏音を大きくすることが可能となる。これにより、人間が発話すると同時に、突然マスキング音を出力した場合に、発話した人間が感じる違和感を抑制することが可能となる。 By having this configuration, the masking device 1 plays a piece of music including performance sounds of a plurality of musical instruments in advance, and only after sensing human voice, can it perform, for example, some of the musical instruments according to the characteristics of the voice. It is possible to make the playing sound louder. As a result, when a masking sound is suddenly output at the same time when a person speaks, it is possible to suppress the sense of discomfort felt by the person speaking.

また、第１態様の例（第１３態様）は、音楽データＤｘを記憶する記憶装置１２を更に備える。取得部１１３は、記憶装置１２から音楽データＤｘを読み出す。生成部１１４Ａは、検出部１１１によって音声信号が検出されない場合、複数のパートデータＤｐのうち所定のパートを音楽データＤｘとして出力する。また、生成部１１４Ａは、検出部１１１によって音声信号が検出された場合、上記の補正を実行し、所定のパートデータＤｐと補正された一のパートデータＤｐｓ’とを含むマスキングデータＤｍを出力する。 Also, the example of the first aspect (the thirteenth aspect) further includes a storage device 12 that stores the music data Dx. Acquisition unit 113 reads music data Dx from storage device 12 . When the detection unit 111 does not detect an audio signal, the generation unit 114A outputs a predetermined part out of the plurality of part data Dp as the music data Dx. Further, when an audio signal is detected by the detection unit 111, the generation unit 114A performs the above correction, and outputs the masking data Dm including the predetermined part data Dp and the corrected one part data Dps'. .

この構成を有することにより、マスキング装置１は、予め、あるパートデータＤｐの示す音楽を流しておき、人間の音声を感知して初めて、当該音声の特徴に応じて、他のパートデータＤｐｓの示す音楽を挿入することが可能となる。これにより、人間が発話すると同時に、突然マスキング音を出力した場合に、発話した人間が感じる違和感を抑制することが可能となる。 With this configuration, the masking device 1 plays music indicated by a certain part data Dp in advance, and only after detecting human voice, according to the characteristics of the voice, the masking device 1 performs the music indicated by the other part data Dps. It is possible to insert music. As a result, when a masking sound is suddenly output at the same time when a person speaks, it is possible to suppress the sense of discomfort felt by the person speaking.

また、第１態様の例（第１４態様）において、音楽データＤｘは、ＭＩＤＩデータであってもよい。 Further, in the example of the first mode (14th mode), the music data Dx may be MIDI data.

この構成を有することにより、音楽データＤｘとしてのＭＩＤＩデータを補正することで、マスキング音を示すマスキングデータＤｍを生成することが可能となる。 With this configuration, it is possible to generate masking data Dm representing a masking sound by correcting MIDI data as music data Dx.

あるいは、第１態様の例（第１５態様）において、音楽データＤｘは、音信号であってもよい。 Alternatively, in the example of the first aspect (fifteenth aspect), the music data Dx may be a sound signal.

この構成を有することにより、音楽データＤｘとしての音信号を補正することで、マスキング音を示すマスキングデータＤｍを生成することが可能となる。 With this configuration, it is possible to generate the masking data Dm representing the masking sound by correcting the sound signal as the music data Dx.

また、第１態様の例（第１６態様）において、生成部１１４は、音楽として新たな曲を生成し、生成した曲に対応するマスキングデータＤｍを生成する。 In addition, in the example of the first aspect (sixteenth aspect), the generation unit 114 generates a new piece of music as music, and generates masking data Dm corresponding to the generated piece of music.

この構成を有することにより、マスキング音のメロディを自動で生成することが可能となる。 With this configuration, it is possible to automatically generate the melody of the masking sound.

また、第１態様の例（第１７態様）は、マスキングデータＤｍに基づいて音楽を再生する再生装置１５を更に備える。 Further, the example of the first mode (17th mode) further includes a reproducing device 15 that reproduces music based on the masking data Dm.

この構成を有することにより、マスキング音としての音楽を再生することが可能となる。 With this configuration, it is possible to reproduce music as a masking sound.

また、第１態様の例（第１８態様）において、再生装置１５は、検出部１１１によって音声が検出された場合に、音楽を再生する。 In addition, in the example of the first mode (18th mode), the reproducing device 15 reproduces music when the detection unit 111 detects sound.

この構成を有することにより、人間の発話のタイミングに合わせて、マスキング音としての音楽を再生することが可能となる。 With this configuration, it is possible to reproduce music as a masking sound in synchronization with the timing of human speech.

１１…制御装置、１２…記憶装置、１３…操作装置、１４…収音装置、１４－１…第１の収音装置、１４－２…第２の収音装置、１５…再生装置、１５－１…第１スピーカー、１５－２…第２スピーカー、１５－３…第３スピーカー、１５－４…第４スピーカー、５１～５４…座席、７１…フロントライトドア、７２…フロントレフトドア、７３…リアライトドア、７４…リアレフトドア、１１１…検出部、１１２…分析部、１１３…取得部、１１４、１１４Ａ…生成部 DESCRIPTION OF SYMBOLS 11... Control device, 12... Storage device, 13... Operation device, 14... Sound collection device, 14-1... First sound collection device, 14-2... Second sound collection device, 15... Reproducing device, 15- DESCRIPTION OF SYMBOLS 1... 1st speaker, 15-2... 2nd speaker, 15-3... 3rd speaker, 15-4... 4th speaker, 51-54... Seat, 71... Front right door, 72... Front left door, 73... Rear light door 74 Rear left door 111 Detecting unit 112 Analyzing unit 113 Acquiring unit 114, 114A Generating unit

Claims

a detection unit that detects an audio signal representing audio from an output signal output from a microphone;
an analysis unit that generates feature data indicating features of the voice by analyzing the voice signal;
a generation unit that generates masking data indicating music for masking the voice based on the feature data;
A masking device comprising a

2. The masking device of claim 1, wherein the voice features include at least one of the pitch of the voice, the level of the voice, and the formants of the voice.

further comprising an acquisition unit for acquiring music data representing music,
3. The masking apparatus according to claim 1, wherein said generator generates said masking data by correcting said music data based on said feature data.

The music includes a plurality of parts that correspond one-to-one with a plurality of tones,
the music data includes a plurality of part data corresponding to the plurality of parts on a one-to-one basis;
The generating unit
selecting one part data from among the plurality of part data based on the feature data;
generating the masking data by correcting at least one of the pitch of the sound indicated by the one part data and the level of the sound indicated by the one part data based on the feature data;
4. A masking device according to claim 3.

the features of the speech include formants of the speech and levels of the speech;
The generating unit
Selecting, from among the plurality of part data, one part data whose range overlaps with the formant of the speech indicated by the feature data;
5. Masking according to claim 4, wherein the selected one part data is corrected so as to change the sound level indicated by the selected one part data according to the sound level indicated by the feature data. Device.

the audio features include the pitch of the audio and the level of the audio;
The generating unit
Selecting one part data whose range includes the same frequency as the pitch of the voice indicated by the feature data from among the plurality of part data,
5. Masking according to claim 4, wherein the selected one part data is corrected so as to change the sound level indicated by the selected one part data according to the sound level indicated by the feature data. Device.

the features of the speech include formants of the speech and pitch of the speech;
The generating unit
Selecting, from among the plurality of part data, one part data whose range overlaps with the formant of the speech indicated by the feature data;
Claims 4 to 6, wherein the selected one part data is corrected so as to change the pitch of the sound indicated by the selected one part data according to the pitch of the voice indicated by the feature data. The masking device according to any one of Claims 1 to 3.

the features of the speech include the pitch of the speech;
The generating unit
Selecting one part data whose range includes the same frequency as the pitch of the voice indicated by the feature data from among the plurality of part data,
Claims 4 to 6, wherein the selected one part data is corrected so as to change the pitch of the sound indicated by the selected one part data according to the pitch of the voice indicated by the feature data. The masking device according to any one of Claims 1 to 3.

9. The masking device according to claim 7, wherein said generation unit raises or lowers the key of said selected one part data in octave units.

10. The masking device according to any one of claims 7 to 9, wherein the generation unit raises and lowers the chord of the selected one part data in units of semitones.

11. The masking data according to any one of claims 4 to 10, including the corrected one part data and part data other than the one part data among the plurality of part data. masking device.

further comprising a storage unit for storing the music data,
The acquisition unit reads the music data from the storage device,
12. The generation unit according to any one of claims 4 to 11, wherein the generation unit performs the correction when the detection unit detects the audio signal while the acquisition unit is reading the music data. A masking device as described in Clause.

further comprising a storage unit for storing the music data,
The acquisition unit reads the music data from the storage unit,
The generating unit
outputting predetermined part data among the plurality of part data as the music data when the audio signal is not detected by the detection unit;
Claims 4 to 11, wherein when the detection unit detects the audio signal, the correction is performed, and the masking data including the predetermined part data and the corrected one part data is output. The masking device according to any one of Claims 1 to 3.

14. The masking device according to any one of claims 3 to 13, wherein said music data is MIDI data.

14. The masking device according to any one of claims 3 to 13, wherein said music data is a sound signal.

3. The masking device according to claim 1, wherein said generating unit generates a new piece of music as said music, and generates said masking data corresponding to said generated piece of music.

17. The masking device according to any one of claims 1 to 16, further comprising a reproducing unit that reproduces the music based on the masking data.

18. The masking device according to claim 17, wherein said reproducing unit reproduces said music when said sound is detected by said detecting unit.