JP5511342B2

JP5511342B2 - Voice changing device, voice changing method and voice information secret talk system

Info

Publication number: JP5511342B2
Application number: JP2009279125A
Authority: JP
Inventors: 孝芳中井; 福司川上; 肖吾木元
Original assignee: Nippon Sheet Glass Environment Amenity Co Ltd
Current assignee: Nippon Sheet Glass Environment Amenity Co Ltd
Priority date: 2009-12-09
Filing date: 2009-12-09
Publication date: 2014-06-04
Anticipated expiration: 2029-12-09
Also published as: JP2011123141A

Description

本発明は、音声を変更する音声変更装置、音声変更方法およびその音声変更装置を備える音声情報秘話システムに関する。 The present invention relates to a voice changing device that changes voice, a voice changing method, and a voice information secret talk system including the voice changing device.

個人情報保護法などの施行により銀行やオフィスにおける会話情報の保護の必要性が高まっている。その手段として、従来から物理的に空間を分ける遮音・防音や、オープンプランオフィスなどにおいて会話音声を別の雑音・音楽などで隠蔽するＢＧＭ・マスキングシステムなどが提案されてきた。 With the enforcement of the Personal Information Protection Law, there is an increasing need to protect conversation information in banks and offices. Conventionally, sound insulation / soundproofing that physically separates the space, BGM / masking system that conceals conversational speech with other noise / music, etc. in an open plan office have been proposed.

音声情報の隠蔽という目的については従来から、
（１）対象音声を他の定常的な雑音で隠蔽するマスキングシステム（Masking System）
（２）室内の暗騒音や空調騒音で隠蔽するシェーディングシステム（Shading System）
（３）遮音・防音（対象室を空間的に区画し、音響的に分離する）
等があった。（１）の例は音声の存在そのものを（無理やり）消し去ろうとするもので、エネルギマスキング（Energy Masking）と位置付けられる。これは例えばオープンプランオフィスのブースや会議室に使用されている。 For the purpose of concealing voice information,
(1) Masking system that masks the target speech with other stationary noise
(2) Shading system concealed by indoor background noise and air conditioning noise
(3) Sound insulation / sound insulation (the target room is spatially separated and acoustically separated)
Etc. The example (1) attempts to (forcefully) erase the presence of speech, and is positioned as energy masking. This is used, for example, in an open plan office booth or conference room.

（１）のシステムの例が非特許文献１に報告されている。そこでは、天井内部などに専用のジェネレータやスピーカを設置し、マスキング音を発生して音声の隠蔽を行っている。その原理は、会話の邪魔にならない程度の（会話とは脈絡のない）音楽や雑音を生成し、いわゆるＳ／Ｎを低減して音声の内容を隠蔽したり、明瞭度・了解度を低減したりして、会話内容を理解できない程度まで隠蔽しようとするものである。システムには会話レベルや室内暗騒音などに応じてマスキング音を最適レベルに制御する制御装置（信号処理装置）・電力増幅器などが含まれる。 An example of the system (1) is reported in Non-Patent Document 1. There, a dedicated generator or speaker is installed inside the ceiling, etc., and masking sound is generated to conceal the sound. The principle is that it generates music and noise that is not in the way of conversation (contrast with conversation), conceals the contents of speech by reducing so-called S / N, and reduces clarity and intelligibility. Or to conceal the content of the conversation to the extent that it cannot be understood. The system includes a control device (signal processing device), a power amplifier, and the like that control the masking sound to the optimum level according to the conversation level, background noise, and the like.

また、この技術を利用した例としては、パーティションからブース内へマスキング用のノイズを放射し、対象空間領域をブースに限定することにより、室内全体の騒音レベルが上昇するのを抑えようとしたものがある。 In addition, as an example using this technology, noise for masking was radiated from the partition into the booth, and the target space area was limited to the booth to suppress the increase in the noise level in the entire room. There is.

（２）のシステムの例が非特許文献２に報告されている。そこでは、放射するマスキングノイズとして、室内の暗騒音そのものや、日常的に身近な空調騒音を使用した「Sound Shading System」が報告されている。このシステムでは、銀行の窓口などにおけるプライバシーの確保を目的とした視覚遮断的なパーティションに対し、会話のプライバシー保護を目的としてパーティション頂部にスピーカを設置する。このスピーカからマスキング音を再生し、それによりパーティションの反対側にいる人への会話内容の漏洩・伝達の阻止を図る。再生する音には街の雑踏をもとに生成した音や、その部屋の空調騒音を使用する。 An example of the system (2) is reported in Non-Patent Document 2. There are reports of “Sound Shading System” that uses indoor background noise itself and air-conditioning noise that is familiar everyday as radiating masking noise. In this system, a speaker is installed at the top of the partition for the purpose of protecting the privacy of conversation, in contrast to a visually interrupting partition for the purpose of ensuring privacy at a bank counter. A masking sound is reproduced from this speaker, thereby preventing the leakage / transmission of conversation contents to a person on the other side of the partition. The sound to be reproduced is the sound generated based on the crowds of the city or the air conditioning noise of the room.

（３）のシステムの例としては、別室として区画する遮音や、パーティションなどで区画する防音がある。 As an example of the system of (3), there is sound insulation partitioned as a separate room or soundproof partitioned by a partition.

コクヨ社プレスリリース、サウンドマスキング、２００６年１０月１８日KOKUYO Press Release, Sound Masking, October 18, 2006 杉本明子、中村隆宏、伊勢史郎、「会話のしやすさとプライバシーを考慮した音場を生成する Sound Shading System の評価」、日本音響学会２００５年春季研究発表会講演論文集、ｐ．８１７Akiko Sugimoto, Takahiro Nakamura, Shiro Ise, “Evaluation of Sound Shading System that generates sound field in consideration of ease of conversation and privacy”, Acoustical Society of Japan Spring Meeting 2005, Proceedings, p. 817 電子情報通信学会、聴覚と音声、１９７３年、ｐ．３７０−３７１The Institute of Electronics, Information and Communication Engineers, Auditory and Speech, 1973, p. 370-371

本発明者は、上述のマスキング／シェーディング技術に関して以下の課題を認識した。
（Ｉ）原音声とは脈絡のない新たな音を放射するので、違和感を伴い室内空間の騒音レベルを上昇させ得る。
（ＩＩ）音声発生のないいわゆる「無音時」にも騒音、つまりマスキング音が聞こえ得る。
（ＩＩＩ）会話とは関係のない別の音（騒音・音楽）を放射することにより、発声者・会話者・その他の在室者に少なからず違和感を与え得る。
（ＩＶ）音声の情報隠蔽は、性質の異なるもの同士は区別して認識する、という聴覚の性質により、雑音やBGMでは奏功しにくいという基本的な問題を含む（エンベロープやスペクトルが似通った音声波形同士の方が聴覚認識上、区別されにくい）。 The inventor has recognized the following problems regarding the above-described masking / shading technique.
(I) Since the original sound emits a new sound having no context, the noise level in the indoor space can be raised with a sense of incongruity.
(II) Noise, that is, a masking sound can be heard even when the sound is not generated.
(III) By emitting another sound (noise / music) unrelated to conversation, it is possible to give a sense of incongruity to a speaker, a talker, and other people in the room.
(IV) Information concealment of speech includes the basic problem that it is difficult to succeed with noise and BGM due to the auditory nature of distinguishing and recognizing different ones (characters with similar envelopes and spectra) Is more difficult to distinguish for auditory recognition).

（Ｉ）については、経験上原音声を完全にマスクするのに必要な雑音の相対レベルは略１５ｄＢである（非特許文献３参照）。この視点から見ると、雑音や音楽を流すことにより音声を隠蔽するという方法では、原音声に対してそれ以上のかなり大きな音量の雑音や音楽が必要となり、maskingであれshadingであれ、室内騒音レベルを大きく上昇させ得る。 As for (I), the relative level of noise necessary for completely masking the original voice is empirically about 15 dB (see Non-Patent Document 3). From this point of view, the method of concealing sound by flowing noise and music requires much louder noise and music than the original sound, and whether it is masking or shading, the room noise level Can be greatly increased.

（ＩＩ）については、発話がない時にも音がするという違和感を伴う。またそもそも発話がない時に雑音や音楽を流すことは会話内容の隠蔽の観点からは無駄と言える。また無駄であるばかりでなく、室の等価騒音レベル（Laeq：A-weighted equivalent sound level＝A特性で補正した音声信号の一定区間の自乗平均音圧レベル、つまり平均的な騒音レベル）を上昇させる結果となりうる。雑音の代わりに音楽を流した場合でも、一般的なＢＧＭとの区別は困難である。 Regarding (II), there is a sense of incongruity that a sound is produced even when there is no utterance. In the first place, playing noise and music when there is no utterance is useless from the viewpoint of concealing conversation content. It is not only useless, but it also increases the room equivalent noise level (Laeq: A-weighted equivalent sound level = root mean square sound pressure level of the audio signal corrected by A characteristics, that is, average noise level). Can result. Even when music is played instead of noise, it is difficult to distinguish from general BGM.

また、（３）のアプローチについては、費用的にかなり大きなものとなり、また開放感を阻害するのでオープンプランオフィスなどでの使用には適さない。 In addition, the approach (3) is considerably large in cost and hinders a feeling of opening, and is not suitable for use in an open plan office or the like.

本発明はこうした課題に鑑みてなされたものであり、その目的は、騒音レベルや受聴者の不快感の増長を抑えた上で音声の内容を隠蔽する技術の提供にある。 The present invention has been made in view of these problems, and an object of the present invention is to provide a technique for concealing audio content while suppressing an increase in noise level and listener discomfort.

本発明のある態様は、音声変更装置に関する。この音声変更装置は、音声信号から変更対象部分を抽出する部分抽出部と、部分抽出部によって抽出された変更対象部分を変更する部分変更部と、少なくとも部分変更部によって変更された変更対象部分を音声出力手段に出力する出力部と、を備える。 One embodiment of the present invention relates to a sound changing device. The voice change device includes a partial extraction unit that extracts a change target part from a voice signal, a partial change unit that changes the change target part extracted by the partial extraction unit, and at least a change target part that has been changed by the partial change unit. And an output unit for outputting to the audio output means.

この態様によると、音声信号のうち変更対象部分を変更してから音声出力手段に出力することができる。音声出力手段は変更対象部分が変更された音声信号を音として出力してもよい。 According to this aspect, the change target portion of the audio signal can be changed and then output to the audio output means. The audio output means may output an audio signal in which the change target portion is changed as a sound.

本発明の別の態様は、音声情報秘話システムである。この音声情報秘話システムは、発話音声を受け、それを表す音声信号を生成する集音手段と、集音手段によって生成された音声信号を変更する音声変更装置と、音声変更装置によって変更された音声信号を音声に変換して発話音声が受聴されうる領域に出力する音声出力手段と、を備える。音声変更装置は、集音手段によって生成された音声信号から変更対象部分を抽出する部分抽出部と、部分抽出部によって抽出された変更対象部分を変更する部分変更部と、少なくとも部分変更部によって変更された変更対象部分を音声出力手段に出力する出力部と、を含む。 Another aspect of the present invention is a speech information secret talk system. The voice information secret speech system includes a sound collecting unit that receives an uttered voice and generates a voice signal representing the voice, a voice changing device that changes a voice signal generated by the sound collecting unit, and a voice changed by the voice changing device. Voice output means for converting the signal into voice and outputting the voice to a region where the uttered voice can be heard. The voice changing device includes a partial extraction unit that extracts a change target portion from the voice signal generated by the sound collecting unit, a partial change unit that changes the change target portion extracted by the partial extraction unit, and at least the partial change unit. And an output unit for outputting the changed portion to be changed to the audio output means.

この態様によると、発話音声を表す音声信号のうち変更対象部分を変更し、それを音として発話音声が受聴されうる領域に出力することができる。 According to this aspect, it is possible to change the change target part of the audio signal representing the uttered voice and output it as a sound to an area where the uttered voice can be heard.

なお、以上の構成要素の任意の組み合わせや、本発明の構成要素や表現を装置、方法、システム、コンピュータプログラム、コンピュータプログラムを格納した記録媒体などの間で相互に置換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements, or those obtained by replacing the constituent elements and expressions of the present invention with each other between apparatuses, methods, systems, computer programs, recording media storing computer programs, and the like are also included in the present invention. It is effective as an embodiment of

本発明によれば、騒音レベルや受聴者の不快感の増長を抑えた上で音声の内容を隠蔽できる。 ADVANTAGE OF THE INVENTION According to this invention, the content of an audio | voice can be concealed, suppressing the increase in a noise level and a listener's discomfort.

マスキングに関する従来のアプローチと実施の形態に係るアプローチをカテゴリに分けて示す説明図である。It is explanatory drawing which divides into a category the conventional approach regarding masking, and the approach which concerns on embodiment. 実施の形態に係る音声情報秘話システムが設けられたブースを模式的に示す斜視図である。It is a perspective view which shows typically the booth provided with the audio | voice information secret talk system which concerns on embodiment. 図２の音声情報秘話システムの機能および構成を模式的に示すブロック図である。It is a block diagram which shows typically the function and structure of the audio | voice information confidential system of FIG. 図２のＩＴパーティションの構成を示す側面図である。It is a side view which shows the structure of the IT partition of FIG. 図３のＳＤコントローラ部の機能および構成を示すブロック図である。It is a block diagram which shows the function and structure of the SD controller part of FIG. 図５の子音ライブラリを示すデータ構造図である。It is a data structure figure which shows the consonant library of FIG. マスキーを表す音声信号の波形を示す波形図である。It is a wave form diagram which shows the waveform of the audio | voice signal showing a maskee. 図７の音声信号をＳＤコントローラ部において子音のみ置換モードで処理することで生成される音声信号の波形を示す波形図である。It is a wave form diagram which shows the waveform of the audio | voice signal produced | generated by processing the audio | voice signal of FIG. 7 in a consonant only substitution mode in the SD controller part. 子音置換処理の内訳を示す説明図である。It is explanatory drawing which shows the breakdown of a consonant replacement process. ＳＤコントローラ部およびスピーカにおける一連の処理を示すフローチャートである。It is a flowchart which shows a series of processes in an SD controller part and a speaker. 第１変形例に係る音声情報秘話システムの機能および構成を模式的に示すブロック図である。It is a block diagram which shows typically the function and structure of the audio | voice information confidential system which concerns on a 1st modification. 第２変形例に係る音声情報秘話システムの機能および構成を模式的に示すブロック図である。It is a block diagram which shows typically the function and structure of the audio | voice information confidential system which concerns on a 2nd modification. 図７の音声信号をＳＤコントローラ部においてエネルギ包絡線で区画された音節群単位で時間反転した音声信号の波形を示す波形図である。It is a wave form diagram which shows the waveform of the audio | voice signal which time-reversed the audio | voice signal of FIG. 7 in the syllable group unit divided by the energy envelope in SD controller part.

以下、本発明を好適な実施の形態をもとに図面を参照しながら説明する。各図面に示される同一または同等の構成要素、部材、処理には、同一の符号を付するものとし、適宜重複した説明は省略する。 The present invention will be described below based on preferred embodiments with reference to the drawings. The same or equivalent components, members, and processes shown in the drawings are denoted by the same reference numerals, and repeated descriptions are omitted as appropriate.

特にオフィスなどにおいては、オープンプランの空間が有する開放性やコミュニケーションの円滑性を損なわずに音声情報、つまり音声の内容だけが隠蔽されることが望ましい。しかしながら、従来のＢＧＭやマスキングを使用する技術は、基本的には音声とは性質の異なる別の音を加えるので、聴覚的な違和感や室内の暗騒音を上昇させてしまうという嫌いがあった。本発明の実施の形態はマイクロホンなどにより集音した音声信号そのものの構造を変更することにより室内の暗騒音を上昇させることなく会話の内容を、理想的には会話の内容のみを、隠蔽し、円滑で快適な秘話環境を実現する。 Particularly in offices and the like, it is desirable to conceal only the voice information, that is, the voice content without impairing the openness and smoothness of communication of the open plan space. However, the conventional technology using BGM and masking basically adds another sound having a different property from that of speech, so that there is a dislike of increasing the sense of incongruity and background noise in the room. The embodiment of the present invention conceals the content of the conversation without raising the background noise by changing the structure of the sound signal itself collected by a microphone or the like, ideally only the content of the conversation, A smooth and comfortable secret story environment is realized.

図１は、マスキングに関する従来のアプローチと実施の形態に係るアプローチをカテゴリに分けて示す説明図である。（ａ）は、電気音響を用いたＳＲ（Sound Reinforcement）／ＰＡ（Public Address）である。これらは音量や明瞭度を高めて「よく聞こえるようにする」従来技術である。（ｆ）は、遮音（Sound Insulation）であり、空間を音響的に分離しできるだけ「聞こえないようにする」従来技術である。これらに対して実施の形態に係るアプローチは（ｅ）のＳＤ（Sound Deformation）であり、会話者本人の原音声を処理して出力することにより、聞こえる聞こえないではなく会話内容を「分からなくする」一種の音声情報撹乱技術である。また、従来技術による（ｂ）ＥＭや（ｃ）ＳＳや（ｄ）ＩＭが多かれ少なかれ室内あるいは対象空間領域の騒音レベルを上昇させて不快感や違和感を増加させ得るのに対し、（ｅ）のＳＤではほとんど音量上昇を伴わない。 FIG. 1 is an explanatory diagram showing a conventional approach related to masking and an approach according to an embodiment divided into categories. (A) is SR (Sound Reinforcement) / PA (Public Address) using electroacoustics. These are conventional technologies that increase the volume and clarity and “make them sound better”. (F) is Sound Insulation, which is a conventional technique for acoustically separating a space and making it “not audible” as much as possible. On the other hand, the approach according to the embodiment is SD (Sound Deformation) of (e), and by processing and outputting the original voice of the conversation person, the conversation content is not made inaudible but not heard. It is a kind of voice information disturbance technology. Further, (b) EM, (c) SS, and (d) IM according to the prior art can increase the noise level in the room or the target space region to increase the unpleasantness and discomfort, In SD, there is almost no volume increase.

本発明の実施の形態の主な立脚点は、言語の認識・理解が、特に日本語の場合は、音声の子音部分に大きく依存するという本発明者の認識である。この子音部分が変化すると、たとえば「雲（ＫＵＭＯ）」は「ＲＵＴＯ」となり、言葉として理解することができない。 The main standpoint of the embodiment of the present invention is the recognition of the present inventor that language recognition and understanding depend largely on the consonant part of speech, particularly in the case of Japanese. If this consonant part changes, for example, “KUMO” becomes “RUTO” and cannot be understood as words.

本発明の実施の形態では、音声認識・理解のこのような側面に着目し、特に原音声の子音部分を変更・削除・置換する。子音部分の処理が主となるので、原音声と比較して音圧レベル（音量）の上昇は小さい。さらに原音声（以下、マスキーと称す）に処理音声（以下、マスカーと称す）を加えた全体の音量を更に低減するために、以下の併用／工夫が可能である。
（i）マスカーの生成において、母音部分を無音に置き換え、処理された子音部分だけを元のタイミングで出力する。
（ii）マスカーの情報隠蔽効果を高めるために、ＡＮＣ（Active Noise Control）またはパラメータ固定のＰＮＣ（Passive Noise Control）技術を併用する。 In the embodiment of the present invention, paying attention to such aspects of speech recognition / understanding, particularly the consonant part of the original speech is changed / deleted / replaced. Since the processing of the consonant part is the main, the increase in the sound pressure level (volume) is small compared to the original voice. Further, in order to further reduce the overall volume of the processed voice (hereinafter referred to as a masker) added to the original voice (hereinafter referred to as a maskee), the following combination / ingenuity is possible.
(I) In generating a masker, the vowel part is replaced with silence, and only the processed consonant part is output at the original timing.
(Ii) ANC (Active Noise Control) or parameter-fixed PNC (Passive Noise Control) technology is used in combination to enhance the masker's information hiding effect.

図２は、実施の形態に係る音声情報秘話システム１００が設けられたブース２を模式的に示す斜視図である。図３は、図２の音声情報秘話システム１００の機能および構成を模式的に示すブロック図である。
音声情報秘話システム１００は、銀行の相談カウンターなど、簡易パーティションで区画されたブース２に設けられる。音声情報秘話システム１００は、マイクロホンＭｉｃと、ＳＤコントローラ部ＳＤと、２つのパワーアンプＰＡと、２つのスピーカＳＰと、を備える。スピーカＳＰおよびＳＤコントローラ部ＳＤは、ブース間を視覚的に隔てるＩＴパーティション４に組み込まれてもよい。 FIG. 2 is a perspective view schematically showing the booth 2 in which the audio information secret system 100 according to the embodiment is provided. FIG. 3 is a block diagram schematically showing the function and configuration of the speech information secret system 100 of FIG.
The voice information secret story system 100 is provided in a booth 2 partitioned by a simple partition, such as a bank consultation counter. The audio information secret system 100 includes a microphone Mic, an SD controller unit SD, two power amplifiers PA, and two speakers SP. The speaker SP and the SD controller unit SD may be incorporated in the IT partition 4 that visually separates the booths.

相談員と会話を行っている顧客６を発話者とする。この発話者のマスキーH'(t)はカウンター部分またはその近傍に設けられたマイクロホンＭｉｃによって集音される。マイクロホンＭｉｃにより集音されたマスキーH'(t)は音声信号に変換され、ＳＤコントローラ部ＳＤに送られる。この音声信号のうち子音部分がＳＤコントローラ部ＳＤによって変更、削除、または置換される。ＳＤコントローラ部ＳＤにおける処理を経た音声信号はパワーアンプＰＡを経てスピーカＳＰから左右の隣接ブース２’にマスカーH(t)として出力される。 A customer 6 who has a conversation with a counselor is a speaker. The speaker's maskee H ′ (t) is collected by a microphone Mic provided at or near the counter portion. The maskee H ′ (t) collected by the microphone Mic is converted into an audio signal and sent to the SD controller unit SD. The consonant part of the audio signal is changed, deleted, or replaced by the SD controller unit SD. The audio signal that has undergone the processing in the SD controller section SD is output as a masker H (t) from the speaker SP to the left and right adjacent booths 2 'via the power amplifier PA.

隣接ブース２’にはマスキーH'(t)が空中を回り込んでくるので、顧客６の発話音声は隣接ブース２’内にいる受聴者８（顧客６とは異なる別の者）によって受聴されうる。しかしながら本実施の形態では、空中を回り込んで漏洩するマスキーH'(t)はマスカーH(t)と合成されて隣接ブース２’内の受聴者８に届く。したがってマスカーH(t)による擾乱により受聴者８はマスキーH'(t)の子音部分を正しく認識することができない。その結果、受聴者８はマスキーH'(t)に含まれる会話の内容を理解することができない。 Since Muskie H '(t) goes around the air in the adjacent booth 2', the voice of the customer 6 is heard by a listener 8 (a different person from the customer 6) in the adjacent booth 2 '. sell. However, in the present embodiment, the masky H ′ (t) that leaks through the air is combined with the masker H (t) and reaches the listener 8 in the adjacent booth 2 ′. Therefore, the listener 8 cannot correctly recognize the consonant part of the maskee H ′ (t) due to the disturbance by the masker H (t). As a result, the listener 8 cannot understand the content of the conversation included in the maskee H ′ (t).

図４は、図２のＩＴパーティション４の構成を示す側面図である。ＩＴパーティション４は、第１吸音層４２と、遮音層４４と、第２吸音層４６と、をこの順に積層してなる積層構造を有する。第１吸音層４２および第２吸音層４６はそれぞれ厚さが２０ｍｍのグラスウールの層である。遮音層４４は厚さが１２ｍｍの石膏ボードである。 FIG. 4 is a side view showing the configuration of the IT partition 4 of FIG. The IT partition 4 has a laminated structure in which a first sound absorbing layer 42, a sound insulating layer 44, and a second sound absorbing layer 46 are laminated in this order. Each of the first sound absorbing layer 42 and the second sound absorbing layer 46 is a glass wool layer having a thickness of 20 mm. The sound insulation layer 44 is a gypsum board having a thickness of 12 mm.

図５は、図３のＳＤコントローラ部ＳＤの機能および構成を示すブロック図である。ここに示す各ブロックは、ハードウェア的には、コンピュータのＣＰＵ（central processing unit）をはじめとする素子や機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組合せによっていろいろなかたちで実現できることは、本明細書に触れた当業者には理解されるところである。 FIG. 5 is a block diagram showing the function and configuration of the SD controller unit SD of FIG. Each block shown here can be realized in hardware by an element such as a CPU (central processing unit) or a mechanical device, and in software by a computer program or the like. Describes functional blocks realized by collaboration. Accordingly, it is understood by those skilled in the art who have touched this specification that these functional blocks can be realized in various forms by a combination of hardware and software.

ＳＤコントローラ部ＳＤは、記憶装置１０と、Ａ／Ｄ部２０と、部分抽出部３０と、部分変更部９０と、Ｄ／Ａ部７０と、ノイズ生成部８０と、子音ライブラリ更新部８２と、母音ライブラリ更新部８４と、を含む。記憶装置１０は、子音ライブラリ１２と、母音ライブラリ１４と、共通ライブラリ１６と、を含む。部分抽出部３０は、音声判別部３６と、子音抽出部３２と、母音抽出部３４と、を含む。部分変更部９０は、子音処理部４０と、母音処理部５０と、を含む。 The SD controller unit SD includes a storage device 10, an A / D unit 20, a partial extraction unit 30, a partial change unit 90, a D / A unit 70, a noise generation unit 80, a consonant library update unit 82, A vowel library updating unit 84. The storage device 10 includes a consonant library 12, a vowel library 14, and a common library 16. The partial extraction unit 30 includes a voice discrimination unit 36, a consonant extraction unit 32, and a vowel extraction unit 34. The partial change unit 90 includes a consonant processing unit 40 and a vowel processing unit 50.

子音ライブラリ１２は、子音部分の種類ごとにその波形データを記憶する。母音ライブラリ１４は、母音部分の種類ごとにその波形データを記憶する。共通ライブラリ１６は、子音部分の種類ごとに所定のサンプル波形データを記憶する。この共通ライブラリ１６に記憶される子音部分のサンプル波形データは、男性、女性、子供、大人などに分類されている。 The consonant library 12 stores waveform data for each type of consonant part. The vowel library 14 stores waveform data for each type of vowel part. The common library 16 stores predetermined sample waveform data for each type of consonant part. The sample waveform data of the consonant part stored in the common library 16 is classified into male, female, child, adult and the like.

ＳＤコントローラ部ＳＤは少なくとも、子音のみ置換モードおよび子音母音置換モードの２つの動作モードを有する。以下各動作モードごとに関連するブロックの機能を説明する。 The SD controller unit SD has at least two operation modes: a consonant only replacement mode and a consonant vowel replacement mode. The function of the block related to each operation mode will be described below.

（１）子音のみ置換モード
マイクロホンＭｉｃにより集音されたマスキーH'(t)は音声信号に変換され、該音声信号はマイクアンプ（不図示）を経てＡ／Ｄ部２０に入力される。Ａ／Ｄ部２０は、アナログ信号である音声信号をデジタル信号に変換する。音声判別部３６は、Ａ／Ｄ部２０でデジタル化された音声信号の波形を過去の発話音声波形と比較することにより、その音声信号の子音部分と母音部分とを判別する。子音抽出部３２は、その判別結果を使用して子音部分を抽出する。 (1) Consonant-only replacement mode The masky H ′ (t) collected by the microphone Mic is converted into an audio signal, and the audio signal is input to the A / D unit 20 via a microphone amplifier (not shown). The A / D unit 20 converts an audio signal that is an analog signal into a digital signal. The voice discriminating unit 36 discriminates a consonant part and a vowel part of the voice signal by comparing the waveform of the voice signal digitized by the A / D unit 20 with a past speech voice waveform. The consonant extraction unit 32 extracts a consonant part using the determination result.

子音ライブラリ更新部８２は、子音抽出部３２によって抽出された子音部分の波形データをその種類ごとに子音ライブラリ１２に蓄積する。ここで子音部分の分類はその継続時間・スペクトル・統計処理などから行われる。このように子音ライブラリ１２に蓄積される子音部分の波形データは、逐次処理によって会話開始から徐々に精度の高いものに置換されてゆく。 The consonant library updating unit 82 accumulates the waveform data of the consonant part extracted by the consonant extracting unit 32 in the consonant library 12 for each type. Here, the classification of the consonant part is performed based on its duration, spectrum, statistical processing and the like. As described above, the waveform data of the consonant portion stored in the consonant library 12 is gradually replaced with highly accurate data from the start of the conversation by sequential processing.

ノイズ生成部８０は、子音抽出部３２で抽出された子音部分を基に、それとスペクトルが類似する子音ノイズを生成する。 Based on the consonant part extracted by the consonant extraction unit 32, the noise generation unit 80 generates consonant noise having a spectrum similar to that.

子音処理部４０は、音声信号のうち子音抽出部３２で抽出された子音部分を処理する。子音処理部４０は、子音抽出部３２によって抽出された子音部分を子音ライブラリ１２から選出したほぼ同じ長さの別の子音部分に置換する。子音処理部４０は、置換の候補が複数ある場合は、ランダムに、かつ各組み合わせが略等確率となるように置換する。ここで子音部分の長さに長短があることの例としては、「ｓ」に相当する子音部分の継続時間は比較的長く、「ｔ」や「ｐ」に相当する子音部分の継続時間は短いことがある。 The consonant processing unit 40 processes the consonant part extracted by the consonant extraction unit 32 in the audio signal. The consonant processing unit 40 replaces the consonant part extracted by the consonant extraction unit 32 with another consonant part having substantially the same length selected from the consonant library 12. When there are a plurality of replacement candidates, the consonant processing unit 40 performs replacement randomly and so that each combination has a substantially equal probability. Here, as an example of the length of the consonant part, the duration of the consonant part corresponding to “s” is relatively long, and the duration of the consonant part corresponding to “t” or “p” is short. Sometimes.

なお、子音処理部４０は、子音ライブラリ１２を使用して子音部分を置換する代わりに、子音抽出部３２によって抽出された子音部分をノイズ生成部８０によって生成された子音ノイズと置換してもよい。この場合、マスキーH'(t)とマスカーH(t)との合成音声の無作為性がより増大する。また子音処理部４０は、子音ライブラリ１２を使用して子音部分を置換する代わりに、子音抽出部３２によって抽出された子音部分を削除してもよい。 Note that the consonant processing unit 40 may replace the consonant part extracted by the consonant extraction unit 32 with the consonant noise generated by the noise generation unit 80 instead of using the consonant library 12 to replace the consonant part. . In this case, the randomness of the synthesized speech between the maskee H ′ (t) and the masker H (t) is further increased. In addition, the consonant processing unit 40 may delete the consonant part extracted by the consonant extraction unit 32 instead of replacing the consonant part using the consonant library 12.

発話開始から数秒〜数十秒程度（以下、発話開始期間と称す）は、子音ライブラリ１２に発話者本人の音声から採取した子音部分が十分に蓄積されていない可能性がある。そこでこの発話開始期間の間は、子音処理部４０は共通ライブラリ１６から対応する子音部分を選出して子音抽出部３２によって抽出された子音部分と置換する。あるいはまた、発話開始期間の間、子音処理部４０は子音抽出部３２によって抽出された子音部分をノイズ生成部８０によって生成された子音ノイズと置換する。あるいはまた、発話開始期間の間、子音処理部４０は子音抽出部３２によって抽出された子音部分を時間方向に反転する。 There is a possibility that the consonant portion collected from the voice of the utterer is not sufficiently accumulated in the consonant library 12 for several seconds to several tens of seconds (hereinafter referred to as an utterance start period) from the start of the utterance. Therefore, during this utterance start period, the consonant processing unit 40 selects the corresponding consonant part from the common library 16 and replaces it with the consonant part extracted by the consonant extraction part 32. Alternatively, during the utterance start period, the consonant processing unit 40 replaces the consonant part extracted by the consonant extraction unit 32 with the consonant noise generated by the noise generation unit 80. Alternatively, during the utterance start period, the consonant processing unit 40 inverts the consonant part extracted by the consonant extraction unit 32 in the time direction.

発話開始期間の間に用いられるこれらの子音部分変更アルゴリズムでは、発話者本人の子音ライブラリ１２を使用する場合よりも自然さにおいて劣る。しかしながら発話開始後の短い時間だけなのでそれほど問題とはならない。 These consonant partial modification algorithms used during the utterance start period are less natural than using the consonant library 12 of the speaker himself. However, since it is only a short time after the start of utterance, it does not matter so much.

Ｄ／Ａ部７０は子音処理部４０において処理された音声信号を、スピーカＳＰを駆動するためのアナログの音声信号に変換してパワーアンプＰＡに出力する。Ｄ／Ａ部７０は特に、子音処理部４０によって置換された子音部分と、その子音部分に対応する変更されていない母音部分とを含む音声信号をアナログ信号に変換して出力する。 The D / A unit 70 converts the audio signal processed by the consonant processing unit 40 into an analog audio signal for driving the speaker SP and outputs the analog audio signal to the power amplifier PA. In particular, the D / A unit 70 converts an audio signal including a consonant part replaced by the consonant processing unit 40 and an unmodified vowel part corresponding to the consonant part into an analog signal and outputs the analog signal.

なお、マスキーH'(t)をマイクロホンＭｉｃで集音してからＳＤコントローラ部ＳＤで処理しスピーカＳＰから対応するマスカーH(t)を出力するまでの時間、つまりＳＤ処理時間Ｔ_ＳＤは、Ｔ＋ｔ以内とされる。ここでＴはマスキーH'(t)が発せられた時点からそれが受聴者８に届くまでの時間であり、ｔはマスキーH'(t)とマスカーH(t)が受聴者８位置において顕著なエコーを発生させないような遅れ時間、もしくは受聴者８に届く合成音声が受聴者８にとって理解不能となる最大の遅れ時間である。ｔの具体的な値は実験により定められるが、代表的には１００ｍｓ程度である。 Note that the time from when the musky H ′ (t) is collected by the microphone Mic until it is processed by the SD controller unit SD and the corresponding masker H (t) is output from the speaker SP, that is, the SD processing time T _SD is T + t. It is supposed to be within. Here, T is the time from when Muskie H '(t) is issued until it reaches the listener 8, and t is noticeable when the Muskie H' (t) and Masker H (t) are at the listener 8 position. This is the delay time that does not generate a simple echo, or the maximum delay time that the synthesized speech that reaches the listener 8 becomes unintelligible for the listener 8. A specific value of t is determined by experiment, but is typically about 100 ms.

マスキーH'(t)とマスカーH(t)とを受聴者８位置で合成して情報隠蔽を行うためには上述の通りＳＤコントローラ部ＳＤでのＳＤ処理を実時間もしくは準実時間で行わなければならない。この時間的な制約の存在、つまりＳＤ処理時間Ｔ_ＳＤを短い時間であるＴ＋ｔ以下としなければならないこと、により、子音部分の抽出及び置換・反転などの処理の精度を犠牲にしなけらばならない場合もある。しかしながら本実施の形態の目的は音声の明瞭度・了解度の低減にあり、想定／予定した処理自体の正確さが目的ではない。したがって本実施の形態では、マスカーH(t)の重畳によりマスキーH'(t)の意味内容が理解し難くなるという条件が満たされれば処理の精度は大きな問題とはならない。これは「意味内容が理解し難くなるという条件」は無数にあるからである。 In order to conceal information by synthesizing Masky H '(t) and Masker H (t) at the listener's 8 position, the SD processing in the SD controller unit SD must be performed in real time or near real time as described above. I must. The accuracy of processing such as extraction and replacement / inversion of consonant parts must be sacrificed due to the existence of this time constraint, that is, the SD processing time T _SD must be shorter than T + t, which is a short time. There is also. However, the purpose of this embodiment is to reduce the intelligibility and intelligibility of speech, and the accuracy of the assumed / scheduled processing itself is not the purpose. Therefore, in this embodiment, if the condition that it becomes difficult to understand the meaning content of the maskee H ′ (t) due to the superposition of the maskers H (t), the processing accuracy does not become a big problem. This is because there are an infinite number of “conditions that make it difficult to understand the semantic content”.

（２）母音置換モード
上述の子音部分の変更に加えて、母音部分も変更するモードである。母音抽出部３４は、子音抽出部３２で子音部分が抽出された音声信号から母音部分を抽出する。 (2) Vowel replacement mode In this mode, the vowel part is also changed in addition to the above-described change of the consonant part. The vowel extraction unit 34 extracts a vowel part from the voice signal from which the consonant part is extracted by the consonant extraction unit 32.

母音ライブラリ更新部８４は、母音抽出部３４によって抽出された母音部分の波形データをその種類ごとに母音ライブラリ１４に蓄積する。ここで母音部分の分類はその継続時間・スペクトル・統計処理などから行われる。このように母音ライブラリ１４に蓄積される母音部分の波形データは、逐次処理によって会話開始から徐々に精度の高いものに置換されてゆく。 The vowel library update unit 84 stores the waveform data of the vowel part extracted by the vowel extraction unit 34 in the vowel library 14 for each type. Here, the vowel part is classified based on its duration, spectrum, statistical processing, and the like. In this way, the waveform data of the vowel part stored in the vowel library 14 is gradually replaced with highly accurate data from the start of the conversation by sequential processing.

ノイズ生成部８０は、母音抽出部３４で抽出された母音部分を基に、それとスペクトルが類似する母音ノイズを生成する。 The noise generation unit 80 generates vowel noise having a spectrum similar to that of the vowel part extracted by the vowel extraction unit 34.

母音処理部５０は、子音処理部４０において子音部分が処理された後の音声信号のうち、母音抽出部３４で抽出された母音部分を処理する。特に騒音レベルの上昇を極力抑える必要がある場合には、母音処理部５０は母音抽出部３４で抽出された母音部分を無音部分に置換する。この場合、Ｄ／Ａ部７０、スピーカＳＰを経て出力されるマスカーH(t)は子音部分と子音部分とに挟まれた無音部分を有する構成となる。つまりマスカーH(t)の子音部分は同期するマスキーH'(t)の母音部分と連結してひとつの音韻を構成することとなる。これにより全体の音量はマスカーH(t)で無音とした母音部分の分だけ低減され、室内の騒音レベルも低減される。 The vowel processing unit 50 processes the vowel part extracted by the vowel extraction unit 34 in the audio signal after the consonant part is processed by the consonant processing unit 40. In particular, when it is necessary to suppress an increase in noise level as much as possible, the vowel processing unit 50 replaces the vowel part extracted by the vowel extraction part 34 with a silent part. In this case, the masker H (t) output via the D / A unit 70 and the speaker SP has a structure having a silent part sandwiched between a consonant part and a consonant part. That is, the consonant part of the masker H (t) is connected to the vowel part of the synchronized masky H ′ (t) to form one phoneme. As a result, the overall sound volume is reduced by the vowel part silenced by the masker H (t), and the indoor noise level is also reduced.

なお、母音処理部５０は、母音部分を無音部分で置き換える代わりに、ライブラリベースの置換を行ってもよい。つまり、母音処理部５０は、母音抽出部３４によって抽出された母音部分を母音ライブラリ１４から選出した別の母音部分に置換してもよい。母音処理部５０は、置換の候補が複数ある場合は、ランダムに、かつ各組み合わせが略等確率となるように置換する。発話開始期間における母音部分変更アルゴリズムについては子音部分のそれと同様である。 The vowel processing unit 50 may perform library-based replacement instead of replacing the vowel part with a silent part. That is, the vowel processing unit 50 may replace the vowel part extracted by the vowel extraction part 34 with another vowel part selected from the vowel library 14. When there are a plurality of replacement candidates, the vowel processing unit 50 performs replacement so that each combination has a substantially equal probability. The vowel part changing algorithm in the utterance start period is the same as that of the consonant part.

または、母音処理部５０は、母音部分を無音部分で置き換える代わりに、母音処理部５０によって抽出された母音部分をノイズ生成部８０によって生成された母音ノイズと置換してもよい。この場合、やはりマスキーH'(t)とマスカーH(t)との合成音声の無作為性がより増大する。 Alternatively, the vowel processing unit 50 may replace the vowel part extracted by the vowel processing unit 50 with the vowel noise generated by the noise generation unit 80 instead of replacing the vowel part with a silent part. In this case, the randomness of the synthesized speech of the maskee H ′ (t) and the masker H (t) is further increased.

また、子音母音の処理の順番、つまり子音処理部４０における処理と母音処理部５０における処理の順番を入れ替えてもよい。 Further, the order of processing of consonant vowels, that is, the order of processing in the consonant processing unit 40 and processing in the vowel processing unit 50 may be switched.

図６は、子音ライブラリ１２を示すデータ構造図である。子音ライブラリ１２は、音素としての子音１１２とその子音の波形データ１１４とを対応付けて記憶する。母音ライブラリ１４および共通ライブラリ１６もまた子音ライブラリ１２と同様のデータ構造を有する。 FIG. 6 is a data structure diagram showing the consonant library 12. The consonant library 12 stores a consonant 112 as a phoneme and waveform data 114 of the consonant in association with each other. The vowel library 14 and the common library 16 also have the same data structure as the consonant library 12.

図７は、マスキーH'(t)を表す音声信号の波形を示す波形図である。図７の波形は「あの、彼とはそうと（う）長いんだよね、実は（ANO KARETOWA SO-TONAGAINDAYONE ZITSUWA）」という原音声をマイクロホンＭｉｃで音声信号に変換したものである。図７の縦軸は信号強度を任意の単位で表し、横軸は時間を表す。図７において縦の破線で区画された領域ひとつひとつが音素に対応し、対応する音素がローマ字で明示されている。また、「-」は音声休止部を表す。エネルギ包絡線１０２は実線で示される。ここでエネルギ包絡線は音声サンプルを自乗音圧領域で数１０ｍｓｅｃの時定数をかけ平方根をとったものである。 FIG. 7 is a waveform diagram showing the waveform of an audio signal representing the maskee H ′ (t). The waveform in FIG. 7 is obtained by converting the original voice “ANO KARETOWA SO-TONAGAINDAYONE ZITSUWA” into a voice signal with the microphone Mic. The vertical axis in FIG. 7 represents signal intensity in arbitrary units, and the horizontal axis represents time. In FIG. 7, each region divided by vertical broken lines corresponds to a phoneme, and the corresponding phoneme is clearly shown in Roman letters. “-” Represents a voice pause unit. The energy envelope 102 is shown as a solid line. Here, the energy envelope is obtained by multiplying a voice sample by a time constant of several tens of milliseconds in the square sound pressure region and taking a square root.

図７における母音、子音、無音の別を表１に示す。音声開始前のある時刻を時刻の原点（ｔ＝０）として定める。 Table 1 shows vowels, consonants, and silences in FIG. A certain time before the start of voice is defined as the time origin (t = 0).

なお、子音、母音、無音の別は、エネルギやゼロ交差数、ＰＡＲＣＯＲ（PARtial auto-CORrelation）の第１係数（スペクトル傾斜）などにより判別することが可能である。 The distinction between consonants, vowels, and silence can be determined by energy, the number of zero crossings, the first coefficient (spectral slope) of PARCOR (PARtial auto-CORrelation), and the like.

図８は、図７の音声信号をＳＤコントローラ部ＳＤにおいて子音のみ置換モードで処理することで生成される音声信号の波形を示す波形図である。区画１０４で示される子音部が置換された子音部である。これらの置換に際し切り出し時間長や再挿入時レベル(ｄＢ)を調整している。
置換後のエネルギ包絡線１０６は実線で示される。図７のエネルギ包絡線１０２と図８のエネルギ包絡線１０６とを比較するとそれ程変化していないことが分かる。つまり音声のイントネーションや抑揚にそれ程変化はない。しかしながら図８の音声信号がスピーカＳＰで音声に変換され、マスカーH(t)として出力されると、受聴者８サイトではマスキーH'(t)とマスカーH(t)とが合成されて聞こえ、その意味内容は理解されにくくなる。つまり「わからない」となることが多い（他の音に聞こえる場合もある）。 FIG. 8 is a waveform diagram showing a waveform of an audio signal generated by processing the audio signal of FIG. 7 in the consonant only replacement mode in the SD controller unit SD. This is a consonant part in which the consonant part indicated by the section 104 is replaced. For these replacements, the cut-out time length and re-insertion level (dB) are adjusted.
The energy envelope 106 after the replacement is indicated by a solid line. When comparing the energy envelope 102 of FIG. 7 and the energy envelope 106 of FIG. 8, it can be seen that there is not much change. In other words, there is not much change in voice intonation and intonation. However, when the audio signal of FIG. 8 is converted into sound by the speaker SP and output as a masker H (t), the listener 8 site synthesizes and hears the maskey H ′ (t) and the masker H (t), Its meaning is difficult to understand. In other words, it is often “I don't know” (may be heard by other sounds).

図９は、上の子音置換処理の内訳を示す説明図である。図９では、図７の音声信号のうちどの子音部をどのように置換したかが示される。この子音置換処理では図７の音声信号内の子音部同士を置換している。 FIG. 9 is an explanatory diagram showing a breakdown of the upper consonant replacement process. FIG. 9 shows how to replace which consonant part in the audio signal of FIG. In this consonant replacement process, consonant parts in the audio signal of FIG. 7 are replaced.

図１０は、ＳＤコントローラ部ＳＤおよびスピーカＳＰにおける一連の処理を示すフローチャートである。Ａ／Ｄ部２０は発話音声を表す音声信号をマイクロホンＭｉｃから取得する（Ｓ２０２）。部分抽出部３０は、取得された音声信号から変更対象部分を抽出する（Ｓ２０４）。部分変更部９０は、抽出された変更対象部分を変更する（Ｓ２０６）。スピーカＳＰは、少なくとも変更された変更対象部分を音声に変換して出力する（Ｓ２０８）。 FIG. 10 is a flowchart showing a series of processes in the SD controller unit SD and the speaker SP. The A / D unit 20 acquires a voice signal representing the speech voice from the microphone Mic (S202). The partial extraction unit 30 extracts a change target part from the acquired audio signal (S204). The part changing unit 90 changes the extracted change target part (S206). The speaker SP converts at least the changed part to be changed into a sound and outputs it (S208).

以上の構成による音声情報秘話システム１００の動作を説明する。銀行のブース２に顧客６が座り、銀行の相談員と例えばローンについて相談する場合を考える。この際、ブース２の隣の隣接ブース２’には受聴者８がいて口座の開設を申請しているとする。顧客６は自己の事業の資金繰りが悪化したなどローンを申請する事情を説明している。無論このような話は受聴者８に漏れ聞こえないほうがよく、特に本実施の形態に係る音声情報秘話システム１００では主に顧客６の発話音声のうち子音部分が変換されたものが受聴者８に届くので、受聴者８は顧客６の発話内容を理解できない。加えて顧客６の発話がない場合はスピーカＳＰから隣接ブース２’への出力は実質的にないため、隣接ブース２’内の騒音レベルを不必要に上昇させることもない。 The operation of the speech information secret system 100 having the above configuration will be described. Consider a case in which a customer 6 sits in a bank booth 2 and consults with a bank counselor about, for example, a loan. At this time, it is assumed that there is a listener 8 in the adjacent booth 2 ′ next to the booth 2 and an application for opening an account is being made. Customer 6 explains the circumstances of applying for a loan, such as the worsening of the cash flow of his business. Of course, it is better for the listener 8 not to leak such a story. In particular, in the speech information secret speech system 100 according to the present embodiment, the listener 6 mainly converts the consonant part of the speech voice of the customer 6. Therefore, the listener 8 cannot understand the utterance content of the customer 6. In addition, when there is no utterance by the customer 6, there is substantially no output from the speaker SP to the adjacent booth 2 ', so that the noise level in the adjacent booth 2' is not increased unnecessarily.

上述の実施の形態において、記憶装置１０の例は、ハードディスクやメモリである。また、本明細書の記載に基づき、各ブロックを、図示しないＣＰＵや、インストールされたアプリケーションプログラムのモジュールや、システムプログラムのモジュールや、ハードディスクから読み出したデータの内容を一時的に記憶するメモリなどにより実現できることは本明細書に触れた当業者には理解されるところである。 In the above-described embodiment, examples of the storage device 10 are a hard disk and a memory. In addition, based on the description of the present specification, each block is stored in a CPU (not shown), an installed application program module, a system program module, a memory that temporarily stores data read from the hard disk, or the like. It will be understood by those skilled in the art who have touched this specification that it can be realized.

本実施の形態に係る音声情報秘話システム１００によると、以下の作用効果を得ることができる。 According to the speech information secret system 100 according to the present embodiment, the following operational effects can be obtained.

（１）本実施の形態に係る音声情報秘話システム１００によると、会話の存在そのものの隠蔽や抹消ではなく、その内容、つまり会話音声に含まれる情報が隠蔽される。この点に関し本発明者は以下を認識した。
オープンプランのオフィスや銀行や証券会社のロビー、特に簡易パーティションにより仕切られた接客カウンターなどでは、会話している人以外の人にその会話の中身を理解不能とすれば、会話内容の隠蔽という点では十分にその目的が果たされる。つまり会話の内容さえ漏れなければ音声そのものは聞こえてもよい。むしろ発話者の存在が視認できる場合などは、音声のスペクトルやエネルギ包絡線（音質やイントネーション、抑揚）が保存されたほうが自然である。本実施の形態に係る音声情報秘話システム１００は、以上の視点・ニーズに対応し、より自然な形で会話内容を隠蔽する。 (1) According to the speech information secret system 100 according to the present embodiment, the content, that is, the information included in the conversation speech is concealed instead of concealing or deleting the presence of the conversation itself. In this regard, the inventor has recognized the following.
In open-plan offices, bank and securities company lobbies, especially in customer service counters partitioned by simple partitions, if the contents of the conversation cannot be understood by anyone other than the person who is speaking, the content of the conversation will be hidden. Well, its purpose is fulfilled. In other words, the voice itself may be heard as long as the content of the conversation is not leaked. Rather, when the presence of a speaker can be visually recognized, it is more natural to preserve the voice spectrum and energy envelope (sound quality, intonation, and intonation). The voice information secret system 100 according to the present embodiment corresponds to the above viewpoints and needs, and conceals conversation contents in a more natural manner.

（２）マスカーH(t)は発話者本人のマスキーH'(t)を基にその子音部分に着目して作成され、原音声と並行してスピーカから出力される。したがって、特に子音のみ置換モードではマスキーH'(t)のスペクトルやエネルギ包絡線はマスカーH(t)となっても保存されうる。その結果、マスカーH(t)のスペクトルやイントネーションはマスキーH'(t)のそれとほぼ同じとなるので、違和感はそれ程無く自然に聞き手に受け取られる。 (2) The masker H (t) is created based on the speaker's own maskee H ′ (t), focusing on its consonant part, and is output from the speaker in parallel with the original voice. Therefore, particularly in the consonant-only replacement mode, the spectrum and energy envelope of the maskee H ′ (t) can be preserved even if it becomes the masker H (t). As a result, since the spectrum and intonation of the masker H (t) are almost the same as those of the maskee H '(t), the sense of incongruity is naturally received by the listener.

（３）マスカーH(t)はマスキーH'(t)に対し子音部分のみを置換して、あるいは子音部分を置換したうえで母音部分を無音部分に置き換えたり処理したりして生成される。したがって、マスカーH(t)の音量（音圧レベル）ひいては室内騒音レベルの上昇を極力抑えることができる。 (3) The masker H (t) is generated by replacing only the consonant part with respect to the maskee H ′ (t), or replacing the vowel part with a silent part after replacing the consonant part. Therefore, it is possible to suppress the increase in the volume (sound pressure level) of the masker H (t) and hence the indoor noise level as much as possible.

（４）時間軸上でマスキーH'(t)がないとき、つまり会話がないときはマスカーH(t)も出力されない。つまり両者は時間的に実質的に重畳する。したがって、音声発生のない「無音時」におけるマスカーH(t)による室内騒音レベルの上昇は抑えられる。 (4) No masker H (t) is output when there is no maskee H '(t) on the time axis, that is, when there is no conversation. That is, both overlap substantially in time. Therefore, an increase in the room noise level due to the masker H (t) during “no sound” when no sound is generated can be suppressed.

（５）従来の技術を使用した場合に発生しうるマスカー断続やレベル変動（会話停止時に断〜レベル低減）による違和感や、会話とは関係のない別の音（騒音・音楽）を放射することによる発話者・会話者・その他の在室者に対する違和感が抑えられる。 (5) Dissipating a feeling of discomfort due to intermittent maskers or level fluctuations (disrupted when the conversation is stopped to reduced level) that may occur when using conventional technology, or other sounds (noise / music) that are not related to conversation This reduces the sense of discomfort for speakers, conversers, and other people in the room.

（６）従来の技術における物理的な遮音や個室化に対しては、空間的な遮断や移動を必要としないので、開放感やコミュニケーションが妨げられにくくなる。 (6) With respect to the physical sound insulation and private room formation in the prior art, no spatial blockage or movement is required, so that a sense of openness and communication are less likely to be hindered.

（７）ＳＤコントローラ部ＳＤおよびスピーカＳＰはＩＴパーティション４に組み込まれるので、システムの設置や取付を大幅に簡略化できる。場合によってはマイクロホンＭｉｃをＩＴパーティション４に組み込んでもよい。この場合、さらに簡略化される。 (7) Since the SD controller unit SD and the speaker SP are incorporated in the IT partition 4, the installation and installation of the system can be greatly simplified. In some cases, the microphone Mic may be incorporated in the IT partition 4. In this case, it is further simplified.

（８）ＩＴパーティション４はそれ自体が吸音処理されている。したがって、ブース内での会話音声の明瞭度を上げつつ隣接ブースへの音漏れを低減できる。 (8) The IT partition 4 itself is subjected to sound absorption processing. Therefore, sound leakage to the adjacent booth can be reduced while increasing the clarity of the conversation voice in the booth.

（９）マスカーH(t)は置換・削除などの処理によりマスキーH'(t)（原音声）とは相関がそれ程高くない信号となる。したがって、音声情報秘話システム１００の動作時においてハウリングなどのフィードバックに起因する異常が生じにくい。 (9) The masker H (t) becomes a signal that is not so correlated with the masky H ′ (t) (original voice) by processing such as replacement / deletion. Therefore, abnormalities due to feedback such as howling are less likely to occur during the operation of the speech information confidential system 100.

以上、実施の形態に係る音声情報秘話システム１００およびそれに含まれるＳＤコントローラ部ＳＤの構成と動作について説明した。この実施の形態は例示であり、その各構成要素や各処理の組み合わせにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 Heretofore, the configuration and operation of the audio information secret system 100 according to the embodiment and the SD controller unit SD included therein have been described. This embodiment is an exemplification, and it is understood by those skilled in the art that various modifications can be made to each component and combination of processes, and such modifications are within the scope of the present invention.

実施の形態では、隣接ブースの片側からマスカーH(t)が出力される場合について説明したが、これに限られない。例えば、信号加算によりマスカーH(t)が隣接ブースの左右両側から出力されてもよい。図１１は、第１変形例に係る音声情報秘話システムの機能および構成を模式的に示すブロック図である。第１変形例に係る音声情報秘話システムは、マイクロホンＭｉｃと、ＳＤコントローラ部ＳＤと、４つのスピーカＳＰａ〜ＳＰｄ（ＳＰｄは不図示）と、４つのパワーアンプＰＡａ〜ＰＡｄ（ＰＡｄは不図示）と、４つの加算器２１０ａ〜２１０ｄ（２１０ｄは不図示）と、を備える。 In the embodiment, the case where the masker H (t) is output from one side of the adjacent booth has been described, but the present invention is not limited to this. For example, the masker H (t) may be output from the left and right sides of the adjacent booth by signal addition. FIG. 11 is a block diagram schematically showing the function and configuration of the audio information secret system according to the first modification. The audio information secret system according to the first modification includes a microphone Mic, an SD controller unit SD, four speakers SPa to SPd (SPd is not shown), and four power amplifiers PAa to PAd (PAd is not shown). Four adders 210a to 210d (210d not shown).

ＳＤコントローラ部ＳＤにおける処理を経た音声信号は、ブース２の左のスピーカＳＰａに対応する加算器２１０ａと、ブース２の右のスピーカＳＰｂに対応する加算器２１０ｂと、ブース２の左隣の隣接ブース２’の左のスピーカＳＰｃに対応する加算器２１０ｃと、ブース２の右隣の隣接ブースの右のスピーカＳＰｄ（不図示）に対応する加算器２１０ｄ（不図示）と、に入力される。それぞれの加算器２１０ａ〜２１０ｄに入力された音声信号は対応するパワーアンプＰＡａ〜ＰＡｄを経てスピーカＳＰａ〜ＳＰｄから出力される。加算器はそれが接続されたスピーカが音声を出力するブースの両隣のブースから、ＳＤコントローラ部ＳＤにおける処理を経た音声信号を取得して加算する。
本変形例によると、マスカーH(t)が隣接ブース２’の左右両側から出力されるので、ブース２における会話内容が受聴者８により伝わりにくくなる。 The audio signal that has undergone the processing in the SD controller unit SD is the adder 210a corresponding to the left speaker SPa of the booth 2, the adder 210b corresponding to the right speaker SPb of the booth 2, and the adjacent booth adjacent to the left of the booth 2. The signal is input to the adder 210c corresponding to the left speaker SPc of 2 ′ and the adder 210d (not shown) corresponding to the right speaker SPd (not shown) of the adjacent booth adjacent to the right of the booth 2. The audio signals input to the adders 210a to 210d are output from the speakers SPa to SPd via the corresponding power amplifiers PAa to PAd. The adder acquires and adds the audio signal that has undergone processing in the SD controller unit SD from the booths adjacent to the booth to which the speaker to which the speaker is connected outputs audio.
According to this modification, since the masker H (t) is output from both the left and right sides of the adjacent booth 2 ′, the conversation contents in the booth 2 are difficult to be transmitted to the listener 8.

また、マスキーH'(t)のレベルを低減するためにＰＮＣ（Passive Noise Controller）を併用してもよい。ＰＮＣは公知のＡＮＣ（Active Noise Control）を調整時に適応処理させ、運用時には設定されたパラメータを固定して使用することを意図するものである。
図１２は、第２変形例に係る音声情報秘話システムの機能および構成を模式的に示すブロック図である。本変形例では、図１１のＳＤコントローラ部ＳＤを図１２の破線で囲まれた部分で置き換える。この部分ではＳＤコントローラ部ＳＤとＰＮＣ部ＰＮＣとが並列に設けられ、マイクロホンＭｉｃからの音声信号がＳＤコントローラ部ＳＤとＰＮＣ部ＰＮＣとに入力される。ＳＤコントローラ部ＳＤの出力側にはスイッチＳＷ１が設けられ、スイッチＳＷ１によってＳＤコントローラ部ＳＤの動作のオンオフが制御される。そのスイッチＳＷ１の出力とＰＮＣ部ＰＮＣの出力とは加算器４０６で加算され、パワーアンプＰＡを介してスピーカＳＰから音声として出力される。 Further, a PNC (Passive Noise Controller) may be used in combination to reduce the level of the maskee H ′ (t). The PNC intends to use a known ANC (Active Noise Control) adaptively at the time of adjustment, and to fix and use the set parameters at the time of operation.
FIG. 12 is a block diagram schematically showing the function and configuration of the speech information secret system according to the second modification. In this modification, the SD controller unit SD in FIG. 11 is replaced with a part surrounded by a broken line in FIG. In this part, the SD controller unit SD and the PNC unit PNC are provided in parallel, and an audio signal from the microphone Mic is input to the SD controller unit SD and the PNC unit PNC. A switch SW1 is provided on the output side of the SD controller unit SD, and the operation of the SD controller unit SD is controlled by the switch SW1. The output of the switch SW1 and the output of the PNC unit PNC are added by an adder 406 and output as sound from the speaker SP via the power amplifier PA.

本変形例では、音源４０２とアンプ４０４を介して接続されたヘッドトルソシミュレータＨＡＴＳ（HATS: Head and Torso Simulator）などを発話者位置Ｐに置いて、ＰＮＣ部ＰＮＣの同定を行う。スイッチＳＷ１を開いてＳＤコントローラ部ＳＤの動作を切り、ＨＡＴＳから適切な音声信号を放射して隣接ブース２’の受聴者位置Ｑに置いたマイクロホンＭｉｃ’の出力が最小になるようにＰＮＣ部ＰＮＣを適応動作させてシステム同定を行う。 In this modification, the head torso simulator HATS (HATS: Head and Torso Simulator) connected to the sound source 402 via the amplifier 404 is placed at the speaker position P, and the PNC unit PNC is identified. The switch SW1 is opened to turn off the operation of the SD controller unit SD, and an appropriate audio signal is emitted from the HATS so that the output of the microphone Mic ′ placed at the listener position Q in the adjacent booth 2 ′ is minimized. System identification is performed by adaptively operating.

このときマイクロホンＭｉｃおよびスピーカＳＰを含むインパルス応答は-h(x)となり、絶対値がＰＮＣ発話者−受聴者間のそれh(x)にほぼ等しくなる。その後スイッチＳＷ１を閉じ、同定されたパラメータを固定した状態でＰＮＣ部を稼動させる。すると発話者と受聴者の位置Ｐ、ＱおよびマイクロホンＭｉｃとスピーカＳＰの位置はほぼ固定されているので、マスキーH'(t)のレベルは効果的に低減され、マスカーH(t)が優勢となる。その結果、情報隠蔽（Information Masking）の効果が強められる。必要に応じてマスカーH(t)のレベルを下げると、マスキーH'(t)を含むシステム全体のレベル、つまり室内の騒音レベルをさらに低減することもできる。
なお、上述のＰＮＣ機能はＳＤコントローラ部ＳＤが組み込まれているコンピュータに組み込まれてもよい。 At this time, the impulse response including the microphone Mic and the speaker SP is −h (x), and the absolute value is substantially equal to that h (x) between the PNC speaker and the listener. Thereafter, the switch SW1 is closed, and the PNC unit is operated with the identified parameters fixed. Then, since the positions P and Q of the speaker and the listener and the positions of the microphone Mic and the speaker SP are substantially fixed, the level of the maskee H ′ (t) is effectively reduced, and the masker H (t) is dominant. Become. As a result, the effect of information masking is enhanced. If the level of the masker H (t) is lowered as necessary, the level of the entire system including the maskee H ′ (t), that is, the noise level in the room can be further reduced.
Note that the PNC function described above may be incorporated in a computer in which the SD controller unit SD is incorporated.

ＡＮＣ／ＰＮＣは既存の技術であるが、広い音場を３次元にわたりくまなく制御するのには向いていない。一方でカウンターのパーティションで囲まれた狭い空間のほぼ定まった位置に受聴者の頭が存在するようなケースでは３次元でも有効な音響低減手段となる。 Although ANC / PNC is an existing technology, it is not suitable for controlling a wide sound field all over three dimensions. On the other hand, in the case where the listener's head is present at a substantially fixed position in a narrow space surrounded by the partition of the counter, the sound reduction means is effective even in three dimensions.

実施の形態では、音声信号を子音部分と母音部分とに分け、少なくとも子音部分を変更する場合について説明したが、これに限られない。例えば、ＳＤコントローラ部の部分抽出部は、音声信号をマスキーH'(t)のエネルギ包絡線に見られる音節群に分割してもよい。そしてＳＤコントローラ部ＳＤの部分変更部は、この音節群ごとに時間方向に反転してもよい。図１３は、図７の音声信号をＳＤコントローラ部においてエネルギ包絡線で区画された音節群単位で時間反転した音声信号の波形を示す波形図である。区画１０８で示される部分が時間反転された部分である。ここでは、実線で示されるエネルギ包絡線１１０の略一山を単位として時間反転される。 In the embodiment, the case where the audio signal is divided into the consonant part and the vowel part and at least the consonant part is changed has been described. However, the present invention is not limited to this. For example, the partial extraction unit of the SD controller unit may divide the audio signal into syllable groups found in the energy envelope of the masky H ′ (t). And the partial change part of SD controller part SD may reverse in the time direction for every syllable group. FIG. 13 is a waveform diagram showing the waveform of the audio signal obtained by time-reversing the audio signal of FIG. 7 in units of syllable groups partitioned by the energy envelope in the SD controller unit. The portion indicated by the partition 108 is a portion that is time-reversed. Here, time reversal is performed in units of substantially one peak of the energy envelope 110 indicated by a solid line.

図７のエネルギ包絡線１０２と図１３のエネルギ包絡線１１０とを比較するとほぼ一致することが分かる。つまり音声のイントネーションや抑揚に殆ど変化はない。しかしながら図１３の音声信号がスピーカＳＰで音声に変換され、マスカーH(t)として出力されると、受聴者８サイトではマスキーH'(t)とマスカーH(t)とが合成されて聞こえ、その意味内容は理解されにくくなる。つまり「わからない」となることが多い（他の音に聞こえる場合もある）。
また、本変形例によると、実施の形態における子音置換の場合と比べてＳＤコントローラ部における処理がより簡素化される。したがって処理時間が短縮される。さらに本発明者が行った実験によると、時間反転による情報撹乱効果は子音置換によるそれと遜色ないものであることが確認された。
本変形例のさらなる利点としては、ライブラリを必要としないので会話の開始時点から十分な情報撹乱効果を得ることができる点がある。 When the energy envelope 102 of FIG. 7 is compared with the energy envelope 110 of FIG. In other words, there is almost no change in voice intonation and intonation. However, when the audio signal of FIG. 13 is converted into audio by the speaker SP and output as a masker H (t), the listener 8 site can synthesize and hear the masky H ′ (t) and the masker H (t), Its meaning is difficult to understand. In other words, it is often “I don't know” (may be heard by other sounds).
Moreover, according to this modification, the process in the SD controller unit is further simplified as compared with the case of consonant replacement in the embodiment. Accordingly, the processing time is shortened. Furthermore, according to experiments conducted by the present inventors, it was confirmed that the information disturbance effect due to time reversal is comparable to that due to consonant replacement.
A further advantage of this modification is that a library is not required, so that a sufficient information disturbance effect can be obtained from the start of conversation.

また、ＳＤコントローラ部は、音声信号を一定の周期、あるいは一定周期に対してわずかにランダム変動する周期で、あるいは原音声の包絡線が略一群と見なせるような区画で、切り出して反転し元の位置に戻してもよい。この際、この処理はマスキーH'(t)に対し遅れがＴ＋ｔ以内になるようにして順次繰り返されてもよい。この場合、実施の形態に係る音声情報秘話システム１００が有する作用効果と同様の作用効果を得ることができる。 In addition, the SD controller unit cuts out and inverts the audio signal at a constant cycle, a cycle that slightly fluctuates randomly with respect to the fixed cycle, or a section in which the envelope of the original speech can be regarded as a group. You may return to the position. At this time, this process may be sequentially repeated so that the delay is within T + t with respect to the maskee H ′ (t). In this case, it is possible to obtain the same function and effect as those of the voice information secret system 100 according to the embodiment.

実施の形態における子音部分などの変更対象部分の置換または削除にあたり、ハニング窓などの時間窓やゼロクロス検出を併用して、切り取り時に発生しうるクリック音などを除去してもよい。この場合、受聴者８あるいは在室者に与えうる違和感がさらに低減される。 In replacement or deletion of a change target portion such as a consonant portion in the embodiment, a time window such as a Hanning window or zero cross detection may be used together to remove a click sound that may occur at the time of clipping. In this case, the uncomfortable feeling that can be given to the listener 8 or the people in the room is further reduced.

以上、実施の形態にもとづき本発明を説明したが、実施の形態は、本発明の原理、応用を示しているにすぎないことはいうまでもなく、実施の形態には、請求の範囲に規定された本発明の思想を逸脱しない範囲において、多くの変形例や配置の変更が可能であることはいうまでもない。
例えば、これらの音声処理にはある程度時間がかかるが、これに加え更に時間遅れを加えて処理音声を放射したり、或いは原音声に複数の処理音声を重ねて放射したりすることも考えられる手法の例である。 Although the present invention has been described based on the embodiments, the embodiments merely show the principle and application of the present invention, and the embodiments are defined in the claims. Needless to say, many modifications and arrangements can be made without departing from the spirit of the present invention.
For example, these audio processes take some time, but in addition to this, it is possible to add a time delay to radiate the processed audio, or to radiate a plurality of processed audio on the original audio. It is an example.

２ブース、４ＩＴパーティション、６顧客、８受聴者、１０記憶装置、３０部分抽出部、９０部分変更部、１００音声情報秘話システム、ＳＤＳＤコントローラ部、ＳＰスピーカ、Ｍｉｃマイクロホン。 2 booths, 4 IT partitions, 6 customers, 8 listeners, 10 storage devices, 30 partial extraction units, 90 partial modification units, 100 voice information secret talk systems, SD SD controller units, SP speakers, Mic microphones.

Claims

A partial extraction unit that extracts a change target part in units of substantially one mountain of an envelope of the waveform of the audio signal from the audio signal;
A partial change unit for changing the change target part extracted by the partial extraction unit;
An output unit that outputs at least the change target portion changed by the partial change unit to an audio output unit ;
The audio signal represents speech,
The change target portion output by the output unit is converted into a sound by the sound output means and output to an area where the uttered sound can be heard .

The voice change device according to claim 1, wherein the partial change unit reverses the change target portion extracted by the partial extraction unit in a time direction.

A sound collecting means for receiving the uttered voice and generating a voice signal representing the voice;
A sound changing device for changing a sound signal generated by the sound collecting means;
Voice output means for converting the voice signal changed by the voice changing device into voice and outputting the voice signal to an area where the uttered voice can be heard; and
The voice changing device is
A partial extractor for extracting a change target portion in units of substantially one mountain of the envelope of the waveform of the sound signal from the sound signal generated by the sound collecting means;
A partial change unit for changing the change target part extracted by the partial extraction unit;
And an output unit that outputs at least the change target portion changed by the partial change unit to the voice output unit.

4. The voice information secret system according to claim 3 , wherein the voice changing device and the voice output means are incorporated in a booth partition.

5. The voice information secret speech system according to claim 4 , wherein the partition is subjected to sound absorption processing.

Obtaining an audio signal;
Extracting a change target portion in units of substantially one mountain of an envelope of the waveform of the audio signal from the acquired audio signal;
Changing the extracted change target part; and
Viewed contains a step for converting at least a modified change target area to a voice, and
The audio signal represents speech,
In the outputting step, the change target portion is output to an area where the uttered voice can be heard .

A function to extract a change target portion in units of substantially one peak of the envelope of the waveform of the audio signal from the audio signal;
A function to change the extracted change target part,
A Turkish computer program to realize the function of outputting to the audio output means at least changed change target area, to the computer,
The audio signal represents speech,
The computer program characterized in that the change target portion output in the output function is converted into speech by the speech output means and output to an area where the uttered speech can be heard .