JP2023067365A

JP2023067365A - Concealed voice transmitter, concealed voice receiver, concealed voice transmission system, concealed voice transmission method, and concealed voice transmission program

Info

Publication number: JP2023067365A
Application number: JP2021178523A
Authority: JP
Inventors: 勇気太刀岡; Yuki Tachioka
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2023-05-16

Abstract

To provide a concealed voice transmitter, a concealed voice receiver, a concealed voice transmission system, a concealed voice transmission method, and a concealed voice transmission program that can deliver voice with high confidentiality and restore it simply.SOLUTION: A secret voice transmitter 12 generates a mask M based on presence or absence of frequency bins in a spectrogram obtained by short-time Fourier transforming a mixed voice of the secret voice and i disturbance sounds where the frequency component of the secret voice is larger than the disturbance sounds by at least a predetermined threshold value θ. Then, n (2≤t≤n) share masks M'n, which are different from one another, are generated based on the mask M to restore the secret voice by superposing t or more. Then, the secret voice transmitter 12 generates n share voices, different from one another, by masking the spectrogram obtained by short-time Fourier transform of the mixed voice with n share masks M'n, and outputs the inverse Fourier transformed n share voices.SELECTED DRAWING: Figure 2

Description

本発明は、秘匿音声送信装置、秘匿音声受信装置、秘匿音声伝送システム、秘匿音声伝送方法、及び秘匿音声伝送プログラムに関する。 The present invention relates to a confidential voice transmission device, a confidential voice receiving device, a confidential voice transmission system, a confidential voice transmission method, and a confidential voice transmission program.

不特定多数に向けて配信される音声において、特定の対象者だけが受け取れるようにメッセージを秘匿する用途は多く、そのための方法はいくつか提案されている。 There are many uses for concealing messages in voices distributed to an unspecified number of people so that only specific people can receive them, and several methods have been proposed for this purpose.

例えば、特許文献１には、暗号化した音声を音響に埋め込む方法が開示されている。特許文献１の方法は、施設の放音装置が放音する音響を収音した収音信号から識別情報を抽出し、施設に関する複数の関連情報を受信するものであり、展示施設に設置されて施設の音声案内に利用される。また特許文献１には、識別情報を非可聴帯域に変調することで音響に埋め込む方法も開示されている。 For example, Patent Literature 1 discloses a method of embedding encrypted speech in sound. The method of Patent Document 1 extracts identification information from a sound pickup signal obtained by picking up sound emitted by a sound emitting device of the facility, and receives a plurality of pieces of related information about the facility. Used for facility voice guidance. Patent Document 1 also discloses a method of embedding identification information in sound by modulating it in a non-audible band.

一般的に、暗号の安全性は復号にかかる計算複雑性により担保されているものの、特許文献１の方法では、識別情報を用いて関連情報を取得するための手段が必要であり構成が複雑となり、識別情報を音響に埋め込む暗号化手法が漏えいした場合に安全でなくなる。また、識別情報を秘匿化するために識別情報を非可聴帯域に変調したとしても、変調されている周波数が分かれば簡易に復号されてしまうため秘匿音声通信には適さない。 In general, the security of cryptography is ensured by the computational complexity involved in decryption, but the method of Patent Document 1 requires means for obtaining related information using identification information, resulting in a complicated configuration. , the encryption method that embeds the identifying information in the sound becomes insecure if compromised. Further, even if the identification information is modulated in a non-audible band in order to conceal the identification information, it is easily decoded if the modulated frequency is known, so it is not suitable for confidential voice communication.

これに対して、画像処理の分野では、暗号化手法が漏えいした場合にも安全な技術として、非特許文献１に示される視覚暗号(Visual cryptography scheme、以下「VCS」という。）が知られている。VCSは１枚の秘匿画像からn枚のシェア画像を生成し、そのうちの少なくとも任意のt枚のシェア画像を集めることで元の秘匿画像を復元できる。一方で、VCSは、t-1枚のシェア画像からは秘匿画像を復元できないため、t枚以上のシェア画像を集めないと秘密が漏れないというものである。このような方式は(t,n)-VCSともいわれる。 On the other hand, in the field of image processing, a visual cryptography scheme (hereinafter referred to as "VCS") shown in Non-Patent Document 1 is known as a technique that is safe even if the encryption method is leaked. there is VCS generates n share images from one secret image, and collects at least t share images among them to restore the original secret image. On the other hand, since the VCS cannot restore the confidential image from t-1 shared images, the secret cannot be leaked unless t or more shared images are collected. Such a scheme is also called (t,n)-VCS.

VCSでは、秘匿画像を復元するためにはｔ枚以上のシェア画像を重ね合わせるだけでよく、複雑な復号処理が必要ないことがその特長である。図１１に(t,n)=(3,4)-VCSの例を示す。図１１の例では、シェア画像は３枚以上を重ね合わせることで秘匿画像の画素の白又は黒を復元できるように生成される。なお、図１１（Ａ）は秘匿画像のある画素が白（０）であり、図１１（Ｂ）は秘匿画像のある画素が黒（１）の場合を示す。図１１の例では、シェア画像は４つであり、秘匿画像の１画素を横方向に３倍、縦方向に２倍の６画素に拡大している。すなわち、図１１のシェア画像は、秘匿画像の１画素に相当する。秘匿画像を復元するためには、秘匿画像の全画素に対応したシェア画像が生成される。 VCS is characterized by the fact that only t or more share images need to be superimposed in order to restore the secret image, and complicated decoding processing is not required. FIG. 11 shows an example of (t,n)=(3,4)-VCS. In the example of FIG. 11, three or more shared images are superimposed to restore the white or black pixels of the confidential image. Note that FIG. 11A shows a case where the pixel with the confidential image is white (0), and FIG. 11B shows the case where the pixel with the confidential image is black (1). In the example of FIG. 11, there are four share images, and one pixel of the confidential image is expanded three times horizontally and twice vertically to six pixels. That is, the share image in FIG. 11 corresponds to one pixel of the secret image. In order to restore the secret image, a share image corresponding to all pixels of the secret image is generated.

図１１の例では、任意の２枚のシェア画像の組み合わせによって黒になる画素数は、秘匿画像の元画素が白（０）又は黒（１）にかかわらず、共に６画素のうち４画素である。このため、図１１（Ａ）に示されるシェア画像のうち任意の２枚を組み合わせた場合と、図１１（Ｂ）に示されるシェア画像のうち任意の２枚と組み合わせた場合とでは区別できない。これに対して任意の３枚のシェアの組み合わせになると、元画素が白（０）である図１１（Ａ）では６画素のうち４画素が黒である一方、元画素が黒（１）である図１１（Ｂ）では６画素のうち５画素が黒になる。これにより、図１１（Ａ）と図１１（Ｂ）とで復元される画像を構成する画素の明るさに差がつき、元の秘匿画像を構成する画素が白又は黒であったか区別できる。このようにVCSでは、秘匿画像の各画素を拡大したうえでｎ個のシェア画像を生成することで、ｔ個以上のシェア画像を集めないと秘匿画像を認識できないようにできる。 In the example of FIG. 11, the number of pixels that become black by combining two arbitrary share images is 4 out of 6 pixels regardless of whether the original pixels of the confidential image are white (0) or black (1). be. For this reason, it is impossible to distinguish between the combination of arbitrary two of the share images shown in FIG. 11A and the combination of arbitrary two of the share images shown in FIG. 11B. On the other hand, when it comes to a combination of arbitrary three shares, in FIG. In some FIG. 11B, 5 out of 6 pixels are black. As a result, there is a difference in the brightness of the pixels forming the restored images between FIG. 11A and FIG. 11B, and it is possible to distinguish whether the pixels forming the original confidential image were white or black. In this way, the VCS generates n share images after enlarging each pixel of the secret image, so that the secret image cannot be recognized unless t or more share images are collected.

非特許文献２、３には、任意のt，nに対して秘匿画像を復元できる条件を満たすシェア画像を生成する方法が提案されている。また、特許文献２では、拡張現実を利用した視覚暗号の方法が開示されており、コンテンツの盗み見を防止するために視覚暗号を利用し、これにより権利のない者に配信コンテンツを盗み見されることを防ぐことができる。 Non-Patent Literatures 2 and 3 propose a method of generating a share image that satisfies conditions for restoring a confidential image for arbitrary t and n. In addition, Patent Document 2 discloses a method of visual encryption using augmented reality. Visual encryption is used to prevent unauthorized viewing of distributed content. can be prevented.

特開２０２０－０２１１０１号公報Japanese Patent Application Laid-Open No. 2020-021101 特表２０１７－５３８１５２号公報Japanese translation of PCT publication No. 2017-538152 M. Naor and A. Shamir, “Visual cryptography,” Advances in Cryptology - EUROCRYPT '94, Workshop on the Theory and Application of Cryptographic Techniques, 1994, Proceedings, vol.950, pp.1--12, Lecture Notes in Computer Science, Springer, 1994.M. Naor and A. Shamir, “Visual cryptography,” Advances in Cryptology - EUROCRYPT '94, Workshop on the Theory and Application of Cryptographic Techniques, 1994, Proceedings, vol.950, pp.1--12, Lecture Notes in Computer Science, Springer, 1994. S.J. Shyu and M.C. Chen, “Optimum pixel expansions for threshold visual secret sharing schemes,” IEEE Transactions on Information Forensics and Security, vol.6, no.3, pp.960--969, 2011.S.J. Shyu and M.C. Chen, “Optimum pixel expansions for threshold visual secret sharing schemes,” IEEE Transactions on Information Forensics and Security, vol.6, no.3, pp.960--969, 2011. M. Iwamoto, “A weak security notion for visual secret sharing schemes,” IEEE Transactions on Information Forensics and Security, vol.7, no.2, pp.372--382, 2012.M. Iwamoto, “A weak security notion for visual secret sharing schemes,” IEEE Transactions on Information Forensics and Security, vol.7, no.2, pp.372--382, 2012.

上記のようにVCSは画像を秘匿化して配信し、この画像を復元する場合に有効であるが、音声に対して応用された例は未だにない。 As described above, VCS is effective in distributing anonymized images and restoring these images, but there are no examples of its application to audio.

そこで本発明は、上記背景に鑑み、音声を秘匿性高く配信して簡易に復元できる、秘匿音声送信装置、秘匿音声受信装置、秘匿音声伝送システム、秘匿音声伝送方法、及び秘匿音声伝送プログラムを提供することを目的とする。 Therefore, in view of the above background, the present invention provides a confidential voice transmission device, a confidential voice receiving device, a confidential voice transmission system, a confidential voice transmission method, and a confidential voice transmission program that can distribute voice with high confidentiality and easily restore it. intended to

本発明の秘匿音声送信装置は、秘匿音声を短時間フーリエ変換して得られたスペクトログラムにおいて、周波数成分が所定の閾値以上大きい周波数ビンの有無に基づいてマスクを生成するマスク生成部と、t個以上を重ね合わせることで前記秘匿音声を復元するための各々が異なるn個（2≦t≦n）のシェアマスクを、前記マスクに基づいて生成するシェアマスク生成部と、前記秘匿音声に他の音声を混合した混合音声を短時間フーリエ変換して得られたスペクトログラムを前記n個のシェアマスクでマスキングすることで、各々が異なるn個のシェア音声を生成するシェア音声生成部と、逆フーリエ変換した前記n個のシェア音声を出力するシェア音声出力部と、を備える。 The confidential voice transmission apparatus of the present invention includes a mask generation unit that generates a mask based on the presence or absence of a frequency bin having a frequency component larger than a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice, and t A share mask generation unit that generates n (2 ≤ t ≤ n) share masks that are different from each other for restoring the confidential voice by superimposing the above, based on the mask; A shared voice generation unit that generates n shared voices that are different from each other by masking a spectrogram obtained by short-time Fourier transforming a mixed voice mixed with voices with the n share masks, and an inverse Fourier transform. and a shared audio output unit configured to output the n shared audios.

本発明の秘匿音声送信装置によれば、前記他の音声は、i個の妨害音であり、前記マスク生成部は、前記秘匿音声に前記i個の妨害音を混合した混合音声を短時間フーリエ変換して得られたスペクトログラムにおいて、前記秘匿音声の周波数成分が前記妨害音よりも所定の閾値以上大きい周波数ビンの有無に基づいて前記マスクを生成してもよい。 According to the confidential voice transmission device of the present invention, the other voices are i interfering sounds, and the mask generator generates mixed voices obtained by mixing the i interfering sounds with the confidential voices using a short-time Fourier filter. The mask may be generated based on the presence or absence of a frequency bin in which the frequency component of the concealed voice is greater than that of the interfering sound by a predetermined threshold or more in the spectrogram obtained by the conversion.

本発明の秘匿音声送信装置によれば、前記他の音声は、n個のカバー音声であり、前記マスク生成部は、前記秘匿音声にn個のカバー音声を混合した混合音声を短時間フーリエ変換して得られたスペクトログラムにおいて、前記秘匿音声の周波数成分が前記カバー音声よりも所定の閾値以上大きい周波数ビンの有無に基づいて前記マスクを生成し、前記カバー音声の周波数成分が前記秘匿音声及び他の前記カバー音声よりも所定の閾値以上大きい周波数ビンの有無に基づいてｊ個(1≦j≦n)のカバーマスクを生成し、前記シェアマスク生成部は、前記マスク及び前記ｊ個のカバーマスクに基づいて前記n個のシェアマスクを生成してもよい。 According to the confidential voice transmission device of the present invention, the other voices are n cover voices, and the mask generator performs a short-time Fourier transform on the mixed voice obtained by mixing the confidential voices with the n cover voices. In the spectrogram obtained by the above, the mask is generated based on the presence or absence of frequency bins in which the frequency components of the concealed audio are larger than the cover audio by a predetermined threshold or more, and the frequency components of the cover audio are the concealed audio and other generating j (1 ≤ j ≤ n) cover masks based on the presence or absence of frequency bins larger than the cover audio of the The n share masks may be generated based on .

本発明の秘匿音声伝送システムは、上記記載の秘匿音声送信装置と、前記秘匿音声送信装置から出力された前記n個のシェア音声のうち前記t個以上を重ね合わせることで、前記秘匿音声を復元する秘匿音声受信装置と、を備える。 The confidential voice transmission system of the present invention restores the confidential voice by superimposing the confidential voice transmission device described above and the t or more of the n shared voices output from the confidential voice transmission device. and a confidential voice receiving device for

本発明の秘匿音声送信装置は、秘匿音声を短時間フーリエ変換して得られたスペクトログラムにおいて、周波数成分が所定の閾値以上大きい周波数ビンの有無に基づいてマスクを生成するマスク生成部と、t個以上を重ね合わせることで前記秘匿音声を復元するための各々が異なるn個（2≦t≦n）のシェアマスクを、前記マスクに基づいて生成するシェアマスク生成部と、前記n個のシェアマスクを出力するシェアマスク出力部と、前記秘匿音声にi個の妨害音を混合した混合音声を出力する音声出力部と、を備える。 The confidential voice transmission apparatus of the present invention includes a mask generation unit that generates a mask based on the presence or absence of a frequency bin having a frequency component larger than a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice, and t A share mask generation unit that generates, based on the masks, n (2≤t≤n) different share masks for restoring the confidential voice by superimposing the above, and the n share masks. and a voice output unit for outputting a mixed voice obtained by mixing the confidential voice with i interfering sounds.

本発明の秘匿音声送信装置によれば、前記マスク生成部は、前記秘匿音声に前記i個の妨害音を混合した混合音声を短時間フーリエ変換して得られたスペクトログラムにおいて、前記秘匿音声の周波数成分が前記妨害音よりも所定の閾値以上大きい周波数ビンの有無に基づいて前記マスクを生成してもよい。 According to the confidential voice transmission device of the present invention, the mask generation unit generates the frequency The mask may be generated based on the presence or absence of frequency bins whose components are greater than the interfering sound by a predetermined threshold or more.

本発明の秘匿音声送信装置によれば、前記シェアマスク出力部は、所定の周波数範囲で強度が所定値以上であるノイズを短時間フーリエ変換して得られたスペクトログラムを前記n個のシェアマスクでマスキングした後に、逆フーリエ変換することで生成された前記n個のシェア音声を音として出力してもよい。 According to the confidential voice transmission device of the present invention, the share mask output unit converts the spectrogram obtained by short-time Fourier transform of noise whose intensity is equal to or greater than a predetermined value in a predetermined frequency range to the n share masks. After masking, the n shared voices generated by inverse Fourier transform may be output as sounds.

本発明の秘匿音声送信装置によれば、前記シェアマスク出力部は、前記n個のシェアマスクの各々をデジタルデータとして出力してもよい。 According to the confidential voice transmission device of the present invention, the share mask output unit may output each of the n share masks as digital data.

本発明の秘匿音声伝送システムは、上記記載の秘匿音声送信装置と、前記シェアマスク出力部から出力された前記n個の前記シェアマスクのうち前記t個以上の前記シェアマスクから前記マスクを復元し、復元した前記マスクを前記音声出力部から出力された前記混合音声にマスキングすることで、前記秘匿音声を復元する秘匿音声受信装置と、を備える。 A confidential voice transmission system according to the present invention includes the confidential voice transmission device described above, and restores the mask from the t or more share masks out of the n share masks output from the share mask output unit. and a hidden voice receiving device that restores the hidden voice by masking the restored mask to the mixed voice output from the voice output unit.

本発明の秘匿音声送信装置によれば、前記マスク生成部は、前記閾値以上の周波数ビンを１とし、前記閾値未満の周波数ビンを０とすることで前記マスクを生成してもよい。 According to the confidential audio transmission device of the present invention, the mask generation unit may generate the mask by setting 1 to frequency bins equal to or greater than the threshold and 0 to frequency bins less than the threshold.

本発明の秘匿音声送信装置によれば、前記シェアマスク生成部は、周波数ビンの数及び時間フレームの数を増大することで前記マスクをm倍し、VCS（Visual Cryptography Scheme）の基本行列を満たす行列式となるように前記n個のシェアマスクを生成してもよい。 According to the confidential audio transmission device of the present invention, the share mask generator multiplies the mask by m by increasing the number of frequency bins and the number of time frames to satisfy the basic matrix of VCS (Visual Cryptography Scheme). The n share masks may be generated to be a determinant.

本発明の秘匿音声伝送システムによれば、前記秘匿音声受信装置は、復元された音声と前記秘匿音声との類似性を判定する類似性判定部を備えてもよい。 According to the confidential voice transmission system of the present invention, the confidential voice receiving device may include a similarity determination unit that determines similarity between the restored voice and the confidential voice.

本発明の秘匿音声伝送方法は、秘匿音声を短時間フーリエ変換して得られたスペクトログラムにおいて、周波数成分が所定の閾値以上大きい周波数ビンの有無に基づいてマスクを生成する第１工程と、t個以上を重ね合わせることで前記秘匿音声を復元するための各々が異なるn個（2≦t≦n）のシェアマスクを、前記マスクに基づいて生成する第２工程と、前記秘匿音声に他の音声を混合した混合音声を短時間フーリエ変換して得られたスペクトログラムを前記n個のシェアマスクでマスキングすることで、各々が異なるn個のシェア音声を生成する第３工程と、逆フーリエ変換した前記n個のシェア音声を出力する第４工程と、出力された前記n個のシェア音声のうち前記t個以上を重ね合わせることで、前記秘匿音声を復元する第５工程と、を有する。 The confidential voice transmission method of the present invention includes a first step of generating a mask based on the presence or absence of a frequency bin having a frequency component larger than a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice; A second step of generating, based on the masks, n (2≤t≤n) share masks each different for restoring the confidential voice by superimposing the above, and another voice on the confidential voice A third step of masking the spectrogram obtained by short-time Fourier transforming the mixed voice mixed with the n share masks with the n share masks to generate n share voices that are different from each other; A fourth step of outputting n shared voices, and a fifth step of restoring the confidential voice by superimposing the t or more out of the n shared voices that have been output.

本発明の秘匿音声伝送方法は、秘匿音声を短時間フーリエ変換して得られたスペクトログラムにおいて、周波数成分が所定の閾値以上大きい周波数ビンの有無に基づいてマスクを生成する第１工程と、t個以上を重ね合わせることで前記秘匿音声を復元するための各々が異なるn個（2≦t≦n）のシェアマスクを、前記マスクに基づいて生成する第２工程と、前記n個のシェアマスクを出力し、前記秘匿音声にi個の妨害音を混合した混合音声を出力する第３工程と、出力された前記n個の前記シェアマスクのうち前記t個以上の前記シェアマスクから前記マスクを復元し、復元した前記マスクを前記混合音声にマスキングすることで、前記秘匿音声を復元する第４工程と、を有する。 The confidential voice transmission method of the present invention includes a first step of generating a mask based on the presence or absence of a frequency bin having a frequency component larger than a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice; a second step of generating, based on the masks, n (2≤t≤n) different share masks for restoring the confidential voice by superimposing the above; a third step of outputting a mixed voice obtained by mixing i interfering sounds with the confidential voice; and restoring the mask from the t or more of the output n share masks. and a fourth step of reconstructing the concealed voice by masking the reconstructed mask to the mixed voice.

本発明の秘匿音声伝送プログラムは、コンピュータに、秘匿音声を短時間フーリエ変換して得られたスペクトログラムにおいて、周波数成分が所定の閾値以上大きい周波数ビンの有無に基づいてマスクを生成する第１工程と、t個以上を重ね合わせることで前記秘匿音声を復元するための各々が異なるn個（2≦t≦n）のシェアマスクを、前記マスクに基づいて生成する第２工程と、前記秘匿音声に他の音声を混合した混合音声を短時間フーリエ変換して得られたスペクトログラムを前記n個のシェアマスクでマスキングすることで、各々が異なるn個のシェア音声を生成する第３工程と、逆フーリエ変換した前記n個のシェア音声を出力する第４工程と、出力された前記n個のシェア音声のうち前記t個以上を重ね合わせることで、前記秘匿音声を復元する第５工程と、を実行させる。 The confidential voice transmission program of the present invention provides a computer with a first step of generating a mask based on the presence or absence of a frequency bin having a frequency component larger than a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice; , a second step of generating, based on the masks, n (2≤t≤n) different share masks for restoring the confidential voice by superimposing t or more; a third step of masking a spectrogram obtained by short-time Fourier transforming a mixed voice mixed with other voices with the n share masks to generate n share voices that are different from each other; A fourth step of outputting the converted n shared voices, and a fifth step of restoring the confidential voice by superimposing the t or more of the output n shared voices. Let

本発明の秘匿音声伝送プログラムは、コンピュータに、秘匿音声を短時間フーリエ変換して得られたスペクトログラムにおいて、周波数成分が所定の閾値以上大きい周波数ビンの有無に基づいてマスクを生成する第１工程と、t個以上を重ね合わせることで前記秘匿音声を復元するための各々が異なるn個（2≦t≦n）のシェアマスクを、前記マスクに基づいて生成する第２工程と、前記n個のシェアマスクを出力し、前記秘匿音声にi個の妨害音を混合した混合音声を出力する第３工程と、出力された前記n個の前記シェアマスクのうち前記t個以上の前記シェアマスクから前記マスクを復元し、復元した前記マスクを前記混合音声にマスキングすることで、前記秘匿音声を復元する第４工程と、を実行させる。 The confidential voice transmission program of the present invention provides a computer with a first step of generating a mask based on the presence or absence of a frequency bin having a frequency component larger than a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice; , a second step of generating, based on the masks, n (2≤t≤n) different share masks for reconstructing the concealed speech by superimposing t or more; a third step of outputting a share mask and outputting a mixed sound obtained by mixing i interfering sounds with the confidential sound; a fourth step of reconstructing a mask and masking the reconstructed mask to the mixed audio to reconstruct the concealed audio;

本発明によれば、音声を秘匿性高く配信して簡易に復元できる。 According to the present invention, voice can be distributed with high secrecy and restored easily.

第１実施形態の秘匿音声伝送システムの概略構成図である。1 is a schematic configuration diagram of a confidential voice transmission system according to a first embodiment; FIG. 第１実施形態の秘匿音声送信装置の機能ブロック図である。It is a functional block diagram of the confidential voice transmission device of 1st Embodiment. 第１実施形態の秘匿音声受信装置の機能ブロック図である。2 is a functional block diagram of the confidential voice receiving device of the first embodiment; FIG. 第１実施形態の復号化の模式図である。FIG. 4 is a schematic diagram of decoding in the first embodiment; 第２実施形態の秘匿音声受信装置の機能ブロック図である。FIG. 11 is a functional block diagram of the confidential voice receiving device of the second embodiment; 第３実施形態の秘匿音声送信装置の機能ブロック図である。FIG. 11 is a functional block diagram of a confidential voice transmission device according to a third embodiment; 第３実施形態の秘匿音声受信装置の機能ブロック図である。FIG. 11 is a functional block diagram of a confidential voice receiving device according to a third embodiment; 第４実施形態の秘匿音声送信装置の機能ブロック図である。It is a functional block diagram of the confidential voice transmission device of 4th Embodiment. 第５実施形態の秘匿音声送信装置の機能ブロック図である。FIG. 11 is a functional block diagram of a confidential voice transmission device according to a fifth embodiment; 第５実施形態の秘匿音声受信装置の機能ブロック図である。FIG. 11 is a functional block diagram of a confidential voice receiving device according to a fifth embodiment; VCSのシェア画像の模式図であり、（Ａ）は秘匿画像の画素が白の場合であり、（Ｂ）は秘匿画像の画素が黒の場合である。It is a schematic diagram of the share image of VCS, (A) is a case where the pixel of a secret image is white, (B) is a case where the pixel of a secret image is black.

以下、図面を参照して本発明の実施形態を説明する。なお、以下に説明する実施形態は、本発明を実施する場合の一例を示すものであって、本発明を以下に説明する具体的構成に限定するものではない。本発明の実施にあたっては、実施形態に応じた具体的構成が適宜採用されてよい。 Embodiments of the present invention will be described below with reference to the drawings. It should be noted that the embodiment described below is an example of the case of carrying out the present invention, and the present invention is not limited to the specific configuration described below. In carrying out the present invention, a specific configuration according to the embodiment may be appropriately adopted.

（VCSの基本行列）
以下に説明する実施形態は、VCS（Visual cryptography scheme）を秘匿音声の配信、復号に適用したものである。そこで、VCSの概要について説明する。 (basic matrix of VCS)
The embodiments described below apply a VCS (Visual cryptography scheme) to distribution and decoding of confidential voice. Therefore, an outline of VCS will be explained.

VCSは、１枚の秘匿画像からn枚のシェア画像を生成し、そのうちの少なくとも任意のt枚（2≦t≦n)のシェア画像を集めることで元の秘匿画像を復元する技術である。なお、tとnには、2≦t≦nの関係がある。VCSの適用例としては、ゲームの参加者（ユーザーともいう。）が異なる場所又は異なる時間でシェア画像のデータを取得し、t枚以上のシェア画像を取得した参加者だけが秘匿画像を取得でき、ゲームの次のステージに進むことができるというものがある。 VCS is a technology that generates n share images from one secret image and collects at least t (2≦t≦n) share images to restore the original secret image. Note that t and n have a relationship of 2≦t≦n. As an example of application of VCS, game participants (also called users) acquire shared image data at different locations or at different times, and only participants who have acquired t or more shared images can acquire confidential images. , you can proceed to the next stage of the game.

任意の(t，n)に対してシェア画像を生成するには、まずシェア画像に対する参加者のアクセス構造Γを定める必要がある。n枚のシェア画像の集合P＝{1,2,…,n}がある場合、どのシェア画像を保持しているかで場合分けしたべき集合は2^Pとなる。そして、シェア画像を重ね合わせることで秘匿画像を復元できるシェア画像の集合を有資格集合Γ_Qとする。 In order to generate a share image for arbitrary (t, n), it is first necessary to define the access structure Γ of the participants to the share image. When there is ^a set P={1, 2, . Then, a set of shared images whose secret image can be restored by superimposing the shared images is defined as a qualified set _ΓQ .

t枚のシェア画像を集めれば秘匿画像を復元できるので、Γ_Qはt枚のシェア画像が集まっている場合が極小となり、この場合を極小有資格集合Γ^* _Qとする。逆にシェア画像から秘匿画像に関する一切の情報を得られない集合を禁止集合Γ_Fとする。このため、Γ_Fは、全体のべき集合2^Pに対するΓ_Qの補集合となり、Γ=(Γ_Q，Γ_F)となる。このときにt-1枚のシェア画像が集まっている場合を極大禁止集合Γ^* _Fという。 Since the secret image can be restored by collecting t share images, Γ _Q is minimized when t share images are collected, and this case is defined as a minimal qualified set Γ ^* _Q . Conversely, a set in which no information about the confidential image can be obtained from the shared image is defined as a forbidden set _ΓF . Therefore, Γ _F is the complement of Γ _Q to the power set 2 ^P of the whole, and Γ=(Γ _Q , Γ _F ). The case where t−1 share images are gathered at this time is called a maximal forbidden set Γ ^* _F .

ここで秘匿画像が２値画像である場合、画素拡大の倍率をmとしてn*mブール行列の組(X₀，X₁)が以下の２つの条件を満たす場合に、上記アクセス構造Γを実現する基本行列であるという。なお、０は白色の画素を表し、１は黒色の画素を表す。 Here, when the secret image is a binary image, the access structure Γ is realized when the set of n*m Boolean matrices (X ₀ , X ₁ ) satisfies the following two conditions, where m is the magnification of pixel enlargement. is said to be the basic matrix for Note that 0 represents a white pixel and 1 represents a black pixel.

（条件１）秘匿画像の復元可能条件:
すべてのS∈Γ^* _Qに対して定数α＞０が存在し、HW（OR(X₀[S]))+αM≦HW（OR(X₁[S]))となる。ここでX.[S]はX.の内からSに対応する行のみを抜き出す操作であり、ORは列ごとのOR、HWはハミング重みである。 (Condition 1) Restorable conditions for confidential images:
There exists a constant α>0 for all S∈Γ ^* _Q such that HW(OR(X ₀ [S]))+αM≦HW(OR(X ₁ [S])). Here, X.[S] is an operation for extracting only the row corresponding to S from X., OR is OR for each column, and HW is Hamming weight.

（条件２）安全性条件:
すべてのS∈Γ^* _Fに対して、X₀[S]とX₁[S]は適当な列の並び替えで等しくできる。 (Condition 2) Safety conditions:
For all S∈Γ ^* _F , X ₀ [S] and X ₁ [S] can be made equal with proper column permutation.

シェア画像は、上記条件を満たす基本行列X_0,X₁に基づいて生成される。例えば画素拡大率mを最小化する基準で整数計画問題として解く方法により、基本行列X_0,X₁を得ることができる。この解法は、下記文献１，２に詳述されている。そして、得られた基本行列X_0,X₁が、それぞれ秘匿画像の画素値{0,1}に対応させて画素拡大したシェア画像の画素値を表している。 A share image is generated based on the basic matrices X _{0 and} X ₁ that satisfy the above conditions. For example, the basic matrices X _{0 and} X ₁ can be obtained by solving an integer programming problem with the criterion of minimizing the pixel enlargement ratio m. This solution method is described in detail in Documents 1 and 2 below. Then, the obtained basic matrices X _{0 and} X ₁ represent the pixel values of the share image pixel-enlarged corresponding to the pixel values {0, 1} of the secret image, respectively.

一例として、t＝3、n＝4、すなわち(3,4)-VCS、画素拡大率m＝6である図１１を参照すると、図１１の行列式は下記となる。
X₀[1]＝[0,1,1,1,0,0]
X₀[2]＝[1,0,1,1,0,0]
X₀[3]＝[1,1,0,1,0,0]
X₀[4]＝[1,1,1,0,0,0]
X₁[1]＝[1,0,0,0,1,1]
X₁[2]＝[0,1,0,0,1,1]
X₁[3]＝[0,0,1,0,1,1]
X₁[4]＝[0,0,0,1,1,1] As an example, referring to FIG. 11 where t=3, n=4, ie (3,4)-VCS, pixel magnification m=6, the determinant of FIG.
_X0 [1] = [0,1,1,1,0,0]
_X0 [2] = [1,0,1,1,0,0]
_X0 [3] = [1,1,0,1,0,0]
_X0 [4] = [1,1,1,0,0,0]
_X1 [1] = [1,0,0,0,1,1]
_X1 [2] = [0,1,0,0,1,1]
_X1 [3] = [0,0,1,0,1,1]
_X1 [4] = [0,0,0,1,1,1]

文献１ S.J. Shyu and M.C. Chen, “Optimum pixel expansions for threshold visual secret sharing schemes,” IEEE Transactions on Information Forensics and Security, vol.6, no.3, pp.960--969, 2011.
文献２ M. Iwamoto, “A weak security notion for visual secret sharing schemes,” IEEE Transactions on Information Forensics and Security, vol.7, no.2, pp.372--382, 2012. Reference 1 SJ Shyu and MC Chen, “Optimum pixel expansions for threshold visual secret sharing schemes,” IEEE Transactions on Information Forensics and Security, vol.6, no.3, pp.960--969, 2011.
Reference 2 M. Iwamoto, “A weak security notion for visual secret sharing schemes,” IEEE Transactions on Information Forensics and Security, vol.7, no.2, pp.372--382, 2012.

（第１実施形態）
図１は、上述したVCSを秘匿音声伝送に適用した秘匿音声伝送システム１０の概略構成図である。 (First embodiment)
FIG. 1 is a schematic configuration diagram of a confidential voice transmission system 10 in which the VCS described above is applied to confidential voice transmission.

本実施形態の秘匿音声伝送システム１０は、一例として、ゲームやレジャーに用いられる。例えば、参加者が決められた複数の場所に実際に行って録音した音声（シェア音声）から対象者のみにわかる秘密のメッセージ（秘匿音声）を聞くことができるとする。これにより、秘匿音声を聞けた参加者は本来決められたコースをたどったことを示すこととなる。また、秘匿音声伝送システム１０は存在証明への利用も考えられる。例えば同時に配信されたシェア音声を複数の参加者が同時に録音し、シェア音声が決められた数だけ集まると秘密のメッセージ（秘匿音声）を聞くことができるとする。これにより、秘匿音声を聞けた複数の参加者は、ある決められた時間に決められた場所にいたことを示すこととなる。 The confidential voice transmission system 10 of this embodiment is used for games and leisure as an example. For example, it is assumed that participants can listen to a secret message (secret voice) that only the target person can understand from voices (shared voices) recorded by actually going to a plurality of predetermined locations. This indicates that the participants who have heard the confidential voice have followed the originally determined course. In addition, the confidential voice transmission system 10 can be used for proof of existence. For example, multiple participants simultaneously record shared audio that is distributed at the same time, and when a certain number of shared audios are collected, a secret message (hidden audio) can be heard. This indicates that a plurality of participants who have heard the confidential voice were at a predetermined place at a predetermined time.

図１に示されるように、本実施形態の秘匿音声伝送システム１０は、秘匿音声送信装置１２及び秘匿音声受信装置１４を備える。 As shown in FIG. 1, a confidential voice transmission system 10 of this embodiment includes a confidential voice transmitting device 12 and a confidential voice receiving device 14 .

秘匿音声送信装置１２は、秘匿したいメッセージ（以下「秘匿音声」という。）から各々異なるn個のシェア音声を生成し、公共空間に設置されたスピーカー１６からシェア音声を出力することで、秘匿性を保ちつつメッセージを配信する。シェア音声はそれぞれからは秘匿音声が何であるかを認識することはできないものの、t個（2≦t≦n）以上のシェア音声を重ね合わせることで秘匿音声が復元される。すなわち、秘匿音声送信装置１２は、秘匿音声をｎ個のシェア音声に暗号化して配信する。 The confidential voice transmission device 12 generates n different shared voices from a message to be confidential (hereinafter referred to as "confidential voice"), and outputs the shared voices from the speaker 16 installed in the public space, thereby improving confidentiality. deliver the message while preserving Although it is not possible to recognize what the hidden audio is from each shared audio, the hidden audio can be restored by superimposing t (2≦t≦n) or more shared audio. That is, the confidential voice transmission device 12 encrypts the confidential voice into n shared voices and distributes them.

なお、t個のシェア音声は、例えば、一つのスピーカー１６から異なる時間で出力されてもよいし、異なる場所に位置する複数のスピーカー１６から別々に出力されてもよい。 Note that the t shared voices may be output from one speaker 16 at different times, or may be output separately from a plurality of speakers 16 located at different locations.

秘匿音声受信装置１４は、例えばゲームの参加者が所有する携帯端末装置であり、この携帯端末装置にn個以上のシェア音声から秘匿音声を復号する機能（アプリケーション）が備えられる。なお、携帯端末装置とは、スマートフォンやタブレット端末等である。参加者は、秘匿音声受信装置１４である携帯端末装置を用いてt個以上のシェア音声を集めて復号化させることで、秘匿音声を聞くことができる。しかしながら、集めたシェア音声がt個未満であれば、参加者は秘匿音声を復元できず、秘匿音声を聞くことはできない。なお、復元された秘匿音声は、秘匿音声受信装置１４である携帯端末装置のスピーカーから出力されてもよいし、デジタルデータとして記憶されてもよい。 The confidential voice receiving device 14 is, for example, a mobile terminal device owned by a game participant, and the mobile terminal device is provided with a function (application) for decoding confidential voices from n or more shared voices. Note that the mobile terminal device is a smartphone, a tablet terminal, or the like. The participants can listen to the confidential voices by collecting and decoding t or more shared voices using the mobile terminal device, which is the confidential voice receiving device 14 . However, if the number of shared voices collected is less than t, the participant cannot restore the confidential voice and cannot listen to the confidential voice. The restored confidential voice may be output from the speaker of the mobile terminal device, which is the confidential voice receiving device 14, or may be stored as digital data.

ここで、異なる複数の音信号が混在した音声から音源毎に音信号を分離する音源分離技術が研究されている。この音源分離技術において、音の観測チャネル数が音源数よりも少ない劣決定条件では、本来観測したい一つの音源の音信号を他の音源の音信号から分離することが難しい。特に、観測チャネル数が１つあり、音源の位置に関する情報が得られない場合には音源を分離することは相当難しい。そこで、本実施形態の秘匿音声伝送システム１０では、この事実を利用し、秘匿音声に対して他の音声（本実施形態ではi個の妨害音）を混合した混合音声をスピーカー１６から出力し、この混合音声に対してVCSを用いた処理に基づいてシェア音声を生成する。 Here, a sound source separation technique for separating a sound signal for each sound source from voice in which a plurality of different sound signals are mixed is being researched. In this sound source separation technique, under an underdetermined condition in which the number of observed sound channels is less than the number of sound sources, it is difficult to separate the sound signal of one sound source originally desired to be observed from the sound signals of other sound sources. In particular, when there is one observation channel and information about the position of the sound source cannot be obtained, it is quite difficult to separate the sound source. Therefore, in the confidential voice transmission system 10 of the present embodiment, using this fact, a mixed voice obtained by mixing the confidential voice with other sounds (i interfering sounds in the present embodiment) is output from the speaker 16, A shared audio is generated based on the processing using VCS for this mixed audio.

図２は秘匿音声送信装置１２の機能ブロック図である。 FIG. 2 is a functional block diagram of the confidential voice transmission device 12. As shown in FIG.

秘匿音声送信装置１２は、スピーカー１６の他に、フーリエ変換部２０、マスク生成部２２、シェアマスク生成部２４、マスキング部２６、及び逆フーリエ変換部２８を備える。なお、フーリエ変換部２０、マスク生成部２２、シェアマスク生成部２４、マスキング部２６、及び逆フーリエ変換部２８で実行される処理は、秘匿音声送信装置１２が備える記録媒体に格納されたプログラムによって実行される。また、このプログラムが実行されることで、プログラムに対応する方法が実行される。 The confidential voice transmission device 12 includes a Fourier transform unit 20 , a mask generation unit 22 , a share mask generation unit 24 , a masking unit 26 and an inverse Fourier transform unit 28 in addition to the speaker 16 . Note that the processes executed by the Fourier transform unit 20, the mask generator 22, the share mask generator 24, the masking unit 26, and the inverse Fourier transform unit 28 are executed by a program stored in a recording medium included in the confidential audio transmission device 12. executed. Also, by executing this program, a method corresponding to the program is executed.

フーリエ変換部２０は、秘匿音声にi個の妨害音を混合した混合音声を短時間フーリエ変換するＦＦＴ（Fast Fourier Transform）分析器である。秘匿音声に混合されるi個の妨害音は各々異なる音であり、妨害音は意味をなさない音や他の内容を話した秘匿音声と同じ話者の音声や異なる話者による音声を用いる。 The Fourier transform unit 20 is an FFT (Fast Fourier Transform) analyzer that performs a short-time Fourier transform on a mixed voice obtained by mixing confidential voice with i interfering sounds. The i interfering sounds mixed with the hidden voice are different sounds, and the interfering sounds use meaningless sounds or voices of the same or different speakers as the hidden voice that speaks other contents.

マスク生成部２２は、混合音声をフーリエ変換して得られたスペクトログラムにおいて、秘匿音声の周波数成分が妨害音よりも所定の閾値θ以上大きい周波数ビン（時間周波数ビン）の有無に基づいてマスクMを生成する。なお、本実施形態のマスク生成部２２は、閾値θ以上の周波数ビンを１とし、閾値θ未満の周波数ビンを０とすることでマスクMを生成する。 The mask generation unit 22 generates a mask M based on the presence or absence of frequency bins (time-frequency bins) in which the frequency components of the concealed voice are greater than the interfering sound by a predetermined threshold θ or more in the spectrogram obtained by Fourier transforming the mixed voice. Generate. Note that the mask generation unit 22 of the present embodiment generates the mask M by setting 1 to frequency bins equal to or greater than the threshold θ and 0 to frequency bins less than the threshold θ.

シェアマスク生成部２４は、t個以上を重ね合わせることで秘匿音声を復元するための各々が異なるn個（2≦t≦n）のシェアマスクM'_nを、マスクMに基づいて生成する。本実施形態のシェアマスク生成部２４は、詳細を後述するように、周波数ビンの数及び時間フレームの数を増大することでマスクをm倍し、VCSの基本行列X_0,X₁を満たす行列式となるように、n個のシェアマスクM'_nを生成する。 The share mask generation unit 24 generates, based on the mask M, n (2≦t≦n) different share masks M′ _n for restoring the confidential voice by superimposing t or more. As will be described later in detail, the share mask generation unit 24 of this embodiment multiplies the mask by m by increasing the number of frequency bins and the number of time frames _{, and} the _matrix Generate n share masks M' _n so as to satisfy the following equation.

マスキング部２６は、混合音声を短時間フーリエ変換して得られたスペクトログラムをn個のシェアマスクM'_nでマスキングすることで、各々が異なるn個のシェア音声を生成する。 The masking unit 26 masks the spectrogram obtained by short-time Fourier transforming the mixed voice with n share masks M′ _n to generate n different share voices.

逆フーリエ変換部２８は、マスキング部２６で生成したn個のシェア音声を逆フーリエ変換する。 The inverse Fourier transform unit 28 inverse Fourier transforms the n shared voices generated by the masking unit 26 .

スピーカー１６は、逆フーリエ変換したn個のシェア音声を音として出力し、ユーザーへ配信する。 The speaker 16 outputs the n shared voices that have undergone the inverse Fourier transform as sounds and distributes them to the user.

図３は、秘匿音声受信装置１４の機能ブロック図である。秘匿音声受信装置１４は、マイク３０、同期部３２、及び音声復元部３４を備える。なお、同期部３２及び音声復元部３４で実行される処理は、秘匿音声受信装置１４の記録媒体に格納されたプログラムによって実行される。また、このプログラムが実行されることで、プログラムに対応する方法が実行される。 FIG. 3 is a functional block diagram of the confidential voice receiving device 14. As shown in FIG. The confidential voice receiving device 14 includes a microphone 30 , a synchronization section 32 and a voice restoring section 34 . The processing executed by the synchronization unit 32 and the audio restoration unit 34 is executed by a program stored in the recording medium of the confidential audio reception device 14. FIG. Also, by executing this program, a method corresponding to the program is executed.

マイク３０は、秘匿音声送信装置１２のスピーカー１６から出力されたシェア音声の入力を受け付ける。 The microphone 30 receives an input of shared voice output from the speaker 16 of the confidential voice transmission device 12 .

同期部３２は、n個のシェア音声から秘匿音声を復元するために、n個のシェア音声の始点を合わせる同期処理を行う。 The synchronizing unit 32 performs synchronization processing for aligning the starting points of the n shared voices in order to restore the confidential voice from the n shared voices.

音声復元部３４は、同期させたn個のシェア音声のうちt個以上を重ね合わせることで、秘匿音声を復元する。なお、秘匿音声受信装置１４によって集められたシェア音声がt個未満である場合には、音声復元部３４は、秘匿音声を復元することはできない。なお、音声復元部３４によって復元された秘匿音声は、秘匿音声受信装置１４である携帯端末装置のスピーカーから出力される。 The audio restoration unit 34 restores the confidential audio by superimposing t or more of the synchronized n shared audios. Note that when the number of shared voices collected by the confidential voice receiving device 14 is less than t, the voice restoring unit 34 cannot restore the confidential voice. The confidential voice restored by the voice restoring unit 34 is output from the speaker of the mobile terminal device, which is the confidential voice receiving device 14 .

次に、本実施形態のシェアマスクM'_nの生成の詳細を説明する。 Next, the details of the generation of the share mask M' _n according to this embodiment will be described.

フーリエ変換部２０は、秘匿音声とi個（1≦i≦I）の妨害音を短時間フーリエ変換し、秘匿音声のスペクトルs(τ,f)と妨害音のスペクトルσ_i(τ,f)を取得する。なお、τは時間フレームであって1≦τ≦T、fは周波数ビンのＩＤであって1≦f≦Fとされる。 The Fourier transform unit 20 performs a short-time Fourier transform on the concealed speech and i (1≤i≤I) interfering sounds to obtain the concealed speech spectrum s(τ,f) and the interfering sound spectrum σ _i (τ,f). to get Note that τ is a time frame and 1≦τ≦T, and f is a frequency bin ID and 1≦f≦F.

マスク生成部２２は、秘匿音声の音声レベル|s|が妨害音の音声レベル|σ_i|よりも閾値θ以上大きい時間周波数ビンに対してマスクMを１とし、それ以外の時間周波数ビンに対してマスクMを０とする。すなわち、妨害音の音声レベルよりも相対的に高い音声レベルの秘匿音声に対応するマスクMを１とする。これを数式１に表す。 The mask generation unit 22 sets the mask M to 1 for time-frequency bins in which the sound level |s| of the concealed sound is greater than the sound level |σ _i | of the interfering sound by a threshold value θ or more, and for other time-frequency bins set mask M to 0. That is, the mask M corresponding to the concealed sound whose sound level is relatively higher than the sound level of the interfering sound is set to 1. This is expressed in Equation 1.

そして、下記の数式２で表される混合音声のスペクトルXにマスクM(τ,f)を乗算することで秘匿音声のスペクトルの主要な部分が取り出せる。すなわち、マスクM(τ,f)は、秘匿音声のスペクトルを{0,1}で表したものである。 Then, by multiplying the mask M(τ, f) by the spectrum X of the mixed speech represented by Equation 2 below, the main part of the spectrum of the concealed speech can be extracted. That is, the mask M(τ, f) represents the spectrum of the hidden voice in {0, 1}.

そこで、シェアマスク生成部２４は、マスク生成部２２で生成されたマスクMを、VCSにおける秘匿画像（２値画像）とみなすことで、シェアマスクM'_nを生成する。すなわち、秘匿音声送信装置１２は、VCSの基本行列X₀,X₁を用いてシェアマスクM'_nを生成する。そして、秘匿音声送信装置１２は、混合音声に対してマスキング処理を行い、t個以上を集めることで秘匿音声を復元できるシェア音声を作成する。 Therefore, the share mask generation unit 24 generates a share mask M' _n by regarding the mask M generated by the mask generation unit 22 as a confidential image (binary image) in the VCS. That is, the confidential voice transmission device 12 generates the share mask M' _n using the basic matrices X ₀ and X ₁ of the VCS. Then, the confidential voice transmission device 12 performs masking processing on the mixed voice, and collects t or more to create shared voice that can restore the confidential voice.

また、シェアマスクM'_nを生成するためには、VCSにおける画素拡大と同様の処理をマスクMに対して行う必要がある。このため、本実施形態のシェアマスク生成部２４は、周波数ビン数及び時間フレーム数を増大することでマスクMをm倍（ｍは整数）し、VCSの基本行列X_0,X₁を満たす行列式となるようにシェアマスクM'_nを生成する。 Also, in order to generate the share mask _M'n , it is necessary to perform the same processing as pixel enlargement in the VCS on the mask M. FIG. Therefore, the share mask generator 24 of the present embodiment multiplies the mask M by m (m is an integer) by increasing the number of frequency bins and the number of time frames to obtain a matrix that satisfies the basic matrices X _{0 and} X ₁ of the VCS. A share mask M' _n is generated so as to satisfy the following equation.

マスクMをm倍するためには、一例として下記の方法がある。
（１）周波数ビン数をm倍に増やす。
（２）時間フレーム数をm倍に増やす。
（３）周波数ビン数をm₁倍すると共に時間フレーム数をm₂倍する（m₁*m₂＝m)。 An example of the method for multiplying the mask M by m is as follows.
(1) Increase the number of frequency bins by m times.
(2) Increase the number of time frames by m times.
(3) Multiply the number of frequency bins by _m1 and multiply the number of time frames by _m2 ( _m1 * _m2 =m).

周波数ビン数を増やすためには、一例として下記の方法がある。
（１）混合音声からシェア音声を生成するための短時間フーリエ変換のサンプリング周波数をm倍にすることで、シェア音声を帯域拡大して同じ時間幅に入る周波数ビン数をm倍にする。
（２）窓長をm倍にして短時間フーリエ変換を行う。 An example of increasing the number of frequency bins is as follows.
(1) By multiplying the sampling frequency of the short-time Fourier transform for generating the shared voice from the mixed voice by m times, the band of the shared voice is expanded to increase the number of frequency bins within the same time width by m times.
(2) Short-time Fourier transform is performed by multiplying the window length by m.

時間フレーム数を増やすには、一例として下記の方法がある。
（１）サンプリング周波数をm倍にして短時間フーリエ変換することで、同じシフト幅に入る時間フレーム数をm倍にする。
（２）フレームシフトを1/m倍にして短時間フーリエ変換する。 An example of increasing the number of time frames is as follows.
(1) By multiplying the sampling frequency by m and performing a short-time Fourier transform, the number of time frames within the same shift width is increased by m.
(2) A short-time Fourier transform is performed by multiplying the frame shift by 1/m.

このように、シェアマスク生成部２４は、シェアマスクM'_nを生成するためにマスクMを周波数方向及び時間方向の少なくとも一方を増大させる増大処理を行い、マスクMのスペクトログラムのm倍のサイズのシェアマスクM'_nを生成する。そして、マスキング部２６は、短時間フーリエ変換した混合音声に対してシェアマスクM'_nを用いてマスキング処理することでn個のシェア音声を生成する。なお、マスキング処理される混合音声のスペクトルグラムもm倍のサイズとされている。 In this way, the share mask generation unit 24 performs an increase process for increasing the mask M in at least one of the frequency direction and the time direction in order to generate the share mask _M'n , and the spectrogram of the mask M is m times as large as the spectrogram. Generate a share mask _M'n . Then, the masking unit 26 generates n shared voices by masking the short-time Fourier-transformed mixed voice using the share mask _M'n . Note that the spectrumgram of the mixed speech to be masked is also made m times larger.

マスキング処理により生成されるシェア音声Yは、数式３で表される。

A shared audio Y generated by the masking process is represented by Equation 3.

そして、逆フーリエ変換部２８は、シェア音声を逆短時間フーリエ変換することで、スピーカー１６から出力できる音声とし、スピーカー１６がシェア音声を出力する。 Then, the inverse Fourier transform unit 28 performs an inverse short-time Fourier transform on the shared audio to obtain audio that can be output from the speaker 16, and the speaker 16 outputs the shared audio.

スピーカー１６から出力されたシェア音声を例えば、一人のユーザーが携帯端末装置等である秘匿音声受信装置１４のマイク３０によって複数回録音することでn個のシェア音声のうち少なくともt個のシェア音声を取得する。なお、秘匿音声を聞かれたくない、他のユーザーにシェア音声を取得されることを避けるため、所定のユーザーしか知り得ない時間を指定してスピーカー１６からシェア音声を出力することで、他のユーザーが秘匿音声を聞く可能性を低減できる。 For example, one user can record at least t shared voices out of n shared voices by recording the shared voices output from the speaker 16 multiple times with the microphone 30 of the confidential voice receiving device 14 such as a mobile terminal device. get. In addition, in order to prevent other users from acquiring the shared audio because they do not want to hear the confidential audio, by specifying a time that only a predetermined user can know and outputting the shared audio from the speaker 16, It is possible to reduce the possibility that the user will hear the confidential voice.

秘匿音声受信装置１４は、録音したｔ個以上のシェア音声の始点を合わせる同期処理を行い、図４に示されるように、ｔ個以上のシェア音声を時間的に同期して加算することで、秘匿音声を復元する。なお、同期処理は、例えば、複数のシェア音声各々の信号の相関を取ることで実現できる。ｔ個のシェア音声の同期加算は下記数式４で表され、jがΓ_Qの要素からなる場合、数式４のシェアマスクM'_nの総和Σが混合音声のうち秘匿音声を残すマスクとなるため、秘匿音声が復元される。 The confidential audio receiving device 14 performs synchronization processing to match the starting points of t or more recorded shared audio, and as shown in FIG. Restore hidden audio. Synchronization processing can be realized, for example, by correlating each signal of a plurality of shared voices. Synchronous addition of t shared voices is represented by the following formula 4, and when j consists of elements of Γ _Q , the sum Σ of the share masks M' _n in formula 4 is a mask that leaves the concealed voice among the mixed voices. , the hidden voice is restored.

なお、音声は画像とは異なり、スペクトログラムの隣接成分間に短時間フーリエ変換に伴う冗長性がある。このため、シェアマスクM'_nがにじんでしまい｛0,1｝が保持されず、シェア音声の同期加算ではS/Nが向上しない可能性がある。そこで、1と0がより明確に分かれるようにシェア音声を同期乗算してもよい。 Note that, unlike images, audio has redundancy associated with short-time Fourier transform between adjacent components of the spectrogram. For this reason, the share mask _M'n is blurred and {0, 1} is not retained, and there is a possibility that the S/N will not be improved by the synchronous addition of the share audio. Therefore, the shared audio may be synchronously multiplied so that 1 and 0 are more clearly separated.

以上説明したように、本実施形態の秘匿音声伝送システム１０は、視覚暗号（VCS）を秘匿音声の伝送に適用し、マスクMから生成したシェアマスクM'_nを秘匿音声のスペクトログラムに埋め込むことでｎ個のシェア音声を生成する。ユーザーは、ｎ個のシェア音声のうちt個以上のシェア音声を集めると秘匿音声を復元できる一方、集めたシェア音声がｔ個未満の場合には秘匿音声を復元できない。また、本実施形態の秘匿音声伝送システム１０は、空間情報を持たないモノラルでの観測信号から、複数の音源が混合した信号を分離する問題が解きがたいという事実を利用している。そして、秘匿音声伝送システム１０は、暗号化のためには音声を短時間フーリエ変換するＦＦＴ分析器があればよく、ｔ個以上のシェア音声を同期させて加算又は乗算するのみにより復号できるため、複雑な構成の復号器を必要としない。 As described above, the confidential voice transmission system 10 of the present embodiment applies visual cryptography (VCS) to the transmission of confidential voice, and embeds the share mask M′ _n generated from the mask M in the spectrogram of the confidential voice. Generate n share voices. When the user collects t or more shared sounds out of n shared sounds, the user can restore the confidential sound, but cannot restore the secret sound when the collected shared sounds are less than t. In addition, the confidential voice transmission system 10 of this embodiment utilizes the fact that it is difficult to solve the problem of separating a signal in which a plurality of sound sources are mixed from a monaural observed signal that does not have spatial information. For encryption, the secret voice transmission system 10 only needs to have an FFT analyzer that performs a short-time Fourier transform on voice. No complex decoder is required.

このように、本実施形態の秘匿音声伝送システム１０は、音声を秘匿性高く配信して簡易に復元できる。 In this way, the confidential voice transmission system 10 of the present embodiment can distribute voice with high confidentiality and easily restore it.

（第２実施形態）
本実施形態の秘匿音声伝送システム１０は、秘匿音声受信装置１４によって復元した秘匿音声と配信された元のメッセージとの類似を判定する。これにより、ユーザーが正しい有資格集合からなるシェアマスクM'_nを取得できているかを検知できる。これにより、ユーザーがある一定の時間において指定した場所に居たというようなことを検知できる。 (Second embodiment)
The confidential voice transmission system 10 of this embodiment determines the similarity between the confidential voice restored by the confidential voice receiving device 14 and the original distributed message. This makes it possible to detect whether the user has acquired the share mask _M'n consisting of the correct qualified set. This makes it possible to detect that the user has been at a specified location for a certain period of time.

図５は、本実施形態の秘匿音声受信装置１４の機能ブロック図である。本実施形態の秘匿音声受信装置１４は、第１実施形態の秘匿音声受信装置１４の構成に加えて、類似性判定部３６を備える。 FIG. 5 is a functional block diagram of the confidential voice receiving device 14 of this embodiment. The confidential voice receiving device 14 of this embodiment includes a similarity determination unit 36 in addition to the configuration of the confidential voice receiving device 14 of the first embodiment.

類似性判定部３６は、音声復元部３４によって復元された音声と秘匿音声であるメッセージとの類似性を判定する。なお、このメッセージは、秘匿音声受信装置１４がサーバ（不図示）等から予めデジタルデータとして取得して記憶手段に記憶しているものの、ユーザーには認識できないようにされている。なお、類似性判定部３６は、例えば、復元された音声とメッセージとのS/Nや音声対ひずみ比といった音源分離の評価に用いられる指標や、相関係数などを用いて類似性を判定する。 The similarity determining unit 36 determines similarity between the voice restored by the voice restoring unit 34 and the message that is the confidential voice. Although this message is previously acquired as digital data from a server (not shown) or the like by the confidential voice receiving device 14 and stored in the storage means, it is made unrecognizable to the user. Note that the similarity determination unit 36 determines the similarity using, for example, an index used for evaluating sound source separation, such as the S/N ratio of the restored voice and message, the voice-to-distortion ratio, or a correlation coefficient. .

（第３実施形態）
混合音声にシェアマスクM'_nをマスキングしてスピーカー１６から出力する第１実施形態では、短時間フーリエ変換の冗長性の影響によって秘匿音声の復元精度が低下する可能性がある。例えば、人間の音声や楽器の音等は、時間周波数ビンでアクティブなビン数が少なく、送信したい音信号のスパース性が高い。このようなスパース性が高い音信号に基づいて生成されたマスクMは、その推定精度が低下するおそれがある。 (Third Embodiment)
In the first embodiment in which the mixed voice is masked with the share mask _M'n and output from the speaker 16, the redundancy of the short-time Fourier transform may reduce the reconstruction accuracy of the concealed voice. For example, human voices, sounds of musical instruments, etc. have a small number of active time-frequency bins, and the sparsity of sound signals to be transmitted is high. The estimation accuracy of the mask M generated based on such a highly sparsity sound signal may be degraded.

そこで、本実施形態では、シェアマスクM'_nと混合音声とを別々に伝送する。このため、本実施形態の秘匿音声送信装置１２は、例えばホワイトノイズ等、所定の周波数範囲で強度が所定値以上であるノイズをキャリアとし、このノイズにシェアマスクM'_nをマスキングした音声をシェア音声としてスピーカー１６から出力する。例えばノイズとしてホワイトノイズを用いた場合は時間平均すると全ての周波数成分がアクティブになるので、シェア音声のスペクトログラムがシェアマスクM'_nと確率的に同じとなる。また、例えば秘匿音声の周波数が500～1kHzに限られていればその帯域に制限したホワイトノイズをノイズとして利用でき、秘匿音声が高域成分をあまり含まない場合にはピンクノイズを使う等が考えられる。 Therefore, in this embodiment, the share mask _M'n and the mixed voice are separately transmitted. For this reason, the confidential voice transmission device 12 of the present embodiment uses noise, such as white noise, whose intensity is equal to or greater than a predetermined value in a predetermined frequency range as a carrier, and shares the voice obtained by masking this noise with the share mask M' _n . It is output from the speaker 16 as voice. For example, when white noise is used as noise, all frequency components are active when averaged over time, so the spectrogram of the share voice is stochastically the same as the share mask _M'n . Also, for example, if the frequency of the hidden voice is limited to 500 to 1 kHz, white noise limited to that band can be used as noise, and pink noise can be used if the hidden voice does not contain much high-frequency components. be done.

図６は、本実施形態の秘匿音声送信装置１２の機能ブロック図であり、本実施形態の秘匿音声送信装置１２は、フーリエ変換部２０、マスク生成部２２、シェアマスク生成部２４、マスキング部２６、逆フーリエ変換部２８、スピーカー１６と共に、ノイズ発生部４０、及びフーリエ変換部４２を備える。 FIG. 6 is a functional block diagram of the confidential voice transmission device 12 of this embodiment. , an inverse Fourier transform unit 28 , a speaker 16 , a noise generator 40 , and a Fourier transform unit 42 .

マスク生成部２２は、秘匿音声を短時間フーリエ変換して得られたスペクトログラムにおいて、秘匿音声の周波数成分が妨害音よりも閾値θ以上大きい周波数ビンの有無に基づいてマスクM'を生成する。 The mask generating unit 22 generates a mask M′ based on the presence or absence of frequency bins in which the frequency components of the concealed audio are greater than the interference sound by a threshold θ or more in the spectrogram obtained by performing the short-time Fourier transform of the concealed audio.

シェアマスク生成部２４は、t個以上を重ね合わせることで秘匿音声を復元するための各々が異なるn個（2≦t≦n）のシェアマスクM'_nを、マスクMに基づいて生成する。 The share mask generation unit 24 generates, based on the mask M, n (2≦t≦n) different share masks M′ _n for restoring the confidential voice by superimposing t or more.

ノイズ発生部４０は、ホワイトノイズ等のノイズを生成して出力する。 The noise generator 40 generates and outputs noise such as white noise.

フーリエ変換部４２は、ノイズ発生部４０から出力されたノイズを短時間フーリエ変換して出力する。 The Fourier transform unit 42 performs a short-time Fourier transform on the noise output from the noise generator 40 and outputs the result.

マスキング部２６は、ノイズを短時間フーリエ変換して得られたスペクトログラムをｎ個のシェアマスクM'_nでマスキングすることで、各々が異なるn個のシェア音声を生成する。 The masking unit 26 masks the spectrogram obtained by performing a short-time Fourier transform on noise with n share masks M′ _n to generate n different share voices.

逆フーリエ変換部２８は、マスキング部２６で生成されたｎ個のシェア音声を逆フーリエ変換する。そして、逆フーリエ変換されたｎ個のシェア音声は、スピーカー１６から出力される。 The inverse Fourier transform unit 28 inverse Fourier transforms the n shared voices generated by the masking unit 26 . Then, the inverse Fourier-transformed n shared voices are output from the speaker 16 .

このように、本実施形態のスピーカー１６から出力されるｎ個のシェア音声は、ノイズがホワイトノイズの場合、短時間フーリエ変換して得られたスペクトログラムをｎ個のシェアマスクM'_nでマスキングした後に、逆フーリエ変換することで生成される。ノイズがピンクノイズ等の場合には、ノイズの特性に合わせたフィルター処理を行うことでホワイトノイズの場合と同様に処理できる。なお、スピーカー１６は、混合音声をシェア音声とは別に出力する。 In this way, when the noise is white noise, the n share voices output from the speaker 16 of the present embodiment are obtained by masking the spectrogram obtained by short-time Fourier transform with n share masks _M'n . It is later generated by inverse Fourier transform. If the noise is pink noise or the like, it can be processed in the same way as white noise by performing filter processing that matches the characteristics of the noise. Note that the speaker 16 outputs the mixed sound separately from the shared sound.

このように、本実施形態の秘匿音声送信装置１２のスピーカー１６は、ノイズをキャリアとすることでシェアマスクM'_nを音（ｎ個のシェア音声）として出力するシェアマスク出力部として機能し、また、シェア音声とは別に混合音声を出力する音声出力部として機能する。シェア音声と混合音声とは、例えば、一つのスピーカー１６から異なる時間で出力されてもよいし、異なる場所に位置する複数のスピーカー１６から別々に出力される。 Thus, the speaker 16 of the confidential voice transmission device 12 of the present embodiment functions as a share mask output unit that outputs the share mask M' _n as sound (n share voices) by using noise as a carrier. It also functions as an audio output unit that outputs mixed audio separately from shared audio. For example, the shared audio and the mixed audio may be output from one speaker 16 at different times, or may be output separately from multiple speakers 16 located at different locations.

図７は、本実施形態の秘匿音声受信装置１４の機能ブロック図である。秘匿音声受信装置１４は、スピーカー１６から出力されたn個のシェア音声のうちt個以上のシェア音声からマスクMを復元し、復元したマスクMとスピーカー１６から出力された混合音声とを重ね合わせることで、秘匿音声を復元する。 FIG. 7 is a functional block diagram of the confidential voice receiving device 14 of this embodiment. The confidential voice receiving device 14 restores the mask M from t or more shared voices among the n shared voices output from the speaker 16, and superimposes the restored mask M on the mixed voice output from the speaker 16. By doing so, the confidential voice is restored.

図７に示されるように本実施形態の秘匿音声受信装置１４は、マイク３０、同期部３２、マスク復元部５０及びマスキング部５２を備える。 As shown in FIG. 7, the confidential voice receiving device 14 of this embodiment includes a microphone 30, a synchronization section 32, a mask restoration section 50 and a masking section 52. FIG.

マスク復元部５０は、マイク３０から入力されて同期部３２によって同期されたｔ個以上のシェア音声を重ね合わせることで、マスクMを復元する。なお、秘匿音声受信装置１４によって集められたシェア音声がt個未満である場合には、マスク復元部５０はマスクMを復元することはできない。 The mask restoration unit 50 restores the mask M by superimposing t or more shared voices input from the microphone 30 and synchronized by the synchronization unit 32 . Note that when the number of shared voices collected by the confidential voice receiving device 14 is less than t, the mask restoring unit 50 cannot restore the mask M.

マスキング部５２は、マイク３０に入力された混合音声に対して、マスク復元部５０によって復元されたマスクMでマスキング処理を行うことで、秘匿音声を抽出する。 The masking unit 52 performs masking processing on the mixed voice input to the microphone 30 with the mask M restored by the mask restoring unit 50, thereby extracting the confidential voice.

このように、本実施形態の秘匿音声伝送システム１０は、シェアマスクM'_nをシェア音声として出力し、ｔ個以上のシェア音声と混合音声とをマスキングすることで、秘匿音声を抽出する。これにより、本実施形態の秘匿音声伝送システム１０は、配布されたシェア音声と混合音声とから精度良く秘匿音声を復元できる。 In this way, the confidential voice transmission system 10 of the present embodiment outputs the share mask M' _n as shared voice, and extracts the confidential voice by masking t or more shared voices and mixed voice. As a result, the confidential voice transmission system 10 of the present embodiment can accurately restore the confidential voice from the distributed shared voice and mixed voice.

（第４実施形態）
本実施形態では、拡張視覚暗号(Extended visual cryptography scheme; EVCS)を秘匿音声伝送に適用する。EVCSは、シェア画像単体ではカバー画像を表示するものの、t枚のシェア画像を集めると秘匿画像が表示されものであり、下記文献３で提案されている。このEVCSを用いてシェア画像を生成すると、シェア画像はカバー画像を表示しているので、シェア画像を見た人にシェア画像が秘匿画像を暗号化していることに気付かれ難いという利点がある。
文献３ G. Ateniese, C. Blundo, A.D. Santis, and D.R. Stinson, “Extended capabilities for visual cryptography,” Theoretical Computer Science, vol.250, no.1, pp.143--161, 2001. (Fourth embodiment)
In this embodiment, an extended visual cryptography scheme (EVCS) is applied to secure voice transmission. EVCS displays a cover image for a single share image, but displays a confidential image when t share images are collected, and is proposed in Document 3 below. When a share image is generated using this EVCS, since the share image displays the cover image, there is an advantage that it is difficult for a person who sees the share image to notice that the share image has encrypted the confidential image.
Reference 3 G. Ateniese, C. Blundo, AD Santis, and DR Stinson, “Extended capabilities for visual cryptography,” Theoretical Computer Science, vol.250, no.1, pp.143--161, 2001.

（EVCSの基本行列）
次にEVCSの基本行列X0,X1について説明する。EVCSでは、秘匿画像の画素値に加えて、カバー画像C₁,…,C_nの画素値(２値)で場合分けすると、2ⁿ個のn*mブール行列の組(X₀ ^C1,…,Cn, X₁ ^C1,…,Cn)が以下の３条件を満たす場合にアクセス構造Γを実現する基本行列であるという。 (fundamental matrix of EVCS)
Next, the basic matrices X0 and X1 of EVCS will be explained. In the EVCS, in addition to the pixel values of the secret image, if the _pixel values ( ^binary ) of the cover images C ₁ ^, _. ^,Cn , _X1C1 ^,...,Cn ) is said to be a basic matrix that realizes the access structure Γ when the following three conditions are satisfied.

（条件１）秘匿画像の復元可能条件:
すべてのS∈Γ^* _Qに対して、0≦l_s≦h_s≦mを満たす整数l_s,h_sが存在し、全てのC₁,…,C_nに対して数式５が成り立つ。

(Condition 1) Restorable conditions for confidential images:
For all S∈Γ ^* _Q , there exist integers l _s , h _s satisfying 0≦l _s ≦h _s ≦m, and Equation 5 holds for all C ₁ , . . . , C _n .

（条件２）安全性条件:
すべてのS∈Γ^* _FとC₁,…,C_nに対し、X₀ ^C1,…,Cn[S]とX₁ ^C1,…,Cn[S]とは適当な列の並び替えで等しくできる。 (Condition 2) Safety conditions:
For all S∈Γ ^* _F and C ₁ ,...,C _n , X ₀ ^C1,...,Cn [S] and X ₁ ^C1,...,Cn [S] can be equal by permuting appropriate columns. .

（条件３）カバー画像の視認条件:
すべてのj=1,…,nに対して、0≦l_j<h_j≦mを満たす整数l_j, h_jが存在し、jを除くすべてのC₁,…,C_nに対して数式６が成り立つ。

(Condition 3) Cover image viewing conditions:
For all j=1,...,n, there exist integers l _j , h _j satisfying 0≦l _j <h _j ≦m, and for all C ₁ ,...,C _n except j, the formula 6 holds.

上記の３条件を満たす基本行列X₀,X₁の最適化法が、下記文献４，５で提案されている
文献４ S.J. Shyu, “Threshold visual cryptographic scheme with meaningful shares,” IEEE Signal Processing Letters, vol.21, no.12, pp.1521--1525, 2014.
文献５ K. Sekine and H. Koga, “Optimal basis matrices of a visual cryptography scheme with meaningful shares and analysis of its security,” 2020 International Symposium on Information Theory and Its Applications (ISITA), pp.422--426, 2020. A method of optimizing the fundamental matrices X ₀ and X ₁ that satisfies the above three conditions is proposed in the following documents 4 and 5. Document 4 SJ Shyu, “Threshold visual cryptographic scheme with meaningful shares,” IEEE Signal Processing Letters, vol. 21, no.12, pp.1521--1525, 2014.
Reference 5 K. Sekine and H. Koga, “Optimal basis matrices of a visual cryptography scheme with meaningful shares and analysis of its security,” 2020 International Symposium on Information Theory and Its Applications (ISITA), pp.422--426, 2020 .

（第４実施形態の構成）
図８は、本実施形態の秘匿音声送信装置１２の機能ブロック図であり、フーリエ変換部２０、マスク生成部２２、シェアマスク生成部２４、マスキング部５２、逆フーリエ変換部２８、及びスピーカー１６を備える。 (Configuration of the fourth embodiment)
FIG. 8 is a functional block diagram of the confidential voice transmission device 12 of this embodiment, in which the Fourier transform unit 20, the mask generator 22, the share mask generator 24, the masking unit 52, the inverse Fourier transform unit 28, and the speaker 16 are Prepare.

フーリエ変換部２０は、秘匿音声にj個(1≦j≦n)のカバー音声を混合した混合音声を短時間フーリエ変換する。秘匿音声に混合されるj個のカバー音声は各々異なる音声である。 The Fourier transform unit 20 performs a short-time Fourier transform on a mixed voice obtained by mixing j (1≤j≤n) cover voices with the confidential voice. The j cover voices mixed with the concealed voice are different voices.

マスク生成部２２は、混合音声を短時間フーリエ変換して得られたスペクトログラムにおいて、秘匿音声の周波数成分がカバー音声よりも所定の閾値θ以上大きい周波数ビンの有無に基づいてマスクMを生成し、カバー音声の周波数成分が秘匿音声及び他のカバー音声よりも所定の閾値θ以上大きい周波数ビンの有無に基づいてｊ個のカバーマスクM^C _jを生成する。 The mask generation unit 22 generates a mask M based on the presence or absence of frequency bins in which the frequency component of the concealed voice is greater than the cover voice by a predetermined threshold θ or more in the spectrogram obtained by performing the short-time Fourier transform of the mixed voice, j cover masks M ^C _j are generated based on the presence or absence of frequency bins in which the frequency components of the cover audio are larger than those of the concealed audio and other cover audio by a predetermined threshold θ or more.

シェアマスク生成部２４は、マスク生成部２２によって生成されたマスクM及びｊ個のカバーマスクM^C _jに基づいてn個のシェアマスクM'_nを生成する。 The share mask generation unit 24 generates n share masks M′ _n based on the mask M generated by the mask generation unit 22 and the j cover masks M ^C _j .

マスキング部２６は、短時間フーリエ変換した混合音声に対してn個のシェアマスクM'_nでマスキングすることで、各々が異なるn個のシェア音声を生成する。逆フーリエ変換部２８は、マスキング部２６で生成したn個のシェア音声を逆フーリエ変換し、スピーカー１６から出力させる。 The masking unit 26 generates n different shared voices by masking the short-time Fourier-transformed mixed voice with n share masks _M'n . The inverse Fourier transform unit 28 inverse Fourier transforms the n shared voices generated by the masking unit 26 and outputs them from the speaker 16 .

なお、本実施形態の秘匿音声受信装置１４は、第１実施形態の秘匿音声受信装置１４と同様であり、秘匿音声送信装置１２から出力されたn個のシェア音声のうちt個以上を重ね合わせることで、秘匿音声を復元する。 The confidential voice receiving device 14 of the present embodiment is the same as the confidential voice receiving device 14 of the first embodiment, and superimposes t or more of the n shared voices output from the confidential voice transmitting device 12. By doing so, the confidential voice is restored.

フーリエ変換部２０は、秘匿音声とj個のカバー音声との混合音声を短時間フーリエ変換し、秘匿音声のスペクトルs(τ,f)とカバー音声のスペクトルκ_j(τ,f)を取得する。 The Fourier transform unit 20 performs a short-time Fourier transform on the mixed voice of the concealed voice and the j cover voices, and obtains the spectrum s(τ,f) of the concealed voice and the spectrum κ _j (τ,f) of the cover voice. .

そして、マスク生成部２２は、秘匿音声の音声レベル|s|がカバー音声の音声レベル|κ_j|よりも閾値θ以上大きい時間周波数ビンに対してマスクMを1とし、それ以外の時間周波数ビンに対してマスクMを０とする。これを数式７に表す。 Then, the mask generation unit 22 sets the mask M to 1 for time-frequency bins in which the sound level |s| of the concealed sound is greater than the sound level |κ _j | of the cover sound by a threshold θ or more, , the mask M is set to 0. This is expressed in Equation 7.

さらに、カバー音声のレベル|κ_j|がその他のカバー音声の音声レベル|κ_j'|及び秘匿音声の音声レベル|s|よりも閾値θ以上大きい時間周波数ビンに対してカバーマスクM^C _jを１とし、それ以外の時間周波数ビンに対してカバーマスクM^C _jを０とする。これを数式８に表す。 Further, the cover mask M ^C _j is applied to the time-frequency bin whose cover audio level |κ _j | is greater than the other cover audio audio level |κ _j′ | and the confidential audio audio level |s| by a threshold θ or more. 1, and the cover mask M ^C _j is set to 0 for other time-frequency bins. This is expressed in Equation 8.

ここで、c_j= M^C _j(τ,f)とすることにより、カバーマスクM^C _jはEVCSにおけるカバー画像（２値画像）とみなすことができる。すなわち、秘匿音声をマスクMとし、カバー音声をカバーマスクM^C _jとすることで、シェアマスク生成部２４は、EVCSと同様の処理によってシェアマスクを生成することができる。 Here, by setting c _j =M ^C _j (τ, f), the cover mask M ^C _j can be regarded as a cover image (binary image) in EVCS. That is, by using the mask M as the secret voice and the cover mask M ^Cj as the cover voice, the share mask generation unit 24 can generate the share mask by the _same process as the EVCS.

シェアマスク生成部２４は、周波数ビン数及び時間フレーム数を増大することでマスクM及びカバーマスクM^C _jをm倍し、EVCSの基本行列を満たす行列式となるようにn個のシェアマスクを生成する。 The share mask generation unit 24 multiplies the mask M and the cover mask M ^C _j by m by increasing the number of frequency bins and the number of time frames, and generates n share masks so that the determinant satisfies the basic matrix of the EVCS. Generate.

そして、混合音声に対してマスキング処理がおこなわれることで、シェア音声jとしてカバー音声jが聞こえることとなる。すなわち、シェア音声単体だけを人が聞くと秘匿音声とは異なる音声が聞こえ、シェア音声に秘匿音声が暗号化されていることが認識され難いが、t個以上のシェア音声を同期加算することで秘匿音声を得ることができる。これにより、本実施形態の秘匿音声伝送システム１０は、秘匿性をより高めてシェア音声を配信できる。 Then, by performing masking processing on the mixed audio, the cover audio j can be heard as the shared audio j. In other words, if a person listens to only the shared voice alone, they hear a voice different from the hidden voice, and it is difficult to recognize that the hidden voice is encrypted in the shared voice. You can get hidden voice. As a result, the confidential voice transmission system 10 of the present embodiment can distribute shared voice with enhanced confidentiality.

（第５実施形態）
本実施形態の秘匿音声伝送システム１０は、第３実施形態と同様に、シェアマスクM'_nと混合音声とを別々に伝送するシステムである。 (Fifth embodiment)
The confidential voice transmission system 10 of this embodiment is a system that separately transmits the share mask _M'n and the mixed voice, as in the third embodiment.

本実施形態の秘匿音声送信装置１２は、n個のシェアマスクM'_nの各々をデジタルデータとして秘匿音声受信装置１４へ出力する。秘匿音声受信装置１４は、秘匿音声送信装置１２から出力されたt個以上のシェアマスクM'_nからマスクMを復元する。これにより、秘匿音声受信装置１４は、取得したt個のシェアマスクM'_nからマスクMを直接的に計算できる。そして、秘匿音声受信装置１４は、復元したマスクMに秘匿音声送信装置１２から出力された混合音声をマスキングすることで、秘匿音声を復元する。 The confidential voice transmission device 12 of this embodiment outputs each of the n share masks M' _n as digital data to the confidential voice reception device 14 . The confidential voice receiving device 14 restores the mask M from t or more share masks M′ _n output from the confidential voice transmitting device 12 . As a result, the confidential voice receiving device 14 can directly calculate the mask M from the acquired t share masks _M'n . Then, the confidential voice receiving device 14 restores the confidential voice by masking the mixed voice output from the confidential voice transmitting device 12 on the restored mask M.

図９は、本実施形態の秘匿音声送信装置１２の機能ブロック図であり、本実施形態の秘匿音声送信装置１２は、フーリエ変換部２０、マスク生成部２２、シェアマスク生成部２４、及びデータ送信部６０を備える。なお、フーリエ変換部２０、マスク生成部２２、シェアマスク生成部２４の機能は、第３実施形態の秘匿音声送信装置１２と同様である。 FIG. 9 is a functional block diagram of the confidential voice transmission device 12 of this embodiment. A unit 60 is provided. The functions of the Fourier transform unit 20, the mask generation unit 22, and the share mask generation unit 24 are the same as those of the confidential voice transmission device 12 of the third embodiment.

データ送信部６０は、シェアマスク生成部２４によって生成されたシェアマスクM'_nの各々をデジタルデータとして出力する。シェアマスクM'_nの出力先は、秘匿音声送信装置１２と通信が可能な秘匿音声受信装置１４である。なお、データ送信部６０は、例えば、近距離通信により秘匿音声受信装置１４と通信を行い、異なる時間でシェアマスクM'_nを出力してもよいし、異なる場所でシェアマスクM'_nを別々に出力してもよい。 The data transmission unit 60 outputs each of the share masks _M'n generated by the share mask generation unit 24 as digital data. The output destination of the share mask M' _n is the confidential voice receiving device 14 that can communicate with the confidential voice transmitting device 12 . Note that the data transmission unit 60 may communicate with the confidential voice receiving device 14 by short-range communication, for example, and output the share mask M' _n at different times, or may output the share mask M' _n separately at different locations. can be output to

スピーカー１６は混合音声を出力する。混合音声は、例えば、一つのスピーカー１６から異なる時間で出力されてもよいし、異なる場所に位置する複数のスピーカー１６から別々に出力されてもよい。混合音声が出力される時間や場所は、シェアマスクM'_nが出力される時間や場所と同じであってもよいし、異なってもよい。 A speaker 16 outputs the mixed sound. For example, the mixed sound may be output from one speaker 16 at different times, or may be output separately from a plurality of speakers 16 positioned at different locations. The time and place at which the mixed sound is output may be the same as the time and place at which the share mask _M'n is output, or may be different.

図１１は、本実施形態の秘匿音声受信装置１４の機能ブロック図であり、本実施形態の秘匿音声受信装置１４は、マイク３０、データ受信部７０、同期部７２、マスク復元部７４、逆フーリエ変換部７６、及びマスキング部７８を備える。 FIG. 11 is a functional block diagram of the confidential voice receiving device 14 of this embodiment. A conversion unit 76 and a masking unit 78 are provided.

マイク３０は、秘匿音声送信装置１２のスピーカー１６から出力された混合音声の入力を受け付ける。 The microphone 30 receives input of mixed voice output from the speaker 16 of the confidential voice transmission device 12 .

データ受信部７０は、秘匿音声送信装置１２のデータ送信部６０から送信されたシェアマスクM'_nを受信する。 The data receiving unit 70 receives the share mask M′ _n transmitted from the data transmitting unit 60 of the confidential voice transmitting device 12 .

同期部７２は、データ受信部７０によって受信されたn個のシェアマスクM'_nからマスクMを復元するために、n個のシェアマスクM'_nの始点を合わせる同期処理を行う。 The synchronizing unit 72 performs a synchronizing process to align the start points of the n share masks M′ _n in order to restore the mask _M from the n share masks M′ n received by the data receiving unit 70 .

マスク復元部７４は、同期部３２によって同期されたｔ個以上のシェアマスクM'_nを重ね合わせることで、マスクMを復元する。なお、秘匿音声受信装置１４によって集められたシェアマスクM'_nがt個未満である場合には、マスク復元部５０はマスクMを復元することはできない。 The mask restoration unit 74 restores the mask M by superimposing t or more share masks M′ _n synchronized by the synchronization unit 32 . Note that if the number of share masks M′ _n collected by the confidential audio receiving device 14 is less than t, the mask restoration unit 50 cannot restore the mask M.

逆フーリエ変換部７６は、マスク復元部５０によって復元したマスクMを逆フーリエ変換する。 The inverse Fourier transform unit 76 inverse Fourier transforms the mask M restored by the mask restore unit 50 .

マスキング部７８は、マイク３０に入力された混合音声に対して、マスク復元部５０によって復元されたマスクMでマスキング処理を行うことで、秘匿音声を取得する。 The masking unit 78 obtains confidential audio by masking the mixed audio input to the microphone 30 with the mask M restored by the mask restoring unit 50 .

ここで、マスク復元部７４は、画素拡大率m₁,m₂がわかっていればm₁,m₂個のマスク成分{0,1}を足し合わせて、それが所定の閾値を超えていれば１、所定の閾値を超えていなければ０とすることで、よりマスクMの推定精度を向上させてもよい。 Here, if the pixel enlargement ratios m ₁ and m ₂ are known, the mask restoration unit 74 sums m ₁ and m ₂ mask components {0, 1}, and if the sum exceeds a predetermined threshold, The accuracy of estimating the mask M may be further improved by setting the value to 1 if the value does not exceed a predetermined threshold value and to 0 if the value does not exceed the predetermined threshold.

すなわち、元のマスクMでの(τ,f)成分は、シェアマスクM'_nにおいて(m₁(τ-₁)+k₁,m₂(f-1)+k₂)に画素拡大されているので、ｔ個のシェアマスクM'_nを加算や乗算することによって画素拡大されたマスクM推定し、下記数式９によってM'_estを生成する。なお、k₁は１≦k₁≦m₁であり、k₂は1≦k₂≦m₂である。 That is, the (τ, f) component in the original mask M is pixel-enlarged to (m ₁ (τ- ₁ )+k ₁ ,m ₂ (f-1)+k ₂ ) in the share mask _M'n . Therefore, the pixel-enlarged mask M is estimated by adding or multiplying t share masks _M'n , and _M'est is generated by the following equation (9). Note that _k1 satisfies _1≤k1≤m1 , _and _k2 _satisfies _1≤k2≤m2 .

なお、χ(x)はxが閾値以上であれば1を返し、閾値未満であれば０を返す指示関数である。このとき、M'_estの大きさは元のマスクMに等しい。また閾値はその領域内の成分が全て１の場合すなわちm₁*m₂=1の場合にのみ１としたり、全体のヒストグラムを作成してある一定成分が１となる値とする、又は中央値とする等の方法がある。 Note that χ(x) is an indicator function that returns 1 if x is greater than or equal to a threshold and returns 0 if x is less than the threshold. Then the magnitude of M' _est is equal to the original mask M. Also, the threshold value is set to 1 only when all the components in the region are 1, that is, when m ₁ *m ₂ = 1, or to a value where a certain component is 1 by creating a histogram of the entire area, or to a median value There is a method such as

以上、本発明を、上記実施形態を用いて説明したが、本発明の技術的範囲は上記実施形態に記載の範囲には限定されない。発明の要旨を逸脱しない範囲で上記実施形態に多様な変更または改良を加えることができ、該変更または改良を加えた形態も本発明の技術的範囲に含まれる。 Although the present invention has been described using the above embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments. Various changes or improvements can be made to the above-described embodiments without departing from the gist of the invention, and forms with such changes or improvements are also included in the technical scope of the present invention.

例えば、第２実施形態の類似性判定部３６は第１実施形態の秘匿音声受信装置１４に適用される形態について説明したが、本発明はこれに限らず、類似性判定部３６は第３から第５実施形態の秘匿音声受信装置１４に適用されてもよい。 For example, the similarity determination unit 36 of the second embodiment is applied to the confidential voice receiving device 14 of the first embodiment, but the present invention is not limited to this. It may be applied to the confidential voice receiving device 14 of the fifth embodiment.

また、第１実施形態では秘匿音声にi個の妨害音を混合した混合音声からマスクM及びシェアマスクM'_nを生成する形態について説明したが、本発明はこれに限らず、マスクM及びシェアマスクM'_nの生成に混合音声を用いず秘匿音声だけを用いてもよい。この形態の場合、マスクMを生成する場合にはi＝0とすることにより秘匿音声だけでマスクMを生成することになる。 In addition, in the first embodiment, a form of generating a mask M and a share mask _M'n from a mixed sound in which i interfering sounds are mixed with a confidential sound has been described, but the present invention is not limited to this, and the mask M and the share Only the concealed speech may be used without using the mixed speech to generate the mask _M'n . In the case of this form, when mask M is generated, by setting i=0, mask M is generated only with the confidential voice.

また、上記実施形態ではスピーカー１６から混合音声を出力する形態について説明したが、本発明はこれに限らず、秘匿音声送信装置１２は混合音声をデジタルデータとして出力し、秘匿音声受信装置１４はデジタルデータとして受信した混合音声から秘匿音声を抽出してもよい。 Further, in the above-described embodiment, a form in which mixed sound is output from the speaker 16 has been described, but the present invention is not limited to this. Confidential speech may be extracted from the mixed speech received as data.

１０秘匿音声伝送システム
１２秘匿音声送信装置
１４秘匿音声受信装置
１６スピーカー（シェア音声出力部）
２２マスク生成部
２４シェアマスク生成部
２６マスキング部（シェア音声生成部）
10 Confidential Audio Transmission System 12 Confidential Audio Transmitting Device 14 Confidential Audio Receiving Device 16 Speaker (Shared Audio Output Unit)
22 mask generation unit 24 share mask generation unit 26 masking unit (share audio generation unit)

Claims

A mask generation unit that generates a mask based on the presence or absence of a frequency bin having a frequency component greater than or equal to a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice;
a share mask generation unit that generates, based on the masks, n (2≤t≤n) different share masks for restoring the confidential voice by superimposing t or more;
Shared voice generation for generating n shared voices, each of which is different, by masking a spectrogram obtained by performing a short-time Fourier transform on the mixed voice obtained by mixing the confidential voice with other voices with the n share masks. Department and
a shared audio output unit that outputs the n shared audios that have undergone inverse Fourier transform;
Secrecy voice transmission device.

The other sounds are i interfering sounds,
In a spectrogram obtained by performing a short-time Fourier transform on a mixed sound in which the i interfering sounds are mixed with the concealed sound, the mask generation unit is configured such that the frequency components of the concealed sound are higher than the interfering sounds by a predetermined threshold or more. 2. The secure audio transmission device according to claim 1, wherein said mask is generated based on the presence or absence of large frequency bins.

The other sounds are n cover sounds,
In a spectrogram obtained by performing a short-time Fourier transform on a mixed sound obtained by mixing n cover sounds with the cover sound, the mask generation unit is configured such that frequency components of the cover sound are larger than the cover sound by a predetermined threshold or more. The mask is generated based on the presence or absence of frequency bins, and j (1 ≤ j ≤ j) frequency bins (1 ≤ j ≤ n) generate a cover mask,
The share mask generator generates the n share masks based on the mask and the j cover masks.
The confidential voice transmission device according to claim 1.

The confidential voice transmission device according to any one of claims 1 to 3;
a confidential voice receiving device that restores the confidential voice by superimposing the t or more of the n shared voices output from the confidential voice transmitting device;
A secret voice transmission system comprising:

A mask generation unit that generates a mask based on the presence or absence of a frequency bin having a frequency component greater than or equal to a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice;
a share mask generation unit that generates, based on the masks, n (2≤t≤n) different share masks for restoring the confidential voice by superimposing t or more;
a share mask output unit that outputs the n share masks;
an audio output unit that outputs a mixed audio obtained by mixing i interfering sounds with the confidential audio;
Secrecy voice transmission device.

In a spectrogram obtained by performing a short-time Fourier transform on a mixed sound in which the i interfering sounds are mixed with the concealed sound, the mask generation unit is configured such that the frequency components of the concealed sound are higher than the interfering sounds by a predetermined threshold or more. 6. The secure audio transmission device according to claim 5, wherein said mask is generated based on the presence or absence of large frequency bins.

The shear mask output unit generates a spectrogram obtained by performing a short-time Fourier transform on noise whose intensity is greater than or equal to a predetermined value in a predetermined frequency range, and then performs an inverse Fourier transform after masking the spectrogram with the n shear masks. 7. The confidential audio transmission device according to claim 5, wherein said n pieces of shared audio are output as sounds.

7. The confidential voice transmission device according to claim 5, wherein said share mask output unit outputs each of said n share masks as digital data.

The confidential voice transmission device according to any one of claims 5 to 8;
restoring the masks from the t or more of the n share masks output from the share mask output unit, and applying the restored masks to the mixed audio output from the audio output unit; a confidential audio receiving device that restores the confidential audio by masking;
A secret voice transmission system comprising:

The mask generation unit generates the mask by setting frequency bins equal to or greater than the threshold value to 1 and frequency bins less than the threshold value to 0.
The confidential voice transmission device according to any one of claims 1 to 3 and 5 to 8.

The share mask generation unit multiplies the mask by m by increasing the number of frequency bins and the number of time frames, and generates the n shares so as to obtain a determinant that satisfies the basic matrix of VCS (Visual Cryptography Scheme). generate a mask,
11. The confidential voice transmission device according to any one of claims 1 to 3, 5 to 8, and 10.

10. The confidential voice transmission system according to claim 4, wherein said confidential voice receiving device comprises a similarity determination unit for determining similarity between restored voice and said confidential voice.

A first step of generating a mask based on the presence or absence of a frequency bin having a frequency component greater than or equal to a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice;
a second step of generating, based on the masks, n (2≤t≤n) different share masks for restoring the confidential voice by superimposing t or more;
A third step of generating n different shared voices by masking a spectrogram obtained by performing a short-time Fourier transform on the mixed voice obtained by mixing the confidential voice with other voices with the n share masks. and,
a fourth step of outputting the n share voices that have undergone inverse Fourier transform;
A fifth step of restoring the confidential voice by superimposing the t or more of the n shared voices that have been output;
A confidential voice transmission method comprising:

A first step of generating a mask based on the presence or absence of a frequency bin having a frequency component greater than or equal to a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice;
a second step of generating, based on the masks, n (2≤t≤n) different share masks for restoring the confidential voice by superimposing t or more;
a third step of outputting the n share masks and outputting a mixed sound obtained by mixing i interfering sounds with the confidential sound;
A fourth step of restoring the mask from the t or more share masks among the output n share masks, and masking the mixed sound with the restored masks, thereby restoring the confidential voice. and,
A confidential voice transmission method comprising:

to the computer,
A first step of generating a mask based on the presence or absence of a frequency bin having a frequency component greater than or equal to a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice;
a second step of generating, based on the masks, n (2≤t≤n) different share masks for restoring the confidential voice by superimposing t or more;
A third step of generating n different shared voices by masking a spectrogram obtained by performing a short-time Fourier transform on the mixed voice obtained by mixing the confidential voice with other voices with the n share masks. and,
a fourth step of outputting the n share voices that have undergone inverse Fourier transform;
A fifth step of restoring the confidential voice by superimposing the t or more of the n shared voices that have been output;
Secrecy voice transmission program for executing

to the computer,
A first step of generating a mask based on the presence or absence of a frequency bin having a frequency component greater than or equal to a predetermined threshold in a spectrogram obtained by performing a short-time Fourier transform on the confidential voice;
a second step of generating, based on the masks, n (2≤t≤n) different share masks for restoring the confidential voice by superimposing t or more;
a third step of outputting the n share masks and outputting a mixed sound obtained by mixing i interfering sounds with the confidential sound;
A fourth step of restoring the mask from the t or more share masks among the output n share masks, and masking the mixed sound with the restored masks, thereby restoring the confidential voice. and,
Secrecy voice transmission program for executing