JP2022069766A

JP2022069766A - Bone conduction microphone, method for enhancing voice of bone conduction microphone, and voice enhancement program

Info

Publication number: JP2022069766A
Application number: JP2020178602A
Authority: JP
Inventors: マルダンマムティミン; Maldan Mamthimin
Original assignee: Tadano Ltd
Current assignee: Tadano Ltd
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2022-05-12

Abstract

To provide a bone conduction microphone, a voice enhancement method for a bone conduction microphone, and a voice enhancement program that can obtain clearer voice with a simple process.SOLUTION: A bone conduction microphone comprises: a vibration detecting element that contacts a human body and detects vocal fold vibration; a transducer 23 that converts data of vocal fold vibration detected by the vibration detecting element into a first utterance code and a first sound quality code; a storage device 21 that stores the second sound quality code when a second transducer that converts voice data of a human voice caused by air vibration due to the vocal fold vibration into a second utterance code and a second sound quality code converts the voice data into the second utterance code and the second sound quality code; and a generator 24 that generates voice-enhanced data in which the voice is enhanced, based on the first utterance code converted by the converter 23 and the second sound quality code stored by the storage device 21.SELECTED DRAWING: Figure 2

Description

本発明は骨伝導マイクロホン、骨伝導マイクロホンの音声強調方法及び、音声強調プログラムに関する。 The present invention relates to a bone conduction microphone, a speech enhancement method for the bone conduction microphone, and a speech enhancement program.

骨伝導マイクロホンは、骨を伝導する声帯振動を検出する装置である。しかし、声帯振動は骨を伝導すると高周波成分が減衰してしまうので、骨伝導マイクロホンが検出する音声は、通常明瞭でないことが多い。そこで、より明瞭な音声を得るために、骨伝導マイクロホンが検出した音声データのうち音声成分を強調する音声強調装置が開発されている。 A bone conduction microphone is a device that detects vocal cord vibration that conducts bone. However, since vocal cord vibration attenuates high-frequency components when it conducts bone, the sound detected by the bone conduction microphone is usually not clear. Therefore, in order to obtain a clearer sound, a speech enhancement device that emphasizes the speech component in the speech data detected by the bone conduction microphone has been developed.

例えば、特許文献１には、骨伝導マイクロホンが検出した音声データを分析して有声音か無声音かを判別する判別手段と、有声音と判別された音声データを補正して第一気導音声データを生成する第一補正手段と、無性音と判別された音声データを補正して第二気導音声データを生成する第二補正手段と、生成された第一気導音声データと第二気導音声データを合わせて出力データを生成する出力生成手段と、を備える音声強調装置が開示されている。 For example, Patent Document 1 describes a discriminating means for discriminating between voiced sound and unvoiced sound by analyzing voice data detected by a bone conduction microphone, and first air conduction voice data by correcting voice data discriminated as voiced sound. The first correction means for generating the second air conduction voice data, the second correction means for correcting the voice data determined to be an asexual sound, and the generated first air conduction voice data and the second air. A sound enhancement device including an output generation means for generating output data by combining guide sound data is disclosed.

特開２０１２－２０８１７７号公報Japanese Unexamined Patent Publication No. 2012-208177

特許文献１に記載の音声強調装置では、骨伝導マイクロホンの音声を分析して有声音と無声音に分けるため、音声データの処理が複雑である。 In the speech enhancement device described in Patent Document 1, since the speech of the bone conduction microphone is analyzed and divided into voiced sound and unvoiced sound, the processing of voice data is complicated.

本発明は上記の課題を解決するためになされたもので、より明瞭な音声を簡易な処理で得ることができる骨伝導マイクロホン、骨伝導マイクロホンの音声強調方法及び、音声強調プログラムを提供することを目的とする。 The present invention has been made to solve the above problems, and to provide a bone conduction microphone, a speech enhancement method for a bone conduction microphone, and a speech enhancement program capable of obtaining clearer speech by simple processing. The purpose.

上記の目的を達成するため、本発明の第一の観点に係る骨伝導マイクロホンは、
人の体に接触して声帯振動を検出する振動検出素子と、
前記振動検出素子が検出した前記声帯振動のデータを第一発話コードと第一音質コードに変換する第一変換器と、
前記声帯振動によって空気が振動することにより生じる人の声の音声データを第二発話コードと第二音質コードに変換する第二変換器が、前記音声データを前記第二発話コードと前記第二音質コードに変換したときの、前記第二音質コードを記憶する記憶装置と、
前記第一変換器が変換した前記第一発話コードと前記記憶装置が記憶する前記第二音質コードとに基づいて、音声を強調した音声強調データを生成する生成器と、
を備える。 In order to achieve the above object, the bone conduction microphone according to the first aspect of the present invention is
A vibration detection element that detects vocal cord vibration in contact with the human body,
A first converter that converts the vocal cord vibration data detected by the vibration detection element into a first utterance code and a first sound quality code.
The second converter that converts the voice data of the human voice generated by the vibration of the air due to the vocal band vibration into the second utterance code and the second sound quality code, converts the voice data into the second utterance code and the second sound quality. A storage device that stores the second sound quality code when converted to a code, and
A generator that generates speech enhancement data in which speech is emphasized based on the first utterance code converted by the first converter and the second sound quality code stored in the storage device.
To prepare for.

前記第一変換器は、前記第一発話コードと前記第一音質コードから前記声帯振動のデータを復元する第一デコーダと組み合わせられた場合に第一オートエンコーダを形成する第一エンコーダであってもよい。 Even if the first transducer is a first encoder that forms a first autoencoder when combined with a first decoder that restores vocal cord vibration data from the first utterance code and the first sound quality code. good.

前記第二変換器は、前記第二発話コードと前記第二音質コードから前記音声データを復元する第二デコーダと組み合わせられた場合に第二オートエンコーダを形成する第二エンコーダであってもよい。 The second converter may be a second encoder that forms a second autoencoder when combined with the second utterance code and the second decoder that restores the voice data from the second sound quality code.

本発明の第二の観点に係る骨伝導マイクロホンの音声強調方法は、
人の体に接触して声帯振動を検出する骨伝導マイクロホンが備える振動検出素子から取得した前記声帯振動のデータを第一発話コードと第一音質コードに変換する変換ステップと、
前記声帯振動によって空気が振動することにより生じる人の声の音声データを第二発話コードと第二音質コードに変換する変換器が、前記音声データを前記第二発話コードと前記第二音質コードに変換したときの、前記第二音質コードを記憶する記憶装置から前記第二音質コードを読み出し、読み出した前記第二音質コードと前記変換ステップで変換した前記第一発話コードとに基づいて前記音声データの音声を強調した音声強調データを生成する生成ステップと、
を備える。 The speech enhancement method of the bone conduction microphone according to the second aspect of the present invention is
A conversion step for converting the vocal cord vibration data acquired from the vibration detection element of the bone conduction microphone that comes into contact with the human body to detect vocal cord vibration into the first speech code and the first sound quality code.
The converter that converts the voice data of the human voice generated by the vibration of the air due to the vocal band vibration into the second speech code and the second sound quality code converts the voice data into the second speech code and the second sound quality code. The voice data is read from the storage device that stores the second sound quality code at the time of conversion, and the voice data is based on the read second sound quality code and the first speech code converted in the conversion step. A generation step to generate voice-enhanced data that emphasizes the voice of
To prepare for.

本発明の第三の観点に係る音声強調プログラムは、
人の体に接触して声帯振動を検出する振動検出素子と、
前記声帯振動によって空気が振動することにより生じる人の声の音声データを第二発話コードと第二音質コードに変換する変換器が、前記音声データを前記第二発話コードと前記第二音質コードに変換したときの、前記第二音質コードを記憶する記憶装置と、
を備える骨伝導マイクロホンの音声強調プログラムであって、
コンピュータに、
前記振動検出素子から取得した前記声帯振動のデータを第一発話コードと第一音質コードに変換する変換ステップと、
前記記憶装置から前記第二音質コードを読み出し、読み出した前記第二音質コードと前記変換ステップで変換した前記第一発話コードに基づいて、音声を強調した音声強調データを生成する生成ステップと、
を実行させるためのものである。 The speech enhancement program according to the third aspect of the present invention is
A vibration detection element that detects vocal cord vibration in contact with the human body,
The converter that converts the voice data of the human voice generated by the vibration of the air due to the vocal band vibration into the second utterance code and the second sound quality code converts the voice data into the second utterance code and the second sound quality code. A storage device that stores the second sound quality code at the time of conversion,
A speech enhancement program for bone conduction microphones
On the computer
A conversion step for converting the vocal cord vibration data acquired from the vibration detection element into a first utterance code and a first sound quality code, and
A generation step of reading the second sound quality code from the storage device and generating voice-enhanced data based on the read second sound quality code and the first utterance code converted in the conversion step.
Is to execute.

本発明の構成によれば、第一変換器は、振動検出素子が検出した声帯振動のデータを第一発話コードと第一音質コードに変換し、生成器は、その第一発話コードと、記憶部に記憶された、人の声の音声データを第二変換器が変換したときの第二音質コードとに基づいて、音声を強調した音声強調データを生成する。これにより、声帯振動のデータから明瞭な音声のデータが得られる。また、声帯振動のデータに複雑な前処理を施す必要がなく、処理が簡易である。 According to the configuration of the present invention, the first converter converts the voice band vibration data detected by the vibration detection element into the first speech code and the first sound quality code, and the generator stores the first speech code and the storage. The speech enhancement data that emphasizes the voice is generated based on the second sound quality code when the second converter converts the speech data of the human voice stored in the unit. As a result, clear voice data can be obtained from the vocal cord vibration data. In addition, it is not necessary to perform complicated preprocessing on the vocal cord vibration data, and the processing is simple.

本発明の実施の形態に係る骨伝導マイクロホンの部品構成図である。It is a component block diagram of the bone conduction microphone which concerns on embodiment of this invention. 骨伝導マイクロホンが備える音声強調装置のブロック図である。It is a block diagram of the speech enhancement device provided in the bone conduction microphone. 音声強調装置が備える記憶装置に格納された音質テーブルのデータ構成図である。It is a data structure diagram of the sound quality table stored in the storage device provided in the speech enhancement device. 音声強調装置が備えるエンコーダとデコーダを学習させる学習装置のブロック図である。It is a block diagram of the learning device which trains an encoder and a decoder included in a speech enhancement device. 学習装置が備える学習モデルのブロック図である。It is a block diagram of the learning model provided in the learning device. 学習装置が実施する学習処理のフローチャートである。It is a flowchart of the learning process carried out by a learning apparatus.

以下、本発明の実施の形態に係る骨伝導マイクロホン、骨伝導マイクロホンの音声強調方法及び、音声強調プログラムについて図面を参照して詳細に説明する。なお、図中、同一又は同等の部分には同一の符号を付す。 Hereinafter, the bone conduction microphone, the speech enhancement method of the bone conduction microphone, and the speech enhancement program according to the embodiment of the present invention will be described in detail with reference to the drawings. In the figure, the same or equivalent parts are designated by the same reference numerals.

実施の形態に係る骨伝導マイクロホンは、明瞭な音声を得るため、学習済みニューラルネットワークを用いて、振動検出素子が検出した声帯振動のデータから音声強調データを生成する。まず、図１－図３を参照して、骨伝導マイクロホンの構成について説明する。 The bone conduction microphone according to the embodiment uses a trained neural network to generate speech enhancement data from vocal cord vibration data detected by a vibration detection element in order to obtain clear speech. First, the configuration of the bone conduction microphone will be described with reference to FIGS. 1 to 3.

図１は、本発明の実施の形態に係る骨伝導マイクロホン１００の部品構成図である。図２は、骨伝導マイクロホン１００が備える音声強調装置２０のブロック図である。図３は、音声強調装置２０が備える記憶装置２１に格納された音質テーブル２１２のデータ構成図である。なお、図１では、理解を容易にするため、骨伝導マイクロホン１００が出力するスピーカ２００もあわせて示している。 FIG. 1 is a component configuration diagram of a bone conduction microphone 100 according to an embodiment of the present invention. FIG. 2 is a block diagram of the speech enhancement device 20 included in the bone conduction microphone 100. FIG. 3 is a data configuration diagram of the sound quality table 212 stored in the storage device 21 included in the speech enhancement device 20. Note that FIG. 1 also shows a speaker 200 output by the bone conduction microphone 100 for easy understanding.

図１に示すように、骨伝導マイクロホン１００は、声帯振動を検出する振動検出素子１０と、振動検出素子１０が検出した声帯振動のデータに基づいて音声強調データを生成する音声強調装置２０と、を備える。 As shown in FIG. 1, the bone conduction microphone 100 includes a vibration detection element 10 that detects vocal cord vibration, a speech enhancement device 20 that generates speech enhancement data based on vocal cord vibration data detected by the vibration detection element 10, and a speech enhancement device 20. To be equipped.

振動検出素子１０は、図示しないが、ケースと、ケースに収容された圧電素子とを備える。振動検出素子１０では、ケースに振動が伝搬すると、その振動によって圧電素子がたわみ、圧電素子に電位が発生する。振動検出素子１０は、その電位から振動を検出する。 Although not shown, the vibration detection element 10 includes a case and a piezoelectric element housed in the case. In the vibration detection element 10, when the vibration propagates to the case, the piezoelectric element bends due to the vibration, and a potential is generated in the piezoelectric element. The vibration detection element 10 detects vibration from its potential.

振動検出素子１０のケースは、人体の部位、例えば、頭頂部、側頭部、咽頭部、鼻腔部等の皮膚に接触可能な形状を有する。これにより、振動検出素子１０は、ケースが人体の部位に接触した状態に装着される。振動検出素子１０は、この状態で、頭蓋骨等の人体の一部を伝搬する声帯振動を検出する。振動検出素子１０は、検出した声帯振動のデータを音声強調装置２０に送信する。 The case of the vibration detection element 10 has a shape capable of contacting a part of the human body, for example, the skin such as the crown, the temporal region, the pharynx, and the nasal cavity. As a result, the vibration detection element 10 is mounted in a state where the case is in contact with a part of the human body. In this state, the vibration detection element 10 detects vocal cord vibration propagating in a part of the human body such as the skull. The vibration detection element 10 transmits the detected vocal cord vibration data to the speech enhancement device 20.

音声強調装置２０は、声帯振動のデータを受信する。音声強調装置２０は、その声帯振動のデータを処理するため、記憶装置２１およびコントローラ２２を備える。 The speech enhancement device 20 receives the data of vocal cord vibration. The speech enhancement device 20 includes a storage device 21 and a controller 22 in order to process the data of the vocal cord vibration.

記憶装置２１は、ＥＥＰＲＯＭ（ＥｌｅｃｔｒｉｃａｌＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄ－ＯｎｌｙＭｅｍｏｒｙ）又はフラッシュメモリ等を有する。そして、記憶装置２１は、声帯振動のデータから音声強調データを生成する音声強調プログラム２１１を記憶する。また、記憶装置２１は、音声強調プログラム２１１のパラメータを格納するモデルデータベース２１３を記憶する。 The storage device 21 has an EEPROM (Electrical Erasable Programmable Read-Only Memory), a flash memory, or the like. Then, the storage device 21 stores the speech enhancement program 211 that generates speech enhancement data from the vocal cord vibration data. Further, the storage device 21 stores a model database 213 that stores the parameters of the speech enhancement program 211.

コントローラ２２は、演算処理を行うＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）及びＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）を含むメモリとを含むマイクロコンピュータを備える。ＣＰＵは、ＲＯＭ又は記憶装置２１に記憶されたプログラムをＲＡＭに読み出して実行することにより、各種処理を行う。例えば、コントローラ２２は、ＣＰＵが上記音声強調プログラム２１１を実行する。そして、モデルデータベース２１３を読み出す。これにより、音声強調処理を行う。コントローラ２２は、この音声強調処理を行うため、図２に示すように、ソフトウエアとして構成される変換器２３および生成器２４の処理ブロックを備える。 The controller 22 includes a microcomputer including a CPU (Central Processing Unit) for performing arithmetic processing, and a memory including a ROM (Read Only Memory) and a RAM (Random Access Memory). The CPU performs various processes by reading the program stored in the ROM or the storage device 21 into the RAM and executing the program. For example, in the controller 22, the CPU executes the speech enhancement program 211. Then, the model database 213 is read out. As a result, speech enhancement processing is performed. In order to perform this speech enhancement process, the controller 22 includes a processing block of a converter 23 and a generator 24 configured as software, as shown in FIG.

変換器２３は、図１に示す振動検出素子１０が検出した声帯振動のデータを受信する。変換器２３は、図２に示すように、エンコーダＥ１を含み、そのエンコーダＥ１によって声帯振動のデータを発話コードとマイク音質コード（以下、発話コードＣ１と音質コードＣ２という）に変換する。変換器２３は、変換した発話コードＣ１と音質コードＣ２を生成器２４に送信する。 The converter 23 receives the data of vocal cord vibration detected by the vibration detection element 10 shown in FIG. As shown in FIG. 2, the converter 23 includes an encoder E1 and converts vocal cord vibration data into an utterance code and a microphone sound quality code (hereinafter referred to as utterance code C1 and sound quality code C2) by the encoder E1. The converter 23 transmits the converted utterance code C1 and the sound quality code C2 to the generator 24.

一方、記憶装置２１には、音質テーブル２１２が格納されている。その音質テーブル２１２には、通常のマイクロホンを用いて記録した音声のデータを、エンコーダＥ１とは別の、後述するエンコーダＥ２が、発話コードとマイク音質コード（以下、発話コードＣ３と音質コードＣ４という）に変換したときの、音質コードＣ４が、図３に示すように、上記エンコーダＥ１の音質コードＣ２に対応付けられている。 On the other hand, the sound quality table 212 is stored in the storage device 21. In the sound quality table 212, the voice data recorded by using a normal microphone is stored in the encoder E2, which will be described later, separately from the encoder E1, and is referred to as a speech code and a microphone sound quality code (hereinafter referred to as speech code C3 and sound quality code C4). ), The sound quality code C4 is associated with the sound quality code C2 of the encoder E1 as shown in FIG.

生成器２４は、図２に示すように、上述した変換器２３から発話コードＣ１と音質コードＣ２を受信する。また、生成器２４は、受信した音質コードＣ２に対応する音質コードＣ４を記憶装置２１の音質テーブル２１２から読み取る。生成器２４は、デコーダＤ４を含み、そのデコーダＤ４が上記の受信した発話コードＣ１と読み取った音質コードＣ４に基づいて、振動検出素子１０が検出した声帯振動に対応する音声を強調した音声強調データを生成する。詳細には、デコーダＤ４は、発話コードＣ１と音質コードＣ４を復号することにより、音声強調データを生成する。 As shown in FIG. 2, the generator 24 receives the utterance code C1 and the sound quality code C2 from the above-mentioned converter 23. Further, the generator 24 reads the sound quality code C4 corresponding to the received sound quality code C2 from the sound quality table 212 of the storage device 21. The generator 24 includes a decoder D4, and speech enhancement data that emphasizes the voice corresponding to the vocal cord vibration detected by the vibration detection element 10 based on the received utterance code C1 and the sound quality code C4 read by the decoder D4. To generate. Specifically, the decoder D4 generates speech enhancement data by decoding the utterance code C1 and the sound quality code C4.

生成器２４は、生成した音声強調データを外部機器に出力する。例えば、生成器２４は、図１に例示するスピーカ２００に音声強調データを出力する。音声強調データでは音声が強調されている。このため、骨伝導マイクロホン１００が取得した音は、音声が明瞭で聞き取りやすい。 The generator 24 outputs the generated speech enhancement data to an external device. For example, the generator 24 outputs speech enhancement data to the speaker 200 illustrated in FIG. Speech is emphasized in the speech enhancement data. Therefore, the sound acquired by the bone conduction microphone 100 is clear and easy to hear.

上述した変換器２３のエンコーダＥ１と生成器２４のデコーダＤ４は、学習済みニューラルネットワークを使用することにより変換処理と生成処理を行う。続いて、図４および図５を参照して、エンコーダＥ１とデコーダＤ４のニューラルネットワークを学習させる学習装置３００について説明する。 The encoder E1 of the converter 23 and the decoder D4 of the generator 24 described above perform conversion processing and generation processing by using the trained neural network. Subsequently, the learning device 300 for learning the neural network of the encoder E1 and the decoder D4 will be described with reference to FIGS. 4 and 5.

図４は、音声強調装置２０が備えるエンコーダＥ１とデコーダＤ４を学習させる学習装置３００のブロック図である。図５は、学習装置３００が備える学習モデル３３０のブロック図である。 FIG. 4 is a block diagram of a learning device 300 for learning the encoder E1 and the decoder D4 included in the speech enhancement device 20. FIG. 5 is a block diagram of the learning model 330 included in the learning device 300.

学習装置３００では、図示しないＣＰＵが、図４に示す記憶装置３１０に記憶された学習プログラム３１１をＲＡＭに読み出して実行する。これにより、学習装置３００は、学習処理を行う。その結果、学習装置３００は、ソフトウエアとして構成される学習部３２０および学習モデル３３０を備える。 In the learning device 300, a CPU (not shown) reads the learning program 311 stored in the storage device 310 shown in FIG. 4 into the RAM and executes it. As a result, the learning device 300 performs the learning process. As a result, the learning device 300 includes a learning unit 320 and a learning model 330 configured as software.

記憶装置３１０には、学習データ３１２が記憶されている。学習部３２０は、記憶装置３１０から学習データ３１２を読み出し、読み出した学習データ３１２を学習モデル３３０に入力する。 The learning data 312 is stored in the storage device 310. The learning unit 320 reads the learning data 312 from the storage device 310, and inputs the read learning data 312 into the learning model 330.

学習モデル３３０は、図５に示すように、エンコーダＥ１、Ｅ２と、デコーダＤ１－Ｄ４と、が組み合わされたモデルである。学習モデル３３０では、エンコーダＥ１とデコーダＤ１が一組のネットワークを形成している。また、エンコーダＥ２とデコーダＤ２がもう一組のネットワークを形成している。さらに、デコーダＤ３、Ｄ４それぞれは、エンコーダＥ１とエンコーダＥ２に接続され、それぞれが別のネットワークを形成している。 As shown in FIG. 5, the learning model 330 is a model in which the encoders E1 and E2 and the decoders D1-D4 are combined. In the learning model 330, the encoder E1 and the decoder D1 form a set of networks. Further, the encoder E2 and the decoder D2 form another set of networks. Further, the decoders D3 and D4 are connected to the encoder E1 and the encoder E2, respectively, and each forms a different network.

図示しないが、エンコーダＥ１とデコーダＤ１は、入力層、隠れ層及び出力層を有するニューラルネットワークモデルによって構築されている。そのニューラルネットワークモデルの入力層と出力層は、次元数が同じであり、隠れ層は、入力層と出力層よりも次元数が小さい。そして、エンコーダＥ１は、ニューラルネットワークモデルの入力層から隠れ層までの部分によって構築され、デコーダＤ１は、そのニューラルネットワークモデルの隠れ層から出力層までの部分によって構築されている。 Although not shown, the encoder E1 and the decoder D1 are constructed by a neural network model having an input layer, a hidden layer, and an output layer. The input layer and the output layer of the neural network model have the same number of dimensions, and the hidden layer has a smaller number of dimensions than the input layer and the output layer. The encoder E1 is constructed by the portion from the input layer to the hidden layer of the neural network model, and the decoder D1 is constructed by the portion from the hidden layer to the output layer of the neural network model.

上述した学習データ３１２には、図４に示すように、振動検出素子１０を用いて予め記録しておいた声帯振動データＡと、通常のマイクロホンを用いて予め記録しておいた音声データＢと、が格納されている。 As shown in FIG. 4, the learning data 312 described above includes vocal cord vibration data A pre-recorded using the vibration detection element 10 and voice data B pre-recorded using a normal microphone. , Is stored.

ここで、音声データＢは、声帯振動データＡを記録したときの、その声帯振動によって発せられた音声を記録したデータである。すなわち、音声データＢは、声帯振動データＡに対応する音声を記録している。なお、本明細書では、通常のマイクロホンとは、声帯振動によって空気が振動することにより発生する音声を電気信号に変換するマイクロホンのことをいい、気導音マイクロホンともいう。 Here, the voice data B is data obtained by recording the voice emitted by the vocal cord vibration when the vocal cord vibration data A is recorded. That is, the voice data B records the voice corresponding to the vocal cord vibration data A. In addition, in this specification, a normal microphone means a microphone which converts the sound generated by the vibration of air by vocal cord vibration into an electric signal, and is also referred to as an air conduction sound microphone.

学習部３２０は、学習データ３１２のうち、声帯振動データＡを、図５に示すように、エンコーダＥ１に入力する。そして、声帯振動データＡとデコーダＤ１の出力を比較して、ニューラルネットワークモデル内のノード間の重みを調整する。これにより、学習部３２０は、声帯振動データＡとデコーダＤ１の出力の誤差を小さくする。その結果、エンコーダＥ１とデコーダＤ１が、オートエンコーダを学習する。すなわち、自己符号化を学習する。 The learning unit 320 inputs the vocal cord vibration data A of the learning data 312 to the encoder E1 as shown in FIG. Then, the vocal cord vibration data A and the output of the decoder D1 are compared to adjust the weights between the nodes in the neural network model. As a result, the learning unit 320 reduces the error between the output of the vocal cord vibration data A and the decoder D1. As a result, the encoder E1 and the decoder D1 learn the autoencoder. That is, it learns self-coding.

また、エンコーダＥ２とデコーダＤ２は、エンコーダＥ１とデコーダＤ１で説明したニューラルネットワークモデルとは別のニューラルネットワークモデルによって構築されている。なお、この別のニューラルネットワークモデルは、エンコーダＥ１とデコーダＤ１で説明したニューラルネットワークモデルと同じ層構造を備える。 Further, the encoder E2 and the decoder D2 are constructed by a neural network model different from the neural network model described in the encoder E1 and the decoder D1. Note that this other neural network model has the same layer structure as the neural network model described in the encoder E1 and the decoder D1.

学習部３２０は、学習データ３１２の音声データＢをエンコーダＥ２に入力する。そして、学習部３２０は、音声データＢとデコーダＤ２の出力を比較して、上記別のニューラルネットワークモデル内のノード間の重みを調整する。これにより、音声データＢとデコーダＤ２の出力の誤差を小さくする。その結果、エンコーダＥ２とデコーダＤ２が自己符号化を学習する。 The learning unit 320 inputs the voice data B of the learning data 312 to the encoder E2. Then, the learning unit 320 compares the output of the voice data B and the output of the decoder D2, and adjusts the weight between the nodes in the other neural network model. As a result, the error between the output of the audio data B and the output of the decoder D2 is reduced. As a result, the encoder E2 and the decoder D2 learn self-coding.

エンコーダＥ１とデコーダＤ１、エンコーダＥ２とデコーダＤ２がそれぞれ自己符号化を学習すると、エンコーダＥ１は、声帯振動データＡを符号化した発話コードＣ１と音質コードＣ２を出力する。また、エンコーダＥ２は、音声データＢを符号化した発話コードＣ３と音質コードＣ４を出力する。 When the encoder E1 and the decoder D1 and the encoder E2 and the decoder D2 each learn self-coding, the encoder E1 outputs the utterance code C1 and the sound quality code C2 in which the vocal cord vibration data A is encoded. Further, the encoder E2 outputs the utterance code C3 and the sound quality code C4 in which the voice data B is encoded.

一方、デコーダＤ３は、デコーダＤ１、Ｄ２と同じ層構造を有するニューラルネットワーク部によって構築されている。学習部３２０は、デコーダＤ３に、エンコーダＥ１が出力する音質コードＣ２と、エンコーダＥ２が出力する発話コードＣ３とを入力する。ここで、音質コードＣ２は、声帯振動データＡを記録した振動検出素子１０のマイク音質のコードである。学習部３２０は、そのマイク音質のコードに対応した出力を得るため、声帯振動データＡとデコーダＤ３の出力を比較する。学習部３２０は、その比較結果に基づいてニューラルネットワーク部のノード間の重みを調整して、声帯振動データＡとデコーダＤ３の出力の誤差を小さくする。これにより、学習部３２０は、声帯振動データＡを出力する状態にデコーダＤ３を学習させる。 On the other hand, the decoder D3 is constructed by a neural network unit having the same layer structure as the decoders D1 and D2. The learning unit 320 inputs the sound quality code C2 output by the encoder E1 and the utterance code C3 output by the encoder E2 to the decoder D3. Here, the sound quality code C2 is a microphone sound quality code of the vibration detection element 10 that records the vocal cord vibration data A. The learning unit 320 compares the output of the vocal cord vibration data A with the output of the decoder D3 in order to obtain an output corresponding to the code of the microphone sound quality. The learning unit 320 adjusts the weights between the nodes of the neural network unit based on the comparison result to reduce the error between the outputs of the vocal cord vibration data A and the decoder D3. As a result, the learning unit 320 trains the decoder D3 in a state of outputting the vocal cord vibration data A.

デコーダＤ４は、デコーダＤ３のニューラルネットワーク部と同じ層構造を有する別のニューラルネットワーク部によって構築されている。学習部３２０は、デコーダＤ４に、エンコーダＥ１が出力する発話コードＣ１と、エンコーダＥ２が出力する音質コードＣ４とを入力する。その音質コードＣ４は、音声データＢを記録した通常のマイクロホンのマイク音質のコードである。学習部３２０は、このマイク音質のコードに対応した出力を得るため、音声データＢとデコーダＤ４の出力を比較して、ニューラルネットワーク部のノード間の重みを調整する。これにより、学習部３２０は、音声データＢとデコーダＤ４の出力の誤差を小さくする。その結果、学習部３２０は、音声データＢを出力する状態にデコーダＤ４を学習させる。 The decoder D4 is constructed by another neural network unit having the same layer structure as the neural network unit of the decoder D3. The learning unit 320 inputs the utterance code C1 output by the encoder E1 and the sound quality code C4 output by the encoder E2 to the decoder D4. The sound quality code C4 is a microphone sound quality code of a normal microphone that records voice data B. In order to obtain an output corresponding to the code of the microphone sound quality, the learning unit 320 compares the output of the voice data B and the output of the decoder D4 and adjusts the weight between the nodes of the neural network unit. As a result, the learning unit 320 reduces the error between the audio data B and the output of the decoder D4. As a result, the learning unit 320 trains the decoder D4 in a state of outputting the voice data B.

学習部３２０は、エンコーダＥ１、Ｅ２とデコーダＤ１－Ｄ４を学習させると、すなわち、学習モデル３３０を学習させると、学習済みの学習モデル３３０のエンコーダＥ１、デコーダＤ４の重み係数等のパラメータを、図２に示すモデルデータベース２１３として記憶装置２１に記憶させる。これにより、学習部３２０は、変換器２３と生成器２４の動作に必要なデータベースを音声強調装置２０に供給する。 When the learning unit 320 learns the encoders E1 and E2 and the decoders D1-D4, that is, when the learning model 330 is trained, the learning unit 320 displays parameters such as the weighting coefficients of the encoders E1 and the decoder D4 of the learned learning model 330. It is stored in the storage device 21 as the model database 213 shown in 2. As a result, the learning unit 320 supplies the speech enhancement device 20 with the database necessary for the operation of the converter 23 and the generator 24.

また、学習部３２０は、学習済みの学習モデル３３０に、再度学習データ３１２を入力する。学習部３２０は、そのときにエンコーダＥ１が出力する音質コードＣ２と、エンコーダＥ２が出力する音質コードＣ４とを用いて、図３に示す音質テーブル２１２を作成する。学習部３２０は、作成した音質テーブル２１２を記憶装置２１に記憶させる。これにより、学習部３２０は、生成器２４の動作に必要なテーブルを音声強調装置２０に供給する。その結果、上述したように、音声強調装置２０が音声強調データを生成して、振動検出素子１０が検出した振動を明瞭で聞き取りやすい音声に変換する。 Further, the learning unit 320 inputs the learning data 312 again into the learned learning model 330. The learning unit 320 creates the sound quality table 212 shown in FIG. 3 by using the sound quality code C2 output by the encoder E1 and the sound quality code C4 output by the encoder E2 at that time. The learning unit 320 stores the created sound quality table 212 in the storage device 21. As a result, the learning unit 320 supplies the speech enhancement device 20 with the table necessary for the operation of the generator 24. As a result, as described above, the speech enhancement device 20 generates speech enhancement data, and the vibration detected by the vibration detection element 10 is converted into a clear and easy-to-hear voice.

次に、図６を参照して、学習装置３００の学習方法をより詳細に説明する。以下の説明では、図示しないが、学習装置３００は、パーソナルコンピュータまたはサーバー（以下、サーバー等という）によって構成されているものとする。そして、それらサーバー等に設けられた記憶装置に学習プログラム３１１と学習データ３１２が格納され、さらに、その学習プログラム３１１のアイコンがディスプレイ装置に表示されているものとする。また、それらサーバー等は、インターネットを介して、骨伝導マイクロホン１００の音声強調装置２０が備えるコントローラに接続されているものとする。 Next, the learning method of the learning device 300 will be described in more detail with reference to FIG. Although not shown in the following description, it is assumed that the learning device 300 is composed of a personal computer or a server (hereinafter, referred to as a server or the like). Then, it is assumed that the learning program 311 and the learning data 312 are stored in the storage device provided in the server or the like, and the icon of the learning program 311 is displayed on the display device. Further, it is assumed that these servers and the like are connected to the controller included in the speech enhancement device 20 of the bone conduction microphone 100 via the Internet.

図６は、学習装置３００が実施する学習処理のフローチャートである。 FIG. 6 is a flowchart of the learning process performed by the learning device 300.

はじめに、学習装置３００のユーザーが、上記アイコンを押して、学習プログラム３１１を起動させる。これにより、サーバーまたは、パーソナルコンピュータのＣＰＵによって学習プログラムが実行され、学習処理のフローが開始される。 First, the user of the learning device 300 presses the above icon to activate the learning program 311. As a result, the learning program is executed by the CPU of the server or the personal computer, and the flow of the learning process is started.

学習処理のフローが開始されると、まず、学習部３２０は、記憶装置３１０から学習データ３１２を読み出す。これにより、学習データ３１２を取得する（ステップＳ１）。 When the flow of the learning process is started, the learning unit 320 first reads the learning data 312 from the storage device 310. As a result, the learning data 312 is acquired (step S1).

なお、学習データ３１２には、音声強調装置２０が強調できる音声の種類を増やすため、人が様々な発音をしたときの、声帯振動データＡと音声データＢが格納されていることが望ましい。例えば、学習データ３１２には、特定の言語のほとんどの文字について、それら文字を読んだときの声帯振動データＡと音声データＢが文字毎に格納されていることが望ましい。 It is desirable that the learning data 312 stores vocal cord vibration data A and voice data B when a person makes various pronunciations in order to increase the types of voices that can be emphasized by the voice enhancement device 20. For example, it is desirable that the learning data 312 stores vocal cord vibration data A and voice data B for each character when reading those characters for most characters in a specific language.

続いて、学習部３２０は、取得した学習データ３１２を用いて、学習モデル３３０のエンコーダＥ１とデコーダＤ１のニューラルネットワークと、エンコーダＥ２とデコーダＤ２のニューラルネットワークを学習させる（ステップＳ２）。 Subsequently, the learning unit 320 trains the neural network of the encoder E1 and the decoder D1 of the learning model 330 and the neural network of the encoder E2 and the decoder D2 using the acquired learning data 312 (step S2).

詳細には、エンコーダＥ１に学習データ３１２の声帯振動データＡを入力し、エンコーダＥ２に学習データ３１２の音声データＢを入力する。そしてデコーダＤ１の出力をＡ^＊、デコーダＤ１の出力をＢ^＊とする場合に、数式１－数式３で表されるコスト関数Ｌ_ａｌｌが一定値以内に収束するまで、ネットワーク内のノード間の重みを調整する。これにより、エンコーダＥ１とデコーダＤ１のネットワークと、エンコーダＥ２とデコーダＤ２のネットワークを学習させる。 Specifically, the vocal cord vibration data A of the learning data 312 is input to the encoder E1, and the voice data B of the learning data 312 is input to the encoder E2. When the output of the decoder D1 is A ^* and the output of the decoder D1 is B ^* , the weights between the nodes in the network until the cost function L _all represented by the formula 1-formula 3 converges within a certain value. To adjust. As a result, the network of the encoder E1 and the decoder D1 and the network of the encoder E2 and the decoder D2 are learned.

次に、学習部３２０は、学習データ３１２を用いて、学習モデル３３０全体を学習させる（ステップＳ３）。 Next, the learning unit 320 trains the entire learning model 330 using the learning data 312 (step S3).

詳細には、ステップＳ２と同じく、エンコーダＥ１に学習データ３１２の声帯振動データＡを入力し、エンコーダＥ２に学習データ３１２の音声データＢを入力する。そして、デコーダＤ３の出力をＡ^＊＊、デコーダＤ４の出力をＢ^＊＊とする場合に、数式４－数式６で表されるコスト関数Ｌ_ａｌｌが一定値以内に収束するまで、ネットワーク内のノード間の重みを調整する。これにより、デコーダＤ３、Ｄ４を含む学習モデル３３０全体を学習させる。 Specifically, as in step S2, the vocal cord vibration data A of the learning data 312 is input to the encoder E1, and the voice data B of the learning data 312 is input to the encoder E2. Then, when the output of the decoder D3 is A ^** and the output of the decoder D4 is B ^** , the nodes in the network until the cost function L _all represented by the formula 4-formula 6 converges within a certain value. Adjust the weight between. As a result, the entire learning model 330 including the decoders D3 and D4 is trained.

学習モデル３３０全体の学習が完了すると、学習部３２０は、学習済みの学習モデル３３０のパラメータを記憶装置２１に格納する（ステップＳ４）。詳細には、エンコーダＥ１とデコーダＤ４のネットワークの層数、ノード数、ノード間の重み係数等のパラメータを記憶装置２１のモデルデータベース２１３に格納する。 When the learning of the entire learning model 330 is completed, the learning unit 320 stores the parameters of the learned learning model 330 in the storage device 21 (step S4). Specifically, parameters such as the number of network layers of the encoder E1 and the decoder D4, the number of nodes, and the weighting coefficient between the nodes are stored in the model database 213 of the storage device 21.

また、学習部３２０は、学習済みの学習モデル３３０を用いて、音質テーブル２１２を作成し、その音質テーブル２１２を記憶装置２１に格納する（ステップＳ５）。 Further, the learning unit 320 creates a sound quality table 212 using the learned learning model 330, and stores the sound quality table 212 in the storage device 21 (step S5).

詳細には、学習部３２０は、学習済みの学習モデル３３０に学習データ３１２を入力し、そのときのエンコーダＥ１、Ｅ２の出力のデータから、エンコーダＥ１の出力である音質コードＣ２に、エンコーダＥ２の出力である音質コードＣ４を対応付ける。このとき、例えば、特定の言語の文字ほとんどについて、声帯振動データＡと音声データＢが学習データ３１２に格納されている場合、それら文字毎に、音質コードＣ２に音質コードＣ４を対応付ける。これにより、学習部３２０は、音質テーブル２１２を作成する。そして、作成した音質テーブル２１２を記憶装置２１に格納する。 Specifically, the learning unit 320 inputs the learning data 312 into the learned learning model 330, and from the output data of the encoders E1 and E2 at that time, the sound quality code C2 which is the output of the encoder E1 is input to the encoder E2. The output sound quality code C4 is associated with it. At this time, for example, when vocal cord vibration data A and voice data B are stored in the learning data 312 for most of the characters in a specific language, the sound quality code C2 is associated with the sound quality code C4 for each of these characters. As a result, the learning unit 320 creates the sound quality table 212. Then, the created sound quality table 212 is stored in the storage device 21.

以上のステップにより、学習装置３００の学習が完了する。 By the above steps, the learning of the learning device 300 is completed.

学習装置３００の学習が完了した後、骨伝導マイクロホン１００の図示しない電源ボタンが押されて、骨伝導マイクロホン１００が起動すると、コントローラ２２は、上記ステップＳ４で記憶装置２１に格納したモデルデータベース２１３を読み出し、読み出したモデルデータベース２１３に基づいて、エンコーダＥ１とデコーダＤ４のニューラルネットワークモデルを構築する。これにより、学習装置３００の学習を骨伝導マイクロホン１００の動作に反映させる。 After the learning of the learning device 300 is completed, the power button (not shown) of the bone conduction microphone 100 is pressed to activate the bone conduction microphone 100, and the controller 22 stores the model database 213 stored in the storage device 21 in step S4. Based on the read and read model database 213, a neural network model of the encoder E1 and the decoder D4 is constructed. As a result, the learning of the learning device 300 is reflected in the operation of the bone conduction microphone 100.

続いて、骨伝導マイクロホン１００の振動検出素子１０が声帯振動を検出すると、コントローラ２２は、その振動検出素子１０から声帯振動データを取得し、取得した声帯振動データを、モデルデータベース２１３で構築したニューラルネットワークモデルのエンコーダＥ１によって発話コードＣ１と音質コードＣ２に変換する（このステップのことを変換ステップともいう）。 Subsequently, when the vibration detection element 10 of the bone conduction microphone 100 detects the voice band vibration, the controller 22 acquires the voice band vibration data from the vibration detection element 10, and the acquired voice band vibration data is a neural structure constructed by the model database 213. It is converted into the speech code C1 and the sound quality code C2 by the encoder E1 of the network model (this step is also called a conversion step).

コントローラ２２は、上記ステップＳ５で記憶装置２１に格納した音質テーブル２１２から、変換した音質コードＣ２に対応する音質コードＣ４を読み出し、上記エンコーダＥ１によって変換した発話コードＣ１と読み出した音質コードＣ４をデコーダＤ４に入力することにより、発話コードＣ１と音質コードＣ４を復号させる。これにより、コントローラ２２は、音声強調データを生成する（このステップのことを生成ステップともいう）。その結果、骨伝導マイクロホン１００では、音声が強調され、聞き取りやすい。 The controller 22 reads the sound quality code C4 corresponding to the converted sound quality code C2 from the sound quality table 212 stored in the storage device 21 in step S5, and decodes the utterance code C1 converted by the encoder E1 and the read sound quality code C4. By inputting to D4, the utterance code C1 and the sound quality code C4 are decoded. As a result, the controller 22 generates speech enhancement data (this step is also referred to as a generation step). As a result, in the bone conduction microphone 100, the voice is emphasized and it is easy to hear.

なお、上述したエンコーダＥ１、Ｅ２は、本明細書及び特許請求の範囲でいうところの第一変換器または第一エンコーダ、第二変換器または第二エンコーダの一例である。デコーダＤ１、Ｄ２は、本明細書及び特許請求の範囲でいうところの第一デコーダ、第二デコーダの一例である。また、エンコーダＥ１とデコーダＤ１によって構成されるオートエンコーダとエンコーダＥ２とデコーダＤ２によって構成されるオートエンコーダは、本明細書及び特許請求の範囲でいうところの第一オートエンコーダ、第二オートエンコーダの一例である。さらに、エンコーダＥ１、Ｅ２が変換する発話コードＣ１、音質コードＣ２、発話コードＣ３および音質コードＣ４は、本明細書及び特許請求の範囲でいうところの第一発話コード、第一音質コード、第二発話コードおよび第二音質コードの一例である。 The above-mentioned encoders E1 and E2 are examples of the first converter or the first encoder, the second converter or the second encoder as defined in the present specification and the claims. The decoders D1 and D2 are examples of the first decoder and the second decoder as defined in the present specification and claims. Further, the autoencoder composed of the encoder E1 and the decoder D1 and the autoencoder composed of the encoder E2 and the decoder D2 are examples of the first autoencoder and the second autoencoder as defined in the present specification and claims. Is. Further, the utterance code C1, the sound quality code C2, the utterance code C3 and the sound quality code C4 converted by the encoders E1 and E2 are the first utterance code, the first sound quality code, and the second sound quality code as defined in the present specification and claims. It is an example of an utterance code and a second sound quality code.

また、上記の実施の形態では、学習部３２０には、骨伝導マイクロホン１００が接続され、通常のマイクロホンは接続されていないが、学習部３２０は、骨伝導マイクロホン１００のほかに、通常のマイクロホンに接続されていてもよい。この場合に、ユーザーがテキストデータを声に出して読んで、骨伝導マイクロホン１００の振動検出素子１０が、そのときのユーザーの声帯振動を検出すると共に、通常のマイクロホンがそのときの音声を検出するとよい。そして、学習部３２０は、検出した声帯振動と音声のデータをステップＳ１の学習データ３１２として利用してもよい。この場合、学習部３２０は、検出した声帯振動と音声のデータを学習データ３１２として記憶装置３１０に記憶させるとよい。 Further, in the above embodiment, the bone conduction microphone 100 is connected to the learning unit 320 and the normal microphone is not connected, but the learning unit 320 is connected to the normal microphone in addition to the bone conduction microphone 100. It may be connected. In this case, when the user reads the text data aloud, the vibration detection element 10 of the bone conduction microphone 100 detects the vocal cord vibration of the user at that time, and the normal microphone detects the voice at that time. good. Then, the learning unit 320 may use the detected vocal cord vibration and voice data as the learning data 312 in step S1. In this case, the learning unit 320 may store the detected vocal cord vibration and voice data in the storage device 310 as learning data 312.

以上のように、実施の形態に係る骨伝導マイクロホン１００では、変換器２３が含むエンコーダＥ１が、振動検出素子１０によって検出された声帯振動データを発話コードＣ１、音質コードＣ２に変換し、生成器２４が含むデコーダＤ４が、エンコーダＥ１によって変換された発話コードＣ１と、記憶装置２１の音質テーブル２１２に格納され、エンコーダＥ１によって変換された音質コードＣ２に対応する音質コードＣ４と、に基づいて、音声強調データを生成する。このため、骨伝導マイクロホン１００では、明瞭かつ、聞き取りやすい音声を得ることができる。また、声帯振動データに複雑な前処理を施す必要がなく、処理が簡易である。 As described above, in the bone conduction microphone 100 according to the embodiment, the encoder E1 included in the converter 23 converts the voice band vibration data detected by the vibration detection element 10 into the speech code C1 and the sound quality code C2, and the generator. The decoder D4 included in the 24 is stored in the speech code C1 converted by the encoder E1 and the sound quality code C4 corresponding to the sound quality code C2 stored in the sound quality table 212 of the storage device 21 and converted by the encoder E1. Generate speech enhancement data. Therefore, with the bone conduction microphone 100, it is possible to obtain clear and easy-to-hear voice. In addition, it is not necessary to perform complicated preprocessing on the vocal cord vibration data, and the processing is simple.

また、変換器２３が含むエンコーダＥ１と生成器２４が含むデコーダＤ４は、学習装置３００によって学習する。このため、骨伝導マイクロホン１００のユーザーの声帯振動データと音声データを格納した学習データ３１２を用いてエンコーダＥ１とデコーダＤ４を学習させることにより、そのユーザーの声帯振動、音声に応じた音声強調データを生成することができる。 Further, the encoder E1 included in the converter 23 and the decoder D4 included in the generator 24 are learned by the learning device 300. Therefore, by training the encoder E1 and the decoder D4 using the learning data 312 that stores the vocal cord vibration data and voice data of the user of the bone conduction microphone 100, the voice enhancement data corresponding to the user's vocal cord vibration and voice can be obtained. Can be generated.

以上、本発明の実施の形態を説明したが、本発明は上記の実施の形態に限定されるものではない。例えば、実施の形態では、振動検出素子１０が圧電素子を備えているが、本発明はこれに限定されない。本発明では、振動検出素子１０が声帯の振動を検出できればよく、その限りにおいて素子は任意である。例えば、圧電素子の換わりに、電磁型素子、静電型素子であってもよい。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above embodiments. For example, in the embodiment, the vibration detection element 10 includes a piezoelectric element, but the present invention is not limited thereto. In the present invention, it is sufficient that the vibration detection element 10 can detect the vibration of the vocal cords, and the element is arbitrary as long as it can detect the vibration of the vocal cords. For example, instead of the piezoelectric element, an electromagnetic element or an electrostatic element may be used.

また、上記の実施の形態では、骨伝導マイクロホン１００が接続される外部機器としてスピーカ２００が例示されているが、本発明はこれに限定されない。本発明では、骨伝導マイクロホン１００が音声を強調した音声強調データを生成すればよく、その接続先は限定されない。例えば、骨伝導マイクロホン１００は、クレーン装置、高所作業車等のキャビンのコントローラに接続されてもよい。そして、そのコントローラを介して、キャビン内に配置されたスピーカ２００に接続されてもよい。また、骨伝導イヤホンに接続されてもよい。このような形態であれば、大きい作業音が発生して作業者の音声が聞き取りにくい環境であっても、作業者の音声を聞き取りやすくすることができる。 Further, in the above embodiment, the speaker 200 is exemplified as an external device to which the bone conduction microphone 100 is connected, but the present invention is not limited thereto. In the present invention, the bone conduction microphone 100 may generate speech enhancement data in which speech is emphasized, and the connection destination thereof is not limited. For example, the bone conduction microphone 100 may be connected to a controller of a cabin such as a crane device or an aerial work platform. Then, it may be connected to the speaker 200 arranged in the cabin via the controller. It may also be connected to a bone conduction earphone. With such a form, it is possible to make it easier to hear the worker's voice even in an environment where a loud work sound is generated and the worker's voice is difficult to hear.

上記の実施の形態では、学習装置３００が、骨伝導マイクロホン１００とは別の装置である。しかし、本発明はこれに限定されない。本発明では、骨伝導マイクロホン１００が学習装置３００を備えていてもよい。例えば、音声強調装置２０のコントローラ２２が、学習装置３００、すなわち、学習部３２０と学習モデル３３０を備えていてもよい。この場合、コントローラ２２が通常のマイクロホンに接続されているとよい。そして、音声強調装置２０では、学習モードと動作モードが切り替え可能であり、学習モード時に、振動検出素子１０が検出した声帯音声データと通常のマイクロホンが検出した音声データに基づいて、学習モデル３３０が学習するとよい。このような形態であれば、ユーザーの声帯振動、音声に応じて骨伝導マイクロホン１００を調整することができる。 In the above embodiment, the learning device 300 is a device different from the bone conduction microphone 100. However, the present invention is not limited to this. In the present invention, the bone conduction microphone 100 may include the learning device 300. For example, the controller 22 of the speech enhancement device 20 may include a learning device 300, that is, a learning unit 320 and a learning model 330. In this case, it is preferable that the controller 22 is connected to a normal microphone. Then, in the speech enhancement device 20, the learning mode and the operation mode can be switched, and the learning model 330 is based on the vocal cord voice data detected by the vibration detection element 10 and the voice data detected by the normal microphone in the learning mode. You should learn. With such a form, the bone conduction microphone 100 can be adjusted according to the vocal cord vibration and voice of the user.

１０…振動検出素子、２０…音声強調装置、２１…記憶装置、２２…コントローラ、２３…変換器、２４…生成器、１００…骨伝導マイクロホン、２００…スピーカ、２１１…音声強調プログラム、２１２…音質テーブル、２１３…モデルデータベース、３００…学習装置、３１０…記憶装置、３１１…学習プログラム、３１２…学習データ、３２０…学習部、３３０…学習モデル、Ｃ１，Ｃ３…発話コード、Ｃ２，Ｃ４…音質コード、Ｄ１－Ｄ４…デコーダ、Ｅ１，Ｅ２…エンコーダ 10 ... Vibration detection element, 20 ... Speech enhancement device, 21 ... Storage device, 22 ... Controller, 23 ... Converter, 24 ... Generator, 100 ... Bone conduction microphone, 200 ... Speaker, 211 ... Speech enhancement program, 212 ... Sound quality Table, 213 ... model database, 300 ... learning device, 310 ... storage device, 311 ... learning program, 312 ... learning data, 320 ... learning unit, 330 ... learning model, C1, C3 ... speech code, C2, C4 ... sound quality code , D1-D4 ... Decoder, E1, E2 ... Encoder

Claims

A vibration detection element that detects vocal cord vibration in contact with the human body,
A first converter that converts the vocal cord vibration data detected by the vibration detection element into a first utterance code and a first sound quality code.
The second converter that converts the voice data of the human voice generated by the vibration of the air due to the vocal band vibration into the second utterance code and the second sound quality code, converts the voice data into the second utterance code and the second sound quality. A storage device that stores the second sound quality code when converted to a code, and
A generator that generates speech enhancement data in which speech is emphasized based on the first utterance code converted by the first converter and the second sound quality code stored in the storage device.
Bone conduction microphone with.

The first transducer is a first encoder that forms a first autoencoder when combined with a first utterance code and a first decoder that restores vocal cord vibration data from the first sound quality code.
The bone conduction microphone according to claim 1.

The second converter is a second encoder that forms a second autoencoder when combined with the second utterance code and the second decoder that restores the voice data from the second sound quality code.
The bone conduction microphone according to claim 1 or 2.

A conversion step for converting the vocal cord vibration data acquired from the vibration detection element of the bone conduction microphone that comes into contact with the human body to detect vocal cord vibration into the first speech code and the first sound quality code.
The converter that converts the voice data of the human voice generated by the vibration of the air due to the vocal band vibration into the second speech code and the second sound quality code converts the voice data into the second speech code and the second sound quality code. The voice data is read from the storage device that stores the second sound quality code at the time of conversion, and the voice data is based on the read second sound quality code and the first speech code converted in the conversion step. A generation step to generate voice-enhanced data that emphasizes the voice of
Speech enhancement method for bone conduction microphones.

A vibration detection element that detects vocal cord vibration in contact with the human body,
The converter that converts the voice data of the human voice generated by the vibration of the air due to the vocal band vibration into the second utterance code and the second sound quality code converts the voice data into the second utterance code and the second sound quality code. A storage device that stores the second sound quality code at the time of conversion,
A speech enhancement program for bone conduction microphones
On the computer
A conversion step for converting the vocal cord vibration data acquired from the vibration detection element into a first utterance code and a first sound quality code, and
A generation step of reading the second sound quality code from the storage device and generating voice-enhanced data based on the read second sound quality code and the first utterance code converted in the conversion step.
A speech enhancement program to execute.