JP7373739B2

JP7373739B2 - Speech-to-text conversion system and speech-to-text conversion device

Info

Publication number: JP7373739B2
Application number: JP2019103763A
Authority: JP
Inventors: 啓田坂; 克中尾; 浩国本; 賀津雄西郷
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2023-11-06
Anticipated expiration: 2039-06-03
Also published as: JP2020197629A

Description

本開示は、音声テキスト変換システムおよび音声テキスト変換装置に関する。 The present disclosure relates to a speech-to-text conversion system and a speech-to-text conversion device.

特許文献１には、騒音を低減し、聞き取りやすい音声信号を生成できる音声補正装置が提案されている。この音声補正装置は、空気の振動を用いて気導音を収音する気導マイクと、ユーザの骨の振動を用いて骨導音を収音する骨伝導マイクと、気導音でのユーザの音声の雑音に対する比率を算出する算出部と、骨導音の周波数スペクトルを、比率が第１の閾値以上のときの気導音中の周波数スペクトルに一致させるための補正係数を記憶する記憶部と、骨導音を、補正係数を用いて補正する補正部と、比率が第２の閾値より小さくなると、補正後の骨導音から出力信号を生成する生成部と、を備える。 Patent Document 1 proposes an audio correction device that can reduce noise and generate an easily audible audio signal. This audio correction device consists of an air conduction microphone that collects air conduction sound using air vibration, a bone conduction microphone that collects bone conduction sound using the vibration of the user's bones, and a bone conduction microphone that collects air conduction sound using air vibration. a calculation unit that calculates the ratio of the voice to the noise; and a storage unit that stores a correction coefficient for matching the frequency spectrum of the bone-conducted sound to the frequency spectrum of the air-conducted sound when the ratio is equal to or greater than a first threshold value. a correction unit that corrects bone-conducted sound using a correction coefficient; and a generation unit that generates an output signal from the corrected bone-conducted sound when the ratio becomes smaller than a second threshold.

特開２０１４－２３９３４６号公報JP2014-239346A

本開示は、上述した従来の事情に鑑みて案出され、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる音声テキスト変換システムおよび音声テキスト変換装置を提供することを目的とする。 The present disclosure was devised in view of the conventional circumstances described above, and an object of the present disclosure is to provide a speech-to-text conversion system and a speech-to-text conversion device that can recognize speech and convert it into text according to the type of connected microphone. shall be.

本開示は、音声を収音する受音器が接続された端末装置とサーバとの間で通信可能な音声テキスト変換システムであって、前記端末装置は、前記受音器により収音された前記音声の音声信号を前記サーバに送信し、前記サーバは、前記端末装置から受信された前記音声信号のスペクトル特性のうち高周波数成分と低周波数成分との比率に基づいて、前記音声がユーザの声帯の振動に基づく骨導音声あるいは空気を介した前記ユーザの鼓膜の振動に基づく気導音声のいずれかを判別し、前記音声が前記骨導音声の場合、前記骨導音声を前記気導音声に変換し、前記気導音声をテキスト情報に変換し、変換された前記テキスト情報を出力する、音声テキスト変換システムを提供する。 The present disclosure is a voice-to-text conversion system capable of communicating between a server and a terminal device connected to a sound receiver that picks up sound, the terminal device being able to transmitting an audio signal of audio to the server; Distinguish either bone conduction sound based on the vibration of the user's eardrum or air conduction sound based on the vibration of the eardrum of the user through the air, and if the sound is the bone conduction sound, convert the bone conduction sound into the air conduction sound. Provided is a speech-to-text conversion system that converts the air conduction speech into text information and outputs the converted text information.

また、本開示は、音声を収音する受音器との間で通信可能な音声テキスト変換装置であって、前記受音器により収音された前記音声のスペクトル特性のうち高周波数成分と低周波数成分との比率に基づいて、前記音声がユーザの声帯の振動に基づく骨導音声あるいは空気を介した前記ユーザの鼓膜の振動に基づく気導音声のいずれかを判別する音声判別部と、前記骨導音声を前記気導音声に変換する音声変換部と、前記気導音声をテキスト情報に変換する音声認識部と、変換された前記テキスト情報を出力する出力部と、を備える、音声テキスト変換装置を提供する。 The present disclosure also provides a speech-to-text conversion device that is capable of communicating with a sound receiver that picks up speech, and which includes high-frequency components and low-frequency components among the spectral characteristics of the speech picked up by the sound receiver. a voice discrimination unit that determines whether the voice is bone conduction voice based on the vibration of the user's vocal cords or air conduction voice based on the vibration of the user's eardrum through the air, based on a ratio with the frequency component; A voice-to-text converter comprising: a voice converter that converts bone conduction voice into the air conduction voice; a voice recognition unit that converts the air conduction voice into text information; and an output unit that outputs the converted text information. Provide equipment.

本開示によれば、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる。 According to the present disclosure, speech can be recognized and converted into text depending on the type of connected microphone.

実施の形態１に係る音声テキスト変換システムのユースケース例を示す図A diagram illustrating an example use case of the speech-to-text conversion system according to Embodiment 1. 実施の形態１に係る音声テキスト変換システムの内部構成例を示すブロック図A block diagram showing an example of the internal configuration of the speech-to-text conversion system according to Embodiment 1. 骨導マイクロホンの使用例を示す図Diagram showing an example of how a bone conduction microphone is used 気導マイクロホンの使用例を示す図Diagram showing an example of how to use an air conduction microphone 実施の形態１に係る音声テキスト変換システムの動作手順例を示すシーケンス図Sequence diagram illustrating an example of an operation procedure of the speech-to-text conversion system according to Embodiment 1 実施の形態１に係る音声テキスト変換システムの音声判別手順例を示すフローチャートFlowchart showing an example of a speech discrimination procedure of the speech-to-text conversion system according to Embodiment 1 実施の形態１に係る音声テキスト変換システムの音声認識例１を示す図A diagram showing example 1 of speech recognition of the speech-to-text conversion system according to the first embodiment 実施の形態１に係る音声テキスト変換システムの音声認識例２を示す図A diagram showing example 2 of speech recognition of the speech-to-text conversion system according to the first embodiment 音声テキスト変換装置の一例を示す図Diagram showing an example of a speech-to-text conversion device

（実施の形態１の内容に至る経緯）
特許文献１には、収音された気導音におけるユーザの音声の雑音に対する比率（ＳＮＲ（ＳｉｇｎａｌｔｏＮｏｉｓｅＲａｔｉｏ））に基づいて、骨伝導マイクによって収音された骨導音を補正する音声補正装置が提案されている。この音声補正装置は、比率が第１の閾値以上となる場合に、補正係数（例えば、気導マイクロホンで得られた信号強度を骨導マイクロホンから得られた信号強度で割った値）を用いて骨導音の周波数スペクトルを気導音中の周波数スペクトルに一致させる。音声補正装置は、比率が第２の閾値より小さくなるまで補正を繰り返し、補正後の骨導音から出力信号を生成する。しかし、上述した音声補正装置は、骨導マイクロホンと気導音マイクロホンとを同時に使用して音声を収音する必要があり、一方のマイクロホンによって収音された音声を補正することは困難だった。 (Details leading to the content of Embodiment 1)
Patent Document 1 describes a sound correction method for correcting bone conduction sound picked up by a bone conduction microphone based on the ratio of the user's voice to noise (SNR (Signal to Noise Ratio)) in the collected air conduction sound. A device has been proposed. This audio correction device uses a correction coefficient (for example, a value obtained by dividing the signal intensity obtained from an air conduction microphone by the signal intensity obtained from a bone conduction microphone) when the ratio is equal to or higher than a first threshold value. Match the frequency spectrum of bone-conducted sound to the frequency spectrum of air-conducted sound. The audio correction device repeats the correction until the ratio becomes smaller than the second threshold value, and generates an output signal from the corrected bone conduction sound. However, the above-described sound correction device requires the simultaneous use of a bone conduction microphone and an air conduction sound microphone to collect sound, and it is difficult to correct the sound collected by one of the microphones.

そこで、以下の各種の実施の形態においては、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる音声テキスト変換システムおよび音声テキスト変換装置の例を説明する。 Therefore, in the following various embodiments, examples of a speech-to-text conversion system and a speech-to-text conversion device that can recognize speech and convert it into text will be described depending on the type of connected microphone.

以下、適宜図面を参照しながら、本開示に係る音声テキスト変換システムおよび音声テキスト変換装置の構成および作用を具体的に開示した実施の形態を詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明や実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になることを避け、当業者の理解を容易にするためである。なお、添付図面及び以下の説明は、当業者が本開示を十分に理解するために提供されるものであって、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments specifically disclosing the configuration and operation of a speech-to-text conversion system and a speech-to-text conversion device according to the present disclosure will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters or redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter recited in the claims.

（実施の形態１）
図１は、実施の形態１に係る音声テキスト変換システム１００のユースケース例を示す図である。音声テキスト変換システム１００は、受音器１と、端末装置２と、サーバ３と、を含んで構成される。受音器１には、骨導マイクロホンＭＣ１または気導マイクロホンＭＣ２のいずれか一方が接続される。 (Embodiment 1)
FIG. 1 is a diagram showing an example use case of the speech-to-text conversion system 100 according to the first embodiment. The speech-to-text conversion system 100 includes a sound receiver 1, a terminal device 2, and a server 3. The sound receiver 1 is connected to either a bone conduction microphone MC1 or an air conduction microphone MC2.

受音器１は、端末装置２との間で有線通信可能に接続された骨導ヘッドセットであり、マイク接続端子１１と、スピーカ（不図示）と、を含んで構成される。マイク接続端子１１は、骨導マイクロホンＭＣ１と気導マイクロホンＭＣ２とを切り替えて接続可能に構成される。受音器１は、接続されたマイクロホンにより収音されたアナログ音声信号（例えば、骨導マイクロホンＭＣ１により収音された骨導音声、または気導マイクロホンＭＣ２により収音された気導音声に基づいて変換されたアナログ音声信号）を端末装置２に送信する。また、受音器１が備えるスピーカは、骨伝導スピーカ（不図示）である。 The sound receiver 1 is a bone conduction headset that is connected to the terminal device 2 for wired communication, and includes a microphone connection terminal 11 and a speaker (not shown). The microphone connection terminal 11 is configured to be able to switch between the bone conduction microphone MC1 and the air conduction microphone MC2 for connection. The sound receiver 1 is based on an analog audio signal picked up by a connected microphone (for example, a bone conduction sound picked up by a bone conduction microphone MC1 or an air conduction sound picked up by an air conduction microphone MC2). The converted analog audio signal) is transmitted to the terminal device 2. Further, the speaker included in the sound receiver 1 is a bone conduction speaker (not shown).

骨導マイクロホンＭＣ１は、ユーザの声帯付近に装着され、声帯の振動（骨導音声）を収音する圧電素子を有して構成される。骨導マイクロホンＭＣ１は、収音された声帯の振動に伴う機械的応力から電位を発生させ、電位を音声信号（つまり、アナログ音声信号）に変換する。なお、骨導マイクロホンＭＣ１は、声帯付近に限らず、例えば頬骨弓部上に装着され、声帯の振動が伝播した鼻腔音の振動を拾ってもよい。 The bone conduction microphone MC1 is mounted near the user's vocal cords and includes a piezoelectric element that picks up vibrations of the vocal cords (bone conduction sound). The bone conduction microphone MC1 generates an electric potential from mechanical stress accompanying the vibration of the vocal cords that have been picked up, and converts the electric potential into an audio signal (that is, an analog audio signal). Note that the bone conduction microphone MC1 is not limited to the vicinity of the vocal cords, but may be mounted, for example, on the zygomatic arch to pick up the vibrations of nasal sounds propagated by the vibrations of the vocal cords.

骨導マイクロホンＭＣ１は、収音された骨導音声（振動）を増幅する増幅器（不図示）を内蔵し、骨導音声の振動を増幅する。これにより、骨導マイクロホンＭＣ１によって変換されるアナログ音声信号は、増幅器によって利得が上げられているため、デジタル信号変換の際の電圧降下が、気導マイクロホンＭＣ２よりも大きくなる。したがって、端末装置２およびサーバ３は、受信されたアナログ音声信号またはデジタル音声信号の電圧値を検出することにより、音声信号が骨伝導音を変換したものであるか、または気導音声を変換したものであるかを判別することができる。 The bone conduction microphone MC1 includes an amplifier (not shown) that amplifies the collected bone conduction sound (vibration), and amplifies the vibration of the bone conduction sound. As a result, the gain of the analog audio signal converted by the bone conduction microphone MC1 is increased by the amplifier, so that the voltage drop during digital signal conversion becomes larger than that of the air conduction microphone MC2. Therefore, by detecting the voltage value of the received analog audio signal or digital audio signal, the terminal device 2 and the server 3 determine whether the audio signal is a converted bone conduction sound or a converted air conduction sound. It is possible to determine whether the

また、骨導マイクロホンＭＣ１は、ユーザの声帯付近に装備され、気導マイクロホンＭＣ２よりも優れた耐騒音性を有している。よって、骨導マイクロホンＭＣ１は、例えば８０～９０ｄＢの騒音が発生する工事現場または高架下などであってもユーザの音声を収音することができる。 Furthermore, the bone conduction microphone MC1 is installed near the user's vocal cords and has better noise resistance than the air conduction microphone MC2. Therefore, the bone conduction microphone MC1 can pick up the user's voice even at a construction site or under an overpass where noise of 80 to 90 dB is generated, for example.

気導マイクロホンＭＣ２は、空気を伝搬するユーザの気導音声を音声信号（つまり、アナログ音声信号）に変換する。また、気導マイクロホンＭＣ２は、無指向性マイクロホン、単一指向性マイクロホンまたは相指向性マイクロホンのいずれであってもよいし、またはこれらを組み合わせて複数の種類のマイクのそれぞれとして区別されてもよい。 The air conduction microphone MC2 converts the user's air conduction voice propagating through the air into an audio signal (that is, an analog audio signal). Further, the air conduction microphone MC2 may be an omnidirectional microphone, a unidirectional microphone, or a phase directional microphone, or may be distinguished as each of a plurality of types of microphones by combining these microphones. .

骨伝導スピーカ（不図示）は、音声信号を機械的振動に変換してその振動をユーザの皮膚、頭蓋骨を経由して伝播させ、聴覚神経に伝える。即ち、通常のスピーカは空気の振動で伝えられた音（気導音）を聴くのに対し、骨伝導スピーカは骨の振動で伝えられた音（骨導音）を聴く。また、骨伝導スピーカにより骨伝導で伝わる音声は、外部雑音の影響をほとんど受けない。即ち、骨伝導スピーカを備える受音器１は、外部の騒音を拾いにくいため、耐騒音性を高めることができる。さらに、骨導マイクロホンＭＣ１を備える受音器１は、口元が完全にオープンとなる。これにより、受音器１は、例えばユーザが防塵・防毒マスクなどを併用しても通常に通信が可能となる。 A bone conduction speaker (not shown) converts an audio signal into mechanical vibrations, propagates the vibrations through the user's skin and skull, and transmits them to the auditory nerve. That is, while a normal speaker listens to sound transmitted by air vibrations (air conduction sound), a bone conduction speaker listens to sound transmitted by bone vibrations (bone conduction sound). Furthermore, the sound transmitted by bone conduction through the bone conduction speaker is hardly affected by external noise. That is, the sound receiver 1 equipped with the bone conduction speaker is less likely to pick up external noise, so that noise resistance can be improved. Furthermore, the mouth of the sound receiver 1 including the bone conduction microphone MC1 is completely open. Thereby, the sound receiver 1 can normally communicate even if the user also wears a dust-proof/gas-proof mask, for example.

端末装置２は、例えば、スマートフォン、タブレット端末あるいはＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）であり、受音器１との間で有線通信可能に接続される。また、端末装置２は、サーバ３との間でネットワークＮＷ１を介して無線通信可能に接続される。端末装置２は、受音器１から受信されたアナログ音声信号をデジタル音声信号に変換し、サーバ３に送信する。また、端末装置２は、アナログ音声信号に基づいてテキストに変換されたテキスト情報、あるいはテキスト情報に基づいて変換された音声信号を受信する。 The terminal device 2 is, for example, a smartphone, a tablet terminal, or a PC (Personal Computer), and is connected to the sound receiver 1 for wired communication. Further, the terminal device 2 is connected to the server 3 via the network NW1 so as to be able to communicate wirelessly. The terminal device 2 converts the analog audio signal received from the receiver 1 into a digital audio signal and transmits it to the server 3. The terminal device 2 also receives text information converted into text based on an analog audio signal, or an audio signal converted based on text information.

ネットワークＮＷ１は、無線ネットワークである。無線ネットワークは、例えば無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、無線ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、４Ｇ（第４世代移動通信システム）、ＬＴＥ（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ）、ＬＴＥ－Ａｄｖａｎｃｅｄ、５Ｇ（第５世代移動通信方式）、Ｗｉ－ｆｉ（登録商標）、またはＷｉＧｉｇ（ＷｉｒｅｌｅｓｓＧｉｇａｂｉｔ）である。 Network NW1 is a wireless network. Examples of wireless networks include wireless LAN (Local Area Network), wireless WAN (Wide Area Network), 4G (4th generation mobile communication system), LTE (Long Term Evolution), LTE-Advanced, and 5G (5th generation mobile communication system). ), Wi-fi (registered trademark), or WiGig (Wireless Gigabit).

サーバ３は、端末装置２との間でネットワークＮＷ１を介して無線通信可能に接続される。サーバ３は、受信されたデジタル音声信号をテキスト情報に変換して端末装置２に送信する。また、サーバ３は、変換したテキスト情報に基づいて再度デジタル音声信号に変換し、端末装置２に送信する。 The server 3 is connected to the terminal device 2 via the network NW1 so that they can communicate wirelessly. The server 3 converts the received digital audio signal into text information and transmits it to the terminal device 2. Furthermore, the server 3 converts the converted text information into a digital audio signal again and transmits it to the terminal device 2 .

図２は、実施の形態１に係る音声テキスト変換システム１００の内部構成例を示すブロック図である。受音器１については、図１を参照して説明したため、詳細な説明を省略する。 FIG. 2 is a block diagram showing an example of the internal configuration of the speech-to-text conversion system 100 according to the first embodiment. Since the sound receiver 1 has been described with reference to FIG. 1, detailed description thereof will be omitted.

まず、端末装置２の内部構成例について説明する。端末装置２は、通信部２０と、プロセッサ２１と、メモリ２２と、Ａ／Ｄ（Ａｎａｌｏｇ―ｔｏ―Ｄｉｇｉｔａｌ）変換部２３と、を含んで構成される。 First, an example of the internal configuration of the terminal device 2 will be described. The terminal device 2 includes a communication section 20, a processor 21, a memory 22, and an A/D (Analog-to-Digital) conversion section 23.

通信部２０は、ネットワークＮＷ１を介してサーバ３と通信可能に接続される。通信部２０は、Ａ／Ｄ変換部２３によって変換されたデジタル音声信号をサーバ３に送信し、テキスト情報またはテキスト情報に基づいて生成されたデジタル音声信号をサーバ３から受信する。 The communication unit 20 is communicably connected to the server 3 via the network NW1. The communication unit 20 transmits the digital audio signal converted by the A/D conversion unit 23 to the server 3, and receives from the server 3 text information or a digital audio signal generated based on the text information.

プロセッサ２１は、例えばＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇｕｎｉｔ）またはＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）を用いて構成されて、メモリ２２と協働して、各種の処理および制御を行う。具体的には、プロセッサ２１はメモリ２２に保持されたプログラムおよびデータを参照し、そのプログラムを実行することにより、各部の機能を実現する。各部の機能は、例えば、受音器１から受信されたアナログ音声信号をデジタル音声信号に変換する機能などである。 The processor 21 is configured using, for example, a CPU (Central Processing unit) or an FPGA (Field Programmable Gate Array), and performs various processing and control in cooperation with the memory 22. Specifically, the processor 21 references programs and data held in the memory 22 and executes the programs to realize the functions of each part. The function of each part is, for example, a function of converting an analog audio signal received from the sound receiver 1 into a digital audio signal.

メモリ２２は、例えばプロセッサ２１の各処理を実行する際に用いられるワークメモリとしてのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）と、プロセッサ２１の動作を規定したプログラムおよびデータを格納するＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）とを有する。ＲＡＭには、プロセッサ２１により生成あるいは取得されたデータもしくは情報が一時的に保存される。ＲＯＭには、プロセッサ２１の動作を規定するプログラムが書き込まれている。また、メモリ２２は、サーバ３に送信されたデジタル音声信号およびサーバ３から受信されたテキスト情報を記憶する。 The memory 22 includes, for example, a RAM (Random Access Memory) as a work memory used when the processor 21 executes each process, and a ROM (Read Only Memory) that stores programs and data that define the operations of the processor 21. have Data or information generated or acquired by the processor 21 is temporarily stored in the RAM. A program that defines the operation of the processor 21 is written in the ROM. The memory 22 also stores digital audio signals sent to the server 3 and text information received from the server 3.

Ａ／Ｄ変換部２３は、受音器１から受信されたアナログ音声信号をデジタル音声信号に変換する。Ａ／Ｄ変換部２３は、変換したデジタル音声信号を、ネットワークＮＷ１を介してサーバ３に送信する。また、Ａ／Ｄ変換部２３は、受音器１からアナログ音声信号を受信した際の電圧降下によって降下した電圧値を測定する。測定された電圧値の情報は、サーバ３に送信される。 The A/D converter 23 converts the analog audio signal received from the sound receiver 1 into a digital audio signal. The A/D converter 23 transmits the converted digital audio signal to the server 3 via the network NW1. Further, the A/D converter 23 measures the voltage value dropped due to a voltage drop when receiving the analog audio signal from the sound receiver 1. Information on the measured voltage value is transmitted to the server 3.

次に、サーバ３の内部構成例について説明する。サーバ３は、通信部３０と、プロセッサ３１と、メモリ３２と、音声判別部３３と、音声変換部３４と、音声認識部３５と、出力部３６と、記憶部３７と、テキスト音声変換部３８と、を含んで構成される。なお、テキスト音声変換部３８は、必須の構成でなく、省略されても端末装置２に備えられてもよい。 Next, an example of the internal configuration of the server 3 will be described. The server 3 includes a communication section 30, a processor 31, a memory 32, a speech discrimination section 33, a speech conversion section 34, a speech recognition section 35, an output section 36, a storage section 37, and a text-to-speech conversion section 38. It consists of and. Note that the text-to-speech converter 38 is not an essential component, and may be omitted or provided in the terminal device 2.

通信部３０は、ネットワークＮＷ１を介して端末装置２と通信可能に接続される。通信部３０は、デジタル音声信号を端末装置２から受信し、テキスト情報またはテキスト情報に基づいて生成されたデジタル音声信号を端末装置２に送信する。 The communication unit 30 is communicably connected to the terminal device 2 via the network NW1. The communication unit 30 receives a digital audio signal from the terminal device 2 and transmits text information or a digital audio signal generated based on the text information to the terminal device 2.

プロセッサ３１は、例えばＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇｕｎｉｔ）またはＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）を用いて構成されて、メモリ３２と協働して、各種の処理および制御を行う。具体的には、プロセッサ３１はメモリ３２に保持されたプログラムおよびデータを参照し、そのプログラムを実行することにより、各部の機能を実現する。各部の機能は、例えば、デジタル音声信号が骨導音声または気導音声のどちらであるかを判定する機能、およびデジタル音声信号を予め生成されている学習データに基づいてテキスト情報に変換する機能などである。 The processor 31 is configured using, for example, a CPU (Central Processing unit) or an FPGA (Field Programmable Gate Array), and performs various processing and control in cooperation with the memory 32. Specifically, the processor 31 references programs and data held in the memory 32 and executes the programs to realize the functions of each part. The functions of each part include, for example, a function to determine whether a digital voice signal is bone conduction voice or air conduction voice, and a function to convert a digital voice signal into text information based on pre-generated learning data. It is.

メモリ３２は、例えばプロセッサ３１の各処理を実行する際に用いられるワークメモリとしてのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）と、プロセッサ３１の動作を規定したプログラムおよびデータを格納するＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）とを有する。ＲＡＭには、プロセッサ３１により生成あるいは取得されたデータもしくは情報が一時的に保存される。ＲＯＭには、プロセッサ３１の動作を規定するプログラムが書き込まれている。また、メモリ３２は、学習データ、音響モデル、発音辞書、言語モデル、認識デコーダなどを記憶する。 The memory 32 includes, for example, a RAM (Random Access Memory) as a work memory used when the processor 31 executes each process, and a ROM (Read Only Memory) that stores programs and data that define the operations of the processor 31. have Data or information generated or acquired by the processor 31 is temporarily stored in the RAM. A program that defines the operation of the processor 31 is written in the ROM. The memory 32 also stores learning data, acoustic models, pronunciation dictionaries, language models, recognition decoders, and the like.

音声判別部３３は、端末装置２からデジタル音声信号と降下した電圧値の情報とを受信し、電圧値の情報に基づいて、デジタル音声信号の基となる音声が骨導音声あるいは気導音声であるかを判別する。 The voice discrimination unit 33 receives the digital voice signal and information on the voltage value that has dropped from the terminal device 2, and determines whether the voice on which the digital voice signal is based is bone conduction voice or air conduction voice based on the voltage value information. Determine if there is.

音声判別部３３は、デジタル音声信号が骨導音声に基づいて変換された音声信号であると判別した場合には、識別子を付与する。識別子は、例えば、人の音声とは異なる周波数帯域の人工音、あるいは特定の識別信号（例えば、「１１００１１」）などである。音声判別部３３は、電圧値の情報が所定の閾値以下であり、さらに識別子が付与されていると判別すると、デジタル音声信号の基となる音声が骨導音声であることを示す判別結果を音声変換部３４に出力する。 When the audio discrimination unit 33 determines that the digital audio signal is an audio signal converted based on bone conduction audio, it assigns an identifier to the digital audio signal. The identifier is, for example, an artificial sound in a frequency band different from that of human speech, or a specific identification signal (eg, "110011"). When the audio discrimination unit 33 determines that the voltage value information is less than a predetermined threshold value and that an identifier has been added, the audio discrimination unit 33 converts the discrimination result indicating that the audio that is the basis of the digital audio signal is bone conduction audio into the audio. It is output to the converter 34.

また、音声判別部３３は、デジタル音声信号のスペクトル特性のうち、高周波数帯域（例えば、１００１～８０００Ｈｚ）における信号レベル（ｄＢ）に対する低周波数帯域（例えば、０～１０００Ｈｚ）における信号レベル（ｄＢ）の比率を算出し、この比率に基づいて、デジタル音声信号の基となる音声が骨導音声あるいは気導音声であるかを判別してよい。骨導音声は、ユーザの音声による振動を、体内を通じて収音するため、体内減衰により高周波数帯域において信号レベルが小さくなる。よって、音声判別部３３は、比率の値が小さいほど低周波数帯域における信号レベルが高周波数帯域における信号レベルに対する相対的な減衰が小さい場合には気導音声と判別し、相対的な減衰が大きい場合には骨導音声と判別する。 Furthermore, the audio discrimination unit 33 determines the signal level (dB) in a low frequency band (for example, 0 to 1000 Hz) relative to the signal level (dB) in a high frequency band (for example, 1001 to 8000 Hz) among the spectral characteristics of the digital audio signal. The ratio may be calculated, and based on this ratio, it may be determined whether the sound on which the digital audio signal is based is bone conduction sound or air conduction sound. Bone conduction sound collects vibrations caused by the user's voice through the body, so the signal level becomes small in high frequency bands due to internal attenuation. Therefore, the voice discrimination unit 33 determines that the voice is air conduction voice when the signal level in the low frequency band is attenuated relative to the signal level in the high frequency band as the ratio value is smaller, and the relative attenuation is large. If so, it is determined that it is a bone conduction sound.

なお、上述した比率に基づくデジタル音声信号の基となる音声の判別方法は、ユーザによる個人差および環境差によって得られるスペクトル特性が変化する。よって、音声判別部３３は、予め収集された複数のスペクトル特性に基づいて生成された判別データがメモリ３２に記憶されている場合には、スペクトル特性とこの判別データとを用いてデジタル音声信号の基となる音声が骨導音声あるいは気導音声であるかを判別してもよい。 Note that the spectral characteristics obtained in the above-mentioned ratio-based voice discrimination method, which is the basis of the digital voice signal, vary depending on individual differences among users and environmental differences. Therefore, if the discrimination data generated based on a plurality of spectral characteristics collected in advance is stored in the memory 32, the speech discrimination section 33 uses the spectral characteristics and the discrimination data to determine the digital voice signal. It may also be determined whether the underlying sound is bone conduction sound or air conduction sound.

音声変換部３４は、音声判別部３３より入力されたデジタル音声信号が骨導音声の場合には、予め生成された学習モデルを用いて骨導音声の特徴量を気導音声の特徴量にマッピングすることにより、骨導音声のデジタル音声信号を気導音声のデジタル音声信号に変換する。学習モデルは、メモリ３２に予め記憶されており、骨導マイクロホンＭＣ１と気導マイクロホンＭＣ２とから同時に収音した音声から骨導音声および気導音声の特徴量をそれぞれ抽出し、骨導音声の特徴量を気導音声の特徴量にマッピングすることにより生成される。なお、音声の特徴量は、例えば基本周波数（声の高さ）、音声信号のスペクトル特性（声質）、非周期信号（声のかすれ）などの情報である。音声変換部３４は、変換後の気導音声を音声認識部３５に出力する。なお、音声変換部３４は、音声判別部３３により入力されたデジタル音声信号が気導音声の場合には、そのまま音声認識部３５に出力する。 When the digital audio signal inputted from the audio discrimination unit 33 is a bone conduction sound, the voice conversion unit 34 maps the feature amount of the bone conduction sound to the feature amount of the air conduction sound using a learning model generated in advance. By doing so, a digital audio signal of bone conduction audio is converted into a digital audio signal of air conduction audio. The learning model is stored in the memory 32 in advance, and extracts the feature quantities of the bone conduction sound and the air conduction sound from the sounds simultaneously picked up from the bone conduction microphone MC1 and the air conduction microphone MC2, and extracts the features of the bone conduction sound. It is generated by mapping the amount to the feature amount of the air conduction sound. Note that the voice feature amount is information such as the fundamental frequency (voice pitch), the spectral characteristics of the voice signal (voice quality), and the aperiodic signal (voice hoarseness). The voice conversion unit 34 outputs the converted air conduction voice to the voice recognition unit 35. Note that, when the digital audio signal inputted by the audio discrimination unit 33 is air conduction audio, the audio conversion unit 34 outputs the digital audio signal as is to the audio recognition unit 35 .

音声認識部３５は、例えば音声認識エンジンであり、気導マイクロホンＭＣ２によって収音された音声をデータベースとする音響モデルを用いて、音声変換部３４より入力された気導音声のデジタル音声信号に含まれる音素（例えば、／ａ／，／ｋ／など）を判別する。なお、音響モデルは、気導マイクロホンＭＣ２によって収音された数千人、数千時間の音声の周波数特性および時間特性を統計処理して予め生成され、メモリ３２に記憶される。 The voice recognition unit 35 is, for example, a voice recognition engine, and uses an acoustic model whose database is the voice picked up by the air conduction microphone MC2, and uses the acoustic model that is included in the digital voice signal of the air conduction voice input from the voice conversion unit 34. The phoneme (for example, /a/, /k/, etc.) that is displayed is determined. Note that the acoustic model is generated in advance by statistically processing the frequency characteristics and time characteristics of the voices of several thousand people and for several thousand hours collected by the air conduction microphone MC2, and is stored in the memory 32.

また、音声認識部３５は、言語モデルを用いて、音声変換部３４より入力された気導音声のデジタル音声信号に含まれる文字列または単語列が言語として適切か否かを評価する。言語モデルは、各国の言語におけるテキストを収集し、統計処理されて生成される。具体的には、言語モデルは、自然言語処理などを実行し、文の品詞および統語構造、単語同士あるいは文書同士の関係性などを定式化したものであり、統計学的な観点から確率的に定められる。言語モデルは、例えばＮグラムモデル、隠れマルコフモデル、最大エントロピーモデルなどであり、メモリ３２に記憶される。 Furthermore, the speech recognition section 35 uses the language model to evaluate whether the character string or word string included in the digital audio signal of the air conduction speech inputted from the speech conversion section 34 is appropriate as a language. Language models are generated by collecting texts in each country's languages and performing statistical processing. Specifically, a language model performs natural language processing to formulate the parts of speech and syntactic structure of sentences, relationships between words or documents, etc., and is calculated probabilistically from a statistical perspective. determined. The language model is, for example, an N-gram model, a hidden Markov model, a maximum entropy model, etc., and is stored in the memory 32.

音声認識部３５は、音響モデルを用いて判別されたデジタル音声信号に含まれる音素と言語モデルを用いて評価された文字列または単語列とを、発音辞書に基づいて音素を結びつけて単語発話（例えば、／ｓａｋｕｒａ／）を構成し、認識デコーダによって音響的かつ言語的に最も適合する言語表現を解読してテキスト情報に変換される。なお、発音辞書は、音響モデルと言語モデルとを結びつけるためのデータであり、メモリ３２に記憶される。認識デコーダは、所謂解読装置であり、音響モデル、発音辞書および言語モデルを用いて音声信号をその発話内容に対応する言語表現に解読して変換する処理を実行する。音声認識部３５は、認識デコーダにより変換されたテキスト情報を出力部３６に出力する。 The speech recognition unit 35 connects the phonemes included in the digital speech signal discriminated using the acoustic model and the character string or word string evaluated using the language model, based on the pronunciation dictionary, and generates a word utterance ( For example, /sakura/) is constructed, and a recognition decoder decodes the most acoustically and linguistically appropriate linguistic expression and converts it into text information. Note that the pronunciation dictionary is data for linking the acoustic model and the language model, and is stored in the memory 32. The recognition decoder is a so-called decoding device that uses an acoustic model, a pronunciation dictionary, and a language model to decode and convert an audio signal into a linguistic expression corresponding to the content of the utterance. The speech recognition unit 35 outputs the text information converted by the recognition decoder to the output unit 36.

また、上述した気導音声に変換された骨導音声をテキスト情報に変換する処理を第１の音声認識処理として、音声認識部３５は、骨導音声をテキスト情報に変換する第２の音声認識処理を実行してもよい。この場合、音声変換部３４は、音声判別部３３から入力されたデジタル音声信号が骨導音声であると判別すると、骨導音声と骨導音声から音声変換した気導音声とを音声認識部３５に出力する。 Further, the voice recognition unit 35 performs a second voice recognition process in which the above-described process of converting the bone conduction voice converted into air conduction voice into text information is the first voice recognition process, and converts the bone conduction voice into text information. Processing may be executed. In this case, when the voice conversion unit 34 determines that the digital voice signal input from the voice discrimination unit 33 is a bone conduction voice, the voice recognition unit 34 converts the bone conduction voice and the air conduction voice converted from the bone conduction voice into a voice. Output to.

音声認識部３５は、変換された第１のテキスト情報と第２のテキスト情報とに対して信頼度を判定し、より高い信頼度を有するテキスト情報を出力部３６に出力する。音声認識部３５は、テキスト情報に対して単語信頼度に基づく信頼度の判定を実行する。具体的には、音声認識部３５は、認識デコーダにより音声信号を第１のテキスト情報および第２のテキスト情報のそれぞれに変換する際に用いられた音響モデルと言語モデルとに基づいて、単語信頼度を判定する。音声認識部３５は、テキスト情報に含まれる各単語に対して近い他の候補の単語が存在するか否かを判定し、その単語に似たスコアを有する他の候補がなければ信頼度が高いと判定し、その単語に同程度のスコアを有する他の候補が多いほど信頼度が低いと判定する。音声認識部３５は、より信頼度が高い方のテキスト情報を出力部３６に出力する。 The speech recognition unit 35 determines the reliability of the converted first text information and second text information, and outputs text information having a higher reliability to the output unit 36. The speech recognition unit 35 performs reliability determination based on word reliability for text information. Specifically, the speech recognition unit 35 determines the word reliability based on the acoustic model and language model used when the recognition decoder converts the speech signal into the first text information and the second text information. Determine the degree. The speech recognition unit 35 determines whether there are other candidate words that are close to each word included in the text information, and if there is no other candidate word that has a score similar to that word, the reliability is high. It is determined that the reliability of the word is lower as the number of other candidates having similar scores for the word increases. The speech recognition unit 35 outputs the text information with higher reliability to the output unit 36.

出力部３６は、音声認識部３５より入力されたテキスト情報を記憶部３７およびテキスト音声変換部３８に出力し、通信部３０に出力する。通信部３０は、ネットワークＮＷ１を介してテキスト情報を端末装置２に送信する。 The output unit 36 outputs the text information input from the speech recognition unit 35 to the storage unit 37 and the text-to-speech conversion unit 38, and outputs it to the communication unit 30. The communication unit 30 transmits text information to the terminal device 2 via the network NW1.

記憶部３７は、所謂ストレージであり、音声認識部３５によって変換されたテキスト情報を記憶する。また、記憶部３７は、端末装置ごと（つまり、ユーザごと）にテキスト情報を記憶してもよい。 The storage unit 37 is a so-called storage, and stores text information converted by the speech recognition unit 35. Furthermore, the storage unit 37 may store text information for each terminal device (that is, for each user).

テキスト音声変換部３８は、出力部３６より入力されたテキスト情報を音声信号に変換する。変換された音声信号は、ネットワークＮＷ１を介して端末装置２に送信され、再生される。これにより、ユーザは、発話内容が正しくテキスト情報に変換されたか否かを音声によって確認することができる。また、この音声信号は、一度テキスト情報に変換されたことでノイズレスの音声信号として生成されるため、より聞き取りやすい音声となる。したがって、ユーザは、ノイズが低減された音声を再生することができる。 The text-to-speech conversion unit 38 converts the text information input from the output unit 36 into an audio signal. The converted audio signal is transmitted to the terminal device 2 via the network NW1 and played back. Thereby, the user can confirm by voice whether or not the content of the utterance has been correctly converted into text information. Moreover, since this audio signal is generated as a noiseless audio signal once converted into text information, the audio becomes easier to hear. Therefore, the user can reproduce audio with reduced noise.

また、テキスト音声変換部３８は、ユーザの音声データに基づいて生成された音響モデルを用いた音声合成エンジンを有してもよい。これにより、テキスト音声変換部３８は、ユーザの音声に変換して音声信号を再生することができる。 Furthermore, the text-to-speech converter 38 may include a speech synthesis engine that uses an acoustic model generated based on the user's speech data. Thereby, the text-to-speech converter 38 can convert the audio signal into the user's voice and reproduce the audio signal.

以上により、音声テキスト変換システム１００は、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる。 As described above, the speech-to-text conversion system 100 can recognize speech and convert it into text depending on the type of connected microphone.

図３Ａおよび図３Ｂを参照して、受音器１の使用例を説明する。図３Ａは、骨導マイクロホンＭＣ１の使用例を示す図である。図３Ｂは、気導マイクロホンＭＣ２の使用例を示す図である。 An example of how the sound receiver 1 is used will be described with reference to FIGS. 3A and 3B. FIG. 3A is a diagram showing an example of how the bone conduction microphone MC1 is used. FIG. 3B is a diagram showing an example of how the air conduction microphone MC2 is used.

受音器１は、骨導マイクロホンＭＣ１あるいは気導マイクロホンＭＣ２のいずれか一方を、マイク接続端子１１に接続して使用される。骨導マイクロホンＭＣ１は、ユーザの声帯付近に接触して装着されて使用される。また、気導マイクロホンＭＣ２は、ユーザの口の前に位置するように配置されて使用される。 The sound receiver 1 is used by connecting either the bone conduction microphone MC1 or the air conduction microphone MC2 to the microphone connection terminal 11. The bone conduction microphone MC1 is used by being placed in contact with the user's vocal cords. Furthermore, the air conduction microphone MC2 is used by being placed in front of the user's mouth.

図４は、実施の形態１に係る音声テキスト変換システム１００の動作手順例を示すシーケンス図である。なお、図４において、ネットワークＮＷ１の図示は省略されている。 FIG. 4 is a sequence diagram showing an example of the operation procedure of the speech-to-text conversion system 100 according to the first embodiment. Note that in FIG. 4, illustration of the network NW1 is omitted.

受音器１は、接続された骨導マイクロホンＭＣ１あるいは気導マイクロホンＭＣ２のいずれか一方のマイクロホンを用いて、ユーザの音声を収音してアナログ音声信号に変換する（Ｔ１）。 The sound receiver 1 uses either the connected bone conduction microphone MC1 or the air conduction microphone MC2 to collect the user's voice and convert it into an analog voice signal (T1).

受音器１は、アナログ音声信号を端末装置２に送信する（Ｔ２）。 The receiver 1 transmits an analog audio signal to the terminal device 2 (T2).

端末装置２は、受信されたアナログ音声信号をデジタル音声信号に変換する（Ｔ３）。なお、端末装置２は、アナログ信号を受信した際の電圧降下に基づいて、降下した電圧値を測定する。 The terminal device 2 converts the received analog audio signal into a digital audio signal (T3). Note that the terminal device 2 measures the dropped voltage value based on the voltage drop when receiving the analog signal.

端末装置２は、変換したデジタル音声信号を、ネットワークＮＷ１を介してサーバ３に送信する（Ｔ４）。また、端末装置２は、測定された電圧値の情報を、ネットワークＮＷ１を介してサーバ３に送信する。 The terminal device 2 transmits the converted digital audio signal to the server 3 via the network NW1 (T4). Further, the terminal device 2 transmits information on the measured voltage value to the server 3 via the network NW1.

サーバ３は、端末装置２から受信した電圧降下に基づく電圧値の情報に基づいて、デジタル音声信号の基となる音声が骨導音声か否かを判別する（Ｔ５）。また、ステップＴ５の処理においてサーバ３は、デジタル音声信号の基となる音声が骨導音声である場合、デジタル音声信号に識別子を付与する。 The server 3 determines whether or not the sound on which the digital audio signal is based is bone conduction sound based on the voltage value information based on the voltage drop received from the terminal device 2 (T5). Furthermore, in the process of step T5, the server 3 adds an identifier to the digital audio signal when the audio on which the digital audio signal is based is bone conduction audio.

サーバ３は、ステップＴ５の処理の結果、骨導音声である場合には気導音声のデジタル音声信号に変換する（Ｔ６）。なお、サーバ３は、気導音声である場合には何の処理も実行しない。 If the result of the process in step T5 is bone conduction sound, the server 3 converts it into a digital sound signal of air conduction sound (T6). Note that the server 3 does not perform any processing if the sound is air conduction sound.

サーバ３は、気導音声を音声認識してテキスト情報に変換する（Ｔ７）。なお、図４には示していないが、さらにサーバ３は、テキスト情報に基づいて音声信号を生成してもよい。 The server 3 recognizes the air conduction voice and converts it into text information (T7). Although not shown in FIG. 4, the server 3 may further generate an audio signal based on the text information.

サーバ３は、変換されたテキスト情報を、ネットワークＮＷ１を介して端末装置２に送信する（Ｔ８）。なお、サーバ３は、ステップＴ７の処理においてさらに音声信号を生成する場合には、生成された音声信号とテキスト情報とのうち少なくとも一方を端末装置２に送信する。 The server 3 transmits the converted text information to the terminal device 2 via the network NW1 (T8). Note that when the server 3 further generates an audio signal in the process of step T7, it transmits at least one of the generated audio signal and text information to the terminal device 2.

端末装置２は、テキスト情報を受信する（Ｔ９）。受信されたテキスト情報は、端末装置２に表示されてもよいし、さらに音声信号に変換されて受音器１に送信されてもよい。 The terminal device 2 receives the text information (T9). The received text information may be displayed on the terminal device 2 or may be further converted into an audio signal and transmitted to the sound receiver 1.

図５は、実施の形態１に係る音声テキスト変換システム１００の音声判別手順例を示すフローチャートである。図５に示す音声判別処理は、サーバ３における音声判別部３３によって実行される。 FIG. 5 is a flowchart illustrating an example of a speech discrimination procedure of the speech-to-text conversion system 100 according to the first embodiment. The voice discrimination process shown in FIG. 5 is executed by the voice discrimination unit 33 in the server 3.

音声判別部３３は、端末装置２よりデジタル音声信号を受信する（Ｓｔ１）。また、音声判別部３３は、この際に電圧降下により降下した電圧値の情報を受信する。 The voice discrimination unit 33 receives a digital voice signal from the terminal device 2 (St1). In addition, the voice discrimination unit 33 receives information on the voltage value dropped due to the voltage drop at this time.

音声判別部３３は、端末装置２より受信された電圧降下により降下した電圧値の情報に基づいて、電圧値が閾値Ｔｈ以下であるか否かを判定する（Ｓｔ２）。 The voice determination unit 33 determines whether the voltage value is less than or equal to the threshold Th based on the information about the voltage value dropped due to the voltage drop received from the terminal device 2 (St2).

音声判別部３３は、ステップＳｔ２の処理において、降下した電圧値が閾値Ｔｈ以下の場合（Ｓｔ２，ＹＥＳ）には、デジタル音声信号に識別子を付与する（Ｓｔ３）。 In the process of step St2, if the dropped voltage value is equal to or less than the threshold Th (St2, YES), the voice discrimination unit 33 adds an identifier to the digital voice signal (St3).

音声判別部３３は、デジタル音声信号に識別子があるか否かを判別する（Ｓｔ４）。これにより、音声判別部３３は、デジタル音声信号の基となる音声が気導音声であるにも関わらず、降下した電圧値が大きくなってしまった場合に骨導音声と誤判別する可能性を低くすることができる。 The audio determination unit 33 determines whether or not the digital audio signal has an identifier (St4). As a result, the sound discrimination unit 33 prevents the possibility that even though the sound on which the digital sound signal is based is air conduction sound, it may misjudge it as bone conduction sound if the dropped voltage value becomes large. It can be lowered.

音声判別部３３は、ステップＳｔ４の処理において、識別子が付与されている場合（Ｓｔ４，ＹＥＳ）には、デジタル音声信号の基となる音声が骨導音声であると判定する（Ｓｔ５）。 In the process of step St4, if an identifier is assigned (St4, YES), the voice determining unit 33 determines that the voice that is the basis of the digital voice signal is a bone conduction voice (St5).

音声判別部３３は、ステップＳｔ４の処理において、識別子が付与されていない場合（Ｓｔ４，ＮＯ）には、デジタル音声信号の基となる音声が気導音声であると判定する（Ｓｔ６）。 In the process of step St4, if no identifier is assigned (St4, NO), the voice determining unit 33 determines that the voice that is the basis of the digital voice signal is air conduction voice (St6).

以上により、音声テキスト変換システム１００は、音声判別処理を終了する。 With the above steps, the speech-to-text conversion system 100 ends the speech discrimination process.

図６Ａおよび図６Ｂを参照して、実施の形態１に係る音声テキスト変換システム１００によって実行された音声認識結果の一例について説明する。図６Ａは、実施の形態１に係る音声テキスト変換システム１００の音声認識例１を示す図である。図６Ｂは、実施の形態１に係る音声テキスト変換システム１００の音声認識例２を示す図である。図６Ａおよび図６Ｂでは、骨導音声、気導音声、学習モデルを用いて骨度音声から変換された気導音声のそれぞれを音声認識した音声認識結果の一例を示す。発話内容Ｕ１１，Ｕ１２のそれぞれは、ユーザによって実際に発話された音声である。 An example of the result of the speech recognition performed by the speech-to-text conversion system 100 according to the first embodiment will be described with reference to FIGS. 6A and 6B. FIG. 6A is a diagram illustrating example 1 of speech recognition of the speech-to-text conversion system 100 according to the first embodiment. FIG. 6B is a diagram showing a second example of speech recognition by the speech-to-text conversion system 100 according to the first embodiment. 6A and 6B show examples of speech recognition results obtained by performing speech recognition on bone conduction speech, air conduction speech, and air conduction speech converted from bone conduction speech using a learning model. Each of the utterance contents U11 and U12 is a voice actually uttered by the user.

発話内容Ｕ１１は、「テレビゲームやパソコンでゲームをして遊ぶ」である。音声認識結果Ａｎ１１は、骨導音声に基づいて音声認識を実行して得られた結果であり、「テレビゲームや若く音で、ゲームをして遊ぶ」というテキスト情報に変換される。音声認識結果Ａｎ２１は、気導音声に基づいて音声認識を実行して得られた結果であり、「あれはテレビゲームやパソコンでワンゲームをして遊ぶなあ」というテキスト情報に変換される。音声認識結果Ａｎ３１は、学習モデルを用いて骨導音声から変換された気導音声に基づいて音声認識を実行して得られた結果であり、「テレビゲームやパソコンでゲームをして遊ぶ」というテキスト情報に変換される。 Utterance content U11 is "I play video games and games on my computer." The voice recognition result An11 is the result obtained by performing voice recognition based on the bone conduction voice, and is converted into text information such as "Play video games and games with young sounds". The voice recognition result An21 is the result obtained by performing voice recognition based on air conduction voice, and is converted into text information such as ``That should be played with a video game or one game on a computer.'' Speech recognition result An31 is the result obtained by performing speech recognition based on air conduction voice converted from bone conduction voice using a learning model, and is the result of performing voice recognition based on air conduction voice converted from bone conduction voice using a learning model. Converted to text information.

発話内容Ｕ１２は、「あらゆる現実をすべて自分の方へねじ曲げたのだ」である。音声認識結果Ａｎ１２は、骨導音声に基づいて音声認識を実行して得られた結果であり、「あらゆる現Ｆを、らすべて自分の方へ、ねじ曲げたのだ」というテキスト情報に変換される。音声認識結果Ａｎ２２は、気導音声に基づいて音声認識を実行して得られた結果であり、「うーんあらゆる現実をらすべての主婦の方へ、ねじ曲げたのだろう」というテキスト情報に変換される。音声認識結果Ａｎ３２は、学習モデルを用いて骨導音声から変換された気導音声に基づいて音声認識を実行して得られた結果であり、「あらゆる現実を、すべて自分の方へ、ねじ曲げたのだ」というテキスト情報に変換される。 Utterance content U12 is ``I twisted all reality towards myself.'' Speech recognition result An12 is the result obtained by performing speech recognition based on bone conduction speech, and is converted into text information that says, ``I twisted all the current F towards myself.'' . Speech recognition result An22 is the result obtained by performing speech recognition based on air conduction speech, and is converted into text information that says, ``Hmm, I guess you twisted all reality to all the housewives.'' Ru. Speech recognition result An32 is the result obtained by performing speech recognition based on air conduction voice converted from bone conduction voice using a learning model, and is the result of performing voice recognition based on air conduction voice converted from bone conduction voice using a learning model. It is converted into text information such as "No."

以上により、音声テキスト変換システム１００は、学習モデルを用いて骨導音声を気導音声に変換することにより、ユーザの発話内容を類似する音声認識結果（テキスト情報）を得ることができる。 As described above, the speech-to-text conversion system 100 can obtain speech recognition results (text information) similar to the content of the user's utterance by converting bone conduction speech into air conduction speech using the learning model.

また、音声テキスト変換システム１００は、音声認識結果（テキスト情報）を用いることにより、ノイズを低減した音声信号を生成することができる。 Furthermore, the speech-to-text conversion system 100 can generate a speech signal with reduced noise by using the speech recognition result (text information).

また、実施の形態１に係る音声テキスト変換システム１００について、その他の実施例について説明する。 Further, other examples of the speech-to-text conversion system 100 according to the first embodiment will be described.

端末装置２は、図２に示す内部構成例に限定されない。端末装置２は、例えば、サーバ３の構成を含んで構成されてもよい。この場合、音声テキスト変換装置１００Ａは、ネットワークＮＷ１およびサーバ３が不要となり省略することができる。以下、図７を参照して説明する。 The terminal device 2 is not limited to the internal configuration example shown in FIG. The terminal device 2 may be configured to include the configuration of the server 3, for example. In this case, the speech-to-text conversion device 100A does not require the network NW1 and the server 3, and can be omitted. This will be explained below with reference to FIG.

図７は、音声テキスト変換装置１００Ａの一例を示す図である。なお、図７に示す音声テキスト変換装置１００Ａの構成は、実施の形態１に係る音声テキスト変換システム１００において説明した構成が有する機能と略同一の機能を有するため、同一の構成については同一の符号を付与して説明を省略する。 FIG. 7 is a diagram showing an example of the speech-to-text conversion device 100A. Note that the configuration of the speech-to-text conversion device 100A shown in FIG. 7 has substantially the same functions as the structure described in the speech-to-text conversion system 100 according to Embodiment 1, so the same components are denoted by the same reference numerals. will be added and the explanation will be omitted.

図７に示す音声テキスト変換装置１００Ａは、受音器１と、端末装置２と、を含んで構成される。端末装置２は、さらに音声判別部３３と、音声変換部３４と、音声認識部３５と、出力部３６と、記憶部３７と、テキスト音声変換部３８と、を含んで構成される。なお、テキスト音声変換部３８は必須の構成でなく、省略されてもよい。また、端末装置２は、さらにテキスト情報を表示する表示部（不図示）などを備えてもよい。 The speech-to-text conversion device 100A shown in FIG. 7 includes a sound receiver 1 and a terminal device 2. The terminal device 2 further includes a speech discrimination section 33, a speech conversion section 34, a speech recognition section 35, an output section 36, a storage section 37, and a text-to-speech conversion section 38. Note that the text-to-speech converter 38 is not an essential component and may be omitted. Furthermore, the terminal device 2 may further include a display unit (not shown) that displays text information.

以上により、実施の形態１に係る音声テキスト変換システム１００は、音声を収音する受音器１が接続された端末装置２とサーバ３との間が通信可能であり、端末装置２は、受音器により収音された音声の音声信号をサーバ３に送信し、サーバ３は、端末装置２から受信された音声信号に基づいて、音声がユーザの声帯の振動に基づく骨導音声あるいは空気を介したユーザの鼓膜の振動に基づく気導音声のいずれかを判別し、骨導音声を気導音声に変換し、気導音声をテキスト情報に変換し、変換されたテキスト情報を出力する。 As described above, in the speech-to-text conversion system 100 according to the first embodiment, communication is possible between the terminal device 2 to which the sound receiver 1 for collecting speech is connected and the server 3, and the terminal device 2 can communicate with the server 3. The audio signal of the voice collected by the sound device is transmitted to the server 3, and the server 3 determines whether the voice is bone conduction voice based on the vibration of the user's vocal cords or air based on the voice signal received from the terminal device 2. The device determines which of the air conduction sounds are based on the vibrations of the user's eardrum, converts the bone conduction sounds into air conduction sounds, converts the air conduction sounds into text information, and outputs the converted text information.

これにより、実施の形態１に係る音声テキスト変換システム１００は、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる。 Thereby, the speech-to-text conversion system 100 according to the first embodiment can recognize speech and convert it into text depending on the type of connected microphone.

また、音声テキスト変換システム１００は、音声が気導音声の場合、気導音声をテキスト情報に変換し、変換されたテキスト情報を出力する。これにより、実施の形態１に係る音声テキスト変換システム１００は、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる。 Furthermore, when the voice is air conduction voice, the speech-to-text conversion system 100 converts the air conduction voice into text information and outputs the converted text information. Thereby, the speech-to-text conversion system 100 according to the first embodiment can recognize speech and convert it into text depending on the type of connected microphone.

また、受音器１は、骨導音声を取得する骨導マイクロホンＭＣ１または気導音声を取得する気導マイクロホンＭＣ２のいずれか一方を備える。これにより、音声テキスト変換システム１００は、複数の種類のマイクロホンを接続可能であり、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる。 Further, the sound receiver 1 includes either a bone conduction microphone MC1 that acquires bone conduction sound or an air conduction microphone MC2 that acquires air conduction sound. Thereby, the speech-to-text conversion system 100 can connect a plurality of types of microphones, and can recognize speech and convert it into text according to the type of connected microphone.

また、サーバ３は、音声信号のスペクトル特性のうち高周波数成分と低周波数成分との比率に基づいて、音声信号が骨導音声あるいは気導音声のいずれかを判別する。これにより、音声テキスト変換システム１００は、音声信号の基となる音声が骨導音声あるいは気導音声のいずれかを判別可能であり、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる。 Further, the server 3 determines whether the audio signal is bone conduction sound or air conduction sound based on the ratio of high frequency components to low frequency components among the spectral characteristics of the audio signal. Thereby, the speech-to-text conversion system 100 can determine whether the sound on which the sound signal is based is bone conduction sound or air conduction sound, and performs speech recognition of the sound according to the type of connected microphone. Can convert text.

また、サーバ３は、端末装置２から音声信号を受信した際に降下する電圧値（つまり、電圧降下値）に基づいて、音声信号が骨導音声あるいは気導音声のいずれかを判別する。これにより、音声テキスト変換システム１００は、音声信号の基となる音声が骨導音声あるいは気導音声のいずれかを判別可能であり、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる。 Further, the server 3 determines whether the audio signal is bone conduction sound or air conduction sound based on the voltage value that drops when the audio signal is received from the terminal device 2 (that is, the voltage drop value). Thereby, the speech-to-text conversion system 100 can determine whether the sound on which the sound signal is based is bone conduction sound or air conduction sound, and performs speech recognition of the sound according to the type of connected microphone. Can convert text.

また、サーバ３は、受音器１により収音された音声が骨導音声の場合、音声信号に前記音声が骨導音声であることを示す識別子を付与する。これにより、音声テキスト変換システム１００は、音声信号の基となる音声が骨導音声であることを、より明確にしてその後の処理を実行できる。 Further, when the sound picked up by the sound receiver 1 is a bone-conducted sound, the server 3 gives the audio signal an identifier indicating that the sound is a bone-conducted sound. Thereby, the speech-to-text conversion system 100 can execute subsequent processing while making it clearer that the audio that is the basis of the audio signal is bone conduction audio.

また、サーバ３は、識別子が付与されているか否かに基づいて、音声が骨導音声または気導音声であるかを判別する。これにより、音声テキスト変換システム１００は、音声信号の基となる音声が骨導音声であることをより確実に判別できる。また、音声テキスト変換システム１００は、デジタル音声信号の基となる音声が気導音声であるにも関わらず、降下した電圧値が大きくなってしまった場合に骨導音声と誤判別する可能性を低くすることができる。 Furthermore, the server 3 determines whether the sound is bone conduction sound or air conduction sound based on whether an identifier is assigned. Thereby, the speech-to-text conversion system 100 can more reliably determine that the audio that is the basis of the audio signal is bone conduction audio. In addition, the speech-to-text conversion system 100 prevents the possibility that even though the sound on which the digital sound signal is based is air conduction sound, it may be misjudged as bone conduction sound if the dropped voltage value becomes large. It can be lowered.

また、識別子は、音声と異なる周波数帯域の音源である。これにより、音声テキスト変換システム１００は、ユーザの音声を損なうことなく識別子を付与することができ、さらに誤判別する可能性を低くすることができる。 Further, the identifier is a sound source in a frequency band different from that of the voice. Thereby, the speech-to-text conversion system 100 can assign an identifier without damaging the user's voice, and can further reduce the possibility of misidentification.

また、サーバ３は、骨導音声を気導音声に変換するための学習モデルを有し、学習モデルは、骨導マイクロホンと気導マイクロホンとから同時に収音された音声に基づいて、骨導音声と気導音声の特徴量をそれぞれ抽出する。サーバ３は、抽出された骨導音声の特徴量を気導音声の特徴量に変換する。これにより、音声テキスト変換システム１００は、効率的な音声認識を実行することができるとともに、気導音声の特徴量に変換する際に骨導音声特有の雑音を除去することができる。 The server 3 also has a learning model for converting bone conduction sound into air conduction sound, and the learning model converts the bone conduction sound into bone conduction sound based on the sound picked up simultaneously from the bone conduction microphone and the air conduction microphone. and the features of the air conduction voice, respectively. The server 3 converts the extracted feature amount of the bone conduction sound into a feature amount of the air conduction sound. Thereby, the speech-to-text conversion system 100 can perform efficient speech recognition, and can also remove noise specific to bone conduction speech when converting it into a feature amount of air conduction speech.

また、サーバ３は、気導音声をデータベースとする音響モデルを用いて音声認識する。これにより、音声テキスト変換システム１００は、効率的な音声認識を実行することができる。 Furthermore, the server 3 recognizes speech using an acoustic model that uses air conduction speech as a database. Thereby, the speech-to-text conversion system 100 can perform efficient speech recognition.

また、サーバ３は、受音器１により収音された音声が骨導音声の場合に、骨導音声に基づいて変換された気導音声を第１のテキスト情報に変換する第１の音声認識処理と、骨導音声を第２のテキスト情報に変換する第２の音声認識処理とを実行する。サーバ３は、第１のテキスト情報および第２のテキスト情報のそれぞれにおける信頼度を判定して比較し、信頼度が高い方のテキスト情報を出力する。これにより、音声テキスト変換システム１００は、受音器１によって収音された音声をより正確にテキスト情報に変換できる。 In addition, when the sound picked up by the sound receiver 1 is bone conduction sound, the server 3 performs first voice recognition that converts the air conduction sound converted based on the bone conduction sound into first text information. and a second voice recognition process that converts the bone conduction voice into second text information. The server 3 determines and compares the reliability of each of the first text information and the second text information, and outputs the text information with the higher reliability. Thereby, the speech-to-text conversion system 100 can more accurately convert the speech picked up by the sound receiver 1 into text information.

実施の形態１の変形例に係る音声テキスト変換装置１００Ａは、音声を収音する受音器１との間で通信可能な音声テキスト変換装置１００Ａであって、受音器により収音された音声が、ユーザの声帯の振動に基づく骨導音声あるいは空気を介したユーザの鼓膜の振動に基づく気導音声のいずれかを判別する音声判別部と、骨導音声を前記気導音声に変換する音声変換部と、気導音声をテキスト情報に変換する音声認識部と、変換された前記テキスト情報を出力する出力部と、を備える。 A speech-to-text conversion device 100A according to a modification of the first embodiment is a speech-to-text conversion device 100A that can communicate with a sound receiver 1 that collects speech, and is capable of communicating with a sound receiver 1 that collects speech. includes a voice discrimination unit that determines either bone conduction voice based on the vibration of the user's vocal cords or air conduction voice based on the vibration of the user's eardrum through the air; and a voice that converts the bone conduction voice into the air conduction voice. The apparatus includes a converting section, a speech recognition section that converts air conduction speech into text information, and an output section that outputs the converted text information.

これにより、実施の形態１の変形例に係る音声テキスト変換装置１００Ａは、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる。 Thereby, the speech-to-text conversion device 100A according to the modification of the first embodiment can recognize speech and convert it into text depending on the type of connected microphone.

以上、添付図面を参照しながら各種の実施の形態について説明したが、本開示はかかる例に限定されない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例、修正例、置換例、付加例、削除例、均等例に想到し得ることは明らかであり、それらについても本開示の技術的範囲に属すると了解される。また、発明の趣旨を逸脱しない範囲において、上述した各種の実施の形態における各構成要素を任意に組み合わせてもよい。 Although various embodiments have been described above with reference to the accompanying drawings, the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims, and It is understood that it falls within the technical scope of the present disclosure. Further, each of the constituent elements in the various embodiments described above may be arbitrarily combined without departing from the spirit of the invention.

本開示は、音声テキスト変換システムおよび音声テキスト変換装置の提示において、接続されたマイクロホンの種類に応じて、音声を音声認識し、テキスト変換できる音声テキスト変換システムおよび音声テキスト変換装置の提示の提示として有用である。 The present disclosure provides a speech-to-text conversion system and a speech-to-text conversion device that can recognize and convert speech into text according to the type of connected microphone. Useful.

１受音器
１１マイク接続端子
２端末装置
２０，３０通信部
２１，３１プロセッサ
２２，３２メモリ
２３Ａ／Ｄ変換部
３サーバ
３３音声判別部
３４音声変換部
３５音声認識部
３６出力部
３７記憶部
１００音声テキスト変換システム
１００Ａ音声テキスト変換装置
ＮＷ１ネットワーク
ＭＣ１骨導マイクロホン
ＭＣ２気導マイクロホン 1 Sound receiver 11 Microphone connection terminal 2 Terminal device 20, 30 Communication section 21, 31 Processor 22, 32 Memory 23 A/D conversion section 3 Server 33 Voice discrimination section 34 Voice conversion section 35 Voice recognition section 36 Output section 37 Storage section 100 Speech-to-text conversion system 100A Speech-to-text conversion device NW1 Network MC1 Bone conduction microphone MC2 Air conduction microphone

Claims

A speech-to-text conversion system capable of communicating between a server and a terminal device connected to a receiver for collecting speech, the system comprising:
The terminal device is
transmitting an audio signal of the voice collected by the sound receiver to the server;
The server is
Based on the ratio of high frequency components to low frequency components among the spectral characteristics of the audio signal received from the terminal device, whether the audio is bone conduction audio based on the vibration of the user's vocal cords or the user's audio via the air is determined. Distinguish one of the air conduction sounds based on the vibration of the eardrum,
When the sound is the bone conduction sound,
converting the bone conduction sound into the air conduction sound;
converting the air conduction voice into text information;
outputting the converted text information;
Speech to text conversion system.

A speech-to-text conversion system capable of communicating between a server and a terminal device connected to a receiver for collecting speech, the system comprising:
The terminal device is
transmitting an audio signal of the voice collected by the sound receiver to the server;
The server is
When receiving the audio signal from the terminal device, it is determined whether the audio is bone conduction audio based on the vibration of the user's vocal cords or the user's eardrum via the air, based on the voltage drop value in the sound receiver. Distinguish any of the air conduction sounds based on the vibrations of the
When the sound is the bone conduction sound,
converting the bone conduction sound into the air conduction sound;
converting the air conduction voice into text information;
outputting the converted text information;
Speech to text conversion system.

If the voice is the air conduction voice, converting the air conduction voice into text information and outputting the converted text information;
The speech-to-text conversion system according to claim 1 or 2 .

The sound receiver includes either a bone conduction microphone that acquires the bone conduction sound or an air conduction microphone that acquires the air conduction sound.
The speech-to-text conversion system according to any one of claims 1 to 3 .

When the sound picked up by the sound receiver is the bone conduction sound, the server adds an identifier to the audio signal indicating that the sound is the bone conduction sound.
The speech-to-text conversion system according to claim 1 or 2 .

The server determines whether the sound is the bone conduction sound or the air conduction sound based on the presence or absence of the identifier.
The speech-to-text conversion system according to claim 5 .

The identifier is an audio signal of a sound source in a frequency band different from the audio,
The speech-to-text conversion system according to claim 5 .

The server has a learning model for converting the bone conduction sound to the air conduction sound,
The learning model extracts the respective feature amounts of the bone conduction sound and the air conduction sound based on the sound picked up simultaneously from the bone conduction microphone and the air conduction microphone, and converting the feature amount of the guided voice into the feature amount of the air guided voice;
The speech-to-text conversion system according to claim 4 .

The server recognizes speech using an acoustic model that uses the air conduction speech as a database.
The speech-to-text conversion system according to any one of claims 1 to 3 .

A speech-to-text conversion system capable of communicating between a server and a terminal device connected to a receiver for collecting speech, the system comprising:
The terminal device is
transmitting an audio signal of the voice collected by the sound receiver to the server;
The server is
Based on the audio signal received from the terminal device, determine whether the audio is bone conduction sound based on the vibration of the user's vocal cords or air conduction sound based on the vibration of the user's eardrum through the air,
When the sound is the bone conduction sound,
a first voice recognition process that converts the air conduction voice converted based on the bone conduction voice into first text information; and a second voice recognition process that converts the bone conduction voice into second text information. and run
determining and comparing the reliability of each of the first text information and the second text information, and outputting the text information with the higher reliability;
Speech to text conversion system.

A speech-to-text conversion device capable of communicating with a sound receiver that picks up speech,
Based on the ratio of high frequency components to low frequency components among the spectral characteristics of the voice picked up by the sound receiver, the voice is determined to be bone conduction voice based on the vibration of the user's vocal cords or bone conduction voice based on the user's voice via the air. a sound discrimination unit that discriminates one of the air conduction sounds based on the vibration of the eardrum;
a voice conversion unit that converts the bone conduction voice to the air conduction voice;
a voice recognition unit that converts the air conduction voice into text information;
an output unit that outputs the converted text information;
Speech-to-text converter.

A speech-to-text conversion device capable of communicating with a sound receiver that picks up speech,
Based on the voltage drop value in the sound receiver when the sound collected by the sound receiver is received, it is determined whether the sound is bone conduction sound based on the vibration of the user's vocal cords or the user's sound through the air. a sound discrimination unit that discriminates between air conduction sounds based on vibrations of the eardrum;
a voice conversion unit that converts the bone conduction voice to the air conduction voice;
a voice recognition unit that converts the air conduction voice into text information;
an output unit that outputs the converted text information;
Speech-to-text converter.

A speech-to-text conversion device capable of communicating with a sound receiver that picks up speech,
a voice discrimination unit that determines whether the voice picked up by the sound receiver is bone conduction voice based on the vibration of the user's vocal cords or air conduction voice based on the vibration of the user's eardrum through the air;
a voice conversion unit that converts the bone conduction voice to the air conduction voice;
a first voice recognition process that converts the air conduction voice into text information and converts the air conduction voice converted based on the bone conduction voice into first text information; a speech recognition unit that executes a second speech recognition process for converting into text information ;
an output unit that determines and compares the reliability of each of the first text information and the second text information, and outputs the text information with the higher reliability;
Speech-to-text converter.