HK40112939A

HK40112939A - Detection of silent speech

Info

Publication number: HK40112939A
Application number: HK62024100993.0A
Authority: HK
Inventors: 阿维亚德·梅泽尔斯; 阿维·巴里亚; 乔拉·科恩布劳; 约纳坦·韦克斯勒
Original assignee: 库伊有限公司
Priority date: 2021-08-04
Filing date: 2022-05-16
Publication date: 2025-01-28

Description

Silent speech detection

相关申请的交叉引用Cross-references to related applications

本申请要求于2021年8月4日提交的美国临时专利申请63/229,091的利益，该美国临时专利申请通过引用并入本文。This application claims the benefit of U.S. Provisional Patent Application 63/229,091, filed August 4, 2021, which is incorporated herein by reference.

发明领域Invention Field

本发明总体上涉及生理感测，尤其涉及用于感测人类语音的方法和装置。This invention relates generally to physiological sensing, and more particularly to methods and apparatus for sensing human speech.

背景background

说话的过程会激活胸部、颈部和面部的神经和肌肉。因此，例如，肌电图(EMG)已被用于捕获肌肉脉冲以用于语音感测。Speaking activates nerves and muscles in the chest, neck, and face. Therefore, electromyography (EMG) has been used, for example, to capture muscle impulses for speech sensing.

二次散斑图案已被用于监测人体上的皮肤的运动。二次散斑通常出现在激光束从粗糙表面(例如皮肤)的漫反射中。通过跟踪由人类皮肤在被激光束照射时进行的反射产生的二次散斑的时间和振幅变化，研究人员测量了血压(blood pulse pressure)和其他生命体征。例如，美国专利10,398,314描述了一种使用图像数据监测对象身体的状况的方法，该图像数据指示由身体产生的散斑图案序列。Secondary speckle patterns have been used to monitor movement of skin on the human body. Secondary speckle typically appears in the diffuse reflection of a laser beam from a rough surface, such as skin. By tracking the temporal and amplitude variations of secondary speckle generated by reflections from human skin when irradiated by a laser beam, researchers have measured blood pulse pressure and other vital signs. For example, U.S. Patent 10,398,314 describes a method for monitoring the condition of a subject's body using image data indicating a sequence of speckle patterns generated by the body.

概述Overview

下面描述的本发明的实施例提供了用于感测人类语音的新方法和设备。The embodiments of the present invention described below provide novel methods and devices for sensing human speech.

根据本发明的实施例，还提供了一种感测设备，该感测设备包括支架和光学感测头，支架被配置为适配该设备的用户的耳朵，光学感测头由支架保持在靠近用户面部的位置，并且被配置为感测从面部反射的光并响应于检测到的光而输出信号。处理电路被配置为处理该信号以生成语音输出。According to an embodiment of the present invention, a sensing device is also provided, comprising a support and an optical sensing head. The support is configured to fit the ear of a user of the device, and the optical sensing head is held by the support in a position close to the user's face and configured to sense light reflected from the face and output a signal in response to the detected light. A processing circuit is configured to process the signal to generate voice output.

在一个实施例中，支架包括耳夹。可替代地，支架包括眼镜框架。在公开的实施例中，光学感测头被配置为感测从用户的脸颊反射的光。In one embodiment, the support includes an ear clip. Alternatively, the support includes an eyeglass frame. In the disclosed embodiment, an optical sensing head is configured to sense light reflected from the user's cheek.

在一些实施例中，光学感测头包括发射器和感测器阵列，发射器被配置为将相干光引导到面部，感测器阵列被配置为感测由于相干光从面部的反射而产生的二次散斑图案。在公开的实施例中，发射器被配置为将相干光的多个光束引导到面部上的不同的相应位置，并且感测器阵列被配置为感测从这些位置反射的二次散斑图案。附加地或可替代地，由光束照射并由感测器阵列感测的位置在至少1cm²的区域上延伸。此外，附加地或可替代地，光学感测头包括多个发射器，该多个发射器被配置为产生覆盖面部的不同的、相应区域的相应光束组，并且处理电路被配置为选择和致动发射器的子集，而不致动所有发射器。In some embodiments, the optical sensing head includes an emitter and a sensor array, the emitter being configured to direct coherent light toward a face, and the sensor array being configured to sense a secondary speckle pattern resulting from reflections of the coherent light from the face. In the disclosed embodiments, the emitter is configured to direct multiple beams of coherent light to different corresponding locations on the face, and the sensor array is configured to sense secondary speckle patterns reflected from these locations. Additionally or alternatively, the locations illuminated by the beams and sensed by the sensor array extend over an area of at least 1 ^cm² . Furthermore, additionally or alternatively, the optical sensing head includes a plurality of emitters configured to generate corresponding beam sets covering different, corresponding areas of the face, and processing circuitry is configured to select and actuate a subset of the emitters, without actuating all emitters.

在公开的实施例中，处理电路被配置为检测感测到的二次散斑图案的变化，并响应于检测到的变化来生成语音输出。In the disclosed embodiments, the processing circuitry is configured to detect changes in the sensed secondary speckle pattern and generate speech output in response to the detected changes.

可替代地或附加地，处理电路被配置为以第一帧速率操作感测器阵列，响应于当以第一帧速率操作时的信号来感测面部的运动，并且响应于感测到的运动将帧速率增加到大于第一帧速率的第二帧速率，以生成语音输出。Alternatively or additionally, the processing circuitry is configured to operate the sensor array at a first frame rate, sense facial motion in response to a signal when operating at the first frame rate, and increase the frame rate to a second frame rate greater than the first frame rate in response to the sensed motion, in order to generate speech output.

在所公开的实施例中，处理电路被配置为响应于由于在用户不发出任何声音的情况下用户的皮肤表面的运动导致的、由光学感测头输出的信号的变化，生成语音输出。In the disclosed embodiments, the processing circuitry is configured to generate voice output in response to changes in the signal output by the optical sensing head caused by movement of the user's skin surface without the user making any sound.

通常，光学感测头由支架保持在距离用户的皮肤表面至少5mm的位置。Typically, the optical sensing head is held by a bracket at a distance of at least 5 mm from the user's skin surface.

在一个实施例中，该设备包括一个或更多个电极，该一个或更多个电极被配置为接触用户的皮肤表面，其中处理电路被配置为响应于由一个或更多个电极感测到的电活动以及由光学感测头输出的信号来生成语音输出。In one embodiment, the device includes one or more electrodes configured to contact the user's skin surface, wherein processing circuitry is configured to generate voice output in response to electrical activity sensed by the one or more electrodes and signals output from an optical sensing head.

附加地或可替代地，该设备包括麦克风，该麦克风被配置为感测用户发出的声音。在一个实施例中，处理电路被配置为将光学感测头输出的信号与麦克风感测到的声音进行比较，以便校准光学感测头。附加地或可替代地，处理电路被配置为响应于对用户发出的声音的感测来改变设备的操作状态。Additionally or alternatively, the device includes a microphone configured to sense sounds emitted by a user. In one embodiment, processing circuitry is configured to compare a signal output from the optical sensing head with the sound sensed by the microphone to calibrate the optical sensing head. Additionally or alternatively, the processing circuitry is configured to change the operating state of the device in response to sensing sounds emitted by the user.

在一些实施例中，该设备包括通信接口，其中处理电路被配置为对信号进行编码，以便通过通信接口传输到处理设备，该处理设备处理经编码的信号以生成语音输出。在公开的实施例中，通信接口包括无线接口。In some embodiments, the device includes a communication interface, wherein processing circuitry is configured to encode signals for transmission via the communication interface to a processing device, which processes the encoded signals to generate voice output. In the disclosed embodiments, the communication interface includes a wireless interface.

附加地或可替代地，该设备包括用户控件，该用户控件连接到支架并被配置为感测用户做出的手势，其中处理电路被配置为响应于感测到的手势来改变设备的操作状态。Additionally or alternatively, the device includes a user control connected to a bracket and configured to sense gestures made by a user, wherein processing circuitry is configured to change the operating state of the device in response to the sensed gestures.

此外，附加地或可替代地，该设备包括扬声器，该扬声器被配置为适配在用户的耳朵内，其中处理电路被配置为合成对应于语音输出的音频信号，以用于由扬声器回放。Additionally or alternatively, the device includes a speaker configured to fit inside a user's ear, wherein processing circuitry is configured to synthesize an audio signal corresponding to the speech output for playback by the speaker.

根据本发明的实施例，还提供了一种感测方法，该方法包括响应于人类对象说(articulate)单词但该对象不将单词发声出来，并且在不接触该对象的面部上的皮肤的情况下，感测该皮肤的运动。响应于感测到的运动，生成包括被说出的单词的语音输出。According to an embodiment of the present invention, a sensing method is also provided, the method comprising sensing movement of skin on the face of a human object in response to the human object articulating a word without audibly pronouncing the word. In response to the sensed movement, a speech output including the spoken word is generated.

在一些实施例中，感测该运动包括感测从对象的面部反射的光。在公开的实施例中，感测该光包括将相干光引导到皮肤，并感测由于相干光从皮肤的反射而产生的二次散斑图案。在一个实施例中，引导相干光包括将相干光的多个光束引导到面部上的不同的相应位置，并使用感测器阵列感测从每个位置反射的二次散斑图案。In some embodiments, sensing the motion includes sensing light reflected from the face of the object. In the disclosed embodiments, sensing the light includes directing coherent light to the skin and sensing a secondary speckle pattern resulting from the reflection of the coherent light from the skin. In one embodiment, directing the coherent light includes directing multiple beams of coherent light to different corresponding locations on the face and using a sensor array to sense the secondary speckle pattern reflected from each location.

在公开的实施例中，生成语音输出包括合成对应于语音输出的音频信号。可替代地或附加地，生成语音输出包括转录由对象说出的单词。In the disclosed embodiments, generating speech output includes synthesizing an audio signal corresponding to the speech output. Alternatively or additionally, generating speech output includes transcribing words spoken by the subject.

根据本发明的实施例的以下详细描述并结合附图，本发明将得到更充分的理解，在附图中：The invention will be more fully understood from the following detailed description of embodiments thereof, taken in conjunction with the accompanying drawings, in which:

附图简述Brief description of the attached diagram

图1是根据本发明的实施例的用于语音感测的系统的示意性形象化图示；Figure 1 is a schematic illustration of a system for voice sensing according to an embodiment of the present invention;

图2是根据本发明的实施例的光学感测头的示意性剖视图；Figure 2 is a schematic cross-sectional view of an optical sensing head according to an embodiment of the present invention;

图3是根据本发明的另一实施例的语音感测设备的示意性形象化图示；Figure 3 is a schematic illustration of a voice sensing device according to another embodiment of the present invention;

图4是示意性地示出根据本发明的实施例的用于语音感测的系统的功能部件的框图；和Figure 4 is a block diagram schematically illustrating the functional components of a system for voice sensing according to an embodiment of the present invention; and

图5是示意性地示出根据本发明的实施例的语音感测方法的流程图。Figure 5 is a flowchart schematically illustrating a voice sensing method according to an embodiment of the present invention.

具体实施方式Detailed Implementation

人们几乎随时随地通过他们的移动电话进行交流。移动电话在公共场所的广泛使用带来了不和谐的噪音，并经常引起隐私问题，因为对话很容易被路人听到。同时，当电话对话中的一方处于嘈杂的位置时，另一方或多方可能由于背景噪音而难以理解他们所听到的内容。文本交流为这些问题提供了一个解决方案，但是移动电话的文本输入很慢，并且干扰了用户查看他们要去哪里的能力。People communicate almost anytime, anywhere via their mobile phones. The widespread use of mobile phones in public places introduces discordant noise and often raises privacy concerns, as conversations are easily overheard by passersby. Furthermore, when one party in a phone conversation is in a noisy environment, the other party or parties may have difficulty understanding what they are hearing due to background noise. Text communication offers a solution to these problems, but text input on mobile phones is slow and interferes with users' ability to see where they are going.

本文描述的本发明的实施例使用无声语音来解决这些问题，使得用户能够说出单词和句子，而无需实际上将单词发声出来或根本无需发出任何声音。正常的发声过程使用多群肌肉和神经，从胸部和腹部开始，通过喉咙，并向上通过口腔和面部。为了说出给定的音素，运动神经元激活面部、喉部和口腔中的肌肉群，为推动气流流出肺部做准备，并且这些肌肉在说话过程中继续运动，以创造单词和句子。如果没有这种气流，嘴就不会发出声音。当没有来自肺部的气流，而面部、喉部和口腔中的肌肉继续说出想要的声音时，则会出现无声语音。The embodiments of the invention described herein address these problems using silent speech, enabling users to speak words and sentences without actually uttering them or making any sound at all. Normal phonation uses multiple groups of muscles and nerves, starting in the chest and abdomen, passing through the throat, and rising through the mouth and face. To pronounce a given phoneme, motor neurons activate muscle groups in the face, throat, and mouth to prepare for airflow from the lungs, and these muscles continue to move during speech to create words and sentences. Without this airflow, the mouth does not produce sound. Silent speech occurs when there is no airflow from the lungs, but the muscles in the face, throat, and mouth continue to produce the desired sound.

无声语音可能是由于神经疾病和肌肉疾病引起的；但它也可能是有意发生的，例如当我们说单词但不希望被别人听到时。即使当我们在不张嘴的情况下把口语单词概念化时，这种说也会发生。由此产生的我们面部肌肉的激活引起了皮肤表面的细微运动。发明人已经发现，通过适当地感测和解码这些运动，有可能可靠地重建由用户说出的单词的实际序列。Silent speech can be caused by neurological and muscular disorders; however, it can also occur intentionally, such as when we say words without intending to be heard. This speaking can even occur when we conceptualize spoken words without opening our mouths. The resulting activation of our facial muscles causes subtle movements on the skin surface. The inventors have discovered that by properly sensing and decoding these movements, it is possible to reliably reconstruct the actual sequence of words spoken by the user.

因此，本文描述的本发明的实施例感测对象面部上的皮肤和皮下神经及肌肉的细微运动，并且使用感测到的运动来生成包括被说出的单词的语音输出，该细微运动响应于由对象在发声或不发声的情况下说出的单词而发生。这些实施例提供了用于在不接触皮肤的情况下(例如通过感测从对象的面部反射的光)感测这些细微运动的方法和设备。因此，它们使用户能够以其他方基本上察觉不到的方式无声地与其他人交流或记录他们自己的想法。根据这些实施例的设备和方法也对环境噪声不敏感，并且可以基本上在任何环境中使用，而不需要用户将他们的视线和注意力从他们的周围事物上转移开。Therefore, embodiments of the invention described herein sense subtle movements of the skin and subcutaneous nerves and muscles on a subject's face and use the sensed movements to generate a speech output including spoken words, the subtle movements occurring in response to the words spoken by the subject, whether phonologically or nonverbally. These embodiments provide methods and apparatus for sensing these subtle movements without contact with the skin (e.g., by sensing light reflected from the subject's face). Thus, they enable users to communicate silently with others or record their own thoughts in a manner substantially imperceptible to the other party. The devices and methods according to these embodiments are also insensitive to ambient noise and can be used in virtually any environment without requiring the user to divert their gaze and attention from their surroundings.

本发明的一些实施例提供了具有普通消费品形式的感测设备，例如夹式头戴式耳机(headphone)或眼镜。在这些实施例中，光学感测头通过适配在用户的耳朵内或耳朵之上的支架被保持在靠近用户面部的位置。例如通过将相干光引导到面部的区域(例如脸颊)，光学感测头感测从面部反射的光，并且感测由于相干光从面部的反射而产生的二次散斑图案的变化。该设备中的处理电路处理光学感测头由于反射光而输出的信号，以生成相应的语音输出。Some embodiments of the present invention provide sensing devices in the form of common consumer products, such as clip-on headphones or glasses. In these embodiments, an optical sensing head is held close to the user's face by a bracket adapted to fit inside or above the user's ear. For example, by directing coherent light to an area of the face (e.g., the cheek), the optical sensing head senses light reflected from the face and senses changes in the secondary speckle pattern resulting from the reflection of coherent light from the face. Processing circuitry in the device processes the signal output by the optical sensing head due to the reflected light to generate corresponding voice output.

可替代地，本发明的原理可以在没有耳夹或其他支架的情况下实现。例如，在替代实施例中，包括相干光源和传感器的无声语音感测模块可以集成到诸如智能手机之类的移动通信设备中。当用户将移动通信设备保持在靠近用户面部的合适位置时，该集成感测模块感测无声语音。Alternatively, the principles of the invention can be implemented without ear clips or other supports. For example, in an alternative embodiment, a silent voice sensing module including a coherent light source and sensors can be integrated into a mobile communication device such as a smartphone. This integrated sensing module senses silent voice when the user holds the mobile communication device in a suitable position close to their face.

在本说明书和权利要求中使用的术语“光”是指红外、可见光和紫外范围中的任何或所有范围的电磁辐射。As used in this specification and claims, the term "light" refers to electromagnetic radiation in any or all ranges of the infrared, visible, and ultraviolet ranges.

图1是根据本发明的实施例的用于语音感测的系统18的示意性形象化图示。系统18基于感测设备20，其中耳夹22形式的支架适配在该设备的用户24的耳朵之上。附接到耳夹22的耳机26适配到用户的耳朵内。光学感测头28通过臂30连接到耳夹22，因此保持在靠近用户面部的位置。在图示的实施例中，设备20具有夹式头戴式耳机的形式和外观，其中光学感测头代替麦克风(或除了麦克风之外还有光学感测头)。Figure 1 is a schematic illustration of a system 18 for voice sensing according to an embodiment of the present invention. System 18 is based on a sensing device 20, wherein a bracket in the form of an ear clip 22 is adapted to the ear of a user 24 of the device. An earphone 26 attached to the ear clip 22 is adapted to fit inside the user's ear. An optical sensing head 28 is connected to the ear clip 22 via an arm 30, thus holding it in a position close to the user's face. In the illustrated embodiment, device 20 has the form and appearance of a clip-on headset, wherein the optical sensing head replaces a microphone (or has an optical sensing head in addition to a microphone).

光学感测头28将一束或更多束相干光导向用户24面部上的不同的相应位置，从而产生在面部的区域34上(且具体是在用户的脸颊上)延伸的光斑(spot)32的阵列。在本实施例中，光学感测头28根本不接触用户的皮肤，而是保持在距离皮肤表面的一定距离处。通常，该距离至少为5mm，并且它甚至可以更大，例如距离皮肤表面至少1cm或者甚至2cm或者更大。为了能够感测面部肌肉的不同部分的运动，由光斑32覆盖并由光学感测头28感测的区域34通常具有至少1cm²的范围；并且更大的区域(例如至少2cm²或者甚至大于4cm²)可以是有利的。The optical sensing head 28 directs one or more beams of coherent light to different corresponding locations on the user 24's face, thereby generating an array of light spots 32 extending over a region 34 of the face (specifically, on the user's cheeks). In this embodiment, the optical sensing head 28 does not contact the user's skin at all, but is held at a certain distance from the skin surface. Typically, this distance is at least 5 mm, and it can even be larger, for example, at least 1 cm or even 2 cm or more from the skin surface. In order to be able to sense the movement of different parts of the facial muscles, the region 34 covered by the light spots 32 and sensed by the optical sensing head 28 typically has a range of at least 1 ^cm² ; and a larger region (e.g., at least 2 ^cm² or even greater than 4 ^cm² ) can be advantageous.

光学感测头28感测从面部上的光斑32反射的相干光，并响应于检测到的光输出信号。具体地，光学感测头28感测由于相干光从其视场内的每个光斑32的反射而产生的二次散斑图案。为了覆盖足够大的区域34，该视场通常具有宽的角范围，通常具有至少60°、或者可能是70°、或者甚至90°或者更多的角宽度。在该视场内，设备20可以感测和处理由于所有光斑32或仅光斑32的某个子集的二次散斑图案而产生的信号。例如，设备20可以选择光斑的子集，该子集被发现在用户24的皮肤表面的相关运动方面给出最大量的有用且可靠的信息。下面参照图2描述光学感测头28的结构和操作的细节。Optical sensing head 28 senses coherent light reflected from spot 32 on the face and outputs a signal in response to the detected light. Specifically, optical sensing head 28 senses the secondary speckle pattern generated by the reflection of coherent light from each spot 32 within its field of view. To cover a sufficiently large area 34, this field of view typically has a wide angular range, typically at least 60°, or possibly 70°, or even 90° or more. Within this field of view, device 20 can sense and process the signal generated by the secondary speckle pattern of all spots 32 or only a subset of spots 32. For example, device 20 can select a subset of spots that has been found to provide the greatest amount of useful and reliable information regarding the relevant movements of the user 24's skin surface. The details of the structure and operation of optical sensing head 28 are described below with reference to FIG2.

在系统18内，处理电路处理由光学感测头28输出的信号以生成语音输出。如先前所述，即使用户22没有将语音发声出来或说出任何其他声音，处理电路也能够感测用户22的皮肤的运动并生成语音输出。语音输出可以采取合成的音频信号或文本转录或两者兼有的形式。合成的音频信号可以经由耳机26中的扬声器回放(并且在给予用户22关于语音输出的反馈时有用)。附加地或替代地，合成的音频信号可以通过网络传输，例如经由与移动通信设备(例如智能手机36)的通信链路传输。Within system 18, processing circuitry processes the signal output from optical sensing head 28 to generate speech output. As previously described, even if user 22 does not utter a sound or make any other noise, the processing circuitry is able to sense the movement of user 22's skin and generate speech output. The speech output may take the form of a synthesized audio signal, a text transcription, or both. The synthesized audio signal can be played back via a speaker in headphones 26 (and is useful in providing feedback to user 22 regarding the speech output). Additionally or alternatively, the synthesized audio signal may be transmitted over a network, for example via a communication link with a mobile communication device (e.g., smartphone 36).

系统18中的处理电路的功能可以完全在设备20内执行，或者它们可以替代地在设备20和外部处理器之间分配，该外部处理器例如为运行合适的应用软件的智能手机36中的处理器。例如，设备20内的处理电路可以对由光学感测头28输出的信号进行数字化和编码，并通过通信链路将编码信号传输到智能手机36。该通信链路可以是有线或无线的，例如使用智能手机提供的蓝牙^TM无线接口。智能手机36中的处理器处理编码信号，以便生成语音输出。智能手机36还可以通过诸如互联网之类的数据网络来访问服务器38，以便例如上传数据和下载软件更新。下文参照图4描述处理电路的设计和操作的细节。The processing circuitry in system 18 can function entirely within device 20, or alternatively, it can be distributed between device 20 and an external processor, such as the processor in smartphone 36 running suitable application software. For example, the processing circuitry within device 20 can digitize and encode the signal output from optical sensing head 28 and transmit the encoded signal to smartphone 36 via a communication link. This communication link can be wired or wireless, for example, using the Bluetooth ^™ wireless interface provided by the smartphone. The processor in smartphone 36 processes the encoded signal to generate voice output. Smartphone 36 can also access server 38 via a data network such as the Internet to, for example, upload data and download software updates. Details of the design and operation of the processing circuitry are described below with reference to FIG4.

在图示的实施例中，设备20还包括例如按钮(push-button)传感器或接近传感器形式的用户控件35，该用户控件35连接到耳夹22。用户控件35感测由用户执行的手势，例如在用户控件35上按压或以其他方式使用户的手指或手靠近用户控件。响应于适当的用户手势，处理电路改变设备20的操作状态。例如，用户24可以以这种方式将设备20从空闲模式切换到活动模式，并因此发信号指示(signal)设备应该开始感测和生成语音输出。这种切换在设备20中节省电池功率方面是有用的。可替代地或附加地，可以应用其他方法来控制设备20的操作状态并减少不必要的功耗，例如如下文参考图5所述。In the illustrated embodiment, device 20 also includes a user control 35, in the form of a push-button sensor or proximity sensor, which is connected to ear clip 22. User control 35 senses gestures performed by the user, such as pressing on user control 35 or otherwise bringing the user's finger or hand close to user control. In response to an appropriate user gesture, processing circuitry changes the operating state of device 20. For example, user 24 can switch device 20 from idle mode to active mode in this way, and thus signal that the device should begin sensing and generating voice output. This switching is useful in terms of saving battery power in device 20. Alternatively or additionally, other methods can be applied to control the operating state of device 20 and reduce unnecessary power consumption, such as those described below with reference to FIG5.

图2是设备20的光学感测头28的示意性剖视图，示出了根据本发明的实施例的光学感测头的部件和功能细节。光学感测头28包括发射器模块40和接收器模块48，以及可选的麦克风54。Figure 2 is a schematic cross-sectional view of the optical sensing head 28 of the device 20, illustrating the components and functional details of the optical sensing head according to an embodiment of the present invention. The optical sensing head 28 includes a transmitter module 40 and a receiver module 48, as well as an optional microphone 54.

发射器模块40包括光源，例如红外激光二极管42，该光源发射相干辐射的输入光束。分束元件44，例如达曼光栅或另一种合适类型的衍射光学元件(DOE)，将输入光束分成多个输出光束46，这些输出光束46在区域34上延伸的位置矩阵处形成相应的光斑32。在一个实施例中(未在图中示出)，发射器模块40包括多个激光二极管或其他发射器，它们产生输出光束46的相应组，这些组覆盖用户面部的区域34内的不同的相应子区域。在这种情况下，设备20中的处理电路可以仅选择和致动发射器的子集，而不致动所有发射器。例如，为了降低设备20的功耗，处理电路可以仅致动一个发射器或由两个或更多个发射器组成的子集，该一个发射器或该子集照射用户面部上的区域，该区域已被发现给出用于生成期望语音输出的最有用信息。The transmitter module 40 includes a light source, such as an infrared laser diode 42, that emits an input beam of coherent radiation. A beam-splitting element 44, such as a Dammann grating or another suitable type of diffractive optics (DOE), splits the input beam into a plurality of output beams 46, which form corresponding spots 32 at a position matrix extending over region 34. In one embodiment (not shown in the figures), the transmitter module 40 includes a plurality of laser diodes or other transmitters that produce corresponding sets of output beams 46 that cover different corresponding sub-regions within region 34 of the user's face. In this case, the processing circuitry in device 20 can select and actuate only a subset of the transmitters, without actuating all of them. For example, to reduce the power consumption of device 20, the processing circuitry can actuate only one transmitter or a subset of two or more transmitters that illuminates a region on the user's face that has been found to provide the most useful information for generating the desired speech output.

接收器模块48包括光学传感器的阵列52，例如CMOS图像传感器，其中物镜50用于将区域34成像到阵列52上。由于光学感测头28的尺寸小以及其靠近皮肤表面，如上所述，接收器模块48具有足够宽的视场，并且以远离法线的高角度观察许多光斑32。由于皮肤表面粗糙，也可以以这些高角度检测到光斑32处的二次散斑图案。Receiver module 48 includes an array 52 of optical sensors, such as a CMOS image sensor, wherein objective lens 50 is used to image region 34 onto array 52. Due to the small size of optical sensing head 28 and its proximity to the skin surface, as described above, receiver module 48 has a sufficiently wide field of view and observes numerous light spots 32 at high angles away from the normal. Due to the roughness of the skin surface, secondary speckle patterns at these light spots 32 can also be detected at these high angles.

麦克风54感测用户24发出的声音，使得用户22能够在需要时将设备20用作传统头戴式耳机。附加地或可替代地，麦克风54可以与设备20的无声语音感测能力结合使用。例如，麦克风54可以在校准过程中使用，在校准过程中，当用户22说出某些音素或单词时，光学感测头28感测皮肤的运动。然后，处理电路可以将光学感测头28输出的信号与麦克风54感测到的声音进行比较，以便校准光学感测头。该校准可以包括提示用户22移动光学感测头28的位置，以便将光学部件对准在相对于用户脸颊的期望位置。Microphone 54 senses the voice emitted by user 24, allowing user 22 to use device 20 as a conventional headset when needed. Additionally or alternatively, microphone 54 can be used in conjunction with the silent voice sensing capability of device 20. For example, microphone 54 can be used during calibration, where optical sensing head 28 senses skin movement as user 22 utters certain phonemes or words. Processing circuitry can then compare the signal output from optical sensing head 28 with the sound sensed by microphone 54 to calibrate the optical sensing head. This calibration may include prompting user 22 to move the position of optical sensing head 28 to align the optical components at a desired location relative to the user's cheek.

在另一实施例中，由麦克风54输出的音频信号可用于改变设备20的操作状态。例如，仅当麦克风54没有检测到用户24对单词的发声时，处理电路才可以生成语音输出。由光学感测头28和麦克风54提供的光学感测和声学感测的组合的其他应用，对于本领域技术人员在阅读本说明书之后将是显而易见的，并且被认为在本发明的范围内。In another embodiment, the audio signal output by microphone 54 can be used to change the operating state of device 20. For example, the processing circuit can generate speech output only when microphone 54 does not detect the user 24 uttering a word. Other applications of the combination of optical sensing and acoustic sensing provided by optical sensing head 28 and microphone 54 will be apparent to those skilled in the art upon reading this specification and are considered to be within the scope of the invention.

图3是根据本发明的另一实施例的语音感测设备60的示意性形象化图示。在该实施例中，耳夹22与眼镜框架62集成或以其他方式附接到眼镜框架62。鼻电极64和颞电极66附接到框架62并接触用户的皮肤表面。电极64和66接收体表肌电图(sEMG)信号，该信号提供关于用户的面部肌肉激活的附加信息。设备60中的处理电路使用由电极64和66感测到的电活动以及来自光学感测头28的输出信号来生成从设备60输出的语音。Figure 3 is a schematic illustration of a voice sensing device 60 according to another embodiment of the present invention. In this embodiment, the ear clip 22 is integrated with or otherwise attached to an eyeglass frame 62. A nasal electrode 64 and a temporal electrode 66 are attached to the frame 62 and contact the user's skin surface. Electrodes 64 and 66 receive surface electromyography (sEMG) signals, which provide additional information about the activation of the user's facial muscles. The processing circuitry in the device 60 uses the electrical activity sensed by electrodes 64 and 66 and the output signal from the optical sensing head 28 to generate speech output from the device 60.

附加地或可替代地，设备60包括一个或更多个附加的光学感测头68，其类似于光学感测头28，用于感测在用户面部的其他区域中的皮肤运动。这些附加的光学感测头可以与光学感测头28一起使用或代替光学感测头28使用。Additionally or alternatively, device 60 includes one or more additional optical sensing heads 68, similar to optical sensing head 28, for sensing skin movement in other areas of the user's face. These additional optical sensing heads may be used in conjunction with or in place of optical sensing head 28.

图4是示意性地示出根据本发明的实施例的用于语音感测的系统18的功能部件的框图。图示的系统围绕图1所示的部件而构建，包括感测设备20、智能手机36和服务器38。可替代地，图4所示和下面描述的功能可以在该系统的部件之间不同地实现和分配。例如，归属于智能手机36的一些或所有处理能力可以在感测设备中实现；或者设备20的感测能力可以在智能手机36中实现。Figure 4 is a block diagram schematically illustrating the functional components of a system 18 for voice sensing according to an embodiment of the present invention. The illustrated system is constructed around the components shown in Figure 1, including a sensing device 20, a smartphone 36, and a server 38. Alternatively, the functions shown in Figure 4 and described below may be implemented and distributed differently among the components of the system. For example, some or all of the processing capabilities belonging to the smartphone 36 may be implemented in the sensing device; or the sensing capabilities of the device 20 may be implemented in the smartphone 36.

在图示的示例中，如上所述，感测设备20包括发射器模块40、接收器模块48、扬声器26、麦克风54和用户控件(UI)35。为了完整起见，感测设备20在图4中被示出为也包括其它传感器71，例如电极和/或环境传感器；但是如前所述，感测设备20能够仅基于由发射器和接收器模块进行的非接触式测量来操作。In the illustrated example, as described above, sensing device 20 includes a transmitter module 40, a receiver module 48, a speaker 26, a microphone 54, and a user control (UI) 35. For completeness, sensing device 20 is shown in FIG. 4 as also including other sensors 71, such as electrodes and/or environmental sensors; however, as previously stated, sensing device 20 is capable of operating solely based on non-contact measurements performed by the transmitter and receiver modules.

感测设备20包括编码器70和控制器75形式的处理电路。编码器70包括硬件处理逻辑和/或数字信号处理器，硬件处理逻辑可以是硬连线的或可编程的，数字信号处理器提取来自接收器模块48的输出信号的特征并对其进行编码。感测设备20经由诸如蓝牙接口之类的通信接口72将编码信号传输到智能手机36中的相应通信接口77。电池74向感测设备20的部件提供操作电力。The sensing device 20 includes processing circuitry in the form of an encoder 70 and a controller 75. The encoder 70 includes hardware processing logic and/or a digital signal processor (DSP), which may be hardwired or programmable. The DSP extracts and encodes features from the output signal from the receiver module 48. The sensing device 20 transmits the encoded signal to a corresponding communication interface 77 in the smartphone 36 via a communication interface 72, such as a Bluetooth interface. A battery 74 provides operating power to the components of the sensing device 20.

控制器75包括可编程的微控制器，例如，该微控制器基于从用户控件35、接收器模块48和智能手机36(经由通信接口72)接收的输入来设置感测设备20的操作状态和操作参数。下面参照图5描述此功能的一些方面。在替代实施例中，控制器75包括更强大的微处理器和/或处理阵列，其独立于智能手机36，在感测设备内本地处理来自接收器模块48的输出信号的特征并生成语音输出。The controller 75 includes a programmable microcontroller, which, for example, sets the operating state and operating parameters of the sensing device 20 based on inputs received from the user control 35, the receiver module 48, and the smartphone 36 (via the communication interface 72). Some aspects of this functionality are described below with reference to FIG5. In an alternative embodiment, the controller 75 includes a more powerful microprocessor and/or processing array, independent of the smartphone 36, which locally processes the characteristics of the output signal from the receiver module 48 and generates voice output within the sensing device.

然而，在本实施例中，来自感测设备20的经编码的输出信号被接收到智能手机36的存储器78中，并由在智能手机36中的处理器上运行的语音生成应用80处理。语音生成应用80将输出信号中的特征转换成文本和/或音频输出信号形式的单词序列。通信接口77将音频输出信号传递回感测设备20的扬声器26，以便回放给用户。来自语音生成应用80的文本和/或音频输出也被输入到其他应用84，例如话音和/或文本通信应用以及记录应用。通信应用例如经由数据通信接口86通过蜂窝或Wi-Fi网络进行通信。However, in this embodiment, the encoded output signal from the sensing device 20 is received in the memory 78 of the smartphone 36 and processed by a speech generation application 80 running on the processor in the smartphone 36. The speech generation application 80 converts features in the output signal into a sequence of words in the form of text and/or audio output signals. The communication interface 77 transmits the audio output signal back to the speaker 26 of the sensing device 20 for playback to the user. The text and/or audio output from the speech generation application 80 is also input to other applications 84, such as voice and/or text communication applications and recording applications. The communication applications communicate, for example, via a cellular or Wi-Fi network through a data communication interface 86.

编码器70和语音生成应用80的操作由本地训练接口82控制。例如，接口82可以向编码器70指示从由接收器模块48输出的信号中提取哪些时间特征和频谱特征，并且可以向语音生成应用80提供神经网络的系数，神经网络将这些特征转换成单词。在本示例中，语音生成应用80实现推断网络，该推断网络查找与从感测设备20接收的经编码的信号特征相对应的、具有最高概率的单词序列。本地训练接口82从服务器38接收推断网络的系数，服务器38也可以周期性地更新系数。The operation of encoder 70 and speech generation application 80 is controlled by local training interface 82. For example, interface 82 can instruct encoder 70 which temporal and spectral features to extract from the signal output by receiver module 48, and can provide speech generation application 80 with coefficients of a neural network that converts these features into words. In this example, speech generation application 80 implements an inference network that finds the word sequence with the highest probability corresponding to the encoded signal features received from sensing device 20. Local training interface 82 receives coefficients from inference network from server 38, which can also periodically update the coefficients.

为了生成本地训练指令82，服务器38使用数据存储库88，该数据存储库88包含来自训练数据90的集合中的散斑图像和相应的基准真值(ground truth)口语单词。存储库88还接收在现场从感测设备20收集到的训练数据。例如，训练数据可以包括当用户说某些声音和单词(可能包括无声语音和有声语音)时从感测设备20收集到的信号。一般训练数据90与从每个感测设备20的用户接收的个人训练数据的这种组合使得服务器38能够针对每个用户导出最佳的推断网络系数。To generate local training instructions 82, server 38 uses data repository 88, which contains speckle images and corresponding ground truth spoken words from the set of training data 90. Repository 88 also receives training data collected in the field from sensing device 20. For example, training data may include signals collected from sensing device 20 when a user speaks certain sounds and words (potentially including silent and spoken speech). This combination of general training data 90 and individual training data received from each user at each sensing device 20 enables server 38 to derive optimal inference network coefficients for each user.

服务器38应用图像分析工具94来从存储库88中的散斑图像中提取特征。这些图像特征与相应的单词字典104和语言模型100一起作为训练数据被输入到神经网络96，语言模型100定义了训练数据中使用的特定语言的语音学结构(phonetic structure)和句法规则。神经网络96生成用于推断网络102的最佳系数，推断网络102将从散斑测量值的相应序列中提取出的特征集的输入序列转换成相应的音素，并最终转换成单词的输出序列。网络架构和训练过程的进一步细节在上述的临时专利申请中进行了描述。服务器38将推断网络102的系数下载到智能手机36，以在语音生成应用80中使用。Server 38 uses image analysis tool 94 to extract features from speckle images in repository 88. These image features, along with a corresponding word dictionary 104 and language model 100, are input as training data into neural network 96. Language model 100 defines the phonetic structure and syntactic rules of the specific language used in the training data. Neural network 96 generates optimal coefficients for inference network 102, which converts the input sequence of the feature set extracted from the corresponding sequence of speckle measurements into corresponding phonemes and ultimately into a sequence of words. Further details of the network architecture and training process are described in the aforementioned provisional patent application. Server 38 downloads the coefficients of inference network 102 to smartphone 36 for use in speech generation application 80.

图5是示意性地示出根据本发明的实施例的用于语音感测的方法的流程图。为了方便和清楚起见，参照如图1和图4所示且上面描述的系统18的元件来描述该方法。可替代地，该方法的原理可以在其他系统配置中应用，例如使用感测设备60(图3)或集成在移动通信设备中的感测设备的系统配置。Figure 5 is a flowchart schematically illustrating a method for voice sensing according to an embodiment of the present invention. For convenience and clarity, the method is described with reference to the elements of system 18 shown in Figures 1 and 4 and described above. Alternatively, the principle of the method can be applied in other system configurations, such as a system configuration using sensing device 60 (Figure 3) or a sensing device integrated into a mobile communication device.

在空闲步骤110，只要用户24不说话，感测设备20就在低功率空闲模式下操作，以便节省电池74中的电力。在这种模式下，控制器75以低帧速率(例如20帧/秒)驱动接收器模块48中的传感器的阵列52。发射器模块40也可以以降低的输出功率来操作。在运动检测步骤112，当接收器模块48以这种低帧速率操作时，控制器75处理阵列52输出的图像，以便检测指示语音的面部运动。在活动捕获步骤114，当检测到这种运动时，控制器75指示接收机模块48以及感测设备20的其他部件将帧速率增加到例如100-200帧/秒的范围，以便能够检测到由于无声语音而发生的二次散斑图案的变化。可替代地或附加地，控制器75可以响应于其他输入，例如用户控件35的致动或从智能手机36接收的指令，来增加帧速率并给感测设备20的其他部件通电。In idle step 110, as long as user 24 is not speaking, sensing device 20 operates in low-power idle mode to conserve power in battery 74. In this mode, controller 75 drives the array 52 of sensors in receiver module 48 at a low frame rate (e.g., 20 frames/second). Transmitter module 40 may also operate at reduced output power. In motion detection step 112, while receiver module 48 operates at this low frame rate, controller 75 processes the image output from array 52 to detect facial movements indicative of speech. In motion capture step 114, when such motion is detected, controller 75 instructs receiver module 48 and other components of sensing device 20 to increase the frame rate to, for example, a range of 100-200 frames/second to detect changes in the secondary speckle pattern due to silent speech. Alternatively or additionally, controller 75 may increase the frame rate and power other components of sensing device 20 in response to other inputs, such as actuation of user control 35 or instructions received from smartphone 36.

由接收器模块48捕获的图像通常包含所投射的激光光斑32的矩阵，如图1所示。在光斑检测116，编码器70检测图像中的光斑的位置。编码器可以从所有光斑中提取特征；但是为了节省功率和处理资源，希望编码器选择光斑的子集。例如，本地训练接口82可以指示哪个光斑子集包含关于用户语音的最大量的信息，并且编码器70可以选择该子集中的光斑。在裁剪步骤118，编码器70从每个图像裁剪出小窗口，其中每个这样的窗口包含所选择的光斑之一。The image captured by receiver module 48 typically contains a matrix of the projected laser spots 32, as shown in Figure 1. In spot detection 116, encoder 70 detects the location of the spots in the image. The encoder can extract features from all spots; however, to save power and processing resources, it is desirable for the encoder to select a subset of spots. For example, local training interface 82 can indicate which subset of spots contains the maximum amount of information about the user's speech, and encoder 70 can select spots from that subset. In cropping step 118, encoder 70 crops small windows from each image, where each such window contains one of the selected spots.

在特征提取步骤120，编码器70从每个选择的光斑中提取散斑运动的特征。例如，编码器70可以基于相应窗口中的像素的平均强度来估计每个散斑中的总能量，并且可以测量每个散斑的能量随着时间推移的变化。附加地或可替代地，编码器70可以提取所选择的光斑子集中的散斑的其他时间特征和/或频谱特征。编码器70将这些特征传送到语音生成应用80(运行在智能手机36上)，在特征输入步骤122，语音生成应用80将特征值的向量输入到从服务器38下载的推断网络102。In feature extraction step 120, encoder 70 extracts features of speckle motion from each selected speckle. For example, encoder 70 can estimate the total energy in each speckle based on the average intensity of pixels in the corresponding window, and can measure the change in energy of each speckle over time. Additionally or alternatively, encoder 70 can extract other temporal and/or spectral features of speckles in the selected subset of speckles. Encoder 70 transmits these features to speech generation application 80 (running on smartphone 36), where, in feature input step 122, speech generation application 80 inputs a vector of feature values into inference network 102 downloaded from server 38.

在语音输出步骤124，基于随着时间的推移而输入到推断网络的特征向量序列，语音生成应用80输出单词的流，这些单词被拼接在一起成为句子。如先前所述，语音输出被用于合成音频信号，用于经由扬声器26回放。在后处理步骤126，在智能手机36上运行的其他应用84对语音和/或音频信号进行后处理，以记录相应的文本和/或通过网络传输语音或文本数据。In the speech output step 124, based on the sequence of feature vectors input to the inference network over time, the speech generation application 80 outputs a stream of words, which are concatenated into sentences. As previously described, the speech output is used to synthesize audio signals for playback via speaker 26. In the post-processing step 126, another application 84 running on smartphone 36 post-processes the speech and/or audio signals to record corresponding text and/or transmit speech or text data over the network.

应当理解，上述实施例是通过示例的方式引用的，并且本发明不限于已经在上文具体示出和描述的内容。更确切地说，本发明的范围包括上文所描述的各种特征的组合和子组合，以及本领域技术人员在阅读前述描述后会想到的并且在现有技术中未被公开的这些特征的变型和修改。It should be understood that the above embodiments are cited by way of example, and the present invention is not limited to what has been specifically shown and described above. More precisely, the scope of the present invention includes combinations and sub-combinations of the various features described above, as well as variations and modifications of these features that would occur to those skilled in the art after reading the foregoing description and that are not disclosed in the prior art.

Claims

1. A sensing device, comprising:

A stand, the stand being configured to fit the ear of a user of the device;

An optical sensing head, held by the bracket in a position close to the user's face, is configured to sense light reflected from the face and output a signal in response to the detected light; and

A processing circuit configured to process the signal to generate voice output.

2. The device according to claim 1, wherein the bracket includes an ear clip.

3. The device according to claim 1, wherein the bracket comprises an eyeglass frame.

4. The device of claim 1, wherein the optical sensing head is configured to sense light reflected from the user's cheek.

5. The device of claim 1, wherein the optical sensing head includes an emitter and a sensor array, the emitter being configured to direct coherent light to the face, and the sensor array being configured to sense a secondary speckle pattern resulting from reflection of the coherent light from the face.

6. The device of claim 5, wherein the transmitter is configured to direct a plurality of beams of the coherent light to different corresponding locations on the face, and the sensor array is configured to sense secondary speckle patterns reflected from said locations.

7. The device of claim 6, wherein the position illuminated by the light beam and sensed by the sensor array extends over a field of view having an angular width of at least 60°.

8. The device of claim 6, wherein the position illuminated by the light beam and sensed by the sensor array extends over an area of at least 1 ^cm² .

9. The device of claim 6, wherein the optical sensing head includes a plurality of emitters configured to generate corresponding groups of the light beams, the corresponding groups of the light beams covering different corresponding regions of the face, and wherein the processing circuitry is configured to select and actuate a subset of the emitters without actuating all of the emitters.

10. The device of claim 5, wherein the processing circuitry is configured to detect a change in the sensed secondary speckle pattern and generate the speech output in response to the detected change.

11. The device of claim 5, wherein the processing circuitry is configured to operate the sensor array at a first frame rate, sense movement of the face in response to the signal when operating at the first frame rate, and increase the frame rate to a second frame rate greater than the first frame rate in response to the sensed movement to generate the speech output.

12. The device according to claims 1-11, wherein the processing circuitry is configured to generate the voice output in response to a change in the signal output by the optical sensing head caused by movement of the user's skin surface without the user making any sound.

13. The device according to claims 1-11, wherein the optical sensing head is held by the bracket at a position at least 5 mm away from the user's skin surface.

14. The device of claims 1-11, further comprising one or more electrodes configured to contact a user's skin surface, wherein the processing circuitry is configured to generate the voice output in response to electrical activity sensed by the one or more electrodes and the signal output by the optical sensing head.

15. The device according to claims 1-11, further comprising a microphone configured to sense sounds emitted by a user.

16. The device of claim 16, wherein the processing circuitry is configured to compare the signal output by the optical sensing head with the sound sensed by the microphone in order to calibrate the optical sensing head.

17. The device of claim 16, wherein the processing circuitry is configured to change the operating state of the device in response to sensing a sound emitted by a user.

18. The device according to claims 1-11, further comprising a communication interface, wherein the processing circuitry is configured to encode the signal for transmission via the communication interface to a processing device, the processing device processing the encoded signal to generate the voice output.

19. The device of claim 17, wherein the communication interface includes a wireless interface.

20. The device of claims 1-11, further comprising a user control connected to the bracket and configured to sense a gesture made by a user, wherein the processing circuitry is configured to change the operating state of the device in response to the sensed gesture.

21. The device according to claims 1-11, further comprising a speaker configured to fit within a user's ear, wherein the processing circuitry is configured to synthesize an audio signal corresponding to the speech output for playback by the speaker.

22. A sensing method, comprising:

In response to a human object saying a word but not uttering the word aloud, and sensing the movement of the skin on the object's face without contact; and

In response to sensed motion, a speech output is generated, which includes the spoken words.

23. The method of claim 23, wherein sensing the motion includes sensing light reflected from the face of the object.

24. The method of claim 24, wherein sensing the light comprises directing coherent light to the skin and sensing a secondary speckle pattern resulting from the reflection of the coherent light from the skin.

25. The method of claim 25, wherein guiding the coherent light comprises guiding a plurality of beams of the coherent light to different corresponding locations on the face and using a sensor array to sense secondary speckle patterns reflected from each location.

26. The method of claim 26, wherein the position illuminated by the light beam and sensed by the sensor array extends over a field of view having an angular width of at least 60°.

27. The method of claim 26, wherein the location, illuminated by the light beam and sensed by the sensor array, extends over an area of at least 1 ^cm² on the cheek of the subject.

28. The method of claim 25, wherein generating the speech output includes detecting a change in the sensed secondary speckle pattern and generating the speech output in response to the detected change.

29. The method according to any one of claims 23-29, wherein generating the speech output comprises synthesizing an audio signal corresponding to the speech output.

30. The method according to any one of claims 23-29, wherein generating the speech output comprises transcribing words spoken by the subject.