CN116721657A - Head wearing device for sound enhanced recording - Google Patents
Head wearing device for sound enhanced recording Download PDFInfo
- Publication number
- CN116721657A CN116721657A CN202310627833.XA CN202310627833A CN116721657A CN 116721657 A CN116721657 A CN 116721657A CN 202310627833 A CN202310627833 A CN 202310627833A CN 116721657 A CN116721657 A CN 116721657A
- Authority
- CN
- China
- Prior art keywords
- signal
- sound signal
- sound
- head
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000005236 sound signal Effects 0.000 claims abstract description 153
- 230000000873 masking effect Effects 0.000 claims abstract description 56
- 238000012545 processing Methods 0.000 claims abstract description 25
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 18
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 18
- 230000009467 reduction Effects 0.000 claims abstract description 10
- 230000007613 environmental effect Effects 0.000 claims abstract description 8
- 210000003128 head Anatomy 0.000 claims description 48
- 239000011521 glass Substances 0.000 claims description 23
- 230000003044 adaptive effect Effects 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims description 11
- 210000001525 retina Anatomy 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 8
- 238000009499 grossing Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 238000000034 method Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 102100026436 Regulator of MON1-CCZ1 complex Human genes 0.000 description 5
- 101710180672 Regulator of MON1-CCZ1 complex Proteins 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 5
- 208000016354 hearing loss disease Diseases 0.000 description 4
- 230000002194 synthesizing effect Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 210000005069 ears Anatomy 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 210000000887 face Anatomy 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G02—OPTICS
- G02C—SPECTACLES; SUNGLASSES OR GOGGLES INSOFAR AS THEY HAVE THE SAME FEATURES AS SPECTACLES; CONTACT LENSES
- G02C11/00—Non-optical adjuncts; Attachment thereof
- G02C11/10—Electronic devices other than hearing aids
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02165—Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Abstract
The invention discloses a head wearing device for sound enhancement recording, which comprises a front part, a side part and a signal synthesis module. The front part is provided with two forward microphone modules for recording the own speaking voice of the user and the speaking voice of the opposite speaker of the user, and the side part is provided with two lateral microphone modules for recording the environmental noise. The signal synthesis module carries out enhancement processing on the two forward sound signals to obtain enhanced forward sound signals, and generates masking thresholds of the enhanced forward sound signals on different frequency bands by utilizing each forward sound signal and each lateral sound signal; and carrying out noise reduction processing on the enhanced forward sound signal by utilizing the masking threshold value to obtain an output voice signal. The invention can effectively separate and reduce the noise of the user and the opposite speaker, and improve the recording quality of the sound.
Description
Technical Field
The present invention relates to a sound recording device, and more particularly to a device with a sound enhancement recording function, such as glasses with a sound receiving/reproducing function, suitable for wearing on the head.
Background
Head wear devices with sound recording function are not uncommon. For portability, there are known recording elements integrated in various portable items, such as watches, cell phones, headphones, helmets, glasses, etc. The solution in which the sound recording element is integrated in the glasses has good portability and controllability of the recording direction. Glasses for wireless communication are proposed, for example, in US patent publication No. US7792552B 2. As shown in fig. 1, it comprises a front frame 1 and two bendable arms, a left arm 2 and a right arm 3, located on both sides of the front frame 1. The left support arm 2 and the right support arm 3 are folded together with the front frame 1 after being bent. The front frame is used for holding two lenses. A microphone 4 is provided in the front frame 1 at the middle of the two lenses, which microphone 4 is used for recording sound. However, the glasses cannot effectively remove environmental noise, which makes the recording quality low, and is inconvenient to process sound to improve the quality of sound, and is also unfavorable for subsequent advanced operations (such as voice recognition, etc.).
As shown in fig. 2, chinese patent application publication No. CN114339524a proposes a head-mounted device including two microphones at the front and two microphones at the sides. However, the microphone is positioned for recording left and right stereo sound, and is not used for noise reduction or speech enhancement.
Disclosure of Invention
The invention aims to solve the problems that the existing head wearing device can not effectively remove environmental noise and can not effectively improve the recording sound quality during reception.
In order to solve the above technical problems, the present invention provides a head wearing device for sound enhancement recording, comprising a front portion and a side portion, when the head wearing device is worn on the head of a user, the front portion faces the front of the face of the user, the side portion is located at two ends of the front portion and faces two sides of the face of the user respectively, a first forward microphone module and a second forward microphone module are arranged in the front portion, and the two forward microphone modules are used for recording the own speaking voice of the user and the speaking voice of the opposite speaker of the user; the two sides of the side face part are respectively provided with a first side microphone module and a second microphone module, and the two side microphone modules are used for recording environmental noise; the head wearing device further comprises a signal synthesis module for: acquiring a first forward sound signal and a second forward sound signal generated by a first forward microphone module and a second forward microphone module, and a first side sound signal and a second side sound signal generated by a first side microphone module and a second side microphone module; performing enhancement processing on the first forward sound signal and the second forward sound signal in combination to obtain an enhanced forward sound signal; generating masking thresholds of the enhanced forward sound signal on different frequency bands using each forward sound signal and each side sound signal; and carrying out noise reduction processing on the enhanced forward sound signal by utilizing the masking threshold value to obtain an output voice signal.
According to a preferred embodiment of the present invention, the first and second forward microphone modules are located at the same horizontal plane when the head-worn device is worn on the head of a user; and the step of combining the first forward sound signal and the second forward sound signal for enhancement processing to obtain an enhanced forward sound signal comprises: adding and dividing the sound signals generated by the first forward microphone module and the second forward microphone module by two to obtain a signal d (n), subtracting and dividing by two to obtain a signal x (n), wherein n represents a time sequence number; processing the signal x (n) by an adaptive filter to obtain a signal y (n); the enhanced output signal e (n) obtained by subtracting the signal y (n) from the signal d (n) is used as an enhanced forward sound signal, and the parameter updating module which inputs the signal e (n) to the adaptive filter is used for updating the parameters of the adaptive filter.
According to a preferred embodiment of the present invention, the step of generating masking thresholds of the enhanced forward sound signal on different frequency bands using each forward sound signal and each side sound signal comprises:
<1> time-frequency-decomposing each of the forward sound signals and each of the side sound signals, and converting from a time-domain signal to a frequency signal;
<2> adding time-frequency units of the same frequency band of the first and second forward sound signals to obtain each time-frequency unit of the forward sound signal; adding the time-frequency units of the same frequency band of the first lateral sound signal and the second lateral sound signal to obtain each time-frequency unit of the lateral sound signal;
<3> performing time-frequency compensation for each time-frequency unit of the side-direction sound signal;
<4> the IID value and the ITD value are calculated for each time-frequency unit of the forward sound signal and the side sound signal, thereby generating the preliminary masking threshold of the enhanced forward sound signal.
According to a preferred embodiment of the present invention, the first and second forward sound signals are subjected to equalization processing before adding time-frequency units of the same frequency band of the first and second forward sound signals so that energy values of the first and second forward sound signals in each frequency band tend to coincide.
According to a preferred embodiment of the present invention, the step of performing noise reduction processing on the enhanced forward sound signal using the masking threshold to obtain an output speech signal includes:
<1> performing voice activation detection on the forward sound signal;
<2> obtaining a final masking threshold of the enhanced forward sound signal based on the voice activation detection result and the preliminary masking threshold;
<3> smoothing the final masking threshold and masking the enhanced speech signal using the smoothed final masking threshold;
<4> each time-frequency unit of the enhanced speech signal subjected to the masking processing is converted into the time domain, resulting in an output speech signal.
According to a preferred embodiment of the present invention, the step of performing time-frequency compensation on each time-frequency unit of the side-direction sound signal includes: and updating the time-frequency compensation parameters according to the voice detection result.
According to a preferred embodiment of the present invention, the head wearing device is a pair of glasses, the front portion of which is a front frame for fixing lenses, and the side portion of which is two temples located on both sides of the front frame;
the first forward microphone module and the second forward microphone module are positioned in the middle of the front frame, and the first lateral microphone module, the second lateral microphone module and the signal synthesis module are positioned in the glasses legs.
According to a preferred embodiment of the present invention, the head wearing apparatus further includes a sound playing module for playing sound to a user according to the output signal.
According to a preferred embodiment of the present invention, the head wearing apparatus further comprises a voice recognition module for performing voice recognition on the output signal to convert the sound signal into text information.
According to a preferred embodiment of the present invention, the head wearing apparatus further includes an information display module for displaying the text information.
According to a preferred embodiment of the present invention, the voice recognition module is further configured to convert the recognized text information into text information in a language specified by the user, and send the text information to the information display module for display.
According to a preferred embodiment of the present invention, the information display module is a retina projection device for projecting the text information onto the retina of the user.
According to a preferred embodiment of the present invention, the voice recognition module is further configured to convert the recognized text information into text information in a language specified by the user, and then convert the converted text information into a sound signal, and send the sound signal to the sound playing module for playing.
The invention can effectively separate and reduce the noise of the user and the opposite speaker, and improve the recording quality of the sound.
Drawings
In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects achieved more clear, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted, however, that the drawings described below are merely illustrative of exemplary embodiments of the present invention and that other embodiments of the drawings may be derived from these drawings by those skilled in the art without undue effort.
FIG. 1 is a schematic diagram of a prior art wireless communication eyewear;
fig. 2 is a schematic structural view of a head-cutting device provided with four microphones in the prior art;
FIG. 3 is a schematic diagram of an NLMS algorithm employed by one embodiment of the present invention to obtain an enhanced forward sound signal;
FIG. 4 is a front view of a head wearable device with sound enhancement recording capabilities in accordance with an embodiment of the present invention;
FIG. 5 is a schematic view showing a state in which a head-wearing apparatus having a sound-enhancement recording function of an embodiment of the present invention is worn on a user's head;
FIG. 6 is a schematic side view of a head-mounted device with sound collection capabilities in accordance with an embodiment of the present invention;
fig. 7 is a flow diagram of one embodiment of a signal synthesis module according to the present invention generating an output speech signal.
Fig. 8 is a block diagram of a radio receiver according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments are shown, although the exemplary embodiments may be practiced in various specific ways. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.
The structures, capabilities, effects, or other features described in a particular embodiment may be incorporated in one or more other embodiments in any suitable manner without departing from the spirit of the present invention.
In describing particular embodiments, specific details of construction, performance, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by those skilled in the art. It is not excluded, however, that one skilled in the art may implement the present invention in a particular situation in a solution that does not include the structures, properties, effects, or other characteristics described above.
The flow diagrams in the figures are merely exemplary flow illustrations and do not represent that all of the elements, operations, and steps in the flow diagrams must be included in the aspects of the invention, nor that the steps must be performed in the order shown in the figures. For example, some operations/steps in the flowcharts may be decomposed, some operations/steps may be combined or partially combined, etc., and the order of execution shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The same reference numerals in the drawings denote the same or similar elements, components or portions, and thus repeated descriptions of the same or similar elements, components or portions may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or portions, these devices, elements, components or portions should not be limited by these terms. That is, these phrases are merely intended to distinguish one from the other. For example, a first device may also be referred to as a second device without departing from the spirit of the invention. Furthermore, the term "and/or," "and/or" is meant to include all combinations of any one or more of the items listed.
The invention provides a head wearing device for sound enhancement recording. The term head-worn device as used herein refers to any device that can be worn on the head of a human being and is not limited to the type of device. However, in order to implement the sound recording method of the present invention, the head wearing device in the present invention must have a front face portion for setting the microphone, which faces the front of the user's face, and side face portions, which face both sides of the user's face, when the head wearing device is worn on the user's head. Therefore, it can also be said that the side portions are located at both ends of the front portion. Thus, when the device is worn by a user, the front portion and the side portions at both ends of the front portion enclose the front and both sides of the user's face. Thus, typical examples of the head wearing device in the present invention are frame glasses, VR glasses, a hood, a helmet, a headband, and a hat, an earphone, and the like having portions on the front and side of the head.
According to the invention, two microphone modules are arranged in the front part, which means a microphone function module with a sound-receiving element and a signal-preprocessing element. These two microphones are referred to herein as a first forward microphone module and a second forward microphone module, respectively. The two forward microphone modules may be located at the same horizontal plane when the head-worn device is worn on the head of a user, and are primarily intended to capture the user's own speech and the user's speech of the opposite speaker. For example, when the head-wearing device is eyeglasses, the two forward microphones are located at the middle position of the front frame of the eyeglasses.
According to the invention, two microphone modules are also provided in the side part, which are referred to herein as a first side microphone module and a second side microphone module, respectively. The two lateral microphone modules may also be located at the same level when the head-worn device is worn on the head of a user, which is primarily intended to capture ambient noise.
In the present invention, the two side microphones are preferably located on both sides of the face of the user, respectively, whereby the two side microphones are located on the side portions of the different ends of the front portion, respectively. For example, when the head-wearing device is glasses, the two lateral microphones are respectively located on the different two temples.
According to the present invention, the head wearing device further includes a signal synthesizing module for acquiring sound signals recorded by the first and second forward microphone modules, i.e., the first and second forward sound signals, and simultaneously acquiring sound signals generated by the first and second side microphone modules, i.e., the first and second side sound signals. Meanwhile, the signal synthesis module performs enhancement processing on the combination of the first forward sound signal and the second forward sound signal to increase the signal in the opposite direction of the user and eliminate noise in other directions, so that the obtained signal is also called an enhanced forward sound signal. Then, the signal synthesis module also utilizes the lateral sound signals to filter the environmental noise, specifically, utilizes each forward sound signal and each lateral sound signal to generate masking thresholds of the enhanced forward sound signals on different frequency bands; and carrying out noise reduction processing on the enhanced forward sound signal by utilizing the masking threshold value to obtain an output voice signal. Thus, the invention can clearly record the sound in front of the wearer wearing the head wearing device of the invention, and the lateral and surrounding environmental noise is reduced. Therefore, for glasses, helmets and the like with the recording function, the invention can improve the sound recording quality and the user experience. The signal synthesis module is formed by elements with signal processing functions, such as a singlechip, a DSP, an FPGA and the like.
More preferably, the signal synthesis module of the present invention may use classical NLMS (normalized least mean square adaptive filter) algorithm for speech enhancement for the first and second forward sound signals.
Fig. 3 is a schematic diagram of an NLMS algorithm employed by an embodiment of the present invention to obtain an enhanced forward sound signal. As shown in fig. 3, the sound signals (time-domain sampling signals) obtained by the first and second forward microphone modules are added and divided by two to obtain a signal d (n), and the sound signals (time-domain sampling signals) obtained by the first and second forward microphone modules are subtracted and divided by two to obtain a signal x (n), and the signal x (n) is processed by an M-order adaptive filter to obtain a signal y (n). The signal d (n) minus the signal y (n) yields an enhanced output signal e (n) (enhanced forward sound signal of the signal synthesis module) whileThe signal e (n) is input to the parameter updating module of the adaptive filter to update the parameters of the M-order adaptive filter. In the figure, n represents the time sequence number, z -1 Represents a delay unit, M is the order of the filter, W 1 、W 2 、…W M M delay signals respectively representing parameters of M-order filter, x (n), x (n-1), x (n-2), … … and x (n-M+1) signals x (n) are respectively marked as x 1 (n)、x 2 (n)、……、x M (n). In this embodiment, the adaptive filter may be an NLMS (normalized least mean square error) filter, and the algorithm adopted by the parameter updating module of the adaptive filter is an NLMS algorithm. As one embodiment, the step of generating the masking threshold of the enhanced forward sound signal on different frequency bands by the signal synthesis module using each forward sound signal and each side sound signal includes:
(1) Performing time-frequency decomposition on each forward sound signal and each side sound signal, and converting the time-domain signals into frequency signals;
(2) Adding the time-frequency units of the same frequency band of the first forward sound signal and the second forward sound signal to obtain each time-frequency unit of the forward sound signal; and adding the time-frequency units of the same frequency band of the first lateral sound signal and the second lateral sound signal to obtain each time-frequency unit of the lateral sound signal. And, before adding the time-frequency units of the same frequency band of the first and second forward sound signals, it is preferable to equalize the first and second forward sound signals
(3) And performing time-frequency compensation on each time-frequency unit of the lateral sound signal, and preferably updating the parameters of the time-frequency compensation according to the voice detection result.
(4) The IID value and the ITD value are calculated for each time-frequency unit of the forward sound signal and the side sound signal, thereby generating a preliminary masking threshold for the enhanced forward sound signal.
According to a preferred embodiment of the present invention, the step of performing noise reduction processing on the enhanced forward sound signal using the masking threshold to obtain an output speech signal includes:
(1) Performing voice activation detection on the forward sound signal;
(2) Obtaining a final masking threshold of the enhanced forward sound signal according to the voice activation detection result and the preliminary masking threshold;
(3) Smoothing the final masking threshold, and masking the enhanced speech signal by using the smoothed final masking threshold;
(4) And converting each time-frequency unit of the enhanced voice signal subjected to masking processing into a time domain to obtain an output voice signal.
Fig. 4 is a front view of a head mounted device for sound enhancement recording in accordance with an embodiment of the present invention. As shown in fig. 4, the head wearing device of this embodiment is an eyeglass comprising two temples, a first temple 1 and a second temple 2, located on both sides of a front frame 3, wherein the front frame 3 is used to fix two lenses, and a first forward microphone module mic1 and a second forward microphone module mic2 located between the two lenses. The first side microphone module mic3 and the second side microphone module mic4 are respectively arranged in the first glasses leg 1 and the second glasses leg 2. The signal synthesizing module 5 in this embodiment is located in the second earpiece 2 (not shown in fig. 4).
Fig. 5 is a schematic view showing a state in which the head-wearing apparatus of the embodiment of the present invention is worn on the head of a user. As shown in fig. 5, when the eyeglass as the head wearing device is worn on the head of the user, the first and second forward microphone modules mic1 and mic2 are located in the middle of the eyeglass front frame 3, and thus are located directly above the bridge of the nose. And the first forward microphone module mic1 and the second forward microphone module mic2 are positioned on the same horizontal plane, namely, the connecting line of the first forward microphone module mic1 and the second forward microphone module mic2 is a horizontal line. And the first lateral microphone module mic3 and the second lateral microphone module mic4 are located in the first earpiece 1 and the second earpiece 2, respectively.
In the glasses of this embodiment, a sound reproduction function is added to the glasses, that is, a sound is reproduced to the user. Thus, the head-wearing device of this embodiment of the invention further includes a sound playing module. The sound source to be played may in principle be any sound source, but the present invention is preferably to receive as the sound source signal the output speech signal obtained by subjecting the forward sound signal to noise reduction processing as described above, as this is particularly useful for people with hearing impairment.
Fig. 6 is a schematic side view of eyeglasses as a head-mounted device in accordance with an embodiment of the present invention. As shown in fig. 6, the signal synthesis module 4 is disposed in the second earpiece 2, the sound playing module 6 is disposed on and fixed to the two earpieces, and the sound playing module 6 is located at a portion of the earpiece near the ear of the user, which may be, for example, a bone conduction hearing aid element, and is electrically connected to the signal synthesis module 5 to receive the output signal. It should be appreciated that the present invention is not limited to a particular type of sound playing device, and any sound playing element suitable for playing the processed output signal may be utilized to implement the solution of the present invention and achieve the corresponding advantageous effects.
As further shown in fig. 6, the head wearing apparatus of the present invention further preferably includes a voice recognition module 7 for performing voice recognition on the output signal to convert the sound signal into text information. Particularly, for the occasions such as frame glasses, VR glasses, helmets and the like which are suitable for placing display elements in front of eyes of users, the invention has a preferable scheme that the voice recognition is carried out on output signals, and word information after the voice recognition can be utilized in various application scenes.
The application scene is still an application scene for people with hearing impairment, and for users with hearing impairment or almost hearing impairment, the method of converting the voice of the opposite speaker into text information and displaying the text information is a scheme which has to be adopted, and is the most intuitive way of better solving the real-time understanding of the words of the opposite speaker. In this case, the glasses of the above embodiments further include an information display module for displaying the text information. In one embodiment shown in fig. 7, the information display module includes a light source 8, a projection element 9, and a special lens that is displayed to accept the projection of the projection element 9. Thus, the information display module is a retina projection device consisting of a source 8, a projection element 9 and a lens, which can project the text information onto the retina of the user. After the user wears the glasses, the words of the opposite speaker converted by the voice recognition module 7 are displayed in real time in front of the eyes of the user. In this embodiment, the projection element 9 receives the converted phonetic text information from the speech recognition module 7.
It will be appreciated that the invention is not limited to the type of information display module provided, and that existing components suitable for displaying images in front of the eyes of a user may be employed.
Another application scenario is real-time translation. At this time, the voice recognition module is further configured to convert the recognized text information into text information in a language specified by the user, and send the text information to the information display module for display. If the user is not in the same language as the opposite speaker, the present invention may convert the recognized spoken words of the opposite speaker into words in the language specified by the user. For example, after the user wears the glasses of the present invention, the language words converted from the foreign language, dialect, etc. uttered by the opposite speaker to the familiar language words can be displayed in real time in front of the user. Besides the converted language characters can be displayed in front of eyes of a user through retina projection, the converted language characters can be converted into voice, and then sent to a sound playing module to be played for the user to listen. The text-to-speech function may be implemented in the speech recognition module or by a separate module.
Fig. 7 is a flow diagram of one embodiment of a signal synthesis module according to the present invention generating an output speech signal.
As shown in fig. 7, first, the signal synthesis module performs time-frequency decomposition on the first forward sound signal and the second forward sound signal generated by the first forward microphone module and the second forward microphone module, and the first side sound signal and the second side sound signal generated by the first side microphone module and the second side microphone module, so as to convert four paths of sound signals from a time domain to a frequency domain, and obtain four paths of frequency signals, wherein the four paths of frequency signals are divided into time-frequency units according to frames and frequency bands.That is, the frequency signal of the first forward sound signal acquired by the first forward microphone module mic1 in the jth frequency band of the ith frame is denoted as a time-frequency unit and is denoted as h 1 (i, j); similarly, the time-frequency unit of the second forward sound signal in the jth frequency band of the ith frame is denoted as h 2 (i, j) the time-frequency unit of the first side-facing sound signal in the j-th frequency band of the i-th frame is denoted as h 3 (i, j) the time-frequency unit of the second side-to-side sound signal in the j-th frequency band of the i-th frame is denoted as h 4 (i,j)。
Next, the time-frequency units h of the two forward sound signals 1 (i,j)、h 2 Each frequency band of (i, j) is first frequency band equalized, which is a guaranteed time-frequency unit h, and then added to each other 1 (i,j)、h 2 (i, j) the energy values in each band tend to be uniform. I.e. h m (i,j)=h 1 (i,j)+h 2 (i, j) b (i, j), wherein h m (i, j) is a synthesized forward sound signal, and b (i, j) is a band equalization coefficient.
Time-frequency unit h for two side-to-side sound signals 3 (i,j)、h 4 The time-frequency units of each frequency band of (i, j) are also added, i.e. h p (i,j)=h 3 (i,j)+h 4 (i,j),h p (i, j) is a synthesized lateral sound signal. Then, for the synthesized lateral sound signal h p Time-frequency compensation h by time-frequency unit of (i, j) g (i,j)=h p (i,j)*g(i,j),h g (i, j) is a time-frequency compensated side-to-side sound signal, and g (i, j) is a time-frequency compensation parameter. The time-frequency compensation parameter g (i, j) can be updated according to the result of the subsequent Voice Activation Detection (VAD).
Next, according to the two-way processed time-frequency unit, the synthesized side sound signal h of the time-frequency compensation g (i, j) and the synthesized forward sound signal h m (i, j) to calculate IID and ITD values. The IID and ITD are calculated using time-frequency units of the time-frequency compensated synthesized lateral signal and the synthesized forward signal. The masking threshold ITD, interaural Time Differences, otherwise known as the two-channel time difference, is then generated based on the IID and the threshold of the ITD, which can be simply understood as the time difference between the arrival of the sound between the ears. IID, i.e. Interaural Intensity DifferenThe two-channel energy difference, or "two-channel energy difference", is simply understood to be the difference in intensity of the sound between the two ears. In this embodiment, IID and ITD for the time-frequency unit of the jth band of the ith frame are denoted as the two-channel energy difference IID (i, j) and the two-channel time difference ITD (i, j), respectively. At the same time, for the synthesized forward sound signal h m The time-frequency unit of (i, j) performs Voice Activation Detection (VAD).
Specifically, the invention can generate three preliminary masking thresholds, and select one masking threshold as the optimal masking threshold G (i, j), i.e., the final masking threshold, based on the comprehensive judgment result. First, an initial masking threshold can be obtained according to a preset masking threshold based on a two-channel energy difference IID (i, j) and a two-channel time difference ITD (i, j) of a jth frequency band of an ith frame, and the first masking threshold can be marked as G 1 (i, j). Second, a steady-state noise reduction mode can be utilized, and a frequency band spectral subtraction is adopted to obtain a second masking threshold, and the second masking threshold can be marked as G 2 (i, j); the speech classification result obtained by the VAD can then be used to take different masking thresholds, and a third masking threshold can be determined, which can be denoted as G 3 (i,j)。
In particular embodiments, in determining the forward sound signal h m When the voice classification result is noise signal, the voice masking threshold G (i, j) can be directly determined as the third masking threshold G 3 (i, j); in determining the forward sound signal h m (i, j) when the speech classification result is speech signal, it is necessary to compare the first masking threshold G 1 (i, j) and a second masking threshold G 2 (i, j), and determining an optimal speech masking threshold based on the comparison result, which may be that a value of the first masking threshold and the second masking threshold, which is the smallest value, is selected as G (i, j).
Finally, smoothing the final masking threshold and synthesizing the final masking threshold with the synthesized forward sound signal h m (i, j) obtaining a noise-reduced time-frequency unit together, and converting the noise-reduced time-frequency unit into a time domain signal to obtain a final output voice signal. The enhanced forward sound signal contains the sound of the wearer and the opposite speaker. The four microphones output only the sound of the wearer through the calculation of the above-described embodiments of the present invention.
Those skilled in the art will appreciate that all or part of the steps implementing the above-described embodiments are implemented as a program, i.e., a computer program, executed by a data processing apparatus (including a computer). The above-described method provided by the present invention can be implemented when the computer program is executed. Moreover, the computer program may be stored in a computer readable storage medium, which may be a readable storage medium such as a magnetic disk, an optical disk, a ROM, a RAM, or a storage array composed of a plurality of storage media, for example, a magnetic disk or a tape storage array. The storage medium is not limited to a centralized storage, but may be a distributed storage, such as cloud storage based on cloud computing.
The following describes apparatus embodiments of the invention that may be used to perform method embodiments of the invention. Details described in the embodiments of the device according to the invention should be regarded as additions to the embodiments of the method described above; for details not disclosed in the embodiments of the device according to the invention, reference may be made to the above-described method embodiments.
Fig. 8 is a block diagram of a radio receiver according to an embodiment of the present invention. As shown in fig. 8, the signal synthesizing module acquires sound signals generated by the first, second, third and fourth microphone modules and processes the signals to generate final processed sound signals as output signals, which can enhance the voice of the opposite speaker, and eliminate and attenuate the environmental noise. The voice recognition module is connected to the signal synthesis module to acquire output voice signals and then carries out voice recognition, and character information obtained after recognition can be directly sent to the retina projection module for display, and can also be sent to the voice playing module for playing after processing of converting characters into voices. Optionally, the voice recognition module is further configured to convert the text information obtained by recognition into text information of a user finger language, and send the text information to the retina projection module or voice, or send the text information to the voice playing module for playing after processing the text to the voice.
It will be appreciated by those skilled in the art that the modules in the above embodiments may be distributed in a device as described, or may be distributed in one or more devices different from the above embodiments with corresponding changes. The modules of the above embodiments may be combined into one module, or may be further split into a plurality of sub-modules. For example, in the above embodiment, the text translation and text-to-speech functions of the speech recognition module may be implemented together, or may be implemented by a sub-module thereof, or may be implemented by another module independent of the speech recognition module.
The above-described specific embodiments further describe the objects, technical solutions and advantageous effects of the present invention in detail, and it should be understood that the present invention is not inherently related to any particular computer, virtual device or electronic apparatus, and various general-purpose devices may also implement the present invention. The foregoing description of the embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.
Claims (13)
1. A head wearing apparatus for sound enhancement recording, comprising a front portion and side portions, the front portion facing the front of the user's face when the head wearing apparatus is worn on the user's head, the side portions being located at both ends of the front portion, respectively facing both sides of the user's face, characterized in that:
a first forward microphone module and a second forward microphone module are arranged in the front part, and the two forward microphone modules are used for recording the own speaking voice of the user and the speaking voice of the user to the opposite speaker;
the two sides of the side face part are respectively provided with a first side microphone module and a second microphone module, and the two side microphone modules are used for recording environmental noise;
the head wearing device further comprises a signal synthesis module for: acquiring a first forward sound signal and a second forward sound signal generated by a first forward microphone module and a second forward microphone module, and a first side sound signal and a second side sound signal generated by a first side microphone module and a second side microphone module; performing enhancement processing on the first forward sound signal and the second forward sound signal in combination to obtain an enhanced forward sound signal; generating masking thresholds of the enhanced forward sound signal on different frequency bands using each forward sound signal and each side sound signal; and carrying out noise reduction processing on the enhanced forward sound signal by utilizing the masking threshold value to obtain an output voice signal.
2. The head-worn device for sound enhancement recording as in claim 1, wherein:
the first and second forward microphone modules are positioned in a same horizontal plane when the head-worn device is worn on the head of a user; and is also provided with
The step of combining the first forward sound signal and the second forward sound signal for enhancement processing to obtain an enhanced forward sound signal comprises:
adding and dividing the sound signals generated by the first forward microphone module and the second forward microphone module by two to obtain a signal d (n), subtracting and dividing by two to obtain a signal x (n), wherein n represents a time sequence number;
processing the signal x (n) by an adaptive filter to obtain a signal y (n);
the enhanced output signal e (n) obtained by subtracting the signal y (n) from the signal d (n) is used as an enhanced forward sound signal, and the parameter updating module which inputs the signal e (n) to the adaptive filter is used for updating the parameters of the adaptive filter.
3. The head-worn device for sound enhancement recording as in claim 1, wherein: the step of generating masking thresholds for the enhanced forward sound signal over different frequency bands using each forward sound signal and each side sound signal comprises:
<1> time-frequency-decomposing each of the forward sound signals and each of the side sound signals, and converting from a time-domain signal to a frequency signal;
<2> adding time-frequency units of the same frequency band of the first and second forward sound signals to obtain each time-frequency unit of the forward sound signal; adding the time-frequency units of the same frequency band of the first lateral sound signal and the second lateral sound signal to obtain each time-frequency unit of the lateral sound signal;
<3> performing time-frequency compensation for each time-frequency unit of the side-direction sound signal;
<4> the IID value and the ITD value are calculated for each time-frequency unit of the forward sound signal and the side sound signal, thereby generating the preliminary masking threshold of the enhanced forward sound signal.
4. A head-worn device for sound enhancement recording as in claim 3, wherein: before adding the time-frequency units of the same frequency band of the first forward sound signal and the second forward sound signal, the first forward sound signal and the second forward sound signal are subjected to equalization processing so that the energy values of the first forward sound signal and the second forward sound signal in each frequency band tend to be consistent.
5. A head-worn device for sound enhancement recording as in claim 3, wherein: the step of performing noise reduction processing on the enhanced forward sound signal by using the masking threshold to obtain an output voice signal includes:
<1> performing voice activation detection on the forward sound signal;
<2> obtaining a final masking threshold of the enhanced forward sound signal based on the voice activation detection result and the preliminary masking threshold;
<3> smoothing the final masking threshold and masking the enhanced speech signal using the smoothed final masking threshold;
<4> each time-frequency unit of the enhanced speech signal subjected to the masking processing is converted into the time domain, resulting in an output speech signal.
6. The head-worn device for sound enhancement recording as in claim 5, wherein: the step of performing time-frequency compensation on each time-frequency unit of the side direction sound signal comprises the following steps: and updating the time-frequency compensation parameters according to the voice detection result.
7. A head wear device for sound enhancement recording as in any one of claims 1-6, wherein:
the head wearing device is a pair of glasses, the front part of the head wearing device is a front frame for fixing lenses, and the side part of the head wearing device is two glasses legs positioned at two sides of the front frame;
the first forward microphone module and the second forward microphone module are positioned in the middle of the front frame, and the first lateral microphone module, the second lateral microphone module and the signal synthesis module are positioned in the glasses legs.
8. A head wear device for sound enhancement recording according to any one of claims 1 to 6, wherein the head wear device further comprises a sound playing module for playing sound to a user in accordance with the output signal.
9. A head wear device for sound enhancement recording according to any one of claims 1 to 6, further comprising a speech recognition module for speech recognition of the output signal to convert the sound signal into text information.
10. The head wear device for sound enhancement recording of claim 9, wherein the head wear device further comprises an information display module for displaying the text information.
11. The head wear device for sound enhancement recording as in claim 10, wherein the voice recognition module is further configured to convert the recognized text information into text information in a user-specified language and send the text information to the information display module for display.
12. The head wear device for sound enhancement recording of claim 11, wherein the information display module is a retina projection device for projecting the text information onto a retina of a user.
13. The head wear device for sound enhancement recording as in claim 8, wherein the voice recognition module is further configured to convert the recognized text information into text information in a user-specified language, and then convert the converted text information into a sound signal, and send the sound signal to the sound playing module for playing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310627833.XA CN116721657A (en) | 2023-05-31 | 2023-05-31 | Head wearing device for sound enhanced recording |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310627833.XA CN116721657A (en) | 2023-05-31 | 2023-05-31 | Head wearing device for sound enhanced recording |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116721657A true CN116721657A (en) | 2023-09-08 |
Family
ID=87865257
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310627833.XA Pending CN116721657A (en) | 2023-05-31 | 2023-05-31 | Head wearing device for sound enhanced recording |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116721657A (en) |
-
2023
- 2023-05-31 CN CN202310627833.XA patent/CN116721657A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6009619B2 (en) | System, method, apparatus, and computer readable medium for spatially selected speech enhancement | |
US11068668B2 (en) | Natural language translation in augmented reality(AR) | |
CN106023983B (en) | Multi-user voice exchange method and device based on Virtual Reality scene | |
CN107710784B (en) | System and method for audio creation and delivery | |
US20180206028A1 (en) | Wearable communication enhancement device | |
US20140236594A1 (en) | Assistive device for converting an audio signal into a visual representation | |
US20140278385A1 (en) | Noise Cancelling Microphone Apparatus | |
CN111833896A (en) | Voice enhancement method, system, device and storage medium for fusing feedback signals | |
US11849274B2 (en) | Systems, apparatus, and methods for acoustic transparency | |
CN113544775B (en) | Audio signal enhancement for head-mounted audio devices | |
US20220232342A1 (en) | Audio system for artificial reality applications | |
Guiraud et al. | An introduction to the speech enhancement for augmented reality (spear) challenge | |
US11410669B2 (en) | Asymmetric microphone position for beamforming on wearables form factor | |
Li et al. | Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition | |
Corey | Microphone array processing for augmented listening | |
CN116721657A (en) | Head wearing device for sound enhanced recording | |
US20110261971A1 (en) | Sound Signal Compensation Apparatus and Method Thereof | |
US20220122630A1 (en) | Real-time augmented hearing platform | |
CN116594197A (en) | Head wearing device based on three-microphone directional sound recording | |
Lüke et al. | Creation of a Lombard speech database using an acoustic ambiance simulation with loudspeakers | |
US20220180885A1 (en) | Audio system including for near field and far field enhancement that uses a contact transducer | |
US11587578B2 (en) | Method for robust directed source separation | |
Magadum et al. | An Innovative Method for Improving Speech Intelligibility in Automatic Sound Classification Based on Relative-CNN-RNN | |
CN117636836A (en) | Headset voice processing method and headset | |
JP2011112671A (en) | Telescopic feedback device for one's own voice |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |