WO2022033236A1 - Audio enhancement method and apparatus, storage medium, and wearable device - Google Patents
Audio enhancement method and apparatus, storage medium, and wearable device Download PDFInfo
- Publication number
- WO2022033236A1 WO2022033236A1 PCT/CN2021/104553 CN2021104553W WO2022033236A1 WO 2022033236 A1 WO2022033236 A1 WO 2022033236A1 CN 2021104553 W CN2021104553 W CN 2021104553W WO 2022033236 A1 WO2022033236 A1 WO 2022033236A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio signal
- bone conduction
- wearer
- enhancement
- signal
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 75
- 230000005236 sound signal Effects 0.000 claims abstract description 278
- 210000000988 bone and bone Anatomy 0.000 claims abstract description 267
- 238000001514 detection method Methods 0.000 claims description 40
- 230000033001 locomotion Effects 0.000 claims description 25
- 230000003595 spectral effect Effects 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 14
- 230000007613 environmental effect Effects 0.000 description 30
- 239000004984 smart glass Substances 0.000 description 24
- 238000010586 diagram Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 230000005540 biological transmission Effects 0.000 description 8
- 239000003570 air Substances 0.000 description 6
- 230000001055 chewing effect Effects 0.000 description 5
- 239000007788 liquid Substances 0.000 description 5
- 239000000203 mixture Substances 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 239000007787 solid Substances 0.000 description 4
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 3
- 206010011224 Cough Diseases 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 229910052799 carbon Inorganic materials 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 239000002245 particle Substances 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 238000006073 displacement reaction Methods 0.000 description 2
- 210000003027 ear inner Anatomy 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 210000003205 muscle Anatomy 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 210000003491 skin Anatomy 0.000 description 2
- 210000001519 tissue Anatomy 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 239000000872 buffer Substances 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000005520 electrodynamics Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/08—Mouthpieces; Microphones; Attachments therefor
-
- G—PHYSICS
- G02—OPTICS
- G02C—SPECTACLES; SUNGLASSES OR GOGGLES INSOFAR AS THEY HAVE THE SAME FEATURES AS SPECTACLES; CONTACT LENSES
- G02C11/00—Non-optical adjuncts; Attachment thereof
- G02C11/10—Electronic devices other than hearing aids
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2410/00—Microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2460/00—Details of hearing devices, i.e. of ear- or headphones covered by H04R1/10 or H04R5/033 but not provided for in any of their subgroups, or of hearing aids covered by H04R25/00 but not provided for in any of its subgroups
- H04R2460/13—Hearing devices using bone conduction transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2499/00—Aspects covered by H04R or H04S not otherwise provided for in their subgroups
- H04R2499/10—General applications
- H04R2499/15—Transducers incorporated in visual displaying devices, e.g. televisions, computer displays, laptops
Definitions
- the present application relates to the technical field of audio processing, and in particular, to an audio enhancement method, device, storage medium and wearable device.
- wearable devices such as smart glasses and smart helmets have gradually entered people's lives.
- the wearable device can realize various functions. For example, by installing an application with audio collection capability on the wearable device, and then combining with the audio collection hardware (such as a microphone) configured by the wearable device itself, the audio collection function is provided to the user.
- the audio collection hardware such as a microphone
- Embodiments of the present application provide an audio enhancement method, apparatus, storage medium, and wearable device, which can improve the flexibility of the wearable device for audio enhancement.
- an embodiment of the present application provides an audio enhancement method, which is applied to a wearable device, where the wearable device includes a bone conduction microphone and a non-bone conduction microphone, and the audio enhancement method includes:
- the first audio signal is acquired by collecting the bone conduction microphone, and the second audio signal is simultaneously acquired by the non-bone conduction microphone;
- the first audio signal is enhanced to obtain the wearer's pronunciation enhancement signal
- enhancement processing is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
- an embodiment of the present application provides an audio enhancement device, which is applied to a wearable device, where the wearable device includes a bone conduction microphone and a non-bone conduction microphone, and the audio enhancement device includes:
- an audio acquisition module configured to acquire a first audio signal through the bone conduction microphone, and acquire a second audio signal through the non-bone conduction microphone synchronously;
- a pronunciation enhancement module configured to perform enhancement processing on the first audio signal when the wearer's pronunciation is detected according to the first audio signal, to obtain the wearer's pronunciation enhancement signal;
- An ambient sound enhancement module configured to perform enhancement processing on the second audio signal when it is detected that the wearer has not spoken according to the first audio signal, to obtain an ambient sound enhancement signal of the actual environment where the wearer is located .
- the embodiments of the present application provide a storage medium on which a computer program is stored, and when the computer program is loaded by a wearable device including a bone conduction microphone and a non-bone conduction microphone, the storage medium provided by the embodiments of the present application is executed. Audio enhancement method.
- an embodiment of the present application provides a wearable device, the wearable device includes a bone conduction microphone, a non-bone conduction microphone, a processor, and a memory, the memory stores a computer program, and the processor loads the The computer program executes the audio enhancement method provided by the embodiments of the present application.
- FIG. 1 is a schematic flowchart of an audio enhancement method provided by an embodiment of the present application.
- FIG. 2 is a schematic diagram of the arrangement positions of the bone conduction microphone and the non-bone conduction microphone when the physical representation form of the wearable device is smart glasses according to the embodiment of the present application.
- FIG. 3 is a schematic view of wearing the smart glasses in FIG. 2 .
- FIG. 4 is a schematic diagram of enhancing processing of a first audio signal in an embodiment of the present application.
- FIG. 5 is another schematic diagram of performing enhancement processing on the first audio signal in an embodiment of the present application.
- FIG. 6 is a schematic diagram of the arrangement positions of multiple non-bone conduction microphones when the physical representation form of the wearable device is smart glasses according to an embodiment of the present application.
- FIG. 7 is a schematic diagram of predicting the direction of a sound source in a real environment in an embodiment of the present application.
- FIG. 8 is a schematic diagram of the setting position of the camera module when the physical presentation form of the wearable device is smart glasses according to the embodiment of the present application.
- FIG. 9 is another schematic flowchart of an audio enhancement method provided by an embodiment of the present application.
- FIG. 10 is a schematic structural diagram of an audio enhancement apparatus provided by an embodiment of the present application.
- FIG. 11 is a schematic structural diagram of a wearable device provided by an embodiment of the present application.
- the position of the user and the microphone needs to be designed and measured, and the position of the user's mouth and the position of the microphone are relatively fixed. .
- the relative positions of the mouth and the microphone are further used to adjust the parameters of the algorithm, so that the microphone has a good enhancement effect on the user's voice and a good suppression effect on the environmental noise.
- the wearable device only enhances the user's voice, and suppresses other sounds. When the user wants to collect non-personal voices, these sounds will be suppressed as noise.
- the present application provides an audio enhancement method, an audio enhancement device, a storage medium, and a wearable device, and the executive body of the audio enhancement method may be the audio enhancement device provided in the embodiment of the present application, or an audio enhancement device integrated with the audio enhancement device.
- Wearable devices in which the audio enhancement device can be implemented in hardware or software. It should be noted that the embodiments of the present application do not specifically limit the physical presentation form of the wearable device, for example, the physical presentation form of the wearable device may be smart glasses, smart helmets, and the like.
- An embodiment of the present application provides an audio enhancement method, which is applied to a wearable device, where the wearable device includes a bone conduction microphone and a non-bone conduction microphone, and the audio enhancement method includes:
- the first audio signal is acquired by collecting the bone conduction microphone, and the second audio signal is simultaneously acquired by the non-bone conduction microphone;
- the first audio signal is enhanced to obtain the wearer's pronunciation enhancement signal
- enhancement processing is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
- the performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal includes:
- the anti-phase superposition of the bone conduction noise signal and the first audio signal is performed to obtain the pronunciation enhancement signal.
- the acquiring the bone conduction noise signal of the wearer during the collection of the first audio signal according to the historical bone conduction noise signal includes:
- the bone conduction noise signal during the acquisition of the first audio signal is predicted according to the noise prediction model.
- the performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal includes:
- the pronunciation enhancement signal is obtained by performing pronunciation enhancement processing on the first audio signal by using a pre-trained pronunciation enhancement model.
- the second audio signal is enhanced to obtain an enhanced ambient sound signal of the actual environment where the wearer is located, including:
- the second audio signal is subjected to beamforming processing according to the sound source direction to obtain the ambient sound enhancement signal.
- the audio enhancement method further includes:
- Pronunciation detection is performed on the spectral feature by using a pre-trained pronunciation detection model to obtain a detection result of whether the wearer has spoken.
- the synchronization performs audio collection through the bone conduction microphone and the non-bone conduction microphone to obtain the first audio signal collected by the bone conduction microphone, and after obtaining the second audio signal collected by the non-bone conduction microphone.
- ,Also includes:
- the first audio signal is enhanced according to the second audio signal
- the second audio signal is enhanced according to the first audio signal.
- the audio enhancement method further includes:
- the wearable device further includes a camera module, and the three-dimensional reconstruction of the real environment to obtain a virtual reality environment includes:
- FIG. 1 is a schematic flowchart of an audio enhancement method provided by an embodiment of the present application.
- the following description takes the wearable device provided by the present application as an example of the execution subject of the audio enhancement method, where the wearable device includes a bone conduction microphone and a non-bone conduction microphone.
- the process of the audio enhancement method provided by the embodiment of the present application may be as follows:
- a first audio signal is acquired through a bone conduction microphone, and a second audio signal is simultaneously acquired through a non-bone conduction microphone.
- the transmission medium of sound includes solid, liquid and air
- bone conduction of sound belongs to the solid conduction of sound wave.
- Bone conduction is a very common physiological phenomenon. For example, we hear the sound of chewing food, which is transmitted to the inner ear through the jawbone. When the user speaks, the vibration is generated through the vocal organ, which is then transmitted to other parts, such as the bridge of the nose and the ear bone, through the body tissues such as bones, muscles and skin.
- the bone conduction microphone can convert the vibration of the body part caused by the user's pronunciation into sound signals, thereby restoring the sound made by the user.
- the wearable device includes a bone conduction microphone and a non-bone conduction microphone.
- the non-bone conduction microphone includes any type of microphone for audio collection based on the principle of sound air transmission and/or liquid transmission, including but not limited to electric type, condenser type, piezoelectric type, electromagnetic type, carbon particle type and semiconductor type and other types of microphones.
- the non-bone conduction microphone is arranged at one end of the temple near the frame, and the bone conduction microphone is arranged at one end of the temple frame. Referring to FIG. 3 , when the smart glasses are in a wearing state, the bone conduction microphone will directly contact the user's ear bones, while the non-bone conduction microphone will not directly contact the user.
- FIG. 2 shows only an optional arrangement of the bone conduction microphone and the non-bone conduction microphone, and the bone conduction microphone and the non-bone conduction microphone are not limited to this arrangement in a specific implementation.
- the bone conduction microphone can also be arranged on the frame of the smart glasses and contact the user's nose bridge when wearing it.
- the wearable device when audio collection is triggered, the wearable device synchronously performs audio collection through the bone conduction microphone and the non-bone conduction microphone, and the audio signal collected by the bone conduction microphone is recorded as the first audio signal, and the non-bone conduction microphone is recorded as the first audio signal.
- the audio signal collected by the microphone is recorded as the second audio signal.
- the embodiments of this application do not specifically limit how to trigger audio collection. For example, it may be when the user operates the wearable device to start the audio collection application for recording, or when the user operates the wearable device to start the audio collection.
- an instant messaging application conducts an audio and video call, it can also trigger audio collection when the wearable device performs voice wake-up or voice recognition.
- the first audio signal is enhanced to obtain a voice enhancement signal of the wearer.
- the audio collection principle of the bone conduction microphone there is less interference from the wearer when the bone conduction microphone collects audio. Therefore, the first audio signal collected by the bone conduction microphone can be effectively used to detect whether the wearer of the wearable device is pronounce.
- the wearable device may detect whether the wearer of the wearable device has spoken according to the first audio signal according to the configured pronunciation detection strategy.
- the embodiment of the present application does not specifically limit the configuration of the pronunciation detection strategy, which can be configured by those of ordinary skill in the art according to actual needs.
- the pronunciation detection strategy can be configured as follows: using a voice activity detection (Voice Activity Detection, VAD) algorithm to detect whether the wearer of the wearable device is vocal.
- VAD Voice Activity Detection
- the pronunciation of the wearer of the wearable device is detected according to the first audio signal
- the pronunciation of the wearer is used as the enhancement object
- the first audio signal is enhanced according to the configured pronunciation enhancement strategy, and the corresponding result is obtained
- the wearer's pronunciation enhances the signal.
- the enhancement processing on the first audio signal can be regarded as a process of eliminating the non-wearer's pronunciation component in the first audio signal, and the purpose of enhancing the wearer's pronunciation is achieved by eliminating the non-wearer's pronunciation component.
- the embodiments of the present application do not specifically limit the configuration of pronunciation enhancement strategies, which can be configured by those of ordinary skill in the art according to actual needs, including artificial intelligence-based pronunciation enhancement strategies and non-artificial intelligence pronunciation enhancement strategies.
- an enhancement process is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
- the ambient sound of the real environment where the wearer is located is used as the enhancement object, and the second sound is enhanced according to the configured ambient sound enhancement strategy.
- the audio signal is enhanced and processed to correspondingly obtain an enhanced signal of the ambient sound of the actual environment where the wearer is located.
- the configuration of the corresponding environmental sound enhancement strategy in the embodiments of the present application is not specifically limited, and can be configured by those of ordinary skill in the art according to actual needs, including artificial intelligence-based pronunciation enhancement strategies and non-artificial intelligence pronunciation enhancement strategies.
- the audio enhancement method provided by the present application is suitable for wearable devices including bone conduction microphones and non-bone conduction microphones.
- signal according to the first audio signal to detect whether the wearer of the wearable device pronounces; if the wearer's pronunciation is detected, then the wearer's pronunciation is used as an enhancement object, and the first audio signal is enhanced to obtain the wearer's pronunciation enhancement. If it is detected that the wearer does not speak, the ambient sound of the wearer's real environment is used as the enhancement object, and the second audio signal is enhanced to obtain the corresponding ambient sound enhancement signal of the wearer's real environment. Therefore, the present application flexibly performs audio enhancement processing by arranging bone conduction microphones and non-bone conduction microphones in the wearable device, and using the audio collection results of the bone conduction microphones to dynamically determine audio enhancement objects.
- the enhancement processing is performed on the first audio signal to obtain the wearer's pronunciation enhancement signal, including:
- the embodiments of the present application further provide an optional pronunciation enhancement strategy.
- the first audio signal collected by the wearable device includes not only the pronunciation components of the wearer, but also the breath sound and heartbeat of the speaker. sounds, chewing sounds and/or cough sounds.
- the interference of the bone conduction microphone mainly comes from the wearer himself.
- the first audio signal is regarded as composed of two parts, that is, the pure wearer's voice signal and the wearer's bone conduction noise (including but not limited to breath sound, heartbeat sound, chewing sound, coughing sound, etc. ). Therefore, the enhancement of the wearer's voice signal can be achieved by eliminating the bone conduction noise in the first audio signal.
- the wearable device will collect the mixed signal of the wearer's voice signal and the bone conduction noise signal through the bone conduction microphone. If the wearer does not speak, the wearable device will only The wearer's bone conduction noise signal is collected through the bone conduction microphone. In the embodiment of the present application, when the wearer does not speak, the wearable device buffers the collected bone conduction noise signal.
- the wearable device When the first audio signal is enhanced, the wearable device first acquires the wearer's bone conduction noise signal before the first audio signal is collected, which is recorded as a historical bone conduction noise signal. For example, the wearable device takes the collection start time of the first audio signal as the end time, and obtains the preset time length collected before the end time (the preset time length can be selected by a person of ordinary skill in the art according to actual needs. Appropriate value, this application implements For example, there is no specific limitation on this, for example, it can be set to a historical bone conduction noise signal of 500ms).
- the wearable device obtains the first audio signal at 09:00 on July 28, 2020. From 13 minutes 56 seconds to 09:13:56 seconds and 500 milliseconds on July 28, 2020, the historical bone conduction noise signal with a duration of 500 milliseconds was buffered by the aforementioned bone conduction microphone.
- the wearable device After acquiring the wearer's historical bone conduction noise signal before the first audio signal acquisition, the wearable device further acquires the wearer's bone conduction noise signal during the first audio signal acquisition according to the historical bone conduction noise signal.
- the wearable device can obtain the noise distribution characteristics of historical bone conduction noise signals, and then generate a noise signal with the same duration as the first audio signal according to the noise distribution characteristics, as the bone conduction noise signal of the wearer during the collection of the first audio signal .
- the wearable device can also directly use the acquired historical bone conduction noise signal as the bone conduction noise signal during the aforementioned first audio signal acquisition, wherein, If the duration of the historical bone conduction noise signal is greater than the duration of the aforementioned first audio signal, a portion of the same duration as the aforementioned first audio signal may be cut from the historic bone conduction noise signal, as the wearer’s data during the collection of the aforementioned first audio signal.
- Bone conduction noise signal if the duration of the historical bone conduction noise signal is less than the duration of the first audio signal, the historical bone conduction noise signal can be copied to obtain multiple historical bone conduction noises, and then multiple historical bone conduction noise signals can be spliced A noise signal with the same duration as the aforementioned first audio signal is obtained as a bone conduction noise signal of the wearer during the aforementioned first audio signal collection period.
- the wearable device after acquiring the bone conduction noise signal of the wearer during the first audio signal collection period, the wearable device first performs anti-phase processing on the aforementioned bone conduction noise signal to obtain an anti-phase bone conduction noise signal; Then, the wearable device further superimposes the anti-phase bone conduction noise signal and the first audio signal, thereby eliminating the bone conduction noise part in the first audio signal, and obtaining a pure voice part of the wearer, that is, a voice enhancement signal.
- acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal including:
- the embodiments of the present application further provide a solution for optionally acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal.
- the wearable device uses the historical bone conduction noise signal as sample data, and performs model training according to a preset training algorithm to obtain a noise prediction model corresponding to the wearer.
- the training algorithm is a machine learning algorithm, and the machine learning algorithm can predict the data by continuously learning features.
- a wearable device can predict the current noise distribution based on the historical noise distribution.
- the machine learning algorithm may include: decision tree algorithm, regression algorithm, Bayesian algorithm, neural network algorithm (may include deep neural network algorithm, convolutional neural network algorithm and recurrent neural network algorithm, etc.), clustering algorithm, etc., Which training algorithm is selected as the preset training algorithm for model training can be selected by those of ordinary skill in the art according to actual needs.
- the training algorithm configured by the wearable device is a Gaussian mixture model algorithm (which is a regression algorithm).
- the wearable device obtains the historical bone conduction noise signal
- the historical bone conduction noise signal is used as sample data
- the model algorithm performs model training to obtain a Gaussian mixture model (the Gaussian mixture model includes a plurality of Gaussian units for describing noise distribution), and the Gaussian mixture model is used as a noise prediction model.
- the wearable device uses the start time and end time of the first audio signal collection period as the input of the noise prediction model, and inputs it into the noise prediction model for prediction, and obtains the predicted noise signal output by the noise prediction model, which is used as the wearer in the aforementioned first noise prediction model.
- acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal including:
- the embodiments of the present application further provide a solution for optionally acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal.
- the wearable device further acquires a pre-trained general bone conduction noise model after acquiring the wearer's historical bone conduction noise signal before the first audio signal is collected.
- the wearable device extracts the acoustic features of the historical bone conduction noise signal, and performs adaptive processing on the general bone conduction noise model based on the acoustic features to obtain a noise prediction model corresponding to the wearer.
- the adaptive processing refers to the processing method of using a part of the non-specific user's acoustic characteristics in the general bone conduction noise model that is similar to the wearer's bone conduction noise as the wearer's acoustic characteristics, and the adaptive processing can use the maximum a posteriori Estimation algorithm implementation.
- the maximum a posteriori estimation is to obtain the estimation of the quantity that is difficult to observe based on empirical data.
- the prior probability and Bayes' theorem are used to obtain the posterior probability
- the objective function that is, the expression of the wearer's bone conduction noise model) Formula
- the likelihood function of the posterior probability is the likelihood function of the posterior probability
- the parameter value when the likelihood function is the maximum can be obtained (for example, the gradient descent algorithm can be used to obtain the maximum value of the likelihood function), which also realizes the general bone conduction noise model.
- a part of the non-specific speaker's acoustic feature features in the wearer are used as the effect of training with the wearer's acoustic features, so that the bone conduction corresponding to the wearer can be obtained according to the parameter value when the obtained likelihood function is the largest. noise model.
- the bone conduction noise signal of the wearer during the first audio signal collection period can be predicted and obtained according to the noise prediction model.
- the enhancement processing is performed on the first audio signal to obtain the wearer's pronunciation enhancement signal, including:
- the pronunciation enhancement signal is obtained by performing the pronunciation enhancement processing on the first audio signal through the pretrained pronunciation enhancement model.
- An optional pronunciation enhancement strategy is further provided in the embodiment of the present application.
- a pronunciation enhancement model is pre-trained, and the pronunciation enhancement model is configured to enhance the user's pronunciation part in the audio signal.
- the wearable device calls the pre-trained pronunciation enhancement model, and inputs the first audio signal into the called pronunciation enhancement model for pronunciation enhancement processing, so that the corresponding pronunciation enhancement can be obtained. Signal.
- the first audio signal is composed of two parts, which are respectively the pronunciation signal of the wearer's pronunciation and other noise signals, and the first audio signal is subjected to pronunciation enhancement processing through the pre-trained pronunciation enhancement model to obtain the noise signal. Suppressed articulation enhances the signal.
- the pronunciation enhancement process is performed on the first audio signal through a pre-trained pronunciation enhancement model to obtain a pronunciation enhancement signal, including:
- the pronunciation types include but are not limited to speaking, singing, humming, and the like.
- a pronunciation enhancement model suitable for performing pronunciation enhancement processing on the pronunciation type audio signal is respectively trained, and a first preset correspondence between the pronunciation type and the pronunciation enhancement model is obtained accordingly.
- the wearable device when the pronunciation enhancement processing is performed on the first audio signal by the pre-trained pronunciation enhancement model, the wearable device first identifies the pronunciation type of the wearer to obtain the pronunciation type of the wearer.
- the recognition mode of pronunciation type is not specifically limited, and can be configured by those of ordinary skill in the art according to actual needs.
- a pronunciation type recognition model for pronunciation type recognition can be pre-trained. The pronunciation type recognition model is called to identify the pronunciation type.
- the wearable device After recognizing the wearer's pronunciation type, the wearable device determines the pronunciation enhancement model corresponding to the wearer's pronunciation type according to the first preset correspondence between the pronunciation type and the pronunciation enhancement model, which is recorded as the target pronunciation enhancement model. For example, the pronunciation enhancement model A suitable for the pronunciation enhancement processing of the pronunciation type "speaking”, the pronunciation enhancement model B suitable for the pronunciation enhancement processing for the pronunciation type "singing", and the pronunciation enhancement model B suitable for the pronunciation enhancement processing for the pronunciation type "humming" are pre-trained. If the pronunciation enhancement model C that performs pronunciation enhancement processing recognizes that the wearer's pronunciation type is "speaking", the pronunciation enhancement model A can be determined as the target pronunciation enhancement model.
- the wearable device After determining the target pronunciation enhancement model, the wearable device calls the target pronunciation enhancement model, and inputs the first audio signal into the called target pronunciation enhancement model for pronunciation enhancement processing to obtain a corresponding pronunciation enhancement signal.
- the second audio signal is enhanced to obtain an enhanced ambient sound signal of the actual environment where the wearer is located, including:
- the wearable device includes a plurality of non-bone conduction microphones, and a microphone array is formed by the plurality of bone conduction microphones.
- the wearable device is provided with two non-bone conduction microphones at the end of the temple near the frame.
- the arrangement of the non-bone conduction microphones shown in FIG. 6 is only an optional arrangement, and those skilled in the art can select the number and arrangement of the non-bone conduction microphones according to actual needs.
- the wearable device will collect multiple second audio signals through multiple non-bone conduction microphones.
- the wearable device may collect the first audio signal according to the non-bone conduction microphones. The time difference between the two audio signals predicts the sound source direction of the wearer's real environment.
- the following takes the setting method of the non-bone conduction microphone shown in FIG. 6 as an example to describe the method for the wearable device to predict the direction of the sound source in the real environment:
- FIG. 7 there is another speaker "User B" in the real environment, and the sound signal sent by User B will be collected by the non-bone conduction microphone 1 and the non-bone conduction microphone 2 successively, and the non-bone conduction microphone 1 and the non-bone conduction microphone 2
- the time difference between the sound signal collected by the conduction microphone 2 is t, and the distance between the non-bone conduction microphone 1 and the non-bone conduction microphone 2 is recorded as L1.
- the incident direction of the sound signal is the same as that of the non-bone conduction microphone 1 and the non-bone conduction microphone 2.
- the included angle of the connection line is ⁇ .
- the calculated angle ⁇ represents the direction of user B compared to the wearable device, that is, the direction of the sound source.
- the beamforming process can be performed on the second audio signal in the sound source direction according to the configured beamforming algorithm to eliminate the noise signal outside the sound source direction, and the corresponding Ambient sound enhances the signal.
- the embodiments of the present application do not specifically limit which beamforming algorithm is used, and can be configured by those of ordinary skill in the art according to actual needs.
- a generalized sidelobe canceler (GSC) algorithm may be used to perform beamforming processing on the second audio signal.
- performing enhancement processing on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located including:
- An ambient sound enhancement signal is obtained by performing an ambient sound enhancement process on the second audio signal by using the pre-trained ambient sound enhancement model.
- An optional ambient sound enhancement strategy is further provided in the embodiment of the present application.
- an ambient sound enhancement model is pre-trained, and the ambient sound enhancement model is configured to enhance the ambient sound part in the audio signal.
- the wearable device when performing enhancement processing on the second audio signal, invokes the pre-trained ambient sound enhancement model, and inputs the second audio signal into the invoked ambient sound enhancement model for ambient sound enhancement processing, so as to obtain the corresponding The ambient sound enhances the signal.
- performing enhancement processing on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located including:
- the ambient sound enhancement processing is performed on the second audio signal by the target ambient sound enhancement model to obtain an ambient sound enhancement signal.
- the division of the environment types there is no specific limitation on the division of the environment types, and the division can be performed by those of ordinary skill in the art according to actual needs.
- the types of environments can be divided into indoor environments and outdoor environments.
- an environmental sound enhancement model suitable for performing environmental sound enhancement processing on the audio signal of the environmental type is respectively trained, and the second preset correspondence between the environmental type and the environmental sound enhancement model is obtained accordingly. relation.
- the wearable device when the environmental sound enhancement processing is performed on the second audio signal through the pre-trained environmental sound enhancement model, the wearable device first identifies the environmental type of the actual environment where the wearer is located, and obtains the wearer's environmental type .
- the environment type identification model is called to identify the environment type.
- the wearable device After recognizing the environmental type of the wearer's real environment, the wearable device determines the environmental sound enhancement model corresponding to the wearer's real environment according to the preset correspondence between the environmental type and the environmental sound enhancement model. Enhance the model for the target ambient sound. For example, an ambient sound enhancement model D suitable for performing ambient sound enhancement processing on the environment type "indoor environment” and an ambient sound enhancement model E suitable for performing ambient sound enhancement processing on the environment type "outdoor environment” are pre-trained. If the environment type of the real scene where the wearer is located is "outdoor environment", the ambient sound enhancement model E may be determined as the target ambient sound enhancement model.
- the wearable device After determining the target ambient sound enhancement model, invokes the target ambient sound enhancement model, and inputs the second audio signal into the invoked target ambient sound enhancement model for ambient sound enhancement processing to obtain a corresponding ambient sound enhancement signal.
- the audio enhancement method provided by this application further includes:
- Pronunciation detection is performed on the spectral features through the pre-trained pronunciation detection model to obtain the detection result of whether the wearer pronounces it.
- a pronunciation detection model is also pre-trained, and the pronunciation detection model is configured to detect whether the wearer speaks.
- a sample pronunciation signal of the wearer's pronunciation can be collected in advance, and the spectral feature of the sample pronunciation signal can be extracted as a training sample, and then the training sample can be used for model training to obtain a pronunciation detection model.
- the wearable device when detecting whether the wearer of the wearable device pronounces according to the first audio signal, can extract the spectral features of the first audio signal, and call the pre-trained pronunciation detection model, and input the extracted spectral features into the pronunciation detection model.
- the model performs pronunciation detection to obtain the detection result of whether the wearer has spoken.
- the method further includes:
- the first audio signal is enhanced according to the second audio signal
- the second audio signal is enhanced according to the first audio signal.
- both the first audio signal and the second audio signal are regarded as composed of two parts, that is, the wearer's pronunciation part and the ambient sound part.
- the wearable device can perform enhancement processing on the wearer's pronunciation part or the ambient sound part according to the configured priority level.
- the wearable device when the current configuration is to enhance the pronunciation of the wearer, that is, when the priority of the pronunciation of the wearer is higher than that of the ambient sound, performs enhancement processing on the first audio signal according to the second audio signal, and suppresses the ambient sound.
- the wearable device When the current configuration is to enhance the ambient sound, that is, when the priority of the ambient sound is higher than that of the wearer's pronunciation, the wearable device enhances the second audio signal according to the first audio signal to suppress the wearer's pronunciation.
- the audio enhancement method provided by this application further includes:
- the wearable device in order to prevent the wearer of the wearable device from being too focused on pronunciation (such as speaking) and ignoring the risks in the real environment, the wearable device also provides safety prompts to the wearer.
- the wearable device firstly identifies the state of the wearer to determine whether the wearer is in a state of movement. It should be noted that the embodiments of the present application do not specifically limit the manner of how to identify whether the wearer is in a motion state, and can be selected by those of ordinary skill in the art according to actual needs. For example, the wearable device can identify whether its current speed in any direction has reached the preset speed. If so, it can determine that it is in a state of motion, so as to determine that the wearer is also in a state of exercise. If not, it is determined that it is in a state of non-exercise.
- the wearable device can identify whether its current displacement in any direction has reached the preset displacement, and if so, it is determined that it is in a movement state, thereby determining that the wearer is also in a movement state. If not, it is determined that the wearer is in a non-exercise state, thereby determining that the wearer is also in a non-exercise state.
- the wearable device performs three-dimensional reconstruction of the real environment where the wearer is located according to the configured three-dimensional reconstruction strategy, and obtains a virtual reality environment corresponding to the real environment.
- the three-dimensional reconstruction of the real environment can be regarded as the process of digitizing the real existing environment, that is, the process of the wearable device recognizing the real environment in which it is located.
- the configuration of the 3D reconstruction strategy is not specifically limited in the embodiments of the present application, and a person of ordinary skill in the art can configure an appropriate 3D reconstruction strategy according to actual needs. For example, a 3D reconstruction strategy that achieves a certain balance between reconstruction accuracy and reconstruction efficiency can be selected according to the processing capability of the wearable device.
- the wearable device After reconstructing the virtual reality environment corresponding to the real environment, the wearable device identifies whether the wearer has a movement risk according to the virtual reality environment and current movement data (including but not limited to movement direction, movement speed, movement acceleration and movement trend, etc.). , where the movement risk includes but is not limited to the risk of falling, the risk of collision, etc.
- the wearable device will give a safety reminder to the wearer, so as to avoid possible damage to the wearer caused by the movement risk. It should be noted that, in the embodiments of the present application, there are no specific restrictions on the manner in which the wearable device performs safety prompts, including but not limited to audio prompting methods, text prompting methods, video prompting methods, and fusion prompting methods (such as the fusion of audio and text). , the fusion of video and text), etc.
- the wearable device further includes a camera module, which performs three-dimensional reconstruction on the real environment where the wearer is located to obtain a virtual reality environment, including:
- the wearable device further includes a camera module.
- the camera module includes a depth camera and an RGB camera.
- FIG. 8 shows an optional setting method of the depth camera and the RGB camera. It should be noted that the depth camera and the RGB camera are not limited to the setting manner shown in FIG. 8 , and may also be set by those of ordinary skill in the art according to actual needs.
- the depth camera selected in the embodiment of the present application is a time-of-flight camera.
- the time-of-flight camera uses the reflection characteristics of light to calculate the depth (or say, the distance of the object to the time-of-flight camera).
- the electronic device can obtain a depth image of the real scene through the time-of-flight camera, and the depth image includes the depth values (depth values) at different positions in the real scene.
- the value reflects the distance from a location in the real scene to the time-of-flight camera, the smaller the value, the closer to the time-of-flight camera, and the larger the value, the farther away from the time-of-flight camera).
- An RGB camera receives light reflected (or emitted) by an object and converts the light into a color image that carries the color characteristics of the object.
- the electronic device may obtain a color image of the real scene through the RGB camera when the RGB camera is aimed at the real scene, and the color image includes color data at different positions in the real scene.
- the acquired depth image includes the depth value of the real scene, but does not include the color data of the real scene
- the acquired color image includes the color data of the real scene, but does not include the depth value of the real scene
- the electronic device can synchronize the depth image and color image of the real scene acquired by the time-of-flight camera and the RGB camera.
- the wearable device After acquiring the depth image and color image of the real scene, the wearable device performs three-dimensional reconstruction of the real scene according to the aforementioned depth image and color image according to the configured three-dimensional reconstruction algorithm, and correspondingly obtains a virtual reality scene corresponding to the real scene. It should be noted that. There is no specific limitation on which three-dimensional reconstruction algorithm is adopted in the embodiments of the present application, and can be selected by those of ordinary skill in the art according to actual needs.
- FIG. 9 is another schematic flowchart of an audio enhancement method provided by an embodiment of the present application.
- the wearable device is smart glasses as an example for description.
- the process of the audio enhancement method provided by the embodiment of the present application may be as follows:
- the smart glasses acquire a first audio signal through a bone conduction microphone, and synchronously acquire a second audio signal through a non-bone conduction microphone.
- the transmission medium of sound includes solid, liquid and air
- bone conduction of sound belongs to the solid conduction of sound wave.
- Bone conduction is a very common physiological phenomenon. For example, we hear the sound of chewing food, which is transmitted to the inner ear through the jawbone. When the user speaks, the vibration is generated through the vocal organ, which is then transmitted to other parts, such as the bridge of the nose and the ear bone, through the body tissues such as bones, muscles and skin.
- the bone conduction microphone can convert the vibration of the body part caused by the user's pronunciation into sound signals, thereby restoring the sound made by the user.
- the smart eye includes a bone conduction microphone and a non-bone conduction microphone.
- the non-bone conduction microphone includes any type of microphone for audio collection based on the principle of sound air transmission and/or liquid transmission, including but not limited to electric type, condenser type, piezoelectric type, electromagnetic type, carbon particle type and semiconductor type and other types of microphones.
- the non-bone conduction microphone is arranged at one end of the temple of the smart glasses close to the frame, and the bone conduction microphone is arranged at one end of the mirror frame in principle of the temple.
- the bone conduction microphone when the smart glasses are in a wearing state, the bone conduction microphone will directly contact the user's ear bones, while the non-bone conduction microphone will not directly contact the user.
- FIG. 2 shows only an optional arrangement of the bone conduction microphone and the non-bone conduction microphone, and the bone conduction microphone and the non-bone conduction microphone are not limited to this arrangement in a specific implementation.
- the bone conduction microphone can also be arranged on the frame of the smart glasses and contact the user's nose bridge when wearing it.
- the smart eye when triggering to perform audio collection, the smart eye synchronously performs audio collection through the bone conduction microphone and the non-bone conduction microphone, the audio signal collected by the bone conduction microphone is recorded as the first audio signal, and the non-bone conduction microphone is recorded as the first audio signal.
- the collected audio signal is denoted as the second audio signal.
- the embodiments of this application do not specifically limit how to trigger audio collection. For example, it may be when the user operates the smart eye to start an audio collection application to record, or the user operates the smart eye to start instant messaging. When a similar application makes an audio and video call, it can also trigger audio collection when the smart eye performs voice wake-up or voice recognition.
- the smart glasses extract spectral features of the first audio signal.
- the first audio signal collected by the bone conduction microphone can be effectively used to detect whether the wearer of the smart eye pronounces the sound. .
- a pronunciation detection model is also pre-trained, and the pronunciation detection model is configured to detect whether the wearer speaks.
- a sample pronunciation signal of the wearer's pronunciation can be collected in advance, and the spectral feature of the sample pronunciation signal can be extracted as a training sample, and then the training sample can be used for model training to obtain a pronunciation detection model.
- the smart eye when detecting whether the wearer of the smart eye speaks according to the first audio signal, the smart eye can extract the spectral feature of the first audio signal.
- the smart glasses perform pronunciation detection on the spectral features through the pre-trained pronunciation detection model to determine whether the wearer has spoken, if yes, go to 204 , otherwise go to 207 .
- the smart glasses in addition to extracting the spectral features of the first audio signal, the smart glasses also call a pre-trained pronunciation detection model, and input the extracted spectral features into the pronunciation detection model for pronunciation detection, so as to obtain the detection of whether the wearer has spoken. result. Among them, if the detection result of the wearer's speech is obtained, go to 204 , and if the detection result that the wearer does not speak is obtained, go to 207 .
- the smart glasses identify the pronunciation type of the wearer to obtain the pronunciation type of the wearer.
- the pronunciation types include but are not limited to speaking, singing, humming, and the like.
- a pronunciation enhancement model suitable for performing pronunciation enhancement processing on the pronunciation type audio signal is respectively trained, and a first preset correspondence between the pronunciation type and the pronunciation enhancement model is obtained accordingly.
- the smart eye when the pronunciation enhancement processing is performed on the first audio signal by the pre-trained pronunciation enhancement model, the smart eye first identifies the pronunciation type of the wearer to obtain the pronunciation type of the wearer.
- the recognition mode of pronunciation type is not specifically limited, and can be configured by those of ordinary skill in the art according to actual needs.
- a pronunciation type recognition model for pronunciation type recognition can be pre-trained. The pronunciation type recognition model is called to identify the pronunciation type.
- the smart glasses determine a target pronunciation enhancement model corresponding to the wearer's pronunciation type according to the preset correspondence between the pronunciation type and the pronunciation enhancement model.
- the smart eye After recognizing the wearer's pronunciation type, the smart eye determines the pronunciation enhancement model corresponding to the wearer's pronunciation type according to the first preset correspondence between the pronunciation type and the pronunciation enhancement model, which is recorded as the target pronunciation enhancement model. For example, the pronunciation enhancement model A suitable for the pronunciation enhancement processing of the pronunciation type "speaking”, the pronunciation enhancement model B suitable for the pronunciation enhancement processing for the pronunciation type "singing", and the pronunciation enhancement model B suitable for the pronunciation enhancement processing for the pronunciation type "humming" are pre-trained. If the pronunciation enhancement model C that performs the pronunciation enhancement processing recognizes that the wearer's pronunciation type is "speaking", the pronunciation enhancement model A can be determined as the target pronunciation enhancement model.
- the smart glasses perform pronunciation enhancement processing on the first audio signal through the target pronunciation enhancement model to obtain a pronunciation enhancement signal.
- the intelligent eye After determining the target pronunciation enhancement model, the intelligent eye calls the target pronunciation enhancement model, and inputs the first audio signal into the called target pronunciation enhancement model for pronunciation enhancement processing, and obtains the corresponding pronunciation enhancement signal.
- the smart glasses identify the environment type of the real environment to obtain the environment type of the real environment.
- the division of the environment types there is no specific limitation on the division of the environment types, and the division can be performed by those of ordinary skill in the art according to actual needs.
- the types of environments can be divided into indoor environments and outdoor environments.
- an environmental sound enhancement model suitable for performing environmental sound enhancement processing on the audio signal of the environmental type is respectively trained, and the second preset correspondence between the environmental type and the environmental sound enhancement model is obtained accordingly. relation.
- the smart eye first identifies the environment type of the real environment where the wearer is located, and obtains the environment type of the real scene where the wearer is located.
- the intelligent eye invokes the environment type identification model to identify the environment type.
- the smart glasses determine a target ambient sound enhancement model corresponding to the environment type of the real environment according to the second preset correspondence between the environment type and the ambient sound enhancement model.
- the smart eye After recognizing the environmental type of the wearer's real environment, the smart eye determines the environmental sound enhancement model corresponding to the wearer's real environment according to the preset correspondence between the environmental type and the environmental sound enhancement model, which is denoted as Target ambient sound enhancement model.
- Target ambient sound enhancement model For example, an ambient sound enhancement model D suitable for performing ambient sound enhancement processing on the environment type "indoor environment” and an ambient sound enhancement model E suitable for performing ambient sound enhancement processing on the environment type "outdoor environment” are pre-trained. If the environment type of the real scene where the wearer is located is "outdoor environment", the ambient sound enhancement model E may be determined as the target ambient sound enhancement model.
- the smart glasses perform ambient sound enhancement processing on the second audio signal through the target ambient sound enhancement model to obtain an ambient sound enhancement signal.
- the smart eye After determining the target ambient sound enhancement model, the smart eye invokes the target ambient sound enhancement model, and inputs the second audio signal into the invoked target ambient sound enhancement model for ambient sound enhancement processing to obtain a corresponding ambient sound enhancement signal.
- FIG. 10 is a schematic structural diagram of an audio enhancement apparatus provided by an embodiment of the present application.
- the audio enhancement device is applied to the wearable device provided in the present application, and the wearable device includes a bone conduction microphone and a non-bone conduction microphone.
- the audio enhancement device may include:
- the audio collection module 301 is used to collect the first audio signal through the bone conduction microphone, and obtain the second audio signal synchronously through the non-bone conduction microphone;
- the pronunciation enhancement module 302 is configured to perform enhancement processing on the first audio signal when the wearer's pronunciation is detected according to the first audio signal to obtain the wearer's pronunciation enhancement signal;
- the ambient sound enhancement module 303 is configured to perform enhancement processing on the second audio signal when it is detected that the wearer does not speak according to the first audio signal, so as to obtain an ambient sound enhancement signal of the actual environment where the wearer is located.
- the pronunciation enhancement module 302 when performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, is configured to:
- the anti-phase superposition of the bone conduction noise signal and the first audio signal is performed to obtain a pronunciation enhancement signal.
- the pronunciation enhancement module 302 when acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal, is configured to:
- the bone conduction noise signal during the acquisition of the first audio signal is predicted according to the noise prediction model.
- the pronunciation enhancement module 302 when performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, is configured to:
- the pronunciation enhancement signal is obtained by performing pronunciation enhancement processing on the first audio signal by using the pre-trained pronunciation enhancement model.
- the ambient sound enhancement module 303 when the second audio signal is enhanced to obtain an enhanced ambient sound signal of the actual environment where the wearer is located, the ambient sound enhancement module 303:
- the second audio signal is subjected to beamforming processing according to the sound source direction to obtain an ambient sound enhancement signal.
- the audio enhancement device provided by the present application further includes a pronunciation detection module for:
- the pre-trained pronunciation detection model is used to detect the pronunciation of the spectral features, and the detection result of whether the wearer is pronounced is obtained.
- the first audio signal is acquired by the bone conduction microphone
- the second audio signal is acquired by the non-bone conduction microphone simultaneously
- the pronunciation enhancement module 302 is further configured to perform enhancement processing on the first audio signal according to the second audio signal when the current configuration is to enhance the wearer's pronunciation; or,
- the ambient sound enhancement module 303 is further configured to perform enhancement processing on the second audio signal according to the first audio signal when the current configuration is to enhance the ambient sound.
- the audio enhancement device provided by the present application further includes a safety prompt module for:
- the wearable device further includes a camera module, and when performing three-dimensional reconstruction of the real environment to obtain a virtual reality environment, the safety prompt module is used for:
- the three-dimensional reconstruction of the real environment is carried out according to the depth image and the color image, and the virtual reality environment is obtained.
- the audio enhancement apparatus provided in the embodiment of the present application and the audio enhancement method in the above embodiment belong to the same concept, and the audio enhancement apparatus can execute any of the methods provided in the audio enhancement method embodiment, and the specific implementation process is described in detail. See the above related embodiments, and details are not repeated here.
- the embodiments of the present application provide a storage medium on which a computer program is stored.
- a wearable device including a bone conduction microphone and a non-bone conduction microphone
- the audio enhancement method provided by the embodiments of the present application is executed. steps in .
- the storage medium may be a magnetic disk, an optical disk, a read only memory (Read Only Memory, ROM), or a random access device (Random Access Memory, RAM), and the like.
- the wearable device includes a processor 401 , a memory 402 , a bone conduction microphone 403 and a non-bone conduction microphone 404 .
- the processor in the embodiment of the present application is a general-purpose processor, such as a processor of an ARM architecture.
- a computer program is stored in the memory 402, which may be a high-speed random access memory, or a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
- the bone conduction microphone 403 includes a microphone for audio collection based on the principle of sound bone conduction.
- the non-bone conduction microphone 404 includes any type of microphone for audio collection based on the principle of sound air transmission and/or liquid transmission, including but not limited to electrodynamic, capacitive, piezoelectric, electromagnetic, carbon particle and semiconductor type, etc. type microphone.
- the memory 402 may further include a memory controller to provide access to the memory 402 by the processor 401, and the processor 401 implements the following functions by loading a computer program in the memory 402:
- the first audio signal is acquired by the bone conduction microphone 403, and the second audio signal is acquired by the non-bone conduction microphone 404 simultaneously;
- the first audio signal is enhanced to obtain the wearer's pronunciation enhancement signal
- enhancement processing is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
- the processor 401 when performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, the processor 401 executes:
- the anti-phase superposition of the bone conduction noise signal and the first audio signal is performed to obtain a pronunciation enhancement signal.
- the processor 401 when acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal, executes:
- the bone conduction noise signal during the acquisition of the first audio signal is predicted according to the noise prediction model.
- the processor 401 when performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, the processor 401 executes:
- the pronunciation enhancement signal is obtained by performing pronunciation enhancement processing on the first audio signal by using the pre-trained pronunciation enhancement model.
- the processor 401 executes:
- the second audio signal is subjected to beamforming processing according to the sound source direction to obtain an ambient sound enhancement signal.
- the processor 401 further executes:
- the pre-trained pronunciation detection model is used to detect the pronunciation of the spectral features, and the detection result of whether the wearer is pronounced is obtained.
- audio collection is performed simultaneously through the bone conduction microphone 403 and the non-bone conduction microphone 404 to obtain the first audio signal collected by the bone conduction microphone 403 and the second audio signal collected by the non-bone conduction microphone 404.
- the processor 401 executes:
- the first audio signal is enhanced according to the second audio signal
- the second audio signal is enhanced according to the first audio signal.
- the processor 401 further executes:
- the wearable device further includes a camera module, and when performing three-dimensional reconstruction on the real environment to obtain a virtual reality environment, the processor 401 executes:
- the three-dimensional reconstruction of the real environment is carried out according to the depth image and the color image, and the virtual reality environment is obtained.
- wearable device provided by the embodiment of the present application and the audio enhancement method in the above embodiment belong to the same concept, and any method provided in the audio enhancement method embodiment can be executed on the wearable device, and its specific implementation For details of the process, please refer to the above embodiments, which will not be repeated here.
- the audio enhancement method of the embodiment of the present application ordinary testers in the art can understand that all or part of the process of implementing the audio enhancement method of the embodiment of the present application can be completed by controlling the relevant hardware through a computer program.
- the computer program can be stored in a computer-readable storage medium, such as in the memory of the wearable device provided by this application, and executed by the processor in the wearable device, and the execution process can include such as Flow of an embodiment of an audio enhancement method.
- the storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- General Health & Medical Sciences (AREA)
- Otolaryngology (AREA)
- General Physics & Mathematics (AREA)
- Ophthalmology & Optometry (AREA)
- Optics & Photonics (AREA)
- Artificial Intelligence (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Disclosed in embodiments of the present application are an audio enhancement method and apparatus, a storage medium, and a wearable device. The method comprises: acquiring a first audio signal by a bone conduction microphone, and acquiring a second audio signal by a non-bone conduction microphone; according to the first audio signal, detecting whether the wearer produces voice; if yes, performing enhancement on the first audio signal to obtain a voice-enhanced signal; and if not, performing enhancement on the second audio signal to obtain an environment sound-enhanced signal of the real environment.
Description
本申请要求于2020年8月11日提交中国专利局、申请号为202010802651.8、发明名称为“音频增强方法、装置、存储介质及可穿戴设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202010802651.8 and the invention titled "Audio Enhancement Method, Device, Storage Medium and Wearable Device" filed with the China Patent Office on August 11, 2020, the entire contents of which are by reference Incorporated in this application.
本申请涉及音频处理技术领域,具体涉及一种音频增强方法、装置、存储介质及可穿戴设备。The present application relates to the technical field of audio processing, and in particular, to an audio enhancement method, device, storage medium and wearable device.
目前,如智能眼镜、智能头盔等可穿戴设备已逐渐进入人们生活之中。通过在可穿戴设备安装不同的应用,使得可穿戴设备能够实现各种各样的功能。比如,通过在可穿戴设备安装具备音频采集能力的应用,然后结合可穿戴设备自身配置的音频采集硬件(比如麦克风)向用户提供音频采集功能。At present, wearable devices such as smart glasses and smart helmets have gradually entered people's lives. By installing different applications on the wearable device, the wearable device can realize various functions. For example, by installing an application with audio collection capability on the wearable device, and then combining with the audio collection hardware (such as a microphone) configured by the wearable device itself, the audio collection function is provided to the user.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种音频增强方法、装置、存储介质及可穿戴设备,能够提高可穿戴设备进行音频增强的灵活性。Embodiments of the present application provide an audio enhancement method, apparatus, storage medium, and wearable device, which can improve the flexibility of the wearable device for audio enhancement.
第一方面,本申请实施例提供了一种音频增强方法,应用于可穿戴设备,所述可穿戴设备包括骨传导麦克风和非骨传导麦克风,所述音频增强方法包括:In a first aspect, an embodiment of the present application provides an audio enhancement method, which is applied to a wearable device, where the wearable device includes a bone conduction microphone and a non-bone conduction microphone, and the audio enhancement method includes:
通过所述骨传导麦克风采集得到第一音频信号,以及同步通过所述非骨传导麦克风采集得到第二音频信号;The first audio signal is acquired by collecting the bone conduction microphone, and the second audio signal is simultaneously acquired by the non-bone conduction microphone;
在根据所述第一音频信号检测到所述穿戴者发音时,对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号;When the wearer's pronunciation is detected according to the first audio signal, the first audio signal is enhanced to obtain the wearer's pronunciation enhancement signal;
在根据所述第一音频信号检测到所述穿戴者未发音时,对所述第二音频信号进行增强处理,得到所述穿戴者所处现实环境的环境音增强信号。When it is detected that the wearer does not speak according to the first audio signal, enhancement processing is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
第二方面,本申请实施例提供了一种音频增强装置,应用于可穿戴设备,所述可穿戴设备包括骨传导麦克风和非骨传导麦克风,所述音频增强装置包括:In a second aspect, an embodiment of the present application provides an audio enhancement device, which is applied to a wearable device, where the wearable device includes a bone conduction microphone and a non-bone conduction microphone, and the audio enhancement device includes:
音频采集模块,用于通过所述骨传导麦克风采集得到第一音频信号,以及同步通过所述非骨传导麦克风采集得到第二音频信号;an audio acquisition module, configured to acquire a first audio signal through the bone conduction microphone, and acquire a second audio signal through the non-bone conduction microphone synchronously;
发音增强模块,用于在根据所述第一音频信号检测到所述穿戴者发音时,对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号;A pronunciation enhancement module, configured to perform enhancement processing on the first audio signal when the wearer's pronunciation is detected according to the first audio signal, to obtain the wearer's pronunciation enhancement signal;
环境音增强模块,用于在根据所述第一音频信号检测到所述穿戴者未发音时,对所述第二音频信号进行增强处理,得到所述穿戴者所处现实环境的环境音增强信号。An ambient sound enhancement module, configured to perform enhancement processing on the second audio signal when it is detected that the wearer has not spoken according to the first audio signal, to obtain an ambient sound enhancement signal of the actual environment where the wearer is located .
第三方面,本申请实施例提供了一种存储介质,其上存储有计算机程序,当所述计算机程序被包括骨传导麦克风和非骨传导麦克风的可穿戴设备加载时执行本申请实施例提供的音频增强方法。In a third aspect, the embodiments of the present application provide a storage medium on which a computer program is stored, and when the computer program is loaded by a wearable device including a bone conduction microphone and a non-bone conduction microphone, the storage medium provided by the embodiments of the present application is executed. Audio enhancement method.
第四方面,本申请实施例提供了一种可穿戴设备,所述可穿戴设备包括骨传导麦克风、非骨传导麦克风、处理器和存储器,所述存储器储存有计算机程序,所述处理器通过加载所述计算机程序执行本申请实施例提供的音频增强方法。In a fourth aspect, an embodiment of the present application provides a wearable device, the wearable device includes a bone conduction microphone, a non-bone conduction microphone, a processor, and a memory, the memory stores a computer program, and the processor loads the The computer program executes the audio enhancement method provided by the embodiments of the present application.
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.
图1是本申请实施例提供的音频增强方法的一流程示意图。FIG. 1 is a schematic flowchart of an audio enhancement method provided by an embodiment of the present application.
图2是本申请实施例中可穿戴设备实体展现形式为智能眼镜时骨传导麦克风和非骨传导麦克风的设置位置示意图。FIG. 2 is a schematic diagram of the arrangement positions of the bone conduction microphone and the non-bone conduction microphone when the physical representation form of the wearable device is smart glasses according to the embodiment of the present application.
图3是图2中智能眼镜的穿戴示意图。FIG. 3 is a schematic view of wearing the smart glasses in FIG. 2 .
图4是本申请实施例中对第一音频信号进行增强处理的一示意图。FIG. 4 is a schematic diagram of enhancing processing of a first audio signal in an embodiment of the present application.
图5是本申请实施例中对第一音频信号进行增强处理的另一示意图。FIG. 5 is another schematic diagram of performing enhancement processing on the first audio signal in an embodiment of the present application.
图6是本申请实施例中可穿戴设备实体展现形式为智能眼镜时多个非骨传导麦克风的设置位置示意图。FIG. 6 is a schematic diagram of the arrangement positions of multiple non-bone conduction microphones when the physical representation form of the wearable device is smart glasses according to an embodiment of the present application.
图7是本申请实施例中预测现实环境中声源方向的示意图。FIG. 7 is a schematic diagram of predicting the direction of a sound source in a real environment in an embodiment of the present application.
图8是本申请实施例中可穿戴设备实体展现形式为智能眼镜时摄像模组的设置位置示意图。FIG. 8 is a schematic diagram of the setting position of the camera module when the physical presentation form of the wearable device is smart glasses according to the embodiment of the present application.
图9是本申请实施例提供的音频增强方法的另一流程示意图。FIG. 9 is another schematic flowchart of an audio enhancement method provided by an embodiment of the present application.
图10是本申请实施例提供的音频增强装置的结构示意图。FIG. 10 is a schematic structural diagram of an audio enhancement apparatus provided by an embodiment of the present application.
图11是本申请实施例提供的可穿戴设备的结构示意图。FIG. 11 is a schematic structural diagram of a wearable device provided by an embodiment of the present application.
请参照图式,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。Please refer to the drawings, wherein the same component symbols represent the same components, and the principles of the present application are exemplified by being implemented in a suitable computing environment. The following description is based on illustrated specific embodiments of the present application and should not be construed as limiting other specific embodiments of the present application not detailed herein.
应当说明的是,本申请以下实施例中所涉及的诸如第一和第二等关系术语仅用于将一个实体或者操作与另一个实体或者操作区分开来,并不用于限定这些实体或操作之间存在着实际的顺序关系。It should be noted that the relational terms such as first and second involved in the following embodiments of this application are only used to distinguish one entity or operation from another entity or operation, and are not used to limit these entities or operations. There is an actual sequential relationship between them.
相关技术中,如智能眼镜、智能头盔为了能够更好的对用户的声音进行增强,在设计阶段,需要对用户和麦克风的位置进行设计和测量,将用户嘴巴的位置和麦克风的位置相对固定下来。之后,进一步利用嘴巴和麦克风的相对位置,调整算法上的参数,使得麦克风对用户的声音有良好的增强效果,同时对环境噪音有良好的抑制效果。然而,相关技术中,可穿戴设备只针对用户的声音进行增强,而对其他声音进行了抑制,当用户想要采集非本人的声音时,这些声音会被当成噪音抑制。In related technologies, such as smart glasses and smart helmets, in order to better enhance the user's voice, in the design stage, the position of the user and the microphone needs to be designed and measured, and the position of the user's mouth and the position of the microphone are relatively fixed. . After that, the relative positions of the mouth and the microphone are further used to adjust the parameters of the algorithm, so that the microphone has a good enhancement effect on the user's voice and a good suppression effect on the environmental noise. However, in the related art, the wearable device only enhances the user's voice, and suppresses other sounds. When the user wants to collect non-personal voices, these sounds will be suppressed as noise.
为此,本申请提供一种音频增强方法、音频增强装置、存储介质以及可穿戴设备,该音频增强方法的执行主体可以是本申请实施例提供的音频增强装置,或者集成了该音频增强装置的可穿戴设备,其中音频增强装置可以采用硬件或软件的方式实现。应当说明的是,本申请实施例对可穿戴设备的实体展现形式不做具体限制,比如,可穿戴设备的实体展现形式可以为智能眼镜、智能头盔等。To this end, the present application provides an audio enhancement method, an audio enhancement device, a storage medium, and a wearable device, and the executive body of the audio enhancement method may be the audio enhancement device provided in the embodiment of the present application, or an audio enhancement device integrated with the audio enhancement device. Wearable devices, in which the audio enhancement device can be implemented in hardware or software. It should be noted that the embodiments of the present application do not specifically limit the physical presentation form of the wearable device, for example, the physical presentation form of the wearable device may be smart glasses, smart helmets, and the like.
本申请实施例提供一种音频增强方法,应用于可穿戴设备,所述可穿戴设备包括骨传导麦克风和非骨传导麦克风,所述音频增强方法包括:An embodiment of the present application provides an audio enhancement method, which is applied to a wearable device, where the wearable device includes a bone conduction microphone and a non-bone conduction microphone, and the audio enhancement method includes:
通过所述骨传导麦克风采集得到第一音频信号,以及同步通过所述非骨传导麦克风采集得到第二音频信号;The first audio signal is acquired by collecting the bone conduction microphone, and the second audio signal is simultaneously acquired by the non-bone conduction microphone;
在根据所述第一音频信号检测到所述穿戴者发音时,对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号;When the wearer's pronunciation is detected according to the first audio signal, the first audio signal is enhanced to obtain the wearer's pronunciation enhancement signal;
在根据所述第一音频信号检测到所述穿戴者未发音时,对所述第二音频信号进行增强处理,得到所述穿戴者所处现实环境的环境音增强信号。When it is detected that the wearer does not speak according to the first audio signal, enhancement processing is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
在一实施例中,所述对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号,包括:In one embodiment, the performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal includes:
获取所述穿戴者在所述第一音频信号采集之前的历史骨传导噪声信号;acquiring a historical bone conduction noise signal of the wearer before the first audio signal is collected;
根据所述历史骨传导噪声信号获取所述穿戴者在所述第一音频信号采集期间的骨传导噪声信号;Acquiring a bone conduction noise signal of the wearer during the collection of the first audio signal according to the historical bone conduction noise signal;
将所述骨传导噪声信号与所述第一音频信号进行反相位叠加,得到所述发音增强信号。The anti-phase superposition of the bone conduction noise signal and the first audio signal is performed to obtain the pronunciation enhancement signal.
在一实施例中,所述根据所述历史骨传导噪声信号获取所述穿戴者在所述第一音频信号采集期间的骨传导噪声信号,包括:In one embodiment, the acquiring the bone conduction noise signal of the wearer during the collection of the first audio signal according to the historical bone conduction noise signal includes:
根据所述历史骨传导噪声信号进行模型训练,得到对应所述穿戴者的噪声预测模型;Perform model training according to the historical bone conduction noise signal to obtain a noise prediction model corresponding to the wearer;
根据所述噪声预测模型预测所述第一音频信号采集期间的骨传导噪声信号。The bone conduction noise signal during the acquisition of the first audio signal is predicted according to the noise prediction model.
在一实施例中,所述对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号,包括:In one embodiment, the performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal includes:
通过预训练的发音增强模型对所述第一音频信号进行发音增强处理,得到所述发音增强信号。The pronunciation enhancement signal is obtained by performing pronunciation enhancement processing on the first audio signal by using a pre-trained pronunciation enhancement model.
在一实施例中,所述非骨传导麦克风为多个,所述对所述第二音频信号进行增强处理,得到所述穿戴者所处现实环境的环境音增强信号,包括:In one embodiment, there are multiple non-bone conduction microphones, and the second audio signal is enhanced to obtain an enhanced ambient sound signal of the actual environment where the wearer is located, including:
根据所述多个非骨传导麦克风采集到的多个第二音频信号,对所述现实环境的声源方向进行预测;predicting the sound source direction of the real environment according to the plurality of second audio signals collected by the plurality of non-bone conduction microphones;
根据所述声源方向对所述第二音频信号进行波束形成处理,得到所述环境音增强信号。The second audio signal is subjected to beamforming processing according to the sound source direction to obtain the ambient sound enhancement signal.
在一实施例中,所述音频增强方法还包括:In one embodiment, the audio enhancement method further includes:
提取所述第一音频信号的频谱特征;extracting spectral features of the first audio signal;
通过预训练的发音检测模型对所述频谱特征进行发音检测,得到所述穿戴者是否发音的检测结果。Pronunciation detection is performed on the spectral feature by using a pre-trained pronunciation detection model to obtain a detection result of whether the wearer has spoken.
在一实施例中,所述同步通过所述骨传导麦克风和所述非骨传导麦克风进行音频采集,得到骨传导麦克风采集的第一音频信号,以及得到非骨传导麦克风采集的第二音频信号之后,还包括:In one embodiment, the synchronization performs audio collection through the bone conduction microphone and the non-bone conduction microphone to obtain the first audio signal collected by the bone conduction microphone, and after obtaining the second audio signal collected by the non-bone conduction microphone. ,Also includes:
在当前配置为对穿戴者发音进行增强时,根据所述第二音频信号对所述第一音频信号进行增强处理;或者,When the current configuration is to enhance the pronunciation of the wearer, the first audio signal is enhanced according to the second audio signal; or,
在当前配置为对环境音进行增强时,根据所述第一音频信号对所述第二音频信号进行增强处理。When the current configuration is to enhance the ambient sound, the second audio signal is enhanced according to the first audio signal.
在一实施例中,所述音频增强方法还包括:In one embodiment, the audio enhancement method further includes:
在所述穿戴者处于运动状态时,对所述现实环境进行三维重建,得到虚拟现实环境;When the wearer is in motion, three-dimensional reconstruction is performed on the real environment to obtain a virtual reality environment;
根据所述虚拟现实环境识别所述穿戴者是否存在运动风险;Identifying whether the wearer is at risk of exercise according to the virtual reality environment;
在识别到所述穿戴者存在运动风险时,对所述穿戴者进行安全提示。When it is recognized that the wearer has a movement risk, a safety prompt is given to the wearer.
在一实施例中,所述可穿戴设备还包括摄像模组,所述对所述现实环境进行三维重建,得到虚拟现实环境,包括:In one embodiment, the wearable device further includes a camera module, and the three-dimensional reconstruction of the real environment to obtain a virtual reality environment includes:
通过所述摄像模组获取所述现实环境的深度图像和彩色图像;Obtain the depth image and color image of the real environment through the camera module;
根据所述深度图像和所述彩色图像对所述现实环境进行三维重建,得到所述虚拟现实环境。请参照图1,图1为本申请实施例提供的音频增强方法的流程示意图。以下以音频增强方法的执行主体为本申请提供的可穿戴设备为例进行说明,该可穿戴设备包括骨传导麦克风和非骨传导麦克风。如图1所示,本申请实施例提供的音频增强方法的流程可以如下:The virtual reality environment is obtained by performing three-dimensional reconstruction of the real environment according to the depth image and the color image. Please refer to FIG. 1 , which is a schematic flowchart of an audio enhancement method provided by an embodiment of the present application. The following description takes the wearable device provided by the present application as an example of the execution subject of the audio enhancement method, where the wearable device includes a bone conduction microphone and a non-bone conduction microphone. As shown in FIG. 1 , the process of the audio enhancement method provided by the embodiment of the present application may be as follows:
在101中,通过骨传导麦克风采集得到第一音频信号,以及同步通过非骨传 导麦克风采集得到第二音频信号。In 101, a first audio signal is acquired through a bone conduction microphone, and a second audio signal is simultaneously acquired through a non-bone conduction microphone.
应当说明的是,声音的传播介质包括固体、液体和空气,声音骨传导属于声波的固体传导。骨传导是个很常见的生理现象,比如我们听到咀嚼食物的声音,就是通过颌骨传递到内耳的。而用户在发音时,通过发音器官产生振动,进而通过骨头、肌肉和皮肤等身体组织传导至其他部位,比如鼻梁、耳骨等部位。利用此种原理,骨传导麦克风能够将用户发音所造成身体部位的振动转换为声音信号,从而还原出用户发出的声音。It should be noted that the transmission medium of sound includes solid, liquid and air, and bone conduction of sound belongs to the solid conduction of sound wave. Bone conduction is a very common physiological phenomenon. For example, we hear the sound of chewing food, which is transmitted to the inner ear through the jawbone. When the user speaks, the vibration is generated through the vocal organ, which is then transmitted to other parts, such as the bridge of the nose and the ear bone, through the body tissues such as bones, muscles and skin. Using this principle, the bone conduction microphone can convert the vibration of the body part caused by the user's pronunciation into sound signals, thereby restoring the sound made by the user.
本申请实施例中,可穿戴设备包括骨传导麦克风和非骨传导麦克风。其中,非骨传导麦克风包括以声音空气传播和/或液体传播为原理进行音频采集的任意类型的麦克风,包括但不限于电动式、电容式、压电式、电磁式、碳粒式以及半导体式等类型麦克风。In the embodiment of the present application, the wearable device includes a bone conduction microphone and a non-bone conduction microphone. Wherein, the non-bone conduction microphone includes any type of microphone for audio collection based on the principle of sound air transmission and/or liquid transmission, including but not limited to electric type, condenser type, piezoelectric type, electromagnetic type, carbon particle type and semiconductor type and other types of microphones.
示例性的,请参照图2,以可穿戴设备的实体展现形式为智能眼镜为例,其中非骨传导麦克风设置在镜腿靠近镜框的一端,骨传导麦克风设置在镜腿原理镜框的一端。请参照图3,当该智能眼镜处于穿戴状态时,骨传导麦克风将直接与用户的耳骨接触,而非骨传导麦克风与用户不直接接触。2, taking the physical representation of the wearable device as smart glasses as an example, the non-bone conduction microphone is arranged at one end of the temple near the frame, and the bone conduction microphone is arranged at one end of the temple frame. Referring to FIG. 3 , when the smart glasses are in a wearing state, the bone conduction microphone will directly contact the user's ear bones, while the non-bone conduction microphone will not directly contact the user.
应当说明的是,图2示出的仅为一可选地骨传导麦克风和非骨传导麦克风设置方式,在具体实现中骨传导麦克风和非骨传导麦克风并不限定于该设置方式。比如,骨传导麦克风还可以设置在智能眼镜的镜框上,并在穿戴时与用户的鼻梁接触等。It should be noted that FIG. 2 shows only an optional arrangement of the bone conduction microphone and the non-bone conduction microphone, and the bone conduction microphone and the non-bone conduction microphone are not limited to this arrangement in a specific implementation. For example, the bone conduction microphone can also be arranged on the frame of the smart glasses and contact the user's nose bridge when wearing it.
本申请实施例中,在触发进行音频采集时,可穿戴设备同步通过骨传导麦克风和非骨传导麦克风进行音频采集,将骨传导麦克风采集得到的音频信号记为第一音频信号,将非骨传导麦克风采集得到的音频信号记为第二音频信号。In the embodiment of the present application, when audio collection is triggered, the wearable device synchronously performs audio collection through the bone conduction microphone and the non-bone conduction microphone, and the audio signal collected by the bone conduction microphone is recorded as the first audio signal, and the non-bone conduction microphone is recorded as the first audio signal. The audio signal collected by the microphone is recorded as the second audio signal.
应当说明的是,本申请实施例中对于如何触发进行音频采集不做具体限定,比如,可以是在用户操作可穿戴设备启动音频采集类应用进行录音时,也可以是在用户操作可穿戴设备启动即时通讯类应用进行音视频通话时,还可以是可穿戴设备进行语音唤醒或者是进行语音识别等情况时,触发进行音频采集。It should be noted that the embodiments of this application do not specifically limit how to trigger audio collection. For example, it may be when the user operates the wearable device to start the audio collection application for recording, or when the user operates the wearable device to start the audio collection. When an instant messaging application conducts an audio and video call, it can also trigger audio collection when the wearable device performs voice wake-up or voice recognition.
在102中,在根据第一音频信号检测到穿戴者发音时,对第一音频信号进行增强处理,得到穿戴者的发音增强信号。In 102, when the voice of the wearer is detected according to the first audio signal, the first audio signal is enhanced to obtain a voice enhancement signal of the wearer.
根据骨传导麦克风的音频采集原理可知,骨传导麦克风进行音频采集时受穿戴者以外的干扰较小,因此,可以有效的利用骨传导麦克风采集的第一音频信号来检测可穿戴设备的穿戴者是否发音。According to the audio collection principle of the bone conduction microphone, there is less interference from the wearer when the bone conduction microphone collects audio. Therefore, the first audio signal collected by the bone conduction microphone can be effectively used to detect whether the wearer of the wearable device is pronounce.
其中,可穿戴设备可以按照配置的发音检测策略,根据第一音频信号检测可穿戴设备的穿戴者是否发音。应当说明的是,本申请实施例对于发音检测策略的配置不做具体限定,可由本领域普通技术人员根据实际需要进行配置。比如,可以配置发音检测策略为:采用语音活动检测(Voice Activity Detection,VAD)算法检测可穿戴设备的穿戴者是否发音。The wearable device may detect whether the wearer of the wearable device has spoken according to the first audio signal according to the configured pronunciation detection strategy. It should be noted that the embodiment of the present application does not specifically limit the configuration of the pronunciation detection strategy, which can be configured by those of ordinary skill in the art according to actual needs. For example, the pronunciation detection strategy can be configured as follows: using a voice activity detection (Voice Activity Detection, VAD) algorithm to detect whether the wearer of the wearable device is vocal.
本申请实施例中,若根据第一音频信号检测到可穿戴设备的穿戴者发音,则将该穿戴者的发音作为增强对象,按照配置的发音增强策略对第一音频信号进行增强处理,相应得到穿戴者的发音增强信号。其中,对第一音频信号进行增强处理可以看做是消除第一音频信号中非穿戴者发音成分的过程,通过消除非穿戴者发音成分达到增强穿戴者发音的目的。In the embodiment of the present application, if the pronunciation of the wearer of the wearable device is detected according to the first audio signal, the pronunciation of the wearer is used as the enhancement object, and the first audio signal is enhanced according to the configured pronunciation enhancement strategy, and the corresponding result is obtained The wearer's pronunciation enhances the signal. The enhancement processing on the first audio signal can be regarded as a process of eliminating the non-wearer's pronunciation component in the first audio signal, and the purpose of enhancing the wearer's pronunciation is achieved by eliminating the non-wearer's pronunciation component.
其中,本申请实施例对于发音增强策略的配置不做具体限制,可由本领域普通技术人员根据实际需要进行配置,包括基于人工智能的发音增强策略和非人工智能的发音增强策略。The embodiments of the present application do not specifically limit the configuration of pronunciation enhancement strategies, which can be configured by those of ordinary skill in the art according to actual needs, including artificial intelligence-based pronunciation enhancement strategies and non-artificial intelligence pronunciation enhancement strategies.
在103中,在根据第一音频信号检测到穿戴者未发音时,对第二音频信号进行增强处理,得到穿戴者所处现实环境的环境音增强信号。In 103, when it is detected that the wearer does not speak according to the first audio signal, an enhancement process is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
本申请实施例中,若根据第一音频信号检测到可穿戴设备的穿戴者为未发音,则将该穿戴者所处现实环境的环境音作为增强对象,按照配置的环境音增强策略对第二音频信号进行增强处理,相应得到穿戴者所处现实环境的环境音增强信号。In this embodiment of the present application, if it is detected that the wearer of the wearable device does not speak according to the first audio signal, the ambient sound of the real environment where the wearer is located is used as the enhancement object, and the second sound is enhanced according to the configured ambient sound enhancement strategy. The audio signal is enhanced and processed to correspondingly obtain an enhanced signal of the ambient sound of the actual environment where the wearer is located.
其中,本申请实施例中对应环境音增强策略的配置不做具体限制,可由本领域普通技术人员根据实际需要进行配置,包括基于人工智能的发音增强策略和非人工智能的发音增强策略。The configuration of the corresponding environmental sound enhancement strategy in the embodiments of the present application is not specifically limited, and can be configured by those of ordinary skill in the art according to actual needs, including artificial intelligence-based pronunciation enhancement strategies and non-artificial intelligence pronunciation enhancement strategies.
由上可知,本申请提供的音频增强方法适用于包括骨传导麦克风和非骨传导麦克风的可穿戴设备,通过骨传导麦克风采集得到第一音频信号,以及同步通过非骨传导麦克风采集得到第二音频信号;根据第一音频信号检测可穿戴设备的穿戴者是否发音;若检测到穿戴者发音,则将穿戴者的发音作为增强对象,对第一音频信号进行增强处理,相应得到穿戴者的发音增强信号;若检测到穿戴者未发音,则将穿戴者所处现实环境的环境音作为增强对象,对第二音频信号进行增强处理,相应得到穿戴者所处现实环境的环境音增强信号。由此,本申请通过在可穿戴设备设置骨传导麦克风和非骨传导麦克风,利用骨传导麦克风的音频采集结果动态地确定音频增强对象,灵活的进行音频增强处理。It can be seen from the above that the audio enhancement method provided by the present application is suitable for wearable devices including bone conduction microphones and non-bone conduction microphones. signal; according to the first audio signal to detect whether the wearer of the wearable device pronounces; if the wearer's pronunciation is detected, then the wearer's pronunciation is used as an enhancement object, and the first audio signal is enhanced to obtain the wearer's pronunciation enhancement. If it is detected that the wearer does not speak, the ambient sound of the wearer's real environment is used as the enhancement object, and the second audio signal is enhanced to obtain the corresponding ambient sound enhancement signal of the wearer's real environment. Therefore, the present application flexibly performs audio enhancement processing by arranging bone conduction microphones and non-bone conduction microphones in the wearable device, and using the audio collection results of the bone conduction microphones to dynamically determine audio enhancement objects.
可选地,在一实施例中,对第一音频信号进行增强处理,得到穿戴者的发音增强信号,包括:Optionally, in an embodiment, the enhancement processing is performed on the first audio signal to obtain the wearer's pronunciation enhancement signal, including:
(1)获取穿戴者在第一音频信号采集之前的历史骨传导噪声信号;(1) Obtain the historical bone conduction noise signal of the wearer before the first audio signal is collected;
(2)根据历史骨传导噪声信号获取穿戴者在第一音频信号采集期间的骨传导噪声信号;(2) acquiring the wearer's bone conduction noise signal during the first audio signal collection period according to the historical bone conduction noise signal;
(3)将骨传导噪声信号与第一音频信号进行反相位叠加,得到发音增强信号。(3) The anti-phase superposition of the bone conduction noise signal and the first audio signal is performed to obtain a pronunciation enhancement signal.
本申请实施例进一步提供一可选地发音增强策略。The embodiments of the present application further provide an optional pronunciation enhancement strategy.
容易理解的是,就如同环境中存在各种各样的噪声(比如,办公室中存在电脑运行产生的噪声,敲击键盘产生的噪声等)那样,人体同样存在各种各样的噪声,比如人体的呼吸音、心跳音、咀嚼音以及咳嗽音等。那么,可穿戴设备在通过骨传导麦克风进行音频采集时,显然难以采集到的纯净的发音信号,其采集到的第一音频信号不仅包括穿戴者的发音成分,还包括发音者的呼吸音、心跳音、咀嚼音和/或咳嗽音等成分。It is easy to understand that, just as there are various noises in the environment (for example, there are noises generated by computer operation in the office, noises generated by tapping on the keyboard, etc.), there are also various noises in the human body, such as the human body. breath sounds, heartbeat sounds, chewing sounds, and cough sounds. Then, when the wearable device collects audio through the bone conduction microphone, it is obviously difficult to collect the pure pronunciation signal. The first audio signal collected by the wearable device includes not only the pronunciation components of the wearer, but also the breath sound and heartbeat of the speaker. sounds, chewing sounds and/or cough sounds.
如上所述,由于外界环境对骨传导麦克风的干扰较小,即骨传导麦克风的干扰主要来自于穿戴者自身。本申请实施例中,将第一音频信号看做是两部分组成,即纯净的穿戴者发音信号和穿戴者产生的骨传导噪声(包括但不限于呼吸音、心跳音、咀嚼音以及咳嗽音等)。因此,消除第一音频信号中的骨传导噪声即可实现对穿戴者发音信号的增强。As mentioned above, since the external environment has less interference with the bone conduction microphone, that is, the interference of the bone conduction microphone mainly comes from the wearer himself. In the embodiment of the present application, the first audio signal is regarded as composed of two parts, that is, the pure wearer's voice signal and the wearer's bone conduction noise (including but not limited to breath sound, heartbeat sound, chewing sound, coughing sound, etc. ). Therefore, the enhancement of the wearer's voice signal can be achieved by eliminating the bone conduction noise in the first audio signal.
可以理解的是,若可穿戴设备的穿戴者发音,可穿戴设备将通过骨传导麦克风采集到穿戴者的发音信号和骨传导噪声信号的混合信号,若穿戴者未发音,那么可穿戴设备将仅通过骨传导麦克风采集到穿戴者的骨传导噪声信号。本申请实施例中,在穿戴者未发音时,可穿戴设备缓存采集到的骨传导噪声信号。It can be understood that if the wearer of the wearable device speaks, the wearable device will collect the mixed signal of the wearer's voice signal and the bone conduction noise signal through the bone conduction microphone. If the wearer does not speak, the wearable device will only The wearer's bone conduction noise signal is collected through the bone conduction microphone. In the embodiment of the present application, when the wearer does not speak, the wearable device buffers the collected bone conduction noise signal.
对第一音频信号进行增强处理时,可穿戴设备首先获取到穿戴者在第一音频信号采集之前的骨传导噪声信号,记为历史骨传导噪声信号。比如,可穿戴设备以第一音频信号的采集起始时刻为结束时刻,获取该结束时刻之前采集的预设时长(该预设时长可由本领域普通技术人员根据实际需要取合适值,本申请实施例 对此不做具体限制,比如,可以设置为500ms)的历史骨传导噪声信号。When the first audio signal is enhanced, the wearable device first acquires the wearer's bone conduction noise signal before the first audio signal is collected, which is recorded as a historical bone conduction noise signal. For example, the wearable device takes the collection start time of the first audio signal as the end time, and obtains the preset time length collected before the end time (the preset time length can be selected by a person of ordinary skill in the art according to actual needs. Appropriate value, this application implements For example, there is no specific limitation on this, for example, it can be set to a historical bone conduction noise signal of 500ms).
比如,预设时长被配置为500毫秒,前述第一音频信号的起始时刻为2020年07月28日09时13分56秒又500毫秒,则可穿戴设备获取2020年07月28日09时13分56秒至2020年07月28日09时13分56秒又500毫秒期间由前述骨传导麦克风缓存的、时长为500毫秒的历史骨传导噪声信号。For example, if the preset duration is configured to be 500 milliseconds, and the start time of the aforementioned first audio signal is 09:13:56 and 500 milliseconds on July 28, 2020, then the wearable device obtains the first audio signal at 09:00 on July 28, 2020. From 13 minutes 56 seconds to 09:13:56 seconds and 500 milliseconds on July 28, 2020, the historical bone conduction noise signal with a duration of 500 milliseconds was buffered by the aforementioned bone conduction microphone.
可穿戴设备在获取到穿戴者在第一音频信号采集之前的历史骨传导噪声信号之后,进一步根据该历史骨传导噪声信号来获取穿戴者在第一音频信号采集期间的骨传导噪声信号。After acquiring the wearer's historical bone conduction noise signal before the first audio signal acquisition, the wearable device further acquires the wearer's bone conduction noise signal during the first audio signal acquisition according to the historical bone conduction noise signal.
比如,可穿戴设备可以获取历史骨传导噪声信号的噪声分布特征,然后根据该噪声分布特征生成与第一音频信号等时长的噪声信号,作为穿戴者在第一音频信号采集期间的骨传导噪声信号。For example, the wearable device can obtain the noise distribution characteristics of historical bone conduction noise signals, and then generate a noise signal with the same duration as the first audio signal according to the noise distribution characteristics, as the bone conduction noise signal of the wearer during the collection of the first audio signal .
又比如,考虑到噪声的稳定性,其连续时间内的变化通常较小,可穿戴设备还可以直接将获取到历史骨传导噪声信号作为前述第一音频信号采集期间的骨传导噪声信号,其中,若历史骨传导噪声信号的时长大于前述第一音频信号的时长,则可以从历史骨传导噪声信号中截取与前述第一音频信号相同时长的部分,作为穿戴者在前述第一音频信号采集期间的骨传导噪声信号;若历史骨传导噪声信号的时长小于前述第一音频信号的时长,则可以对历史骨传导噪声信号进行复制,得到多个历史骨传导噪声,然后拼接多个历史骨传导噪声信号以得到与前述第一音频信号相同时长的噪声信号,作为穿戴者在前述第一音频信号采集期间的骨传导噪声信号。For another example, considering the stability of noise, its variation in continuous time is usually small, and the wearable device can also directly use the acquired historical bone conduction noise signal as the bone conduction noise signal during the aforementioned first audio signal acquisition, wherein, If the duration of the historical bone conduction noise signal is greater than the duration of the aforementioned first audio signal, a portion of the same duration as the aforementioned first audio signal may be cut from the historic bone conduction noise signal, as the wearer’s data during the collection of the aforementioned first audio signal. Bone conduction noise signal; if the duration of the historical bone conduction noise signal is less than the duration of the first audio signal, the historical bone conduction noise signal can be copied to obtain multiple historical bone conduction noises, and then multiple historical bone conduction noise signals can be spliced A noise signal with the same duration as the aforementioned first audio signal is obtained as a bone conduction noise signal of the wearer during the aforementioned first audio signal collection period.
请参照图4,在获取到穿戴者在前述第一音频信号采集期间的骨传导噪声信号之后,可穿戴设备首先对前述骨传导噪声信号进行反相位处理,得到反相位骨传导噪声信号;然后,可穿戴设备进一步将反相位骨传导噪声信号与第一音频信号进行叠加,从而消除第一音频信号中的骨传导噪声部分,得到纯净的穿戴者发音部分,即发音增强信号。Referring to FIG. 4 , after acquiring the bone conduction noise signal of the wearer during the first audio signal collection period, the wearable device first performs anti-phase processing on the aforementioned bone conduction noise signal to obtain an anti-phase bone conduction noise signal; Then, the wearable device further superimposes the anti-phase bone conduction noise signal and the first audio signal, thereby eliminating the bone conduction noise part in the first audio signal, and obtaining a pure voice part of the wearer, that is, a voice enhancement signal.
可选地,在一实施例中,根据历史骨传导噪声信号获取穿戴者在第一音频信号采集期间的骨传导噪声信号,包括:Optionally, in an embodiment, acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal, including:
(1)根据历史骨传导噪声信号进行模型训练,得到对应穿戴者的噪声预测模型;(1) Perform model training according to historical bone conduction noise signals to obtain a noise prediction model corresponding to the wearer;
(2)根据噪声预测模型预测第一音频信号采集期间的骨传导噪声信号。(2) Predicting the bone conduction noise signal during the first audio signal acquisition period according to the noise prediction model.
本申请实施例进一步提供一可选地根据历史骨传导噪声信号获取穿戴者在第一音频信号采集期间的骨传导噪声信号的方案。The embodiments of the present application further provide a solution for optionally acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal.
其中,可穿戴设备在获取到历史骨传导噪声信号之后,将该历史骨传导噪声信号作为样本数据,并按照预设的训练算法进行模型训练,得到对应穿戴者的噪声预测模型。Wherein, after acquiring the historical bone conduction noise signal, the wearable device uses the historical bone conduction noise signal as sample data, and performs model training according to a preset training algorithm to obtain a noise prediction model corresponding to the wearer.
应当说明的是,训练算法为机器学习算法,机器学习算法可以通过不断的进行特征学习来对数据进行预测,比如,可穿戴设备可以根据历史的噪声分布来预测当前的噪声分布。其中,机器学习算法可以包括:决策树算法、回归算法、贝叶斯算法、神经网络算法(可以包括深度神经网络算法、卷积神经网络算法以及递归神经网络算法等)、聚类算法等等,对于选取何种训练算法用作预设训练算法进行模型训练,可由本领域普通技术人员根据实际需要进行选取。It should be noted that the training algorithm is a machine learning algorithm, and the machine learning algorithm can predict the data by continuously learning features. For example, a wearable device can predict the current noise distribution based on the historical noise distribution. Among them, the machine learning algorithm may include: decision tree algorithm, regression algorithm, Bayesian algorithm, neural network algorithm (may include deep neural network algorithm, convolutional neural network algorithm and recurrent neural network algorithm, etc.), clustering algorithm, etc., Which training algorithm is selected as the preset training algorithm for model training can be selected by those of ordinary skill in the art according to actual needs.
比如,可穿戴设备配置的训练算法为高斯混合模型算法(为一种回归算法),可穿戴设备在获取到历史骨传导噪声信号之后,将该历史骨传导噪声信号作为样本数据,并按照高斯混合模型算法进行模型训练,训练得到一个高斯混合模型(该 高斯混合模型包括多个高斯单元,用于描述噪声分布),将该高斯混合模型作为噪声预测模型。之后,可穿戴设备将第一音频信号采集期间的开始时刻和结束时刻作为噪声预测模型的输入,输入到噪声预测模型进行预测,得到噪声预测模型输出的预测噪声信号,作为穿戴者在前述第一音频信号采集期间的骨传导噪声信号。For example, the training algorithm configured by the wearable device is a Gaussian mixture model algorithm (which is a regression algorithm). After the wearable device obtains the historical bone conduction noise signal, the historical bone conduction noise signal is used as sample data, and the wearable device uses the Gaussian mixture as the sample data. The model algorithm performs model training to obtain a Gaussian mixture model (the Gaussian mixture model includes a plurality of Gaussian units for describing noise distribution), and the Gaussian mixture model is used as a noise prediction model. After that, the wearable device uses the start time and end time of the first audio signal collection period as the input of the noise prediction model, and inputs it into the noise prediction model for prediction, and obtains the predicted noise signal output by the noise prediction model, which is used as the wearer in the aforementioned first noise prediction model. Bone conduction noise signal during audio signal acquisition.
可选地,在一实施例中,根据历史骨传导噪声信号获取穿戴者在第一音频信号采集期间的骨传导噪声信号,包括:Optionally, in an embodiment, acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal, including:
(1)获取预训练的通用骨传导噪声模型;(1) Obtain a pre-trained general bone conduction noise model;
(2)根据历史骨传导噪声信号对通用骨传导噪声模型进行自适应处理,得到对应穿戴者的噪声预测模型;(2) adaptively process the general bone conduction noise model according to the historical bone conduction noise signal, and obtain the noise prediction model corresponding to the wearer;
(3)根据噪声预测模型预测第一音频信号采集期间的骨传导噪声信号。(3) Predicting the bone conduction noise signal during the first audio signal acquisition period according to the noise prediction model.
本申请实施例进一步提供一可选地根据历史骨传导噪声信号获取穿戴者在第一音频信号采集期间的骨传导噪声信号的方案。The embodiments of the present application further provide a solution for optionally acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal.
其中,可穿戴设备在获取到穿戴者在第一音频信号采集之前的历史骨传导噪声信号之后,进一步获取到预训练的通用骨传导噪声模型。The wearable device further acquires a pre-trained general bone conduction noise model after acquiring the wearer's historical bone conduction noise signal before the first audio signal is collected.
然后,可穿戴设备提取历史骨传导噪声信号的声学特征,并基于该声学特征对通用骨传导噪声模型进行自适应处理,得到对应穿戴者的噪声预测模型。Then, the wearable device extracts the acoustic features of the historical bone conduction noise signal, and performs adaptive processing on the general bone conduction noise model based on the acoustic features to obtain a noise prediction model corresponding to the wearer.
其中,自适应处理是指将通用骨传导噪声模型中的与穿戴者的骨传导噪声相近的一部分非特定用户的声学特征作为穿戴者的声学特征的处理方法,该自适应处理可以采用最大后验估计算法实现。最大后验估计是根据经验数据获得对难以观察的量的估计,估计过程中,需利用先验概率和贝叶斯定理得到后验概率,目标函数(即表示穿戴者的骨传导噪声模型的表达式)为后验概率的似然函数,求得该似然函数最大时的参数值(比如,可采用梯度下降算法求出似得然函数的最大值),也就实现将通用骨传导噪声模型中的与穿戴者相近的一部分非特定说话人的声学特征特征作为穿戴者的声学特征一同训练的效果,从而根据求得的似然函数最大时的参数值获取到与穿戴者相对应的骨传导噪声模型。之后,即可根据该噪声预测模型预测得到穿戴者在第一音频信号采集期间的骨传导噪声信号。Among them, the adaptive processing refers to the processing method of using a part of the non-specific user's acoustic characteristics in the general bone conduction noise model that is similar to the wearer's bone conduction noise as the wearer's acoustic characteristics, and the adaptive processing can use the maximum a posteriori Estimation algorithm implementation. The maximum a posteriori estimation is to obtain the estimation of the quantity that is difficult to observe based on empirical data. In the estimation process, the prior probability and Bayes' theorem are used to obtain the posterior probability, and the objective function (that is, the expression of the wearer's bone conduction noise model) Formula) is the likelihood function of the posterior probability, and the parameter value when the likelihood function is the maximum can be obtained (for example, the gradient descent algorithm can be used to obtain the maximum value of the likelihood function), which also realizes the general bone conduction noise model. A part of the non-specific speaker's acoustic feature features in the wearer are used as the effect of training with the wearer's acoustic features, so that the bone conduction corresponding to the wearer can be obtained according to the parameter value when the obtained likelihood function is the largest. noise model. After that, the bone conduction noise signal of the wearer during the first audio signal collection period can be predicted and obtained according to the noise prediction model.
可选地,在一实施例中,对第一音频信号进行增强处理,得到穿戴者的发音增强信号,包括:Optionally, in an embodiment, the enhancement processing is performed on the first audio signal to obtain the wearer's pronunciation enhancement signal, including:
通过预训练的发音增强模型对第一音频信号进行发音增强处理,得到发音增强信号。The pronunciation enhancement signal is obtained by performing the pronunciation enhancement processing on the first audio signal through the pretrained pronunciation enhancement model.
本申请实施例中进一步提供一可选地发音增强策略。An optional pronunciation enhancement strategy is further provided in the embodiment of the present application.
应当说明的是,本申请实施例中预先训练有发音增强模型,该发音增强模型被配置为用于对音频信号中的用户发音部分进行增强。It should be noted that, in the embodiment of the present application, a pronunciation enhancement model is pre-trained, and the pronunciation enhancement model is configured to enhance the user's pronunciation part in the audio signal.
相应的,在对第一音频信号进行增强处理时,可穿戴设备调用预训练的发音增强模型,并将第一音频信号输入到调用的发音增强模型进行发音增强处理,即可得到对应的发音增强信号。Correspondingly, when the first audio signal is enhanced, the wearable device calls the pre-trained pronunciation enhancement model, and inputs the first audio signal into the called pronunciation enhancement model for pronunciation enhancement processing, so that the corresponding pronunciation enhancement can be obtained. Signal.
比如,请参照图5,第一音频信号由两部分组成,分别为穿戴者发音的发音信号和其它的噪声信号,通过预训练的发音增强模型对第一音频信号进行发音增强处理,得到噪声信号被抑制的发音增强信号。For example, please refer to FIG. 5, the first audio signal is composed of two parts, which are respectively the pronunciation signal of the wearer's pronunciation and other noise signals, and the first audio signal is subjected to pronunciation enhancement processing through the pre-trained pronunciation enhancement model to obtain the noise signal. Suppressed articulation enhances the signal.
可选地,在一实施例中,通过预训练的发音增强模型对第一音频信号进行发音增强处理,得到发音增强信号,包括:Optionally, in one embodiment, the pronunciation enhancement process is performed on the first audio signal through a pre-trained pronunciation enhancement model to obtain a pronunciation enhancement signal, including:
(1)对穿戴者的发音类型进行识别,得到穿戴者的发音类型;(1) Identify the wearer's pronunciation type to obtain the wearer's pronunciation type;
(2)根据发音类型和发音增强模型的第一预设对应关系,确定对应穿戴者 的发音类型的目标发音增强模型;(2) according to the first preset correspondence of pronunciation type and pronunciation enhancement model, determine the target pronunciation enhancement model of the pronunciation type of corresponding wearer;
(3)通过目标发音增强模型对第一音频信号进行发音增强处理,得到发音增强信号。(3) Perform pronunciation enhancement processing on the first audio signal through the target pronunciation enhancement model to obtain a pronunciation enhancement signal.
其中,发音类型包括但不限于说话、唱歌、哼哼等。本申请实施例中,针对不同的发音类型,分别训练有适于对该发音类型音频信号进行发音增强处理的发音增强模型,并相应得到发音类型和发音增强模型的第一预设对应关系。Among them, the pronunciation types include but are not limited to speaking, singing, humming, and the like. In the embodiment of the present application, for different pronunciation types, a pronunciation enhancement model suitable for performing pronunciation enhancement processing on the pronunciation type audio signal is respectively trained, and a first preset correspondence between the pronunciation type and the pronunciation enhancement model is obtained accordingly.
本申请实施例中,在通过预训练的发音增强模型对第一音频信号进行发音增强处理时,可穿戴设备首先对穿戴者的发音类型进行识别,得到穿戴者的发音类型。其中,本申请实施例中对于发音类型的识别方式不做具体限定,可由本领域普通技术人员根据实际需要进行配置,比如,可以预训练用于发音类型识别的发音类型识别模型,在需要时,调用该发音类型识别模型进行发音类型的识别。In the embodiment of the present application, when the pronunciation enhancement processing is performed on the first audio signal by the pre-trained pronunciation enhancement model, the wearable device first identifies the pronunciation type of the wearer to obtain the pronunciation type of the wearer. Wherein, in the embodiment of the present application, the recognition mode of pronunciation type is not specifically limited, and can be configured by those of ordinary skill in the art according to actual needs. For example, a pronunciation type recognition model for pronunciation type recognition can be pre-trained. The pronunciation type recognition model is called to identify the pronunciation type.
在识别得到穿戴者的发音类型之后,可穿戴设备即根据发音类型和发音增强模型的第一预设对应关系,确定对应穿戴者发音类型的发音增强模型,记为目标发音增强模型。比如,预先训练有适于对发音类型“说话”进行发音增强处理的发音增强模型A,适于对发音类型“唱歌”进行发音增强处理的发音增强模型B,适于对发音类型“哼哼”进行发音增强处理的发音增强模型C,若识别到穿戴者的发音类型为“说话”,则可将发音增强模型A确定为目标发音增强模型。After recognizing the wearer's pronunciation type, the wearable device determines the pronunciation enhancement model corresponding to the wearer's pronunciation type according to the first preset correspondence between the pronunciation type and the pronunciation enhancement model, which is recorded as the target pronunciation enhancement model. For example, the pronunciation enhancement model A suitable for the pronunciation enhancement processing of the pronunciation type "speaking", the pronunciation enhancement model B suitable for the pronunciation enhancement processing for the pronunciation type "singing", and the pronunciation enhancement model B suitable for the pronunciation enhancement processing for the pronunciation type "humming" are pre-trained. If the pronunciation enhancement model C that performs pronunciation enhancement processing recognizes that the wearer's pronunciation type is "speaking", the pronunciation enhancement model A can be determined as the target pronunciation enhancement model.
在确定目标发音增强模型之后,可穿戴设备即调用该目标发音增强模型,并将第一音频信号输入到调用的目标发音增强模型进行发音增强处理,得到对应的发音增强信号。After determining the target pronunciation enhancement model, the wearable device calls the target pronunciation enhancement model, and inputs the first audio signal into the called target pronunciation enhancement model for pronunciation enhancement processing to obtain a corresponding pronunciation enhancement signal.
可选地,在一实施例中,非骨传导麦克风为多个,对第二音频信号进行增强处理,得到穿戴者所处现实环境的环境音增强信号,包括:Optionally, in an embodiment, there are multiple non-bone conduction microphones, and the second audio signal is enhanced to obtain an enhanced ambient sound signal of the actual environment where the wearer is located, including:
(1)根据多个非骨传导麦克风采集到的多个第二音频信号,对现实环境的声源方向进行预测;(1) Predict the sound source direction of the real environment according to multiple second audio signals collected by multiple non-bone conduction microphones;
(2)根据声源方向对第二音频信号进行波束形成处理,得到环境音增强信号。(2) Performing beamforming processing on the second audio signal according to the sound source direction to obtain an ambient sound enhancement signal.
本申请实施例中,可穿戴设备包括多个非骨传导麦克风,由这多个骨传导麦克风构成麦克风阵列。In the embodiment of the present application, the wearable device includes a plurality of non-bone conduction microphones, and a microphone array is formed by the plurality of bone conduction microphones.
比如,请参照图6,可穿戴设备在镜腿靠近镜框的一端设置有两个非骨传导麦克风。应当说明的是,图6所示非骨传导麦克风的设置方式仅为一可选地的设置方式,本领域普通技术人员可根据实际需要选择非骨传导麦克风的数量及设置方式。For example, referring to FIG. 6 , the wearable device is provided with two non-bone conduction microphones at the end of the temple near the frame. It should be noted that the arrangement of the non-bone conduction microphones shown in FIG. 6 is only an optional arrangement, and those skilled in the art can select the number and arrangement of the non-bone conduction microphones according to actual needs.
本申请实施例中,可穿戴设备将通过多个非骨传导麦克风采集得到多个第二音频信号,在对第二音频信号进行增强处理时,可穿戴设备可以根据各非骨传导麦克风采集到第二音频信号的时间差,对穿戴者所处现实环境的声源方向进行预测。In this embodiment of the present application, the wearable device will collect multiple second audio signals through multiple non-bone conduction microphones. When the second audio signal is enhanced, the wearable device may collect the first audio signal according to the non-bone conduction microphones. The time difference between the two audio signals predicts the sound source direction of the wearer's real environment.
以下以图6所示的非骨传导麦克风的设置方式为例,对可穿戴设备预测现实环境中声源方向的方式进行说明:The following takes the setting method of the non-bone conduction microphone shown in FIG. 6 as an example to describe the method for the wearable device to predict the direction of the sound source in the real environment:
请参照图7,现实环境中存在其它发音者“用户B”,用户B所发出的声音信号将被非骨传导麦克风1和非骨传导麦克风2先后采集到,将非骨传导麦克风1和非骨传导麦克风2采集到声音信号的时间差为t,将非骨传导麦克风1和非骨传导麦克风2之间的距离记为L1,假设声音信号的入射方向与非骨传导麦克风1和非骨传导麦克风2连线的夹角为θ,由于声音信号在空气中的传播速度C为已知,那么,用户B距离非骨传导麦克风1和非骨传导麦克风2的声程差 L2=C*t,则根据三角函数原理,有如下公式:Please refer to FIG. 7 , there is another speaker "User B" in the real environment, and the sound signal sent by User B will be collected by the non-bone conduction microphone 1 and the non-bone conduction microphone 2 successively, and the non-bone conduction microphone 1 and the non-bone conduction microphone 2 The time difference between the sound signal collected by the conduction microphone 2 is t, and the distance between the non-bone conduction microphone 1 and the non-bone conduction microphone 2 is recorded as L1. It is assumed that the incident direction of the sound signal is the same as that of the non-bone conduction microphone 1 and the non-bone conduction microphone 2. The included angle of the connection line is θ. Since the propagation speed C of the sound signal in the air is known, then the sound path difference L2=C*t between the user B and the non-bone conduction microphone 1 and the non-bone conduction microphone 2 is calculated according to The principle of trigonometric functions has the following formula:
θ=cos-1(L2/L1);θ=cos-1(L2/L1);
由此,计算得到夹角θ代表了用户B相较于可穿戴设备的方向,也即是声源方向。Thus, the calculated angle θ represents the direction of user B compared to the wearable device, that is, the direction of the sound source.
在预测得到穿戴者所处现实环境的声源方向之后,即可按照配置的波束形成算法,在该声源方向对第二音频信号进行波束形成处理,消除声源方向外的噪声信号,相应得到环境音增强信号。其中,本申请实施例对于采用何种波束形成算法不做具体限制,可由本领域普通技术人员根据实际需要进行配置。比如,可以采用广义旁瓣抵消(Generalized sidelobe canceller,GSC)算法对第二音频信号进行波束形成处理。After predicting the sound source direction of the actual environment where the wearer is located, the beamforming process can be performed on the second audio signal in the sound source direction according to the configured beamforming algorithm to eliminate the noise signal outside the sound source direction, and the corresponding Ambient sound enhances the signal. The embodiments of the present application do not specifically limit which beamforming algorithm is used, and can be configured by those of ordinary skill in the art according to actual needs. For example, a generalized sidelobe canceler (GSC) algorithm may be used to perform beamforming processing on the second audio signal.
可选地,在一实施例中,对第二音频信号进行增强处理,得到穿戴者所处现实环境的环境音增强信号,包括:Optionally, in an embodiment, performing enhancement processing on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located, including:
通过预训练的环境音增强模型对第二音频信号进行环境音增强处理,得到环境音增强信号。An ambient sound enhancement signal is obtained by performing an ambient sound enhancement process on the second audio signal by using the pre-trained ambient sound enhancement model.
本申请实施例中进一步提供一可选地环境音增强策略。An optional ambient sound enhancement strategy is further provided in the embodiment of the present application.
应当说明的是,本申请实施例中预先训练有环境音增强模型,该环境音增强模型被配置为用于对音频信号中环境音部分进行增强。It should be noted that, in the embodiment of the present application, an ambient sound enhancement model is pre-trained, and the ambient sound enhancement model is configured to enhance the ambient sound part in the audio signal.
相应的,在对第二音频信号进行增强处理时,可穿戴设备调用预训练的环境音增强模型,并将第二音频信号输入到调用的环境音增强模型进行环境音增强处理,即可得到对应的环境音增强信号。Correspondingly, when performing enhancement processing on the second audio signal, the wearable device invokes the pre-trained ambient sound enhancement model, and inputs the second audio signal into the invoked ambient sound enhancement model for ambient sound enhancement processing, so as to obtain the corresponding The ambient sound enhances the signal.
可选地,在一实施例中,对第二音频信号进行增强处理,得到穿戴者所处现实环境的环境音增强信号,包括:Optionally, in an embodiment, performing enhancement processing on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located, including:
(1)对现实环境的环境类型进行识别,得到现实环境的环境类型;(1) Identify the environment type of the real environment, and obtain the environment type of the real environment;
(2)根据环境类型和环境音增强模型的第二预设对应关系,确定对应现实环境的环境类型的目标环境音增强模型;(2) According to the second preset correspondence between the environmental type and the environmental sound enhancement model, determine the target environmental sound enhancement model corresponding to the environmental type of the real environment;
(3)通过目标环境音增强模型对第二音频信号进行环境音增强处理,得到环境音增强信号。(3) The ambient sound enhancement processing is performed on the second audio signal by the target ambient sound enhancement model to obtain an ambient sound enhancement signal.
应当说明的是,本申请实施例中对于环境类型的划分不做具体限制,可由本领域普通技术人员根据实际需要进行划分。比如,可以划分环境类型为室内环境和室外环境。本申请实施例中,针对不同的环境类型,分别训练有适于对该环境类型音频信号进行环境音增强处理的环境音增强模型,并相应得到环境类型和环境音增强模型的第二预设对应关系。It should be noted that, in the embodiment of the present application, there is no specific limitation on the division of the environment types, and the division can be performed by those of ordinary skill in the art according to actual needs. For example, the types of environments can be divided into indoor environments and outdoor environments. In the embodiment of the present application, for different environmental types, an environmental sound enhancement model suitable for performing environmental sound enhancement processing on the audio signal of the environmental type is respectively trained, and the second preset correspondence between the environmental type and the environmental sound enhancement model is obtained accordingly. relation.
本申请实施例中,在通过预训练的环境音增强模型对第二音频信号进行环境音增强处理时,可穿戴设备首先对穿戴者所处现实环境的环境类型进行识别,得到穿戴者的环境类型。其中,本申请实施例中对于环境类型的识别方式不做具体限定,可由本领域普通技术人员根据实际需要进行配置,比如,可以预训练用于环境类型识别的环境类型识别模型,在需要时,调用该环境类型识别模型进行环境类型的识别。In the embodiment of the present application, when the environmental sound enhancement processing is performed on the second audio signal through the pre-trained environmental sound enhancement model, the wearable device first identifies the environmental type of the actual environment where the wearer is located, and obtains the wearer's environmental type . There is no specific limitation on the identification method of the environment type in the embodiments of the present application, which can be configured by those of ordinary skill in the art according to actual needs. The environment type identification model is called to identify the environment type.
在识别得到穿戴者所处现实环境的环境类型之后,可穿戴设备即根据环境类型和环境音增强模型的预设对应关系,确定对应穿戴者所处现实环境的环境类型的环境音增强模型,记为目标环境音增强模型。比如,预先训练有适于对环境类型“室内环境”进行环境音增强处理的环境音增强模型D,适于对环境类型“室外环境”进行环境音增强处理的环境音增强模型E,若识别到穿戴者所处现实场景的环境类型为“室外环境”,则可将环境音增强模型E确定为目标环境音增 强模型。After recognizing the environmental type of the wearer's real environment, the wearable device determines the environmental sound enhancement model corresponding to the wearer's real environment according to the preset correspondence between the environmental type and the environmental sound enhancement model. Enhance the model for the target ambient sound. For example, an ambient sound enhancement model D suitable for performing ambient sound enhancement processing on the environment type "indoor environment" and an ambient sound enhancement model E suitable for performing ambient sound enhancement processing on the environment type "outdoor environment" are pre-trained. If the environment type of the real scene where the wearer is located is "outdoor environment", the ambient sound enhancement model E may be determined as the target ambient sound enhancement model.
在确定目标环境音增强模型之后,可穿戴设备即调用该目标环境音增强模型,并将第二音频信号输入到调用的目标环境音增强模型进行环境音增强处理,得到对应的环境音增强信号。After determining the target ambient sound enhancement model, the wearable device invokes the target ambient sound enhancement model, and inputs the second audio signal into the invoked target ambient sound enhancement model for ambient sound enhancement processing to obtain a corresponding ambient sound enhancement signal.
可选地,在一实施例中,本申请提供的音频增强方法还包括:Optionally, in an embodiment, the audio enhancement method provided by this application further includes:
(1)提取第一音频信号的频谱特征;(1) extracting the spectral feature of the first audio signal;
(2)通过预训练的发音检测模型对频谱特征进行发音检测,得到穿戴者是否发音的检测结果。(2) Pronunciation detection is performed on the spectral features through the pre-trained pronunciation detection model to obtain the detection result of whether the wearer pronounces it.
应当说明的是,本申请实施例中还预先训练有发音检测模型,该发音检测模型被配置检测穿戴者是否发音。比如,可以预先采集穿戴者发音的样本发音信号,并提取样本发音信号的频谱特征作为训练样本,进而利用训练样本进行模型训练,得到发音检测模型。It should be noted that, in the embodiment of the present application, a pronunciation detection model is also pre-trained, and the pronunciation detection model is configured to detect whether the wearer speaks. For example, a sample pronunciation signal of the wearer's pronunciation can be collected in advance, and the spectral feature of the sample pronunciation signal can be extracted as a training sample, and then the training sample can be used for model training to obtain a pronunciation detection model.
相应的,在根据第一音频信号检测可穿戴设备的穿戴者是否发音时,可穿戴设备可以提取第一音频信号的频谱特征,并调用预训练的发音检测模型,将提取的频谱特征输入发音检测模型进行发音检测,得到穿戴者是否发音的检测结果。Correspondingly, when detecting whether the wearer of the wearable device pronounces according to the first audio signal, the wearable device can extract the spectral features of the first audio signal, and call the pre-trained pronunciation detection model, and input the extracted spectral features into the pronunciation detection model. The model performs pronunciation detection to obtain the detection result of whether the wearer has spoken.
可选地,在一实施例中,通过骨传导麦克风采集得到第一音频信号,以及同步通过非骨传导麦克风采集得到第二音频信号之后,还包括:Optionally, in an embodiment, after acquiring the first audio signal through the acquisition of the bone conduction microphone, and synchronously acquiring the second audio signal through the acquisition of the non-bone conduction microphone, the method further includes:
在当前配置为对穿戴者发音进行增强时,根据第二音频信号对第一音频信号进行增强处理;或者,When the current configuration is to enhance the pronunciation of the wearer, the first audio signal is enhanced according to the second audio signal; or,
在当前配置为对环境音进行增强时,根据第一音频信号对第二音频信号进行增强处理。When the current configuration is to enhance the ambient sound, the second audio signal is enhanced according to the first audio signal.
应当说明的是,在本申请实施例中,将第一音频信号和第二音频信号均看做由两部分组成,即穿戴者发音部分和环境音部分。相应的,可穿戴设备可以根据配置的优先等级对穿戴者发音部分或环境音部分进行增强处理。It should be noted that, in the embodiments of the present application, both the first audio signal and the second audio signal are regarded as composed of two parts, that is, the wearer's pronunciation part and the ambient sound part. Correspondingly, the wearable device can perform enhancement processing on the wearer's pronunciation part or the ambient sound part according to the configured priority level.
其中,在当前配置为对穿戴者发音进行增强时,即穿戴者发音的优先级高于环境音的优先级时,可穿戴设备根据第二音频信号对第一音频信号进行增强处理,抑制其中的环境音。Wherein, when the current configuration is to enhance the pronunciation of the wearer, that is, when the priority of the pronunciation of the wearer is higher than that of the ambient sound, the wearable device performs enhancement processing on the first audio signal according to the second audio signal, and suppresses the ambient sound.
在当前配置为环境音进行增强时,即环境音的优先级高于穿戴者发音的优先级时,可穿戴设备根据第一音频信号对第二音频信号进行增强处理,抑制其中的穿戴者发音。When the current configuration is to enhance the ambient sound, that is, when the priority of the ambient sound is higher than that of the wearer's pronunciation, the wearable device enhances the second audio signal according to the first audio signal to suppress the wearer's pronunciation.
可选地,在一实施例中,本申请提供的音频增强方法,还包括:Optionally, in an embodiment, the audio enhancement method provided by this application further includes:
(1)在穿戴者处于运动状态时,对穿戴者所处现实环境进行三维重建,得到虚拟现实环境;(1) When the wearer is in motion, 3D reconstruction is performed on the actual environment where the wearer is located to obtain a virtual reality environment;
(2)根据虚拟现实环境识别穿戴者是否存在运动风险;(2) Identify whether the wearer has exercise risks according to the virtual reality environment;
(3)在识别到穿戴者存在运动风险时,对穿戴者进行安全提示。(3) When it is recognized that the wearer has a movement risk, a safety reminder is given to the wearer.
本申请实施例中,为避免可穿戴设备的穿戴者在发音(比如说话)时过于专注,而忽略现实环境中的风险,可穿戴设备还对穿戴者进行安全提示。In the embodiment of the present application, in order to prevent the wearer of the wearable device from being too focused on pronunciation (such as speaking) and ignoring the risks in the real environment, the wearable device also provides safety prompts to the wearer.
其中,可穿戴设备首先对穿戴者的状态进行识别,以确定穿戴者是否处于运动状态。应当说明的是,本申请实施例中对于如何识别穿戴者是否处于运动状态的方式不做具体限定,可由本领域普通技术人员根据实际需要进行选取。比如,可穿戴设备可以识别自身当前在任一方向的速度是否均达到预设速度,若是,则判定自身处于运动状态,从而确定穿戴者同处于运动状态,若否,则判定自身处于非运动状态,从而确定穿戴者同处于非运动状态;又比如,可穿戴设备可以识别自身当前在任一方向的位移是否达到预设位移,若是,则判定自身处于运动状 态,从而确定穿戴者同处于运动状态,若否,则判定自身处于非运动状态,从而确定穿戴者同处于非运动状态。Among them, the wearable device firstly identifies the state of the wearer to determine whether the wearer is in a state of movement. It should be noted that the embodiments of the present application do not specifically limit the manner of how to identify whether the wearer is in a motion state, and can be selected by those of ordinary skill in the art according to actual needs. For example, the wearable device can identify whether its current speed in any direction has reached the preset speed. If so, it can determine that it is in a state of motion, so as to determine that the wearer is also in a state of exercise. If not, it is determined that it is in a state of non-exercise. In this way, it is determined that the wearer is in a non-exercise state; for another example, the wearable device can identify whether its current displacement in any direction has reached the preset displacement, and if so, it is determined that it is in a movement state, thereby determining that the wearer is also in a movement state. If not, it is determined that the wearer is in a non-exercise state, thereby determining that the wearer is also in a non-exercise state.
若识别到穿戴者处于运动状态,则可穿戴设备按照配置的三维重建策略,对穿戴者所处的现实环境进行三维重建,得到与现实环境对应的虚拟现实环境。其中,对现实环境进行三维重建可以看做是对真实存在的现实环境进行数字化的过程,也即是可穿戴设备认知其所在现实环境的过程。应当说明的是,本申请实施例中对于三维重建策略的配置不做具体限制,可由本领域普通技术人员根据实际需要配置合适的三维重建策略。比如,可以根据可穿戴设备的处理能力选取重建精度和重建效率达到一定平衡的三维重建策略。If it is recognized that the wearer is in motion, the wearable device performs three-dimensional reconstruction of the real environment where the wearer is located according to the configured three-dimensional reconstruction strategy, and obtains a virtual reality environment corresponding to the real environment. Among them, the three-dimensional reconstruction of the real environment can be regarded as the process of digitizing the real existing environment, that is, the process of the wearable device recognizing the real environment in which it is located. It should be noted that the configuration of the 3D reconstruction strategy is not specifically limited in the embodiments of the present application, and a person of ordinary skill in the art can configure an appropriate 3D reconstruction strategy according to actual needs. For example, a 3D reconstruction strategy that achieves a certain balance between reconstruction accuracy and reconstruction efficiency can be selected according to the processing capability of the wearable device.
在重建得到对应现实环境的虚拟现实环境之后,可穿戴设备根据该虚拟现实环境以及当前的运动数据(包括但不限于运动方向、运动速度、运动加速度以及运动趋势等)识别穿戴者是否存在运动风险,其中,运动风险包括但不限于跌落风险、碰撞风险等。After reconstructing the virtual reality environment corresponding to the real environment, the wearable device identifies whether the wearer has a movement risk according to the virtual reality environment and current movement data (including but not limited to movement direction, movement speed, movement acceleration and movement trend, etc.). , where the movement risk includes but is not limited to the risk of falling, the risk of collision, etc.
若识别到穿戴者存在运动风险时,则可穿戴设备对穿戴者进行安全提示,从而避免运动风险对穿戴者所可能造成的损害。应当说明的是,本申请实施例中对于可穿戴设备进行安全提示的方式不做具体限制,包括但不限于音频提示方式、文字提示方式、视频提示方式以及融合提示方式(比如音频和文字的融合,视频和文字的融合)等。If it is recognized that the wearer has a movement risk, the wearable device will give a safety reminder to the wearer, so as to avoid possible damage to the wearer caused by the movement risk. It should be noted that, in the embodiments of the present application, there are no specific restrictions on the manner in which the wearable device performs safety prompts, including but not limited to audio prompting methods, text prompting methods, video prompting methods, and fusion prompting methods (such as the fusion of audio and text). , the fusion of video and text), etc.
可选地,在一实施例中,可穿戴设备还包括摄像模组,对穿戴者所处现实环境进行三维重建,得到虚拟现实环境,包括:Optionally, in an embodiment, the wearable device further includes a camera module, which performs three-dimensional reconstruction on the real environment where the wearer is located to obtain a virtual reality environment, including:
(1)通过摄像模组获取穿戴者所处现实环境的深度图像和彩色图像;(1) Obtain the depth image and color image of the wearer's real environment through the camera module;
(2)根据深度图像和彩色图像对现实环境进行三维重建,得到虚拟现实环境。(2) 3D reconstruction of the real environment according to the depth image and color image to obtain a virtual reality environment.
本申请实施例中,可穿戴设备还包括摄像模组。比如,该摄像模组包括深度摄像头和RGB摄像头,请参照图8,示出了一种可选地深度摄像头和RGB摄像头的设置方式。应当说明的是,深度摄像头和RGB摄像头并不限于图8示出的设置方式,还可由本领域普通技术人员根据实际需要进行设置。In the embodiment of the present application, the wearable device further includes a camera module. For example, the camera module includes a depth camera and an RGB camera. Please refer to FIG. 8 , which shows an optional setting method of the depth camera and the RGB camera. It should be noted that the depth camera and the RGB camera are not limited to the setting manner shown in FIG. 8 , and may also be set by those of ordinary skill in the art according to actual needs.
其中,本申请实施例中对于深度摄像头的类型不做具体限制,可由本领域普通技术人员根据实际需要进行选取。比如,本申请实施例中选取的深度摄像头为飞行时间摄像头。飞行时间摄像头利用光的反射特性,通过向物体发射调制后的光脉冲信号,并接收从物体反射回的光脉冲信号,利用光脉冲信号往返的时间(或相位差)来计算物体的深度(或者说,物体到飞行时间摄像头的距离)。相应的,在本申请实施例中,电子设备可以在飞行时间摄像头对准现实场景时,通过该飞行时间摄像头获取到现实场景的深度图像,该深度图像包括现实场景不同位置处的深度值(深度值反映现实场景某一位置到飞行时间摄像头的距离,其值越小,表示距离飞行时间摄像头越近,而值越大,表示距离飞行时间摄像头越远)。The type of the depth camera is not specifically limited in the embodiments of the present application, and can be selected by those of ordinary skill in the art according to actual needs. For example, the depth camera selected in the embodiment of the present application is a time-of-flight camera. The time-of-flight camera uses the reflection characteristics of light to calculate the depth (or say, the distance of the object to the time-of-flight camera). Correspondingly, in the embodiment of the present application, when the time-of-flight camera is aimed at the real scene, the electronic device can obtain a depth image of the real scene through the time-of-flight camera, and the depth image includes the depth values (depth values) at different positions in the real scene. The value reflects the distance from a location in the real scene to the time-of-flight camera, the smaller the value, the closer to the time-of-flight camera, and the larger the value, the farther away from the time-of-flight camera).
RGB摄像头接收物体反射的光线(或者发出的光线),将光线转换为携带物体颜色特征的彩色图像。相应的,在本申请实施例中,电子设备可以在RGB摄像头对准现实场景时,通过该RGB摄像头获取到的现实场景的彩色图像,该彩色图像包括现实场景不同位置处的颜色数据。An RGB camera receives light reflected (or emitted) by an object and converts the light into a color image that carries the color characteristics of the object. Correspondingly, in this embodiment of the present application, the electronic device may obtain a color image of the real scene through the RGB camera when the RGB camera is aimed at the real scene, and the color image includes color data at different positions in the real scene.
需要说明的是,获取到的深度图像包括了现实场景的深度值,但不包括现实场景的颜色数据,而获取到的彩色图像包括了现实场景的颜色数据,但不包括现实场景的深度值,为了确保获取到现实场景深度图像和彩色图像的一致性,电子设备可以同步通过飞行时间摄像头和RGB摄像头获取到的现实场景的深度图像 和彩色图像。It should be noted that the acquired depth image includes the depth value of the real scene, but does not include the color data of the real scene, and the acquired color image includes the color data of the real scene, but does not include the depth value of the real scene, In order to ensure the consistency of the acquired depth image and color image of the real scene, the electronic device can synchronize the depth image and color image of the real scene acquired by the time-of-flight camera and the RGB camera.
在获取到现实场景的深度图像以及彩色图像之后,可穿戴设备按照配置的三维重建算法,根据前述深度图像以及彩色图像对现实场景进行三维重建,相应得到对应现实场景的虚拟现实场景。应当说明的是。本申请实施例中对于采用何种三维重建算法不做具体限制,可由本领域普通技术人员根据实际需要进行选取。After acquiring the depth image and color image of the real scene, the wearable device performs three-dimensional reconstruction of the real scene according to the aforementioned depth image and color image according to the configured three-dimensional reconstruction algorithm, and correspondingly obtains a virtual reality scene corresponding to the real scene. It should be noted that. There is no specific limitation on which three-dimensional reconstruction algorithm is adopted in the embodiments of the present application, and can be selected by those of ordinary skill in the art according to actual needs.
图9为本申请实施例提供的音频增强方法的另一流程示意图。以下以可穿戴设备为智能眼镜为例进行说明,如图9所示,本申请实施例提供的音频增强方法的流程可以如下:FIG. 9 is another schematic flowchart of an audio enhancement method provided by an embodiment of the present application. In the following, the wearable device is smart glasses as an example for description. As shown in FIG. 9 , the process of the audio enhancement method provided by the embodiment of the present application may be as follows:
在201中,智能眼镜通过骨传导麦克风采集得到第一音频信号,以及同步通过非骨传导麦克风采集得到第二音频信号。In 201, the smart glasses acquire a first audio signal through a bone conduction microphone, and synchronously acquire a second audio signal through a non-bone conduction microphone.
应当说明的是,声音的传播介质包括固体、液体和空气,声音骨传导属于声波的固体传导。骨传导是个很常见的生理现象,比如我们听到咀嚼食物的声音,就是通过颌骨传递到内耳的。而用户在发音时,通过发音器官产生振动,进而通过骨头、肌肉和皮肤等身体组织传导至其他部位,比如鼻梁、耳骨等部位。利用此种原理,骨传导麦克风能够将用户发音所造成身体部位的振动转换为声音信号,从而还原出用户发出的声音。It should be noted that the transmission medium of sound includes solid, liquid and air, and bone conduction of sound belongs to the solid conduction of sound wave. Bone conduction is a very common physiological phenomenon. For example, we hear the sound of chewing food, which is transmitted to the inner ear through the jawbone. When the user speaks, the vibration is generated through the vocal organ, which is then transmitted to other parts, such as the bridge of the nose and the ear bone, through the body tissues such as bones, muscles and skin. Using this principle, the bone conduction microphone can convert the vibration of the body part caused by the user's pronunciation into sound signals, thereby restoring the sound made by the user.
本申请实施例中,智能眼睛包括骨传导麦克风和非骨传导麦克风。其中,非骨传导麦克风包括以声音空气传播和/或液体传播为原理进行音频采集的任意类型的麦克风,包括但不限于电动式、电容式、压电式、电磁式、碳粒式以及半导体式等类型麦克风。In the embodiment of the present application, the smart eye includes a bone conduction microphone and a non-bone conduction microphone. Wherein, the non-bone conduction microphone includes any type of microphone for audio collection based on the principle of sound air transmission and/or liquid transmission, including but not limited to electric type, condenser type, piezoelectric type, electromagnetic type, carbon particle type and semiconductor type and other types of microphones.
示例性的,请参照图2,非骨传导麦克风设置在智能眼镜的镜腿靠近镜框的一端,骨传导麦克风设置在镜腿原理镜框的一端。请参照图3,当该智能眼镜处于穿戴状态时,骨传导麦克风将直接与用户的耳骨接触,而非骨传导麦克风与用户不直接接触。Exemplarily, please refer to FIG. 2 , the non-bone conduction microphone is arranged at one end of the temple of the smart glasses close to the frame, and the bone conduction microphone is arranged at one end of the mirror frame in principle of the temple. Referring to FIG. 3 , when the smart glasses are in a wearing state, the bone conduction microphone will directly contact the user's ear bones, while the non-bone conduction microphone will not directly contact the user.
应当说明的是,图2示出的仅为一可选地骨传导麦克风和非骨传导麦克风设置方式,在具体实现中骨传导麦克风和非骨传导麦克风并不限定于该设置方式。比如,骨传导麦克风还可以设置在智能眼镜的镜框上,并在穿戴时与用户的鼻梁接触等。It should be noted that FIG. 2 shows only an optional arrangement of the bone conduction microphone and the non-bone conduction microphone, and the bone conduction microphone and the non-bone conduction microphone are not limited to this arrangement in a specific implementation. For example, the bone conduction microphone can also be arranged on the frame of the smart glasses and contact the user's nose bridge when wearing it.
本申请实施例中,在触发进行音频采集时,智能眼睛同步通过骨传导麦克风和非骨传导麦克风进行音频采集,将骨传导麦克风采集得到的音频信号记为第一音频信号,将非骨传导麦克风采集得到的音频信号记为第二音频信号。In the embodiment of the present application, when triggering to perform audio collection, the smart eye synchronously performs audio collection through the bone conduction microphone and the non-bone conduction microphone, the audio signal collected by the bone conduction microphone is recorded as the first audio signal, and the non-bone conduction microphone is recorded as the first audio signal. The collected audio signal is denoted as the second audio signal.
应当说明的是,本申请实施例中对于如何触发进行音频采集不做具体限定,比如,可以是在用户操作智能眼睛启动音频采集类应用进行录音时,也可以是在用户操作智能眼睛启动即时通讯类应用进行音视频通话时,还可以是智能眼睛进行语音唤醒或者是进行语音识别等情况时,触发进行音频采集。It should be noted that the embodiments of this application do not specifically limit how to trigger audio collection. For example, it may be when the user operates the smart eye to start an audio collection application to record, or the user operates the smart eye to start instant messaging. When a similar application makes an audio and video call, it can also trigger audio collection when the smart eye performs voice wake-up or voice recognition.
在202中,智能眼镜提取第一音频信号的频谱特征。In 202, the smart glasses extract spectral features of the first audio signal.
根据骨传导麦克风的音频采集原理可知,骨传导麦克风进行音频采集时受穿戴者以外的干扰较小,因此,可以有效的利用骨传导麦克风采集的第一音频信号来检测智能眼睛的穿戴者是否发音。According to the audio collection principle of the bone conduction microphone, there is less interference from the wearer when the bone conduction microphone collects audio. Therefore, the first audio signal collected by the bone conduction microphone can be effectively used to detect whether the wearer of the smart eye pronounces the sound. .
应当说明的是,本申请实施例中还预先训练有发音检测模型,该发音检测模型被配置检测穿戴者是否发音。比如,可以预先采集穿戴者发音的样本发音信号,并提取样本发音信号的频谱特征作为训练样本,进而利用训练样本进行模型训练,得到发音检测模型。It should be noted that, in the embodiment of the present application, a pronunciation detection model is also pre-trained, and the pronunciation detection model is configured to detect whether the wearer speaks. For example, a sample pronunciation signal of the wearer's pronunciation can be collected in advance, and the spectral feature of the sample pronunciation signal can be extracted as a training sample, and then the training sample can be used for model training to obtain a pronunciation detection model.
相应的,在根据第一音频信号检测智能眼睛的穿戴者是否发音时,智能眼睛可以提取第一音频信号的频谱特征。Correspondingly, when detecting whether the wearer of the smart eye speaks according to the first audio signal, the smart eye can extract the spectral feature of the first audio signal.
在203中,智能眼镜通过预训练的发音检测模型对频谱特征进行发音检测,以确定穿戴者是否发音,是则转入204,否则转入207。In 203 , the smart glasses perform pronunciation detection on the spectral features through the pre-trained pronunciation detection model to determine whether the wearer has spoken, if yes, go to 204 , otherwise go to 207 .
本申请实施例中,除了提取第一音频信号的频谱特征之外,智能眼镜还调用预训练的发音检测模型,并将提取的频谱特征输入发音检测模型进行发音检测,得到穿戴者是否发音的检测结果。其中,若得到穿戴者发音的检测结果,则转入204,若得到穿戴者未发言的检测结果,则转入207。In the embodiment of the present application, in addition to extracting the spectral features of the first audio signal, the smart glasses also call a pre-trained pronunciation detection model, and input the extracted spectral features into the pronunciation detection model for pronunciation detection, so as to obtain the detection of whether the wearer has spoken. result. Among them, if the detection result of the wearer's speech is obtained, go to 204 , and if the detection result that the wearer does not speak is obtained, go to 207 .
在204中,智能眼镜对穿戴者的发音类型进行识别,得到穿戴者的发音类型。In 204, the smart glasses identify the pronunciation type of the wearer to obtain the pronunciation type of the wearer.
其中,发音类型包括但不限于说话、唱歌、哼哼等。本申请实施例中,针对不同的发音类型,分别训练有适于对该发音类型音频信号进行发音增强处理的发音增强模型,并相应得到发音类型和发音增强模型的第一预设对应关系。Among them, the pronunciation types include but are not limited to speaking, singing, humming, and the like. In the embodiment of the present application, for different pronunciation types, a pronunciation enhancement model suitable for performing pronunciation enhancement processing on the pronunciation type audio signal is respectively trained, and a first preset correspondence between the pronunciation type and the pronunciation enhancement model is obtained accordingly.
本申请实施例中,在通过预训练的发音增强模型对第一音频信号进行发音增强处理时,智能眼睛首先对穿戴者的发音类型进行识别,得到穿戴者的发音类型。其中,本申请实施例中对于发音类型的识别方式不做具体限定,可由本领域普通技术人员根据实际需要进行配置,比如,可以预训练用于发音类型识别的发音类型识别模型,在需要时,调用该发音类型识别模型进行发音类型的识别。In the embodiment of the present application, when the pronunciation enhancement processing is performed on the first audio signal by the pre-trained pronunciation enhancement model, the smart eye first identifies the pronunciation type of the wearer to obtain the pronunciation type of the wearer. Wherein, in the embodiment of the present application, the recognition mode of pronunciation type is not specifically limited, and can be configured by those of ordinary skill in the art according to actual needs. For example, a pronunciation type recognition model for pronunciation type recognition can be pre-trained. The pronunciation type recognition model is called to identify the pronunciation type.
在205中,智能眼镜根据发音类型和发音增强模型的预设对应关系,确定对应穿戴者的发音类型的目标发音增强模型。In 205, the smart glasses determine a target pronunciation enhancement model corresponding to the wearer's pronunciation type according to the preset correspondence between the pronunciation type and the pronunciation enhancement model.
在识别得到穿戴者的发音类型之后,智能眼睛即根据发音类型和发音增强模型的第一预设对应关系,确定对应穿戴者发音类型的发音增强模型,记为目标发音增强模型。比如,预先训练有适于对发音类型“说话”进行发音增强处理的发音增强模型A,适于对发音类型“唱歌”进行发音增强处理的发音增强模型B,适于对发音类型“哼哼”进行发音增强处理的发音增强模型C,若识别到穿戴者的发音类型为“说话”,则可将发音增强模型A确定为目标发音增强模型。After recognizing the wearer's pronunciation type, the smart eye determines the pronunciation enhancement model corresponding to the wearer's pronunciation type according to the first preset correspondence between the pronunciation type and the pronunciation enhancement model, which is recorded as the target pronunciation enhancement model. For example, the pronunciation enhancement model A suitable for the pronunciation enhancement processing of the pronunciation type "speaking", the pronunciation enhancement model B suitable for the pronunciation enhancement processing for the pronunciation type "singing", and the pronunciation enhancement model B suitable for the pronunciation enhancement processing for the pronunciation type "humming" are pre-trained. If the pronunciation enhancement model C that performs the pronunciation enhancement processing recognizes that the wearer's pronunciation type is "speaking", the pronunciation enhancement model A can be determined as the target pronunciation enhancement model.
在206中,智能眼镜通过目标发音增强模型对第一音频信号进行发音增强处理,得到发音增强信号。In 206, the smart glasses perform pronunciation enhancement processing on the first audio signal through the target pronunciation enhancement model to obtain a pronunciation enhancement signal.
在确定目标发音增强模型之后,智能眼睛即调用该目标发音增强模型,并将第一音频信号输入到调用的目标发音增强模型进行发音增强处理,得到对应的发音增强信号。After determining the target pronunciation enhancement model, the intelligent eye calls the target pronunciation enhancement model, and inputs the first audio signal into the called target pronunciation enhancement model for pronunciation enhancement processing, and obtains the corresponding pronunciation enhancement signal.
在207中,智能眼镜对现实环境的环境类型进行识别,得到现实环境的环境类型。In 207, the smart glasses identify the environment type of the real environment to obtain the environment type of the real environment.
应当说明的是,本申请实施例中对于环境类型的划分不做具体限制,可由本领域普通技术人员根据实际需要进行划分。比如,可以划分环境类型为室内环境和室外环境。本申请实施例中,针对不同的环境类型,分别训练有适于对该环境类型音频信号进行环境音增强处理的环境音增强模型,并相应得到环境类型和环境音增强模型的第二预设对应关系。It should be noted that, in the embodiment of the present application, there is no specific limitation on the division of the environment types, and the division can be performed by those of ordinary skill in the art according to actual needs. For example, the types of environments can be divided into indoor environments and outdoor environments. In the embodiment of the present application, for different environmental types, an environmental sound enhancement model suitable for performing environmental sound enhancement processing on the audio signal of the environmental type is respectively trained, and the second preset correspondence between the environmental type and the environmental sound enhancement model is obtained accordingly. relation.
本申请实施例中,智能眼睛首先对穿戴者所处现实环境的环境类型进行识别,得到穿戴者所处现实场景的环境类型。其中,本申请实施例中对于环境类型的识别方式不做具体限定,可由本领域普通技术人员根据实际需要进行配置,比如,可以预训练用于环境类型识别的环境类型识别模型,在需要时,智能眼睛调用该环境类型识别模型进行环境类型的识别。In the embodiment of the present application, the smart eye first identifies the environment type of the real environment where the wearer is located, and obtains the environment type of the real scene where the wearer is located. There is no specific limitation on the identification method of the environment type in the embodiments of the present application, which can be configured by those of ordinary skill in the art according to actual needs. The intelligent eye invokes the environment type identification model to identify the environment type.
在208中,智能眼镜根据环境类型和环境音增强模型的第二预设对应关系,确定对应现实环境的环境类型的目标环境音增强模型。In 208, the smart glasses determine a target ambient sound enhancement model corresponding to the environment type of the real environment according to the second preset correspondence between the environment type and the ambient sound enhancement model.
在识别得到穿戴者所处现实环境的环境类型之后,智能眼睛即根据环境类型和环境音增强模型的预设对应关系,确定对应穿戴者所处现实环境的环境类型的环境音增强模型,记为目标环境音增强模型。比如,预先训练有适于对环境类型“室内环境”进行环境音增强处理的环境音增强模型D,适于对环境类型“室外环境”进行环境音增强处理的环境音增强模型E,若识别到穿戴者所处现实场景的环境类型为“室外环境”,则可将环境音增强模型E确定为目标环境音增强模型。After recognizing the environmental type of the wearer's real environment, the smart eye determines the environmental sound enhancement model corresponding to the wearer's real environment according to the preset correspondence between the environmental type and the environmental sound enhancement model, which is denoted as Target ambient sound enhancement model. For example, an ambient sound enhancement model D suitable for performing ambient sound enhancement processing on the environment type "indoor environment" and an ambient sound enhancement model E suitable for performing ambient sound enhancement processing on the environment type "outdoor environment" are pre-trained. If the environment type of the real scene where the wearer is located is "outdoor environment", the ambient sound enhancement model E may be determined as the target ambient sound enhancement model.
在209中,智能眼镜通过目标环境音增强模型对第二音频信号进行环境音增强处理,得到环境音增强信号。In 209, the smart glasses perform ambient sound enhancement processing on the second audio signal through the target ambient sound enhancement model to obtain an ambient sound enhancement signal.
在确定目标环境音增强模型之后,智能眼睛即调用该目标环境音增强模型,并将第二音频信号输入到调用的目标环境音增强模型进行环境音增强处理,得到对应的环境音增强信号。After determining the target ambient sound enhancement model, the smart eye invokes the target ambient sound enhancement model, and inputs the second audio signal into the invoked target ambient sound enhancement model for ambient sound enhancement processing to obtain a corresponding ambient sound enhancement signal.
请参照图10,图10为本申请实施例提供的音频增强装置的结构示意图。该音频增强装置应用于本申请提供的可穿戴设备,该可穿戴设备包括骨传导麦克风和非骨传导麦克风。如图10所示,音频增强装置可以包括:Please refer to FIG. 10 , which is a schematic structural diagram of an audio enhancement apparatus provided by an embodiment of the present application. The audio enhancement device is applied to the wearable device provided in the present application, and the wearable device includes a bone conduction microphone and a non-bone conduction microphone. As shown in Figure 10, the audio enhancement device may include:
音频采集模块301,用于通过骨传导麦克风采集得到第一音频信号,以及同步通过非骨传导麦克风采集得到第二音频信号;The audio collection module 301 is used to collect the first audio signal through the bone conduction microphone, and obtain the second audio signal synchronously through the non-bone conduction microphone;
发音增强模块302,用于在根据第一音频信号检测到穿戴者发音时,对第一音频信号进行增强处理,得到穿戴者的发音增强信号;The pronunciation enhancement module 302 is configured to perform enhancement processing on the first audio signal when the wearer's pronunciation is detected according to the first audio signal to obtain the wearer's pronunciation enhancement signal;
环境音增强模块303,用于在根据第一音频信号检测到穿戴者未发音时,对第二音频信号进行增强处理,得到穿戴者所处现实环境的环境音增强信号。The ambient sound enhancement module 303 is configured to perform enhancement processing on the second audio signal when it is detected that the wearer does not speak according to the first audio signal, so as to obtain an ambient sound enhancement signal of the actual environment where the wearer is located.
可选地,在一实施例中,在对第一音频信号进行增强处理,得到穿戴者的发音增强信号时,发音增强模块302用于:Optionally, in an embodiment, when performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, the pronunciation enhancement module 302 is configured to:
获取穿戴者在第一音频信号采集之前的历史骨传导噪声信号;Obtain the wearer's historical bone conduction noise signal before the first audio signal is collected;
根据历史骨传导噪声信号获取穿戴者在第一音频信号采集期间的骨传导噪声信号;Acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal;
将骨传导噪声信号与第一音频信号进行反相位叠加,得到发音增强信号。The anti-phase superposition of the bone conduction noise signal and the first audio signal is performed to obtain a pronunciation enhancement signal.
可选地,在一实施例中,在根据历史骨传导噪声信号获取穿戴者在第一音频信号采集期间的骨传导噪声信号时,发音增强模块302用于:Optionally, in an embodiment, when acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal, the pronunciation enhancement module 302 is configured to:
根据历史骨传导噪声信号进行模型训练,得到对应穿戴者的噪声预测模型;Carry out model training according to the historical bone conduction noise signal to obtain the noise prediction model corresponding to the wearer;
根据噪声预测模型预测第一音频信号采集期间的骨传导噪声信号。The bone conduction noise signal during the acquisition of the first audio signal is predicted according to the noise prediction model.
可选地,在一实施例中,在对第一音频信号进行增强处理,得到穿戴者的发音增强信号时,发音增强模块302用于:Optionally, in an embodiment, when performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, the pronunciation enhancement module 302 is configured to:
通过预训练的发音增强模型对第一音频信号进行发音增强处理,得到发音增强信号。The pronunciation enhancement signal is obtained by performing pronunciation enhancement processing on the first audio signal by using the pre-trained pronunciation enhancement model.
可选地,在一实施例中,非骨传导麦克风为多个,在对第二音频信号进行增强处理,得到穿戴者所处现实环境的环境音增强信号时,环境音增强模块303:Optionally, in an embodiment, there are multiple non-bone conduction microphones, and when the second audio signal is enhanced to obtain an enhanced ambient sound signal of the actual environment where the wearer is located, the ambient sound enhancement module 303:
根据多个非骨传导麦克风采集到的多个第二音频信号,对现实环境的声源方向进行预测;According to multiple second audio signals collected by multiple non-bone conduction microphones, predict the sound source direction of the real environment;
根据声源方向对第二音频信号进行波束形成处理,得到环境音增强信号。The second audio signal is subjected to beamforming processing according to the sound source direction to obtain an ambient sound enhancement signal.
可选地,在一实施例中,本申请提供的音频增强装置还包括发音检测模块,用于:Optionally, in an embodiment, the audio enhancement device provided by the present application further includes a pronunciation detection module for:
提取第一音频信号的频谱特征;extracting spectral features of the first audio signal;
通过预训练的发音检测模型对频谱特征进行发音检测,得到穿戴者是否发音的检测结果。The pre-trained pronunciation detection model is used to detect the pronunciation of the spectral features, and the detection result of whether the wearer is pronounced is obtained.
可选地,在一实施例中,通过骨传导麦克风采集得到第一音频信号,以及同步通过非骨传导麦克风采集得到第二音频信号之后,Optionally, in an embodiment, after the first audio signal is acquired by the bone conduction microphone, and the second audio signal is acquired by the non-bone conduction microphone simultaneously,
发音增强模块302还用于在当前配置为对穿戴者发音进行增强时,根据第二音频信号对第一音频信号进行增强处理;或者,The pronunciation enhancement module 302 is further configured to perform enhancement processing on the first audio signal according to the second audio signal when the current configuration is to enhance the wearer's pronunciation; or,
环境音增强模块303还用于在当前配置为对环境音进行增强时,根据第一音频信号对第二音频信号进行增强处理。The ambient sound enhancement module 303 is further configured to perform enhancement processing on the second audio signal according to the first audio signal when the current configuration is to enhance the ambient sound.
可选地,在一实施例中,本申请提供的音频增强装置还包括安全提示模块,用于:Optionally, in an embodiment, the audio enhancement device provided by the present application further includes a safety prompt module for:
在穿戴者处于运动状态时,对现实环境进行三维重建,得到虚拟现实环境;When the wearer is in motion, three-dimensional reconstruction of the real environment is performed to obtain a virtual reality environment;
根据虚拟现实环境识别穿戴者是否存在运动风险;Identify whether the wearer is at risk of exercise based on the virtual reality environment;
在识别到穿戴者存在运动风险时,对穿戴者进行安全提示。When it is recognized that the wearer is at risk of exercise, the wearer is reminded of safety.
可选地,在一实施例中,可穿戴设备还包括摄像模组,在对现实环境进行三维重建,得到虚拟现实环境时,安全提示模块用于:Optionally, in an embodiment, the wearable device further includes a camera module, and when performing three-dimensional reconstruction of the real environment to obtain a virtual reality environment, the safety prompt module is used for:
通过摄像模组获取现实环境的深度图像和彩色图像;Obtain the depth image and color image of the real environment through the camera module;
根据深度图像和彩色图像对现实环境进行三维重建,得到虚拟现实环境。The three-dimensional reconstruction of the real environment is carried out according to the depth image and the color image, and the virtual reality environment is obtained.
应当说明的是,本申请实施例提供的音频增强装置与上文实施例中的音频增强方法属于同一构思,音频增强装置可以运行音频增强方法实施例中提供的任一方法,其具体实现过程详见以上相关实施例,此处不再赘述。It should be noted that the audio enhancement apparatus provided in the embodiment of the present application and the audio enhancement method in the above embodiment belong to the same concept, and the audio enhancement apparatus can execute any of the methods provided in the audio enhancement method embodiment, and the specific implementation process is described in detail. See the above related embodiments, and details are not repeated here.
本申请实施例提供一种存储介质,其上存储有计算机程序,当其存储的计算机程序被包括骨传导麦克风和非骨传导麦克风的可穿戴设备加载时执行如本申请实施例提供的音频增强方法中的步骤。其中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM)或者随机存取器(Random Access Memory,RAM)等。The embodiments of the present application provide a storage medium on which a computer program is stored. When the stored computer program is loaded by a wearable device including a bone conduction microphone and a non-bone conduction microphone, the audio enhancement method provided by the embodiments of the present application is executed. steps in . The storage medium may be a magnetic disk, an optical disk, a read only memory (Read Only Memory, ROM), or a random access device (Random Access Memory, RAM), and the like.
本申请实施例还提供一种可穿戴设备,请参照图11,可穿戴设备包括处理器401、存储器402、骨传导麦克风403和非骨传导麦克风404。An embodiment of the present application further provides a wearable device. Please refer to FIG. 11 . The wearable device includes a processor 401 , a memory 402 , a bone conduction microphone 403 and a non-bone conduction microphone 404 .
本申请实施例中的处理器是通用处理器,比如ARM架构的处理器。The processor in the embodiment of the present application is a general-purpose processor, such as a processor of an ARM architecture.
存储器402中存储有计算机程序,其可以为高速随机存取存储器,还可以为非易失性存储器,比如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件等。A computer program is stored in the memory 402, which may be a high-speed random access memory, or a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
骨传导麦克风403包括以声音骨传导为原理进行音频采集的麦克风。The bone conduction microphone 403 includes a microphone for audio collection based on the principle of sound bone conduction.
非骨传导麦克风404包括以声音空气传播和/或液体传播为原理进行音频采集的任意类型的麦克风,包括但不限于电动式、电容式、压电式、电磁式、碳粒式以及半导体式等类型麦克风。The non-bone conduction microphone 404 includes any type of microphone for audio collection based on the principle of sound air transmission and/or liquid transmission, including but not limited to electrodynamic, capacitive, piezoelectric, electromagnetic, carbon particle and semiconductor type, etc. type microphone.
本申请实施例中,存储器402还可以包括存储器控制器,以提供处理器401对存储器402的访问,处理器401通过加载存储器402中的计算机程序实现如下功能:In this embodiment of the present application, the memory 402 may further include a memory controller to provide access to the memory 402 by the processor 401, and the processor 401 implements the following functions by loading a computer program in the memory 402:
通过骨传导麦克风403采集得到第一音频信号,以及同步通过非骨传导麦克风404采集得到第二音频信号;The first audio signal is acquired by the bone conduction microphone 403, and the second audio signal is acquired by the non-bone conduction microphone 404 simultaneously;
在根据第一音频信号检测到穿戴者发音时,对第一音频信号进行增强处理,得到穿戴者的发音增强信号;When the wearer's pronunciation is detected according to the first audio signal, the first audio signal is enhanced to obtain the wearer's pronunciation enhancement signal;
在根据第一音频信号检测到穿戴者未发音时,对第二音频信号进行增强处理, 得到穿戴者所处现实环境的环境音增强信号。When it is detected that the wearer does not speak according to the first audio signal, enhancement processing is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
可选地,在一实施例中,在对第一音频信号进行增强处理,得到穿戴者的发音增强信号时,处理器401执行:Optionally, in an embodiment, when performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, the processor 401 executes:
获取穿戴者在第一音频信号采集之前的历史骨传导噪声信号;Obtain the wearer's historical bone conduction noise signal before the first audio signal is collected;
根据历史骨传导噪声信号获取穿戴者在第一音频信号采集期间的骨传导噪声信号;Acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal;
将骨传导噪声信号与第一音频信号进行反相位叠加,得到发音增强信号。The anti-phase superposition of the bone conduction noise signal and the first audio signal is performed to obtain a pronunciation enhancement signal.
可选地,在一实施例中,在根据历史骨传导噪声信号获取穿戴者在第一音频信号采集期间的骨传导噪声信号时,处理器401执行:Optionally, in an embodiment, when acquiring the wearer's bone conduction noise signal during the first audio signal acquisition period according to the historical bone conduction noise signal, the processor 401 executes:
根据历史骨传导噪声信号进行模型训练,得到对应穿戴者的噪声预测模型;Carry out model training according to the historical bone conduction noise signal to obtain the noise prediction model corresponding to the wearer;
根据噪声预测模型预测第一音频信号采集期间的骨传导噪声信号。The bone conduction noise signal during the acquisition of the first audio signal is predicted according to the noise prediction model.
可选地,在一实施例中,在对第一音频信号进行增强处理,得到穿戴者的发音增强信号时,处理器401执行:Optionally, in an embodiment, when performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, the processor 401 executes:
通过预训练的发音增强模型对第一音频信号进行发音增强处理,得到发音增强信号。The pronunciation enhancement signal is obtained by performing pronunciation enhancement processing on the first audio signal by using the pre-trained pronunciation enhancement model.
可选地,在一实施例中,非骨传导麦克风404为多个,在对第二音频信号进行增强处理,得到穿戴者所处现实环境的环境音增强信号时,处理器401执行:Optionally, in an embodiment, there are multiple non-bone conduction microphones 404, and when the second audio signal is enhanced to obtain an ambient sound enhancement signal of the actual environment where the wearer is located, the processor 401 executes:
根据多个非骨传导麦克风404采集到的多个第二音频信号,对现实环境的声源方向进行预测;predicting the sound source direction of the real environment according to the plurality of second audio signals collected by the plurality of non-bone conduction microphones 404;
根据声源方向对第二音频信号进行波束形成处理,得到环境音增强信号。The second audio signal is subjected to beamforming processing according to the sound source direction to obtain an ambient sound enhancement signal.
可选地,在一实施例中,处理器401还执行:Optionally, in an embodiment, the processor 401 further executes:
提取第一音频信号的频谱特征;extracting spectral features of the first audio signal;
通过预训练的发音检测模型对频谱特征进行发音检测,得到穿戴者是否发音的检测结果。The pre-trained pronunciation detection model is used to detect the pronunciation of the spectral features, and the detection result of whether the wearer is pronounced is obtained.
可选地,在一实施例中,同步通过骨传导麦克风403和非骨传导麦克风404进行音频采集,得到骨传导麦克风403采集的第一音频信号,以及得到非骨传导麦克风404采集的第二音频信号之后,处理器401执行:Optionally, in an embodiment, audio collection is performed simultaneously through the bone conduction microphone 403 and the non-bone conduction microphone 404 to obtain the first audio signal collected by the bone conduction microphone 403 and the second audio signal collected by the non-bone conduction microphone 404. After the signal, the processor 401 executes:
在当前配置为对穿戴者发音进行增强时,根据第二音频信号对第一音频信号进行增强处理;或者,When the current configuration is to enhance the pronunciation of the wearer, the first audio signal is enhanced according to the second audio signal; or,
在当前配置为对环境音进行增强时,根据第一音频信号对第二音频信号进行增强处理。When the current configuration is to enhance the ambient sound, the second audio signal is enhanced according to the first audio signal.
可选地,在一实施例中,处理器401还执行:Optionally, in an embodiment, the processor 401 further executes:
在穿戴者处于运动状态时,对现实环境进行三维重建,得到虚拟现实环境;When the wearer is in motion, three-dimensional reconstruction of the real environment is performed to obtain a virtual reality environment;
根据虚拟现实环境识别穿戴者是否存在运动风险;Identify whether the wearer is at risk of exercise based on the virtual reality environment;
在识别到穿戴者存在运动风险时,对穿戴者进行安全提示。When it is recognized that the wearer is at risk of exercise, the wearer is reminded of safety.
可选地,在一实施例中,可穿戴设备还包括摄像模组,在对现实环境进行三维重建,得到虚拟现实环境时,处理器401执行:Optionally, in an embodiment, the wearable device further includes a camera module, and when performing three-dimensional reconstruction on the real environment to obtain a virtual reality environment, the processor 401 executes:
通过摄像模组获取现实环境的深度图像和彩色图像;Obtain the depth image and color image of the real environment through the camera module;
根据深度图像和彩色图像对现实环境进行三维重建,得到虚拟现实环境。The three-dimensional reconstruction of the real environment is carried out according to the depth image and the color image, and the virtual reality environment is obtained.
应当说明的是,本申请实施例提供的可穿戴设备与上文实施例中的音频增强方法属于同一构思,在可穿戴设备上可以运行音频增强方法实施例中提供的任一方法,其具体实现过程详见以上实施例,此处不再赘述。It should be noted that the wearable device provided by the embodiment of the present application and the audio enhancement method in the above embodiment belong to the same concept, and any method provided in the audio enhancement method embodiment can be executed on the wearable device, and its specific implementation For details of the process, please refer to the above embodiments, which will not be repeated here.
应当说明的是,对本申请实施例的音频增强方法而言,本领域普通测试人员可以理解实现本申请实施例的音频增强方法的全部或部分流程,是可以通过计算 机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读取存储介质中,如存储在本申请提供的可穿戴设备的存储器中,并被该可穿戴设备内的处理器执行,在执行过程中可包括如音频增强方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。It should be noted that, for the audio enhancement method of the embodiment of the present application, ordinary testers in the art can understand that all or part of the process of implementing the audio enhancement method of the embodiment of the present application can be completed by controlling the relevant hardware through a computer program. , the computer program can be stored in a computer-readable storage medium, such as in the memory of the wearable device provided by this application, and executed by the processor in the wearable device, and the execution process can include such as Flow of an embodiment of an audio enhancement method. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.
以上对本申请实施例所提供的一种音频增强方法、装置、存储介质及可穿戴设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The audio enhancement method, device, storage medium, and wearable device provided by the embodiments of the present application have been described in detail above. The principles and implementations of the present application are described with specific examples in this article. It is only used to help understand the method of the present application and its core idea; at the same time, for those skilled in the art, according to the idea of the present application, there will be changes in the specific implementation and application scope. The contents of the description should not be construed as limiting the application.
Claims (20)
- 一种音频增强方法,应用于可穿戴设备,其中,所述可穿戴设备包括骨传导麦克风和非骨传导麦克风,所述音频增强方法包括:An audio enhancement method is applied to a wearable device, wherein the wearable device includes a bone conduction microphone and a non-bone conduction microphone, and the audio enhancement method includes:通过所述骨传导麦克风采集得到第一音频信号,以及同步通过所述非骨传导麦克风采集得到第二音频信号;The first audio signal is acquired by collecting the bone conduction microphone, and the second audio signal is simultaneously acquired by the non-bone conduction microphone;在根据所述第一音频信号检测到所述穿戴者发音时,对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号;When the wearer's pronunciation is detected according to the first audio signal, the first audio signal is enhanced to obtain the wearer's pronunciation enhancement signal;在根据所述第一音频信号检测到所述穿戴者未发音时,对所述第二音频信号进行增强处理,得到所述穿戴者所处现实环境的环境音增强信号。When it is detected that the wearer does not speak according to the first audio signal, enhancement processing is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
- 根据权利要求1所述的音频增强方法,所述对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号,包括:The audio enhancement method according to claim 1, wherein the enhancement processing is performed on the first audio signal to obtain the pronunciation enhancement signal of the wearer, comprising:获取所述穿戴者在所述第一音频信号采集之前的历史骨传导噪声信号;acquiring a historical bone conduction noise signal of the wearer before the first audio signal is collected;根据所述历史骨传导噪声信号获取所述穿戴者在所述第一音频信号采集期间的骨传导噪声信号;Acquiring a bone conduction noise signal of the wearer during the collection of the first audio signal according to the historical bone conduction noise signal;将所述骨传导噪声信号与所述第一音频信号进行反相位叠加,得到所述发音增强信号。The anti-phase superposition of the bone conduction noise signal and the first audio signal is performed to obtain the pronunciation enhancement signal.
- 根据权利要求2所述的音频增强方法,其中,所述根据所述历史骨传导噪声信号获取所述穿戴者在所述第一音频信号采集期间的骨传导噪声信号,包括:The audio enhancement method according to claim 2, wherein the acquiring the wearer's bone conduction noise signal during the collection of the first audio signal according to the historical bone conduction noise signal comprises:根据所述历史骨传导噪声信号进行模型训练,得到对应所述穿戴者的噪声预测模型;Perform model training according to the historical bone conduction noise signal to obtain a noise prediction model corresponding to the wearer;根据所述噪声预测模型预测所述第一音频信号采集期间的骨传导噪声信号。The bone conduction noise signal during the acquisition of the first audio signal is predicted according to the noise prediction model.
- 根据权利要求1所述的音频增强方法,其中,所述对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号,包括:The audio enhancement method according to claim 1, wherein the performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, comprising:通过预训练的发音增强模型对所述第一音频信号进行发音增强处理,得到所述发音增强信号。The pronunciation enhancement signal is obtained by performing pronunciation enhancement processing on the first audio signal by using a pre-trained pronunciation enhancement model.
- 根据权利要求1所述的音频增强方法,其中,所述非骨传导麦克风为多个,所述对所述第二音频信号进行增强处理,得到所述穿戴者所处现实环境的环境音增强信号,包括:The audio enhancement method according to claim 1, wherein there are multiple non-bone conduction microphones, and the enhancement processing is performed on the second audio signal to obtain an ambient sound enhancement signal of the actual environment where the wearer is located ,include:根据所述多个非骨传导麦克风采集到的多个第二音频信号,对所述现实环境的声源方向进行预测;predicting the sound source direction of the real environment according to the plurality of second audio signals collected by the plurality of non-bone conduction microphones;根据所述声源方向对所述第二音频信号进行波束形成处理,得到所述环境音增强信号。The second audio signal is subjected to beamforming processing according to the sound source direction to obtain the ambient sound enhancement signal.
- 根据权利要求1所述的音频增强方法,其中,还包括:The audio enhancement method of claim 1, further comprising:提取所述第一音频信号的频谱特征;extracting spectral features of the first audio signal;通过预训练的发音检测模型对所述频谱特征进行发音检测,得到所述穿戴者是否发音的检测结果。Pronunciation detection is performed on the spectral feature by using a pre-trained pronunciation detection model to obtain a detection result of whether the wearer has spoken.
- 根据权利要求1所述的音频增强方法,其中,所述同步通过所述骨传导麦克风和所述非骨传导麦克风进行音频采集,得到骨传导麦克风采集的第一音频信号,以及得到非骨传导麦克风采集的第二音频信号之后,还包括:The audio enhancement method according to claim 1, wherein, in the synchronization, audio collection is performed by the bone conduction microphone and the non-bone conduction microphone to obtain the first audio signal collected by the bone conduction microphone, and the non-bone conduction microphone is obtained. After the collected second audio signal, it further includes:在当前配置为对穿戴者发音进行增强时,根据所述第二音频信号对所述第一音频信号进行增强处理;或者,When the current configuration is to enhance the pronunciation of the wearer, the first audio signal is enhanced according to the second audio signal; or,在当前配置为对环境音进行增强时,根据所述第一音频信号对所述第二音频信号进行增强处理。When the current configuration is to enhance the ambient sound, the second audio signal is enhanced according to the first audio signal.
- 根据权利要求1-7任一项所述的音频增强方法,其中,还包括:The audio enhancement method according to any one of claims 1-7, further comprising:在所述穿戴者处于运动状态时,对所述现实环境进行三维重建,得到虚拟现实环境;When the wearer is in motion, three-dimensional reconstruction is performed on the real environment to obtain a virtual reality environment;根据所述虚拟现实环境识别所述穿戴者是否存在运动风险;Identifying whether the wearer is at risk of exercise according to the virtual reality environment;在识别到所述穿戴者存在运动风险时,对所述穿戴者进行安全提示。When it is recognized that the wearer has a movement risk, a safety prompt is given to the wearer.
- 根据权利要求8所述的音频增强方法,其中,所述可穿戴设备还包括摄像模组,所述对所述现实环境进行三维重建,得到虚拟现实环境,包括:The audio enhancement method according to claim 8, wherein the wearable device further comprises a camera module, and the three-dimensional reconstruction of the real environment to obtain a virtual reality environment comprises:通过所述摄像模组获取所述现实环境的深度图像和彩色图像;Obtain the depth image and color image of the real environment through the camera module;根据所述深度图像和所述彩色图像对所述现实环境进行三维重建,得到所述虚拟现实环境。The virtual reality environment is obtained by performing three-dimensional reconstruction of the real environment according to the depth image and the color image.
- 一种音频增强装置,应用于可穿戴设备,其中,所述可穿戴设备包括骨传导麦克风和非骨传导麦克风,所述音频增强装置包括:An audio enhancement device is applied to a wearable device, wherein the wearable device includes a bone conduction microphone and a non-bone conduction microphone, and the audio enhancement device includes:音频采集模块,用于通过所述骨传导麦克风采集得到第一音频信号,以及同步通过所述非骨传导麦克风采集得到第二音频信号;an audio acquisition module, configured to acquire a first audio signal through the bone conduction microphone, and acquire a second audio signal through the non-bone conduction microphone synchronously;发音增强模块,用于在根据所述第一音频信号检测到所述穿戴者发音时,对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号;A pronunciation enhancement module, configured to perform enhancement processing on the first audio signal when the wearer's pronunciation is detected according to the first audio signal, to obtain the wearer's pronunciation enhancement signal;环境音增强模块,用于在根据所述第一音频信号检测到所述穿戴者未发音时,对所述第二音频信号进行增强处理,得到所述穿戴者所处现实环境的环境音增强信号。An ambient sound enhancement module, configured to perform enhancement processing on the second audio signal when it is detected that the wearer has not spoken according to the first audio signal, to obtain an ambient sound enhancement signal of the actual environment where the wearer is located .
- 一种存储介质,其上存储有计算机程序,其中,当所述计算机程序被包括骨传导麦克风和非骨传导麦克风的可穿戴设备加载时执行如权利要求1-9任一项所述的音频增强方法。A storage medium having a computer program stored thereon, wherein the audio enhancement according to any one of claims 1-9 is performed when the computer program is loaded by a wearable device comprising a bone conduction microphone and a non-bone conduction microphone method.
- 一种可穿戴设备,包括骨传导麦克风、非骨传导麦克风、处理器和存储器,所述存储器储存有计算机程序,其中其中,所述处理器通过调用所述计算机程序,用于执行:A wearable device, comprising a bone conduction microphone, a non-bone conduction microphone, a processor and a memory, wherein the memory stores a computer program, wherein the processor is configured to execute by calling the computer program:通过所述骨传导麦克风采集得到第一音频信号,以及同步通过所述非骨传导麦克风采集得到第二音频信号;The first audio signal is acquired by collecting the bone conduction microphone, and the second audio signal is simultaneously acquired by the non-bone conduction microphone;在根据所述第一音频信号检测到所述穿戴者发音时,对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号;When the wearer's pronunciation is detected according to the first audio signal, the first audio signal is enhanced to obtain the wearer's pronunciation enhancement signal;在根据所述第一音频信号检测到所述穿戴者未发音时,对所述第二音频信号进行增强处理,得到所述穿戴者所处现实环境的环境音增强信号。When it is detected that the wearer does not speak according to the first audio signal, enhancement processing is performed on the second audio signal to obtain an ambient sound enhancement signal of the real environment where the wearer is located.
- 根据权利要求12所述的电子设备,其中,在对所述第一音频信号进行增强处理,得到所述穿戴者的发音增强信号时,所述处理器用于执行:The electronic device according to claim 12, wherein, when the first audio signal is enhanced to obtain the wearer's voice enhancement signal, the processor is configured to execute:获取所述穿戴者在所述第一音频信号采集之前的历史骨传导噪声信号;acquiring a historical bone conduction noise signal of the wearer before the first audio signal is collected;根据所述历史骨传导噪声信号获取所述穿戴者在所述第一音频信号采集期间的骨传导噪声信号;Acquiring a bone conduction noise signal of the wearer during the collection of the first audio signal according to the historical bone conduction noise signal;将所述骨传导噪声信号与所述第一音频信号进行反相位叠加,得到所述发音增强信号。The anti-phase superposition of the bone conduction noise signal and the first audio signal is performed to obtain the pronunciation enhancement signal.
- 根据权利要求13所述的电子设备,其中,在根据所述历史骨传导噪声信号获取所述穿戴者在所述第一音频信号采集期间的骨传导噪声信号时,所述处理器用于执行:14. The electronic device of claim 13, wherein, when acquiring the wearer's bone conduction noise signal during the first audio signal acquisition based on the historical bone conduction noise signal, the processor is configured to execute:根据所述历史骨传导噪声信号进行模型训练,得到对应所述穿戴者的噪声预测模型;Perform model training according to the historical bone conduction noise signal to obtain a noise prediction model corresponding to the wearer;根据所述噪声预测模型预测所述第一音频信号采集期间的骨传导噪声信号。The bone conduction noise signal during the acquisition of the first audio signal is predicted according to the noise prediction model.
- 根据权利要求12所述的电子设备,其中,在对所述第一音频信号进行 增强处理,得到所述穿戴者的发音增强信号时,所述处理器用于执行:The electronic device according to claim 12, wherein, when performing enhancement processing on the first audio signal to obtain the wearer's pronunciation enhancement signal, the processor is configured to execute:通过预训练的发音增强模型对所述第一音频信号进行发音增强处理,得到所述发音增强信号。The pronunciation enhancement signal is obtained by performing pronunciation enhancement processing on the first audio signal by using a pre-trained pronunciation enhancement model.
- 根据权利要求12所述的电子设备,其中,所述非骨传导麦克风为多个,在对所述第二音频信号进行增强处理,得到所述穿戴者所处现实环境的环境音增强信号时,所述处理器用于执行:The electronic device according to claim 12, wherein there are multiple non-bone conduction microphones, and when the second audio signal is enhanced to obtain an ambient sound enhancement signal of the real environment where the wearer is located, The processor is used to execute:根据所述多个非骨传导麦克风采集到的多个第二音频信号,对所述现实环境的声源方向进行预测;predicting the sound source direction of the real environment according to the plurality of second audio signals collected by the plurality of non-bone conduction microphones;根据所述声源方向对所述第二音频信号进行波束形成处理,得到所述环境音增强信号。The second audio signal is subjected to beamforming processing according to the sound source direction to obtain the ambient sound enhancement signal.
- 根据权利要求12所述的电子设备,其中,所述处理器还用于执行:The electronic device of claim 12, wherein the processor is further configured to perform:提取所述第一音频信号的频谱特征;extracting spectral features of the first audio signal;通过预训练的发音检测模型对所述频谱特征进行发音检测,得到所述穿戴者是否发音的检测结果。Pronunciation detection is performed on the spectral feature by using a pre-trained pronunciation detection model to obtain a detection result of whether the wearer has spoken.
- 根据权利要求12所述的电子设备,其中,在通过所述骨传导麦克风和所述非骨传导麦克风进行音频采集,得到骨传导麦克风采集的第一音频信号,以及得到非骨传导麦克风采集的第二音频信号之后,所述处理器还用于执行:The electronic device according to claim 12, wherein, when audio is collected by the bone conduction microphone and the non-bone conduction microphone, a first audio signal collected by the bone conduction microphone is obtained, and a first audio signal collected by the non-bone conduction microphone is obtained. After the two audio signals, the processor is further configured to execute:在当前配置为对穿戴者发音进行增强时,根据所述第二音频信号对所述第一音频信号进行增强处理;或者,When the current configuration is to enhance the pronunciation of the wearer, the first audio signal is enhanced according to the second audio signal; or,在当前配置为对环境音进行增强时,根据所述第一音频信号对所述第二音频信号进行增强处理。When the current configuration is to enhance the ambient sound, the second audio signal is enhanced according to the first audio signal.
- 根据权利要求12-18任一项所述的电子设备,其中,所述处理器还用于执行:The electronic device according to any one of claims 12-18, wherein the processor is further configured to execute:在所述穿戴者处于运动状态时,对所述现实环境进行三维重建,得到虚拟现实环境;When the wearer is in motion, three-dimensional reconstruction is performed on the real environment to obtain a virtual reality environment;根据所述虚拟现实环境识别所述穿戴者是否存在运动风险;Identifying whether the wearer is at risk of exercise according to the virtual reality environment;在识别到所述穿戴者存在运动风险时,对所述穿戴者进行安全提示。When it is recognized that the wearer has a movement risk, a safety prompt is given to the wearer.
- 根据权利要求19所述的电子设备,其中,所述可穿戴设备还包括摄像模组,在对所述现实环境进行三维重建,得到虚拟现实环境时,所述处理器用于执行:The electronic device according to claim 19, wherein the wearable device further comprises a camera module, and when performing three-dimensional reconstruction of the real environment to obtain a virtual reality environment, the processor is configured to execute:通过所述摄像模组获取所述现实环境的深度图像和彩色图像;Obtain the depth image and color image of the real environment through the camera module;根据所述深度图像和所述彩色图像对所述现实环境进行三维重建,得到所述虚拟现实环境。The virtual reality environment is obtained by performing three-dimensional reconstruction of the real environment according to the depth image and the color image.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010802651.8 | 2020-08-11 | ||
CN202010802651.8A CN111935573B (en) | 2020-08-11 | 2020-08-11 | Audio enhancement method and device, storage medium and wearable device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022033236A1 true WO2022033236A1 (en) | 2022-02-17 |
Family
ID=73310697
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/104553 WO2022033236A1 (en) | 2020-08-11 | 2021-07-05 | Audio enhancement method and apparatus, storage medium, and wearable device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111935573B (en) |
WO (1) | WO2022033236A1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111935573B (en) * | 2020-08-11 | 2022-06-14 | Oppo广东移动通信有限公司 | Audio enhancement method and device, storage medium and wearable device |
WO2022134103A1 (en) * | 2020-12-25 | 2022-06-30 | 深圳市韶音科技有限公司 | Eyeglasses |
CN113115190B (en) * | 2021-03-31 | 2023-01-24 | 歌尔股份有限公司 | Audio signal processing method, device, equipment and storage medium |
CN113395629B (en) * | 2021-07-19 | 2022-07-22 | 歌尔科技有限公司 | Earphone, audio processing method and device thereof, and storage medium |
CN114466270A (en) * | 2022-03-11 | 2022-05-10 | 南昌龙旗信息技术有限公司 | Signal processing method, notebook computer and storage medium |
CN115662436B (en) * | 2022-11-14 | 2023-04-14 | 北京探境科技有限公司 | Audio processing method and device, storage medium and intelligent glasses |
CN118338219A (en) * | 2023-01-11 | 2024-07-12 | 上海又为智能科技有限公司 | Audio signal processing apparatus and method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070160243A1 (en) * | 2005-12-23 | 2007-07-12 | Phonak Ag | System and method for separation of a user's voice from ambient sound |
US9324313B1 (en) * | 2013-10-23 | 2016-04-26 | Google Inc. | Methods and systems for implementing bone conduction-based noise cancellation for air-conducted sound |
CN106686494A (en) * | 2016-12-27 | 2017-05-17 | 广东小天才科技有限公司 | Voice input control method of wearable device and wearable device |
CN108156291A (en) * | 2017-12-29 | 2018-06-12 | 广东欧珀移动通信有限公司 | Speech signal collection method, apparatus, electronic equipment and readable storage medium storing program for executing |
CN109348334A (en) * | 2018-10-26 | 2019-02-15 | 歌尔科技有限公司 | A kind of wireless headset and its ambience listening method and apparatus |
CN210442589U (en) * | 2019-08-20 | 2020-05-01 | 科大讯飞股份有限公司 | A spectacle-frame and glasses for role separation pronunciation are gathered |
CN111935573A (en) * | 2020-08-11 | 2020-11-13 | Oppo广东移动通信有限公司 | Audio enhancement method and device, storage medium and wearable device |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2974655B1 (en) * | 2011-04-26 | 2013-12-20 | Parrot | MICRO / HELMET AUDIO COMBINATION COMPRISING MEANS FOR DEBRISING A NEARBY SPEECH SIGNAL, IN PARTICULAR FOR A HANDS-FREE TELEPHONY SYSTEM. |
US20150199950A1 (en) * | 2014-01-13 | 2015-07-16 | DSP Group | Use of microphones with vsensors for wearable devices |
US20160379661A1 (en) * | 2015-06-26 | 2016-12-29 | Intel IP Corporation | Noise reduction for electronic devices |
US9978397B2 (en) * | 2015-12-22 | 2018-05-22 | Intel Corporation | Wearer voice activity detection |
CN108076398A (en) * | 2016-11-11 | 2018-05-25 | 冯黎 | Controllable turnover pickup earphone |
CN106850963A (en) * | 2016-12-27 | 2017-06-13 | 广东小天才科技有限公司 | Call control method of wearable device and wearable device |
US10455324B2 (en) * | 2018-01-12 | 2019-10-22 | Intel Corporation | Apparatus and methods for bone conduction context detection |
CN110931027B (en) * | 2018-09-18 | 2024-09-27 | 北京三星通信技术研究有限公司 | Audio processing method, device, electronic equipment and computer readable storage medium |
CN111261181A (en) * | 2020-01-15 | 2020-06-09 | 成都法兰特科技有限公司 | Speech recognition method, noise recognition method, sound pickup device, and telephone communication apparatus |
CN111464905A (en) * | 2020-04-09 | 2020-07-28 | 电子科技大学 | Hearing enhancement method and system based on intelligent wearable device and wearable device |
-
2020
- 2020-08-11 CN CN202010802651.8A patent/CN111935573B/en active Active
-
2021
- 2021-07-05 WO PCT/CN2021/104553 patent/WO2022033236A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070160243A1 (en) * | 2005-12-23 | 2007-07-12 | Phonak Ag | System and method for separation of a user's voice from ambient sound |
US9324313B1 (en) * | 2013-10-23 | 2016-04-26 | Google Inc. | Methods and systems for implementing bone conduction-based noise cancellation for air-conducted sound |
CN106686494A (en) * | 2016-12-27 | 2017-05-17 | 广东小天才科技有限公司 | Voice input control method of wearable device and wearable device |
CN108156291A (en) * | 2017-12-29 | 2018-06-12 | 广东欧珀移动通信有限公司 | Speech signal collection method, apparatus, electronic equipment and readable storage medium storing program for executing |
CN109348334A (en) * | 2018-10-26 | 2019-02-15 | 歌尔科技有限公司 | A kind of wireless headset and its ambience listening method and apparatus |
CN210442589U (en) * | 2019-08-20 | 2020-05-01 | 科大讯飞股份有限公司 | A spectacle-frame and glasses for role separation pronunciation are gathered |
CN111935573A (en) * | 2020-08-11 | 2020-11-13 | Oppo广东移动通信有限公司 | Audio enhancement method and device, storage medium and wearable device |
Also Published As
Publication number | Publication date |
---|---|
CN111935573B (en) | 2022-06-14 |
CN111935573A (en) | 2020-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022033236A1 (en) | Audio enhancement method and apparatus, storage medium, and wearable device | |
US11823679B2 (en) | Method and system of audio false keyphrase rejection using speaker recognition | |
Akbari et al. | Lip2audspec: Speech reconstruction from silent lip movements video | |
US20240087565A1 (en) | Determining input for speech processing engine | |
EP1503368B1 (en) | Head mounted multi-sensory audio input system | |
WO2021022094A1 (en) | Per-epoch data augmentation for training acoustic models | |
JP7038210B2 (en) | Systems and methods for interactive session management | |
CN110874137B (en) | Interaction method and device | |
WO2020214844A1 (en) | Identifying input for speech recognition engine | |
EP3923198A1 (en) | Method and apparatus for processing emotion information | |
US20180054688A1 (en) | Personal Audio Lifestyle Analytics and Behavior Modification Feedback | |
US20230386461A1 (en) | Voice user interface using non-linguistic input | |
Zhang et al. | EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic Sensing | |
CN111326152A (en) | Voice control method and device | |
Bilac et al. | Gaze and filled pause detection for smooth human-robot conversations | |
CN113924542B (en) | Headset signal for determining emotional state | |
JP2018087847A (en) | Dialogue control device, its method and program | |
JP2020126195A (en) | Voice interactive device, control device for voice interactive device and control program | |
JP6480351B2 (en) | Speech control system, speech control device and speech control program | |
JP2018149625A (en) | Communication robot, program, and system | |
JPWO2018135304A1 (en) | Information processing apparatus, information processing method, and program | |
WO2020102943A1 (en) | Method and apparatus for generating gesture recognition model, storage medium, and electronic device | |
US11862147B2 (en) | Method and system for enhancing the intelligibility of information for a user | |
CN113362432A (en) | Facial animation generation method and device | |
EP3288035A2 (en) | Personal audio lifestyle analytics and behavior modification feedback |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21855289 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21855289 Country of ref document: EP Kind code of ref document: A1 |