CN114822565A - Audio signal generation method and system, and non-transitory computer readable medium - Google Patents

Audio signal generation method and system, and non-transitory computer readable medium Download PDF

Info

Publication number
CN114822565A
CN114822565A CN202210237943.0A CN202210237943A CN114822565A CN 114822565 A CN114822565 A CN 114822565A CN 202210237943 A CN202210237943 A CN 202210237943A CN 114822565 A CN114822565 A CN 114822565A
Authority
CN
China
Prior art keywords
audio data
bone conduction
frequency
air conduction
conduction audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210237943.0A
Other languages
Chinese (zh)
Inventor
周美林
廖风云
齐心
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Voxtech Co Ltd
Original Assignee
Shenzhen Voxtech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Voxtech Co Ltd filed Critical Shenzhen Voxtech Co Ltd
Priority to CN202210237943.0A priority Critical patent/CN114822565A/en
Publication of CN114822565A publication Critical patent/CN114822565A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The application relates to an audio signal generation method and system, and a non-transitory computer readable medium, wherein the audio signal generation method comprises the following steps: acquiring first audio data acquired by a bone conduction sensor; acquiring second audio data acquired by the air conduction sensor, wherein the first audio data and the second audio data represent voice of a user, and the first audio data and the second audio data are respectively composed of different frequency components; dividing the first audio data and the second audio data into a plurality of segments according to one or more frequency thresholds, wherein each segment of the first audio data corresponds to one segment of the second audio data; each of the plurality of segments of the first audio data and the second audio data is spliced, fused, and/or combined based on the weights to generate third audio data.

Description

Audio signal generation method and system, and non-transitory computer readable medium
The application is a divisional application of a Chinese patent application with an application number of CN201910864002.8 and an invention name of 'system and method for generating audio signals' filed by the Chinese patent office on 12/09/2019.
Technical Field
The present application relates generally to the field of signal processing, and in particular, to audio signal generation methods and systems, non-transitory computer readable media.
Background
With the widespread use of electronic devices, communication between people is becoming more and more convenient. When using an electronic device for communication, a user may rely on a microphone to collect speech signals while the user is speaking. The speech signal picked up by the microphone may represent the speech of the user. However, due to, for example, the performance of the microphone itself, noise, etc., it is sometimes difficult to ensure that the speech signal picked up by the microphone is sufficiently intelligible (i.e., the fidelity of the signal). Especially in the public places such as factories, automobiles, airplanes, ships, markets and the like, different background noises seriously affect the communication quality. Accordingly, it is desirable to provide systems and methods for generating audio signals with less noise and/or improved fidelity.
Disclosure of Invention
The embodiment of the application provides an audio signal generation method, which comprises the following steps: acquiring first audio data acquired by a bone conduction sensor; acquiring second audio data acquired by the air conduction sensor, wherein the first audio data and the second audio data represent voice of a user, and the first audio data and the second audio data are respectively composed of different frequency components; dividing the first audio data and the second audio data into a plurality of segments according to one or more frequency thresholds, wherein each segment of the first audio data corresponds to one segment of the second audio data; each of the plurality of segments of the first audio data and the second audio data is spliced, fused, and/or combined based on the weights to generate third audio data.
An embodiment of the present application further provides an audio signal generating system, including: at least one processor; executable instructions, which may be executed by at least one processor, cause the system to perform the audio signal generation method as described in the above embodiments.
An embodiment of the present application further provides a system for generating an audio signal, including: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring first audio data acquired by a bone conduction sensor and second audio data acquired by an air conduction sensor, the first audio data and the second audio data represent voice of a user, and the first audio data and the second audio data are respectively composed of different frequency components; a weight determination unit configured to divide the first audio data and the second audio data into a plurality of segments according to one or more frequency thresholds, wherein each segment of the first audio data corresponds to one segment of the second audio data; a combining unit configured for splicing, fusing and/or combining each of the plurality of segments of the first audio data and the second audio data based on the weights to generate third audio data.
Embodiments of the present application also provide a non-transitory computer-readable medium storing computer instructions that, when executed, perform the audio signal generation method according to the above embodiments.
Additional features of the present application will be set forth in part in the description which follows. Additional features of some aspects of the present application will be apparent to those of ordinary skill in the art in view of the following description and accompanying drawings, or in view of the production or operation of the embodiments. The features of the present application may be realized and attained by practice or use of the methods, instrumentalities and combinations of the various aspects of the specific embodiments described below.
Drawings
The present application will be further described by way of exemplary embodiments. These exemplary embodiments will be described in detail by means of the accompanying drawings. These embodiments are non-limiting exemplary embodiments in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:
fig. 1 is a schematic diagram of an exemplary audio signal generation system shown in accordance with some embodiments of the present application.
Fig. 2 is a block diagram of an exemplary processing device shown in accordance with some embodiments of the present application.
Fig. 3 is a block diagram of an exemplary audio data generation module shown in accordance with some embodiments of the present application.
Fig. 4 is a flow diagram of an exemplary process for generating an audio signal, shown in accordance with some embodiments of the present application.
Fig. 5 is a flow diagram of an exemplary process for reconstructing bone conduction audio data using a trained machine learning model, according to some embodiments of the present application.
Fig. 6 is a flow diagram illustrating an exemplary process for reconstructing bone conduction audio data using a harmonic correction model according to some embodiments of the present application.
Fig. 7 is a flow diagram illustrating an exemplary process for reconstructing bone conduction audio data using sparse matrix techniques, according to some embodiments of the present application.
Fig. 8 is a flow diagram of an exemplary process for generating audio data, shown in accordance with some embodiments of the present application.
Fig. 9 is a flow diagram of an exemplary process for generating audio data, shown in accordance with some embodiments of the present application.
Fig. 10 is a graph of frequency response of bone conduction audio data, corresponding reconstructed bone audio data, and corresponding air conduction audio data, according to some embodiments of the present application.
Fig. 11 is a graph of frequency response of bone conduction audio data collected by bone conduction sensors located at different parts of a user's body, according to some embodiments of the present application.
Fig. 12 is a graph of frequency response of bone conduction audio data collected by bone conduction sensors located at different parts of a user's body, according to some embodiments of the present application.
Fig. 13 is a time-frequency diagram of spliced audio data generated by splicing bone conduction audio data and air conduction audio data according to a frequency splicing point of 2000Hz according to some embodiments of the present application.
FIG. 14 is a time-frequency diagram of spliced bone conduction audio data generated from spliced bone conduction audio data at a frequency of 2000Hz and air conduction audio data de-noised using a wiener filter according to some embodiments of the present application.
Fig. 15 is a time-frequency diagram of spliced bone conduction audio data generated from spliced bone conduction audio data at a frequency of 2000Hz and air conduction audio data de-noised using spectral subtraction according to some embodiments of the present application.
Fig. 16 is a time-frequency diagram of bone conduction audio data, according to some embodiments of the present application.
FIG. 17 is a time-frequency diagram of air conduction audio data, shown in accordance with some embodiments of the present application.
Fig. 18 is a time-frequency diagram illustrating spliced audio data generated by splicing bone conduction audio data and air conduction audio data according to a frequency splicing point of 2000Hz according to some embodiments of the present application.
Fig. 19 is a time-frequency diagram illustrating spliced audio data generated by splicing bone conduction audio data and air conduction audio data according to a frequency splice of 3000Hz according to some embodiments of the present application.
Fig. 20 is a time-frequency diagram of spliced audio data generated by splicing bone conduction audio data and air conduction audio data according to a frequency splicing point of 4000Hz according to some embodiments of the present application.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below. However, it will be apparent to one skilled in the art that the present application may be practiced without these specific details. In other instances, well-known methods, procedures, systems, components, and/or circuits have been described at a high-level (without detail) in order to avoid unnecessarily obscuring aspects of the present application. It will be apparent to those skilled in the art that various modifications to the disclosed embodiments are possible, and that the general principles defined in this application may be applied to other embodiments and applications without departing from the spirit and scope of the application. Thus, the present application is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.
These and other features, aspects, and advantages of the present application, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description of the accompanying drawings, all of which form a part of this specification. It is to be understood, however, that the drawings are designed solely for the purposes of illustration and description and are not intended as a definition of the limits of the application. It should be understood that the drawings are not to scale.
Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the operations in the flow diagrams may be performed out of order. Rather, various steps may be processed in reverse order or simultaneously. Also, one or more other operations may be added to the flowcharts. One or more operations may also be deleted from the flowchart.
Systems and methods for audio signal generation are provided. The system and method may acquire first audio data (also referred to as bone conduction audio data) acquired by a bone conduction sensor. The system and method may acquire second audio data (also referred to as air conduction audio data) acquired by the air conduction sensor. The bone conduction audio data and the air conduction audio data may represent a user's voice, the bone conduction audio data and the second air conduction audio data each being composed of different frequency components. The system and method may generate audio data based on the bone conduction audio data and the air conduction audio data. Frequency components above a certain frequency point in the generated audio data are increased relative to frequency components above the frequency point in the bone conduction audio data. The system and method may determine target audio data representing a user's voice based on the generated audio data, the target audio data having a higher fidelity than the bone conduction audio data and the air conduction audio data. According to the present application, audio data generated based on bone conduction audio data and air conduction audio data has more high frequency components relative to bone conduction audio data and less noise relative to air conduction audio data, which may improve the fidelity and intelligibility of the generated audio data relative to bone conduction audio data and air conduction audio data. In some embodiments, the reconstructed bone conduction audio data may be obtained by reconstructing bone conduction audio data by increasing a high frequency component in the bone conduction audio data, the reconstructed bone conduction audio data being closer to the air conduction audio data, and the quality of the reconstructed air conduction audio data being higher relative to the bone conduction audio data, which may further improve the quality of the generated audio data. In some embodiments, the audio data may be generated by splicing bone conduction audio data and air conduction audio data by selecting different frequency splicing points based on factors such as environmental noise, and the fidelity of the audio data may be ensured while reducing noise of the audio data.
Fig. 1 is a schematic diagram of an exemplary audio signal generation system 100 shown in accordance with some embodiments of the present application. The audio signal generation system 100 may include an audio capture device 110, a server 120, a terminal 130, a storage device 140, and a network 150.
The audio capture device 110 may capture audio data (e.g., audio signals) by capturing a user's voice or speech while the user is speaking. For example, when a user speaks, the sound emitted by the user may cause the air surrounding the user's mouth to vibrate and/or the tissue of the user's body (e.g., the skull) to vibrate. The audio capture device 110 may receive the vibrations and convert the vibrations into an electrical signal (e.g., an analog signal or a digital signal), which may also be referred to as audio data. The audio data may be transmitted in the form of electrical signals via the network 150 to the server 120/terminal 130 and/or storage device 140. In some embodiments, the audio capture device 110 may include a sound recorder, a headset (e.g., a bluetooth headset, a wired headset), a hearing aid device, and the like.
In some embodiments, the audio capture device 110 may be connected to the speakers wirelessly (e.g., network 150) and/or by wire. The audio capture device 110 may send captured audio data to a speaker to play and/or reproduce the user's voice. In some embodiments, the speaker and audio capture device 110 may be integrated into a single device, such as a headset. In some embodiments, the audio capture device 110 and the speaker may be separate from each other. For example, the audio capture device 110 may be installed in a first terminal (e.g., a headset) and the speaker may be installed in another terminal (e.g., the terminal 130).
In some embodiments, the audio capture device 110 may include a bone conduction microphone 112 and an air conduction microphone 114. Bone conduction microphone 112 may include a bone conduction sensor for acquiring bone conduction audio data. The bone conduction sensor may acquire vibration signals conducted through bone (e.g., skull) tissue of the user while the user is speaking to generate bone conduction audio data. In some embodiments, the bone conduction sensors may form an array of bone conduction sensors. In some embodiments, the bone conduction microphone 112 may be placed at and/or in contact with a portion of the user's body, collecting bone conduction audio data. The portion of the user's body may include the forehead, the neck (e.g., at the throat), the face (e.g., the area around the mouth, the chin), the top of the head, the mastoid process, the area around the ears, the area inside the ears, the temples, etc., or any combination thereof. For example, bone conduction microphone 112 may be placed at and/or in contact with the tragus, pinna, inner ear canal, outer ear canal, etc. In some embodiments, the acoustic characteristics of the bone conduction audio data may differ due to differences in the part of the user's body where the bone conduction microphone 112 is located and/or in contact with. For example, bone conduction audio data collected by bone conduction microphones 112 located in the area around the ears has a higher energy than bone conduction audio data collected by bone conduction microphones 112 located on the forehead. The air conduction microphone 114 may include one or more air conduction sensors for collecting air conduction audio data through air conduction while the user is speaking. In some embodiments, the gas conductance sensor may form an array of gas conductance sensors. In some embodiments, the gas conduction microphone 114 may be placed within a certain range (e.g., 0cm, 1cm, 2cm, 5cm, 10cm, 20cm, etc.) from the user's mouth. The acoustic characteristics of the air conduction audio data (e.g., the average amplitude of the air conduction audio data) may be different depending on the different distances between the air conduction microphone 114 and the user's mouth. For example, the greater the distance between the gas conduction microphone 114 and the user's mouth, the smaller the average amplitude of the gas conduction audio data may be.
In some embodiments, the server 120 may be a single server or a group of servers. The server groups may be centralized (e.g., a data center) or distributed (e.g., the servers 120 may be a distributed system). In some embodiments, the server 120 may be local or remote. For example, server 120 may access information and/or data stored in terminals 130 and/or storage devices 140 via network 150. As another example, server 120 may be directly connected to terminal 130 and/or storage device 140 to access stored information and/or data. In some embodiments, the server 120 may be implemented on a cloud platform. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-tiered cloud, and the like, or any combination thereof.
In some embodiments, the server 120 may include a processing device 122. The processing device 122 may process information and/or data related to audio signal generation to perform one or more of the functions described herein. For example, the processing device 122 may obtain bone conduction audio data collected by the bone conduction microphone 112 and air conduction audio data collected by the air conduction microphone 114, where the bone conduction audio data and the air conduction audio data represent the same user's (or user's) voice. The processing device 122 may generate the target audio data based on the bone conduction audio data and the air conduction audio data. As another example, processing device 122 may obtain the trained machine learning model and/or the constructed filter from storage device 140 or any other storage device. The processing device 122 may reconstruct the bone conduction audio data using the trained machine learning model and/or the constructed filter. As another example, processing device 122 may train the preliminary machine learning model by using multiple sets of speech samples (i.e., training data) to determine a trained machine learning model. Each of the sets of speech samples may include bone conduction audio data and air conduction audio data representing the same user speech. As yet another example, processing device 122 may perform a noise reduction operation on the air conduction audio data to obtain noise reduced air conduction audio data. The processing device 122 may generate target audio data based on the reconstructed bone conduction audio data and the noise-reduced air conduction audio data. In some embodiments, the processing apparatus 122 may include one or more processing engines (e.g., a single chip processing engine or a multi-chip processing engine). By way of example only, the processing device 122 may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an application specific instruction set processor (ASIP), an image processing unit (GPU), a physical arithmetic processing unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a micro-controller unit, a Reduced Instruction Set Computer (RISC), a microprocessor, or the like, or any combination thereof.
In some embodiments, the terminal 130 may include a mobile device 130-1, a tablet 130-2, a laptop 130-3, a built-in device in a vehicle 130-4, a wearable device 130-5, or the like, or any combination thereof. In some embodiments, the mobile device 130-1 may include a smart home device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home devices may include smart lighting devices, smart appliance control devices, smart monitoring devices, smart televisions, smart cameras, interphones, and the like, or any combination thereof. In some embodiments, the smart mobile device may include a smart phone, a Personal Digital Assistant (PDA), a gaming device, a navigation device, a point of sale (POS), etc., or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device includes a virtual reality helmet, virtual reality glasses, virtual reality eyeshields, augmented reality helmets, augmented reality glasses, augmented reality eyeshields, and the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include Google TM Spectacles, Oculus Rift, HoloLens, Gear VR, etc. In some embodiments, the in-vehicle device 130-4 may include an in-vehicle computer, an in-vehicle television, or the like. In some embodiments, the terminal 130 may be a device having location technology for locating the position of the passenger and/or the terminal 130. In some embodiments, wearable device 130-5 may include a smart bracelet, smart footwear, smart glasses, smart helmet, smart watch, smart garment, smart backpack, or the likeSmart accessories, etc., or any combination thereof. In some embodiments, the audio capture device 110 may be integrated with the terminal 130.
Storage device 140 may store data and/or instructions. For example, the storage device 140 may store data for multiple sets of speech samples, one or more machine learning models, trained machine learning models and/or constructed filters, audio data collected by the bone conduction microphone 112 and the air conduction microphone 114, and so forth. In some embodiments, the storage device 140 may store data obtained from the terminal 130 and/or the audio capture device 110. In some embodiments, storage device 140 may store data and/or instructions that server 120 may execute to perform the exemplary methods described in this disclosure. In some embodiments, storage device 140 may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), etc., or any combination thereof. Exemplary mass storage devices may include magnetic disks, optical disks, solid state drives, and the like. Exemplary removable memories may include flash drives, floppy disks, optical disks, memory cards, compact disks, magnetic tape, and the like. Exemplary volatile read and write memory can include Random Access Memory (RAM). Exemplary RAM may include Dynamic Random Access Memory (DRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), Static Random Access Memory (SRAM), thyristor random access memory (T-RAM), and zero capacitance random access memory (Z-RAM), among others. Exemplary ROMs may include mask-type read-only memory (MROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), digital versatile disc read-only memory, and the like. In some embodiments, the storage device 140 may execute on a cloud platform. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-tiered cloud, and the like, or any combination thereof.
In some embodiments, the storage device 140 may be connected to a network 150 to communicate with one or more components of the audio signal generation system 100 (e.g., the audio capture device 110, the server 120, and the terminal 130). One or more components of the audio signal generation system 100 may access data or instructions stored in the storage device 140 via the network 150. In some embodiments, the storage device 140 may be directly connected to or in communication with one or more components of the audio signal generation system 100 (e.g., the audio capture device 110, the server 120, and the terminal 130). In some embodiments, the storage device 140 may be part of the server 120.
The network 150 may facilitate the exchange of information and/or data. In some embodiments, one or more components of the audio signal generation system 100 (e.g., the audio capture device 110, the server 120, the terminal 130, and the storage device 140) may send information and/or data to other components of the audio signal generation system 100 via the network 150. For example, the server 120 may obtain bone conduction audio data and air conduction audio data from the terminal 130 via the network 150. In some embodiments, the network 150 may be any form of wired or wireless network, or any combination thereof. By way of example only, network 150 may include a cable network, a wired network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Public Switched Telephone Network (PSTN), a bluetooth network, a zigbee network, a Near Field Communication (NFC) network, or the like, or any combination thereof. In some embodiments, the network 150 may include one or more network access points. For example, the network 150 may include wired or wireless network access points, such as base stations and/or internet exchange points, through which one or more components of the audio signal generation system 100 may connect to the network 150 to exchange data and/or information.
It will be understood by those of ordinary skill in the art that when an element (or component) of the audio signal generation system 100 executes, the element may execute via electrical and/or electromagnetic signals. For example, when the bone conduction microphone 112 transmits bone conduction audio data to the server 120, the processor of the bone conduction microphone 112 may generate an electrical signal encoding the bone conduction audio data. The processor of the bone conduction microphone 112 may then transmit the electrical signal to an output port. If the bone conduction microphone 112 is in communication with the server 120 via a wired network, the output port may be physically connected to a cable, which may also transmit electrical signals to the input port of the server 120. If the bone conduction microphone 112 is in communication with the server 120 via a wireless network, the output port of the bone conduction microphone 112 may be one or more antennas that convert the electrical signals to electromagnetic signals. Similarly, the gas conduction microphone 114 may transmit gas conduction audio data to the server 120 via electrical or electromagnetic signals. Within an electronic device, such as terminal 130 and/or server 120, when its processor processes instructions, issues instructions, and/or performs actions, the instructions and/or actions are performed by electrical signals. For example, when the processor retrieves or acquires data from the storage medium, an electrical signal may be sent to a read/write device of the storage medium that can read or write structured data in or to the storage medium. The configuration data may be transmitted to the processor in the form of electrical signals via a bus of the electronic device. Herein, an electrical signal may refer to one electrical signal, a series of electrical signals, and/or at least two discrete electrical signals.
Fig. 2 is a block diagram illustrating an exemplary processing device according to some embodiments of the present application. As shown in fig. 2, the processing device 122 may include an acquisition module 210, a pre-processing module 220, an audio data generation module 230, and a storage module 240. Each of the modules described above may be hardware circuitry designed to perform certain actions, for example, in accordance with instructions stored in one or more storage media, and/or any combination of hardware circuitry and one or more storage media.
The acquisition module 210 may be configured to acquire data for generating an audio signal. For example, the acquisition module 210 may acquire raw audio data, one or more models, training data for training machine learning models, and so forth. In some embodiments, the acquisition module 210 may acquire first audio data acquired by a bone conduction sensor. As used herein, a bone conduction sensor may refer to any sensor (e.g., bone conduction microphone 112) capable of acquiring a vibration signal conducted by bone tissue (e.g., skull) of a user while the user is speaking, as described elsewhere in this application (e.g., fig. 1 and description thereof). In some embodiments, the first audio data may include an audio signal in the time domain, an audio signal in the frequency domain, and so on. The first audio data may include an analog signal or a digital signal. The acquisition module 210 may also acquire second audio data acquired by the air conduction sensor. The gas conduction sensor may refer to any sensor (e.g., gas conduction microphone 114) capable of collecting a vibration signal conducted by gas when a user speaks, as described elsewhere in this application (e.g., fig. 1 and its description). In some embodiments, the second audio data may include an audio signal in the time domain, an audio signal in the frequency domain, and so on. The second audio data may include an analog signal or a digital signal. In some embodiments, the acquisition module 210 may obtain a trained machine learning model, a constructed filter, a harmonic modification model, and/or the like for reconstructing the first audio data. In some embodiments, processing device 122 may retrieve one or more models, first audio data, and/or second audio data from air conduction sensors (e.g., air conduction microphone 114), terminal 130, storage device 140, or any other storage device in real-time or periodically over network 150.
The pre-processing module 220 may be configured to pre-process the first audio data and/or the second audio data. The first audio data and the second audio data after being preprocessed may also be referred to as preprocessed first audio data and preprocessed second audio data, respectively. Exemplary pre-processing operations may include domain transform operations, signal alignment operations, audio reconstruction operations, speech enhancement operations, and the like. In some embodiments, the pre-processing module 220 may perform a domain transform operation by performing a fourier transform or an inverse fourier transform. In some embodiments, the pre-processing module 220 may perform a normalization operation on the first audio data and/or the second audio data to obtain normalized first audio data and/or normalized second audio data for calibrating the first audio data and/or the second audio data. In some embodiments, the pre-processing module 220 may perform a speech enhancement operation on the second audio data (or the normalized second audio data). In some embodiments, the pre-processing module 220 may perform a noise reduction operation on the second audio data (or the normalized second audio data) to obtain noise-reduced second audio data. In some embodiments, the pre-processing module 220 may perform an audio reconstruction operation on the first audio data (or the normalized first audio data) using a trained machine learning model, a constructed filter, a harmonic correction model, a sparse matrix technique, or the like, or any combination thereof, to generate reconstructed first audio data.
The audio data generation module 230 may be configured to generate third audio data based on the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data). In some embodiments, the noise level associated with the third audio data may be lower than the noise level associated with the second audio data (or the pre-processed second audio data). In some embodiments, the audio data generation module 230 may generate the third audio data according to one or more frequency thresholds based on the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data). In some embodiments, the audio data generation module 230 may determine a single frequency threshold. The audio data generation module 230 may splice the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data) in the frequency domain according to a single frequency threshold to generate the third audio data.
In some embodiments, the audio data generation module 230 may determine the first weight and the second weight for the low frequency portion of the first audio data (or the pre-processed first audio data) and the high frequency portion of the first audio data (or the pre-processed first audio data), respectively, based at least in part on a frequency threshold. The low frequency part of the first audio data (or the pre-processed first audio data) comprises frequency components of the first audio data (or the pre-processed first audio data) which are smaller than the frequency threshold. The high frequency part of the first audio data (or the preprocessed first audio data) includes frequency components of the first audio data (or the preprocessed first audio data) that are greater than the frequency threshold. In some embodiments, the audio data generation module 230 may determine the third weight and the fourth weight for the low frequency portion of the second audio data (or the pre-processed second audio data) and the high frequency portion of the second audio data (or the pre-processed second audio data), respectively, based at least in part on a frequency threshold. The low frequency part of the second audio data (or the pre-processed second audio data) includes frequency components of the second audio data (or the pre-processed second audio data) that are less than the frequency threshold. The high frequency part of the second audio data (or the preprocessed second audio data) includes frequency components of the second audio data (or the preprocessed second audio data) that are greater than the frequency threshold. In some embodiments, the audio data generation module 230 may determine the third audio data by weighting the low frequency part and the high frequency part of the first audio data (or the pre-processed first audio data) and the low frequency part and the high frequency part of the second audio data (or the pre-processed second audio data) using the first weight, the second weight, the third weight, and the fourth weight, respectively.
In some embodiments, the audio data generation module 230 may determine the weights corresponding to the first audio data (or the pre-processed first audio data) and the weights corresponding to the second audio data (or the pre-processed second audio data) based at least in part on the first audio data (or the pre-processed first audio data) and/or the second audio data (or the pre-processed second audio data). The audio data generation module 230 may weight the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data) using a weight corresponding to the first audio data (or the pre-processed first audio data) and a weight corresponding to the second audio data (or the pre-processed second audio data) to determine the third audio data.
In some embodiments, the audio data generation module 230 may determine, based on the third audio data, target audio data representing user speech that has a higher fidelity than the first audio data and the second audio data. In some embodiments, the audio data generation module 230 may designate the third audio data as the target audio data. In some embodiments, the audio data generation module 230 may perform a post-processing operation on the third audio data to obtain the target audio data. In some embodiments, the audio data generation module 230 may perform an inverse fourier transform operation on the third audio data in the frequency domain to obtain the target audio data in the time domain. In some embodiments, the audio data generation module 230 may perform a noise reduction operation on the third audio data to obtain the target audio data. In some embodiments, the audio data generation module 230 may transmit the signal to a client terminal (e.g., terminal 130), the storage device 140, and/or any other storage device (not shown in the audio signal generation system 100) via the network 150. The signal may include target audio data. The signal may also be configured to cause the client terminal to play the target audio data.
The storage module 240 may be configured to store data and/or instructions associated with the audio signal generation system 100. For example, the storage module 240 may store voice sample data, machine learning models, trained machine learning models and/or constructed filters, audio data collected by the bone conduction microphone 112 and/or the air conduction microphone 114, and so on. In some embodiments, the storage module 240 may be the same as the storage device 140 in the configuration.
It should be noted that the foregoing is provided for illustrative purposes only and is not intended to limit the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications can be made in the light of the description of the application. However, such changes and modifications do not depart from the scope of the present application. For example, the storage module 240 may be omitted. For another example, the audio data generation module 230 and the storage module 240 may be integrated into one module.
Fig. 3 is a block diagram illustrating an exemplary audio data generation module according to some embodiments of the present application. As shown in fig. 3, the audio data generation module 230 may include a frequency determination unit 310, a weight determination unit 320, and a combination unit 330. Each of the sub-modules described above may be a hardware circuit designed to perform certain actions, for example, in accordance with instructions stored in one or more storage media, and/or any combination of a hardware circuit and one or more storage media.
The frequency determination unit 310 may be configured to determine one or more frequency thresholds based at least in part on the bone conduction audio data and/or the air conduction audio data. In some embodiments, the frequency threshold may be a frequency point of the bone conduction audio data and/or the air conduction audio data. In some embodiments, the frequency threshold may be different from the frequency points of the bone conduction audio data and/or the air conduction audio data. In some embodiments, the frequency determination unit 310 may determine the frequency threshold based on a frequency response curve associated with the bone conduction audio data. The frequency response curve associated with the bone conduction audio data may include frequency response values that vary as a function of frequency. In some embodiments, the frequency determination unit 310 may determine one or more frequency thresholds based on frequency response values of a frequency response curve associated with the bone conduction audio data. In some embodiments, the frequency determination unit 310 may determine one or more frequency thresholds according to the variation characteristics of the frequency response curve. In some embodiments, the frequency determination unit 310 may determine one or more frequency thresholds based on a frequency response curve associated with the reconstructed bone conduction audio data. In some embodiments, frequency determination unit 310 may determine one or more frequency thresholds based on a noise level associated with at least a portion of the air conduction audio data. In some embodiments, the noise level may be represented by a signal-to-noise ratio of the air conduction audio data. The greater the signal-to-noise ratio, the lower the noise level may be. The greater the signal-to-noise ratio associated with the air conduction audio data, the greater the frequency threshold.
The weight determination unit 320 may be configured to divide the bone conduction audio data and the air conduction audio data into a plurality of segments according to one or more frequency thresholds. Each segment of bone conduction audio data may correspond to a segment of air conduction audio data. As used herein, a segment of air conduction audio data corresponding to a segment of bone conduction audio data may mean that the two segments of bone conduction audio data and air conduction audio data are defined by one or two identical frequency thresholds. In some embodiments, the count or number of frequency thresholds may be one, and the weight determination unit 320 may divide the bone conduction audio data and the air conduction audio data into two segments.
The weight determination unit 320 may also be configured for determining a weight for each of a plurality of segments of bone conduction audio data and air conduction audio data. In some embodiments, the weight of the particular segment of bone conduction audio data and the weight of the corresponding particular segment of air conduction audio data satisfy a condition such that the sum of the weight of the particular segment of bone conduction audio data and the weight of the corresponding particular segment of air conduction audio data is equal to 1. In some embodiments, the weight determination unit 320 may determine the weights of the bone conduction audio data or different segments of the air conduction audio data based on the SNR of the air conduction audio data.
The combining unit 330 may be configured to splice, fuse and/or combine the bone conduction audio data and the air conduction audio data for each of the plurality of segments of bone conduction audio data and air conduction audio data based on the weights to generate spliced, or fused, or combined audio data. In some embodiments, the combining unit 330 may determine the low frequency portion of the bone conduction audio data and the high frequency portion of the air conduction audio data from a single frequency threshold. The combination unit 330 may splice and/or combine the low frequency portion of the bone conduction audio data and the high frequency portion of the air conduction audio data to generate spliced audio data. The combining unit 330 may determine the low frequency portion of the bone conduction audio data and the high frequency portion of the air conduction audio data based on one or more filters. In some embodiments, the combining unit 330 may weight the low frequency portion of the bone conduction audio data, the high frequency portion of the bone conduction audio data, the low frequency portion of the air conduction audio data, and the high frequency portion of the air conduction audio data using the first weight, the second weight, the third weight, and the fourth weight, respectively, to determine the spliced, fused, or combined audio data. In some embodiments, the combining unit 330 may determine the fused or combined audio data by weighting using the weight of the bone conduction audio data and the weight of the air conduction audio data, respectively.
It should be noted that the foregoing is provided for illustrative purposes only and is not intended to limit the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications can be made in the light of the description of the application. However, such changes and modifications do not depart from the scope of the present application. For example, the audio data generation module 230 may further include an audio data partitioning sub-module (not shown in fig. 3). The audio data partitioning sub-module may be configured to partition each of the bone conduction audio data and the air conduction audio data into a plurality of segments according to one or more frequency thresholds. For another example, the weight determination unit 320 and the combining unit 330 may be integrated into one module.
Fig. 4 is a flow diagram of an exemplary process for generating an audio signal, shown in accordance with some embodiments of the present application. In some embodiments, process 400 may be implemented as instructions (e.g., an application program) stored in storage device 140. Processing device 122 may execute the instructions, and when executing the instructions, processing device 122 may be configured to perform process 400. The operation of the process shown below is for illustration purposes only. In some embodiments, process 400 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order of the operations of process 400 shown in FIG. 4 and described below is non-limiting.
In 410, the processing device 122 (e.g., the acquisition module 210) may acquire first audio data acquired by the bone conduction sensor. As used herein, a bone conduction sensor refers to any sensor (e.g., bone conduction microphone 112) that can acquire a vibration signal conducted by a user's bone tissue (e.g., skull) while the user (or user) is speaking, as described elsewhere in this application (e.g., fig. 1 and its description). The vibration signals collected by the bone conduction sensor may be converted to audio data (e.g., audio signals) by the bone conduction sensor or other device (e.g., amplifier, analog-to-digital converter (ADC), etc.). The audio data (e.g., the first audio data) acquired by the bone conduction sensor may also be referred to as bone conduction audio data. In some embodiments, the first audio data may include an audio signal in the time domain, an audio signal in the frequency domain, and so on. The first audio data may include an analog signal or a digital signal. In some embodiments, the processing device 122 may retrieve the first audio data from the bone conduction sensor (e.g., bone conduction microphone 112), the terminal 130, the storage device 140, or any other storage device in real-time or periodically over the network 150.
The first audio data may be represented by a superposition of multiple waves (e.g., sinusoids, harmonics, etc.) having different frequencies and/or intensities (i.e., amplitudes). As used herein, a wave having a particular frequency may also be referred to as a frequency component having a particular frequency. In some embodiments, the frequency components included in the first audio data acquired by the bone conduction sensor may be in a frequency range of 0Hz to 20kHz, or 20Hz to 10kHz, or 20Hz to 4000Hz, or 20Hz to 3000Hz, or 800Hz to 3500Hz, or 800Hz to 3000Hz, or 1500Hz to 3000Hz, or the like. The first audio data may be collected and/or generated by the bone conduction sensor while the user speaks. The first audio data may represent content of a user speaking (i.e., the user's speech). For example, the first audio data may include acoustic features and/or semantic information that may reflect the user's speech content. The acoustic features of the first audio data may include duration-related features, energy-related features, fundamental-frequency-related features, frequency-spectrum-related features, phase-spectrum-related features, and the like. The duration-related features may also be referred to as duration features. Exemplary duration characteristics may include speech rate, short-term average zero-crossing rate, and the like. The features associated with energy may also be referred to as energy or amplitude features. Exemplary energy or amplitude characteristics may include short-term average energy, short-term average amplitude, short-term energy gradient, average amplitude rate of change, short-term maximum amplitude, and the like. The feature associated with the fundamental frequency may also be referred to as a fundamental frequency feature. Exemplary fundamental frequency characteristics may include a fundamental frequency, a pitch of the fundamental frequency, an average fundamental frequency, a maximum fundamental frequency, a range of fundamental frequencies, and so forth. Exemplary features associated with the frequency spectrum may include formant features, Linear Predictive Cepstral Coefficients (LPCCs), mel-frequency cepstral coefficients (MFCCs), and the like. Exemplary characteristics associated with the phase spectrum may include instantaneous phase, initial phase, and the like.
In some embodiments, the first audio data may be acquired and/or generated by placing the bone conduction sensor at a certain part of the user's body and/or by contacting the bone conduction sensor with the user's skin. The portion of the user's body that comes into contact with the bone conduction sensor includes, but is not limited to, the forehead, the neck (e.g., throat), the mastoid, the area around the ear, the area inside the ear, the temple, the face (e.g., area around the mouth, chin), the crown of the head, and the like. For example, bone conduction microphone 112 may be placed at and/or in contact with the tragus, pinna, inner ear canal, outer ear canal, etc. In some embodiments, the first audio data may be different depending on the part of the user's body that is in contact with the bone conduction sensor. For example, a difference in a portion of the user's body that is in contact with the bone conduction sensor may cause a change in a frequency characteristic (e.g., amplitude of a frequency component) of the first audio data, noise included in the first audio data, or the like. For example, the signal strength of the first audio data acquired by the bone conduction sensor located at the neck is greater than the signal strength of the first audio data acquired by the bone conduction sensor located at the tragus. The signal strength of the first audio data acquired by the bone conduction sensor located at the tragus is greater than the signal strength of the first audio data acquired by the bone conduction sensor located at the ear canal. As another example, bone conduction audio data collected by a first bone conduction sensor located in a region around a user's ear has more frequency content than bone conduction audio data collected simultaneously by a second bone conduction sensor having the same configuration but located at the top of the user's head. In some embodiments, the first audio data may be acquired by a bone conduction sensor located at a site of a user's body applying a particular pressure to the site in a range (e.g., 0N to 1N, or 0N to 0.8N, etc.). For example, the first audio data may be acquired by a bone conduction sensor located at the tragus of the user's body applying a particular pressure (e.g., 0 newton, or 0.2N, or 0.4N, or 0.8N, etc.) to the site. The difference in pressure exerted by the bone conduction sensor on the same body part may cause a change in the frequency content, acoustic characteristics (e.g., amplitude of the frequency content), noise in the first audio data, etc. of the first audio data acquired by the bone conduction sensor. For example, when the pressure increases from 0N to 0.8N, the signal strength of the first audio data gradually increases and then gradually reaches saturation. More description of the effect of bone conduction sensor placement on bone conduction audio data at different body parts may be found elsewhere in the application (e.g., fig. 11 and its description). More description of the effect of different pressures applied by the bone conduction sensor to a user's body part on the bone conduction audio data may be found elsewhere in this application (e.g., fig. 12 and its description).
In 420, the processing device 122 (e.g., the acquisition module 210) may acquire second audio data acquired by the air conduction sensor. As used herein, air conduction sensor may refer to any sensor (e.g., air conduction microphone 114) capable of collecting vibration signals through air conduction while a user is speaking, as described elsewhere in this application (e.g., fig. 1 and its description). The vibration signals collected by the air conduction sensor may be converted to audio data (e.g., audio signals) by the air conduction sensor or other devices (e.g., amplifiers, analog-to-digital converters (ADCs), etc.). The audio data (e.g., the second audio data) collected by the air conduction sensor may also be referred to as air conduction audio data. In some embodiments, the second audio data may include an audio signal in the time domain, an audio signal in the frequency domain, and so on. The second audio data may include an analog signal or a digital signal. In some embodiments, the processing device 122 may retrieve the second audio data from the air conduction sensor (e.g., air conduction microphone 114), the terminal 130, the storage device 140, or any other storage device in real-time or periodically over the network 150. In some embodiments, the second audio data may be acquired by placing the gas conduction sensor within a distance (e.g., 0cm, 1cm, 2cm, 5cm, 10cm, 20cm, etc.) from the user's mouth. In some embodiments, the difference in distance between the air conduction sensor and the user's mouth may result in a difference in the second audio data (e.g., the average amplitude of the second audio data) being captured.
The second audio data may be represented by a superposition of multiple waves (e.g., sine waves, harmonics, etc.) having different frequencies and/or intensities (i.e., amplitudes). In some embodiments, the frequency content included in the second audio data collected by the air conduction sensor may be in a frequency range of 0Hz to 20kHz, or 20Hz to 20kHz, or 800Hz to 10 kHz. The gas conduction sensor may collect and/or generate second audio data when the user speaks. The second audio data may represent content of the user speaking (i.e., the user's speech). For example, the second audio data includes acoustic features and/or semantic information that may reflect the user's speech content. The acoustic features of the second audio data may include features associated with a duration, features associated with an energy, features associated with a fundamental frequency, features associated with a frequency spectrum, features related to a phase spectrum, and the like, as described in operation 410.
In some embodiments, the first audio data and the second audio data may represent the same voice of the same user by different frequency components. The first audio data and the second audio data representing the same voice of the same user may refer to the first audio data and the second audio data simultaneously collected by the bone conduction sensor and the gas conduction sensor, respectively, when the user speaks. The first audio data acquired by the bone conduction sensor may include a first frequency component. The second audio data may include a second frequency component. In some embodiments, the second frequency components comprise at least a portion of the first frequency components. The semantic information included in the second audio data may be the same as or different from the semantic information included in the first audio data. The acoustic characteristics of the second audio data are the same as or different from the acoustic characteristics of the first audio data. For example, the amplitude of a certain frequency component of the first audio data may be different from the amplitude of the same frequency component of the second audio data. For another example, frequency components smaller than a certain frequency point (e.g., 2000Hz) or within a certain frequency range (e.g., 20Hz to 2000Hz) in the first audio data may be more than frequency components smaller than the certain frequency point (e.g., 2000Hz) or within the frequency range (e.g., 20Hz to 2000Hz) in the second audio data. Frequency components greater than a certain frequency point (e.g., 3000Hz) or within a certain frequency range (e.g., 3000Hz to 20kHz) in the first audio data may be less than frequency components greater than the frequency point (e.g., 3000Hz) or within the frequency range (e.g., 3000Hz to 20kHz) in the second audio data. As used herein, the frequency components in the first audio data that are less than a certain frequency point (e.g., 2000Hz) or within a certain frequency range (e.g., 20Hz to 2000Hz) than the frequency components in the second audio data that are less than the certain frequency point (e.g., 2000Hz) or within the frequency range (e.g., 20Hz to 2000Hz) may refer to the count or number of frequency components in the first audio data that are less than the certain frequency point (e.g., 2000Hz) or within the frequency range (e.g., 20Hz to 2000Hz) being greater than the count or number of frequency components in the second audio data that are less than the certain frequency point (e.g., 2000Hz) or within the frequency range (e.g., 20Hz to 2000 Hz).
In 430, the processing device 122 (e.g., the pre-processing module 220) may pre-process at least one of the first audio data or the second audio data. The preprocessed first audio data and the second audio data may also be referred to as preprocessed first audio data and preprocessed second audio data, respectively. Exemplary pre-processing operations may include domain transform operations, signal alignment operations, audio reconstruction operations, speech enhancement operations, and the like.
A domain transformation operation may be performed to convert the first audio data and/or the second audio data from a time domain to a frequency domain or from the frequency domain to the time domain. In some embodiments, processing device 122 may perform a domain transformation operation by performing a fourier transform or an inverse fourier transform. In some embodiments, to perform the domain transformation operation, the processing device 122 may perform a framing operation, a windowing operation, etc., on the first audio data and/or the second audio data. For example, the first audio data may be divided into one or more speech frames. Each frame of speech may be comprised of audio data for a duration period (e.g., 5ms, 10ms, 15ms, 20ms, 25ms, etc.) during which the audio data for each frame may be considered approximately stable. A windowing operation may be performed on the speech frames using a wave-segmentation function to obtain processed speech frames. As used herein, a wavelength division function may be referred to as a window function. Exemplary window functions may include Hanning windows, Hamming, Blackman-Harris windows, and the like. Finally, the first audio data may be converted from the time domain to the frequency domain based on the processed speech frame using a fourier transform operation.
The signal calibration operation may be used to unify the magnitude (e.g., amplitude) of the first audio data and the second audio data to eliminate differences between the magnitude of the first audio data and/or the second audio data caused by, for example, sensitivity differences between the bone conduction sensor and the gas conduction sensor. In some embodiments, the processing device 122 may perform a normalization operation on the first audio data and/or the second audio data to calibrate the first audio data and/or the second audio data, obtaining normalized first audio data and/or normalized second audio data. For example, the processing device 122 may determine the normalized first audio data and/or the normalized second audio data according to equation (1), as follows:
Figure BDA0003543062870000101
wherein S is normalized To normalized first audio data (or normalized second audio data), S initial Refers to the first audio data (or the second audio data), | S max | may represent the maximum value among absolute values of the amplitudes of the first audio data (or the second audio data).
The speech enhancement operation may be used to reduce noise or other extraneous and undesirable information in the audio data (e.g., the first audio data and/or the second audio data). The speech enhancement operation performed on the first audio data (or normalized first audio data) and/or the second audio data (or normalized second audio data) may use speech enhancement algorithms including spectral subtraction based speech enhancement algorithms, wavelet analysis based speech enhancement algorithms, kalman filter based speech enhancement algorithms, signal subspace based speech enhancement algorithms, auditory masking effect based speech enhancement algorithms, independent component analysis based speech enhancement algorithms, neural network techniques, or the like, or combinations thereof. In some embodiments, the speech enhancement operation may include a noise reduction operation. In some embodiments, the processing device 122 may perform a noise reduction operation on the second audio data (or the normalized second audio data) to obtain noise-reduced second audio data. In some embodiments, the normalized second audio data and/or the noise-reduced second audio data may also be referred to as pre-processed second audio data. In some embodiments, the noise reduction operation may include using a wiener filter, spectral subtraction, an adaptive algorithm, a Minimum Mean Square Error (MMSE) estimation algorithm, or the like, or any combination thereof.
The audio reconstruction operation may be used to emphasize or increase frequency content of the bone conduction audio data above a certain frequency point (e.g., 2000Hz, 3000Hz) or within a frequency range (e.g., 2000Hz to 20kHz, 3000Hz to 20kHz) in the initial bone conduction audio data (e.g., the first audio data or the normalized first audio data) such that the reconstructed bone conduction audio data has improved fidelity relative to the initial bone conduction audio data (e.g., the first audio data or the normalized first audio data). The reconstructed bone conduction audio data may be similar, near, or the same as ideal bone conduction audio data without or with little noise, and the reconstructed bone conduction audio data represents the same voice of the same user as the initial bone conduction audio data, which was acquired by the air conduction sensor at the same time as the bone conduction sensor acquired the initial bone conduction audio data. The reconstructed bone conduction audio data may be equivalent to the air conduction audio data and may also be referred to as equivalent air conduction audio data corresponding to the initial bone conduction audio data. As used herein, reconstructed bone conduction audio data that is similar, near, or the same as the ideal air conduction audio data may refer to a degree of similarity between the reconstructed bone conduction audio data and the ideal air conduction audio data that may be greater than a certain threshold (e.g., 90%, 80%, 70%, etc.). More descriptions regarding the reconstructed bone conduction audio data, the initial bone conduction audio data, and the ideal air conduction audio data may be found elsewhere in this application (e.g., fig. 10 and its description).
In some embodiments, the processing device 122 may reconstruct the first audio data using a trained machine learning model, a constructed filter, a harmonic correction model, a sparse matrix technique, or the like, or any combination thereof, to generate reconstructed first audio data. In some embodiments, one of a trained machine learning model, a constructed filter, a harmonic modification model, a sparse matrix technique, etc. may be used to generate the reconstructed first audio data. In some embodiments, the reconstructed first audio data may be generated using at least two of a trained machine learning model, a constructed filter, a harmonic modification model, a sparse matrix technique, or the like. For example, the processing device 122 may generate intermediate first audio data by reconstructing the first audio data using the trained machine learning model. The processing device 122 may generate reconstructed first audio data by reconstructing the intermediate first audio data using one of a constructed filter, a harmonic correction model, a sparse matrix technique, or the like. For another example, the processing device 122 may reconstruct the first audio data to generate intermediate first audio data by using one of a machine learning model, a constructed filter, a harmonic modification model, a sparse matrix technique, and the like. The processing device 122 may reconstruct the first audio data to generate another intermediate first audio data by using another of a machine learning model, a constructed filter, a harmonic modification model, a sparse matrix technique, or the like. The processing device 122 may generate the reconstructed first audio data by averaging the intermediate first audio data and the further intermediate first audio data. For another example, the processing device 122 may reconstruct the first audio data to generate a plurality of intermediate first audio data by using two or more of a machine learning model, a constructed filter, a harmonic modification model, a sparse matrix technique, and the like, and the processing device 122 may generate the reconstructed first audio data by averaging the plurality of intermediate first audio data.
In some embodiments, the processing device 122 may reconstruct the first audio data (or the normalized first audio data) using the trained machine learning model to obtain reconstructed first audio data. Frequency components in the reconstructed first audio data that are above a certain frequency point (e.g., 2000Hz, 3000Hz) or within a certain frequency range (e.g., 2000Hz to 20kHz, 3000Hz to 20kHz, etc.) are increased relative to frequency components in the first audio data that are above the certain frequency point (e.g., 2000Hz, 3000Hz) or within the frequency range (e.g., 2000Hz to 20kHz, 3000Hz to 20kHz, etc.). The trained machine learning model may be constructed based on a deep learning model, a traditional machine learning model, or the like, or any combination thereof. Exemplary deep learning models may include Convolutional Neural Network (CNN) models, Recurrent Neural Network (RNN) models, long short term memory network (LSTM) models, and the like. Exemplary conventional machine learning models may include Hidden Markov Models (HMMs), multi-layer perceptron (MLP) models, and the like.
In some embodiments, the preliminary machine learning model may be trained by using multiple sets of training data to determine a trained machine learning model. Each of the sets of training data may include bone conduction audio data and air conduction audio data. A set of training data may also be referred to as a speech sample. During training of the primary machine learning model, bone conduction audio data in the speech samples may be an input to the primary machine learning model, and air conduction audio data in the speech samples corresponding to the bone conduction audio data may be a desired output of the primary machine learning model. The bone conduction audio data and the air conduction audio data in the speech sample may represent the same speech and are simultaneously acquired by the bone conduction sensor and the air conduction sensor, respectively, in a noise-free environment. As used herein, a noiseless environment may refer to an environment in which one or more noise assessment parameters (e.g., noise criteria curve, statistical noise level, etc.) satisfy a certain condition, such as being less than a certain threshold. The trained machine learning model may be configured to provide a correspondence between bone conduction audio data (e.g., first audio data) and reconstructed bone conduction audio data (e.g., equivalent air conduction audio data). The trained machine learning model may reconstruct the bone conduction audio data based on the correspondence. In some embodiments, the bone conduction audio data in the sets of training data may be acquired by bone conduction sensors placed in the same part of the user's (e.g., tester) body (e.g., the area around the ear). In some embodiments, the portion of the body in which the bone conduction sensor that acquired the bone conduction audio data used to train the machine learning model is consistent with and/or identical to the portion of the body in which the bone conduction sensor that acquired the bone conduction audio data (e.g., the first audio data) to be reconstructed using the trained machine learning model is located. For example, the part of the body where the bone conduction sensor acquiring the bone conduction audio data in each set of training data used for training the machine learning model is located may coincide with and/or be the same as the part of the body where the bone conduction sensor acquiring the first audio data is located. For another example, if the part of the body where the bone conduction sensor that acquires the first audio data is the neck, the part of the body where the bone conduction sensor that acquires the bone conduction audio data used to train the machine learning model is also the neck. Placement of bone conduction sensors for acquiring sets of training data at a body part of a user (e.g., a tester) may affect the correspondence between bone conduction audio data (e.g., first audio data) and reconstructed bone conduction audio data (e.g., equivalent air conduction audio data). Therefore, reconstructing bone conduction audio data based on the correspondence using the trained machine learning model affects the reconstructed bone conduction audio data. Sets of training data collected by bone conduction sensors located at different parts of a user's body may generate different correspondences between bone conduction audio data (e.g., first audio data) and reconstructed bone conduction audio data (e.g., equivalent air conduction audio data). For example, multiple bone conduction sensors of the same configuration may be located at different parts of the body, such as the mastoid, temple, crown of the head, external auditory canal, and the like. Multiple bone conduction sensors may simultaneously acquire bone conduction audio data generated while a user is speaking. A plurality of training sets may be formed based on bone conduction audio data acquired by a plurality of bone conduction sensors. Each of the plurality of training sets may include sets of training data collected by one of the plurality of bone conduction sensors and the air conduction sensor. Each of the sets of training data may include bone conduction audio data and air conduction audio data representing the same speech. Each training set of the plurality of training sets may be used to train a machine learning model to obtain a trained machine learning model. A plurality of trained machine learning models may be obtained based on a plurality of training sets. The plurality of trained machine learning models may provide different correspondences between the particular bone conduction audio data and the reconstructed bone conduction audio data. For example, the same bone conduction audio data may be input into a plurality of trained machine learning models, respectively, to generate different reconstructed bone conduction audio data. In some embodiments, the bone conduction audio data (e.g., frequency response curves, signal strength, acoustic characteristics, etc.) acquired by differently configured bone conduction sensors may be different. Thus, the bone conduction sensor that collects bone conduction audio data used to train the machine learning model may be the same in configuration as the bone conduction sensor that collects bone conduction audio data (e.g., the first audio data) to be reconstructed using the trained machine learning model. In some embodiments, the bone conduction audio data (e.g., frequency response curve) collected differs for different pressures applied to a portion of the user's body. Thus, the pressure for acquiring the bone conduction audio data for training the machine learning model may be the same as the pressure for acquiring the bone conduction audio data (e.g., the first audio data) reconstructed using the trained machine learning model. Further description regarding determining a trained machine learning model and/or reconstructing bone conduction audio data may refer to fig. 5 and its description of bone conduction audio data.
In some embodiments, the processing device 122 (e.g., the pre-processing module 220) may reconstruct the first audio data (or the normalized first audio data) using the constructed filter to obtain reconstructed bone conduction audio data. The constructed filter may be configured to provide a relationship between the particular air conduction audio data and particular bone conduction audio data corresponding to the particular air conduction audio data. As used herein, bone conduction audio data and air conduction audio data that correspond to each other may refer to the bone conduction audio data and the air conduction audio data representing the same voice of the same user. The particular air conduction audio data may also be referred to as equivalent air conduction audio data or reconstructed bone conduction audio data corresponding to the particular bone conduction audio data. The frequency components in the particular air conduction audio data that are above a certain frequency point (e.g., 2000Hz, 3000Hz) or within a certain frequency range (e.g., 2000Hz to 20kHz, 3000Hz to 20kHz, etc.) are greater than the frequency components in the particular air conduction audio data that are above the certain frequency point (e.g., 2000Hz, 3000Hz) or within the frequency range (e.g., 2000Hz to 20kHz, 3000Hz to 20kHz, etc.). The processing device 122 may convert the particular bone conduction audio data to the particular air conduction audio data based on the relationship. For example, the processing device 122 may convert the first audio data into reconstructed first audio data using the constructed filter to obtain reconstructed first audio data. In some embodiments, bone conduction audio data in a speech sample may be denoted as d (n) and corresponding air conduction audio data in a speech sample may be denoted as s (n). Bone conduction audio data d (n) and corresponding air conduction audio data s (n) may be determined by a bone conduction system and an air conduction system, respectively, based on the initial sound excitation signal e (n), which may be equivalent to filter B and filter V, respectively. The constructed filter may then be equivalent to filter H. The filter H can be determined according to equation (2) as shown below:
Figure BDA0003543062870000121
in some embodiments, long-term spectral techniques may be used, for example, to determine the constructed filter. For example, the processing device 122 may determine the constructed filter according to equation (3) as shown below:
Figure BDA0003543062870000122
wherein the content of the first and second substances,
Figure BDA0003543062870000123
refers to a filter constructed in the frequency domain,
Figure BDA0003543062870000124
refers to a long-term spectral representation corresponding to air conduction audio data s (n),
Figure BDA0003543062870000125
refers to a long-term spectral expression corresponding to bone conduction audio data d (n). In some embodiments, the processing device 122 may acquire one or more sets of boneBone conduction audio data and air conduction audio data (also referred to as speech samples), in each set, are collected by the bone conduction sensor and the air conduction sensor, respectively, when an operator (e.g., a tester) speaks in a noise-free environment. The processing device 122 may determine the constructed filter based on one or more sets of bone conduction audio data and air conduction audio data according to equation (3). For example, the processing device 122 may construct the candidate filter based on the bone conduction audio data and the air conduction audio data corresponding to each other in each group according to equation (3). The processing device 122 may determine a constructed filter based on the candidate filters. In some embodiments, processing device 122 may perform an Inverse Fourier Transform (IFT) (e.g., fast IFT) operation on initial filter h (f) to obtain a constructed filter in the time domain.
In some embodiments, the bone conduction sensor that acquired the bone conduction audio data used to determine the constructed filter is located in the same body part as the bone conduction sensor that acquired the bone conduction audio data to be reconstructed using the constructed filter. For example, the portion of the body in which the bone conduction sensor acquiring the bone conduction audio data for determining the constructed filter is located may be the same as the portion of the body in which the bone conduction sensor acquiring the first audio data is located. For another example, if the part of the body where the bone conduction sensor that acquired the first audio data is located is the neck, the part of the body where the bone conduction sensor that acquired the bone conduction audio data used to determine the constructed filter is also the neck. Different filters may be generated by sets of training data collected by bone conduction sensors located at different parts of the body. For example, a first set of bone conduction audio data and corresponding air conduction audio data may be acquired by a bone conduction sensor and an air conduction sensor, respectively, located at a first portion of a user's body while the user is speaking. A second set of bone conduction audio data and corresponding air conduction audio data may be acquired while the user is speaking via respective acquisition by the bone conduction sensor and the air conduction sensor located at a second portion of the user's body. The first filter may be determined based on the bone conduction audio data and the corresponding air conduction audio data of the first set. The second filter may be determined based on the second set of bone conduction audio data and corresponding air conduction audio data. The first filter and the second filter are different. I.e. the first filter and the second filter provide different correspondence between bone conduction audio data and air conduction audio data.
In some embodiments, the processing device 122 (e.g., the pre-processing module 220) may reconstruct the first audio data (or the normalized first audio data) using the harmonic modification model to obtain reconstructed first audio data. The harmonic modification model may be configured to provide a relationship between the magnitude spectrum of the particular air conduction audio data and the magnitude spectrum of the particular bone conduction audio data corresponding to the particular air conduction audio data. As used herein, particular air conduction audio data may also be referred to as equivalent air conduction audio data or reconstructed bone conduction audio data corresponding to the particular bone conduction audio data. The magnitude spectrum of the particular air conduction audio data may also be referred to as a corrected magnitude spectrum of the particular bone conduction audio data. The processing device 122 may determine a magnitude spectrum and a phase spectrum of the first audio data (or the normalized first audio data) in the frequency domain. The processing device 122 may correct the magnitude spectrum of the first audio data (or the normalized first audio data) using the harmonic modification model to obtain a corrected magnitude spectrum of the first audio data (or the normalized first audio data). The processing device 122 may then determine reconstructed first audio data based on the corrected magnitude spectrum and the phase spectrum of the first audio data (or the normalized first audio data). Further description regarding the reconstruction of the first audio data using the harmonic correction model may refer to the description elsewhere in this application (e.g., fig. 6 and its description).
In some embodiments, the processing device 122 (e.g., the pre-processing module 220) may reconstruct the first audio data (or the normalized first audio data) using sparse matrix techniques to obtain reconstructed first audio data. For example, the processing device 122 may obtain a first transformation relationship configured to convert the dictionary matrix of the initial bone conduction audio data (e.g., first audio data) into a dictionary matrix of reconstructed bone conduction audio data (e.g., reconstructed first audio data) corresponding to the initial bone conduction audio data. The processing device 122 may obtain a second transformation relationship configured to convert the sparse code matrix of the initial bone conduction audio data into a sparse code matrix of reconstructed bone conduction audio data corresponding to the initial bone conduction audio data. The processing device 122 may determine a reconstructed dictionary matrix for the first audio data based on the dictionary matrix for the first audio data using the first transformation relationship. The processing device 122 may determine a sparse code matrix of the reconstructed first audio data based on the sparse code matrix of the first audio data using the second transformation relationship. The processing device 122 may determine the reconstructed first audio data based on the determined dictionary matrix and sparse code matrix of the reconstructed first audio data. In some embodiments, the first transformation relationship and/or the second transformation relationship may be a default setting of the audio signal generation system 100. In some embodiments, the processing device 122 may determine the first transformation relationship and/or the second transformation relationship based on one or more sets of the mutually corresponding bone conduction audio data sets and the air conduction audio data. Further description regarding the reconstruction of first audio data using sparse matrix techniques may refer to the description elsewhere in this application (e.g., fig. 7 and its description).
In 440, the processing device 122 (e.g., the audio data generation module 230) may generate third audio data based on the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data). The frequency components higher than a certain frequency point (or threshold) in the third audio data are increased relative to the frequency components higher than the frequency point (or threshold) in the first audio data (or the pre-processed first audio data). In other words, the frequency components higher than the frequency point (or threshold) in the third audio data may be more than the frequency components higher than the frequency point (or threshold) in the first audio data (or the preprocessed first audio data). In some embodiments, the noise level associated with the third audio data may be lower than the noise level associated with the second audio data (or the pre-processed second audio data). As used herein, an increase in frequency components in the third audio data above the frequency point (or threshold) relative to frequency components in the first audio data (or the pre-processed first audio data) above the frequency point may refer to a count or number of waves (e.g., sinusoids or harmonics) in the third audio data having frequencies above the frequency point being greater than a count or number of waves (e.g., sinusoids or harmonics) in the first audio data having frequencies above the frequency point. In some embodiments, the frequency point may be a constant in the range of 20Hz to 20 kHz. For example, the frequency points may be 2000Hz, 3000Hz, 4000Hz, 5000Hz, 4000Hz, etc. In some embodiments, the frequency point may be a frequency value of a frequency component in the third audio data and/or the first audio data.
In some embodiments, the processing device 122 may generate the third audio data according to one or more frequency thresholds based on the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data). For example, the processing device 122 may determine the one or more frequency thresholds based at least in part on the first audio data (or the pre-processed first audio data) and/or the second audio data (or the pre-processed second audio data). The processing device 122 may divide the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data) into a plurality of segments, respectively, according to one or more frequency thresholds. The processing device 122 may determine a weight for each of a plurality of segments of the first audio data (or pre-processed first audio data) and the second audio data (or pre-processed second audio data), respectively. The processing device 122 may then determine the third audio data based on the weight of each of the plurality of segments of the first audio data (or pre-processed first audio data) and the second audio data (or pre-processed second audio data).
In some embodiments, the processing device 122 may determine a single frequency threshold. The processing device 122 may splice the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data) in the frequency domain according to a single frequency threshold to generate third audio data. For example, the processing device 122 may determine, using the first filter, a low frequency portion of the first audio data (or the pre-processed first audio data) that includes frequency components below a single frequency threshold. The processing device 122 may determine, using the second filter, a high frequency portion of the second audio data (or the pre-processed second audio data) that includes frequency components above a single frequency threshold. The processing device 122 may splice and/or combine the low frequency portion of the first audio data (or the pre-processed first audio data) and the high frequency portion of the second audio data (or the pre-processed second audio data) to generate the third audio data. In some embodiments, the first filter may be a low pass filter with a single frequency threshold as a cutoff frequency, which may allow frequency components in the first audio data below the single frequency threshold to pass. The second filter may be a high pass filter with a single frequency threshold as a cutoff frequency, which may allow frequency components in the second audio data above the single frequency threshold to pass. In some embodiments, the processing device 122 may determine the single frequency threshold based at least in part on the first audio data (or the pre-processed first audio data) and/or the second audio data (or the pre-processed second audio data). Further description of determining a single frequency threshold may be found with reference to fig. 8 and the description thereof.
In some embodiments, the processing device 122 may determine the first weight and the second weight for the low frequency portion of the first audio data (or the pre-processed first audio data) and the high frequency portion of the first audio data (or the pre-processed first audio data), respectively, based at least in part on a single frequency threshold. The processing device 122 may determine the third weight and the fourth weight for the low frequency portion of the second audio data (or the pre-processed second audio data) and the high frequency portion of the second audio data (or the pre-processed second audio data), respectively, based at least in part on a single frequency threshold. In some embodiments, the processing device 122 may weight the low frequency portion of the first audio data (or the pre-processed first audio data), the high frequency portion of the first audio data (or the pre-processed first audio data), the low frequency portion of the second audio data (or the pre-processed second audio data), the high frequency portion of the second audio data (or the pre-processed second audio data) using the first weight, the second weight, the third weight, and the fourth weight, respectively, to determine the third audio data. More description on determining the third audio data (or the spliced audio data) can be found in the description of fig. 8.
In some embodiments, the processing device 122 may determine the weights corresponding to the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data), respectively, based at least in part on the first audio data (or the pre-processed first audio data) and/or the second audio data (or the pre-processed second audio data). The processing device 122 may determine the third audio data by weighting the first audio data (or the pre-processed first audio data) and the second audio data (or the pre-processed second audio data) using a weight corresponding to the first audio data (or the pre-processed first audio data) and a weight corresponding to the second audio data (or the pre-processed second audio data). Further description regarding determining the third audio data may be found elsewhere in the present application (e.g., fig. 9 and its description).
In 450, the processing device 122 (e.g., the audio data generation module 230) may determine target audio data representing the user's voice based on the third audio data, the target audio data having a higher fidelity than the first audio data and the second audio data. The target audio data may represent the voice of the user represented by the first audio data and the second audio data. As used herein, fidelity may be used to represent the similarity between the output audio data (e.g., target audio data, first audio data, second audio data) and the original input audio data (e.g., the user's voice). Fidelity may represent intelligibility of the output audio data (e.g., target audio data, first audio data, second audio data).
In some embodiments, the processing device 122 may designate the third audio data as the target audio data. In some embodiments, processing device 122 may perform post-processing operations on the third audio data to obtain the target audio data. In some embodiments, the post-processing operations may include noise reduction operations, domain transform operations (e.g., Fourier Transform (FT) operations), and the like, or combinations thereof. In some embodiments, the noise reduction operation performed on the third audio data may include using a wiener filter, spectral subtraction, an adaptive algorithm, a Minimum Mean Square Error (MMSE) estimation algorithm, or the like, or any combination thereof. In some embodiments, the noise reduction operation performed on the third audio data may be the same or different than the noise reduction operation performed on the second audio data. For example, both the noise reduction operation performed on the second audio data and the noise reduction operation performed on the third audio data may use spectral subtraction. For another example, the noise reduction operation performed on the second audio data may use a wiener filter, and the noise reduction operation performed on the third audio data may use spectral subtraction. In some embodiments, the processing device 122 may perform an inverse fourier operation on the third audio data in the frequency domain to obtain the target audio data in the time domain.
In some embodiments, the processing device 122 may transmit the signal to a client terminal (e.g., terminal 130), the storage device 140, and/or any other storage device (not shown in the audio signal generation system 100) via the network 150. The signal may include target audio data. The signal may also be configured to instruct the client terminal to play the target audio data.
It should be noted that the foregoing is provided for illustrative purposes only and is not intended to limit the scope of the present application. Various changes and modifications will occur to those skilled in the art based on the description herein. However, such changes and modifications do not depart from the scope of the present application. For example, operation 450 may be omitted. As another example, operations 410 and 420 may be integrated into a single operation.
Fig. 5 is a flow diagram of an exemplary process for reconstructing bone conduction audio data using a trained machine learning model, according to some embodiments of the present application. The operation of the process shown below is for illustration purposes only. In some embodiments, process 500 may be accomplished with one or more additional operations not described, and/or without one or more operations discussed. Additionally, the order of the operations of process 500 shown in FIG. 5 and described below is non-limiting. In some embodiments, one or more operations of process 500 may be performed to implement at least a portion of operation 430 as described in fig. 4.
At 510, the processing device 122 (e.g., the acquisition module 210) may acquire bone conduction audio data. In some embodiments, the bone conduction audio data may be raw audio data (e.g., first audio data) collected by the bone conduction sensor while the user is speaking (e.g., fig. 1 and its description), as described elsewhere in this application. For example, the user's voice may be picked up by a bone conduction sensor (e.g., bone conduction microphone 112) to generate an electrical signal (e.g., an analog signal or a digital signal) (i.e., bone conduction audio data). The bone conduction sensor may transmit the electrical signal to the server 120, the terminal 130, and/or the storage device 140 via the network 150. In some embodiments, the bone conduction audio data includes acoustic features and/or semantic information that may reflect the user's speech content. Exemplary acoustic characteristics may include features associated with a duration, features associated with an energy, features associated with a fundamental frequency, features associated with a frequency spectrum, features associated with a phase spectrum, and the like, as described elsewhere in this application (e.g., fig. 4 and its description).
In 520, the processing device 122 (e.g., the acquisition module 210) may obtain the trained machine learning model. The trained machine learning model may be provided by training a preliminary machine learning model using a plurality of sets of training data. In some embodiments, the trained machine learning model may be used to process specific bone conduction audio data to obtain processed bone conduction audio data. The processed bone conduction audio data may also be reconstructed bone conduction audio data. The frequency content of the bone conduction audio data above a certain frequency threshold (e.g., 800Hz, 2000Hz, 3000Hz, 4000Hz, etc.) in the processed bone conduction audio data may increase relative to the frequency content of the bone conduction audio data above the frequency threshold or frequency point (e.g., 800Hz, 2000Hz, 3000Hz, 4000Hz, etc.) in the particular bone conduction audio data. The processed bone conduction audio data may be similar or identical to ideal bone conduction audio data that is acquired by the bone conduction sensor at the same time that the bone conduction sensor acquired the particular bone conduction audio data, with little or no noise, and the processed bone conduction audio data represents the same voice of the same user as the unprocessed particular bone conduction audio data. As used herein, the processed bone conduction audio data being similar or identical to the ideal air conduction audio data with no or little noise may refer to a similarity between an acoustic feature of the processed bone conduction audio data and an acoustic feature of the ideal air conduction audio data being greater than a certain threshold (e.g., 0.9, 0.8, 0.7, etc.). For example, in a noise-free environment, when the user speaks, bone conduction audio data and air conduction audio data are simultaneously collected by bone conduction microphone 112 and air conduction microphone 114, respectively. The bone conduction audio data is processed by the trained machine learning model to generate processed bone conduction audio data having the same or similar acoustic characteristics as the air conduction audio data collected by the corresponding air conduction microphone 114. In some embodiments, processing device 122 may obtain the trained machine learning model from terminal 130, storage device 140, or any other storage device.
In some embodiments, the preliminary machine learning model may be constructed based on a deep learning model, a traditional machine learning model, or the like, or any combination thereof. The deep learning model may include a Convolutional Neural Network (CNN) model, a Recurrent Neural Network (RNN) model, a long short term memory network (LSTM) model, the like, or any combination thereof. Conventional machine learning models may include Hidden Markov Models (HMMs), multi-layer perceptron (MLP) models, the like, or any combination thereof. In some embodiments, the primary machine learning model may include multiple layers, e.g., an input layer, multiple hidden layers, and an output layer. The plurality of hidden layers may include one or more convolutional layers, one or more pooling layers, one or more batch normalization layers, one or more activation layers, one or more fully-connected layers, a lossy function layer, and the like. Each layer may include a plurality of nodes. In some embodiments, the primary machine learning model may be defined by at least two structural parameters and at least two learning parameters (or training parameters). The parameters may be structured by training a preliminary machine learning model using at least two sets of training data, varying the learning parameters. The user may set and/or adjust the structural parameters prior to training the preliminary machine learning model. Exemplary structural parameters of the machine learning model may include the size of the layer kernel, the total number (or number) of layers, the number (or number) of nodes in each layer, the learning rate, the batch size, the step size, and the like. For example, if the preliminary machine learning model includes a long-short term memory model, the long-short term memory model may include one input layer having 2 nodes, four hidden layers, and one output layer having 2 nodes, each hidden layer including 30 nodes. The time-shift step for the long-short term memory model may be 65 and the learning rate may be 0.003. Exemplary learning parameters of the machine learning model may include connection weights between two connected nodes, bias vectors associated with the nodes, and the like. The connection weight between two connection nodes may be configured to represent a proportion of the output value of a node as the input value of another connection node. The bias vector associated with a node may be configured to control the output value of the node that is offset from the origin.
In some embodiments, a preliminary machine learning model may be trained based on a machine learning model training algorithm using multiple sets of training data to determine a trained machine learning model. In some embodiments, one or more sets of training data in the plurality of sets of training data may be acquired in a noise-free environment, for example, in a anechoic chamber. The set of training data may include specific bone conduction audio data and corresponding specific air conduction audio data. A particular bone conduction audio data and a corresponding particular air conduction audio data in a set of training data may be obtained from a particular user simultaneously by a bone conduction sensor (e.g., bone conduction microphone 112) and an air conduction sensor (e.g., air conduction microphone 114). In some embodiments, each of at least some of the sets of training data may include particular bone conduction audio data and corresponding reconstructed bone conduction audio data, which may be generated by reconstructing the particular bone conduction audio data using one or more reconstruction techniques as described elsewhere in this application. Exemplary machine learning model training algorithms may include a gradient descent algorithm, a newton algorithm, a quasi-newton algorithm, a Levenberg-Marquardt algorithm, a conjugate gradient algorithm, or the like, or combinations thereof. The trained machine learning model may be configured to provide a correspondence between bone conduction audio data (e.g., first audio data) and reconstructed bone conduction audio data (e.g., equivalent air conduction audio data). The trained machine learning model may reconstruct the bone conduction audio data based on the correspondence. In some embodiments, the bone conduction audio data in the sets of training data may be acquired by bone conduction sensors placed in the same part of the user's (e.g., tester) body (e.g., the area around the ear). In some embodiments, the portion of the body in which the bone conduction sensor that acquired the bone conduction audio data used to train the machine learning model is located may be the same as the portion of the body in which the bone conduction sensor that acquired the bone conduction sensor data (e.g., the first audio data) to be reconstructed using the trained machine learning model is located. For example, the portion of the body where the bone conduction sensor that acquired the bone conduction audio data in each set of training data used to train the machine learning model is located may be the same as the portion of the body where the bone conduction sensor that acquired the first audio data is located. For another example, if the part of the body where the bone conduction sensor that acquires the first audio data is the neck, the part of the body where the bone conduction sensor that acquires the bone conduction audio data used to train the machine learning model is also the neck.
Placement of bone conduction sensors for acquiring sets of training data at different body parts of a user (e.g., a tester) affects a correspondence between bone conduction audio data (e.g., first audio data) and reconstructed bone conduction audio data (e.g., equivalent air conduction audio data). Therefore, reconstructing bone conduction audio data using the trained machine learning model may affect the reconstructed bone conduction audio data generated based on the correspondence. Sets of training data collected by bone conduction sensors located at different parts of a user's body may generate different correspondences between bone conduction audio data (e.g., first audio data) and reconstructed bone conduction audio data (e.g., equivalent air conduction audio data). For example, multiple bone conduction sensors of the same configuration may be located at different parts of the body, such as the mastoid, temple, crown of the head, external auditory canal, and the like. Multiple bone conduction sensors may simultaneously acquire bone conduction audio data generated while a user is speaking. A plurality of training sets may be formed based on bone conduction audio data acquired by a plurality of bone conduction sensors. Each of the plurality of training sets may include sets of training data collected by one of the plurality of bone conduction sensors and the air conduction sensor. Each of the sets of training data may include bone conduction audio data and air conduction audio data representing the same speech. Each training set of the plurality of training sets may be used to train a machine learning model to obtain a trained machine learning model. A plurality of trained machine learning models may be obtained based on a plurality of training sets. The plurality of trained machine learning models may provide different correspondences between the particular bone conduction audio data and the reconstructed bone conduction audio data. For example, the same bone conduction audio data may be input into a plurality of trained machine learning models, respectively, to generate different reconstructed bone conduction audio data. In some embodiments, the bone conduction audio data (e.g., frequency response curves, signal strength, acoustic characteristics, etc.) acquired by differently configured bone conduction sensors may be different. Thus, the bone conduction sensor used to acquire the trained machine learning model bone conduction audio data may be the same in configuration as the bone conduction sensor used to acquire the bone conduction audio data (e.g., the first audio data) to be reconstructed using the trained machine learning model. In some embodiments, the bone conduction audio data (e.g., frequency response curve) may be different when the bone conduction sensor applies a pressure differential to a portion of the user's body within a certain range (e.g., 0N to 1N, or 0N to 0.8N). Thus, the pressure at which bone conduction audio data used to train the machine learning model is collected may be the same as the pressure at which reconstructed bone conduction audio data (e.g., first audio data) to be used with the trained machine learning model is collected.
In some embodiments, the trained machine learning model may be obtained by performing at least two iterations to update one or more learning parameters of the primary machine learning model. For each of at least two iterations, a particular set of training data may be input into the primary machine learning model. For example, specific bone conduction audio data of a specific training data set may be input into an input layer of the primary machine learning model, and specific air conduction audio data of a specific training data set may be input into an output layer of the primary machine learning model as a desired output of the primary machine learning model corresponding to the specific bone conduction audio data (i.e., input). The primary machine learning model may extract one or more acoustic features (e.g., duration features, amplitude features, fundamental frequency features, etc.) of particular bone conduction audio data and particular air conduction audio data in a particular training data set. Based on the extracted features, the primary machine learning model may determine a predicted output corresponding to particular bone conduction audio data (i.e., input). The predicted output corresponding to the particular bone conduction audio data is then compared to the desired output of the output layer (i.e., the input particular air conduction audio data) based on the cost function. The cost function of the primary machine learning model may be configured to evaluate a difference between an estimated value (e.g., a predicted output) and an actual value (e.g., a desired output or input of particular air conduction audio data) of the primary machine learning model. If the value of the cost function exceeds the threshold in the current iteration, the learning parameters of the primary machine learning model may be adjusted and updated such that the value of the cost function (i.e., the difference between the predicted output and the input specific air conduction audio data) is less than the threshold. Thus, in the next iteration, another set of training data may be input into the preliminary machine learning model to train the preliminary machine learning model as described above. Then, at least two iterations may be performed to update the learning parameters of the primary machine learning model until a termination condition is satisfied. The termination condition may indicate whether the primary machine learning model is sufficiently trained. For example, the termination condition may be satisfied if the value of the cost function associated with the primary machine learning model is minimal or less than a threshold (e.g., a constant). For another example, if the values of the cost function converge, the termination condition may be satisfied. The cost function may be considered converged if the value of the cost function changes by less than a threshold (e.g., a constant) in two or more successive iterations. As yet another example, the termination condition may be satisfied when a specified number of iterations are performed in the training process. A trained machine learning model may be determined based on the updated learning parameters. In some embodiments, the trained machine learning model may be sent to storage device 140/storage module 240 or any other storage device for storage.
In 530, the processing device 122 (e.g., the pre-processing module 220) may process the bone conduction audio data using the trained machine learning model to obtain reconstructed bone conduction audio data. In some embodiments, the processing device 122 may input the bone conduction audio data into the trained machine learning model, which may then output the processed bone conduction audio data. In some embodiments, the processing device 122 may extract acoustic features of the bone conduction audio data and input the extracted acoustic features of the bone conduction audio data into the trained machine learning model. The training machine learning model may output processed bone conduction audio data. The frequency content of the bone conduction audio data above a frequency threshold or frequency point (e.g., 800Hz, 2000Hz, 3000Hz, etc.) in the processed bone conduction audio data is increased relative to the frequency content of the bone conduction audio data above the frequency threshold or frequency point in the unprocessed bone conduction audio data. In some embodiments, the processing device 122 may send the processed bone conduction audio data to a client terminal (e.g., terminal 130). The client terminal (e.g., terminal 130) may convert the processed bone conduction audio data to voice and play the voice to the user.
It should be noted that the foregoing is provided for illustrative purposes only and is not intended to limit the scope of the present application. Various changes and modifications may be made by those skilled in the art based on the description of the present application. However, such changes and modifications do not depart from the scope of the present application.
Fig. 6 is a flow diagram of an exemplary process for reconstructing bone conduction audio data based on a harmonic correction model, shown in accordance with some embodiments of the present application. The operation of the process shown below is for illustration purposes only. In some embodiments, process 600 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of process 600 are illustrated in FIG. 6 and described below is non-limiting. In some embodiments, one or more operations of process 600 may be performed to implement at least a portion of operation 430 as described in connection with fig. 4.
In 610, the processing device 122 (e.g., the acquisition module 210) may acquire bone conduction audio data. In some embodiments, as described in connection with operation 410, the bone conduction audio data may be raw audio data (e.g., first audio data) collected by the bone conduction sensor while the user is speaking. For example, the user's voice may be picked up by a bone conduction sensor (e.g., bone conduction microphone 112) to generate an electrical signal (e.g., an analog signal or a digital signal) (i.e., bone conduction audio data). In some embodiments, the bone conduction audio data may include a plurality of waves having different frequencies and amplitudes. Bone conduction audio data in the frequency domain may be represented as a matrix comprising a plurality of elements. Each element of the plurality of elements may represent a frequency and an amplitude of the wave.
In 620, the processing device 122 (e.g., the pre-processing module 220) may determine a magnitude spectrum and a phase spectrum of the bone conduction audio data. In some embodiments, the processing device 122 may determine the magnitude spectrum and the phase spectrum of the bone conduction audio data by performing a Fourier Transform (FT) operation on the bone conduction audio data. The processing device 122 may determine a magnitude spectrum and a phase spectrum of the bone conduction audio data in the frequency domain. For example, the processing device 122 may utilize peak detection techniques, including but not limited to a spectral envelope estimation vocoder algorithm (SEEVOC), to detect peaks of waves in the resulting bone conduction audio data. The processing device 122 may determine a magnitude spectrum and a phase spectrum based on the peak of the wave. For example, the amplitude of the wave is half the distance between the peak and trough.
In 630, the processing device 122 (e.g., the pre-processing module 220) may obtain a harmonic modification model. The harmonic modification model may be configured to provide a relationship between the magnitude spectrum of the particular air conduction audio data and the magnitude spectrum of the particular bone conduction audio data corresponding to the particular air conduction audio data. An amplitude spectrum of the particular air conduction audio data corresponding to the particular bone conduction audio data may be determined based on the relationship and the amplitude spectrum of the particular bone conduction audio data. As used herein, particular air conduction audio data may also be referred to as equivalent air conduction audio data or reconstructed bone conduction audio data corresponding to the particular bone conduction audio data.
In some embodiments, the harmonic modification model may be a default setting for the audio signal generation system 100. In some embodiments, the processing device 122 may retrieve the harmonic correction model from the storage device 140, the storage module 240, or any other storage device. In some embodiments, the harmonic modification model may be based on a determination of one or more sets of bone conduction audio data and corresponding air conduction audio data. The bone conduction audio data and the corresponding air conduction audio data in each set may be simultaneously acquired by the bone conduction sensor and the air conduction sensor while an operator (e.g., a tester) is speaking in a noise-free environment. The bone conduction sensor and the air conduction sensor may be the same as or different from the bone conduction sensor used to acquire the first audio data and the air conduction sensor used to acquire the second audio data. In some embodiments, the harmonic correction model may be determined according to operations a 1-a 3 based on one or more sets of bone conduction audio data and corresponding air conduction audio data. In operation a1, the processing device 122 may determine a candidate correction matrix based on the bone conduction audio data in each group and the amplitude spectrum of the corresponding air conduction audio data in each group using a peak detection technique (e.g., a spectral envelope estimation vocoder algorithm (SEEVOC)).
In some embodiments, the portion of the body in which the bone conduction sensor acquiring the bone conduction audio data for determining the harmonic correction model is located may be identical and/or identical to the portion of the body in which the bone conduction sensor acquiring the bone conduction audio data to be reconstructed using the harmonic correction model is located. For example, the portion of the body in which the bone conduction sensor acquiring the bone conduction audio data for determining the harmonic correction model is located may be the same as the portion of the body in which the bone conduction sensor acquiring the first audio data is located. For another example, if the part of the body where the bone conduction sensor that acquired the first audio data is located is the neck, the part of the body where the bone conduction sensor that acquired the bone conduction audio data used to determine the harmonic correction model is located is also the neck. Different harmonic correction models may be generated from sets of data collected by bone conduction sensors located at different parts of the user's body. For example, a first set of bone conduction audio data and corresponding air conduction audio data may be acquired by a bone conduction sensor and an air conduction sensor located at a first portion of a user's body while the user is speaking. A second set of bone conduction audio data and corresponding air conduction audio data may be acquired by a bone conduction sensor and an air conduction sensor located at a second portion of the user's body while the user is speaking. A first harmonic modification model may be determined based on the first set of bone conduction audio data and the corresponding air conduction audio data. A second harmonic modification model may be determined based on the second set of bone conduction audio data and corresponding air conduction audio data. The first harmonic correction model and the second harmonic correction model are different. The first and second harmonic modification models provide different correspondences between the magnitude spectrum of the particular air conduction audio data and the magnitude spectrum of the particular bone conduction audio data corresponding to the particular air conduction audio data. The reconstructed bone conduction audio data obtained by reconstructing the same bone conduction audio data based on the first harmonic wave correction model and the second harmonic wave correction model are different.
In 640, the processing device 122 (e.g., the pre-processing module 220) may correct the amplitude spectrum of the bone conduction audio data to obtain a corrected amplitude spectrum of the bone conduction audio data. In some embodiments, the harmonic modification model may include a correction matrix including weight coefficients corresponding to each element in a magnitude spectrum of bone conduction audio data (e.g., the first audio data described in fig. 4). As used herein, an element in the amplitude spectrum may refer to the amplitude of a wave (i.e., a frequency component). The processing device 122 may correct the magnitude spectrum of the bone conduction audio data (e.g., the first audio data described in fig. 4 or the normalized first audio data) by multiplying the correction matrix with the magnitude spectrum of the bone conduction audio data (e.g., the first audio data described in fig. 4) to obtain a corrected magnitude spectrum of the bone conduction audio data (e.g., the first audio data described in fig. 4).
In 650, the processing device 122 (e.g., the pre-processing module 220) may determine reconstructed bone conduction audio data based on the corrected magnitude spectrum and the phase spectrum of the bone conduction audio data. In some embodiments, the processing device 122 may perform an inverse fourier transform on the corrected magnitude spectrum and the phase spectrum of the bone conduction audio data to obtain reconstructed bone conduction audio data.
It should be noted that the foregoing is provided for illustrative purposes only and is not intended to limit the scope of the present application. Various changes and modifications will occur to those skilled in the art based on the description herein. However, such changes and modifications do not depart from the scope of the present application.
Fig. 7 is a flow diagram of an exemplary process for reconstructing bone conduction audio data based on sparse matrix techniques, shown in accordance with some embodiments of the present application. The operation of the process shown below is for illustration purposes only. In some embodiments, process 700 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of process 700 are illustrated in FIG. 7 and described below is non-limiting. In some embodiments, one or more operations of process 700 may be performed to implement at least a portion of operation 430 as described in connection with fig. 4.
In 710, the processing device 122 (e.g., the acquisition module 210) may acquire bone conduction audio data. In some embodiments, as described in connection with operation 410, the bone conduction audio data may be raw audio data (e.g., first audio data) collected by the bone conduction sensor while the user is speaking. For example, the user's voice may be picked up by a bone conduction sensor (e.g., bone conduction microphone 112) to generate an electrical signal (e.g., an analog signal or a digital signal) (i.e., bone conduction audio data). In some embodiments, the bone conduction audio data may include a plurality of waves having different frequencies and amplitudes. The bone conduction audio data in the frequency domain may be represented as a matrix X. The matrix X may be determined based on the dictionary matrix D and the sparse code matrix C. For example, the audio data may be determined according to equation (4):
X≈DC(4)。
at 720, the processing device 122 (e.g., the pre-processing module 220) may obtain a first transformation relationship for converting the dictionary matrix of bone conduction audio data to a dictionary matrix of reconstructed bone conduction audio data corresponding to the bone conduction audio data. In some embodiments, the first transformation relationship may be a default setting of the audio signal generation system 100. In some embodiments, the processing device 122 may obtain the first transformation relationship from the storage device 140, the storage module 240, or any other storage device. In some embodiments, the first transformation relationship may be determined based on one or more sets of bone conduction audio data and corresponding air conduction audio data (i.e., speech samples). The bone conduction audio data and the corresponding air conduction audio data in each group may be acquired simultaneously by the bone conduction sensor and the air conduction sensor, respectively, in a noise-free environment while an operator (e.g., a tester) speaks. For example, the processing device 122 may determine a dictionary matrix for the bone conduction audio data and a corresponding dictionary matrix for the bone conduction audio data in each set of data based on the determination described in operation 740. The processing device 122 may divide the dictionary matrix for the air conduction audio data in each set of data by the corresponding dictionary matrix for the bone conduction audio data for the one or more sets of bone conduction audio data and the corresponding air conduction audio data to obtain candidate first transformation relationships. In some embodiments, the processing device 122 may determine a plurality of candidate first transformation relationships based on the plurality of sets of bone conduction audio data and corresponding air conduction audio data. The processing device 122 may average the plurality of candidate first transformation relationships to obtain a first transformation relationship. In some embodiments, the processing device 122 may determine one of the plurality of candidate first transformation relationships as the first transformation relationship.
At 730, the processing device 122 (e.g., the pre-processing module 220) may obtain a second transformation relationship for converting the sparse code matrix of bone conduction audio data into a sparse code matrix of reconstructed bone conduction audio data corresponding to the bone conduction audio data. In some embodiments, the second transformation relationship may be a default setting of the audio signal generation system 100. In some embodiments, the processing device 122 may retrieve the second transformation relationship from the storage device 140, the storage module 240, or any other storage device. In some embodiments, the second transformation relationship may be determined based on one or more sets of bone conduction audio data and corresponding air conduction audio data. For example, the processing device 122 may determine a sparse code matrix for bone conduction audio data and a sparse code matrix for air conduction audio data corresponding to bone conduction audio data in each of the one or more sets of data according to operation 740. The processing device 122 may divide the sparse code matrix of the air conduction audio data by the corresponding sparse code matrix of the bone conduction audio data to obtain a candidate second transform relationship. In some embodiments, the processing device 122 may determine one or more candidate second transformation relationships based on one or more sets of bone conduction audio data and corresponding air conduction audio data. The processing device 122 may average the one or more candidate second transformation relationships to obtain a second transformation relationship. In some embodiments, the processing device 122 may determine one of the one or more candidate second transformation relationships as the second transformation relationship.
In some embodiments, the part of the body in which the bone conduction sensor acquiring the bone conduction audio data for determining the first transformation relationship (and/or the second transformation relationship) is located may be the same as the part of the body in which the bone conduction sensor acquiring the bone conduction audio data to be reconstructed using the first transformation relationship (and/or the second transformation relationship). For example, the portion of the body in which the bone conduction sensor acquiring the bone conduction audio data for determining the first transformation relationship (and/or the second transformation relationship) is located may be the same as the portion of the body in which the bone conduction sensor acquiring the first audio data is located. For another example, if the part of the body in which the bone conduction sensor that acquired the first audio data is located is the neck, the part of the body in which the bone conduction sensor that acquired the bone conduction audio data used to determine the first transformation relationship (and/or the second transformation relationship) is also the neck. Different bone conduction audio data collected by bone conduction sensors located at different parts of the user's body may generate different first transformation relationships (and/or second transformation relationships). Reconstructing the same bone conduction audio data based on different first transformation relationships (and/or second transformation relationships) may obtain different reconstructed bone conduction audio data.
In 740, the processing device 122 (e.g., the pre-processing module 220) may determine a dictionary matrix of reconstructed bone conduction audio data (e.g., the reconstructed first audio data described in fig. 4) using a first transformation relationship based on the dictionary matrix of bone conduction audio data (e.g., the first audio data or the normalized first audio data described in fig. 4). For example, the processing device 122 may multiply the first transformation relationship (e.g., in matrix form) with a dictionary matrix of bone conduction audio data (e.g., the first audio data described in fig. 4 or the normalized first audio data) to obtain a dictionary matrix of reconstructed bone conduction audio data (e.g., the reconstructed first audio data described in fig. 4). Processing device 122 may determine a dictionary matrix and/or a sparse code matrix for audio data (e.g., bone conduction audio data (e.g., first audio data), bone conduction audio data for speech samples, and/or air conduction audio data) by performing at least two iterations. Before performing at least two iterations, processing device 122 may initialize a dictionary matrix for audio data (e.g., first audio data) to obtain an initial dictionary matrix. For example, the processing device 122 may set each element in the initial dictionary matrix to 0 or 1. In each iteration, the processing device 122 may determine an estimated sparse code matrix for the audio data (e.g., the first audio data) using, for example, an Orthogonal Matching Pursuit (OMP) algorithm based on the audio data (e.g., the first audio data) and the initial dictionary matrix. The processing device 122 may determine an estimated dictionary matrix using, for example, a K-singular value decomposition (K-SVD) algorithm based on the audio data (e.g., the first audio data) and the estimated sparse code matrix. Processing device 122 may determine estimated audio data based on the estimated dictionary matrix and the estimated sparse code matrix according to equation (4). The processing device 122 may compare the estimated audio data with audio data (e.g., first audio data). If the difference between the estimated audio data generated in the current iteration and the audio data (e.g., the first audio data) exceeds a threshold, the processing device 122 may update the initial dictionary matrix using the estimated dictionary matrix generated in the current iteration. The processing device 122 may perform the next iteration based on the updated initial dictionary matrix until a difference between the estimated audio data generated in the current iteration and the audio data (e.g., the first audio data) is less than a threshold. If the difference between the estimated audio data and the audio data generated in the current iteration is less than the threshold, the processing device 122 may designate the estimated dictionary matrix and the estimated sparse code matrix generated in the current iteration as a dictionary matrix and/or a sparse code matrix for the audio data (e.g., the first audio data).
In 750, the processing device 122 (e.g., the pre-processing module 220) may determine a sparse code matrix of reconstructed bone conduction audio data (e.g., the reconstructed first audio data described in fig. 4) based on the sparse code matrix of bone conduction audio data (e.g., the first audio data or the normalized first audio data described in fig. 4) using a second transformation relationship. For example, the processing device 122 may multiply the second transform relationship (e.g., matrix) with a sparse code matrix of the bone conduction audio data to obtain a sparse code matrix of the reconstructed first audio data. A sparse code matrix for the first audio data may be determined as described in operation 740.
In 760, the processing device 122 (e.g., the pre-processing module 220) may determine reconstructed bone conduction audio data (e.g., the reconstructed first audio data described in fig. 4) based on the dictionary matrix and the sparse code matrix of the reconstructed bone audio data. The processing device 122 may determine reconstructed bone conduction audio data based on the dictionary matrix and the sparse code matrix determined in operations 740 and 750 according to equation (4).
It should be noted that the foregoing is provided for illustrative purposes only and is not intended to limit the scope of the present application. Various changes and modifications will occur to those skilled in the art based on the description herein. However, such changes and modifications do not depart from the scope of the present application. For example, operations 720 and 730 may be integrated into a single operation.
Fig. 8 is a flow diagram of an exemplary process for generating audio data, shown in accordance with some embodiments of the present application. The operation of the process shown below is for illustration purposes only. In some embodiments, process 800 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of process 800 are illustrated in FIG. 8 and described below is non-limiting. In some embodiments, one or more operations of process 800 may be performed to implement at least a portion of operation 440 as described in connection with fig. 4.
In 810, the processing device 122 (e.g., the audio data generation module 230 or the frequency determination unit 310) may determine one or more frequency thresholds based at least in part on the bone conduction audio data and/or the air conduction audio data. Bone conduction audio data (e.g., first audio data or pre-processed first audio data) and air conduction audio data (e.g., second audio data or pre-processed second audio data) may be simultaneously collected by the bone conduction sensor and the air conduction sensor, respectively, while the user is speaking. More description of bone conduction audio data and air conduction audio data may be found elsewhere in this application (e.g., fig. 4 and its description).
As described herein, a frequency threshold may also be referred to as a frequency bin. In some embodiments, the frequency threshold may be a frequency value of a frequency component in the bone conduction audio data and/or the air conduction audio data. In some embodiments, the frequency threshold may be different from the frequency values of the frequency components in the bone conduction audio data and/or the air conduction audio data. In some embodiments, the processing device 122 may determine the frequency threshold based on a frequency response curve associated with the bone conduction audio data. The frequency response curve associated with the bone conduction audio data may include frequency response values that vary as a function of frequency. In some embodiments, the processing device 122 may determine the frequency threshold based on a frequency response value of a frequency response curve associated with the bone conduction audio data. For example, the processing device 122 may determine a maximum frequency (e.g., 2000Hz in the frequency response curve m shown in fig. 10) within a frequency range (e.g., 0-2000Hz in the frequency response curve m shown in fig. 10) for which the corresponding frequency response value is less than a certain threshold (e.g., about 80dB in the frequency response curve m shown in fig. 10) as a frequency threshold. As another example, the processing device 122 may determine a minimum frequency (e.g., 4000Hz of the frequency response curve m as shown in FIG. 10) of a range of frequencies (e.g., 4000Hz of the frequency response curve m as shown in FIG. 10) as a frequency threshold, the frequencies within the range of frequencies corresponding to frequency response values greater than a threshold (e.g., about 90dB of the frequency response curve m as shown in FIG. 10). As yet another example, the processing device 122 may determine a minimum frequency and a maximum frequency within a range of frequencies corresponding to a frequency response value within a range as the frequency threshold. For another example, as shown in fig. 10, the processing device 122 may determine one or more frequency thresholds based on the frequency response curve "m" of the bone conduction audio data. The processing device 122 may determine a frequency range (0-2000Hz) for which a frequency response value less than a certain threshold (e.g., 70dB) corresponds. The processing device 122 may determine the maximum frequency in the frequency range as the frequency threshold. In some embodiments, the processing device 122 may determine one or more frequency thresholds based on the changing characteristics of the frequency response curve. For example, the processing device 122 may determine a maximum frequency and/or a minimum frequency in a frequency range in which the frequency response curve has a steady variation as the frequency threshold. As another example, the processing device 122 may determine a maximum frequency and/or a minimum frequency in a frequency range in which the frequency response curve varies dramatically as the frequency threshold. As another example, a frequency response curve m for a frequency range less than 800Hz varies substantially steadily over a frequency range greater than 800Hz and less than 4000 Hz. The processing device 122 may determine 800Hz and 4000Hz as frequency thresholds. In some embodiments, the processing device 122 may reconstruct the bone conduction audio data using one or more reconstruction techniques described elsewhere in this application (e.g., fig. 4 and its description) to obtain reconstructed bone conduction audio data. The processing device 122 may determine a frequency response curve associated with the reconstructed bone conduction audio data. The processing device 122 may determine the frequency threshold based on a frequency response curve associated with the reconstructed bone conduction audio data in the same or similar manner as described above for determining the frequency threshold based on the frequency response curve of the bone conduction audio data.
In some embodiments, the processing device 122 may determine one or more frequency thresholds based on a noise level associated with at least a portion of the air conduction audio data. The higher the noise level, the higher the frequency threshold (e.g., minimum frequency threshold) may be. The lower the noise level, the lower the frequency threshold (e.g., minimum frequency threshold) may be. In some embodiments, the noise level associated with the air conduction audio data may be represented by the amount or energy of noise included in the air conduction audio data. The greater the amount or energy of noise contained in the air conduction audio data, the greater the noise level. In some embodiments, the noise level may be represented by a signal-to-noise ratio of the air conduction audio data. The greater the signal-to-noise ratio, the lower the noise level. The greater the signal-to-noise ratio associated with the air conduction audio data, the smaller the threshold. For example, if the signal-to-noise ratio is 0dB, the frequency threshold may be 2000 Hz. If the signal-to-noise ratio is 20dB, the frequency threshold may be 4000 Hz. For example, the frequency threshold may be determined based on equation (5) as follows:
Figure BDA0003543062870000211
wherein, F point Can represent a frequency threshold, F1, F2, F3 can be values in the range of 0-20KHz, where F1 is satisfied>F2>F3. A1 and a2 are constant values, for example, a1 may be 0 and a2 may be equal to 20.
Further, the frequency threshold may be represented by equation (6):
Figure BDA0003543062870000212
in some embodiments, the processing device 122 may determine the signal-to-noise ratio of the air conduction audio data according to equation (6) as follows:
Figure BDA0003543062870000213
where n refers to the nth frame speech frame in the air conduction audio data,
Figure BDA0003543062870000214
refers to the energy of the pure audio data contained in the air conduction audio data,
Figure BDA0003543062870000215
refers to the energy of the noise data contained in the air conduction audio data. In some embodiments, processing device 122 may determine the noise data in the air conduction audio data using a noise estimation algorithm, such as a Minimum Statistics (MS) algorithm, a Minimum Control Recursive Average (MCRA) algorithm, or the like. The processing device 122 may determine pure audio data in the air conduction audio data based on noise data in the air conduction audio data. The processing device 122 may then determine an energy of pure audio data in the air conduction audio data and an energy of noise data in the air conduction audio data. In some embodiments, the processing device 122 may determine noise data in the air conduction audio data using the bone conduction sensor and the air conduction sensor. For example, the processing device 122 may determine that reference audio data was acquired by the air conduction sensor, that no signal was acquired by the bone conduction sensor while the air conduction sensor acquired the reference audio data, and that the time at which the air conduction sensor acquired the reference audio data is close to the time at which the air conduction sensor acquired the air conduction audio data. As used herein, a time being close to another time may mean that the difference between the two times is less than a certain threshold (e.g., 10ms, 100ms, 1 second, 2 seconds, 3 seconds, 4 seconds, etc.). The reference audio data may be equivalent to noise data in the air conduction audio data. The processing device 122 may then determine pure audio data in the air conduction audio data based on the noise data (i.e., the reference audio data) in the air conduction audio data. And the processing device 122 may determine a signal-to-noise ratio associated with the air conduction audio data according to equation (7).
In some embodiments, the processing device 122 may extract the energy of the noise data in the air conduction audio data and determine the energy of the pure audio data based on the energy of the noise data and the total energy of the air conduction audio data. For example, the processing device 122 may subtract the energy of the noise data in the air conduction audio data from the total energy of the air conduction audio data to obtain the energy of the pure audio data in the air conduction audio data. The processing device 122 may determine the signal-to-noise ratio based on the energy of the pure audio data and the energy of the noise data according to equation (7).
In 820, the processing device 122 (e.g., the audio data generation module 230 or the weight determination unit 320) may divide the bone conduction audio data and the air conduction audio data into a plurality of segments, respectively, according to one or more frequency thresholds. In some embodiments, the bone conduction audio data and the air conduction audio data are time domain data and processing device 122 may perform a domain transform operation (e.g., an FT operation) on the bone conduction audio data and the air conduction audio data to convert the bone conduction audio data and the air conduction audio data to a frequency domain. In some embodiments, the bone conduction audio data and the air conduction audio data may be frequency domain data. The bone conduction audio data and the air conduction audio data in the frequency domain may each comprise a frequency spectrum. Bone conduction audio data in the frequency domain may also be referred to as bone conduction spectrum. Air conduction audio data in the frequency domain may also be referred to as air conduction spectrum. The processing device 122 may divide the bone conduction spectrum and the air conduction spectrum into a plurality of segments, respectively. Each segment of bone conduction audio data may correspond to a segment of air conduction audio data. As used herein, a segment of air conduction audio data corresponding to a segment of bone conduction audio data may mean that the two segments of bone conduction audio data and air conduction audio data are defined by one or two identical frequency thresholds. For example, if a particular segment of bone conduction audio data is defined by frequency points 2000Hz and 4000Hz, in other words, a particular segment of bone conduction audio data includes frequency components in the range of 2000Hz to 4000Hz, the segment of air conduction audio data to which the particular segment of bone conduction audio data corresponds may also be defined by frequency thresholds of 2000Hz and 4000 Hz. In other words, a section of the air conduction audio data corresponding to a section of the bone conduction audio data defined by 2000Hz to 4000Hz includes frequency components in the range of 2000Hz to 4000 Hz.
In some embodiments, the count or number of frequency thresholds may be one, and the processing device 122 may divide the bone conduction frequency spectrum and the gas conduction frequency spectrum into two segments, respectively. For example, one of the two segments of the bone conduction frequency spectrum may include a portion of the bone conduction frequency spectrum having frequency components less than a frequency threshold, and the other of the two segments of the bone conduction frequency spectrum may include a remaining portion of the bone conduction frequency spectrum having frequency components above the frequency threshold.
In 830, the processing device 122 (e.g., the audio data generation module 230 or the weight determination unit 320) may determine a weight for each of the plurality of segments of bone conduction audio data and air conduction audio data, respectively. In some embodiments, the weight of the particular segment of bone conduction audio data and the weight of the corresponding particular segment of air conduction audio data may satisfy a criterion such that the sum of the weight of the particular segment of bone conduction audio data and the weight of the corresponding particular segment of air conduction audio data is equal to 1. For example, if the processing device 122 divides the bone conduction audio data and the air conduction audio data into two segments according to a single frequency threshold. The weight of a segment of the bone conduction audio data having frequency components below a single frequency threshold (also referred to as a low frequency part of the bone conduction audio data) may be equal to 1, or 0.9, or 0.8, etc. The weight of a segment of the air conduction audio data having frequency components below a single frequency threshold (also referred to as a low frequency portion of the air conduction audio data) may be correspondingly equal to 0, or 0.1, or 0.2, etc., corresponding to a weight of 1, or 0.9, or 0.8, etc., respectively, of a segment of the bone conduction audio data. The weight of another segment of the bone conduction audio data having frequency components above the single frequency threshold (also referred to as the high frequency portion of the bone conduction audio data) may be equal to 0, or 0.1, or 0.2, etc. The weight of another segment of the air conduction audio data having frequency components above the single frequency threshold (also referred to as the high frequency portion of the air conduction audio data) may be correspondingly equal to 1, or 0.9, or 0.8, etc., corresponding to the weight data 0, or 0.1, or 0.2, etc., of another segment of the bone conduction audio data, respectively.
In some embodiments, the processing device 122 may determine weights for the bone conduction audio data or different segments of the air conduction audio data based on a signal-to-noise ratio of the air conduction audio data. For example, the lower the signal-to-noise ratio of the air conduction audio data, the greater the weight of a particular segment of the bone conduction audio data may be, and the lower the weight of the corresponding particular segment of the air conduction audio data may be.
In 840, the processing device 122 (e.g., the audio data generation module 230 or the combination unit 330) may splice the bone conduction audio data and the air conduction audio data for each of the plurality of segments of the bone conduction audio data and the air conduction audio data to generate spliced audio data. The spliced audio data may represent the user's speech with a higher fidelity than the bone conduction audio data and/or the air conduction audio data. Splicing of the bone conduction audio data and the air conduction audio data may refer to selecting one or more portions of frequency components of the air conduction audio data and selecting one or more portions of frequency components of the bone conduction audio data in the frequency domain according to one or more frequency thresholds, and generating the audio data based on the selected portions of the bone conduction audio data and the selected portions of the air conduction audio data. As described herein, a frequency threshold may also be referred to as a frequency tap point. In some embodiments, the selected portion of the bone conduction audio data and/or the air conduction audio data may include frequency components below a frequency threshold. In some embodiments, the selected portion of the bone conduction audio data and/or the air conduction audio data may include frequency components below a frequency threshold and greater than another frequency threshold. In some embodiments, the selected portion of the bone conduction audio data and/or the air conduction audio data may include frequency components greater than a frequency threshold.
In some embodiments, the processing device 122 may determine the spliced audio data according to equation (8) as follows:
Figure BDA0003543062870000231
wherein the content of the first and second substances,
Figure BDA0003543062870000232
the audio data of the bone conduction is indicated,
Figure BDA0003543062870000233
refers to the air conduction audio data and the audio data,
Figure BDA0003543062870000234
comprises (a) m1 ,a m2 ,...,a mN ) Refers to the weights of multiple segments of bone conduction audio data,
Figure BDA0003543062870000235
comprises (b) m1 ,b m2 ,...,b mN ) Refers to the weight of multiple segments of air conduction audio data, (x) m1 ,x m2 ,...,x mN ) A plurality of segments of bone conduction audio data, each segment comprising frequency components within a frequency range defined by a frequency threshold, (y) m1 ,y m2 ,...,y mN ) Refers to a plurality of segments of air conduction audio data, each segment comprising frequency components within a frequency range defined by a frequency threshold. For example, x m1 And y m1 Frequency components below 800Hz in the bone conduction audio data and the air conduction audio data, respectively, may be referred to. Also for example, x m2 And y m2 May refer to frequency components in the frequency ranges of 800Hz and 4000Hz in the bone conduction audio data and the air conduction audio data, respectively. N may be a constant, e.g., 1, 2, 3, etc. a is mn(n=1,2,…N) May be a constant in the range of 0 to 1. b mn(n=1,2,…N) May be a constant in the range of 0 to 1. a is mn(n=1,2,…N) And b mn(n=1,2,…N) The sum is equal to 1. In some embodiments, N may be equal to 2. The processing device 122 may divide each of the bone conduction audio data and the air conduction audio data into two segments according to a single frequency threshold, respectively. For example, the processing device 122 may determine the low frequency portion and the high frequency portion of the bone conduction audio data (or the air conduction audio data) based on a single frequency threshold. The low frequency portion of the bone conduction audio data (or the air conduction audio data) may include frequency components of the bone conduction audio data (or the air conduction audio data) below a single frequency threshold, and the high frequency portion of the bone conduction audio data (or the air conduction audio data) may include frequency components of the bone conduction audio data (or the air conduction audio data) above the single frequency threshold. In some implementationsFor example, the processing device 122 may determine the low frequency portion and the high frequency portion of the bone conduction audio data (or the air conduction audio data) based on one or more filters. The one or more filters may include a low pass filter, a high pass filter, a band pass filter, and the like, or any combination thereof.
In some embodiments, the processing device 122 may determine the first weight and the second weight for the low frequency portion of the bone conduction audio data and the high frequency portion of the bone conduction audio data, respectively, based at least in part on a single frequency threshold. The processing device 122 may determine the third weight and the fourth weight for the low frequency portion of the air conduction audio data and the high frequency portion of the air conduction audio data, respectively, based at least in part on the single frequency threshold. In some embodiments, the first weight, the second weight, the third weight, and the fourth weight may be determined based on a signal-to-noise ratio of the air conduction audio data. For example, if the signal-to-noise ratio of the air conduction audio data is greater than the threshold, the processing device 122 may determine that the first weight is less than the third weight, and/or that the second weight is greater than the fourth weight. For another example, the processing device 122 may determine a plurality of signal-to-noise ratio ranges, each corresponding to a fixed first weight, second weight, third weight, and fourth weight. The first weight and the second weight may be the same or different, and the third weight and the fourth weight may be the same or different. The sum of the first weight and the third weight is 1, and the sum of the second weight and the fourth weight is 1. The first weight, the second weight, the third weight, the first weight, and/or the fourth weight may be constant values in the range of 0 to 1, e.g., 1, 0.9, 0.8, 0.7, 0.3, 0.4, 0.5, 0.6, 02, 0.1, 0, etc. In some embodiments, the processing device 122 may determine the stitched audio data by weighting the low frequency portion and the high frequency portion of the bone conduction audio data, the low frequency portion and the high frequency portion of the air conduction audio data using the first weight, the second weight, the third weight, and the fourth weight, respectively. For example, the processing device 122 may determine the low frequency portion of the spliced audio data by weighted summing the low frequency portion of the bone conduction audio data and the low frequency portion of the air conduction audio data using the first weight and the third weight. The processing device 122 may weight and sum the high frequency portion of the bone conduction audio data and the high frequency portion of the air conduction audio data using the second weight and the fourth weight to determine the high frequency portion of the spliced audio data. The processing device 122 may combine the low frequency portion of the spliced audio data and the high frequency portion of the spliced audio data to obtain spliced audio data.
In some embodiments, the first weight of the low frequency portion of the bone conduction audio data may be equal to 1 and the second weight of the high frequency portion of the bone conduction audio data may be equal to 0. The third weight of the low frequency portion of the air conduction audio data may be equal to 0 and the fourth weight of the high frequency portion of the air conduction audio data may be equal to 1. The spliced audio data may be generated by splicing a low frequency portion of the bone conduction audio data and a high frequency portion of the air conduction audio data. In some embodiments, the audio data generated by splicing the bone conduction audio data and the air conduction audio data may be different according to a single frequency threshold. For example, as shown in fig. 16-20, fig. 16-20 are time-frequency graphs of spliced audio data generated by splicing specific bone conduction audio data and specific air conduction audio data at frequency points of 2000Hz, 3000Hz, and 4000Hz, respectively, according to some embodiments of the present application. The amounts of noise in the spliced audio data corresponding to fig. 16, 19, and 20 are different from each other. The larger the frequency splicing point, the less the amount of noise in the spliced audio data.
It should be noted that the foregoing is provided for illustrative purposes only and is not intended to limit the scope of the present application. Various changes and modifications will occur to those skilled in the art based on the description herein. However, such changes and modifications do not depart from the scope of the present application.
Fig. 9 is a flow diagram of an exemplary process for generating audio data, shown in accordance with some embodiments of the present application. The operation of the process shown below is for illustration purposes only. In some embodiments, process 900 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of process 900 are illustrated in FIG. 9 and described below is non-limiting. In some embodiments, one or more operations of procedure 900 may be performed to implement at least a portion of operation 440 as described in connection with fig. 4.
In 910, the processing device 122 (e.g., the audio data generation module 230 or the weight determination unit 320) may determine weights corresponding to bone conduction audio data based at least in part on at least one of the bone conduction audio data or the air conduction audio data. In some embodiments, the bone conduction audio data and the air conduction audio data may be simultaneously obtained by the bone conduction sensor and the air conduction sensor, respectively, when the user speaks. The air conduction audio data and the bone conduction audio data may represent a user's voice. More description of bone conduction audio data and air conduction audio data may be found in fig. 4 and its description.
In some embodiments, the processing device 122 may determine the weight of the bone conduction audio data based on a signal-to-noise ratio of the air conduction audio data. More description regarding determining the signal-to-noise ratio of air conduction audio data may be found elsewhere in this application (e.g., fig. 8 and its description). The greater the signal-to-noise ratio of the air conduction audio data, the lower the weight of the bone conduction audio data. For example, the weight of the bone conduction audio data may be set to a value a if the signal-to-noise ratio of the air conduction audio data is greater than a predetermined threshold, and the weight of the bone conduction audio data may be set to a value B if the signal-to-noise ratio of the air conduction audio data is less than the predetermined threshold, a < B. As another example, the processing device 122 may determine the weight of the bone conduction audio data according to equation (9) as follows:
Figure BDA0003543062870000241
wherein a1> a2> a 3. A1 and/or a2 may be default settings for the audio signal generation system 100. Further, the processing device 122 may determine at least two signal-to-noise ratio ranges, each signal-to-noise ratio range corresponding to a value of a weight of the bone conduction audio data, such as equation (10):
Figure BDA0003543062870000242
wherein, W bone Refer to corresponding to bone conduction audio dataThe weight of (c).
In 920, the processing device 122 (e.g., the audio data generation module 230 or the weight determination unit 320) may determine weights corresponding to the air conduction audio data based at least in part on at least one of the bone conduction audio data or the air conduction audio data. The method for determining the weights of the air conduction audio data may be similar to or the same as the method for determining the weights of the bone conduction audio data, as described in operation 910. For example, the processing device 122 may determine the weight of the air conduction audio data based on a signal-to-noise ratio of the air conduction audio data. More description regarding determining the signal-to-noise ratio of air conduction audio data may be found elsewhere in this application (e.g., fig. 8 and its description). The greater the signal-to-noise ratio of the air conduction audio data, the higher the weight of the air conduction audio data. For another example, the weight of the air conduction audio data may be set to a value X if the signal-to-noise ratio of the air conduction audio data is greater than a predetermined threshold, the weight of the air conduction audio data may be set to a value Y if the signal-to-noise ratio of the air conduction audio data is less than the predetermined threshold, and X > Y. The weight of the bone conduction audio data and the weight of the air conduction audio data need to satisfy a certain criterion so that the sum of the weight of the bone conduction audio data and the weight of the air conduction audio data is equal to 1. The processing device 122 may determine weights for determining the air conduction audio data based on the weights for the bone conduction audio data. For example, the processing device 122 may determine the weight of the air conduction audio data base based on the difference between 1 and the weight of the bone conduction audio data.
At 930, the processing device 122 (e.g., the audio data generation module 230 or the combination unit 330) may determine the target audio data by weighted summation of the bone conduction audio data and the air conduction audio data using the weights of the bone conduction audio data and the weights of the air conduction audio data. The target audio data may represent the voice of the user, which is the same as the voice represented by the bone conduction audio data and the air conduction audio data. In some embodiments, processing device 122 may determine the target audio data according to equation (11) as follows:
Figure BDA0003543062870000243
wherein S is air Refers to air conduction audio data, S bone Refers to bone conduction audio data, a 1 Is the weight of the air conduction audio data, b 1 Refers to the weight, S, of the bone conduction audio data out Refers to target audio data. a is n And b n The sum equals the criterion of 1. For example, the target audio data may be determined according to equation (12) as follows:
Figure BDA0003543062870000251
in some embodiments, the processing device 122 may transmit the target audio data to a client terminal (e.g., terminal 130), the storage device 140, and/or any other storage device (not shown in the audio signal generation system 100) via the network 150.
Examples of the invention
The following examples are provided for illustrative purposes only and are not intended to limit the scope of the present application.
Example 1 frequency response curves for bone conduction audio data, reconstructed bone conduction audio data, and corresponding air conduction audio data.
As shown in fig. 10, a curve "m" represents a frequency response curve of the bone conduction audio data, and a curve "n" represents a frequency response curve of the air conduction audio data corresponding to the bone conduction audio data. The bone conduction audio data and the air conduction audio data represent the same speech of the user. Curve "m 1" represents a frequency response curve of reconstructed bone conduction audio data generated by reconstructing bone conduction audio data using a trained machine learning model according to process 500. As shown in fig. 10, the frequency response curve "m 1" is closer to the frequency response curve "n" than the frequency response curve "m". In other words, the reconstructed bone conduction audio data is closer to the air conduction audio data than the bone conduction audio data. Furthermore, the lower-frequency point (e.g., 2000Hz) portion of the frequency response curve "m 1" of the reconstructed bone conduction audio data is more similar or close to the frequency of the air conduction audio data.
Example 2 frequency response curves of bone conduction audio data acquired by bone conduction sensors located at different parts of a user's body.
As shown in fig. 11, a curve "p" represents a frequency response curve of bone conduction audio data acquired by a first bone conduction sensor located at the neck of a user's body. Curve "b" represents the frequency response curve of bone conduction audio data acquired by a second bone conduction sensor located at the mastoid of the user's body. Curve "o" represents the frequency response curve of bone conduction audio data acquired by a third bone conduction sensor located in the ear canal (e.g., external auditory canal) of the user's body. In some embodiments, the second and third bone conduction sensors are the same configuration as the first bone conduction sensor. The bone conduction audio data collected by the first, second and third bone conduction sensors represent the same voice of the same user, and are collected simultaneously by the first, second and third bone conduction sensors. In some embodiments, the first, second, and third bone conduction sensors may take different configurations. The frequency response curves of bone conduction audio data acquired by bone conduction sensors with different configurations at the same part may be different.
As shown in fig. 11, the frequency response curve "p", the frequency response curve "b", and the frequency response curve "o" are different from each other. In other words, the bone conduction audio data collected by the first, second and third bone conduction sensors is different from the portion of the user's body where the first, second and third bone conduction sensors are located. For example, a response value of less than 800Hz for a frequency component in bone conduction audio data acquired by a first bone conduction sensor located at the neck of the user's body is greater than a response value of less than 800Hz for a frequency component in bone conduction audio data acquired by a second bone conduction sensor located at the mastoid of the user's body. The frequency response curve may reflect the ability of the bone conduction sensor to convert acoustic energy into an electrical signal. According to the frequency response curves "p", "b" and "o", the bone conduction sensor has a response value in a frequency range of 0 to about 5000Hz at different parts of the body greater than a response value in a frequency range exceeding about 5000 Hz. The frequency response curves "p", "b" and "o" vary smoothly over the frequency range of 0 to about 2000Hz, and vary dramatically over 2000H. The sensors are located in different areas of the user's body and have a strong ability to pick up low frequency signals or low frequency components (e.g., 0-2000Hz, or 0-5000Hz), i.e., the energy of the signals collected by the bone conduction sensors is mainly concentrated in the low frequency band.
Thus, as shown in fig. 11, the bone conduction device for collecting and/or playing audio signals may include a bone conduction sensor for collecting bone conduction audio signals, which may be located at one or more portions or locations of the user's body by designing the structure of the bone conduction device. In designing the bone conduction device structure, the area of the user's body where the bone conduction sensor is located may be determined based on one or more characteristics of frequency response curve, signal strength, user comfort, aesthetics, convenience, etc. For example, the bone conduction device may comprise a bone conduction sensor for acquiring a bone conduction audio signal. When the user wears the bone conduction device, the bone conduction sensor can be in the tragus of the user, the auditory canal and/or the tragus of the user, the auditory canal contact and other positions, so that the signal intensity of the audio signal collected by the bone conduction sensor is relatively high, and the bone conduction device is convenient and attractive to wear.
Example 3: exemplary frequency response curves of bone conduction audio data acquired by bone conduction sensors applying different pressures to the same area of a user's body.
In fig. 12, a curve "L1" represents a frequency response curve of bone conduction audio data acquired with the bone conduction sensor applying a pressure F1 of 0N at the user's tragus. As used herein, the pressure exerted by the bone conduction sensor on the user's body part may also be referred to as the clamping force of the bone conduction sensor or bone conduction device. Curve "L2" represents a frequency response curve of bone conduction audio data acquired with a pressure F2 of 0.2N applied by the bone conduction sensor at the user's tragus. Curve "L3" represents a frequency response curve of bone conduction audio data acquired with a bone conduction sensor applying a pressure F3 of 0.4N at the user's tragus. Curve "L4" represents the frequency response curve of bone conduction audio data acquired with a bone conduction sensor applying F4 of 0.8N at the user's tragus. In fig. 12, the frequency response curves "L1" - "L4" are different from each other. In other words, the bone conduction audio data collected by the bone conduction sensor applying different pressures to the same area of the user's body is different.
The bone conduction audio data collected by the bone conduction sensor may be different when the bone conduction sensor applies different pressures to a portion of the user's body. For example, the signal strength of bone conduction audio data acquired by a bone conduction sensor may vary from pressure to pressure. When the pressure is increased from 0N to 0.8N, the signal intensity gradually increases first, then the increasing trend is reduced, and the saturation is slowly reached. However, the greater the pressure exerted by the bone conduction sensor on the body part of the user, the more uncomfortable the user will feel when wearing. Thus, as shown in fig. 11 and 12, a bone conduction device for capturing and/or playing audio signals may include a bone conduction transducer for capturing bone conduction audio signals, which may be configured to be located at one or more parts or locations of a user's body and to provide a range of clamping forces of the bone conduction device on the part of the user when worn by the user. In designing the bone conduction device structure, the area of the user's body where the bone conduction sensor is located and/or the clamping force applied to that portion of the user's body may be determined based on one or more characteristics of the frequency response curve, signal strength, comfort of the user, etc. For example, the bone conduction device may include a bone conduction sensor for collecting bone conduction audio signals, and when the bone conduction device is worn by a user, the bone conduction sensor is in contact with the tragus of the user, and the clamping force applied to the tragus is in a range of 0 to 0.8N, such as 0.2N, or 0.4N, or 0.6N, or 0.8N, so that the signal intensity of the collected bone conduction signals is relatively high, and at the same time, the proper clamping force makes the user feel comfortable when wearing the bone conduction device.
Example 4: an exemplary time-frequency plot of spliced audio data.
Fig. 13 is a time-frequency plot of spliced audio data generated by splicing bone conduction audio data and air conduction audio data according to some embodiments of the present application. The bone conduction audio data and the air conduction audio data represent the same speech of the same user. The air conduction audio data includes noise. Fig. 14 is a time-frequency plot of spliced audio data generated by splicing bone conduction audio data and pre-processed air conduction audio data according to some embodiments of the present application. The pre-processed air conduction audio data is generated by denoising the air conduction audio data using a wiener filter. Fig. 15 is a time-frequency plot of spliced audio data generated from bone conduction audio data and another pre-processed air conduction audio data according to some embodiments of the present application. Another pre-processed audio data is generated by denoising the air conduction audio data using a spectral subtraction technique. The time-frequency plots of the spliced audio data shown in fig. 13-15 are generated based on the same 2000Hz frequency splicing point according to process 800. As shown in fig. 13 to 15, the frequency components above 2000Hz in the spliced audio data shown in fig. 14 (e.g., region M) and 15 (e.g., region N) are less noisy than the frequency components above 2000Hz in the spliced audio data shown in fig. 13 (e.g., region O), which may indicate that the spliced audio data generated based on the noise-reduced air conduction audio data has higher fidelity than the spliced audio data generated based on the air conduction audio data that has not been noise-reduced. The frequency components above 2000Hz in the spliced audio data shown in fig. 14 are different from the frequency components above 2000Hz in the spliced audio data shown in fig. 15 because different noise reduction techniques are performed on the air conduction audio data. As shown in fig. 14 and 15, the frequency components (e.g., region M) higher than 2000Hz in the spliced audio data shown in fig. 14 have less noise than the frequency components (e.g., region N) higher than 2000Hz in the spliced audio data shown in fig. 15.
Example 5: exemplary time-frequency plots of spliced audio data generated according to different frequency thresholds.
Fig. 16 is a time-frequency diagram of bone conduction audio data. Fig. 17 is a time-frequency diagram of air conduction audio data corresponding to bone conduction audio data. Bone conduction audio data (e.g., the first audio data described in fig. 4) and air conduction audio data (e.g., the second audio data described in fig. 4) may be collected by the bone conduction sensor and the air conduction sensor simultaneously while the user is speaking. Fig. 18 through 20 are time-frequency plots of spliced audio data generated by splicing bone conduction audio data and air conduction audio data according to frequency thresholds (frequency splicing points) of 2000Hz, 3000Hz, and 4000Hz, respectively, according to some embodiments of the present application. Comparing the time-frequency diagrams of the spliced audio data shown in fig. 18 to 20 with the time-frequency diagrams of the air conduction audio data shown in fig. 17, the noise in the spliced audio data in fig. 18, 19 and 20 is smaller than that in the air conduction audio data shown in fig. 17. The larger the frequency threshold, the less amount of noise in the spliced audio data. Comparing the time-frequency diagrams of the spliced audio data shown in fig. 18 to 20 with the time-frequency diagrams of the bone conduction audio data shown in fig. 16, the frequency components smaller than the frequencies 2000Hz, 3000Hz, and 4000Hz in fig. 18 to 20 are increased, respectively, with respect to the frequency components smaller than the frequencies 2000Hz, 3000Hz, and 4000Hz in fig. 16.
It should be noted that the above description of various embodiments is provided for illustrative purposes only and is not intended to limit the scope of the present application. Various changes and modifications may be made by those skilled in the art based on the description of the present application. However, such changes and modifications do not depart from the scope of the present application. Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present application. Other variations are also possible within the scope of the present application. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the present application can be viewed as being consistent with the teachings of the present application. Accordingly, the embodiments of the present application are not limited to only those embodiments explicitly described and depicted herein.

Claims (11)

1. An audio signal generation method, comprising:
acquiring first audio data acquired by a bone conduction sensor;
acquiring second audio data acquired by an air conduction sensor, wherein the first audio data and the second audio data represent voice of a user, and the first audio data and the second audio data are respectively composed of different frequency components;
dividing the first audio data and the second audio data into a plurality of segments according to one or more frequency thresholds, wherein each segment of the first audio data corresponds to one segment of the second audio data;
splicing, fusing, and/or combining each of the plurality of segments of the first audio data and the second audio data based on the weights to generate third audio data.
2. The method of claim 1, wherein target audio data representing user speech is determined based on the third audio data, the target audio data having a higher fidelity than the first audio data and the second audio data.
3. The method of claim 2, wherein a post-processing operation is performed on the third audio data to obtain the target audio data, the post-processing operation comprising a noise reduction operation, a domain transformation operation, or a combination thereof.
4. The method of claim 1, wherein the frequency threshold is determined by a process comprising:
determining a noise level associated with the second audio data;
determining at least one of the one or more frequency thresholds based on a noise level associated with the second audio data.
5. The method of claim 4, wherein the noise level associated with the second audio data is represented by a signal-to-noise ratio of the second audio data, and wherein the signal-to-noise ratio of the second audio data is determined by operations comprising:
determining an energy of noise in the second audio data using the bone conduction sensor and the air conduction sensor;
determining an energy of pure audio data in the second audio data based on an energy of the noise in the second audio data;
determining the signal-to-noise ratio based on an energy of the noise in the second audio data and an energy of the pure audio data in the second audio data.
6. The method of claim 4, wherein the greater the noise level associated with the second audio data, the greater at least one of the one or more frequency thresholds.
7. The method of claim 1, wherein at least one of the one or more frequency thresholds is determined based on a frequency response curve associated with the first audio data.
8. The method of claim 1, wherein dividing the first audio data and the second audio data into a plurality of segments according to one or more frequency thresholds comprises:
determining a low frequency portion of the first audio data, the low frequency portion comprising frequency components below a certain frequency threshold of the one or more frequency thresholds;
determining a high frequency portion of the second audio data, the high frequency portion including frequency components above the certain one of the one or more frequency thresholds.
9. An audio signal generation system, comprising:
at least one processor;
executable instructions executable by the at least one processor to cause the system to perform the audio signal generation method of any one of claims 1 to 8.
10. A system for audio signal generation, comprising:
an acquisition module, configured to acquire first audio data acquired by a bone conduction sensor and second audio data acquired by an air conduction sensor, where the first audio data and the second audio data represent voice of a user, and the first audio data and the second audio data are composed of different frequency components, respectively;
a weight determination unit configured to divide the first audio data and the second audio data into a plurality of segments according to one or more frequency thresholds, wherein each segment of the first audio data corresponds to one segment of the second audio data;
a combining unit configured to splice, fuse and/or combine each of the plurality of segments of the first audio data and the second audio data based on the weights to generate third audio data.
11. A non-transitory computer-readable medium storing computer instructions that, when executed, perform the audio signal generation method of any of claims 1-8.
CN202210237943.0A 2019-09-12 2019-09-12 Audio signal generation method and system, and non-transitory computer readable medium Pending CN114822565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210237943.0A CN114822565A (en) 2019-09-12 2019-09-12 Audio signal generation method and system, and non-transitory computer readable medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210237943.0A CN114822565A (en) 2019-09-12 2019-09-12 Audio signal generation method and system, and non-transitory computer readable medium
CN201910864002.8A CN112581970A (en) 2019-09-12 2019-09-12 System and method for audio signal generation

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910864002.8A Division CN112581970A (en) 2019-09-12 2019-09-12 System and method for audio signal generation

Publications (1)

Publication Number Publication Date
CN114822565A true CN114822565A (en) 2022-07-29

Family

ID=75109581

Family Applications (3)

Application Number Title Priority Date Filing Date
CN201910864002.8A Pending CN112581970A (en) 2019-09-12 2019-09-12 System and method for audio signal generation
CN202210237943.0A Pending CN114822565A (en) 2019-09-12 2019-09-12 Audio signal generation method and system, and non-transitory computer readable medium
CN202210239104.2A Pending CN114822566A (en) 2019-09-12 2019-09-12 Audio signal generation method and system, and non-transitory computer readable medium

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201910864002.8A Pending CN112581970A (en) 2019-09-12 2019-09-12 System and method for audio signal generation

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202210239104.2A Pending CN114822566A (en) 2019-09-12 2019-09-12 Audio signal generation method and system, and non-transitory computer readable medium

Country Status (1)

Country Link
CN (3) CN112581970A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095691A (en) * 2023-10-13 2023-11-21 荣耀终端有限公司 Voice data set construction method, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731927A (en) * 2021-08-30 2023-03-03 华为技术有限公司 Voice wake-up method, apparatus, device, storage medium and program product

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0630490A (en) * 1992-05-12 1994-02-04 Katsuo Motoi Ear set type transceiver
JP3095214B2 (en) * 1996-06-28 2000-10-03 日本電信電話株式会社 Intercom equipment
JP2014096732A (en) * 2012-11-09 2014-05-22 Oki Electric Ind Co Ltd Voice collection device, and telephone set
JP6123503B2 (en) * 2013-06-07 2017-05-10 富士通株式会社 Audio correction apparatus, audio correction program, and audio correction method
CN109240639A (en) * 2018-08-30 2019-01-18 Oppo广东移动通信有限公司 Acquisition methods, device, storage medium and the terminal of audio data
CN109545193B (en) * 2018-12-18 2023-03-14 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN109767783B (en) * 2019-02-15 2021-02-02 深圳市汇顶科技股份有限公司 Voice enhancement method, device, equipment and storage medium
CN109982179B (en) * 2019-04-19 2023-08-11 努比亚技术有限公司 Audio signal output method and device, wearable device and storage medium
CN110136731B (en) * 2019-05-13 2021-12-24 天津大学 Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method
WO2021046796A1 (en) * 2019-09-12 2021-03-18 Shenzhen Voxtech Co., Ltd. Systems and methods for audio signal generation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095691A (en) * 2023-10-13 2023-11-21 荣耀终端有限公司 Voice data set construction method, electronic equipment and storage medium
CN117095691B (en) * 2023-10-13 2023-12-19 荣耀终端有限公司 Voice data set construction method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112581970A (en) 2021-03-30
CN114822566A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN114424581A (en) System and method for audio signal generation
US11363390B2 (en) Perceptually guided speech enhancement using deep neural networks
CN111836178B (en) Hearing device comprising keyword detector and self-voice detector and/or transmitter
US10306389B2 (en) Head wearable acoustic system with noise canceling microphone geometry apparatuses and methods
EP3413589A1 (en) A microphone system and a hearing device comprising a microphone system
Aroudi et al. Cognitive-driven binaural beamforming using EEG-based auditory attention decoding
CN111833896B (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
US10547956B2 (en) Method of operating a hearing aid, and hearing aid
Maruri et al. V-speech: Noise-robust speech capturing glasses using vibration sensors
CN116569564A (en) Bone conduction headset speech enhancement system and method
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
CN114822565A (en) Audio signal generation method and system, and non-transitory computer readable medium
US11582562B2 (en) Hearing system comprising a personalized beamformer
US20230209283A1 (en) Method for audio signal processing on a hearing system, hearing system and neural network for audio signal processing
TW202244898A (en) Methods and systems for audio signal generation
RU2804933C2 (en) Systems and methods of audio signal production
CN114127846A (en) Voice tracking listening device
CN112118511A (en) Earphone noise reduction method and device, earphone and computer readable storage medium
US20230388721A1 (en) Hearing aid system comprising a sound source localization estimator
WO2022141364A1 (en) Audio generation method and system
WO2024002896A1 (en) Audio signal processing method and system for enhancing a bone-conducted audio signal using a machine learning model
Shankar Real-Time Single and Dual-Channel Speech Enhancement on Edge Devices for Hearing Applications
Grant et al. Modeling auditory and auditory-visual speech intelligibility: Challenges and possible solutions
Ritch et al. A Triple-Microphone Real-Time Speech Enhancement Algorithm Based on Approximate Array Analytical Solutions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination