WO2023211443A1 - Transformer-encoded speech extraction and enhancement - Google Patents

Transformer-encoded speech extraction and enhancement Download PDF

Info

Publication number
WO2023211443A1
WO2023211443A1 PCT/US2022/026671 US2022026671W WO2023211443A1 WO 2023211443 A1 WO2023211443 A1 WO 2023211443A1 US 2022026671 W US2022026671 W US 2022026671W WO 2023211443 A1 WO2023211443 A1 WO 2023211443A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
frequency components
audio frequency
time
subset
Prior art date
Application number
PCT/US2022/026671
Other languages
French (fr)
Inventor
Yi Zhang
Yuan Lin
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2022/026671 priority Critical patent/WO2023211443A1/en
Publication of WO2023211443A1 publication Critical patent/WO2023211443A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients

Definitions

  • This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for enhancing an audio signal by applying deep learning techniques in speech extraction.
  • Speech signals are oftentimes mixed with ambient noise that can easily distract a listener’s attention. This makes it difficult to discern the speech signals in crowded or noisy environments.
  • Deep learning techniques have demonstrated significant advantages over conventional signal processing methods for the purposes of improving perceptual quality and intelligibility of the speech signals.
  • Some existing solutions utilize convolutional neural networks (CNNs) or recursive neural networks (RNNs) to extract speaker embeddings and isolate target speech content.
  • CNNs convolutional neural networks
  • RNNs recursive neural networks
  • RNNs are a crucial component in speech enhancement systems and have recurrent connections that are essential to leant long-term dependencies and manage speech contexts properly. Nevertheless, an inherently sequential nature of RNNs impairs an effective parallelization of computational tasks. This bottleneck is particularly evident when large datasets are processed long sequences. It would be beneficial to have an effective and efficient mechanism to improve audio quality and reduce audio noise, while keeping a high utilization rate of computational resources and a low power consumption.
  • a transformer-based speech extraction network is applied to isolate a target speaker’s speech from competing speakers or ambient noise in a corrupted environment.
  • the transformer-based speech extraction network includes a dual-path transformer network, and is capable of working with additional domain information (e.g., visual feature) to provide multi-modal solutions for improving signal-to-noise performance.
  • additional domain information e.g., visual feature
  • Such a transformer -based speech extraction network eliminates recurrence in RNNs and applies a fully attention-based mechanism, and can avoid an inherently sequential nature of the RNNs completely.
  • the transformer-based network attends to a sequence of audio data, direct connections are established among both adjacent and remote time frames of each sentence to allow the transformer-based network to learn long-term dependencies of the audio data easily.
  • the transformer-based speech extraction network supports parallel computation of audio processing tasks, improves audio quality, and reduces audio noise effectively and efficiently.
  • an audio processing method is implemented at an electronic device having memory.
  • the method includes obtaining input audio data having a plurality of audio sentences in a temporal domain and converting the input audio data in the temporal domain to a plurality of audio frequency components in a spectral domain.
  • Each audio sentence includes a plurality of audio time frames.
  • Each time frame has a respective frame position in a respective audio sentence and corresponds to a subset of audio frequency components.
  • the method further includes for each audio time frame and in accordance with the respective frame position, correlating the subset of audio frequency components of the audio time frame with a set of adjacent audio frequency components and a set of remote audio frequency components converted from the same respective audio sentence and generating an audio feature including the correlated audio frequency components of each audio time frame.
  • the method further includes generating a plurality of output frequency components from the audio feature and decoding the plurality of output frequency components to output audio data in the temporal domain.
  • the method further includes for each audio sentence, segmenting corresponding audio frequency components based on a plurality of time chunks, and each time chunk has a number of audio time frames.
  • correlation of each audio time frame further includes correlating the subset of audio frequency components of the respective audio time frame with the set of adjacent audio frequencycomponents to generate a correlated subset of audio frequency components and correlating the correlated subset of audio frequency components with the set of remote audio frequency components to generate the audio feature.
  • correlation of each audio time frame further includes correlating the subset of audio frequency components of the respective audio time frame with the set of remote audio frequency components to generate a correlated subset of audio frequency components and correlating the correlated subset of audio frequency components with the set of adjacent audio frequency components to generate the audio feature.
  • the method further includes correlation of each audio time frame is repeated successively for a plurality of iterations.
  • the method further includes converting the plurality of audio sentences in the temporal domain to a plurality of audio frequency components in a spectral domain.
  • a subset of the plurality of audio frequency components corresponds to a first audio time frame in the plurality of audio time frames.
  • the method further includes in accordance with a first frame position of the first audio time frame, correlating the subset of audio frequency components corresponding to the first audio time frame with a set of adjacent audio frequency components and a set of remote audio frequency components converted from a first audio sentence including the first audio time frame.
  • the method further includes generating an audio feature of the first audio time frame including the correlated audio frequency components of the first audio time frame, generating a plurality of output frequency components based on at least the audio feature of the first audio time frame, and decoding the plurality of output frequency components to generate output audio data in the temporal domain.
  • some implementations include an electronic device that, includes one or more processors and memory' having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • FIG. 2 is a block diagram illustrating an electronic system configured to process content data (e.g., image data), in accordance with some embodiments.
  • content data e.g., image data
  • Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments
  • Figure 4B is an example node in the neural network, in accordance with some embodiments.
  • Figure 5A is a block diagram of an example speech extraction network configured to process input audio data, in accordance with some embodiments
  • Figure 5B is a data structure of example input audio data, in accordance with some embodiments.
  • FIG. 6 is a block diagram of an example transformer module (e.g., an intrachunk transformer module, an intrachunk transformer module) configured to correlate audio frequency components of a plurality of audio time frames, in accordance with some embodiments.
  • an example transformer module e.g., an intrachunk transformer module, an intrachunk transformer module
  • Figure 7 is a block diagram of a speech extraction network trained to process input audio data, in accordance with some embodiments.
  • Figure 8 is a flow diagram of an example audio processing method, in accordance with some embodiments.
  • Figure 9 is a flow diagram of an example audio processing method, in accordance with some embodiments.
  • Various embodiments of this application are directed to speech processing and enhancement.
  • transformers adopt an attention mechanism to correlate time frames of each audio sentence with each other in a spectral domain.
  • input audio data and an auxiliary speech are converted to audio frequency components using a short-time Fourier transform (STFT), while the correlated audio frequency components are converted to output audio data using an inverse STFT (ISTFT).
  • STFT short-time Fourier transform
  • ISTFT inverse STFT
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone).
  • HMD head-mounted display
  • AR augmented reality
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • the one or more servers 102 are configured to enable real-time data communication with the client, devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C.
  • the networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify moti on or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution ( L T E), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet. Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet. Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
  • deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • data processing models e.g., a speech extraction model and a feature extraction model in Figures 5 A and 7 are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
  • the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D).
  • the client device 104 obtains the training data from the one or more seivers 102 or storage 106 and applies the training data to train the data processing models.
  • both model training and data processing are implemented remotely at a server 102 (e.g., the server 102 A) associated with a client device 104 (e.g. the client device 104A and HMD 104D).
  • the server 102 A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models.
  • the client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results.
  • the client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
  • a pair of AR glasses 104D are communicatively coupled in the data processing environment 100.
  • the AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display.
  • the camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses.
  • the video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses.
  • deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses.
  • the device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D.
  • the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.
  • deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D.
  • 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model.
  • Visual content is optionally generated using a second data processing model.
  • Training of the first and second data processing models is optionally implemented by the sewer 102 or AR glasses 104D.
  • Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.
  • FIG 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments.
  • the electronic system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof.
  • the electronic system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera for gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more optical cameras (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.
  • Memory 206 includes high-speed random access memory , such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a n on-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset, or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on; .
  • User interface module 218 for enabling presentation of information (e.g. , a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the electronic system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
  • content data e.g., video, image, audio, or textual data
  • Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 228 is applied to process content data; and
  • the data processing module 228 includes an audio processing module 229, and the data processing models 240 include a speech extraction model and a signature extraction model.
  • the audio processing module 229 is associated with one of the user applications 224 (e.g., a social media application, a conferencing application, an Internet phone sendee, and an audio recorder application) to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 228 is applied to implement an audio processing process 800 in Figure 8.
  • the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 .
  • the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 ,
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset, of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is an example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 3 12.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size.
  • ROI region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318.
  • the data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and textual data.
  • the model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data.
  • the model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that, can be derived from the processed content data.
  • Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs.
  • a weight w associated with each link 412 is applied to the node output.
  • the one or more node inputs are combined based on corresponding weights w 1 , w 2 , w 3 , and uv according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every' node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural netw'ork that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RN N exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory/ (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory/
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network e.g., a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory/ (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the training process is a process for calibrating all of the weights for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • Figure 5 A is a block diagram of an example speech extraction network 500 configured to process input audio data 502, in accordance with some embodiments
  • Figure 5B is a data structure 550 of example input audio data 502, in accordance with some embodiments.
  • the speech extraction network 500 is implemented by an audio processing module 229 of an electronic system 200, and configured to receive the input audio data 502 and an auxiliary speech 504 including an audio signal produced by a target speaker and extract output audio data 506 from the input audio data 502 with reference to the auxiliary speech 504.
  • the output audio data 506 has a first signal-to- noise ratio (SNR)
  • the input audio data 502 has a second SNR that is less than the first SNR.
  • SNR signal-to- noise ratio
  • the speech extraction network 500 includes a speech encoder 508, a speech extractor 510, a speech decoder 512, and a speaker embedder 514.
  • the speech encoder 508 is configured to obtain the input audio data 502 and auxiliary speech 504 in a temporal domain and convert the input audio data 502 and auxiliary speech 504 to a plurality of audio frequency components 516 and a plurality of auxiliary frequency components 518 in a spectral domain, respectively.
  • the speech extractor 510 is coupled to the speech encoder 508 and configured to correlate a subset of audio frequency components of each audio time frame with a set of adjacent audio frequency components and a set of remote audio frequency components converted from the same respective audio sentence of the input audio data 502 or auxiliary speech 504.
  • An audio feature 520 is generated and includes the correlated audio frequency components of each audio time frame.
  • the speech extractor 510 further converts the audio feature 520 to a plurality of output frequency components 522.
  • the speech decoder 512 is coupled to the speech extractor 510 and configured to decode the plurality of output frequency components 522 to the output audio data 506 in the temporal domain.
  • the speech embedder 514 is coupled to both the speech encoder 508 and speech extractor 510 and configured to generate a signature feature 524 from the plurality of auxiliary frequency components 518 using a signature extraction model.
  • the speech extractor 510 combines the signature feature 524 with an input audio feature 526 including the plurality of audio frequency components 516 and correlates frequency components of the combined audio feature 528 on an audio time frame basis to generate the audio feature 520. As such, the speech extractor 510 processes the plurality of audio frequency components 516 with reference to the signature feature 524, thereby allowing the output audio data 510 to be generated with the second SNR better than the first SNR of the input audio data 502.
  • the input audio data 502 includes a first number (B) of audio sentences 530, and each audio sentence 530 includes a plurality of audio time frames 532 having a respective second number (T) of audio time frames. Each time frame 532 has a respective frame position in a respective audio sentence and corresponds to a subset of audio frequency components 526.
  • Each audio sentence 530 includes a respective third number (5) of time chunks 534, and each time chunk 534 includes a subset of audio time frames 532 having a respective fourth number (AT) of time frames.
  • the input audio data 502 includes 5 sentences 530.
  • the first sentence 530A of the input audio data 502 has 10 time chunks 534, and the first time chunk 534 A of the first sentence 530 A has 20 audio time frames 532 (e.g., 532A and 532B).
  • the auxiliary speech 504 also includes a plurality of audio sentences 530 provided by the target speaker, and each audio sentence the auxiliary speech 504 is segmented to a plurality of time frames that are grouped to time chunks.
  • each of the third number of time chunks 534 does not overlap with any corresponding neighboring time chunk 534.
  • each time chunk 534 has a temporal overlap with any neighboring time chunk 534.
  • the time chunk 534A has a temporal overlap with its neighboring time chunk 534B, and the temporal overlap is 50% of the time chunk 534A or 534B
  • the time chunk 534D has a respective temporal overlap with either one of its neighboring time chunks 534C and 534E, and the respective temporal overlap is 50% of the time chunk 534C or 534E.
  • each of the second number of audio time frames 532 does not overlap with any corresponding neighboring audio time frame 532.
  • each audio time frame 532 has a temporal overlap with any neighboring audio time frame 532.
  • the audio time frame 532E has a temporal overlap with its neighboring audio time frame 534F, and the temporal overlap is 50% of the time chunk 534E or 534E.
  • the audio time frame 532B has a respective temporal overlap with either one of its neighboring audio time frames 532A and 532C, and the respective temporal overlap is 50% of the audio time frame 532A or 532C.
  • each of the second, third, and fourth numbers remain s the same for all sentences 530.
  • corresponding temporal lengths of time chunks 534 and time frames 532 are not fixed, and are adjusted based on a length of the respective sentence 530 to get the fixed second, third, and fourth numbers of total audio time frames 532, time chunks 534, and time frames 532 per time chunk 534.
  • each of the second, third, and fourth numbers remains the same for all sentences, so are the temporal lengths of time chunks 534 and time frames 532.
  • Each sentence 530 is filled with a respective blank audio portion at an end of the respective sentence 530 to reach a predefined sentence length, such that the sentence 530 can be segmented into the same third number (S) of time chunks 534 and the same second number (T) of time frames 532 based on fixed temporal lengths of each time chunk 534 and each time frame 532.
  • the speech encoder 508 includes a Fourier transform or a time-to-frequency neural network.
  • the input audio data 502 in the temporal domain is converted to the plurality of audio frequency components 516 in the spectral domain using the Fourier transform or time-to-frequency neural network.
  • the auxiliary speech 504 in the temporal domain may be converted to the plurality of auxiliary frequency components 518 in the spectral domain using the Fourier transform or time-to-frequency neural network.
  • the audio and auxiliary frequency components 516 and 518 are organized into an audio spectrum matrix 516’ and an auxiliary spectrum matrix 518’, respectively.
  • Each of the audio and auxiliary spectrum matrices 516’ and 518’ has three dimensions corresponding to sentences, time frames, and frequencies, respectively.
  • each audio time frame 532 of the input audio data 502 is converted to a predefined number (F) of audio frequency components, and the audio spectrum matrix 516’ has Bx TxF audio frequency components.
  • Each audio time frame 532 of the auxiliary speech 504 is converted to an auxiliary number (Emb) of audio frequency components 516, and the auxiliary spectrum matrix 518’ has Bx TxEmb auxiliary frequency components 518.
  • the plurality of audio frequency components 516 are further modified to a plurality of audio frequency components 526 by an input network 536 in the speech extractor 510.
  • the audio frequency components 526 has the same number (e.g., Bx Tx F) of components as the audio frequency components 516, and are organized to three dimensions corresponding to sentences, time frames, and frequencies, respectively.
  • the speech embedder 514 extracts speech characteristics of the target speaker byconverting the auxiliary frequency components 518 to the auxiliary frequency components 524, e.g., using a CNN layer, an RNN layer, or a harmonic based speaker embedding network including a two-dimensional convolution layer 538 and a harmonic block 540.
  • the auxiliary frequency components 524 have the same number (e.g., Bx TxEmb) of components as the auxiliary frequency components 518, and are organized to three dimensions corresponding to sentences, time frames, and frequencies, respectively. Further, in some embodiments, the speech extractor 510 combines the audio frequency components 526 and auxiliary frequency components 524 to form the combined audio feature 528. For example, the audio frequency components 526 and auxiliary frequency components 524 are concatenated to each other. [0052] In some embodiments, the combined audio feature 528 is segmented (538) to a pre-transformer audio feature 540 based on the plurality of time chunks 534 in each sentence 530 of the input audio data 502 and auxiliary speech 504.
  • each audio sentence 530 corresponds to TxF audio frequency components of the audio frequency components 526 combined in the audio feature 528.
  • the Tx F audio frequency components are segmented according to the plurality of time chunks 534 in each audio sentence 530, e.g., segmented to S chunks each having audio frequency components.
  • the audio frequency components 526 include Bx TxF components in total, and are re-organized to a matrix that has BxSxKxF components in four dimensions corresponding to sentences, chunks, time frames in each time chunk, and frequencies, respectively.
  • the pre-transformer audio feature 540 includes the re-organized audio frequency components 526.
  • the speech extractor 510 further includes a dual transformer module 542 including one or more transformer stage.
  • the dual transformer module 542 includes a first transformer stage 542A receives the pre-transformer audio feature 540 including the frequency components 524 and 526 that are combined and segmented.
  • the adjacent audio frequency components include a first set of audio frequency components 526 corresponding to other audio time frames 532 (e.g., 532B-532F) in the same time chunk 534 (e.g., 534A), and the remote audio frequency components include a second set of audio frequency components 526 distributed among one or more remaining time chunks 534 (e.g., 534B- 534E) of the same audio sentence 530 (e.g., 530A).
  • the first transformer stage 542A correlates the subset of audio frequency components 526 of the audio time frame 532 (e.g., 532A) with the first set of audio frequency components using an intrachunk transformer module 544 and with the second set of audio frequency components converted from the same respective audio sentence 530 (e.g., 530 A) using an interchunk transformer module 546. More specifically, in the first transformer stage 542A, the intrachunk transformer module 544 and interchunk transformer module 546 are applied on each time chunk 534 and across time chunks 534, respectively. The intrachunk transformer module 544 acts on each time chunk 534 independently, modeling short-term dependencies within the respective time chunk 534.
  • the interchunk transformer module 546 is applied to model transitions across different, time chunks 534 and enable effective modelling of long-term dependencies across the different time chunks 534.
  • the dual transformer module 542 includes a plurality of successive transformer stages 542 that are coupled in series with each other.
  • a second transformer stage 542B is coupled to an output of the first transformer stage 542A.
  • a first audio feature is outputted from the first transformer stage 542A and includes first correlated audio frequency components of each audio time frame 532.
  • the first audio feature is provided to the second transformer stage 542B and correlated using the intrachunk transformer module 544 and interchunk transformer module 546 sequentially.
  • the plurality of successive transformer stages 542 are coupled in series and correlates audio frequency components 526 of each audio time frame 532 repeatedly and successively for a plurality of iterations (e.g., for 2, 3, 4, ..., or 8 iterations).
  • the plurality of output frequency components 522 is generated by converting the audio feature 520 including the correlated audio frequency components to a frequency mask 548 and applying the frequency mask 548 on the plurality of audio frequency components 526 to form the plurality of output frequency components 522.
  • the audio feature 520 results from segmented mask estimation, and is recovered to an enhanced spectrum of the target speech (i.e., the output frequency components 522), e.g., by an overlap-and-add operation and multiplication with a corrupted spectrum.
  • the speech encoder 508, speech extractor 510, speech decoder 512, and speaker embedder 514 process the input audio data 502 jointly and with reference to an auxiliary speech 504 to enhance audio quality of the input audio data 502.
  • the speech encoder 508 is configured to estimate a representation for the input audio data 502. For example, a short-time Fourier transform (STFT) is applied to encode the input audio data 502 and auxiliary speech 504 in the temporal domain to the representation (e.g., frequency components 516 and 518) in the spectral domain.
  • STFT short-time Fourier transform
  • auxiliary speech 504 in the temporal domain to the representation (e.g., frequency components 516 and 518) in the spectral domain.
  • a learnable convolution layer may be applied in the speech encoder 508 to encode the input audio data 502 or auxiliary speech 504.
  • the input audio data 502 includes a corrupted speech of the target speaker
  • the speech encoder 506 encodes the auxiliary and corrupted speeches in a weighted manner.
  • the speech extractor 506 estimates the frequency mask 548 to extract the output frequency components 522 including a target speech in the mixture, depending on information from the auxiliary speech 504.
  • the speech decoder 508 is configured to reconstruct the output audio data 512 in the temporal domain from the output frequency components 522.
  • an enhanced target speech magnitude spectrogram is used to reconstruct the time-domain enhanced speech via an inverse STFT or learnable transpose convolution layers.
  • the speaker embedder 510 is configured to convert auxiliary utterances into speaker embedding 524, which contains voiceprint information of the target speaker and leads a speaker extraction network 500 to direct attention to the target speaker.
  • FIG. 6 is a block diagram of an example transformer module 600 (e.g., an intrachunk transformer module 544, an interchunk transformer module 546) configured to correlate audio frequency components 610 of a plurality of audio time frames 532, in accordance with some embodiments.
  • the plurality of audio time frames 532 are optionally segmented from the input audio data 502 and auxiliary speech 504.
  • the transformer module 600 includes a positional encoder 602 and one or more transformer encoders 604.
  • Each audio time frame 534 corresponds to a subset of audio frequency components 610.
  • the subset of audio frequency components 610 are converted from audio signal of the respective audio time frame 534 in the input audio data 502 or auxiliary speech 504 and may be further processed using the input network 536 or speech embedder 514.
  • the positional encoder 602 associates the subset of audio frequency components 610 of each audio time frame 534 with a temporal position of the respective audio time frame 534, i.e., combines the subset of audio frequency components 610 of each audio time frame 534 with a respective weight determined from its temporal position in a corresponding sentence 530.
  • the one or more transformer encoders 604 includes a single transformer encoder 604.
  • the transformer encoder 604 includes a multihead attention module 606 configured to combine the subset of weighted audio frequency components 612 of each audio time frame 534 to generate a plurality of correlated frequency components 614 for the plurality of audio time frames 534.
  • the transformer encoder 604 further includes a feedforward module 608 configured to update the plurality of correlated frequency components 614 to a plurality of correlated frequency components 616, which are outputted from the transformer module 600.
  • the one or more transformer encoders 604 has a number of (i.e., two or more) transformer encoders that are coupled in series.
  • Each transformer encoder 604 includes at least a multihead attention module 606.
  • a first multihead attention module 606 receives the weighted audio frequency components 612 of each audio time frame 534 to generate a plurality of correlated frequency components 614 for the plurality of audio time frames 534.
  • An intermediate multihead attention module 606 receives the correlated frequency components 614 outputted by an immediately preceding multihead attention module 606 to generate corresponding correlated frequency components 614 for the plurality of audio time frames 534.
  • a last multihead attention module 606 receives the correlated frequency components 614 outputted by the immediately preceding multihead attention module 606 to generate the corresponding correlated frequency components 614 for the plurality of audio time frames 534 to be outputted by the transformer module 600.
  • each transformer encoder 604 includes a feedforward module 608 coupled to the multihead attention module 606.
  • a first transformer encoder 604 receives the weighted audio frequency components 612 of each audio time frame 534 to generate a plurality of correlated frequency components 616 for the plurality of audio time frames 534.
  • An intermediate transformer encoder 604 receives the correlated frequency components 616 outputted by an immediately preceding transformer encoder 604 to generate corresponding correlated frequency components 616 for the plurality of audio time frames 534.
  • a last transformer encoder 604 receives the correlated frequency components 616 outputted by the immediately preceding transformer encoder 604 to generate the corresponding correlated frequency components 616 for the plurality’ of audio time frames 534 to be outputted by the transformer module 600.
  • the plurality of time chunks 534 include an ordered sequence of successive time chunks 534 in the same sentence 530, and each time chunk 534 includes a series of audio time frames 532.
  • Each audio time frame 532 of a time chunk 534 (e.g., time chunk 534A) has an intrachunk position in the time chunk 534 and corresponds to a subset of audio frequency components 610 (e.g., components 528).
  • each audio time frame 532 (e.g., frame 532A) of the time chunk 534 (e.g., time chunk 534A) are weighted based on the intrachunk position of the respective audio time frame 532 (e.g., frame 532A) in the respective time chunk 534 (e.g., time chunk 534A).
  • each of the subset, of audio frequency components 610 is weighted based on the same intrachunk position of the respective audio time frame 532 (e.g., frame 532A) in the respective time chunk 534 (e.g., time chunk 534A).
  • time frames 532 e.g., frames 532A-532F
  • time chunk 534 e.g., time chunk 534A
  • Corresponding adjacent audio frequency components 610 are weighted based on intrachunk positions of the other audio time frames 532 (e.g., frames 532B-532F) in the same respective time chunk 534.
  • the plurality of time chunks 534 includes an ordered sequence of successive time chunks (e.g., 534A-534E) in the same sentence 530, and each time chunk 534 includes a series of audio time frames 532 in a respective audio sentence 530.
  • a set of audio time frames 532 includes a single audio time frame 532 (e.g., a fifth audio time frame 532A in the time chunk 534A) from a respective one of the successive time chunks 534.
  • Each of the set of audio time frame 532 has an interchunk position (e.g., in the second time chunk 534B) in the respective audio sentence 530 and corresponds to a subset of audio frequency components 610 (e.g., components 528).
  • the subset of audio frequency components 610 of an audio time frame 532 (e.g., frame 532A) is weighted based on the interchunk position of the respective time chunk 534 (e.g., frame 534A) in the respective audio sentence 530.
  • the audio time frames 532 in this set of audio time frames 532 are distributed in different time chunks 534 of the same sentence 530, thereby being remote to one another.
  • Corresponding remote audio frequency components 610 are weighted based on interchunk positions of the other audio time frames 532A’ in this set of audio time frames 532. That said, corresponding remote audio frequency components 610 are weighted based on interchunk positions of audio time frames 532A’ of one or more remaining time chunks 534 (e.g., time chunks 534B-534E) in the respective audio sentence 530.
  • a first time chunk 534A of a first audio sentence 530A has a first audio time frame 532A.
  • the first audio time frame 532A corresponds to a subset of first audio frequency components 610A.
  • the subset of first audio frequency components 610A of the first audio time frame 532A is correlated to (1) a set of adjacent frequency components including the subset of audio frequency components 610A of each of other audio time frames 532 (e.g., 532B-532F) in the first time chunk 534A and (2) a set of remote frequency components including the subset of audio frequency components 610B of a remaining audio time frame 532A’ in each of one or more remaining time chunks 534 (e.g., 534B-534E) in the first audio sentence 530A.
  • the subset of first audio frequency components 610A of the first audio time frame 532A is weighted based on an intrachunk position of the first audio time frame 532A in the first time chunk 532A.
  • the subset of audio frequency components of each of the other audio time frames 532 (e.g., 532B) is weighted based on an intrachunk position of each of the other audio time frames 532 in the first time chunk 534A.
  • the remaining audio time frame 532A’ is included in each of one or more remaining time chunks 534 (e.g., 534B- 534E) in the first audio sentence 53OA.
  • the remaining audio time frame 532A’ has a remaining position in each remaining time chunk 5.34 (e.g., 534B-534E), and the first audio time frame 532A has a first, position in the first time chunk 534A.
  • the first position of the first audio time frame 532A is the same as the remaining position of the remaining audio time frame 532A’ in their respective time chunks.
  • the subset of first audio frequency components 61 OA of the first audio time frame 532A is weighted based on an interchunk position of the first time chunk 534A in the first audio sentence 530A.
  • the subset of audio frequency components 61 OB of the remaining audio time frame 532A’ in each of the one or more remaining time chunks 534 is weighted based on an interchunk position of each of the one or more remaining time chunks 534 (e.g., 534B-534E) in the first audio sentence 530 A.
  • the intrachunk transformer module 544 of each transformer stage 542 is configured to correlate the subset of audio frequency components of the first audio time frame 532A with the set of adjacent audio frequency components to generate a correlated subset of audio frequency components.
  • the interchunk transformer module 546 is coupled to the intrachunk transformer module 544 and configured to correlates the correlated subset of audio frequency components with the set of remote audio frequency components of the remaining audio time frames 532A’ in each time chunk 534B534E to generate the audio feature 520.
  • the interchunk transformer module 546 of each transformer stage 542 is configured to correlate the subset of audio frequency components of the first audio time frame 532A with the set of remote audio frequency components to generate a correlated subset of audio frequency components.
  • the intrachunk transformer module 544 is coupled to the interchunk transformer module 546 and configured to correlates the correlated subset of audio frequency components with the set of adjacent audio frequency components of the other audio time frames 532 (e.g., 532B-532F) of the same time chunk 534A to generate the audio feature 520.
  • FIG. 7 is a block diagram of a speech extraction network 700 trained to process input audio data 502, in accordance with some embodiments.
  • the speech extraction network 700 is optionally trained by a model training module 226 of a server 102 and/or a client device 104.
  • the speech extraction network 700 is trained at the serv er 102, and sent to the client device 104 to generate the output audio data 506 from the input audio data 502.
  • the speech extraction network 700 is trained at the server 102, which receives the input audio data 502 from the client device 104 and generate the output audio data 506 from the input audio data 502 to be returned to the client device 104.
  • the speech extraction network 700 is trained at the client device 104.
  • the client device 104 further generates the output audio data 506 from the input, audio data 502.
  • the speech extraction network 700 obtains a signature feature 524 corresponding to a target sound of the input audio data 502.
  • the speech encoder 508 receives an auxiliary speech 504 of the target sound (e.g., voice of a target speaker) and generates a plurality of auxiliary frequency components 518.
  • the speech embedder 514 is configured to generate an signature feature 524 from the plurality of auxiliary frequency components 518 using a signature extraction model.
  • the auxiliary speech 504 includes a plurality of signature audio sentences.
  • the signature feature 524 includes a plurality of signature frequency components extracted from each of a plurality of signature time frames of the signature audio sentences.
  • the speech extractor 510 modifies the plurality of audio frequency components 526 based on the signature feature 524, e.g., combines the plurality of audio frequency components 526 and the plurality of signature frequency components of the signature feature 524.
  • the speech extractor 510 correlates a subset of audio frequency components of each audio time frame 532 with adjacent and remote audio frequency components within each of the audio sentences 530 in both the input audio data 502 and auxiliary speech 504.
  • the signature feature 524 is extracted using the signature extraction model.
  • the audio feature 520 is generated from the plurality of audio frequency components 516 in the spectral domain using a speech extraction model.
  • the signature extraction model and the speech extraction model are trained end-to-end based on a combination of a mean squared error (MSE) loss 702 and a cross-entropy loss 704.
  • MSE loss 702 indicates a difference between test audio output data and corresponding ground truth audio data.
  • the cross-entropy loss 704 indicates a quality of the signature feature 524.
  • a linear network 706 and a softmax network 708 are used to generate an audio feature mask 710 from an output of the speech embedder 514, and the audio feature mask 710 is converted to the cross-entropy loss 704.
  • FIG. 8 is a flow diagram of an example audio processing method 800, in accordance with some embodiments.
  • the audio processing method 800 is described as being implemented by an electronic system 200 (e.g., including a mobile phone 104C).
  • Method 800 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other nonvolatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors.
  • the electronic system 200 obtains (802) input audio data having a plurality of audio sentences 530 in a temporal domain and converts (804) the input audio data 502 in the temporal domain to a plurality of audio frequency components (e.g., audio frequency components 516 and 526) in a spectral domain.
  • Each audio sentence 530 includes (806) a plurality of audio time frames 532, and each time frame 532 has (808) a respective frame position in a respective audio sentence 530 and corresponds to a subset of audio frequency components.
  • the electronic system 200 For each audio time frame 532 and in accordance with the respective frame position (810), the electronic system 200 correlates (812) the subset of audio frequency components of the audio time frame 532 with a set of adjacent audio frequency components and a set of remote audio frequency components converted from the same respective audio sentence 530, thereby generating (814) an audio feature 520 including the correlated audio frequency components of each audio time frame 532, The electronic system 200 generates (816) a plurality' of output frequency components 522 from the audio feature 520 and decodes (818) the plurality of output frequency components 522 to output audio data 506 in the temporal domain.
  • the electronic system 200 segments (820) corresponding audio frequency components based on a plurality of time chunks 534.
  • Each time chunk 534 has a number (X) of audio time frames 532.
  • the respective audio time frame 532 corresponds to a respective time chunk 534 in a corresponding audio sentence 530.
  • the set of adjacent audio frequency components include a first set of audio frequency components corresponding to other audio time frames 532B-532F in the respective time chunk 534A
  • the set of remote audio frequency components include a second set of audio frequency components distributed among one or more remaining time chunks 534 (e.g., 534B- 534E) of the corresponding audio sentence 530.
  • each of the subset of audio frequency components of the respective audio time frame 532 (e.g., 532A) is weighted based on an intrachunk position of the respective audio time frame 532 in the respective time chunk 534 (e.g., 534A), and the set of adjacent audio frequency components are weighted based on intrachunk positions of the other audio time frames 532 (e.g., 532B-532F) in the respective time chunk 534 (e.g., 534A).
  • each of the subset of audio frequency components of the respective audio time frame 532 (e.g., 532A) is weighted based on an interchunk position of the respective time chunk 534 (e.g., 534A) in the respective audio sentence 530, and the set of remote audio frequency components are weighted based on interchunk positions of the one or more remaining time chunks 534 (e.g., 534B-534E) in the respective audio sentence 530.
  • a first audio time frame 532A corresponds to a subset of first audio frequency components
  • the subset of first audio frequency components of the first audio time frame 532A is correlated to (1) the subset of audio frequency components of each of other audio time frames 532B-532F in the first time chunk 534A and (2) the subset of audio frequency components of a remaining audio time frame 532A’ in each of one or more remaining time chunks 534B-534E in the respective audio sentence 530.
  • the subset of first audio frequency components of the first audio time frame 532 is weighted based on an intrachunk position of the first audio time frame 532A in the first time chunk 534
  • the subset of audio frequency components of each of the other audio time frames 532B-532F is weighted based on an intrachunk position of each of the other audio time frames 532B-532F in the first time chunk 534A.
  • the remaining audio time frame 532A’ has a remaining position in each remaining time chunk (e.g., 534B-534E).
  • the first audio time frame 532A has a first position in the first time chunk 534A, and the first position is the same as the remaining position.
  • the subset of first audio frequency components of the first audio time frame 532A is weighted based on an interchunk position of the first time chunk 534A in the first audio sentence 530
  • the subset of audio frequency components of the remaining audio time frame 532A’ in each of the one or more remaining time chunks is weighted based on an interchunk position of each of the one or more remaining time chunks (e.g., 534B-534E) in the first audio sentence 530.
  • correlation of each audio time frame 532 further includes correlating (822) the subset of audio frequency components of the respective audio time frame 532 with the set of adjacent audio frequency components to generate a correlated subset of audio frequency components and correlating (824) the correlated subset of audio frequency components with the set of remote audio frequency components to generate the audio feature 520.
  • correlation of each audio time frame 532 further includes correlating the subset of audio frequency components of the respective audio time frame 532 with the set of remote audio frequency components to generate a correlated subset of audio frequency components, and correlating the correlated subset of adjacent frequency components with the set of remote audio frequency components to generate the audio feature 520.
  • correlation of each audio time frame 532 of each sentence 530 is repeated successively for a plurality of iterations.
  • the plurality of output frequency components 522 are generated by converting the audio feature 520 including the correlated audio frequency components to a frequency mask 548 and applying the frequency mask 548 on the plurality of audio frequency components 526 to form the plurality of output frequency components 522.
  • the electronic system 200 obtains a signature feature 524 corresponding to a target sound of the input audio data 502 and modifies the plurality of audio frequency components 526 based on the signature feature prior to correlating the subset of audio frequency components 526 of the audio time frame 532 with the adjacent and remote audio frequency components.
  • the signature feature 524 includes a plurality of signature frequency components extracted from each of a plurality of signature time frames of a plurality of signature audio sentences in the auxiliary speech 504. Further, in some embodiments, the signature feature 524 is extracted using a signature extraction model.
  • the audio feature 520 is generated from the plurality of audio frequency components 526 in the spectral domain using a speech extraction model.
  • the signature extraction model and the speech extraction model are trained end-to-end based on a combination of a mean squared error (MSE) loss and a cross-entropy loss.
  • MSE mean squared error
  • the MSE loss indicates a difference between test audio output data and corresponding ground truth audio data
  • the cross-entropy loss indicates a quality of the signature feature.
  • the input audio data 502 in the temporal domain is converted to the plurality’ of audio frequency components 516 or 526 in the spectral domain using a Fourier transform or using a time-to-frequency neural network.
  • the output audio data 506 has a first signal-to-noise ratio (SNR), and the input audio data 502 has a second SNR that is less than the first SNR.
  • the electronic system 200 executes one of a social media application, a conferencing application, an Internet phone service, and an audio recorder application for implementing the method 800. That said, the audio processing method 800 can be applied for both audio and video communications, e.g., phone calls, video conferences, live video streaming and similar applications. With an early enrollment (1 or 2 sentences from the target speaker) which is accessible to phone users, the method 800 can work robustly in heavily corrupted scenarios. Furthermore, this method 800 can work together with other domain features (e.g., visual information) to enhance performance of an application involving audio or video communication.
  • domain features e.g., visual information
  • Speech signals are oftentimes mixed with ambient noise that can easily distract a listener’s attention. This makes it difficult to discern the speech signals in crowdy or noisy environments.
  • Deep learning techniques have demonstrated significant advantages over conventional signal processing methods for the purposes of improving perceptual quality and intelligibility of speech signals.
  • CNNs convolutional neural networks
  • RNNs recursive neural networks
  • a speaker encoder includes a 3-layer long shortterm memory (LSTM) network, which receives log-mel filter bank energies as input and outputs 256-dimensional speaker embeddings. Average pooling can be used after the last layer to convert frame-wise features to utterance-wise vectors.
  • LSTM 3-layer long shortterm memory
  • An i-vector extractor has also been applied with a variability matrix. After being trained with features of 19 mel frequency cepstral coefficients (MFCCs), energy, and their first and second derivatives, the extractor can output 60-dimensional i-vector of a target speaker. Further, a block structure consists of 2 CNNs with a kernel size of 1 x 1 and 1-D max-pooling layer, and the 1-D max-pooling layer efficiently addresses a silent frame issue raised by average pooling. Particularly, the method 800 captures both short-term and long-term dependencies of a speech spectrum by applying transformer blocks on each time chunk 534 of a sentence 530 as well as across time chunks in speech extraction. These transformer blocks could eliminate recurrence, replace it with a fully attention-based mechanism, and implement computation tasks in parallel.
  • MFCCs mel frequency cepstral coefficients
  • FIG. 9 is a flow diagram of another example audio processing method 900, in accordance with some embodiments.
  • the audio processing method 900 is described as being implemented by an electronic system 200 (e.g., including a mobile phone 104C).
  • Method 900 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 9 may correspond to instructions stored in a computer memory' or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory/, or other nonvolatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors.
  • the electronic system 200 obtains (902) input audio data 502.
  • the input audio data 502 includes a plurality of audio sentences 530 in a temporal domain, and each of the plurality of audio sentences 530 includes a plurality of audio time frames 532 each having a frame position.
  • the electronic system 200 converts (904) the plurality of audio sentences 530 in the temporal domain to a plurality of audio frequency components in a spectral domain. A subset of the plurality of audio frequency components corresponds to a first audio time frame 532A in the plurality of audio time frames 532.
  • the electronic system correlates (906) the subset of audio frequency components corresponding to the first audio time frame 532 A with a set of adjacent audio frequency components and a set of remote audio frequency components converted from a first audio sentence 530A including the first audio time frame 532A.
  • the electronic sy stem generates (908) an audio feature 520 of the first audio time frame 532A including the correlated audio frequency components of the first audio time frame 532A, generates (910) a plurality of output frequency components 522 based on at least the audio feature of the first audio time frame 532A, and decodes (912) the plurality of output frequency components 522 to generate output audio data 506 in the temporal domain.
  • the electronic system 200 segments audio frequency components corresponding to the first audio sentence 530A into a plurality of time chunks 534.
  • Each of the plurality of time chunks 534 includes a number of audio time frames 532.
  • the first audio time frame 532A corresponds to a first time chunk 534A in the first audio sentence 530A.
  • the set of adjacent audio frequency components include a first set of audio frequency components corresponding to other audio time frames 532B-532F in the first time chunk 534A
  • the set of remote audio frequency components include a second set of audio frequency components corresponding to audio time frames among one or more remaining time chunks 534 (e.g., 534B- 534E) of the first audio sentence 530 A.
  • each of the subset of audio frequency components corresponding to the first audio time frame 532A is weighted based on an intrachunk position of the first audio time frame 532A in the first time chunk 534A, and the set of adjacent audio frequency components are weighted based on intrachunk positions of the other audio time frames 532 (e.g., 532B-532F) in the first time chunk 534 A.
  • each of the subset of audio frequency components of the first audio time frame 532A is weighted based on an interchunk position of the first time chunk 534A in the first audio sentence 530A, and the set of remote audio frequency components are weighted based on interchunk positions of the one or more remaining time chunks (e.g., 534B-534E) in the first audio sentence 530A.
  • the subset of audio frequency components of the first audio time frame 532A are correlated with a subset of audio frequency components corresponding to each of other audio time frames 532 (e.g., 532B-532F) in the first time chunk 534A and a subset of audio frequency components corresponding to a remaining audio time frame in each of one or more remaining time chunks (e.g., 534B-534E) in the first audio sentence 530A.
  • a subset of audio frequency components corresponding to each of other audio time frames 532 e.g., 532B-532F
  • a subset of audio frequency components corresponding to a remaining audio time frame in each of one or more remaining time chunks e.g., 534B-534E
  • the subset of audio frequency components of the first audio time frame 532A based on an intrachunk position of the first audio time frame 532 A in the first time chunk 534,A, and the subset of audio frequency components of each of the other audio time frames 532 are weighted based on an intrachunk position of each of the other audio time frames 532 (e.g., 532B-532F) in the first time chunk 534A.
  • the remaining audio time frame has a remaining position in each of the one or more remaining time chunks (e.g., 534B-534E), and the first audio time frame 532A has a first position in the first time chunk 534A, the first position being the same as the remaining position.
  • the subset of first audio frequency components of the first audio time frame 532A are weighted based on an interchunk position of the first time chunk 534A in the first audio sentence 530A
  • the subset of audio frequency components of the remaining audio time frame in each of the one or more remaining time chunks are weighted based on an interchunk position of each of the one or more remaining time chunks (e.g., 534B-534E) in the first audio sentence 530A.
  • the electronic system 200 correlates the subset of audio frequency components corresponding to the first audio time frame 532 A with the set of adjacent audio frequency components to generate a correlated subset of audio frequency components.
  • the electronic system 200 correlates the correlated subset of audio frequency components with the set of remote audio frequency components to generate the audio feature.
  • the electronic system 200 correlates the subset of audio frequency components corresponding to the first audio time frame 532A with the set of remote audio frequency components to generate a correlated subset of audio frequency components and correlates the correlated subset of audio frequency components with the set of adjacent audio frequency components to generate the audio feature.
  • the electronic system 200 repeats, successively for a plurality of iterations, correlating the subset of audio frequency components corresponding to the first audio time frame 532A with the set of adjacent audio frequency components and the set of remote audio frequency components.
  • the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary' skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

An electronic device obtains input audio data having a plurality of audio sentences in a temporal domain and converts the input audio data to a plurality of audio frequency components in a. spectral domain. Each audio sentence includes a plurality of audio time frames each having a respective frame position. For a first audio time frame and in accordance with a first frame position, a subset of audio frequency components are correlated with a set of adjacent audio frequency components and a set of remote audio frequency components converted from a first, audio sentence including the first audio time frame. An audio feature is generated to include the correlated audio frequency components of the first audio time frame. A plurality' of output frequency components are generated from the audio feature and decoded to output audio data in the temporal domain.

Description

Transformer-Encoded Speech Extraction and Enhancement
TECHNICAL FIELD
[0001] This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for enhancing an audio signal by applying deep learning techniques in speech extraction.
BACKGROUND
[0002] Speech signals are oftentimes mixed with ambient noise that can easily distract a listener’s attention. This makes it difficult to discern the speech signals in crowded or noisy environments. Deep learning techniques have demonstrated significant advantages over conventional signal processing methods for the purposes of improving perceptual quality and intelligibility of the speech signals. Some existing solutions utilize convolutional neural networks (CNNs) or recursive neural networks (RNNs) to extract speaker embeddings and isolate target speech content. RNNs are a crucial component in speech enhancement systems and have recurrent connections that are essential to leant long-term dependencies and manage speech contexts properly. Nevertheless, an inherently sequential nature of RNNs impairs an effective parallelization of computational tasks. This bottleneck is particularly evident when large datasets are processed long sequences. It would be beneficial to have an effective and efficient mechanism to improve audio quality and reduce audio noise, while keeping a high utilization rate of computational resources and a low power consumption.
SUMMARY
[0003] Various embodiments of this application are directed to enhancing audio quality by establishing correlations among audio frames of each sentence in a spectral domain. A transformer-based speech extraction network is applied to isolate a target speaker’s speech from competing speakers or ambient noise in a corrupted environment. The transformer-based speech extraction network includes a dual-path transformer network, and is capable of working with additional domain information (e.g., visual feature) to provide multi-modal solutions for improving signal-to-noise performance. Such a transformer -based speech extraction network eliminates recurrence in RNNs and applies a fully attention-based mechanism, and can avoid an inherently sequential nature of the RNNs completely. When the transformer-based network attends to a sequence of audio data, direct connections are established among both adjacent and remote time frames of each sentence to allow the transformer-based network to learn long-term dependencies of the audio data easily. By these means, the transformer-based speech extraction network supports parallel computation of audio processing tasks, improves audio quality, and reduces audio noise effectively and efficiently.
[0004] In one aspect, an audio processing method is implemented at an electronic device having memory. The method includes obtaining input audio data having a plurality of audio sentences in a temporal domain and converting the input audio data in the temporal domain to a plurality of audio frequency components in a spectral domain. Each audio sentence includes a plurality of audio time frames. Each time frame has a respective frame position in a respective audio sentence and corresponds to a subset of audio frequency components. The method further includes for each audio time frame and in accordance with the respective frame position, correlating the subset of audio frequency components of the audio time frame with a set of adjacent audio frequency components and a set of remote audio frequency components converted from the same respective audio sentence and generating an audio feature including the correlated audio frequency components of each audio time frame. The method further includes generating a plurality of output frequency components from the audio feature and decoding the plurality of output frequency components to output audio data in the temporal domain.
[0005] In some embodiments, the method further includes for each audio sentence, segmenting corresponding audio frequency components based on a plurality of time chunks, and each time chunk has a number of audio time frames. In some embodiments, correlation of each audio time frame further includes correlating the subset of audio frequency components of the respective audio time frame with the set of adjacent audio frequencycomponents to generate a correlated subset of audio frequency components and correlating the correlated subset of audio frequency components with the set of remote audio frequency components to generate the audio feature. Alternatively, in some embodiments, correlation of each audio time frame further includes correlating the subset of audio frequency components of the respective audio time frame with the set of remote audio frequency components to generate a correlated subset of audio frequency components and correlating the correlated subset of audio frequency components with the set of adjacent audio frequency components to generate the audio feature. In some embodiments, the method further includes correlation of each audio time frame is repeated successively for a plurality of iterations. [0006] In one aspect, an audio processing method is implemented by an electronic device having memory. The method includes obtaining input audio data. The input audio data includes a plurality of audio sentences in a temporal domain, and each of the plurality of audio sentences includes a plurality of audio time frames each having a frame position. The method further includes converting the plurality of audio sentences in the temporal domain to a plurality of audio frequency components in a spectral domain. A subset of the plurality of audio frequency components corresponds to a first audio time frame in the plurality of audio time frames. The method further includes in accordance with a first frame position of the first audio time frame, correlating the subset of audio frequency components corresponding to the first audio time frame with a set of adjacent audio frequency components and a set of remote audio frequency components converted from a first audio sentence including the first audio time frame. The method further includes generating an audio feature of the first audio time frame including the correlated audio frequency components of the first audio time frame, generating a plurality of output frequency components based on at least the audio feature of the first audio time frame, and decoding the plurality of output frequency components to generate output audio data in the temporal domain.
[0007] In another aspect, some implementations include an electronic device that, includes one or more processors and memory' having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
[0008] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
[0009] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures. [0011] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
[0012] Figure 2 is a block diagram illustrating an electronic system configured to process content data (e.g., image data), in accordance with some embodiments.
[0013] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
[0014] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.
[0015] Figure 5A is a block diagram of an example speech extraction network configured to process input audio data, in accordance with some embodiments, and Figure 5B is a data structure of example input audio data, in accordance with some embodiments.
[0016] Figure 6 is a block diagram of an example transformer module (e.g., an intrachunk transformer module, an intrachunk transformer module) configured to correlate audio frequency components of a plurality of audio time frames, in accordance with some embodiments.
[0017] Figure 7 is a block diagram of a speech extraction network trained to process input audio data, in accordance with some embodiments.
[0018] Figure 8 is a flow diagram of an example audio processing method, in accordance with some embodiments.
[0019] Figure 9 is a flow diagram of an example audio processing method, in accordance with some embodiments.
[0020] Like reference numerals refer to corresponding parts throughout the several views of the drawings.
DETAILED DESCRIPTION
[0021] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art. that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.
[0022] Various embodiments of this application are directed to speech processing and enhancement. During target speech extraction, transformers adopt an attention mechanism to correlate time frames of each audio sentence with each other in a spectral domain. In some embodiments, input audio data and an auxiliary speech are converted to audio frequency components using a short-time Fourier transform (STFT), while the correlated audio frequency components are converted to output audio data using an inverse STFT (ISTFT). Such an audio processing method is applicable to both audio and video communications, e.g., phone calls, video conferences, live video streaming and similar applications. With an early enrollment (1 or 2 sentences in the auxiliary speech of a target speaker) which is easily accessible for phone users, this audio processing method can work robustly in heavily corrupted scenarios. Furthermore, this method can w'ork together with other domain features (e.g., visual information) to enhance audio performance.
[0023] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
[0024] The one or more servers 102 are configured to enable real-time data communication with the client, devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify moti on or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0025] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution ( L T E), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet. Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
[0026] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models (e.g., a speech extraction model and a feature extraction model in Figures 5 A and 7) are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.
[0027] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more seivers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102 A) associated with a client device 104 (e.g. the client device 104A and HMD 104D). The server 102 A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
[0028] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses.
Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.
[0029] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the sewer 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.
[0030] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.
[0031] Memory 206 includes high-speed random access memory , such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a n on-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset, or superset thereof:
. Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;
. Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on; . User interface module 218 for enabling presentation of information (e.g. , a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
. Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
. Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
. One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);
. Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
. Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 228 is applied to process content data; and
. One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history7 data, user preferences, and predefined account settings, o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 250; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 240 includes an encoder-decoder network 500 (e.g., a U net) in Figure 5; and o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic system 200, respectively, where the content data is processed by the data processing models 250 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104.
[0032] In some embodiments, the data processing module 228 includes an audio processing module 229, and the data processing models 240 include a speech extraction model and a signature extraction model. The audio processing module 229 is associated with one of the user applications 224 (e.g., a social media application, a conferencing application, an Internet phone sendee, and an audio recorder application) to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 228 is applied to implement an audio processing process 800 in Figure 8.
[0033] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 , In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.
[0034] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset, of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above. [0035] Figure 3 is an example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
[0036] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 3 12. The data processing model 240 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data.
[0037] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
[0038] The data processing module 228 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data. The model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that, can be derived from the processed content data.
[0039] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights w1, w2, w3, and uv according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
[0040] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every' node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
[0041] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural netw'ork that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0042] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RN N exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory/ (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.
[0043] The training process is a process for calibrating all of the weights for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.
[0044] Figure 5 A is a block diagram of an example speech extraction network 500 configured to process input audio data 502, in accordance with some embodiments, and Figure 5B is a data structure 550 of example input audio data 502, in accordance with some embodiments. The speech extraction network 500 is implemented by an audio processing module 229 of an electronic system 200, and configured to receive the input audio data 502 and an auxiliary speech 504 including an audio signal produced by a target speaker and extract output audio data 506 from the input audio data 502 with reference to the auxiliary speech 504. Compared with the input audio data 502, audio signals produced by other speakers different from the target speaker and ambient noise are reduced or suppressed in the output audio data 506. In some embodiments, the output audio data 506 has a first signal-to- noise ratio (SNR), and the input audio data 502 has a second SNR that is less than the first SNR.
[0045] The speech extraction network 500 includes a speech encoder 508, a speech extractor 510, a speech decoder 512, and a speaker embedder 514. The speech encoder 508 is configured to obtain the input audio data 502 and auxiliary speech 504 in a temporal domain and convert the input audio data 502 and auxiliary speech 504 to a plurality of audio frequency components 516 and a plurality of auxiliary frequency components 518 in a spectral domain, respectively. The speech extractor 510 is coupled to the speech encoder 508 and configured to correlate a subset of audio frequency components of each audio time frame with a set of adjacent audio frequency components and a set of remote audio frequency components converted from the same respective audio sentence of the input audio data 502 or auxiliary speech 504. An audio feature 520 is generated and includes the correlated audio frequency components of each audio time frame. The speech extractor 510 further converts the audio feature 520 to a plurality of output frequency components 522. The speech decoder 512 is coupled to the speech extractor 510 and configured to decode the plurality of output frequency components 522 to the output audio data 506 in the temporal domain. The speech embedder 514 is coupled to both the speech encoder 508 and speech extractor 510 and configured to generate a signature feature 524 from the plurality of auxiliary frequency components 518 using a signature extraction model. The speech extractor 510 combines the signature feature 524 with an input audio feature 526 including the plurality of audio frequency components 516 and correlates frequency components of the combined audio feature 528 on an audio time frame basis to generate the audio feature 520. As such, the speech extractor 510 processes the plurality of audio frequency components 516 with reference to the signature feature 524, thereby allowing the output audio data 510 to be generated with the second SNR better than the first SNR of the input audio data 502.
[0046] Referring to Figure 5B, the input audio data 502 includes a first number (B) of audio sentences 530, and each audio sentence 530 includes a plurality of audio time frames 532 having a respective second number (T) of audio time frames. Each time frame 532 has a respective frame position in a respective audio sentence and corresponds to a subset of audio frequency components 526. Each audio sentence 530 includes a respective third number (5) of time chunks 534, and each time chunk 534 includes a subset of audio time frames 532 having a respective fourth number (AT) of time frames. For each sentence 530, the second number is a product of the third number and the fourth number, i.e., T=SK. In an example, the input audio data 502 includes 5 sentences 530. The first sentence 530A of the input audio data 502 has 10 time chunks 534, and the first time chunk 534 A of the first sentence 530 A has 20 audio time frames 532 (e.g., 532A and 532B). Additionally, the auxiliary speech 504 also includes a plurality of audio sentences 530 provided by the target speaker, and each audio sentence the auxiliary speech 504 is segmented to a plurality of time frames that are grouped to time chunks.
[0047] In some embodiments, each of the third number of time chunks 534 does not overlap with any corresponding neighboring time chunk 534. Alternatively, in some embodiments not shown, each time chunk 534 has a temporal overlap with any neighboring time chunk 534. For example, the time chunk 534A has a temporal overlap with its neighboring time chunk 534B, and the temporal overlap is 50% of the time chunk 534A or 534B, The time chunk 534D has a respective temporal overlap with either one of its neighboring time chunks 534C and 534E, and the respective temporal overlap is 50% of the time chunk 534C or 534E.
[0048] In some embodiments, each of the second number of audio time frames 532 does not overlap with any corresponding neighboring audio time frame 532. Alternatively, in some embodiments not shown, each audio time frame 532 has a temporal overlap with any neighboring audio time frame 532. For example, the audio time frame 532E has a temporal overlap with its neighboring audio time frame 534F, and the temporal overlap is 50% of the time chunk 534E or 534E. The audio time frame 532B has a respective temporal overlap with either one of its neighboring audio time frames 532A and 532C, and the respective temporal overlap is 50% of the audio time frame 532A or 532C.
[0049] In some embodiments, each of the second, third, and fourth numbers remain s the same for all sentences 530. For each sentence 530, corresponding temporal lengths of time chunks 534 and time frames 532 are not fixed, and are adjusted based on a length of the respective sentence 530 to get the fixed second, third, and fourth numbers of total audio time frames 532, time chunks 534, and time frames 532 per time chunk 534. Alternatively, in some embodiments, each of the second, third, and fourth numbers remains the same for all sentences, so are the temporal lengths of time chunks 534 and time frames 532. Each sentence 530 is filled with a respective blank audio portion at an end of the respective sentence 530 to reach a predefined sentence length, such that the sentence 530 can be segmented into the same third number (S) of time chunks 534 and the same second number (T) of time frames 532 based on fixed temporal lengths of each time chunk 534 and each time frame 532.
[0050] In some embodiments, the speech encoder 508 includes a Fourier transform or a time-to-frequency neural network. The input audio data 502 in the temporal domain is converted to the plurality of audio frequency components 516 in the spectral domain using the Fourier transform or time-to-frequency neural network. The auxiliary speech 504 in the temporal domain may be converted to the plurality of auxiliary frequency components 518 in the spectral domain using the Fourier transform or time-to-frequency neural network. In an example, the audio and auxiliary frequency components 516 and 518 are organized into an audio spectrum matrix 516’ and an auxiliary spectrum matrix 518’, respectively. Each of the audio and auxiliary spectrum matrices 516’ and 518’ has three dimensions corresponding to sentences, time frames, and frequencies, respectively. For example, each audio time frame 532 of the input audio data 502 is converted to a predefined number (F) of audio frequency components, and the audio spectrum matrix 516’ has Bx TxF audio frequency components. Each audio time frame 532 of the auxiliary speech 504 is converted to an auxiliary number (Emb) of audio frequency components 516, and the auxiliary spectrum matrix 518’ has Bx TxEmb auxiliary frequency components 518.
[0051] In some embodiments, the plurality of audio frequency components 516 are further modified to a plurality of audio frequency components 526 by an input network 536 in the speech extractor 510. The audio frequency components 526 has the same number (e.g., Bx Tx F) of components as the audio frequency components 516, and are organized to three dimensions corresponding to sentences, time frames, and frequencies, respectively. Conversely, the speech embedder 514 extracts speech characteristics of the target speaker byconverting the auxiliary frequency components 518 to the auxiliary frequency components 524, e.g., using a CNN layer, an RNN layer, or a harmonic based speaker embedding network including a two-dimensional convolution layer 538 and a harmonic block 540. The auxiliary frequency components 524 have the same number (e.g., Bx TxEmb) of components as the auxiliary frequency components 518, and are organized to three dimensions corresponding to sentences, time frames, and frequencies, respectively. Further, in some embodiments, the speech extractor 510 combines the audio frequency components 526 and auxiliary frequency components 524 to form the combined audio feature 528. For example, the audio frequency components 526 and auxiliary frequency components 524 are concatenated to each other. [0052] In some embodiments, the combined audio feature 528 is segmented (538) to a pre-transformer audio feature 540 based on the plurality of time chunks 534 in each sentence 530 of the input audio data 502 and auxiliary speech 504. Every two neighboring time chunks 534 of the same sentence 530 optionally have no overlaps or share a predefined overlapped portion (e.g., 20%, 50%). Specifically, each audio sentence 530 corresponds to TxF audio frequency components of the audio frequency components 526 combined in the audio feature 528. The Tx F audio frequency components are segmented according to the plurality of time chunks 534 in each audio sentence 530, e.g., segmented to S chunks each having
Figure imgf000021_0001
audio frequency components. The audio frequency components 526 include Bx TxF components in total, and are re-organized to a matrix that has BxSxKxF components in four dimensions corresponding to sentences, chunks, time frames in each time chunk, and frequencies, respectively. The pre-transformer audio feature 540 includes the re-organized audio frequency components 526.
[0053] The speech extractor 510 further includes a dual transformer module 542 including one or more transformer stage. In some embodiments, the dual transformer module 542 includes a first transformer stage 542A receives the pre-transformer audio feature 540 including the frequency components 524 and 526 that are combined and segmented.
Referring to Figure 5B, for each audio time frame 532 (e.g., 532A), the adjacent audio frequency components include a first set of audio frequency components 526 corresponding to other audio time frames 532 (e.g., 532B-532F) in the same time chunk 534 (e.g., 534A), and the remote audio frequency components include a second set of audio frequency components 526 distributed among one or more remaining time chunks 534 (e.g., 534B- 534E) of the same audio sentence 530 (e.g., 530A). That said, for each audio time frame 532 (e.g., 532 A), the first transformer stage 542A correlates the subset of audio frequency components 526 of the audio time frame 532 (e.g., 532A) with the first set of audio frequency components using an intrachunk transformer module 544 and with the second set of audio frequency components converted from the same respective audio sentence 530 (e.g., 530 A) using an interchunk transformer module 546. More specifically, in the first transformer stage 542A, the intrachunk transformer module 544 and interchunk transformer module 546 are applied on each time chunk 534 and across time chunks 534, respectively. The intrachunk transformer module 544 acts on each time chunk 534 independently, modeling short-term dependencies within the respective time chunk 534. The interchunk transformer module 546 is applied to model transitions across different, time chunks 534 and enable effective modelling of long-term dependencies across the different time chunks 534. [0054] Alternatively, in some embodiments, the dual transformer module 542 includes a plurality of successive transformer stages 542 that are coupled in series with each other. For example, a second transformer stage 542B is coupled to an output of the first transformer stage 542A. A first audio feature is outputted from the first transformer stage 542A and includes first correlated audio frequency components of each audio time frame 532. The first audio feature is provided to the second transformer stage 542B and correlated using the intrachunk transformer module 544 and interchunk transformer module 546 sequentially. By these means, the plurality of successive transformer stages 542 are coupled in series and correlates audio frequency components 526 of each audio time frame 532 repeatedly and successively for a plurality of iterations (e.g., for 2, 3, 4, ..., or 8 iterations). [0055] In some embodiments, the plurality of output frequency components 522 is generated by converting the audio feature 520 including the correlated audio frequency components to a frequency mask 548 and applying the frequency mask 548 on the plurality of audio frequency components 526 to form the plurality of output frequency components 522. Stated another way, the audio feature 520 results from segmented mask estimation, and is recovered to an enhanced spectrum of the target speech (i.e., the output frequency components 522), e.g., by an overlap-and-add operation and multiplication with a corrupted spectrum.
[0056] In accordance with the speech extraction network 500, the speech encoder 508, speech extractor 510, speech decoder 512, and speaker embedder 514 process the input audio data 502 jointly and with reference to an auxiliary speech 504 to enhance audio quality of the input audio data 502. The speech encoder 508 is configured to estimate a representation for the input audio data 502. For example, a short-time Fourier transform (STFT) is applied to encode the input audio data 502 and auxiliary speech 504 in the temporal domain to the representation (e.g., frequency components 516 and 518) in the spectral domain. In another example, a learnable convolution layer may be applied in the speech encoder 508 to encode the input audio data 502 or auxiliary speech 504. In some embodiments, the input audio data 502 includes a corrupted speech of the target speaker, and the speech encoder 506 encodes the auxiliary and corrupted speeches in a weighted manner. The speech extractor 506 estimates the frequency mask 548 to extract the output frequency components 522 including a target speech in the mixture, depending on information from the auxiliary speech 504. The speech decoder 508 is configured to reconstruct the output audio data 512 in the temporal domain from the output frequency components 522. In some embodiments, an enhanced target speech magnitude spectrogram is used to reconstruct the time-domain enhanced speech via an inverse STFT or learnable transpose convolution layers. The speaker embedder 510 is configured to convert auxiliary utterances into speaker embedding 524, which contains voiceprint information of the target speaker and leads a speaker extraction network 500 to direct attention to the target speaker.
[0057] Figure 6 is a block diagram of an example transformer module 600 (e.g., an intrachunk transformer module 544, an interchunk transformer module 546) configured to correlate audio frequency components 610 of a plurality of audio time frames 532, in accordance with some embodiments. The plurality of audio time frames 532 are optionally segmented from the input audio data 502 and auxiliary speech 504. The transformer module 600 includes a positional encoder 602 and one or more transformer encoders 604. Each audio time frame 534 corresponds to a subset of audio frequency components 610. The subset of audio frequency components 610 are converted from audio signal of the respective audio time frame 534 in the input audio data 502 or auxiliary speech 504 and may be further processed using the input network 536 or speech embedder 514. The positional encoder 602 associates the subset of audio frequency components 610 of each audio time frame 534 with a temporal position of the respective audio time frame 534, i.e., combines the subset of audio frequency components 610 of each audio time frame 534 with a respective weight determined from its temporal position in a corresponding sentence 530. In some embodiments, the one or more transformer encoders 604 includes a single transformer encoder 604. The transformer encoder 604 includes a multihead attention module 606 configured to combine the subset of weighted audio frequency components 612 of each audio time frame 534 to generate a plurality of correlated frequency components 614 for the plurality of audio time frames 534. In some embodiments, the transformer encoder 604 further includes a feedforward module 608 configured to update the plurality of correlated frequency components 614 to a plurality of correlated frequency components 616, which are outputted from the transformer module 600.
[0058] In some embodiments, the one or more transformer encoders 604 has a number of (i.e., two or more) transformer encoders that are coupled in series. Each transformer encoder 604 includes at least a multihead attention module 606. A first multihead attention module 606 receives the weighted audio frequency components 612 of each audio time frame 534 to generate a plurality of correlated frequency components 614 for the plurality of audio time frames 534. An intermediate multihead attention module 606 receives the correlated frequency components 614 outputted by an immediately preceding multihead attention module 606 to generate corresponding correlated frequency components 614 for the plurality of audio time frames 534. A last multihead attention module 606 receives the correlated frequency components 614 outputted by the immediately preceding multihead attention module 606 to generate the corresponding correlated frequency components 614 for the plurality of audio time frames 534 to be outputted by the transformer module 600.
Further, in some embodiments, each transformer encoder 604 includes a feedforward module 608 coupled to the multihead attention module 606. A first transformer encoder 604 receives the weighted audio frequency components 612 of each audio time frame 534 to generate a plurality of correlated frequency components 616 for the plurality of audio time frames 534. An intermediate transformer encoder 604 receives the correlated frequency components 616 outputted by an immediately preceding transformer encoder 604 to generate corresponding correlated frequency components 616 for the plurality of audio time frames 534. A last transformer encoder 604 receives the correlated frequency components 616 outputted by the immediately preceding transformer encoder 604 to generate the corresponding correlated frequency components 616 for the plurality’ of audio time frames 534 to be outputted by the transformer module 600.
[0059] In some embodiments, referring to Figure 5B, the plurality of time chunks 534 include an ordered sequence of successive time chunks 534 in the same sentence 530, and each time chunk 534 includes a series of audio time frames 532. Each audio time frame 532 of a time chunk 534 (e.g., time chunk 534A) has an intrachunk position in the time chunk 534 and corresponds to a subset of audio frequency components 610 (e.g., components 528). The subset of audio frequency components 610 of each audio time frame 532 (e.g., frame 532A) of the time chunk 534 (e.g., time chunk 534A) are weighted based on the intrachunk position of the respective audio time frame 532 (e.g., frame 532A) in the respective time chunk 534 (e.g., time chunk 534A). Particularly, each of the subset, of audio frequency components 610 is weighted based on the same intrachunk position of the respective audio time frame 532 (e.g., frame 532A) in the respective time chunk 534 (e.g., time chunk 534A). The time frames 532 (e.g., frames 532A-532F) in the same time chunk 534 (e.g., time chunk 534A) are adjacent to one another. Corresponding adjacent audio frequency components 610 are weighted based on intrachunk positions of the other audio time frames 532 (e.g., frames 532B-532F) in the same respective time chunk 534.
[0060] Alternatively, in some embodiments, the plurality of time chunks 534 includes an ordered sequence of successive time chunks (e.g., 534A-534E) in the same sentence 530, and each time chunk 534 includes a series of audio time frames 532 in a respective audio sentence 530. For example, a set of audio time frames 532 includes a single audio time frame 532 (e.g., a fifth audio time frame 532A in the time chunk 534A) from a respective one of the successive time chunks 534. Each of the set of audio time frame 532 has an interchunk position (e.g., in the second time chunk 534B) in the respective audio sentence 530 and corresponds to a subset of audio frequency components 610 (e.g., components 528). The subset of audio frequency components 610 of an audio time frame 532 (e.g., frame 532A) is weighted based on the interchunk position of the respective time chunk 534 (e.g., frame 534A) in the respective audio sentence 530. The audio time frames 532 in this set of audio time frames 532 are distributed in different time chunks 534 of the same sentence 530, thereby being remote to one another. Corresponding remote audio frequency components 610 are weighted based on interchunk positions of the other audio time frames 532A’ in this set of audio time frames 532. That said, corresponding remote audio frequency components 610 are weighted based on interchunk positions of audio time frames 532A’ of one or more remaining time chunks 534 (e.g., time chunks 534B-534E) in the respective audio sentence 530.
[0061] In an example, a first time chunk 534A of a first audio sentence 530A has a first audio time frame 532A. The first audio time frame 532A corresponds to a subset of first audio frequency components 610A. The subset of first audio frequency components 610A of the first audio time frame 532A is correlated to (1) a set of adjacent frequency components including the subset of audio frequency components 610A of each of other audio time frames 532 (e.g., 532B-532F) in the first time chunk 534A and (2) a set of remote frequency components including the subset of audio frequency components 610B of a remaining audio time frame 532A’ in each of one or more remaining time chunks 534 (e.g., 534B-534E) in the first audio sentence 530A. Further, the subset of first audio frequency components 610A of the first audio time frame 532A is weighted based on an intrachunk position of the first audio time frame 532A in the first time chunk 532A. The subset of audio frequency components of each of the other audio time frames 532 (e.g., 532B) is weighted based on an intrachunk position of each of the other audio time frames 532 in the first time chunk 534A.
[0062] In some embodiments, the remaining audio time frame 532A’ is included in each of one or more remaining time chunks 534 (e.g., 534B- 534E) in the first audio sentence 53OA. The remaining audio time frame 532A’ has a remaining position in each remaining time chunk 5.34 (e.g., 534B-534E), and the first audio time frame 532A has a first, position in the first time chunk 534A. The first position of the first audio time frame 532A is the same as the remaining position of the remaining audio time frame 532A’ in their respective time chunks. Additionally, in some embodiments, the subset of first audio frequency components 61 OA of the first audio time frame 532A is weighted based on an interchunk position of the first time chunk 534A in the first audio sentence 530A. The subset of audio frequency components 61 OB of the remaining audio time frame 532A’ in each of the one or more remaining time chunks 534 (e.g., 534B-534E) is weighted based on an interchunk position of each of the one or more remaining time chunks 534 (e.g., 534B-534E) in the first audio sentence 530 A.
[0063] In some embodiments, the intrachunk transformer module 544 of each transformer stage 542 is configured to correlate the subset of audio frequency components of the first audio time frame 532A with the set of adjacent audio frequency components to generate a correlated subset of audio frequency components. In each transformer stage 542, the interchunk transformer module 546 is coupled to the intrachunk transformer module 544 and configured to correlates the correlated subset of audio frequency components with the set of remote audio frequency components of the remaining audio time frames 532A’ in each time chunk 534B534E to generate the audio feature 520. Alternatively, in some embodiments, the interchunk transformer module 546 of each transformer stage 542 is configured to correlate the subset of audio frequency components of the first audio time frame 532A with the set of remote audio frequency components to generate a correlated subset of audio frequency components. In each transformer stage 542, the intrachunk transformer module 544 is coupled to the interchunk transformer module 546 and configured to correlates the correlated subset of audio frequency components with the set of adjacent audio frequency components of the other audio time frames 532 (e.g., 532B-532F) of the same time chunk 534A to generate the audio feature 520.
[0064] Figure 7 is a block diagram of a speech extraction network 700 trained to process input audio data 502, in accordance with some embodiments. The speech extraction network 700 is optionally trained by a model training module 226 of a server 102 and/or a client device 104. In some embodiments, the speech extraction network 700 is trained at the serv er 102, and sent to the client device 104 to generate the output audio data 506 from the input audio data 502. Alternatively, in some embodiments, the speech extraction network 700 is trained at the server 102, which receives the input audio data 502 from the client device 104 and generate the output audio data 506 from the input audio data 502 to be returned to the client device 104. Alternatively, in some embodiments, the speech extraction network 700 is trained at the client device 104. The client device 104 further generates the output audio data 506 from the input, audio data 502. [0065] The speech extraction network 700 obtains a signature feature 524 corresponding to a target sound of the input audio data 502. For example, the speech encoder 508 receives an auxiliary speech 504 of the target sound (e.g., voice of a target speaker) and generates a plurality of auxiliary frequency components 518. The speech embedder 514 is configured to generate an signature feature 524 from the plurality of auxiliary frequency components 518 using a signature extraction model. The auxiliary speech 504 includes a plurality of signature audio sentences. The signature feature 524 includes a plurality of signature frequency components extracted from each of a plurality of signature time frames of the signature audio sentences. The speech extractor 510 modifies the plurality of audio frequency components 526 based on the signature feature 524, e.g., combines the plurality of audio frequency components 526 and the plurality of signature frequency components of the signature feature 524. The speech extractor 510 correlates a subset of audio frequency components of each audio time frame 532 with adjacent and remote audio frequency components within each of the audio sentences 530 in both the input audio data 502 and auxiliary speech 504. Additionally, in some embodiments, the signature feature 524 is extracted using the signature extraction model. The audio feature 520 is generated from the plurality of audio frequency components 516 in the spectral domain using a speech extraction model.
[0066] In some embodiments, the signature extraction model and the speech extraction model are trained end-to-end based on a combination of a mean squared error (MSE) loss 702 and a cross-entropy loss 704. The MSE loss 702 indicates a difference between test audio output data and corresponding ground truth audio data. The cross-entropy loss 704 indicates a quality of the signature feature 524. In an example, a linear network 706 and a softmax network 708 are used to generate an audio feature mask 710 from an output of the speech embedder 514, and the audio feature mask 710 is converted to the cross-entropy loss 704.
[0067] Figure 8 is a flow diagram of an example audio processing method 800, in accordance with some embodiments. For convenience, the audio processing method 800 is described as being implemented by an electronic system 200 (e.g., including a mobile phone 104C). Method 800 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other nonvolatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 800 may be combined and/or the order of some operations may be changed.
[0068] The electronic system 200 obtains (802) input audio data having a plurality of audio sentences 530 in a temporal domain and converts (804) the input audio data 502 in the temporal domain to a plurality of audio frequency components (e.g., audio frequency components 516 and 526) in a spectral domain. Each audio sentence 530 includes (806) a plurality of audio time frames 532, and each time frame 532 has (808) a respective frame position in a respective audio sentence 530 and corresponds to a subset of audio frequency components. For each audio time frame 532 and in accordance with the respective frame position (810), the electronic system 200 correlates (812) the subset of audio frequency components of the audio time frame 532 with a set of adjacent audio frequency components and a set of remote audio frequency components converted from the same respective audio sentence 530, thereby generating (814) an audio feature 520 including the correlated audio frequency components of each audio time frame 532, The electronic system 200 generates (816) a plurality' of output frequency components 522 from the audio feature 520 and decodes (818) the plurality of output frequency components 522 to output audio data 506 in the temporal domain.
[0069] In some embodiments, for each audio sentence 530, the electronic system 200 segments (820) corresponding audio frequency components based on a plurality of time chunks 534. Each time chunk 534 has a number (X) of audio time frames 532. Further, in some embodiments, for each audio time frame 532, the respective audio time frame 532 corresponds to a respective time chunk 534 in a corresponding audio sentence 530. Referring to Figure 5B, for a first time frame 532 A of a first time chunk 534A, the set of adjacent audio frequency components include a first set of audio frequency components corresponding to other audio time frames 532B-532F in the respective time chunk 534A, and the set of remote audio frequency components include a second set of audio frequency components distributed among one or more remaining time chunks 534 (e.g., 534B- 534E) of the corresponding audio sentence 530. Additionally, in some embodiments, for correlation with the adjacent audio frequency components, each of the subset of audio frequency components of the respective audio time frame 532 (e.g., 532A) is weighted based on an intrachunk position of the respective audio time frame 532 in the respective time chunk 534 (e.g., 534A), and the set of adjacent audio frequency components are weighted based on intrachunk positions of the other audio time frames 532 (e.g., 532B-532F) in the respective time chunk 534 (e.g., 534A). Further, in some embodiments, for correlation with the remote audio frequency components, each of the subset of audio frequency components of the respective audio time frame 532 (e.g., 532A) is weighted based on an interchunk position of the respective time chunk 534 (e.g., 534A) in the respective audio sentence 530, and the set of remote audio frequency components are weighted based on interchunk positions of the one or more remaining time chunks 534 (e.g., 534B-534E) in the respective audio sentence 530.
[0070] In some embodiments, for a first time chunk 534A of a first audio sentence 530A, a first audio time frame 532A corresponds to a subset of first audio frequency components, and the subset of first audio frequency components of the first audio time frame 532A is correlated to (1) the subset of audio frequency components of each of other audio time frames 532B-532F in the first time chunk 534A and (2) the subset of audio frequency components of a remaining audio time frame 532A’ in each of one or more remaining time chunks 534B-534E in the respective audio sentence 530. Further, in some embodiments, for correlation within the first time chunk 534, the subset of first audio frequency components of the first audio time frame 532 is weighted based on an intrachunk position of the first audio time frame 532A in the first time chunk 534, and the subset of audio frequency components of each of the other audio time frames 532B-532F is weighted based on an intrachunk position of each of the other audio time frames 532B-532F in the first time chunk 534A. Additionally, in some embodiments, the remaining audio time frame 532A’ has a remaining position in each remaining time chunk (e.g., 534B-534E). The first audio time frame 532A has a first position in the first time chunk 534A, and the first position is the same as the remaining position. Also, in some embodiments, for correlation with the one or more remaining time chunks, the subset of first audio frequency components of the first audio time frame 532A is weighted based on an interchunk position of the first time chunk 534A in the first audio sentence 530, and the subset of audio frequency components of the remaining audio time frame 532A’ in each of the one or more remaining time chunks (e.g., 534B-534E) is weighted based on an interchunk position of each of the one or more remaining time chunks (e.g., 534B-534E) in the first audio sentence 530.
[0071] In some embodiments, correlation of each audio time frame 532 further includes correlating (822) the subset of audio frequency components of the respective audio time frame 532 with the set of adjacent audio frequency components to generate a correlated subset of audio frequency components and correlating (824) the correlated subset of audio frequency components with the set of remote audio frequency components to generate the audio feature 520. Alternatively, in some embodiments, correlation of each audio time frame 532 further includes correlating the subset of audio frequency components of the respective audio time frame 532 with the set of remote audio frequency components to generate a correlated subset of audio frequency components, and correlating the correlated subset of adjacent frequency components with the set of remote audio frequency components to generate the audio feature 520.
[0072] In some embodiments, correlation of each audio time frame 532 of each sentence 530 is repeated successively for a plurality of iterations.
[0073] In some embodiments, the plurality of output frequency components 522 are generated by converting the audio feature 520 including the correlated audio frequency components to a frequency mask 548 and applying the frequency mask 548 on the plurality of audio frequency components 526 to form the plurality of output frequency components 522.
[0074] In some embodiments, the electronic system 200 obtains a signature feature 524 corresponding to a target sound of the input audio data 502 and modifies the plurality of audio frequency components 526 based on the signature feature prior to correlating the subset of audio frequency components 526 of the audio time frame 532 with the adjacent and remote audio frequency components. In an example, the signature feature 524 includes a plurality of signature frequency components extracted from each of a plurality of signature time frames of a plurality of signature audio sentences in the auxiliary speech 504. Further, in some embodiments, the signature feature 524 is extracted using a signature extraction model. The audio feature 520 is generated from the plurality of audio frequency components 526 in the spectral domain using a speech extraction model. The signature extraction model and the speech extraction model are trained end-to-end based on a combination of a mean squared error (MSE) loss and a cross-entropy loss. The MSE loss indicates a difference between test audio output data and corresponding ground truth audio data, and the cross-entropy loss indicates a quality of the signature feature.
[0075] In some embodiments, the input audio data 502 in the temporal domain is converted to the plurality’ of audio frequency components 516 or 526 in the spectral domain using a Fourier transform or using a time-to-frequency neural network.
[0076] In some embodiments, the output audio data 506 has a first signal-to-noise ratio (SNR), and the input audio data 502 has a second SNR that is less than the first SNR. [0077] In some embodiments, the electronic system 200 executes one of a social media application, a conferencing application, an Internet phone service, and an audio recorder application for implementing the method 800. That said, the audio processing method 800 can be applied for both audio and video communications, e.g., phone calls, video conferences, live video streaming and similar applications. With an early enrollment (1 or 2 sentences from the target speaker) which is accessible to phone users, the method 800 can work robustly in heavily corrupted scenarios. Furthermore, this method 800 can work together with other domain features (e.g., visual information) to enhance performance of an application involving audio or video communication.
[0078] Speech signals are oftentimes mixed with ambient noise that can easily distract a listener’s attention. This makes it difficult to discern the speech signals in crowdy or noisy environments. Deep learning techniques have demonstrated significant advantages over conventional signal processing methods for the purposes of improving perceptual quality and intelligibility of speech signals. In some situations, convolutional neural networks (CNNs) or recursive neural networks (RNNs) are utilized to extract speaker embeddings and isolate target speech content. For example, a speaker encoder includes a 3-layer long shortterm memory (LSTM) network, which receives log-mel filter bank energies as input and outputs 256-dimensional speaker embeddings. Average pooling can be used after the last layer to convert frame-wise features to utterance-wise vectors. An i-vector extractor has also been applied with a variability matrix. After being trained with features of 19 mel frequency cepstral coefficients (MFCCs), energy, and their first and second derivatives, the extractor can output 60-dimensional i-vector of a target speaker. Further, a block structure consists of 2 CNNs with a kernel size of 1 x 1 and 1-D max-pooling layer, and the 1-D max-pooling layer efficiently addresses a silent frame issue raised by average pooling. Particularly, the method 800 captures both short-term and long-term dependencies of a speech spectrum by applying transformer blocks on each time chunk 534 of a sentence 530 as well as across time chunks in speech extraction. These transformer blocks could eliminate recurrence, replace it with a fully attention-based mechanism, and implement computation tasks in parallel.
[0079] It should be understood that the particular order in which the operations in Figure 8 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary' skill in the art would recognize various ways to process audio data. Additionally, it should be noted that details of other processes described above with respect to Figures 5A-7 and 9 are also applicable in an analogous manner to method 800 described above with respect to Figure 8. For brevity, these details are not repeated here.
[0080] Figure 9 is a flow diagram of another example audio processing method 900, in accordance with some embodiments. For convenience, the audio processing method 900 is described as being implemented by an electronic system 200 (e.g., including a mobile phone 104C). Method 900 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 9 may correspond to instructions stored in a computer memory' or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory/, or other nonvolatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 900 may be combined and/or the order of some operations may be changed.
[0081] The electronic system 200 obtains (902) input audio data 502. The input audio data 502 includes a plurality of audio sentences 530 in a temporal domain, and each of the plurality of audio sentences 530 includes a plurality of audio time frames 532 each having a frame position. The electronic system 200 converts (904) the plurality of audio sentences 530 in the temporal domain to a plurality of audio frequency components in a spectral domain. A subset of the plurality of audio frequency components corresponds to a first audio time frame 532A in the plurality of audio time frames 532. In accordance with a first frame position of the first audio time frame 532A, the electronic system correlates (906) the subset of audio frequency components corresponding to the first audio time frame 532 A with a set of adjacent audio frequency components and a set of remote audio frequency components converted from a first audio sentence 530A including the first audio time frame 532A. The electronic sy stem generates (908) an audio feature 520 of the first audio time frame 532A including the correlated audio frequency components of the first audio time frame 532A, generates (910) a plurality of output frequency components 522 based on at least the audio feature of the first audio time frame 532A, and decodes (912) the plurality of output frequency components 522 to generate output audio data 506 in the temporal domain.
[0082] In some embodiments, the electronic system 200 segments audio frequency components corresponding to the first audio sentence 530A into a plurality of time chunks 534. Each of the plurality of time chunks 534 includes a number of audio time frames 532. The first audio time frame 532A corresponds to a first time chunk 534A in the first audio sentence 530A. Further, in some embodiments, the set of adjacent audio frequency components include a first set of audio frequency components corresponding to other audio time frames 532B-532F in the first time chunk 534A, and the set of remote audio frequency components include a second set of audio frequency components corresponding to audio time frames among one or more remaining time chunks 534 (e.g., 534B- 534E) of the first audio sentence 530 A.
[0083] Further, in some embodiments, when the subset of audio frequency components corresponding to the first audio time frame 532A are correlated with the set of adjacent audio frequency components and the set of remote audio frequency components, each of the subset of audio frequency components corresponding to the first audio time frame 532A is weighted based on an intrachunk position of the first audio time frame 532A in the first time chunk 534A, and the set of adjacent audio frequency components are weighted based on intrachunk positions of the other audio time frames 532 (e.g., 532B-532F) in the first time chunk 534 A. Additionally, in some embodiments, each of the subset of audio frequency components of the first audio time frame 532A is weighted based on an interchunk position of the first time chunk 534A in the first audio sentence 530A, and the set of remote audio frequency components are weighted based on interchunk positions of the one or more remaining time chunks (e.g., 534B-534E) in the first audio sentence 530A.
[0084] In some embodiments, the subset of audio frequency components of the first audio time frame 532A are correlated with a subset of audio frequency components corresponding to each of other audio time frames 532 (e.g., 532B-532F) in the first time chunk 534A and a subset of audio frequency components corresponding to a remaining audio time frame in each of one or more remaining time chunks (e.g., 534B-534E) in the first audio sentence 530A. Further, in some embodiments, the subset of audio frequency components of the first audio time frame 532A based on an intrachunk position of the first audio time frame 532 A in the first time chunk 534,A, and the subset of audio frequency components of each of the other audio time frames 532 (e.g., 532B-532F) are weighted based on an intrachunk position of each of the other audio time frames 532 (e.g., 532B-532F) in the first time chunk 534A. Further, in some embodiments, the remaining audio time frame has a remaining position in each of the one or more remaining time chunks (e.g., 534B-534E), and the first audio time frame 532A has a first position in the first time chunk 534A, the first position being the same as the remaining position. Additionally, in some embodiments, the subset of first audio frequency components of the first audio time frame 532A are weighted based on an interchunk position of the first time chunk 534A in the first audio sentence 530A, and the subset of audio frequency components of the remaining audio time frame in each of the one or more remaining time chunks (e.g., 534B-534E) are weighted based on an interchunk position of each of the one or more remaining time chunks (e.g., 534B-534E) in the first audio sentence 530A.
[0085] In some embodiments, the electronic system 200 correlates the subset of audio frequency components corresponding to the first audio time frame 532 A with the set of adjacent audio frequency components to generate a correlated subset of audio frequency components. The electronic system 200 correlates the correlated subset of audio frequency components with the set of remote audio frequency components to generate the audio feature. [0086] In some embodiments, the electronic system 200 correlates the subset of audio frequency components corresponding to the first audio time frame 532A with the set of remote audio frequency components to generate a correlated subset of audio frequency components and correlates the correlated subset of audio frequency components with the set of adjacent audio frequency components to generate the audio feature.
[0087] In some embodiments, the electronic system 200 repeats, successively for a plurality of iterations, correlating the subset of audio frequency components corresponding to the first audio time frame 532A with the set of adjacent audio frequency components and the set of remote audio frequency components.
[0088] It should be understood that the particular order in which the operations in Figure 9 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary' skill in the art would recognize various ways to process audio data. Additionally, it should be noted that details of other processes described above with respect to Figures 5A-8 are also applicable in an analogous manner to method 900 described above with respect to Figure 9. For brevity, these details are not repeated here.
[0089] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, ‘"an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
[0090] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
[0091] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.
[0092] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary' skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:
1. An audio processing method, implemented by an electronic device, the method comprising: obtaining input audio data, wherein the input audio data includes a plurality of audio sentences in a temporal domain, and each of the plurality of audio sentences includes a plurality of audio time frames each having a frame position; converting the plurality of audio sentences in the temporal domain to a plurality of audio frequency components in a spectral domain, wherein a subset of the plurality of audio frequency components corresponds to a first audio time frame in the plurality of audio time frames; in accordance with a first frame position of the first audio time frame, correlating the subset of audio frequency components corresponding to the first audio time frame with a set of adjacent audio frequency components and a set of remote audio frequency components converted from a first audio sentence including the first audio time frame; generating an audio feature of the first audio time frame, the audio feature including the correlated audio frequency components of the first audio time frame; generating a plurality of output frequency components based on at least the audio feature of the first audio time frame; and decoding the plurality of output frequency components to generate output audio data in the temporal domain .
2. The method of claim 1 , further comprising: segmenting audio frequency components corresponding to the first audio sentence into a plurality of time chunks, each of the plurality of time chunks including a number of audio time frames, wherein the first audio time frame corresponds to a first time chunk in the first audio sentence.
3. The method of claim 2, wherein: the set of adjacent audio frequency components include a first set of audio frequency components corresponding to other audio time frames in the first time chunk; and the set of remote audio frequency components include a second set of audio frequencycomponents corresponding to audio time frames among one or more remaining time chunks of the first audio sentence.
4. The method of claim 3, wherein correlating the subset of audio frequency components corresponding to the first audio time frame with the set of adjacent audio frequency components and the set of remote audio frequency components further comprises: weighting each of the subset of audio frequency components corresponding to the first audio time frame based on an intrachunk position of the first audio time frame in the first time chunk, and weighting the set of adjacent audio frequency components based on intrachunk positions of the other audio time frames in the first time chunk.
5. The method of claim 3 or 4, wherein correlating the subset of audio frequency components corresponding to the first audio time frame with the set of adjacent audio frequency components and the set of remote audio frequency components further comprises: weighting each of the subset of audio frequency components of the first audio time frame based on an interchunk position of the first time chunk in the first audio sentence; and weighting the set of remote audio frequency components based on interchunk positions of the one or more remaining time chunks in the first audio sentence.
6. The method of claim 2, wherein correlating the subset of audio frequency components corresponding to the first audio time frame with the set of adjacent audio frequency components and the set of remote audio frequency components further comprises: correlating the subset of audio frequency components of the first audio time frame with a subset of audio frequency components corresponding to each of other audio time frames in the first time chunk and a subset of audio frequency components corresponding to a remaining audio time frame in each of one or more remaining time chunks in the first audio sentence.
7. The method of claim 6, wherein correlating the subset of audio frequency components corresponding to the first audio time frame with the set of adjacent audio frequency components and the set of remote audio frequency components further comprises: weighting the subset of audio frequency components of the first audio time frame based on an intrachunk position of the first audio time frame in the first time chunk; and weighting the subset of audio frequency components of each of the other audio time frames based on an intrachunk position of each of the other audio time frames in the first time chunk.
8. The method of claim 6, wherein the remaining audio time frame has a remaining position in each of the one or more remaining time chunks, and the first audio time frame has a first position in the first time chunk, the first position being the same as the remaining position.
9. The method of ciaim 6, wherein correlating the subset of audio frequency components corresponding to the first audio time frame with the set of adjacent audio frequency components and the set of remote audio frequency components further compri ses: weighting the subset of first audio frequency components of the first audio time frame based on an interchunk position of the first time chunk in the first audio sentence, and weighting the subset of audio frequency components of the remaining audio time frame in each of the one or more remaining time chunks based on an interchunk position of each of the one or more remaining time chunks in the first audio sentence.
10. The method of any of the preceding claims, wherein correlating the subset of audio frequency components corresponding to the first audio time frame with the set of adjacent audio frequency components and the set of remote audi o frequency components further comprises: correlating the subset of audio frequency components corresponding to the first audio time frame with the set of adjacent audio frequency components to generate a correlated subset of audio frequency components; and correlating the correlated subset of audio frequency components with the set of remote audio frequency components to generate the audio feature.
11. The method of any of claims 1-9, wherein correlating the subset of audio frequency components corresponding to the first audio time frame with the set of adjacent audio frequency components and the set of remote audio frequency components further comprises: correlating the subset of audio frequency components corresponding to the first audio time frame with the set of remote audio frequency components to generate a correlated subset of audio frequency components; and correlating the correlated subset of audio frequency components with the set of adjacent audio frequency components to generate the audio feature.
12. The method of any of the preceding claims, wherein generating an audio feature for the first audio time frame further comprises repeating, successively for a plurality of iterations, correlating the subset of audio frequency components corresponding to the first audio time frame with the set of adjacent audio frequency components and the set of remote audio frequency components.
13. The method of any of the preceding claims, wherein generating the plurality of output frequency components comprises: converting the audio feature including the correlated audio frequency components to a frequency mask, and applying the frequency mask on the plurality of audio frequency components to form the plurality of output frequency components.
14. The method of any of the preceding claims, further comprising: obtaining a signature feature corresponding to a target sound of the input audio data; and modifying the plurality of audio frequency components based on the signature feature prior to correlating the subset of audio frequency components corresponding to the first audio time frame with the sets of adjacent and remote audio frequency components.
15. The method of claim 14, wherein : the signature feature is extracted using a signature extraction model; the audio feature is generated from the plurality of audio frequency components in the spectral domain using a speech extraction model; and the signature extraction model and the speech extraction model are trained end-to-end based on a combination of a mean squared error (MSE) loss and a cross-entropy loss, wherein the MSE loss indicates a difference between test audio output data and corresponding ground truth audio data, and the cross-entropy loss indicates a quality of the signature feature.
16. The method of any of the preceding claims, wherein the input audio data in the temporal domain is converted to the plurality of audio frequency components in the spectral domain using a Fourier transform or using a tirne-to-frequency neural network.
17. The method of any of the preceding claims, further comprising: executing one of a social media application, a conferencing application, an Internet phone service, and an audio recorder application.
18. The method of any of the preceding claims, wherein the output audio data has a first signal-to-noise ratio (SNR), and the input audio data has a second SNR that is less than the first SNR.
19. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of cl aims 1-18.
20. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-18.
PCT/US2022/026671 2022-04-28 2022-04-28 Transformer-encoded speech extraction and enhancement WO2023211443A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2022/026671 WO2023211443A1 (en) 2022-04-28 2022-04-28 Transformer-encoded speech extraction and enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2022/026671 WO2023211443A1 (en) 2022-04-28 2022-04-28 Transformer-encoded speech extraction and enhancement

Publications (1)

Publication Number Publication Date
WO2023211443A1 true WO2023211443A1 (en) 2023-11-02

Family

ID=88519477

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/026671 WO2023211443A1 (en) 2022-04-28 2022-04-28 Transformer-encoded speech extraction and enhancement

Country Status (1)

Country Link
WO (1) WO2023211443A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040176961A1 (en) * 2002-12-23 2004-09-09 Samsung Electronics Co., Ltd. Method of encoding and/or decoding digital audio using time-frequency correlation and apparatus performing the method
US20160275964A1 (en) * 2015-03-20 2016-09-22 Electronics And Telecommunications Research Institute Feature compensation apparatus and method for speech recogntion in noisy environment
US20210120358A1 (en) * 2019-10-18 2021-04-22 Msg Entertainment Group, Llc Synthesizing audio of a venue
US20210375260A1 (en) * 2020-05-29 2021-12-02 TCL Research America Inc. Device and method for generating speech animation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040176961A1 (en) * 2002-12-23 2004-09-09 Samsung Electronics Co., Ltd. Method of encoding and/or decoding digital audio using time-frequency correlation and apparatus performing the method
US20160275964A1 (en) * 2015-03-20 2016-09-22 Electronics And Telecommunications Research Institute Feature compensation apparatus and method for speech recogntion in noisy environment
US20210120358A1 (en) * 2019-10-18 2021-04-22 Msg Entertainment Group, Llc Synthesizing audio of a venue
US20210375260A1 (en) * 2020-05-29 2021-12-02 TCL Research America Inc. Device and method for generating speech animation

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
JP7434137B2 (en) Speech recognition method, device, equipment and computer readable storage medium
JP7034339B2 (en) Audio signal processing system and how to convert the input audio signal
CN112088402B (en) Federated neural network for speaker recognition
CN110503971A (en) Time-frequency mask neural network based estimation and Wave beam forming for speech processes
US10109277B2 (en) Methods and apparatus for speech recognition using visual information
US11392833B2 (en) Neural acoustic model
CN112735482B (en) Endpoint detection method and system based on joint deep neural network
CN113516990B (en) Voice enhancement method, neural network training method and related equipment
CN109147763B (en) Audio and video keyword identification method and device based on neural network and inverse entropy weighting
WO2023101679A1 (en) Text-image cross-modal retrieval based on virtual word expansion
WO2023102223A1 (en) Cross-coupled multi-task learning for depth mapping and semantic segmentation
JP2020071482A (en) Word sound separation method, word sound separation model training method and computer readable medium
WO2020238681A1 (en) Audio processing method and device, and man-machine interactive system
US20240296697A1 (en) Multiple Perspective Hand Tracking
Lin et al. Domestic activities clustering from audio recordings using convolutional capsule autoencoder network
CN111681649B (en) Speech recognition method, interaction system and achievement management system comprising system
WO2023211443A1 (en) Transformer-encoded speech extraction and enhancement
WO2023018423A1 (en) Learning semantic binary embedding for video representations
WO2023277877A1 (en) 3d semantic plane detection and reconstruction
US20230410830A1 (en) Audio purification method, computer system and computer-readable medium
WO2023177388A1 (en) Methods and systems for low light video enhancement
CN115910047B (en) Data processing method, model training method, keyword detection method and equipment
WO2023172257A1 (en) Photometic stereo for dynamic surface with motion field
WO2024136883A1 (en) Hard example mining (hem) for speech enhancement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22940453

Country of ref document: EP

Kind code of ref document: A1