WO2021135577A1 - 音频信号处理方法、装置、电子设备及存储介质 - Google Patents

音频信号处理方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021135577A1
WO2021135577A1 PCT/CN2020/124132 CN2020124132W WO2021135577A1 WO 2021135577 A1 WO2021135577 A1 WO 2021135577A1 CN 2020124132 W CN2020124132 W CN 2020124132W WO 2021135577 A1 WO2021135577 A1 WO 2021135577A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
feature
model
network
student
Prior art date
Application number
PCT/CN2020/124132
Other languages
English (en)
French (fr)
Other versions
WO2021135577A9 (zh
Inventor
王珺
林永业
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20909391.3A priority Critical patent/EP4006901A4/en
Publication of WO2021135577A1 publication Critical patent/WO2021135577A1/zh
Publication of WO2021135577A9 publication Critical patent/WO2021135577A9/zh
Priority to US17/667,370 priority patent/US20220165288A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of signal processing technology, and in particular to an audio signal processing method, device, electronic equipment, and storage medium.
  • the "cocktail party problem" is a hot research topic: given a mixed audio signal (two or more speakers), how to separate the independence of each person who speaks at the same time in a cocktail party audio signal?
  • the solution to the above cocktail party problem is called voice separation technology.
  • speech separation is usually based on a deep model of supervised learning.
  • deep models based on supervised learning include DPCL (Deep Clustering), DANet (Deep Attractor Network), and ADANet (Anchored Deep Attractor Network, anchored deep attractor network), ODANet (Online Deep Attractor Network, online deep attractor network), etc.
  • supervised learning refers to training a deep model for speech separation in the corresponding scene for a certain type of specific scene after obtaining the labeled training data.
  • the robustness and generalization of the deep model based on supervised learning is poor, resulting in the processing of the deep model based on supervised learning outside the training scene The accuracy of the audio signal is poor.
  • the embodiments of the present application provide an audio signal processing method, device, electronic equipment, and storage medium, which can improve the accuracy of the audio signal processing process.
  • the technical solutions are as follows:
  • an audio signal processing method which is applied to an electronic device, and the method includes:
  • the embedding process on the mixed audio signal to obtain the embedding feature of the mixed audio signal includes:
  • the mixed audio signal is mapped to the embedding space to obtain the embedding feature.
  • the performing generalized feature extraction on the embedded feature to obtain the generalized feature of the target component in the mixed audio signal includes:
  • the embedding processing on the mixed audio signal to obtain the embedding feature of the mixed audio signal includes:
  • the performing generalized feature extraction on the embedded feature to obtain the generalized feature of the target component in the mixed audio signal includes:
  • the embedded feature is input to an extraction network, and generalized feature extraction is performed on the embedded feature through the extraction network to obtain the generalized feature of the target component in the mixed audio signal.
  • the extraction network is an autoregressive model
  • the embedded features are input into the extraction network
  • the embedded features are generalized feature extraction through the extraction network to obtain the mixed audio signal
  • the generalization characteristics of the target component include:
  • the embedded feature is input into the autoregressive model, and the embedded feature is recursively weighted through the autoregressive model to obtain the generalized feature of the target component.
  • the method further includes:
  • the teacher model and the student model are collaboratively iteratively trained to obtain the coding network and the extraction network, wherein the student model includes a first coding network and a first extraction network, and the teacher The model includes a second coding network and a second extraction network.
  • the output of the first coding network is used as the input of the first extraction network
  • the output of the second coding network is used as the input of the second extraction network.
  • the teacher model in the iterative process is weighted by the teacher model of the previous iterative process and the student model of the current iterative process.
  • the collaborative iterative training of the teacher model and the student model based on the unlabeled sample mixed signal to obtain the coding network and the extraction network includes:
  • any iteration process based on the student model of this iteration process and the teacher model of the previous iteration process, obtain the teacher model of this iteration process;
  • the coding network and the extraction network are acquired based on the student model or teacher model of this iterative process.
  • the obtaining the loss function value of the current iteration process based on at least one of the sample mixed signal, the teacher generalization feature, or the student generalization feature includes:
  • the stop training condition is that the mean square error does not decrease during the iteration process of the first target number of consecutive times; or,
  • the training stop condition is that the mean square error is less than or equal to the first target threshold and the mutual information value is greater than or equal to the second target threshold; or,
  • the training stop condition is that the number of iterations reaches the second target number.
  • obtaining the teacher model of this iteration process includes:
  • the acquisition of the coding network and the extraction network based on the student model or the teacher model of the current iteration process includes:
  • the second coding network and the second extraction network in the teacher model of this iterative process are respectively determined as the coding network and the extraction network.
  • the performing audio signal processing based on the generalization feature of the target component includes:
  • a response voice corresponding to the audio signal of the target object is generated, and the response voice is output.
  • an audio signal processing device which includes:
  • An embedding processing module configured to perform embedding processing on the mixed audio signal to obtain the embedding characteristics of the mixed audio signal
  • a feature extraction module configured to perform generalized feature extraction on the embedded feature to obtain a generalized feature of a target component in the mixed audio signal, where the target component corresponds to the audio signal of the target object in the mixed audio signal;
  • the signal processing module is used to perform audio signal processing based on the generalization characteristics of the target component.
  • the embedded processing module is configured to input the mixed audio signal into an encoding network, and perform embedding processing on the mixed audio signal through the encoding network to obtain the embedded characteristics of the mixed audio signal;
  • the feature extraction module is configured to input the embedded feature into an extraction network, and perform generalized feature extraction on the embedded feature through the extraction network to obtain the generalized feature of the target component in the mixed audio signal, and the target component corresponds to The audio signal of the target object in the mixed audio signal.
  • the embedded processing module is used to:
  • the mixed audio signal is mapped to the embedding space to obtain the embedding feature.
  • the feature extraction module is configured to perform recursive weighting processing on the embedded feature to obtain the generalized feature of the target component.
  • the extraction network is an autoregressive model
  • the feature extraction module is used to:
  • the embedded feature is input into the autoregressive model, and the embedded feature is recursively weighted through the autoregressive model to obtain the generalized feature of the target component.
  • the device further includes:
  • the training module is used to perform collaborative iterative training on the teacher model and the student model based on the unlabeled sample mixed signal to obtain the coding network and the extraction network, wherein the student model includes a first coding network and a first extraction Network, the teacher model includes a second coding network and a second extraction network, the output of the first coding network is used as the input of the first extraction network, and the output of the second coding network is used as the second extraction network
  • the teacher model in each iteration process is weighted by the teacher model in the previous iteration process and the student model in this iteration process.
  • the training module includes:
  • the first obtaining unit is used to obtain the teacher model of this iteration process based on the student model of this iteration process and the teacher model of the previous iteration process in any iteration process;
  • the output unit is configured to input the unlabeled sample mixed signal into the teacher model and the student model of this iterative process respectively, and respectively output the teacher generalization feature and the student generalization feature of the target component in the sample mixed signal;
  • the second acquiring unit is configured to acquire the loss function value of this iteration process based on at least one of the sample mixed signal, the teacher generalization feature, or the student generalization feature;
  • the parameter adjustment unit is used to adjust the parameters of the student model if the loss function value does not meet the training stop condition to obtain the student model of the next iteration process, and execute the next time based on the student model of the next iteration process Iteration process;
  • the third obtaining unit is configured to obtain the coding network and the extraction network based on the student model or the teacher model of this iteration process if the loss function value meets the training stop condition.
  • the second acquiring unit is configured to:
  • the stop training condition is that the mean square error does not decrease during the iteration process of the first target number of consecutive times; or,
  • the training stop condition is that the mean square error is less than or equal to the first target threshold and the mutual information value is greater than or equal to the second target threshold; or,
  • the training stop condition is that the number of iterations reaches the second target number.
  • the first acquiring unit is configured to:
  • the third acquiring unit is configured to:
  • the second coding network and the second extraction network in the teacher model of this iterative process are respectively determined as the coding network and the extraction network.
  • the signal processing module is used to:
  • a response voice corresponding to the audio signal of the target object is generated, and the response voice is output.
  • an electronic device in one aspect, includes one or more processors and one or more memories, and at least one piece of program code is stored in the one or more memories, and the at least one piece of program code is generated by the one or more
  • the multiple processors are loaded and executed to implement the operations performed by the audio signal processing method in any of the foregoing possible implementation manners.
  • a storage medium stores at least one piece of program code.
  • the at least one piece of program code is loaded and executed by a processor to implement what is executed by the audio signal processing method in any of the above-mentioned possible implementation modes. operating.
  • FIG. 1 is a schematic diagram of an implementation environment of an audio signal processing method provided by an embodiment of the present application
  • FIG. 2 is a flowchart of an audio signal processing method provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a method for training a coding network and an extraction network provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a training method of a coding network and an extraction network provided by an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of an audio signal processing device provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • Fig. 7 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the term "at least one" refers to one or more, and the meaning of “multiple” refers to two or more than two, for example, multiple first positions refer to two or more first positions.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes audio processing technology, computer vision technology, natural language processing technology, and machine learning/deep learning.
  • Speech Technology also known as voice processing technology
  • voice processing technology includes speech separation technology, automatic speech recognition technology (Automatic Speech Recognition, ASR), speech synthesis technology (Text To Speech, TTS, also known as text-to-speech technology), and voiceprint recognition technology.
  • the embodiments of the present application relate to voice separation technology in the field of audio processing technology.
  • the following is a brief introduction to the voice separation technology:
  • the goal of speech separation is to separate the target speaker’s voice from background interference.
  • speech separation is one of the most basic types of tasks, with a wide range of applications, including hearing prostheses, mobile communications, and robustness.
  • Automatic speech recognition and speaker recognition The human auditory system can easily separate the voice of one person from the voice of another person. Even in a noisy sound environment like a cocktail party, the human ear has the ability to focus on listening to the content of a target speaker. Therefore, The problem of voice separation is often called the "cocktail party problem" (cocktail party problem).
  • the audio signal collected by the microphone may include background interference such as noise, other speakers’ voices, reverberation, etc.
  • directly performing downstream tasks such as voice recognition and voiceprint verification will greatly reduce the accuracy of downstream tasks. Therefore, the addition of voice separation technology to the voice front end can separate the target speaker’s voice from other background interference, thereby improving the robustness of downstream tasks, making voice separation technology gradually become indispensable in modern audio processing systems The missing link.
  • speech separation tasks are divided into three categories: when the interference is a noise signal, it is called speech enhancement; when the interference is other speakers, it is called multi-speaker separation (Speaker Separation); When the interference is the reflected wave of the target speaker's own voice, it is called De-reverberation.
  • single-channel (monaural) speech separation is a very difficult problem in the industry, because compared to dual-channel or For multi-channel input signals, single-channel input signals lack spatial cues that can be used to locate the sound source.
  • the embodiments of the present application provide an audio processing method, which is not only applicable to dual-channel or multi-channel voice separation scenarios, but also applicable to single-channel voice separation scenarios, and can also be used in (especially training Enhance the accuracy of the audio processing process in various scenarios outside of the scene.
  • FIG. 1 is a schematic diagram of an implementation environment of an audio signal processing method provided by an embodiment of the present application.
  • the implementation environment includes a terminal 101 and a server 102, and both the terminal 101 and the server 102 are electronic devices.
  • the terminal 101 is used to collect audio signals, and an audio signal collection component, such as a recording element such as a microphone, is installed on the terminal 101, or the terminal 101 directly downloads an audio file and decodes the audio file to obtain audio signal.
  • an audio signal collection component such as a recording element such as a microphone
  • an audio signal processing component is installed on the terminal 101, so that the terminal 101 independently implements the audio signal processing method provided in its own embodiment.
  • the processing component is a DSP (Digital Signal Processing, digital signal processor).
  • DSP Digital Signal Processing, digital signal processor
  • Processing tasks, subsequent audio processing tasks include but are not limited to: at least one of speech recognition, voiceprint verification, text-to-speech conversion, smart voice assistant response, or smart speaker response.
  • the embodiment of this application does not specifically limit the type of audio processing tasks .
  • the terminal 101 collects the mixed audio signal through the collection component, it also sends the mixed audio signal to the server 102, and the server performs audio processing on the mixed audio signal, for example, running the implementation of this application on the server.
  • the coding network and the program code of the extraction network provided in the example are used to extract the generalized characteristics of the target component in the mixed audio signal, and perform subsequent audio processing tasks based on the generalized characteristics of the target component.
  • the subsequent audio processing tasks include but are not limited to: speech At least one of recognition, voiceprint verification, text-to-speech conversion, smart voice assistant response, or smart speaker response.
  • the embodiment of the present application does not specifically limit the type of audio processing task.
  • the terminal 101 and the server 102 are connected through a wired network or a wireless network.
  • the server 102 is used to process audio signals, and the server 102 includes at least one of a server, multiple servers, a cloud computing platform, or a virtualization center.
  • the server 102 is responsible for the main calculation work, and the terminal 101 is responsible for the secondary calculation work; or the server 102 is responsible for the secondary calculation work and the terminal 101 is responsible for the main calculation work; or, the terminal 101 and the server 102 are distributed Computing architecture for collaborative computing.
  • terminal 101 generally refers to one of multiple terminals.
  • the device type of terminal 101 includes but is not limited to: vehicle-mounted terminal, TV, smart phone, smart speaker, tablet computer, e-book reader, MP3 (Moving Picture Experts) Group Audio Layer III, Motion Picture Experts compress standard audio layer 3) Player, MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Experts compress standard audio layer 4) Player, laptop or desktop computer at least One kind.
  • the terminal includes a smart phone as an example.
  • the number of the aforementioned terminals 101 is more or less. For example, there is only one terminal 101, or there are dozens or hundreds of terminals 101, or more. The embodiment of the present application does not limit the number of terminals 101 and the device type.
  • the target component corresponds to the audio signal of the terminal user in the mixed audio signal.
  • the vehicle-mounted terminal collects the mixed audio signal, and extracts the mixed audio signal based on the audio processing method provided in this embodiment of the application.
  • the user's voice can be separated from the mixed audio signal, and the user's clean audio signal can be extracted. Not only the noise interference, but also other noises are removed from the clean audio signal.
  • the speaker’s voice interference based on the above clean audio signal, can accurately parse and respond to the user’s voice commands, improve the audio processing accuracy of the vehicle terminal, improve the intelligence of the intelligent driving system, and optimize the user experience.
  • 5G Fifth Generation wireless systems, the fifth generation of mobile communication systems
  • the target component corresponds to the audio signal of the end user in the mixed audio signal.
  • the playback environment of the smart speaker it is usually accompanied by background music interference.
  • the interfering mixed audio signal based on the audio processing method provided by the embodiments of the present application, after extracting the generalized characteristics of the target component in the mixed audio signal, the user's voice can be separated from the mixed audio signal, and the user's clean audio can be extracted.
  • the signal, in the clean audio signal not only removes the background music interference, but also removes the voice interference of other speakers. Based on the above clean audio signal, it can accurately parse and respond to the user’s voice commands, improving the audio of the smart speaker
  • the processing accuracy rate optimizes the user experience.
  • the target component corresponds to the audio signal of the terminal user in the mixed audio signal.
  • the environment in which the user uses the mobile phone is usually unpredictable and complex. There are also various types of interference. For traditional deep models based on supervised learning, it is obviously impractical to collect labeled training data covering various scenarios.
  • the smart phone collects the interference-carrying mixed audio signal, and based on the audio processing method provided by the embodiments of this application, extracts the generalized characteristics of the target component in the mixed audio signal. No matter what the scene, the user’s voice can be mixed. The audio signal is separated and the user’s clean audio signal is extracted.
  • the clean audio signal not only removes the background music interference, but also removes the voice interference of other speakers.
  • the user’s voice command can be Analyze and respond accurately. For example, after the user triggers the text-to-speech conversion instruction, he enters a voice with noise interference. After the smart phone extracts the user’s clean audio signal, it can accurately convert the user’s voice into the corresponding text. It greatly improves the accuracy and precision of the text-to-speech conversion process, improves the audio processing accuracy of smart phones, and optimizes the user experience.
  • the audio processing method can be applied to various downstream tasks of audio processing.
  • the front end as a preprocessing step for voice separation and feature extraction of mixed audio signals, has high availability, portability, and portability. In addition, it has good performance for the more difficult cocktail party issues and single-channel voice separation issues. The performance is detailed below.
  • Fig. 2 is a flowchart of an audio signal processing method provided by an embodiment of the present application. Referring to FIG. 2, this embodiment is applied to the terminal 101 in the foregoing embodiment, or applied to the server 102, or applied to the interaction process between the terminal 101 and the server 102. In this embodiment, the application to the terminal 101 is taken as an example. It is stated that this embodiment includes the following steps:
  • the terminal obtains a mixed audio signal.
  • the mixed audio signal includes the audio signal of the target object, and the target object is any object capable of making sounds, such as at least one of a natural person, an avatar, an intelligent customer service, an intelligent voice assistant, or an AI robot.
  • the mixed audio signal The speaker with the highest medium energy is determined as the target object, and the embodiment of the present application does not specifically limit the type of the target object.
  • the mixed audio signal also includes at least one of the noise signal or the audio signal of other objects.
  • the other object refers to any object other than the target object.
  • the noise signal includes white noise and pink noise. At least one of, brown noise, blue noise, or purple noise. The embodiment of the present application does not specifically limit the type of the noise signal.
  • an application program is installed on the terminal.
  • the operating system responds to the audio collection instruction and calls the recording interface to drive the audio signal collection component (such as a microphone).
  • the terminal also selects a segment of audio from locally pre-stored audio as the mixed audio signal.
  • the terminal also downloads an audio file from the cloud, and parses the audio file to obtain a mixed audio signal.
  • the embodiment of the present application does not specifically limit the manner of obtaining the mixed audio signal.
  • the terminal inputs the mixed audio signal into the coding network, and maps the mixed audio signal to the embedding space through the coding network to obtain the embedding feature of the mixed audio signal.
  • the encoding network non-linearly maps the input signal (mixed audio signal) from the low-dimensional space to the high-dimensional embedding space (embedding space), that is, the vector representation of the input signal in the embedding space is the above Embedded features.
  • the terminal inputs the mixed audio signal to the encoding network (encoder), and embeds the mixed audio signal through the encoding network to obtain the embedding characteristics of the mixed audio signal, which is equivalent to encoding the mixed audio signal once. , To obtain high-dimensional embedded features with stronger expressive ability, so that the subsequent extraction of the generalized features of the target component has higher accuracy.
  • Step 202 is a process of embedding the mixed audio signal to obtain the embedding characteristics of the mixed audio signal.
  • the embedding process of the terminal through the encoding network is taken as an example for description.
  • the terminal directly maps the mixed audio signal to the embedding space to obtain the embedding feature of the mixed audio signal.
  • the embedding process is implemented by mapping, that is, in step 202, the terminal maps the mixed audio signal to the embedding space to obtain the embedding feature.
  • the terminal inputs the embedded feature into the autoregressive model, and performs recursive weighting processing on the embedded feature through the autoregressive model to obtain the generalized feature of the target component in the mixed audio signal, and the target component corresponds to the target in the mixed audio signal.
  • the audio signal of the object is the audio signal of the object.
  • the mixed audio signal is usually in the form of an audio data stream, that is, the mixed audio signal includes at least one audio frame, correspondingly, the embedded feature of the mixed audio signal includes the embedded feature of at least one audio frame.
  • the above-mentioned autoregressive model is an LSTM (Long Short-Term Memory) network.
  • the LSTM network includes an input layer, a hidden layer, and an output layer.
  • the hidden layer includes a hierarchical structure. A plurality of memory units, each memory unit corresponding to an embedded feature of an audio frame of the mixed audio signal in the input layer.
  • any memory unit in any layer of the LSTM network when the memory unit receives the embedded feature of the audio frame and the output feature of the previous memory unit in this layer, the embedded feature of the audio frame and the previous memory unit Weighted transformation is performed on the output characteristics of the memory unit to obtain the output characteristics of the memory unit, and the output characteristics of the memory unit are respectively output to the next memory unit in this layer and the memory unit at the corresponding position in the next layer.
  • Each memory in each layer The units perform the above operations, which is equivalent to performing recursive weighting processing in the entire LSTM network.
  • the terminal inputs the embedded features of multiple audio frames in the mixed audio signal into multiple memory units in the first layer, and the multiple memory units in the first layer have embedded features of the multiple audio frames.
  • the unit outputs the generalized features of the target components in multiple audio frames.
  • the aforementioned autoregressive model is also a BLSTM (Bidirectional Long Short-Term Memory) network.
  • the BLSTM network includes a forward LSTM and a backward LSTM, and the BLSTM network also includes an input layer. , Hidden layer and output layer.
  • the hidden layer includes multiple memory units in a hierarchical structure. Each memory unit corresponds to the embedded feature of an audio frame of the mixed audio signal in the input layer.
  • Each memory unit must perform not only the weighting operation corresponding to the forward LSTM, but also the weighting operation corresponding to the backward LSTM.
  • any memory unit in any layer of the BLSTM network on the one hand, when the memory unit receives the embedded feature of the audio frame and the output feature of the previous memory unit in this layer, the embedded feature of the audio frame and the upper The output characteristics of a memory unit are weighted and transformed to obtain the output characteristics of the memory unit, and the output characteristics of the memory unit are respectively output to the next memory unit in this layer and the memory unit at the corresponding position in the next layer; on the other hand, When the memory unit receives the embedded feature of the audio frame and the output feature of the next memory unit in the layer, it performs a weighted transformation on the embedded feature of the audio frame and the output feature of the next memory unit to obtain the output of the memory unit Features: output the output features of the memory unit to the previous memory unit in this layer and the memory unit at the corresponding position in the next layer.
  • Each memory unit in each layer performs the above operations, which is equivalent to performing recursive weighting processing in the entire BLSTM network.
  • the terminal inputs the embedded features of multiple audio frames in the mixed audio signal into multiple memory units in the first layer, and the multiple memory units in the first layer have embedded features of the multiple audio frames.
  • the multiple memory units in the last layer output the generalized features of the target components in multiple audio frames.
  • This step 203 is a process of recursively weighting the embedded features to obtain the generalized features of the target component.
  • the process of obtaining the generalized features is implemented by an extraction network, that is, this step 203 is a process of inputting the embedded features into an extraction network, and performing generalized feature extraction on the embedded features through the extraction network to obtain generalized features of target components in the mixed audio signal.
  • this step 203 taking the extraction network as an autoregressive model as an example, it is explained that the terminal inputs the embedded feature into the abstraction network (abstractor), and the embedded feature is generalized feature extraction through the extraction network to obtain the mixed audio signal The generalized features of the target component.
  • the extraction network is also a recurrent model, summary function, CNN (Convolutional Neural Networks, convolutional neural network), TDNN (Time Delay Neural Network, time-delayed neural network), or gated convolutional neural network
  • CNN Convolutional Neural Networks, convolutional neural network
  • TDNN Time Delay Neural Network, time-delayed neural network
  • gated convolutional neural network At least one of the above, or a combination of multiple different types of networks, the embodiment of the present application does not specifically limit the structure of the extraction network.
  • the terminal performs audio signal processing based on the generalization feature of the target component.
  • Audio signal processing has different meanings in different task scenarios.
  • the terminal performs text-to-speech conversion on the audio signal of the target object based on the generalized characteristics of the target component, and outputs text information corresponding to the audio signal of the target object.
  • the terminal inputs the generalized features of the target component into the speech recognition model, and the audio signal of the target object in the mixed audio signal is translated into the corresponding text information through the speech recognition model.
  • the generalized feature can It is well suitable for text-to-speech scenes and has high audio signal processing accuracy.
  • the terminal performs voiceprint recognition on the audio signal of the target object based on the generalized characteristics of the target component, outputs the voiceprint recognition result corresponding to the audio signal of the target object, and then performs voiceprint payment based on the voiceprint recognition result .
  • the terminal inputs the generalized features of the target component into the voiceprint recognition model, and verifies whether the audio signal of the target object in the mixed audio signal is the own voice through the voiceprint recognition model, and determines the corresponding The result of voiceprint recognition, if it is verified that the result of voiceprint recognition is "it’s my voice", perform the subsequent payment operation, otherwise the payment failure message will be returned.
  • the generalized feature can be well applied to the voiceprint payment scenario, and has a higher Audio signal processing accuracy.
  • the terminal In the intelligent voice interaction scenario, the terminal generates a response voice corresponding to the audio signal of the target object based on the generalized characteristics of the target component, and outputs the response voice.
  • the terminal inputs the generalized features of the target component to the question and answer model, and after the question and answer model extracts the semantic information of the audio signal of the target object in the mixed audio signal, the corresponding response speech is generated based on the semantic information , Output the response voice to the user, the generalized features can be well applied to intelligent voice interaction scenarios, and have high audio signal processing accuracy.
  • the above are only a few exemplary audio processing scenarios, and the generalization characteristics of the target component are well suited to various audio processing scenarios. According to the different audio processing scenarios, the downstream audio processing tasks are also different.
  • the audio signal processing The manners are also different, and the embodiment of the present application does not specifically limit the manner of audio signal processing.
  • the embedded feature of the mixed audio signal is obtained by embedding the mixed audio signal, and the generalized feature extraction is performed on the embedded feature to extract the generalized target component in the mixed audio signal.
  • the target component corresponds to the audio signal of the target object in the mixed audio signal
  • the audio signal is processed based on the generalization feature of the target component, because the generalization feature of the target component is not for the sound feature in a certain type of specific scene , Has good generalization ability and expression ability, so when audio signal processing is based on the generalization characteristics of the target component, it can be well applied to different scenarios, and the robustness and generalization of the audio signal processing process are improved. Improve the accuracy of audio signal processing.
  • the above training method is applied to the terminal 101 or the server 102 in the above implementation environment.
  • the application to the server 102 is taken as an example for description.
  • the server 102 will train the coding network and the extraction network after training.
  • the encoding network and the extraction network are sent to the terminal 101, so that the terminal 101 executes the audio signal processing method in the above-mentioned embodiment.
  • the server first obtains the unlabeled sample mixture signal, and then, based on the unlabeled sample mixture signal, performs collaborative iterative training on the teacher model and the student model to obtain the coding network and the extraction network used in the foregoing embodiment.
  • the unlabeled sample mixed signal is the training data that has not undergone any labeling.
  • the sample mixed signal also includes the audio signal of the target object.
  • the target object is any object that can speak, such as natural persons, virtual images, and intelligent customer service. At least one of the intelligent voice assistant or the AI robot, for example, the speaker with the highest energy in the mixed audio signal is determined as the target object.
  • the embodiment of the present application does not specifically limit the type of the target object.
  • the sample mixed signal also includes at least one of noise signals or audio signals of other objects, and the noise signal includes at least one of white noise, pink noise, brown noise, blue noise or purple noise, The embodiment of the present application does not specifically limit the type of the noise signal.
  • the process in which the server obtains the sample mixed signal is similar to the process in which the terminal obtains the mixed audio signal in step 201, and will not be repeated here. It should be noted that the server also automatically generates an unlabeled sample mixed signal based on the voice generation model, and completes the subsequent training process based on the generated sample mixed signal.
  • Fig. 3 is a flowchart of a coding network and extraction network training method provided by an embodiment of the application. Please refer to Fig. 3.
  • the embodiment of the application taking any iteration process as an example, how does the teacher model and the student model compare? Performing collaborative iterative training is described. This embodiment includes the following steps:
  • the server obtains the teacher model of this iteration process based on the student model of this iteration process and the teacher model of the previous iteration process.
  • the student model includes a first coding network and a first extraction network
  • the teacher model includes a second coding network and a second extraction network
  • the output of the first coding network is used as the input of the first extraction network
  • the second coding network The output of the network is used as the input of the second extraction network.
  • the teacher model in each iteration process is weighted by the teacher model in the previous iteration process and the student model in the current iteration process.
  • the server obtains the teacher model of this iteration process by performing the following sub-steps:
  • the server multiplies the parameter set of the teacher model in the last iteration process by the first smoothing coefficient to obtain the first parameter set.
  • the server multiplies the parameter sets of the second coding network and the second extraction network in the teacher model of the previous iterative process by the first smoothing coefficient respectively to obtain the first smoothing coefficients corresponding to the second coding network and the second extraction network.
  • a parameter set the server multiplies the parameter sets of the second coding network and the second extraction network in the teacher model of the previous iterative process by the first smoothing coefficient respectively to obtain the first smoothing coefficients corresponding to the second coding network and the second extraction network.
  • the server will use the teacher model in the l-1th iteration process the parameter set ⁇ l-1 of the second coding network and the second extraction
  • the parameter set ⁇ l-1 ′ of the network is respectively multiplied by the first smoothing coefficient ⁇ to obtain the first parameter set ⁇ l-1 ′ corresponding to the second coding network and the first parameter set ⁇ l corresponding to the second extraction network . -1 '.
  • the server multiplies the student model of this iterative process with the second smoothing coefficient to obtain a second parameter set, where the value obtained by adding the first smoothing coefficient and the second smoothing coefficient is 1.
  • the student model of this iteration process is obtained by parameter adjustment based on the student model of the previous iteration process.
  • the server multiplies the parameter sets of the first coding network and the first extraction network in the student model of this iterative process by the second smoothing coefficient respectively to obtain the first coding network and the first extraction network corresponding to each Two parameter set.
  • the server multiplies the parameter set ⁇ l of the first coding network and the parameter set ⁇ l of the first extraction network in the student model used in the lth iteration process by the second smoothing coefficient 1- ⁇ , respectively , to obtain a first encoded second set of parameters corresponding to a second network parameter set (1- ⁇ ) ⁇ l ( 1- ⁇ ) ⁇ l and the network corresponding to the first extraction.
  • the server determines the sum of the first parameter set and the second parameter set as the parameter set of the teacher model of the current iteration process.
  • the server determines the sum of the first parameter set of the second coding network of the teacher model in the last iteration and the second parameter set of the first coding network of the student model in this iteration.
  • the parameter set of the second coding network of the teacher model Similarly, the first parameter set of the second extraction network of the teacher model in the previous iteration and the second parameter set of the first extraction network of the student model in this iteration And determine the parameter set of the second extraction network of the teacher model of this iterative process.
  • the server compares the first parameter set ⁇ l-1 of the second coding network during the l-1th iteration with the second parameter set (1- ⁇ ) ⁇ of the first coding network during the lth iteration.
  • the sum of l is determined as the parameter set ⁇ l ′ of the second coding network in the teacher model of the lth iteration process, that is, the parameter set ⁇ l ′ of the second coding network in the teacher model of the lth iteration process is used
  • the following formula is used to express:
  • ⁇ l ′ ⁇ l-1 ′+(1- ⁇ ) ⁇ l
  • the server compares the first parameter set ⁇ l-1 of the second extraction network during the l-1th iteration with the second parameter set (1- ⁇ ) ⁇ of the first extraction network during the lth iteration.
  • the sum of l is determined as the parameter set ⁇ l ′ of the second extraction network in the teacher model of the lth iteration process, that is, the parameter set ⁇ l ′ of the second extraction network in the teacher model of the lth iteration process is used
  • the following formula is used to express:
  • ⁇ l ′ ⁇ l-1 ′+(1- ⁇ ) ⁇ l
  • the server updates the parameters of the teacher model of the previous iteration process to obtain the teacher model of the current iteration process.
  • the server After obtaining the parameter set ⁇ l ′ of the second coding network and the parameter set ⁇ l ′ of the second extraction network in the teacher model of the lth iteration process, the server will The parameter set ⁇ l-1 ′ of the second coding network in the teacher model is updated to the above ⁇ l ′, and the parameter set ⁇ l-1 ′ of the second extraction network in the teacher model of the l-1th iteration process is updated to the above ⁇ l ′, thus get the teacher model of the lth iteration process.
  • the server can update the parameter sets of the second coding network and the second extraction network in the teacher model based on an exponential moving average (EMA) method, for example, in the first iteration
  • EMA exponential moving average
  • the teacher model and the student model are initialized (or pre-trained) respectively, and the parameters of the teacher model and the student model are kept the same in the first iteration.
  • the teacher model is equivalent to the first iteration in the second iteration.
  • the weighted average of the teacher model (with the same parameters as the student model) and the parameter set of the student model in the second iteration process is equivalent to the first iteration in the second iteration.
  • the final teacher model is essentially equivalent to multiple history
  • the weighted average of the student model in the iterative process can better reflect the performance of the student model in the process of multiple historical iterations, which is conducive to the collaborative training of a student model with better robustness.
  • the server inputs the unlabeled sample mixed signal into the teacher model and the student model of this iteration process respectively, and respectively outputs the teacher generalization feature and the student generalization feature of the target component in the sample mixed signal.
  • the server inputs the unlabeled sample mixed signal into the first coding network in the student model of this iterative process, and passes through the first coding network of this iterative process Embedding the sample mixed signal to obtain the student embedding feature of the sample mixed signal, input the student embedding feature of the sample mixed signal into the first extraction network in the student model of this iteration process, and pass the first extraction network of this iteration process Perform generalized feature extraction on the sample mixed signal, and output the student generalized feature of the target component in the sample mixed signal, the target component corresponding to the audio signal of the target object in the sample mixed signal, the above process is the same as step 202-step 203 in the above embodiment Similar, I won’t go into details here.
  • the server inputs the unlabeled sample mixed signal into the second coding network in the teacher model of this iterative process, and passes through the second coding network of this iterative process Embedding the sample mixed signal to obtain the teacher embedding feature of the sample mixed signal, input the teacher embedding feature of the sample mixed signal into the second extraction network in the teacher model of this iteration process, and pass the second extraction network of this iteration process Perform generalized feature extraction on the sample mixed signal, and output the teacher generalized feature of the target component in the sample mixed signal.
  • the target component corresponds to the audio signal of the target object in the sample mixed signal.
  • the first coding network E ⁇ It is equivalent to a nonlinear mapping of the sample mixed signal x, the sample mixed signal x is mapped from the input domain to a high-dimensional embedding (embedding) space, and the student embedding feature v of the output sample mixed signal, that is, the first
  • the function of a coding network E ⁇ is equivalent to the following mapping relationship:
  • the STFT feature of the sample mixed signal x is at least one of the log mel spectrum feature or the MFCC (Mel Frequency Cepstrum Coefficient, Mel frequency cepstrum coefficient) feature, or it is also the log mel spectrum
  • v represents the student embedding feature of the sample mixed signal
  • p represents the feature obtained after the first extraction network A ⁇ weights the student embedding feature v
  • c represents the student generalization feature of the target component in the sample mixed signal
  • the student generalized feature c at this time is the feature obtained after the recursive weighted transformation between the input feature v and the output feature p of the first extraction network A ⁇ .
  • the first extraction network is an autoregressive model.
  • discrete student generalization features can be constructed in time series based on local student embedding features.
  • the constructed student generalization features may be Short-term is also long-term, and the embodiments of this application do not specifically limit the time resolution of student generalization features.
  • the above-mentioned autoregressive model adopts LSTM network.
  • the causal system is also called nonanticipative system, that is, the system whose output cannot appear before the input arrives, that is to say, a certain system
  • the output of the time depends only on the input of the system at that time and before, and has nothing to do with the input after that time.
  • the one-way recursive weighting process is performed through the LSTM network to avoid ignoring the timing between the previous and subsequent audio frames.
  • the above autoregressive model uses the BLSTM network.
  • the non-causal system means that the output at the current moment not only depends on the current input, but also depends on the future input, so the BLSTM network performs two-way
  • the recursive weighting process can not only consider the role of each historical audio frame before each audio frame, but also consider the role of each future audio frame after each audio frame, so as to better preserve the difference between each audio frame Context information (context).
  • the student generalized feature c is represented by the following formula:
  • c t ⁇ c represents the student generalization feature of the t-th audio frame
  • v t ⁇ v represents the student embedding feature of the t-th audio frame
  • p t ⁇ p represents the first extraction network for the t-th audio
  • represents the point multiplication operation between features
  • t(t ⁇ 1) represents the frame index
  • f represents the frequency band index.
  • the numerator and denominator in the above formula are respectively multiplied by a binary threshold matrix w, which can help to reduce the interference of low-energy noise in the sample mixed signal (equivalent to a high-pass filter).
  • w binary threshold matrix
  • w t ⁇ w represents the binary threshold matrix of the t-th audio frame, and The meanings of the other symbols are the same as the meanings of the symbols in the previous formula, so I won’t repeat them here.
  • the binary threshold matrix w t,f is represented by the following formula:
  • X represents the training set formed by the sample mixed signal, that is, if the sample mixed signal X t with the frame index of t and the frequency band index of f in the training set, the energy value of f is less than the maximum of the sample mixed signal in the training set One percent of the energy value, then the two-valued threshold matrix w t,f is set to 0, so that the interference of the sample mixed signal X t,f (low energy noise) is ignored when calculating the student generalization feature, otherwise, Set the two-valued threshold matrix w t,f to 1, and perform calculations as usual for audio components other than low-energy noise.
  • a student generalization feature is constructed for each audio frame.
  • This discrete student generalization feature c t is more suitable for some tasks that require high-resolution information in the time domain, such as frequency spectrum for the target speaker reconstruction.
  • the first extraction network also uses a summary function or a recurrent model, so that a global student generalization can be constructed based on the local student embedding features through the summary function or recurrent model.
  • the embodiment of this application does not specifically limit the type of the first extraction network.
  • the student generalized feature c is represented by the following formula:
  • c, v, p, w, t, f are consistent with the meaning of the same symbols in the above formulas, and for the sake of brevity, the dimension index subscripts of c, v, p, w are omitted, and they are not done here. Go into details.
  • the student generalization feature c given in the above formula represents a long-term stable, global, "slow” (referring to low time domain resolution) abstract representation, which is more suitable for some information that only needs low time domain resolution. Tasks, such as generalizing the characteristics of the hidden target speaker.
  • the server obtains the loss function value of the current iteration process based on at least one of the sample mixed signal, the teacher's generalization feature, or the student's generalization feature.
  • the traditional method of calculating the loss function value for the explicit input signal includes NCE (Noise Contrastive Estimation), DIM (Deep InfoMax, Maximize the depth of mutual information) and so on.
  • the embodiment of the present application provides an estimator for the student model, and the estimator is used to calculate the loss function value of the first coding network and the first extraction network in each iteration.
  • the aforementioned loss function value includes the Mean Squared Error (MSE) between the teacher generalization feature and the student generalization feature or the mutual information value (Mutual Information, At least one item in MI).
  • MSE Mean Squared Error
  • MI Mutual Information
  • the server obtains the loss function value of this iteration process by performing the following sub-steps:
  • the server obtains the mean square error between the generalization characteristics of the teacher and the generalization characteristics of the students in this iterative process.
  • the server obtains the mutual information value between the sample mixed signal and the student generalization feature of this iteration process.
  • the student model includes the first coding network E ⁇ , the first extraction network A ⁇ and the calculation module T ⁇ , where ⁇ is the parameter set of the first coding network E ⁇ , and ⁇ is the first extraction network A ⁇
  • the parameter set of ⁇ is the parameter set of the calculation module T ⁇ .
  • the expression of the above mapping relationship shows that the calculation module T ⁇ takes the student embedding feature v and the student generalized feature c as input, and outputs one located in The mutual information value in the output domain.
  • the calculation module T ⁇ is modeled as the following formula:
  • T ⁇ D ⁇ ⁇ g ⁇ (E ⁇ , A ⁇ )
  • g represents a function that combines the student embedding feature output by E ⁇ and the student generalization feature output by A ⁇
  • D ⁇ represents a function for calculating the mutual information value MI.
  • the training sample is an unlabeled interfered sample mixed signal.
  • the time-frequency point x of this type of sample mixed signal is considered to be the time-frequency point x of the audio signal of the target object and the time-frequency point of the interference signal.
  • the embodiment of the present application proposes a loss function called ImNICE (InfoMax Noise-Interference Contractive Estimation, noise-interference comparison estimation based on mutual information maximization) for implicit input signals.
  • ImNICE InfoMax Noise-Interference Contractive Estimation, noise-interference comparison estimation based on mutual information maximization
  • MI mutual information value
  • x represents the time-frequency point of the input signal that is predicted as a positive sample by the student model
  • x obeys the above distribution P ⁇ p(x, c)
  • x′ represents the time-frequency points of the input signal predicted as negative samples by the student model
  • x′ obeys the above proposed distribution
  • x′ represents taken from the proposal distribution
  • E P represents the mathematical expectation of the distribution P
  • the mathematical expectation of, c ⁇ A ⁇ (E ⁇ (x)) represents the student generalization characteristics of the first coding network E ⁇ and the first extraction network A ⁇ after acting on the input signal.
  • f ⁇ (x, c ) exp(T ⁇ (E ⁇ (x), c)) represents the mutual information value between the time-frequency point x predicted by the student model as a positive sample in the input signal and the student generalization feature c.
  • f ⁇ (x′,c) exp(T ⁇ (E ⁇ (x′),c)) represents the mutual information between the time-frequency point x′ predicted by the student model as a negative sample in the input signal and the student generalization feature c value.
  • ImNICE loss function value is equivalent to an average cross-entropy loss. Specifically, assuming that there is a distribution p and another distribution q, then the average cross-entropy loss between p and q is:
  • f ⁇ (x, c) is deduced as In other words f ⁇ (x, c) is regarded as a probability density ratio, which can be used to estimate the mutual information value between the input sample mixed signal x and the student generalization feature c.
  • the mutual information value between the explicit input signal x and the student generalized feature c is calculated according to the definition formula of the mutual information value, which is as follows:
  • I(x; c) represents the mutual information value between the explicit input signal x and the student generalized feature c
  • p(x) is the probability distribution that the explicit input signal x obeys
  • c ) Is the conditional probability distribution of the explicit input signal x with the student generalization feature c
  • p(x, c) is the joint distribution between the explicit input signal x and the student generalization feature c. Since the explicit input signal can directly obtain p(x) or p(x
  • the sample mixed signal is an implicit input signal (this is determined by the nature of unsupervised learning )
  • the ImNICE loss function value of avoids obtaining p(x) or p(x
  • the problem of mutual information value avoids obtaining p(x) or p(x
  • this statistical constraint p(x,c) is the difference between the sample mixed signal x and the student generalization feature c
  • the joint distribution, p(x,c) is predicted by the teacher model.
  • the second extraction network A ⁇ 'of the teacher model performs the following operations:
  • the server takes an intermediate prediction value p calculated by the second extraction network A ⁇ ′ as an estimated value of the joint distribution p(x, c).
  • the server determines at least one of the mean square error or the mutual information value as the loss function value of the current iteration process.
  • the server can ensure the consistency-based learning between the teacher model and the student model by obtaining the mean square error. If the training stop condition is not met, the first step of the student model is updated through the following step 304
  • the mean square error is the loss function of a typical reconstruction task. Consistency learning is based on the mean square error, which can ensure that the generalized characteristics of the students learned in the middle are relative to the target to a certain extent. Stable consistency between the audio signals of the subject.
  • the server can provide a calculation module for the training process of unsupervised learning by obtaining the mutual information value, which is used to obtain the mutual information value between the sample mixed signal and the student generalization feature in the student model. Specifically, by introducing The probability density ratio f ⁇ (x, c) and the statistical constraint p(x, c) are used to estimate the mutual information value of the student model.
  • the training goal is to minimize the mean square error and maximize the mutual information value.
  • the server adjusts the parameters of the student model to obtain the student model of the next iteration process, and executes the next iteration process based on the student model of the next iteration process.
  • the stop training condition is that the mean square error does not decrease during the first target number of consecutive iterations; or, the stop training condition is that the mean square error is less than or equal to the first target threshold and the mutual information value It is greater than or equal to the second target threshold; or, the stopping training condition is that the number of iterations reaches the second target number.
  • the server After the server obtains the loss function value of the student model in this iteration process, it judges whether the loss function value of this iteration process satisfies the stop training condition. If it does not meet the stop training condition, it updates the student model for the next iteration process based on the above step 304 , And then return to perform the above steps 3011-step 3014 to obtain the teacher model of the next iteration process, and perform the operations similar to the above step 302-step 303 based on the teacher model and student model of the next iteration process to complete the next iteration training , And so on, and will not be repeated here. After multiple iterations, until the loss function value of a certain iteration process satisfies the stop training condition, the following step 305 is executed.
  • the server obtains the coding network and the extraction network based on the student model or the teacher model of this iterative process.
  • the server obtains the coding network and the extraction network based on the student model of this iteration process, that is, the server respectively determines the first coding network and the first extraction network in the student model of this iteration process as The coding network and the extraction network.
  • the server also obtains the coding network and the extraction network based on the teacher model of this iteration process, that is, the server separately determines the second coding network and the second extraction network in the teacher model of this iteration process Is the coding network and the extraction network.
  • the server performs collaborative iterative training on the teacher model and the student model based on the unlabeled sample mixed signal to obtain the coding network and the extraction network.
  • the teacher model in each iteration is changed from the teacher model in the previous iteration.
  • the student model of this iteration process is weighted.
  • the training goal of, whether it is the teacher model or the student model of this iterative process can be selected as the coding network and the extraction network.
  • the embodiment of this application does not specifically limit the acquisition of the final coding network and the extraction network based on the teacher model or the student model .
  • Fig. 4 is a schematic diagram of a method for training a coding network and an extraction network provided by an embodiment of the present application.
  • a set of unlabeled sample mixed signals 410 speech
  • a The group of interference signals 411 noise
  • the student embedding feature 412 (or the teacher embedding feature) of the sample mixed signal 410 and the interference signal 411 is performed by the first extraction network 421 of the student model (or the second extraction network of the teacher model), respectively, to the sample mixed signal 410 and the interference signal 411
  • the student embedding feature 412 (or teacher embedding feature) is recursively weighted to obtain the student generalization feature 413 (or teacher generalization feature) of the sample mixed signal and the interference signal, and the student embedding feature
  • the loss function value 414 includes the mean square error or the ImNICE loss function value ( At least one item of mutual information value).
  • a heat map 415 of the mutual information value is also drawn.
  • the time-frequency points of the light-colored area belong to the target speech The probability of human speech is greater.
  • the time-frequency points in the dark area are more likely to be noise or interference. That is to say, as the color goes from light to dark in the heat map, the time-frequency points at the corresponding positions are noise The probability of increasing gradually, it is convenient to visually observe the thermal distribution obeyed by each time-frequency point.
  • is used to represent the input signal of the first coding network 420 (that is, the sample mixed signal 410 and the interference signal 411)
  • v is used to represent the output signal of the first coding network 420 ( That is, the student embedding feature 412).
  • the input signal of the first extraction network 421 is also v
  • c is used to represent the output signal of the first extraction network 421 (that is, the student generalization feature 413)
  • the input of the calculation module 422 The signal includes v and c, and R is used to represent the loss function value 414 output by the calculation module 422.
  • the sampling rate is set to 16KHz
  • the STFT window length is set to 25ms
  • the STFT window shift is set to 10ms
  • the number of STFT frequency bands is set to 257 .
  • the initial learning rate is 0.0001
  • the weight reduction coefficient of the learning rate is 0.8.
  • the MSE mean square error
  • the number of nodes in the output layer of the first coding network is set to 40, and the number of randomly down-sampled frames for each training corpus is 32.
  • each The number of negative samples corresponding to the positive sample is 63, and the judgment threshold of the prediction probability p(x, c) of the positive sample is 0.5.
  • the first coding network is a 4-layer BLSTM network
  • each hidden layer has 600 nodes
  • the output layer is a fully connected layer, which can output the 600-dimensional hidden layer output by the last hidden layer.
  • the vector (output feature) is mapped to a 275*40-dimensional high-dimensional embedding space v, and a 275*40-dimensional embedded feature is obtained.
  • the 275*40-dimensional embedded feature is input to the first extraction network, and the first extraction network Contains a fully connected layer and a 2-layer BLSTM network.
  • T ⁇ (v, c) represents the calculation module
  • v represents the embedded feature
  • v T represents the transposed vector of the embedded feature
  • represents the weighting matrix
  • c represents the generalization feature.
  • the hyperparameter selection and the model structure are only an exemplary description.
  • the number of levels of the BLSTM network in the first coding network or the first extraction network is adjusted and changed according to the requirements of complexity and performance.
  • also adjust and change the network structure of the first coding network or the first extraction network such as using at least one of LSTM network, CNN, TDNN or gated CNN.
  • the network structure of the first coding network or the first extraction network is also expanded or simplified.
  • the method provided by the embodiment of this application performs collaborative iterative training on the teacher model and the student model based on the unlabeled sample mixed signal to obtain the coding network and the extraction network.
  • the teacher model in each iteration process is changed from the previous iteration process.
  • the teacher model and the student model of this iterative process are weighted.
  • the collaborative iterative training and consistent learning of the teacher model and the student model it can effectively learn from the unlabeled and interfering sample mixed signal.
  • the generalizable hidden signal representation that is, the generalization feature of the target component, which can be applied to a variety of industrial application scenarios, and helps to improve the accuracy of the audio processing process.
  • labeled training data (referring to training samples containing clean audio signals of the target object) often only cover a small part of the application scenario, and a large amount of data is unlabeled
  • a novel unsupervised loss function is proposed, and a novel unsupervised loss function based on unsupervised learning is proposed.
  • the training method can develop a large amount of unlabeled training data without manually labeling the unlabeled training data, which saves labor costs and improves the efficiency of obtaining training data.
  • supervised learning that only relies on labeled data has the problems of poor robustness and poor generalization.
  • a speech representation that only uses supervised learning for a certain type of interfering speech environment is often not applicable to another.
  • the unsupervised system can extract the generalized features of the target component.
  • the above-mentioned generalized features are not extracted for a certain type of interference, but in the intricate noise environment.
  • the features with high robustness and generalizability extracted from the annotation data can be applied to most audio processing scenarios.
  • DANet needs the embeddings (embedding vector) of the database as input during the training phase, so there is a problem of mismatch between the embeddings centers between training and testing.
  • ADANet By introducing the PIT (Permutation Invariant Training, permutation invariant training method) method to alleviate the above-mentioned embedding center mismatch problem, the PIT method determines the correct output permutation by calculating the lowest value of the selected objective function among all possible input permutations
  • the PIT method will naturally bring a lot of computational complexity, resulting in a large increase in the computational cost when extracting features.
  • ODANet estimates an abstract representation for each audio frame, based on the estimated Abstract representation is used to calculate the mask of the audio frame in the future, and so on.
  • ODANet tends to cause unstable target speaker tracking and mask estimation.
  • Weighting function In order to improve the stability of the performance, it is necessary to introduce additional dynamics defined by experts. Weighting function, and also need to carefully adjust and select the context window length.
  • no additional PIT processing is required, so a small calculation cost can be ensured, no speaker tracking mechanism is required, and no expert definition processing and adjustment are required, so the coding network and extraction network can be greatly saved.
  • the training cost of, and based on the unlabeled training data it can automatically learn the generalization characteristics of the hidden target component (usually the target speaker), and the audio processing based on the above generalization characteristics can effectively solve the cocktail party problem. It also has good performance for more difficult single-channel speech separation tasks, can be applied to various industrial scenes, and has high audio processing accuracy.
  • Fig. 5 is a schematic structural diagram of an audio signal processing device provided by an embodiment of the present application. Please refer to Fig. 5, the device includes:
  • the embedding processing module 501 is configured to perform embedding processing on the mixed audio signal to obtain the embedding characteristics of the mixed audio signal;
  • the feature extraction module 502 is configured to perform generalized feature extraction on the embedded feature to obtain the generalized feature of the target component in the mixed audio signal, where the target component corresponds to the audio signal of the target object in the mixed audio signal;
  • the signal processing module 503 is configured to perform audio signal processing based on the generalization feature of the target component.
  • the device provided by the embodiment of the present application obtains the embedded feature of the mixed audio signal by embedding the mixed audio signal, and performs generalized feature extraction on the embedded feature, which can extract the generalized feature of the target component in the mixed audio signal ,
  • the target component corresponds to the audio signal of the target object in the mixed audio signal, and the audio signal is processed based on the generalization feature of the target component. Since the generalization feature of the target component is not for a certain type of sound feature in a specific scene, It has good generalization ability and expressive ability. Therefore, when audio signal processing is performed based on the generalization characteristics of the target component, it can be well adapted to different scenarios, which improves the robustness and generalization of the audio signal processing process, and improves Improve the accuracy of audio signal processing.
  • the embedding processing module 501 is configured to input the mixed audio signal into an encoding network, and perform embedding processing on the mixed audio signal through the encoding network to obtain the embedding characteristics of the mixed audio signal;
  • the feature extraction module 502 is configured to input the embedded feature into an extraction network, and perform generalized feature extraction on the embedded feature through the extraction network to obtain the generalized feature of the target component in the mixed audio signal, and the target component corresponds to the mixed audio The audio signal of the target object in the signal.
  • the embedded processing module 501 is used to:
  • the mixed audio signal is mapped to the embedding space to obtain the embedding feature.
  • the feature extraction module 502 is configured to perform recursive weighting processing on the embedded feature to obtain the generalized feature of the target component.
  • the extraction network is an autoregressive model
  • the feature extraction module 502 is used to:
  • the embedded feature is input into the autoregressive model, and the embedded feature is recursively weighted through the autoregressive model to obtain the generalized feature of the target component.
  • the device further includes:
  • the training module is used to perform collaborative iterative training on the teacher model and the student model based on the unlabeled sample mixed signal to obtain the coding network and the extraction network.
  • the student model includes a first coding network and a first extraction network.
  • the teacher model includes a second coding network and a second extraction network. The output of the first coding network is used as the input of the first extraction network, and the output of the second coding network is used as the input of the second extraction network.
  • the teacher model of is weighted by the teacher model of the last iteration process and the student model of this iteration process.
  • the training module includes:
  • the first obtaining unit is used to obtain the teacher model of this iteration process based on the student model of this iteration process and the teacher model of the previous iteration process in any iteration process;
  • the output unit is used to input the unlabeled sample mixed signal into the teacher model and the student model of this iteration process respectively, and respectively output the teacher generalization feature and the student generalization feature of the target component in the sample mixed signal;
  • the second acquiring unit is configured to acquire the loss function value of this iteration process based on at least one of the sample mixed signal, the teacher generalization feature, or the student generalization feature;
  • the parameter adjustment unit is used to adjust the parameters of the student model if the loss function value does not meet the training stop condition to obtain the student model of the next iteration process, and execute the next iteration process based on the student model of the next iteration process;
  • the third acquiring unit is used for acquiring the coding network and the extraction network based on the student model or teacher model of this iteration process if the loss function value meets the training stop condition.
  • the second acquiring unit is used to:
  • the stop training condition is that the mean square error does not decrease during the iteration process of the first target number of consecutive times; or,
  • the training stop condition is that the mean square error is less than or equal to the first target threshold and the mutual information value is greater than or equal to the second target threshold; or,
  • the training stop condition is that the number of iterations reaches the second target number.
  • the first acquiring unit is used to:
  • the third acquiring unit is used to:
  • the second coding network and the second extraction network in the teacher model of this iterative process are respectively determined as the coding network and the extraction network.
  • the signal processing module 503 is used to:
  • a response voice corresponding to the audio signal of the target object is generated, and the response voice is output.
  • the audio signal processing device provided in the above embodiment processes audio signals
  • only the division of the above functional modules is used as an example.
  • the above function allocation is completed by different functional modules as needed, i.e.
  • the internal structure of the electronic device is divided into different functional modules to complete all or part of the functions described above.
  • the audio signal processing device provided in the foregoing embodiment and the audio signal processing method embodiment belong to the same concept, and the specific implementation process is detailed in the audio signal processing method embodiment, which will not be repeated here.
  • the electronic device involved in the embodiment of the present application is a terminal.
  • FIG. 6 is a schematic structural diagram of a terminal provided in an embodiment of the present application. Please refer to FIG. 6.
  • the terminal 600 is: a smart phone, a tablet Computer, MP3 player (Moving Picture Experts Group Audio Layer III, Motion Picture Experts Compresses Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Experts Compresses Standard Audio Layer 4) Player, Laptop or Desktop computer.
  • the terminal 600 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
  • the terminal 600 includes a processor 601 and a memory 602.
  • the processor 601 includes one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 601 adopts at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array).
  • the processor 601 also includes a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake state, also called a CPU (Central Processing Unit, central processing unit); It is a low-power processor for processing data in the standby state.
  • the processor 601 is integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is responsible for rendering and drawing content that needs to be displayed on the display screen.
  • the processor 601 further includes an AI (Artificial Intelligence) processor, and the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 602 includes one or more computer-readable storage media, which are non-transitory.
  • the memory 602 may also include a high-speed random access memory and a non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 602 is used to store at least one program code, and the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • the mixed audio signal is mapped to the embedding space to obtain the embedding feature.
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • the embedded feature is input to an extraction network, and generalized feature extraction is performed on the embedded feature through the extraction network to obtain the generalized feature of the target component in the mixed audio signal.
  • the extraction network is an autoregressive model
  • the at least one program code is used to be executed by the processor 601 to implement the following steps: input the embedded features into the autoregressive model, and pass the autoregressive The model performs recursive weighting processing on the embedded feature to obtain the generalized feature of the target component.
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • the teacher model and the student model are collaboratively iteratively trained to obtain the coding network and the extraction network, wherein the student model includes a first coding network and a first extraction network, and the teacher The model includes a second coding network and a second extraction network.
  • the output of the first coding network is used as the input of the first extraction network
  • the output of the second coding network is used as the input of the second extraction network.
  • the teacher model in the iterative process is weighted by the teacher model of the previous iterative process and the student model of the current iterative process.
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • any iteration process based on the student model of this iteration process and the teacher model of the previous iteration process, obtain the teacher model of this iteration process;
  • the coding network and the extraction network are acquired based on the student model or teacher model of this iterative process.
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • the stop training condition is that the mean square error does not decrease during the successive iterations of the first target number of times; or,
  • the training stop condition is that the mean square error is less than or equal to the first target threshold and the mutual information value is greater than or equal to the second target threshold; or,
  • the training stop condition is that the number of iterations reaches the second target number.
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • the second coding network and the second extraction network in the teacher model of this iterative process are respectively determined as the coding network and the extraction network.
  • the at least one program code is used to be executed by the processor 601 to implement the following steps:
  • a response voice corresponding to the audio signal of the target object is generated, and the response voice is output.
  • the terminal 600 may optionally further include: a peripheral device interface 603 and at least one peripheral device.
  • the processor 601, the memory 602, and the peripheral device interface 603 are connected by a bus or signal line.
  • Each peripheral device is connected to the peripheral device interface 603 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 604, a touch display screen 605, a camera component 606, an audio circuit 607, a positioning component 608, and a power supply 609.
  • the peripheral device interface 603 can be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 601 and the memory 602.
  • the processor 601, the memory 602, and the peripheral device interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 601, the memory 602, and the peripheral device interface 603 or The two are implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 604 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 604 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 604 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and so on.
  • the radio frequency circuit 604 communicates with other terminals through at least one wireless communication protocol.
  • the wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity, wireless fidelity) networks.
  • the radio frequency circuit 604 further includes a circuit related to NFC (Near Field Communication), which is not limited in this application.
  • the display screen 605 is used to display a UI (User Interface, user interface).
  • the UI includes graphics, text, icons, videos, and any combination of them.
  • the display screen 605 also has the ability to collect touch signals on or above the surface of the display screen 605.
  • the touch signal is input to the processor 601 as a control signal for processing.
  • the display screen 605 is also used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 605 there is one display screen 605, and the front panel of the terminal 600 is provided; in other embodiments, there are at least two display screens 605, which are respectively provided on different surfaces of the terminal 600 or in a folding design;
  • the display screen 605 is a flexible display screen, which is arranged on the curved surface or the folding surface of the terminal 600.
  • the display screen 605 is also set as a non-rectangular irregular pattern, that is, a special-shaped screen.
  • the display screen 605 is made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).
  • the camera assembly 606 is used to capture images or videos.
  • the camera assembly 606 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
  • the camera assembly 606 also includes a flash.
  • the flash is a single-color temperature flash and also a dual-color temperature flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash used for light compensation under different color temperatures.
  • the audio circuit 607 includes a microphone and a speaker.
  • the microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 601 for processing, or input to the radio frequency circuit 604 to implement voice communication. For the purpose of stereo collection or noise reduction, there are multiple microphones, which are respectively set in different parts of the terminal 600.
  • the microphone is also an array microphone or an omnidirectional acquisition microphone.
  • the speaker is used to convert the electrical signal from the processor 601 or the radio frequency circuit 604 into sound waves.
  • the speaker is a traditional thin-film speaker and also a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it not only converts electrical signals into sound waves that are audible to humans, but also converts electrical signals into sound waves that are inaudible to humans for distance measurement and other purposes.
  • the audio circuit 607 also includes a headphone jack.
  • the positioning component 608 is used to locate the current geographic location of the terminal 600 to implement navigation or LBS (Location Based Service, location-based service).
  • the positioning component 608 is a positioning component based on the GPS (Global Positioning System) of the United States, the Beidou system of China, the Grenas system of Russia, or the Galileo system of the European Union.
  • the power supply 609 is used to supply power to various components in the terminal 600.
  • the power source 609 is alternating current, direct current, disposable batteries, or rechargeable batteries.
  • the rechargeable battery supports wired charging or wireless charging.
  • the rechargeable battery is also used to support fast charging technology.
  • the terminal 600 further includes one or more sensors 610.
  • the one or more sensors 610 include, but are not limited to: an acceleration sensor 611, a gyroscope sensor 612, a pressure sensor 613, a fingerprint sensor 614, an optical sensor 615, and a proximity sensor 616.
  • the acceleration sensor 611 detects the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 600. For example, the acceleration sensor 611 is used to detect the components of gravitational acceleration on three coordinate axes.
  • the processor 601 controls the touch screen 605 to display the user interface in a horizontal view or a vertical view according to the gravity acceleration signal collected by the acceleration sensor 611.
  • the acceleration sensor 611 is also used for the collection of game or user motion data.
  • the gyroscope sensor 612 detects the body direction and rotation angle of the terminal 600, and the gyroscope sensor 612 and the acceleration sensor 611 cooperate to collect the user's 3D actions on the terminal 600.
  • the processor 601 implements the following functions according to the data collected by the gyroscope sensor 612: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 613 is arranged on the side frame of the terminal 600 and/or the lower layer of the touch screen 605.
  • the processor 601 performs left and right hand recognition or quick operation according to the holding signal collected by the pressure sensor 613.
  • the processor 601 controls the operability controls on the UI interface according to the user's pressure operation on the touch display screen 605.
  • the operability control includes at least one of a button control, a scroll bar control, an icon control, and a menu control.
  • the fingerprint sensor 614 is used to collect the user's fingerprint.
  • the processor 601 can identify the user's identity based on the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 can identify the user's identity based on the collected fingerprints. When it is recognized that the user's identity is a trusted identity, the processor 601 authorizes the user to perform related sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings.
  • the fingerprint sensor 614 is provided on the front, back or side of the terminal 600. When a physical button or a manufacturer logo is provided on the terminal 600, the fingerprint sensor 614 is integrated with the physical button or the manufacturer logo.
  • the optical sensor 615 is used to collect the ambient light intensity.
  • the processor 601 controls the display brightness of the touch screen 605 according to the ambient light intensity collected by the optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 605 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 605 is decreased.
  • the processor 601 also dynamically adjusts the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.
  • the proximity sensor 616 also called a distance sensor, is usually arranged on the front panel of the terminal 600.
  • the proximity sensor 616 is used to collect the distance between the user and the front of the terminal 600.
  • the processor 601 controls the touch screen 605 to switch from the on-screen state to the off-screen state; when the proximity sensor 616 detects When the distance between the user and the front of the terminal 600 gradually increases, the processor 601 controls the touch display screen 605 to switch from the rest screen state to the bright screen state.
  • FIG. 6 does not constitute a limitation on the terminal 600, and includes more or less components than shown, or some components are combined, or different component arrangements are adopted.
  • the electronic device involved in the embodiment of the present application is a server.
  • FIG. 7 is a schematic structural diagram of a server provided by an embodiment of the present application. Please refer to FIG. 7.
  • the server 700 may vary depending on configuration or performance. A relatively large difference arises, including one or more processors (Central Processing Units, CPU) 701 and one or more memories 702, where at least one program code is stored in the memory 702, and the at least one program code consists of
  • the processor 701 loads and executes to implement the following steps:
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • the mixed audio signal is mapped to the embedding space to obtain the embedding feature.
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • the embedded feature is input to an extraction network, and generalized feature extraction is performed on the embedded feature through the extraction network to obtain the generalized feature of the target component in the mixed audio signal.
  • the extraction network is an autoregressive model
  • the at least one program code is used to be executed by the processor 701 to implement the following steps: input the embedded features into the autoregressive model, and pass the autoregressive model.
  • the model performs recursive weighting processing on the embedded feature to obtain the generalized feature of the target component.
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • the teacher model and the student model are collaboratively iteratively trained to obtain the coding network and the extraction network, wherein the student model includes a first coding network and a first extraction network, and the teacher The model includes a second coding network and a second extraction network.
  • the output of the first coding network is used as the input of the first extraction network
  • the output of the second coding network is used as the input of the second extraction network.
  • the teacher model in the iterative process is weighted by the teacher model of the previous iterative process and the student model of the current iterative process.
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • any iteration process based on the student model of this iteration process and the teacher model of the previous iteration process, obtain the teacher model of this iteration process;
  • the coding network and the extraction network are acquired based on the student model or teacher model of this iterative process.
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • the stop training condition is that the mean square error does not decrease during the successive iterations of the first target number of times; or,
  • the training stop condition is that the mean square error is less than or equal to the first target threshold and the mutual information value is greater than or equal to the second target threshold; or,
  • the training stop condition is that the number of iterations reaches the second target number.
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • the second coding network and the second extraction network in the teacher model of this iterative process are respectively determined as the coding network and the extraction network.
  • the at least one program code is used to be executed by the processor 701 to implement the following steps:
  • a response voice corresponding to the audio signal of the target object is generated, and the response voice is output.
  • the server 700 also has components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output.
  • the server 700 also includes other components for implementing device functions, which will not be described in detail here.
  • a computer-readable storage medium such as a memory including at least one piece of program code, and the above-mentioned at least one piece of program code can be executed by a processor in an electronic device to complete the following steps:
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • the mixed audio signal is mapped to the embedding space to obtain the embedding feature.
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • the embedded feature is input to an extraction network, and generalized feature extraction is performed on the embedded feature through the extraction network to obtain the generalized feature of the target component in the mixed audio signal.
  • the extraction network is an autoregressive model
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps: input the embedded features into the autoregressive model, and pass the The autoregressive model performs recursive weighting processing on the embedded feature to obtain the generalized feature of the target component.
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • the teacher model and the student model are collaboratively iteratively trained to obtain the coding network and the extraction network, wherein the student model includes a first coding network and a first extraction network, and the teacher The model includes a second coding network and a second extraction network.
  • the output of the first coding network is used as the input of the first extraction network
  • the output of the second coding network is used as the input of the second extraction network.
  • the teacher model in the iterative process is weighted by the teacher model of the previous iterative process and the student model of the current iterative process.
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • any iteration process based on the student model of this iteration process and the teacher model of the previous iteration process, obtain the teacher model of this iteration process;
  • the coding network and the extraction network are acquired based on the student model or teacher model of this iterative process.
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • the stop training condition is that the mean square error does not decrease during the successive iterations of the first target number of times; or,
  • the training stop condition is that the mean square error is less than or equal to the first target threshold and the mutual information value is greater than or equal to the second target threshold; or,
  • the training stop condition is that the number of iterations reaches the second target number.
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • the second coding network and the second extraction network in the teacher model of this iterative process are respectively determined as the coding network and the extraction network.
  • the at least one program code is used to be executed by a processor in an electronic device to implement the following steps:
  • a response voice corresponding to the audio signal of the target object is generated, and the response voice is output.
  • the computer-readable storage medium is ROM (Read-Only Memory), RAM (Random-Access Memory), CD-ROM (Compact Disc Read-Only Memory, CD-ROM), Tapes, floppy disks and optical data storage devices, etc.
  • ROM Read-Only Memory
  • RAM Random-Access Memory
  • CD-ROM Compact Disc Read-Only Memory
  • Tapes floppy disks and optical data storage devices, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Image Analysis (AREA)

Abstract

一种音频信号处理方法、装置、电子设备及存储介质,属于信号处理技术领域。通过对该混合音频信号进行嵌入处理,得到混合音频信号的嵌入特征,对嵌入特征进行泛化特征提取,能够提取得到混合音频信号中目标分量的泛化特征,由于目标分量的泛化特征具有较好的泛化能力和表达能力,能够良好地适用于不同的场景,因此提升了音频信号处理过程的鲁棒性和泛化性,提升了音频信号处理的准确性。

Description

音频信号处理方法、装置、电子设备及存储介质
本申请要求于2020年01月02日提交的申请号为2020100016363、发明名称为“音频信号处理方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信号处理技术领域,特别涉及一种音频信号处理方法、装置、电子设备及存储介质。
背景技术
在信号处理领域中,“鸡尾酒会问题”是一个热门研究课题:在给定混合音频信号(说话人为两人或两人以上)的情况下,如何分离出鸡尾酒会中同时说话的每个人的独立音频信号?针对上述鸡尾酒会问题的解决方案称为语音分离技术。目前,通常是基于有监督学习的深度模型来进行语音分离,例如,基于有监督学习的深度模型包括DPCL(Deep Clustering,深度聚类网络)、DANet(Deep Attractor Network,深度吸引子网络)、ADANet(Anchored Deep Attractor Network,锚定深度吸引子网络)、ODANet(Online Deep Attractor Network,在线深度吸引子网络)等。
在上述过程中,有监督学习是指在获取标注后的训练数据之后,针对某一类特定场景训练出在对应场景下进行语音分离的深度模型。在实际应用中针对训练时没有标注过的音频信号,基于有监督学习的深度模型的鲁棒性和泛化性较差,导致在训练场景以外的情况下,基于有监督学习的深度模型在处理音频信号时准确性较差。
发明内容
本申请实施例提供了一种音频信号处理方法、装置、电子设备及存储介质,能够提升音频信号处理过程的准确性。技术方案如下:
一方面,提供了一种音频信号处理方法,应用于电子设备,该方法包括:
对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征,所述目标分量对应于所述混合音频信号中目标对象的音频信号;
基于所述目标分量的泛化特征进行音频信号处理。
在一种可能实施方式中,所述对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征包括:
将所述混合音频信号映射至嵌入空间,得到所述嵌入特征。
在一种可能实施方式中,所述对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征包括:
对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一种可能实施方式中,所述对混合音频信号进行嵌入处理,得到所述混合音频信号的 嵌入特征包括:
将混合音频信号输入编码网络,通过所述编码网络对所述混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
所述对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征包括:
将所述嵌入特征输入萃取网络,通过所述萃取网络对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征。
在一种可能实施方式中,所述萃取网络为自回归模型,所述将所述嵌入特征输入萃取网络,通过所述萃取网络对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征包括:
将所述嵌入特征输入所述自回归模型,通过所述自回归模型对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一种可能实施方式中,所述方法还包括:
基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到所述编码网络以及所述萃取网络,其中,所述学生模型包括第一编码网络和第一萃取网络,所述教师模型包括第二编码网络和第二萃取网络,所述第一编码网络的输出作为所述第一萃取网络的输入,所述第二编码网络的输出作为所述第二萃取网络的输入,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得。
在一种可能实施方式中,所述基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到所述编码网络以及所述萃取网络包括:
在任一次迭代过程中,基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型;
将所述未标注的样本混合信号分别输入本次迭代过程的教师模型和学生模型,分别输出所述样本混合信号中目标分量的教师泛化特征以及学生泛化特征;
基于所述样本混合信号、所述教师泛化特征或者所述学生泛化特征中至少一项,获取本次迭代过程的损失函数值;
若所述损失函数值不符合停止训练条件,对所述学生模型的参数进行调整,得到下一次迭代过程的学生模型,基于所述下一次迭代过程的学生模型执行下一次迭代过程;
若所述损失函数值符合所述停止训练条件,基于本次迭代过程的学生模型或教师模型,获取所述编码网络和所述萃取网络。
在一种可能实施方式中,所述基于所述样本混合信号、所述教师泛化特征或者所述学生泛化特征中至少一项,获取本次迭代过程的损失函数值包括:
获取所述教师泛化特征以及所述学生泛化特征之间的均方误差;
获取所述样本混合信号与所述学生泛化特征之间的互信息值;
将所述均方误差或者所述互信息值中至少一项确定为本次迭代过程的损失函数值。
在一种可能实施方式中,所述停止训练条件为在连续第一目标次数的迭代过程中所述均方误差没有减小;或,
所述停止训练条件为所述均方误差小于或等于第一目标阈值且所述互信息值大于或等于第二目标阈值;或,
所述停止训练条件为迭代次数到达第二目标次数。
在一种可能实施方式中,所述基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型包括:
将上一次迭代过程的教师模型的参数集与第一平滑系数相乘,得到第一参数集;
将本次迭代过程的学生模型与第二平滑系数相乘,得到第二参数集,其中,所述第一平滑系数与所述第二平滑系数相加所得的数值为1;
将所述第一参数集与所述第二参数集之和确定为本次迭代过程的教师模型的参数集;
基于本次迭代过程的教师模型的参数集,对上一次迭代过程的教师模型进行参数更新,得到本次迭代过程的教师模型。
在一种可能实施方式中,所述基于本次迭代过程的学生模型或教师模型,获取所述编码网络和所述萃取网络包括:
分别将本次迭代过程的学生模型中第一编码网络和第一萃取网络,确定为所述编码网络和所述萃取网络;或,
分别将本次迭代过程的教师模型中第二编码网络和第二萃取网络,确定为所述编码网络和所述萃取网络。
在一种可能实施方式中,所述基于所述目标分量的泛化特征进行音频信号处理包括:
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行文语转换,输出所述目标对象的音频信号对应的文本信息;或,
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行声纹识别,输出所述目标对象的音频信号对应的声纹识别结果;或,
基于所述目标分量的泛化特征,生成所述目标对象的音频信号对应的应答语音,输出所述应答语音。
一方面,提供了一种音频信号处理装置,该装置包括:
嵌入处理模块,用于对所述混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
特征提取模块,用于对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征,所述目标分量对应于所述混合音频信号中目标对象的音频信号;
信号处理模块,用于基于所述目标分量的泛化特征进行音频信号处理。
在一种可能实施方式中,嵌入处理模块,用于将混合音频信号输入编码网络,通过所述编码网络对所述混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
特征提取模块,用于将所述嵌入特征输入萃取网络,通过所述萃取网络对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征,所述目标分量对应于所述混合音频信号中目标对象的音频信号。
在一种可能实施方式中,所述嵌入处理模块用于:
将所述混合音频信号映射至嵌入空间,得到所述嵌入特征。
在一种可能实施方式中,特征提取模块用于对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一种可能实施方式中,所述萃取网络为自回归模型,所述特征提取模块用于:
将所述嵌入特征输入所述自回归模型,通过所述自回归模型对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一种可能实施方式中,所述装置还包括:
训练模块,用于基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到所述编码网络以及所述萃取网络,其中,所述学生模型包括第一编码网络和第一萃取网络,所述教师模型包括第二编码网络和第二萃取网络,所述第一编码网络的输出作为所述第一萃取网络的输入,所述第二编码网络的输出作为所述第二萃取网络的输入,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得。
在一种可能实施方式中,所述训练模块包括:
第一获取单元,用于在任一次迭代过程中,基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型;
输出单元,用于将所述未标注的样本混合信号分别输入本次迭代过程的教师模型和学生模型,分别输出所述样本混合信号中目标分量的教师泛化特征以及学生泛化特征;
第二获取单元,用于基于所述样本混合信号、所述教师泛化特征或者所述学生泛化特征中至少一项,获取本次迭代过程的损失函数值;
参数调整单元,用于若所述损失函数值不符合停止训练条件,对所述学生模型的参数进行调整,得到下一次迭代过程的学生模型,基于所述下一次迭代过程的学生模型执行下一次迭代过程;
第三获取单元,用于若所述损失函数值符合所述停止训练条件,基于本次迭代过程的学生模型或教师模型,获取所述编码网络和所述萃取网络。
在一种可能实施方式中,所述第二获取单元用于:
获取所述教师泛化特征以及所述学生泛化特征之间的均方误差;
获取所述样本混合信号与所述学生泛化特征之间的互信息值;
将所述均方误差或者所述互信息值中至少一项确定为本次迭代过程的损失函数值。
在一种可能实施方式中,所述停止训练条件为在连续第一目标次数的迭代过程中所述均方误差没有减小;或,
所述停止训练条件为所述均方误差小于或等于第一目标阈值且所述互信息值大于或等于第二目标阈值;或,
所述停止训练条件为迭代次数到达第二目标次数。
在一种可能实施方式中,所述第一获取单元用于:
将上一次迭代过程的教师模型的参数集与第一平滑系数相乘,得到第一参数集;
将本次迭代过程的学生模型与第二平滑系数相乘,得到第二参数集,其中,所述第一平滑系数与所述第二平滑系数相加所得的数值为1;
将所述第一参数集与所述第二参数集之和确定为本次迭代过程的教师模型的参数集;
基于本次迭代过程的教师模型的参数集,对上一次迭代过程的教师模型进行参数更新,得到本次迭代过程的教师模型。
在一种可能实施方式中,所述第三获取单元用于:
分别将本次迭代过程的学生模型中第一编码网络和第一萃取网络,确定为所述编码网络和所述萃取网络;或,
分别将本次迭代过程的教师模型中第二编码网络和第二萃取网络,确定为所述编码网络和所述萃取网络。
在一种可能实施方式中,所述信号处理模块用于:
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行文语转换,输出所述目 标对象的音频信号对应的文本信息;或,
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行声纹识别,输出所述目标对象的音频信号对应的声纹识别结果;或,
基于所述目标分量的泛化特征,生成所述目标对象的音频信号对应的应答语音,输出所述应答语音。
一方面,提供了一种电子设备,该电子设备包括一个或多个处理器和一个或多个存储器,该一个或多个存储器中存储有至少一条程序代码,该至少一条程序代码由该一个或多个处理器加载并执行以实现如上述任一种可能实现方式的音频信号处理方法所执行的操作。
一方面,提供了一种存储介质,该存储介质中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行以实现如上述任一种可能实现方式的音频信号处理方法所执行的操作。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还根据这些附图获得其他的附图。
图1是本申请实施例提供的一种音频信号处理方法的实施环境示意图;
图2是本申请实施例提供的一种音频信号处理方法的流程图;
图3是本申请实施例提供的一种编码网络及萃取网络的训练方法的流程图;
图4是本申请实施例提供的一种编码网络及萃取网络的训练方法的原理性示意图;
图5是本申请实施例提供的一种音频信号处理装置的结构示意图;
图6是本申请实施例提供的一种终端的结构示意图;
图7是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
本申请中术语“至少一个”是指一个或多个,“多个”的含义是指两个或两个以上,例如,多个第一位置是指两个或两个以上的第一位置。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数 据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括音频处理技术、计算机视觉技术、自然语言处理技术以及机器学习/深度学习等几大方向。
让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中音频处理技术(Speech Technology,也称语音处理技术)成为未来最被看好的人机交互方式之一,具体包括语音分离技术、自动语音识别技术(Automatic Speech Recognition,ASR)、语音合成技术(Text To Speech,TTS,也称文语转换技术)以及声纹识别技术等。
随着AI技术的发展,音频处理技术在多个领域展开了研究和应用,例如常见的智能音箱、智能语音助手、车载或电视盒子上的语音前端处理、ASR、语音识别产品、声纹识别产品等,相信随着AI技术的发展,音频处理技术将在更多的领域得到应用,发挥越来越重要的价值。
本申请实施例涉及音频处理技术领域内的语音分离技术,下面对语音分离技术进行简介:
语音分离的目标是将目标说话人的声音从背景干扰中分离出来,在音频信号处理中,语音分离属于最基本的任务类型之一,应用范围很广泛,包括听力假体、移动通信、鲁棒的自动语音识别以及说话人识别等。人类听觉系统能够轻易地将一个人的声音和另一个人的声音分离开来,即使在鸡尾酒会那样嘈杂的声音环境中,人耳也有能力专注于听某一个目标说话人的说话内容,因此,语音分离问题通常也被称为“鸡尾酒会问题”(cocktail party problem)。
由于麦克风采集到的音频信号中可能包括噪声、其他说话人的声音、混响等背景干扰,若不做语音分离,直接进行语音识别、声纹验证等下游任务,会大大降低下游任务的准确率,因此,在语音前端加上语音分离技术,能够将目标说话人的声音和其他背景干扰分离开来,从而能够提升下游任务的鲁棒性,使得语音分离技术逐渐成为现代音频处理系统中不可或缺的一环。
在一些实施例中,根据背景干扰的不同,语音分离任务分为三类:当干扰为噪声信号时,称为语音增强(Speech Enhancement);当干扰为其他说话人时,称为多说话人分离(Speaker Separation);当干扰为目标说话人自身声音的反射波时,称为解混响(De-reverberation)。
尽管基于有监督学习的深度模型在语音分离任务中取得了一定的成功,但根据广泛报道,如果在应用中遭遇到训练时没有标注过的噪声类型干扰,语音分离系统的准确性显著下降。
此外,研究表明只有一只耳朵功能正常的人更容易被干扰的声音分散注意力,同理,单通道(单耳的)语音分离在业内是非常困难的一个问题,因为相对于双声道或者多声道的输入信号而言,单通道的输入信号缺失了可用于定位声源的空间线索。
有鉴于此,本申请实施例提供一种音频处理方法,不仅能够适用于双声道或多声道的语音分离场景,而且能够适用于单通道的语音分离场景,同时还能够在(尤其是训练场景之外的)各类场景下提升音频处理过程的准确性。
图1是本申请实施例提供的一种音频信号处理方法的实施环境示意图。参见图1,在该实施环境中包括终端101和服务器102,终端101和服务器102均为电子设备。
在一些实施例中,终端101用于采集音频信号,在终端101上安装有音频信号的采集组件,例如麦克风等录音元件,或者,终端101还直接下载一段音频文件,将该音频文件进行解码得到音频信号。
在一些实施例中,终端101上安装有音频信号的处理组件,使得终端101独立实现本身 实施例提供的音频信号处理方法,例如,该处理组件是一个DSP(Digital Signal Processing,数字信号处理器),在DSP上运行本申请实施例提供的编码网络及萃取网络的程序代码,以提取采集组件所采集到的混合音频信号中目标分量的泛化特征,基于目标分量的泛化特征执行后续的音频处理任务,后续的音频处理任务包括但不限于:语音识别、声纹验证、文语转换、智能语音助理应答或者智能音箱应答中至少一项,本申请实施例不对音频处理任务的类型进行具体限定。
在一些实施例中,终端101在通过采集组件采集到混合音频信号之后,还将该混合音频信号发送至服务器102,由服务器对该混合音频信号进行音频处理,比如,在服务器上运行本申请实施例提供的编码网络及萃取网络的程序代码,以提取混合音频信号中目标分量的泛化特征,基于目标分量的泛化特征执行后续的音频处理任务,后续的音频处理任务包括但不限于:语音识别、声纹验证、文语转换、智能语音助理应答或者智能音箱应答中至少一项,本申请实施例不对音频处理任务的类型进行具体限定。
在一些实施例中,终端101和服务器102通过有线网络或无线网络相连。
服务器102用于处理音频信号,服务器102包括一台服务器、多台服务器、云计算平台或者虚拟化中心中的至少一种。可选地,服务器102承担主要计算工作,终端101承担次要计算工作;或者,服务器102承担次要计算工作,终端101承担主要计算工作;或者,终端101和服务器102两者之间采用分布式计算架构进行协同计算。
可选地,终端101泛指多个终端中的一个,终端101的设备类型包括但不限于:车载终端、电视机、智能手机、智能音箱、平板电脑、电子书阅读器、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机或者台式计算机中的至少一种。以下实施例,以终端包括智能手机来进行举例说明。
本领域技术人员知晓,上述终端101的数量更多或更少。比如上述终端101仅为一个,或者上述终端101为几十个或几百个,或者更多数量。本申请实施例对终端101的数量和设备类型不加以限定。
在一个示例性场景中,以终端101为车载终端为例,假设目标分量对应于混合音频信号中终端用户的音频信号,车载终端采集混合音频信号,基于本申请实施例提供的音频处理方法,提取到混合音频信号中目标分量的泛化特征之后,能够将用户的语音从混合音频信号中分离出来,萃取出用户的干净音频信号,在干净音频信号中不但去除了噪声干扰,而且还去除了其他说话人的声音干扰,基于上述干净音频信号,能够对用户的语音指令进行准确地解析和响应,提升了车载终端的音频处理准确率,提升了智能驾驶系统的智能性,优化了用户体验,在未来的5G(5th Generation wireless systems,第五代移动通信系统)时代,随着车联网的全面普及,将具有重要的应用价值。
在一个示例性场景中,以终端101为智能音箱为例,假设目标分量对应于混合音频信号中终端用户的音频信号,在智能音箱的播放环境中通常伴随着背景音乐干扰,智能音箱采集到携带干扰的混合音频信号,基于本申请实施例提供的音频处理方法,提取到混合音频信号中目标分量的泛化特征之后,能够将用户的语音从混合音频信号中分离出来,萃取出用户的干净音频信号,在干净音频信号中不但去除了背景音乐干扰,而且还去除了其他说话人的声音干扰,基于上述干净音频信号,能够对用户的语音指令进行准确地解析和响应,提升了智 能音箱的音频处理准确率,优化了用户体验。
在一个示例性场景中,以终端101为智能手机为例,假设目标分量对应于混合音频信号中终端用户的音频信号,用户使用手机的环境通常是不可预测、复杂多变的,那么环境中携带的干扰类型也是多种多样的,针对传统的基于有监督学习的深度模型而言,若要收集覆盖各类场景的携带标注的训练数据显然是不切实际的,而在本申请实施例中,智能手机采集到携带干扰的混合音频信号,基于本申请实施例提供的音频处理方法,提取到混合音频信号中目标分量的泛化特征,不管在何种场景下,均能够将用户的语音从混合音频信号中分离出来,萃取出用户的干净音频信号,在干净音频信号中不但去除了背景音乐干扰,而且还去除了其他说话人的声音干扰,基于上述干净音频信号,能够对用户的语音指令进行准确地解析和响应,比如,用户在触发文语转换指令之后,录入了一段携带噪声干扰的语音,智能手机萃取到用户的干净音频信号之后,能够准确地将用户的语音转化为对应的文本,大大提升文语转换过程的准确性、精确性,提升了智能手机的音频处理准确率,优化了用户体验。
在上述各个场景均为本申请实施例所涉及的音频处理方法的示例性场景,不应构成对该音频处理方法的应用场景的限制,该音频处理方法可应用于各类音频处理的下游任务的前端,作为一个针对混合音频信号进行语音分离以及特征提取的预处理步骤,具有高可用性、可迁移性和可移植性,此外,针对较为困难的鸡尾酒会问题以及单通道语音分离问题,均具有良好的表现,下面进行详述。
图2是本申请实施例提供的一种音频信号处理方法的流程图。参见图2,该实施例应用于上述实施例中的终端101,或者应用于服务器102,或者应用于终端101与服务器102之间的交互过程,在本实施例中以应用于终端101为例进行说明,该实施例包括下述步骤:
201、终端获取混合音频信号。
其中,该混合音频信号中包括目标对象的音频信号,该目标对象是任何能够发声的客体,比如自然人、虚拟形象、智能客服、智能语音助手或者AI机器人中至少一项,例如,将混合音频信号中能量最大的说话人确定为目标对象,本申请实施例不对目标对象的类型进行具体限定。除了目标对象的音频信号之外,混合音频信号中还包括噪声信号或者其他对象的音频信号中至少一项,其他对象是指除了目标对象之外的任一对象,噪声信号包括白噪声、粉红噪声、褐色噪声、蓝噪声或者紫噪声中至少一项,本申请实施例不对噪声信号的类型进行具体限定。
在一些实施例中,在上述过程中,终端上安装有应用程序,用户在应用程序中触发音频采集指令之后,操作系统响应于音频采集指令,调用录音接口,驱动音频信号的采集组件(比如麦克风)以音频流的形式采集混合音频信号。在另一些实施例中,终端也从本地预存的音频中选择一段音频作为混合音频信号。在另一些实施例中,终端还从云端下载音频文件,对该音频文件进行解析得到混合音频信号,本申请实施例不对混合音频信号的获取方式进行具体限定。
202、终端将混合音频信号输入编码网络,通过该编码网络将该混合音频信号映射至嵌入空间,得到该混合音频信号的嵌入特征。
在上述过程中,由编码网络将输入信号(混合音频信号)从低维空间非线性地映射至高维的嵌入空间(embedding space),也即是说,输入信号在嵌入空间的向量表示即为上述嵌入特征。
上述步骤202中,终端将混合音频信号输入编码网络(encoder),通过编码网络对该混合音频信号进行嵌入(embedding)处理,得到混合音频信号的嵌入特征,相当于对混合音频信号进行了一次编码,得到表达能力更强的高维嵌入特征,使得后续提取目标分量的泛化特征时具有更高的准确性。
该步骤202为对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征的过程,该过程中以终端通过编码网络实现嵌入处理过程为例进行了说明,在另一些实施例中,该步骤202中,终端直接将混合音频信号映射至嵌入空间,得到该混合音频信号的嵌入特征。
在一些实施例中,嵌入处理过程通过映射实现,也即是,步骤202中,终端将所述混合音频信号映射至嵌入空间,得到所述嵌入特征。
203、终端将该嵌入特征输入自回归模型,通过该自回归模型对该嵌入特征进行递归加权处理,得到该混合音频信号中目标分量的泛化特征,该目标分量对应于该混合音频信号中目标对象的音频信号。
需要说明的是,由于混合音频信号通常为音频数据流的形式,也即是说,混合音频信号包括至少一个音频帧,那么相应地,混合音频信号的嵌入特征包括至少一个音频帧的嵌入特征。
在一些实施例中,上述自回归模型是一个LSTM(Long Short-Term Memory,长短期记忆)网络,在LSTM网络中包括输入层、隐藏层和输出层,在隐藏层中包括具有分层结构的多个记忆单元,每个记忆单元对应于输入层中混合音频信号的一个音频帧的嵌入特征。
对LSTM网络的任一层中任一个记忆单元,当该记忆单元接收到该音频帧的嵌入特征和本层内上一个记忆单元的输出特征时,对该音频帧的嵌入特征以及上一个记忆单元的输出特征进行加权变换,得到该记忆单元的输出特征,将该记忆单元的输出特征分别输出至本层内下一个记忆单元以及下一层内对应位置的记忆单元,每层内的每个记忆单元均执行上述操作,相当于在整个LSTM网络中执行了递归加权处理。
在上述基础上,终端将混合音频信号中多个音频帧的嵌入特征分别输入到第一层内的多个记忆单元,由第一层内的多个记忆单元对该多个音频帧的嵌入特征进行单向的递归加权变换,得到多个音频帧的中间特征,将该多个音频帧的中间特征分别输入到第二层的多个记忆单元,以此类推,直到最后一层的多个记忆单元输出多个音频帧中目标分量的泛化特征。
在一些实施例中,上述自回归模型还是一个BLSTM(Bidirectional Long Short-Term Memory,双向长短期记忆)网络,BLSTM网络中包括一个前向LSTM和一个后向LSTM,在BLSTM网络中也包括输入层、隐藏层和输出层,在隐藏层中包括分层结构的多个记忆单元,每个记忆单元对应于输入层中混合音频信号的一个音频帧的嵌入特征,但与LSTM不同的是,BLSTM中每个记忆单元不仅要执行前向LSTM对应的加权操作,还要执行后向LSTM对应的加权操作。
对BLSTM网络的任一层中任一个记忆单元,一方面,当该记忆单元接收到该音频帧的嵌入特征和本层内上一个记忆单元的输出特征时,对该音频帧的嵌入特征以及上一个记忆单元的输出特征进行加权变换,得到该记忆单元的输出特征,将该记忆单元的输出特征分别输出至本层内下一个记忆单元以及下一层内对应位置的记忆单元;另一方面,当该记忆单元接收到该音频帧的嵌入特征和本层内下一个记忆单元的输出特征时,对该音频帧的嵌入特征以及下一个记忆单元的输出特征进行加权变换,得到该记忆单元的输出特征,将该记忆单元的输出特征分别输出至本层内上一个记忆单元以及下一层内对应位置的记忆单元。每层内的每个记忆单元均执行上述操作,相当于在整个BLSTM网络中执行了递归加权处理。
在上述基础上,终端将混合音频信号中多个音频帧的嵌入特征分别输入到第一层内的多个记忆单元,由第一层内的多个记忆单元对该多个音频帧的嵌入特征进行双向(包括前向和后向)的递归加权变换,得到多个音频帧的中间特征,将该多个音频帧的中间特征分别输入到第二层的多个记忆单元,以此类推,直到最后一层的多个记忆单元输出多个音频帧中目标分量的泛化特征。
该步骤203为对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征的过程,在一些实施例中,该得到泛化特征的过程通过萃取网络实现,也即是,该步骤203为将所述嵌入特征输入萃取网络,通过所述萃取网络对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征的过程。在上述步骤203中,以萃取网络为自回归模型为例,说明了终端将该嵌入特征输入萃取网络(abstractor),通过该萃取网络对该嵌入特征进行泛化特征提取,得到该混合音频信号中目标分量的泛化特征,这些泛化特征相当于目标对象的说话语音的一个抽象表征(abstract feature),而并非是针对某一类型的干扰或者某一类型的下游任务而训练出来的特定特征,泛化特征能够在通常场景下具有良好的表达能力,使得基于泛化特征执行的音频信号处理的准确性得到普遍地提升。
在一些实施例中,萃取网络还是复发性(recurrent)模型、摘要函数、CNN(Convolutional Neural Networks,卷积神经网络)、TDNN(Time Delay Neural Network,时延神经网络)或者闸控卷积神经网络中至少一项,或者多个不同类型网络的组合,本申请实施例不对萃取网络的结构进行具体限定。
204、终端基于该目标分量的泛化特征进行音频信号处理。
在不同的任务场景下音频信号处理具有不同的含义,下面给出几个示例性说明:
在文语转换场景中,终端基于目标分量的泛化特征,对目标对象的音频信号进行文语转换,输出目标对象的音频信号对应的文本信息。可选地,在进行文语转换时,终端将目标分量的泛化特征输入至语音识别模型,通过语音识别模型将混合音频信号中目标对象的音频信号翻译为对应的文本信息,泛化特征能够良好地适用于文语转换场景,具有较高的音频信号处理准确性。
在声纹支付场景中,终端基于目标分量的泛化特征,对目标对象的音频信号进行声纹识别,输出目标对象的音频信号对应的声纹识别结果,进而基于声纹识别结果进行声纹支付。可选地,在进行声纹识别时,终端将目标分量的泛化特征输入至声纹识别模型,通过声纹识别模型验证混合音频信号中目标对象的音频信号是否为本人的声音,确定对应的声纹识别结果,若验证出声纹识别结果为“是本人的声音”之后,执行后续的支付操作,否则返回支付失败信息,泛化特征能够良好地适用于声纹支付场景,具有较高的音频信号处理准确性。
在智能语音交互场景中,终端基于目标分量的泛化特征,生成目标对象的音频信号对应的应答语音,输出该应答语音。可选地,在进行语音合成时,终端将目标分量的泛化特征输入至问答模型,通过问答模型提取混合音频信号中目标对象的音频信号的语义信息之后,基于该语义信息生成对应的应答语音,向用户输出该应答语音,泛化特征能够良好地适用于智能语音交互场景,具有较高的音频信号处理准确性。
以上仅为几种示例性的音频处理场景,而目标分量的泛化特征良好地适用于各类音频处理场景,根据音频处理场景的不同,下游的音频处理任务也不尽相同,音频信号处理的方式也就不尽相同,本申请实施例不对音频信号处理的方式进行具体限定。
上述所有可选技术方案,采用任意结合形成本申请的可选实施例,在此不再一一赘述。
本申请实施例提供的方法,通过对该混合音频信号进行嵌入处理,得到该混合音频信号的嵌入特征,对该嵌入特征进行泛化特征提取,能够提取得到该混合音频信号中目标分量的泛化特征,该目标分量对应于该混合音频信号中目标对象的音频信号,基于该目标分量的泛化特征进行音频信号处理,由于目标分量的泛化特征并非是针对某一类特定场景下的声音特征,具有较好的泛化能力和表达能力,因此基于目标分量的泛化特征进行音频信号处理时,能够良好地适用于不同的场景,提升了音频信号处理过程的鲁棒性和泛化性,提升了音频信号处理的准确性。
在上述实施例中,介绍了如何对混合音频信号进行目标分量的泛化特征提取,并基于目标分量的泛化特征来进行音频处理,也即是说,在上述实施例中终端能够从夹杂各类干扰的混合音频信号中,针对目标对象的音频信号(通常是目标说话人的声音)提取出鲁棒的、通用的表征(目标分量的泛化特征)。在本申请实施例中,将对如何获取上述音频信号处理方法中使用的编码网络以及萃取网络进行说明,提供一种基于无监督学习的编码网络及萃取网络的训练方法。
上述训练方法应用于上述实施环境中的终端101或者服务器102,在本实施例中以应用于服务器102为例进行说明,可选地,服务器102在训练得到编码网络和萃取网络之后,将训练好的编码网络及萃取网络发送至终端101,使得终端101执行上述实施例中的音频信号处理方法。
在训练过程中,服务器先获取未标注的样本混合信号,再基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到上述实施例中所使用的编码网络以及萃取网络。
其中,未标注的样本混合信号也即是未经过任何标注的训练数据,该样本混合信号中也包括目标对象的音频信号,该目标对象是任何能够发声的客体,比如自然人、虚拟形象、智能客服、智能语音助手或者AI机器人中至少一项,例如,将混合音频信号中能量最大的说话人确定为目标对象,本申请实施例不对目标对象的类型进行具体限定。除了目标对象的音频信号之外,样本混合信号中还包括噪声信号或者其他对象的音频信号中至少一项,噪声信号包括白噪声、粉红噪声、褐色噪声、蓝噪声或者紫噪声中至少一项,本申请实施例不对噪声信号的类型进行具体限定。
服务器获取样本混合信号的过程,与上述步骤201中终端获取混合音频信号的过程类似,这里不做赘述。需要说明的是,服务器还基于语音生成模型自动地生成一段未标注的样本混合信号,基于生成的样本混合信号完成后续的训练流程。
假设用χ表示训练集,χ中存在一组有标注的训练样本{X (1),...,X (L)∈χ},一组未标注的训练样本{X (L+1),...,X (L+U)∈χ},以及一组背景干扰和噪声样本{X (L+U+1),...,X (L+U+N)∈χ},每个训练样本(或噪声样本)是由输入空间的一组时频点{x=X t,f} t=1...,T;f=1...,F构成的,X表示训练样本,t表示帧索引,f表示频带索引,T表示训练样本所包含的音频帧个数,F表示训练样本所包含的频带个数。
在本申请实施例提供的基于无监督学习的训练方法中,训练集中缺少有标注的训练样本,也即是说,L=0,U≥1,N≥0。
图3是本申请实施例提供的一种编码网络及萃取网络的训练方法的流程图,请参考图3,在本申请实施例中,以任一次迭代过程为例,对教师模型和学生模型如何进行协同迭代训练进行说明,该实施例包括下述步骤:
301、在任一次迭代过程中,服务器基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型。
其中,该学生模型包括第一编码网络和第一萃取网络,该教师模型包括第二编码网络和第二萃取网络,该第一编码网络的输出作为该第一萃取网络的输入,该第二编码网络的输出作为该第二萃取网络的输入。
可选地,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得。在上述步骤301中,服务器通过执行下述几个子步骤来获取本次迭代过程的教师模型:
3011、服务器将上一次迭代过程的教师模型的参数集与第一平滑系数相乘,得到第一参数集。
在上述过程中,服务器将上一次迭代过程的教师模型中第二编码网络以及第二萃取网络的参数集分别与第一平滑系数相乘,得到第二编码网络以及第二萃取网络各自对应的第一参数集。
在一个示例中,假设第二编码网络的参数集用θ′表示,第二萃取网络的参数集用ψ′表示,第一平滑系数用α表示,本次迭代过程为第l(l≥2)次迭代过程,上一次迭代过程为第l-1次迭代过程,那么服务器将第l-1次迭代过程中所采用的教师模型中第二编码网络的参数集θ l-1′以及第二萃取网络的参数集ψ l-1′分别与第一平滑系数α相乘,即可得到第二编码网络对应的第一参数集αθ l-1′以及第二萃取网络对应的第一参数集αψ l-1′。
3012、服务器将本次迭代过程的学生模型与第二平滑系数相乘,得到第二参数集,其中,该第一平滑系数与该第二平滑系数相加所得的数值为1。
其中,本次迭代过程的学生模型是基于上一次迭代过程的学生模型进行参数调整而得到的。
在上述过程中,服务器将本次迭代过程的学生模型中第一编码网络以及第一萃取网络的参数集分别与第二平滑系数相乘,得到第一编码网络以及第一萃取网络各自对应的第二参数集。
基于上述示例,假设第一编码网络的参数集用θ表示,第一萃取网络的参数集用ψ表示,由于第一平滑系数与第二平滑系数相加所得的数值为1,那么第二平滑系数用1-α来表示,服务器将第l次迭代过程所采用的学生模型中第一编码网络的参数集θ l以及第一萃取网络的参数集ψ l分别与第二平滑系数1-α相乘,即可得到第一编码网络对应的第二参数集(1-α)θ l以及第一萃取网络对应的第二参数集(1-α)ψ l
3013、服务器将该第一参数集与该第二参数集之和确定为本次迭代过程的教师模型的参数集。
在上述过程中,服务器将上一次迭代过程中教师模型的第二编码网络的第一参数集以及本次迭代过程中学生模型的第一编码网络的第二参数集之和确定为本次迭代过程的教师模型的第二编码网络的参数集,同理,将上一次迭代过程中教师模型的第二萃取网络的第一参数 集以及本次迭代过程中学生模型的第一萃取网络的第二参数集之和确定为本次迭代过程的教师模型的第二萃取网络的参数集。
基于上述示例,服务器将第l-1次迭代过程中第二编码网络的第一参数集αθ l-1′与第l次迭代过程中第一编码网络的第二参数集(1-α)θ l之和确定为第l次迭代过程的教师模型中第二编码网络的参数集θ l′,也即是说,第l次迭代过程的教师模型中第二编码网络的参数集θ l′使用下述公式进行表示:
θ l′=αθ l-1′+(1-α)θ l
基于上述示例,服务器将第l-1次迭代过程中第二萃取网络的第一参数集αψ l-1′与第l次迭代过程中第一萃取网络的第二参数集(1-α)ψ l之和确定为第l次迭代过程的教师模型中第二萃取网络的参数集ψ l′,也即是说,第l次迭代过程的教师模型中第二萃取网络的参数集ψ l′使用下述公式进行表示:
ψ l′=αψ l-1′+(1-α)ψ l
3014、服务器基于本次迭代过程的教师模型的参数集,对上一次迭代过程的教师模型进行参数更新,得到本次迭代过程的教师模型。
在上述过程中,在获取到第l次迭代过程的教师模型中第二编码网络的参数集θ l′以及第二萃取网络的参数集ψ l′之后,服务器将第l-1次迭代过程的教师模型中第二编码网络的参数集θ l-1′更新为上述θ l′,将第l-1次迭代过程的教师模型中第二萃取网络的参数集ψ l-1′更新为上述ψ l′,从而得到第l次迭代过程的教师模型。
上述步骤3011-步骤3014,服务器能够基于一种指数移动平均(Exponential Moving Average,EMA)的方法来分别更新教师模型中第二编码网络以及第二萃取网络的参数集,比如,在第一次迭代过程中分别对教师模型和学生模型进行初始化(或者预训练),保持教师模型和学生模型在第一次迭代过程中参数相同,接下来在第二次迭代过程中教师模型相当于第一次迭代过程中教师模型(与学生模型参数相同)与第二次迭代过程中学生模型的参数集的加权平均,随着学生模型与教师模型的一次次迭代,可知最终教师模型在本质上相当于多次历史迭代过程中学生模型的加权平均,基于这种EMA方法获取的教师模型能够较好地反映出多次历史迭代过程中学生模型的性能,有利于协同训练出具有更好的鲁棒性的学生模型。
302、服务器将未标注的样本混合信号分别输入本次迭代过程的教师模型和学生模型,分别输出该样本混合信号中目标分量的教师泛化特征以及学生泛化特征。
在上述步骤302中,对本次迭代过程的学生模型而言,服务器将未标注的样本混合信号输入本次迭代过程的学生模型中的第一编码网络,通过本次迭代过程的第一编码网络对样本混合信号进行嵌入处理,得到样本混合信号的学生嵌入特征,将样本混合信号的学生嵌入特征输入本次迭代过程的学生模型中的第一萃取网络,通过本次迭代过程的第一萃取网络对样本混合信号进行泛化特征提取,输出样本混合信号中目标分量的学生泛化特征,该目标分量对应于样本混合信号中目标对象的音频信号,上述过程与上述实施例中步骤202-步骤203类似,这里不做赘述。
在上述步骤302中,对本次迭代过程的教师模型而言,服务器将未标注的样本混合信号输入本次迭代过程的教师模型中的第二编码网络,通过本次迭代过程的第二编码网络对样本 混合信号进行嵌入处理,得到样本混合信号的教师嵌入特征,将样本混合信号的教师嵌入特征输入本次迭代过程的教师模型中的第二萃取网络,通过本次迭代过程的第二萃取网络对样本混合信号进行泛化特征提取,输出样本混合信号中目标分量的教师泛化特征,该目标分量对应于样本混合信号中目标对象的音频信号,上述过程与上述实施例中步骤202-203类似,这里不做赘述。
在一个示例中,假设用x表示样本混合信号,用E θ表示本次迭代过程中学生模型的第一编码网络(encoder),其中θ表示第一编码网络的参数集,那么第一编码网络E θ相当于对样本混合信号x作了一次非线性映射,将样本混合信号x从输入域映射到一个高维的嵌入(embedding)空间,输出样本混合信号的学生嵌入特征v,也即是说,第一编码网络E θ的作用相当于下述映射关系:
Figure PCTCN2020124132-appb-000001
在上述映射关系中,
Figure PCTCN2020124132-appb-000002
表示单通道的样本混合信号的短时傅里叶谱(Short-Time Fourier Transform,STFT),T表示输入的样本混合信号的音频帧个数,F表示STFT的频带个数,
Figure PCTCN2020124132-appb-000003
表示第一编码网络E θ的输入域,D表示嵌入空间的维度,
Figure PCTCN2020124132-appb-000004
表示第一编码网络E θ的输出域(也即是嵌入空间),也即是说,第一编码网络E θ表示一个连续可微的参数函数,能够将样本混合信号x从输入域
Figure PCTCN2020124132-appb-000005
映射至嵌入空间
Figure PCTCN2020124132-appb-000006
在一些实施例中,上述样本混合信号x的STFT特征是对数梅尔谱特征或者MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征中至少一项,或者,也是对数梅尔谱特征与MFCC特征之间的组合,或者,还包括自回归模型的后验预测得分、梅尔频谱特征或者其他因素的特征,本申请对STFT特征的类型不进行具体限定。
基于上述示例,假设用A ψ表示本次迭代过程中学生模型的第一萃取网络(abstractor),其中ψ表示第一萃取网络的参数集,那么第一萃取网络A ψ的作用相当于下述映射关系:
Figure PCTCN2020124132-appb-000007
在上述映射关系中,v表示样本混合信号的学生嵌入特征,p表示第一萃取网络A ψ对学生嵌入特征v进行加权处理之后所得的特征,c表示样本混合信号中目标分量的学生泛化特征,此时的学生泛化特征c是由第一萃取网络A ψ的输入特征v与输出特征p之间进行递归加权变换之后所得的特征,此外,
Figure PCTCN2020124132-appb-000008
表示第一萃取网络A ψ的输出域,T、F、D、
Figure PCTCN2020124132-appb-000009
的含义与上述示例相同,这里不做赘述。
在一些实施例中,第一萃取网络为自回归模型,从而通过自回归模型,能够基于本地的学生嵌入特征按时序地构建离散的学生泛化特征,此时构建出的学生泛化特征可能是短时的,也是长时的,本申请实施例不对学生泛化特征的时间分辨率进行具体限定。
可选地,在因果系统(causal system)中上述自回归模型采用LSTM网络,因果系统又称非超前系统(nonanticipative system),即输出不可能在输入到达之前出现的系统,也就是说系统某一时刻的输出,只取决于系统该时刻以及该时刻之前的输入,而与该时刻之后的输入无关,此时通过LSTM网络进行单向的递归加权处理,能够避免忽略掉前后音频帧之间在时序上因果关系。
可选地,在非因果系统(noncausal system)中上述自回归模型采用BLSTM网络,非因 果系统是指当前时刻的输出不仅取决于当前的输入,还取决于将来的输入,因此通过BLSTM网络进行双向的递归加权处理,不仅能够考虑到各个音频帧之前的各个历史音频帧的作用,而且还能够考虑到各个音频帧之后的各个未来音频帧的作用,从而能够较好地保留各个音频帧之间的上下文信息(context)。
在上述情况中,假设给定了预测值p(也即是上述示例中第一萃取网络的输出特征),那么学生泛化特征c采用下述公式进行表示:
Figure PCTCN2020124132-appb-000010
在上述公式中,c t∈c表示第t个音频帧的学生泛化特征,v t∈v表示第t个音频帧的学生嵌入特征,p t∈p表示第一萃取网络针对第t个音频帧所输出的预测值,⊙表示特征之间的点乘操作,t(t≥1)表示帧索引,f表示频带索引。
在一些实施例中,还将上述公式中的分子和分母分别乘以一个二值阈值矩阵w,能够有助于减轻样本混合信号中低能量噪声的干扰(相当于一个高通滤波器),此时学生泛化特征c采用下述公式进行表示:
Figure PCTCN2020124132-appb-000011
在上述公式中,w t∈w表示第t个音频帧的二值阈值矩阵,且
Figure PCTCN2020124132-appb-000012
其余各符号的含义与上一个公式中各符号的含义相同,这里不做赘述。
其中,对帧索引为t且频带索引为f的二值阈值矩阵w t,f而言,该二值阈值矩阵w t,f采用下述公式进行表示:
Figure PCTCN2020124132-appb-000013
在上述公式中,X表示样本混合信号构成的训练集,也即是说,若训练集中帧索引为t且频带索引为f的样本混合信号X t,f的能量值小于训练集中样本混合信号最大能量值的百分之一,那么将二值阈值矩阵w t,f置为0,从而在计算学生泛化特征时,忽略掉样本混合信号X t,f(低能量噪声)的干扰,否则,将二值阈值矩阵w t,f置为1,对于低能量噪声之外的音频分量进行照常计算。
在上述过程中,针对每个音频帧都构建各自的学生泛化特征,这种离散的学生泛化特征c t更适用于一些需要高时域分辨率信息的任务,比如针对目标说话人进行频谱重建。
在另一些实施例中,第一萃取网络还采用一种摘要函数或者一种复发性(recurrent)模型,从而通过摘要函数或者复发性模型,能够基于本地的学生嵌入特征构建出全局的学生泛化特征,本申请实施例不对第一萃取网络的类型进行具体限定。
在上述情况中,假设给定了预测值p(也即是上述示例中第一萃取网络的输出特征),那么学生泛化特征c采用下述公式进行表示:
Figure PCTCN2020124132-appb-000014
其中,c、v、p、w、t、f均与上述各个公式中相同符号的含义一致,并且出于简洁的考虑,省略了c、v、p、w的维度索引下标,这里不做赘述。
上述公式中给出学生泛化特征c代表了一种跨越长时稳定的、全局的、“慢”(指低时域分辨率)的抽象表征,更加适用于一些仅需要低时域分辨率信息的任务,比如用于概括隐藏的目标说话人的特征。
303、服务器基于该样本混合信号、该教师泛化特征或者该学生泛化特征中至少一项,获取本次迭代过程的损失函数值。
由于在训练过程中采用的样本混合信号是未经过标注的,此时无法直接观察到隐藏在样本混合信号中的目标对象的音频信号,也即是说训练过程中采用隐式输入信号,那么传统的针对显式输入信号来计算损失函数值的方法将不再适用,其中,传统的针对显式输入信号计算损失函数值的方法包括NCE(Noise Contrastive Estimation,噪声对比估计)、DIM(Deep InfoMax,深度互信息最大化)等。
有鉴于此,本申请实施例针对学生模型提供一种计算模块(estimator),该计算模块用于计算第一编码网络与第一萃取网络在每次迭代过程的损失函数值。
可选地,上述损失函数值包括教师泛化特征与学生泛化特征之间的均方误差(Mean Squared Error,MSE)或者样本混合信号与学生泛化特征之间的互信息值(Mutual Information,MI)中至少一项。
在上述步骤303中,服务器通过执行下述几个子步骤来获取本次迭代过程的损失函数值:
3031、服务器获取本次迭代过程的教师泛化特征以及学生泛化特征之间的均方误差。
在上述过程中,教师泛化特征与学生泛化特征之间的均方误差MSE采用下述公式进行表示:
Figure PCTCN2020124132-appb-000015
在上述公式中,
Figure PCTCN2020124132-appb-000016
表示教师泛化特征与学生泛化特征之间的均方误差MSE,t表示帧索引,f表示频带索引,x表示样本混合信号,sigmoid表示激活函数,
Figure PCTCN2020124132-appb-000017
表示教师泛化特征c t′的转置向量,v t,f′表示教师嵌入特征,c t T表示学生泛化特征c t的转置向量,v t,f表示学生嵌入特征。
3032、服务器获取样本混合信号与本次迭代过程的学生泛化特征之间的互信息值。
在上述过程中,假设学生模型包括第一编码网络E θ、第一萃取网络A ψ以及计算模块T ω,其中,θ为第一编码网络E θ的参数集,ψ为第一萃取网络A ψ的参数集,ω为计算模块T ω的参数集,此时,整个学生模型的参数集表示为Θ={θ,ψ,ω}。
在上述步骤302已经介绍了第一编码网络E θ以及第一萃取网络A ψ所等价的映射关系,这里不做赘述,此处将对计算模块T ω的等价映射关系进行介绍,表达式如下:
Figure PCTCN2020124132-appb-000018
上述映射关系的表达式表明,计算模块T ω以学生嵌入特征v以及学生泛化特征c为输入,输出一个位于
Figure PCTCN2020124132-appb-000019
输出域内的互信息值。
针对上述映射关系,将计算模块T ω建模为如下公式:
T ω=D ω○g○(E θ,A ψ)
在上述公式中,g表示将E θ输出的学生嵌入特征与A ψ输出的学生泛化特征联合在一起的函数,D ω表示计算互信息值MI的函数。
在本申请实施例中,训练样本为未标注的受干扰的样本混合信号,这类样本混合信号的时频点x认为是由目标对象的音频信号的时频点x和干扰信号的时频点x′的线性混合,也即是说,x=x+x′,样本混合信号所服从的分布为P≈p(x,c),其中p为第一萃取网络根据样本混合信号x以及学生泛化特征c所确定的一个预测值。除此之外,训练样本中还包括干扰信号(纯干扰或者背景噪声),即x=x′,干扰信号所服从的提议分布(proposal distribution)为
Figure PCTCN2020124132-appb-000020
在这种情况下,本申请实施例针对隐式输入信号提出一种简称为ImNICE(InfoMax Noise-Interference Contractive Estimation,基于互信息最大化的噪声-干扰对比估计)的损失函数,此时样本混合信号与学生泛化特征之间的互信息值MI(也即是ImNICE损失函数值)采用下述公式进行表示:
Figure PCTCN2020124132-appb-000021
在上述公式中,
Figure PCTCN2020124132-appb-000022
表示样本混合信号与学生泛化特征之间ImNICE损失函数值,Θ={θ,ψ,ω}表示整个学生模型的参数集,x表示输入信号中被学生模型预测为正样本的时频点,x服从上述分布P≈p(x,c),x′表示输入信号中被学生模型预测为负样本的时频点,x′服从上述提议分布
Figure PCTCN2020124132-appb-000023
也即是说,x′表示取自提议分布
Figure PCTCN2020124132-appb-000024
的负样本时频点(对应于噪声或干扰信号),E P表示分布P的数学期望,
Figure PCTCN2020124132-appb-000025
表示提议分布
Figure PCTCN2020124132-appb-000026
的数学期望,c~A ψ(E θ(x))表示第一编码网络E θ以及第一萃取网络A ψ在作用于输入信号之后所得的学生泛化特征,此外,f Θ(x,c)=exp(T ω(E θ(x),c))代表输入信号中被学生模型预测为正样本的时频点x与学生泛化特征c之间的互信息值,同理,f Θ(x′,c)=exp(T ω(E θ(x′),c))代表输入信号中被学生模型预测为负样本的时频点x′与学生泛化特征c之间的互信息值。
需要说明的是,上述针对ImNICE损失函数值的定义相当于一种平均交叉熵损失,具体地,假设存在一个分布p和另一个分布q,那么p和q之间的平均交叉熵损失为:
H(p,q)=-∑p log q
此时基于信息论的相关知识,推理出f Θ(x,c)的最优解为
Figure PCTCN2020124132-appb-000027
也即是说
Figure PCTCN2020124132-appb-000028
f Θ(x,c)视为是一种概率密度比值,这个概率密度比值能够用于估计输入的样本混合信号x与学生泛化特征c之间的互信息值。
针对传统的显式输入信号而言,根据互信息值的定义式来计算显式输入信号x与学生泛化特征c之间的互信息值,该定义式如下:
Figure PCTCN2020124132-appb-000029
在上述过程中,I(x;c)表示显式输入信号x与学生泛化特征c之间的互信息值,p(x)为显式输入信号x服从的概率分布,p(x|c)为显式输入信号x在具备学生泛化特征c时的条件概率分布,p(x,c)为显式输入信号x与学生泛化特征c之间的联合分布。由于显式输入信号能够直接获取到p(x)或者p(x|c),从而能够直接依据定义式来计算互信息值。
在本申请实施例中,由于输入的样本混合信号中不能直接观察到目标对象的音频信号,也即是说,样本混合信号是一种隐式输入信号(这是由无监督学习的性质而决定的),那么在计算互信息值的时候,就无法像传统的显式输入信号那样,通过获取p(x)或者p(x|c)来计算互信息值,但是,基于本申请实施例引入的ImNICE损失函数值,避免了获取p(x)或者p(x|c),而是通过获取f Θ(x,c)来计算互信息值,由于f Θ(x,c)正比于p(x|c)与p(x)之间的概率密度比值,因此f Θ(x,c)能够表征互信息值,从而解决了在无监督学习中无法计算隐式输入信号与学生泛化特征之间的互信息值的问题。
需要说明的是,由于在上述ImNICE损失函数值中还引入了额外的统计约束P≈p(x,c),这个统计约束p(x,c)为样本混合信号x与学生泛化特征c之间的联合分布,p(x,c)由教师模型来进行预测,在每次迭代过程中,教师模型的第二萃取网络A ψ′执行下述操作:
A ψ′:v→p,v×p→c
服务器取第二萃取网络A ψ′计算得到的一个中间预测值p作为联合分布p(x,c)的估计值。
3033、服务器将该均方误差或者该互信息值中至少一项确定为本次迭代过程的损失函数值。
在上述过程中,服务器通过获取均方误差,能够保证教师模型和学生模型之间的一致性学习(consistency-based learning),若不符合停止训练条件,通过下述步骤304更新学生模型的第一编码网络以及第一萃取网络的参数集,均方误差是典型的重建任务的损失函数,基于均方误差来进行一致性学习,能够在一定程度上保证中间学习到的学生泛化特征相对于目标对象的音频信号之间的稳定一致性。
在上述过程中,服务器通过获取互信息值,能够针对无监督学习的训练流程提供计算模块,用于获取学生模型中样本混合信号与学生泛化特征之间的互信息值,具体地,通过引入概率密度比值f Θ(x,c)以及统计约束p(x,c)来估算学生模型的互信息值,训练目标是最小 化均方误差且最大化互信息值。
304、若该损失函数值不符合停止训练条件,服务器对该学生模型的参数进行调整,得到下一次迭代过程的学生模型,基于该下一次迭代过程的学生模型执行下一次迭代过程。
可选地,该停止训练条件为在连续第一目标次数的迭代过程中该均方误差没有减小;或,该停止训练条件为该均方误差小于或等于第一目标阈值且该互信息值大于或等于第二目标阈值;或,该停止训练条件为迭代次数到达第二目标次数。
服务器获取学生模型在本次迭代过程的损失函数值之后,判断本次迭代过程的损失函数值是否满足停止训练条件,若不符合停止训练条件,基于上述步骤304更新得到下一次迭代过程的学生模型,进而返回执行上述步骤3011-步骤3014,获取下一次迭代过程的教师模型,基于下一次迭代过程的教师模型和学生模型执行与上述步骤302-步骤303类似的操作,从而完成下一次的迭代训练,以此类推,这里不做赘述,在经历过多次迭代之后,直到某一次迭代过程的损失函数值满足停止训练条件,执行下述步骤305。
305、若该损失函数值符合停止训练条件,服务器基于本次迭代过程的学生模型或教师模型,获取编码网络和萃取网络。
在一些实施例中,服务器基于本次迭代过程的学生模型获取编码网络和萃取网络,也即是说,服务器分别将本次迭代过程的学生模型中第一编码网络和第一萃取网络,确定为该编码网络和该萃取网络。
在一些实施例中,服务器还基于本次迭代过程的教师模型获取编码网络和萃取网络,也即是说,服务器分别将本次迭代过程的教师模型中第二编码网络和第二萃取网络,确定为该编码网络和该萃取网络。
在上述过程中,服务器基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到该编码网络以及该萃取网络,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得。随着教师模型与学生模型的协同迭代训练和一致性学习,能够保证损失函数中均方误差趋于最小化且互信息值趋于最大化,若达到了停止训练条件,说明满足了预先设定的训练目标,不管是本次迭代过程的教师模型还是学生模型,均能够取为编码网络和萃取网络,本申请实施例不对基于教师模型还是学生模型来获取最终的编码网络及萃取网络进行具体限定。
图4是本申请实施例提供的一种编码网络及萃取网络的训练方法的原理性示意图,请参考图4,在训练集中设置一组未标注(unlabeled)的样本混合信号410(speech)以及一组干扰信号411(noises),通过学生模型的第一编码网络420(或者教师模型的第二编码网络),分别将样本混合信号410以及干扰信号411映射至高维的嵌入空间(embedding space),得到样本混合信号410以及干扰信号411的学生嵌入特征412(或者教师嵌入特征),通过学生模型的第一萃取网络421(或者教师模型的第二萃取网络),分别对样本混合信号410以及干扰信号411的学生嵌入特征412(或者教师嵌入特征)进行递归加权处理,得到样本混合信号以及干扰信号的学生泛化特征413(或者教师泛化特征),基于样本混合信号以及干扰信号的学生嵌入特征412以及学生泛化特征413,能够通过计算模块422获取本次迭代过程的损失函数值414(unsupervised loss,也即是无监督损失函数值),该损失函数值414包括均方误差或者ImNICE损失函数值(互信息值)中至少一项,在一些实施例中,针对计算模块获取的互信息值,还绘制出互信息值的热力图415,在热力图415中浅色区域的时频点属于目标说话人的语音的概率较大,深色区域的时频点属于噪声或干扰的概率越大,也即是说,在 热力图中随着颜色从浅到深,代表对应位置的时频点属于噪声的概率逐渐增大,便于直观地观察各个时频点服从的热力分布。
其中,为了简约的表示各个网络的输入与输出,采用χ表示第一编码网络420的输入信号(也即是样本混合信号410以及干扰信号411),采用v表示第一编码网络420的输出信号(也即是学生嵌入特征412),当然,第一萃取网络421的输入信号也为v,采用c表示第一萃取网络421的输出信号(也即是学生泛化特征413),计算模块422的输入信号包括v以及c,采用R表示计算模块422所输出的损失函数值414。
在一个示例性场景中,在获取训练集中样本混合信号的STFT谱时,将采样率设置为16KHz,将STFT窗长设置为25ms,将STFT窗移设置为10ms,将STFT频带个数设置为257。在针对学生模型以及教师模型进行训练优化时,设置批处理数据的大小为32,初始学习率为0.0001,学习率的权重下降系数为0.8,此外,若模型的MSE(均方误差)损失连续3次迭代过程都没有改善时,认为训练达到收敛并停止训练。
在一个示例性场景中,针对学生模型的第一编码网络,将第一编码网络的输出层节点数设置为40,每段训练语料随机降采样帧数为32,计算ImNICE损失函数值时每个正样本对应的负样本个数为63,正样本预测概率p(x,c)的判定阈值为0.5。
在一个示例性场景中,第一编码网络为4层BLSTM网络,每个隐藏层(隐层)节点数为600,输出层为一个全连接层,能够将最后一个隐层输出的600维的隐向量(输出特征)映射到一个275*40维的高维嵌入空间v,得到一个275*40维的嵌入特征,将该275*40维的嵌入特征输入第一萃取网络,该第一萃取网络中包含一个全连接层和一个2层BLSTM网络,通过全连接层能够将275*40维的嵌入特征(本质上也是一个隐向量)映射到600维,将600维的隐向量输入到2层BLSTM网络中,其中每个隐层节点数为600,最终输出泛化特征,在计算模块中采用一个简单的加权矩阵(比如二值阈值矩阵)
Figure PCTCN2020124132-appb-000030
用于计算向量之间的内积:T ω(v,c)=v Tωc,其中,T ω(v,c)表示计算模块,v表示嵌入特征,v T表示嵌入特征的转置向量,ω表示加权矩阵,c表示泛化特征。
在上述过程中,超参数选取以及模型结构仅为一种示例性说明,在一些实施例中,根据复杂度和性能的需求,调整改变第一编码网络或第一萃取网络中BLSTM网络的层级数目,或者,还调整改变第一编码网络或第一萃取网络的网络结构,比如采用LSTM网络、CNN、TDNN或者闸控CNN中至少一项,此外,根据场景对模型内存占用的限制以及检测准确率的要求,还对第一编码网络或者第一萃取网络的网络结构进行拓展或者简化。
上述所有可选技术方案,采用任意结合形成本申请的可选实施例,在此不再一一赘述。
本申请实施例提供的方法,基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到该编码网络以及该萃取网络,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得,随着教师模型与学生模型的协同迭代训练和一致性学习,能够从未标注的、有干扰的样本混合信号中,有效地学习到鲁棒的、可泛化的隐藏信号表征(也即是目标分量的泛化特征),从而能够适用于各种各样的工业应用场景,有助于提升音频处理过程的准确性。
进一步地,当训练的数据场景和真实的测试场景差异越明显(也即是越不匹配)时,无监督系统所提取的泛化特征就具有越明显的优势,另外,在工业应用场景中往往存在大量的未标注数据,这些数据直接作为无监督系统的训练样本,而无需送去进行人工标注,避免了 针对训练数据进行标注的人力成本,也即是说,无监督系统能够挖掘利用更多的训练数据。
在大多数采用语音增强、语音分离的工业应用中,带标注的训练数据(指包含目标对象的干净音频信号的训练样本)往往只能覆盖很小一部分的应用场景,大量的数据是无标注的,在传统的有监督系统中,需要对无标注的数据进行人工标注,耗费较高的人力成本,在本申请实施例中,提出了一种新颖的无监督损失函数,以及基于无监督学习的训练方法,能够开发海量的未标注训练数据,不必对未标注的训练数据进行人工标注,节约了人力成本,且提升了训练数据的获取效率。
此外,仅仅依靠有标注数据的监督学习存在鲁棒性差、泛化性差的问题,比如,一个仅仅采用监督学习针对某一类有干扰的说话环境中学习到的语音表征,往往不能适用于另一类干扰的背景噪声环境,而在本申请实施例中,无监督系统能够提取到目标分量的泛化特征,上述泛化特征并非是针对某一类干扰而进行提取的,而是在错综复杂的无标注数据中提取到的具有高鲁棒性、可泛化性的特征,能够适用于大多数的音频处理场景。
相较于传统的DANet、ADANet以及ODANet而言,首先,DANet在训练阶段需要数据库的embeddings(嵌入向量)分配作为输入,因此存在着训练-测试之间的embeddings中心不匹配的问题,其次,ADANet中通过引入PIT(Permutation Invariant Training,排列不变式训练方法)方法来缓解上述embeddings中心不匹配的问题,PIT方法通过计算所有可能的输入排列中所选目标函数的最低值来确定正确的输出排列,然而在遍布全排列的过程中,PIT方法自然会带来大量的计算复杂度,导致提取特征时的计算代价大量增加,最后,ODANet中针对每个音频帧估计一个抽象表征,基于该估计的抽象表征来计算未来时刻音频帧的掩膜(mask),以此类推,然而,ODANet易于导致不稳定的目标说话人追踪以及mask估计,为了提升性能的稳定性,还需要额外引入专家定义的动态加权函数,并且还需要对上下文窗长进行仔细调整和选择。
而在本申请实施例中,不需要进行额外的PIT处理,因此能够保证较小的计算代价,不需要引入说话人追踪机制也无需进行专家定义处理和调节,因此能够大大节约编码网络和萃取网络的训练成本,而且基于无标注的训练数据,能够自动学习到隐藏的目标分量(通常是目标说话人)的泛化特征,基于上述泛化特征来进行音频处理,能够有效地解决鸡尾酒会问题,针对较为困难的单通道语音分离任务也具有良好的表现,能够适用于各类工业场景,具有较高的音频处理准确性。
图5是本申请实施例提供的一种音频信号处理装置的结构示意图,请参考图5,该装置包括:
嵌入处理模块501,用于对混合音频信号进行嵌入处理,得到该混合音频信号的嵌入特征;
特征提取模块502,用于对该嵌入特征进行泛化特征提取,得到该混合音频信号中目标分量的泛化特征,该目标分量对应于该混合音频信号中目标对象的音频信号;
信号处理模块503,用于基于该目标分量的泛化特征进行音频信号处理。
本申请实施例提供的装置,通过对混合音频信号进行嵌入处理,得到该混合音频信号的嵌入特征,对该嵌入特征进行泛化特征提取,能够提取得到该混合音频信号中目标分量的泛化特征,该目标分量对应于该混合音频信号中目标对象的音频信号,基于该目标分量的泛化特征进行音频信号处理,由于目标分量的泛化特征并非是针对某一类特定场景下的声音特征, 具有较好的泛化能力和表达能力,因此基于目标分量的泛化特征进行音频信号处理时,能够良好地适用于不同的场景,提升了音频信号处理过程的鲁棒性和泛化性,提升了音频信号处理的准确性。
在一种可能实施方式中,嵌入处理模块501,用于将混合音频信号输入编码网络,通过该编码网络对该混合音频信号进行嵌入处理,得到该混合音频信号的嵌入特征;
特征提取模块502,用于将该嵌入特征输入萃取网络,通过该萃取网络对该嵌入特征进行泛化特征提取,得到该混合音频信号中目标分量的泛化特征,该目标分量对应于该混合音频信号中目标对象的音频信号。
在一种可能实施方式中,该嵌入处理模块501用于:
将该混合音频信号映射至嵌入空间,得到该嵌入特征。
在一种可能实施方式中,该特征提取模块502用于对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一种可能实施方式中,该萃取网络为自回归模型,该特征提取模块502用于:
将该嵌入特征输入该自回归模型,通过该自回归模型对该嵌入特征进行递归加权处理,得到该目标分量的泛化特征。
在一种可能实施方式中,基于图5的装置组成,该装置还包括:
训练模块,用于基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到该编码网络以及该萃取网络,其中,该学生模型包括第一编码网络和第一萃取网络,该教师模型包括第二编码网络和第二萃取网络,该第一编码网络的输出作为该第一萃取网络的输入,该第二编码网络的输出作为该第二萃取网络的输入,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得。
在一种可能实施方式中,基于图5的装置组成,该训练模块包括:
第一获取单元,用于在任一次迭代过程中,基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型;
输出单元,用于将该未标注的样本混合信号分别输入本次迭代过程的教师模型和学生模型,分别输出该样本混合信号中目标分量的教师泛化特征以及学生泛化特征;
第二获取单元,用于基于该样本混合信号、该教师泛化特征或者该学生泛化特征中至少一项,获取本次迭代过程的损失函数值;
参数调整单元,用于若该损失函数值不符合停止训练条件,对该学生模型的参数进行调整,得到下一次迭代过程的学生模型,基于该下一次迭代过程的学生模型执行下一次迭代过程;
第三获取单元,用于若该损失函数值符合该停止训练条件,基于本次迭代过程的学生模型或教师模型,获取该编码网络和该萃取网络。
在一种可能实施方式中,该第二获取单元用于:
获取该教师泛化特征以及该学生泛化特征之间的均方误差;
获取该样本混合信号与该学生泛化特征之间的互信息值;
将该均方误差或者该互信息值中至少一项确定为本次迭代过程的损失函数值。
在一种可能实施方式中,该停止训练条件为在连续第一目标次数的迭代过程中该均方误差没有减小;或,
该停止训练条件为该均方误差小于或等于第一目标阈值且该互信息值大于或等于第二目 标阈值;或,
该停止训练条件为迭代次数到达第二目标次数。
在一种可能实施方式中,该第一获取单元用于:
将上一次迭代过程的教师模型的参数集与第一平滑系数相乘,得到第一参数集;
将本次迭代过程的学生模型与第二平滑系数相乘,得到第二参数集,其中,该第一平滑系数与该第二平滑系数相加所得的数值为1;
将该第一参数集与该第二参数集之和确定为本次迭代过程的教师模型的参数集;
基于本次迭代过程的教师模型的参数集,对上一次迭代过程的教师模型进行参数更新,得到本次迭代过程的教师模型。
在一种可能实施方式中,该第三获取单元用于:
分别将本次迭代过程的学生模型中第一编码网络和第一萃取网络,确定为该编码网络和该萃取网络;或,
分别将本次迭代过程的教师模型中第二编码网络和第二萃取网络,确定为该编码网络和该萃取网络。
在一种可能实施方式中,该信号处理模块503用于:
基于该目标分量的泛化特征,对该目标对象的音频信号进行文语转换,输出该目标对象的音频信号对应的文本信息;或,
基于该目标分量的泛化特征,对该目标对象的音频信号进行声纹识别,输出该目标对象的音频信号对应的声纹识别结果;或,
基于该目标分量的泛化特征,生成该目标对象的音频信号对应的应答语音,输出该应答语音。
上述所有可选技术方案,采用任意结合形成本申请的可选实施例,在此不再一一赘述。
需要说明的是:上述实施例提供的音频信号处理装置在处理音频信号时,仅以上述各功能模块的划分进行举例说明,应用中,根据需要而将上述功能分配由不同的功能模块完成,即将电子设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的音频信号处理装置与音频信号处理方法实施例属于同一构思,其具体实现过程详见音频信号处理方法实施例,这里不再赘述。
在一些实施例中,本申请实施例所涉及的电子设备是一种终端,图6是本申请实施例提供的一种终端的结构示意图,请参考图6,该终端600是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端600还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端600包括有:处理器601和存储器602。
处理器601包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器601采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器601也包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处 理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器601在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器601还包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器602包括一个或多个计算机可读存储介质,该计算机可读存储介质是非暂态的。存储器602还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器602中的非暂态的计算机可读存储介质用于存储至少一个程序代码,该至少一个程序代码用于被处理器601所执行以实现如下步骤:
对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征,所述目标分量对应于所述混合音频信号中目标对象的音频信号;
基于所述目标分量的泛化特征进行音频信号处理。
在一些实施例中,该至少一个程序代码用于被处理器601所执行以实现如下步骤:
将所述混合音频信号映射至嵌入空间,得到所述嵌入特征。
在一些实施例中,该至少一个程序代码用于被处理器601所执行以实现如下步骤:
对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一些实施例中,该至少一个程序代码用于被处理器601所执行以实现如下步骤:
将混合音频信号输入编码网络,通过所述编码网络对所述混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
在一些实施例中,该至少一个程序代码用于被处理器601所执行以实现如下步骤:
将所述嵌入特征输入萃取网络,通过所述萃取网络对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征。
在一些实施例中,所述萃取网络为自回归模型,该至少一个程序代码用于被处理器601所执行以实现如下步骤:将所述嵌入特征输入所述自回归模型,通过所述自回归模型对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一些实施例中,该至少一个程序代码用于被处理器601所执行以实现如下步骤:
基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到所述编码网络以及所述萃取网络,其中,所述学生模型包括第一编码网络和第一萃取网络,所述教师模型包括第二编码网络和第二萃取网络,所述第一编码网络的输出作为所述第一萃取网络的输入,所述第二编码网络的输出作为所述第二萃取网络的输入,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得。
在一些实施例中,该至少一个程序代码用于被处理器601所执行以实现如下步骤:
在任一次迭代过程中,基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型;
将所述未标注的样本混合信号分别输入本次迭代过程的教师模型和学生模型,分别输出所述样本混合信号中目标分量的教师泛化特征以及学生泛化特征;
基于所述样本混合信号、所述教师泛化特征或者所述学生泛化特征中至少一项,获取本次迭代过程的损失函数值;
若所述损失函数值不符合停止训练条件,对所述学生模型的参数进行调整,得到下一次迭代过程的学生模型,基于所述下一次迭代过程的学生模型执行下一次迭代过程;
若所述损失函数值符合所述停止训练条件,基于本次迭代过程的学生模型或教师模型,获取所述编码网络和所述萃取网络。
在一些实施例中,该至少一个程序代码用于被处理器601所执行以实现如下步骤:
获取所述教师泛化特征以及所述学生泛化特征之间的均方误差;
获取所述样本混合信号与所述学生泛化特征之间的互信息值;
将所述均方误差或者所述互信息值中至少一项确定为本次迭代过程的损失函数值。
在一些实施例中,所述停止训练条件为在连续第一目标次数的迭代过程中所述均方误差没有减小;或,
所述停止训练条件为所述均方误差小于或等于第一目标阈值且所述互信息值大于或等于第二目标阈值;或,
所述停止训练条件为迭代次数到达第二目标次数。
该至少一个程序代码用于被处理器601所执行以实现如下步骤:
将上一次迭代过程的教师模型的参数集与第一平滑系数相乘,得到第一参数集;
将本次迭代过程的学生模型与第二平滑系数相乘,得到第二参数集,其中,所述第一平滑系数与所述第二平滑系数相加所得的数值为1;
将所述第一参数集与所述第二参数集之和确定为本次迭代过程的教师模型的参数集;
基于本次迭代过程的教师模型的参数集,对上一次迭代过程的教师模型进行参数更新,得到本次迭代过程的教师模型。
该至少一个程序代码用于被处理器601所执行以实现如下步骤:
分别将本次迭代过程的学生模型中第一编码网络和第一萃取网络,确定为所述编码网络和所述萃取网络;或,
分别将本次迭代过程的教师模型中第二编码网络和第二萃取网络,确定为所述编码网络和所述萃取网络。
该至少一个程序代码用于被处理器601所执行以实现如下步骤:
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行文语转换,输出所述目标对象的音频信号对应的文本信息;或,
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行声纹识别,输出所述目标对象的音频信号对应的声纹识别结果;或,
基于所述目标分量的泛化特征,生成所述目标对象的音频信号对应的应答语音,输出所述应答语音。
在一些实施例中,终端600还可选包括有:外围设备接口603和至少一个外围设备。处理器601、存储器602和外围设备接口603之间通过总线或信号线相连。各个外围设备通过总线、信号线或电路板与外围设备接口603相连。具体地,外围设备包括:射频电路604、触摸显示屏605、摄像头组件606、音频电路607、定位组件608和电源609中的至少一种。
外围设备接口603可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器601和存储器602。在一些实施例中,处理器601、存储器602和外围设备接口603被集成在同一芯片或电路板上;在一些其他实施例中,处理器601、存储器602和外围设备接口603中的任意一个或两个在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路604用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射 频电路604通过电磁信号与通信网络以及其他通信设备进行通信。射频电路604将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路604包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路604通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:城域网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路604还包括NFC(Near Field Communication,近距离无线通信)有关的电路,本申请对此不加以限定。
显示屏605用于显示UI(User Interface,用户界面)。该UI包括图形、文本、图标、视频及其它们的任意组合。当显示屏605是触摸显示屏时,显示屏605还具有采集在显示屏605的表面或表面上方的触摸信号的能力。该触摸信号作为控制信号输入至处理器601进行处理。此时,显示屏605还用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏605为一个,设置终端600的前面板;在另一些实施例中,显示屏605为至少两个,分别设置在终端600的不同表面或呈折叠设计;在再一些实施例中,显示屏605是柔性显示屏,设置在终端600的弯曲表面上或折叠面上。甚至,显示屏605还设置成非矩形的不规则图形,也即异形屏。显示屏605采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头组件606用于采集图像或视频。可选地,摄像头组件606包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件606还包括闪光灯。闪光灯是单色温闪光灯,也是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,用于不同色温下的光线补偿。
音频电路607包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器601进行处理,或者输入至射频电路604以实现语音通信。出于立体声采集或降噪的目的,麦克风为多个,分别设置在终端600的不同部位。麦克风还是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器601或射频电路604的电信号转换为声波。扬声器是传统的薄膜扬声器,也是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅将电信号转换为人类可听见的声波,也将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路607还包括耳机插孔。
定位组件608用于定位终端600的当前地理位置,以实现导航或LBS(Location Based Service,基于位置的服务)。定位组件608是基于美国的GPS(Global Positioning System,全球定位系统)、中国的北斗系统、俄罗斯的格雷纳斯系统或欧盟的伽利略系统的定位组件。
电源609用于为终端600中的各个组件进行供电。电源609是交流电、直流电、一次性电池或可充电电池。当电源609包括可充电电池时,该可充电电池支持有线充电或无线充电。该可充电电池还用于支持快充技术。
在一些实施例中,终端600还包括有一个或多个传感器610。该一个或多个传感器610包括但不限于:加速度传感器611、陀螺仪传感器612、压力传感器613、指纹传感器614、光学传感器615以及接近传感器616。
加速度传感器611检测以终端600建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器611用于检测重力加速度在三个坐标轴上的分量。处理器601根据加速度传感器611采集的重力加速度信号,控制触摸显示屏605以横向视图或纵向视图进行用户界面的显示。加速度传感器611还用于游戏或者用户的运动数据的采集。
陀螺仪传感器612检测终端600的机体方向及转动角度,陀螺仪传感器612与加速度传感器611协同采集用户对终端600的3D动作。处理器601根据陀螺仪传感器612采集的数据,实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。
压力传感器613设置在终端600的侧边框和/或触摸显示屏605的下层。当压力传感器613设置在终端600的侧边框时,检测用户对终端600的握持信号,由处理器601根据压力传感器613采集的握持信号进行左右手识别或快捷操作。当压力传感器613设置在触摸显示屏605的下层时,由处理器601根据用户对触摸显示屏605的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。
指纹传感器614用于采集用户的指纹,由处理器601根据指纹传感器614采集到的指纹识别用户的身份,或者,由指纹传感器614根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时,由处理器601授权该用户执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器614被设置终端600的正面、背面或侧面。当终端600上设置有物理按键或厂商Logo时,指纹传感器614与物理按键或厂商Logo集成在一起。
光学传感器615用于采集环境光强度。在一个实施例中,处理器601根据光学传感器615采集的环境光强度,控制触摸显示屏605的显示亮度。具体地,当环境光强度较高时,调高触摸显示屏605的显示亮度;当环境光强度较低时,调低触摸显示屏605的显示亮度。在另一个实施例中,处理器601还根据光学传感器615采集的环境光强度,动态调整摄像头组件606的拍摄参数。
接近传感器616,也称距离传感器,通常设置在终端600的前面板。接近传感器616用于采集用户与终端600的正面之间的距离。在一个实施例中,当接近传感器616检测到用户与终端600的正面之间的距离逐渐变小时,由处理器601控制触摸显示屏605从亮屏状态切换为息屏状态;当接近传感器616检测到用户与终端600的正面之间的距离逐渐变大时,由处理器601控制触摸显示屏605从息屏状态切换为亮屏状态。
本领域技术人员理解,图6中示出的结构并不构成对终端600的限定,包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在一些实施例中,本申请实施例所涉及的电子设备是一种服务器,图7是本申请实施例提供的一种服务器的结构示意图,请参考图7,该服务器700可因配置或性能不同而产生比较大的差异,包括一个或一个以上处理器(Central Processing Units,CPU)701和一个或一个以上的存储器702,其中,该存储器702中存储有至少一条程序代码,该至少一条程序代码由该处理器701加载并执行以实现如下步骤:
对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征,所 述目标分量对应于所述混合音频信号中目标对象的音频信号;
基于所述目标分量的泛化特征进行音频信号处理。
在一些实施例中,该至少一个程序代码用于被处理器701所执行以实现如下步骤:
将所述混合音频信号映射至嵌入空间,得到所述嵌入特征。
在一些实施例中,该至少一个程序代码用于被处理器701所执行以实现如下步骤:
对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一些实施例中,该至少一个程序代码用于被处理器701所执行以实现如下步骤:
将混合音频信号输入编码网络,通过所述编码网络对所述混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
在一些实施例中,该至少一个程序代码用于被处理器701所执行以实现如下步骤:
将所述嵌入特征输入萃取网络,通过所述萃取网络对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征。
在一些实施例中,所述萃取网络为自回归模型,该至少一个程序代码用于被处理器701所执行以实现如下步骤:将所述嵌入特征输入所述自回归模型,通过所述自回归模型对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一些实施例中,该至少一个程序代码用于被处理器701所执行以实现如下步骤:
基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到所述编码网络以及所述萃取网络,其中,所述学生模型包括第一编码网络和第一萃取网络,所述教师模型包括第二编码网络和第二萃取网络,所述第一编码网络的输出作为所述第一萃取网络的输入,所述第二编码网络的输出作为所述第二萃取网络的输入,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得。
在一些实施例中,该至少一个程序代码用于被处理器701所执行以实现如下步骤:
在任一次迭代过程中,基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型;
将所述未标注的样本混合信号分别输入本次迭代过程的教师模型和学生模型,分别输出所述样本混合信号中目标分量的教师泛化特征以及学生泛化特征;
基于所述样本混合信号、所述教师泛化特征或者所述学生泛化特征中至少一项,获取本次迭代过程的损失函数值;
若所述损失函数值不符合停止训练条件,对所述学生模型的参数进行调整,得到下一次迭代过程的学生模型,基于所述下一次迭代过程的学生模型执行下一次迭代过程;
若所述损失函数值符合所述停止训练条件,基于本次迭代过程的学生模型或教师模型,获取所述编码网络和所述萃取网络。
在一些实施例中,该至少一个程序代码用于被处理器701所执行以实现如下步骤:
获取所述教师泛化特征以及所述学生泛化特征之间的均方误差;
获取所述样本混合信号与所述学生泛化特征之间的互信息值;
将所述均方误差或者所述互信息值中至少一项确定为本次迭代过程的损失函数值。
在一些实施例中,所述停止训练条件为在连续第一目标次数的迭代过程中所述均方误差没有减小;或,
所述停止训练条件为所述均方误差小于或等于第一目标阈值且所述互信息值大于或等于第二目标阈值;或,
所述停止训练条件为迭代次数到达第二目标次数。
该至少一个程序代码用于被处理器701所执行以实现如下步骤:
将上一次迭代过程的教师模型的参数集与第一平滑系数相乘,得到第一参数集;
将本次迭代过程的学生模型与第二平滑系数相乘,得到第二参数集,其中,所述第一平滑系数与所述第二平滑系数相加所得的数值为1;
将所述第一参数集与所述第二参数集之和确定为本次迭代过程的教师模型的参数集;
基于本次迭代过程的教师模型的参数集,对上一次迭代过程的教师模型进行参数更新,得到本次迭代过程的教师模型。
该至少一个程序代码用于被处理器701所执行以实现如下步骤:
分别将本次迭代过程的学生模型中第一编码网络和第一萃取网络,确定为所述编码网络和所述萃取网络;或,
分别将本次迭代过程的教师模型中第二编码网络和第二萃取网络,确定为所述编码网络和所述萃取网络。
该至少一个程序代码用于被处理器701所执行以实现如下步骤:
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行文语转换,输出所述目标对象的音频信号对应的文本信息;或,
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行声纹识别,输出所述目标对象的音频信号对应的声纹识别结果;或,
基于所述目标分量的泛化特征,生成所述目标对象的音频信号对应的应答语音,输出所述应答语音。
当然,该服务器700还具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器700还包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括至少一条程序代码的存储器,上述至少一条程序代码可由电子设备中的处理器执行以完成如下步骤:
对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征,所述目标分量对应于所述混合音频信号中目标对象的音频信号;
基于所述目标分量的泛化特征进行音频信号处理。
在一些实施例中,该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:
将所述混合音频信号映射至嵌入空间,得到所述嵌入特征。
在一些实施例中,该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:
对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一些实施例中,该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:
将混合音频信号输入编码网络,通过所述编码网络对所述混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
在一些实施例中,该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步 骤:
将所述嵌入特征输入萃取网络,通过所述萃取网络对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征。
在一些实施例中,所述萃取网络为自回归模型,该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:将所述嵌入特征输入所述自回归模型,通过所述自回归模型对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
在一些实施例中,该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:
基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到所述编码网络以及所述萃取网络,其中,所述学生模型包括第一编码网络和第一萃取网络,所述教师模型包括第二编码网络和第二萃取网络,所述第一编码网络的输出作为所述第一萃取网络的输入,所述第二编码网络的输出作为所述第二萃取网络的输入,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得。
在一些实施例中,该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:
在任一次迭代过程中,基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型;
将所述未标注的样本混合信号分别输入本次迭代过程的教师模型和学生模型,分别输出所述样本混合信号中目标分量的教师泛化特征以及学生泛化特征;
基于所述样本混合信号、所述教师泛化特征或者所述学生泛化特征中至少一项,获取本次迭代过程的损失函数值;
若所述损失函数值不符合停止训练条件,对所述学生模型的参数进行调整,得到下一次迭代过程的学生模型,基于所述下一次迭代过程的学生模型执行下一次迭代过程;
若所述损失函数值符合所述停止训练条件,基于本次迭代过程的学生模型或教师模型,获取所述编码网络和所述萃取网络。
在一些实施例中,该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:
获取所述教师泛化特征以及所述学生泛化特征之间的均方误差;
获取所述样本混合信号与所述学生泛化特征之间的互信息值;
将所述均方误差或者所述互信息值中至少一项确定为本次迭代过程的损失函数值。
在一些实施例中,所述停止训练条件为在连续第一目标次数的迭代过程中所述均方误差没有减小;或,
所述停止训练条件为所述均方误差小于或等于第一目标阈值且所述互信息值大于或等于第二目标阈值;或,
所述停止训练条件为迭代次数到达第二目标次数。
该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:
将上一次迭代过程的教师模型的参数集与第一平滑系数相乘,得到第一参数集;
将本次迭代过程的学生模型与第二平滑系数相乘,得到第二参数集,其中,所述第一平滑系数与所述第二平滑系数相加所得的数值为1;
将所述第一参数集与所述第二参数集之和确定为本次迭代过程的教师模型的参数集;
基于本次迭代过程的教师模型的参数集,对上一次迭代过程的教师模型进行参数更新,得到本次迭代过程的教师模型。
该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:
分别将本次迭代过程的学生模型中第一编码网络和第一萃取网络,确定为所述编码网络和所述萃取网络;或,
分别将本次迭代过程的教师模型中第二编码网络和第二萃取网络,确定为所述编码网络和所述萃取网络。
该至少一个程序代码用于被电子设备中的处理器所执行以实现如下步骤:
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行文语转换,输出所述目标对象的音频信号对应的文本信息;或,
基于所述目标分量的泛化特征,对所述目标对象的音频信号进行声纹识别,输出所述目标对象的音频信号对应的声纹识别结果;或,
基于所述目标分量的泛化特征,生成所述目标对象的音频信号对应的应答语音,输出所述应答语音。
例如,该计算机可读存储介质是ROM(Read-Only Memory,只读存储器)、RAM(Random-Access Memory,随机存取存储器)、CD-ROM(Compact Disc Read-Only Memory,只读光盘)、磁带、软盘和光数据存储设备等。
本领域普通技术人员理解实现上述实施例的全部或部分步骤通过硬件来完成,也通过程序来指令相关的硬件完成,该程序存储于一种计算机可读存储介质中,上述提到的存储介质是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种音频信号处理方法,其特征在于,应用于电子设备,所述方法包括:
    对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
    对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征,所述目标分量对应于所述混合音频信号中目标对象的音频信号;
    基于所述目标分量的泛化特征进行音频信号处理。
  2. 根据权利要求1所述的方法,其特征在于,所述对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征包括:
    将所述混合音频信号映射至嵌入空间,得到所述嵌入特征。
  3. 根据权利要求1所述的方法,其特征在于,所述对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征包括:
    对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
  4. 根据权利要求1所述的方法,其特征在于,所述对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征包括:
    将混合音频信号输入编码网络,通过所述编码网络对所述混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
    所述对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征包括:
    将所述嵌入特征输入萃取网络,通过所述萃取网络对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征。
  5. 根据权利要求4所述的方法,其特征在于,所述萃取网络为自回归模型,所述将所述嵌入特征输入萃取网络,通过所述萃取网络对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征包括:
    将所述嵌入特征输入所述自回归模型,通过所述自回归模型对所述嵌入特征进行递归加权处理,得到所述目标分量的泛化特征。
  6. 根据权利要求4-5任一项所述的方法,其特征在于,所述方法还包括:
    基于未标注的样本混合信号,对教师模型和学生模型进行协同迭代训练,得到所述编码网络以及所述萃取网络,其中,所述学生模型包括第一编码网络和第一萃取网络,所述教师模型包括第二编码网络和第二萃取网络,所述第一编码网络的输出作为所述第一萃取网络的输入,所述第二编码网络的输出作为所述第二萃取网络的输入,每次迭代过程中的教师模型由上一次迭代过程的教师模型以及本次迭代过程的学生模型进行加权所得。
  7. 根据权利要求6所述的方法,其特征在于,所述基于未标注的样本混合信号,对教 师模型和学生模型进行协同迭代训练,得到所述编码网络以及所述萃取网络包括:
    在任一次迭代过程中,基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型;
    将所述未标注的样本混合信号分别输入本次迭代过程的教师模型和学生模型,分别输出所述样本混合信号中目标分量的教师泛化特征以及学生泛化特征;
    基于所述样本混合信号、所述教师泛化特征或者所述学生泛化特征中至少一项,获取本次迭代过程的损失函数值;
    若所述损失函数值不符合停止训练条件,对所述学生模型的参数进行调整,得到下一次迭代过程的学生模型,基于所述下一次迭代过程的学生模型执行下一次迭代过程;
    若所述损失函数值符合所述停止训练条件,基于本次迭代过程的学生模型或教师模型,获取所述编码网络和所述萃取网络。
  8. 根据权利要求7所述的方法,其特征在于,所述基于所述样本混合信号、所述教师泛化特征或者所述学生泛化特征中至少一项,获取本次迭代过程的损失函数值包括:
    获取所述教师泛化特征以及所述学生泛化特征之间的均方误差;
    获取所述样本混合信号与所述学生泛化特征之间的互信息值;
    将所述均方误差或者所述互信息值中至少一项确定为本次迭代过程的损失函数值。
  9. 根据权利要求8所述的方法,其特征在于,所述停止训练条件为在连续第一目标次数的迭代过程中所述均方误差没有减小;或,
    所述停止训练条件为所述均方误差小于或等于第一目标阈值且所述互信息值大于或等于第二目标阈值;或,
    所述停止训练条件为迭代次数到达第二目标次数。
  10. 根据权利要求7所述的方法,其特征在于,所述基于本次迭代过程的学生模型以及上一次迭代过程的教师模型,获取本次迭代过程的教师模型包括:
    将上一次迭代过程的教师模型的参数集与第一平滑系数相乘,得到第一参数集;
    将本次迭代过程的学生模型与第二平滑系数相乘,得到第二参数集,其中,所述第一平滑系数与所述第二平滑系数相加所得的数值为1;
    将所述第一参数集与所述第二参数集之和确定为本次迭代过程的教师模型的参数集;
    基于本次迭代过程的教师模型的参数集,对上一次迭代过程的教师模型进行参数更新,得到本次迭代过程的教师模型。
  11. 根据权利要求7所述的方法,其特征在于,所述基于本次迭代过程的学生模型或教师模型,获取所述编码网络和所述萃取网络包括:
    分别将本次迭代过程的学生模型中第一编码网络和第一萃取网络,确定为所述编码网络和所述萃取网络;或,
    分别将本次迭代过程的教师模型中第二编码网络和第二萃取网络,确定为所述编码网络和所述萃取网络。
  12. 根据权利要求1所述的方法,其特征在于,所述基于所述目标分量的泛化特征进行音频信号处理包括:
    基于所述目标分量的泛化特征,对所述目标对象的音频信号进行文语转换,输出所述目标对象的音频信号对应的文本信息;或,
    基于所述目标分量的泛化特征,对所述目标对象的音频信号进行声纹识别,输出所述目标对象的音频信号对应的声纹识别结果;或,
    基于所述目标分量的泛化特征,生成所述目标对象的音频信号对应的应答语音,输出所述应答语音。
  13. 一种音频信号处理装置,其特征在于,所述装置包括:
    嵌入处理模块,用于对混合音频信号进行嵌入处理,得到所述混合音频信号的嵌入特征;
    特征提取模块,用于对所述嵌入特征进行泛化特征提取,得到所述混合音频信号中目标分量的泛化特征,所述目标分量对应于所述混合音频信号中目标对象的音频信号;
    信号处理模块,用于基于所述目标分量的泛化特征进行音频信号处理。
  14. 一种电子设备,其特征在于,所述电子设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条程序代码,所述至少一条程序代码由所述一个或多个处理器加载并执行以实现如权利要求1至权利要求12任一项所述的音频信号处理方法所执行的操作。
  15. 一种存储介质,其特征在于,所述存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现如权利要求1至权利要求12任一项所述的音频信号处理方法所执行的操作。
PCT/CN2020/124132 2020-01-02 2020-10-27 音频信号处理方法、装置、电子设备及存储介质 WO2021135577A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20909391.3A EP4006901A4 (en) 2020-01-02 2020-10-27 METHOD AND APPARATUS FOR PROCESSING AUDIO SIGNAL, ELECTRONIC DEVICE, AND STORAGE MEDIA
US17/667,370 US20220165288A1 (en) 2020-01-02 2022-02-08 Audio signal processing method and apparatus, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010001636.3 2020-01-02
CN202010001636.3A CN111179961B (zh) 2020-01-02 2020-01-02 音频信号处理方法、装置、电子设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/667,370 Continuation US20220165288A1 (en) 2020-01-02 2022-02-08 Audio signal processing method and apparatus, electronic device, and storage medium

Publications (2)

Publication Number Publication Date
WO2021135577A1 true WO2021135577A1 (zh) 2021-07-08
WO2021135577A9 WO2021135577A9 (zh) 2021-09-30

Family

ID=70652567

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124132 WO2021135577A1 (zh) 2020-01-02 2020-10-27 音频信号处理方法、装置、电子设备及存储介质

Country Status (4)

Country Link
US (1) US20220165288A1 (zh)
EP (1) EP4006901A4 (zh)
CN (1) CN111179961B (zh)
WO (1) WO2021135577A1 (zh)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179961B (zh) * 2020-01-02 2022-10-25 腾讯科技(深圳)有限公司 音频信号处理方法、装置、电子设备及存储介质
CN112071330B (zh) * 2020-09-16 2022-09-20 腾讯科技(深圳)有限公司 一种音频数据处理方法、设备以及计算机可读存储介质
CN112420057B (zh) * 2020-10-26 2022-05-03 四川长虹电器股份有限公司 基于距离编码的声纹识别方法、装置、设备及存储介质
CN112562726B (zh) * 2020-10-27 2022-05-27 昆明理工大学 一种基于mfcc相似矩阵的语音音乐分离方法
CN112331223A (zh) * 2020-11-09 2021-02-05 合肥名阳信息技术有限公司 一种给配音添加背景音乐的方法
CN112447183A (zh) * 2020-11-16 2021-03-05 北京达佳互联信息技术有限公司 音频处理模型的训练、音频去噪方法、装置及电子设备
CN112863521B (zh) * 2020-12-24 2022-07-05 哈尔滨理工大学 一种基于互信息估计的说话人识别方法
CN113539231B (zh) * 2020-12-30 2024-06-18 腾讯科技(深圳)有限公司 音频处理方法、声码器、装置、设备及存储介质
CN112957013B (zh) * 2021-02-05 2022-11-11 江西国科美信医疗科技有限公司 一种动态生命体征信号采集系统、监测装置及设备
CN113177602B (zh) * 2021-05-11 2023-05-26 上海交通大学 图像分类方法、装置、电子设备和存储介质
CN113380262B (zh) * 2021-05-13 2022-10-18 重庆邮电大学 一种基于注意力机制与扰动感知的声音分离方法
CN113488023B (zh) * 2021-07-07 2022-06-14 合肥讯飞数码科技有限公司 一种语种识别模型构建方法、语种识别方法
CN113380268A (zh) * 2021-08-12 2021-09-10 北京世纪好未来教育科技有限公司 模型训练的方法、装置和语音信号的处理方法、装置
CN113707123B (zh) * 2021-08-17 2023-10-20 慧言科技(天津)有限公司 一种语音合成方法及装置
CN115356694B (zh) * 2022-08-26 2023-08-22 哈尔滨工业大学(威海) 高频地波雷达抗冲击干扰方法、系统、计算机设备及应用
CN116258730B (zh) * 2023-05-16 2023-08-11 先进计算与关键软件(信创)海河实验室 一种基于一致性损失函数的半监督医学图像分割方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6832194B1 (en) * 2000-10-26 2004-12-14 Sensory, Incorporated Audio recognition peripheral system
US20110313953A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Automated Classification Pipeline Tuning Under Mobile Device Resource Constraints
CN108922518A (zh) * 2018-07-18 2018-11-30 苏州思必驰信息科技有限公司 语音数据扩增方法和系统
CN110288979A (zh) * 2018-10-25 2019-09-27 腾讯科技(深圳)有限公司 一种语音识别方法及装置
CN110459237A (zh) * 2019-04-12 2019-11-15 腾讯科技(深圳)有限公司 语音分离方法、语音识别方法及相关设备
CN110459240A (zh) * 2019-08-12 2019-11-15 新疆大学 基于卷积神经网络和深度聚类的多说话人语音分离方法
CN111179961A (zh) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 音频信号处理方法、装置、电子设备及存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180358003A1 (en) * 2017-06-09 2018-12-13 Qualcomm Incorporated Methods and apparatus for improving speech communication and speech interface quality using neural networks
US10811000B2 (en) * 2018-04-13 2020-10-20 Mitsubishi Electric Research Laboratories, Inc. Methods and systems for recognizing simultaneous speech by multiple speakers
US10529349B2 (en) * 2018-04-16 2020-01-07 Mitsubishi Electric Research Laboratories, Inc. Methods and systems for end-to-end speech separation with unfolded iterative phase reconstruction
CN108960407B (zh) * 2018-06-05 2019-07-23 出门问问信息科技有限公司 递归神经网路语言模型训练方法、装置、设备及介质
US11416741B2 (en) * 2018-06-08 2022-08-16 International Business Machines Corporation Teacher and student learning for constructing mixed-domain model
US10699700B2 (en) * 2018-07-31 2020-06-30 Tencent Technology (Shenzhen) Company Limited Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks
US10585989B1 (en) * 2018-09-07 2020-03-10 International Business Machines Corporation Machine-learning based detection and classification of personally identifiable information
US20200152330A1 (en) * 2018-11-13 2020-05-14 CurieAI, Inc. Scalable Personalized Treatment Recommendation
CN109523994A (zh) * 2018-11-13 2019-03-26 四川大学 一种基于胶囊神经网络的多任务语音分类方法
CN109637546B (zh) * 2018-12-29 2021-02-12 苏州思必驰信息科技有限公司 知识蒸馏方法和装置
CN110619887B (zh) * 2019-09-25 2020-07-10 电子科技大学 一种基于卷积神经网络的多说话人语音分离方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6832194B1 (en) * 2000-10-26 2004-12-14 Sensory, Incorporated Audio recognition peripheral system
US20110313953A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Automated Classification Pipeline Tuning Under Mobile Device Resource Constraints
CN108922518A (zh) * 2018-07-18 2018-11-30 苏州思必驰信息科技有限公司 语音数据扩增方法和系统
CN110288979A (zh) * 2018-10-25 2019-09-27 腾讯科技(深圳)有限公司 一种语音识别方法及装置
CN110459237A (zh) * 2019-04-12 2019-11-15 腾讯科技(深圳)有限公司 语音分离方法、语音识别方法及相关设备
CN110459240A (zh) * 2019-08-12 2019-11-15 新疆大学 基于卷积神经网络和深度聚类的多说话人语音分离方法
CN111179961A (zh) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 音频信号处理方法、装置、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4006901A4

Also Published As

Publication number Publication date
US20220165288A1 (en) 2022-05-26
EP4006901A1 (en) 2022-06-01
CN111179961A (zh) 2020-05-19
EP4006901A4 (en) 2022-11-16
CN111179961B (zh) 2022-10-25
WO2021135577A9 (zh) 2021-09-30

Similar Documents

Publication Publication Date Title
WO2021135577A1 (zh) 音频信号处理方法、装置、电子设备及存储介质
WO2021135628A1 (zh) 语音信号的处理方法、语音分离方法
CN110364144B (zh) 一种语音识别模型训练方法及装置
CN111063342B (zh) 语音识别方法、装置、计算机设备及存储介质
CN111680123B (zh) 对话模型的训练方法、装置、计算机设备及存储介质
US20240105159A1 (en) Speech processing method and related device
CN111696570B (zh) 语音信号处理方法、装置、设备及存储介质
CN111863020B (zh) 语音信号处理方法、装置、设备及存储介质
CN112069309A (zh) 信息获取方法、装置、计算机设备及存储介质
CN113763933B (zh) 语音识别方法、语音识别模型的训练方法、装置和设备
CN110114765B (zh) 通过共享话语的上下文执行翻译的电子设备及其操作方法
CN111581958A (zh) 对话状态确定方法、装置、计算机设备及存储介质
CN115148197A (zh) 语音唤醒方法、装置、存储介质及系统
CN113822076A (zh) 文本生成方法、装置、计算机设备及存储介质
CN112384974A (zh) 电子装置和用于提供或获得用于训练电子装置的数据的方法
CN113129867A (zh) 语音识别模型的训练方法、语音识别方法、装置和设备
CN113761888A (zh) 文本翻译方法、装置、计算机设备及存储介质
US11315553B2 (en) Electronic device and method for providing or obtaining data for training thereof
WO2022227507A1 (zh) 唤醒程度识别模型训练方法及语音唤醒程度获取方法
CN113948060A (zh) 一种网络训练方法、数据处理方法及相关设备
CN115039169A (zh) 一种语音指令识别方法、电子设备以及非瞬态计算机可读存储介质
CN113409770A (zh) 发音特征处理方法、装置、服务器及介质
CN111341307A (zh) 语音识别方法、装置、电子设备及存储介质
CN116956814A (zh) 标点预测方法、装置、设备及存储介质
WO2021147417A1 (zh) 语音识别方法、装置、计算机设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20909391

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020909391

Country of ref document: EP

Effective date: 20220224

NENP Non-entry into the national phase

Ref country code: DE