WO2021135628A1 - 语音信号的处理方法、语音分离方法 - Google Patents

语音信号的处理方法、语音分离方法 Download PDF

Info

Publication number
WO2021135628A1
WO2021135628A1 PCT/CN2020/126475 CN2020126475W WO2021135628A1 WO 2021135628 A1 WO2021135628 A1 WO 2021135628A1 CN 2020126475 W CN2020126475 W CN 2020126475W WO 2021135628 A1 WO2021135628 A1 WO 2021135628A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
speech signal
student
teacher
signal
Prior art date
Application number
PCT/CN2020/126475
Other languages
English (en)
French (fr)
Inventor
王珺
林永业
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20908929.1A priority Critical patent/EP3992965A4/en
Publication of WO2021135628A1 publication Critical patent/WO2021135628A1/zh
Priority to US17/674,677 priority patent/US20220172737A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Definitions

  • This application relates to the field of voice technology, and in particular to a voice signal processing method, voice separation method, device, computer equipment, and storage medium.
  • the embodiments of the present disclosure provide a voice signal processing method, voice separation method, device, computer equipment, and storage medium.
  • the technical scheme is as follows:
  • a method for processing a voice signal which is executed by a computer device, and includes:
  • the mixed speech signal into the student model and the teacher model respectively, the mixed speech signal is marked with a clean speech signal used to generate the mixed speech signal, and the model parameters of the teacher model are based on the model parameter configuration of the student model;
  • the model parameters of the student model and the teacher model are adjusted to obtain a voice separation model.
  • determining the accuracy information includes any one of the following:
  • determining the consistency information includes any one of the following:
  • determining the consistency information includes:
  • the consistency information is determined based on the short-time and time-varying abstract characteristics of the first clean speech signal and the short-time and time-varying abstract characteristics of the second clean speech signal.
  • determining the consistency information includes:
  • the consistency information is determined based on the weighted value of the third consistency information and the fourth consistency information.
  • adjusting the model parameters of the student model and the teacher model includes: adopting an exponential moving average method, determining the model parameters of the teacher model based on the model parameters of the student model, and adopting the determined teacher model The model parameters of to configure the teacher model.
  • the method further includes:
  • One iteration process corresponds to one accuracy information and one consistency information;
  • Obtaining the voice separation model includes:
  • the student model determined by the iterative process satisfying the stop training condition is output as the speech separation model.
  • the student model and the teacher model adopt a permutation invariant training PIT method for signal separation; or, the student model and the teacher model adopt a prominent oriented selection mechanism for signal separation.
  • a voice separation method which is executed by a computer device, and includes:
  • the speech separation model is obtained based on the mixed speech signal and the collaborative iterative training of the student model and the teacher model, and the model parameters of the teacher model are based on the model parameter configuration of the student model;
  • the clean voice signal in the voice signal is predicted, and the clean voice signal of the voice signal is output.
  • the loss function of the iterative process is based on the accuracy information between the output of the student model and the training input of the student model, and the consistency between the output of the student model and the output of the teacher model Information construction.
  • the loss function of the iterative process is constructed based on the following information:
  • the first accuracy information between the first clean speech signal output by the student model and the clean speech signal in the mixed speech signal, one of the first interference signal output by the student model and the interference signal in the mixed speech signal The second accuracy information between the students, the first consistency information between the first clean speech signal output by the student model and the second clean speech signal output by the teacher model, the first interference signal output by the student model, and the The second consistency information between the second interference signals output by the teacher model.
  • the loss function of the iterative process is constructed based on the following information:
  • the short-time and time-varying abstract features output by the student model and the short-time and time-varying abstract features output by the teacher model as well as the short-time and time-varying abstract features output by the student model and the long-term stable abstract features output by the teacher model.
  • a voice signal processing device including:
  • the training module is used to input the mixed speech signal into the student model and the teacher model respectively, the mixed speech signal is marked with a clean speech signal used to generate the mixed speech signal, and the model parameters of the teacher model are based on the model parameter configuration of the student model;
  • the accuracy determination module is used to determine accuracy information based on the signal output by the student model and the clean speech signal marked in the mixed speech signal of the input model, and the accuracy information is used to indicate the accuracy of separation of the student model;
  • the consistency determination module is used to determine consistency information based on the signal output by the student model and the signal output by the teacher model, and the accuracy information is used to indicate the degree of consistency between the separation ability of the student model and the teacher model;
  • the adjustment module is used to adjust the model parameters of the student model and the teacher model based on multiple accuracy information and multiple consistency information to obtain a voice separation model.
  • the accuracy determination module is used to perform any of the following steps:
  • the consistency determination module is used to perform any of the following steps:
  • the consistency determination module is configured to determine the consistency information based on the short-time-varying abstract feature of the first clean speech signal and the short-time-varying abstract feature of the second clean speech signal.
  • the consistency determination module is used to:
  • the consistency information is determined based on the weighted value of the third consistency information and the fourth consistency information.
  • the adjustment module is used to adopt an exponential moving average method to determine the model parameters of the teacher model based on the model parameters of the student model, and use the determined model parameters of the teacher model to determine the teacher model Configure it.
  • the device further includes an iterative acquisition module, which is used to iteratively execute the mixed voice signal input into the student model and the teacher model respectively, and acquire multiple pieces of the accuracy information and multiple pieces of the consistency information.
  • the iterative process corresponds to one accuracy information and one consistency information;
  • the iterative acquisition module is further configured to output the student model determined by the iterative process that satisfies the stop training condition as the speech separation model in response to satisfying the stop training condition.
  • the student model and the teacher model adopt a permutation invariant training PIT method for signal separation; or, the student model and the teacher model adopt a prominent oriented selection mechanism for signal separation.
  • a voice separation device including:
  • the signal acquisition module is used to acquire the sound signal to be separated
  • the input module is used to input the sound signal into a speech separation model, the speech separation model is obtained based on the mixed speech signal and the collaborative iterative training of the student model and the teacher model, and the model parameters of the teacher model are based on the model parameter configuration of the student model;
  • the prediction module is used to predict the clean voice signal in the voice signal through the voice separation model, and output the clean voice signal of the voice signal.
  • the loss function of the iterative process is based on the accuracy information between the output of the student model and the training input of the student model, and the consistency between the output of the student model and the output of the teacher model Information construction.
  • the loss function of the iterative process is constructed based on the following information:
  • the first accuracy information between the first clean speech signal output by the student model and the clean speech signal in the mixed speech signal, one of the first interference signal output by the student model and the interference signal in the mixed speech signal The second accuracy information between the students, the first consistency information between the first clean speech signal output by the student model and the second clean speech signal output by the teacher model, the first interference signal output by the student model, and the The second consistency information between the second interference signals output by the teacher model.
  • the loss function of the iterative process is constructed based on the following information:
  • the short-time and time-varying abstract features output by the student model and the short-time and time-varying abstract features output by the teacher model as well as the short-time and time-varying abstract features output by the student model and the long-term stable abstract features output by the teacher model.
  • a computer device in one aspect, includes one or more processors and one or more memories, and at least one computer program is stored in the one or more memories.
  • the multiple processors are loaded and executed to implement the voice signal processing method or the voice separation method as in any of the foregoing possible implementation manners.
  • a computer-readable storage medium is provided, and at least one computer program is stored in the computer-readable storage medium, and the at least one computer program is loaded and executed by a processor to implement a voice as described in any of the foregoing possible implementation manners.
  • Signal processing method or voice separation method is provided.
  • a computer program product or computer program includes one or more pieces of program code, and the one or more pieces of program code are stored in a computer-readable storage medium.
  • One or more processors of the computer device can read the one or more pieces of program code from the computer-readable storage medium, and the one or more processors execute the one or more pieces of program code, so that the computer device can execute any of the foregoing.
  • a possible implementation manner of a voice signal processing method or a voice separation method can be used to read the one or more pieces of program code from the computer-readable storage medium.
  • FIG. 1 is a schematic diagram of an implementation environment of a training method for a speech separation model provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the principle of a method for training a voice separation model provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a method for training a voice separation model provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a process of processing a mixed voice signal by a student model provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of the internal structure of the student model provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of a voice separation method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a training device for a voice separation model provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a voice separation processing device provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Speech Technology includes Automatic Speech Recognition (ASR), Text To Speech (TTS), and voiceprint recognition technology. Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • ASR Automatic Speech Recognition
  • TTS Text To Speech
  • voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • Natural language processing (Nature Language Processing, NLP) is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language people use daily, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
  • FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • the implementation environment includes a terminal 110 and a server 140.
  • the terminal 110 is connected to the server 140 through a wireless network or a wired network.
  • the device types of the terminal 110 include smart phones, tablet computers, smart speakers, e-book readers, MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3) players, MP4 (Moving Picture Experts) Experts Group Audio Layer IV, moving image experts compress standard audio layer 4) At least one of players, laptop computers, desktop computers, and in-vehicle computers.
  • the terminal 110 installs and runs an application program supporting the voice separation technology.
  • the application program may be a voice assistant application program, and the voice assistant application program may also have functions such as data recording, audio and video playback, translation, and data query.
  • the terminal 110 is a terminal used by a user, and a user account is logged in an application program running in the terminal 110.
  • the server 140 includes at least one of one server, multiple servers, a cloud computing platform, or a virtualization center.
  • the server 140 is used to provide background services for applications that support voice separation.
  • the server 140 is responsible for primary voice separation processing, and the terminal 110 is responsible for secondary voice separation processing; or, the server 140 is responsible for secondary voice separation processing, and the terminal 110 is responsible for primary voice separation processing; or, the server 140 or the terminal 110 can separately undertake the work of speech separation processing.
  • the server 140 includes: an access server, a voice server, and a database.
  • the access server is used to provide the terminal 110 with access services.
  • the voice server is used to provide background services related to voice separation processing.
  • the database may include a voice information database, a user information database, etc., and different services provided by the server may correspond to different databases.
  • the terminal 110 may generally refer to one of multiple terminals, and this embodiment only uses the terminal 110 as an example for illustration.
  • the number of the aforementioned terminals may be more or less.
  • the foregoing terminal may be only one, or the foregoing terminal may be dozens or hundreds, or a greater number.
  • the foregoing implementation environment may also include other terminals.
  • the embodiments of the present application do not limit the number of terminals and device types.
  • the above voice separation method can be applied to in-vehicle terminals, TV boxes, voice recognition products, voiceprint recognition products, smart voice assistants, smart speakers and other products. It can be applied to the front end of the above products, or it can be implemented through the interaction between the terminal and the server. .
  • the vehicle-mounted terminal can collect voice signals, perform voice separation on the voice signals, perform voice recognition based on the separated clean voice signals, and perform corresponding driving control or processing procedures based on the recognized voice content information.
  • the terminal can collect voice signals, send the voice signals to the server, and the server will perform voice separation on the voice signals, and then perform voice recognition on the separated clean voice signals, based on the recognized voice content information. Perform recording or other follow-up corresponding processing.
  • the above-mentioned voice recognition method can be applied to products such as in-vehicle terminals, TV boxes, voice recognition products, smart speakers, etc., can be applied to the front end of the above-mentioned products, or can be implemented through interaction between the front end and the server.
  • the vehicle-mounted terminal can collect voice signals, perform voice separation on the voice signals, perform voice recognition based on the separated clean voice signals, and perform corresponding driving control or processing procedures based on the recognized voice content information.
  • the vehicle-mounted terminal can also send a voice signal to a back-end server connected to the vehicle-mounted terminal, and the back-end server performs voice separation and voice recognition on the received voice signal to obtain voice content corresponding to the voice signal.
  • the background server may respond to the voice content corresponding to the voice signal and send the voice content or corresponding feedback information to the vehicle-mounted terminal, and the vehicle-mounted terminal executes the corresponding driving control or processing process based on the acquired voice content or feedback information, such as turning on or off Sunroof, open or close the navigation system and open or close the lighting and other operations.
  • the generation process of the mixed speech signal can be expressed by the following formula (1):
  • x represents the time-frequency point of the clean voice signal
  • e represents the time-frequency point of the interference signal
  • X represents the time-frequency point of the mixed voice signal
  • a set of labeled training samples ⁇ X (1) ,...,X (L) ⁇ can be obtained, and the clean voice signal in the mixed voice signal is not labeled
  • a set of unlabeled training samples ⁇ X (L+1) ,...,X (L+U) ⁇ can be obtained.
  • T represents the number of input frames
  • F represents the number of STFT frequency bands.
  • Figure 2 is a schematic diagram of the principle of a training method for a speech separation model provided by an embodiment of the application. See Figure 2.
  • the network structure used in this training includes a student model and a teacher model.
  • the model parameters of the teacher model The configuration is based on the parameters of the student model.
  • the model parameters of the teacher model are also adjusted synchronously based on the adjusted student model.
  • a model training method with batch overlap is realized.
  • the computer device inputs the mixed speech signal as a training sample into the student model and the teacher model respectively.
  • the student model outputs the first clean speech signal and the first interference signal
  • the teacher model outputs the second clean signal. Voice signal and second interference signal.
  • the above step 301 takes a single iterative process as an example to show a possible implementation manner in which a computer device inputs a mixed speech signal into a student model and a teacher model respectively, where the mixed speech signal is marked with a clean signal used to generate the mixed speech signal.
  • the voice signal, the mixed voice signal also includes interference signals other than the clean voice signal.
  • the student model processes the mixed speech signal to output the first clean speech signal and the first interference signal
  • the teacher model processes the mixed speech signal to output the second clean speech signal and the second interference signal.
  • the computer device determines accuracy information of the iterative process based on the first clean speech signal output by the student model and the clean speech signal used to generate the mixed speech signal, where the accuracy information is used to indicate the accuracy of separation of the student model.
  • the above step 302 is also a possible implementation manner for the computer device to determine the accuracy information based on the signal output by the student model and the clean speech signal marked in the mixed speech signal. Since the signal output by the student model includes the first clean speech signal and the first interference signal, in addition to the method of determining the accuracy information provided in step 302, the computer device can also be based on the first interference signal and mixed speech output by the student model Determine the accuracy information for the interference signal in the signal, or combine the above two possible implementation manners, and weight the accuracy information obtained by the two implementation manners to obtain the final accuracy information.
  • the embodiments of this application are not correct The method of obtaining accuracy information is specifically limited.
  • the computer device determines the consistency information of the iterative process based on the first clean speech signal output by the student model and the second clean speech signal output by the teacher model, and the consistency information is used to indicate the separation of the student model and the teacher model The degree of consistency of ability.
  • the above step 303 is also a possible implementation manner for the computer device to determine the consistency information based on the signal output by the student model and the signal output by the teacher model. Since the signal output by the student model includes the first clean speech signal and the first interference signal, the signal output by the teacher model includes the second clean speech signal and the second interference signal, except for the method of determining the consistency information provided in step 303 above.
  • the computer device can also determine the consistency information based on the first interference signal output by the student model and the second interference signal output by the teacher model, or it will combine the above two possible implementations and compare the results of the two implementations.
  • the sexual information is weighted to obtain the final consistency information.
  • the embodiment of the present application does not specifically limit the method of obtaining the consistency information.
  • the computer device adjusts the model parameters of the student model and the teacher model based on the accuracy information and consistency information determined in each iteration process until the stop training condition is met, and the iterative process that satisfies the stop training condition
  • the output of the determined student model is a speech separation model.
  • the above step 304 is also a possible implementation manner in which the computer device adjusts the model parameters of the student model and the teacher model based on a plurality of accuracy information and a plurality of consistency information to obtain a speech separation model, wherein, an iterative process Corresponds to one accuracy information and one consistency information.
  • an iterative process Corresponds to one accuracy information and one consistency information.
  • the computer device responds to satisfying the stop training condition, and outputs the student model determined by the iterative process that satisfies the stop training condition as a speech separation model, or it can also satisfy
  • the output of the teacher model determined by the iterative process of the stop training condition is a speech separation model.
  • the loss function value is determined based on the accuracy information and consistency information determined in this iterative process
  • the model parameters of the student model are adjusted based on the loss function value
  • the teacher is based on the adjusted model parameters.
  • the model parameters of the model are adjusted. Based on the adjusted model, iterative training is continued until the training stop condition is met, and the trained student model is used as a speech separation model.
  • the training of the above-mentioned student model can actually be understood as a supervised learning process, and the training of the teacher model can be understood as a semi-supervised learning process.
  • the teacher model enables the student model to achieve a better convergence state during the entire training process, so that The trained voice separation model has stronger separation ability, better accuracy and consistency.
  • the accuracy of the separation results of the student model and the consistency between the results obtained by the teacher model and the student model can be used to improve the performance of the trained speech separation model. While separating accuracy, it can also maintain the stability of separation, which greatly improves the separation ability of the trained speech separation model.
  • the teacher model smooths the training of the student model.
  • the model parameters of the teacher model change with the model parameters of the student model during each iteration, and the loss function is constructed by taking into account the output between the teacher model and the student model.
  • the model parameter configuration method of the above-mentioned teacher model in each iteration process can be as follows: the exponential moving average (EMA) method is adopted, and the model parameters of the student model are used to determine the model parameters.
  • the model parameters of the teacher model are configured using the determined model parameters of the teacher model.
  • the above configuration process can be regarded as a smoothing process of model parameters.
  • is the smoothing coefficient of the parameter
  • l is the number of iterations
  • l is a positive integer greater than 1
  • ⁇ and ⁇ ' are the parameters of the encoder in the student model and the teacher model, respectively.
  • is the smoothing coefficient of the parameter
  • l is the number of iterations
  • l is a positive integer greater than 1
  • ⁇ and ⁇ ' are the parameters of the abstract feature extractor in the student model and the teacher model, respectively.
  • parameter calculation methods are only a few examples of configuring the model parameters of the teacher model based on the model parameters of the student model.
  • the calculation method can also adopt other methods, and the model parameters can also cover other parameter types.
  • the application embodiment does not limit this.
  • the mixed speech signal as a training sample is input into the student model and the teacher model respectively, and through the model processing, the student model outputs the first clean speech signal and the first interference signal, and the teacher model outputs the second clean speech signal And the second interference signal.
  • FIG. 4 is a schematic diagram of a flow of processing mixed speech signals by a student model provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a structure for implementing the foregoing model. Referring to FIG. 4, the flow specifically includes the following steps.
  • the computer device maps the mixed speech signal to a high-dimensional vector space to obtain an embedding matrix corresponding to the mixed speech signal.
  • This step 401 is the process of performing feature conversion on the mixed speech signal.
  • the mixed speech signal can be converted into the form of model input.
  • the computer device performs frame and windowing on the mixed speech signal, and each The frame is subjected to Fast Fourier Transform (FFT) to convert the time domain signal into a frequency domain signal, and arrange the obtained frequency domain signals in time sequence to obtain a characteristic matrix representing the mixed speech signal.
  • FFT Fast Fourier Transform
  • the characteristics of the mixed speech signal can be the short-term Fourier change sound spectrum characteristics, the logarithmic Mel spectrum characteristics, the Mel-Frequency Cepstral Coefficients (MFCC) characteristics, or the previous convolutional neural network.
  • the (Convolutional Neural Networks, CNN) post-prediction score may also be the characteristics of other factors and the combination of various characteristics, which is not limited in the embodiment of the present application.
  • the above step 401 can be implemented by the encoder 501 in FIG. 5, and the processing process of the encoder is explained by taking the converted feature as the short-time Fourier variation sound spectrum as an example:
  • the mixed speech signal is input to the encoder, and the encoder obtains the characteristic matrix of the short-time Fourier change sound spectrum of the mixed speech signal, and then maps the characteristic matrix to a high-dimensional vector space, and outputs the embedded matrix corresponding to the mixed speech signal.
  • you can use Represents the feature matrix obtained by the encoder after processing the mixed speech signal (T and F respectively input the number of frames and frequency bands of the mixed speech signal to the encoder), then the encoder maps it to a high-dimensional space vector and outputs the mixed
  • the process of embedding the speech signal into the matrix ⁇ can be expressed as E ⁇ : Among them, ⁇ is the model parameter of the encoder.
  • the computer device extracts abstract features from the embedded matrix corresponding to the mixed speech signal.
  • This step 402 is a process of feature extraction, and the extracted features can be used to characterize the mixed speech signal and provide a basis for subsequent speech signal reconstruction.
  • the abstract feature extractor 502 in Figure 5 can be an autoregressive model.
  • the Long Short Term Memory Networks (LSTM) model is used in the causal system.
  • the causal system adopts the Bi-directional Long Short-Term Memory (Bi-LSTM) model, and extracts short-term or long-term abstract features in time series based on the embedded matrix corresponding to the mixed speech signal. It can also adopt a bi-directional long-term memory network (Bi-LSTM) model.
  • Bi-LSTM Bi-directional Long Short-Term Memory
  • the embodiment of the present application does not limit the specific model structure of the abstract feature extractor and the types of abstract features extracted.
  • c t ⁇ c represents short-time time-varying abstract features
  • represents the embedding matrix
  • p ⁇ P represents the weight
  • represents the dot product of the elements
  • t and f respectively represent the frames of the short-time Fourier-varying sound spectrum Index, frequency band index.
  • the feature matrix can be normalized, and elements less than a certain threshold are set to 0, and other elements are set to 1.
  • a computer device can multiply formula (4) by a two Value threshold matrix, which helps to reduce the impact of low-energy noise on the abstract feature extraction process.
  • the calculation formula is as follows (5):
  • w ⁇ R TF represents the two-valued threshold matrix
  • the above-mentioned abstract feature extractor extracts the abstract feature c from the embedding matrix ⁇ , which can be simply expressed as A ⁇ : Where ⁇ is the model parameter of the abstract feature extractor.
  • the computer device performs signal reconstruction based on the extracted abstract features, the input mixed speech signal, and the output of the encoder to obtain the first clean speech signal.
  • Performing speech signal reconstruction based on the above input can obtain a new set of speech signals, which provides a basis for the following speech signal comparison and calculation of training loss.
  • the speech signal output by the student model is named the first clean speech signal.
  • This step can be implemented by the signal reconstruction module 503 in FIG. 5.
  • the signal reconstruction module 503 can use any signal reconstruction algorithm to reconstruct the speech signal according to the extracted abstract features, the clean speech signal and the features of the embedded matrix, to output the first A clean speech signal and a first interference signal, the output first clean speech signal and the first interference signal can be used to calculate the loss function value of this iteration, and to train the model through back propagation.
  • the encoder can adopt a 4-layer Bi-LSTM structure, and each hidden layer has 600 nodes, which can map a 600-dimensional hidden vector to a 257*40-dimensional high-dimensional vector space, and the output layer The number of nodes is 40.
  • the encoder uses 16KHz sampling rate, 25ms window length, 10ms window shift, and 257 frequency bands to process the mixed speech signal.
  • Each training corpus is randomly downsampled to 32 frames.
  • the abstract feature extractor connected to the encoder can include a fully connected layer that can map 257*40-dimensional hidden vectors to 600-dimensional.
  • the signal reconstruction module can be a 2-layer Bi-LSTM structure, with 600 nodes in each hidden layer.
  • the aforementioned encoder, abstract feature extractor, and signal reconstruction module can add more levels or change its model type to at least one of the encoder, abstract feature extractor, and signal reconstruction module according to the complexity of the actual application and the performance requirements.
  • the embodiment of the application does not specifically limit the model type and topology of the above structure, which can be replaced with other effective new model structures, such as long and short-term memory networks, convolutional neural networks, time delay networks, gated convolution Neural networks, etc., as well as models that combine various network structures.
  • the model structure and processing flow of the teacher model and the student model can be the same.
  • the teacher model can also be a little more complicated.
  • the structure is used to extract features of different time domain characteristics, so as to perform signal reconstruction based on the features of different time domain characteristics, and further perform loss function value calculation and back propagation model training based on the reconstructed result.
  • c′ L ⁇ c′ representing long-term stable abstract features
  • ⁇ ′ ⁇ ′ representing high-dimensional embedding matrix
  • p′ ⁇ P′ representing weight
  • representing element dot multiplication
  • t and f respectively representing short-term
  • w represents the binary threshold matrix shown in formula (6).
  • the above-mentioned binary threshold matrix may not be multiplied. This is not limited.
  • Such abstract features with low resolution in the time domain that is, long-term stable abstract features, are suitable for generalizing hidden speaker features, while abstract features with high resolution in the time domain are short-term time-varying.
  • Abstract features are more suitable for tasks that require high time-domain resolution, such as speaker spectrum reconstruction.
  • the first type is the supervised training of the training objectives aimed at improving the accuracy
  • the second type is the consistent learning between the teacher model and the student model.
  • the specific process can include any of the following:
  • the accuracy information of the iterative process is determined based on the first clean speech signal output by the student model and the clean speech signal marked in the mixed speech signal.
  • the accuracy information of the iterative process is determined based on the first interference signal output by the student model and the interference signal other than the clean speech signal marked in the mixed speech signal;
  • the third implementation manner is to determine the first accuracy information of the iterative process based on the first clean speech signal output by the student model and the clean speech signal marked in the mixed speech signal; based on the student model The output first interference signal and the interference signal other than the clean speech signal marked in the mixed speech signal, determine the second accuracy information of the iterative process, and determine the second accuracy information according to the first accuracy information and the second accuracy information.
  • the accuracy information determines the accuracy information of the iterative process.
  • the first clean speech signal may be, for example, the speech signal with the largest energy shown in formula (8), or may be a speech signal determined based on the PIT algorithm of formula (9), of course, it may also be based on other
  • the voice signal determined by the method is not limited in the embodiment of the present application.
  • the above accuracy information is used to determine the difference between the separated signal and the reference signal.
  • the accuracy information may be the mean square error (Mean-Square Error, MSE) between the frequency spectrum of the signal. ), it may also be a scale-invariant signal-to-noise ratio (Scale Invariant Signal to Noise Ratio, SI-SNR) objective function, which is not specifically limited in the embodiment of the present application.
  • equation (8) can be used to calculate the difference between the first clean speech signal with the largest energy and the marked clean speech signal. Mean square error between:
  • x represents a marked clean voice signal
  • X represents a mixed voice signal
  • c represents an abstract feature
  • v represents an embedding matrix
  • t and f represent the frame index and frequency band index of the short-time Fourier change sound spectrum, respectively.
  • x represents a marked clean voice signal
  • X represents a mixed voice signal
  • e represents an interference signal
  • c represents an abstract feature
  • v represents an embedding matrix
  • t and f represent the frame index and frequency band of the short-time Fourier change sound spectrum, respectively index.
  • the above three implementation manners can be understood as a method of constructing a loss function, that is, what type of input and output are used to construct the loss function, so that the model can be backpropagated training based on the loss function.
  • the above loss function is a reconstruction type objective function.
  • the supervised discrimination learning model using this objective function can ensure that the learned representation encodes the target speaker’s speech information to a certain extent, so that the supervised The differentiated learning can enable the student model to effectively estimate a short-time-varying abstract feature.
  • the consistency information of the iterative process is determined based on the first clean speech signal output by the student model and the second clean speech signal output by the teacher model.
  • the consistency information of the iterative process is determined based on the first interference signal output by the student model and the second interference signal output by the teacher model.
  • the third implementation manner is to determine the first consistency information of the iterative process based on the first clean speech signal output by the student model and the second clean speech signal output by the teacher model, and output based on the student model
  • the first interference signal and the second interference signal output by the teacher model determine the second consistency information of the iterative process, and determine according to the first consistency information and the second consistency information Consistency information of the iterative process.
  • the first clean speech signal may be, for example, the speech signal with the largest energy shown in formula (8), or may be a speech signal determined based on the PIT algorithm of formula (9), of course, it may also be based on other
  • the voice signal determined by the method is not limited in the embodiment of the present application.
  • consistency information is used to indicate the gap between the target speaker's spectrum estimated by the teacher model and the target speaker's spectrum estimated by the student model.
  • the consistency information may be the difference between the spectrum of the signal.
  • MSE may also be SI-SNR, which is not specifically limited in the embodiment of the present application.
  • the above three implementation manners can be understood as a method of constructing a loss function, that is, what type of input and output are used to construct the loss function, so that the model can be backpropagated training based on the loss function.
  • the loss function constructed here is used to calculate the gap between the target speaker's spectrum estimated by the teacher model and the target speaker's spectrum estimated by the student model.
  • the teacher model can have two types of features, one is short-term time-varying abstract features, and the other is long-term stable abstract features, which can be determined based on these two types of features.
  • Consistency information based on the short-time and time-varying abstract features of the first clean speech signal and the short-time and time-varying abstract features of the second clean speech signal output by the teacher model, determine the third consistency information of the iterative process; The short-term time-varying abstract feature of the first clean speech signal and the long-term stable abstract feature of the second clean speech signal output by the teacher model are described, and the fourth consistency information of the iterative process is determined.
  • the final consistency information of the iterative process is constructed.
  • the loss function when constructing the loss function, it can be constructed based only on the short-term abstract features of the student model and teacher model, or based on the short-term abstract features of the student model and teacher model, and the long-term stable abstract features of the teacher model. To build.
  • X represents the mixed speech signal
  • c t and c t ′ represent the short-term abstract features predicted by the student model and the teacher model
  • v and ⁇ ′ represent the embedded matrix of the student model and the teacher model, respectively
  • t and f represent the short-term Time Fourier changes the frame index and frequency band index of the sound spectrum.
  • X represents the mixed speech signal
  • c L ′ represents the long-term stable abstract feature predicted by the teacher model
  • c represents the short-term time-varying abstract feature predicted by the student model
  • v and ⁇ ′ represent the embedded matrix of the student model and the teacher model, respectively
  • t and f respectively represent the frame index and frequency band index of the short-time Fourier change sound spectrum.
  • the student model and the teacher model The model parameters of is adjusted until the stop training condition is satisfied, and the student model determined by the iterative process that satisfies the stop training condition is output as a speech separation model.
  • the above process is to separately explain the construction of the loss function whose training target is accuracy and the loss function whose training target is the consistency between models.
  • a joint loss function that can express the accuracy information and consistency information.
  • the model of the student model and the teacher model can be adjusted based on the third consistency information and accuracy information determined in each iteration process.
  • the parameters are adjusted, that is, the joint loss function can be expressed by the following formula (12):
  • the loss function representing the accuracy of the training target Indicates that the training target is a consistent loss function, which can be specifically a loss function based on short-time time-varying abstract features, ⁇ is a weighting factor, and ⁇ can be continuously optimized during the neural network iteration process until the optimal value is matched.
  • the model parameters when the model parameters are adjusted, it may be based on the weighted value and accuracy information of the third consistency information and the fourth consistency information determined in each iteration process, The model parameters of the student model and the teacher model are adjusted. That is, the joint loss function can be expressed by the following formula (13):
  • the training objective is the loss function of accuracy
  • ⁇ 1 , ⁇ 2 are weighting factors.
  • ⁇ 1 and ⁇ 2 can be continuously optimized in the iterative process of the neural network until the optimal value is matched.
  • the above-mentioned conditions for stopping training may be conditions such as the number of iterations reaching the target number and the loss function becoming stable, which are not limited in the embodiment of this application.
  • the batch data size is set It is 32
  • the initial learning rate is 0.0001
  • the weight reduction coefficient of the learning rate is 0.8.
  • the training method provided by the embodiments of the present application can automatically learn the characteristics of a stable hidden target speaker without additional PIT processing, speaker tracking mechanism, or processing and adjustment defined by experts.
  • the consistency-based training used in this application does not require labeled information, and can mine unsupervised information in massive unlabeled data to help improve the robustness and versatility of the system.
  • the embodiments of this application have been tested to fully verify the effectiveness of the speech separation model trained based on the consistency of the student-teacher model.
  • the separation performance of the embodiment of this application is excellent in terms of speech quality perception evaluation, short-term objective intelligibility, signal distortion ratio and other indicators and stability. .
  • an embodiment of the present application also provides a voice separation method.
  • the method may include:
  • the computer device obtains a sound signal to be separated.
  • the computer device inputs the sound signal into the speech separation model, which is obtained based on the mixed speech signal and the collaborative iterative training of the student model and the teacher model, and the model parameters of the teacher model are based on the model parameter configuration of the student model.
  • the computer device predicts the clean voice signal in the voice signal through the voice separation model, and outputs the clean voice signal of the voice signal.
  • the loss function of the iterative process is based on the accuracy information between the output of the student model and the training input of the student model, and the consistency between the output of the student model and the output of the teacher model Information construction.
  • the loss function of the iterative process is constructed based on the following information:
  • the first accuracy information between the first clean speech signal output by the student model and the clean speech signal in the mixed speech signal, one of the first interference signal output by the student model and the interference signal in the mixed speech signal The second accuracy information between the students, the first consistency information between the first clean speech signal output by the student model and the second clean speech signal output by the teacher model, the first interference signal output by the student model, and the The second consistency information between the second interference signals output by the teacher model.
  • the loss function of the iterative process is constructed based on the following information:
  • the short-time and time-varying abstract features output by the student model and the short-time and time-varying abstract features output by the teacher model as well as the short-time and time-varying abstract features output by the student model and the long-term stable abstract features output by the teacher model.
  • model training process and the speech separation process can be executed by different computer equipment respectively.
  • model training After the model training is completed, it can be provided to the front-end or application side computer equipment to perform the speech separation task, and the speech separation task can be It is a subtask used to separate speech in tasks such as speech recognition.
  • the signal obtained by the separation can also be used in specific processing procedures such as speech recognition. This is not the case in the embodiments of this application. Make a limit.
  • Fig. 7 is a schematic structural diagram of a training device for a speech separation model provided by an embodiment of the present disclosure.
  • the device includes:
  • the training module 701 is used to input the mixed speech signal as a training sample into the student model and the teacher model respectively in any iteration process, the mixed speech signal is labeled with a clean speech signal used to generate the mixed speech signal, and the model of the teacher model The parameters are based on the model parameter configuration of the student model;
  • the training module 701 is used to input the mixed speech signal into the student model and the teacher model respectively, the mixed speech signal is marked with the clean speech signal used to generate the mixed speech signal, and the model parameters of the teacher model are based on the student model.
  • Model parameter configuration of the model
  • the accuracy determination module 702 is configured to determine the accuracy information of the iterative process based on the signal output by the student model and the clean speech signal marked in the mixed speech signal of the input model.
  • the accuracy information is used to indicate the accuracy of the student model. Separation accuracy;
  • the accuracy determination module 702 is used to determine accuracy information based on the signal output by the student model and the clean speech signal marked in the mixed speech signal, and the accuracy information is used to indicate the accuracy of the student model. Separation accuracy;
  • the consistency determination module 703 is used to determine the consistency information of the iterative process based on the signal output by the student model and the signal output by the teacher model, and the accuracy information is used to indicate the consistency of the separation ability of the student model and the teacher model degree;
  • the consistency determination module 703 is used to determine consistency information based on the signal output by the student model and the signal output by the teacher model, and the accuracy information is used to indicate the separation ability of the student model and the teacher model. The degree of agreement;
  • the adjustment module 704 is used to adjust the model parameters of the student model and the teacher model based on the accuracy information and consistency information determined in each iteration process until the stop training condition is met, and the iteration of the stop training condition will be met
  • the output of the student model determined in the process is a speech separation model
  • the adjustment module 704 is configured to adjust the model parameters of the student model and the teacher model based on a plurality of accuracy information and a plurality of consistency information to obtain a speech separation model.
  • the accuracy determining module 702 is configured to perform any of the following steps:
  • the consistency determining module 703 is configured to perform any of the following steps:
  • the adjustment module 704 is used to adopt an exponential moving average method to determine the model parameters of the teacher model based on the model parameters of the student model, and use the determined model parameters of the teacher model to determine the model parameters of the teacher model.
  • the model is configured.
  • the consistency determination module 703 is used to determine the iterative process based on the short-time-varying abstract features of the first clean speech signal and the short-time-varying abstract features of the second clean speech signal output by the teacher model
  • the third consistency information (that is, the determination of the consistency information).
  • the consistency determination module 703 is used to:
  • the consistency information is determined based on the weighted value of the third consistency information and the fourth consistency information.
  • the device also includes an iterative acquisition module for iteratively executing multiple times to input the mixed voice signal into the student model and the teacher model, respectively, to acquire multiple accuracy information and multiple For this consistency information, one iteration process corresponds to one accuracy information and one consistency information;
  • the iterative acquisition module is further configured to output the student model determined by the iterative process that satisfies the stop training condition as the speech separation model in response to satisfying the stop training condition.
  • the student model and the teacher model adopt a PIT method for signal separation; or, the student model and the teacher model adopt a prominent oriented selection mechanism for signal separation.
  • the training device for the speech separation model provided in the above embodiment trains the speech separation model
  • only the division of the above-mentioned functional modules is used as an example for illustration.
  • the above-mentioned functions can be allocated according to needs. Different functional modules are completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the training device for the speech separation model provided in the foregoing embodiment and the training method embodiment of the speech separation model belong to the same concept. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.
  • FIG. 8 is a schematic structural diagram of a voice separation device provided by an embodiment of the present application. Referring to Figure 8, the device includes:
  • the signal acquisition module 801 is used to acquire the sound signal to be separated
  • the input module 802 is configured to input the sound signal into a speech separation model, the speech separation model is obtained based on the mixed speech signal and the collaborative iterative training of the student model and the teacher model, and the model parameters of the teacher model are based on the model parameter configuration of the student model;
  • the prediction module 803 is configured to predict the clean voice signal in the sound signal through the voice separation model, and output the clean voice signal of the sound signal.
  • the loss function of the iterative process is based on the accuracy information between the output of the student model and the training input of the student model, and the consistency between the output of the student model and the output of the teacher model Information construction.
  • the loss function of the iterative process is constructed based on the following information:
  • the first accuracy information between the first clean speech signal output by the student model and the clean speech signal in the mixed speech signal, one of the first interference signal output by the student model and the interference signal in the mixed speech signal The second accuracy information between the students, the first consistency information between the first clean speech signal output by the student model and the second clean speech signal output by the teacher model, the first interference signal output by the student model, and the The second consistency information between the second interference signals output by the teacher model.
  • the loss function of the iterative process is constructed based on the following information:
  • the short-time and time-varying abstract features output by the student model and the short-time and time-varying abstract features output by the teacher model as well as the short-time and time-varying abstract features output by the student model and the long-term stable abstract features output by the teacher model.
  • the voice separation device provided in the above embodiment performs voice separation
  • only the division of the above functional modules is used as an example for illustration.
  • the above functions can be allocated by different functional modules according to needs. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the voice separation device provided in the foregoing embodiment and the voice separation method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • the computer device involved in the embodiment of the present application includes one or more processors and one or more memories, and at least one computer program is stored in the one or more memories, and the at least one computer program The program is loaded by the one or more processors and performs the following operations:
  • the mixed speech signal into the student model and the teacher model respectively, the mixed speech signal is marked with a clean speech signal used to generate the mixed speech signal, and the model parameters of the teacher model are based on the model parameter configuration of the student model;
  • the model parameters of the student model and the teacher model are adjusted to obtain a voice separation model.
  • the at least one computer program is loaded by the one or more processors and executes any one of the following operations:
  • the at least one computer program is loaded by the one or more processors and executes any one of the following operations:
  • the at least one computer program is loaded by the one or more processors and performs the following operations:
  • the consistency information is determined based on the short-time and time-varying abstract characteristics of the first clean speech signal and the short-time and time-varying abstract characteristics of the second clean speech signal.
  • the at least one computer program is loaded by the one or more processors and performs the following operations:
  • the consistency information is determined based on the weighted value of the third consistency information and the fourth consistency information.
  • the at least one computer program is loaded by the one or more processors and performs the following operations:
  • the model parameters of the teacher model are determined based on the model parameters of the student model, and the teacher model is configured using the determined model parameters of the teacher model.
  • the at least one computer program is loaded by the one or more processors and performs the following operations:
  • One iteration process corresponds to one accuracy information and one consistency information;
  • the at least one computer program is also loaded by the one or more processors and performs the following operations:
  • the student model determined by the iterative process satisfying the stop training condition is output as the speech separation model.
  • the student model and the teacher model adopt a permutation invariant training PIT method for signal separation; or, the student model and the teacher model adopt a prominent oriented selection mechanism for signal separation.
  • the computer device involved in the embodiment of the present application includes one or more processors and one or more memories, and at least one computer program is stored in the one or more memories.
  • the computer program is loaded by the one or more processors and performs the following operations:
  • the speech separation model is obtained based on the mixed speech signal and the collaborative iterative training of the student model and the teacher model, and the model parameters of the teacher model are based on the model parameter configuration of the student model;
  • the clean voice signal in the voice signal is predicted, and the clean voice signal of the voice signal is output.
  • the loss function of the iterative process is based on the accuracy information between the output of the student model and the training input of the student model, and the consistency between the output of the student model and the output of the teacher model Information construction.
  • the loss function of the iterative process is constructed based on the following information:
  • the first accuracy information between the first clean speech signal output by the student model and the clean speech signal in the mixed speech signal, one of the first interference signal output by the student model and the interference signal in the mixed speech signal The second accuracy information between the students, the first consistency information between the first clean speech signal output by the student model and the second clean speech signal output by the teacher model, the first interference signal output by the student model, and the The second consistency information between the second interference signals output by the teacher model.
  • the loss function of the iterative process is constructed based on the following information:
  • the short-time and time-varying abstract features output by the student model and the short-time and time-varying abstract features output by the teacher model as well as the short-time and time-varying abstract features output by the student model and the long-term stable abstract features output by the teacher model.
  • the computer device provided in the embodiment of the present application can be implemented as a server.
  • FIG. 9 is a schematic diagram of the structure of a server provided in the embodiment of the present application.
  • the server 900 may have relatively large differences due to different configurations or performances, and may include One or more processors (central processing units, CPU) 901 and one or more memories 902, wherein at least one computer program is stored in the one or more memories 902, and the at least one computer program is executed by the One or more processors 901 are loaded and executed to implement the voice signal processing method (that is, the training method of the voice separation model) or the voice separation method provided in the foregoing various embodiments.
  • the server 900 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output.
  • the server 900 may also include other components for implementing device functions, which will not be repeated here.
  • FIG. 10 is a schematic structural diagram of a terminal provided in an embodiment of the present application.
  • the terminal may be used to execute the terminal-side method in the foregoing embodiment.
  • the terminal 1000 can be: a smart phone, an intelligent voice assistant, a smart speaker, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV) , The dynamic image expert compresses the standard audio level 4) Player, laptop or desktop computer.
  • the terminal 1000 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
  • the terminal 1000 includes: one or more processors 1001 and one or more memories 1002.
  • the processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 1001 may adopt at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array). achieve.
  • the processor 1001 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state.
  • the processor 1001 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used to render and draw content that needs to be displayed on the display screen.
  • the processor 1001 may further include an AI (Artificial Intelligence) processor, and the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 1002 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 1002 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 1002 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1001 to implement the speech separation provided by the method embodiment of the present application. Method or training method of speech separation model.
  • the terminal 1000 optionally further includes: a peripheral device interface 1003 and at least one peripheral device.
  • the processor 1001, the memory 1002, and the peripheral device interface 1003 may be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1003 through a bus, a signal line, or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 1004, a display screen 1005, a camera component 1006, an audio circuit 1007, a positioning component 1008, and a power supply 1009.
  • the peripheral device interface 1003 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 1001 and the memory 1002.
  • the processor 1001, the memory 1002, and the peripheral device interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1001, the memory 1002, and the peripheral device interface 1003 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 1004 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 1004 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 1004 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 1004 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and so on.
  • the radio frequency circuit 1004 can communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity, wireless fidelity) networks.
  • the radio frequency circuit 1004 may also include a circuit related to NFC (Near Field Communication), which is not limited in this application.
  • the display screen 1005 is used to display a UI (User Interface, user interface).
  • the UI can include graphics, text, icons, videos, and any combination thereof.
  • the display screen 1005 also has the ability to collect touch signals on or above the surface of the display screen 1005.
  • the touch signal can be input to the processor 1001 as a control signal for processing.
  • the display screen 1005 may also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 1005 there may be one display screen 1005, which is provided with the front panel of the terminal 1000; in other embodiments, there may be at least two display screens 1005, which are respectively arranged on different surfaces of the terminal 1000 or in a folded design; In still other embodiments, the display screen 1005 may be a flexible display screen, which is disposed on the curved surface or the folding surface of the terminal 1000. Furthermore, the display screen 1005 can also be set as a non-rectangular irregular pattern, that is, a special-shaped screen.
  • the display screen 1005 may be made of materials such as LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode, organic light-emitting diode).
  • the camera assembly 1006 is used to capture images or videos.
  • the camera assembly 1006 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
  • the camera assembly 1006 may also include a flash.
  • the flash can be a single-color flash or a dual-color flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • the audio circuit 1007 may include a microphone and a speaker.
  • the microphone is used to collect sound waves from the user and the environment, and convert the sound waves into electrical signals and input to the processor 1001 for processing, or input to the radio frequency circuit 1004 to implement voice communication. For the purpose of stereo collection or noise reduction, there may be multiple microphones, which are respectively set in different parts of the terminal 1000.
  • the microphone can also be an array microphone or an omnidirectional collection microphone.
  • the speaker is used to convert the electrical signal from the processor 1001 or the radio frequency circuit 1004 into sound waves.
  • the speaker can be a traditional thin-film speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert the electrical signal into human audible sound waves, but also convert the electrical signal into human inaudible sound waves for distance measurement and other purposes.
  • the audio circuit 1007 may also include a headphone jack.
  • the positioning component 1008 is used to locate the current geographic location of the terminal 1000 to implement navigation or LBS (Location Based Service, location-based service).
  • the positioning component 1008 may be a positioning component based on the GPS (Global Positioning System, Global Positioning System) of the United States, the Beidou system of China, the Granus system of Russia, or the Galileo system of the European Union.
  • the power supply 1009 is used to supply power to various components in the terminal 1000.
  • the power source 1009 may be alternating current, direct current, disposable batteries, or rechargeable batteries.
  • the rechargeable battery may support wired charging or wireless charging.
  • the rechargeable battery can also be used to support fast charging technology.
  • the terminal 1000 further includes one or more sensors 1010.
  • the one or more sensors 1010 include, but are not limited to: an acceleration sensor 1011, a gyroscope sensor 1012, a pressure sensor 1013, a fingerprint sensor 1014, an optical sensor 1015, and a proximity sensor 1016.
  • the acceleration sensor 1011 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 1000.
  • the acceleration sensor 1011 can be used to detect the components of gravitational acceleration on three coordinate axes.
  • the processor 1001 may control the display screen 1005 to display the user interface in a horizontal view or a vertical view according to the gravity acceleration signal collected by the acceleration sensor 1011.
  • the acceleration sensor 1011 may also be used for the collection of game or user motion data.
  • the gyroscope sensor 1012 can detect the body direction and rotation angle of the terminal 1000, and the gyroscope sensor 1012 can cooperate with the acceleration sensor 1011 to collect the user's 3D actions on the terminal 1000. Based on the data collected by the gyroscope sensor 1012, the processor 1001 can implement the following functions: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 1013 may be disposed on the side frame of the terminal 1000 and/or the lower layer of the display screen 1005.
  • the processor 1001 performs left and right hand recognition or quick operation according to the holding signal collected by the pressure sensor 1013.
  • the processor 1001 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 1005.
  • the operability control includes at least one of a button control, a scroll bar control, an icon control, and a menu control.
  • the fingerprint sensor 1014 is used to collect the user's fingerprint.
  • the processor 1001 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 1014, or the fingerprint sensor 1014 identifies the user's identity according to the collected fingerprint.
  • the processor 1001 authorizes the user to perform related sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings.
  • the fingerprint sensor 1014 may be provided on the front, back or side of the terminal 1000. When a physical button or a manufacturer logo is provided on the terminal 1000, the fingerprint sensor 1014 can be integrated with the physical button or the manufacturer logo.
  • the optical sensor 1015 is used to collect the ambient light intensity.
  • the processor 1001 may control the display brightness of the display screen 1005 according to the intensity of the ambient light collected by the optical sensor 1015. Optionally, when the ambient light intensity is high, the display brightness of the display screen 1005 is increased; when the ambient light intensity is low, the display brightness of the display screen 1005 is decreased. In another embodiment, the processor 1001 may also dynamically adjust the shooting parameters of the camera assembly 1006 according to the ambient light intensity collected by the optical sensor 1015.
  • the proximity sensor 1016 also called a distance sensor, is usually arranged on the front panel of the terminal 1000.
  • the proximity sensor 1016 is used to collect the distance between the user and the front of the terminal 1000.
  • the processor 1001 controls the display screen 1005 to switch from the on-screen state to the off-screen state; when the proximity sensor 1016 detects When the distance between the user and the front of the terminal 1000 gradually increases, the processor 1001 controls the display screen 1005 to switch from the rest screen state to the bright screen state.
  • FIG. 10 does not constitute a limitation on the terminal 1000, and may include more or fewer components than shown in the figure, or combine certain components, or adopt different component arrangements.
  • a computer-readable storage medium such as a memory including a computer program, which can be executed by a processor to complete the speech separation method or the training method of the speech separation model in the aforementioned embodiment.
  • the computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory (CD-ROM), Magnetic tapes, floppy disks and optical data storage devices, etc.
  • At least one computer program stored in the computer-readable storage medium is loaded by the processor and performs the following operations:
  • the mixed speech signal into the student model and the teacher model respectively, the mixed speech signal is marked with a clean speech signal used to generate the mixed speech signal, and the model parameters of the teacher model are based on the model parameter configuration of the student model;
  • the model parameters of the student model and the teacher model are adjusted to obtain a voice separation model.
  • the at least one computer program is loaded by the processor and executes any one of the following operations:
  • the at least one computer program is loaded by the processor and executes any one of the following operations:
  • the at least one computer program is loaded by the processor and performs the following operations:
  • the consistency information is determined based on the short-time and time-varying abstract characteristics of the first clean speech signal and the short-time and time-varying abstract characteristics of the second clean speech signal.
  • the at least one computer program is loaded by the processor and performs the following operations:
  • the consistency information is determined based on the weighted value of the third consistency information and the fourth consistency information.
  • the at least one computer program is loaded by the processor and performs the following operations: using an exponential moving average method, determining the model parameters of the teacher model based on the model parameters of the student model, and using the determined teacher model The model parameters of the model configure the teacher model.
  • the at least one computer program is loaded by the processor and performs the following operations:
  • One iteration process corresponds to one accuracy information and one consistency information;
  • the at least one computer program is also loaded by the processor and performs the following operations:
  • the student model determined by the iterative process satisfying the stop training condition is output as the speech separation model.
  • the student model and the teacher model adopt a permutation invariant training PIT method for signal separation; or, the student model and the teacher model adopt a prominent oriented selection mechanism for signal separation.
  • At least one computer program stored in the computer-readable storage medium is loaded by the processor and performs the following operations:
  • the speech separation model is obtained based on the mixed speech signal and the collaborative iterative training of the student model and the teacher model, and the model parameters of the teacher model are based on the model parameter configuration of the student model;
  • the clean voice signal in the voice signal is predicted, and the clean voice signal of the voice signal is output.
  • the loss function of the iterative process is based on the accuracy information between the output of the student model and the training input of the student model, and the consistency between the output of the student model and the output of the teacher model Information construction.
  • the loss function of the iterative process is constructed based on the following information:
  • the first accuracy information between the first clean speech signal output by the student model and the clean speech signal in the mixed speech signal, one of the first interference signal output by the student model and the interference signal in the mixed speech signal The second accuracy information between the students, the first consistency information between the first clean speech signal output by the student model and the second clean speech signal output by the teacher model, the first interference signal output by the student model, and the The second consistency information between the second interference signals output by the teacher model.
  • the loss function of the iterative process is constructed based on the following information:
  • the short-time and time-varying abstract features output by the student model and the short-time and time-varying abstract features output by the teacher model as well as the short-time and time-varying abstract features output by the student model and the long-term stable abstract features output by the teacher model.
  • an embodiment of the present application also provides a computer program product or computer program, the computer program product or the computer program includes one or more pieces of program code, and the one or more pieces of program code are stored in a computer-readable storage medium .
  • One or more processors of the computer device can read the one or more pieces of program code from the computer-readable storage medium, and the one or more processors execute the one or more pieces of program code, so that the computer device can execute each of the above The voice signal processing method or voice separation method involved in the embodiment.

Abstract

一种语音信号的处理方法、语音分离方法、装置、计算机设备及存储介质,属于语音技术领域。在训练过程中,能够基于学生模型的分离结果的准确性、教师模型和学生模型分离得到的结果之间的一致性,来使得教师模型能够对学生模型的训练起到一种平滑的作用,从而提升训练得到的语音分离模型的分离准确性的同时,还能够保持分离的稳定性,提高了训练的语音分离模型的分离能力。

Description

语音信号的处理方法、语音分离方法
本申请要求于2020年01月02日提交的申请号为202010003201.2、发明名称为“语音分离模型的训练方法、语音分离方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音技术领域,特别涉及一种语音信号的处理方法、语音分离方法、装置、计算机设备及存储介质。
背景技术
随着人工智能技术和电子设备的发展,语音已经成为人类与电子设备进行交互的重要方式之一。然而,由于干扰声源的存在,电子设备在复杂开放环境下语音识别的识别精度远没有达到令人满意的程度,原因在于难以将目标语音和干扰声源进行精准分离。现阶段,开发一种在复杂可变的输入环境中具有较强泛化性和鲁棒性的语音分离方法仍然是一项极具挑战性的任务。
发明内容
本公开实施例提供了一种语音信号的处理方法、语音分离方法、装置、计算机设备及存储介质。该技术方案如下:
一方面,提供了一种语音信号的处理方法,由计算机设备执行,包括:
将混合语音信号分别输入学生模型和教师模型,该混合语音信号标注有用于生成该混合语音信号的干净语音信号,该教师模型的模型参数基于该学生模型的模型参数配置;
基于该学生模型输出的信号和该混合语音信号中标注的该干净语音信号,确定准确性信息,该准确性信息用于表示该学生模型的分离准确程度;
基于该学生模型输出的信号和教师模型输出的信号,确定一致性信息,该一致性信息用于表示该学生模型和该教师模型的分离能力的一致程度;
基于多个准确性信息和多个一致性信息,调整该学生模型和该教师模型的模型参数,以获取语音分离模型。
在一种可能实现方式中,基于该学生模型输出的信号和该混合语音信号中标注的该干净语音信号,确定准确性信息包括下述任一项:
基于该学生模型输出的第一干净语音信号和该混合语音信号中标注的该干净语音信号,确定该准确性信息;
基于该学生模型输出的第一干扰信号和该混合语音信号中除了该干净语音信号以外的干扰信号,确定该准确性信息;
基于该学生模型输出的第一干净语音信号和该混合语音信号中标注的该干净语音信号,确定第一准确性信息;基于该学生模型输出的第一干扰信号和该混合语音信号中除了该干净语音信号以外的干扰信号,确定第二准确性信息;根据该第一准确性信息和该第二准确性信息,确定该准确性信息。
在一种可能实现方式中,基于该学生模型输出的信号和教师模型输出的信号,确定一致性信息包括下述任一项:
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定该一致性信息;
基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定该一致性信息;
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定第一一致性信息,基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定第二一致性信息,根据该第一一致性信息和该第二一致性信息,确定该一致性信息。
在一种可能实现方式中,基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定该一致性信息包括:
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的短时时变抽象特征,确定该一致性信息。
在一种可能实现方式中,基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定该一致性信息包括:
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的短时时变抽象特征,确定第三一致性信息;
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的长时稳定抽象特征,确定第四一致性信息;
基于该第三一致性信息和该第四一致性信息的加权值,确定该一致性信息。
在一种可能实现方式中,调整该学生模型和该教师模型的模型参数包括:采用指数移动平均的方法,基于该学生模型的模型参数确定该教师模型的模型参数,采用确定好的该教师模型的模型参数对该教师模型进行配置。
在一种可能实现方式中,该方法还包括:
迭代多次执行将混合语音信号分别输入学生模型和教师模型,获取多个该准确性信息和多个该一致性信息,一次迭代过程对应于一个准确性信息和一个一致性信息;
获取语音分离模型包括:
响应于满足停止训练条件,将满足该停止训练条件的迭代过程所确定的学生模型输出为该语音分离模型。
在一种可能实现方式中,该学生模型和该教师模型采用排列不变式训练PIT方式进行信号分离;或,该学生模型和该教师模型采用突出导向选择机制进行信号分离。
一方面,提供了一种语音分离方法,由计算机设备执行,包括:
获取待分离的声音信号;
将该声音信号输入语音分离模型,该语音分离模型基于混合语音信号以及学生模型和教师模型协同迭代训练得到,该教师模型的模型参数基于该学生模型的模型参数配置;
通过该语音分离模型,对该声音信号中的干净语音信号进行预测,输出该声音信号的干净语音信号。
在一种可能实现方式中,该迭代过程的损失函数基于该学生模型的输出和该学生模型的训练输入之间的准确性信息、该学生模型的输出和该教师模型的输出之间的一致性信息构建。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的第一准确性信息、该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的第二准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的第一一致性信息、该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号之间的第二一致性信息。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征;或,
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征,以及,该学生模型输出的短时时变抽象特征和教师模型输出的长时稳定抽象特征。
一方面,提供了一种语音信号的处理装置,包括:
训练模块,用于将混合语音信号分别输入学生模型和教师模型,该混合语音信号标注有用于生成该混合语音信号的干净语音信号,该教师模型的模型参数基于该学生模型的模型参数配置;
准确性确定模块,用于基于该学生模型输出的信号和输入模型的混合语音信号中标注的该干净语音信号,确定准确性信息,该准确性信息用于表示该学生模型的分离准确程度;
一致性确定模块,用于基于该学生模型输出的信号和教师模型输出的信号,确定一致性信息,该准确性信息用于表示该学生模型和该教师模型的分离能力的一致程度;
调整模块,用于基于多个准确性信息和多个一致性信息,调整该学生模型和该教师模型的模型参数,以获取语音分离模型。
在一种可能实现方式中,准确性确定模块,用于执行下述任一步骤:
基于该学生模型输出的第一干净语音信号和该混合语音信号中标注的该干净语音信号,确定该准确性信息;
基于该学生模型输出的第一干扰信号和该混合语音信号中除了该干净语音信号以外的干扰信号,确定该准确性信息;
基于该学生模型输出的第一干净语音信号和该混合语音信号中标注的该干净语音信号,确定第一准确性信息;基于该学生模型输出的第一干扰信号和该混合语音信号中除了该干净语音信号以外的干扰信号,确定第二准确性信息;根据该第一准确性信息和该第二准确性信息,确定该准确性信息。
在一种可能实现方式中,一致性确定模块,用于执行下述任一步骤:
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定该一致性信息;
基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定该一致性信息;
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定 第一一致性信息,基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定第二一致性信息,根据该第一一致性信息和该第二一致性信息,确定该一致性信息。
在一种可能实现方式中,一致性确定模块,用于基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的短时时变抽象特征,确定该一致性信息。
在一种可能实现方式中,一致性确定模块,用于:
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的短时时变抽象特征,确定第三一致性信息;
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的长时稳定抽象特征,确定第四一致性信息;
基于该第三一致性信息和该第四一致性信息的加权值,确定该一致性信息。
在一种可能实现方式中,该调整模块,用于采用指数移动平均的方法,基于该学生模型的模型参数确定该教师模型的模型参数,采用确定好的该教师模型的模型参数对该教师模型进行配置。
在一种可能实现方式中,该装置还包括迭代获取模块,用于迭代多次执行将混合语音信号分别输入学生模型和教师模型,获取多个该准确性信息和多个该一致性信息,一次迭代过程对应于一个准确性信息和一个一致性信息;
该迭代获取模块,还用于响应于满足停止训练条件,将满足该停止训练条件的迭代过程所确定的学生模型输出为该语音分离模型。
在一种可能实现方式中,该学生模型和该教师模型采用排列不变式训练PIT方式进行信号分离;或,该学生模型和该教师模型采用突出导向选择机制进行信号分离。
一方面,提供了一种语音分离装置,包括:
信号获取模块,用于获取待分离的声音信号;
输入模块,用于将该声音信号输入语音分离模型,该语音分离模型基于混合语音信号以及学生模型和教师模型协同迭代训练得到,该教师模型的模型参数基于该学生模型的模型参数配置;
预测模块,用于通过该语音分离模型,对该声音信号中的干净语音信号进行预测,输出该声音信号的干净语音信号。
在一种可能实现方式中,该迭代过程的损失函数基于该学生模型的输出和该学生模型的训练输入之间的准确性信息、该学生模型的输出和该教师模型的输出之间的一致性信息构建。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的第一准确性信息、该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的第二准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的第一一致性信息、该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号之 间的第二一致性信息。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征;或,
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征,以及,该学生模型输出的短时时变抽象特征和教师模型输出的长时稳定抽象特征。
一方面,提供了一种计算机设备,该计算机设备包括一个或多个处理器和一个或多个存储器,该一个或多个存储器中存储有至少一条计算机程序,该至少一条计算机程序由该一个或多个处理器加载并执行以实现如上述任一种可能实施方式的语音信号的处理方法或语音分离方法。
一方面,提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条计算机程序,该至少一条计算机程序由处理器加载并执行以实现如上述任一种可能实施方式的语音信号的处理方法或语音分离方法。
一方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或该计算机程序包括一条或多条程序代码,该一条或多条程序代码存储在计算机可读存储介质中。计算机设备的一个或多个处理器能够从计算机可读存储介质中读取该一条或多条程序代码,该一个或多个处理器执行该一条或多条程序代码,使得计算机设备能够执行上述任一种可能实施方式的语音信号的处理方法或语音分离方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种语音分离模型的训练方法的实施环境的示意图;
图2是本申请实施例提供的一种语音分离模型的训练方法的原理示意图;
图3是本申请实施例提供的一种语音分离模型的训练方法的流程示意图;
图4是本申请实施例提供的一种学生模型处理混合语音信号的流程示意图;
图5是本申请实施例提供的学生模型内部的一种结构示意图;
图6是本申请实施例提供的一种语音分离方法的流程图;
图7是本申请实施例提供的一种语音分离模型的训练装置的结构示意图;
图8是本申请实施例提供的一种语音分离处理装置的结构示意图;
图9是本申请实施例提供的一种服务器的结构示意图;
图10是本申请实施例提供的一种终端的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了便于理解本申请实施例的技术过程,下面对本申请实施例所涉及的一些名词进行解释:
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技 术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
语音技术(Speech Technology)的关键技术有自动语音识别技术(Automatic Speech Recognition,ASR)和语音合成技术(Text To Speech,TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
自然语言处理(Nature Language processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。
近年来,监督学习的引入在解决语音分离方面取得了一些进展。但监督学习需要手工收集带标注的高质量训练样本,这个过程耗时耗力并且效率低下,此外,要让有标注的训练样本覆盖所有类型的实际应用场景亦是不切实际的。
有鉴于此,图1是本申请实施例提供的一种实施环境的示意图,参见图1,该实施环境中包括终端110和服务器140。终端110通过无线网络或有线网络与服务器140相连。
可选地,终端110的设备类型包括智能手机、平板电脑、智能音箱、电子书阅读器、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机、台式计算机及车载计算机等中的至少一种。终端110安装和运行有支持语音分离技术的应用程序。该应用程序可以是语音助手类应用程序,该语音助手类的应用程序还可以具有数据记录、音视频播放、翻译、数据查询等功能。示例性的,终端110是用户使用的终端,终端110中运行的应用程序内登录有用户账号。
可选地,服务器140包括一台服务器、多台服务器、云计算平台或者虚拟化中心中的至少一种。服务器140用于为支持语音分离的应用程序提供后台服务。可选地,服务器140承担主要语音分离处理工作,终端110承担次要语音分离处理工作;或者,服务器140承担次要语音分离处理工作,终端110承担主要语音分离处理工作;或者,服务器140或终端110分别可以单独承担语音分离处理工作。
可选地,服务器140包括:接入服务器、语音服务器和数据库。接入服务器用于为终端 110提供接入服务。语音服务器用于提供语音分离处理有关的后台服务。数据库可以包括语音信息数据库以及用户信息数据库等,基于服务器所提供的不同服务可以对应于不同数据库。语音服务器可以是一台或多台,当语音服务器是多台时,存在至少两台语音服务器用于提供不同的服务,和/或,存在至少两台语音服务器用于提供相同的服务,比如以负载均衡方式提供同一种服务,本申请实施例对此不加以限定。
终端110可以泛指多个终端中的一个,本实施例仅以终端110来举例说明。
本领域技术人员可以知晓,上述终端的数量可以更多或更少。比如上述终端可以仅为一个,或者上述终端为几十个或几百个,或者更多数量,此时上述实施环境中还包括其他终端。本申请实施例对终端的数量和设备类型不加以限定。
上述语音分离方法可以应用于车载终端、电视盒子、语音识别产品、声纹识别产品、智能语音助手以及智能音箱等产品,可以应用于上述产品前端,也可以通过终端和服务器之间的交互来实现。
以车载终端为例,车载终端可以采集语音信号,对语音信号进行语音分离,基于分离得到的干净语音信号进行语音识别,基于识别得到的语音内容信息来执行对应的驾驶控制或者处理过程。以自动语音识别产品为例,终端可以采集语音信号,将语音信号发送至服务器,由服务器对语音信号进行语音分离,再对分离得到的干净语音信号进行语音识别,基于识别得到的语音内容信息来进行记录或者后续的其他相应处理。
上述语音识别方法可以应用于车载终端、电视盒子、语音识别产品以及智能音箱等产品,可以应用于上述产品前端,也可以通过前端和服务器之间的交互来实现。
以车载终端为例,车载终端可以采集语音信号,对语音信号进行语音分离,基于分离得到的干净语音信号进行语音识别,基于识别得到的语音内容信息来执行对应的驾驶控制或者处理过程。车载终端还可以将语音信号发送至与车载终端连接的后台服务器,由后台服务器对接收到的语音信号进行语音分离和语音识别,得到与语音信号对应的语音内容。后台服务器可以响应于语音信号对应的语音内容,将语音内容或者对应的反馈信息发送至车载终端,车载终端基于获取到的语音内容或者反馈信息来执行对应的驾驶控制或者处理过程,例如开启或关闭天窗,开启或关闭导航系统以及开启或关闭照明灯光等操作。
需要说明的是,本申请实施例提供的语音分离方法可以应用于各种基于语音功能的产品中,上述描述仅仅是为了便于理解而进行的,并不能对本申请实施例造成不当限定。
在正式开始训练模型之前,可以先进行训练样本的生成,把干净语音信号和干扰信号进行混合,生成混合语音信号,将这类混合语音信号作为训练样本,并且对混合语音信号中的干净语音信号进行标注,以便后续进行损失函数的计算来实现模型训练。
混合语音信号的生成过程即可用下述公式(1)来表示:
X=x+e          (1)
其中,x表示干净语音信号的时频点,e表示干扰信号的时频点,X表示混合语音信号的时频点。
通过对混合语音信号中的干净语音信号进行标注,可以得到一组有标注的训练样本{X (1),...,X (L)},对混合语音信号中的干净语音信号不进行标注可以得到一组未标注的训练样本{X (L+1),...,X (L+U)}。
每个训练样本都由输入空间的一组时频点构成,也即是,{x=X t,f}t=1...,T;f=1...,F, 在一些实施例中,以混合语音信号的时频点采用短时傅立叶谱(Short-time Fourier Transform,STFT)表示为例,则T表示输入帧的个数,F表示STFT频带个数。
图2是本申请实施例提供的一种语音分离模型的训练方法的原理示意图,参见图2,该训练所采用的网络结构包括学生模型和教师模型,在模型初始状态下,教师模型的模型参数基于学生模型的参数进行配置,在每次迭代过程中,在基于损失函数对学生模型的模型参数进行调整时,也相应基于调整后的学生模型来对教师模型的模型参数进行同步的调整,从而实现了一种分批交迭的模型训练方法。下面基于上述图2所示的原理示意图,再结合图3所示的方法流程图,对该语音分离模型的训练过程进行简要说明,参见图2和3所示的训练流程图,在训练过程中,可以包括下述步骤:
301、在任一次迭代过程中,计算机设备将作为训练样本的混合语音信号分别输入学生模型和教师模型,通过模型处理,学生模型输出第一干净语音信号和第一干扰信号,教师模型输出第二干净语音信号和第二干扰信号。
上述步骤301以单次迭代过程为例,示出了计算机设备将混合语音信号分别输入学生模型和教师模型的一种可能实施方式,其中,该混合语音信号标注有用于生成该混合语音信号的干净语音信号,该混合语音信号还包括除了该干净语音信号之外的干扰信号。可选地,学生模型对该混合语音信号进行处理,输出第一干净语音信号和第一干扰信号,教师模型对该混合语音信号进行处理,输出第二干净语音信号和第二干扰信号。
302、计算机设备基于学生模型输出的第一干净语音信号和用于生成混合语音信号的干净语音信号,确定该迭代过程的准确性信息,该准确性信息用于表示该学生模型的分离准确程度。
上述步骤302也即是计算机设备基于学生模型输出的信号和该混合语音信号中标注的该干净语音信号,确定准确性信息的一种可能实施方式。由于学生模型输出的信号包括第一干净语音信号和第一干扰信号,除了基于上述步骤302提供的确定准确性信息的方式之外,计算机设备还能够基于学生模型输出的第一干扰信号和混合语音信号中的干扰信号,确定该准确性信息,或者,将综合上述两种可能实施方式,并对两种实施方式所得的准确性信息进行加权,以获取最终的准确性信息,本申请实施例不对准确性信息的获取方式进行具体限定。
303、计算机设备基于学生模型输出的第一干净语音信号和教师模型输出的第二干净语音信号,确定该迭代过程的一致性信息,该一致性信息用于表示该学生模型和该教师模型的分离能力的一致程度。
上述步骤303也即是计算机设备基于学生模型输出的信号和该教师模型输出的信号,确定一致性信息的一种可能实施方式。由于学生模型输出的信号包括第一干净语音信号和第一干扰信号,教师模型输出的信号包括第二干净语音信号和第二干扰信号,除了基于上述步骤303提供的确定一致性信息的方式之外,计算机设备还能够基于学生模型输出的第一干扰信号和教师模型输出的第二干扰信号,确定该一致性信息,或者,将综合上述两种可能实施方式,并对两种实施方式所得的一致性信息进行加权,以获取最终的一致性信息,本申请实施例不对一致性信息的获取方式进行具体限定。
304、计算机设备基于每次迭代过程所确定的准确性信息和一致性信息,对该学生模型和该教师模型的模型参数进行调整,直到满足停止训练条件,将满足该停止训练条件的迭代过程所确定的学生模型输出为语音分离模型。
上述步骤304也即是计算机设备基于多个准确性信息和多个一致性信息,调整该学生模 型和该教师模型的模型参数,以获取语音分离模型的一种可能实施方式,其中,一次迭代过程对应于一个准确性信息和一个一致性信息。通过迭代多次执行上述步骤301-303,也即迭代多次执行将混合语音信号分别输入学生模型和教师模型,能够获取到多个准确性信息和多个一致性信息,可选地,在对教师模型和学生模型的模型参数进行迭代调整的过程中,计算机设备响应于满足停止训练条件,将满足该停止训练条件的迭代过程所确定的学生模型输出为语音分离模型,或者,还可以将满足该停止训练条件的迭代过程所确定的教师模型输出为语音分离模型。
对于一次迭代过程来说,基于本次迭代过程所确定的准确性信息和一致性信息,确定损失函数值,基于损失函数值对学生模型的模型参数进行调整,基于调整后的模型参数,对教师模型的模型参数进行调整,基于调整后的模型,继续进行迭代训练,直到满足停止训练条件,将训练得到的学生模型作为语音分离模型。
上述学生模型的训练实际上可以理解为一种监督学习过程,而教师模型的训练可以理解为一种半监督学习过程,教师模型在整个训练过程中使得学生模型能够达到更好的收敛状态,使得训练得到的语音分离模型的分离能力更强,准确性和一致性更好。
通过本申请实施例提供的技术方案,在训练过程中,能够基于学生模型的分离结果的准确性、教师模型和学生模型分离得到的结果之间的一致性,从而提升训练得到的语音分离模型的分离准确性的同时,还能够保持分离的稳定性,大大提高了训练的语音分离模型的分离能力。
而教师模型对学生模型的训练进行平滑,是通过教师模型在每次迭代过程中模型参数随学生模型的模型参数变化而变化以及损失函数的构建过程中考虑到了教师模型和学生模型之间输出的一致性来进行,可选地,上述教师模型在每次迭代过程中的模型参数配置方式可以如下:采用指数移动平均(Exponential Moving Average,EMA)的方法,基于所述学生模型的模型参数确定所述教师模型的模型参数,采用确定好的所述教师模型的模型参数对所述教师模型进行配置。上述配置过程可以看做是一种对模型参数的平滑过程。
以教师模型中的编码器参数为例,在任一次迭代过程中,该教师模型的编码器参数的计算方法如下式(2)所示:
θ l′=αθ l-1′+(1-α)θ l             (2)
其中,α是参数的平滑系数,l是迭代次数,l为大于1的正整数,θ、θ′分别是学生模型、教师模型中编码器的参数。
以教师模型中的抽象特征提取器参数为例,在任一次迭代过程中,该教师模型的抽象特征提取器参数的计算方法如下式(3)所示:
ψ l′=αψ l-1′+(1-α)ψ l         (3)
其中,α是参数的平滑系数,l是迭代次数,l为大于1的正整数,ψ、ψ′分别是学生模型、教师模型中抽象特征提取器的参数。
需要说明的是,上述参数计算方式仅为基于学生模型的模型参数对教师模型的模型参数进行配置的几种示例,其计算方式还可以采用其他方式,其模型参数也可以涵盖其他参数类型,本申请实施例对此不做限定。
下面基于上述步骤对模型训练过程中模型内部处理流程进行示例性说明。
在任一次迭代过程中,将作为训练样本的混合语音信号分别输入学生模型和教师模型,通过模型处理,该学生模型输出第一干净语音信号和第一干扰信号,该教师模型输出第二干 净语音信号和第二干扰信号。
其中,学生模型和教师模型可以采用相同的模型架构,也即是,该两个模型的处理流程可以同理,因此,下面先基于学生模型的模型架构和处理流程进行介绍。图4是本申请实施例提供的一种学生模型处理混合语音信号的流程示意图,图5是实现上述模型内部的一种结构示意图,参见图4,该流程具体包括以下步骤。
401、计算机设备把混合语音信号映射到一个高维向量空间,得到该混合语音信号对应的嵌入矩阵。
该步骤401为对混合语音信号进行特征转换的过程,可以将该混合语音信号转换为模型输入的形式,在一种可能实现方式中,计算机设备对混合语音信号进行分帧加窗,对每一帧做快速傅里叶变换(Fast Fourier Transform,FFT),把时域信号转为频域信号,将得到的频域信号按时序排列起来即可得到表示混合语音信号的特征矩阵,将该特征矩阵映射到一个高维向量空间,即可得到混合语音信号对应的嵌入矩阵。
其中,混合语音信号的特征可以是短时傅里叶变化声谱特征、对数梅尔谱特征、梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)特征、或者上一个卷积神经网络(Convolutional Neural Networks,CNN)后预测得分,也可以是其他因素的特征,以及各种特征之间的组合,本申请实施例对此不做限定。
上述步骤401可以通过图5中的编码器501实现,现以所转换的特征为短时傅里叶变化声谱为例对编码器的处理过程做出说明:
将混合语音信号输入编码器,编码器获取混合语音信号的短时傅里叶变化声谱的特征矩阵,再将该特征矩阵映射到一个高维向量空间,输出混合语音信号对应的嵌入矩阵。例如,可以用
Figure PCTCN2020126475-appb-000001
表示编码器对混合语音信号处理后得到的特征矩阵(T、F分别输入编码器的混合语音信号的帧的个数、频带个数),则编码器将其映射到高维空间向量并输出混合语音信号的嵌入矩阵ν的过程可以表示为E θ
Figure PCTCN2020126475-appb-000002
其中,θ为编码器的模型参数。
402、计算机设备从混合语音信号对应的嵌入矩阵中提取抽象特征。
该步骤402是特征提取的过程,所提取到的特征可以用于表征该混合语音信号,为后续的语音信号重建提供基础。
此步骤可以通过图5中的抽象特征提取器502实现,抽象特征提取器可以是一个自回归模型,例如,在因果系统中采用长短时记忆网络(Long Short Term Memory Networks,LSTM)模型,在非因果系统中采用双向长短时记忆网络(Bi-directional Long Short-Term Memory,Bi-LSTM)模型,基于混合语音信号对应的嵌入矩阵从中按时序地提取短时或长时抽象特征,也可以采用一种复发性(recurrent)模型或一种摘要函数,基于嵌入矩阵提取全局的抽象特征。本申请实施例对抽象特征提取器的具体模型结构和提取到的抽象特征的种类不做限定。
现以自回归模型为例,对抽象特征提取器的处理过程做出说明:
在一种可能实现方式中,给定一个权重P,特征提取的计算公式如下:
Figure PCTCN2020126475-appb-000003
其中,c t∈c,表示短时时变抽象特征,υ∈ν,表示嵌入矩阵,p∈P,表示权重,⊙表示元素点乘,t、f分别表示短时傅里叶变化声谱的帧索引、频带索引。
在一种可能实现方式中,还可以对上述特征提取所得到的特征进行整形,去除取值小于一定阈值的矩阵元素,从而消除低能量噪声对特征提取的影响。例如,在本申请实施例中, 可以对特征矩阵进行归一化,将小于一定阈值的元素置为0,将其他元素置为1,举例说明,计算机设备可以对式(4)乘以一个二值阈值矩阵,这样有助于减轻低能量噪声对抽象特征提取过程的影响,此时,计算公式如下式(5):
Figure PCTCN2020126475-appb-000004
其中,w∈R TF,表示该二值阈值矩阵:
Figure PCTCN2020126475-appb-000005
上述抽象特征提取器从嵌入矩阵ν中提取抽象特征c的过程可以简约表示为A ψ
Figure PCTCN2020126475-appb-000006
其中ψ为抽象特征提取器的模型参数。
403、该计算机设备基于提取到的抽象特征、输入的混合语音信号以及编码器的输出,进行信号重建,得到第一干净语音信号。
基于上述输入来进行语音信号重建,可以得到一组新的语音信号,为下述语音信号对比,计算训练损失提供基础。为方便表示,在此将学生模型输出的语音信号命名为第一干净语音信号。
此步骤可以通过图5中的信号重建模块503实现,信号重建模块503可根据提取到的抽象特征、干净语音信号和嵌入矩阵的特征,采用任一种信号重建算法进行语音信号重建,以输出第一干净语音信号和第一干扰信号,所输出的第一干净语音信号和第一干扰信号可以用于计算本次迭代的损失函数值,并通过反向传播来训练模型。
在一种示例性结构中,编码器可以采用4层的Bi-LSTM结构,每个隐层结点数为600,能将600维的隐向量映射到257*40维的高维向量空间,输出层结点数为40,该编码器采用16KHz采样率,25ms窗长,10ms窗移,257个频带个数的参数设置对混合语音信号进行处理,每段训练语料随机降采样帧数为32。该编码器所连接的抽象特征提取器可以包含一个全连接层,能够将257*40维的隐向量映射到600维。而信号重建模块可以是一个2层的Bi-LSTM结构,每个隐层结点数为600。
上述编码器、抽象特征提取器以及信号重建模块可以根据实际应用的复杂程度和对性能的要求,对编码器、抽象特征提取器和信号重建模块中至少一个增加更多层级或改变其模型类型,本申请实施例不具体限定上述结构的模型类型和拓扑结构,其可以替换为其它各种有效的新型的模型结构,例如,长短时记忆网络,卷积神经网络、时延网络、闸控卷积神经网络等,以及各种网络结构相结合的模型。
上述实施例内容仅介绍了学生模型的模型结构和处理流程,而在本申请实施例中,教师模型与学生模型的模型架构和处理流程可以同理,当然,教师模型还可以采用稍微复杂一些的结构,用以提取不同时域特性的特征,从而基于该时域特性不同的特征,来进行信号重建,进一步基于重建出的结果来进行损失函数值的计算以及反向传播的模型训练。
例如,对于学生模型来说,可以基于上式(5)所示的方法来提取在时域上分辨率较高的抽象特征,也即是短时时变抽象特征,对于教师模型,也可以采用同理的过程来提取短时时变抽象特征,而在一种可能实现方式中,对于教师模型来说,可以在进行特征提取时,还可以提取在时域上分辨率较低的抽象特征,为了便于表述,称其为长时稳定抽象特征,该特征可以用下式(7)表示:
Figure PCTCN2020126475-appb-000007
其中,c′ L∈c′,表示长时稳定抽象特征,υ′∈ν′,表示高维嵌入矩阵,p′∈P′,表示权重,⊙表示元素点乘,t、f分别表示短时傅里叶变化声谱的帧索引、频带索引,w表示式(6)所示的二值阀值矩阵,当然,在该实施例中,也可以不乘以上述二值阈值矩阵,本申请对此不做限定。
这类在时域上分辨率较低的抽象特征,也即是长时稳定抽象特征适用于概括隐藏的说话人特征,而在时域上分辨率较高的抽象特征,也即是短时时变抽象特征,更适合与需要高时域分辨率的任务,例如,说话人的频谱重建等。
在训练学生模型的模型参数过程中,综合采用两类训练目标,第一类是旨在提高准确性的训练目标的有监督训练,第二类是教师模型和学生模型之间的一致性学习。
对于提高准确性的训练目标来说,需要基于所述学生模型输出的信号和所述混合语音信号中标注的所述干净语音信号,确定所述迭代过程的准确性信息,而该确定准确性信息的具体过程可以包括下述任一项:
第一种实现方式、基于所述学生模型输出的第一干净语音信号和所述混合语音信号中标注的所述干净语音信号,确定所述迭代过程的准确性信息。
第二种实现方式、基于所述学生模型输出的第一干扰信号和所述混合语音信号中标注的所述干净语音信号以外的干扰信号,确定所述迭代过程的准确性信息;
第三种实现方式、基于所述学生模型输出的第一干净语音信号和所述混合语音信号中标注的所述干净语音信号,确定所述迭代过程的第一准确性信息;基于所述学生模型输出的第一干扰信号和所述混合语音信号中标注的所述干净语音信号以外的干扰信号,确定所述迭代过程的第二准确性信息,根据所述第一准确性信息和所述第二准确性信息,确定所述迭代过程的准确性信息。
其中,该第一干净语音信号可以是例如公式(8)中所示的能量最大的语音信号,还可以是基于例如公式(9)的PIT算法所确定的语音信号,当然,还可以是基于其他方式所确定的语音信号,本申请实施例对此不做限定。
需要说明的是,上述准确性信息用于确定分离出的信号和作为参考的信号之间的差距,例如,该准确性信息可以是信号的频谱之间的均方误差(Mean-Square Error,MSE),也可以是比例不变信噪比(Scale Invariant Signal to Noise Ratio,SI-SNR)目标函数,本申请实施例对此不做具体限定。
例如,以采用最直观的突出导向(salience-based)选择机制下的准确性计算为例,可以采用下述式(8)来计算能量最大的第一干净语音信号与已标注的干净语音信号之间的均方误差:
Figure PCTCN2020126475-appb-000008
其中,x表示带标记的干净语音信号,X表示混合语音信号,c表示抽象特征,v表示嵌入矩阵,t、f分别表示短时傅里叶变化声谱的帧索引、频带索引。
又例如,以采用排列不变式训练方法(Permutation Invariant Training,PIT)的准确性计 算为例,则可以采用下式(9)来计算所有可能的第一干净语音信号与已标注的干净语音信号以及所有可能的第一干扰信号和已标注的干扰信号之间MSE:
Figure PCTCN2020126475-appb-000009
其中,x表示带标记的干净语音信号,X表示混合语音信号,e表示干扰信号,c表示抽象特征,v表示嵌入矩阵,t、f分别表示短时傅里叶变化声谱的帧索引、频带索引。
上述三种实现方式可以理解为一种对损失函数的构建方法,也即是,通过哪类输入输出来构建该损失函数,从而能够基于损失函数对模型进行反向传播的训练。而上述损失函数是以重建类型的目标函数,利用该目标函数的有监督的鉴别学习模型能够一定程度上保证学习到的表征对目标说话人语音信息的编码,使得通过结合语音分离任务的有监督的区分学习,能够使学生模型有效地估计出一个短时时变的抽象特征。
对于教师模型和学生模型之间的一致性学习来说,需要基于所述学生模型输出的信号和教师模型输出的信号,确定所述迭代过程的一致性信息,而该确定一致性信息的具体过程可以包括下述任一项:
第一种实现方式、基于所述学生模型输出的第一干净语音信号和所述教师模型输出的第二干净语音信号,确定所述迭代过程的一致性信息。
第二种实现方式、基于所述学生模型输出的第一干扰信号和所述教师模型输出的第二干扰信号,确定所述迭代过程的一致性信息。
第三种实现方式、基于所述学生模型输出的第一干净语音信号和所述教师模型输出的第二干净语音信号,确定所述迭代过程的第一一致性信息,基于所述学生模型输出的第一干扰信号和所述教师模型输出的第二干扰信号,确定所述迭代过程的第二一致性信息,根据所述第一一致性信息和所述第二一致性信息,确定所述迭代过程的一致性信息。
其中,该第一干净语音信号可以是例如公式(8)中所示的能量最大的语音信号,还可以是基于例如公式(9)的PIT算法所确定的语音信号,当然,还可以是基于其他方式所确定的语音信号,本申请实施例对此不做限定。
需要说明的是,上述一致性信息用于表示教师模型所估计的目标说话人频谱和学生模型所估计的目标说话人频谱之间的差距,例如,该一致性信息可以是信号的频谱之间的MSE,也可以是SI-SNR,本申请实施例对此不做具体限定。
上述三种实现方式可以理解为一种对损失函数的构建方法,也即是,通过哪类输入输出来构建该损失函数,从而能够基于损失函数对模型进行反向传播的训练。而此处所构建的损失函数是用于计算教师模型所估计的目标说话人频谱和学生模型所估计的目标说话人频谱之间的差距。
而对于教师模型来说,如上述实施例内容所涉及的,教师模型可以具有两类特征,一类是短时时变抽象特征,一类是长时稳定抽象特征,可以基于这两类特征来确定一致性信息,基于所述第一干净语音信号的短时时变抽象特征和教师模型输出的第二干净语音信号的短时时变抽象特征,确定所述迭代过程的第三一致性信息;基于所述第一干净语音信号的短时时变抽象特征和教师模型输出的第二干净语音信号的长时稳定抽象特征,确定所述迭代过程的第四一致性信息。可选地,基于该第三一致性信息和该第四一致性信息的加权值,构建所述迭代过程最终的一致性信息。
相应地,在构建损失函数时,可以仅基于学生模型和教师模型的短时时变抽象特征来构 建,还可以是基于学生模型和教师模型的短时时变抽象特征以及教师模型的长时稳定抽象特征来构建。
例如,基于学生模型和教师模型的短时时变抽象特征来构建损失函数时,可以采用如下式(10):
Figure PCTCN2020126475-appb-000010
其中,X表示混合语音信号,c t、c t′分别表示学生模型、教师模型预测出的短时抽象特征,v、ν′分别表示学生模型、教师模型的嵌入矩阵,t、f分别表示短时傅里叶变化声谱的帧索引、频带索引。
例如,基于学生模型和教师模型的短时时变抽象特征以及教师模型的长时稳定抽象特征来构建损失函数时,可以采用如下式(11):
Figure PCTCN2020126475-appb-000011
其中,X表示混合语音信号,c L′表示教师模型预测出的长时稳定抽象特征c表示学生模型预测出的短时时变抽象特征,v、ν′分别表示学生模型、教师模型的嵌入矩阵,t、f分别表示短时傅里叶变化声谱的帧索引、频带索引。
对于整个模型训练来说,需要结合准确性和一致性来进行,在每次迭代过程中,基于该次迭代过程所确定的准确性信息和一致性信息,对所述学生模型和所述教师模型的模型参数进行调整,直到满足停止训练条件,将满足所述停止训练条件的迭代过程所确定的学生模型输出为语音分离模型。上述过程是分别对训练目标为准确性的损失函数以及训练目标为模型之间的一致性的损失函数的构建分别进行的说明,而要结合上述准确性信息和一致性信息进行训练,则需要建立能够表达该准确性信息和一致性信息的联合损失函数。
在一种可能的实现方式中,在进行模型参数调整时,可以基于每次迭代过程所确定的所述第三一致性信息以及准确性信息,对所述学生模型和所述教师模型的模型参数进行调整,也即是,联合损失函数可以采用下式(12)来表示:
Figure PCTCN2020126475-appb-000012
其中,
Figure PCTCN2020126475-appb-000013
表示训练目标为准确性的损失函数,
Figure PCTCN2020126475-appb-000014
表示训练目标为一致性的损失函数,具体可以为基于短时时变抽象特征的损失函数,λ为权重因子,λ可以是在神经网络迭代过程中不断优化,直至匹配到最优值。
在一种可能的实现方式中,在进行模型参数调整时,可以基于每次迭代过程所确定的所述第三一致性信息和所述第四一致性信息的加权值以及准确性信息,对所述学生模型和所述教师模型的模型参数进行调整。也即是,该联合损失函数可以采用下式(13)来表示:
Figure PCTCN2020126475-appb-000015
其中,
Figure PCTCN2020126475-appb-000016
训练目标为准确性的损失函数,
Figure PCTCN2020126475-appb-000017
表示基于短时时变抽象特征和长时稳定抽象特征的损失函数,λ 1、λ 2为权重因子。
其中,λ 1、λ 2可以是在神经网络迭代过程中不断优化,直至匹配到最优值。
需要说明的是,上述停止训练条件可以是迭代次数达到目标次数、损失函数趋于平稳等 条件,本申请实施例对此不做限定,例如,在模型训练过程中,若设置批处理数据的大小为32,初始学习率为0.0001,学习率的权重下降系数为0.8,则当模型的损失函数值连续3次迭代都没有改善时,认为训练达到收敛并结束训练。
本申请实施例提供的训练方法,能够自动学习到稳定的隐藏目标说话人的特征,无需额外的PIT处理、说话人追踪机制或者由专家定义的处理和调节等。另一方面,本申请中用到的基于一致性的训练不需要标注信息,可以挖掘海量未标注数据中的无监督信息,来帮助提高系统的鲁棒性和通用性。并且,本申请实施例经过试验,充分验证了基于学生-教师模型的一致性所训练的语音分离模型的有效性,在多种干扰环境多种信噪比的条件下,包括0dB-20dB的音乐背景声干扰、其他说话人干扰以及背景噪声干扰等条件下,该申请实施例的分离性能,在语音质量感知评估、短时客观可懂度以及信号失真比等指标以及稳定性方面,均表现优异。
基于上述训练所得到的语音分离模型,本申请实施例还提供了一种语音分离方法,参见图6所示的语音分离方法的流程图,该方法可以包括:
601、计算机设备获取待分离的声音信号。
602、计算机设备将该声音信号输入语音分离模型,该语音分离模型基于混合语音信号以及学生模型和教师模型协同迭代训练得到,该教师模型的模型参数基于该学生模型的模型参数配置。
603、计算机设备通过该语音分离模型,对该声音信号中的干净语音信号进行预测,输出该声音信号的干净语音信号。
在一种可能实现方式中,该迭代过程的损失函数基于该学生模型的输出和该学生模型的训练输入之间的准确性信息、该学生模型的输出和该教师模型的输出之间的一致性信息构建。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的第一准确性信息、该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的第二准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的第一一致性信息、该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号之间的第二一致性信息。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征;或,
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征,以及,该学生模型输出的短时时变抽象特征和教师模型输出的长时稳定抽象特征。
需要说明的是,上述模型训练过程和该语音分离过程可以分别由不同计算机设备执行,在模型训练完成后,可以提供至前端或者应用侧的计算机设备来进行语音分离任务,而该语音分离任务可以是语音识别等任务中的一个用于对语音进行分离的子任务,在完成语音分离 后,分离所得到的信号还可以用于进行语音识别等具体的处理过程中,本申请实施例对此不做限定。
图7是本公开实施例提供的一种语音分离模型的训练装置的结构示意图。参见图7,该装置包括:
训练模块701,用于在任一次迭代过程中,将作为训练样本的混合语音信号分别输入学生模型和教师模型,该混合语音信号标注有用于生成该混合语音信号的干净语音信号,该教师模型的模型参数基于该学生模型的模型参数配置;
也即是说,该训练模块701,用于将混合语音信号分别输入学生模型和教师模型,该混合语音信号标注有用于生成该混合语音信号的干净语音信号,该教师模型的模型参数基于该学生模型的模型参数配置;
准确性确定模块702,用于基于该学生模型输出的信号和输入模型的混合语音信号中标注的该干净语音信号,确定该迭代过程的准确性信息,该准确性信息用于表示该学生模型的分离准确程度;
也即是说,该准确性确定模块702,用于基于该学生模型输出的信号和该该混合语音信号中标注的干净语音信号,确定准确性信息,该准确性信息用于表示该学生模型的分离准确程度;
一致性确定模块703,用于基于该学生模型输出的信号和教师模型输出的信号,确定该迭代过程的一致性信息,该准确性信息用于表示该学生模型和该教师模型的分离能力的一致程度;
也即是说,该一致性确定模块703,用于基于该学生模型输出的信号和教师模型输出的信号,确定一致性信息,该准确性信息用于表示该学生模型和该教师模型的分离能力的一致程度;
调整模块704,用于基于每次迭代过程所确定的准确性信息和一致性信息,对该学生模型和该教师模型的模型参数进行调整,直到满足停止训练条件,将满足该停止训练条件的迭代过程所确定的学生模型输出为语音分离模型;
也即是说,该调整模块704,用于基于多个准确性信息和多个一致性信息,调整该学生模型和该教师模型的模型参数,以获取语音分离模型。
在一种可能实现方式中,准确性确定模块702,用于执行下述任一步骤:
基于该学生模型输出的第一干净语音信号和该混合语音信号中标注的该干净语音信号,确定该迭代过程的准确性信息;
基于该学生模型输出的第一干扰信号和该混合语音信号中标注的该干净语音信号以外的干扰信号,确定该迭代过程的准确性信息;
基于该学生模型输出的第一干净语音信号和该混合语音信号中标注的该干净语音信号,确定该迭代过程的第一准确性信息;基于该学生模型输出的第一干扰信号和该混合语音信号中标注的该干净语音信号以外的干扰信号,确定该迭代过程的第二准确性信息,根据该第一准确性信息和该第二准确性信息,确定该迭代过程的准确性信息。
在一种可能实现方式中,一致性确定模块703,用于执行下述任一步骤:
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定该迭代过程的一致性信息;
基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定该迭代过 程的一致性信息;
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定该迭代过程的第一一致性信息,基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定该迭代过程的第二一致性信息,根据该第一一致性信息和该第二一致性信息,确定该迭代过程的一致性信息。
在一种可能实现方式中,该调整模块704,用于采用指数移动平均的方法,基于该学生模型的模型参数确定该教师模型的模型参数,采用确定好的该教师模型的模型参数对该教师模型进行配置。
在一种可能实现方式中,一致性确定模块703,用于基于该第一干净语音信号的短时时变抽象特征和教师模型输出的第二干净语音信号的短时时变抽象特征,确定该迭代过程的第三一致性信息(也即确定该一致性信息)。
在一种可能实现方式中,一致性确定模块703,用于:
基于该第一干净语音信号的短时时变抽象特征和教师模型输出的第二干净语音信号的短时时变抽象特征,确定该迭代过程的第三一致性信息;
基于该第一干净语音信号的短时时变抽象特征和教师模型输出的第二干净语音信号的长时稳定抽象特征,确定该迭代过程的第四一致性信息;
基于该第三一致性信息和该第四一致性信息的加权值,确定该一致性信息。
在一种可能实现方式中,基于图7的装置组成,该装置还包括迭代获取模块,用于迭代多次执行将混合语音信号分别输入学生模型和教师模型,获取多个该准确性信息和多个该一致性信息,一次迭代过程对应于一个准确性信息和一个一致性信息;
该迭代获取模块,还用于响应于满足停止训练条件,将满足该停止训练条件的迭代过程所确定的学生模型输出为该语音分离模型。
在一种可能实现方式中,该学生模型和该教师模型采用PIT方式进行信号分离;或,该学生模型和该教师模型采用突出导向选择机制进行信号分离。
需要说明的是:上述实施例提供的语音分离模型的训练装置在进行语音分离模型的训练时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音分离模型的训练装置与语音分离模型的训练方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图8是本申请实施例提供的一种语音分离装置的结构示意图。参见图8,该装置包括:
信号获取模块801,用于获取待分离的声音信号;
输入模块802,用于将该声音信号输入语音分离模型,该语音分离模型基于混合语音信号以及学生模型和教师模型协同迭代训练得到,该教师模型的模型参数基于该学生模型的模型参数配置;
预测模块803,用于通过该语音分离模型,对该声音信号中的干净语音信号进行预测,输出该声音信号的干净语音信号。
在一种可能实现方式中,该迭代过程的损失函数基于该学生模型的输出和该学生模型的训练输入之间的准确性信息、该学生模型的输出和该教师模型的输出之间的一致性信息构建。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的准确性 信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的第一准确性信息、该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的第二准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的第一一致性信息、该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号之间的第二一致性信息。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征;或,
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征,以及,该学生模型输出的短时时变抽象特征和教师模型输出的长时稳定抽象特征。
需要说明的是:上述实施例提供的语音分离装置在进行语音分离时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音分离装置与语音分离方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
在一个示例性实施例中,本申请实施例所涉及的计算机设备,包括一个或多个处理器和一个或多个存储器,该一个或多个存储器中存储有至少一条计算机程序,该至少一条计算机程序由该一个或多个处理器加载并执行如下操作:
将混合语音信号分别输入学生模型和教师模型,该混合语音信号标注有用于生成该混合语音信号的干净语音信号,该教师模型的模型参数基于该学生模型的模型参数配置;
基于该学生模型输出的信号和该混合语音信号中标注的该干净语音信号,确定准确性信息,该准确性信息用于表示该学生模型的分离准确程度;
基于该学生模型输出的信号和教师模型输出的信号,确定一致性信息,该一致性信息用于表示该学生模型和该教师模型的分离能力的一致程度;
基于多个准确性信息和多个一致性信息,调整该学生模型和该教师模型的模型参数,以获取语音分离模型。
在一种可能实现方式中,该至少一条计算机程序由该一个或多个处理器加载并执行下述任一项操作:
基于该学生模型输出的第一干净语音信号和该混合语音信号中标注的该干净语音信号,确定该准确性信息;
基于该学生模型输出的第一干扰信号和该混合语音信号中除了该干净语音信号以外的干扰信号,确定该准确性信息;
基于该学生模型输出的第一干净语音信号和该干净语音信号,确定第一准确性信息;基于该学生模型输出的第一干扰信号和该混合语音信号中除了该干净语音信号以外的干扰信号,确定第二准确性信息;根据该第一准确性信息和该第二准确性信息,确定该准确性信息。
在一种可能实现方式中,该至少一条计算机程序由该一个或多个处理器加载并执行下述 任一项操作:
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定该一致性信息;
基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定该一致性信息;
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定第一一致性信息,基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定第二一致性信息,根据该第一一致性信息和该第二一致性信息,确定该一致性信息。
在一种可能实现方式中,该至少一条计算机程序由该一个或多个处理器加载并执行下述如下操作:
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的短时时变抽象特征,确定该一致性信息。
在一种可能实现方式中,该至少一条计算机程序由该一个或多个处理器加载并执行下述如下操作:
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的短时时变抽象特征,确定第三一致性信息;
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的长时稳定抽象特征,确定第四一致性信息;
基于该第三一致性信息和该第四一致性信息的加权值,确定该一致性信息。
在一种可能实现方式中,该至少一条计算机程序由该一个或多个处理器加载并执行下述如下操作:
采用指数移动平均的方法,基于该学生模型的模型参数确定该教师模型的模型参数,采用确定好的该教师模型的模型参数对该教师模型进行配置。
在一种可能实现方式中,该至少一条计算机程序由该一个或多个处理器加载并执行下述如下操作:
迭代多次执行将混合语音信号分别输入学生模型和教师模型,获取多个该准确性信息和多个该一致性信息,一次迭代过程对应于一个准确性信息和一个一致性信息;
该至少一条计算机程序还由该一个或多个处理器加载并执行下述如下操作:
响应于满足停止训练条件,将满足该停止训练条件的迭代过程所确定的学生模型输出为该语音分离模型。
在一种可能实现方式中,该学生模型和该教师模型采用排列不变式训练PIT方式进行信号分离;或,该学生模型和该教师模型采用突出导向选择机制进行信号分离。
在另一个示例性实施例中,本申请实施例所涉及的计算机设备,包括一个或多个处理器和一个或多个存储器,该一个或多个存储器中存储有至少一条计算机程序,该至少一条计算机程序由该一个或多个处理器加载并执行如下操作:
获取待分离的声音信号;
将该声音信号输入语音分离模型,该语音分离模型基于混合语音信号以及学生模型和教师模型协同迭代训练得到,该教师模型的模型参数基于该学生模型的模型参数配置;
通过该语音分离模型,对该声音信号中的干净语音信号进行预测,输出该声音信号的干净语音信号。
在一种可能实现方式中,该迭代过程的损失函数基于该学生模型的输出和该学生模型的训练输入之间的准确性信息、该学生模型的输出和该教师模型的输出之间的一致性信息构建。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的第一准确性信息、该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的第二准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的第一一致性信息、该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号之间的第二一致性信息。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征;或,
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征,以及,该学生模型输出的短时时变抽象特征和教师模型输出的长时稳定抽象特征。
对于本申请实施例提供的计算机设备,可以实现为一服务器,图9是本申请实施例提供的一种服务器的结构示意图,该服务器900可因配置或性能不同而产生比较大的差异,可以包括一个或多个处理器(central processing units,CPU)901和一个或多个的存储器902,其中,所述一个或多个存储器902中存储有至少一条计算机程序,所述至少一条计算机程序由所述一个或多个处理器901加载并执行以实现上述各个实施例提供的语音信号的处理方法(也即语音分离模型的训练方法)或语音分离方法。当然,该服务器900还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器900还可以包括其他用于实现设备功能的部件,在此不做赘述。
对于本申请实施例提供的计算机设备,可以实现为一终端,图10是本申请实施例提供的一种终端的结构示意图,该终端可以用于执行上述实施例中终端侧的方法。该终端1000可以是:智能手机、智能语音助手、智能音箱、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1000还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端1000包括有:一个或多个处理器1001和一个或多个存储器1002。
处理器1001可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1001可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1001也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1001可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所 需要显示的内容的渲染和绘制。一些实施例中,处理器1001还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1002可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1002还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1002中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器1001所执行以实现本申请中方法实施例提供的语音分离方法或语音分离模型的训练方法。
在一些实施例中,终端1000还可选包括有:外围设备接口1003和至少一个外围设备。处理器1001、存储器1002和外围设备接口1003之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1003相连。可选地,外围设备包括:射频电路1004、显示屏1005、摄像头组件1006、音频电路1007、定位组件1008和电源1009中的至少一种。
外围设备接口1003可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器1001和存储器1002。在一些实施例中,处理器1001、存储器1002和外围设备接口1003被集成在同一芯片或电路板上;在一些其他实施例中,处理器1001、存储器1002和外围设备接口1003中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路1004用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路1004通过电磁信号与通信网络以及其他通信设备进行通信。射频电路1004将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路1004包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路1004可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:城域网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路1004还可以包括NFC(Near Field Communication,近距离无线通信)有关的电路,本申请对此不加以限定。
显示屏1005用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1005是触摸显示屏时,显示屏1005还具有采集在显示屏1005的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1001进行处理。此时,显示屏1005还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏1005可以为一个,设置终端1000的前面板;在另一些实施例中,显示屏1005可以为至少两个,分别设置在终端1000的不同表面或呈折叠设计;在再一些实施例中,显示屏1005可以是柔性显示屏,设置在终端1000的弯曲表面上或折叠面上。甚至,显示屏1005还可以设置成非矩形的不规则图形,也即异形屏。显示屏1005可以采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头组件1006用于采集图像或视频。可选地,摄像头组件1006包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄 像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件1006还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。
音频电路1007可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器1001进行处理,或者输入至射频电路1004以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端1000的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器1001或射频电路1004的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路1007还可以包括耳机插孔。
定位组件1008用于定位终端1000的当前地理位置,以实现导航或LBS(Location Based Service,基于位置的服务)。定位组件1008可以是基于美国的GPS(Global Positioning System,全球定位系统)、中国的北斗系统、俄罗斯的格雷纳斯系统或欧盟的伽利略系统的定位组件。
电源1009用于为终端1000中的各个组件进行供电。电源1009可以是交流电、直流电、一次性电池或可充电电池。当电源1009包括可充电电池时,该可充电电池可以支持有线充电或无线充电。该可充电电池还可以用于支持快充技术。
在一些实施例中,终端1000还包括有一个或多个传感器1010。该一个或多个传感器1010包括但不限于:加速度传感器1011、陀螺仪传感器1012、压力传感器1013、指纹传感器1014、光学传感器1015以及接近传感器1016。
加速度传感器1011可以检测以终端1000建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器1011可以用于检测重力加速度在三个坐标轴上的分量。处理器1001可以根据加速度传感器1011采集的重力加速度信号,控制显示屏1005以横向视图或纵向视图进行用户界面的显示。加速度传感器1011还可以用于游戏或者用户的运动数据的采集。
陀螺仪传感器1012可以检测终端1000的机体方向及转动角度,陀螺仪传感器1012可以与加速度传感器1011协同采集用户对终端1000的3D动作。处理器1001根据陀螺仪传感器1012采集的数据,可以实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。
压力传感器1013可以设置在终端1000的侧边框和/或显示屏1005的下层。当压力传感器1013设置在终端1000的侧边框时,可以检测用户对终端1000的握持信号,由处理器1001根据压力传感器1013采集的握持信号进行左右手识别或快捷操作。当压力传感器1013设置在显示屏1005的下层时,由处理器1001根据用户对显示屏1005的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。
指纹传感器1014用于采集用户的指纹,由处理器1001根据指纹传感器1014采集到的指纹识别用户的身份,或者,由指纹传感器1014根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时,由处理器1001授权该用户执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器1014可以被设置终端1000的正面、背面或侧面。当终端1000上设置有物理按键或厂商Logo时,指纹传感器1014 可以与物理按键或厂商Logo集成在一起。
光学传感器1015用于采集环境光强度。在一个实施例中,处理器1001可以根据光学传感器1015采集的环境光强度,控制显示屏1005的显示亮度。可选地,当环境光强度较高时,调高显示屏1005的显示亮度;当环境光强度较低时,调低显示屏1005的显示亮度。在另一个实施例中,处理器1001还可以根据光学传感器1015采集的环境光强度,动态调整摄像头组件1006的拍摄参数。
接近传感器1016,也称距离传感器,通常设置在终端1000的前面板。接近传感器1016用于采集用户与终端1000的正面之间的距离。在一个实施例中,当接近传感器1016检测到用户与终端1000的正面之间的距离逐渐变小时,由处理器1001控制显示屏1005从亮屏状态切换为息屏状态;当接近传感器1016检测到用户与终端1000的正面之间的距离逐渐变大时,由处理器1001控制显示屏1005从息屏状态切换为亮屏状态。
本领域技术人员可以理解,图10中示出的结构并不构成对终端1000的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括计算机程序的存储器,上述计算机程序可由处理器执行以完成上述实施例中的语音分离方法或语音分离模型的训练方法。例如,该计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在一个示例性实施例中,该计算机可读存储介质中存储的至少一条计算机程序由处理器加载并执行如下操作:
将混合语音信号分别输入学生模型和教师模型,该混合语音信号标注有用于生成该混合语音信号的干净语音信号,该教师模型的模型参数基于该学生模型的模型参数配置;
基于该学生模型输出的信号和该混合语音信号中标注的该干净语音信号,确定准确性信息,该准确性信息用于表示该学生模型的分离准确程度;
基于该学生模型输出的信号和教师模型输出的信号,确定一致性信息,该一致性信息用于表示该学生模型和该教师模型的分离能力的一致程度;
基于多个准确性信息和多个一致性信息,调整该学生模型和该教师模型的模型参数,以获取语音分离模型。
在一种可能实现方式中,该至少一条计算机程序由处理器加载并执行下述任一项操作:
基于该学生模型输出的第一干净语音信号和该混合语音信号中标注的该干净语音信号,确定该准确性信息;
基于该学生模型输出的第一干扰信号和该混合语音信号中除了该干净语音信号以外的干扰信号,确定该准确性信息;
基于该学生模型输出的第一干净语音信号和该干净语音信号,确定第一准确性信息;基于该学生模型输出的第一干扰信号和该混合语音信号中除了该干净语音信号以外的干扰信号,确定第二准确性信息;根据该第一准确性信息和该第二准确性信息,确定该准确性信息。
在一种可能实现方式中,该至少一条计算机程序由处理器加载并执行下述任一项操作:
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定该一致性信息;
基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定该一致性 信息;
基于该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号,确定第一一致性信息,基于该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号,确定第二一致性信息,根据该第一一致性信息和该第二一致性信息,确定该一致性信息。
在一种可能实现方式中,该至少一条计算机程序由处理器加载并执行如下操作:
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的短时时变抽象特征,确定该一致性信息。
在一种可能实现方式中,该至少一条计算机程序由处理器加载并执行如下操作:
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的短时时变抽象特征,确定第三一致性信息;
基于该第一干净语音信号的短时时变抽象特征和该第二干净语音信号的长时稳定抽象特征,确定第四一致性信息;
基于该第三一致性信息和该第四一致性信息的加权值,确定该一致性信息。
在一种可能实现方式中,该至少一条计算机程序由处理器加载并执行如下操作:采用指数移动平均的方法,基于该学生模型的模型参数确定该教师模型的模型参数,采用确定好的该教师模型的模型参数对该教师模型进行配置。
在一种可能实现方式中,该至少一条计算机程序由处理器加载并执行如下操作:
迭代多次执行将混合语音信号分别输入学生模型和教师模型,获取多个该准确性信息和多个该一致性信息,一次迭代过程对应于一个准确性信息和一个一致性信息;
在一种可能实现方式中,该至少一条计算机程序还由处理器加载并执行如下操作:
响应于满足停止训练条件,将满足该停止训练条件的迭代过程所确定的学生模型输出为该语音分离模型。
在一种可能实现方式中,该学生模型和该教师模型采用排列不变式训练PIT方式进行信号分离;或,该学生模型和该教师模型采用突出导向选择机制进行信号分离。
在另一个示例性实施例中,该计算机可读存储介质中存储的至少一条计算机程序由处理器加载并执行如下操作:
获取待分离的声音信号;
将该声音信号输入语音分离模型,该语音分离模型基于混合语音信号以及学生模型和教师模型协同迭代训练得到,该教师模型的模型参数基于该学生模型的模型参数配置;
通过该语音分离模型,对该声音信号中的干净语音信号进行预测,输出该声音信号的干净语音信号。
在一种可能实现方式中,该迭代过程的损失函数基于该学生模型的输出和该学生模型的训练输入之间的准确性信息、该学生模型的输出和该教师模型的输出之间的一致性信息构建。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的一致性信息;
或,该学生模型输出的第一干净语音信号和该混合语音信号中的干净语音信号之间的第一准确性信息、该学生模型输出的第一干扰信号和该混合语音信号中的干扰信号之间的第二准确性信息、该学生模型输出的第一干净语音信号和该教师模型输出的第二干净语音信号之间的第一一致性信息、该学生模型输出的第一干扰信号和该教师模型输出的第二干扰信号之间的第二一致性信息。
在一种可能实现方式中,该迭代过程的损失函数基于下述信息构建:
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征;或,
该学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征,以及,该学生模型输出的短时时变抽象特征和教师模型输出的长时稳定抽象特征。
示意性地,本申请实施例还提供一种计算机程序产品或计算机程序,该计算机程序产品或该计算机程序包括一条或多条程序代码,该一条或多条程序代码存储在计算机可读存储介质中。计算机设备的一个或多个处理器能够从计算机可读存储介质中读取该一条或多条程序代码,该一个或多个处理器执行该一条或多条程序代码,使得计算机设备能够执行上述各个实施例中涉及的语音信号的处理方法或语音分离方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
上述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (18)

  1. 一种语音信号的处理方法,由计算机设备执行,其中,包括:
    将混合语音信号分别输入学生模型和教师模型,所述混合语音信号标注有用于生成所述混合语音信号的干净语音信号,所述教师模型的模型参数基于所述学生模型的模型参数配置;
    基于所述学生模型输出的信号和所述混合语音信号中标注的所述干净语音信号,确定准确性信息,所述准确性信息用于表示所述学生模型的分离准确程度;
    基于所述学生模型输出的信号和所述教师模型输出的信号,确定一致性信息,所述一致性信息用于表示所述学生模型和所述教师模型的分离能力的一致程度;
    基于多个所述准确性信息和多个所述一致性信息,调整所述学生模型和所述教师模型的模型参数,以获取语音分离模型。
  2. 根据权利要求1所述的方法,其中,所述基于所述学生模型输出的信号和所述混合语音信号中标注的所述干净语音信号,确定准确性信息包括下述任一项:
    基于所述学生模型输出的第一干净语音信号和所述混合语音信号中标注的所述干净语音信号,确定所述准确性信息;
    基于所述学生模型输出的第一干扰信号和所述混合语音信号中除了所述干净语音信号以外的干扰信号,确定所述准确性信息;以及
    基于所述学生模型输出的第一干净语音信号和所述混合语音信号中标注的所述干净语音信号,确定第一准确性信息;基于所述学生模型输出的第一干扰信号和所述混合语音信号中除了所述干净语音信号以外的干扰信号,确定第二准确性信息;根据所述第一准确性信息和所述第二准确性信息,确定所述准确性信息。
  3. 根据权利要求1所述的方法,其中,所述基于所述学生模型输出的信号和所述教师模型输出的信号,确定一致性信息包括下述任一项:
    基于所述学生模型输出的第一干净语音信号和所述教师模型输出的第二干净语音信号,确定所述一致性信息;
    基于所述学生模型输出的第一干扰信号和所述教师模型输出的第二干扰信号,确定所述一致性信息;以及
    基于所述学生模型输出的第一干净语音信号和所述教师模型输出的第二干净语音信号,确定第一一致性信息,基于所述学生模型输出的第一干扰信号和所述教师模型输出的第二干扰信号,确定第二一致性信息,根据所述第一一致性信息和所述第二一致性信息,确定所述一致性信息。
  4. 根据权利要求3所述的方法,其中,所述基于所述学生模型输出的第一干净语音信号和所述教师模型输出的第二干净语音信号,确定所述一致性信息包括:
    基于所述第一干净语音信号的短时时变抽象特征和所述第二干净语音信号的短时时变抽象特征,确定所述一致性信息。
  5. 根据权利要求3所述的方法,其中,所述基于所述学生模型输出的第一干净语音信号和所述教师模型输出的第二干净语音信号,确定所述一致性信息包括:
    基于所述第一干净语音信号的短时时变抽象特征和所述第二干净语音信号的短时时变抽象特征,确定第三一致性信息;
    基于所述第一干净语音信号的短时时变抽象特征和所述第二干净语音信号的长时稳定抽象特征,确定第四一致性信息;
    基于所述第三一致性信息和所述第四一致性信息的加权值,确定所述一致性信息。
  6. 根据权利要求1所述的方法,其中,所述调整所述学生模型和所述教师模型的模型参数包括:
    采用指数移动平均的方法,基于所述学生模型的模型参数确定所述教师模型的模型参数,采用确定好的所述教师模型的模型参数对所述教师模型进行配置。
  7. 根据权利要求1至6中任一项所述的方法,其中,所述方法还包括:
    迭代多次执行将混合语音信号分别输入学生模型和教师模型,获取多个所述准确性信息和多个所述一致性信息,一次迭代过程对应于一个准确性信息和一个一致性信息;
    所述获取语音分离模型包括:
    响应于满足停止训练条件,将满足所述停止训练条件的迭代过程所确定的学生模型输出为所述语音分离模型。
  8. 根据权利要求1所述的方法,其中,所述学生模型和所述教师模型采用排列不变式训练PIT方式进行信号分离;或,所述学生模型和所述教师模型采用突出导向选择机制进行信号分离。
  9. 一种语音分离方法,由计算机设备执行,其中,包括:
    获取待分离的声音信号;
    将所述声音信号输入语音分离模型,所述语音分离模型基于混合语音信号以及学生模型和教师模型协同迭代训练得到,所述教师模型的模型参数基于所述学生模型的模型参数配置;
    通过所述语音分离模型,对所述声音信号中的干净语音信号进行预测,输出所述声音信号的干净语音信号。
  10. 根据权利要求9所述的方法,其中,所述迭代过程的损失函数基于所述学生模型的输出和所述学生模型的训练输入之间的准确性信息、所述学生模型的输出和所述教师模型的输出之间的一致性信息构建。
  11. 根据权利要求10所述的方法,其中,所述迭代过程的损失函数基于下述信息构建:
    所述学生模型输出的第一干净语音信号和所述混合语音信号中的干净语音信号之间的准确性信息、所述学生模型输出的第一干净语音信号和所述教师模型输出的第二干净语音信号之间的一致性信息;
    或,所述学生模型输出的第一干扰信号和所述混合语音信号中的干扰信号之间的准确性信息、所述学生模型输出的第一干净语音信号和所述教师模型输出的第二干净语音信号之间的一致性信息;
    或,所述学生模型输出的第一干净语音信号和所述混合语音信号中的干净语音信号之间的第一准确性信息、所述学生模型输出的第一干扰信号和所述混合语音信号中的干扰信号之间的第二准确性信息、所述学生模型输出的第一干净语音信号和所述教师模型输出的第二干净语音信号之间的第一一致性信息、所述学生模型输出的第一干扰信号和所述教师模型输出的第二干扰信号之间的第二一致性信息。
  12. 根据权利要求10所述的方法,其中,所述迭代过程的损失函数基于下述信息构建:
    所述学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征;或,
    所述学生模型输出的短时时变抽象特征和教师模型输出的短时时变抽象特征,以及,所述学生模型输出的短时时变抽象特征和教师模型输出的长时稳定抽象特征。
  13. 一种语音信号的处理装置,其中,包括:
    训练模块,用于将混合语音信号分别输入学生模型和教师模型,所述混合语音信号标注有用于生成所述混合语音信号的干净语音信号,所述教师模型的模型参数基于所述学生模型的模型参数配置;
    准确性确定模块,用于基于所述学生模型输出的信号和输入模型的混合语音信号中标注的所述干净语音信号,确定准确性信息,所述准确性信息用于表示所述学生模型的分离准确程度;
    一致性确定模块,用于基于所述学生模型输出的信号和所述教师模型输出的信号,确定一致性信息,所述一致性信息用于表示所述学生模型和所述教师模型的分离能力的一致程度;
    调整模块,用于基于多个所述准确性信息和多个所述一致性信息,调整所述学生模型和所述教师模型的模型参数,以获取语音分离模型。
  14. 一种语音分离装置,其中,包括:
    信号获取模块,用于获取待分离的声音信号;
    输入模块,用于将所述声音信号输入语音分离模型,所述语音分离模型基于混合语音信号以及学生模型和教师模型协同迭代训练得到,所述教师模型的模型参数基于所述学生模型的模型参数配置;
    预测模块,用于通过所述语音分离模型,对所述声音信号中的干净语音信号进行预测,输出所述声音信号的干净语音信号。
  15. 一种计算机设备,其中,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述一个或多个处理器加载并执行如权利要求1至权利要求8任一项所述的语音信号的处理方法。
  16. 一种计算机设备,其中,所述计算机设备包括一个或多个处理器和一个或多个存储 器,所述一个或多个存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述一个或多个处理器加载并执行如权利要求9至权利要求12任一项所述的语音分离方法。
  17. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由一个或多个处理器加载并执行如权利要求1至权利要求8任一项所述的语音信号的处理方法。
  18. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由一个或多个处理器加载并执行如权利要求9至权利要求12任一项所述的语音分离方法。
PCT/CN2020/126475 2020-01-02 2020-11-04 语音信号的处理方法、语音分离方法 WO2021135628A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20908929.1A EP3992965A4 (en) 2020-01-02 2020-11-04 METHODS FOR SPEECH SIGNAL PROCESSING AND VOICE SEPARATION METHODS
US17/674,677 US20220172737A1 (en) 2020-01-02 2022-02-17 Speech signal processing method and speech separation method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010003201.2 2020-01-02
CN202010003201.2A CN111179962B (zh) 2020-01-02 2020-01-02 语音分离模型的训练方法、语音分离方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/674,677 Continuation US20220172737A1 (en) 2020-01-02 2022-02-17 Speech signal processing method and speech separation method

Publications (1)

Publication Number Publication Date
WO2021135628A1 true WO2021135628A1 (zh) 2021-07-08

Family

ID=70652590

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/126475 WO2021135628A1 (zh) 2020-01-02 2020-11-04 语音信号的处理方法、语音分离方法

Country Status (4)

Country Link
US (1) US20220172737A1 (zh)
EP (1) EP3992965A4 (zh)
CN (1) CN111179962B (zh)
WO (1) WO2021135628A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117577124A (zh) * 2024-01-12 2024-02-20 京东城市(北京)数字科技有限公司 基于知识蒸馏的音频降噪模型的训练方法、装置及设备

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200128938A (ko) 2019-05-07 2020-11-17 삼성전자주식회사 모델 학습 방법 및 장치
CN111179962B (zh) * 2020-01-02 2022-09-27 腾讯科技(深圳)有限公司 语音分离模型的训练方法、语音分离方法及装置
CN111611808B (zh) * 2020-05-22 2023-08-01 北京百度网讯科技有限公司 用于生成自然语言模型的方法和装置
CN112562726B (zh) * 2020-10-27 2022-05-27 昆明理工大学 一种基于mfcc相似矩阵的语音音乐分离方法
CN112309375B (zh) * 2020-10-28 2024-02-23 平安科技(深圳)有限公司 语音识别模型的训练测试方法、装置、设备及存储介质
CN113380268A (zh) * 2021-08-12 2021-09-10 北京世纪好未来教育科技有限公司 模型训练的方法、装置和语音信号的处理方法、装置
CN113707123B (zh) * 2021-08-17 2023-10-20 慧言科技(天津)有限公司 一种语音合成方法及装置
CN113724740B (zh) * 2021-08-30 2024-03-08 中国科学院声学研究所 音频事件检测模型训练方法及装置
CN115132183B (zh) * 2022-05-25 2024-04-12 腾讯科技(深圳)有限公司 音频识别模型的训练方法、装置、设备、介质及程序产品
CN116403599B (zh) * 2023-06-07 2023-08-15 中国海洋大学 一种高效的语音分离方法及其模型搭建方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6832194B1 (en) * 2000-10-26 2004-12-14 Sensory, Incorporated Audio recognition peripheral system
US20110313953A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Automated Classification Pipeline Tuning Under Mobile Device Resource Constraints
CN110390950A (zh) * 2019-08-17 2019-10-29 杭州派尼澳电子科技有限公司 一种基于生成对抗网络的端到端语音增强方法
CN110459240A (zh) * 2019-08-12 2019-11-15 新疆大学 基于卷积神经网络和深度聚类的多说话人语音分离方法
CN111179962A (zh) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 语音分离模型的训练方法、语音分离方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI223792B (en) * 2003-04-04 2004-11-11 Penpower Technology Ltd Speech model training method applied in speech recognition
US10147442B1 (en) * 2015-09-29 2018-12-04 Amazon Technologies, Inc. Robust neural network acoustic model with side task prediction of reference signals
CN108615533B (zh) * 2018-03-28 2021-08-03 天津大学 一种基于深度学习的高性能语音增强方法
CN108962229B (zh) * 2018-07-26 2020-11-13 汕头大学 一种基于单通道、无监督式的目标说话人语音提取方法
CN110600017B (zh) * 2019-09-12 2022-03-04 腾讯科技(深圳)有限公司 语音处理模型的训练方法、语音识别方法、系统及装置
CN111341341B (zh) * 2020-02-11 2021-08-17 腾讯科技(深圳)有限公司 音频分离网络的训练方法、音频分离方法、装置及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6832194B1 (en) * 2000-10-26 2004-12-14 Sensory, Incorporated Audio recognition peripheral system
US20110313953A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Automated Classification Pipeline Tuning Under Mobile Device Resource Constraints
CN110459240A (zh) * 2019-08-12 2019-11-15 新疆大学 基于卷积神经网络和深度聚类的多说话人语音分离方法
CN110390950A (zh) * 2019-08-17 2019-10-29 杭州派尼澳电子科技有限公司 一种基于生成对抗网络的端到端语音增强方法
CN111179962A (zh) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 语音分离模型的训练方法、语音分离方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3992965A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117577124A (zh) * 2024-01-12 2024-02-20 京东城市(北京)数字科技有限公司 基于知识蒸馏的音频降噪模型的训练方法、装置及设备
CN117577124B (zh) * 2024-01-12 2024-04-16 京东城市(北京)数字科技有限公司 基于知识蒸馏的音频降噪模型的训练方法、装置及设备

Also Published As

Publication number Publication date
EP3992965A4 (en) 2022-09-07
CN111179962A (zh) 2020-05-19
CN111179962B (zh) 2022-09-27
US20220172737A1 (en) 2022-06-02
EP3992965A1 (en) 2022-05-04

Similar Documents

Publication Publication Date Title
WO2021135628A1 (zh) 语音信号的处理方法、语音分离方法
WO2021135577A9 (zh) 音频信号处理方法、装置、电子设备及存储介质
CN110288978B (zh) 一种语音识别模型训练方法及装置
CN108615526B (zh) 语音信号中关键词的检测方法、装置、终端及存储介质
CN111063342B (zh) 语音识别方法、装置、计算机设备及存储介质
CN113763933B (zh) 语音识别方法、语音识别模型的训练方法、装置和设备
CN110047468B (zh) 语音识别方法、装置及存储介质
WO2021114847A1 (zh) 网络通话方法、装置、计算机设备及存储介质
CN111696570B (zh) 语音信号处理方法、装置、设备及存储介质
KR102369083B1 (ko) 음성 데이터 처리 방법 및 이를 지원하는 전자 장치
US20240105159A1 (en) Speech processing method and related device
CN111863020B (zh) 语音信号处理方法、装置、设备及存储介质
WO2021238599A1 (zh) 对话模型的训练方法、装置、计算机设备及存储介质
CN111986691B (zh) 音频处理方法、装置、计算机设备及存储介质
CN113763532B (zh) 基于三维虚拟对象的人机交互方法、装置、设备及介质
CN111581958A (zh) 对话状态确定方法、装置、计算机设备及存储介质
CN113409770A (zh) 发音特征处理方法、装置、服务器及介质
CN111341307A (zh) 语音识别方法、装置、电子设备及存储介质
CN110990549A (zh) 获取答案的方法、装置、电子设备及存储介质
CN115168643B (zh) 音频处理方法、装置、设备及计算机可读存储介质
CN116956814A (zh) 标点预测方法、装置、设备及存储介质
US20220223142A1 (en) Speech recognition method and apparatus, computer device, and computer-readable storage medium
CN115116437A (zh) 语音识别方法、装置、计算机设备、存储介质及产品
CN109829067B (zh) 音频数据处理方法、装置、电子设备及存储介质
CN111737415A (zh) 实体关系抽取方法、实体关系学习模型的获取方法及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20908929

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020908929

Country of ref document: EP

Effective date: 20220131

NENP Non-entry into the national phase

Ref country code: DE