WO2021203880A1 - 一种语音增强方法、训练神经网络的方法以及相关设备 - Google Patents

一种语音增强方法、训练神经网络的方法以及相关设备 Download PDF

Info

Publication number
WO2021203880A1
WO2021203880A1 PCT/CN2021/079047 CN2021079047W WO2021203880A1 WO 2021203880 A1 WO2021203880 A1 WO 2021203880A1 CN 2021079047 W CN2021079047 W CN 2021079047W WO 2021203880 A1 WO2021203880 A1 WO 2021203880A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
enhanced
speech
signal
image
Prior art date
Application number
PCT/CN2021/079047
Other languages
English (en)
French (fr)
Inventor
王午芃
邢超
陈晓
孙凤宇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021203880A1 publication Critical patent/WO2021203880A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method for speech enhancement, a method for training a neural network, and related equipment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
  • Speech recognition refers to a technology that recognizes corresponding text content from speech waveforms, and is one of the important technologies in the field of artificial intelligence.
  • speech enhancement technology is a very important technology, usually also called speech noise reduction technology.
  • the speech enhancement technology can eliminate high-frequency noise, low-frequency noise, white noise and various other noises in the speech signal, thereby improving the effect of speech recognition. Therefore, how to improve the effect of speech enhancement needs to be solved urgently.
  • the embodiment of the present application provides a voice enhancement method, which can apply image information in the process of voice enhancement. In some relatively noisy environments, it can also improve the ability of voice enhancement and improve the sense of hearing.
  • the first aspect of the present application provides a voice enhancement method, which may include: acquiring a voice to be enhanced and a reference image, where the voice to be enhanced and the reference image are data acquired at the same time.
  • the first neural network outputting the first enhanced signal of the speech to be enhanced, the first neural network is a neural network obtained by training the mixed data of speech and noise with the first mask as the training target.
  • the masking function of the reference image output by the second neural network indicates whether the frequency band energy corresponding to the reference image is less than the preset value.
  • the frequency band energy less than the preset value indicates that the frequency band of the speech to be enhanced corresponding to the reference image is noise
  • the second neural network It is a neural network obtained by training the image that can include lip features corresponding to the sound source of the voice used by the first neural network with the second mask mask as the training target.
  • the second enhanced signal of the speech to be enhanced is determined according to the calculation result of the first enhanced signal and the masking function. It can be seen from the first aspect that the first neural network is used to output the first enhanced signal of the speech to be enhanced, and the second neural network is used to model the association relationship between image information and voice information, so that the reference image output by the second neural network is masked
  • the function can indicate that the speech to be enhanced corresponding to the reference image is noise or speech.
  • the reference image is an image corresponding to the sound source of the speech to be enhanced that may include lip features.
  • the second enhancement of the speech to be enhanced is determined according to the calculation result of the first enhancement signal and the masking function
  • the signal may include: taking the first enhanced signal and the masking function as the input data of the third neural network, and determining the second enhanced signal according to the weight value output by the third neural network, the weight indicating the sum of the first enhanced signal in the second enhanced signal
  • the output ratio of the correction signal is the result of the calculation of the masking function and the first enhancement signal.
  • the third neural network uses the first mask as the training target to perform the output data of the first neural network and the output data of the second neural network. The trained neural network.
  • the method may further include: determining whether the reference image may include face information or lip information.
  • the weight value indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is 100%.
  • the correction signal is a product operation result of the first enhancement signal and the masking function.
  • the correction signal is determined according to the product operation result of M signal-to-noise ratios and the masking function at the first moment, where M is A positive integer
  • the first enhanced signal output by the first neural network at the first time may include M frequency bands, each of the M frequency bands corresponds to a signal-to-noise ratio
  • the masking function at the first time is that the second neural network is in the first The masking function output at all times.
  • the speech to be enhanced may include the first acoustic feature frame, and the first acoustic feature frame.
  • the moment corresponding to the feature frame is indicated by the first time index
  • the reference image may include the first image frame
  • the first image frame is the input data of the second neural network
  • the masking function of the reference image output by the second neural network may include:
  • the second neural network outputs the masking function corresponding to the first image frame at the first moment.
  • the first moment is indicated by the multiple of the first time index. The multiple is based on the frame rate of the first acoustic feature frame and the frame rate of the first image frame. The ratio is determined.
  • the method may further include: performing feature transformation on the speech to be enhanced to obtain the Enhance the frequency domain characteristics of speech.
  • the method may further include: performing feature inverse transformation on the second enhanced signal to obtain enhanced speech.
  • performing feature transformation on the voice to be enhanced may include: performing a short-time Fourier transform STFT on the voice to be enhanced.
  • Performing an inverse feature transform on the second enhanced signal may include: performing an inverse short-time Fourier transform ISTFT on the second enhanced signal.
  • the method may further include sampling the reference image so that the reference image may include the image frame
  • the frame rate is the preset frame rate.
  • the lip features are obtained by feature extraction on the face image, and the face image It is obtained by face detection on the reference image.
  • the frequency band energy of the reference image is represented by an activation function, so that the value of the activation function is The value approximates IBM to obtain the second neural network.
  • the voice to be enhanced is obtained through a single audio channel.
  • the first mask is an ideal floating value masking IRM
  • the second mask is Ideal two-value masking IBM.
  • the second aspect of the present application provides a method for training a neural network, which is used for speech enhancement.
  • the method may include: obtaining training data.
  • the training data may include mixed data of speech and noise and the corresponding sound source of the speech. Includes images of lip features.
  • the first neural network is obtained by training the mixed data, and the trained first neural network is used to output the first enhanced signal of the speech to be enhanced.
  • the image is trained to obtain a second neural network.
  • the trained second neural network is used to output the masking function of the reference image.
  • the masking function indicates whether the frequency band energy of the reference image is less than the preset value. If the energy of the frequency band is less than the preset value, it indicates that the speech frequency band to be enhanced corresponding to the reference image is noise, and the calculation result of the first enhanced signal and the masking function is used to determine the second enhanced signal of the speech to be enhanced.
  • the reference image is an image corresponding to the sound source of the voice to be enhanced that may include lip features.
  • the calculation result of the first enhancement signal and the masking function is used to determine the second aspect of the speech to be enhanced.
  • the enhanced signal may include: taking the first enhanced signal and the masking function as the input data of the third neural network, and determining the second enhanced signal according to the weight value output by the third neural network, the weight indicating the first enhanced signal in the second enhanced signal And the output ratio of the correction signal.
  • the correction signal is the result of the calculation of the masking function and the first enhancement signal.
  • the third neural network uses the first mask as the training target, and the output data of the first neural network and the output data of the second neural network Neural network obtained by training.
  • the method may further include: determining whether the image may include face information or lip information. When the image does not include face information or lip information, the weight value indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is 100%.
  • the correction signal is the product operation result of the first enhancement signal and the masking function.
  • the correction signal is determined according to the product operation result of M signal-to-noise ratios and the masking function at the first moment, where M is A positive integer
  • the first enhanced signal output by the first neural network at the first time may include M frequency bands, each of the M frequency bands corresponds to a signal-to-noise ratio
  • the masking function at the first time is that the second neural network is in the first The masking function output at all times.
  • the speech to be enhanced may include the first acoustic feature frame, and the first acoustic feature frame.
  • the moment corresponding to the feature frame is indicated by the first time index
  • the image may include the first image frame
  • the first image frame is the input data of the second neural network
  • the masking function of the output image according to the second neural network may include: The neural network outputs the masking function corresponding to the first image frame at the first moment, the first moment is indicated by the multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame .
  • the method may further include: performing feature transformation on the speech to be enhanced to obtain the Enhance the frequency domain characteristics of speech.
  • the method may further include: performing feature inverse transformation on the second enhanced signal to obtain enhanced speech.
  • performing feature transformation on the voice to be enhanced may include: performing short-time Fourier transform STFT on the voice to be enhanced.
  • Performing an inverse feature transform on the second enhanced signal may include: performing an inverse short-time Fourier transform ISTFT on the second enhanced signal.
  • the method may further include: sampling the image so that the image can include the frame of the image frame The rate is the preset frame rate.
  • the lip features are obtained by feature extraction on the face image, and the face image It is obtained for face detection on the image.
  • the frequency band energy of the image is represented by an activation function, so that the value of the activation function Approach IBM to get the second neural network.
  • the voice to be enhanced is obtained through a single audio channel.
  • the first mask is an ideal floating value masking IRM
  • the second mask is Ideal two-value masking IBM.
  • a third aspect of the present application provides a voice enhancement device, which is characterized by comprising: an acquisition module configured to acquire a voice to be enhanced and a reference image, where the voice to be enhanced and the reference image are data acquired at the same time.
  • the audio processing module is configured to output the first enhanced signal of the speech to be enhanced according to the first neural network.
  • the first neural network is a neural network obtained by training the mixed data of speech and noise with the first mask as the training target.
  • the image processing module is used to output the masking function of the reference image according to the second neural network.
  • the masking function indicates whether the frequency band energy corresponding to the reference image is less than a preset value.
  • the frequency band energy less than the preset value indicates that the frequency band of the speech to be enhanced corresponding to the reference image is
  • the second neural network uses the second mask as the training target, and is a neural network obtained by training the image including lip features corresponding to the sound source of the voice used by the first neural network.
  • the integrated processing module is used to determine the second enhanced signal of the speech to be enhanced according to the calculation result of the first enhanced signal and the masking function.
  • the reference image is an image including lip features corresponding to the sound source of the speech to be enhanced.
  • the integrated processing module is specifically configured to: use the first enhanced signal and the masking function as the third
  • the input data of the neural network determines the second enhancement signal according to the weight value output by the third neural network.
  • the weight value indicates the output ratio of the first enhancement signal and the correction signal in the second enhancement signal.
  • the correction signal is the masking function and the first enhancement signal
  • the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network with the first mask as the training target.
  • the device further includes: a feature extraction module, a feature extraction module for determining whether the reference image includes face information or Lip information.
  • the weight value indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is 100%.
  • the correction signal is the product operation result of the first enhancement signal and the masking function.
  • the correction signal is determined according to the product operation result of M signal-to-noise ratios and the masking function at the first moment, where M is A positive integer, the first enhanced signal output by the first neural network at the first time includes M frequency bands, each of the M frequency bands corresponds to a signal-to-noise ratio, and the masking function at the first time is that the second neural network at the first time The output masking function.
  • the speech to be enhanced includes the first acoustic feature frame, and the first acoustic feature
  • the time corresponding to the frame is indicated by the first time index
  • the reference image includes the first image frame
  • the first image frame is the input data of the second neural network
  • the image processing module is specifically used to: output the first image frame according to the second neural network
  • the first moment is indicated by a multiple of the first time index
  • the multiple is determined according to the ratio of the frame rate of the first acoustic characteristic frame to the frame rate of the first image frame.
  • performing feature transformation on the voice to be enhanced may include: performing short-time Fourier transform STFT on the voice to be enhanced.
  • Performing an inverse feature transform on the second enhanced signal may include: performing an inverse short-time Fourier transform ISTFT on the second enhanced signal.
  • the feature extraction module is also used to sample the reference image so that the reference image can include the image
  • the frame rate of the frame is the preset frame rate.
  • the lip features are obtained by feature extraction on the face image, and the face image It is obtained by face detection on the reference image.
  • the frequency band energy of the reference image is represented by an activation function, so that the activation function takes The value approximates IBM to obtain the second neural network.
  • the voice to be enhanced is obtained through a single audio channel.
  • the first mask is an ideal floating value masking IRM
  • the second mask is Ideal two-value masking IBM.
  • the fourth aspect of the present application provides a device for training a neural network.
  • the neural network is used for speech enhancement.
  • the device includes: an acquisition module for acquiring training data.
  • the training data includes mixed data of speech and noise and corresponding to the sound source of the speech Includes images of lip features.
  • the audio processing module is used to train the mixed data to obtain the first neural network with the ideal floating value masking IRM as the training target, and the trained first neural network is used to output the first enhanced signal of the speech to be enhanced.
  • the image processing module is used to train the image to obtain the second neural network with the ideal binary masking IBM as the training target.
  • the trained second neural network is used to output the masking function of the reference image, and the masking function indicates the frequency band energy of the reference image Whether it is less than the preset value and the frequency band energy is less than the preset value indicates that the speech frequency band to be enhanced corresponding to the reference image is noise, and the calculation result of the first enhanced signal and the masking function is used to determine the second enhanced signal of the speech to be enhanced.
  • the reference image is an image including lip features corresponding to the sound source of the voice to be enhanced.
  • the second possible implementation manner it further includes: an integrated processing module.
  • the integrated processing module is used to use the first enhanced signal and the masking function as the input data of the third neural network, and determine the second enhanced signal according to the weight value output by the third neural network, and the weight value indicates the first enhanced signal in the second enhanced signal And the output ratio of the correction signal.
  • the correction signal is the calculation result of the masking function and the first enhancement signal.
  • the third neural network uses the first mask as the training target, and the output data of the first neural network and the output data of the second neural network Neural network obtained by training.
  • the device further includes: a feature extraction module,
  • the feature extraction module is used to determine whether the image includes face information or lip information. When the image does not include face information or lip information, the weight value indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is 100%.
  • the correction signal is the product operation result of the first enhancement signal and the masking function.
  • the correction signal is determined according to the product operation result of M signal-to-noise ratios and the masking function at the first moment, where M is A positive integer, the first enhanced signal output by the first neural network at the first time includes M frequency bands, each of the M frequency bands corresponds to a signal-to-noise ratio, and the masking function at the first time is that the second neural network at the first time The output masking function.
  • the to-be-enhanced speech includes the first acoustic feature frame, and the first acoustic feature frame
  • the corresponding moment is indicated by the first time index
  • the image includes the first image frame
  • the first image frame is the input data of the second neural network
  • the image processing module is specifically used to: output the first image frame in the second neural network according to the second neural network.
  • the masking function corresponding to a moment, the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic characteristic frame to the frame rate of the first image frame.
  • performing feature transformation on the voice to be enhanced may include: performing short-time Fourier transform STFT on the voice to be enhanced.
  • Performing an inverse feature transform on the second enhanced signal may include: performing an inverse short-time Fourier transform ISTFT on the second enhanced signal.
  • the feature extraction module is also used to sample the reference image so that the reference image can include the image
  • the frame rate of the frame is the preset frame rate.
  • the lip features are obtained by feature extraction on the face image, and the face image It is obtained by face detection on the reference image.
  • the frequency band energy of the reference image is represented by an activation function, so that the activation function takes The value approximates IBM to obtain the second neural network.
  • the voice to be enhanced is obtained through a single audio channel.
  • the first mask is an ideal floating value masking IRM
  • the second mask is Ideal two-value masking IBM.
  • a fifth aspect of the present application provides a voice enhancement device, which is characterized by comprising: a memory for storing a program.
  • the processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method described in the first aspect or any one of the possible implementation manners of the first aspect.
  • a sixth aspect of the present application provides a device for training a neural network, which is characterized by comprising: a memory for storing programs.
  • the processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method described in the second aspect or any one of the possible implementation manners of the second aspect.
  • a seventh aspect of the present application provides a computer storage medium, which is characterized in that the computer storage medium stores program code, and the program code includes instructions for executing the method described in the first aspect or any one of the possible implementations of the first aspect. Method of tracing.
  • An eighth aspect of the present application provides a computer storage medium, characterized in that the computer storage medium stores program code, and the program code includes the method used to execute the second aspect or any one of the possible implementation manners of the second aspect. Method of tracing.
  • the first neural network is used to output the first enhanced signal of the speech to be enhanced
  • the second neural network is used to model the association relationship between image information and voice information, so that the second neural network outputs a reference
  • the masking function of the image may indicate that the speech to be enhanced corresponding to the reference image is noise or speech.
  • FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of this application.
  • FIG. 2 is a system architecture provided by this application.
  • FIG. 3 is a schematic structural diagram of a convolutional neural network provided by an embodiment of this application.
  • FIG. 4 is a schematic structural diagram of a convolutional neural network provided by an embodiment of this application.
  • FIG. 5 is a hardware structure of a chip provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of a system architecture provided by an embodiment of the application.
  • FIG. 7 is a schematic flowchart of a voice enhancement method provided by an embodiment of this application.
  • FIG. 8 is a schematic diagram of an application scenario of a solution provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of an application scenario of a solution provided by an embodiment of the application.
  • FIG. 10 is a schematic diagram of an application scenario of a solution provided by an embodiment of this application.
  • FIG. 11 is a schematic diagram of an application scenario of a solution provided by an embodiment of this application.
  • FIG. 12 is a schematic diagram of time sequence alignment provided by an embodiment of this application.
  • FIG. 13 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application.
  • FIG. 14 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application.
  • FIG. 16 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application.
  • FIG. 17 is a schematic structural diagram of a speech enhancement device provided by an embodiment of this application.
  • FIG. 18 is a schematic structural diagram of a device for training a neural network provided by an embodiment of the application.
  • FIG. 19 is a schematic structural diagram of another speech enhancement device provided by an embodiment of this application.
  • FIG. 20 is a schematic structural diagram of another apparatus for training a neural network provided by an embodiment of the application.
  • the naming or numbering of steps appearing in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering.
  • the named or numbered process steps can be implemented according to the The technical purpose changes the execution order, as long as the same or similar technical effects can be achieved.
  • the division of modules presented in this application is a logical division. In actual applications, there may be other divisions. For example, multiple modules can be combined or integrated in another system, or some features can be ignored , Or not to execute, in addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be through some ports, and the indirect coupling or communication connection between the modules may be electrical or other similar forms. There are no restrictions in the application.
  • modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed to multiple circuit modules, and some or all of them may be selected according to actual needs. Module to achieve the purpose of this application program.
  • Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of the artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • Intelligent Information Chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom".
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • the infrastructure can communicate with the outside through sensors, and the computing power of the infrastructure can be provided by smart chips.
  • the smart chip here can be a central processing unit (CPU), a neural network processing unit (NPU), a graphics processing unit (GPU), and an application specific integrated circuit (application specific integrated).
  • CPU central processing unit
  • NPU neural network processing unit
  • GPU graphics processing unit
  • application specific integrated circuit application specific integrated circuit
  • hardware acceleration chips such as circuit, ASIC) and field programmable gate array (FPGA).
  • the basic platform of infrastructure can include distributed computing framework and network related platform guarantee and support, and can include cloud storage and computing, interconnection network, etc.
  • data can be obtained through sensors and external communication, and then these data can be provided to the smart chip in the distributed computing system provided by the basic platform for calculation.
  • the data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • the above-mentioned data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other processing methods.
  • machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies.
  • the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.
  • some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.
  • the embodiments of this application can be applied in many fields of artificial intelligence, for example, smart manufacturing, smart transportation, smart home, smart medical, smart security, automatic driving, safe cities, and other fields.
  • the embodiments of the present application can be specifically applied in the fields of speech enhancement and speech recognition that require the use of (deep) neural networks.
  • a neural network can be composed of neural units.
  • a neural unit can refer to an arithmetic unit that takes xs and intercept 1 as inputs.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be a region composed of several neural units.
  • Important equation taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • the neural network can use an error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forwarding the input signal to the output will cause error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss is converged.
  • the back-propagation algorithm is a back-propagation motion dominated by error loss, and aims to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • an embodiment of the present application provides a system architecture 100.
  • a data collection device 160 is used to collect training data.
  • the data collection device 160 stores the training data in the database 130, and the training device 120 trains to obtain the target model/rule 101 based on the training data maintained in the database 130.
  • the training device 120 processes the input raw data and compares the output data with the original data until the data output by the training device 120 is compared with the original data. The difference is less than a certain threshold, thereby completing the training of the target model/rule 101.
  • the above-mentioned target model/rule 101 can be used to implement the speech enhancement method in the embodiment of the present application, and the above-mentioned training device can be used to implement the method for training a neural network provided in the embodiment of the present application.
  • the target model/rule 101 in the embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 130 may not all come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target model/rule 101 completely based on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training.
  • the above description should not be used as a reference to this application. Limitations of the embodiment.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 2, which can be a terminal, such as a mobile phone terminal, a tablet computer, notebook computers, augmented reality (AR) AR/virtual reality (VR), vehicle-mounted terminals, etc., can also be servers or clouds.
  • the execution device 110 is configured with an input/output (input/output, I/O) interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140.
  • the input data in this embodiment of the present application may include: a to-be-processed image input by the client device.
  • the preprocessing module 113 and the preprocessing module 114 are used to perform preprocessing according to the input data (such as the image to be processed) received by the I/O interface 112.
  • the preprocessing module 113 and the preprocessing module may not be provided.
  • 114 there may only be one preprocessing module, and the calculation module 111 is directly used to process the input data.
  • the execution device 110 may call data, codes, etc. in the data storage system 150 for corresponding processing .
  • the data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 150.
  • the I/O interface 112 returns the processing result to the client device 140 to provide it to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above tasks provide users with the desired results.
  • the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 112.
  • the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 140.
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data and store it in the database 130 as shown in the figure.
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in the database 130.
  • FIG. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.
  • the target model/rule 101 is obtained by training according to the training device 120.
  • the target model/rule 101 may be the neural network in the present application in the embodiment of the application.
  • the neural network provided in the embodiment of the present application It can be CNN, deep convolutional neural networks (DCNN), recurrent neural network (RNNS) and so on.
  • CNN is a very common neural network
  • the structure of CNN will be introduced in detail below in conjunction with Figure 3.
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning. Multi-level learning is carried out on the abstract level of.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the input image.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (the pooling layer is optional), and a neural network layer 230.
  • the input layer 210 can obtain the image to be processed, and pass the obtained image to be processed to the convolutional layer/pooling layer 220 and the subsequent neural network layer 230 for processing, and the processing result of the image can be obtained.
  • the convolutional layer/pooling layer 220 may include layers 221-226, for example: in an implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer. Layers, 224 is the pooling layer, 225 is the convolutional layer, and 226 is the pooling layer; in another implementation, 221 and 222 are the convolutional layers, 223 is the pooling layer, and 224 and 225 are the convolutional layers. Layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 can include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...It depends on the value of stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row ⁇ column) are applied. That is, multiple homogeneous matrices.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), the size of the convolution feature maps extracted by the multiple weight matrices of the same size are also the same, and then the multiple extracted convolution feature maps of the same size are merged to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions. .
  • the initial convolutional layer (such as 221) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example, 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • the 221-226 layers as illustrated by 220 in Figure 3 can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the size of the image space.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of the average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the image size.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate one or a group of required classes of output. Therefore, the neural network layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 3) and an output layer 240. The parameters contained in the multiple hidden layers can be based on specific task types. The relevant training data of the, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 240 After the multiple hidden layers in the neural network layer 230, that is, the final layer of the entire convolutional neural network 200 is the output layer 240.
  • the output layer 240 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230.
  • CNN convolutional neural network
  • FIG. 4 multiple convolutional layers/pooling layers in the convolutional layer/pooling layer 220 in FIG. 4 are parallel, and the respectively extracted features are input to the full neural network layer 230 for processing.
  • the convolutional neural network shown in FIG. 3 and FIG. 4 is only used as an example of two possible convolutional neural networks of the speech enhancement method and the method of training the model in the embodiment of the present application.
  • the convolutional neural network used in the speech enhancement method and training model method may also exist in the form of other network models.
  • FIG. 5 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor.
  • the chip may be set in the execution device 110 as shown in FIG. 2 to complete the calculation work of the calculation module 111.
  • the chip can also be set in the training device 120 as shown in FIG. 2 to complete the training work of the training device 120 and output the target model/rule 101.
  • the algorithms of each layer in the convolutional neural network as shown in FIG. 3 or FIG. 4 can be implemented in the chip as shown in FIG. 5.
  • the neural network processor NPU is mounted on a main central processing unit (central processing unit, CPU, host CPU) as a coprocessor, and the main CPU distributes tasks.
  • the core part of the NPU is the arithmetic circuit 303.
  • the controller 304 controls the arithmetic circuit 303 to extract data from the memory (weight memory or input memory) and perform calculations.
  • the arithmetic circuit 303 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 303 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to matrix B from the weight memory 302 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the matrix A data and matrix B from the input memory 301 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 308.
  • the vector calculation unit 307 can perform further processing on the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 307 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 307 can store the processed output vector in the unified buffer 306.
  • the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 307 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 303, for example for use in a subsequent layer in a neural network.
  • the unified memory 306 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 301 and/or the unified memory 306 through the storage unit access controller 305 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 302, and The data in the unified memory 306 is stored in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 310 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 309 through the bus.
  • An instruction fetch buffer 309 connected to the controller 304 is used to store instructions used by the controller 304;
  • the controller 304 is used to call the instructions cached in the instruction fetch memory 309 to control the working process of the computing accelerator.
  • unified memory 306, input memory 301, weight memory 302, and fetch memory 309 are all on-chip (On-Chip) memories.
  • Memory double data rate synchronous dynamic random access memory, referred to as DDR SDRAM
  • high bandwidth memory HBM
  • other readable and writable memory other readable and writable memory.
  • each layer in the convolutional neural network shown in FIG. 2 can be executed by the arithmetic circuit 303 or the vector calculation unit 307.
  • an embodiment of the present application provides a system architecture.
  • the system architecture includes a local device 401, a local device 402, an execution device 210 and a data storage system 150, where the local device 401 and the local device 402 are connected to the execution device 210 through a communication network.
  • the execution device 210 may be implemented by one or more servers.
  • the execution device 210 can be used in conjunction with other computing devices, such as data storage, routers, load balancers and other devices.
  • the execution device 210 may be arranged on one physical site or distributed on multiple physical sites.
  • the execution device 210 may use the data in the data storage system 150 or call the program code in the data storage system 150 to implement the speech enhancement method or the neural network training method of the embodiment of the present application.
  • a target neural network can be built, and the target neural network can be used for speech enhancement or speech recognition processing and so on.
  • the user can operate respective user devices (for example, the local device 401 and the local device 402) to interact with the execution device 210.
  • Each local device can represent any computing device, such as personal computers, computer workstations, smart phones, tablets, smart cameras, smart cars or other types of cellular phones, media consumption devices, wearable devices, set-top boxes, game consoles, etc.
  • the local device of each user can interact with the execution device 210 through a communication network of any communication mechanism/communication standard.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 401 and the local device 402 obtain the relevant parameters of the target neural network from the execution device 210, deploy the target neural network on the local device 401 and the local device 402, and use the target neural network for voice enhancement Or speech recognition and so on.
  • the target neural network can be directly deployed on the execution device 210.
  • the execution device 210 obtains the image to be processed from the local device 401 and the local device 402, and performs speech enhancement or other types of enhanced speech according to the target neural network. Voice processing.
  • the above-mentioned execution device 210 may also be referred to as a cloud device. At this time, the execution device 210 is generally deployed in the cloud.
  • the execution device 110 in FIG. 2 introduced above can execute the voice enhancement method of the embodiment of this application, and the training device 120 in FIG. 4 introduced above can execute the steps of the method for training a neural network in the embodiment of this application.
  • the CNN model shown in FIG. 5 and FIG. 6 and the chip shown in FIG. 5 can also be used to execute each step of the speech enhancement method and the method of training the model in the embodiments of the present application.
  • the speech enhancement method and the method of training a model of the embodiment of the present application will be described in detail below in conjunction with the accompanying drawings.
  • FIG. 7 it is a schematic flowchart of a voice enhancement method provided by an embodiment of this application.
  • a voice enhancement method provided by an embodiment of the present application may include the following steps:
  • the voice to be enhanced can be acquired through a multi-channel microphone array, or the voice to be enhanced can be acquired through a single audio channel (hereinafter referred to as mono).
  • time domain and frequency domain information are used, while microphone array speech enhancement uses not only time domain and frequency domain information, but also spatial domain information. Since time domain and frequency domain information play a leading role in the separation of audio sources, while spatial information only plays an auxiliary role, the to-be-enhanced speech of the solution provided in this application can be obtained through a mono microphone array.
  • Mono voice enhancement has relatively low hardware cost requirements, can form a universal solution, and is widely used in various products.
  • the complex environment will limit the effect of the monophonic acoustic probability model, and the task of monophonic speech enhancement is more difficult.
  • the solution provided by this application can provide visual information for the acoustic model to enhance the effect of the speech noise reduction model.
  • 5th generation mobile networks 5th generation mobile networks or 5th generation wireless systems, 5th-Generation, 5G
  • video calls and cameras are used more and more widely in 5G smart homes, so the information provided in this application can be based on The monophonic speech enhancement method will be widely used in the near future.
  • the reference image involved in the technical solution provided in this application can be obtained by a device that can record images or images, such as a camera or a video camera.
  • a device that can record images or images, such as a camera or a video camera.
  • an example of obtaining the voice to be enhanced and the reference image will be described with reference to several typical scenarios that this application may be applicable to. It should be noted that the several typical scenarios introduced below are only examples of possible applicable scenarios of the solution provided in this application, and do not represent all scenarios to which the solution provided in this application can be applied.
  • Scene 1 Video and voice call
  • FIG. 8 it is a schematic diagram of an applicable scenario of a solution provided by an embodiment of this application.
  • device A and device B are establishing a video and voice call.
  • the device A and the device B can be a mobile phone, a tablet, a notebook computer or a smart wearable device.
  • the sound acquired by device A is the voice to be enhanced.
  • the voice to be enhanced may include the voice of the user of device A. Voice and noise of the surrounding environment.
  • the image obtained by device A is a reference image.
  • the reference image at this time may be an image of the area where the camera lens of device A is aimed.
  • the user of device A points the camera at his face (it should be noted that this application When the difference between the camera lens and the camera is not emphasized, they express the same meaning and both represent the device that records images or images), then the reference image at this time is the face of the user of device A. Or the user of device A did not point the camera at himself during the video and voice passing process, but aimed at the surrounding environment, then the reference image at this time is the surrounding environment.
  • FIG. 8 it is a schematic diagram of an applicable scenario of another solution provided by this application. Taking device A as an example, suppose that device A adopts the solution provided in this application, and in the process of establishing a video and voice pass with device B, text prompts can be displayed in the window of the video dialogue.
  • the text “Aim the camera at the face, the voice effect will be better”, or “Please aim the camera at the face” or “In progress” is displayed in the video window.
  • Voice enhancement please aim the camera at your face” and so on.
  • device A detects that the user has pointed the camera at the face, it will not prompt.
  • a text prompt will be displayed in the video window. For example, it can display "Aim the camera at the human face, the voice effect will be better", or "Please aim the camera at the human face", etc. Wait.
  • FIG. 9 it is a schematic diagram of another applicable scenario provided by an embodiment of this application.
  • coordinating the work of multiple parties through meetings is a more important means.
  • the recording of the content of each speaker during the meeting and the collation of the meeting minutes have become basic requirements.
  • recording equipment such as a voice recorder will record the whole process first, and then manually organize the recording content to form the meeting record after the meeting.
  • these methods are inefficient due to the need for manual intervention.
  • the speech recognition technology is applied to the convenience brought by the conference system to the arrangement of conference records.
  • the speech content of the participants is recorded through the recording equipment, and the speech recognition software recognizes the speech content of the participants, which can further form the meeting record. , which greatly improves the efficiency of meeting minutes.
  • the solution provided in this application can be applied to the scene of recording a meeting to further improve the effect of speech recognition.
  • A is speaking in the meeting
  • the image can be obtained synchronously.
  • the content of A’s speech is the voice to be enhanced.
  • the voice to be enhanced may include A’s pure voice and other noises generated in the meeting.
  • the image taken simultaneously is the reference image.
  • It is the face image of A.
  • the photographer may not be in the process of A speaking, and the whole process of shooting A's face, then in the process of A speaking, other non-face images obtained can also be regarded as in this solution.
  • Reference image In some actual situations, the photographer may not be in the process of A speaking, and the whole process of shooting A's face, then in the process of A speaking, other non-face images obtained can also be regarded as in this solution. Reference image.
  • the content of at least one of the three persons A, B, and C can be selected to be enhanced.
  • the content of A’s speech is selected to be enhanced, the face image of A can be simultaneously captured during the process of A’s speech.
  • the content of A’s speech is the voice to be enhanced, and the voice to be enhanced may include A’s
  • the face image of A taken simultaneously at this time is the reference image.
  • you choose to enhance the content of B’s speech you can take B’s face image synchronously while B is speaking.
  • B’s speech content is the voice to be enhanced, and the voice to be enhanced may include B’s pure voice and Other noises generated in the conference (for example, the other noises can be A's speech content or C's speech content), and the face image of B taken simultaneously at this time is a reference image.
  • the content of C’s speech is the voice to be enhanced, and the voice to be enhanced can include C’s pure voice and Other noises generated in the conference (for example, the other noises may be A's speech content or B's speech content), and the face image of C taken simultaneously at this time is a reference image.
  • the speech content of A and B is the speech to be enhanced.
  • the speech to be enhanced may include the pure speech of A, the pure speech of B, and other noises generated in the conference (for example, the other noises may be the content of speech of C).
  • the facial images of A and B taken simultaneously are reference images.
  • the speech content of B and C is the speech to be enhanced, which should be enhanced
  • the voice may include B's pure voice, C's pure voice, and other noises generated in the conference (for example, other noises may be A's speech content).
  • the facial images of B and C taken simultaneously are reference images.
  • you choose to enhance the speech content of A and C you can simultaneously take the face images of A and C during the speech of A and C.
  • the speech content of A and C is the speech to be enhanced, and the speech to be enhanced may include The pure speech of A, the pure speech of C, and other noises generated in the meeting (for example, other noises may be B's speech content), the face images of A and C taken simultaneously at this time are reference images.
  • the speech to be enhanced can include the pure speech of A, the pure speech of B, the pure speech of C and other noises generated in the meeting (such as the sounds of other participants except ABC or other environment Noise), the face images of A, B and C taken simultaneously at this time are reference images.
  • the wearable device referred to in this scenario refers to a portable device that can be worn directly on the body or integrated into the user's clothes or accessories.
  • wearable devices can be smart watches, smart bracelets, smart glasses, and so on.
  • Input methods and semantic understanding based on voice recognition are widely used in wearable devices.
  • touch is still the main way of communication between people and them, because the screens of these devices are generally small, and people and them
  • the communication is mainly based on simple and direct tasks.
  • Voice will inevitably become the next-generation information portal for these devices, which can also liberate people's fingers and make the communication between people and these devices more convenient and natural.
  • these devices are usually used by users in a more complex acoustic environment. There are various sudden noise interferences around.
  • the communication between people and mobile phones and wearable devices usually occurs on the street or in the shopping mall.
  • the complex noise environment usually reduces the recognition rate of speech significantly.
  • the decline in recognition rate means that these devices cannot accurately understand the user's instructions, which will greatly reduce the user's experience.
  • the solution provided in this application can also be applied to a voice interaction scenario with a wearable device.
  • the wearable device acquires the user's voice instructions, it can simultaneously acquire the user's face image.
  • the user's voice instructions can be voice enhanced, so that the wearable device can be updated. Recognize the user's instructions well, and make a response to the user's instructions.
  • the user's voice command can be regarded as the voice to be enhanced, and the synchronously acquired face image can be regarded as the reference image.
  • visual information such as reference image, is introduced in the process of speech enhancement. , So that in the environment with very noisy background noise, there are also very good speech enhancement and speech recognition effects.
  • Smart home (smart home, home automation) is based on the residence as a platform, using integrated wiring technology, network communication technology, security technology, automatic control technology, audio and video technology to integrate facilities related to home life to build efficient residential facilities and homes
  • the management system of schedule affairs improves home safety, convenience, comfort, and artistry, and realizes an environmentally friendly and energy-saving living environment.
  • smart homes can include smart lighting systems, smart curtains, smart TVs, smart air conditioners, and so on.
  • the user issues a voice control instruction to the smart home it may specifically include the user directly issues a voice control instruction to the smart home, or the user issues voice control instructions to the smart home through other devices, such as mobile phones and other devices. Send voice control commands to the smart home remotely.
  • the image of the preset area can be obtained through the smart home or other devices.
  • the mobile phone can obtain the image captured at this time.
  • the voice control command issued by the user is the voice to be enhanced, and the image captured simultaneously is the reference image.
  • a voice can be issued to prompt the user to point the camera at the face, such as a prompt "Voice enhancement is in progress, please aim the camera at the face", etc. .
  • the first neural network is a neural network obtained by training on mixed data of speech and noise with an ideal ratio mask (IRM) as the training target.
  • IRM ideal ratio mask
  • Time-frequency masking is a common goal of speech separation.
  • Common time-frequency masking includes ideal binary masking and ideal floating value masking. They can significantly improve the intelligibility and perceptual quality of separated speech.
  • time-domain waveforms of speech can be synthesized through inverse transform technology. Exemplarily, a definition of ideal float masking in the Fourier transform domain is given below:
  • Ys(t,f) is the short-time Fourier transform coefficient of the pure speech in the mixed data
  • Yn(t,f) is the short-time Fourier transform coefficient of the noise in the mixed data
  • Ps(t,f) is Ys(t,f) corresponds to the energy density
  • Pn(t,f) is the energy density corresponding to Yn(t,f).
  • the definition of the ideal floating value mask in the Fourier transform domain is given above. It should be noted that after knowing the solution provided by this application, those skilled in the art can easily think that other speech separation goals can also be used as The training goal of the first neural network. For example, short-time Fourier transform masking, implicit time-frequency masking, etc. can also be used as the training target of the first neural network. In other words, in the prior art, after the mixed data of speech and noise is separated by a certain neural network, the signal-to-noise ratio of the output signal of the neural network can be obtained at any time, then the training adopted by the neural network Goals, all the solutions provided in this application can be adopted.
  • the aforementioned voice may refer to a pure voice or a clean voice, and refers to a voice that is not protected from any noise.
  • the mixed data of speech and noise refers to noisy speech, that is, speech obtained by adding a preset distribution of noise to the clean speech.
  • the clean speech and the noisy speech are used as the speech to be trained.
  • multiple noise-added speeches corresponding to the clean speech can be obtained by adding various noises of different distributions to the clean speech. For example: adding the noise of the first distribution to clean speech 1 to get noisy speech 1, adding noise from the second distribution to clean speech 2 to get noisy speech 2, and adding noise from the third distribution to clean speech 1 to get noisy Voice 3, and so on.
  • multiple data pairs of clean speech and noisy speech can be obtained, for example: ⁇ clean speech 1, noisy speech 1 ⁇ , ⁇ clean speech 1, noisy speech 2 ⁇ , ⁇ clean speech 1, plus noisy voice 3 ⁇ and so on.
  • the final trained neural network model is equivalent to the embodiment of this application.
  • the first neural network in.
  • the speech to be enhanced is converted into a two-dimensional time-frequency signal, which is input to the first neural network to obtain the first enhanced signal of the speech to be enhanced.
  • the short-time-fourier-transform (STFT) method can be used to perform time-frequency conversion on the voice signal to be enhanced to obtain the two-dimensional time-frequency signal of the voice to be enhanced.
  • STFT short-time-fourier-transform
  • y(t) represents the time domain signal of the speech to be enhanced at time t
  • x(t) represents the time domain signal of the clean speech at time t
  • n(t) represents the time domain signal of the noise at time t.
  • the STFT transformation of the voice to be enhanced can be expressed as follows:
  • Y (t, d) represents the frequency domain signal of the voice to be enhanced in the t-th acoustic feature frame and the d-th frequency band
  • X (t, d) represents the frequency of the clean speech in the t-th acoustic feature frame and the d-th frequency band
  • the representation of the signal in the domain, N(t,d) represents the representation of the noise in the frequency domain signal of the t-th acoustic feature frame and the d-th frequency band.
  • T and D respectively represent the total number of acoustic feature frames and the total number of frequency bands in the signal to be enhanced.
  • the method of performing feature transformation on the speech signal is not limited to the STFT method, and other methods, such as Gabor transformation and Wigner-Ville distribution, can also be used in some other implementation manners.
  • the manner of performing feature transformation on the voice signal to obtain the two-dimensional time-frequency signal of the voice signal in the prior art all the embodiments of the present application may be adopted.
  • the frequency domain features after feature transformation can also be normalized.
  • the frequency domain feature can be subtracted by the mean value divided by the standard deviation operation to obtain the normalized frequency domain feature.
  • the normalized frequency domain feature can be used as the input of the first neural network to obtain the first enhanced signal. Taking a long short-term memory network (LSTM) as an example, Expressed by the following formula:
  • the right side of the above equation is the training target IRM, which has been introduced above.
  • Ps(aclean,j) represents the energy spectrum (also called energy density) of the clean signal at time j
  • Ps(anoise,j) represents the energy spectrum of the noise signal at time j.
  • the left side of the above equation represents the approximation of the training target through the neural network.
  • a j represents the input of the neural network.
  • it can be a frequency domain feature
  • g() represents a functional relationship.
  • here can be the normalization of the input of the neural network by subtracting the mean value divided by the standard deviation and then doing the logarithm The functional relationship of the transformation.
  • the first neural network of the present application can be any kind of time series model, that is, it can provide corresponding output at each time step to ensure the real-time nature of the model.
  • the weights can be frozen, that is, to keep the weight parameters of the first neural network unchanged, so that the second neural network or other neural networks will not affect the performance of the first neural network, and ensure that there is no visual model.
  • the model under the condition that the reference image does not include face information or lip information can ensure the robustness of the model according to the output of the first neural network.
  • the masking function indicates whether the frequency band energy of the reference image is less than a preset value.
  • the frequency band energy is less than the preset value indicating that the speech to be enhanced corresponding to the reference image is noise, and the frequency band energy is not less than the preset value indicating that the speech to be enhanced corresponding to the reference image is clean speech.
  • the second neural network is an ideal binary mask (IBM) as a training target, and is a neural network obtained by training an image including lip features corresponding to the sound source of the voice used by the first neural network.
  • This weak reference method to convert the original fine distribution into a rough distribution through binarization, so as to facilitate image fitting. And this rough distribution characterizes whether the mouth shape corresponds to the pronunciation of a certain set of frequency bands.
  • This application is to establish the mapping relationship between the frequency band energy of the image and the frequency band energy of the voice through the second neural network. Specifically, the energy of each frequency band of the image frame at each time and each frequency band of the acoustic feature frame at each time are established. The relationship between the energies.
  • the training objectives of the second neural network and the data used in the training are described below.
  • the training target IBM of the second neural network is a symbolic function, and its definition is explained below by the following expression.
  • the dist function is the energy distribution function, which is defined as follows:
  • j refers to the time j, or the time when the duration of the j-th frame ends.
  • Each frame may include multiple frequency bands, such as k frequency bands, where k refers to the kth frequency band of the pure speech at time j, and k is a positive integer.
  • the number of frequency bands included in each time can be preset, for example, one time can be set to include 4 frequency bands, or one time can include 5 frequency bands, which is not limited in the embodiment of the present application.
  • P s ( ak j) refers to the energy spectrum of the k-th frequency band of the clean signal at time j. Therefore, dist(aj) characterizes the distribution of audio energy in the k frequency bands corresponding to time j.
  • the threshold is a preset threshold. In a specific implementation, the threshold can generally be 10 -5 . If the difference between dist(aj) and threshold is greater than or equal to 0, that is, dist(aj) is greater than threshold, then dist(aj) is considered to be voice-dominated or it is impossible to determine whether dist(aj) is voice-dominated or noise-dominated, and the corresponding function The value is set to 1. If the difference between dist(aj) and threshold is less than 0, that is, dist(aj) is less than threshold, then dist(aj) is considered to be noise dominant, and its corresponding function value is set to 0.
  • the training data of the second neural network is an image including lip features corresponding to the sound source of the voice used by the first neural network.
  • 500 sentences such as mainstream newspapers and magazines can be selected, including all the utterances as much as possible, and then 100 different people are selected to read aloud, as the clean speech signal (ie the analog noise corresponding to the speech) Clean speech)
  • the training data of the second neural network may include face images of the 100 different people, or mouth images of the 100 different people, or faces of the 100 different people Image, such as the image of the upper body.
  • the training data of the second neural network does not only include images including lip features corresponding to the sound source of the voice used by the first neural network, and the training data of the second neural network may also include some images that do not contain lips. Feature image data or data that does not include face images.
  • v stands for training data.
  • the training data has been introduced above, and will not be repeated here.
  • sigmoid is defined as Sigmoid is an activation function, through which the energy of each frequency band of each moment of the image is expressed, and the value of sigmoid is approximated to the value of dist(aj)-threshold through the neural network, such as the LSTM used in the above formula.
  • f() represents the feature extraction function. It should be noted that the sigmoid here is only for illustrative purposes, and other activation functions may also be adopted in the embodiment of the present application to approximate the training target.
  • the image frames processed by the second neural network may be aligned with the acoustic feature frames of the first neural network in time sequence. Through the alignment of the time series, it can be ensured that in the subsequent process, the data output by the second neural network processed at the same time corresponds to the data output by the first neural network. For example, suppose there is a video that includes 1 image frame and 4 acoustic feature frames. The multiple relationship between the number of image frames and acoustic frames can be determined by re-sampling the video according to the preset frame rate, for example, the image data included in the video is performed according to the frame rate of the image frame of 40 frames/s.
  • Resampling is to resample the audio data included in the video according to the frame rate of the acoustic feature frame at 10 frames/s.
  • the 1-frame image frame and the 4-frame acoustic feature frame are aligned in time.
  • the duration of the image frame of 1 frame is aligned with the duration of the acoustic feature frame of 4 frames.
  • the first neural network processes the 4 frames of acoustic feature frames
  • the second neural network processes the image frames of 1 frame
  • the processed image frames of the second neural network are compared with those of the first neural network.
  • the acoustic feature frames are aligned in time series.
  • the purpose is to make the first neural network and the second neural network in the process of processing, and after the processing is completed, the 4 acoustic feature frames and the 1 image frame are in time The top is still aligned.
  • 4 image frames corresponding to the 4 acoustic feature frames can be obtained, and the 4 image frames are output.
  • the masking function corresponding to the frame is output.
  • the speech to be enhanced includes a first acoustic feature frame, the time corresponding to the first acoustic feature frame is indicated by the first time index, the image includes the first image frame, and the first image frame is the second nerve.
  • the input data of the network, according to the masking function of the output image of the second neural network includes: according to the second neural network outputting the masking function corresponding to the first image frame at the first moment, the first moment is indicated by the multiple of the first time index, and the multiple It is determined according to the ratio of the frame rate of the first acoustic characteristic frame to the frame rate of the first image frame, so that the first moment is the moment corresponding to the first acoustic characteristic frame.
  • m represents a multiple, which is determined according to the ratio of the frame rate of the first acoustic characteristic frame to the frame rate of the first image frame.
  • the frame rate of the first acoustic feature frame is 10 frames/s
  • the frame rate of the first image frame is 40 frames/s
  • the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame is 1/4 (10/40)
  • m takes 4 in the above formula.
  • the frame rate of the first acoustic feature frame is 25 frames/s
  • the frame rate of the first image frame is 50 frames/s
  • the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame If it is 1/2 (25/50), then m takes 2 in the above formula.
  • m is taken as 4 in the following, and further description is made in conjunction with FIG. 12.
  • FIG. 12 shows a schematic diagram of time sequence alignment provided by an embodiment of this application. As shown in FIG. 12, the white squares in the figure represent the input image frames of the second neural network. As shown in FIG. 12, 4 input image frames are shown.
  • the duration of the input 1 frame of image frame is the same as the duration of 4 frames of acoustic feature frame, that is, when m is 4, after the time series alignment process of the second neural network, the input frame of image frame corresponds to 4 frames after processing
  • the duration of each of the 4-frame processed image frames is the same as the duration of the acoustic frame.
  • the black box represents the image frame after the second neural network time alignment processing, the second neural network will output the masking function of the aligned image frame, as shown in Figure 12, including a total of 16 times After the image frames are aligned, the masking function corresponding to the 16 time-aligned image frames will be output.
  • the 16 image frames are aligned in time with an acoustic feature frame.
  • the 1 image frame represented by the white box and the 4 acoustic feature frames are aligned in time
  • the black box represents 1 One image frame and one acoustic feature frame are aligned in time.
  • the reference image is input to the second neural network during speech enhancement to obtain the masking function of the reference image.
  • some preprocessing can be performed on the reference image, and the preprocessed reference image can be input to the second neural network.
  • the reference image can also be sampled to a specified image frame rate.
  • face feature extraction on the reference image to obtain a face image, and the face feature extraction can be performed by a face feature extraction algorithm.
  • Facial feature extraction algorithms include recognition algorithms based on facial feature points, recognition algorithms based on the entire face image, and recognition algorithms based on templates. For example, it may be face detection based on a face feature point detection algorithm. Facial feature extraction can also be performed through neural networks.
  • Face feature extraction can be performed through a convolutional neural network model, such as face detection based on a multi-task convolutional neural network.
  • the face image extracted by the face feature can be used as the input of the second neural network.
  • the second neural network can also perform further processing on the face image, for example, it can extract the image frames corresponding to the movement features of the human mouth, and perform time sequence alignment processing on the image frames corresponding to the movement features of the mouth.
  • the first enhanced signal may be output through the first neural network
  • the masking function of the reference image may be output through the second neural network. Since the second neural network establishes the mapping relationship between the frequency band energy of the image and the frequency band energy of the speech, the masking function can indicate whether the frequency band energy of the reference image is less than the preset value, and the frequency band energy less than the preset value indicates that the speech to be enhanced corresponding to the reference image is noise , The energy of the frequency band is not less than the preset value, which means that the voice to be enhanced corresponding to the reference image is clean voice.
  • the second enhanced signal of the speech to be enhanced determined by the calculation result of the first enhanced signal and the masking function is better than the first enhanced signal, that is, compared to the solution of speech enhancement only through a single neural network.
  • Voice enhancement effect For example, suppose that for the first frequency band included in the audio to be enhanced at a certain moment, the first neural network outputs the signal-to-noise ratio of the first frequency band as A, assuming that A represents that the first neural network determines that the first frequency band is voice-dominated, The second neural network outputs the frequency band energy of the first frequency band as B, and B is less than the preset value, that is, assuming that B represents the second neural network determines that the first frequency band is noise-dominated, mathematical operations can be performed through A and B, for example, A Perform one or several operations of sum, product, or squaring with B to obtain the result of the operation between A and B.
  • the result of the operation can determine the proportion of A and B in the second enhanced signal output .
  • the principle of the operation of the first enhanced signal and the masking function is that the actual meaning of the masking function is to measure whether a certain frequency band has enough energy.
  • the output value of the second neural network is small and the output value of the first neural network is large.
  • a certain frequency band such as the first frequency band
  • the second neural network (video side)
  • the shape of a person's mouth does not make a corresponding sound
  • the output value of the second neural network is large and the output value of the first neural network is small.
  • a certain frequency band such as the first frequency band
  • the second neural network video side
  • the above inconsistent part will be scaled to a smaller value, while the consistent part will remain unchanged, and a new output second enhanced signal after fusion will be obtained.
  • the energy of the frequency band with inconsistent pronunciation or audio and video will be compressed to a smaller value.
  • the first neural network is used to output the first enhanced signal of the speech to be enhanced
  • the second neural network is used to model the association relationship between image information and voice information, so that the second neural network outputs a reference
  • the masking function of the image may indicate that the speech to be enhanced corresponding to the reference image is noise or speech.
  • the embodiment corresponding to FIG. 7 above introduced that the second enhanced signal of the speech to be enhanced can be determined according to the calculation result of the first enhanced signal and the masking function.
  • a preferred solution is given below.
  • the second enhanced signal of the speech to be enhanced is determined through the third neural network. Specifically, the second enhanced signal is determined according to the weight output by the third neural network. The weight value indicates the output ratio of the first enhanced signal and the modified signal in the second enhanced signal, and the modified signal is the calculation result of the masking function and the first enhanced signal.
  • the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network with IRM as the training target.
  • FIG. 13 it is a schematic flowchart of another voice enhancement method provided by an embodiment of this application.
  • another voice enhancement method provided by an embodiment of the present application may include the following steps:
  • Step 1301 can be understood with reference to step 701 in the embodiment corresponding to FIG. 7, and details are not repeated here.
  • Step 1302 can be understood with reference to step 702 in the embodiment corresponding to FIG. 7, and details are not repeated here.
  • Step 1303 can be understood with reference to step 703 in the embodiment corresponding to FIG. 7, and details are not repeated here.
  • it may further include: determining whether the reference image includes face information. If it is determined that the reference image includes face information, the masking function of the reference image is output according to the second neural network.
  • the first enhanced signal and the masking function are used as the input data of the third neural network, and the second enhanced signal is determined according to the weight value output by the third neural network.
  • the weight value indicates the output ratio of the first enhanced signal and the modified signal in the second enhanced signal, and the modified signal is the calculation result of the masking function and the first enhanced signal.
  • the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network with IRM as the training target.
  • the third neural network trains the output data of the first neural network and the output data of the second neural network.
  • the first neural network outputs multiple sets of first enhanced signals during the training process and the second neural network is training
  • the multiple sets of masking functions output in the process are trained.
  • the second neural network aligns the image frame with the acoustic feature frame of the first neural network in time series, the output of the first neural network and the second neural network received by the third neural network at the same time
  • the output is time aligned data.
  • the third neural network can train the operation results of the first enhanced signal and the masking function.
  • the mathematical operation between the first enhanced signal and the masking function has been introduced above, and the details will not be repeated here. This application does not limit the type of the third neural network.
  • the third neural network is an LSTM, and the mathematical operation between the first enhancement signal and the masking function is a multiplication operation.
  • the output data and the output data of the second neural network are trained to output the weight (gate), which can be expressed by the following formula:
  • the reference image may include face information, specifically, an image including face information at the sound source of the voice to be enhanced. In some scenes, the reference image may also be irrelevant to the face information, for example, the reference image may not be relevant to the corresponding image at the sound source.
  • the training data of the second neural network of the present application includes not only the corresponding image including lip features at the sound source of the voice used by the first neural network, but also some image data that does not include lip features or does not include human faces. Image data.
  • the second enhanced signal can be expressed by the following formula, where IRM' represents the second enhanced signal:
  • IRM′ gate ⁇ (IBM ⁇ IRM)+(1-gate) ⁇ IRM
  • the output of the second neural network is not completely accurate, it may cause a part of the first enhanced signal to be scaled incorrectly, so we added a third neural network network, through the weight, to retain the confident part, the unsure part Filled by the first enhanced signal.
  • the correction signal is determined according to the product of M signal-to-noise ratios and the masking function at the first moment, M is a positive integer, and the first enhanced signal output by the first neural network at the first moment includes M Frequency band, each of the M frequency bands corresponds to a signal-to-noise ratio, and the masking function at the first moment is the masking function output by the second neural network at the first moment.
  • M is a positive integer
  • the first enhanced signal output by the first neural network at the first moment includes M Frequency band
  • each of the M frequency bands corresponds to a signal-to-noise ratio
  • the masking function at the first moment is the masking function output by the second neural network at the first moment.
  • the speech to be enhanced at the first moment includes a frame of acoustic characteristics, and the frame of acoustic characteristics includes 4 frequency bands.
  • the first moment can be any moment corresponding to the voice to be enhanced.
  • the first moment includes 4 frequency bands for illustrative purposes only. How many frequency bands are included at each moment can be preset, for example, a time can be set It includes 4 frequency bands, or includes 5 frequency bands at a time, which is not limited in the embodiment of the present application. Assume that the signal-to-noise ratios corresponding to the 4 frequency bands are 0.8, 0.5, 0.1, and 0.6, respectively.
  • the second neural network will output the masking function of the 4 frequency bands corresponding to the reference image at the first moment.
  • the second neural network aligns the image frame with the acoustic feature frame of the first neural network in time series, which will not be repeated here.
  • the modified signal includes 4 frequency bands, and the energy of each frequency band is 0.8 (1x0.8), 0.5 (1x0.5), 0 (0x0.1), 0.6 (1x0.6).
  • the solution provided by the present application can support streaming decoding, and the theoretical limit is the duration of a unit acoustic feature frame. Taking the duration of a unit acoustic feature frame of 10 ms as an example, with the solution provided in this application, the theoretical upper bound of the time delay of the second enhanced speech output is 10 ms.
  • the third neural network receives a frame of acoustic feature frame Corresponding to the first enhanced signal, the first enhanced signal and the corresponding masking function at the same time can be processed to output the second enhanced signal at that time. Since the speech to be enhanced can be processed frame by frame, the second enhanced signal can be played frame by frame.
  • the corresponding second neural network since the voice to be enhanced can be processed in units of acoustic feature frames, frame by frame, the corresponding second neural network also outputs the masking function according to the time corresponding to the acoustic feature frame, so the third neural network can use the acoustic feature
  • the frame is the unit to output the second enhanced signal, so in the solution provided in this application, the upper bound of the theoretical delay is the duration of the unit acoustic characteristic frame.
  • FIG. 15 is a schematic flowchart of another voice enhancement method provided by an embodiment of this application.
  • a video that includes the voice to be enhanced and the reference image.
  • the frequency domain feature is input to the first neural network.
  • the segment of speech to be enhanced is sampled as 3 segments of audio.
  • each segment of audio includes 4 acoustic feature frames, that is, the input of the first neural network in FIG. 15.
  • the second neural network performs time alignment processing on the 1 image frame, it can output 4 image frames corresponding to the 4 acoustic feature frames, that is, the output of the second neural network in FIG. 15.
  • the first enhancement signal corresponding to the four acoustic feature frames output by the first neural network and the masking function corresponding to the four image frames output by the second neural network can be input to the third neural network in turn, and the third neural network will output
  • the second enhanced signal corresponding to the four acoustic feature frames is the output of the third neural network in FIG. 15. Inverse feature transformation is performed on the second enhanced signal to obtain the time-domain enhanced signal of the speech to be enhanced.
  • the first enhanced signal and the masking function can be used as the input data of the third neural network, and the second enhanced signal can be determined according to the weight output by the third neural network.
  • the third neural network after the third neural network is trained, during speech enhancement, it may further include performing feature inverse transformation on the result output by the third neural network to obtain a time domain signal.
  • the frequency domain characteristics obtained after the short-time Fourier transform of the speech to be enhanced are the input of the first neural network, and then the second enhanced signal of the third neural network can be subjected to inverse short-time Fourier transform. -time-fourier-transform, ISTFT) to get the time domain signal.
  • ISTFT inverse short-time Fourier transform
  • the training data of the second neural network may also include some image data that does not include lip features or data that does not include face images. It should be noted that, in some specific implementations, the training data of the second neural network may also include only image data including lip features or data including face images. In some specific implementations, it can be determined first whether the reference image includes face information or lip information. If the reference image does not include face information or lip information, only the first neural network outputs the voice to be enhanced. The enhanced signal, when the reference image includes face information or lip information, the enhanced signal of the speech to be enhanced is output according to the first neural network, the second neural network, and the third neural network. The following describes with reference to FIG.
  • the system first determines whether the reference image includes face information or lip information. If it does not include face information or lip information, it determines the enhancement signal of the speech to be enhanced according to the first enhancement signal output by the first neural network, that is, the second enhancement The signal is the first enhanced signal. If the system determines that the reference image includes face information or lip information, it will determine the second enhanced signal through the third neural network according to the mask function output by the second neural network and the first enhanced signal output by the first neural network. How to determine the second enhanced signal according to the third neural network has been described in detail above, and will not be repeated here.
  • the process of the voice enhancement method provided by the embodiment of the present application includes two parts: an "application” process and a “training” process.
  • the application process provided by this application is introduced above, and a speech enhancement method is specifically introduced.
  • the training process provided by this application is introduced below, and a method for training a neural network is specifically introduced.
  • This application provides a method for training a neural network, which is used for speech enhancement.
  • the method may include: obtaining training data.
  • the training data may include mixed data of voice and noise, and the sound source of the voice may include lips.
  • Characteristic image Taking the ideal floating value masking IRM as the training target, the first neural network is obtained by training the mixed data, and the trained first neural network is used to output the first enhanced signal of the speech to be enhanced.
  • the image is trained to obtain a second neural network.
  • the trained second neural network is used to output the masking function of the reference image.
  • the masking function indicates whether the frequency band energy of the reference image is less than the preset value. If the energy of the frequency band is less than the preset value, it indicates that the speech frequency band to be enhanced corresponding to the reference image is noise, and the calculation result of the first enhanced signal and the masking function is used to determine the second enhanced signal of the speech to be enhanced.
  • the reference image is an image corresponding to the sound source of the speech to be enhanced that may include lip features.
  • the operation result of the first enhanced signal and the masking function is used to determine the second enhanced signal of the speech to be enhanced, which may include: using the first enhanced signal and the masking function as the input data of the third neural network,
  • the second enhancement signal is determined according to the weight value output by the third neural network.
  • the weight value indicates the output ratio of the first enhancement signal and the correction signal in the second enhancement signal.
  • the correction signal is the calculation result of the masking function and the first enhancement signal.
  • the neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network with the first mask as the training target.
  • the method may further include: determining whether the image may include face information or lip information. When the image does not include face information or lip information, the weight value indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is 100%.
  • the modified signal is the product of the first enhanced signal and the masking function.
  • the correction signal is determined according to the product operation result of M signal-to-noise ratios and the masking function at the first moment, M is a positive integer, and the first enhanced signal output by the first neural network at the first moment may include M There are two frequency bands, each of the M frequency bands corresponds to a signal-to-noise ratio, and the masking function at the first moment is the masking function output by the second neural network at the first moment.
  • the speech to be enhanced may include the first acoustic feature frame, and the moment corresponding to the first acoustic feature frame is indicated by the first time index, and the image may include the first image frame, which is the first image frame.
  • the input data of the second neural network, according to the masking function of the second neural network output image may include: according to the second neural network outputting the masking function corresponding to the first image frame at the first time, the first time is a multiple of the first time index Indicates that the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.
  • the method may further include: performing feature transformation on the speech to be enhanced to obtain the frequency domain characteristics of the speech to be enhanced.
  • the method may further include: performing feature inverse transformation on the second enhanced signal to obtain enhanced speech.
  • performing feature transformation on the voice to be enhanced may include: performing short-time Fourier transform STFT on the voice to be enhanced.
  • Performing an inverse feature transform on the second enhanced signal may include: performing an inverse short-time Fourier transform ISTFT on the second enhanced signal.
  • the method may further include: sampling the image so that the frame rate of the image frame included in the image is a preset frame rate.
  • the lip features are obtained by feature extraction on a face image
  • the face image is obtained by face detection on an image.
  • the frequency band energy of the image is represented by the activation function, and the value of the activation function is approximated to IBM to obtain the second neural network.
  • the voice to be enhanced is obtained through a single audio channel.
  • the first mask is an ideal floating value masking IRM
  • the second mask is an ideal binary masking IBM.
  • the experimental data set uses the Grid data set as the pure speech corpus. There are 32 groups of speakers with 1,000 each, and a total of 32,000 corpus are divided into 27,000 training sets (30 groups of speakers, 900 per group), and 3000 Seentest test sets ( 30 groups of speakers, 100 in each group) and 2000 in the Unseentest test set (2 groups of speakers, 1,000 in each group).
  • the CHiME background data set is divided into a training noise set and a normal environment test noise set according to 8:2, and Audioset Human noise is used as a human sound environment test set.
  • the main baselines for comparison are the acoustic model (AO), the Visual Speech Enhancement (VSE) model, and the Looking to Listen (L2L) model.
  • the experiment is mainly evaluated by PESQ score.
  • Experimental data confirms that the solution provided by this application can use visual information to improve the speech enhancement task from -5 to 20 dB.
  • FIG. 17 is a schematic structural diagram of a speech enhancement device provided by an embodiment of this application.
  • the device for voice enhancement includes: an acquisition module 1701, configured to acquire a voice to be enhanced and a reference image, where the voice to be enhanced and the reference image are data acquired at the same time.
  • the audio processing module 1702 is configured to output the first enhanced signal of the speech to be enhanced according to the first neural network.
  • the first neural network is a neural network obtained by training the mixed data of speech and noise with the first mask as the training target .
  • the image processing module 1703 is configured to output the masking function of the reference image according to the second neural network. The masking function indicates whether the frequency band energy corresponding to the reference image is less than a preset value.
  • the frequency band energy less than the preset value indicates the frequency band of the speech to be enhanced corresponding to the reference image
  • the second neural network uses the second mask as the training target, and is a neural network obtained by training the image including lip features corresponding to the sound source of the voice used by the first neural network.
  • the integrated processing module 1704 is configured to determine the second enhanced signal of the speech to be enhanced according to the calculation result of the first enhanced signal and the masking function.
  • the reference image is an image including lip features corresponding to the sound source of the speech to be enhanced.
  • the integrated processing module 1704 is specifically configured to: use the first enhanced signal and the masking function as the input data of the third neural network, and determine the second enhanced signal according to the weight output by the third neural network.
  • the value indicates the output ratio of the first enhanced signal and the modified signal in the second enhanced signal.
  • the modified signal is the calculation result of the masking function and the first enhanced signal.
  • the third neural network is based on the first mask as the training target. A neural network obtained by training the output data of the second neural network and the output data of the second neural network.
  • the device further includes: a feature extraction module, which is used to determine whether the reference image includes face information or lip information. When the reference image does not include face information or lip information, the weight value indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is 100%.
  • the modified signal is the product of the first enhanced signal and the masking function.
  • the correction signal is determined according to the product of M signal-to-noise ratios and the masking function at the first moment, M is a positive integer, and the first enhanced signal output by the first neural network at the first moment includes M Frequency band, each of the M frequency bands corresponds to a signal-to-noise ratio, and the masking function at the first moment is the masking function output by the second neural network at the first moment.
  • the speech to be enhanced includes a first acoustic feature frame, the time corresponding to the first acoustic feature frame is indicated by the first time index, the reference image includes the first image frame, and the first image frame is the second
  • the input data of the neural network, the image processing module 1703 is specifically used to: output the masking function corresponding to the first image frame at the first time according to the second neural network.
  • the first time is indicated by the multiple of the first time index, and the multiple is based on the first time index.
  • the ratio of the frame rate of the acoustic feature frame to the frame rate of the first image frame is determined.
  • performing feature transformation on the voice to be enhanced may include: performing short-time Fourier transform STFT on the voice to be enhanced.
  • Performing an inverse feature transform on the second enhanced signal may include: performing an inverse short-time Fourier transform ISTFT on the second enhanced signal.
  • the feature extraction module is further configured to sample the reference image so that the frame rate of the image frames included in the reference image is a preset frame rate.
  • the lip feature is obtained by feature extraction on a face image
  • the face image is obtained by face detection on a reference image
  • the frequency band energy of the reference image is represented by the activation function, and the value of the activation function is approximated to IBM to obtain the second neural network.
  • the voice to be enhanced is obtained through a single audio channel.
  • the first mask is an ideal floating value masking IRM
  • the second mask is an ideal binary masking IBM.
  • FIG. 18 is a schematic structural diagram of a device for training a neural network provided by an embodiment of the application.
  • the neural network is used for speech enhancement.
  • the device includes: an acquisition module 1801 for acquiring training data.
  • the training data includes mixed data of speech and noise, and the corresponding sound source of the speech includes lips.
  • the audio processing module 1802 is configured to use the ideal floating value masking IRM as a training target to train the mixed data to obtain a first neural network, and the trained first neural network is used to output the first enhanced signal of the speech to be enhanced.
  • the image processing module 1803 is used to train the image to obtain the second neural network with the ideal binary masking IBM as the training target.
  • the trained second neural network is used to output the masking function of the reference image, and the masking function indicates the frequency band of the reference image Whether the energy is less than the preset value and the frequency band energy is less than the preset value indicates that the frequency band of the speech to be enhanced corresponding to the reference image is noise, and the calculation result of the first enhanced signal and the masking function is used to determine the second enhanced signal of the speech to be enhanced.
  • the reference image is an image including lip features corresponding to the sound source of the speech to be enhanced.
  • it further includes: a comprehensive processing module 1804, a comprehensive processing module 1804, configured to use the first enhanced signal and the masking function as the input data of the third neural network, and determine according to the weight value output by the third neural network
  • the second enhanced signal indicates the output ratio of the first enhanced signal and the modified signal in the second enhanced signal
  • the modified signal is the result of the operation of the masking function and the first enhanced signal
  • the third neural network uses the first mask as the training target , A neural network obtained by training the output data of the first neural network and the output data of the second neural network.
  • the device further includes: a feature feature extraction module,
  • the feature extraction module is used to determine whether the image includes face information or lip information. When the image does not include face information or lip information, the weight value indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is 100%.
  • the modified signal is the product of the first enhanced signal and the masking function.
  • the correction signal is determined according to the product of M signal-to-noise ratios and the masking function at the first moment, M is a positive integer, and the first enhanced signal output by the first neural network at the first moment includes M Frequency band, each of the M frequency bands corresponds to a signal-to-noise ratio, and the masking function at the first moment is the masking function output by the second neural network at the first moment.
  • the speech to be enhanced includes a first acoustic feature frame, the time corresponding to the first acoustic feature frame is indicated by the first time index, the image includes the first image frame, and the first image frame is the second nerve.
  • the input data of the network, the image processing module 1803, is specifically used to: output the masking function corresponding to the first image frame at the first moment according to the second neural network, the first moment is indicated by the multiple of the first time index, and the multiple is based on the first sound
  • the ratio of the frame rate of the academic feature frame to the frame rate of the first image frame is determined.
  • performing feature transformation on the voice to be enhanced may include: performing short-time Fourier transform STFT on the voice to be enhanced.
  • Performing an inverse feature transform on the second enhanced signal may include: performing an inverse short-time Fourier transform ISTFT on the second enhanced signal.
  • the feature extraction module is further configured to sample the reference image so that the frame rate of the image frames included in the reference image is a preset frame rate.
  • the lip feature is obtained by feature extraction on a face image
  • the face image is obtained by face detection on a reference image
  • the frequency band energy of the reference image is represented by the activation function, and the value of the activation function is approximated to IBM to obtain the second neural network.
  • the voice to be enhanced is obtained through a single audio channel.
  • the first mask is an ideal floating value masking IRM
  • the second mask is an ideal binary masking IBM.
  • FIG. 19 is a schematic structural diagram of another voice enhancement device provided by an embodiment of this application.
  • FIG. 19 is a schematic block diagram of a speech enhancement device according to an embodiment of the present application.
  • the voice enhancement device module shown in FIG. 19 includes a memory 1901, a processor 1902, a communication interface 1903, and a bus 1904. Among them, the memory 1901, the processor 1902, and the communication interface 1903 implement communication connections between each other through the bus 1904.
  • the aforementioned communication interface 1903 is equivalent to the image acquisition module 901 in the speech enhancement device, and the aforementioned processor 1902 is equivalent to the feature extraction module 902 and the detection module 903 in the speech enhancement device.
  • the modules and modules in the voice enhancement device module are described in detail below.
  • the memory 1901 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 1901 may store a program.
  • the processor 1902 and the communication interface 1903 are used to execute each step of the speech enhancement method in the embodiment of the present application.
  • the communication interface 1903 may obtain the image to be detected from a memory or other devices, and then the processor 1902 performs voice enhancement on the image to be detected.
  • the processor 1902 may adopt a general central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more
  • the integrated circuit is used to execute related programs to realize the functions required by the modules in the voice enhancement device of the embodiment of the present application (for example, the processor 1902 can implement the feature extraction module 902 and the detection module 903 in the voice enhancement device. Function to be executed), or execute the voice enhancement method in the embodiment of the present application.
  • the processor 1902 may also be an integrated circuit chip with signal processing capabilities.
  • each step of the voice enhancement method in the embodiment of the present application can be completed by the integrated logic circuit of hardware in the processor 1902 or instructions in the form of software.
  • the above-mentioned processor 1902 may also be a general-purpose processor, a digital signal processing (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices , Discrete hardware components.
  • the aforementioned general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 1901, and the processor 1902 reads the information in the memory 1901, and combines its hardware to complete the functions required by the modules included in the voice enhancement device of the embodiment of the present application, or perform the voice enhancement of the method embodiment of the present application method.
  • the communication interface 1903 uses a transceiving device such as but not limited to a transceiver to implement communication between the device module and other devices or a communication network.
  • a transceiving device such as but not limited to a transceiver to implement communication between the device module and other devices or a communication network.
  • the image to be processed can be acquired through the communication interface 1903.
  • the bus 1904 may include a path for transferring information between various components of the device module (for example, the memory 1901, the processor 1902, and the communication interface 1903).
  • FIG. 20 is a schematic structural diagram of another apparatus for training a neural network provided by an embodiment of the application.
  • FIG. 20 is a schematic diagram of the hardware structure of a training neural network device according to an embodiment of the present application. Similar to the above device, the training neural network device shown in FIG. 20 includes a memory 2001, a processor 2002, a communication interface 2003, and a bus 2004. Among them, the memory 2001, the processor 2002, and the communication interface 2003 realize the communication connection between each other through the bus 2004.
  • the memory 2001 may store a program.
  • the processor 2002 is configured to execute each step of the neural network training method of the embodiment of the present application.
  • the processor 2002 may adopt a general CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits to execute related programs to implement the neural network training method of the embodiment of the present application.
  • the processor 2002 may also be an integrated circuit chip with signal processing capability.
  • each step of the neural network training method of the embodiment of the present application can be completed by the integrated logic circuit of the hardware in the processor 2002 or the instructions in the form of software.
  • the neural network is trained by the training neural network device shown in FIG. 20, and the neural network obtained by training can be used to execute the method of the embodiment of the present application.
  • the device shown in FIG. 20 can obtain training data and the neural network to be trained from the outside through the communication interface 2003, and then the processor trains the neural network to be trained according to the training data.
  • the device modules and devices only show the memory, processor, and communication interface, in the specific implementation process, those skilled in the art should understand that the device modules and devices may also include other necessary for normal operation. Device. At the same time, according to specific needs, those skilled in the art should understand that the device modules and devices may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the device module and the device may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIG. 19 and FIG. 20.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the modules is only a logical function division, and there may be other divisions in actual implementation, for example, multiple modules or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or modules, and may be in electrical, mechanical or other forms.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the function is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

一种语音增强方法,涉及人工智能领域,包括:获取待增强语音和参考图像(701),待增强语音和参考图像为同时获取的数据。根据第一神经网络输出待增强语音的第一增强信号(702)。根据第二神经网络输出参考图像的掩蔽函数(703),掩蔽函数指示参考图像对应的频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音的频段为噪声。根据第一增强信号和掩蔽函数的运算结果确定待增强语音的第二增强信号(704)。通过提供的技术方案,可以将图像信息应用于语音增强的过程中,在一些相对嘈杂的环境中,也可以很好的提升语音增强的能力,提升听感。

Description

一种语音增强方法、训练神经网络的方法以及相关设备
本申请要求于2020年4月10日提交中国专利局、申请号为202010281044.1、申请名称为“一种语音增强方法、训练神经网络的方法以及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,具体涉及一种语音增强方法、训练神经网络的方法以及相关设备。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
语音识别(automatic speech recognition,ASR)是指一种从语音波形中识别出对应的文字内容的技术,是人工智能领域的重要技术之一。在语音识别系统中,语音增强技术是非常重要的一项技术,通常也称为语音降噪技术。通过语音增强技术可以消除语音信号中的高频噪声、低频噪声、白噪声以及各种其他噪声,从而提高语音识别的效果。因此,如何提高语音增强效果,亟待解决。
发明内容
本申请实施例提供一种语音增强方法,可以将图像信息应用于语音增强的过程中,在一些相对嘈杂的环境中,也可以很好的提升语音增强的能力,提升听感。
为达到上述目的,本申请实施例提供如下技术方案:
本申请第一方面提供一种语音增强方法,可以包括:获取待增强语音和参考图像,待增强语音和参考图像为同时获取的数据。根据第一神经网络输出待增强语音的第一增强信号,第一神经网络是以第一掩码mask为训练目标,对语音和噪声的混合数据进行训练得到的神经网络。根据第二神经网络输出参考图像的掩蔽函数,掩蔽函数指示参考图像对应的频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音的频段为噪声,第二神经网络是以第二掩码mask为训练目标,对第一神经网络采用的语音的声源处对应的可以包括唇部特征的图像进行训练得到的神经网络。根据第一增强信号和掩蔽函数的运算结果确定待增强语音的第二增强信号。由第一方面可知,利用第一神经网络输出待增强语音的第一增强信号,利用第二神经网络对图像信息和语音信息的关联关系进行建模,使第二神经网络输出的参考图像的掩蔽函数可以指示该参考图像对应的待增强语音为噪声或者语音。通过本申请提供的技术方案,可以将图像信息应用于语音增强的过程中,在一些相对嘈杂的环境中,也可以很好的提升语音增强的能力,提升听感。
可选地,结合上述第一方面,在第一种可能的实现方式中,参考图像为待增强语音的声源处对应的可以包括唇部特征的图像。
可选地,结合上述第一方面或第一方面第一种可能的实现方式,在第二种可能的实现方式中,根据第一增强信号和掩蔽函数的运算结果确定待增强语音的第二增强信号,可以包括:以第一增强信号以及掩蔽函数作为第三神经网络的输入数据,根据第三神经网络输出的权值确定第二增强信号,权值指示第二增强信号中第一增强信号和修正信号的输出比例,修正信号是掩蔽函数和第一增强信号的运算结果,第三神经网络是以第一mask为训练目标,对第一神经网络的输出数据以及第二神经网络的输出数据进行训练得到的神经网络。
可选地,结合上述第一方面第二种可能的实现方式,在第三种可能的实现方式中,该方法还可以包括:确定参考图像是否可以包括人脸信息或者唇部信息。参考图像不包括人脸信息或者唇部信息时,权值指示第二增强信号中修正信号的输出比例为0,第一增强信号的输出比例为百分之百。
可选地,结合上述第一方面第二种或第一方面第三种可能的实现方式,在第四种可能的实现方式中,修正信号是第一增强信号和掩蔽函数的乘积运算结果。
可选地,结合上述第一方面第四种可能的实现方式,在第五种可能的实现方式中,修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,M为正整数,第一时刻第一神经网络输出的第一增强信号可以包括M个频段,M个频段中的每一个频段对应一个信噪比,第一时刻的掩蔽函数为第二神经网络在第一时刻输出的掩蔽函数。
可选地,结合上述第一方面或第一方面第一种至第五种可能的实现方式,在第六种可能的实现方式,待增强语音可以包括第一声学特征帧,第一声学特征帧对应的时刻由第一时间索引指示,参考图像可以包括第一图像帧,第一图像帧为第二神经网络的输入数据,根据第二神经网络输出参考图像的掩蔽函数,可以包括:根据第二神经网络输出第一图像帧在第一时刻对应的掩蔽函数,第一时刻由第一时间索引的倍数指示,倍数根据第一声学特征帧的帧率与第一图像帧的帧率的比值确定。
可选地,结合上述第一方面或第一方面第一种至第六种可能的实现方式,在第七种可能的实现方式,该方法还可以包括:对待增强语音进行特征变换,以得到待增强语音的频域特征。方法还可以包括:对第二增强信号进行特征反变换,以得到增强语音。
可选地,结合上述第一方面第七种可能的实现方式,在第八种可能的实现方式,对待增强语音进行特征变换,可以包括:对待增强语音进行短时傅里叶变换STFT。对第二增强信号进行特征反变换,可以包括:对第二增强信号进行逆短时傅里叶变换ISTFT。
可选地,结合上述第一方面第一种至第八种可能的实现方式,在第九种可能的实现方式,该方法还可以包括对参考图像进行采样,使参考图像可以包括的图像帧的帧率为预设的帧率。
可选地,结合上述第一方面或第一方面第一种至第八种可能的实现方式,在第十种可能的实现方式,唇部特征通过对人脸图进行特征抽取获得,人脸图为对参考图像进行人脸检测获得。
可选地,结合上述第一方面或第一方面第一种至第十种可能的实现方式,在第十一种 可能的实现方式,参考图像的频段能量由激活函数表示,使激活函数的取值逼近IBM,以得到第二神经网络。
可选地,结合上述第一方面或第一方面第一种至第十一种可能的实现方式,在第十二种可能的实现方式,待增强语音通过单个音频通道获取。
可选地,结合上述第一方面或第一方面第一种至第十二种可能的实现方式,在第十三种可能的实现方式,第一mask是理想浮值掩蔽IRM,第二mask是理想二值掩蔽IBM。
本申请第二方面提供一种训练神经网络的方法,该神经网络用于语音增强,该方法可以包括:获取训练数据,训练数据可以包括语音和噪声的混合数据以及语音的声源处对应的可以包括唇部特征的图像。以理想浮值掩蔽IRM为训练目标,对混合数据进行训练得到第一神经网络,训练好的第一神经网络用于输出待增强语音的第一增强信号。以理想二值掩蔽IBM为训练目标,对图像进行训练得到第二神经网络,训练好的第二神经网络用于输出参考图像的掩蔽函数,掩蔽函数指示参考图像的频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音频段为噪声,第一增强信号和掩蔽函数的运算结果用于确定待增强语音的第二增强信号。
可选地,结合上述第二方面,在第一种可能的实现方式中,参考图像为待增强语音的声源处对应的可以包括唇部特征的图像。
可选地,结合上述第二方面或第二方面第一种可能的实现方式,在第二种可能的实现方式中,第一增强信号和掩蔽函数的运算结果用于确定待增强语音的第二增强信号,可以包括:以第一增强信号以及掩蔽函数作为第三神经网络的输入数据,根据第三神经网络输出的权值确定第二增强信号,权值指示第二增强信号中第一增强信号和修正信号的输出比例,修正信号是掩蔽函数和第一增强信号的运算结果,第三神经网络是以第一mask为训练目标,对第一神经网络的输出数据以及第二神经网络的输出数据进行训练得到的神经网络。
可选地,结合上述第二方面第二种可能的实现方式,在第三种可能的实现方式中,方法还可以包括:确定图像是否可以包括人脸信息或者唇部信息。图像不包括人脸信息或者唇部信息时,权值指示第二增强信号中修正信号的输出比例为0,第一增强信号的输出比例为百分之百。
可选地,结合上述第二方面第二种或第二方面第三种可能的实现方式,在第四种可能的实现方式中,修正信号是第一增强信号和掩蔽函数的乘积运算结果。
可选地,结合上述第二方面第四种可能的实现方式,在第五种可能的实现方式中,修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,M为正整数,第一时刻第一神经网络输出的第一增强信号可以包括M个频段,M个频段中的每一个频段对应一个信噪比,第一时刻的掩蔽函数为第二神经网络在第一时刻输出的掩蔽函数。
可选地,结合上述第二方面或第二方面第一种至第五种可能的实现方式,在第六种可能的实现方式,待增强语音可以包括第一声学特征帧,第一声学特征帧对应的时刻由第一时间索引指示,图像可以包括第一图像帧,第一图像帧为第二神经网络的输入数据,根据第二神经网络输出图像的掩蔽函数,可以包括:根据第二神经网络输出第一图像帧在第一时刻对应的掩蔽函数,第一时刻由第一时间索引的倍数指示,倍数根据第一声学特征帧的 帧率与第一图像帧的帧率的比值确定。
可选地,结合上述第二方面或第二方面第一种至第六种可能的实现方式,在第七种可能的实现方式,该方法还可以包括:对待增强语音进行特征变换,以得到待增强语音的频域特征。该方法还可以包括:对第二增强信号进行特征反变换,以得到增强语音。
可选地,结合上述第二方面第七种可能的实现方式,在第八种可能的实现方式,对待增强语音进行特征变换,可以包括:对待增强语音进行短时傅里叶变换STFT。对第二增强信号进行特征反变换,可以包括:对第二增强信号进行逆短时傅里叶变换ISTFT。
可选地,结合上述第二方面第一种至第八种可能的实现方式,在第九种可能的实现方式,该方法还可以包括:对图像进行采样,使图像可以包括的图像帧的帧率为预设的帧率。
可选地,结合上述第二方面或第二方面第一种至第八种可能的实现方式,在第十种可能的实现方式,唇部特征通过对人脸图进行特征抽取获得,人脸图为对图像进行人脸检测获得。
可选地,结合上述第二方面或第二方面第一种至第十种可能的实现方式,在第十一种可能的实现方式,图像的频段能量由激活函数表示,使激活函数的取值逼近IBM,以得到第二神经网络。
可选地,结合上述第二方面或第二方面第一种至第十一种可能的实现方式,在第十二种可能的实现方式,待增强语音通过单个音频通道获取。
可选地,结合上述第二方面或第二方面第一种至第十二种可能的实现方式,在第十三种可能的实现方式,第一mask是理想浮值掩蔽IRM,第二mask是理想二值掩蔽IBM。
本申请第三方面提供一种语音增强装置,其特征在于,包括:获取模块,用于获取待增强语音和参考图像,所述待增强语音和所述参考图像为同时获取的数据。音频处理模块,用于根据第一神经网络输出待增强语音的第一增强信号,第一神经网络是以第一掩码mask为训练目标,对语音和噪声的混合数据进行训练得到的神经网络。图像处理模块,用于根据第二神经网络输出参考图像的掩蔽函数,掩蔽函数指示参考图像对应的频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音的频段为噪声,第二神经网络是以第二掩码mask为训练目标,对第一神经网络采用的语音的声源处对应的包括唇部特征的图像进行训练得到的神经网络。综合处理模块,用于根据第一增强信号和掩蔽函数的运算结果确定待增强语音的第二增强信号。
可选地,结合上述第三方面,在第一种可能的实现方式中,参考图像为待增强语音的声源处对应的包括唇部特征的图像。
可选地,结合上述第三方面或第三方面第一种可能的实现方式,在第二种可能的实现方式中,综合处理模块,具体用于:以第一增强信号以及掩蔽函数作为第三神经网络的输入数据,根据第三神经网络输出的权值确定第二增强信号,权值指示第二增强信号中第一增强信号和修正信号的输出比例,修正信号是掩蔽函数和第一增强信号的运算结果,第三神经网络是以第一mask为训练目标,对第一神经网络的输出数据以及第二神经网络的输出数据进行训练得到的神经网络。
可选地,结合上述第三方面第二种可能的实现方式,在第三种可能的实现方式中,装 置还包括:特征提取模块,特征提取模块,用于确定参考图像是否包括人脸信息或者唇部信息。参考图像不包括人脸信息或者唇部信息时,权值指示第二增强信号中修正信号的输出比例为0,第一增强信号的输出比例为百分之百。
可选地,结合上述第三方面第二种或第三方面第三种可能的实现方式,在第四种可能的实现方式中,修正信号是第一增强信号和掩蔽函数的乘积运算结果。
可选地,结合上述第三方面第四种可能的实现方式,在第五种可能的实现方式中,修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,M为正整数,第一时刻第一神经网络输出的第一增强信号包括M个频段,M个频段中的每一个频段对应一个信噪比,第一时刻的掩蔽函数为第二神经网络在第一时刻输出的掩蔽函数。
可选地,结合上述第三方面或第三方面第一种至第五种可能的实现方式,在第六种可能的实现方式,待增强语音包括第一声学特征帧,第一声学特征帧对应的时刻由第一时间索引指示,参考图像包括第一图像帧,第一图像帧为第二神经网络的输入数据,图像处理模块,具体用于:根据第二神经网络输出第一图像帧在第一时刻对应的掩蔽函数,第一时刻由第一时间索引的倍数指示,倍数根据第一声学特征帧的帧率与第一图像帧的帧率的比值确定。
可选地,结合上述第三方面第七种可能的实现方式,在第八种可能的实现方式,对待增强语音进行特征变换,可以包括:对待增强语音进行短时傅里叶变换STFT。对第二增强信号进行特征反变换,可以包括:对第二增强信号进行逆短时傅里叶变换ISTFT。
可选地,结合上述第三方面第一种至第八种可能的实现方式,在第九种可能的实现方式,特征提取模块,还用于对参考图像进行采样,使参考图像可以包括的图像帧的帧率为预设的帧率。
可选地,结合上述第三方面或第三方面第一种至第八种可能的实现方式,在第十种可能的实现方式,唇部特征通过对人脸图进行特征抽取获得,人脸图为对参考图像进行人脸检测获得。
可选地,结合上述第三方面或第三方面第一种至第十种可能的实现方式,在第十一种可能的实现方式,参考图像的频段能量由激活函数表示,使激活函数的取值逼近IBM,以得到第二神经网络。
可选地,结合上述第三方面或第三方面第一种至第十一种可能的实现方式,在第十二种可能的实现方式,待增强语音通过单个音频通道获取。
可选地,结合上述第三方面或第三方面第一种至第十二种可能的实现方式,在第十三种可能的实现方式,第一mask是理想浮值掩蔽IRM,第二mask是理想二值掩蔽IBM。
本申请第四方面提供一种训练神经网络的装置,神经网络用于语音增强,装置包括:获取模块,用于获取训练数据,训练数据包括语音和噪声的混合数据以及语音的声源处对应的包括唇部特征的图像。音频处理模块,用于以理想浮值掩蔽IRM为训练目标,对混合数据进行训练得到第一神经网络,训练好的第一神经网络用于输出待增强语音的第一增强信号。图像处理模块,用于以理想二值掩蔽IBM为训练目标,对图像进行训练得到第二神经网络,训练好的第二神经网络用于输出参考图像的掩蔽函数,掩蔽函数指示参考图像的 频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音频段为噪声,第一增强信号和掩蔽函数的运算结果用于确定待增强语音的第二增强信号。
可选地,结合上述第四方面,在第一种可能的实现方式中,参考图像为待增强语音的声源处对应的包括唇部特征的图像。
可选地,结合上述第四方面或第四方面第一种可能的实现方式,在第二种可能的实现方式中,还包括:综合处理模块。
综合处理模块,用于以第一增强信号以及掩蔽函数作为第三神经网络的输入数据,根据第三神经网络输出的权值确定第二增强信号,权值指示第二增强信号中第一增强信号和修正信号的输出比例,修正信号是掩蔽函数和第一增强信号的运算结果,第三神经网络是以第一mask为训练目标,对第一神经网络的输出数据以及第二神经网络的输出数据进行训练得到的神经网络。
可选地,结合上述第四方面第二种可能的实现方式,在第三种可能的实现方式中,装置还包括:特征特征提取模块,
特征特征提取模块,用于确定图像是否包括人脸信息或者唇部信息。图像不包括人脸信息或者唇部信息时,权值指示第二增强信号中修正信号的输出比例为0,第一增强信号的输出比例为百分之百。
可选地,结合上述第四方面第二种或第四方面第三种可能的实现方式,在第四种可能的实现方式中,修正信号是第一增强信号和掩蔽函数的乘积运算结果。
可选地,结合上述第四方面第四种可能的实现方式,在第五种可能的实现方式中,修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,M为正整数,第一时刻第一神经网络输出的第一增强信号包括M个频段,M个频段中的每一个频段对应一个信噪比,第一时刻的掩蔽函数为第二神经网络在第一时刻输出的掩蔽函数。
可选地,结合上述第四方面或第四方面第一种至第五种可能的实现方式,在第六种可能的实现方式待增强语音包括第一声学特征帧,第一声学特征帧对应的时刻由第一时间索引指示,图像包括第一图像帧,第一图像帧为第二神经网络的输入数据,图像处理模块,具体用于:根据第二神经网络输出第一图像帧在第一时刻对应的掩蔽函数,第一时刻由第一时间索引的倍数指示,倍数根据第一声学特征帧的帧率与第一图像帧的帧率的比值确定。
可选地,结合上述第四方面第七种可能的实现方式,在第八种可能的实现方式,对待增强语音进行特征变换,可以包括:对待增强语音进行短时傅里叶变换STFT。对第二增强信号进行特征反变换,可以包括:对第二增强信号进行逆短时傅里叶变换ISTFT。
可选地,结合上述第四方面第一种至第八种可能的实现方式,在第九种可能的实现方式,特征提取模块,还用于对参考图像进行采样,使参考图像可以包括的图像帧的帧率为预设的帧率。
可选地,结合上述第四方面或第四方面第一种至第八种可能的实现方式,在第十种可能的实现方式,唇部特征通过对人脸图进行特征抽取获得,人脸图为对参考图像进行人脸检测获得。
可选地,结合上述第四方面或第四方面第一种至第十种可能的实现方式,在第十一种 可能的实现方式,参考图像的频段能量由激活函数表示,使激活函数的取值逼近IBM,以得到第二神经网络。
可选地,结合上述第四方面或第四方面第一种至第十一种可能的实现方式,在第十二种可能的实现方式,待增强语音通过单个音频通道获取。
可选地,结合上述第四方面或第四方面第一种至第十二种可能的实现方式,在第十三种可能的实现方式,第一mask是理想浮值掩蔽IRM,第二mask是理想二值掩蔽IBM。
本申请第五方面提供一种语音增强装置,其特征在于,包括:存储器,用于存储程序。处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如第一方面或第一方面任意一种可能的实现方式所描的方法。
本申请第六方面提供一种训练神经网络的装置,其特征在于,包括:存储器,用于存储程序。处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如第二方面或第二方面任意一种可能的实现方式所描的方法。
本申请第七方面提供一种计算机存储介质,其特征在于,所述计算机存储介质存储有程序代码,所述程序代码包括用于执行如第一方面或第一方面任意一种可能的实现方式所描的方法。
本申请第八方面提供一种计算机存储介质,其特征在于,所述计算机存储介质存储有程序代码,所述程序代码包括用于执行如第二方面或第二方面任意一种可能的实现方式所描的方法。
通过本申请实施例提供的方案,利用第一神经网络输出待增强语音的第一增强信号,利用第二神经网络对图像信息和语音信息的关联关系进行建模,使第二神经网络输出的参考图像的掩蔽函数可以指示该参考图像对应的待增强语音为噪声或者语音。通过本申请提供的技术方案,可以将图像信息应用于语音增强的过程中,在一些相对嘈杂的环境中,也可以很好的提升语音增强的能力,提升听感。
附图说明
图1为本申请实施例提供的一种人工智能主体框架示意图;
图2为本申请提供的一种系统架构;
图3为本申请实施例提供的一种卷积神经网络的结构示意图;
图4为本申请实施例提供的一种卷积神经网络的结构示意图;
图5为本申请实施例提供的一种芯片的硬件结构;
图6为本申请实施例提供的一种系统架构示意图;
图7为本申请实施例提供的一种语音增强方法的流程示意图;
图8为本申请实施例提供的一种方案的适用场景的示意图;
图9为本申请实施例提供的一种方案的适用场景的示意图;
图10为本申请实施例提供的一种方案的适用场景的示意图;
图11为本申请实施例提供的一种方案的适用场景的示意图;
图12为本申请实施例提供的一种关于时间序列对齐的示意图;
图13为本申请实施例提供的另一种语音增强方法的流程示意图;
图14为本申请实施例提供的另一种语音增强方法的流程示意图;
图15为本申请实施例提供的另一种语音增强方法的流程示意图;
图16为本申请实施例提供的另一种语音增强方法的流程示意图;
图17为本申请实施例提供的一种语音增强装置的结构示意图;
图18为本申请实施例提供的一种训练神经网络的装置的结构示意图;
图19为本申请实施例提供的另一种语音增强装置的结构示意图;
图20为本申请实施例提供的另一种训练神经网络的装置的结构示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。本申请中所出现的模块的划分,是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个模块可以结合成或集成在另一个系统中,或一些特征可以忽略,或不执行,另外,所显示的或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些端口,模块之间的间接耦合或通信连接可以是电性或其他类似的形式,本申请中均不作限定。并且,作为分离部件说明的模块或子模块可以是也可以不是物理上的分离,可以是也可以不是物理模块,或者可以分布到多个电路模块中,可以根据实际的需要选择其中的部分或全部模块来实现本申请方案的目的。
为了更好的理解本申请提供的方案可以适用的领域以及场景,在对本申请提供的技术方案进行具体的介绍之前,首先对人工智能主体框架、本申请提供的方案适用的系统架构以及神经网络的相关知识进行介绍。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“信息技术(information technology,IT)价值链”(垂直轴)两个维度对上述人工智能主题框架进行详细的阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。
基础设施可以通过传感器与外部沟通,基础设施的计算能力可以由智能芯片提供。
这里的智能芯片可以是中央处理器(central processing unit,CPU)、神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processingunit,GPU)、专门应用的集成电路(application specific integrated circuit,ASIC)以及现场可编程门阵列(field programmable gate array,FPGA)等硬件加速芯片。
基础设施的基础平台可以包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。
例如,对于基础设施来说,可以通过传感器和外部沟通获取数据,然后将这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据:
基础设施的上一层的数据用于表示人工智能领域的数据来源。该数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理:
上述数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等处理方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力:
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用:
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。
本申请实施例可以应用在人工智能中的很多领域,例如,智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市等领域。
具体地,本申请实施例可以具体应用在语音增强、语音识别需要使用(深度)神经网络 的领域。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(3)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
如图2所示,本申请实施例提供了一种系统架构100。在图2中,数据采集设备160用于采集训练数据。
在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。
下面对训练设备120基于训练数据得到目标模型/规则101进行描述,训练设备120对输入的原始数据进行处理,将输出的数据与原始数据进行对比,直到训练设备120输出的数据与原始数据的差值小于一定的阈值,从而完成目标模型/规则101的训练。
上述目标模型/规则101能够用于实现本申请实施例的语音增强方法,上述训练设备可以用于实现本申请实施例提供的训练神经网络的方法。本申请实施例中的目标模型/规则101具体可以为神经网络。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图2所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)AR/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图2中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:客户设备输入的待处理图像。
预处理模块113和预处理模块114用于根据I/O接口112接收到的输入数据(如待处理图像)进行预处理,在本申请实施例中,也可以没有预处理模块113和预处理模块114(也可以只有其中的一个预处理模块),而直接采用计算模块111对输入数据进行处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图2中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图2中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图2所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是本申请中的神经网络,具体的,本申请实施例提供的神经网络可以CNN,深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNNS)等等。
由于CNN是一种非常常见的神经网络,下面结合图3重点对CNN的结构进行详细的介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
本申请实施例的语音增强方法和训练模型的方法具体采用的神经网络的结构可以如图3所示。在图3中,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及神经网络层230。其中,输入层210可以获取待处理图像,并将获取到的待处理图像交由卷积层/池化层220以及后面的神经网络层230进行处理,可以得到图像的处理结果。下面对图3中的CNN 200中内部的层结构进行详细的介绍。
卷积层/池化层220:
卷积层:
如图3所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图3中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层隐含层(如图3所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在神经网络层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图3由210至240方向的传播为前向传播)完成,反向传播(如图3由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
本申请实施例的语音增强方法和训练模型的方法具体采用的神经网络的结构可以如图4所示。在图4中,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及神经网络层230。与图3相比,图4中的卷积层/池化层220中的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层230进行处理。
需要说明的是,图3和图4所示的卷积神经网络仅作为一种本申请实施例的语音增强 方法和训练模型的方法的两种可能的卷积神经网络的示例,在具体的应用中,本申请实施例的语音增强方法和训练模型的方法所采用的卷积神经网络还可以以其他网络模型的形式存在。
图5为本申请实施例提供的一种芯片的硬件结构,该芯片包括神经网络处理器。该芯片可以被设置在如图2所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图2所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图3或图4所示的卷积神经网络中各层的算法均可在如图5所示的芯片中得以实现。
神经网络处理器NPU作为协处理器挂载到主中央处理器(centralprocessing unit,CPU,host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路303,控制器304控制运算电路303提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路303内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器302中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器301中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)308中。
向量计算单元307可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元307可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能307将经处理的输出的向量存储到统一缓存器306。例如,向量计算单元307可以将非线性函数应用到运算电路303的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元307生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路303的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器306用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器305(direct memory accesscontroller,DMAC)将外部存储器中的输入数据搬运到输入存储器301和/或统一存储器306、将外部存储器中的权重数据存入权重存储器302,以及将统一存储器306中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)310,用于通过总线实现主CPU、DMAC和取指存储器309之间进行交互。
与控制器304连接的取指存储器(instruction fetch buffer)309,用于存储控制器304使用的指令;
控制器304,用于调用取指存储器309中缓存的指令,实现控制该运算加速器的工作 过程。
入口:可以根据实际发明说明这里的数据是说明数据,比如探测到车辆速度?障碍物距离等。
一般地,统一存储器306,输入存储器301,权重存储器302以及取指存储器309均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random accessmemory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
其中,图2所示的卷积神经网络中各层的运算可以由运算电路303或向量计算单元307执行。
如图6所示,本申请实施例提供了一种系统架构。该系统架构包括本地设备401、本地设备402以及执行设备210和数据存储系统150,其中,本地设备401和本地设备402通过通信网络与执行设备210连接。
执行设备210可以由一个或多个服务器实现。可选的,执行设备210可以与其它计算设备配合使用,例如:数据存储器、路由器、负载均衡器等设备。执行设备210可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备210可以使用数据存储系统150中的数据,或者调用数据存储系统150中的程序代码来实现本申请实施例的语音增强方法或者训练神经网络的方法。
通过上述过程执行设备210能够搭建成一个目标神经网络,该目标神经网络可以用于语音增强或者语音识别处理等等。
用户可以操作各自的用户设备(例如本地设备401和本地设备402)与执行设备210进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备210进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在一种实现方式中,本地设备401、本地设备402从执行设备210获取到目标神经网络的相关参数,将目标神经网络部署在本地设备401、本地设备402上,利用该目标神经网络进行语音增强或者语音识别等等。
在另一种实现中,执行设备210上可以直接部署目标神经网络,执行设备210通过从本地设备401和本地设备402获取待处理图像,并根据目标神经网络对待增强语音进行语音增强或者其他类型的语音处理。
上述执行设备210也可以称为云端设备,此时执行设备210一般部署在云端。
上文中介绍的图2中的执行设备110能够执行本申请实施例的语音增强方法,上文中介绍的图4中的训练设备120能够执行本申请实施例的训练神经网络的方法的各个步骤,图5和图6所示的CNN模型和图5所示的芯片也可以用于执行本申请实施例的语音增强方法和训练模型的方法的各个步骤。下面结合附图对本申请实施例的语音增强方法和训练模型的方法进行详细的介绍。
如图7所示,为本申请实施例提供的一种语音增强方法的流程示意图。
如图7所示,本申请实施例提供的一种语音增强方法可以包括如下步骤:
701、获取待增强语音和参考图像。
本申请可以通过多声道的麦克风阵列获取待增强语音也可以通过单个音频通道(以下简称为单声道)获取待增强语音。
通过单声道语音增强只利用了时域和频域的信息,而麦克风阵列语音增强不仅利用了时域和频域的信息,还利用了空域的信息。由于时域和频域信息在音源分离中起主导作用,而空域信息只是起到辅助作用,所以本申请提供的方案的待增强语音可以通过单声道的麦克风阵列获取。
需要说明的是,通过单个音频通道获取待增强语音为本申请实施例提供的一个更为优选的方案。单声道语音增强对硬件成本要求相对低,可以形成通用的解决方案,并且广泛应用到各个产品中。但是复杂的环境会限制单声道的声学概率模型的效果,单声道语音增强的任务更为困难。而本申请提供的方案可以为声学模型提供视觉信息来增强语音降噪模型的效果。随着第五代移动通信技术(5th generation mobile networks或5th generation wireless systems、5th-Generation,5G)的发展,视频通话和摄像头在5G智能家居中使用越来越广泛,因此本申请提供的可以基于单声道的语音增强方法会在不远的将来大范围应用。
本申请提供的技术方案中涉及的参考图像可以通过相机、摄像机等可以记录影像或者图像的设备获取。下面结合本申请可能适用的几个典型的场景,对获取待增强语音和参考图像进行举例说明。需要说明的是,下面介绍的几个典型的场景只是对本申请提供的方案可能的适用场景的举例说明,并不代表本申请提供的方案可以适用的全部场景。
场景一:视频语音通话
如图8所示,为本申请实施例提供的一种方案的适用场景的示意图。如图8中的a所示,设备A与设备B正在建立视频语音通话。其中,设备A和设备B可以是手机、平板、笔记本电脑或者智能穿戴设备。假设设备A采用了本申请提供的方案,则在设备A和设备B建立视频语音通过的过程中,设备A获取到的声音为待增强语音,此时的待增强语音可能包括设备A的用户的语音以及周围环境的噪声。设备A获取到的图像为参考图像,此时的参考图像可以是设备A的相机镜头对准的区域的图像,比如设备A的用户将摄像头对准了自己的脸(需要说明的是,本申请中的相机镜头和摄像头在不强调二者区别之时,表达相同的意思,都是表示记录影像或图像的器件),则此时参考图像为设备A的用户的人脸。或者设备A的用户在视频语音通过的过程中,没有将摄像头对准自己,而是对准了周围的环境,则此时参考图像为周围的环境。
由于本申请提供的技术方案可以结合图像信息对语音增强,具体的,需要结合人脸的图像信息对语音进行增强,所以在摄像头对准人脸的时候将会有更好的语音的增强效果。为了方便用户可以更好的感受到本申请提供的方案带来的良好的语音增强效果。在一种具体的场景中,可以提示用户将摄像头对准人脸,将获得更好的语音增强效果。如图8中的b所示,为本申请提供的另一种方案的适用场景的示意图。以A设备为例,假设设备A采 用了本申请提供的方案,在与设备B建立视频语音通过的过程中,可以在视频对话的窗口显示文字提示。比如图8中的b所示的,在视频的过程中,在视频窗口显示文字“将摄像头对准人脸,语音效果会更好”,或者“请将摄像头对准人脸”或者“正在进行语音增强,请将摄像头对准脸部”等等。或者如图8中的c所示,在视频的过程中,如果设备A检测到用户已经将摄像头对准了人脸则不进行提示,当检测到在视频的过程中,设备A的用户没有将摄像头对准人脸,而是对准了环境时,在视频窗口显示文字提示,比如可以显示“将摄像头对准人脸,语音效果会更好”,或者“请将摄像头对准人脸”等等。需要说明的是,当用户了解了这一功能后,可以选择关闭文字提示,即用户了解了视频语音通过过程中,将摄像头对准人脸,可以有更好的语音增强效果后,用户可以主动关掉文字提示的功能,或者可以预先设定,采用了本方案的设备只在第一次视频语音通过的过程显示文字提示。
场景二:会议录音
如图9所示,为本申请实施例提供的另一种适用场景的示意图。目前,为了提高工作效率,通过会议协调多方人士的工作是比较重要的手段。为了能够回溯会议内容,在会议过程中对每个发言人发言内容的记录以及会议记录的整理成为了基本要求。当前记录发言人的发言和整理会议记录可以采用多种方式,比如:秘书的人工速记。或者录音笔等录音设备先全程录音,会后人工整理录音内容形成会议记录等。但是这些方式均因为需要人工介入而导致效率较低。
语音识别技术引用到会议系统给会议记录的整理带来的便捷,比如:在会议系统中,通过录音设备录制与会者的发言内容,以及语音识别软件识别与会者的发言内容,进一步可以形成会议记录,这大大提高了会议记录的整理的效率。本申请提供的方案可以应用到对会议录音这一场景中,进一步提升语音识别的效果。在这一场景中,假设会议上A正在发言,则可以录制A的发言内容,在录制A的发言内容的同时,同步获取图像。此时A的发言内容为待增强语音,该待增强语音可以包括A的纯语音以及会议中产生的其他噪声,此时同步拍摄的图像为参考图像,在一个优选的实施方式中,该参考图像为A的人脸图像。在一些实际情况中,拍摄者有可能并未在A发言的过程中,全程拍摄A的人脸,则在A发言的过程中,获取到的其他非人脸图像也可以看做是本方案中的参考图像。
在另一种场景中,假设会议上正在发言的有A,B,C三人,可以选择对A,B,C三人中的至少一个人的发言内容进行增强。举例说明,当选择对A的发言内容进行增强时,可以在A发言的过程中,同步拍摄A的人脸图像,此时,A的发言内容为待增强语音,该待增强语音可以包括A的纯语音以及会议中产生的其他噪声(比如其他噪声可以是B的发言内容或者C的发言内容),此时同步拍摄的A的人脸图像为参考图像。当选择对B的发言内容进行增强时,可以在B发言的过程中,同步拍摄B的人脸图像,此时,B的发言内容为待增强语音,该待增强语音可以包括B的纯语音以及会议中产生的其他噪声(比如其他噪声可以是A的发言内容或者C的发言内容),此时同步拍摄的B的人脸图像为参考图像。当选择对C的发言内容进行增强时,可以在C发言的过程中,同步拍摄C的人脸图像,此时,C的发言内容为待增强语音,该待增强语音可以包括C的纯语音以及会议中产生的其他噪声(比如其他噪声可以是A的发言内容或者B的发言内容),此时同步拍摄的C的人脸图像为 参考图像。或者,当选择对A和B的发言内容进行增强时,可以在A和B发言的过程中,同步拍摄A和B的人脸图像,此时,A和B的发言内容为待增强语音,该待增强语音可以包括A的纯语音和B的纯语音以及会议中产生的其他噪声(比如其他噪声可以是C的发言内容),此时同步拍摄的A和B的人脸图像为参考图像。当选择对B和C的发言内容进行增强时,可以在B和C发言的过程中,同步拍摄和B和C的人脸图像此时,B和C的发言内容为待增强语音,该待增强语音可以包括B的纯语音和C的纯语音以及会议中产生的其他噪声(比如其他噪声可以是A的发言内容),此时同步拍摄的B和C的人脸图像为参考图像。当选择对A和C的发言内容进行增强时,可以在A和C发言的过程中,同步拍摄A和C的人脸图像,A和C的发言内容为待增强语音,该待增强语音可以包括A的纯语音和C的纯语音以及会议中产生的其他噪声(比如其他噪声可以是B的发言内容),此时同步拍摄的A和C的人脸图像为参考图像。或者,当选择对A和B以及C的发言内容进行增强时,可以在A和B以及C发言的过程中,同步拍摄A和B以及C的人脸图像,此时,A和B以及C的发言内容为待增强语音,该待增强语音可以包括A的纯语音和B的纯语音以及C的纯语音以及会议中产生的其他噪声(比如除ABC之外的其他与会人发出的声音或者其他环境噪声),此时同步拍摄的A和B以及C的人脸图像为参考图像。
场景三:与可穿戴设备的语音交互
本场景所指的可穿戴设备是指可以直接穿在身上,或是整合到用户的衣服或配件的一种便携式设备。比如,可穿戴设备可以是智能手表,智能手环,智能眼镜等等。基于语音识别的输入法和语义理解被大幅应用于可穿戴设备中,虽然触控目前仍然是人和它们之间通信的主要方式,但是由于这些设备的屏幕普遍较小,且人和它们之间的交流都是以简单直接的任务为主,语音必然成为这些设备的下一代信息入口,以此也能解放人的手指,使得人与这些设备之间的通信更为便捷自然。但是,这些设备通常都在比较复杂的声学环境中被用户所用,周围有各种突发噪声的干扰,比如人和手机以及穿戴设备之间的交流通常会发生在大街上或商场里,这些场景里都有非常嘈杂的背景噪音,复杂的噪声环境通常让语音的识别率显著下降,识别率的下降意味着这些设备无法准确理解用户的指令,这就会大幅降低用户的体验。本申请提供的方案也可以应用于与可穿戴设备的语音交互场景中。如图10所示,可穿戴设备在获取用户的语音指令时,可以同步获取用户的人脸图像,根据本申请提供的方案,对用户的语音指令进行语音增强,进而可以使可穿戴设备可以更好的识别用户的指令,做出对应用户的指令的响应。在这一场景中,可以将用户的语音指令看做待增强语音,将同步获取的人脸图像看做参考图像,通过本申请提供的方案,在语音增强的过程中引入视觉信息,如参考图像,使在有非常嘈杂的背景噪声的环境中,也有很好的语音增强以及语音识别的效果。
场景四:与智能家居的语音交互
智能家居(smart home,home automation)是以住宅为平台,利用综合布线技术、网络通信技术、安全防范技术、自动控制技术、音视频技术将家居生活有关的设施集成,构建高效的住宅设施与家庭日程事务的管理系统,提升家居安全性、便利性、舒适性、艺术性,并实现环保节能的居住环境。比如,智能家居可以包括智能照明系统,智能窗帘,智 能电视,智能空调等等。如图11所示,当用户对智能家居发出语音控制指令时,具体的可以包括用户直接对智能家居发出语音控制指令,或者用户通过其他设备对智能家居发出语音控制指令,比如通过手机等设备,远程对智能家居发出语音控制指令。此时可以通过智能家居或者其他设备获取预设区域的图像。比如当用户通过手机对智能家居发出语音控制指令时,手机可以获取此时拍摄到的图像,在这种场景中,用户发出的语音控制指令为待增强语音,同步拍摄到的图像为参考图像。在一个具体的实施场景中,当预设区域没有检测到人脸时,可以发出语音提示用户将摄像头对准人脸,比如发出提示“正在进行语音增强,请将摄像头对准脸部”等等。
702、根据第一神经网络输出待增强语音的第一增强信号。
第一神经网络是以理想浮值掩蔽(ideal ratio mask,IRM)为训练目标,对语音和噪声的混合数据进行训练得到的神经网络。
时频掩蔽是语音分离的常用目标,常见的时频掩蔽有理想的二值掩蔽和理想浮值掩蔽,它们能显著地提高分离语音的可懂度和感知质量,一旦估计出了时频掩蔽目标,不考虑相位信息,通过逆变换技术即可合成语音的时域波形。示例性的,下面给出一种傅里叶变换域的理想浮值掩蔽的定义:
Figure PCTCN2021079047-appb-000001
其中,Ys(t,f)是混合数据中纯净语音的短时傅里叶变换系数,Yn(t,f)是混合数据中噪声的短时傅里叶变换系数,Ps(t,f)是Ys(t,f)对应的能量密度,Pn(t,f)是Yn(t,f)对应的能量密度。
上面给出了傅里叶变换域的理想浮值掩蔽的定义,需要说明的是,本领域的技术人员在获知了本申请提供的方案后,容易联想到还可以采用其他的语音分离的目标作为第一神经网络的训练目标。比如还可以采用短时傅里叶变换掩蔽,隐式时频掩蔽等等作为第一神经网络的训练目标。换句话说,现有技术中,语音和噪声的混合数据,经过某个神经网络进行语音分离后,可以得到该神经网络的输出信号在任意一个时刻的信噪比,则该神经网络采用的训练目标,本申请提供的方案都可以采用。
上述语音可以是指纯净语音或者干净语音,是指未保护任何噪声的语音。语音和噪声的混合数据是指加噪语音,即向该干净语音中添加预设分布的噪声得到的语音。本实施例中将干净语音和加噪语音作为待训练的语音。
具体的,在生成加噪语音时,可以通过向干净语音中添加各种不同分布的噪声得到该干净语音对应的多个加噪语音。例如:向干净语音1中添加第一分布的噪声得到加噪语音1,向干净语音2中添加第二分布的噪声得到加噪语音2,向干净语音1中添加第三分布的噪声得到加噪语音3,依次类推。经过上述加噪过程,可以得到多个干净语音和加噪语音的数据对,例如:{干净语音1,加噪语音1},{干净语音1,加噪语音2},{干净语音1,加噪语音3}等等。
实际训练过程中,可以先获取多个干净语音,并且向每个干净语音中添加多种不同分 布的噪声,从而得到海量的{干净语音,加噪语音}的数据对。将这些数据对作为待训练的语音。例如:可以选取主流报刊媒体等500个语句,尽可能包含所有的发声,再选取100位不同的人进行朗读,作为干净语音信号(即模拟的含噪语音对应的干净语音)。然后再选取公共场景、交通、工作场景、咖啡厅等18中生活常见噪音,与干净语音信号进行交叉合成,得到带噪音的语音信号(相当于模拟的含噪语音)。干净语音信号与带噪音的语音信号一一匹配作为标记好的数据。将这些数据随机打乱,并选取其中80%作为训练集进行神经网络模型训练,另外20%作为验证集用于验证神经网络模型的结果,最后训练好的神经网络模型即相当于本申请实施例中的第一神经网络。
第一神经网络训练完成后,在语音增强时,将待增强语音转换成二维时频信号,输入到第一神经网络,得到该待增强语音的第一增强信号。
可以采用短时傅立叶变换(short-time-fourier-transform,STFT)的方式对待增强语音信号进行时频转换,以得到待增强语音的二维时频信号。需要说明的是,本申请有时也将时频转换称为特征变换,在不强调二者的区别之时,二者表示相同的意思,本申请有时也将二维时频信号称为频域特征,在不强调二者的区别之时,二者表示相同的意思。下面对此进行举例说明,假设待增强语音的表达式如下:
y(t)=x(t)+n(t)
其中,y(t)表示t时刻待增强语音的时域信号,x(t)表示t时刻干净语音的时域信号,n(t)表示t时刻噪声的时域信号。对待增强语音进行STFT变换,可以表示如下:
Y(t,d)=X(t,d)+N(t,d)t-1,2,...,T;d=1,2,...,D
其中,Y(t,d)表示待增强语音在第t声学特征帧和第d频带的频域信号的表示,X(t,d)表示干净语音在第t声学特征帧和第d频带的频域信号的表示,N(t,d)表示噪声在第t声学特征帧和第d频带的频域信号的表示。T和D分别表示待增强信号总共有多少声学特征帧和总频带数。
需要说明的是,对语音信号进行特征变换的方式不止限于STFT的方式,在一些其他的实施方式中也可以采用其他方式,例如Gabor变换和Wigner-Ville分布等方式。现有技术中关于对于语音信号进行特征变换得到语音信号的二维时频信号的方式,本申请实施例均可以采用。在一个具体的实施方式中,为了加速神经网络的收敛速度和收敛性,还可以对特征变换后的频域特征进行规范化处理。比如,可以对频域特征进行减均值除以标准差的运算,以得到规范化后的频域特征。在一个具体的实施方式中,可以将经过规范化后的频域特征作为第一神经网络的输入,以得到第一增强信号,以长短期记忆网络(long short-term memory,LSTM)为例,可以通过如下公式表示:
Figure PCTCN2021079047-appb-000002
其中,上述等式的右边为训练目标IRM,上文已经对此进行了介绍。在本公式中,Ps(aclean,j)代表干净信号在j时刻的能量谱(也可以称为能量密度),Ps(anoise,j)代表噪声信号在j时刻的能量谱。上述等式的左边表示通过神经网络对训练目标的近似。a j 代表神经网络的输入,在本实施方式中,可以是频域特征,g()代表一个函数关系,比如这里可以是对神经网络的输入进行减均值除以标准差的规范化而后做对数变换的函数关系。
需要说明的是,上述LSTM仅仅是为了举例说明,本申请的第一神经网络可以是任意一种时序模型,即可以在每一个时间步提供对应的输出,确保模型的实时性。第一神经网络训练完毕后,可以将权重冻结,即保持第一神经网络的权重参数不变,使第二神经网络或者其他神经网络不会影响到第一神经网络的性能,确保在缺乏视觉模态(即参考图像不包括人脸信息或者唇部信息)的情况下的模型能按照第一神经网络的输出,保证模型的健壮性。
703、根据第二神经网络输出参考图像的掩蔽函数。
掩蔽函数指示参考图像的频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音为噪声,频段能量不小于预设值表示参考图像对应的待增强语音为干净语音。第二神经网络是以理想二值掩蔽(ideal binary mask,IBM)为训练目标,对第一神经网络采用的语音的声源处对应的包括唇部特征的图像进行训练得到的神经网络。
从生理学的角度出发,可以认为不同人说出同样话语的音量,音色等是不同的,导致每一个音的发音时频谱有差异,但它们的能量分布是相同的。发音的能量分布可以作为原始音频对说话人和音量等因素做规范化后的结果,这也是从音频的共振峰可以推测音节的原因。因此我们对干净信号的能量分布做建模,用人嘴的图像拟合这种能量分布。事实上,人嘴图像直接拟合上述的能量分布是很困难的,人的发音不只是通过嘴型来确定,而是通过口腔内部共振腔的形状及舌部的位置等因素确定,但人嘴的图像并不能准确反映这些因素,导致同一段嘴型的视频可以对应不同的发音,即不能一一映射。因此我们设计了这种弱相关(weak reference)的方式,将原有的精细的分布通过二值化的方式转化成粗糙的分布,以便于图像端去拟合。而这种粗糙分布刻画的是嘴型是否会对应某一组频段的发音状况。本申请要通过第二神经网络建立图像的频段能量和语音的频段能量的映射关系,具体的要建立每个时刻的图像帧的每个频段的能量和每个时刻的声学特征帧的每个频段的能量之间的关联关系。
下面分别对第二神经网络的训练目标以及训练用到的数据进行说明。
第二神经网络的训练目标IBM为一种符号函数,下面通过如下表达式对其定义进行说明。
Figure PCTCN2021079047-appb-000003
其中,dist函数为能量分布函数,其定义如下:
Figure PCTCN2021079047-appb-000004
其中,j是指在j时刻,或者是第j帧的持续时长结束的时刻。每一帧可以包括多个频段,比如包括k个频段,k是指j时刻纯净语音的第k个频段,k为正整数。每个时刻包括多少个频段可以预先设定,比如可以设定一个时刻包括4个频段,或者一个时刻包括5个频段,本申请实施例对此并不做限定。P s(a kj)是指干净信号在j时刻第k个频段的能量 谱。因此dist(aj)表征的是在j时刻对应的k个频段上音频能量的分布。threshold为预先设定的阈值,在一个具体的实施方式中,threshold一般可取10 -5。如果dist(aj)和threshold的差值大于等于0,即dist(aj)大于threshold,则认为dist(aj)是语音主导或者无法判断dist(aj)是语音主导还是噪声主导,将其对应的函数值设定为1。如果dist(aj)和threshold的差值小于0,即dist(aj)小于threshold,则认为dist(aj)是噪音主导,将其对应的函数值设定为0。
第二神经网络的训练数据为第一神经网络采用的语音的声源处对应的包括唇部特征的图像。比如,上述在步骤702中提到,可以选取主流报刊媒体等500个语句,尽可能包含所有的发声,再选取100位不同的人进行朗读,作为干净语音信号(即模拟的含噪语音对应的干净语音),则第二神经网络的训练数据可以包括该100位不同的人的人脸图像,或者包括该100位不同的人的人嘴图像,或者包括该100位不同的人的包括人脸的图像,比如上半身的图像。需要说明的是,第二神经网络的训练数据并不是只包括第一神经网络采用的语音的声源处对应的包括唇部特征的图像,第二神经网络的训练数据还可以包括一些不包含唇部特征的图像数据或者不包括人脸图像的数据。
下面结合以下公式进行具体的解释说明。
Figure PCTCN2021079047-appb-000005
v代表训练数据,上面已经对训练数据进行了介绍,此处不再重复赘述。sigmoid定义为
Figure PCTCN2021079047-appb-000006
sigmoid是一种激活函数,通过该激活函数表示图像的每个时刻每个频段的能量,通过神经网络使sigmoid的值逼近dist(aj)-threshold的取值,比如上述公式中用到的LSTM。f()代表特征提取函数。需要说明的是,这里的sigmoid只是为了举例说明,本申请实施方式中还可以采取其他的激活函数去逼近训练目标。
此外,在一个具体的实施方式中,可以使第二神经网络的处理的图像帧与第一神经网络的声学特征帧进行时间序列的对齐。通过时间序列的对齐,可以保证在后续流程中,同一时刻处理的第二神经网络输出的数据与第一神经网络输出的数据是对应的。举例说明,假设有一段视频,该段视频中包括1帧的图像帧和4帧的声学特征帧。这里的图像帧和声学帧的数目的倍数关系可以通过对该段视频按照预设的帧率进行重采样确定,比如按照图像帧的帧率为40帧/s对该段视频包括的图像数据进行重采样,按照声学特征帧的帧率为10帧/s对该段视频包括的音频数据进行重采样。在这段视频中,该1帧的图像帧与4帧的声学特征帧在时间上是对齐的。换句话说,该1帧的图像帧的持续时长与该4帧的声学特征帧的持续时长是对齐的。在本方案中,第一神经网络对该4帧的声学特征帧进行处理,第二神经网络对该1帧的图像帧进行处理,对第二神经网络的处理的图像帧与第一神经网络的声学特征帧进行时间序列的对齐,在这个例子中,是为了使第一神经网络和第二神经网络在处理过程中,以及处理完成后,该4帧声学特征帧与该1帧图像帧在时间上仍然是对齐的。不仅如此,通过本申请提供的方案,通过第二神经网络对该1帧图像帧进行时间 对齐处理后,可以得到与该4帧声学特征帧分别对应的4帧图像帧,并输出该4帧图像帧对应的掩蔽函数。下面对本申请实施例给出的一种时间序列对齐的方式进行具体的介绍。
在一个具体的实施方式中,待增强语音包括第一声学特征帧,第一声学特征帧对应的时刻由第一时间索引指示,图像包括第一图像帧,第一图像帧为第二神经网络的输入数据,根据第二神经网络输出图像的掩蔽函数,包括:根据第二神经网络输出第一图像帧在第一时刻对应的掩蔽函数,第一时刻由第一时间索引的倍数指示,倍数根据第一声学特征帧的帧率与第一图像帧的帧率的比值确定,以使第一时刻为第一声学特征帧对应的时刻。举例说明,上述公式中,m代表倍数,根据第一声学特征帧的帧率与第一图像帧的帧率的比值确定。比如第一声学特征帧的帧率为10帧/s,第一图像帧的帧率为40帧/s,则第一声学特征帧的帧率与第一图像帧的帧率的比值为1/4(10/40),则上述公式中m取4。再比如第一声学特征帧的帧率为25帧/s,第一图像帧的帧率为50帧/s,则第一声学特征帧的帧率与第一图像帧的帧率的比值为1/2(25/50),则上述公式中m取2。为了更清楚的解释时间队列对齐,下面以m取4,结合图12进行进一步的说明。图12所示,为本申请实施例提供的一种关于时间序列对齐的示意图。如图12所示,图中的白色方框代表第二神经网络的输入的图像帧,如图12所示,示出了4帧输入的图像帧。假设输入的1帧图像帧持续时间与4帧声学特征帧持续时长相同,即m取4时,经过第二神经网络的时间序列对齐的处理后,该输入的一帧图像帧对应4帧处理后的图像帧,该4帧处理后的图像帧的每一帧的持续时长与声学帧持续时长相同。如图12所示,黑色方框代表经过第二神经网络时间对齐处理后的图像帧,第二神经网络会输出对齐处理后的图像帧的掩蔽函数,如图12所示,共包括16个时间对齐处理后的图像帧,则会输出与该16个时间对齐处理后的图像帧对应的掩蔽函数。该16个图像帧分别与一个声学特征帧在时间上是对齐的,换句话说,白色方框代表的1个图像帧与4个声学特征帧在时间上是对齐的,黑色方框代表的1个图像帧与1个声学特征帧在时间上是对齐的。
第二神经网络训练完成后,在语音增强时,将参考图像输入到第二神经网络,得到该参考图像的掩蔽函数。在实际执行的过程中,可以对参考图像做一些预处理,将预处理后的参考图像输入到第二神经网络,比如还可以将参考图像采样到制定的图像帧率。还可以对参考图像进行人脸特征提取,以得到人脸图,人脸特征提取可以通过人脸特征提取算法进行。人脸特征提取算法包括基于人脸特征点的识别算法、基于整幅人脸图像的识别算法、基于模板的识别算法等。比如,可以是基于人脸特征点检测算法的人脸检测。人脸特征提取也可以通过神经网络进行。可以通过卷积神经网络模型进行人脸特征的提取,比如基于多任务卷积神经网络的人脸检测等。可以将经过人脸特征提取的人脸图作为第二神经网络的输入。第二神经网络还可以对人脸图进行进一步的处理,比如可以提取人嘴部的运动特征对应的图像帧,对这些人嘴部的运动特征对应的图像帧进行时间序列对齐的处理。
704、根据第一增强信号和掩蔽函数的运算结果确定待增强语音的第二增强信号。
本实施例可以通过第一神经网络输出第一增强信号,通过第二神经网络输出参考图像的掩蔽函数。由于第二神经网络建立图像的频段能量和语音的频段能量的映射关系,掩蔽函数可以指示参考图像的频段能量是否小于预设值,频段能量小于预设值表示参考图像对 应的待增强语音为噪声,频段能量不小于预设值表示参考图像对应的待增强语音为干净语音。通过第一增强信号和掩蔽函数的运算结果确定的待增强语音的第二增强信号,相比于第一增强信号,即相比于只通过单一的神经网络进行语音增强的方案,可以获得更好的语音增强效果。举例说明,假设对于某一时刻的待增强音频包括的第一频段,第一神经网络输出该第一频段的信噪比为A,假设A代表第一神经网络确定该第一频段为语音主导,第二神经网络输出该第一频段的频段能量为B,B小于预设值,即假设B代表第二神经网络确定该第一频段为噪音主导,通过A和B进行数学运算,比如可以对A和B进行加和,乘积,或者平方中的一种或者几种运算,得到A和B之间的运算结果,通过该运算结果可以确定A和B在最后输出的第二增强信号中的占比。具体的,第一增强信号和掩蔽函数的运算的原理在于掩蔽函数的实际意义是衡量某一频段是否有足够的能量。当第一神经网络输出的第一增强信号与第二神经网络输出的掩蔽函数指示不一致性时,会反应为:
第二神经网络输出的值小而第一神经网络输出的值大,对应第一神经网络(音频端)认为某个频段(比如第一频段)有能量构成发音,而第二神经网络(视频端)认为人的口型并不能发出对应的声音;
第二神经网络输出的值大而第一神经网络输出的值小,对应第一神经网络(音频端)认为某个频段(比如第一频段)没有能量构成发音,而第二神经网络(视频端)认为人的口型正在发出某种可能的声音;
通过第一增强信号和掩蔽函数的运算的操作方式会将以上不一致的部分缩放到一个较小的值,而一致的部分则会保持不变,得到融合后的新输出第二增强信号,其中不发音或音视频不一致的的频段能量都会被压缩到一个较小的值。
由图7对应的实施例可知,利用第一神经网络输出待增强语音的第一增强信号,利用第二神经网络对图像信息和语音信息的关联关系进行建模,使第二神经网络输出的参考图像的掩蔽函数可以指示该参考图像对应的待增强语音为噪声或者语音。通过本申请提供的技术方案,可以将图像信息应用于语音增强的过程中,在一些相对嘈杂的环境中,也可以很好的提升语音增强的能力,提升听感。
上面图7对应的实施例中介绍了可以根据第一增强信号和掩蔽函数的运算结果确定待增强语音的第二增强信号。下面给出一种优选的方案,通过第三神经网络确定待增强语音的第二增强信号,具体的,根据第三神经网络输出的权值确定第二增强信号。权值指示第二增强信号中第一增强信号和修正信号的输出比例,修正信号是掩蔽函数和第一增强信号的运算结果。第三神经网络是以IRM为训练目标,对第一神经网络的输出数据以及第二神经网络的输出数据进行训练得到的神经网络。
如图13所示,为本申请实施例提供的另一种语音增强方法的流程示意图。
如图13所示,本申请实施例提供的另一种语音增强方法可以包括如下步骤:
1301、获取待增强语音和参考图像。
步骤1301可以参照图7对应的实施例中的步骤701进行理解,此处不再重复赘述。
1302、根据第一神经网络输出待增强语音的第一增强信号。
步骤1302可以参照图7对应的实施例中的步骤702进行理解,此处不再重复赘述。
1303、根据第二神经网络输出参考图像的掩蔽函数。
步骤1303可以参照图7对应的实施例中的步骤703进行理解,此处不再重复赘述。
在一个具体的实施方式中,还可以包括:确定参考图像是否包括人脸信息。若确定参考图像包括人脸信息,则根据第二神经网络输出参考图像的掩蔽函数。
1304、根据第三神经网络输出的权值确定第二增强信号。
以第一增强信号以及掩蔽函数作为第三神经网络的输入数据,根据第三神经网络输出的权值确定第二增强信号。权值指示第二增强信号中第一增强信号和修正信号的输出比例,修正信号是掩蔽函数和第一增强信号的运算结果。第三神经网络是以IRM为训练目标,对第一神经网络的输出数据以及第二神经网络的输出数据进行训练得到的神经网络。
第三神经网络对第一神经网络的输出数据以及第二神经网络的输出数据进行训练,具体的,对第一神经网络在训练过程中输出的多组第一增强信号以及第二神经网络在训练过程中输出的多组掩蔽函数进行训练。由于在步骤1302中,第二神经网络对图像帧与第一神经网络的声学特征帧进行时间序列的对齐,所以第三神经网络在同一时刻接收到的第一神经网络的输出以及第二神经网络的输出是时间对齐后的数据。第三神经网络可以对第一增强信号以及掩蔽函数的运算结果进行训练,关于第一增强信号以及掩蔽函数之间的数学运算已经在上文进行了介绍,这里不再重复赘述。本申请并不限制第三神经网络的类型,示例性的,第三神经网络为LSTM,第一增强信号和掩蔽函数之间的数学运算为乘法运算时,第三神经网络对第一神经网络的输出数据以及第二神经网络的输出数据进行训练,以输出权值(gate),可以通过如下公式表示:
gate=LSTM(IBMI×IRM)
上文步骤701中提到了几种本方案可能适用的具体场景,其中参考图像可能包括人脸信息,具体的,是待增强语音的声源处的包括人脸信息的图像。在一些场景中,参考图像也可能与人脸信息无关,比如,参考图像可能与声源处对应的图像无关。本申请第二神经网络的训练数据中既包括了第一神经网络采用的语音的声源处对应的包括唇部特征的图像,还可以包括一些不包含唇部特征的图像数据或者不包括人脸图像的数据。所以在不同的场景中,是否要结合第二神经网络的输出对语音进行增强,以及如果要结合第二神经网络的输出对语音进行增强,第二神经网络的输出以及第一神经网络的输出在最终输出的第二增强信号中的占比是多少,这些问题通过第三神经网络输出的权值确定。示例性性,以第一增强信号和掩蔽函数之间的数学运算为乘法运算为例,第二增强信号可以通过下面的公式表示,其中IRM’代表第二增强信号:
IRM′=gate×(IBM×IRM)+(1-gate)×IRM
由于第二神经网络的输出并不是完全准确的,可能导致错误的将一部分的第一增强信号缩放,因此我们添加了第三神经网络网络,通过权值,保留确信的部分,而不确信的部分由第一增强信号填补。这种设计方案也确保了当检测不到视觉模态(即检测不到参考图像中包括人脸信号或者唇部信息)的情况下,可以通过将权值置为0,使得IRM’=IRM,即第二增强信号即为第一增强信号,保证了本申请提供的方案可以在不同情况下都有良好的语音增强的性能。
在一个具体的实施方式中,修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,M为正整数,第一时刻第一神经网络输出的第一增强信号包括M个频段,M个频段中的每一个频段对应一个信噪比,第一时刻的掩蔽函数为第二神经网络在第一时刻输出的掩蔽函数。下面结合图14对这一过程举例说明。如图14所示,为本申请实施例提供的另一种语音增强方法的流程示意图。如图14所示,给出了一段待增强语音的频率的分布曲线,如图14所示,第一时刻的待增强语音包括一帧声学特征帧,该一帧声学特征帧包括4个频段,需要说明的是,第一时刻可以是待增强语音对应的任意一个时刻,第一时刻包括4个频段仅仅是为了举例说明,每个时刻包括多少个频段可以预先设定,比如可以设定一个时刻包括4个频段,或者一个时刻包括5个频段,本申请实施例对此并不做限定。假设该4个频段对应的信噪比分别为0.8,0.5,0.1以及0.6。第二神经网络在第一时刻会输出参考图像对应的4个频段的掩蔽函数,这是因为第二神经网络对图像帧与第一神经网络的声学特征帧进行时间序列的对齐,这里不再重复赘述。假设该4个频段对应的掩蔽函数分别为1,1,0以及1。则修正信号包括4个频段,每个频段的能量分别为0.8(1x0.8),0.5(1x0.5),0(0x0.1),0.6(1x0.6)。
通过本申请提供的这种实施方式,使本申请提供的方案可以支持流式解码,理论上界为单位声学特征帧的持续时间。以单位声学特征帧的持续时长为10ms为例,则通过本申请提供的方案,输出的第二增强语音的时延的理论上界为10ms。因为第二神经网络是按照声学特征帧对应的时刻输出掩蔽函数(具体的可以参照上面关于时间序列对齐的描述进行理解,这里不再重复赘述),所以第三神经网络接收到一帧声学特征帧对应的第一增强信号,就可以对该第一增强信号,以及同一时刻对应的掩蔽函数进行处理,输出该时刻的第二增强信号。由于可以逐帧对待增强语音进行处理,所以可以逐帧播放第二增强信号。换句话说,由于可以以声学特征帧为单位,一帧一帧对待增强语音进行处理,相应的第二神经网络也是按照声学特征帧对应的时刻输出掩蔽函数,所以第三神经网络可以以声学特征帧为单位输出第二增强信号,所以本申请提供的方案,理论时延上界为单位声学特征帧的持续时长。
为了更好的理解本申请提供的方案,下面结合图15进行描述。
图15为本申请实施例提供的另一种语音增强方法的流程示意图。假设有一段视频,该段视频包括待增强语音以及参考图像。对该待增强语音进行特征变换得到该待增强语音对应的频域特征后,将该频域特征输入到第一神经网络。如图15所示,假设该段待增强语音被采样为3段音频,每一段音频经过特征变换后,包括4帧声学特征帧,即图15中的第一神经网络的输入。假设按照预设的图像帧的帧率与声学特征帧的帧率的比值对参考图像进行重采样,确定每4帧声学特征帧对应1帧图像帧。第二神经网络对该1帧图像帧进行时间对齐处理后,可以输出与该4帧声学特征帧对应的4帧图像帧,即图15中的第二神经网络的输出。可以依次将第一神经网络输出的该4帧声学特征帧对应的第一增强信号,以及第二神经网络输出的4帧图像帧对应的掩蔽函数输入至第三神经网络,第三神经网络会输出该4帧声学特征帧对应的第二增强信号,即图15中的第三神经网络的输出。再对该第二增强信号进行特征反变换,即可得到该待增强语音的时域增强信号。
第三神经网络训练好后,在语音增强时,可以以所述第一增强信号以及掩蔽函数作为第三神经网络的输入数据,根据第三神经网络输出的权值确定第二增强信号。
在一个具体的实施方式中,第三神经网络训练后,在语音增强时,还可以包括对第三神经网络输出的结果进行特征反变换,以得到时域信号。比如待增强语音通过短时傅里叶变换后得到的频域特征为第一神经网络的输入,则可以对第三神经网络出书的第二增强信号进行逆短时傅里叶变换(inverse short-time-fourier-transform,ISTFT),以得到时域信号。
由图7和图15对应的实施例可知,第二神经网络的训练数据中还可以包括一些不包含唇部特征的图像数据或者不包括人脸图像的数据。需要说明的是,在一些具体的实施方式中,第二神经网络的训练数据中也可以只包括包含唇部特征的图像数据或者包括人脸图像的数据。在一些具体的实施方式中,可以先判断参考图像中是否包括人脸信息或者唇部信息,如果参考图像中不包括人脸信息或者唇部信息,则只根据第一神经网络输出待增强语音的增强信号,参考图像中包括人脸信息或者唇部信息时,则根据第一神经网络、第二神经网络以及第三神经网络输出待增强语音的增强信号。下面结合图16进行说明,图16为本申请实施例提供的另一种语音增强方法的流程示意图。系统先判断参考图像中是否包括人脸信息或者唇部信息,如果没有包括人脸信息或者唇部信息则根据第一神经网络输出的第一增强信号确定待增强语音的增强信号,即第二增强信号即为第一增强信号。如果系统判断参考图像中包括人脸信息或者唇部信息,则根据第二神经网络输出的掩码函数以及第一神经网络输出的第一增强信号,通过第三神经网络确定第二增强信号,具体如何根据第三神经网络确定第二增强信号,上文已经进行了详细的描述,这里不再重复赘述。
本申请实施例提供的语音增强方法的流程包括“应用”流程和“训练”流程两部分。以上对本申请提供的应用流程进行了介绍,具体的对一种语音增强方法进行了介绍,下面对本申请提供的训练流程进行介绍,具体的介绍一种训练神经网络的方法。
本申请提供一种训练神经网络的方法,该神经网络用于语音增强,该方法可以包括:获取训练数据,训练数据可以包括语音和噪声的混合数据以及语音的声源处对应的可以包括唇部特征的图像。以理想浮值掩蔽IRM为训练目标,对混合数据进行训练得到第一神经网络,训练好的第一神经网络用于输出待增强语音的第一增强信号。以理想二值掩蔽IBM为训练目标,对图像进行训练得到第二神经网络,训练好的第二神经网络用于输出参考图像的掩蔽函数,掩蔽函数指示参考图像的频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音频段为噪声,第一增强信号和掩蔽函数的运算结果用于确定待增强语音的第二增强信号。
在一个具体的实施方式中,参考图像为待增强语音的声源处对应的可以包括唇部特征的图像。
在一个具体的实施方式中,第一增强信号和掩蔽函数的运算结果用于确定待增强语音的第二增强信号,可以包括:以第一增强信号以及掩蔽函数作为第三神经网络的输入数据,根据第三神经网络输出的权值确定第二增强信号,权值指示第二增强信号中第一增强信号和修正信号的输出比例,修正信号是掩蔽函数和第一增强信号的运算结果,第三神经网络 是以第一mask为训练目标,对第一神经网络的输出数据以及第二神经网络的输出数据进行训练得到的神经网络。
在一个具体的实施方式中,方法还可以包括:确定图像是否可以包括人脸信息或者唇部信息。图像不包括人脸信息或者唇部信息时,权值指示第二增强信号中修正信号的输出比例为0,第一增强信号的输出比例为百分之百。
在一个具体的实施方式中,修正信号是第一增强信号和掩蔽函数的乘积运算结果。
在一个具体的实施方式中,修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,M为正整数,第一时刻第一神经网络输出的第一增强信号可以包括M个频段,M个频段中的每一个频段对应一个信噪比,第一时刻的掩蔽函数为第二神经网络在第一时刻输出的掩蔽函数。
在一个具体的实施方式中,待增强语音可以包括第一声学特征帧,第一声学特征帧对应的时刻由第一时间索引指示,图像可以包括第一图像帧,第一图像帧为第二神经网络的输入数据,根据第二神经网络输出图像的掩蔽函数,可以包括:根据第二神经网络输出第一图像帧在第一时刻对应的掩蔽函数,第一时刻由第一时间索引的倍数指示,倍数根据第一声学特征帧的帧率与第一图像帧的帧率的比值确定。
在一个具体的实施方式中,该方法还可以包括:对待增强语音进行特征变换,以得到待增强语音的频域特征。该方法还可以包括:对第二增强信号进行特征反变换,以得到增强语音。
在一个具体的实施方式中,对待增强语音进行特征变换,可以包括:对待增强语音进行短时傅里叶变换STFT。对第二增强信号进行特征反变换,可以包括:对第二增强信号进行逆短时傅里叶变换ISTFT。
在一个具体的实施方式中,该方法还可以包括:对图像进行采样,使图像可以包括的图像帧的帧率为预设的帧率。
在一个具体的实施方式中,唇部特征通过对人脸图进行特征抽取获得,人脸图为对图像进行人脸检测获得。
在一个具体的实施方式中,图像的频段能量由激活函数表示,使激活函数的取值逼近IBM,以得到第二神经网络。
在一个具体的实施方式中,待增强语音通过单个音频通道获取。
在一个具体的实施方式中,第一mask是理想浮值掩蔽IRM,第二mask是理想二值掩蔽IBM。
实验数据集采用Grid数据集作为纯净语音语料,32组说话人每人1000条,共32000条语料被分为训练集27000条(30组说话人,每组900条),Seentest测试集3000条(30组说话人,每组100条)和Unseentest测试集2000条(2组说话人,每组1000条)。CHiME background数据集按8:2分为训练噪声集和普通环境测试噪声集,Audioset Human noise作为人声环境测试集。主要对比的基线是声学模型(AO),Visual Speech Enhancement(VSE)模型和Looking to Listen(L2L)模型。实验主要由PESQ评分作为评估方式。通过实验数据证实,本申请提供的方案能够利用视觉信息对语音增强任务在-5到20dB上有全面提升。
上文结合附图对本申请实施例的语音增强方法和神经网络训练方法进行了详细的描述,下面对本申请实施例的相关装置进行详细的介绍。应理解,相关装置能够执行本申请实施例的语音增强方法以及神经网络训练的各个步骤,下面在介绍相关装置时适当省略重复的描述。
图17为本申请实施例提供的一种语音增强装置的结构示意图;
在一个具体的实施方式中,该一种语音增强装置,包括:获取模块1701,用于获取待增强语音和参考图像,所述待增强语音和所述参考图像为同时获取的数据。音频处理模块1702,用于根据第一神经网络输出待增强语音的第一增强信号,第一神经网络是以第一掩码mask为训练目标,对语音和噪声的混合数据进行训练得到的神经网络。图像处理模块1703,用于根据第二神经网络输出参考图像的掩蔽函数,掩蔽函数指示参考图像对应的频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音的频段为噪声,第二神经网络是以第二掩码mask为训练目标,对第一神经网络采用的语音的声源处对应的包括唇部特征的图像进行训练得到的神经网络。综合处理模块1704,用于根据第一增强信号和掩蔽函数的运算结果确定待增强语音的第二增强信号。
在一个具体的实施方式中,参考图像为待增强语音的声源处对应的包括唇部特征的图像。
在一个具体的实施方式中,综合处理模块1704,具体用于:以第一增强信号以及掩蔽函数作为第三神经网络的输入数据,根据第三神经网络输出的权值确定第二增强信号,权值指示第二增强信号中第一增强信号和修正信号的输出比例,修正信号是掩蔽函数和第一增强信号的运算结果,第三神经网络是以第一mask为训练目标,对第一神经网络的输出数据以及第二神经网络的输出数据进行训练得到的神经网络。
在一个具体的实施方式中,装置还包括:特征提取模块,特征提取模块,用于确定参考图像是否包括人脸信息或者唇部信息。参考图像不包括人脸信息或者唇部信息时,权值指示第二增强信号中修正信号的输出比例为0,第一增强信号的输出比例为百分之百。
在一个具体的实施方式中,修正信号是第一增强信号和掩蔽函数的乘积运算结果。
在一个具体的实施方式中,修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,M为正整数,第一时刻第一神经网络输出的第一增强信号包括M个频段,M个频段中的每一个频段对应一个信噪比,第一时刻的掩蔽函数为第二神经网络在第一时刻输出的掩蔽函数。
在一个具体的实施方式中,待增强语音包括第一声学特征帧,第一声学特征帧对应的时刻由第一时间索引指示,参考图像包括第一图像帧,第一图像帧为第二神经网络的输入数据,图像处理模块1703,具体用于:根据第二神经网络输出第一图像帧在第一时刻对应的掩蔽函数,第一时刻由第一时间索引的倍数指示,倍数根据第一声学特征帧的帧率与第一图像帧的帧率的比值确定。
在一个具体的实施方式中,对待增强语音进行特征变换,可以包括:对待增强语音进行短时傅里叶变换STFT。对第二增强信号进行特征反变换,可以包括:对第二增强信号进行逆短时傅里叶变换ISTFT。
在一个具体的实施方式中,特征提取模块,还用于对参考图像进行采样,使参考图像可以包括的图像帧的帧率为预设的帧率。
在一个具体的实施方式中,唇部特征通过对人脸图进行特征抽取获得,人脸图为对参考图像进行人脸检测获得。
在一个具体的实施方式中,参考图像的频段能量由激活函数表示,使激活函数的取值逼近IBM,以得到第二神经网络。
在一个具体的实施方式中,待增强语音通过单个音频通道获取。
在一个具体的实施方式中,第一mask是理想浮值掩蔽IRM,第二mask是理想二值掩蔽IBM。
图18为本申请实施例提供的一种训练神经网络的装置的结构示意图。
本申请提供一种训练神经网络的装置,神经网络用于语音增强,装置包括:获取模块1801,用于获取训练数据,训练数据包括语音和噪声的混合数据以及语音的声源处对应的包括唇部特征的图像。音频处理模块1802,用于以理想浮值掩蔽IRM为训练目标,对混合数据进行训练得到第一神经网络,训练好的第一神经网络用于输出待增强语音的第一增强信号。图像处理模块1803,用于以理想二值掩蔽IBM为训练目标,对图像进行训练得到第二神经网络,训练好的第二神经网络用于输出参考图像的掩蔽函数,掩蔽函数指示参考图像的频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音频段为噪声,第一增强信号和掩蔽函数的运算结果用于确定待增强语音的第二增强信号。
在一个具体的实施方式中,参考图像为待增强语音的声源处对应的包括唇部特征的图像。
在一个具体的实施方式中,还包括:综合处理模块1804,综合处理模块1804,用于以第一增强信号以及掩蔽函数作为第三神经网络的输入数据,根据第三神经网络输出的权值确定第二增强信号,权值指示第二增强信号中第一增强信号和修正信号的输出比例,修正信号是掩蔽函数和第一增强信号的运算结果,第三神经网络是以第一mask为训练目标,对第一神经网络的输出数据以及第二神经网络的输出数据进行训练得到的神经网络。
在一个具体的实施方式中,装置还包括:特征特征提取模块,
特征特征提取模块,用于确定图像是否包括人脸信息或者唇部信息。图像不包括人脸信息或者唇部信息时,权值指示第二增强信号中修正信号的输出比例为0,第一增强信号的输出比例为百分之百。
在一个具体的实施方式中,修正信号是第一增强信号和掩蔽函数的乘积运算结果。
在一个具体的实施方式中,修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,M为正整数,第一时刻第一神经网络输出的第一增强信号包括M个频段,M个频段中的每一个频段对应一个信噪比,第一时刻的掩蔽函数为第二神经网络在第一时刻输出的掩蔽函数。
在一个具体的实施方式中,待增强语音包括第一声学特征帧,第一声学特征帧对应的时刻由第一时间索引指示,图像包括第一图像帧,第一图像帧为第二神经网络的输入数据,图像处理模块1803,具体用于:根据第二神经网络输出第一图像帧在第一时刻对应的掩蔽 函数,第一时刻由第一时间索引的倍数指示,倍数根据第一声学特征帧的帧率与第一图像帧的帧率的比值确定。
在一个具体的实施方式中,对待增强语音进行特征变换,可以包括:对待增强语音进行短时傅里叶变换STFT。对第二增强信号进行特征反变换,可以包括:对第二增强信号进行逆短时傅里叶变换ISTFT。
在一个具体的实施方式中,特征提取模块,还用于对参考图像进行采样,使参考图像可以包括的图像帧的帧率为预设的帧率。
在一个具体的实施方式中,唇部特征通过对人脸图进行特征抽取获得,人脸图为对参考图像进行人脸检测获得。
在一个具体的实施方式中,参考图像的频段能量由激活函数表示,使激活函数的取值逼近IBM,以得到第二神经网络。
在一个具体的实施方式中,待增强语音通过单个音频通道获取。
在一个具体的实施方式中,第一mask是理想浮值掩蔽IRM,第二mask是理想二值掩蔽IBM。
图19为本申请实施例提供的另一种语音增强装置的结构示意图
图19是本申请实施例的语音增强装置的示意性框图。图19所示的语音增强装置模块包括存储器1901、处理器1902、通信接口1903以及总线1904。其中,存储器1901、处理器1902、通信接口1903通过总线1904实现彼此之间的通信连接。
上述通信接口1903相当于语音增强装置中的图像获取模块901,上述处理器1902相当于语音增强装置中的特征提取模块902和检测模块903。下面对语音增强装置模块中的各个模块和模块进行详细的介绍。
存储器1901可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器1901可以存储程序,当存储器1901中存储的程序被处理器1902执行时,处理器1902和通信接口1903用于执行本申请实施例的语音增强方法的各个步骤。具体地,通信接口1903可以从存储器或者其他设备中获取待检测图像,然后由处理器1902对该待检测图像进行语音增强。
处理器1902可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的语音增强装置中的模块所需执行的功能(例如,处理器1902可以实现上述语音增强装置中的特征提取模块902和检测模块903所需执行的功能),或者执行本申请实施例的语音增强方法。
处理器1902还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例的语音增强方法的各个步骤可以通过处理器1902中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器1902还可以是通用处理器、数字信号处理器(digital signalprocessing, DSP)、ASIC、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。上述通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1901,处理器1902读取存储器1901中的信息,结合其硬件完成本申请实施例的语音增强装置中包括的模块所需执行的功能,或者执行本申请方法实施例的语音增强方法。
通信接口1903使用例如但不限于收发器一类的收发装置,来实现装置模块与其他设备或通信网络之间的通信。例如,可以通过通信接口1903获取待处理图像。
总线1904可包括在装置模块各个部件(例如,存储器1901、处理器1902、通信接口1903)之间传送信息的通路。
图20为本申请实施例提供的另一种训练神经网络的装置的结构示意图。
图20是本申请实施例的训练神经网络装置的硬件结构示意图。与上述装置类似,图20所示的训练神经网络装置包括存储器2001、处理器2002、通信接口2003以及总线2004。其中,存储器2001、处理器2002、通信接口2003通过总线2004实现彼此之间的通信连接。
存储器2001可以存储程序,当存储器2001中存储的程序被处理器2002执行时,处理器2002用于执行本申请实施例的神经网络的训练方法的各个步骤。
处理器2002可以采用通用的CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的神经网络的训练方法。
处理器2002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例的神经网络的训练方法的各个步骤可以通过处理器2002中的硬件的集成逻辑电路或者软件形式的指令完成。
应理解,通过图20所示的训练神经网络装置对神经网络进行训练,训练得到的神经网络就可以用于执行本申请实施例的方法。
具体地,图20所示的装置可以通过通信接口2003从外界获取训练数据以及待训练的神经网络,然后由处理器根据训练数据对待训练的神经网络进行训练。
应注意,尽管上述装置模块和装置仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置模块和装置还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置模块和装置还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置模块和装置也可仅仅包括实现本申请实施例所必须的器件,而不必包括图19和图20中所示的全部器件。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的模块及 算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。
所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (46)

  1. 一种语音增强方法,其特征在于,包括:
    获取待增强语音和参考图像,所述待增强语音和所述参考图像为同时获取的数据;
    根据第一神经网络输出所述待增强语音的第一增强信号,所述第一神经网络是以第一掩码mask为训练目标,对语音和噪声的混合数据进行训练得到的神经网络;
    根据第二神经网络输出所述参考图像的掩蔽函数,所述掩蔽函数指示所述参考图像对应的频段能量是否小于预设值,所述频段能量小于所述预设值表示所述参考图像对应的所述待增强语音的频段为噪声,所述第二神经网络是以第二掩码mask为训练目标,对所述第一神经网络采用的所述语音的声源处对应的包括唇部特征的图像进行训练得到的神经网络;
    根据所述第一增强信号和所述掩蔽函数的运算结果确定所述待增强语音的第二增强信号。
  2. 根据权利要求1所述的语音增强方法,其特征在于,所述参考图像为所述待增强语音的声源处对应的包括唇部特征的图像。
  3. 根据权利要求1或2所述的语音增强方法,其特征在于,所述根据所述第一增强信号和所述掩蔽函数的运算结果确定所述待增强语音的第二增强信号,包括:
    以所述第一增强信号以及所述掩蔽函数作为第三神经网络的输入数据,根据所述第三神经网络输出的权值确定所述第二增强信号,所述权值指示所述第二增强信号中所述第一增强信号和修正信号的输出比例,所述修正信号是所述掩蔽函数和所述第一增强信号的运算结果,所述第三神经网络是以所述第一mask为训练目标,对所述第一神经网络的输出数据以及所述第二神经网络的输出数据进行训练得到的神经网络。
  4. 根据权利要求3所述的语音增强方法,其特征在于,所述方法还包括:
    确定所述参考图像是否包括人脸信息或者唇部信息;
    所述参考图像不包括所述人脸信息或者唇部信息时,所述权值指示所述第二增强信号中所述修正信号的输出比例为0,所述第一增强信号的输出比例为百分之百。
  5. 根据权利要求3或4所述的语音增强方法,其特征在于,所述修正信号是所述第一增强信号和所述掩蔽函数的乘积运算结果。
  6. 根据权利要求5所述的语音增强方法,其特征在于,所述修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,所述M为正整数,所述第一时刻所述第一神经网络输出的所述第一增强信号包括M个频段,所述M个频段中的每一个频段对应一个信噪比,所述第一时刻的掩蔽函数为所述第二神经网络在所述第一时刻输出的所述掩蔽函数。
  7. 根据权利要求1至6任一项所述的语音增强方法,其特征在于,所述待增强语音包括第一声学特征帧,所述第一声学特征帧对应的时刻由第一时间索引指示,所述参考图像包括第一图像帧,所述第一图像帧为所述第二神经网络的输入数据,所述根据第二神经网络输出所述参考图像的掩蔽函数,包括:
    根据所述第二神经网络输出所述第一图像帧在第一时刻对应的掩蔽函数,所述第一时刻由所述第一时间索引的倍数指示,所述倍数根据所述第一声学特征帧的帧率与所述第一图像帧的帧率的比值确定。
  8. 根据权利要求1至7任一项所述的语音增强方法,其特征在于,所述方法还包括:
    对所述待增强语音进行特征变换,以得到所述待增强语音的频域特征;
    所述方法还包括:
    对所述第二增强信号进行特征反变换,以得到增强语音。
  9. 根据权利要求8所述的语音增强方法,其特征在于,
    所述对所述待增强语音进行特征变换,包括:
    对所述待增强语音进行短时傅里叶变换STFT;
    所述对所述第二增强信号进行特征反变换,包括:
    对所述第二增强信号进行逆短时傅里叶变换ISTFT。
  10. 根据权利要求1至9任一项所述的语音增强方法,其特征在于,所述方法还包括:
    对所述参考图像进行采样,使所述参考图像包括的图像帧的帧率为预设的帧率。
  11. 根据权利要求1至10任一项所述的语音增强方法,其特征在于,所述唇部特征通过对人脸图进行特征抽取获得,所述人脸图为对所述参考图像进行人脸检测获得。
  12. 根据权利要求1至11任一项所述的语音增强方法,其特征在于,所述参考图像的频段能量由激活函数表示,使所述激活函数的取值逼近所述IBM,以得到所述第二神经网络。
  13. 根据权利要求1至12任一项所述的语音增强方法,其特征在于,所述待增强语音通过单个音频通道获取。
  14. 根据权利要求1至13任一项所述的语音增强方法,其特征在于,所述第一mask是理想浮值掩蔽IRM,所述第二mask是理想二值掩蔽IBM。
  15. 一种训练神经网络的方法,其特征在于,所述神经网络用于语音增强,所述方法包括:
    获取训练数据,所述训练数据包括语音和噪声的混合数据以及所述语音的声源处对应的包括唇部特征的图像;
    以理想浮值掩蔽IRM为训练目标,对所述混合数据进行训练得到第一神经网络,训练好的所述第一神经网络用于输出待增强语音的第一增强信号;
    以理想二值掩蔽IBM为训练目标,对所述图像进行训练得到第二神经网络,训练好的所述第二神经网络用于输出参考图像的掩蔽函数,所述掩蔽函数指示所述参考图像的频段能量是否小于预设值,所述频段能量小于所述预设值表示所述参考图像对应的所述待增强语音频段为噪声,所述第一增强信号和所述掩蔽函数的运算结果用于确定所述待增强语音的第二增强信号。
  16. 根据权利要求15所述的训练神经网络的方法,其特征在于,所述参考图像为所述待增强语音的声源处对应的包括唇部特征的图像。
  17. 根据权利要求15或16所述的训练神经网络的方法,其特征在于,所述第一增强信号和所述掩蔽函数的运算结果用于确定所述待增强语音的第二增强信号,包括:
    以所述第一增强信号以及所述掩蔽函数作为第三神经网络的输入数据,根据所述第三神经网络输出的权值确定所述第二增强信号,所述权值指示所述第二增强信号中所述第一 增强信号和修正信号的输出比例,所述修正信号是所述掩蔽函数和所述第一增强信号的运算结果,所述第三神经网络是以所述第一mask为训练目标,对所述第一神经网络的输出数据以及所述第二神经网络的输出数据进行训练得到的神经网络。
  18. 根据权利要求17所述的训练神经网络的方法,其特征在于,所述方法还包括:
    确定所述图像是否包括人脸信息或者唇部信息;
    所述图像不包括所述人脸信息或者唇部信息时,所述权值指示所述第二增强信号中所述修正信号的输出比例为0,所述第一增强信号的输出比例为百分之百。
  19. 根据权利要求17或18所述的训练神经网络的方法,其特征在于,所述修正信号是所述第一增强信号和所述掩蔽函数的乘积运算结果。
  20. 根据权利要求19所述的训练神经网络的方法,其特征在于,所述修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,所述M为正整数,所述第一时刻所述第一神经网络输出的所述第一增强信号包括M个频段,所述M个频段中的每一个频段对应一个信噪比,所述第一时刻的掩蔽函数为所述第二神经网络在所述第一时刻输出的所述掩蔽函数。
  21. 根据权利要求15至20任一项所述的训练神经网络的方法,其特征在于,所述待增强语音包括第一声学特征帧,所述第一声学特征帧对应的时刻由第一时间索引指示,所述图像包括第一图像帧,所述第一图像帧为所述第二神经网络的输入数据,所述根据第二神经网络输出所述图像的掩蔽函数,包括:
    根据所述第二神经网络输出所述第一图像帧在第一时刻对应的掩蔽函数,所述第一时刻由所述第一时间索引的倍数指示,所述倍数根据所述第一声学特征帧的帧率与所述第一图像帧的帧率的比值确定。
  22. 根据权利要求15至21任一项所述的训练神经网络的方法,其特征在于,所述方法还包括:
    对所述待增强语音进行特征变换,以得到所述待增强语音的频域特征;
    所述方法还包括:
    对所述第二增强信号进行特征反变换,以得到增强语音。
  23. 根据权利要求22所述的训练神经网络的方法,其特征在于,
    所述对所述待增强语音进行特征变换,包括:
    对所述待增强语音进行短时傅里叶变换STFT;
    所述对所述第二增强信号进行特征反变换,包括:
    对所述第二增强信号进行逆短时傅里叶变换ISTFT。
  24. 根据权利要求15至23任一项所述的训练神经网络的方法,其特征在于,所述方法还包括:
    对所述图像进行采样,使所述图像包括的图像帧的帧率为预设的帧率。
  25. 根据权利要求15至24任一项所述的训练神经网络的方法,其特征在于,所述唇部特征通过对人脸图进行特征抽取获得,所述人脸图为对所述图像进行人脸检测获得。
  26. 根据权利要求15至25任一项所述的训练神经网络的方法,其特征在于,所述图像 的频段能量由激活函数表示,使所述激活函数的取值逼近所述IBM,以得到所述第二神经网络。
  27. 根据权利要求15至26任一项所述的训练神经网络的方法,其特征在于,所述待增强语音通过单个音频通道获取。
  28. 根据权利要求15至27任一项所述的训练神经网络的方法,其特征在于,所述第一mask是理想浮值掩蔽IRM,所述第二mask是理想二值掩蔽IBM。
  29. 一种语音增强装置,其特征在于,包括:
    获取模块,用于获取待增强语音和参考图像,所述待增强语音和所述参考图像为同时获取的数据;
    音频处理模块,用于根据第一神经网络输出所述待增强语音的第一增强信号,所述第一神经网络是以第一掩码mask为训练目标,对语音和噪声的混合数据进行训练得到的神经网络;
    图像处理模块,用于根据第二神经网络输出所述参考图像的掩蔽函数,所述掩蔽函数指示所述参考图像对应的频段能量是否小于预设值,所述频段能量小于所述预设值表示所述参考图像对应的所述待增强语音的频段为噪声,所述第二神经网络是以第二掩码mask为训练目标,对所述第一神经网络采用的所述语音的声源处对应的包括唇部特征的图像进行训练得到的神经网络;
    综合处理模块,用于根据所述第一增强信号和所述掩蔽函数的运算结果确定所述待增强语音的第二增强信号。
  30. 根据权利要求29所述的语音增强装置,其特征在于,所述参考图像为所述待增强语音的声源处对应的包括唇部特征的图像。
  31. 根据权利要求29或30所述的语音增强装置,其特征在于,所述综合处理模块,具体用于:
    以所述第一增强信号以及所述掩蔽函数作为第三神经网络的输入数据,根据所述第三神经网络输出的权值确定所述第二增强信号,所述权值指示所述第二增强信号中所述第一增强信号和修正信号的输出比例,所述修正信号是所述掩蔽函数和所述第一增强信号的运算结果,所述第三神经网络是以所述第一mask为训练目标,对所述第一神经网络的输出数据以及所述第二神经网络的输出数据进行训练得到的神经网络。
  32. 根据权利要求31所述的语音增强装置,其特征在于,所述装置还包括:特征提取模块,
    所述特征提取模块,用于确定所述参考图像是否包括人脸信息或者唇部信息;所述参考图像不包括所述人脸信息或者唇部信息时,所述权值指示所述第二增强信号中所述修正信号的输出比例为0,所述第一增强信号的输出比例为百分之百。
  33. 根据权利要求31或32所述的语音增强装置,其特征在于,所述修正信号是所述第一增强信号和所述掩蔽函数的乘积运算结果。
  34. 根据权利要求33所述的语音增强装置,其特征在于,所述修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,所述M为正整数,所述第一时刻所述第一 神经网络输出的所述第一增强信号包括M个频段,所述M个频段中的每一个频段对应一个信噪比,所述第一时刻的掩蔽函数为所述第二神经网络在所述第一时刻输出的所述掩蔽函数。
  35. 根据权利要求29至34任一项所述的语音增强装置,其特征在于,所述待增强语音包括第一声学特征帧,所述第一声学特征帧对应的时刻由第一时间索引指示,所述参考图像包括第一图像帧,所述第一图像帧为所述第二神经网络的输入数据,所述图像处理模块,具体用于:
    根据所述第二神经网络输出所述第一图像帧在第一时刻对应的掩蔽函数,所述第一时刻由所述第一时间索引的倍数指示,所述倍数根据所述第一声学特征帧的帧率与所述第一图像帧的帧率的比值确定。
  36. 一种训练神经网络的装置,其特征在于,所述神经网络用于语音增强,所述装置包括:
    获取模块,用于获取训练数据,所述训练数据包括语音和噪声的混合数据以及所述语音的声源处对应的包括唇部特征的图像;
    音频处理模块,用于以理想浮值掩蔽IRM为训练目标,对所述混合数据进行训练得到第一神经网络,训练好的所述第一神经网络用于输出待增强语音的第一增强信号;
    图像处理模块,用于以理想二值掩蔽IBM为训练目标,对所述图像进行训练得到第二神经网络,训练好的所述第二神经网络用于输出参考图像的掩蔽函数,所述掩蔽函数指示所述参考图像的频段能量是否小于预设值,所述频段能量小于所述预设值表示所述参考图像对应的所述待增强语音频段为噪声,所述第一增强信号和所述掩蔽函数的运算结果用于确定所述待增强语音的第二增强信号。
  37. 根据权利要求36所述的训练神经网络的装置,其特征在于,所述参考图像为所述待增强语音的声源处对应的包括唇部特征的图像。
  38. 根据权利要求36或37所述的训练神经网络的装置,其特征在于,还包括:综合处理模块,
    所述综合处理模块,用于以所述第一增强信号以及所述掩蔽函数作为第三神经网络的输入数据,根据所述第三神经网络输出的权值确定所述第二增强信号,所述权值指示所述第二增强信号中所述第一增强信号和修正信号的输出比例,所述修正信号是所述掩蔽函数和所述第一增强信号的运算结果,所述第三神经网络是以所述第一mask为训练目标,对所述第一神经网络的输出数据以及所述第二神经网络的输出数据进行训练得到的神经网络。
  39. 根据权利要求38所述的训练神经网络的装置,其特征在于,所述装置还包括:特征特征提取模块,
    所述特征特征提取模块,用于确定所述图像是否包括人脸信息或者唇部信息;
    所述图像不包括所述人脸信息或者唇部信息时,所述权值指示所述第二增强信号中所述修正信号的输出比例为0,所述第一增强信号的输出比例为百分之百。
  40. 根据权利要求38或39所述的训练神经网络的装置,其特征在于,所述修正信号是所述第一增强信号和所述掩蔽函数的乘积运算结果。
  41. 根据权利要求40所述的训练神经网络的装置,其特征在于,所述修正信号根据M个信噪比和第一时刻的掩蔽函数的乘积运算结果确定,所述M为正整数,所述第一时刻所述第一神经网络输出的所述第一增强信号包括M个频段,所述M个频段中的每一个频段对应一个信噪比,所述第一时刻的掩蔽函数为所述第二神经网络在所述第一时刻输出的所述掩蔽函数。
  42. 根据权利要求36至41任一项所述的训练神经网络的装置,其特征在于,所述待增强语音包括第一声学特征帧,所述第一声学特征帧对应的时刻由第一时间索引指示,所述图像包括第一图像帧,所述第一图像帧为所述第二神经网络的输入数据,所述图像处理模块,具体用于:
    根据所述第二神经网络输出所述第一图像帧在第一时刻对应的掩蔽函数,所述第一时刻由所述第一时间索引的倍数指示,所述倍数根据所述第一声学特征帧的帧率与所述第一图像帧的帧率的比值确定。
  43. 一种语音增强装置,其特征在于,包括:
    存储器,用于存储程序;
    处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如权利要求1-14中任一项所述的方法。
  44. 一种训练神经网络的装置,其特征在于,包括:
    存储器,用于存储程序;
    处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如权利要求15-28中任一项所述的方法。
  45. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有程序代码,所述程序代码包括用于执行如权利要求1-14中任一项所述的方法中的步骤的指令。
  46. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有程序代码,所述程序代码包括用于执行如权利要求15-28中任一项所述的方法中的步骤的指令。
PCT/CN2021/079047 2020-04-10 2021-03-04 一种语音增强方法、训练神经网络的方法以及相关设备 WO2021203880A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010281044.1A CN113516990B (zh) 2020-04-10 2020-04-10 一种语音增强方法、训练神经网络的方法以及相关设备
CN202010281044.1 2020-04-10

Publications (1)

Publication Number Publication Date
WO2021203880A1 true WO2021203880A1 (zh) 2021-10-14

Family

ID=78022804

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/079047 WO2021203880A1 (zh) 2020-04-10 2021-03-04 一种语音增强方法、训练神经网络的方法以及相关设备

Country Status (2)

Country Link
CN (1) CN113516990B (zh)
WO (1) WO2021203880A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113921030A (zh) * 2021-12-07 2022-01-11 江苏清微智能科技有限公司 基于加权语音损失的语音增强神经网络训练方法及装置
CN114581832A (zh) * 2022-03-04 2022-06-03 中国科学院声学研究所 一种语音增强方法
CN114898767A (zh) * 2022-04-15 2022-08-12 中国电子科技集团公司第十研究所 基于U-Net的机载语音噪音分离方法、设备及介质
CN115440251A (zh) * 2022-09-01 2022-12-06 有米科技股份有限公司 一种图像辅助音频补全的音频重构方法及装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093347A (zh) * 2021-11-26 2022-02-25 青岛海尔科技有限公司 唤醒词能量计算方法、系统、语音唤醒系统及存储介质
CN114255307A (zh) * 2021-12-08 2022-03-29 中国联合网络通信集团有限公司 虚拟人脸的控制方法、装置、设备及存储介质
CN114783454B (zh) * 2022-04-27 2024-06-04 北京百度网讯科技有限公司 一种模型训练、音频降噪方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150032449A1 (en) * 2013-07-26 2015-01-29 Nuance Communications, Inc. Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition
CN106328156A (zh) * 2016-08-22 2017-01-11 华南理工大学 一种音视频信息融合的麦克风阵列语音增强系统及方法
US9666183B2 (en) * 2015-03-27 2017-05-30 Qualcomm Incorporated Deep neural net based filter prediction for audio event classification and extraction
CN108447495A (zh) * 2018-03-28 2018-08-24 天津大学 一种基于综合特征集的深度学习语音增强方法
CN109326302A (zh) * 2018-11-14 2019-02-12 桂林电子科技大学 一种基于声纹比对和生成对抗网络的语音增强方法
CN109616139A (zh) * 2018-12-25 2019-04-12 平安科技(深圳)有限公司 语音信号噪声功率谱密度估计方法和装置
CN110390950A (zh) * 2019-08-17 2019-10-29 杭州派尼澳电子科技有限公司 一种基于生成对抗网络的端到端语音增强方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8635066B2 (en) * 2010-04-14 2014-01-21 T-Mobile Usa, Inc. Camera-assisted noise cancellation and speech recognition
CN105096961B (zh) * 2014-05-06 2019-02-01 华为技术有限公司 语音分离方法和装置
EP3649642A1 (en) * 2017-07-03 2020-05-13 Yissum Research Development Company of The Hebrew University of Jerusalem Ltd. Method and system for enhancing a speech signal of a human speaker in a video using visual information
CN110709924B (zh) * 2017-11-22 2024-01-09 谷歌有限责任公司 视听语音分离
CN115762579A (zh) * 2018-09-29 2023-03-07 华为技术有限公司 一种声音处理方法、装置与设备
CN110246512B (zh) * 2019-05-30 2023-05-26 平安科技(深圳)有限公司 声音分离方法、装置及计算机可读存储介质
CN110390350B (zh) * 2019-06-24 2021-06-15 西北大学 一种基于双线性结构的层级分类方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150032449A1 (en) * 2013-07-26 2015-01-29 Nuance Communications, Inc. Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition
US9666183B2 (en) * 2015-03-27 2017-05-30 Qualcomm Incorporated Deep neural net based filter prediction for audio event classification and extraction
CN106328156A (zh) * 2016-08-22 2017-01-11 华南理工大学 一种音视频信息融合的麦克风阵列语音增强系统及方法
CN108447495A (zh) * 2018-03-28 2018-08-24 天津大学 一种基于综合特征集的深度学习语音增强方法
CN109326302A (zh) * 2018-11-14 2019-02-12 桂林电子科技大学 一种基于声纹比对和生成对抗网络的语音增强方法
CN109616139A (zh) * 2018-12-25 2019-04-12 平安科技(深圳)有限公司 语音信号噪声功率谱密度估计方法和装置
CN110390950A (zh) * 2019-08-17 2019-10-29 杭州派尼澳电子科技有限公司 一种基于生成对抗网络的端到端语音增强方法

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113921030A (zh) * 2021-12-07 2022-01-11 江苏清微智能科技有限公司 基于加权语音损失的语音增强神经网络训练方法及装置
CN113921030B (zh) * 2021-12-07 2022-06-07 江苏清微智能科技有限公司 一种基于加权语音损失的语音增强神经网络训练方法及装置
CN114581832A (zh) * 2022-03-04 2022-06-03 中国科学院声学研究所 一种语音增强方法
CN114898767A (zh) * 2022-04-15 2022-08-12 中国电子科技集团公司第十研究所 基于U-Net的机载语音噪音分离方法、设备及介质
CN114898767B (zh) * 2022-04-15 2023-08-15 中国电子科技集团公司第十研究所 基于U-Net的机载语音噪音分离方法、设备及介质
CN115440251A (zh) * 2022-09-01 2022-12-06 有米科技股份有限公司 一种图像辅助音频补全的音频重构方法及装置

Also Published As

Publication number Publication date
CN113516990B (zh) 2024-08-13
CN113516990A (zh) 2021-10-19

Similar Documents

Publication Publication Date Title
WO2021203880A1 (zh) 一种语音增强方法、训练神经网络的方法以及相关设备
WO2021104110A1 (zh) 一种语音匹配方法及相关设备
JP7337953B2 (ja) 音声認識方法及び装置、ニューラルネットワークの訓練方法及び装置、並びにコンピュータープログラム
US11908483B2 (en) Inter-channel feature extraction method, audio separation method and apparatus, and computing device
CN110992987A (zh) 语音信号中针对通用特定语音的并联特征提取系统及方法
CN113039555B (zh) 在视频剪辑中进行动作分类的方法、系统及存储介质
WO2019062931A1 (zh) 图像处理装置及方法
WO2022048239A1 (zh) 音频的处理方法和装置
JP2022505718A (ja) ドメイン分類器を使用したニューラルネットワークにおけるドメイン適応のためのシステム及び方法
CN113421547B (zh) 一种语音处理方法及相关设备
WO2023284435A1 (zh) 生成动画的方法及装置
CN115169507B (zh) 类脑多模态情感识别网络、识别方法及情感机器人
WO2020211820A1 (zh) 语音情感识别方法和装置
CN108491808B (zh) 用于获取信息的方法及装置
Patilkulkarni Visual speech recognition for small scale dataset using VGG16 convolution neural network
CN113963692A (zh) 一种车舱内语音指令控制方法及相关设备
CN113611318A (zh) 一种音频数据增强方法及相关设备
CN117975991B (zh) 基于人工智能的数字人驱动方法及装置
CN116312512A (zh) 面向多人场景的视听融合唤醒词识别方法及装置
CN117115312B (zh) 一种语音驱动面部动画方法、装置、设备及介质
Hu et al. Speech emotion recognition based on attention mcnn combined with gender information
US20230098678A1 (en) Speech signal processing method and related device thereof
Ivanko et al. A novel task-oriented approach toward automated lip-reading system implementation
CN111971670B (zh) 在对话中生成响应
Krokotsch et al. Generative adversarial networks and simulated+ unsupervised learning in affect recognition from speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21784631

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21784631

Country of ref document: EP

Kind code of ref document: A1