CN113516990B

CN113516990B - Voice enhancement method, neural network training method and related equipment

Info

Publication number: CN113516990B
Application number: CN202010281044.1A
Authority: CN
Inventors: 王午芃; 邢超; 陈晓; 孙凤宇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2024-08-13
Anticipated expiration: 2040-04-10
Also published as: WO2021203880A1; CN113516990A

Abstract

The application discloses a voice enhancement method, which relates to the field of artificial intelligence and comprises the following steps: and acquiring the voice to be enhanced and a reference image, wherein the voice to be enhanced and the reference image are data acquired simultaneously. And outputting a first enhancement signal of the voice to be enhanced according to the first neural network. And outputting a masking function of the reference image according to the second neural network, wherein the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, and the frequency band energy smaller than the preset value indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise. And determining a second enhancement signal of the voice to be enhanced according to the operation result of the first enhancement signal and the masking function. By the technical scheme provided by the application, the image information can be applied to the voice enhancement process, and the voice enhancement capability and the hearing can be well improved in some relatively noisy environments.

Description

Voice enhancement method, neural network training method and related equipment

Technical Field

The application relates to the field of artificial intelligence, in particular to a voice enhancement method and a method for training a neural network.

Background

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, man-machine interaction, recommendation and search, AI-based theory, and the like.

Speech recognition (automatic speech recognition, ASR) refers to a technique for recognizing corresponding text content from speech waveforms, and is one of the important techniques in the field of artificial intelligence. In speech recognition systems, speech enhancement techniques are a very important technique, also commonly referred to as speech noise reduction techniques. High frequency noise, low frequency noise, white noise and various other noise in the voice signal can be eliminated by the voice enhancement technology, so that the voice recognition effect is improved. Therefore, how to improve the voice enhancement effect is needed to be solved.

Disclosure of Invention

The embodiment of the application provides a voice enhancement method, which can be used for applying image information to a voice enhancement process, and can also well improve the voice enhancement capability and the hearing in some relatively noisy environments.

In order to achieve the above purpose, the embodiment of the present application provides the following technical solutions:

The first aspect of the present application provides a speech enhancement method, which may include: and acquiring the voice to be enhanced and a reference image, wherein the voice to be enhanced and the reference image are data acquired simultaneously. And outputting a first enhancement signal of the voice to be enhanced according to a first neural network, wherein the first neural network is obtained by training the mixed data of the voice and the noise by taking a first mask as a training target. And outputting a masking function of the reference image according to a second neural network, wherein the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, the frequency band energy being smaller than the preset value indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the second neural network is a neural network obtained by training an image which can comprise lip characteristics and corresponds to a voice source of the voice adopted by the first neural network by taking the second mask as a training target. And determining a second enhancement signal of the voice to be enhanced according to the operation result of the first enhancement signal and the masking function. According to the first aspect, a first enhancement signal of the voice to be enhanced is output by using a first neural network, and the association relationship between the image information and the voice information is modeled by using a second neural network, so that a masking function of a reference image output by the second neural network can indicate that the voice to be enhanced corresponding to the reference image is noise or voice. By the technical scheme provided by the application, the image information can be applied to the voice enhancement process, and the voice enhancement capability and the hearing can be well improved in some relatively noisy environments.

Optionally, with reference to the first aspect, in a first possible implementation manner, the reference image is an image that may include a lip feature corresponding to a sound source of the speech to be enhanced.

Optionally, with reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner, determining, according to the first enhancement signal and an operation result of the masking function, a second enhancement signal of the speech to be enhanced may include: the first enhancement signal and the masking function are used as input data of a third neural network, the second enhancement signal is determined according to a weight value output by the third neural network, the weight value indicates the output proportion of the first enhancement signal and the correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network by taking the first mask as a training target.

Optionally, with reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the method may further include: it is determined whether the reference image may include face information or lip information. When the reference image does not include face information or lip information, the weight indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is one hundred percent.

Optionally, with reference to the second aspect or the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the correction signal is a product operation result of the first enhancement signal and the masking function.

Optionally, with reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, the correction signal is determined according to a product operation result of M signal-to-noise ratios and a masking function at a first moment, M is a positive integer, the first enhancement signal output by the first neural network at the first moment may include M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first moment is the masking function output by the second neural network at the first moment.

Optionally, with reference to the first aspect or the first to fifth possible implementation manners of the first aspect, in a sixth possible implementation manner, the speech to be enhanced may include a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the reference image may include a first image frame, the first image frame is input data of a second neural network, and outputting, according to a masking function of the second neural network, the reference image may include: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

Optionally, with reference to the first aspect or the first to sixth possible implementation manners of the first aspect, in a seventh possible implementation manner, the method may further include: and carrying out feature transformation on the voice to be enhanced so as to obtain the frequency domain features of the voice to be enhanced. The method may further comprise: and performing characteristic inverse transformation on the second enhancement signal to obtain enhanced voice.

Optionally, with reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner, performing feature transformation on voice to be enhanced may include: the short-time fourier transform STFT is performed on the speech to be enhanced. Performing the inverse feature transformation on the second enhancement signal may include: and performing inverse short time Fourier transform ISTFT on the second enhancement signal.

Optionally, with reference to the first to eighth possible implementation manners of the first aspect, in a ninth possible implementation manner, the method may further include sampling the reference image so that a frame rate of an image frame that may be included in the reference image is a preset frame rate.

Optionally, with reference to the first aspect or the first to eighth possible implementation manners of the first aspect, in a tenth possible implementation manner, the lip feature is obtained by feature extraction of a face image, where the face image is obtained by face detection of a reference image.

Optionally, with reference to the first aspect or the first to tenth possible implementation manners of the first aspect, in an eleventh possible implementation manner, the band energy of the reference image is represented by an activation function, and a value of the activation function is approximated to IBM, so as to obtain the second neural network.

Optionally, with reference to the first aspect or the first to eleventh possible implementation manners of the first aspect, in a twelfth possible implementation manner, the voice to be enhanced is acquired through a single audio channel.

Optionally, with reference to the first aspect or the first to twelfth possible implementation manners of the first aspect, in a thirteenth possible implementation manner, the first mask is an ideal floating value mask IRM, and the second mask is an ideal binary mask IBM.

A second aspect of the application provides a method of training a neural network for speech enhancement, the method may comprise: training data is acquired, which may include a mixture of speech and noise, and corresponding images, which may include lip features, at the source of the speech. And training the mixed data by taking the ideal float value masking IRM as a training target to obtain a first neural network, wherein the trained first neural network is used for outputting a first enhancement signal of the voice to be enhanced. And training the image by taking the ideal binary masking IBM as a training target to obtain a second neural network, wherein the trained second neural network is used for outputting a masking function of the reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value to indicate that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the operation results of the first enhancement signal and the masking function are used for determining a second enhancement signal of the voice to be enhanced.

Optionally, with reference to the second aspect, in a first possible implementation manner, the reference image is an image that may include a lip feature corresponding to a sound source of the speech to be enhanced.

Optionally, with reference to the second aspect or the first possible implementation manner of the second aspect, in a second possible implementation manner, the determining, by using an operation result of the first enhancement signal and the masking function, a second enhancement signal of the speech to be enhanced may include: the first enhancement signal and the masking function are used as input data of a third neural network, the second enhancement signal is determined according to a weight value output by the third neural network, the weight value indicates the output proportion of the first enhancement signal and the correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network by taking the first mask as a training target.

Optionally, with reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, the method may further include: it is determined whether the image may include face information or lip information. When the image does not include face information or lip information, the weight indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is one hundred percent.

Optionally, with reference to the second aspect or the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the correction signal is a product operation result of the first enhancement signal and the masking function.

Optionally, with reference to the fourth possible implementation manner of the second aspect, in a fifth possible implementation manner, the correction signal is determined according to a product operation result of M signal-to-noise ratios and a masking function at a first moment, M is a positive integer, the first enhancement signal output by the first neural network at the first moment may include M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first moment is the masking function output by the second neural network at the first moment.

Optionally, with reference to the second aspect or the first to fifth possible implementation manners of the second aspect, in a sixth possible implementation manner, the voice to be enhanced may include a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the image may include a first image frame, the first image frame is input data of a second neural network, and according to a masking function of the second neural network output image, the method may include: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

Optionally, with reference to the second aspect or the first to sixth possible implementation manners of the second aspect, in a seventh possible implementation manner, the method may further include: and carrying out feature transformation on the voice to be enhanced so as to obtain the frequency domain features of the voice to be enhanced. The method may further comprise: and performing characteristic inverse transformation on the second enhancement signal to obtain enhanced voice.

Optionally, with reference to the seventh possible implementation manner of the second aspect, in an eighth possible implementation manner, performing feature transformation on voice to be enhanced may include: the short-time fourier transform STFT is performed on the speech to be enhanced. Performing the inverse feature transformation on the second enhancement signal may include: and performing inverse short time Fourier transform ISTFT on the second enhancement signal.

Optionally, with reference to the first to eighth possible implementation manners of the second aspect, in a ninth possible implementation manner, the method may further include: and sampling the image to enable the frame rate of the image frames which can be included in the image to be a preset frame rate.

Optionally, with reference to the second aspect or the first to eighth possible implementation manners of the second aspect, in a tenth possible implementation manner, the lip feature is obtained by feature extraction of a face map, where the face map is obtained by face detection of an image.

Optionally, with reference to the second aspect or the first to tenth possible implementation manners of the second aspect, in an eleventh possible implementation manner, the band energy of the image is represented by an activation function, and a value of the activation function is approximated to IBM, so as to obtain the second neural network.

Optionally, with reference to the second aspect or the first to eleventh possible implementation manners of the second aspect, in a twelfth possible implementation manner, the voice to be enhanced is acquired through a single audio channel.

Optionally, with reference to the second aspect or the first to twelfth possible implementation manners of the second aspect, in a thirteenth possible implementation manner, the first mask is an ideal floating value mask IRM, and the second mask is an ideal binary mask IBM.

A third aspect of the present application provides a speech enhancement apparatus, comprising: the acquisition module is used for acquiring the voice to be enhanced and the reference image, wherein the voice to be enhanced and the reference image are data acquired simultaneously. The audio processing module is used for outputting a first enhancement signal of the voice to be enhanced according to a first neural network, wherein the first neural network is a neural network obtained by training the mixed data of the voice and the noise by taking a first mask as a training target. The image processing module is used for outputting a masking function of the reference image according to a second neural network, the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, the frequency band energy smaller than the preset value indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the second neural network is a neural network obtained by training the image including the lip feature corresponding to the voice source of the voice adopted by the first neural network by taking the second mask as a training target. And the comprehensive processing module is used for determining a second enhancement signal of the voice to be enhanced according to the first enhancement signal and the operation result of the masking function.

Optionally, with reference to the third aspect, in a first possible implementation manner, the reference image is an image including lip features corresponding to a sound source of the speech to be enhanced.

Optionally, with reference to the third aspect or the first possible implementation manner of the third aspect, in a second possible implementation manner, the integrated processing module is specifically configured to: the first enhancement signal and the masking function are used as input data of a third neural network, the second enhancement signal is determined according to a weight value output by the third neural network, the weight value indicates the output proportion of the first enhancement signal and the correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network by taking the first mask as a training target.

Optionally, with reference to the second possible implementation manner of the third aspect, in a third possible implementation manner, the apparatus further includes: the feature extraction module is used for determining whether the reference image comprises face information or lip information. When the reference image does not include face information or lip information, the weight indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is one hundred percent.

Optionally, with reference to the second aspect or the third possible implementation manner of the third aspect, in a fourth possible implementation manner, the correction signal is a product operation result of the first enhancement signal and the masking function.

Optionally, with reference to the fourth possible implementation manner of the third aspect, in a fifth possible implementation manner, the correction signal is determined according to a product operation result of M signal-to-noise ratios and a masking function at a first moment, M is a positive integer, the first enhancement signal output by the first neural network at the first moment includes M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first moment is the masking function output by the second neural network at the first moment.

Optionally, with reference to the third aspect or the first to fifth possible implementation manners of the third aspect, in a sixth possible implementation manner, the voice to be enhanced includes a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the reference image includes a first image frame, and the first image frame is input data of the second neural network, and the image processing module is specifically configured to: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

Optionally, with reference to the seventh possible implementation manner of the third aspect, in an eighth possible implementation manner, performing feature transformation on voice to be enhanced may include: the short-time fourier transform STFT is performed on the speech to be enhanced. Performing the inverse feature transformation on the second enhancement signal may include: and performing inverse short time Fourier transform ISTFT on the second enhancement signal.

Optionally, with reference to the first to eighth possible implementation manners of the third aspect, in a ninth possible implementation manner, the feature extraction module is further configured to sample the reference image, so that a frame rate of an image frame that may be included in the reference image is a preset frame rate.

Optionally, with reference to the third aspect or the first to eighth possible implementation manners of the third aspect, in a tenth possible implementation manner, the lip feature is obtained by feature extraction of a face map, where the face map is obtained by face detection of a reference image.

Optionally, with reference to the third aspect or the first to tenth possible implementation manners of the third aspect, in an eleventh possible implementation manner, the band energy of the reference image is represented by an activation function, and a value of the activation function is approximated to IBM, so as to obtain the second neural network.

Optionally, with reference to the third aspect or the first to eleventh possible implementation manners of the third aspect, in a twelfth possible implementation manner, the voice to be enhanced is acquired through a single audio channel.

Optionally, with reference to the third aspect or the first to twelfth possible implementation manners of the third aspect, in a thirteenth possible implementation manner, the first mask is an ideal floating value mask IRM, and the second mask is an ideal binary mask IBM.

A fourth aspect of the present application provides an apparatus for training a neural network for speech enhancement, the apparatus comprising: the acquisition module is used for acquiring training data, wherein the training data comprises mixed data of voice and noise and images corresponding to sound sources of the voice and comprising lip features. The audio processing module is used for taking the ideal floating value masking IRM as a training target, training the mixed data to obtain a first neural network, and outputting a first enhancement signal of the voice to be enhanced by the trained first neural network. The image processing module is used for training the image by taking an ideal binary mask IBM as a training target to obtain a second neural network, the trained second neural network is used for outputting a masking function of the reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value to indicate that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the operation results of the first enhancement signal and the masking function are used for determining a second enhancement signal of the voice to be enhanced.

Optionally, with reference to the fourth aspect, in a first possible implementation manner, the reference image is an image including lip features corresponding to a sound source of the speech to be enhanced.

Optionally, with reference to the fourth aspect or the first possible implementation manner of the fourth aspect, in a second possible implementation manner, the method further includes: and a comprehensive processing module.

The comprehensive processing module is used for determining a second enhancement signal according to a weight value output by the third neural network by taking the first enhancement signal and the masking function as input data of the third neural network, wherein the weight value indicates the output proportion of the first enhancement signal and the correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network by taking the first mask as a training target.

Optionally, with reference to the second possible implementation manner of the fourth aspect, in a third possible implementation manner, the apparatus further includes: the characteristic feature extraction module is used for extracting the characteristic features,

And the characteristic feature extraction module is used for determining whether the image comprises face information or lip information. When the image does not include face information or lip information, the weight indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is one hundred percent.

Optionally, with reference to the second aspect or the third possible implementation manner of the fourth aspect, in a fourth possible implementation manner, the correction signal is a product operation result of the first enhancement signal and the masking function.

Optionally, with reference to the fourth possible implementation manner of the fourth aspect, in a fifth possible implementation manner, the correction signal is determined according to a product operation result of M signal-to-noise ratios and a masking function at a first moment, M is a positive integer, the first enhancement signal output by the first neural network at the first moment includes M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first moment is the masking function output by the second neural network at the first moment.

Optionally, with reference to the fourth aspect or the first to fifth possible implementation manners of the fourth aspect, in a sixth possible implementation manner, the speech to be enhanced includes a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the image includes a first image frame, the first image frame is input data of the second neural network, and the image processing module is specifically configured to: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

Optionally, with reference to the seventh possible implementation manner of the fourth aspect, in an eighth possible implementation manner, performing feature transformation on speech to be enhanced may include: the short-time fourier transform STFT is performed on the speech to be enhanced. Performing the inverse feature transformation on the second enhancement signal may include: and performing inverse short time Fourier transform ISTFT on the second enhancement signal.

Optionally, with reference to the first to eighth possible implementation manners of the fourth aspect, in a ninth possible implementation manner, the feature extraction module is further configured to sample the reference image, so that a frame rate of an image frame that may be included in the reference image is a preset frame rate.

Optionally, with reference to the fourth aspect or the first to eighth possible implementation manners of the fourth aspect, in a tenth possible implementation manner, the lip feature is obtained by feature extraction of a face map, where the face map is obtained by face detection of a reference image.

Optionally, with reference to the fourth aspect or the first to tenth possible implementation manners of the fourth aspect, in an eleventh possible implementation manner, the band energy of the reference image is represented by an activation function, and a value of the activation function is approximated to IBM, so as to obtain the second neural network.

Optionally, with reference to the fourth aspect or the first to eleventh possible implementation manners of the fourth aspect, in a twelfth possible implementation manner, the voice to be enhanced is acquired through a single audio channel.

Alternatively, with reference to the fourth aspect or the first to twelfth possible implementation manners of the fourth aspect, in a thirteenth possible implementation manner, the first mask is an ideal floating value mask IRM, and the second mask is an ideal binary mask IBM.

A fifth aspect of the present application provides a speech enhancement apparatus, comprising: and a memory for storing a program. A processor for executing the program stored in the memory, the processor being configured to perform the method as described in the first aspect or any one of the possible implementations of the first aspect when the program stored in the memory is executed.

A sixth aspect of the present application provides an apparatus for training a neural network, comprising: and a memory for storing a program. A processor for executing the program stored in the memory, the processor being adapted to perform the method as described in the second aspect or any one of the possible implementations of the second aspect when the program stored in the memory is executed.

A seventh aspect of the present application provides a computer storage medium storing program code comprising instructions for performing the method as described in the first aspect or any one of the possible implementations of the first aspect.

An eighth aspect of the present application provides a computer storage medium storing program code comprising instructions for performing the method as described in the second aspect or any one of the possible implementations of the second aspect.

According to the scheme provided by the embodiment of the application, the first neural network is utilized to output the first enhancement signal of the voice to be enhanced, and the second neural network is utilized to model the association relation between the image information and the voice information, so that the masking function of the reference image output by the second neural network can indicate that the voice to be enhanced corresponding to the reference image is noise or voice. By the technical scheme provided by the application, the image information can be applied to the voice enhancement process, and the voice enhancement capability and the hearing can be well improved in some relatively noisy environments.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence main body framework according to an embodiment of the present application;

FIG. 2 is a system architecture according to the present application;

fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application;

FIG. 5 is a hardware structure of a chip according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a system architecture according to an embodiment of the present application;

Fig. 7 is a schematic flow chart of a voice enhancement method according to an embodiment of the present application;

fig. 8 is a schematic diagram of an application scenario of a solution provided in an embodiment of the present application;

Fig. 9 is a schematic diagram of an application scenario of a solution provided in an embodiment of the present application;

Fig. 10 is a schematic diagram of an application scenario of a solution provided in an embodiment of the present application;

Fig. 11 is a schematic diagram of an application scenario of a solution provided in an embodiment of the present application;

FIG. 12 is a schematic diagram of time series alignment according to an embodiment of the present application;

FIG. 13 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;

FIG. 14 is a flowchart of another speech enhancement method according to an embodiment of the present application;

FIG. 15 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;

FIG. 16 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;

Fig. 17 is a schematic structural diagram of a voice enhancement device according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of an apparatus for training a neural network according to an embodiment of the present application;

FIG. 19 is a schematic diagram of another speech enhancement apparatus according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of another apparatus for training a neural network according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will now be described with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the present application. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules that are expressly listed or inherent to such process, method, article, or apparatus. The naming or numbering of the steps in the present application does not mean that the steps in the method flow must be executed according to the time/logic sequence indicated by the naming or numbering, and the execution sequence of the steps in the flow that are named or numbered may be changed according to the technical purpose to be achieved, so long as the same or similar technical effects can be achieved. The division of the modules in the present application is a logical division, and may be implemented in another manner in practical applications, for example, a plurality of modules may be combined or integrated in another system, or some features may be omitted or not implemented, and further, coupling or direct coupling or communication connection between the modules shown or discussed may be through some ports, and indirect coupling or communication connection between the modules may be electrical or other similar manners, which are not limited in the present application. The modules or sub-modules described as separate components may be physically separated or not, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present application.

In order to better understand the applicable field and scene of the scheme provided by the application, before the technical scheme provided by the application is specifically introduced, the artificial intelligent main body framework, the system architecture applicable to the scheme provided by the application and the related knowledge of the neural network are introduced.

FIG. 1 illustrates a schematic diagram of an artificial intelligence framework that describes the overall workflow of an artificial intelligence system, applicable to general artificial intelligence field requirements.

The above-described artificial intelligence topic framework is described in detail below from two dimensions, the "Smart information chain" (horizontal axis) and the "information technology (information technology, IT) value chain" (vertical axis).

The "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process.

The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure:

The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform.

The infrastructure may communicate with the outside through sensors, and the computing power of the infrastructure may be provided by the smart chip.

The smart chip may be a hardware acceleration chip such as a central processing unit (central processing unit, CPU), a neural Network Processor (NPU), a graphics processor (graphics processingunit, GPU), an Application Specific Integrated Circuit (ASIC), and a field programmable gate array (field programmable GATE ARRAY, FPGA).

The basic platform of the infrastructure can comprise a distributed computing framework, network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection network and the like.

For example, for an infrastructure, data may be obtained through sensor and external communication and then provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data:

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to internet of things data of traditional equipment, wherein the data comprise service data of an existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) And (3) data processing:

such data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities:

After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application:

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

The embodiment of the application can be applied to various fields in artificial intelligence, such as intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city and the like.

In particular, the embodiment of the application can be applied to the fields of voice enhancement and voice recognition, wherein the (deep) neural network is required to be used.

Since embodiments of the present application relate to a large number of applications of neural networks, for ease of understanding, the following description will first discuss the terms and concepts related to neural networks that may be involved in embodiments of the present application.

(1) Neural network

The neural network may be composed of neural units, which may refer to an arithmetic unit having xs and intercept 1 as inputs, and the output of the arithmetic unit may be:

Where s=1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to a next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by joining together a plurality of the above-described single neural units, i.e., the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(2) Loss function

In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

(3) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial neural network model are updated by back propagation of the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

As shown in fig. 2, an embodiment of the present application provides a system architecture 100. In fig. 2, a data acquisition device 160 is used to acquire training data.

After the training data is collected, the data collection device 160 stores the training data in the database 130 and the training device 120 trains the target model/rule 101 based on the training data maintained in the database 130.

The training device 120 obtains the target model/rule 101 based on the training data, and the training device 120 processes the input raw data and compares the output data with the raw data until the difference between the output data of the training device 120 and the raw data is less than a certain threshold, thereby completing the training of the target model/rule 101.

The above-mentioned target model/rule 101 can be used to implement the speech enhancement method according to the embodiment of the present application, and the above-mentioned training device can be used to implement the method for training the neural network according to the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may be specifically a neural network. In practical applications, the training data maintained in the database 130 is not necessarily collected by the data collecting device 160, but may be received from other devices. It should be noted that the training device 120 is not necessarily completely based on the training data maintained by the database 130 to perform training of the target model/rule 101, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 2, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/Virtual Reality (VR), a vehicle-mounted terminal, or may also be a server or cloud. In fig. 2, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include in an embodiment of the present application: the image to be processed is input by the client device.

The preprocessing module 113 and the preprocessing module 114 are used for preprocessing according to the input data (such as an image to be processed) received by the I/O interface 112, and in the embodiment of the present application, the preprocessing module 113 and the preprocessing module 114 (or only one of the preprocessing modules) may be omitted, and the computing module 111 may be directly used for processing the input data.

In preprocessing input data by the execution device 110, or in performing processing related to computation or the like by the computation module 111 of the execution device 110, the execution device 110 may call data, codes or the like in the data storage system 150 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results to the client device 140 for presentation to the user.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule 101 for different targets or different tasks, where the corresponding target model/rule 101 may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result.

In the case shown in FIG. 2, the user may manually give input data that may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 2, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 2, the training device 120 trains to obtain a target model/rule 101, where the target model/rule 101 may be a neural network in the present application in the embodiment of the present application, and specifically, the neural network provided in the embodiment of the present application may be a CNN, a deep convolutional neural network (deep convolutional neural networks, DCNN), a recurrent neural network (recurrent neural network, RNNS), or the like.

Since CNN is a very common neural network, the structure of CNN will be described in detail with reference to fig. 3. As described in the basic concept introduction above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (DEEP LEARNING) architecture, where the deep learning architecture refers to learning at multiple levels at different abstraction levels through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to an image input thereto.

The structure of the neural network specifically adopted by the voice enhancement method and the model training method according to the embodiment of the application can be shown in fig. 3. In fig. 3, convolutional Neural Network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230. The input layer 210 may acquire an image to be processed, and process the acquired image to be processed by the convolution layer/pooling layer 220 and the following neural network layer 230, so as to obtain a processing result of the image. The internal layer structure of the CNN 200 of fig. 3 is described in detail below.

Convolution layer/pooling layer 220:

convolution layer:

The convolution/pooling layer 220 as shown in fig. 3 may include layers as examples 221-226, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, layer 223 is a convolutional layer, layer 224 is a pooling layer, layer 225 is a convolutional layer, and layer 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 are pooling layers, 224, 225 are convolutional layers, and 226 are pooling layers. I.e. the output of the convolution layer may be used as input to a subsequent pooling layer or as input to another convolution layer to continue the convolution operation.

The internal principle of operation of one convolution layer will be described below using the convolution layer 221 as an example.

The convolution layer 221 may include a plurality of convolution operators, also known as kernels, which function in image processing as a filter to extract specific information from the input image matrix, which may be a weight matrix in nature, which is typically predefined, and which is typically processed on the input image in a horizontal direction, pixel by pixel (or two pixels by two pixels … … depending on the value of the step size stride), to accomplish the task of extracting specific features from the image during the convolution operation on the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same size (row by column), i.e., multiple homography matrices. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by the "multiple" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix is used to extract image edge information, another weight matrix is used to extract a particular color of the image, yet another weight matrix is used to blur unwanted noise in the image, etc. The sizes (rows and columns) of the weight matrixes are the same, the sizes of the convolution feature images extracted by the weight matrixes with the same sizes are the same, and the convolution feature images with the same sizes are combined to form the output of convolution operation.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from an input image, so that the convolutional neural network 200 can perform correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 200 increases, features extracted by the later convolutional layers (e.g., 226) become more complex, such as features of high level semantics, which are more suitable for the problem to be solved.

Pooling layer:

Since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers, as illustrated by layers 221-226 in FIG. 3, 220. The only purpose of the pooling layer during image processing is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator may calculate pixel values in the image over a particular range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.

Neural network layer 230:

After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not yet sufficient to output the desired output information. Because, as previously described, the convolution/pooling layer 220 will only extract features and reduce the parameters imposed by the input image. However, in order to generate the final output information (the required class information or other relevant information), convolutional neural network 200 needs to utilize neural network layer 230 to generate the output of the required number of classes or a set of classes. Thus, multiple hidden layers (231, 232 to 23n as shown in fig. 3) may be included in the neural network layer 230, and the output layer 240, where parameters included in the multiple hidden layers may be pre-trained according to relevant training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.

After the underlying layers in the neural network layer 230, i.e., the final layer of the overall convolutional neural network 200 is the output layer 240, the output layer 240 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network 200 (e.g., propagation from 210 to 240 in fig. 3) is completed (e.g., propagation from 240 to 210 in fig. 3) and the backward propagation (e.g., propagation from 240 to 210 in fig. 3) begins to update the weights and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the desired result.

The structure of the neural network specifically adopted by the voice enhancement method and the model training method according to the embodiment of the present application may be shown in fig. 4. In fig. 4, convolutional Neural Network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230. In contrast to fig. 3, the plurality of convolutional layers/pooling layers 220 in fig. 4 are parallel, and the features extracted respectively are input to the full neural network layer 230 for processing.

It should be noted that the convolutional neural network shown in fig. 3 and fig. 4 is only an example of two possible convolutional neural networks used in the speech enhancement method and the model training method according to the embodiments of the present application, and in specific applications, the convolutional neural network used in the speech enhancement method and the model training method according to the embodiments of the present application may also exist in the form of other network models.

Fig. 5 is a hardware structure of a chip according to an embodiment of the present application, where the chip includes a neural network processor. The chip may be provided in an execution device 110 as shown in fig. 2 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 2 for completing the training work of the training device 120 and outputting the target model/rule 101. The algorithms of the layers in the convolutional neural network as shown in fig. 3 or fig. 4 may be implemented in a chip as shown in fig. 5.

The neural network processor NPU is mounted as a coprocessor to a main central processing unit (centralprocessing unit, CPU, host CPU) and tasks are distributed by the main CPU. The NPU has a core part of an arithmetic circuit 303, and a controller 304 controls the arithmetic circuit 303 to extract data in a memory (weight memory or input memory) and perform arithmetic.

In some implementations, the arithmetic circuitry 303 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 303 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 302 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 301 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 308.

The vector calculation unit 307 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 307 may be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 307 can store the vector of processed outputs to the unified buffer 306. For example, the vector calculation unit 307 may apply a nonlinear function to an output of the operation circuit 303, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 307 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 303, for example for use in subsequent layers in a neural network.

The unified memory 306 is used for storing input data and output data.

The weight data is transferred to the input memory 301 and/or the unified memory 306 directly by the memory cell access controller 305 (direct memory accesscontroller, DMAC), the weight data in the external memory is stored in the weight memory 302, and the data in the unified memory 306 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 310 for implementing interactions between the main CPU, DMAC and finger memory 309 via the bus.

An instruction fetch memory (instruction fetch buffer) 309 connected to the controller 304, for storing instructions used by the controller 304;

the controller 304 is configured to invoke an instruction cached in the instruction fetch memory 309, so as to control a working process of the operation accelerator.

And (3) an inlet: the data herein may be interpreted according to the actual invention as illustrating data such as detected vehicle speed? Obstacle distance, etc.

Typically, the unified memory 306, the input memory 301, the weight memory 302, and the finger memory 309 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random accessmemory, abbreviated as DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

The operations of the layers in the convolutional neural network shown in fig. 2 may be performed by the operation circuit 303 or the vector calculation unit 307.

As shown in fig. 6, an embodiment of the present application provides a system architecture. The system architecture includes a local device 401, a local device 402, and an execution device 210 and data storage system 150, wherein the local device 401 and the local device 402 are connected to the execution device 210 through a communication network.

The execution device 210 may be implemented by one or more servers. Alternatively, the execution device 210 may be used with other computing devices, such as: data storage, routers, load balancers, etc. The execution device 210 may be disposed on one physical site or distributed across multiple physical sites. The execution device 210 may use data in the data storage system 150 or invoke program code in the data storage system 150 to implement the speech enhancement method or the method of training a neural network of embodiments of the present application.

The above-described process execution device 210 can be configured as a target neural network that can be used for voice enhancement or voice recognition processing, etc.

The user may operate respective user devices (e.g., local device 401 and local device 402) to interact with the execution device 210. Each local device may represent any computing device, such as a personal computer, computer workstation, smart phone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set top box, game console, etc.

The local device of each user may interact with the performing device 210 through a communication network of any communication mechanism/communication standard, which may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

In one implementation, the local device 401, the local device 402 obtains relevant parameters of the target neural network from the execution device 210, deploys the target neural network on the local device 401, the local device 402, and uses the target neural network for voice enhancement or voice recognition, and so on.

In another implementation, the target neural network may be deployed directly on the execution device 210, where the execution device 210 performs voice enhancement or other type of voice processing on the voice to be enhanced according to the target neural network by acquiring the image to be processed from the local device 401 and the local device 402.

The execution device 210 may also be referred to as a cloud device, where the execution device 210 is typically deployed in the cloud.

The execution device 110 in fig. 2 described above is capable of executing the voice enhancement method of the embodiment of the present application, the training device 120 in fig. 4 described above is capable of executing the steps of the method for training a neural network of the embodiment of the present application, and the CNN model shown in fig. 5 and 6 and the chip shown in fig. 5 may also be used to execute the steps of the voice enhancement method and the method for training the model of the embodiment of the present application. The following describes the speech enhancement method and the model training method according to the embodiments of the present application in detail with reference to the accompanying drawings.

Fig. 7 is a schematic flow chart of a voice enhancement method according to an embodiment of the present application.

As shown in fig. 7, a voice enhancement method provided by an embodiment of the present application may include the following steps:

701. and acquiring the voice to be enhanced and the reference image.

The application can acquire the voice to be enhanced through the multi-channel microphone array, and can acquire the voice to be enhanced through a single audio channel (hereinafter referred to as a single channel).

Only the information of the time domain and the frequency domain is utilized by the mono voice enhancement, and the microphone array voice enhancement not only utilizes the information of the time domain and the frequency domain, but also utilizes the information of the space domain. Because the time domain information and the frequency domain information play a leading role in the separation of sound sources and the spatial domain information only plays an auxiliary role, the voice to be enhanced in the scheme provided by the application can be acquired through a mono microphone array.

It should be noted that, obtaining the voice to be enhanced through a single audio channel is a more preferable scheme provided by the embodiment of the present application. The mono speech enhancement requires relatively low hardware cost, can form a general solution, and is widely used in various products. But the complex environment can limit the effectiveness of the mono acoustic probability model, the task of mono speech enhancement is more difficult. The scheme provided by the application can provide visual information for the acoustic model to enhance the effect of the voice noise reduction model. With the development of the fifth Generation mobile communication technology (5 th Generation mobile networks or 5th Generation wireless systems, 5th-Generation, 5G), video calls and cameras are increasingly used in 5G smart homes, so that the voice enhancement method based on a single channel provided by the application can be widely applied in the near future.

The reference image related in the technical scheme provided by the application can be acquired by equipment such as a camera, a video camera and the like which can record images or pictures. The acquisition of the speech and reference images to be enhanced is illustrated below in connection with several exemplary scenarios to which the present application may be applied. It should be noted that the following description of several exemplary scenarios is merely illustrative of possible applicable scenarios of the solution provided by the present application, and does not represent all the scenarios applicable to the solution provided by the present application.

Scene one: video voice call

Fig. 8 is a schematic diagram of an application scenario of a solution provided in an embodiment of the present application. As shown by a in fig. 8, device a is establishing a video voice call with device B. The device a and the device B may be a mobile phone, a tablet, a notebook computer, or an intelligent wearable device. Assuming that the device a adopts the scheme provided by the present application, in the process of establishing the video voice passing between the device a and the device B, the voice acquired by the device a is the voice to be enhanced, and the voice to be enhanced may include the voice of the user of the device a and the noise of the surrounding environment. The image acquired by the device a is a reference image, where the reference image may be an image of an area where a camera lens of the device a is aligned, for example, when a user of the device a aligns a camera with his face (it should be noted that, when the camera lens and the camera in the present application are different from each other in a de-emphasized manner, the same meaning is expressed as a device for recording an image or an image), and then the reference image is a face of the user of the device a. Or the user of the device a does not aim the camera at himself in the process of passing the video voice, but at the surrounding environment, the reference image is the surrounding environment at this time.

The technical scheme provided by the application can be combined with the image information to enhance the voice, and particularly, the voice is required to be enhanced by combining with the image information of the human face, so that a better voice enhancement effect can be achieved when the camera is aligned to the human face. In order to facilitate the user to better feel the good voice enhancement effect brought by the scheme provided by the application. In a specific scenario, the user may be prompted to aim the camera at the face, and a better speech enhancement effect will be obtained. As shown in b of fig. 8, a schematic diagram of an applicable scenario of another scheme provided by the present application is shown. Taking the device A as an example, assume that the device A adopts the scheme provided by the application, and in the process of establishing video voice passing with the device B, a text prompt can be displayed in a window of a video dialogue. For example, as shown in b of fig. 8, in the process of video, the text of "aim camera at face, better voice effect", or "please aim camera at face" or "in voice enhancement, please aim camera at face", etc. is displayed in the video window. Or as shown in c of fig. 8, during the video, if the device a detects that the user has aimed the camera at the face, no prompt is made, and when it is detected that during the video, the user of the device a has not aimed the camera at the face, but at the environment, a text prompt is displayed in the video window, for example, "aim the camera at the face, the voice effect will be better", or "please aim the camera at the face", etc. may be displayed. It should be noted that, after the user knows the function, the user can choose to close the text prompt, that is, the user knows the video voice passing process, the camera is aimed at the face, and after the better voice enhancement effect, the user can actively close the text prompt function, or can preset, and the device adopting the scheme only displays the text prompt in the first video voice passing process.

Scene II: conference recording

Fig. 9 is a schematic diagram of another applicable scenario provided in an embodiment of the present application. Currently, in order to improve the working efficiency, it is an important means to coordinate the work of multiple people through a conference. In order to be able to trace back the conference content, recording of the content of each speaker during the conference and sorting of the conference recording become essential requirements. The current recording of the talker's talker and the arrangement of the conference recordings may take a variety of forms, such as: manual shorthand by secretary. Or recording equipment such as a recording pen and the like records in the whole course firstly, and then manually collates recording contents to form conference records and the like. But these approaches are inefficient because of the need for manual intervention.

The voice recognition technology is introduced to the conference system to bring convenience to the arrangement of conference records, such as: in the conference system, the speaking content of the participants is recorded through the recording equipment, and the speaking content of the participants is identified by the voice identification software, so that conference records can be further formed, and the arrangement efficiency of the conference records is greatly improved. The scheme provided by the application can be applied to the scene of recording the conference, and further improves the effect of voice recognition. In this scenario, assuming that a is speaking at the conference, the speaking content of a may be recorded, and the images may be synchronously acquired while recording the speaking content of a. The speaking content of a is the voice to be enhanced, which may include pure voice of a and other noise generated in the conference, and the synchronously shot image is a reference image, and in a preferred embodiment, the reference image is a face image of a. In some practical cases, the photographer may not photograph the face of a in the whole course of speaking a, and then other obtained non-face images in the speaking a can be regarded as reference images in the scheme.

In another scenario, assuming that three persons a, B, C are speaking at the conference, the speaking content of at least one of the three persons a, B, C may be optionally enhanced. For example, when the speech content of a is selected to be enhanced, the face image of a may be synchronously captured during the speech of a, where the speech content of a is to-be-enhanced speech, the to-be-enhanced speech may include pure speech of a and other noise generated in the conference (such as other noise may be the speech content of B or the speech content of C), and the face image of a synchronously captured at this time is a reference image. When the speaking content of B is selected to be enhanced, the face image of B may be synchronously shot in the speaking process of B, where the speaking content of B is to-be-enhanced voice, the to-be-enhanced voice may include pure voice of B and other noise generated in the conference (such as other noise may be the speaking content of a or the speaking content of C), and the face image of B synchronously shot at this time is a reference image. When the speaking content of C is selected to be enhanced, the face image of C may be synchronously shot in the speaking process of C, where the speaking content of C is to-be-enhanced voice, the to-be-enhanced voice may include pure voice of C and other noise generated in the conference (such as other noise may be the speaking content of a or the speaking content of B), and the face image of C synchronously shot at this time is a reference image. Or when the speaking contents of A and B are selected to be enhanced, the face images of A and B can be synchronously shot in the speaking process of A and B, at this time, the speaking contents of A and B are voices to be enhanced, the voices to be enhanced can comprise pure voices of A and pure voices of B and other noise (such as the speaking contents of C) generated in the conference, and at this time, the face images of A and B synchronously shot are reference images. When the speaking contents of B and C are selected to be enhanced, in the process of speaking B and C, the speaking contents of B and C are to be enhanced, and the to-be-enhanced voice may include pure voice of B and pure voice of C and other noise (such as other noise may be speaking contents of a) generated in the conference, and at this time, the face images of B and C that are synchronously photographed are reference images. When the speaking contents of a and C are selected to be enhanced, in the speaking process of a and C, face images of a and C may be synchronously shot, where the speaking contents of a and C are to-be-enhanced voices, and the to-be-enhanced voices may include pure voices of a and pure voices of C and other noises (such as other noises may be speaking contents of B) generated in the conference, and at this time, the face images of a and C synchronously shot are reference images. Or when the speaking contents of a, B and C are selected to be enhanced, face images of a, B and C can be synchronously shot in the speaking process of a, B and C, at this time, the speaking contents of a, B and C are voices to be enhanced, the voices to be enhanced can include pure voices of a and B, pure voices of C and other noises (such as sounds made by other conferees except ABC or other environmental noises) generated in the conference, and at this time, the face images of a, B and C synchronously shot are reference images.

Scene III: voice interaction with wearable device

The wearable device referred to in this scenario refers to a portable device that may be worn directly on the body or integrated into the clothing or accessories of the user. For example, the wearable device may be a smart watch, smart bracelet, smart glasses, or the like. Input methods and semantic understanding based on voice recognition are widely applied to wearable devices, although touch control is still the main mode of communication between people and the wearable devices at present, as screens of the wearable devices are generally smaller, communication between people and the wearable devices is mainly based on simple and direct tasks, voice is inevitably the next generation information entry of the wearable devices, so that fingers of the people can be liberated, and communication between the wearable devices is more convenient and natural. However, these devices are usually used by users in relatively complex acoustic environments, and surrounding interference caused by various sudden noise, such as communication between people and mobile phones and wearing devices, usually occurs on the street or in a mall, where very noisy background noise is present, and the complex noise environment usually causes a significant reduction in the recognition rate of speech, and the reduction in the recognition rate means that these devices cannot accurately understand the instructions of the users, which greatly reduces the user experience. The scheme provided by the application can also be applied to a voice interaction scene with the wearable equipment. As shown in fig. 10, when a voice command of a user is acquired by the wearable device, face images of the user can be acquired synchronously, and according to the scheme provided by the application, the voice command of the user is enhanced, so that the user command can be better identified by the wearable device, and a response corresponding to the user command can be made. In the scene, the voice command of the user can be regarded as the voice to be enhanced, the face image obtained synchronously is regarded as the reference image, and the scheme provided by the application introduces visual information, such as the reference image, in the voice enhancement process, so that the method has good voice enhancement and voice recognition effects in the environment with very noisy background noise.

Scene four: voice interaction with smart home

The smart home (home automation) takes a home as a platform, integrates facilities related to home life by utilizing a comprehensive wiring technology, a network communication technology, a security technology, an automatic control technology and an audio-video technology, builds an efficient management system of home facilities and family schedule matters, improves home safety, convenience, comfort and artistry, and realizes an environment-friendly and energy-saving living environment. For example, smart homes may include smart lighting systems, smart curtains, smart televisions, smart air conditioners, and so forth. As shown in fig. 11, when the user sends a voice control instruction to the smart home, the user may specifically send a voice control instruction to the smart home directly, or send a voice control instruction to the smart home through other devices, such as a mobile phone, and send a voice control instruction to the smart home remotely. At this time, the image of the preset area can be acquired through the smart home or other devices. For example, when a user sends a voice control instruction to the smart home through the mobile phone, the mobile phone can acquire an image shot at the moment, and in the scene, the voice control instruction sent by the user is voice to be enhanced, and the synchronously shot image is a reference image. In a specific implementation scenario, when no face is detected in the preset area, a voice prompt may be sent to prompt the user to aim the camera at the face, for example, to send a prompt of "voice enhancement is being performed, please aim the camera at the face", etc.

702. And outputting a first enhancement signal of the voice to be enhanced according to the first neural network.

The first neural network is a neural network obtained by training mixed data of voice and noise with ideal float mask (IRM) as a training target.

The time-frequency masking is a common target for voice separation, and the common time-frequency masking has ideal binary masking and ideal floating masking, which can obviously improve the intelligibility and the perception quality of separated voice, once the time-frequency masking target is estimated, the time-domain waveform of the voice can be synthesized through the inverse transformation technology without considering phase information. By way of example, a definition of an ideal float mask for the fourier transform domain is given below:

Where Ys (t, f) is a short-time fourier transform coefficient of clean speech in the mixed data, yn (t, f) is a short-time fourier transform coefficient of noise in the mixed data, ps (t, f) is an energy density corresponding to Ys (t, f), and Pn (t, f) is an energy density corresponding to Yn (t, f).

The definition of the ideal float masking of the fourier transform domain is given above, and it should be noted that, after knowing the solution provided by the present application, one skilled in the art can easily think that other targets of speech separation can also be used as training targets of the first neural network. For example, short-time fourier transform masking, implicit time-frequency masking, etc. may also be employed as the training target for the first neural network. In other words, in the prior art, after voice separation is performed on mixed data of voice and noise through a certain neural network, the signal to noise ratio of an output signal of the neural network at any moment can be obtained, and the training target adopted by the neural network can be adopted by the scheme provided by the application.

The above-mentioned voices may be clean voices or clean voices, meaning voices without any noise protection. The mixed data of the voice and the noise refers to the noise adding voice, namely the voice obtained by adding the noise with preset distribution to the clean voice. In this embodiment, clean speech and noise-added speech are used as the speech to be trained.

Specifically, when generating the noise-added voice, a plurality of noise-added voices corresponding to the clean voice can be obtained by adding various different distributed noises into the clean voice. For example: adding the first distributed noise to the clean voice 1 to obtain a noisy voice 1, adding the second distributed noise to the clean voice 2 to obtain a noisy voice 2, adding the third distributed noise to the clean voice 1 to obtain a noisy voice 3, and so on. Through the above-mentioned noise adding process, a plurality of clean voice and data pairs of noise added voice can be obtained, for example: { clean speech 1, noisy speech 1}, { clean speech 1, noisy speech 2}, { clean speech 1, noisy speech 3}, and so forth.

In the actual training process, a plurality of clean voices can be acquired first, and noise with different distributions is added to each clean voice, so that massive { clean voice, noise-added voice } data pairs are obtained. These data pairs are taken as the speech to be trained. For example: 500 sentences such as main stream newspaper media can be selected, all the voices are contained as far as possible, and 100 different people are selected for reading as clean voice signals (namely, the simulated clean voice corresponding to the noisy voice). And then common living noise in public scenes, traffic, working scenes, coffee shops and the like 18 is selected, and the common living noise and the clean voice signal are subjected to cross synthesis to obtain a voice signal with noise (equivalent to simulated noise-containing voice). The clean voice signal is matched with the voice signal with noise one by one to be used as marked data. The data are randomly disturbed, 80% of the data are selected as a training set to carry out neural network model training, the other 20% of the data are used as a verification set to verify the result of the neural network model, and the finally trained neural network model is equivalent to the first neural network in the embodiment of the application.

After the training of the first neural network is finished, when the voice is enhanced, the voice to be enhanced is converted into a two-dimensional time-frequency signal, and the two-dimensional time-frequency signal is input into the first neural network to obtain a first enhanced signal of the voice to be enhanced.

The time-frequency conversion can be performed on the voice signal to be enhanced by adopting a short-time-fourier-transform (STFT) mode so as to obtain a two-dimensional time-frequency signal of the voice to be enhanced. The present application also refers to time-frequency conversion as feature transformation, and when the difference between the two is not emphasized, both represent the same meaning, and the present application also refers to two-dimensional time-frequency signal as frequency domain feature, and when the difference between the two is not emphasized, both represent the same meaning. This is illustrated below assuming the expression for the speech to be enhanced as follows:

y(t)＝x(t)+n(t)

Wherein y (t) represents a time domain signal of the voice to be enhanced at the time t, x (t) represents a time domain signal of the clean voice at the time t, and n (t) represents a time domain signal of the noise at the time t. STFT transformation is performed on the voice to be enhanced, and the STFT transformation can be expressed as follows:

Y(t,d)＝x(t,d)＋N(t,d)t-1，2，...，T；d＝1，2，...，D

Wherein Y (t, d) represents the representation of the frequency domain signals of the speech to be enhanced in the t-th acoustic feature frame and the d-th frequency band, X (t, d) represents the representation of the frequency domain signals of the clean speech in the t-th acoustic feature frame and the d-th frequency band, and N (t, d) represents the representation of the frequency domain signals of the noise in the t-th acoustic feature frame and the d-th frequency band. T and D represent how many acoustic feature frames and total frequency bands the signal to be enhanced has in total, respectively.

The feature transformation of the voice signal is not limited to the STFT method, and other methods, such as Gabor transformation and Wigner-Ville distribution, may be used in some other embodiments. In the prior art, regarding a manner of performing feature transformation on a voice signal to obtain a two-dimensional time-frequency signal of the voice signal, the embodiment of the application can be adopted. In a specific embodiment, in order to accelerate the convergence speed and convergence of the neural network, the frequency domain features after feature transformation may be normalized. For example, the frequency domain feature may be subjected to an average reduction divided by a standard deviation to obtain a normalized frequency domain feature. In a specific embodiment, the normalized frequency domain feature may be used as an input of a first neural network to obtain a first enhancement signal, which may be represented by the following formula, for example, a long short-term memory (LSTM):

The right side of the above equation, which has been described above, is the training target IRM. In the present formula, ps (aclean, j) represents the energy spectrum (which may also be referred to as energy density) of the clean signal at time j, and Ps (anoise, j) represents the energy spectrum of the noise signal at time j. The left hand side of the above equation represents an approximation of the training target through the neural network. Alpha _j represents the input of the neural network, which in this embodiment may be a frequency domain feature, and g () represents a functional relationship, such as here may be a functional relationship where the input to the neural network is normalized by dividing the mean value by the standard deviation and then logarithmically transformed.

It should be noted that, the LSTM is only for illustration, and the first neural network of the present application may be any timing model, that is, may provide a corresponding output at each time step, so as to ensure real-time performance of the model. After the first neural network is trained, the weight can be frozen, namely the weight parameters of the first neural network are kept unchanged, so that the second neural network or other neural networks cannot influence the performance of the first neural network, the model under the condition of lacking of a visual mode (namely the reference image does not comprise face information or lip information) can be ensured to be output according to the first neural network, and the robustness of the model is ensured.

703. And outputting a masking function of the reference image according to the second neural network.

The masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy being smaller than the preset value indicates that the voice to be enhanced corresponding to the reference image is noise, and the frequency band energy being not smaller than the preset value indicates that the voice to be enhanced corresponding to the reference image is clean voice. The second neural network is a neural network obtained by training the image including the lip feature corresponding to the sound source of the voice used by the first neural network with the ideal binary mask (ideal binary mask, IBM) as a training target.

From a physiological point of view, it is considered that the volume, timbre, etc. of the same utterance uttered by different persons are different, resulting in a difference in the spectrum at the time of pronunciation of each of the voices, but their energy distribution is the same. The energy distribution of the pronunciation can be normalized as a result of the original audio to factors such as speaker and volume, which is why syllables can be presumed from formants of the audio. We therefore model the energy distribution of the clean signal, fitting this energy distribution with the image of the human mouth. In fact, it is difficult to directly fit the above energy distribution to the image of the mouth, and the pronunciation of the person is not determined by the mouth shape, but by the shape of the resonant cavity in the mouth and the position of the tongue, but the image of the mouth cannot accurately reflect these factors, so that the videos of the same mouth shape can correspond to different pronunciations, that is, cannot be mapped one by one. Therefore, we have devised this way of weak correlation (WEAK REFERENCE) to transform the original fine distribution into coarse distribution by binarization so as to facilitate the fitting of the image end. The rough distribution characterizes whether the mouth shape corresponds to the sounding status of a certain group of frequency bands. The application establishes a mapping relation between the frequency band energy of the image and the frequency band energy of the voice through a second neural network, and particularly establishes an association relation between the energy of each frequency band of the image frame at each moment and the energy of each frequency band of the acoustic feature frame at each moment.

The training targets of the second neural network and the data used for training are described below, respectively.

The training objective IBM of the second neural network is a sign function, whose definition is explained below by the following expression.

Wherein, dist function is an energy distribution function, which is defined as follows:

Where j refers to the moment at j, or the moment at which the duration of the j-th frame ends. Each frame may include a plurality of frequency bands, for example, k frequency bands, where k is the kth frequency band of pure voice at the j moment, and k is a positive integer. The number of frequency bands included in each time may be preset, for example, one time may include 4 frequency bands, or one time may include 5 frequency bands, which is not limited in the embodiment of the present application. P _S(a^k j) refers to the energy spectrum of the kth frequency band of the clean signal at time j. Dist (aj) thus characterizes the distribution of audio energy over the k frequency bands corresponding to the moment j. threshold is a predetermined threshold, and in one embodiment, threshold is typically 10 ^-5. If the difference between dist (aj) and threshold is greater than or equal to 0, that is, dist (aj) is greater than threshold, dist (aj) is considered to be voice dominant or whether dist (aj) is voice dominant or noise dominant cannot be judged, and the corresponding function value is set to 1. If the difference between dist (aj) and threshold is less than 0, i.e., dist (aj) is less than threshold, dist (aj) is considered to be noise dominant, and its corresponding function value is set to 0.

The training data of the second neural network is a corresponding image including lip features at the source of the speech employed by the first neural network. For example, in step 702, 500 sentences such as the main journal media may be selected to include all the utterances as much as possible, and then 100 different persons are selected to read as clean voice signals (i.e. the simulated clean voice corresponding to the noisy voice), and the training data of the second neural network may include face images of the 100 different persons, or mouth images of the 100 different persons, or images including faces of the 100 different persons, such as images of the upper body. It should be noted that, the training data of the second neural network is not only an image including lip features corresponding to the sound source of the voice adopted by the first neural network, but also some image data not including lip features or data not including a face image.

The following will explain the explanation in detail with reference to the following formulas.

V represents training data, which has been described above and will not be repeated here. sigmoid is defined asSigmoid is an activation function by which the energy of each frequency band at each moment of an image is represented, and the value of sigmoid is made to approach the value of dist (aj) -threshold by a neural network, such as LSTM used in the above formula. f () represents a feature extraction function. It should be noted that the sigmoid is only for illustration, and other activation functions may be adopted to approach the training target in the embodiment of the present application.

Furthermore, in one particular embodiment, the processed image frames of the second neural network may be time-sequentially aligned with the acoustic feature frames of the first neural network. By means of the alignment of the time sequences, it can be ensured that in a subsequent flow, the data output by the second neural network processed at the same time corresponds to the data output by the first neural network. By way of example, assume that there is a segment of video that includes 1 frame of image frames and 4 frames of acoustic feature frames. The multiple relation between the number of the image frames and the number of the acoustic frames can be determined by resampling the video segment according to a preset frame rate, for example, resampling the image data included in the video segment according to the frame rate of the image frames being 40 frames/s, and resampling the audio data included in the video segment according to the frame rate of the acoustic feature frames being 10 frames/s. In this video, the 1-frame image frame is aligned in time with the 4-frame acoustic feature frame. In other words, the duration of the 1-frame image frame is aligned with the duration of the 4-frame acoustic feature frame. In this scenario, the first neural network processes the 4 frames of acoustic feature frames, the second neural network processes the 1 frame of image frames, and the processed image frames of the second neural network are aligned in time series with the acoustic feature frames of the first neural network, in this example, so that the 4 frames of acoustic feature frames remain aligned in time with the 1 frame of image frames during processing by the first and second neural networks, and after processing is complete. Furthermore, according to the scheme provided by the application, after the time alignment processing is carried out on the 1-frame image frame through the second neural network, 4-frame image frames corresponding to the 4-frame acoustic feature frames can be obtained, and the masking function corresponding to the 4-frame image frames is output. The following describes a time series alignment manner in detail.

In a specific embodiment, the speech to be enhanced includes a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the image includes a first image frame, the first image frame is input data of a second neural network, and the masking function for outputting the image according to the second neural network includes: outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to a ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame, so that the first moment is the moment corresponding to the first acoustic feature frame. For example, in the above formula, m represents a multiple, and is determined according to a ratio of a frame rate of the first acoustic feature frame to a frame rate of the first image frame. For example, if the frame rate of the first acoustic feature frame is 10 frames/s and the frame rate of the first image frame is 40 frames/s, the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame is 1/4 (10/40), and then m takes 4 in the above formula. For example, if the frame rate of the first acoustic feature frame is 25 frames/s and the frame rate of the first image frame is 50 frames/s, the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame is 1/2 (25/50), and then m takes 2 in the above formula. For a clearer explanation of time alignment, reference is made to fig. 12 for a further explanation with reference to m for 4. Fig. 12 is a schematic diagram of time series alignment according to an embodiment of the present application. As shown in fig. 12, the white box in the figure represents the input image frame of the second neural network, and as shown in fig. 12, the image frame of the 4-frame input is shown. Assuming that the duration of the input 1 frame image frame is the same as the duration of the 4 frames of acoustic feature frames, that is, when m takes 4, after the time sequence alignment process of the second neural network, the input one frame image frame corresponds to the 4 frames of processed image frames, and the duration of each frame of the 4 frames of processed image frames is the same as the duration of the acoustic frames. As shown in fig. 12, the black box represents the image frame after the second neural network time alignment processing, the second neural network outputs a masking function of the image frame after the alignment processing, and as shown in fig. 12, includes 16 image frames after the time alignment processing in total, and outputs a masking function corresponding to the 16 image frames after the time alignment processing. The 16 image frames are each aligned in time with one acoustic feature frame, in other words, 1 image frame represented by a white box is aligned in time with 4 acoustic feature frames, and 1 image frame represented by a black box is aligned in time with 1 acoustic feature frame.

After the second neural network training is completed, the reference image is input into the second neural network during voice enhancement, and a masking function of the reference image is obtained. In the actual implementation process, some preprocessing may be performed on the reference image, and the preprocessed reference image is input to the second neural network, for example, the reference image may also be sampled to a formulated image frame rate. The reference image can be subjected to face feature extraction to obtain a face image, and the face feature extraction can be performed through a face feature extraction algorithm. The face feature extraction algorithm comprises a face feature point-based recognition algorithm, a whole face image-based recognition algorithm, a template-based recognition algorithm and the like. For example, the face detection may be based on a face feature point detection algorithm. Face feature extraction may also be performed through a neural network. The extraction of the face features can be performed through a convolutional neural network model, such as face detection based on a multi-task convolutional neural network. The face map extracted through the face features may be used as an input to the second neural network. The second neural network may further process the face image, for example, may extract image frames corresponding to the motion features of the human mouth, and perform time sequence alignment processing on the image frames corresponding to the motion features of the human mouth.

704. And determining a second enhancement signal of the voice to be enhanced according to the operation result of the first enhancement signal and the masking function.

The present embodiment may output the first enhancement signal through a first neural network and output the masking function of the reference image through a second neural network. Because the second neural network establishes a mapping relation between the frequency band energy of the image and the frequency band energy of the voice, the masking function can indicate whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value to indicate that the voice to be enhanced corresponding to the reference image is noise, and the frequency band energy is not smaller than the preset value to indicate that the voice to be enhanced corresponding to the reference image is clean voice. The second enhancement signal of the speech to be enhanced, which is determined by the operation result of the first enhancement signal and the masking function, can obtain a better speech enhancement effect compared to the first enhancement signal, i.e. compared to a scheme in which speech enhancement is performed only by a single neural network. For example, it is assumed that, for a first frequency band included in the audio to be enhanced at a certain moment, the first neural network outputs a signal-to-noise ratio of the first frequency band as a, it is assumed that a represents the first neural network to determine that the first frequency band is voice dominant, the second neural network outputs frequency band energy of the first frequency band as B, and B is smaller than a preset value, that is, it is assumed that B represents the second neural network to determine that the first frequency band is noise dominant, and mathematical operations, such as one or more operations of addition, multiplication, or squaring, may be performed on a and B, so as to obtain an operation result between a and B, and the duty ratio of a and B in the finally outputted second enhancement signal may be determined through the operation result. In particular, the principle of the operation of the first enhancement signal and the masking function is that the actual meaning of the masking function is to measure whether a certain frequency band has sufficient energy. When the first enhancement signal output by the first neural network indicates inconsistency with the masking function output by the second neural network, the method is as follows:

the second neural network outputs a small value, the first neural network outputs a large value, the corresponding first neural network (audio end) considers that energy in a certain frequency band (such as the first frequency band) forms pronunciation, and the second neural network (video end) considers that the mouth shape of a person cannot emit corresponding sound;

The second neural network outputs a large value, the first neural network outputs a small value, the corresponding first neural network (audio end) considers that a certain frequency band (such as the first frequency band) has no energy to form pronunciation, and the second neural network (video end) considers that the mouth shape of the person is making a certain possible sound;

The inconsistent part is scaled to a smaller value through the operation mode of the operation of the first enhancement signal and the masking function, the consistent part is kept unchanged, and a new output second enhancement signal after fusion is obtained, wherein the frequency band energy inconsistent with non-pronunciation or audio and video is compressed to a smaller value.

As can be seen from the embodiment corresponding to fig. 7, the first neural network is used to output the first enhancement signal of the speech to be enhanced, and the second neural network is used to model the association relationship between the image information and the speech information, so that the masking function of the reference image output by the second neural network can indicate that the speech to be enhanced corresponding to the reference image is noise or speech. By the technical scheme provided by the application, the image information can be applied to the voice enhancement process, and the voice enhancement capability and the hearing can be well improved in some relatively noisy environments.

In the above corresponding embodiment of fig. 7, it is described that the second enhancement signal of the speech to be enhanced may be determined based on the first enhancement signal and the result of the operation of the masking function. In the following, a preferred scheme is given, wherein the second enhancement signal of the voice to be enhanced is determined through the third neural network, and in particular, the second enhancement signal is determined according to the weight value output by the third neural network. The weight indicates the output ratio of the first enhancement signal and the correction signal in the second enhancement signal, and the correction signal is the operation result of the masking function and the first enhancement signal. The third neural network is a neural network which is obtained by training the output data of the first neural network and the output data of the second neural network by taking the IRM as a training target.

Fig. 13 is a schematic flow chart of another voice enhancement method according to an embodiment of the present application.

As shown in fig. 13, another voice enhancement method provided by an embodiment of the present application may include the following steps:

1301. and acquiring the voice to be enhanced and the reference image.

Step 1301 can be understood with reference to step 701 in the corresponding embodiment of fig. 7, and the detailed description is not repeated here.

1302. And outputting a first enhancement signal of the voice to be enhanced according to the first neural network.

Step 1302 may be understood by referring to step 702 in the corresponding embodiment of fig. 7, and the detailed description is not repeated here.

1303. And outputting a masking function of the reference image according to the second neural network.

Step 1303 can be understood with reference to step 703 in the corresponding embodiment of fig. 7, and the detailed description will not be repeated here.

In a specific embodiment, the method may further include: it is determined whether the reference image includes face information. And outputting a masking function of the reference image according to the second neural network if the reference image is determined to comprise face information.

1304. And determining a second enhancement signal according to the weight value output by the third neural network.

And taking the first enhancement signal and the masking function as input data of the third neural network, and determining the second enhancement signal according to the weight value output by the third neural network. The weight indicates the output ratio of the first enhancement signal and the correction signal in the second enhancement signal, and the correction signal is the operation result of the masking function and the first enhancement signal. The third neural network is a neural network which is obtained by training the output data of the first neural network and the output data of the second neural network by taking the IRM as a training target.

The third neural network trains the output data of the first neural network and the output data of the second neural network, and specifically trains a plurality of groups of first enhancement signals output by the first neural network in the training process and a plurality of groups of masking functions output by the second neural network in the training process. Since the second neural network time-series aligns the image frames with the acoustic feature frames of the first neural network in step 1302, the output of the first neural network and the output of the second neural network received by the third neural network at the same time are time-aligned data. The third neural network may train the operation results of the first enhancement signal and the masking function, and the mathematical operation between the first enhancement signal and the masking function is already described above, and will not be repeated here. The present application is not limited to the type of the third neural network, and, by way of example, when the third neural network is LSTM and the mathematical operation between the first enhancement signal and the masking function is a multiplication operation, the third neural network trains the output data of the first neural network and the output data of the second neural network to output a weight (gate), which can be expressed by the following formula:

gate＝LSTM(IBM×IRM)

A few specific scenarios where this scheme may be applicable are mentioned in step 701 above, where the reference image may include face information, in particular, an image including face information at the source of the speech to be enhanced. In some scenarios, the reference image may also be independent of face information, e.g., the reference image may be independent of the corresponding image at the sound source. The training data of the second neural network not only comprises the image which corresponds to the sound source of the voice adopted by the first neural network and comprises the lip feature, but also comprises some image data which does not comprise the lip feature or data which does not comprise the face image. So in different scenarios, whether the speech is to be enhanced in combination with the output of the second neural network, and if the speech is to be enhanced in combination with the output of the second neural network, what the ratio of the output of the second neural network and the output of the first neural network is in the second enhancement signal of the final output, are determined by the weights of the third neural network outputs. Illustratively, taking as an example a mathematical operation between the first enhancement signal and the masking function as a multiplication operation, the second enhancement signal may be represented by the following formula, wherein IRM' represents the second enhancement signal:

IRM'＝gate×(IBM×IRM)+(1-gate)×IRM

Since the output of the second neural network is not perfectly accurate, which may lead to an erroneous scaling of a part of the first enhancement signal, we add a third neural network, by weight, preserving the confident part, which is filled by the first enhancement signal. The design scheme also ensures that when the visual mode (namely the face signal or the lip information in the reference image is not detected) is not detected, the IRM' =IRM, namely the second enhancement signal is the first enhancement signal, and the scheme provided by the application can have good voice enhancement performance under different conditions.

In a specific embodiment, the correction signal is determined according to a product operation result of M signal-to-noise ratios and a masking function at a first moment, M is a positive integer, the first enhancement signal output by the first neural network at the first moment includes M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first moment is a masking function output by the second neural network at the first moment. This process is illustrated below in conjunction with fig. 14. Fig. 14 is a schematic flow chart of another voice enhancement method according to an embodiment of the present application. As shown in fig. 14, a distribution curve of frequencies of a to-be-enhanced voice is given, and as shown in fig. 14, the to-be-enhanced voice at a first moment includes a frame of acoustic feature frames, where the frame of acoustic feature frames includes 4 frequency bands, it should be noted that, the first moment may be any moment corresponding to the to-be-enhanced voice, the first moment includes 4 frequency bands merely for illustration, how many frequency bands each moment includes may be preset, for example, one moment may be set to include 4 frequency bands, or one moment includes 5 frequency bands, which is not limited in this embodiment of the present application. Let the signal to noise ratio for the 4 bands be 0.8,0.5,0.1 and 0.6, respectively. The second neural network outputs the masking functions of the 4 frequency bands corresponding to the reference image at the first moment, because the second neural network performs time-series alignment on the image frames and the acoustic feature frames of the first neural network, and a detailed description is not repeated here. Let the masking functions corresponding to the 4 frequency bands be 1,0 and 1, respectively. The correction signal comprises 4 frequency bands, each having an energy of 0.8 (1 x 0.8), 0.5 (1 x 0.5), 0 (0 x 0.1), 0.6 (1 x 0.6), respectively.

By this embodiment of the present application, the scheme provided by the present application can support stream decoding, theoretically bounded by the duration of an acoustic feature frame. Taking the duration of a unit acoustic feature frame as an example, the theoretical bound of the time delay of the output second enhanced voice is 10ms through the scheme provided by the application. Because the second neural network outputs the masking function according to the time corresponding to the acoustic feature frame (specifically, the description about time sequence alignment may be referred to and will not be repeated here), the third neural network receives the first enhancement signal corresponding to the acoustic feature frame, and processes the first enhancement signal and the masking function corresponding to the same time, and outputs the second enhancement signal at the time. Since the speech to be enhanced can be processed frame by frame, the second enhancement signal can be played frame by frame. In other words, the unit of the acoustic feature frame can be used for processing the voice to be enhanced frame by frame, and the corresponding second neural network outputs the masking function according to the moment corresponding to the acoustic feature frame, so that the third neural network can output the second enhanced signal in the unit of the acoustic feature frame.

For a better understanding of the solution provided by the present application, the following description is made in connection with fig. 15.

Fig. 15 is a flowchart of another voice enhancement method according to an embodiment of the present application. Assume that a video segment is provided that includes the speech to be enhanced and a reference image. And after carrying out feature transformation on the voice to be enhanced to obtain frequency domain features corresponding to the voice to be enhanced, inputting the frequency domain features into a first neural network. As shown in fig. 15, it is assumed that the speech to be enhanced is sampled into 3 pieces of audio, and each piece of audio includes 4 acoustic feature frames after being subjected to feature transformation, that is, the input of the first neural network in fig. 15. It is assumed that the reference image is resampled according to the ratio of the frame rate of the preset image frame to the frame rate of the acoustic feature frame, and it is determined that every 4 acoustic feature frames correspond to 1 image frame. After the second neural network performs time alignment processing on the 1-frame image frame, 4-frame image frames corresponding to the 4-frame acoustic feature frame may be output, that is, the output of the second neural network in fig. 15. The first enhancement signal corresponding to the 4 frames of acoustic feature frames output by the first neural network and the masking function corresponding to the 4 frames of image frames output by the second neural network may be sequentially input to the third neural network, where the third neural network may output the second enhancement signal corresponding to the 4 frames of acoustic feature frames, that is, the output of the third neural network in fig. 15. And then carrying out feature inverse transformation on the second enhancement signal to obtain the time domain enhancement signal of the voice to be enhanced.

After the third neural network is trained, when the voice is enhanced, the first enhancement signal and the masking function can be used as input data of the third neural network, and the second enhancement signal can be determined according to the weight output by the third neural network.

In a specific embodiment, after the training of the third neural network, during the speech enhancement, the method may further include performing inverse feature transformation on the result output by the third neural network to obtain a time domain signal. For example, the frequency domain feature obtained after the short-time fourier transform of the voice to be enhanced is the input of the first neural network, and then an inverse short-time fourier transform (ISTFT) may be performed on the second enhancement signal outputted from the third neural network to obtain a time domain signal.

As can be seen from the embodiments corresponding to fig. 7 and 15, the training data of the second neural network may further include some image data that does not include lip features or data that does not include a face image. In some specific embodiments, the training data of the second neural network may include only the image data including the lip feature or the data including the face image. In some specific embodiments, it may be determined whether the reference image includes face information or lip information, if the reference image does not include face information or lip information, the enhancement signal of the to-be-enhanced voice is output only according to the first neural network, and if the reference image includes face information or lip information, the enhancement signal of the to-be-enhanced voice is output according to the first neural network, the second neural network and the third neural network. Fig. 16 is a flow chart illustrating another voice enhancement method according to an embodiment of the present application. The system firstly judges whether the reference image comprises face information or lip information, and if the reference image does not comprise the face information or the lip information, the system determines an enhancement signal of the voice to be enhanced according to a first enhancement signal output by the first neural network, namely, the second enhancement signal is the first enhancement signal. If the system judges that the reference image includes the face information or the lip information, the second enhancement signal is determined through the third neural network according to the mask function output by the second neural network and the first enhancement signal output by the first neural network, and specifically how to determine the second enhancement signal according to the third neural network is described in detail above, and detailed description is not repeated here.

The flow of the voice enhancement method provided by the embodiment of the application comprises an application flow and a training flow. The application flow provided by the application is introduced, a voice enhancement method is specifically introduced, the training flow provided by the application is introduced, and a method for training a neural network is specifically introduced.

The present application provides a method of training a neural network for speech enhancement, the method may include: training data is acquired, which may include a mixture of speech and noise, and corresponding images, which may include lip features, at the source of the speech. And training the mixed data by taking the ideal float value masking IRM as a training target to obtain a first neural network, wherein the trained first neural network is used for outputting a first enhancement signal of the voice to be enhanced. And training the image by taking the ideal binary masking IBM as a training target to obtain a second neural network, wherein the trained second neural network is used for outputting a masking function of the reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value to indicate that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the operation results of the first enhancement signal and the masking function are used for determining a second enhancement signal of the voice to be enhanced.

In a specific embodiment, the reference image is an image corresponding to the sound source of the speech to be enhanced, which may include lip features.

In a specific embodiment, the operation result of the first enhancement signal and the masking function is used to determine the second enhancement signal of the speech to be enhanced, and may include: the first enhancement signal and the masking function are used as input data of a third neural network, the second enhancement signal is determined according to a weight value output by the third neural network, the weight value indicates the output proportion of the first enhancement signal and the correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network by taking the first mask as a training target.

In a specific embodiment, the method may further comprise: it is determined whether the image may include face information or lip information. When the image does not include face information or lip information, the weight indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is one hundred percent.

In a specific embodiment, the correction signal is the result of a product operation of the first enhancement signal and the masking function.

In a specific embodiment, the correction signal is determined according to a product operation result of M signal-to-noise ratios and a masking function at a first moment, M is a positive integer, the first enhancement signal output by the first neural network at the first moment may include M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first moment is a masking function output by the second neural network at the first moment.

In a specific embodiment, the speech to be enhanced may include a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the image may include a first image frame, the first image frame is input data of a second neural network, and outputting, according to a masking function of the second neural network, the image may include: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

In a specific embodiment, the method may further comprise: and carrying out feature transformation on the voice to be enhanced so as to obtain the frequency domain features of the voice to be enhanced. The method may further comprise: and performing characteristic inverse transformation on the second enhancement signal to obtain enhanced voice.

In a specific embodiment, performing feature transformation on the voice to be enhanced may include: the short-time fourier transform STFT is performed on the speech to be enhanced. Performing the inverse feature transformation on the second enhancement signal may include: and performing inverse short time Fourier transform ISTFT on the second enhancement signal.

In a specific embodiment, the method may further comprise: and sampling the image to enable the frame rate of the image frames which can be included in the image to be a preset frame rate.

In a specific embodiment, the lip feature is obtained by feature extraction of a face image, which is obtained by face detection of an image.

In a specific embodiment, the band energy of the image is represented by an activation function, and the value of the activation function is approximated to IBM to obtain the second neural network.

In a specific embodiment, the speech to be enhanced is obtained through a single audio channel.

In one particular embodiment, the first mask is an ideal float masking IRM and the second mask is an ideal binary masking IBM.

The experimental dataset adopts Grid dataset as pure voice corpus, 1000 of 32 groups of speakers each, and the total 32000 corpora are divided into 27000 training sets (30 groups of speakers, 900 of each group), 3000 Seentest test sets (30 groups of speakers, 100 of each group) and 2000 Unseentest test sets (2 groups of speakers, 1000 of each group). CHiME background data sets are divided into a training noise set and a common environment test noise set according to 8:2, and Audioset Human noise is used as a human voice environment test set. The primary comparative baselines were the acoustic model (AO), visual SPEECH ENHANCEMENT (VSE) model and Looking to Listen (L2L) model. The experiments were mainly evaluated by PESQ scores. Experimental data prove that the scheme provided by the application can comprehensively improve the voice enhancement task by-5 to 20dB by utilizing visual information.

The voice enhancement method and the neural network training method according to the embodiments of the present application are described in detail above with reference to the accompanying drawings, and the related devices according to the embodiments of the present application are described in detail below. It should be understood that the related apparatus can perform the voice enhancement method and the respective steps of the neural network training according to the embodiments of the present application, and repetitive description will be omitted as appropriate when describing the related apparatus.

In a specific embodiment, the speech enhancement apparatus comprises: an obtaining module 1701, configured to obtain a to-be-enhanced voice and a reference image, where the to-be-enhanced voice and the reference image are data obtained simultaneously. The audio processing module 1702 is configured to output a first enhancement signal of a voice to be enhanced according to a first neural network, where the first neural network is a neural network obtained by training mixed data of the voice and noise with a first mask as a training target. The image processing module 1703 is configured to output a masking function of the reference image according to the second neural network, where the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, and the frequency band energy being smaller than the preset value indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the second neural network is a neural network obtained by training the image including the lip feature corresponding to the sound source of the voice adopted by the first neural network with the second mask as a training target. The integrated processing module 1704 is configured to determine a second enhancement signal of the speech to be enhanced according to the first enhancement signal and the operation result of the masking function.

In a specific embodiment, the reference image is a corresponding image comprising lip features at the source of the speech to be enhanced.

In one particular embodiment, the integrated processing module 1704 is specifically configured to: the first enhancement signal and the masking function are used as input data of a third neural network, the second enhancement signal is determined according to a weight value output by the third neural network, the weight value indicates the output proportion of the first enhancement signal and the correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network by taking the first mask as a training target.

In a specific embodiment, the apparatus further comprises: the feature extraction module is used for determining whether the reference image comprises face information or lip information. When the reference image does not include face information or lip information, the weight indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is one hundred percent.

In a specific embodiment, the correction signal is determined according to a product operation result of M signal-to-noise ratios and a masking function at a first moment, M is a positive integer, the first enhancement signal output by the first neural network at the first moment includes M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first moment is a masking function output by the second neural network at the first moment.

In a specific embodiment, the speech to be enhanced includes a first acoustic feature frame, the time corresponding to the first acoustic feature frame is indicated by a first time index, the reference image includes a first image frame, and the first image frame is input data of the second neural network, and the image processing module 1703 is specifically configured to: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

In a specific embodiment, the feature extraction module is further configured to sample the reference image, so that a frame rate of an image frame that may be included in the reference image is a preset frame rate.

In a specific embodiment, the lip feature is obtained by feature extraction of a face image, which is obtained by face detection of a reference image.

In a specific embodiment, the band energy of the reference image is represented by an activation function, and the value of the activation function is approximated to IBM to obtain the second neural network.

Fig. 18 is a schematic structural diagram of an apparatus for training a neural network according to an embodiment of the present application.

The application provides a device for training a neural network, which is used for voice enhancement, and comprises: the acquiring module 1801 is configured to acquire training data, where the training data includes mixed data of speech and noise and a corresponding image including lip features at a sound source of the speech. The audio processing module 1802 is configured to train the mixed data with the ideal float value masking IRM as a training target to obtain a first neural network, where the trained first neural network is used to output a first enhancement signal of the speech to be enhanced. The image processing module 1803 is configured to train the image with the ideal binary mask IBM as a training target to obtain a second neural network, where the trained second neural network is configured to output a masking function of the reference image, the masking function indicates whether band energy of the reference image is smaller than a preset value, the band energy is smaller than the preset value to indicate that a speech band to be enhanced corresponding to the reference image is noise, and an operation result of the first enhancement signal and the masking function is used to determine a second enhancement signal of the speech to be enhanced.

In a specific embodiment, the method further comprises: the integrated processing module 1804, is configured to use the first enhancement signal and the masking function as input data of the third neural network, determine the second enhancement signal according to a weight value output by the third neural network, where the weight value indicates an output ratio of the first enhancement signal and the correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network with the first mask as a training target.

In a specific embodiment, the apparatus further comprises: the characteristic feature extraction module is used for extracting the characteristic features,

In a specific embodiment, the speech to be enhanced includes a first acoustic feature frame, the moment corresponding to the first acoustic feature frame is indicated by a first time index, the image includes a first image frame, the first image frame is input data of the second neural network, and the image processing module 1803 is specifically configured to: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

Fig. 19 is a schematic structural diagram of another voice enhancement device according to an embodiment of the present application

Fig. 19 is a schematic block diagram of a speech enhancement apparatus of an embodiment of the present application. The speech enhancement device module shown in fig. 19 includes a memory 1901, a processor 1902, a communication interface 1903, and a bus 1904. The memory 1901, the processor 1902, and the communication interface 1903 are connected to each other by a bus 1904.

The communication interface 1903 corresponds to the image acquisition module 901 in the speech enhancement device, and the processor 1902 corresponds to the feature extraction module 902 and the detection module 903 in the speech enhancement device. The individual modules and modules of the speech enhancement apparatus module are described in detail below.

The memory 1901 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 1901 may store a program, and the processor 1902 and the communication interface 1903 are configured to perform respective steps of the speech enhancement method of the embodiment of the present application when the program stored in the memory 1901 is executed by the processor 1902. In particular, the communication interface 1903 may retrieve an image to be detected from memory or other device and then voice enhance the image to be detected by the processor 1902.

The processor 1902 may employ a general-purpose central processing unit (central processing unit, CPU), microprocessor, application Specific Integrated Circuit (ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to perform the functions required to be performed by the modules in the speech enhancement apparatus of the embodiments of the present application (e.g., the processor 1902 may perform the functions required to be performed by the feature extraction module 902 and the detection module 903 in the speech enhancement apparatus described above), or to perform the speech enhancement method of the embodiments of the present application.

The processor 1902 may also be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the speech enhancement method of the embodiments of the present application may be performed by integrated logic circuitry in hardware or by instructions in software in the processor 1902.

The processor 1902 may also be a general purpose processor, a digital signal processor (digital signalprocessing, DSP), an ASIC, an off-the-shelf programmable gate array (field programmable GATE ARRAY, FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1901, and the processor 1902 reads information in the memory 1901, and in combination with its hardware, performs functions required to be performed by modules included in the speech enhancement apparatus of the embodiment of the present application, or performs the speech enhancement method of the method embodiment of the present application.

Communication interface 1903 enables communication between the device module and other equipment or a communication network using a transceiver device such as, but not limited to, a transceiver. For example, an image to be processed may be acquired through the communication interface 1903.

The bus 1904 may include a path for transferring information between various components of the device module (e.g., the memory 1901, the processor 1902, the communication interface 1903).

Fig. 20 is a schematic diagram of a hardware configuration of a training neural network device according to an embodiment of the present application. Similar to the above-described devices, the training neural network device shown in fig. 20 includes a memory 2001, a processor 2002, a communication interface 2003, and a bus 2004. The memory 2001, the processor 2002, and the communication interface 2003 are connected to each other by a bus 2004.

The memory 2001 may store a program, and when the program stored in the memory 2001 is executed by the processor 2002, the processor 2002 is used to perform the respective steps of the training method of the neural network of the embodiment of the present application.

The processor 2002 may employ a general-purpose CPU, microprocessor, ASIC, GPU, or one or more integrated circuits for executing associated routines to perform the neural network training methods of embodiments of the present application.

The processor 2002 may also be an integrated circuit chip with signal processing capabilities. In implementation, various steps of the neural network training method of embodiments of the present application may be performed by hardware integrated logic circuits or software-form instructions in the processor 2002.

It should be understood that the neural network is trained by the training neural network device shown in fig. 20, and the trained neural network can be used to perform the method according to the embodiment of the present application.

Specifically, the apparatus shown in fig. 20 may acquire training data and a neural network to be trained from the outside through the communication interface 2003, and then train the neural network to be trained according to the training data by the processor.

It should be noted that while the above-described apparatus modules and apparatus only illustrate memory, processors, and communication interfaces, those skilled in the art will appreciate that in a particular implementation, the apparatus modules and apparatus may also include other devices necessary to achieve proper operation. Also, as will be appreciated by those of skill in the art, the apparatus modules and apparatus may also include hardware devices that perform other additional functions, as desired. Furthermore, it will be appreciated by those skilled in the art that the apparatus module and apparatus may also include only the necessary components to implement the embodiments of the present application, and not all of the components shown in fig. 19 and 20.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and module may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech enhancement, comprising:

acquiring a voice to be enhanced and a reference image, wherein the voice to be enhanced and the reference image are data acquired simultaneously;

Outputting a first enhancement signal of the voice to be enhanced according to a first neural network, wherein the first neural network is obtained by training the mixed data of the voice and the noise by taking a first mask as a training target;

Outputting a masking function of the reference image according to a second neural network, wherein the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, the frequency band energy smaller than the preset value indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the second neural network is a neural network obtained by training an image including lip characteristics corresponding to a voice source adopted by the first neural network by taking a second mask as a training target;

and determining a second enhancement signal of the voice to be enhanced according to the first enhancement signal and the operation result of the masking function.

2. The method of claim 1, wherein the reference image is a corresponding image including lip features at a sound source of the speech to be enhanced.

3. The method according to claim 1 or 2, wherein the determining the second enhancement signal of the speech to be enhanced according to the first enhancement signal and the operation result of the masking function includes:

the first enhancement signal and the masking function are used as input data of a third neural network, the second enhancement signal is determined according to a weight value output by the third neural network, the weight value indicates the output proportion of the first enhancement signal and the correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network by taking the first mask as a training target.

4. A method of speech enhancement according to claim 3, wherein said method further comprises:

Determining whether the reference image includes face information or lip information;

When the reference image does not include the face information or the lip information, the weight indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is one hundred percent.

5. A method of speech enhancement according to claim 3, wherein said correction signal is the result of a product operation of said first enhancement signal and said masking function.

6. The method according to claim 5, wherein the correction signal is determined according to a product operation result of M signal-to-noise ratios and a masking function at a first time, M is a positive integer, the first enhancement signal output by the first neural network at the first time includes M frequency bands, each of the M frequency bands corresponds to a signal-to-noise ratio, and the masking function at the first time is the masking function output by the second neural network at the first time.

7. The method according to any one of claims 1 to 2, wherein the speech to be enhanced comprises a first acoustic feature frame, a time instant to which the first acoustic feature frame corresponds is indicated by a first time index, the reference image comprises a first image frame, the first image frame is input data of the second neural network, and the outputting the masking function of the reference image according to the second neural network comprises:

Outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to a ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

8. The speech enhancement method according to any of claims 1 to 2, characterized in that the method further comprises:

performing feature transformation on the voice to be enhanced to obtain frequency domain features of the voice to be enhanced;

The method further comprises the steps of:

and performing feature inverse transformation on the second enhancement signal to obtain enhanced voice.

9. The method for speech enhancement according to claim 8, wherein,

The feature transformation of the voice to be enhanced comprises the following steps:

Performing short-time Fourier transform (STFT) on the voice to be enhanced;

Said performing an inverse feature transformation on said second enhancement signal comprises:

And performing inverse short time Fourier transform ISTFT on the second enhancement signal.

10. The speech enhancement method according to any of claims 1 to 2, characterized in that the method further comprises:

and sampling the reference image to enable the frame rate of the image frames included in the reference image to be a preset frame rate.

11. The method according to any one of claims 1 to 2, wherein the lip feature is obtained by feature extraction of a face map obtained by face detection of the reference image.

12. The speech enhancement method according to any of claims 1 to 2, wherein the band energy of the reference image is represented by an activation function whose value is approximated to an ideal binary mask IBM to obtain the second neural network.

13. The method of claim 1 to 2, wherein the speech to be enhanced is obtained through a single audio channel.

14. The speech enhancement method according to any of claims 1-2, wherein said first mask is an ideal float masking IRM and said second mask is an ideal binary masking IBM.

15. A method of training a neural network for speech enhancement, the method comprising:

Acquiring training data, wherein the training data comprises mixed data of voice and noise and an image which corresponds to a sound source of the voice and comprises lip characteristics;

The method comprises the steps of taking an ideal floating value masking IRM as a training target, training the mixed data to obtain a first neural network, and outputting a first enhancement signal of voice to be enhanced by the trained first neural network;

And training the image by taking an ideal binary masking IBM as a training target to obtain a second neural network, wherein the trained second neural network is used for outputting a masking function of a reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value to indicate that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the operation results of the first enhancement signal and the masking function are used for determining a second enhancement signal of the voice to be enhanced.

16. The method of claim 15, wherein the reference image is a corresponding image including lip features at a source of the speech to be enhanced.

17. The method of training a neural network according to claim 15 or 16, wherein the operation result of the first enhancement signal and the masking function is used to determine the second enhancement signal of the speech to be enhanced, comprising:

The first enhancement signal and the masking function are used as input data of a third neural network, the second enhancement signal is determined according to a weight value output by the third neural network, the weight value indicates the output proportion of the first enhancement signal and the correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network by taking a first mask as a training target.

18. The method of training a neural network of claim 17, further comprising:

Determining whether the image includes face information or lip information;

When the image does not include the face information or the lip information, the weight indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is one hundred percent.

19. The method of training a neural network of claim 17, wherein the correction signal is a product of the first enhancement signal and the masking function.

20. The method of claim 19, wherein the correction signal is determined according to a product operation result of M signal-to-noise ratios and a masking function at a first time, wherein M is a positive integer, wherein the first enhancement signal output by the first neural network at the first time includes M frequency bands, each of the M frequency bands corresponds to a signal-to-noise ratio, and wherein the masking function at the first time is the masking function output by the second neural network at the first time.

21. The method of training a neural network according to any one of claims 15 to 16, wherein the speech to be enhanced comprises a first acoustic feature frame, the time instant to which the first acoustic feature frame corresponds being indicated by a first time index, the image comprising a first image frame, the first image frame being input data of the second neural network, the outputting a masking function of the image according to the second neural network comprising:

22. A method of training a neural network according to any one of claims 15 to 16, further comprising:

The method further comprises the steps of:

23. The method of training a neural network of claim 22,

Performing short-time Fourier transform (STFT) on the voice to be enhanced;

24. A method of training a neural network according to any one of claims 15 to 16, further comprising:

And sampling the image to enable the frame rate of the image frames included in the image to be a preset frame rate.

25. A method of training a neural network according to any of claims 15 to 16, wherein the lip features are obtained by feature extraction from a face map obtained by face detection of the image.

26. A method of training a neural network according to any of claims 15 to 16, wherein the band energy of the image is represented by an activation function, the activation function being approximated in value to the IBM to obtain the second neural network.

27. A method of training a neural network according to any of claims 15 to 16, wherein the speech to be enhanced is acquired through a single audio channel.

28. The method of training a neural network according to any of claims 15 to 16, wherein the first neural network is trained on a first mask, the second neural network is trained on a second mask, the first mask is an ideal float masking IRM, and the second mask is an ideal binary masking IBM.

29. A speech enhancement apparatus, comprising:

The acquisition module is used for acquiring the voice to be enhanced and the reference image, wherein the voice to be enhanced and the reference image are data acquired simultaneously;

The audio processing module is used for outputting a first enhancement signal of the voice to be enhanced according to a first neural network, wherein the first neural network is a neural network obtained by training the mixed data of the voice and the noise by taking a first mask as a training target;

The image processing module is used for outputting a masking function of the reference image according to a second neural network, the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, the frequency band energy smaller than the preset value indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the second neural network is a neural network obtained by training an image which comprises lip characteristics and corresponds to a sound source of the voice and is adopted by the first neural network by taking a second mask as a training target;

And the comprehensive processing module is used for determining a second enhancement signal of the voice to be enhanced according to the first enhancement signal and the operation result of the masking function.

30. The speech enhancement apparatus according to claim 29, wherein said reference image is a corresponding image comprising lip features at a source of said speech to be enhanced.

31. The speech enhancement apparatus according to claim 29 or 30, wherein said integrated processing module is specifically configured to:

32. The speech enhancement apparatus of claim 31, wherein said apparatus further comprises: the characteristic extraction module is used for extracting the characteristic of the object,

The feature extraction module is used for determining whether the reference image comprises face information or lip information; when the reference image does not include the face information or the lip information, the weight indicates that the output ratio of the correction signal in the second enhancement signal is 0, and the output ratio of the first enhancement signal is one hundred percent.

33. The speech enhancement apparatus of claim 31 wherein said correction signal is the result of a product operation of said first enhancement signal and said masking function.

34. The speech enhancement apparatus of claim 33, wherein said correction signal is determined based on a product of M signal-to-noise ratios and a masking function at a first time, said M being a positive integer, said first enhancement signal output by said first neural network at said first time comprising M frequency bands, each of said M frequency bands corresponding to a signal-to-noise ratio, said masking function at said first time being said masking function output by said second neural network at said first time.

35. The speech enhancement apparatus according to any of claims 29 to 30, wherein the speech to be enhanced comprises a first acoustic feature frame, the moment corresponding to the first acoustic feature frame being indicated by a first time index, the reference image comprising a first image frame, the first image frame being input data of the second neural network, the image processing module being specifically configured to:

36. An apparatus for training a neural network for speech enhancement, the apparatus comprising:

The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring training data, the training data comprises mixed data of voice and noise, and an image which corresponds to a sound source of the voice and comprises lip characteristics;

the audio processing module is used for taking the ideal floating value masking IRM as a training target, training the mixed data to obtain a first neural network, and outputting a first enhancement signal of the voice to be enhanced by the trained first neural network;

The image processing module is used for training the image by taking an ideal binary masking IBM as a training target to obtain a second neural network, the trained second neural network is used for outputting a masking function of a reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value to indicate that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the operation results of the first enhancement signal and the masking function are used for determining a second enhancement signal of the voice to be enhanced.

37. The apparatus for training a neural network of claim 36, wherein said reference image is a corresponding image including lip features at a source of said speech to be enhanced.

38. The apparatus for training a neural network of claim 36 or 37, further comprising: a comprehensive treatment module, wherein the comprehensive treatment module is used for treating the sewage,

The comprehensive processing module is configured to determine, according to a weight value output by the third neural network, the second enhancement signal by using the first enhancement signal and the masking function as input data of the third neural network, where the weight value indicates an output ratio of the first enhancement signal and a correction signal in the second enhancement signal, and the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network with the first mask as a training target.

39. The apparatus for training a neural network of claim 38, further comprising: the characteristic extraction module is used for extracting the characteristic of the object,

The feature extraction module is used for determining whether the image comprises face information or lip information;

40. The apparatus for training a neural network of claim 38, wherein said correction signal is a result of a product operation of said first enhancement signal and said masking function.

41. The apparatus for training a neural network of claim 40, wherein said correction signal is determined based on a result of a product of M signal-to-noise ratios and a masking function at a first time, said M being a positive integer, said first enhancement signal output by said first neural network at said first time comprising M frequency bands, each of said M frequency bands corresponding to a signal-to-noise ratio, said masking function at said first time being said masking function output by said second neural network at said first time.

42. The apparatus for training a neural network according to any one of claims 36 to 37, wherein the speech to be enhanced comprises a first acoustic feature frame, the time instant to which the first acoustic feature frame corresponds being indicated by a first time index, the image comprising a first image frame, the first image frame being input data of the second neural network, the image processing module being specifically configured to:

43. A speech enhancement apparatus, comprising:

a memory for storing a program;

A processor for executing the memory-stored program, which processor is adapted to perform the method of any one of claims 1-14 when the memory-stored program is executed.

44. An apparatus for training a neural network, comprising:

a memory for storing a program;

A processor for executing the memory-stored program, which processor is adapted to perform the method of any of claims 15-28 when the memory-stored program is executed.

45. A computer storage medium, characterized in that the computer storage medium stores a program code comprising instructions for performing the steps of the method according to any of claims 1-14.

46. A computer storage medium, characterized in that the computer storage medium stores a program code comprising instructions for performing the steps of the method according to any of claims 15-28.