CN114333863A

CN114333863A - Voice enhancement method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN114333863A
Application number: CN202111544776.6A
Authority: CN
Inventors: 李渊强; 殷保才; 刘文超; 程虎; 陈航
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-04-12

Abstract

The application discloses a voice enhancement method, a voice enhancement device, electronic equipment and a computer-readable storage medium, wherein the method comprises the following steps: acquiring video data and original audio data of a target, wherein the video data is obtained by shooting the target when the original audio data is acquired; extracting visual features by using video data, and extracting semantic features and voice features by using original audio data; and performing voice enhancement processing based on the visual features, the semantic features and the voice features to obtain enhanced audio data. By the method, the robustness of voice enhancement can be improved.

Description

Voice enhancement method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of speech technologies, and in particular, to a speech enhancement method and apparatus, an electronic device, and a computer-readable storage medium.

Background

The speech enhancement aims at removing background noise and interference sound in the speech of a speaker, is a separation task in nature, and has been researched more to improve the speech enhancement effect by combining auxiliary information except the speech. For example, the visual modalities utilized in the current multi-modal enhancement scheme assist speech enhancement, but the visual information has great instability, such as unstable visual information due to problems of illumination and equipment, which may make the vision not work, thereby causing instability of speech enhancement effect.

Disclosure of Invention

The technical problem mainly solved by the present application is to provide a method, an apparatus, an electronic device and a computer-readable storage medium for speech enhancement, which can improve the robustness of speech enhancement.

To solve the above technical problem, a first aspect of the present application provides a speech enhancement method, including: acquiring video data and original audio data of a target, wherein the video data is obtained by shooting the target when the original audio data is acquired; extracting visual features by using video data, and extracting semantic features and voice features by using original audio data; and performing voice enhancement processing based on the visual features, the semantic features and the voice features to obtain enhanced audio data.

In order to solve the above technical problem, a second aspect of the present application provides a speech enhancement apparatus, including: the acquisition module is used for acquiring video data and original audio data of a target, wherein the video data is obtained by shooting the target when the original audio data is acquired; the feature extraction module is used for extracting visual features by using the video data and extracting semantic features and voice features by using the original audio data; and the voice enhancement module is used for carrying out voice enhancement processing based on the visual characteristic, the semantic characteristic and the voice characteristic to obtain enhanced audio data.

In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the memory is used for storing program data, and the processor is used for executing the program data to implement the foregoing method.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium, in which program data are stored, and when the program data are executed by a processor, the program data are used for implementing the foregoing method.

The beneficial effect of this application is: different from the situation of the prior art, the method and the device have the advantages that the video data and the original audio data of the target are obtained, the video data are obtained by shooting the target when the original audio data are obtained, then the visual features are extracted by using the video data, the semantic features and the voice features are extracted by using the original audio data, finally the voice enhancement processing is carried out based on the visual features, the semantic features and the voice features, the enhanced audio data are obtained, new semantic features are introduced, the multi-mode voice enhancement processing is carried out by synthesizing the visual features, the semantic features and the voice features, the auxiliary enhancement can be carried out by using the semantic features under the condition that the visual features are unstable, and the robustness of the voice enhancement is favorably improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application, the drawings required in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor. Wherein:

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a speech enhancement method of the present application;

FIG. 2 is another flow chart of an embodiment of the speech enhancement method of the present application;

FIG. 3 is a schematic flow chart of another embodiment of step S12;

FIG. 4 is a schematic flow chart diagram illustrating a step S12 according to another embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating another embodiment of step S13 in FIG. 1;

FIG. 6 is another flowchart of another embodiment of step S13 in FIG. 1;

FIG. 7 is a schematic flow chart of another embodiment of step S133 in FIG. 5;

FIG. 8 is a flow chart illustrating another embodiment of the speech enhancement method of the present application;

FIG. 9 is a block diagram schematically illustrating the structure of an embodiment of the speech enhancement apparatus of the present application;

FIG. 10 is a block diagram illustrating the structure of an embodiment of the electronic device of the present application;

FIG. 11 is a block diagram illustrating the structure of one embodiment of the computer-readable storage medium of the present application.

Detailed Description

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The terms "first" and "second" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Currently, speech enhancement schemes can be divided into two types, a single-modality scheme and a multi-modality scheme. The single-mode scheme is based on voice features, enhanced voice is obtained by learning a clean voice mask (mask), or a voice enhancement effect of a specific speaker is improved by extracting voiceprint features, the effect of the specific speaker is stable, the specific speaker can only be customized for a target person, generalization capability is weak, and usability of the model is reduced. Visual features are mainly added in the multi-modal scheme, but the fine granularity of visemes compared with phonemes is insufficient, the visual features have certain instability, and the model is not stable enough, so that the robustness of speech enhancement is poor. In addition, the adopted feature fusion mode does not excavate the extraction capability of the visual mode to the audio signal, and even has adverse effect in a low-noise environment.

Based on the above, the application provides a speech enhancement method, which utilizes video data and original audio data to extract multi-modal features to assist speech enhancement, and audio and video can respectively play a good auxiliary effect in low-noise and high-noise scenes, so that the robustness of a noise environment can be improved by fusing the two. Secondly, the extraction of visual features can be enhanced by the lip-shaped key point information, and the robustness of voice enhancement is improved by the introduction of semantic features through information such as a sound production mode and semantic content. In addition, the features of different modes are effectively fused through the transformer, so that the system can still have a better voice enhancement result when different noise levels and visual features are unstable.

Referring to fig. 1 to fig. 2, fig. 1 is a schematic flowchart illustrating a speech enhancement method according to an embodiment of the present application, and fig. 2 is a schematic flowchart illustrating another speech enhancement method according to an embodiment of the present application.

The method may comprise the steps of:

step S11: and acquiring video data and original audio data of the target, wherein the video data is obtained by shooting the target when the original audio data is acquired.

The target is any body capable of emitting sound, which may be, for example, a human, an anthropomorphic robot, or the like. The target may include one or more subjects. In an example, the raw audio data may be the sound emitted by the person(s) when speaking.

The video data is a combination of image frames. The video data may or may not include audio data. In a related technology, only one face picture is used for extracting visual features, although the calculation amount of the model is reduced, the visual features are only effective on data in a set and are poor in model usability. In the embodiment, the motion visual features are extracted through the video data, and compared with the mode that the visual features are extracted through only one face picture, the model usability is higher.

In an application scenario, a camera may be used to capture a target to obtain video data of the target, and an audio collection device (e.g., a microphone) may be used to collect audio data of the target to obtain original audio data. The original audio data may be noisy audio data due to the influence of the collection environment and the like, and the speech enhancement method provided by the embodiment can remove the noise in the original audio data to obtain enhanced audio data. The noise may be, but is not limited to, non-target sounds doped with noise, reverberation, breathing sounds, and the like.

In some embodiments, the video data and the original audio data may be preprocessed before step S12 to ensure the validity of the video data and the original audio.

In one example, the preprocessing may be to remove image frames that do not include a region of interest in the video data based on the face keypoint information, wherein the region of interest includes a lip shape of the target. It will be appreciated that the video data is used to assist speech enhancement, and the motion and pronunciation of the lip shape are closely related, so that the result of speech enhancement can be enhanced by combining the motion information of the lip shape, so that the effective image frames in the video data need to contain the lip shape of the target. Specifically, a face key point algorithm may be used to perform face key point detection on image frames in the video data to obtain face key point information, and then image frames not including the region of interest may be screened out based on the face key point information, so that the image frames may be removed. In other cases, the region of interest may be other locations related to pronunciation, and is not limited to the lip shape of the target. For example, when the target is an anthropomorphic robot, the eyes of the target (specifically, the eye region displayed on the display screen) are opened when the target pronounces, and the eyes of the target are closed when the target is not pronounced, and at this time, the region where the eyes of the target are located may be used as the region of interest.

In another example, the pre-processing may be to remove unsynchronized data from the audio data and the video data to obtain synchronized audio data and video data. Since the present application is a multimodal speech enhancement method, both audio data and video data are required, and both need to have good synchronicity. In some application scenarios, there may be a delay or asymmetry in the acquisition of the audio data and the video data, resulting in the resulting audio data and video data being asynchronous.

In yet another example, the pre-processing may be based on the face keypoint information, removing image frames in the video data that do not contain the region of interest, and then removing data that is not synchronized in the audio data and the video data. Because a part of image frames in the video data are removed, the audio data and the video data are not synchronous, and the audio data corresponding to the part of image frames can be removed. In addition, the frame rate of the video data may be unified, for example, to 25fps (Frames Per Second).

Step S12: visual features are extracted using the video data, and semantic and speech features are extracted using the raw audio data.

In some embodiments, the visual characteristics include information about the movement of the lips of the subject, such as opening and closing. Semantic meaning is meaning of data, and semantic features include meaning expressed by objects in original audio data. The speech features include features of sounds uttered by the target in the original audio data, such as timbre, pitch, and the like.

Step S13: and performing voice enhancement processing based on the visual features, the semantic features and the voice features to obtain enhanced audio data.

Specifically, as shown in fig. 2, the visual feature, the semantic feature, and the speech feature may be feature-fused to obtain an enhanced feature, and then the enhanced audio data may be obtained based on the enhanced feature. In some embodiments, the enhancement features may be directly output as the enhanced audio data, and in other embodiments, the enhancement features may be further processed to obtain the enhanced audio data. For details, reference may be made to the following examples, respectively.

Visual information is unstable due to problems of illumination, equipment and the like, so that a voice enhancement effect is unstable, however, voice enhancement is audio data of an extraction target in essence, and semantic features and voice features can be extracted from the audio data for self voice enhancement. In the embodiment, the visual features, the semantic features and the voice features are comprehensively utilized to perform multi-modal voice enhancement processing, so that high-quality voice can be extracted according to the semantic features and the voice features even under the condition that the visual features are unstable, and the robustness of voice enhancement is improved.

In the embodiment, video data and original audio data of a target are obtained, wherein the video data is obtained by shooting the target when the original audio data is obtained, then visual features are extracted by using the video data, semantic features and voice features are extracted by using the original audio data, and finally voice enhancement processing is performed based on the visual features, the semantic features and the voice features to obtain enhanced audio data.

The manner of acquiring the visual feature, the voice feature and the semantic feature will be described with reference to fig. 3 to 4.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating another embodiment of step S12 of the present application.

In the present embodiment, extracting visual features using video data may include steps S121 to S123.

Step S121: and intercepting a focus area image of the target by using the video data, wherein the focus area comprises the lip of the target.

As shown in fig. 2, in some embodiments, the region-of-interest image is an image centered on a lip (or lip). The types of region of interest images may include, but are not limited to: RGB (color) images, grayscale images, infrared images.

Step S122: and generating a key point mask image according to the lip-shaped key points.

Specifically, the region-of-interest image may be processed to obtain lip-shaped key points, so that a key point mask image (mask image) may be generated from the lip-shaped key points for extracting lip motion features, i.e., visual features.

Step S123: and extracting visual features based on the key point mask image and the attention area image.

In some embodiments, when performing visual feature extraction, it is required that the size of the region of interest image and the key point mask image are the same, for example, 64 × 64 pixels each. In other embodiments, the dimensions of the region of interest image and the keypoint mask image may not be limited. In other embodiments, even if the size of the annotation region image and the key point mask image are not the same or do not match the network input size, the image size can be adjusted as needed.

In some embodiments, the key point mask image and the attention area image may be subjected to preset prediction by using a three-dimensional convolution residual error network, so as to obtain visual features. Specifically, a lip-centered 64x64 color region of interest image and corresponding keypoint mask images may be merged into a 4-channel sequence as the network input. In the embodiment, the visual features have a certain time sequence, and each frame image is in a certain connection with the adjacent images in front and back, so that the feature change between the adjacent frames can be better integrated and more effective features can be extracted by adopting a neural network and adopting a 3D convolution and residual network structure (such as Conv3D + resnet 18).

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a step S12 according to another embodiment of the present application.

In the present embodiment, the extraction of semantic features and speech features using audio data may include steps S124 to S126. It should be noted that the steps S121 to S123 and the steps S124 to S126 have no certain precedence relationship, and may be executed or executed successively.

Step S124: and extracting frequency domain characteristics by using the audio data, and performing short-time Fourier transform on the audio data to obtain a mixed voice power spectrum.

In one aspect, a predetermined number of dimensions (e.g., 40 dimensions) of frequency domain features may be extracted using the audio data as input to a semantic extraction network for extraction of semantic features. The preset number can be set according to the requirement of the semantic extraction network.

On the other hand, a mixed speech power spectrum and a mixed speech phase can be obtained by performing Short-Time Fourier Transform (STFT) on the audio data. Wherein, the mixed voice power spectrum is used for extracting the voice characteristics. The mixed speech phase is used for the subsequent prediction of the target speech phase, which can be seen in the following embodiments.

Step S125: and processing the frequency domain features by utilizing a semantic extraction network to obtain semantic features.

In the semantic features, components with identification degrees in the audio data need to be extracted, and influences of noise and the like are removed. Wherein, the speech feature output by the semantic extraction network is the frame level.

Step S126: and processing the mixed voice power spectrum by utilizing an enhanced network to obtain voice characteristics.

Since the target voice needs to be restored by voice enhancement, the present embodiment inputs the pre-trained enhancement network by using the mixed voice power spectrum as the initial voice feature, so as to perform enhancement processing on the initial voice feature through the enhancement network, thereby obtaining the voice feature.

In some embodiments, the frame length and the step size of the speech feature and the semantic feature are the same, for example, the frame length is 25ms, and the step size is 10 ms; the visual feature is, for example, a frame length of 40ms, and in order to keep consistent with the speech feature, the visual feature and the speech feature may be aligned by using bilinear interpolation. Therefore, the visual feature, the semantic feature and the voice feature can be conveniently subjected to feature fusion subsequently.

In this embodiment, the semantic extraction network and the enhancement network may be obtained by training Deep Neural Networks (DNNs), and in other examples, other types of Neural Networks may also be used.

Referring to fig. 5 to 6, fig. 5 is a schematic flowchart of another embodiment of step S13 in fig. 1, and fig. 6 is another flowchart of another embodiment of step S13 in fig. 1.

In the present embodiment, step S13 may include sub-steps S131 to S133.

Step S131: and combining the visual features and the semantic features to obtain the auxiliary features.

Because the visual features enhance the effect wisely when the noise is large, the semantic features can keep clean voice undistorted when the noise is small, and in order to effectively fuse the advantages of the two features, the two features are combined in the time dimension to obtain an auxiliary feature (denoted as F).

Step S132: and fusing the auxiliary characteristic and the voice characteristic to obtain an enhanced characteristic.

Since the present application relates to a multi-modal speech enhancement scheme, the selection of the feature fusion mode between different modalities also has an influence on the final speech enhancement effect. In this regard, the present embodiment also provides a feature fusion method suitable for the method, that is, fusion is performed by using an attention mechanism.

After the assist features are obtained, the assist features and the speech features can be fused using an attention mechanism to obtain enhanced features. Specifically, as shown in fig. 6, the assist features and the speech features may be input into a transform model to fuse them based on a Multi-head Attention Mechanism (MHA). Corresponding features can be extracted from the voice features according to the feature keys extracted from the auxiliary features, and then enhanced features, namely the features of clean voice, are obtained after three layers of MHA.

Step S133: based on the enhancement features, enhanced audio data is obtained.

Since the enhanced features obtained after feature fusion have removed noise, the enhanced features can be directly used as enhanced audio data in this embodiment.

However, in other embodiments, the enhancement features may be further post-processed in order to strengthen the connection between adjacent frames and avoid speech distortion problems.

Referring to fig. 7, fig. 7 is a schematic flowchart illustrating another embodiment of step S133 in fig. 5.

In this embodiment, step S133 may further include substeps S1331 to S1333.

Step S1331: and extracting to obtain a target voice power spectrum based on the enhanced features.

Specifically, the enhanced features may be processed by using a Long Short-Term Memory neural network (LSTM) to obtain a target speech power spectrum. The Long-Short Term Memory neural network may be, but is not limited to, any one of a Bi-directional Long-Short Term Memory neural network (BilSTM) and a unidirectional Long-Short Term Memory neural network (unidirectional LSTM). Wherein, the BilSTM is formed by combining a forward LSTM and a backward LSTM.

In some embodiments, the enhancement features may be processed using BilSTM to obtain a target speech power spectrum. The inventor of the application finds that the speech enhancement result of the BilSTM is obviously improved compared with the speech enhancement result of the unidirectional LSTM in experiments.

Step S1332: and performing preset operation on the target voice power spectrum and the mixed voice power spectrum to obtain a target voice logarithmic power spectrum, wherein the mixed voice power spectrum is obtained by short-time Fourier transform of original audio data.

Alternatively, the preset operation may be a point product (inner product, number product of vectors) or other mathematical operation method capable of playing a similar role.

Step S1333: and performing preset transformation on the target voice logarithmic power spectrum based on the target voice phase to obtain enhanced audio data.

Alternatively, the preset Transform is an Inverse Short-Time Fourier Transform (iSTFT), but is not limited thereto.

After short-time Fourier transform, the audio data are divided into two parts, namely a power spectrum and a phase, wherein the phase also has influence on the voice enhancement effect. In some embodiments, a mixed voice phase obtained by short-time fourier transform of the original audio data may be used as the target voice phase, and inverse short-time fourier transform may be performed in combination with the target voice log power spectrum.

In other embodiments, the phase corresponding to the target speech power spectrum may also be predicted. For example, before step S1333, the method may further include: and presetting the mixed voice phase and the target voice logarithmic power spectrum by using a phase prediction network to obtain a target voice phase, wherein the mixed voice phase is obtained by short-time Fourier transform of original audio data. Specifically, the phase of the mixed voice and the log power spectrum of the target voice are superposed in a channel dimension and then used as the input of a phase prediction network, so that the deviation between the phase of the target voice and the phase of the mixed voice can be obtained, and further, the deviation and the phase of the mixed voice are added to obtain the phase of the target voice. Therefore, the voice enhancement effect can be further improved according to the target voice phase.

In some embodiments, in order to better predict the target speech phase, a phase-dependent loss function may also be added during the training of the phase prediction network. Specifically, the phase prediction network may be trained by using a phase loss function, and then a weight value of a target point in the target voice log power spectrum, which corresponds to the phase loss function, is increased, where a value of the target point in the target voice log power spectrum is greater than a preset threshold. The preset threshold value can be set according to actual conditions. The phase prediction of the important point can be more accurate by increasing the phase weight of the point with larger value in the power spectrum.

In some embodiments, the phase loss function is formulated as follows:

in the above formula (1), L_phaseIn order to be the value of the phase loss,

for the target log-power spectrum of the speech,

is the target speech phase, phi_tfTo predict the phase of speech, the power spectrum is considered as a picture, with the dimension T x F, tf representing the power spectrum index. And each sampling point in the target voice logarithmic power spectrum can be provided with a corresponding phase weight, and the phase prediction of the important point can be more accurate by adjusting the phase weight of the sampling point with the larger numerical value corresponding to tf.

And if the loss value does not meet the training stopping condition of the model, selecting a new sample to continue training the model, and if the loss value meets the training stopping condition of the model, using the currently trained network as the phase prediction network for a corresponding service scene.

Referring to fig. 8, fig. 8 is a flowchart illustrating a speech enhancement method according to another embodiment of the present application.

(1) And acquiring video data and original audio data of the target, wherein the video data is obtained by shooting the target when the original audio data is acquired.

(2) Visual features are extracted using the video data, and semantic and speech features are extracted using the raw audio data.

Wherein extracting visual features comprises: processing the video data to obtain a key point mask image and a region of interest image, inputting the key point mask image and the region of interest image into a three-dimensional convolution residual error network (comprising Conv3D + resnet18), and then obtaining visual characteristics.

Wherein, extracting semantic features comprises: and extracting frequency domain characteristics by using the audio data, inputting the frequency domain characteristics into a semantic extraction network, and outputting semantic characteristics.

Wherein, extracting the voice features comprises: and performing short-time Fourier transform on the audio data to obtain a mixed voice power spectrum and a mixed voice phase, and inputting the mixed voice power spectrum into an enhancement network to obtain voice characteristics.

(3) And performing voice enhancement processing based on the visual features, the semantic features and the voice features to obtain enhanced audio data.

The visual features and the semantic features are combined in the time dimension to obtain auxiliary features, and then the auxiliary features and the voice features are input into a feature fusion module to be fused by using a transformer to obtain enhanced features. In some embodiments, the enhancement features have removed noise, which may be output as enhanced audio data.

In other embodiments, the enhancement features may also be post-processed to enhance the association between adjacent frames to avoid speech distortion problems. Which comprises the following steps: inputting the enhancement features into BilSTM to obtain a target voice power spectrum (not shown), and inputting the target voice power spectrum and the mixed voice power spectrum into a Mask (Mask) to perform dot multiplication calculation to obtain a target voice logarithmic power spectrum. Further, the target voice logarithmic power spectrum and the mixed voice phase can be input into a phase prediction network to obtain a target voice phase, and finally the target voice logarithmic power spectrum and the target voice phase are subjected to inverse short-time Fourier transform to obtain enhanced audio data.

On the basis, the embodiment also provides a training method, firstly, the visual extraction network (three-dimensional convolution residual error network) and the semantic extraction network can be pre-trained, then in the whole training process, the parameters of the network are extracted through fixing the semantic features, because the parameters of the visual extraction network can be fixed firstly through the difference between the modes, the parameters of the visual extraction network are optimized and updated uniformly after the whole model is converged, namely, the parameters of the visual extraction network are updated after the model is converged, and the effect is more stable. It should be noted that the speech extraction network (enhancement network) does not need to be pre-trained here, but is trained when the entire model is trained. Finally, the parameters of the entire model are updated based on the loss values by calculating the loss values (e.g., L2-loss) between the enhanced speech data and the corresponding clean audio data.

Referring to fig. 9, fig. 9 is a schematic block diagram of a structure of a speech enhancement device according to an embodiment of the present application.

In this embodiment, the speech enhancement apparatus 100 may include an obtaining module 110, a feature extracting module 120, and a speech enhancement module 130, where the obtaining module 110 is configured to obtain video data and original audio data of a target, where the video data is obtained by shooting the target when the original audio data is obtained; the feature extraction module 120 is configured to extract visual features using the video data, and extract semantic features and speech features using the original audio data; the speech enhancement module 130 is configured to perform speech enhancement processing based on the visual feature, the semantic feature, and the speech feature to obtain enhanced audio data.

In some embodiments, the speech enhancement module 130 is further configured to combine the visual features and the semantic features to obtain auxiliary features; fusing the auxiliary features and the voice features to obtain enhanced features; based on the enhancement features, enhanced audio data is obtained.

In some embodiments, the speech enhancement module 130 is further configured to fuse the assistant features and the speech features using an attention mechanism to obtain enhanced features, and extract a target speech power spectrum based on the enhanced features; performing preset operation on the target voice power spectrum and the mixed voice power spectrum to obtain a target voice logarithmic power spectrum, wherein the mixed voice power spectrum is obtained by short-time Fourier transform of original audio data; and performing preset transformation on the target voice logarithmic power spectrum based on the target voice phase to obtain enhanced audio data.

In some embodiments, the predetermined operation is a point multiplication, and/or the speech enhancement module 130 is further configured to process the enhanced features using a long-short term memory neural network to obtain a target speech power spectrum.

In some embodiments, the pre-set transform is an inverse short-time fourier transform, and/or the speech enhancement module 130 is further configured to perform a pre-set process on the mixed speech phase and the target speech logarithmic power spectrum by using a phase prediction network to obtain a target speech phase, where the mixed speech phase is obtained by performing short-time fourier transform on the original audio data.

In some embodiments, in training the phase prediction network, further comprising training the phase prediction network with a phase loss function; and increasing the weight value of a target point in the target voice logarithmic power spectrum corresponding to the phase loss function, wherein the value of the target point in the target voice logarithmic power spectrum is greater than a preset threshold value.

In some embodiments, the feature extraction module 120 is further configured to capture an image of a region of interest of the target using the video data, the region of interest including a lip shape of the target; generating a key point mask image according to lip-shaped key points; and extracting visual features based on the key point mask image and the attention area image.

In some embodiments, the feature extraction module 120 is further configured to perform preset prediction on the key point mask image and the attention area image by using a three-dimensional convolution residual error network, so as to obtain the visual features.

In some embodiments, the feature extraction module 120 is further configured to extract frequency domain features using the audio data, and perform short-time fourier transform on the audio data to obtain a mixed voice power spectrum; processing the frequency domain features by utilizing a semantic extraction network to obtain semantic features; and processing the mixed voice power spectrum by utilizing an enhanced network to obtain voice characteristics.

In some embodiments, the obtaining module 110 is further configured to remove image frames in the video data that do not include the region of interest, which includes the lip shape of the target, based on the face key point information before extracting the visual features using the video data and extracting the semantic features and the voice features using the original audio data; and/or remove data that is not synchronized in the audio data and the video data.

In this embodiment, the speech enhancement apparatus is used to implement the speech enhancement method in the above embodiments, so that the description of the above steps may be referred to the method embodiment, and will not be described herein again.

Referring to fig. 10, fig. 10 is a schematic block diagram of a structure of an embodiment of an electronic device according to the present application.

In this embodiment, the electronic device 200 may comprise a memory 210 and a processor 220 coupled to each other, the memory 210 being configured to store program data, and the processor 220 being configured to execute the program data to implement the steps in any of the above-mentioned method embodiments. Specifically, the electronic device 200 may include, but is not limited to: desktop computers, notebook computers, servers, mobile phones, tablet computers, and the like, without limitation.

In particular, the processor 220 is configured to control itself and the memory 210 to implement the steps of any of the above-described method embodiments. Processor 220 may also be referred to as a CPU (Central Processing Unit). The processor 220 may be an integrated circuit chip having signal processing capabilities. The Processor 220 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, processor 220 may be commonly implemented by multiple integrated circuit chips.

Referring to fig. 11, fig. 11 is a schematic block diagram of a structure of an embodiment of a computer-readable storage medium according to the present application.

In the embodiment, the computer readable storage medium 300 stores program data 310, and the program data 310 is used for implementing the steps of any of the above method embodiments when being executed by a processor.

The computer-readable storage medium 300 may be a medium that can store a computer program, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or may be a server that stores the computer program, and the server may transmit the stored computer program to another device for operation or may self-operate the stored computer program.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A method of speech enhancement, comprising:

acquiring video data and original audio data of a target, wherein the video data is obtained by shooting the target when the original audio data is acquired;

extracting visual features by using the video data, and extracting semantic features and voice features by using the original audio data;

and performing voice enhancement processing based on the visual features, the semantic features and the voice features to obtain enhanced audio data.

2. The method of claim 1,

the performing speech enhancement processing based on the visual features, the semantic features, and the speech features to obtain enhanced audio data includes:

merging the visual features and the semantic features to obtain auxiliary features;

fusing the auxiliary feature and the voice feature to obtain an enhanced feature;

based on the enhancement features, the enhanced audio data is obtained.

3. The method of claim 2,

the fusing the assistant feature and the voice feature to obtain an enhanced feature, including:

fusing the auxiliary features and the voice features by using an attention mechanism to obtain the enhanced features;

the obtaining the enhanced audio data based on the enhanced features comprises:

extracting and obtaining a target voice power spectrum based on the enhanced features;

performing preset operation on the target voice power spectrum and the mixed voice power spectrum to obtain a target voice logarithmic power spectrum, wherein the mixed voice power spectrum is obtained by short-time Fourier transform of the original audio data;

and performing preset transformation on the target voice logarithmic power spectrum based on the target voice phase to obtain enhanced audio data.

4. The method of claim 3, wherein the predetermined operation is dot multiplication;

and/or, the extracting the target voice power spectrum based on the enhanced features comprises:

and processing the enhanced features by using a long-short term memory neural network to obtain a target voice power spectrum.

5. The method of claim 3, wherein the pre-set transform is an inverse short-time Fourier transform;

and/or, the preset transformation is performed on the target voice logarithmic power spectrum based on the target voice phase, and before the enhanced audio data is obtained, the method further comprises:

and presetting the mixed voice phase and the target voice logarithmic power spectrum by using a phase prediction network to obtain a target voice phase, wherein the mixed voice phase is obtained by short-time Fourier transform of the original audio data.

6. The method of claim 5, further comprising:

training the phase prediction network by using a phase loss function;

and increasing the weight value of a target point in the target voice logarithmic power spectrum, which corresponds to the phase loss function, wherein the value of the target point in the target voice logarithmic power spectrum is greater than a preset threshold value.

7. The method of claim 1,

the extracting visual features using the video data includes:

capturing a region-of-interest image of the target by using the video data, wherein the region-of-interest comprises a lip shape of the target;

generating a key point mask image according to the lip-shaped key points;

and extracting the visual features based on the key point mask image and the attention area image.

8. The method of claim 7,

the extracting the visual features based on the keypoint mask image and the region of interest image comprises:

and performing preset prediction on the key point mask image and the attention area image by using a three-dimensional convolution residual error network to obtain the visual characteristics.

9. The method of claim 1,

the extracting semantic features and voice features by using the audio data comprises the following steps:

extracting frequency domain characteristics by using the audio data, and carrying out short-time Fourier transform on the audio data to obtain a mixed voice power spectrum;

processing the frequency domain features by utilizing the semantic extraction network to obtain the semantic features;

and processing the mixed voice power spectrum by utilizing an enhancement network to obtain the voice characteristics.

10. The method of claim 1,

before the extracting visual features by using the video data and extracting semantic features and voice features by using the original audio data, the method further comprises the following steps:

removing image frames which do not contain a concerned area in the video data based on the face key point information, wherein the concerned area comprises the lip shape of the target; and/or

And removing the unsynchronized data in the audio data and the video data.

11. A speech enhancement apparatus, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring video data and original audio data of a target, and the video data is obtained by shooting the target when the original audio data is acquired;

the characteristic extraction module is used for extracting visual characteristics by utilizing the video data and extracting semantic characteristics and voice characteristics by utilizing the original audio data;

and the voice enhancement module is used for carrying out voice enhancement processing on the basis of the visual feature, the semantic feature and the voice feature to obtain enhanced audio data.

12. An electronic device, characterized in that the electronic device comprises a memory and a processor coupled to each other, the memory being adapted to store program data and the processor being adapted to execute the program data to implement the method according to any of claims 1-10.

13. A computer-readable storage medium, in which program data are stored which, when being executed by a processor, are adapted to carry out the method according to any one of claims 1-10.