CN116072136A

CN116072136A - Speech enhancement method, electronic device, storage medium and chip system

Info

Publication number: CN116072136A
Application number: CN202111279132.9A
Authority: CN
Inventors: 林泽一; 刘恒; 李力骏; 李志刚
Original assignee: Huawei Device Co Ltd
Current assignee: Huawei Device Co Ltd
Priority date: 2021-10-31
Filing date: 2021-10-31
Publication date: 2023-05-05

Abstract

The application is applicable to the technical field of audio frequency, and provides a voice enhancement method, electronic equipment, a storage medium and a chip system, wherein the method comprises the following steps: collecting a first face image of a first user; if the first face image is not matched with the stored face data, acquiring sound characteristics of the first user; storing first face data of a first user; collecting a second face image and first audio data of a first user; if the second face image is matched with the stored first face data, the sound of the first user in the first audio data is enhanced based on the sound characteristics of the first user and then output, the accuracy of identifying the first user can be improved by collecting the second face image, and the accuracy of enhancing the sound of the first user in the first audio data can be improved by combining the sound characteristics of the first user.

Description

Speech enhancement method, electronic device, storage medium and chip system

Technical Field

The present disclosure relates to the field of audio technologies, and in particular, to a voice enhancement method, an electronic device, a storage medium, and a chip system.

Background

With the continuous development of terminal equipment, the functions of the terminal equipment are increased, and the scenes of the terminal equipment, which need to adopt the audio and video call functions, are increased.

When the terminal equipment is in the scene of the audio-video call, the terminal equipment can collect the sound emitted by the user in the current scene to obtain audio data. The current scene may include various sounds, and the audio data collected by the terminal device includes other sounds in the current scene, namely noise, besides the sounds made by the user. To improve the quality of an audio-video call, the terminal device may process the audio data, such as denoising, or enhance a specific sound (such as a sound made by a user). However, the effect of the current terminal device on audio data processing is not accurate enough, and needs to be improved.

Disclosure of Invention

The application provides a voice enhancement method, electronic equipment, a storage medium and a chip system, which solve the problem that in the prior art, the effect of voice enhancement on user voice is not accurate enough in the process of optimizing audio data by terminal equipment.

In order to achieve the above purpose, the present application adopts the following technical scheme:

in a first aspect, a method of speech enhancement is provided, the method comprising:

Collecting a first face image of a first user;

if the first face image is not matched with the stored face data, acquiring the sound characteristics of the first user;

storing first face data of the first user;

collecting a second face image and first audio data of the first user;

and if the second face image is matched with the stored first face data, enhancing the sound of the first user in the first audio data based on the sound characteristics of the first user and outputting the enhanced sound.

The terminal equipment collects the second face image and the first audio data, the first user who is speaking in the current scene can be determined, and the terminal equipment can enhance the sound of the first user in the first audio data in combination with the sound characteristics of the first user acquired in the terminal equipment. By collecting the second face image, the accuracy of identifying the first user can be improved, and the accuracy of enhancing the sound of the first user in the first audio data can be improved by combining the sound characteristics of the first user.

In a first possible implementation manner of the first aspect, the method further includes:

collecting second audio data of the first user;

And if the first face image is not matched with the stored face data, outputting third audio data according to the second audio data, wherein the sound of the first user in the third audio data is not enhanced.

Under the condition that the voice enhancement network is determined not to learn the voice characteristics of the first user, the terminal equipment does not denoise the collected second audio data through the voice enhancement network, so that the situation of error silencing of the second audio data caused by the voice enhancement network is avoided, and the reliability of the terminal equipment is improved.

With reference to any one of the foregoing possible implementation manners of the first aspect, in a second possible implementation manner of the first aspect, if the second face image matches the stored first face data, the enhancing the sound of the first user in the first audio data based on the sound feature of the first user includes:

detecting whether movement of the lips of the first user occurs;

and if the second face image is matched with the stored first face data and the lips of the first user move, enhancing the sound of the first user in the first audio data based on the sound characteristics of the first user and outputting the enhanced sound.

Before enhancing the first audio data, the terminal device may determine whether the lips of the user move, and if the lips of the user do not move, it indicates that the user does not speak currently, and does not need to enhance the voice of the first user in the first audio data. If the lips of the user move, the terminal device indicates that the user is speaking currently, according to the dynamic change process of the lips of the user, the speaking of the user can be determined through the lip language technology, and the accuracy of enhancing the voice of the user in the second audio data can be further improved by combining the acquired voice characteristics of the first user.

Based on the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the detecting whether the lip of the first user moves includes:

acquiring a first lip movement sequence image of the first user;

detecting whether a lip in the first lip movement sequence image moves or not;

the step of outputting the first user's voice in the first audio data after enhancing the voice based on the voice characteristics of the first user includes:

and according to the first lip movement sequence image, the second face image and the first audio data, enhancing the sound of the first user in the first audio data through the voice enhancement network and outputting the enhanced sound.

By detecting whether the lips of the first user move or not, the voice of the first user in the first audio data can be enhanced by combining the first lip movement sequence image when the lips of the user move, namely, the user is speaking, so that the accuracy of enhancing the voice of the first user can be improved.

With reference to any one of the foregoing possible implementation manners of the first aspect, in a fourth possible implementation manner of the first aspect, after the enhancing the sound of the first user in the first audio data based on the sound feature of the first user, the method further includes:

acquiring enhanced first audio data;

detecting whether the enhanced first audio data has a silencing phenomenon or not;

if the enhanced first audio data has the silencing phenomenon, acquiring the sound characteristics of the first user again;

if the enhanced first audio data does not have the silencing phenomenon, continuing to enhance the sound of the first user in the re-acquired audio data based on the sound characteristics of the first user and outputting the enhanced sound.

By detecting whether the enhanced first audio data has the silencing phenomenon or not, the reliability of the terminal equipment for enhancing the sound of the first user can be improved.

In a fifth possible implementation manner of the first aspect, the acquiring the sound feature of the first user includes:

acquiring fourth audio data of the first user and a first sequence of images, the first sequence of images comprising: face information and lip information;

and acquiring sound characteristics of the first user based on the first sequence of images and the fourth audio data.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the acquiring a sound feature of the first user includes:

the sound characteristics of the first user are learned through a voice enhanced network.

Based on the sixth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the learning, by the voice enhancement network, the voice feature of the first user includes:

acquiring a third face image and a second lip movement sequence image according to the first sequence image;

and inputting the third face image, the second lip movement sequence image and the fourth audio data into the voice enhancement network, and learning the sound characteristics of the first user through the voice enhancement network.

In an eighth possible implementation manner of the first aspect, before the inputting the third face image, the second lip motion sequence image, and the fourth audio data into the speech enhancement network, the method further includes, before learning the sound feature of the first user through the speech enhancement network:

determining whether a lip of the first user moves according to the second lip moving sequence image;

determining whether the current scene is a quiet environment according to the fourth audio data;

the inputting the third face image, the second lip motion sequence image, and the fourth audio data into the voice enhancement network, learning the sound features of the first user through the voice enhancement network, comprising:

and if the current scene is a quiet environment and the lips of the first user move, inputting the third face image, the second lip movement sequence image and the fourth audio data into the voice enhancement network, and learning the voice characteristics of the first user through the voice enhancement network.

By determining whether the second lip movement sequence image and the fourth audio data acquired by the terminal equipment meet the sound conditions of the first user, and under the condition that the sound conditions of the first user are met, the sound conditions of the first user are learned according to the second face sequence image and the fourth audio data, so that the accuracy of the learned sound characteristics of the first user can be improved.

Based on the eighth possible implementation manner of the first aspect, in a ninth possible implementation manner of the first aspect, the determining, according to the fourth audio data, whether the current scene is a quiet environment includes:

inputting the fourth audio data into the voice enhancement network to obtain first denoising data;

comparing the fourth audio data with the first denoising data;

if the similarity between the fourth audio data and the first denoising data is greater than or equal to a similarity threshold value, determining that the current scene is a quiet environment;

and if the similarity between the fourth audio data and the first denoising data is smaller than the similarity threshold value, determining that the current scene is not a quiet environment.

By collecting the fourth audio data in the quiet environment, the collected fourth audio data can be used as the label data of the training voice enhancement network, and the process of training the voice enhancement network is simplified, so that the efficiency of training the voice enhancement network can be improved, and the timeliness of calling the voice enhancement network by the terminal equipment can be improved.

In a tenth possible implementation manner of the first aspect, based on any one of the seventh to ninth possible implementation manners of the first aspect, the inputting the third face image, the second lip-moving sequence image, and the fourth audio data into the voice enhancement network, learning, by the voice enhancement network, a sound feature of the first user includes:

Mixing the fourth audio data with pre-stored noise data to obtain mixed audio data;

inputting the mixed sound data, the third face image and the second lip movement sequence image into the voice enhancement network to obtain second denoising data;

and adjusting the voice enhancement network according to the second denoising data and the fourth audio data so that the voice enhancement network learns to obtain the sound characteristics of the first user.

Different second denoising data are obtained through continuous training, the second denoising data are compared with the collected first audio data according to the continuous obtaining, the voice enhancement network is adjusted according to the comparison result, the voice characteristics of the first user are obtained, and the accuracy of the voice characteristics of the first user can be further improved.

In a second aspect, there is provided another speech enhancement method, the method comprising:

acquiring first audio data and a first sequence of images of a first user, the first sequence of images comprising: face information and lip information of a first user;

if the face information of the first user is determined to be matched with the stored face data according to the first sequence image and the lips of the first user move, the sound of the first user in the first audio data is enhanced according to the sound characteristics of the first user and then output.

In a first possible implementation manner of the second aspect, the method further includes:

and if the face information of the first user is not matched with the stored face data according to the first sequence image, or if the lips of the first user are not moved according to the first sequence image, outputting second audio data according to the first audio data, wherein the sound of the first user in the second audio data is not enhanced.

With reference to any one of the foregoing possible implementation manners of the second aspect, in a second possible implementation manner of the second aspect, the enhancing, by the sound feature of the first user, the sound of the first user in the first audio data, includes:

according to the first sequence image, a first face image and a first lip movement sequence image are extracted;

and according to the first audio data, the first face image and the first lip movement sequence image, enhancing the sound of the first user in the first audio data through the voice enhancement network and outputting the enhanced sound.

In a third possible implementation manner of the second aspect, before the enhancing the sound of the first user in the first audio data according to the sound feature of the first user, the method further includes:

Extracting the first sequence image to obtain a first lip movement sequence image;

and judging whether the lips of the first user move or not according to the first lip movement sequence image.

In a fourth possible implementation manner of the second aspect, before the enhancing the sound of the first user in the first audio data according to the sound feature of the first user, the method further includes:

extracting the first sequence image to obtain first face information;

and traversing each stored face data, and determining whether each stored face data comprises face data matched with the first face information.

With reference to any one of the foregoing possible implementation manners of the second aspect, in a fifth possible implementation manner of the second aspect, after the enhancing, according to the sound feature of the first user, the sound of the first user in the first audio data is output, the method further includes:

acquiring enhanced first audio data;

With reference to any one of the foregoing possible implementation manners of the second aspect, in a sixth possible implementation manner of the second aspect, the acquiring the first audio data and the first sequence image of the first user includes:

the first audio data and the first sequence of images of the first user are acquired in response to an operation for invoking a speech enhancement network.

In a third aspect, a speech enhancement apparatus is provided, the apparatus comprising:

the acquisition module is used for acquiring a first face image of the first user;

the first acquisition module is used for acquiring the sound characteristics of the first user if the first face image is not matched with the stored face data;

the storage module is used for storing the first face data of the first user;

the acquisition module is also used for acquiring a second face image and first audio data of the first user;

and the output module is used for outputting the sound of the first user in the first audio data after enhancing the sound characteristic of the first user if the second face image is matched with the stored first face data.

In a first possible implementation manner of the third aspect, the acquisition module is further configured to acquire second audio data of the first user;

and the output module is further configured to output third audio data according to the second audio data if the first face image is not matched with the stored face data, where the sound of the first user in the third audio data is not enhanced.

With reference to any one of the foregoing possible implementation manners of the third aspect, in a second possible implementation manner of the third aspect, the output module is specifically configured to detect whether a lip of the first user moves; and if the second face image is matched with the stored first face data and the lips of the first user move, enhancing the sound of the first user in the first audio data based on the sound characteristics of the first user and outputting the enhanced sound.

With reference to the second possible implementation manner of the third aspect, in a third possible implementation manner of the third aspect, the output module is specifically configured to obtain a first lip motion sequence image of the first user; detecting whether a lip in the first lip movement sequence image moves or not; the step of outputting the first user's voice in the first audio data after enhancing the voice based on the voice characteristics of the first user includes: and according to the first lip movement sequence image, the second face image and the first audio data, enhancing the sound of the first user in the first audio data through the voice enhancement network and outputting the enhanced sound.

With reference to any one of the foregoing possible implementation manners of the third aspect, in a fourth possible implementation manner of the third aspect, the apparatus further includes:

the second acquisition module is used for acquiring the enhanced first audio data;

the detection module is used for detecting whether the enhanced first audio data has a silencing phenomenon or not;

the first obtaining module is further configured to obtain, if the enhanced first audio data has a silencing phenomenon, a sound feature of the first user again;

and the output module is further used for continuing to enhance the sound of the first user in the re-collected audio data based on the sound characteristics of the first user if the enhanced first audio data does not have the silencing phenomenon, and outputting the enhanced sound.

In a fifth possible implementation manner of the third aspect, the first acquiring module is specifically configured to acquire fourth audio data of the first user and a first sequence of images, where the first sequence of images includes: face information and lip information; and acquiring sound characteristics of the first user based on the first sequence of images and the fourth audio data.

With reference to the fifth possible implementation manner of the third aspect, in a sixth possible implementation manner of the third aspect, the first obtaining module is further specifically configured to learn, through a voice enhancement network, a voice feature of the first user.

With reference to the sixth possible implementation manner of the third aspect, in a seventh possible implementation manner of the third aspect, the first obtaining module is further specifically configured to obtain a third face image and a second lip motion sequence image according to the first sequence image; and inputting the third face image, the second lip movement sequence image and the fourth audio data into the voice enhancement network, and learning the sound characteristics of the first user through the voice enhancement network.

With reference to the seventh possible implementation manner of the third aspect, in an eighth possible implementation manner of the third aspect, the apparatus further includes:

the first judging module is used for determining whether the lips of the first user move according to the second lip movement sequence image;

the second judging module is used for determining whether the current scene is a quiet environment according to the fourth audio data;

the first obtaining module is further specifically configured to input the third face image, the second lip movement sequence image, and the fourth audio data into the voice enhancement network if the current scene is a quiet environment and the lips of the first user are moving, and learn the voice characteristics of the first user through the voice enhancement network.

Based on the eighth possible implementation manner of the third aspect, in a ninth possible implementation manner of the third aspect, the second judging module is specifically configured to input the fourth audio data into the speech enhancement network to obtain first denoising data; comparing the fourth audio data with the first denoising data; if the similarity between the fourth audio data and the first denoising data is greater than or equal to a similarity threshold value, determining that the current scene is a quiet environment; and if the similarity between the fourth audio data and the first denoising data is smaller than the similarity threshold value, determining that the current scene is not a quiet environment.

With reference to any one of the seventh to ninth possible implementation manners of the third aspect, in a tenth possible implementation manner of the third aspect, the first obtaining module is further specifically configured to mix the fourth audio data with pre-stored noise data to obtain mixed audio data; inputting the mixed sound data, the third face image and the second lip movement sequence image into the voice enhancement network to obtain second denoising data; and adjusting the voice enhancement network according to the second denoising data and the fourth audio data so that the voice enhancement network learns to obtain the sound characteristics of the first user.

In a fourth aspect, there is provided another speech enhancement apparatus, the apparatus comprising:

the system comprises an acquisition module for acquiring first audio data and a first sequence of images of a first user, the first sequence of images comprising: face information and lip information of a first user;

and the output module is used for enhancing the sound of the first user in the first audio data according to the sound characteristics of the first user if the face information of the first user is determined to be matched with the stored face data according to the first sequence image and the lips of the first user move.

In a first possible implementation manner of the fourth aspect, the output module is further configured to output second audio data according to the first audio data if it is determined that the face information of the first user does not match the stored face data according to the first sequence image, or if it is determined that the lips of the first user do not move according to the first sequence image, and sound of the first user in the second audio data is not enhanced.

With reference to any one of the foregoing possible implementation manners of the fourth aspect, in a second possible implementation manner of the fourth aspect, the output module is specifically configured to extract, according to the first sequence image, a first face image and a first lip motion sequence image; and according to the first audio data, the first face image and the first lip movement sequence image, enhancing the sound of the first user in the first audio data through the voice enhancement network and outputting the enhanced sound.

In a third possible implementation manner of the fourth aspect, the apparatus further includes:

the first extraction module is used for extracting the first sequence image to obtain a first lip movement sequence image;

and the first judging module is used for judging whether the lips of the first user move or not according to the first lip movement sequence image.

In a fourth possible implementation manner of the fourth aspect, the apparatus further includes:

the second extraction module is used for extracting the first sequence image to obtain first face information;

and the second judging module is used for traversing each stored face data and determining whether each stored face data comprises the face data matched with the first face information.

With reference to any one of the foregoing possible implementation manners of the fourth aspect, in a fifth possible implementation manner of the fourth aspect, the apparatus further includes:

the first acquisition module is used for acquiring the enhanced first audio data;

the second acquisition module is used for acquiring the sound characteristics of the first user again if the enhanced first audio data has the silencing phenomenon;

With reference to any one of the foregoing possible implementation manners of the fourth aspect, in a sixth possible implementation manner of the fourth aspect, the acquiring module is specifically configured to acquire the first audio data and the first sequence image of the first user in response to an operation for invoking a voice enhancement network.

In a fifth aspect, a method for enhancing voice is provided, which is applied to a scene of a voice or video call composed of a first terminal device and a second terminal device, and the method includes:

when the first terminal equipment detects that the operation of triggering the voice or video call to the second terminal equipment is initiated, or the first terminal equipment receives a request of the voice or video call initiated by the second terminal equipment, a first face image of a first user is acquired;

if the first face image is not matched with the stored face data, the first terminal equipment acquires the sound characteristics of the first user;

The first terminal device stores first face data of the first user;

the first terminal equipment collects a second face image and first audio data of the first user;

and if the second face image is matched with the stored first face data, the first terminal equipment enhances the sound of the first user in the first audio data based on the sound characteristics of the first user and outputs the enhanced sound to the second terminal equipment.

In a first possible implementation manner of the fifth aspect, the method further includes:

the first terminal equipment collects second audio data of the first user;

if the first face image is not matched with the stored face data, the first terminal device outputs third audio data according to the second audio data, and the sound of the first user in the third audio data is not enhanced.

In a sixth aspect, there is provided an electronic device comprising: a processor for running a computer program stored in a memory to implement the speech enhancement method according to any one of the first or second aspects.

In a seventh aspect, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the speech enhancement method of any one of the first or second aspects.

In an eighth aspect, there is provided a chip system comprising a memory and a processor executing a computer program stored in the memory to implement the speech enhancement method of any one of the first or second aspects.

It will be appreciated that the advantages of the second to eighth aspects may be found in the relevant description of the first aspect, and are not repeated here.

Drawings

Fig. 1A is a schematic diagram of a framework in which a terminal device uses a multi-mode speech enhancement technique to enhance audio data according to an embodiment of the present application;

fig. 1B is a schematic diagram of a framework in which another terminal device according to an embodiment of the present application employs a multi-modal speech enhancement technique to enhance audio data;

fig. 2 is a schematic view of a speech enhancement scene according to an embodiment of the present application;

fig. 3 is a schematic flow chart of a terminal device reminding a user to input audio data and a face sequence image according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a method for speech enhancement according to an embodiment of the present application;

fig. 5 is a schematic diagram of a framework for online training of a voice enhancement network by a terminal device according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a speech enhancement framework of a speech enhancement network according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a voice enhancement device according to an embodiment of the present application;

FIG. 8 is a block diagram illustrating another speech enhancement apparatus according to an embodiment of the present application;

FIG. 9 is a block diagram of a further speech enhancement apparatus according to an embodiment of the present application;

FIG. 10 is a block diagram illustrating a further speech enhancement apparatus according to an embodiment of the present application;

FIG. 11 is a block diagram illustrating a further speech enhancement apparatus according to an embodiment of the present application;

fig. 12 is a block diagram of a voice enhancement device according to an embodiment of the present application

Fig. 13 is a block diagram illustrating a structure of a voice enhancement device according to an embodiment of the present application

Fig. 14 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known model training methods, speech enhancement techniques, audio mixing techniques, and terminal devices are omitted so as not to obscure the description of the present application with unnecessary detail.

The terminology used in the following embodiments is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of this application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include, for example, "one or more" such forms of expression, unless the context clearly indicates to the contrary.

The terminal device can adopt a voice enhancement technology to denoise the collected audio data, obtain denoised audio data, and enhance the sound emitted by the user in the audio data. In a multi-person scenario, the terminal device cannot accurately recognize and obtain the sound of a certain user from the sounds of a plurality of users. Therefore, the acquired audio data can be denoised by adopting a multi-mode voice enhancement technology in combination with face sequence images of a plurality of users in a scene where the terminal equipment is located, so that the sound emitted by a certain user is obtained.

The face sequence images are image sets formed by sequentially arranging a plurality of face images. For example, the face sequence image of the user may be a set of images obtained by arranging multiple frames of face images corresponding to the user in time sequence in the video data collected by the terminal device.

Referring to fig. 1A, fig. 1A shows a schematic frame diagram of a terminal device for enhancing audio data using a multi-modal speech enhancement technique.

Specifically, after the terminal device collects the audio data, short-time fourier transform (short time fourier transform, STFT) can be performed on the audio data to obtain a magnitude spectrum and a phase spectrum of the audio data, and then feature extraction is performed on the magnitude spectrum of the audio data through a two-dimensional convolutional neural network (convolutional neural network, CNN) to obtain voice features. And the terminal equipment can also perform feature extraction on the acquired lip sequence image through the three-dimensional CNN to obtain visual features.

Then, the terminal device may combine the extracted voice features and visual features, and input the combined features into the voice enhancement network to obtain a mask (mask) corresponding to the phase spectrum of the audio data, so that an enhanced amplitude spectrum may be obtained according to the mask and the amplitude spectrum of the audio data.

Finally, the terminal device may input the enhanced amplitude spectrum and the phase spectrum of the audio data together into another voice enhancement network, to obtain an enhanced phase spectrum and a re-enhanced amplitude spectrum, and then subject the enhanced phase spectrum and the re-enhanced amplitude spectrum to short-time inverse fourier transform (inverse short time fourier transform, ISTFT) to obtain the enhanced audio data.

Referring to fig. 1B, fig. 1B is a schematic diagram illustrating another architecture in which a terminal device enhances audio data using a multi-modal speech enhancement technique.

Specifically, the terminal device may receive input audio data and a face sequence image, perform feature extraction on the audio data to obtain a real part and an imaginary part of the audio data, and perform feature extraction on the real part and the imaginary part through the two-dimensional CNN to obtain a voice feature.

The terminal equipment can also perform feature extraction by combining face sequence images respectively corresponding to a plurality of users through a preset face feature extraction network (facenet), and further extract the extracted features through a two-dimensional CNN network to obtain visual features respectively corresponding to the plurality of users.

Then, the terminal device may combine the voice feature with the visual feature, and input the combined feature into the voice enhancement network to obtain a mask corresponding to the real part and the imaginary part of the audio data, so as to obtain a product of the mask and the real part and the imaginary part of the audio data, that is, the real part and the imaginary part of the audio data after being enhanced. And finally, ISTFT is carried out on the real part and the imaginary part of the audio data after the enhancement, so that the audio data after the voice enhancement is obtained.

However, in the multi-modal speech enhancement technique shown in fig. 1A, the speech enhancement network is not trained for the specific face features of the specific user, and the accuracy of the speech enhancement network is not high when extracting the speech of the specific user. Moreover, due to the influence of factors such as illumination and angle, the voice enhancement network may misjudge that the specific user does not speak in the process of identifying the lip sequence image of the specific user, so that the situation of missilencing the specific user occurs (for example, the missilencing phenomenon of 7 seconds on average occurs within a period of 1 minute, and the signal-to-noise ratio between the audio data output by the voice enhancement network and the pure stream audio data reaches 7.4).

In the multi-modal speech enhancement technique shown in fig. 1B, although the information of the specific speaker is encoded, the speech enhancement network does not train the specific speaker, and the speech enhancement network cannot focus the lips in the face sequence image, and also affects the speech enhancement effect, compared with the multi-modal speech enhancement technique shown in fig. 1A.

Therefore, the training method of the voice enhancement network is provided, by acquiring the audio data of the user and combining the face sequence image of the user, the voice enhancement network is trained on line in real time, so that when the terminal equipment adopts the trained voice enhancement network to carry out denoising enhancement, the lip movement of the user can be accurately identified based on the specific face characteristic and the corresponding voice characteristic of the user, the phenomenon of false silencing is reduced, and the accuracy of voice enhancement of the voice enhancement network is improved.

For example, in a video call scene, the terminal device may determine, according to the collected face image, a target user speaking in the current scene, and in combination with the sound feature of the target user obtained by the terminal device, the terminal device may enhance the sound of the target user in the collected audio data, so that the accuracy of identifying the target user may be improved, and further the accuracy of enhancing the sound of the target user may be improved.

In a multi-person scene, the audio data collected by the terminal device may include the voices of a plurality of users, and the terminal device needs to enhance the voices of the holder of the terminal device, so that the terminal device can determine the voice characteristics of the holder according to the collected face image of the holder, thereby identifying the voices of the holder in the collected audio data, enhancing the voices of the holder, and improving the accuracy of enhancing the voices of the appointed users by the terminal device.

In noisy environments, if the terminal device is performing video call, the audio data collected by the terminal device includes more noise, and is interfered by the noise, so that the clarity of the sound made by the user is lower. The terminal equipment can determine the sound characteristics of the user according to the collected face images of the user, so that unclear sound of the user in the audio data can be enhanced according to the sound characteristics, and the effect of the terminal equipment on voice enhancement of the user can be improved.

Of course, the embodiment of the application is not limited to the above scenes, and may be applied to other scenes, such as video conference, webcast, instant messaging, etc., where the application scene of the voice enhancement method is not limited.

The voice enhancement method provided by the embodiment of the application can be applied to various scenes, each scene can comprise a plurality of terminal devices, and each terminal device can train the voice enhancement network. For example, in various scenes such as video call, video conference and live webcast, the terminal device can train the voice enhancement network aiming at specific users in the scenes to obtain the voice enhancement network trained on specific face features, so that the voice enhancement network can recognize sounds made by the specific users more accurately.

Referring to fig. 2, fig. 2 illustrates a speech enhancement scenario, for example, where two terminal devices are engaged in a video call, the speech enhancement scenario may include: the first terminal device 201 and the second terminal device 202 are located in the same network, the holder of the first terminal device 201 is a first user, and the holder of the second terminal device 202 is a second user.

In the process of video call, the first terminal device 201 and the second terminal device 202 may collect a first sequence image of the first user and audio data of a scene where the first terminal device 201 is located, and the second terminal device 202 may collect a second sequence image of the second user and audio data of a scene where the second terminal device 202 is located. If the first terminal device 201 acquires the first sequence image, the first terminal device 201 determines that the face library of the first terminal device 201 does not include the first face feature of the first user according to any face image in the first sequence image, which indicates that the voice enhancement network does not train the first face feature of the first user, the first terminal device 201 can learn the voice feature of the first user in real time according to the audio data of the first user and the first sequence image, so as to complete online training on the voice enhancement network, and improve the effect of voice enhancement of the terminal device.

The voice enhancement network trains the face characteristics of each sample in the face library. For example, after the terminal device learns the sound features of a user according to the face sequence image of the user and completes training the voice enhancement network, the terminal device can add the specific face features of the user into the face library. The sound characteristics of the user may include characteristics of timbre, frequency, voiceprint, and the like of the sound of the user, and the embodiment of the present application does not limit information included in the sound characteristics of the user.

Moreover, the first sequence of images acquired by the first terminal device may include: the image information and the lip movement information, the image information extracted by the first terminal device according to the first sequence of images may include: the lip movement information extracted by the first terminal device according to the first sequence image can be the lip movement sequence image of the first user. Furthermore, the face image and the lip movement sequence image can be extracted according to the face sequence image.

In addition, the face information, the face data and the face features can be used for representing the face of the user, and the forms of the face information, the face data and the face features can be the same or different. The face information may be face data, or face data may be extracted from the face information, for example, a face image includes a face of the first user, a face displayed in the face image may be face information, and a face related face extracted from the face image may be face data. The embodiment of the present invention is not particularly limited thereto.

Similarly, if the second terminal device 202 determines that the voice enhancement network of the second terminal device 202 does not include the second face feature of the second user after the second sequence image of the second user is acquired, the second terminal device 202 may perform online training on the voice enhancement network of the second terminal device 202 in a similar manner to the above process.

It should be noted that, when the terminal device needs to perform denoising enhancement through the voice enhancement network for the first time, the terminal device may remind the user to input audio data and a face sequence image in a quiet environment, so that the voice enhancement network of the terminal device may obtain the voice characteristics of the user.

For example, referring to fig. 3, a schematic diagram of a terminal device reminding a user to input audio data and a face sequence image is shown, the terminal device detects that a voice enhancement network needs to be invoked for the first time according to an operation triggered by the user, and the terminal device can display a reminding interface and reminding information, for better experiencing video call, please face to input voice to a mobile phone in a quiet environment, reminding the user to input the audio data and the face sequence image, so that the user can better experience a voice enhancement function. Meanwhile, the terminal device can also provide a determination option and a cancel option for the user to select. If the terminal equipment detects that the user triggers the operation on the determination option, the terminal equipment can switch to an audio acquisition interface, a section of characters can be displayed in the interface, the user is reminded to read the section of characters, and options such as re-recording, determination, cancellation and the like are provided. Meanwhile, a camera of the terminal equipment can be called to collect face sequence images of the user, so that after the collection is finished, the terminal equipment can learn the sound characteristics of the user according to the collected audio data and the face sequence images to complete the training of the voice enhancement network. If the terminal equipment detects that the user triggers the operation on the cancel option at the reminding interface, the terminal equipment can be switched to the video call interface, denoising and enhancing are carried out on the audio data through a default voice enhancing network, and the user is reminded to input the audio data and the face sequence image again when the voice enhancing network is detected to be required to be called next time.

It should be noted that, in the embodiment of the present application, the first face feature, the second face feature, and the specific face feature are all used to indicate the face feature of the user holding the terminal device, that is, the face feature of the user currently using the terminal device, and the description of the face feature of the user holding the terminal device is not limited.

The above-mentioned taking two terminal devices for video call as an example illustrates that the terminal devices can train in real time to obtain the voice enhancement network for the user holding the terminal devices, thereby improving the effect of voice enhancement. In practical applications, the terminal device may also apply the speech enhancement network in other scenarios. For example, when a plurality of terminal devices are in a video conference, the terminal devices can enhance the sound of a speaker (a user who currently speaks) from audio data by combining face sequence images respectively corresponding to a plurality of users; or in the process of carrying out network live broadcast on other terminal equipment by certain terminal equipment, the terminal equipment can extract the sound of a user in the audio data and remove noise in the environment, thereby realizing voice enhancement; or in the scene that the user performs voice interaction with the vehicle machine, the vehicle machine can collect the voice sent by the user to obtain audio data, extract the voice of the user through the voice enhancement network, then convert the extracted audio data to obtain text information, and finally, according to the instruction corresponding to the text information, namely the instruction sent by the user.

Fig. 4 is a schematic flowchart of a voice enhancement method provided in an embodiment of the present application, which may be applied to any of the terminal devices described above, and referring to fig. 4, by way of example and not limitation, and the method includes:

step 401, acquiring specific face features according to the acquired face sequence images.

In a scenario that the terminal device needs to invoke a voice enhancement network to perform voice enhancement on audio data, the terminal device may collect the audio data first, and at the same time, the terminal device may collect a face sequence image of a user (hereinafter referred to simply as a user) who uses the terminal device at present, so that the terminal device may perform feature extraction on the face sequence image through a face feature extraction network to obtain a specific face feature corresponding to the user.

Specifically, when the terminal equipment detects that the voice enhancement network needs to be invoked, face sequence images can be acquired in real time, each image in the acquired face sequence images is traversed according to the acquired time sequence, whether the face sequence images comprise the face parts of the user or not is determined, and the face images in the face sequence images are obtained. When the terminal equipment detects that any image of the face sequence image comprises a face part of a user, the terminal equipment can input the face image into a preset face feature extraction network, and feature extraction is carried out on the face image through the face feature extraction network to obtain specific face features.

Further, after detecting the face image including the face part, the terminal device may continue to collect the face sequence image and continue to identify whether other images in the face sequence image include the face part. If the other images also comprise the face part, the terminal equipment can also input the other face images comprising the face part into a face feature extraction network to obtain specific face features.

In addition, after the terminal device acquires the face sequence image, the lip motion sequence image of the user can be extracted according to the face position in the face sequence image, so that in the subsequent step, the terminal device can determine whether the user is speaking according to the lip motion sequence image, or find the user currently speaking from the face sequence images of a plurality of users.

Step 402, comparing the specific face features with the sample face features in the face library.

The face library is generated according to face features used for training the voice enhancement network. For example, after training the voice enhancement network for a certain face feature, the terminal device may add the face feature to the face library as a sample face feature, so as to determine, according to each sample face feature in the face library, whether the voice enhancement network is trained for the certain face feature, that is, whether the voice feature of the user is learned in the voice enhancement network may be determined according to each sample face feature in the face library.

Correspondingly, after extracting the specific face features, the terminal device can compare the specific face features with each sample face feature in the face library to determine whether the face library comprises sample face features identical or similar to the specific face features.

If any sample face feature in the face library is the same as or similar to the specific face feature, it indicates that the terminal device may have trained the voice enhancement network for the specific face feature, the voice enhancement network learns the voice feature of the user, and the terminal device may execute step 403 to enhance the voice of the user in the collected audio data through the voice enhancement network, so as to obtain the enhanced audio data.

If the face library does not include the sample face feature identical or similar to the specific face feature, it indicates that the terminal device does not train the voice enhancement network for the specific face feature, the voice enhancement network does not learn the voice feature of the user, and the terminal device may execute step 404, and does not invoke the voice enhancement network to perform denoising enhancement on the audio data temporarily until the voice enhancement network is trained according to the specific face feature.

It should be noted that, in the multi-user scenario, the face sequence image acquired by the terminal device in step 401 may extract specific face features of multiple users. If the terminal device detects that the specific face features of a plurality of users are collected, the terminal device can further detect lip features of the specific face features, determine whether the user is speaking, thereby determining a speaking speaker, and compare the specific face features of the speaker with each sample face feature in the face library.

Of course, the terminal device may also only correspond to the same user according to a plurality of specific face features acquired by the face sequence image. Correspondingly, in practical application, the terminal equipment can determine whether the difference between the specific face feature and the sample face feature is within an error range according to comparison between a plurality of specific face features extracted from a plurality of images and the sample face feature, so as to determine whether the sample face feature which is the same as or similar to the specific face feature is included in the face library, aiming at the face feature of the same user, wherein the specific face feature extracted by the terminal equipment and the sample face feature stored in the face library may have certain difference due to the influence of factors such as environment, light and the like.

Step 403, if any sample face feature in the face library is the same as the specific face feature, denoising and enhancing the collected audio data through the voice enhancement network.

After the terminal equipment determines that the specific face characteristics are the same as the face characteristics of a certain sample face in the face library, the terminal equipment indicates that the voice enhancement network is trained for the current user using the terminal equipment, and the terminal equipment can call the voice enhancement network to enhance the collected audio data.

Before the terminal device enhances the audio data through the voice enhancement network, the terminal device can determine whether the user is speaking currently according to the action of the lips of the user in the acquired face sequence images. If the lip of the user is detected to have obvious action, the user is indicated to speak, and the terminal equipment can acquire a lip movement sequence image of the user from the face sequence image, namely, the sequence image of the lip action of the user during speaking. And then, the terminal equipment can input the extracted lip movement sequence image, the acquired audio data and the face image in the face sequence image into a voice enhancement network, and perform multi-mode voice enhancement processing through the voice enhancement network to obtain the audio data after denoising and enhancement.

In the process that the terminal equipment acquires the lip movement sequence image according to the face sequence image, the terminal equipment can firstly identify the face included in each image in the face sequence image, then identify the lips of the user in the face area obtained by identification, reserve the area where the lips are located in each image and remove other areas, so that the lip movement sequence image of the user is obtained.

It should be noted that, the terminal device may also acquire the lip movement sequence image of the user in other manners. For example, when the terminal device includes a plurality of cameras, the terminal device may use different cameras to focus on different positions, and acquire a face sequence image and a lip motion sequence image of the user. Of course, other ways of acquiring the lip movement sequence image of the user may be used, which is not limited in the embodiment of the present application.

In addition, in the process of carrying out multi-mode voice enhancement processing by the voice enhancement network, the voice enhancement network can extract specific face features according to the input face sequence image, and find out the specific face features and corresponding voices according to voice features of users corresponding to the specific face features, so that the collected voice data can enhance the voice of the voice data corresponding to the specific face features, and the voice enhancement of the current collected voice data can be realized, and the voice-enhanced voice data can be obtained. The other processes of the multi-mode voice enhancement processing performed by the voice enhancement network may refer to the prior art, and the embodiments of the present application are not described herein.

If the terminal equipment does not detect that the lips of the user have obvious actions according to the acquired face sequence images, the lips of the user can be continuously detected until the lips of the user are detected to have obvious actions, namely the user starts speaking. And then, the terminal equipment can perform multi-mode voice enhancement processing according to the mode to obtain the denoised audio data.

It should be noted that, in order to improve the reliability of denoising the audio data by the terminal device, after denoising the audio data by the voice enhancement network, the terminal device may detect whether the audio data output by the voice enhancement network has noise reduction. If the audio data output by the voice enhancement network still has the noise reduction condition, which indicates that the audio data output by the voice enhancement network still has errors, the terminal equipment does not execute the step 403 to call the voice enhancement network to perform noise reduction enhancement, but executes the

steps

404 and 405 to output the currently collected audio data, trains the voice enhancement network again, and acquires the sound characteristics of the user so as to reduce the noise reduction phenomenon of the voice enhancement network and improve the accuracy and reliability of the voice enhancement network.

If the audio data output by the voice enhancement network does not have the silencing condition, which indicates that the voice enhancement network can achieve a better denoising enhancement effect for the collected audio data, the terminal device can continue to execute step 403 to call the voice enhancement network for denoising enhancement until the terminal device stops calling the voice enhancement network according to the operation triggered by the user.

Step 404, if each sample face feature in the face library is different from the specific face feature, outputting the audio data acquired currently.

After the terminal equipment determines that the face library does not comprise the sample face characteristics identical to the specific face characteristics, the voice enhancement network is not trained for the user using the terminal equipment currently and does not learn the voice characteristics of the user, so that the terminal equipment can firstly not call the voice enhancement network to carry out denoising enhancement on the audio data, but output the audio data by adopting unprocessed audio data, namely, output the audio data by adopting the current acquired audio data, and the phenomenon of error silencing caused by the voice enhancement network is avoided.

It should be noted that, in the process of executing step 404 by the terminal device, the terminal device may also execute step 405 simultaneously, and perform online training on the voice enhancement network in real time according to the currently acquired sequence image and audio data.

Step 405, training the voice enhancement network according to the face sequence image and the audio data.

After determining that the voice enhancement network is not trained for the current user, the terminal device may perform online training on the voice enhancement network in real time according to the face sequence image collected in step 401 and the collected audio data, so as to obtain the voice characteristics of the user until the preset training condition is met. The training condition may be that the training frequency reaches a frequency threshold, or that the error output by the voice enhancement network is less than or equal to an error threshold.

Referring to fig. 5, fig. 5 is a schematic diagram of a framework for online training of a voice enhancement network by a terminal device according to an embodiment of the present application, and a process for online training of a voice enhancement network in step 405 may include the following steps:

step 4051, according to the collected audio data a, combining the lip motion sequence image B extracted from the face sequence image, and performing denoising enhancement through a voice enhancement network to obtain first denoising data C.

Step 4052, comparing the audio data a and the first denoising data C, and determining whether the audio data a and the first denoising data C are similar.

Step 4053, determining whether the user is currently speaking or not according to the lip sequence image B.

Step 4054, determining whether the face library includes a specific face feature extracted from the face sequence image.

Step 4055, if the audio data a and the first denoising data C are similar, the user is speaking, and the face database includes a specific face feature, acquiring pre-stored noise data D, and mixing the noise data D with the audio data a to obtain mixed data.

Step 4056, inputting the mixed data into a voice enhancement network, and training the voice enhancement network by combining the audio data A to obtain a trained voice enhancement network.

Step 4057, taking the specific face features as sample face features, and adding the sample face features into a face library.

Wherein it has been determined in step 402 that no specific face feature is included in the face library, and it is not necessary to determine whether the specific face feature is included in the face library again when step 4054 is performed.

In addition, the process of extracting the lip sequence image B by the terminal device in the step 4051 is similar to the process of extracting the lip sequence image by the terminal device in the step 403, and will not be described herein. In addition, after the face sequence image is extracted in step 401, the terminal device may extract the lip motion sequence image B according to the face sequence image, so as to determine the reliability of the voice enhancement network according to the extracted lip motion sequence image B in step 403, and also train the voice enhancement network in step 405. That is, after performing step 401, the terminal device may perform step 4051.

And then, the terminal equipment can input the audio data A, the lip motion sequence image B and the face image in the face sequence image into a voice enhancement network, and denoising the audio data A through the voice enhancement network to obtain first denoising data C. The terminal device may then compare the audio data a with the first denoising data C, determine a similarity between the audio data a and the first denoising data C, thereby determining an environment in which the terminal device is currently located, and determine whether the user is using the terminal device in a quiet environment.

If the similarity between the audio data a and the first denoising data C is greater than or equal to the similarity threshold, which indicates that the terminal device is currently in a quiet environment, the terminal device may train the voice enhancement network according to the currently collected audio data a. If the similarity between the audio data a and the first denoising data C is smaller than the similarity threshold, which indicates that the terminal device is not in the quiet environment currently, the terminal device needs to continuously collect the audio data until the terminal device is in the quiet environment. Further, if the terminal device is not currently in a quiet environment, the terminal device may alert the user to go to the quiet environment in order to train and use the voice enhanced network.

For example, the duration of each of the audio data a and the first denoising data C is 10 seconds(s), and then the terminal device may compare the audio data a and the first denoising data C at each time (1 s, 2s, 3s, 4s, 5s, 6s, 7s, 8s, 9s, and 10s at each time) to obtain a data difference (such as parameters of frequency and amplitude) between the audio data a and the first denoising data C at each time, so as to obtain a similarity between the two. If the identical similarity is 1 and the similarity threshold is 0.9, determining that the current environment of the terminal equipment is a quiet environment when the similarity is greater than or equal to 0.9; if the similarity is smaller than 0.9, it may be determined that the environment in which the terminal device is currently located is not a quiet environment.

The terminal device can also detect whether lip movement occurs to the lips of the user according to the lip movement sequence image B, namely judging whether the user is speaking or not, while determining the environment in which the terminal device is located. If lip movement of the lips of the user is detected according to the lip movement sequence image B, the terminal equipment can train the voice enhancement network according to the lip movement sequence image B, wherein the user is speaking. However, if the terminal device does not detect that lip movement occurs in the lips of the user, that is, the user does not speak, the terminal device may prompt the user to speak, so as to collect the audio data a and the lip movement sequence image B, and complete training of the voice enhancement network.

Because the terminal device has determined in step 402 that the specific face feature is not in the face library, when the terminal device determines that the terminal device is currently in a quiet environment and lips of the user move, the terminal device can train the voice enhancement network according to the collected audio data a, face sequence images and lip movement sequence images B and by combining with the pre-stored noise data D, so as to obtain the voice enhancement network for the current user.

In the process of training the voice enhancement network, the terminal equipment can acquire the pre-stored noise data D, and mix the noise data D with the acquired audio data A to obtain the noise added mixed data. And then, the terminal equipment can input the voice enhancement network with the voice mixing data, the lip moving sequence image B and the face image in the face sequence image, and takes the audio data A collected in the quiet environment as the label data for training, so that the second denoising data output by the voice enhancement network is compared with the collected audio data A, the voice enhancement network is adjusted according to the comparison result, and the process is circulated until the training is performed to obtain the voice enhancement network meeting the training condition.

The training conditions may be that the number of times of training the voice enhancement network reaches a frequency threshold, or that the similarity between the audio data output by the voice enhancement network and the collected audio data reaches a similarity threshold, or may be other conditions.

Furthermore, the terminal equipment can add the specific face features of the current user into the face library, and the specific face features are used as sample face features in the face library to finish updating the face library, so that the condition that the user corresponding to the specific face features still needs to train the voice enhancement network when calling the voice enhancement network by using the terminal equipment again is avoided.

It should be noted that, in the embodiment of the present application, the step 4053 and the step 4054 may be performed after the step 4051 as an example, and in practical application, the step 4053 and the step 4054 may be performed simultaneously with the step 4051 or may be performed before the step 4051, which is not limited in the embodiment of the present application.

Step 406, calling a voice enhancement network to perform denoising enhancement on the collected audio data.

After the training of the voice enhancement network is finished, the terminal equipment can start the voice enhancement function, call the trained voice enhancement network to perform denoising enhancement on the currently acquired audio data, so that the sound emitted by the user is enhanced, and the sound emitted by the user is clearer.

It should be noted that, in step 401, the terminal device may identify one or more face images including a face part from the collected face sequence images, and the process of voice enhancement of the terminal device is summarized below by taking the example that the terminal device identifies one face image.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating a framework of voice enhancement by a voice enhancement network, and a process of voice enhancement by a terminal device on collected audio data in combination with invoking the multi-modal voice enhancement network may include:

the terminal equipment can collect the face sequence image first, identify the face sequence image to obtain a face image comprising a face part in the face sequence image, and extract the characteristics of the face image to obtain the face characteristics of the user.

Meanwhile, the terminal equipment can further identify the face sequence image, and extract the lip sequence image based on the lips of the face parts acquired in the face sequence image. Moreover, the terminal equipment can collect the audio data while collecting the face sequence images.

The terminal device may then input the acquired face features, lip sequence images and audio data into a speech enhancement network.

Correspondingly, the voice enhancement network can firstly perform feature extraction on the lip sequence image and the audio data respectively to obtain visual features corresponding to the lip sequence image and audio features corresponding to the audio data. In addition, in order to improve the effect of voice enhancement, the terminal device may also align the visual feature and the audio feature, so as to obtain the aligned visual feature and the aligned audio feature.

And then, the voice enhancement network can fuse the face features, the aligned visual features and the aligned audio features to obtain fused data, and the fused data is subjected to feature extraction by adopting a self-attention model to obtain initial features. And extracting the initial characteristics to obtain the fused audio characteristics. And finally, the voice enhancement network carries out ISTFT on the fused audio characteristics to obtain enhanced audio data.

In addition, it should be noted that, in the foregoing embodiment, only one user is taken as an illustration of a process of training the voice enhancement network, and in practical application, the terminal device may also train the voice enhancement network in a manner similar to the foregoing training process in a multi-person scenario, to obtain a trained voice enhancement network.

In summary, according to the voice enhancement method provided by the embodiment of the present application, the terminal device acquires the first face image of the user; if the first face image is not matched with the stored face data, acquiring the sound characteristics of the user and storing the first face data of the user; collecting a second face image and first audio data of a user; if the second face image is matched with the stored first face data, the voice of the user in the first audio data is enhanced and then output based on the voice characteristics of the first user. The terminal device can learn the sound characteristics of the first user by combining the first face characteristics of the first user, can improve the accuracy of identifying the first user by collecting the second face images, and can improve the accuracy of enhancing the sound of the first user in the first audio data by combining the sound characteristics of the first user (for example, the phenomenon of error silencing of average 1 second occurs within 1 minute duration, and the signal-to-noise ratio between the audio data output by the voice enhancement network and the pure stream audio data reaches 8.8).

Before enhancing the first audio data, the terminal device may determine whether the lips of the user move, and if the lips of the user do not move, it indicates that the user does not speak currently, and it is not necessary to enhance the voice of the first user in the first audio data. If the lips of the user move, the terminal device indicates that the user is speaking currently, according to the dynamic change process of the lips of the user, the speaking of the user can be determined through the lip language technology, and the accuracy of enhancing the voice of the user in the second audio data can be further improved by combining the acquired voice characteristics of the first user.

In addition, the terminal equipment can combine the voice characteristics of the first user learned by the voice enhancement network by inputting the face image of the user to the voice enhancement network to carry out denoising enhancement on the collected audio data, so that the denoising effect of the voice enhancement network and the voice enhancement effect of the user can be improved.

In addition, under the condition that the voice enhancement network is determined not to learn the voice characteristics of the first user, the terminal equipment does not conduct denoising enhancement on the collected audio data through the voice enhancement network, but outputs the collected audio data, so that the situation of error silencing caused by the voice enhancement network on the audio data is avoided, and the reliability of the terminal equipment is improved.

Further, by collecting the audio data in the quiet environment, the collected audio data can be used as the label data of the training voice enhancement network, so that the process of training the voice enhancement network is simplified, the efficiency of training the voice enhancement network can be improved, and the timeliness of calling the voice enhancement network by the terminal equipment can be improved.

Further, in the running process of the terminal equipment, whether the specific face features of the user currently using the terminal equipment are included in the face library can be determined according to the collected face sequence images and the collected audio data, so that whether the voice enhancement network can be invoked to carry out denoising enhancement on the audio data is determined, the accuracy of invoking the voice enhancement network can be improved, and the accuracy of denoising enhancement of the voice enhancement network can be improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

Corresponding to the speech enhancement method described in the above embodiments, fig. 7 is a block diagram of a speech enhancement apparatus provided in the embodiment of the present application, and for convenience of explanation, only the portions related to the embodiment of the present application are shown.

Referring to fig. 7, the apparatus includes:

the acquisition module 701 is configured to acquire a first face image of a first user;

a first obtaining module 702, configured to obtain a sound feature of the first user if the first face image does not match the stored face data;

a storage module 703, configured to store first face data of the first user;

the acquisition module 701 is further configured to acquire a second face image and first audio data of the first user;

and an output module 704, configured to, if the second face image matches the stored first face data, enhance the sound of the first user in the first audio data based on the sound feature of the first user, and output the enhanced sound.

Optionally, the collection module 701 is further configured to collect second audio data of the first user;

the output module 704 is further configured to output third audio data according to the second audio data if the first face image does not match with the stored face data, where the sound of the first user in the third audio data is not enhanced.

Optionally, the output module 704 is specifically configured to detect whether the lips of the first user move; and if the second face image is matched with the stored first face data and the lips of the first user move, enhancing the sound of the first user in the first audio data based on the sound characteristics of the first user and outputting the enhanced sound.

Optionally, the output module 704 is specifically configured to acquire a first lip motion sequence image of the first user; detecting whether a lip in the first lip movement sequence image moves or not; the step of outputting the first user's voice in the first audio data after enhancing the voice based on the voice characteristics of the first user includes: and according to the first lip movement sequence image, the second face image and the first audio data, enhancing the sound of the first user in the first audio data through the voice enhancement network and outputting the enhanced sound.

Optionally, referring to fig. 8, the apparatus further includes:

a second obtaining module 705, configured to obtain the enhanced first audio data;

a detecting module 706, configured to detect whether a silencing phenomenon occurs in the enhanced first audio data;

the first obtaining module 702 is further configured to obtain the sound feature of the first user again if the enhanced first audio data has a silencing phenomenon;

the output module 704 is further configured to, if the enhanced first audio data does not have the silencing phenomenon, continue to enhance the sound of the first user in the re-collected audio data based on the sound feature of the first user and output the enhanced sound.

Optionally, the first obtaining module 702 is specifically configured to collect the fourth audio data and a first sequence of images of the first user, where the first sequence of images includes: face information and lip information; and acquiring sound characteristics of the first user based on the first sequence of images and fourth audio data.

Optionally, the first obtaining module 702 is further specifically configured to learn, through a voice enhancement network, a voice feature of the first user.

Optionally, the first obtaining module 702 is further specifically configured to obtain a third face image and a second lip motion sequence image according to the first sequence image; and inputting the third face image, the second lip movement sequence image and the fourth audio data into the voice enhancement network, and learning the sound characteristics of the first user through the voice enhancement network.

Optionally, referring to fig. 9, the apparatus further includes:

a first determining module 707, configured to determine whether a lip of the first user moves according to the second lip movement sequence image;

a second judging module 708, configured to determine whether the current scene is a quiet environment according to the fourth audio data;

the first obtaining module 702 is further specifically configured to input the third face image, the second lip movement sequence image, and the fourth audio data into the voice enhancement network if the current scene is a quiet environment and the lips of the first user are moving, and learn the voice characteristics of the first user through the voice enhancement network.

Optionally, the second determining module 708 is specifically configured to input the fourth audio data into the speech enhancement network to obtain first denoising data; comparing the fourth audio data with the first denoising data; if the similarity between the fourth audio data and the first denoising data is greater than or equal to a similarity threshold value, determining that the current scene is a quiet environment; and if the similarity between the fourth audio data and the first denoising data is smaller than the similarity threshold value, determining that the current scene is not a quiet environment.

Optionally, the first obtaining module 702 is further specifically configured to mix the fourth audio data with pre-stored noise data to obtain mixed audio data; inputting the mixed sound data, the third face image and the second lip movement sequence image into the voice enhancement network to obtain second denoising data; and adjusting the voice enhancement network according to the second denoising data and the fourth audio data so that the voice enhancement network learns to obtain the sound characteristics of the first user.

Fig. 10 is a block diagram of another voice enhancement device according to an embodiment of the present application, and only a portion related to the embodiment of the present application is shown for convenience of explanation.

Referring to fig. 10, the apparatus includes:

an acquisition module 1001, configured to acquire first audio data and a first sequence of images of a first user, where the first sequence of images includes: face information and lip information of a first user;

and an output module 1002, configured to, if it is determined according to the first sequence image that the face information of the first user matches the stored face data and the lips of the first user move, enhance the sound of the first user in the first audio data according to the sound feature of the first user, and output the enhanced sound.

Optionally, the output module 1002 is further configured to output second audio data according to the first audio data if it is determined that the face information of the first user does not match with the stored face data according to the first sequence image, or if it is determined that the lips of the first user do not move according to the first sequence image, and the sound of the first user in the second audio data is not enhanced.

Optionally, the output module 1002 is specifically configured to extract, according to the first sequence image, a first face image and a first lip motion sequence image; and according to the first audio data, the first face image and the first lip movement sequence image, enhancing the sound of the first user in the first audio data through the voice enhancement network and outputting the enhanced sound.

Optionally, referring to fig. 11, the apparatus further includes:

a first extraction module 1003, configured to extract the first sequence image to obtain a first lip motion sequence image;

a first judging module 1004 is configured to judge whether the lips of the first user move according to the first lip movement sequence image.

Optionally, referring to fig. 12, the apparatus further includes:

A second extraction module 1005, configured to extract the first sequence of images to obtain the first face information;

a second determining module 1006, configured to traverse each stored face data, and determine whether each stored face data includes face data that matches the first face information.

Optionally, referring to fig. 13, the apparatus further includes:

a first obtaining module 1007 for obtaining the enhanced first audio data;

a detection module 1008, configured to detect whether a silencing phenomenon occurs in the enhanced first audio data;

a second obtaining module 1009, configured to obtain, if the enhanced first audio data has a silencing phenomenon, a sound feature of the first user again;

the output module 1002 is further configured to, if the enhanced first audio data does not have the silencing phenomenon, continue to enhance the sound of the first user in the re-collected audio data based on the sound feature of the first user and output the enhanced sound.

Optionally, the acquiring module 1001 is specifically configured to acquire the first audio data and the first sequence image of the first user in response to an operation for invoking the voice enhancement network.

In summary, in the voice enhancement device provided in the embodiment of the present application, the terminal device acquires the first face image of the user; if the first face image is not matched with the stored face data, acquiring the sound characteristics of the user and storing the first face data of the user; collecting a second face image and first audio data of a user; if the second face image is matched with the stored first face data, the voice of the user in the first audio data is enhanced and then output based on the voice characteristics of the first user. The terminal device can learn the sound characteristics of the first user by combining the first face characteristics of the first user, can improve the accuracy of identifying the first user by collecting the second face images, and can improve the accuracy of enhancing the sound of the first user in the first audio data by combining the sound characteristics of the first user (for example, the phenomenon of error silencing of average 1 second occurs within 1 minute duration, and the signal-to-noise ratio between the audio data output by the voice enhancement network and the pure stream audio data reaches 8.8).

The following describes a terminal device according to an embodiment of the present application. Referring to fig. 14, fig. 14 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

The terminal device can include a processor 1410, an external memory interface 1420, an internal memory 1421, a universal serial bus (universal serial bus, USB) interface 1430, a charge management module 1440, a power management module 1441, a battery 1442, an antenna 1, an antenna 2, a mobile communication module 1450, a wireless communication module 1460, an audio module 1470, a speaker 1470A, a receiver 1470B, a microphone 1470C, an earphone interface 1470D, a sensor module 1480, keys 1490, a motor 1491, an indicator 1492, a camera 1493, a display screen 1494, and a subscriber identity module (subscriber identification module, SIM) card interface 1495, among others. The sensor modules 1480 may include, among others, pressure sensors 1480A, gyroscope sensors 1480B, barometric pressure sensors 1480C, magnetic sensors 1480D, acceleration sensors 1480E, distance sensors 1480F, proximity sensors 1480G, fingerprint sensors 1480H, temperature sensors 1480J, touch sensors 1480K, ambient light sensors 1480L, bone conduction sensors 1480M, and the like.

It will be appreciated that the structure illustrated in the embodiments of the present invention does not constitute a specific limitation on the terminal device. In other embodiments of the present application, the terminal device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 1410 may include one or more processing units, such as: the processor 1410 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a memory, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can be a neural center and a command center of the terminal equipment. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 1410 for storing instructions and data. In some embodiments, the memory in processor 1410 is a cache memory. The memory may hold instructions or data that the processor 1410 has just used or recycled. If the processor 1410 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided, reducing the latency of the processor 1410, and thus improving the efficiency of the system.

In some embodiments, processor 1410 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The I2C interface is a bi-directional synchronous serial bus comprising a serial data line (SDA) and a serial clock line (derail clock line, SCL). In some embodiments, processor 1410 may contain multiple sets of I2C buses. The processor 1410 may be coupled to the touch sensor 1480K, charger, flash, camera 1493, etc., respectively, through different I2C bus interfaces. For example: the processor 1410 may be coupled to the touch sensor 1480K through an I2C interface, so that the processor 1410 and the touch sensor 1480K communicate through an I2C bus interface to implement a touch function of the terminal device.

The I2S interface may be used for audio communication. In some embodiments, processor 1410 may contain multiple sets of I2S buses. The processor 1410 may be coupled to the audio module 1470 through an I2S bus to enable communication between the processor 1410 and the audio module 1470. In some embodiments, the audio module 1470 may communicate audio signals to the wireless communication module 1460 via the I2S interface to implement the function of answering a call via a bluetooth headset.

PCM interfaces may also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 1470 and the wireless communication module 1460 may be coupled through a PCM bus interface. In some embodiments, the audio module 1470 may also communicate audio signals to the wireless communication module 1460 via the PCM interface to enable a phone call to be received via the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 1410 with the wireless communication module 1460. For example: the processor 1410 communicates with a bluetooth module in the wireless communication module 1460 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 1470 may communicate audio signals to the wireless communication module 1460 through a UART interface to implement the function of playing music through a bluetooth headset.

The MIPI interface may be used to connect processor 1410 with peripheral devices such as a display screen 1494, a camera 1493, and the like. The MIPI interfaces include camera serial interfaces (camera serial interface, CSI), display serial interfaces (display serial interface, DSI), and the like. In some embodiments, the processor 1410 and the camera 1493 communicate through CSI interfaces to implement a photographing function of the terminal device. The processor 1410 and the display screen 1494 communicate through a DSI interface to realize a display function of the terminal device.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal or as a data signal. In some embodiments, GPIO interfaces may be used to connect the processor 1410 with the camera 1493, the display screen 1494, the wireless communication module 1460, the audio module 1470, the sensor module 1480, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, an MIPI interface, etc.

The USB interface 1430 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. USB interface 1430 may be used to connect a charger to charge the terminal device or to transfer data between the terminal device and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other terminal devices, such as AR devices, etc.

It should be understood that the connection relationship between the modules illustrated in the embodiment of the present invention is only illustrative, and does not limit the structure of the terminal device. In other embodiments of the present application, the terminal device may also use different interfacing manners in the foregoing embodiments, or a combination of multiple interfacing manners.

The charge management module 1440 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 1440 may receive a charging input of a wired charger through the USB interface 1430. In some wireless charging embodiments, the charging management module 1440 may receive wireless charging input through a wireless charging coil of a terminal device. The charging management module 1440 may also provide power to the terminal device via the power management module 1441 while charging the battery 1442.

The power management module 1441 is configured to couple the battery 1442, the charge management module 1440, and the processor 1410. The power management module 1441 receives input from the battery 1442 and/or the charge management module 1440 and provides power to the processor 1410, the internal memory 1421, the external memory, the display screen 1494, the camera 1493, the wireless communication module 1460, and the like. The power management module 1441 may also be configured to monitor battery capacity, battery cycle times, battery health (leakage, impedance), and other parameters. In other embodiments, the power management module 1441 may also be provided in the processor 1410. In other embodiments, the power management module 1441 and the charge management module 1440 may be provided in the same device.

The wireless communication function of the terminal device can be implemented by the antenna 1, the antenna 2, the mobile communication module 1450, the wireless communication module 1460, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the terminal device may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 1450 may provide solutions for wireless communication including 2G/3G/4G/5G etc. applied on the terminal device. The mobile communication module 1450 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), or the like. The mobile communication module 1450 receives electromagnetic waves from the antenna 1, filters, amplifies the received electromagnetic waves, and transmits the electromagnetic waves to the modem processor for demodulation. The mobile communication module 1450 can amplify the signal modulated by the modem processor and convert the signal into electromagnetic waves to radiate the electromagnetic waves through the antenna 1. In some embodiments, at least some of the functional modules of the mobile communication module 1450 may be provided in the processor 1410. In some embodiments, at least some of the functional modules of the mobile communication module 1450 may be provided in the same device as at least some of the modules of the processor 1410.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to speakers 1470A, receivers 1470B, etc.), or displays images or video through a display screen 1494. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 1450 or other functional modules, independent of the processor 1410.

The wireless communication module 1460 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc. applied to the terminal device. The wireless communication module 1460 may be one or more devices that integrate at least one communication processing module. The wireless communication module 1460 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 1410. The wireless communication module 1460 may also receive signals to be transmitted from the processor 1410, frequency modulate them, amplify them, and convert them to electromagnetic waves for radiation via the antenna 2.

In some embodiments, the antenna 1 of the terminal device is coupled to the mobile communication module 1450 and the antenna 2 is coupled to the wireless communication module 1460 so that the terminal device can communicate with the network and other devices through wireless communication technology. The wireless communication techniques may include the Global System for Mobile communications (global system for mobile communications, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation satellite system, GLONASS), a beidou satellite navigation system (beidou navigation satellite system, BDS), a quasi zenith satellite system (quasi-zenith satellite system, QZSS) and/or a satellite based augmentation system (satellite based augmentation systems, SBAS).

The terminal device implements display functions through the GPU, the display screen 1494, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 1494 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 1410 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 1494 is for displaying images, videos, and the like. The display screen 1494 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light emitting diode (AMOLED), a flexible light-emitting diode (flex), a mini, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the terminal device may include 1 or N displays 1494, N being a positive integer greater than 1.

The terminal device may implement shooting functions through an ISP, a camera 1493, a video codec, a GPU, a display screen 1494, an application processor, and the like.

The ISP is used to process the data fed back by the camera 1493. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 1493.

The camera 1493 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, the terminal device may include 1 or N cameras 1493, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the terminal device selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, etc.

Video codecs are used to compress or decompress digital video. The terminal device may support one or more video codecs. In this way, the terminal device may play or record video in multiple encoding formats, for example: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent cognition of terminal equipment can be realized through NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The external memory interface 1420 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the terminal device. The external memory card communicates with the processor 1410 through an external memory interface 1420 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

Internal memory 1421 can be used to store computer-executable program code that includes instructions. The processor 1410 executes various functional applications of the terminal device and data processing by executing instructions stored in the internal memory 1421. The internal memory 1421 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the terminal device (such as audio data, phonebook, etc.), etc. In addition, the internal memory 1421 can include high-speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash memory (universal flash storage, UFS), and the like.

The terminal device may implement audio functions through an audio module 1470, a speaker 1470A, a receiver 1470B, a microphone 1470C, an earphone interface 1470D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 1470 is used to convert digital audio information to an analog audio signal output and also to convert an analog audio input to a digital audio signal. The audio module 1470 may also be used to encode and decode audio signals. In some embodiments, the audio module 1470 may be disposed in the processor 1410, or a portion of the functional module of the audio module 1470 may be disposed in the processor 1410.

The speaker 1470A, also known as a "horn," is used to convert audio electrical signals into sound signals. The terminal device can listen to music, or to hands-free calls, through speaker 1470A.

A receiver 1470B, also known as a "earpiece," is used to convert the audio electrical signal into a sound signal. When the terminal device picks up a call or voice message, the voice can be picked up by placing the receiver 1470B close to the human ear.

A microphone 1470C, also known as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 1470C through the mouth, inputting a sound signal to the microphone 1470C. The terminal device may be provided with at least one microphone 1470C. In other embodiments, the terminal device may be provided with two microphones 1470C, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the terminal device may further be provided with three, four or more microphones 1470C to collect sound signals, reduce noise, identify sound sources, implement directional recording functions, etc.

The headphone interface 1470D is for connecting wired headphones. The earphone interface 1470D may be a USB interface 1430 or a 3.5mm open mobile terminal platform (open mobile terminal platform, OMTP) standard interface, a american cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 1480A is used to sense a pressure signal, which can be converted into an electrical signal. In some embodiments, pressure sensor 1480A may be provided on display 1494. The pressure sensor 1480A is of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 1480A. The terminal device determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 1494, the terminal device detects the intensity of the touch operation based on the pressure sensor 1480A. The terminal device may also calculate the position of the touch from the detection signal of the pressure sensor 1480A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example: and executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.

The gyro sensor 1480B may be used to determine the motion pose of the terminal apparatus. In some embodiments, the angular velocity of the terminal device about three axes (i.e., x, y, and z axes) may be determined by the gyro sensor 1480B. The gyro sensor 1480B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 1480B detects the angle of the shake of the terminal device, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the terminal device by the reverse motion, thereby realizing anti-shake. The gyro sensor 1480B may also be used to navigate, somatosensory game scenes.

The air pressure sensor 1480C is used to measure air pressure. In some embodiments, the terminal device calculates altitude from barometric pressure values measured by barometric pressure sensor 1480C, aiding in positioning and navigation.

The magnetic sensor 1480D includes a hall sensor. The terminal device can detect the opening and closing of the flip holster using the magnetic sensor 1480D. In some embodiments, when the terminal device is a flip machine, the terminal device may detect opening and closing of the flip according to the magnetic sensor 1480D. And then according to the detected opening and closing state of the leather sheath or the opening and closing state of the flip, the characteristics of automatic unlocking of the flip and the like are set.

The acceleration sensor 1480E may detect the magnitude of acceleration of the terminal device in all directions (typically three axes). The magnitude and direction of gravity can be detected when the terminal device is stationary. The method can also be used for identifying the gesture of the terminal equipment, and is applied to applications such as horizontal and vertical screen switching, pedometers and the like.

A distance sensor 1480F for measuring distance. The terminal device may measure the distance by infrared or laser. In some embodiments, the scene is photographed and the terminal device can range using the distance sensor 1480F to achieve fast focus.

The proximity light sensor 1480G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The terminal device emits infrared light outwards through the light emitting diode. The terminal device detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object in the vicinity of the terminal device. When insufficient reflected light is detected, the terminal device may determine that there is no object in the vicinity of the terminal device. The terminal device may detect that the user holds the terminal device in close proximity to the ear using the proximity light sensor 1480G to automatically extinguish the screen for power saving purposes. The proximity light sensor 1480G may also be used in holster mode, pocket mode to automatically unlock and lock the screen.

The ambient light sensor 1480L is used to sense ambient light. The terminal device may adaptively adjust the brightness of the display screen 1494 based on the perceived ambient light level. The ambient light sensor 1480L may also be used to automatically adjust white balance when taking a photograph. Ambient light sensor 1480L may also cooperate with proximity light sensor 1480G to detect if the terminal device is in a pocket to prevent false touches.

The fingerprint sensor 1480H is used to collect a fingerprint. The terminal equipment can utilize the collected fingerprint characteristics to realize fingerprint unlocking, access the application lock, fingerprint photographing, fingerprint incoming call answering and the like.

The temperature sensor 1480J is used to detect temperature. In some embodiments, the terminal device performs a temperature processing strategy using the temperature detected by temperature sensor 1480J. For example, when the temperature reported by temperature sensor 1480J exceeds a threshold, the terminal apparatus performs a reduction in performance of a processor located near temperature sensor 1480J in order to reduce power consumption to implement thermal protection. In other embodiments, when the temperature is below another threshold, the terminal device heats battery 1442 to avoid the low temperature causing the terminal device to shut down abnormally. In other embodiments, when the temperature is below a further threshold, the terminal device performs boosting of the output voltage of battery 1442 to avoid abnormal shutdown caused by low temperatures.

The touch sensor 1480K, also referred to as a "touch panel". The touch sensor 1480K may be disposed on the display 1494, and the touch sensor 1480K and the display 1494 form a touch screen, which is also referred to as a "touch screen". The touch sensor 1480K is used to detect a touch operation acting on or near it. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display screen 1494. In other embodiments, the touch sensor 1480K may also be disposed on the surface of the terminal apparatus in a different location than the display 1494.

Bone conduction sensor 1480M may acquire a vibration signal. In some embodiments, bone conduction sensor 1480M may acquire a vibration signal of a human vocal tract vibrating bone block. The bone conduction sensor 1480M may also contact the pulse of the human body and receive the blood pressure pulsation signal. In some embodiments, bone conduction sensor 1480M may also be provided in a headset, in combination with an osteogenic headset. The audio module 1470 may parse out a voice signal based on the vibration signal of the vocal vibration bone block obtained by the bone conduction sensor 1480M, so as to implement a voice function. The application processor may analyze heart rate information based on the blood pressure beat signal acquired by the bone conduction sensor 1480M, so as to implement a heart rate detection function.

The keys 1490 include a power-on key, a volume key, etc. The key 1490 may be a mechanical key. Or may be a touch key. The terminal device may receive key inputs, generating key signal inputs related to user settings of the terminal device and function control.

The motor 1491 may generate a vibration cue. The motor 1491 may be used for incoming call vibration alerting as well as for touch vibration feedback. For example, touch operations acting on different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 1491 may also correspond to different vibration feedback effects by touch operations applied to different areas of the display screen 1494. Different application scenarios (such as time reminding, receiving information, alarm clock, game, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

The indicator 1492 may be an indicator light, which may be used to indicate a state of charge, a change in charge, a message, a missed call, a notification, etc.

The SIM card interface 1495 is for connecting a SIM card. The SIM card may be contacted and separated from the terminal device by being inserted into the SIM card interface 1495 or withdrawn from the SIM card interface 1495. The terminal device may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 1495 may support Nano SIM cards, micro SIM cards, and the like. The same SIM card interface 1495 can be plugged with multiple cards at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 1495 may also be compatible with different types of SIM cards. The SIM card interface 1495 may also be compatible with external memory cards. The terminal equipment interacts with the network through the SIM card to realize the functions of communication, data communication and the like. In some embodiments, the terminal device employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the terminal device and cannot be separated from the terminal device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a terminal device, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

Finally, it should be noted that: the foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech enhancement, the method comprising:

collecting a first face image of a first user;

storing first face data of the first user;

collecting a second face image and first audio data of the first user;

2. The method according to claim 1, wherein the method further comprises:

collecting second audio data of the first user;

3. The method according to claim 1 or 2, wherein if the second face image matches the stored first face data, the enhancing the sound of the first user in the first audio data based on the sound feature of the first user includes:

detecting whether movement of the lips of the first user occurs;

4. A method according to claim 3, wherein said detecting whether movement of the lips of the first user has occurred comprises:

acquiring a first lip movement sequence image of the first user;

detecting whether a lip in the first lip movement sequence image moves or not;

5. The method of any one of claims 1 to 4, wherein after the enhancing the sound of the first user in the first audio data based on the sound characteristics of the first user, the method further comprises:

acquiring enhanced first audio data;

6. The method of claim 1, wherein the acquiring the sound characteristic of the first user comprises:

7. The method of claim 6, wherein the acquiring the sound characteristic of the first user comprises:

8. The method of claim 7, wherein learning the sound characteristics of the first user over a voice enhanced network comprises:

9. The method of claim 8, wherein prior to said entering the third face image, the second lip motion sequence image, and the fourth audio data into the speech enhancement network, the method further comprises, prior to learning the sound features of the first user through the speech enhancement network:

10. The method of claim 9, wherein the determining whether the current scene is a quiet environment based on the fourth audio data comprises:

comparing the fourth audio data with the first denoising data;

11. The method according to any one of claims 8 to 10, wherein said inputting the third face image, the second lip motion sequence image, and the fourth audio data into the speech enhancement network, learning the sound features of the first user through the speech enhancement network, comprises:

12. A method of speech enhancement, the method comprising:

13. The method according to claim 12, wherein the method further comprises:

14. The method according to claim 12 or 13, wherein the enhancing the sound of the first user in the first audio data according to the sound characteristics of the first user includes:

15. The method of claim 12, wherein prior to the enhancing the first user's voice in the first audio data based on the first user's voice characteristics, the method further comprises:

16. The method of claim 12, wherein prior to the enhancing the first user's voice in the first audio data based on the first user's voice characteristics, the method further comprises:

Extracting the first sequence image to obtain first face information;

17. The method according to any one of claims 12 to 16, wherein after the enhancing the sound of the first user in the first audio data according to the sound characteristics of the first user, the method further comprises:

acquiring enhanced first audio data;

18. The method of any of claims 12 to 17, wherein the capturing the first audio data and the first sequence of images of the first user comprises:

19. A method for voice enhancement, applied in a scenario of a voice or video call consisting of a first terminal device and a second terminal device, the method comprising:

the first terminal device stores first face data of the first user;

20. The method of claim 19, wherein the method further comprises:

The first terminal equipment collects second audio data of the first user;

21. An electronic device, comprising: a processor for running a computer program stored in a memory to implement the speech enhancement method of any one of claims 1 to 11 or claims 12 to 18.

22. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the speech enhancement method according to any of claims 1 to 11 or claims 12 to 18.

23. A chip system comprising a memory and a processor executing a computer program stored in the memory to implement the speech enhancement method of any one of claims 1 to 11 or claims 12 to 18.