CN111091845A - Audio processing method and device, terminal equipment and computer storage medium - Google Patents

Audio processing method and device, terminal equipment and computer storage medium Download PDF

Info

Publication number
CN111091845A
CN111091845A CN201911275028.5A CN201911275028A CN111091845A CN 111091845 A CN111091845 A CN 111091845A CN 201911275028 A CN201911275028 A CN 201911275028A CN 111091845 A CN111091845 A CN 111091845A
Authority
CN
China
Prior art keywords
face image
mouth
audio signal
audio
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201911275028.5A
Other languages
Chinese (zh)
Inventor
耿杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201911275028.5A priority Critical patent/CN111091845A/en
Publication of CN111091845A publication Critical patent/CN111091845A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Processing (AREA)

Abstract

The application relates to the field of Artificial Intelligence (AI) of a terminal, particularly relates to the field of voice recognition, and provides an audio processing method, an audio processing device, terminal equipment and a computer storage medium, wherein the method comprises the following steps: acquiring a face image set to be processed and an audio signal to be denoised; extracting mouth features of the face images in the face image set to be processed, and extracting frequency spectrum features of the audio signal to be denoised; inputting the mouth characteristics of each face image and the frequency spectrum characteristics of the audio signal to be denoised into a preset neural network model to obtain a frequency spectrum mask; and processing the audio signal to be subjected to noise reduction by using the spectrum mask to obtain a target audio signal. The method and the device can solve the problems that the existing face-based auxiliary noise reduction algorithm has high calculation power requirements on terminal equipment, is difficult to operate on low-calculation-power terminal equipment, and has a limited application scene.

Description

Audio processing method and device, terminal equipment and computer storage medium
Technical Field
The present application relates to the field of Artificial Intelligence (AI) of a terminal, and more particularly, to an audio processing method, apparatus, terminal device and computer storage medium.
Background
Currently, many terminal devices have voice interaction functions, such as voice assistant, voice input method, and the like. When the user uses the terminal equipment, if the user is in a quieter environment, the terminal equipment can identify the recorded audio data more accurately.
However, once the noise level in the environment is high and the terminal device is not configured with appropriate noise reduction measures, the recognition accuracy of the audio data may drop drastically.
An effective noise reduction mode is very important for terminal equipment with a voice interaction function. Learners propose face-based auxiliary noise reduction algorithms. A human being can filter out a speaker's speech information from a higher noise background by observing changes in the speaker's face, particularly movements of the mouth, in a face-to-face situation, in conjunction with the heard sounds. Especially when multiple persons speak together, the target speaker can be better filtered. Based on a similar principle, the terminal equipment can collect the face image and the audio signal of the speaker as model input, and the face image of the speaker is used for carrying out auxiliary noise reduction on the audio signal, so that a better noise reduction effect is obtained.
However, because the existing face-based auxiliary noise reduction algorithm directly adopts the face image as the model input, a large number of face images need to be processed in the continuous video frames, and the number of pixels of each face image is large, which requires a high calculation amount, and the algorithm is difficult to operate on some low-calculation-power terminal devices (such as mobile phones, vehicle-mounted terminals, etc.).
Disclosure of Invention
In view of this, embodiments of the present application provide an audio processing method and apparatus, a terminal device, and a computer storage medium, so as to solve the problems that the existing face-based auxiliary noise reduction algorithm has a high computational power requirement on the terminal device, is difficult to operate on a low-computational-power terminal device, and has a limited application scenario.
A first aspect of an embodiment of the present application provides an audio processing method, including:
acquiring a face image set to be processed and an audio signal to be denoised;
extracting mouth features of the face images in the face image set to be processed, and extracting frequency spectrum features of the audio signal to be denoised;
inputting the mouth characteristics of each face image and the frequency spectrum characteristics of the audio signal to be denoised into a preset neural network model to obtain a frequency spectrum mask;
and processing the audio signal to be subjected to noise reduction by using the spectrum mask to obtain a target audio signal.
It should be noted that the above-mentioned mouth features can be determined according to actual situations. For example, the mouth feature may be an image of a mouth region in a face image, or the mouth feature may be a point cloud coordinate matrix of key points of a mouth in the face image.
In the process of audio processing, the mouth characteristics are used as the input of the preset neural network, so that the preset neural network does not need to process redundant information of other parts in the face image, the calculated amount of the preset neural network can be greatly reduced, and the calculation force requirement on the terminal equipment is reduced.
In a possible implementation manner of the first aspect, before the acquiring the set of facial images to be processed and the audio signal to be denoised, the method further includes:
acquiring a detection audio signal with a first preset time length;
carrying out speaker number identification on the detected audio signals to obtain the number of target speakers;
correspondingly, the acquiring the facial image set to be processed and the audio signal to be denoised includes:
and if the number of the target speakers is more than 1, acquiring a human face image set to be processed and an audio signal to be denoised.
It should be noted that the first preset time period may be set according to actual requirements.
When the number of the target speakers is detected to be more than 1, in order to better identify the human voice signal of the user and suppress unnecessary environmental noise, a human face auxiliary noise reduction function can be started, and a human face image set to be processed and an audio signal to be noise reduced are acquired to execute a subsequent processing flow.
In one possible implementation form of the first aspect, the mouth feature is an image of a mouth region of the face image;
correspondingly, the extracting the mouth features of the face images in the face image set to be processed includes:
and identifying and intercepting images of mouth regions of the face images in the face image set to be processed.
It should be noted that the mouth feature may be an image of a mouth Region of the face image, and in this case, the image of the mouth Region in the face image may be extracted through a Region of interest (ROI) extraction algorithm or other extraction algorithms.
In another possible implementation manner of the first aspect, the mouth feature is a point cloud coordinate matrix of a mouth key point corresponding to the face image;
correspondingly, the extracting the mouth features of the face images in the face image set to be processed includes:
identifying key points of a mouth corresponding to each face image in the face image set to be processed;
and constructing a point cloud coordinate matrix according to the coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image.
It should be noted that the mouth feature may be a point cloud coordinate matrix of a key point of the mouth corresponding to the face image. The terminal equipment can use the face key point recognition model to recognize the mouth key points corresponding to the face images in the face image set.
In a possible implementation manner of the first aspect, the constructing a point cloud coordinate matrix according to coordinates of each key point of the mouth in the face image, and obtaining the point cloud coordinate matrix of the key point of the mouth corresponding to each face image includes:
carrying out normalization processing on the coordinates of each key point of the mouth in the face image to obtain the normalized coordinates of each key point of the mouth in the face image;
and constructing a point cloud coordinate matrix according to the normalized coordinates of the key points of the mouth in the face image to obtain the point cloud coordinate matrix of the key points of the mouth corresponding to the face image.
It should be noted that the normalization processing of the coordinates of the key points of the mouth can effectively solve the problem of the position and distance offset of the mouth in the face image, and provide the mouth features with smaller calculation amount and higher robustness for the subsequent processing flow.
In a possible implementation manner of the first aspect, the preset neural network model includes a first recurrent neural network, a second recurrent neural network, a third recurrent neural network, and a first fully-connected network;
correspondingly, the inputting the mouth feature of each face image and the spectral feature of the audio signal to be denoised into a preset neural network model to obtain a spectral mask includes:
inputting the mouth features of the face images into a first recurrent neural network to obtain first features;
inputting the frequency spectrum characteristic of the audio signal to be denoised into a second recurrent neural network to obtain a second characteristic;
splicing the first feature and the second feature to obtain a third feature;
inputting the third characteristic into a third recurrent neural network to obtain a fourth characteristic;
and inputting the fourth characteristic into a first full-connection network to obtain a spectrum mask.
It should be noted that, when the first recurrent neural network and the second recurrent neural network are adopted to receive the mouth feature of each face image and the spectral feature of the audio signal to be denoised, because the recurrent neural network is of a neural network structure input according to time sequence, the processing operation can be executed once when the mouth feature of one frame of face image and the spectral feature of one frame of first audio signal frame are received, so that the mouth feature of each frame of face image and the spectral feature of each frame of first audio signal frame are processed in real time, and the application scene of the audio processing method is expanded.
In a possible implementation manner of the first aspect, the method further includes:
and carrying out voice recognition on the target audio signal to obtain text information and displaying the text information.
It should be noted that the terminal device may perform Speech Recognition on the target audio signal by using A Speech Recognition (ASR) engine according to a Speech Recognition instruction of the user, and display text information obtained by the Recognition to the user.
In a possible implementation manner of the first aspect, the method further includes:
and playing the target audio signal.
It should be noted that the terminal device may play the target audio signal according to a playback instruction of the user, so that the user can listen to the voice effect after noise reduction.
In a possible implementation manner of the first aspect, the method further includes:
and sending the target audio signal to a designated terminal device.
It should be noted that the terminal device may send the target audio signal to a specified terminal device according to a file sending instruction of a user.
A second aspect of an embodiment of the present application provides an audio processing apparatus, including:
the data acquisition module is used for acquiring a face image set to be processed and an audio signal to be denoised;
the characteristic extraction module is used for extracting the mouth characteristics of all the face images in the face image set to be processed and extracting the frequency spectrum characteristics of the audio signal to be denoised;
the frequency spectrum mask module is used for inputting the mouth characteristics of each face image and the frequency spectrum characteristics of the audio signal to be denoised into a preset neural network model to obtain a frequency spectrum mask;
and the target audio module is used for processing the audio signal to be subjected to noise reduction by using the spectrum mask to obtain a target audio signal.
In a possible implementation manner of the second aspect, the apparatus further includes:
the detection audio module is used for acquiring a detection audio signal with a first preset duration;
the number identification module is used for identifying the number of speakers of the detected audio signals to obtain the number of target speakers;
correspondingly, the data acquisition module is specifically configured to acquire a face image set to be processed and an audio signal to be denoised if the number of target speakers is greater than 1.
In one possible implementation form of the second aspect, the mouth feature is an image of a mouth region of the face image;
correspondingly, the feature extraction module comprises:
and the image intercepting submodule is used for identifying and intercepting images of mouth regions of all the face images in the face image set to be processed.
In another possible implementation manner of the second aspect, the mouth feature is a point cloud coordinate matrix of a mouth key point corresponding to the face image;
correspondingly, the feature extraction module comprises:
the key identification submodule is used for identifying key points of a mouth corresponding to each face image in the face image set to be processed;
and the point cloud matrix submodule is used for constructing a point cloud coordinate matrix according to the coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image.
In one possible implementation manner of the second aspect, the point cloud matrix sub-module includes:
the normalization submodule is used for performing normalization processing on the coordinates of each key point of the mouth in the face image to obtain the normalized coordinates of each key point of the mouth in the face image;
and the matrix construction submodule is used for constructing a point cloud coordinate matrix according to the normalized coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image.
In a possible implementation manner of the second aspect, the preset neural network model includes a first recurrent neural network, a second recurrent neural network, a third recurrent neural network, and a first fully-connected network;
correspondingly, the spectrum mask module comprises:
the first characteristic submodule is used for inputting the mouth characteristics of each face image into a first recurrent neural network to obtain first characteristics;
the second characteristic submodule is used for inputting the frequency spectrum characteristic of the audio signal to be denoised into a second recurrent neural network to obtain a second characteristic;
the third characteristic submodule is used for splicing the first characteristic and the second characteristic to obtain a third characteristic;
the fourth characteristic submodule is used for inputting the third characteristic into a third recurrent neural network to obtain a fourth characteristic;
and the mask output submodule is used for inputting the fourth characteristic into the first full-connection network to obtain a spectrum mask.
In a possible implementation manner of the second aspect, the apparatus further includes:
and the text recognition module is used for carrying out voice recognition on the target audio signal to obtain text information and displaying the text information.
In a possible implementation manner of the second aspect, the apparatus further includes:
and the audio playing module is used for playing the target audio signal.
In a possible implementation manner of the second aspect, the apparatus further includes:
and the audio sending module is used for sending the target audio signal to the appointed terminal equipment.
A third aspect of the embodiments of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method when executing the computer program.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, implements the steps of the method as described above.
A fifth aspect of embodiments of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to implement the steps of the method as described above.
Compared with the prior art, the embodiment of the application has the advantages that:
the embodiment of the application provides an audio processing method, wherein mouth features in a face image are extracted, the mouth features of the face image and an audio signal to be processed are used as input of a preset neural network model to obtain a spectrum mask, and the audio signal to be processed is processed according to the spectrum mask to obtain a target audio signal. In the process of audio processing, a complete face image is not directly used as model input, but mouth features in the face image are used as model input, and a preset neural network model does not need to process redundant information except for a mouth in the face image, so that the calculated amount of the preset neural network model is greatly reduced, the audio processing method can be applied to various low-computing-power terminal devices, and the problems that the existing face-based auxiliary noise reduction algorithm has high computing power requirement on the terminal devices, is difficult to operate on the low-computing-power terminal devices and has limited application scenes are solved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of an audio processing method provided in an embodiment of the present application;
FIG. 2 is a diagram illustrating a speaker count detection model according to an embodiment of the present disclosure;
fig. 3 is a schematic voting diagram of a voting network provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of voting in another voting network provided by embodiments of the present application;
fig. 5 is a schematic diagram of a face key point provided in an embodiment of the present application;
FIG. 6 is a schematic diagram of key points of a mouth provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of another key point of the mouth provided by an embodiment of the present application;
fig. 8 is a schematic diagram of a mouth keypoint normalization process provided in an embodiment of the present application;
FIG. 9 is a diagram illustrating a normalized result of key points of the mouth according to an embodiment of the present disclosure;
FIG. 10 is a diagram illustrating a neural network model according to an embodiment of the present disclosure;
FIG. 11 is a schematic diagram of another neural network model provided in an embodiment of the present application;
FIG. 12 is a schematic view of a scenario provided by an embodiment of the present application;
FIG. 13 is a schematic diagram of another scenario provided by an embodiment of the present application;
FIG. 14 is a schematic diagram of another scenario provided by an embodiment of the present application;
FIG. 15 is a schematic diagram of another scenario provided by an embodiment of the present application;
fig. 16 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;
fig. 17 is a schematic diagram of a terminal device provided in an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
The audio processing method provided by the embodiment of the application can be applied to terminal equipment. The electronic terminal may be any device having a data processing function, including but not limited to a smart phone, a tablet computer, a wearable device, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, a super-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and other terminal devices.
In addition, in the description of the present application, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
Currently, many terminal devices have voice interaction functions, such as voice assistant, voice input method, and the like. When the user uses the terminal equipment, if the user is in a quieter environment, the terminal equipment can identify the recorded audio data more accurately.
However, once the noise level in the environment is high and the terminal device is not configured with appropriate noise reduction measures, the recognition accuracy of the audio data may drop drastically.
An effective noise reduction mode is very important for terminal equipment with a voice interaction function. In some noise reduction modes, the terminal device may perform noise reduction in a mode of combining software and hardware. E.g. beamforming techniques based on multi-microphone arrays. However, this method depends on the hardware condition of the terminal device, and it is difficult to migrate and use the terminal device to a different terminal device.
In other noise reduction approaches, a general noise reduction algorithm may be used for noise reduction. The noise reduction mode can be applied to most terminal equipment. However, the noise reduction effect of the noise reduction method is limited, and when the terminal device is in a multi-person scene, it is difficult for the terminal device to determine which user's person sound signal should be filtered, which results in failure of the noise reduction algorithm.
Based on the above situation, a learner proposes an auxiliary noise reduction algorithm based on a human face. A human being can filter out a speaker's speech information from a higher noise background by observing changes in the speaker's face, particularly movements of the mouth, in a face-to-face situation, in conjunction with the heard sounds. Especially when multiple persons speak together, the target speaker can be better filtered. Based on a similar principle, the terminal device can collect a face image and an audio signal of a speaker as model input, a Convolutional Neural Network (CNN) + a bidirectional Recurrent Neural Network (BiRNN) is adopted as an audio noise reduction model, a face feature of the face image and an audio feature of the audio signal are extracted through the CNN, the face feature and the audio feature are spliced into a target feature, the target feature is processed by the BiRNN and then input into a full connection layer, and spectrum masks corresponding to different human voice signals are obtained. And respectively processing the audio signals by using the frequency spectrum masks corresponding to the voice signals, so that the noise of the audio signals can be covered, and the voice signals after noise reduction are obtained.
However, because the existing face-based auxiliary noise reduction algorithm directly adopts the face image as the model input, a large number of face images need to be processed in the continuous video frames, and the number of pixels of each face image is large, which requires a high calculation amount, and the algorithm is difficult to operate in some low-calculation-power terminal devices (such as mobile phones, vehicle-mounted terminals, etc.).
In summary, the existing face-based auxiliary noise reduction algorithm has high computational power requirements on terminal equipment, is difficult to operate on low-computational-power terminal equipment, and has limited application scenarios. In order to solve the above problem, an embodiment of the present application provides an audio processing method, where mouth features in a face image are extracted, the mouth features of the face image and an audio signal to be processed are used as inputs of a preset neural network model to obtain a spectrum mask, and the audio signal to be processed is processed according to the spectrum mask to obtain a target audio signal. In the process of audio processing, a complete face image is not directly used as model input, but mouth features in the face image are used as model input, and a preset neural network model does not need to process redundant information except for a mouth in the face image, so that the calculated amount of the preset neural network model is greatly reduced, the audio processing method can be applied to various low-computing-power terminal devices, and the problems that the existing face-based auxiliary noise reduction algorithm has high computing power requirement on the terminal devices, is difficult to operate on the low-computing-power terminal devices and has limited application scenes are solved.
Next, the audio processing method provided by the embodiment of the present application will be described in detail from the perspective of the terminal device. Referring to the flowchart of the audio processing method shown in fig. 1, the method includes:
s101, acquiring a face image set to be processed and an audio signal to be denoised;
when the user uses the voice interaction function of the terminal equipment, the audio processing method provided by the embodiment can be actively started through user operations such as clicking a screen and the like; alternatively, the terminal device may automatically enable the audio processing method of the embodiment when detecting that the user uses the voice interaction function.
In some possible implementations, when the user uses the voice interaction function of the terminal device, a trigger key may be provided on the display surface of the terminal device. If the user thinks there is more noise signal in the environment, can click the trigger button voluntarily. When the terminal device detects that the user clicks the trigger button, the audio processing method provided by the embodiment is started to perform noise reduction processing on the acquired audio signal.
In other possible implementations, when the user uses the voice interaction function on the terminal device, the audio processing method provided by this embodiment may be enabled by default to perform noise reduction processing on the acquired audio signal.
In other possible implementations, when the user uses the voice interaction function on the terminal device, the terminal device may detect the number of speakers in the environment in real time. When the number of speakers is detected to be greater than 1, the user can be automatically enabled or prompted to enable the audio signal processing method provided by the embodiment to perform noise reduction processing on the acquired audio signal.
When the number of speakers is detected, an audio signal with a first preset duration can be recorded as a detection audio signal. The first preset duration can be set according to actual requirements. For example, the first preset time period may be set to 3 s.
After the detection audio signal is acquired, the spectral feature of the detection audio signal may be extracted. The method for extracting the spectral characteristics of the audio signal comprises the following steps:
1.1, performing framing processing on the audio signal to obtain at least 1 frame of audio signal frame. In the process of framing, the frame length of each frame of audio signal frame and the frame interval between two adjacent frames of audio signal frames can be set according to actual conditions. For example, the frame length of each frame of the audio signal frame may be set to 25ms, and the frame interval between two adjacent frames of the audio signal frame may be set to 10 ms.
And 1.2, respectively carrying out signal sampling on each audio signal frame, and changing each audio signal frame into a discrete state. The sampling frequency can be set according to actual conditions. For example, the sampling frequency may be set to 16 KHz.
And 1.3, respectively carrying out short-time discrete Fourier Transform (STFT) on the sampling data corresponding to each audio signal frame to obtain the spectrum characteristics corresponding to each audio signal frame.
When short-time discrete fourier transform is performed, the number of STFT points may be set according to actual conditions. For example, the STFT point number may be set to 400, indicating that each frame of the audio signal includes 400 sample data. When the number of the sampling data corresponding to the audio signal frame is less than 400, the sampling data corresponding to the audio signal frame is automatically filled to 400 sampling data by 0.
The dimension of the extracted spectral features is N x 201 x 2. Where N is the number of audio signal frames, 201 is the STFT spectral dimension, and 2 denotes that the spectral features include real and imaginary parts.
The detection audio signal can be divided into at least one second audio signal frame by the above spectral feature extraction method, and the spectral feature of each second audio signal frame is extracted.
After the spectral characteristics of each second audio signal frame of the detected audio signal are extracted, the spectral characteristics of each second audio signal frame are input into the speaker number detection model, and speaker number identification is carried out on the detected audio signal to obtain the number of target speakers.
The specific structure of the speaker number detection model can be set according to actual conditions. In some embodiments, the structure of the speaker population detection model may include a fourth Recurrent neural network (RNN network), a second fully-connected network, and a voting network, as shown in fig. 2.
And inputting the spectral characteristics of each second audio signal frame of the detected audio signal into the fourth RNN to obtain fifth characteristics corresponding to each second audio signal frame.
And inputting the fifth characteristics corresponding to each second audio signal frame into a second full-connection network, and performing regression prediction on the fifth characteristics by the second full-connection network to obtain a speaker number prediction result corresponding to each second audio signal frame.
And inputting the speaker number prediction result corresponding to each second audio signal frame into a voting network to obtain the number of the target speakers.
In some embodiments, the voting network may directly use the prediction of the number of speakers with the greatest number as the target speaker number. The voting function in the voting network can be expressed as:
P'=Min(Vote(P{t0,t0+t}))
wherein, P' represents the voting result; t is t0Representing a start time of detecting the audio signal; t is t0+ Δ t represents the end time of detecting the audio signal; p { t }0,t0+ Δ t represents a time period [ t0, t0+ Δ t ]]A set of speaker number prediction results corresponding to each second audio signal; vote represents a voting function, and the output of the voting function is the value (possibly a plurality of parallel) with the maximum prediction result of the number of speakers; min represents the selection of the minimum value among a plurality of parallel values.
For example, as shown in FIG. 3, P { t }0,t0And + Δ t } {0,1,0,1, 2,1,2,2, 2}, wherein the number of 2 is the largest, the voting network outputs 2 as the number of target speakers.
In other embodiments, the voting network may perform windowing on the speaker number prediction results corresponding to each second audio signal, and perform a sliding window function to use the speaker number prediction result with the largest number in each window as the candidate speaker number prediction result. And then, selecting the candidate speaker number prediction result with the largest number as the target speaker number.
For example, as shown in FIG. 4, P { t }0,t0+ Δ t } - {0,1,0,1, 2,1,2,2, 2 }. For P { t0,t0+ Δ t, each window includes 4 speaker number predictions (each arc in fig. 4 represents a window). The resulting window can be represented by {0,1,0,1}, {1,0,1,2}, {0,1,2,1}, {1,2,1,2}, {2,1,2,2}, {1,2,2,2}, and {0,1,0,1}, or {1,2,2,2}, or {2, 2} in the context of a translation window function. The prediction result of the number of speakers with the largest number in each window is then used as the prediction result of the number of candidate speakers, and the set of the prediction results of the number of candidate speakers can be expressed as {0,1, 1, 1,2,2 }. And selecting the prediction result of the number of the candidate speakers with the largest number as the number of the target speakers, wherein the number of 1 in the set of the number of the candidate speakers is the largest, so that the voting network outputs 1 as the number of the target speakers.
When the voting network determines the number of the target speakers by using the second mode, the result jitter generated in the prediction process can be effectively filtered, so that the prediction result of the number of the target speakers is more stable and accurate.
And the number of the target speakers output by the voting network is the final speaking number detection result. If the number of the target speakers is greater than 1, the terminal device can automatically start or prompt the user to start the audio signal processing method provided by the embodiment to perform noise reduction processing on the acquired audio signals.
When the terminal device enables the audio signal processing method provided by the embodiment, a face image set to be processed and an audio signal to be denoised can be acquired. The terminal equipment can start the imaging assembly to collect the face video signal and extract a face image set to be processed from the face video signal.
It can be understood that, since the face image set to be processed is extracted from the face video signal in the present embodiment, the face image set should include at least two face images.
The face images contained in the face image set can be face images belonging to the same user or face images belonging to different users.
In some possible implementation manners, the user needs to filter out the environmental noise irrelevant to the user and the human voice signals of other people in the audio signal, and at this time, the facial image set should only include the facial image of the user using the terminal device.
In some embodiments, the terminal device may display a face collection box to prompt the user to adjust the position of the face of the user into the face collection box. The terminal equipment collects the face video signals of the face collection frame area, and extracts a face image set to be processed from the face video signals, so that the face image set to be processed only contains the face image of the user.
Or, in other embodiments, the terminal device may not prompt the face acquisition box. At this time, the face video signal acquired by the terminal device may include face images of different users. However, according to the conventional application scenario, the user of the terminal device should be closest to the terminal device, and in the face video signal, the area of the face image of the user of the terminal device should be larger than the face images of other users. Therefore, the terminal device can recognize the face image with the largest area in each frame image of the face video signal, and the face image is taken as the face image of the user of the terminal device and is added into the face image set.
In other possible implementation manners, the user needs to extract the human voice signals of different users from the audio signal, and at this time, the face image set includes the face images of different users. The terminal equipment can identify and extract the face image in each frame of image of the face video signal to obtain a face image set.
In addition, the terminal equipment can also provide a mode exit key on the display interface. If the user considers that the face-assisted denoising function is not needed or the user completes the voice interaction operation with the terminal equipment, the user can quit the face-assisted denoising function by triggering the mode quit key.
The terminal equipment can acquire the audio signal to be denoised through sound pickup equipment such as a microphone while acquiring the face image set to be processed.
It should be noted that, although the above description has been described, the facial image set to be processed and the audio signal to be denoised are data collected by the terminal device. However, in an actual application process, the facial image set to be processed and the audio signal to be denoised may be data acquired by the terminal device, or the facial image set to be processed and the audio signal to be denoised may also be data acquired by other devices, and the other devices transmit the facial image set to be processed and the audio signal to be denoised to the terminal device of this embodiment for audio denoising. In the embodiment of the present application, the original sources of the face image set to be processed and the audio signal to be denoised are not limited.
S102, extracting mouth features of all face images in the face image set to be processed, and extracting frequency spectrum features of the audio signal to be denoised;
in the current face-based auxiliary noise reduction algorithm, a complete face image is used as an input of an audio noise reduction model. The pixels of the complete face image are numerous, and the audio noise reduction model consumes a great deal of computing power to process the complete face image.
However, in the practical application process, only the image of the mouth region in the complete face image is related to the processing process of audio noise reduction. In the complete face image, images outside the mouth region are all redundant data.
Therefore, in this embodiment, the mouth features of the face images in the set of face images to be processed may be extracted, and the mouth features of the face images are used as the input of the audio noise reduction model, so as to greatly reduce the data amount to be processed by the audio noise reduction model and reduce the computational power requirement of the audio noise reduction model on the terminal device, so that the audio processing method of this embodiment may be applied to the terminal device with low computational power, and increase the scenes to which the audio processing method of this embodiment is applied.
In some possible implementations, the mouth feature may be an image of a mouth region in a face image. The terminal device may extract an image of a mouth Region in the face image through a Region of interest (ROI) extraction algorithm or other extraction algorithms.
In this case, the terminal device still uses the image as the input of the audio noise reduction model, but the input image is the image of the mouth region of the face image, and redundant data outside the mouth region in the face image is reduced, thereby reducing the amount of data that the audio noise reduction model needs to process.
In other possible implementations, the mouth feature may be a point cloud coordinate matrix of a key point of the mouth corresponding to the face image. The terminal equipment can use the face key point recognition model to recognize the mouth key points corresponding to the face images in the face image set.
In some embodiments, the face keypoint recognition model may directly recognize mouth keypoints in the face image. And the terminal equipment constructs a point cloud coordinate matrix according to the coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image in the face image set.
In other embodiments, the face keypoint identification model may identify all face keypoints in the face image, and extract mouth keypoints from all face keypoints. And the terminal equipment constructs a point cloud coordinate matrix according to the coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image in the face image set.
For example, referring to fig. 5 and fig. 6, the terminal device may process the face image through the face keypoint model to obtain 68 common face keypoints. Then, the terminal device extracts 18 key points of the mouth, which are key points 49 to 68, respectively, from the key points 68 of the individual face.
After the terminal device identifies the key points of the mouth in the face image, normalization processing can be performed on the coordinates of the key points of the mouth in the face image. The principle of the normalization algorithm is to normalize the coordinates of the key points on the two sides of the mouth in the key points of the mouth to the coordinates of (0,0.5) and (1,0.5), and then map the other key points of the mouth according to the same rotation, scaling and translation principles of affine change to obtain the normalized coordinates of the key points of the mouth in each face image. The specific process is as follows:
2.1 As shown in FIG. 7, the mouth keypoint is denoted Ck,k∈[0,17](18 mouth keypoints are represented). Wherein the leftmost and rightmost mouth keypoints are denoted as C0And C1,CkIs expressed as (x) in terms of the coordinates before normalizationk,yk)。
2.2 in the normalization algorithm, each mouth keypoint (x) is calculated by a transformation formulak,yk) Normalized coordinates (x'k,y’k) The transformation formula is:
Figure BDA0002315336170000101
Figure BDA0002315336170000102
Figure BDA0002315336170000103
wherein A is a coefficient matrix and b is an auxiliary matrix.
As shown in fig. 8, the coordinates of key points on both sides of the mouth among the key points of the mouth can be normalized to two coordinates of (0,0.5) and (1,0.5) by the above transformation formula. And then mapping other mouth key points according to the same affine change rotation, scaling and translation principles to obtain the coordinates of each mouth key point after normalization processing. The result of point cloud coordinate normalization of each mouth key point is shown in fig. 9.
The normalization processing is carried out on the coordinates of the key points of the mouth part, so that the problem of mouth position and distance deviation in the face image can be effectively solved, and the mouth features with smaller calculated amount and higher robustness are provided for the audio noise reduction model.
After the normalization processing is performed on the coordinates of the key points of the mouth, the terminal device can construct a point cloud coordinate matrix according to the normalized coordinates of the key points of the mouth in the face image, and obtain the point cloud coordinate matrix of the key points of the mouth corresponding to each face image.
Compared with the method that the image of the mouth region is used as the input of the audio noise reduction model, the method that the point cloud coordinate matrix of the key points of the mouth is used as the input of the audio noise reduction model can further reduce the data amount needing to be processed by the audio noise reduction model, and the audio noise reduction model is low in calculation power.
It should be noted that one face image includes a plurality of key points of the mouth, and the coordinates of the key points of the mouth may construct a point cloud coordinate matrix, that is, one face image corresponds to one point cloud coordinate matrix.
In addition to extracting the mouth features of each face image in the set of face images to be processed, the spectral features of the audio signal to be denoised need to be extracted, and the specific extraction manner may refer to the above-described spectral feature extraction method, divide the audio signal to be denoised into a plurality of first audio signal frames by the above-described spectral feature extraction method, and extract the spectral features of the first audio signal frames of each frame.
S103, inputting the mouth characteristics of each face image and the frequency spectrum characteristics of the audio signal to be denoised into a preset neural network model to obtain a frequency spectrum mask;
after the terminal device obtains the mouth features of the face images in the face image set to be processed and the spectral features of the audio signals to be denoised, the mouth features of the face images in the face image set to be processed and the spectral features of the audio signals to be denoised are input into a preset neural network model to obtain a spectral mask.
The user can select a proper neural network model as an audio noise reduction model in advance according to actual requirements.
As shown in fig. 10, in some possible implementations, the preset neural network model may include a first convolutional network, a second convolutional network, a fifth RNN network, and a third fully-connected network. At this time, the mouth feature may be an image of a mouth region in the face image.
The image of the mouth region in each face image is input to the first convolution network to obtain the sixth feature. And inputting the spectral characteristics of the audio signal to be denoised into the second convolution network to obtain a seventh characteristic. And splicing the sixth characteristic and the seventh characteristic to obtain an eighth characteristic. Inputting the eighth characteristic into the fifth RNN network to obtain a ninth characteristic. And inputting the ninth characteristic into the third full-connection network to obtain a spectrum mask.
As shown in fig. 11, in other possible implementations, the preset neural network model may include a first RNN network, a second RNN network, a third RNN network, and a first fully-connected network. At this time, the mouth feature may be a point cloud coordinate matrix of key points of the mouth in the face image.
The terminal device may input the point cloud coordinate matrix of the key point of the mouth corresponding to each face image into the first RNN network, so as to obtain the first feature. And the terminal equipment inputs the frequency spectrum characteristic of the audio signal to be denoised into the second RNN network to obtain a second characteristic. And the terminal equipment splices the first characteristic and the second characteristic to obtain a third characteristic. And the terminal equipment inputs the third characteristic into a third RNN network to obtain a fourth characteristic. And the terminal equipment inputs the fourth characteristic into the first full-connection network to obtain a spectrum mask.
The convolutional network mentioned above is composed of one or more convolutional layers. The type of convolutional layer can be selected according to actual requirements, for example, a conventional convolutional layer can be selected, and an expansion convolutional layer can also be selected.
The above-mentioned RNN network structure may be selected according to actual requirements. For example, a structure of an RNN network structure or a structure of a variant RNN network such as a single-layer RNN network structure, a multi-layer RNN network structure, a single-layer LSTM network structure, a multi-layer LSTM network structure, or the like may be selected as the structure of the RNN network.
The above-mentioned fully-connected network consists of one or more fully-connected layers.
When the preset neural network model in the first implementation manner is adopted, the structure of the first convolutional network and the structure of the second convolutional network are limited, the terminal device needs to acquire a complete to-be-processed face image set and a complete to-be-denoised audio signal, and then the complete to-be-processed face image set and the complete to-be-denoised audio signal can be input into the preset neural network model for processing, so that the mouth characteristics of each frame of face image and the frequency spectrum characteristics of each frame of first audio signal cannot be processed in real time, and the application scene of the audio processing method of the embodiment is greatly limited.
When the preset neural network model in the second implementation manner is adopted, the first RNN network and the second RNN network are adopted to receive the mouth feature of each face image and the spectral feature of the audio signal to be denoised. Because the RNN network is a neural network structure that is input in time sequence, the processing operation can be performed once every time the mouth feature of one frame of the face image and the spectral feature of one frame of the first audio signal frame are received, so that the mouth feature of each frame of the face image and the spectral feature of each frame of the first audio signal frame are processed in real time, and the application scene of the audio processing method of the embodiment is expanded.
It should be noted that the above mentioned spectrum mask is a matrix with a value range of [0, 1], and the size is the same as the spectrum.
The number of spectral masks mentioned above is related to the number of users in the set of facial images to be processed. And if the facial images in the facial image set are all facial images of the same user, the number of the obtained frequency spectrum features is 1. And if the facial images of a plurality of users are contained in the facial image set, the number of the frequency spectrum characteristics is consistent with the number of the users in the facial image set.
And S104, processing the audio signal to be denoised by using the spectrum mask to obtain a target audio signal.
After obtaining the spectrum mask, the terminal device may process the audio signal to be denoised by using the spectrum mask, multiply the spectrum mask by corresponding elements of the spectral features of the audio signal to be denoised, generate a denoising spectrum, and obtain a target audio signal according to the denoising spectrum.
The target audio signal is the noise-reduced audio signal. When the face images contained in the face image set can be face images belonging to the same user, the spectrum mask can inhibit noise signals except the voice signals of the user in the audio signals to be denoised, so that the voice signals of the user are highlighted, and the denoising effect is achieved.
When the facial images of a plurality of users are contained in the facial image set, a plurality of spectrum masks exist. And processing the target audio signal by using each spectrum mask respectively to obtain the noise reduction audio corresponding to each user. In the noise reduction audio corresponding to each user, the human voice signals and the environmental noise of other users can be suppressed, the human voice signals of the user are highlighted, and the noise reduction effect is achieved.
After the target audio signal is obtained, the terminal device can cache the target audio signal and execute corresponding operation according to the user requirement.
In some possible implementation manners, the terminal device may perform Speech Recognition on the target audio signal by using an Automatic Speech Recognition (ASR) engine according to a Speech Recognition instruction of the user, and display text information obtained by the Recognition to the user.
In other possible implementation manners, the terminal device may play the target audio signal according to a playback instruction of the user, so that the user can listen to the voice effect after noise reduction.
In other possible implementations, the terminal device may send the target audio signal to a specified terminal device according to a file sending instruction of a user.
The following describes the audio processing method provided in this embodiment with reference to specific application scenarios:
referring to FIG. 12, when a user desires to enable the voice interaction function, the interface shown in FIG. 12 may be entered. The microphone-like keys in fig. 12 are recording keys.
The user can click the recording button to start the audio recording function.
As shown in fig. 13, after the user starts the audio recording function, the terminal device may record the detected audio signal with a first preset duration (e.g., 3 seconds), and perform the speaker count detection on the detected audio signal. When the voice signals of a plurality of users exist in the environment, the terminal equipment prompts that a plurality of people around you are detected, the face auxiliary noise reduction is started, and the face auxiliary noise reduction function is started.
As shown in fig. 14, after the terminal device activates the face auxiliary noise reduction function, a face acquisition box may be displayed and prompt "please place the face in the picture" so as to accurately acquire the face image set of the user. In addition, the terminal device can also provide a mode exit key and prompt the user to turn off the auxiliary noise reduction, so that the user is informed of the function of the mode exit key. When the user needs to quit the face auxiliary noise reduction function, the user can quit the face auxiliary noise reduction function by clicking the mode quit key.
And then, the terminal equipment acquires a face image set to be processed and an audio signal to be denoised. The terminal equipment extracts the mouth characteristics of each face image in the face image set to be processed and extracts the frequency spectrum characteristics of the audio signal to be denoised. And the terminal equipment inputs the mouth characteristics of the face images in the face image set to be processed and the spectral characteristics of the audio signal to be denoised into a preset neural network model for processing to obtain a spectral mask. And the terminal equipment processes the audio signal to be denoised by using the spectrum mask to obtain a target audio signal.
As shown in fig. 15, the terminal device may perform speech recognition on the target audio signal using the ASR engine, and display the recognized text information "where to play" to the user.
Meanwhile, the terminal equipment can also provide a play key and a share key. In fig. 15, the leftmost button at the bottom is a play button, and the rightmost button is a share button.
When the user clicks the play key, the terminal equipment can play the target audio signal, and the user can conveniently listen to the voice effect after noise reduction.
When the user clicks the sharing key, the terminal device may send the target audio signal to a designated terminal device.
In the audio processing method provided by the embodiment, the mouth feature of the face image and the spectral feature of the audio signal to be processed are used as the input of the preset neural network model to obtain the spectral mask by extracting the mouth feature of the face image, and the audio signal to be processed is processed according to the spectral mask to obtain the target audio signal.
In the process of audio processing, the noise reduction is carried out based on the mouth characteristics of the face image and the spectral characteristics of the audio signal to be subjected to noise reduction without depending on hardware noise reduction, and the method can be conveniently expanded to any terminal equipment supporting sound pickup equipment (such as a microphone) and an imaging component (such as a camera).
Meanwhile, the audio processing method of the embodiment performs noise reduction based on the mouth feature of the face image and the spectral feature of the audio signal to be noise reduced, and can generate a target audio signal with a higher signal-to-noise ratio.
In addition, the audio processing method of the embodiment does not directly use a complete face image as a model input, but uses the characteristics of the mouth in the face image as a model input, and the preset neural network model does not need to process redundant information except the mouth in the face image, so that the calculation amount of the preset neural network model is greatly reduced, the audio processing method can be applied to various low-calculation-power terminal devices, and the problems that the existing face-based auxiliary noise reduction algorithm has high calculation power requirement on the terminal devices, is difficult to operate on the low-calculation-power terminal devices, and has a limited application scene are solved.
And when the preset neural network model receives the mouth feature of the face image and the spectral feature of the audio signal to be denoised by adopting the RNN model, the mouth feature of the face image of each frame and the spectral feature of the first audio signal frame of each frame can be continuously processed, and the processing result is output in real time.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Referring to fig. 16, an audio processing apparatus according to an embodiment of the present application is shown, for convenience of illustration, only a portion related to the present application, and as shown in fig. 16, the audio processing apparatus includes,
a data obtaining module 1601, configured to obtain a set of facial images to be processed and an audio signal to be denoised;
a feature extraction module 1602, configured to extract mouth features of each facial image in the facial image set to be processed, and extract spectral features of the audio signal to be denoised;
a spectrum mask module 1603, configured to input the mouth features of each face image and the spectrum features of the audio signal to be denoised into a preset neural network model to obtain a spectrum mask;
the target audio module 1604 is configured to process the audio signal to be denoised by using the spectrum mask, so as to obtain a target audio signal.
Further, the apparatus further comprises:
the detection audio module is used for acquiring a detection audio signal with a first preset duration;
the number identification module is used for identifying the number of speakers of the detected audio signals to obtain the number of target speakers;
correspondingly, the data obtaining module 1601 is specifically configured to obtain a to-be-processed face image set and an audio signal to be denoised if the number of target speakers is greater than 1.
Further, the mouth feature is an image of a mouth region of the face image;
accordingly, the feature extraction module 1602 includes:
and the image intercepting submodule is used for identifying and intercepting images of mouth regions of all the face images in the face image set to be processed.
Further, the mouth features are point cloud coordinate matrixes of mouth key points corresponding to the face images;
accordingly, the feature extraction module 1602 includes:
the key identification submodule is used for identifying key points of a mouth corresponding to each face image in the face image set to be processed;
and the point cloud matrix submodule is used for constructing a point cloud coordinate matrix according to the coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image.
Further, the point cloud matrix sub-module includes:
the normalization submodule is used for performing normalization processing on the coordinates of each key point of the mouth in the face image to obtain the normalized coordinates of each key point of the mouth in the face image;
and the matrix construction submodule is used for constructing a point cloud coordinate matrix according to the normalized coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image.
Further, the preset neural network model comprises a first recurrent neural network, a second recurrent neural network, a third recurrent neural network and a first fully-connected network;
correspondingly, the spectrum mask module 1603 includes:
the first characteristic submodule is used for inputting the mouth characteristics of each face image into a first recurrent neural network to obtain first characteristics;
the second characteristic submodule is used for inputting the frequency spectrum characteristic of the audio signal to be denoised into a second recurrent neural network to obtain a second characteristic;
the third characteristic submodule is used for splicing the first characteristic and the second characteristic to obtain a third characteristic;
the fourth characteristic submodule is used for inputting the third characteristic into a third recurrent neural network to obtain a fourth characteristic;
and the mask output submodule is used for inputting the fourth characteristic into the first full-connection network to obtain a spectrum mask.
Further, the apparatus further comprises:
and the text recognition module is used for carrying out voice recognition on the target audio signal to obtain text information and displaying the text information.
Further, the apparatus further comprises:
and the audio playing module is used for playing the target audio signal.
Further, the apparatus further comprises:
and the audio sending module is used for sending the target audio signal to the appointed terminal equipment.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
Referring to fig. 17, an embodiment of the present application further provides a terminal device, where the terminal device 17 includes: a processor 170, a memory 171, and a computer program 172 stored in the memory 171 and executable on the processor 170. The processor 170, when executing the computer program 172, implements the steps in the above-described audio processing method embodiments, such as the steps S101 to S104 shown in fig. 1. Alternatively, the processor 170, when executing the computer program 172, implements the functions of the modules/units in the above-described device embodiments, such as the functions of the modules 1601 to 1604 shown in fig. 16.
Illustratively, the computer program 172 may be partitioned into one or more modules/units that are stored in the memory 171 and executed by the processor 170 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 172 in the terminal device 17. For example, the computer program 172 may be divided into a data acquisition module, a feature extraction module, a spectrum mask module, and a target audio module, and each module specifically functions as follows:
the data acquisition module is used for acquiring a face image set to be processed and an audio signal to be denoised;
the characteristic extraction module is used for extracting the mouth characteristics of all the face images in the face image set to be processed and extracting the frequency spectrum characteristics of the audio signal to be denoised;
the frequency spectrum mask module is used for inputting the mouth characteristics of each face image and the frequency spectrum characteristics of the audio signal to be denoised into a preset neural network model to obtain a frequency spectrum mask;
and the target audio module is used for processing the audio signal to be subjected to noise reduction by using the spectrum mask to obtain a target audio signal.
The terminal device 17 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 170, a memory 171. Those skilled in the art will appreciate that fig. 17 is merely an example of a terminal device 17 and does not constitute a limitation of terminal device 17 and may include more or fewer components than shown, or some components may be combined, or different components, for example, the terminal device may also include input output devices, network access devices, buses, etc.
The Processor 170 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 171 may be an internal storage unit of the terminal device 17, such as a hard disk or a memory of the terminal device 17. The memory 171 may also be an external storage device of the terminal device 17, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the terminal device 17. Further, the memory 171 may also include both an internal storage unit and an external storage device of the terminal device 17. The memory 171 is used to store the computer program and other programs and data required by the terminal device. The memory 171 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (20)

1. An audio processing method, comprising:
acquiring a face image set to be processed and an audio signal to be denoised;
extracting mouth features of the face images in the face image set to be processed, and extracting frequency spectrum features of the audio signal to be denoised;
inputting the mouth characteristics of each face image and the frequency spectrum characteristics of the audio signal to be denoised into a preset neural network model to obtain a frequency spectrum mask;
and processing the audio signal to be subjected to noise reduction by using the spectrum mask to obtain a target audio signal.
2. The audio processing method according to claim 1, wherein before said obtaining the set of facial images to be processed and the audio signal to be noise-reduced, further comprising:
acquiring a detection audio signal with a first preset time length;
carrying out speaker number identification on the detected audio signals to obtain the number of target speakers;
correspondingly, the acquiring the facial image set to be processed and the audio signal to be denoised includes:
and if the number of the target speakers is more than 1, acquiring a human face image set to be processed and an audio signal to be denoised.
3. The audio processing method according to claim 1, wherein the mouth feature is an image of a mouth region of the face image;
correspondingly, the extracting the mouth features of the face images in the face image set to be processed includes:
and identifying and intercepting images of mouth regions of the face images in the face image set to be processed.
4. The audio processing method according to claim 1, wherein the mouth feature is a point cloud coordinate matrix of a mouth key point corresponding to the face image;
correspondingly, the extracting the mouth features of the face images in the face image set to be processed includes:
identifying key points of a mouth corresponding to each face image in the face image set to be processed;
and constructing a point cloud coordinate matrix according to the coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image.
5. The audio processing method according to claim 4, wherein the constructing a point cloud coordinate matrix according to the coordinates of each key point of the mouth in the face image, and obtaining the point cloud coordinate matrix of the key point of the mouth corresponding to each face image comprises:
carrying out normalization processing on the coordinates of each key point of the mouth in the face image to obtain the normalized coordinates of each key point of the mouth in the face image;
and constructing a point cloud coordinate matrix according to the normalized coordinates of the key points of the mouth in the face image to obtain the point cloud coordinate matrix of the key points of the mouth corresponding to the face image.
6. The audio processing method according to claim 1, wherein the preset neural network model comprises a first recurrent neural network, a second recurrent neural network, a third recurrent neural network, and a first fully-connected network;
correspondingly, the inputting the mouth feature of each face image and the spectral feature of the audio signal to be denoised into a preset neural network model to obtain a spectral mask includes:
inputting the mouth features of the face images into a first recurrent neural network to obtain first features;
inputting the frequency spectrum characteristic of the audio signal to be denoised into a second recurrent neural network to obtain a second characteristic;
splicing the first feature and the second feature to obtain a third feature;
inputting the third characteristic into a third recurrent neural network to obtain a fourth characteristic;
and inputting the fourth characteristic into a first full-connection network to obtain a spectrum mask.
7. The audio processing method of claim 1, wherein the method further comprises:
and carrying out voice recognition on the target audio signal to obtain text information and displaying the text information.
8. The audio processing method of claim 1, wherein the method further comprises:
and playing the target audio signal.
9. The audio processing method of claim 1, wherein the method further comprises:
and sending the target audio signal to a designated terminal device.
10. An audio processing apparatus, comprising:
the data acquisition module is used for acquiring a face image set to be processed and an audio signal to be denoised;
the characteristic extraction module is used for extracting the mouth characteristics of all the face images in the face image set to be processed and extracting the frequency spectrum characteristics of the audio signal to be denoised;
the frequency spectrum mask module is used for inputting the mouth characteristics of each face image and the frequency spectrum characteristics of the audio signal to be denoised into a preset neural network model to obtain a frequency spectrum mask;
and the target audio module is used for processing the audio signal to be subjected to noise reduction by using the spectrum mask to obtain a target audio signal.
11. The audio processing apparatus of claim 10, wherein the apparatus further comprises:
the detection audio module is used for acquiring a detection audio signal with a first preset duration;
the number identification module is used for identifying the number of speakers of the detected audio signals to obtain the number of target speakers;
correspondingly, the data acquisition module is specifically configured to acquire a face image set to be processed and an audio signal to be denoised if the number of target speakers is greater than 1.
12. The audio processing apparatus according to claim 10, wherein the mouth feature is an image of a mouth region of the face image;
correspondingly, the feature extraction module comprises:
and the image intercepting submodule is used for identifying and intercepting images of mouth regions of all the face images in the face image set to be processed.
13. The audio processing apparatus according to claim 10, wherein the mouth feature is a point cloud coordinate matrix of a mouth key point corresponding to the face image;
correspondingly, the feature extraction module comprises:
the key identification submodule is used for identifying key points of a mouth corresponding to each face image in the face image set to be processed;
and the point cloud matrix submodule is used for constructing a point cloud coordinate matrix according to the coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image.
14. The audio processing apparatus of claim 13, wherein the point cloud matrix sub-module comprises:
the normalization submodule is used for performing normalization processing on the coordinates of each key point of the mouth in the face image to obtain the normalized coordinates of each key point of the mouth in the face image;
and the matrix construction submodule is used for constructing a point cloud coordinate matrix according to the normalized coordinates of each key point of the mouth in the face image to obtain the point cloud coordinate matrix of the key point of the mouth corresponding to each face image.
15. The audio processing apparatus according to claim 10, wherein the preset neural network model includes a first recurrent neural network, a second recurrent neural network, a third recurrent neural network, and a first fully-connected network;
correspondingly, the spectrum mask module comprises:
the first characteristic submodule is used for inputting the mouth characteristics of each face image into a first recurrent neural network to obtain first characteristics;
the second characteristic submodule is used for inputting the frequency spectrum characteristic of the audio signal to be denoised into a second recurrent neural network to obtain a second characteristic;
the third characteristic submodule is used for splicing the first characteristic and the second characteristic to obtain a third characteristic;
the fourth characteristic submodule is used for inputting the third characteristic into a third recurrent neural network to obtain a fourth characteristic;
and the mask output submodule is used for inputting the fourth characteristic into the first full-connection network to obtain a spectrum mask.
16. The audio processing apparatus of claim 10, wherein the apparatus further comprises:
and the text recognition module is used for carrying out voice recognition on the target audio signal to obtain text information and displaying the text information.
17. The audio processing apparatus of claim 10, wherein the apparatus further comprises:
and the audio playing module is used for playing the target audio signal.
18. The audio processing apparatus of claim 10, wherein the apparatus further comprises:
and the audio sending module is used for sending the target audio signal to the appointed terminal equipment.
19. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 9 when executing the computer program.
20. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
CN201911275028.5A 2019-12-12 2019-12-12 Audio processing method and device, terminal equipment and computer storage medium Withdrawn CN111091845A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911275028.5A CN111091845A (en) 2019-12-12 2019-12-12 Audio processing method and device, terminal equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911275028.5A CN111091845A (en) 2019-12-12 2019-12-12 Audio processing method and device, terminal equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN111091845A true CN111091845A (en) 2020-05-01

Family

ID=70396341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911275028.5A Withdrawn CN111091845A (en) 2019-12-12 2019-12-12 Audio processing method and device, terminal equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN111091845A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798543A (en) * 2020-09-10 2020-10-20 北京易真学思教育科技有限公司 Model training method, data processing method, device, equipment and storage medium
CN111883091A (en) * 2020-07-09 2020-11-03 腾讯音乐娱乐科技(深圳)有限公司 Audio noise reduction method and training method of audio noise reduction model
CN112614508A (en) * 2020-12-11 2021-04-06 北京华捷艾米科技有限公司 Audio and video combined positioning method and device, electronic equipment and storage medium
CN112951258A (en) * 2021-04-23 2021-06-11 中国科学技术大学 Audio and video voice enhancement processing method and model
CN113093106A (en) * 2021-04-09 2021-07-09 北京华捷艾米科技有限公司 Sound source positioning method and system
CN118098205A (en) * 2024-02-29 2024-05-28 广州市中航服商务管理有限公司 Dialogue type air ticket inquiring system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883091A (en) * 2020-07-09 2020-11-03 腾讯音乐娱乐科技(深圳)有限公司 Audio noise reduction method and training method of audio noise reduction model
CN111798543A (en) * 2020-09-10 2020-10-20 北京易真学思教育科技有限公司 Model training method, data processing method, device, equipment and storage medium
CN112614508A (en) * 2020-12-11 2021-04-06 北京华捷艾米科技有限公司 Audio and video combined positioning method and device, electronic equipment and storage medium
CN113093106A (en) * 2021-04-09 2021-07-09 北京华捷艾米科技有限公司 Sound source positioning method and system
CN112951258A (en) * 2021-04-23 2021-06-11 中国科学技术大学 Audio and video voice enhancement processing method and model
CN112951258B (en) * 2021-04-23 2024-05-17 中国科学技术大学 Audio/video voice enhancement processing method and device
CN118098205A (en) * 2024-02-29 2024-05-28 广州市中航服商务管理有限公司 Dialogue type air ticket inquiring system

Similar Documents

Publication Publication Date Title
CN111091576B (en) Image segmentation method, device, equipment and storage medium
CN111091845A (en) Audio processing method and device, terminal equipment and computer storage medium
CN111079576B (en) Living body detection method, living body detection device, living body detection equipment and storage medium
CN110379430B (en) Animation display method and device based on voice, computer equipment and storage medium
WO2020224479A1 (en) Method and apparatus for acquiring positions of target, and computer device and storage medium
CN110807361B (en) Human body identification method, device, computer equipment and storage medium
CN110047468B (en) Speech recognition method, apparatus and storage medium
CN111696570B (en) Voice signal processing method, device, equipment and storage medium
CN110570460B (en) Target tracking method, device, computer equipment and computer readable storage medium
CN110600040B (en) Voiceprint feature registration method and device, computer equipment and storage medium
CN112257552B (en) Image processing method, device, equipment and storage medium
WO2022017006A1 (en) Video processing method and apparatus, and terminal device and computer-readable storage medium
CN111445901A (en) Audio data acquisition method and device, electronic equipment and storage medium
CN112967730A (en) Voice signal processing method and device, electronic equipment and storage medium
CN111341307A (en) Voice recognition method and device, electronic equipment and storage medium
CN112508959B (en) Video object segmentation method and device, electronic equipment and storage medium
CN115129932A (en) Video clip determination method, device, equipment and storage medium
CN112233689B (en) Audio noise reduction method, device, equipment and medium
CN112233688B (en) Audio noise reduction method, device, equipment and medium
CN115168643B (en) Audio processing method, device, equipment and computer readable storage medium
CN115206305B (en) Semantic text generation method and device, electronic equipment and storage medium
CN110232417B (en) Image recognition method and device, computer equipment and computer readable storage medium
CN111797873A (en) Scene recognition method and device, storage medium and electronic equipment
CN113301444B (en) Video processing method and device, electronic equipment and storage medium
CN114996515A (en) Training method of video feature extraction model, text generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200501