CN116312570A

CN116312570A - Voice noise reduction method, device, equipment and medium based on voiceprint recognition

Info

Publication number: CN116312570A
Application number: CN202310267948.2A
Authority: CN
Inventors: 尹青山; 冯落落; 李沛; 黄洋
Original assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Current assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date: 2023-03-15
Filing date: 2023-03-15
Publication date: 2023-06-23

Abstract

The application discloses a voice noise reduction method, device, equipment and medium based on voiceprint recognition, wherein the method comprises the following steps: acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio. The audio corresponding to the appointed person can be obtained by carrying out voice separation on the scene audio and matching the audio corresponding to the appointed person in the person audio corresponding to a plurality of single persons, so that in the scene voice with multi-speaker conversation, other audios except the target speaker in the voice can be regarded as noise, and the voice of the target speaker is reserved.

Description

Voice noise reduction method, device, equipment and medium based on voiceprint recognition

Technical Field

The application relates to the field of voice noise reduction, in particular to a voice noise reduction method, device, equipment and medium based on voiceprint recognition.

Background

Voice noise reduction refers to extracting a voice signal containing valuable information from a contaminated recording or noisy call as much as possible, reducing interference of background sounds. The quality and intelligibility of the voice are easy to be interfered by environmental noise, reverberation and echo, and the conversation quality can be effectively improved by reducing the background noise, so that the communication efficiency is improved. Under the accumulation of many years of research and development of numerous enterprises and scholars, the voice noise reduction algorithm has achieved a great deal of results, can eliminate noise and reverberation in audio, and can retain original voice to the greatest extent.

However, the voice noise reduction algorithm still has the disadvantage of being vulnerable to human voice interference in the ambient sound. The selective listening function of the human auditory system allows a human to listen to the voice of a target speaker in an environment where there are many different speaker voices. Although the human auditory system can separate the voices of a speaker, it is difficult for existing voice noise reduction algorithms to do so.

Disclosure of Invention

In order to solve the above problems, the present application provides a voice noise reduction method, system, device and medium based on voiceprint recognition, wherein the method comprises:

acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.

In an example, the voice separation is performed on the scene audio to obtain personal audio corresponding to each of a plurality of single persons, which specifically includes: acquiring a first data set, and training an initial separation model through the first data set to obtain a separation model; inputting the scene voice into the separation model to obtain personnel audio corresponding to a plurality of single personnel respectively; the initial separation model is composed of a first encoder, a first decoder and a mask network; the first encoder comprises a one-dimensional convolution network and a rectification linear unit, and the first decoder comprises a one-dimensional transposition convolution layer; the masking network includes a layer normalization and a MossFormer module.

In one example, the step of determining the designated person audio corresponding to the designated person by matching the person audio and the voiceprint template information specifically includes: acquiring a second data set, training an initial matching model through the second data set to obtain a matching model, and inputting the voice of the person and the voiceprint template information into the matching model to obtain the audio of the appointed person corresponding to the appointed person; the initial matching model comprises a time delay neural network layer, a channel attention mechanism layer, a multi-scale feature fusion layer and a pooling layer of an attention mechanism, wherein the channel attention mechanism layer comprises a Squeeze-and-specification module and a one-dimensional Res2Net layer.

In one example, the noise reduction processing is performed on the designated person audio to obtain target audio, and specifically includes: acquiring a third data set, and training an initial noise reduction model through the third data set to obtain a noise reduction model; inputting the appointed person audio to the noise reduction model to obtain the target audio; the initial noise reduction model includes a second encoder, a second decoder, a attention module, and a loop module, the second encoder and the second decoder including a plurality of convolution loop modules therein.

In one example, the noise reduction processing on the audio of the appointed person specifically includes: performing first noise reduction processing on the appointed personnel audio to obtain intermediate audio; and determining the voice purity of the intermediate audio, judging whether the voice purity is lower than a preset threshold, and if so, performing second noise reduction processing on the intermediate audio until the voice purity of the intermediate audio is higher than the preset threshold.

In one example, before the voice separation of the scene audio, the method further includes: preprocessing the scene audio, wherein the preprocessing process comprises the following steps: framing, pre-emphasis and windowing are carried out on the scene audio so as to obtain an intermediate voice frame; performing fast Fourier transform on the intermediate voice frame to obtain a corresponding first amplitude spectrum and a first phase spectrum; taking absolute values of the first amplitude spectrum and the first phase spectrum to obtain a second amplitude spectrum; and carrying out Mel filtering transformation on the second amplitude spectrum to obtain a Mel spectrum, and taking the logarithm of the result to obtain a logarithmic Mel spectrum.

In one example, after the obtaining the target audio, the method further comprises: and receiving a text conversion request from a user, and outputting the target audio as a text document in a specified format according to the text conversion request.

The application also provides a voice noise reduction device based on voiceprint recognition, which comprises: the voice print module is used for acquiring voice print template information of a designated person and scene audio comprising voice of the designated person; the separation module is used for carrying out voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; the matching module is used for determining the appointed person audio corresponding to the appointed person by matching the person audio and the voiceprint template information; and the noise reduction module is used for carrying out noise reduction processing on the appointed personnel audio so as to obtain target audio.

The application also provides voice noise reduction equipment based on voiceprint recognition, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform: acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.

The present application also provides a non-volatile computer storage medium storing computer-executable instructions configured to: acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.

The method provided by the application has the following beneficial effects: the voice separation model is based on the encoder-decoder structure of the convolution enhanced attention mechanism, can respectively execute attention operation on the local and global features, improves the interaction capability of remote features and the extraction capability of local features in the model, simultaneously executes convolution operation on the local position, strengthens the modeling capability of the model on the local position of data, and enables the model to complete voice separation tasks under multi-speaker and noisy environments. The voiceprint recognition model is improved based on a traditional time delay neural network, a channel attention mechanism module is added to extract the relation between characteristic channels, multi-scale characteristics are fused to obtain information of different layers, and finally a pooling layer based on an attention mechanism is used to generate voiceprints with global characteristics. The speech noise reduction model introduces a circulation network into the structure of the convolution encoder-decoder, so that the modeling capability of the model on long sequence data is improved, and meanwhile, the complexity of the circulation network is reduced by adding a feedforward sequence memory network.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a schematic flow chart of a voice noise reduction method based on voiceprint recognition in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a voice noise reduction method based on voiceprint recognition in an embodiment of the present application;

fig. 3 is a schematic flow chart of preprocessing scene audio in the embodiment of the application;

FIG. 4 is a schematic structural diagram of an initial separation model according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an initial matching model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a channel attention mechanism module in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an initial noise reduction model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a voice noise reduction device based on voiceprint recognition in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a voice noise reduction device based on voiceprint recognition in an embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flowchart of a voice noise reduction method based on voiceprint recognition according to one or more embodiments of the present disclosure. The process may be performed by computing devices in the respective areas, with some input parameters or intermediate results in the process allowing manual intervention adjustments to help improve accuracy.

The implementation of the analysis method according to the embodiment of the present application may be a terminal device or a server, which is not particularly limited in this application. For ease of understanding and description, the following embodiments are described in detail with reference to a server.

It should be noted that the server may be a single device, or may be a system formed by a plurality of devices, that is, a distributed server, which is not specifically limited in this application.

As shown in fig. 1 and 2, an embodiment of the present application provides a voice noise reduction method based on voiceprint recognition, including:

s101: and acquiring voiceprint template information of the appointed person and scene audio comprising voice of the appointed person.

Firstly, voiceprint template information of appointed personnel and dialogue scene audio of the appointed personnel and other speakers are obtained; the voiceprint template information here refers to audio when a specified person speaks alone.

In one embodiment, as shown in fig. 3, after the scene audio is obtained, a data preprocessing operation is required to be performed on the scene audio, so that the scene audio data is converted into a data format more beneficial to training of a deep learning model, specifically, framing, pre-emphasis and windowing operations are performed on a voice signal of the scene audio, then the voice frame is subjected to fast fourier transformation to obtain a corresponding first amplitude spectrum and a first phase spectrum, an absolute value of a result of the fast fourier transformation is taken to obtain a second amplitude spectrum, then Mel filtering is performed on the second amplitude spectrum to obtain a Mel spectrum, and a logarithm of the result is taken to obtain a logarithmic Mel spectrum.

S102: and performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise.

Performing voice separation on the acquired scene audio, and decomposing the dialogue scene audio into personnel audio corresponding to a plurality of single speakers respectively; here, the human audio includes scene noise and other speaker voices which are difficult to separate.

In one embodiment, when performing voice separation, a first data set is firstly acquired, and an initial separation model is trained through the first data set to obtain a separation model; and then inputting the scene voice into the separation model to obtain the personnel audio corresponding to the single personnel respectively. As shown in fig. 4, in particular, the initial separation model first encoder-first decoder structure and one masking network are composed. The one-dimensional convolutional network in the first encoder is responsible for extracting features from the data, and the rectified linear unit (ReLU) in the first encoder introduces non-linear features. The one-dimensional transpose convolution layer in the first decoder is responsible for outputting the result of the speech separation. Layer normalization in the mask network scales the range of the data to solve the problem of data non-uniformity, and improves the stability and efficiency of the network. The MossFormer module in the mask network is responsible for processing long sequence data.

S103: and matching the personnel audio and the voiceprint template information to determine the appointed personnel audio corresponding to the appointed personnel.

The voice print template information of the appointed person is matched with the voice print template information of the plurality of single speakers, and the voice print of the appointed person can be obtained from the voice print template information of the appointed person.

In one embodiment, when performing voice matching, a second data set is required to be acquired, an initial matching model is trained through the second data set to obtain a matching model, and voice of a person and voiceprint template information are input into the matching model to obtain audio of the person corresponding to the person. As shown in fig. 5, in particular, the initial matching model is composed of a conventional time-lapse neural network, a channel attention mechanism, a multi-scale feature extraction module, and a pooling layer of the attention mechanism. The one-dimensional convolution network in the traditional time delay neural network layer combines the rectification linear units and batch normalization, improves the stability and efficiency of the network, and is more beneficial to parameter propagation. The channel attention mechanism combines the Squeeze-and-specification module with the one-dimensional Res2Net layer to construct the relation between characteristic channels, so that the model attention information is richer. The invention combines the characteristics of multiple layers and multiple scales and utilizes the information of the shallow layer and the deep layer of the network. The Attention module can model the global, and the pooling layer of the Attention mechanism can generate the characteristics based on the global information. The structure of the Squeeze-and-specification module is shown in FIG. 6.

S104: and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.

And carrying out noise reduction processing on the acquired personnel audio of the appointed personnel, so as to remove noise and environmental audio which do not belong to the speaker, and further obtain the screened target audio.

In one embodiment, in the noise reduction process, a third data set is first acquired, an initial noise reduction model is trained through the third data set to obtain a noise reduction model, and then the designated person audio is input into the noise reduction model to obtain the target audio. As shown in fig. 7, the initial noise reduction model is composed of a second encoder, a second decoder and a feedforward sequence memory network, and a convolution loop module in the second encoder is used to extract features in data and can retain long-sequence information. The convolution module in the second decoder is used for outputting a result. The feedforward sequence memory network removes irrelevant context information, and improves the speed of voice noise reduction.

In one embodiment, when noise reduction is performed, first performing first noise reduction processing on the audio of the appointed person to obtain intermediate audio; and then determining the voice purity of the intermediate audio, judging whether the voice purity is lower than a preset threshold value, and if so, performing second noise reduction treatment, namely secondary noise reduction, on the intermediate audio until the voice purity of the intermediate audio is higher than the preset threshold value.

In one embodiment, after obtaining the target audio, if a text conversion request is received from a user, the target audio may be output as a text document in a specified format according to the text conversion request and then output.

It should be noted that, the initial separation model, the initial matching model, and the initial noise reduction model are mathematical models constructed based on a machine learning algorithm, including but not limited to a neural network model, a support vector machine model, and the like, the constructed models are trained in advance through a training data set, and when the set training precision and accuracy are reached, the model trained the current time is determined to complete training so as to be used for prediction processing.

As shown in fig. 8, the embodiment of the present application further provides a voice noise reduction device based on voiceprint recognition, including:

the obtaining module 801 obtains voiceprint template information of a specified person and scene audio including voice of the specified person.

The separation module 802 performs voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel, where the personnel audio includes scene noise.

And the matching module 803 is used for determining the designated personnel audio corresponding to the designated personnel by matching the personnel audio and the voiceprint template information.

And the noise reduction module 804 performs noise reduction processing on the appointed personnel audio to obtain target audio.

As shown in fig. 9, the embodiment of the present application further provides a voice noise reduction device based on voiceprint recognition, including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

The embodiments also provide a non-volatile computer storage medium storing computer executable instructions configured to:

All embodiments in the application are described in a progressive manner, and identical and similar parts of all embodiments are mutually referred, so that each embodiment mainly describes differences from other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.

The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not described in detail herein.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A voice noise reduction method based on voiceprint recognition, comprising:

acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person;

performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise;

matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel;

and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.

2. The method according to claim 1, wherein the performing the voice separation on the scene audio to obtain the personal audio corresponding to each of the plurality of single persons specifically includes:

acquiring a first data set, and training an initial separation model through the first data set to obtain a separation model;

inputting the scene voice into the separation model to obtain personnel audio corresponding to a plurality of single personnel respectively;

the initial separation model is composed of a first encoder, a first decoder and a mask network; the first encoder comprises a one-dimensional convolution network and a rectification linear unit, and the first decoder comprises a one-dimensional transposition convolution layer; the masking network includes a layer normalization and a MossFormer module.

3. The method according to claim 1, wherein the step of determining the designated person audio corresponding to the designated person by matching the person audio and the voiceprint template information specifically includes:

acquiring a second data set, training an initial matching model through the second data set to obtain a matching model, and inputting the voice of the person and the voiceprint template information into the matching model to obtain the audio of the appointed person corresponding to the appointed person;

the initial matching model comprises a time delay neural network layer, a channel attention mechanism layer, a multi-scale feature fusion layer and a pooling layer of an attention mechanism, wherein the channel attention mechanism layer comprises a Squeeze-and-specification module and a one-dimensional Res2Net layer.

4. The method according to claim 1, wherein the denoising the specified person audio to obtain a target audio specifically comprises:

acquiring a third data set, and training an initial noise reduction model through the third data set to obtain a noise reduction model;

inputting the appointed person audio to the noise reduction model to obtain the target audio;

the initial noise reduction model includes a second encoder, a second decoder, a attention module, and a loop module, the second encoder and the second decoder including a plurality of convolution loop modules therein.

5. The method according to claim 1, wherein the denoising the specified person audio specifically comprises:

performing first noise reduction processing on the appointed personnel audio to obtain intermediate audio;

and determining the voice purity of the intermediate audio, judging whether the voice purity is lower than a preset threshold, and if so, performing second noise reduction processing on the intermediate audio until the voice purity of the intermediate audio is higher than the preset threshold.

6. The method of claim 1, wherein prior to the speech separation of the scene audio, the method further comprises:

preprocessing the scene audio, wherein the preprocessing process comprises the following steps:

framing, pre-emphasis and windowing are carried out on the scene audio so as to obtain an intermediate voice frame;

performing fast Fourier transform on the intermediate voice frame to obtain a corresponding first amplitude spectrum and a first phase spectrum;

taking absolute values of the first amplitude spectrum and the first phase spectrum to obtain a second amplitude spectrum;

and carrying out Mel filtering transformation on the second amplitude spectrum to obtain a Mel spectrum, and taking the logarithm of the result to obtain a logarithmic Mel spectrum.

7. The method of claim 1, wherein after the obtaining the target audio, the method further comprises:

and receiving a text conversion request from a user, and outputting the target audio as a text document in a specified format according to the text conversion request.

8. A voice noise reduction device based on voiceprint recognition, comprising:

the voice print module is used for acquiring voice print template information of a designated person and scene audio comprising voice of the designated person;

the separation module is used for carrying out voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise;

the matching module is used for determining the appointed person audio corresponding to the appointed person by matching the person audio and the voiceprint template information;

and the noise reduction module is used for carrying out noise reduction processing on the appointed personnel audio so as to obtain target audio.

9. A voice noise reduction device based on voiceprint recognition, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform:

10. A non-transitory computer storage medium storing computer-executable instructions, the computer-executable instructions configured to: