CN116312570A - Voice noise reduction method, device, equipment and medium based on voiceprint recognition - Google Patents

Voice noise reduction method, device, equipment and medium based on voiceprint recognition Download PDF

Info

Publication number
CN116312570A
CN116312570A CN202310267948.2A CN202310267948A CN116312570A CN 116312570 A CN116312570 A CN 116312570A CN 202310267948 A CN202310267948 A CN 202310267948A CN 116312570 A CN116312570 A CN 116312570A
Authority
CN
China
Prior art keywords
audio
personnel
voice
scene
noise reduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310267948.2A
Other languages
Chinese (zh)
Inventor
尹青山
冯落落
李沛
黄洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong New Generation Information Industry Technology Research Institute Co Ltd
Original Assignee
Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong New Generation Information Industry Technology Research Institute Co Ltd filed Critical Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority to CN202310267948.2A priority Critical patent/CN116312570A/en
Publication of CN116312570A publication Critical patent/CN116312570A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Abstract

The application discloses a voice noise reduction method, device, equipment and medium based on voiceprint recognition, wherein the method comprises the following steps: acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio. The audio corresponding to the appointed person can be obtained by carrying out voice separation on the scene audio and matching the audio corresponding to the appointed person in the person audio corresponding to a plurality of single persons, so that in the scene voice with multi-speaker conversation, other audios except the target speaker in the voice can be regarded as noise, and the voice of the target speaker is reserved.

Description

Voice noise reduction method, device, equipment and medium based on voiceprint recognition
Technical Field
The application relates to the field of voice noise reduction, in particular to a voice noise reduction method, device, equipment and medium based on voiceprint recognition.
Background
Voice noise reduction refers to extracting a voice signal containing valuable information from a contaminated recording or noisy call as much as possible, reducing interference of background sounds. The quality and intelligibility of the voice are easy to be interfered by environmental noise, reverberation and echo, and the conversation quality can be effectively improved by reducing the background noise, so that the communication efficiency is improved. Under the accumulation of many years of research and development of numerous enterprises and scholars, the voice noise reduction algorithm has achieved a great deal of results, can eliminate noise and reverberation in audio, and can retain original voice to the greatest extent.
However, the voice noise reduction algorithm still has the disadvantage of being vulnerable to human voice interference in the ambient sound. The selective listening function of the human auditory system allows a human to listen to the voice of a target speaker in an environment where there are many different speaker voices. Although the human auditory system can separate the voices of a speaker, it is difficult for existing voice noise reduction algorithms to do so.
Disclosure of Invention
In order to solve the above problems, the present application provides a voice noise reduction method, system, device and medium based on voiceprint recognition, wherein the method comprises:
acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.
In an example, the voice separation is performed on the scene audio to obtain personal audio corresponding to each of a plurality of single persons, which specifically includes: acquiring a first data set, and training an initial separation model through the first data set to obtain a separation model; inputting the scene voice into the separation model to obtain personnel audio corresponding to a plurality of single personnel respectively; the initial separation model is composed of a first encoder, a first decoder and a mask network; the first encoder comprises a one-dimensional convolution network and a rectification linear unit, and the first decoder comprises a one-dimensional transposition convolution layer; the masking network includes a layer normalization and a MossFormer module.
In one example, the step of determining the designated person audio corresponding to the designated person by matching the person audio and the voiceprint template information specifically includes: acquiring a second data set, training an initial matching model through the second data set to obtain a matching model, and inputting the voice of the person and the voiceprint template information into the matching model to obtain the audio of the appointed person corresponding to the appointed person; the initial matching model comprises a time delay neural network layer, a channel attention mechanism layer, a multi-scale feature fusion layer and a pooling layer of an attention mechanism, wherein the channel attention mechanism layer comprises a Squeeze-and-specification module and a one-dimensional Res2Net layer.
In one example, the noise reduction processing is performed on the designated person audio to obtain target audio, and specifically includes: acquiring a third data set, and training an initial noise reduction model through the third data set to obtain a noise reduction model; inputting the appointed person audio to the noise reduction model to obtain the target audio; the initial noise reduction model includes a second encoder, a second decoder, a attention module, and a loop module, the second encoder and the second decoder including a plurality of convolution loop modules therein.
In one example, the noise reduction processing on the audio of the appointed person specifically includes: performing first noise reduction processing on the appointed personnel audio to obtain intermediate audio; and determining the voice purity of the intermediate audio, judging whether the voice purity is lower than a preset threshold, and if so, performing second noise reduction processing on the intermediate audio until the voice purity of the intermediate audio is higher than the preset threshold.
In one example, before the voice separation of the scene audio, the method further includes: preprocessing the scene audio, wherein the preprocessing process comprises the following steps: framing, pre-emphasis and windowing are carried out on the scene audio so as to obtain an intermediate voice frame; performing fast Fourier transform on the intermediate voice frame to obtain a corresponding first amplitude spectrum and a first phase spectrum; taking absolute values of the first amplitude spectrum and the first phase spectrum to obtain a second amplitude spectrum; and carrying out Mel filtering transformation on the second amplitude spectrum to obtain a Mel spectrum, and taking the logarithm of the result to obtain a logarithmic Mel spectrum.
In one example, after the obtaining the target audio, the method further comprises: and receiving a text conversion request from a user, and outputting the target audio as a text document in a specified format according to the text conversion request.
The application also provides a voice noise reduction device based on voiceprint recognition, which comprises: the voice print module is used for acquiring voice print template information of a designated person and scene audio comprising voice of the designated person; the separation module is used for carrying out voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; the matching module is used for determining the appointed person audio corresponding to the appointed person by matching the person audio and the voiceprint template information; and the noise reduction module is used for carrying out noise reduction processing on the appointed personnel audio so as to obtain target audio.
The application also provides voice noise reduction equipment based on voiceprint recognition, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform: acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.
The present application also provides a non-volatile computer storage medium storing computer-executable instructions configured to: acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.
The method provided by the application has the following beneficial effects: the voice separation model is based on the encoder-decoder structure of the convolution enhanced attention mechanism, can respectively execute attention operation on the local and global features, improves the interaction capability of remote features and the extraction capability of local features in the model, simultaneously executes convolution operation on the local position, strengthens the modeling capability of the model on the local position of data, and enables the model to complete voice separation tasks under multi-speaker and noisy environments. The voiceprint recognition model is improved based on a traditional time delay neural network, a channel attention mechanism module is added to extract the relation between characteristic channels, multi-scale characteristics are fused to obtain information of different layers, and finally a pooling layer based on an attention mechanism is used to generate voiceprints with global characteristics. The speech noise reduction model introduces a circulation network into the structure of the convolution encoder-decoder, so that the modeling capability of the model on long sequence data is improved, and meanwhile, the complexity of the circulation network is reduced by adding a feedforward sequence memory network.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is a schematic flow chart of a voice noise reduction method based on voiceprint recognition in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a voice noise reduction method based on voiceprint recognition in an embodiment of the present application;
fig. 3 is a schematic flow chart of preprocessing scene audio in the embodiment of the application;
FIG. 4 is a schematic structural diagram of an initial separation model according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of an initial matching model according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a channel attention mechanism module in an embodiment of the present application;
FIG. 7 is a schematic structural diagram of an initial noise reduction model according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a voice noise reduction device based on voiceprint recognition in an embodiment of the present application;
fig. 9 is a schematic structural diagram of a voice noise reduction device based on voiceprint recognition in an embodiment of the present application.
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
Fig. 1 is a flowchart of a voice noise reduction method based on voiceprint recognition according to one or more embodiments of the present disclosure. The process may be performed by computing devices in the respective areas, with some input parameters or intermediate results in the process allowing manual intervention adjustments to help improve accuracy.
The implementation of the analysis method according to the embodiment of the present application may be a terminal device or a server, which is not particularly limited in this application. For ease of understanding and description, the following embodiments are described in detail with reference to a server.
It should be noted that the server may be a single device, or may be a system formed by a plurality of devices, that is, a distributed server, which is not specifically limited in this application.
As shown in fig. 1 and 2, an embodiment of the present application provides a voice noise reduction method based on voiceprint recognition, including:
s101: and acquiring voiceprint template information of the appointed person and scene audio comprising voice of the appointed person.
Firstly, voiceprint template information of appointed personnel and dialogue scene audio of the appointed personnel and other speakers are obtained; the voiceprint template information here refers to audio when a specified person speaks alone.
In one embodiment, as shown in fig. 3, after the scene audio is obtained, a data preprocessing operation is required to be performed on the scene audio, so that the scene audio data is converted into a data format more beneficial to training of a deep learning model, specifically, framing, pre-emphasis and windowing operations are performed on a voice signal of the scene audio, then the voice frame is subjected to fast fourier transformation to obtain a corresponding first amplitude spectrum and a first phase spectrum, an absolute value of a result of the fast fourier transformation is taken to obtain a second amplitude spectrum, then Mel filtering is performed on the second amplitude spectrum to obtain a Mel spectrum, and a logarithm of the result is taken to obtain a logarithmic Mel spectrum.
S102: and performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise.
Performing voice separation on the acquired scene audio, and decomposing the dialogue scene audio into personnel audio corresponding to a plurality of single speakers respectively; here, the human audio includes scene noise and other speaker voices which are difficult to separate.
In one embodiment, when performing voice separation, a first data set is firstly acquired, and an initial separation model is trained through the first data set to obtain a separation model; and then inputting the scene voice into the separation model to obtain the personnel audio corresponding to the single personnel respectively. As shown in fig. 4, in particular, the initial separation model first encoder-first decoder structure and one masking network are composed. The one-dimensional convolutional network in the first encoder is responsible for extracting features from the data, and the rectified linear unit (ReLU) in the first encoder introduces non-linear features. The one-dimensional transpose convolution layer in the first decoder is responsible for outputting the result of the speech separation. Layer normalization in the mask network scales the range of the data to solve the problem of data non-uniformity, and improves the stability and efficiency of the network. The MossFormer module in the mask network is responsible for processing long sequence data.
S103: and matching the personnel audio and the voiceprint template information to determine the appointed personnel audio corresponding to the appointed personnel.
The voice print template information of the appointed person is matched with the voice print template information of the plurality of single speakers, and the voice print of the appointed person can be obtained from the voice print template information of the appointed person.
In one embodiment, when performing voice matching, a second data set is required to be acquired, an initial matching model is trained through the second data set to obtain a matching model, and voice of a person and voiceprint template information are input into the matching model to obtain audio of the person corresponding to the person. As shown in fig. 5, in particular, the initial matching model is composed of a conventional time-lapse neural network, a channel attention mechanism, a multi-scale feature extraction module, and a pooling layer of the attention mechanism. The one-dimensional convolution network in the traditional time delay neural network layer combines the rectification linear units and batch normalization, improves the stability and efficiency of the network, and is more beneficial to parameter propagation. The channel attention mechanism combines the Squeeze-and-specification module with the one-dimensional Res2Net layer to construct the relation between characteristic channels, so that the model attention information is richer. The invention combines the characteristics of multiple layers and multiple scales and utilizes the information of the shallow layer and the deep layer of the network. The Attention module can model the global, and the pooling layer of the Attention mechanism can generate the characteristics based on the global information. The structure of the Squeeze-and-specification module is shown in FIG. 6.
S104: and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.
And carrying out noise reduction processing on the acquired personnel audio of the appointed personnel, so as to remove noise and environmental audio which do not belong to the speaker, and further obtain the screened target audio.
In one embodiment, in the noise reduction process, a third data set is first acquired, an initial noise reduction model is trained through the third data set to obtain a noise reduction model, and then the designated person audio is input into the noise reduction model to obtain the target audio. As shown in fig. 7, the initial noise reduction model is composed of a second encoder, a second decoder and a feedforward sequence memory network, and a convolution loop module in the second encoder is used to extract features in data and can retain long-sequence information. The convolution module in the second decoder is used for outputting a result. The feedforward sequence memory network removes irrelevant context information, and improves the speed of voice noise reduction.
In one embodiment, when noise reduction is performed, first performing first noise reduction processing on the audio of the appointed person to obtain intermediate audio; and then determining the voice purity of the intermediate audio, judging whether the voice purity is lower than a preset threshold value, and if so, performing second noise reduction treatment, namely secondary noise reduction, on the intermediate audio until the voice purity of the intermediate audio is higher than the preset threshold value.
In one embodiment, after obtaining the target audio, if a text conversion request is received from a user, the target audio may be output as a text document in a specified format according to the text conversion request and then output.
It should be noted that, the initial separation model, the initial matching model, and the initial noise reduction model are mathematical models constructed based on a machine learning algorithm, including but not limited to a neural network model, a support vector machine model, and the like, the constructed models are trained in advance through a training data set, and when the set training precision and accuracy are reached, the model trained the current time is determined to complete training so as to be used for prediction processing.
As shown in fig. 8, the embodiment of the present application further provides a voice noise reduction device based on voiceprint recognition, including:
the obtaining module 801 obtains voiceprint template information of a specified person and scene audio including voice of the specified person.
The separation module 802 performs voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel, where the personnel audio includes scene noise.
And the matching module 803 is used for determining the designated personnel audio corresponding to the designated personnel by matching the personnel audio and the voiceprint template information.
And the noise reduction module 804 performs noise reduction processing on the appointed personnel audio to obtain target audio.
As shown in fig. 9, the embodiment of the present application further provides a voice noise reduction device based on voiceprint recognition, including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.
The embodiments also provide a non-volatile computer storage medium storing computer executable instructions configured to:
acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person; performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise; matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel; and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.
All embodiments in the application are described in a progressive manner, and identical and similar parts of all embodiments are mutually referred, so that each embodiment mainly describes differences from other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.
The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not described in detail herein.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (10)

1. A voice noise reduction method based on voiceprint recognition, comprising:
acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person;
performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise;
matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel;
and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.
2. The method according to claim 1, wherein the performing the voice separation on the scene audio to obtain the personal audio corresponding to each of the plurality of single persons specifically includes:
acquiring a first data set, and training an initial separation model through the first data set to obtain a separation model;
inputting the scene voice into the separation model to obtain personnel audio corresponding to a plurality of single personnel respectively;
the initial separation model is composed of a first encoder, a first decoder and a mask network; the first encoder comprises a one-dimensional convolution network and a rectification linear unit, and the first decoder comprises a one-dimensional transposition convolution layer; the masking network includes a layer normalization and a MossFormer module.
3. The method according to claim 1, wherein the step of determining the designated person audio corresponding to the designated person by matching the person audio and the voiceprint template information specifically includes:
acquiring a second data set, training an initial matching model through the second data set to obtain a matching model, and inputting the voice of the person and the voiceprint template information into the matching model to obtain the audio of the appointed person corresponding to the appointed person;
the initial matching model comprises a time delay neural network layer, a channel attention mechanism layer, a multi-scale feature fusion layer and a pooling layer of an attention mechanism, wherein the channel attention mechanism layer comprises a Squeeze-and-specification module and a one-dimensional Res2Net layer.
4. The method according to claim 1, wherein the denoising the specified person audio to obtain a target audio specifically comprises:
acquiring a third data set, and training an initial noise reduction model through the third data set to obtain a noise reduction model;
inputting the appointed person audio to the noise reduction model to obtain the target audio;
the initial noise reduction model includes a second encoder, a second decoder, a attention module, and a loop module, the second encoder and the second decoder including a plurality of convolution loop modules therein.
5. The method according to claim 1, wherein the denoising the specified person audio specifically comprises:
performing first noise reduction processing on the appointed personnel audio to obtain intermediate audio;
and determining the voice purity of the intermediate audio, judging whether the voice purity is lower than a preset threshold, and if so, performing second noise reduction processing on the intermediate audio until the voice purity of the intermediate audio is higher than the preset threshold.
6. The method of claim 1, wherein prior to the speech separation of the scene audio, the method further comprises:
preprocessing the scene audio, wherein the preprocessing process comprises the following steps:
framing, pre-emphasis and windowing are carried out on the scene audio so as to obtain an intermediate voice frame;
performing fast Fourier transform on the intermediate voice frame to obtain a corresponding first amplitude spectrum and a first phase spectrum;
taking absolute values of the first amplitude spectrum and the first phase spectrum to obtain a second amplitude spectrum;
and carrying out Mel filtering transformation on the second amplitude spectrum to obtain a Mel spectrum, and taking the logarithm of the result to obtain a logarithmic Mel spectrum.
7. The method of claim 1, wherein after the obtaining the target audio, the method further comprises:
and receiving a text conversion request from a user, and outputting the target audio as a text document in a specified format according to the text conversion request.
8. A voice noise reduction device based on voiceprint recognition, comprising:
the voice print module is used for acquiring voice print template information of a designated person and scene audio comprising voice of the designated person;
the separation module is used for carrying out voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise;
the matching module is used for determining the appointed person audio corresponding to the appointed person by matching the person audio and the voiceprint template information;
and the noise reduction module is used for carrying out noise reduction processing on the appointed personnel audio so as to obtain target audio.
9. A voice noise reduction device based on voiceprint recognition, comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform:
acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person;
performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise;
matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel;
and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.
10. A non-transitory computer storage medium storing computer-executable instructions, the computer-executable instructions configured to:
acquiring voiceprint template information of a designated person and scene audio comprising voices of the designated person;
performing voice separation on the scene audio to obtain personnel audio corresponding to a plurality of single personnel respectively, wherein the personnel audio comprises scene noise;
matching the personnel audio and the voiceprint template information to determine appointed personnel audio corresponding to the appointed personnel;
and carrying out noise reduction processing on the appointed personnel audio to obtain target audio.
CN202310267948.2A 2023-03-15 2023-03-15 Voice noise reduction method, device, equipment and medium based on voiceprint recognition Pending CN116312570A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310267948.2A CN116312570A (en) 2023-03-15 2023-03-15 Voice noise reduction method, device, equipment and medium based on voiceprint recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310267948.2A CN116312570A (en) 2023-03-15 2023-03-15 Voice noise reduction method, device, equipment and medium based on voiceprint recognition

Publications (1)

Publication Number Publication Date
CN116312570A true CN116312570A (en) 2023-06-23

Family

ID=86833911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310267948.2A Pending CN116312570A (en) 2023-03-15 2023-03-15 Voice noise reduction method, device, equipment and medium based on voiceprint recognition

Country Status (1)

Country Link
CN (1) CN116312570A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095674A (en) * 2023-08-25 2023-11-21 广东福临门世家智能家居有限公司 Interactive control method and system for intelligent doors and windows

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095674A (en) * 2023-08-25 2023-11-21 广东福临门世家智能家居有限公司 Interactive control method and system for intelligent doors and windows
CN117095674B (en) * 2023-08-25 2024-03-26 广东福临门世家智能家居有限公司 Interactive control method and system for intelligent doors and windows

Similar Documents

Publication Publication Date Title
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110379412B (en) Voice processing method and device, electronic equipment and computer readable storage medium
Tan et al. Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios
Hummersone et al. On the ideal ratio mask as the goal of computational auditory scene analysis
CN114203163A (en) Audio signal processing method and device
CN112102846B (en) Audio processing method and device, electronic equipment and storage medium
WO2023001128A1 (en) Audio data processing method, apparatus and device
CN116312570A (en) Voice noise reduction method, device, equipment and medium based on voiceprint recognition
CN112614504A (en) Single sound channel voice noise reduction method, system, equipment and readable storage medium
CN112289334B (en) Reverberation elimination method and device
WO2020015546A1 (en) Far-field speech recognition method, speech recognition model training method, and server
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
Chen et al. CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
CN113409756B (en) Speech synthesis method, system, device and storage medium
CN114333874A (en) Method for processing audio signal
CN109801643B (en) Processing method and device for reverberation suppression
EP4350695A1 (en) Apparatus, methods and computer programs for audio signal enhancement using a dataset
Fahad hossain et al. A continuous word segmentation of Bengali noisy speech
US20240079022A1 (en) General speech enhancement method and apparatus using multi-source auxiliary information
Martín Doñas Online multichannel speech enhancement combining statistical signal processing and deep neural networks
WO2024018429A1 (en) Audio signal processing method, audio signal processing apparatus, computer device and storage medium
US20230326475A1 (en) Apparatus, Methods and Computer Programs for Noise Suppression
CN112216303A (en) Voice processing method and device and electronic equipment
KR20210010133A (en) Speech recognition method, learning method for speech recognition and apparatus thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination