CN115331689A - Training method, device, equipment, storage medium and product of voice noise reduction model - Google Patents

Training method, device, equipment, storage medium and product of voice noise reduction model Download PDF

Info

Publication number
CN115331689A
CN115331689A CN202210964055.9A CN202210964055A CN115331689A CN 115331689 A CN115331689 A CN 115331689A CN 202210964055 A CN202210964055 A CN 202210964055A CN 115331689 A CN115331689 A CN 115331689A
Authority
CN
China
Prior art keywords
sample data
noise
target
voice
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210964055.9A
Other languages
Chinese (zh)
Inventor
李良斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202210964055.9A priority Critical patent/CN115331689A/en
Publication of CN115331689A publication Critical patent/CN115331689A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Abstract

The application provides a training method, a device, equipment, a storage medium and a product of a voice noise reduction model, and belongs to the technical field of audio signal processing. The method comprises the following steps: training to obtain a first voice noise reduction model based on a plurality of groups of first sample data pairs in a simulation scene of various scenes; in a target scene in multiple scenes, performing reverberation processing on initial voice sample data and initial noise sample data in each group of initial sample data pairs respectively to obtain second voice sample data and target noise data containing target environment reverberation, and mixing the second voice sample data and the target noise data to obtain second noisy voice sample data; training the first voice noise reduction model based on a plurality of groups of third sample data pairs including second voice sample data and target noise data to obtain a target voice noise reduction model; the method enables the target voice noise reduction model to be suitable for the target scene, and can improve the noise reduction effect of the voice data with noise in the target scene.

Description

Training method, device, equipment, storage medium and product of voice noise reduction model
Technical Field
The present application relates to the field of audio signal processing technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a product for training a speech noise reduction model.
Background
At present, a voice noise reduction model is used for reducing noise of noisy voice data in scenes such as conferences, lectures and lectures so as to obtain clean voice data with noise removed.
In the related art, generally, based on simulation scenes of these scenes, noisy speech sample data and clean speech sample data are obtained to perform model training, so as to obtain a speech noise reduction model common to these scenes. When the voice noise reduction is performed in any scene based on the voice noise reduction model, the noise reduction effect is poor because the scene is not completely the same as the simulation scene, and then the noise reduction is performed on the voice data with noise in the scene based on the voice noise reduction model.
Disclosure of Invention
The embodiment of the application provides a training method, a training device, equipment, a storage medium and a product of a voice noise reduction model, which can improve the noise reduction effect of noisy voice data. The technical scheme is as follows:
in one aspect, a method for training a speech noise reduction model is provided, where the method includes:
training to obtain a first voice noise reduction model based on multiple groups of first sample data pairs in a simulation scene of multiple scenes, wherein each group of first sample data pairs comprises first voice sample data and first noise-carrying voice sample data, and the first voice sample data is the first noise-carrying voice sample data subjected to noise reduction;
acquiring multiple groups of initial sample data pairs, wherein each group of initial sample data pairs comprises initial voice sample data and initial noise sample data, the initial voice sample data is the first voice sample data without simulation environment reverberation, and the initial noise sample data is data obtained by removing voice data and simulation environment reverberation from first noisy voice sample data;
performing reverberation processing on initial voice sample data in the multiple groups of initial sample data pairs in a target scene in the multiple scenes to obtain second voice sample data in multiple groups of second sample data pairs containing target environment reverberation, and performing reverberation processing on initial noise sample data in the multiple groups of initial sample data pairs to obtain target noise data in the multiple groups of second sample data pairs;
mixing second voice sample data and target noise data in each group of second sample data pairs to obtain a plurality of second noisy voice sample data;
training the first voice noise reduction model based on multiple groups of third sample data pairs to obtain a target voice noise reduction model, wherein the target voice noise reduction model is used for reducing noise of the voice data with noise in the target scene, and each group of third sample data pairs comprises second voice sample data and second voice sample data with noise.
In some embodiments, the performing reverberation processing on the initial voice sample data in the multiple sets of initial sample data pairs to obtain second voice sample data in multiple sets of second sample data pairs containing target environment reverberation includes:
for each group of initial sample data pairs, playing the initial voice sample data in the target scene through target playing equipment, and performing sound acquisition in the target scene through target sound acquisition equipment to obtain second voice sample data containing the target environment reverberation; alternatively, the first and second liquid crystal display panels may be,
and for each group of initial sample data pairs, obtaining impulse response data in the target scene, and performing convolution processing on the impulse response data and the initial voice sample data to obtain second voice sample data containing the target environment reverberation.
In some embodiments, the reverberation processing on the initial noise sample data in the multiple sets of initial sample data pairs to obtain the target noise data in the multiple sets of second sample data pairs includes:
for each group of initial sample data pairs, determining a noise type corresponding to the initial noise sample data, and acquiring noise corresponding to the noise type in the target scene to obtain the target noise data containing the target environment reverberation; alternatively, the first and second electrodes may be,
and for each group of initial sample data pairs, playing the initial noise sample data in the target scene through target playing equipment, and carrying out sound acquisition in the target scene through target sound acquisition equipment to obtain the target noise data containing the target environment reverberation.
In some embodiments, the mixing the second voice sample data and the target noise data in each of the plurality of groups of second sample data pairs to obtain a plurality of second noisy voice sample data includes:
and for each group of second sample data pairs, mixing the second voice sample data and the target noise data based on a target signal-to-noise ratio to obtain second noisy voice sample data.
In some embodiments, the target noise data includes target noise data of a plurality of different noise types, the second noisy speech sample data is a plurality of samples, and the mixing the second speech sample data and the target noise data based on the target signal-to-noise ratio to obtain the second noisy speech sample data includes at least one of the following implementation manners:
mixing the target noise data with different noise types with the second voice sample data respectively based on the target signal-to-noise ratio to obtain a plurality of second voice sample data with noise;
and for target noise data of each noise type, mixing the target noise data with at least one target noise data with a different noise type to obtain a plurality of mixed noise data, and mixing the plurality of mixed noise data with the second voice sample data respectively based on the target signal-to-noise ratio to obtain a plurality of second noisy voice sample data.
In some embodiments, the second voice sample data includes a plurality of second voice sample data of different crowds, the second noisy voice sample data is a plurality of samples, and the mixing the second voice sample data and the target noise data based on the target signal-to-noise ratio obtains the second noisy voice sample data, including at least one of the following implementation manners:
mixing second voice sample data of the different crowds with the target noise data respectively based on the target signal-to-noise ratio to obtain a plurality of second noisy voice sample data;
and mixing the second voice sample data with at least one second voice sample data different from the second voice sample data of each crowd to obtain a plurality of mixed voice sample data, and mixing the plurality of mixed voice sample data with the target noise data respectively based on the target signal-to-noise ratio to obtain a plurality of second noise-carrying voice sample data.
In another aspect, an apparatus for training a speech noise reduction model is provided, the apparatus comprising:
the first training module is used for training to obtain a first voice noise reduction model based on a plurality of groups of first sample data pairs in a simulation scene of various scenes, wherein each group of first sample data pairs comprises first voice sample data and first noise-carrying voice sample data, and the first voice sample data is the first noise-carrying voice sample data after noise reduction;
the acquisition module is used for acquiring a plurality of groups of initial sample data pairs, each group of initial sample data pairs comprises initial voice sample data and initial noise sample data, the initial voice sample data is the first voice sample data without the reverberation of the simulation environment, and the initial noise sample data is data obtained by removing the voice data and the reverberation of the simulation environment from the first voice sample data with noise;
a processing module, configured to perform reverberation processing on initial voice sample data in the multiple sets of initial sample data pairs in a target scene of the multiple scenes to obtain second voice sample data in multiple sets of second sample data pairs including target environment reverberation, and perform reverberation processing on initial noise sample data in the multiple sets of initial sample data pairs to obtain target noise data in the multiple sets of second sample data pairs;
the mixing module is used for mixing the second voice sample data and the target noise data in each group of second sample data pairs to obtain a plurality of second noisy voice sample data;
and the second training module is used for training the first voice noise reduction model based on the multiple groups of third sample data pairs to obtain a target voice noise reduction model, the target voice noise reduction model is used for reducing noise of the noisy voice data in the target scene, and each group of third sample data pairs comprises second voice sample data and second noisy voice sample data.
In some embodiments, the processing module is to:
for each group of initial sample data pairs, playing the initial voice sample data in the target scene through target playing equipment, and performing sound acquisition in the target scene through target sound acquisition equipment to obtain second voice sample data containing the target environment reverberation; alternatively, the first and second electrodes may be,
and for each group of initial sample data pairs, obtaining impulse response data in the target scene, and performing convolution processing on the impulse response data and the initial voice sample data to obtain second voice sample data containing the target environment reverberation.
In some embodiments, the processing module is to:
for each group of initial sample data pairs, determining a noise type corresponding to the initial noise sample data, and acquiring noise corresponding to the noise type in the target scene to obtain the target noise data containing the target environment reverberation; alternatively, the first and second electrodes may be,
and for each group of initial sample data pairs, playing the initial noise sample data in the target scene through target playing equipment, and carrying out sound acquisition in the target scene through target sound acquisition equipment to obtain the target noise data containing the target environment reverberation.
In some embodiments, the mixing module is configured to, for each group of second sample data pairs, mix the second voice sample data and the target noise data based on a target signal-to-noise ratio to obtain the second noisy voice sample data.
In some embodiments, the target noise data comprises target noise data of a plurality of different noise types, the second noisy speech sample data is a plurality, and the mixing module is configured to:
mixing the target noise data of the plurality of different noise types with the second voice sample data respectively based on the target signal-to-noise ratio to obtain a plurality of second voice sample data with noise;
and for target noise data of each noise type, mixing the target noise data with at least one target noise data with a different noise type to obtain a plurality of mixed noise data, and mixing the plurality of mixed noise data with the second voice sample data respectively based on the target signal-to-noise ratio to obtain a plurality of second noisy voice sample data.
In some embodiments, the second speech sample data comprises a plurality of different groups of second speech sample data, the second noisy speech sample data being a plurality, the mixing module being configured to:
on the basis of the target signal-to-noise ratio, mixing second voice sample data of the multiple different crowds with the target noise data respectively to obtain multiple second voice sample data with noise;
and mixing the second voice sample data with at least one second voice sample data different from the second voice sample data of each crowd to obtain a plurality of mixed voice sample data, and mixing the plurality of mixed voice sample data with the target noise data respectively based on the target signal-to-noise ratio to obtain a plurality of second noise-carrying voice sample data.
In another aspect, a computer device is provided, which includes one or more processors and one or more memories, and at least one program code is stored in the one or more memories, and the at least one program code is loaded and executed by the one or more processors to implement the method for training a speech noise reduction model according to any of the above implementations.
In another aspect, a computer-readable storage medium is provided, where at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the method for training a speech noise reduction model according to any of the above-mentioned implementation manners.
In another aspect, a computer program product is provided, the computer program product including computer program code stored in a computer readable storage medium, the computer program code being read from the computer readable storage medium by a processor of a computer device, and the computer program code being executed by the processor to cause the computer device to execute the training method of a speech noise reduction model according to any of the above-mentioned implementation manners.
The embodiment of the application provides a training method of a voice noise reduction model, which obtains a basic voice noise reduction model which is universal for various scenes on the basis of sample data in a simulation scene, performs reverberation processing on initial voice sample data and initial noise sample data in a target scene to obtain voice sample data and noise data containing target environment reverberation, and obtains voice sample data with noise on the basis of the voice sample data and the noise data containing the environment reverberation, namely obtains the voice sample data and the voice sample data with noise which are adapted to the target scene; and then training the basic voice noise reduction model based on the voice sample data adapted to the target scene and the voice sample data with noise, so that the trained target voice noise reduction model can be suitable for the target scene, and therefore, noise is reduced for the voice data with noise in the target scene based on the target voice noise reduction model, and the noise reduction effect can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;
FIG. 2 is a flowchart of a method for training a speech noise reduction model according to an embodiment of the present application;
FIG. 3 is a flowchart of another method for training a speech noise reduction model according to an embodiment of the present application;
FIG. 4 is a flowchart of another method for training a speech noise reduction model according to an embodiment of the present application;
FIG. 5 is a block diagram of a training apparatus for a speech noise reduction model according to an embodiment of the present application;
fig. 6 is a block diagram of a terminal according to an embodiment of the present disclosure;
fig. 7 is a block diagram of a server according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions. For example, the sample data pairs referred to in this application are all obtained with sufficient authorization.
The training method for the speech noise reduction model provided by the embodiment of the application can be executed by a computer device, and in some embodiments, the computer device is at least one of a terminal or a server. Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment of a training method for a speech noise reduction model according to an embodiment of the present application, where the implementation environment includes at least one of a terminal 10 and a server 20, and the terminal 10 and the server 20 can be directly or indirectly connected through a wired or wireless communication manner, and the present application is not limited herein. The training method of the speech noise reduction model provided in the embodiment of the present application may be executed by the terminal 10 alone, or may be executed by the server 20, or is implemented by the terminal 10 or the server 20 through data interaction, which is not limited in the embodiment of the present application. In some embodiments, the server 20 undertakes primary computational tasks and the terminal 10 undertakes secondary computational tasks; alternatively, the server 20 undertakes the secondary computing job and the terminal 10 undertakes the primary computing job; alternatively, the server 20 and the terminal 10 perform cooperative computing by using a distributed computing architecture.
In some embodiments, the training method of the speech noise reduction model provided in the embodiments of the present application is applied to a conference, a lecture, a workshop, and the like, where noise reduction is required for noisy speech data, for example, in a lecture scene, noise reduction processing may be performed on the collected noisy speech data of a speaker based on the speech noise reduction model to obtain clean speech data with noise removed.
The terminal 10 is at least one of a mobile phone, a tablet Computer, and a PC (Personal Computer) device. The server 20 may be at least one of a server, a server cluster composed of a plurality of servers, a cloud server, a cloud computing platform, and a virtualization center.
Fig. 2 is a flowchart of a training method for a speech noise reduction model, which may be implemented by at least one of a terminal and a server, in an embodiment of the present application, the terminal and the server are collectively referred to as a computer device, and thus the method is performed by taking the computer device as an execution subject, referring to fig. 2, and the method includes:
201. the computer equipment trains to obtain a first voice noise reduction model based on multiple groups of first sample data pairs in a simulation scene of multiple scenes, wherein each group of first sample data pairs comprises first voice sample data and first noise-carrying voice sample data, and the first voice sample data is the first noise-carrying voice sample data subjected to noise reduction.
In an embodiment of the present application, the first voice sample data and the first noisy voice sample data are both voice signals. The first voice sample data is clean voice sample data which does not contain noise and contains simulated environment reverberation, and the first noisy voice sample data is noisy voice sample data which contains noise and contains simulated environment reverberation. The simulation environment reverberation is the environment reverberation in the simulation scene, and the simulation environment reverberation in any simulation scene is generated based on the sound rebound phenomenon in the simulation scene, and is related to the size and the shape of the simulation scene and substances in the simulation scene.
In this embodiment of the application, the computer device learns a first noise reduction rule based on the plurality of groups of first sample pairs, where the first noise reduction rule refers to a rule for reducing noise of a first noisy speech sample data to obtain the first speech sample data, and then generates the first speech noise reduction model based on the first noise reduction rule. In the embodiment of the present application, the multiple scenes include scenes such as lectures, conferences, lectures, factories, subways, and the like, which are not specifically limited herein; the computer device trains to obtain a first voice noise reduction model based on multiple groups of first sample data pairs in the simulation scene of multiple scenes, namely, a universal basic voice noise reduction model in multiple scenes is obtained.
202. The computer equipment acquires multiple groups of initial sample data pairs, each group of initial sample data pairs comprises initial voice sample data and initial noise sample data, the initial voice sample data is first voice sample data without simulated environment reverberation, and the initial noise sample data is data obtained by removing the voice data and the simulated environment reverberation from the first voice sample data with noise.
In an embodiment of the present application, the initial voice sample data is clean voice sample data that does not include noise and does not include simulated ambient reverberation.
203. The computer equipment performs reverberation processing on initial voice sample data in multiple groups of initial sample data pairs in a target scene in multiple scenes to obtain second voice sample data in multiple groups of second sample data pairs containing target environment reverberation, and performs reverberation processing on initial noise sample data in the multiple groups of initial sample data pairs to obtain target noise data in the multiple groups of second sample data pairs.
In the embodiment of the present application, the target scene may be any scene to be subjected to voice noise reduction in the multiple scenes. The target environmental reverberation is an environmental reverberation under the target scene, which is generated based on a sound rebound phenomenon under the target scene, and is related to the size and shape of the target scene and substances in the target scene.
204. And the computer equipment mixes the second voice sample data and the target noise data in each group of second sample data pairs to obtain a plurality of second noise-carrying voice sample data.
205. The computer device trains the first voice noise reduction model based on multiple groups of third sample data pairs to obtain a target voice noise reduction model, the target voice noise reduction model is used for reducing noise of noisy voice data in a target scene, and each group of third sample data pairs comprises second voice sample data and second noisy voice sample data.
In this embodiment of the application, the computer device learns a second noise reduction rule based on the plurality of sets of third sample data pairs, where the second noise reduction rule refers to a rule for reducing noise of a second noisy speech sample data to obtain a second speech sample data, and then generates the target speech noise reduction model based on the second noise reduction rule.
The embodiment of the application provides a training method of a voice noise reduction model, which obtains a basic voice noise reduction model which is universal for various scenes on the basis of sample data in a simulation scene, performs reverberation processing on initial voice sample data and initial noise sample data in a target scene to obtain voice sample data and noise data containing target environment reverberation, and obtains voice sample data with noise on the basis of the voice sample data and the noise data containing the environment reverberation, namely obtains the voice sample data and the voice sample data with noise which are adaptive to the target scene; and then training the basic voice noise reduction model based on the voice sample data and the noise-carrying voice sample data which are adapted to the target scene, so that the trained target voice noise reduction model can be suitable for the target scene, and therefore noise reduction is carried out on the noise-carrying voice data in the target scene based on the target voice noise reduction model, and the noise reduction effect can be improved.
Fig. 3 is a flowchart of a training method for a speech noise reduction model, which may be implemented by at least one of a terminal and a server, in an embodiment of the present application, the terminal and the server are collectively referred to as a computer device, and thus the method is performed by taking the computer device as an execution subject, referring to fig. 3, and the method includes:
301. the computer equipment trains to obtain a first voice noise reduction model based on a plurality of groups of first sample data pairs in a simulation scene of various scenes, wherein each group of first sample data pairs comprises first voice sample data and first noise-carrying voice sample data, and the first voice sample data is the first noise-carrying voice sample data subjected to noise reduction.
Optionally, the process of training, by the computer device, to obtain the first speech noise reduction model based on the plurality of groups of first sample data pairs includes: and the computer equipment inputs the first noisy speech sample data in each group of first sample data pairs into the initial speech noise reduction model to obtain predicted first speech data, and then adjusts model parameters of the initial speech noise reduction model based on a loss value between the first speech data and the first speech sample data. And the computer equipment executes the steps on the iteration based on the multiple groups of first sample data until the iteration reaches a stop condition to obtain a first voice noise reduction model.
In some embodiments, the computer device is configured as a terminal capable of acquiring a plurality of sets of first sample data pairs based on a simulation scenario of a plurality of scenarios; optionally, the terminal is a simulator, which can simulate a plurality of scenes, and further can generate noisy voice sample data and voice sample data based on the simulator to obtain a plurality of sets of first sample data pairs; or the terminal is a sound acquisition device and is used for acquiring noisy voice sample data and voice sample data in a simulation scene so as to acquire a plurality of groups of first sample data pairs.
302. The computer equipment acquires multiple groups of initial sample data pairs, each group of initial sample data pairs comprises initial voice sample data and initial noise sample data, the initial voice sample data is first voice sample data without simulated environment reverberation, and the initial noise sample data is data obtained by removing the voice data and the simulated environment reverberation from the first voice sample data with noise.
In one implementation, the initial sample data pairs and the first sample data pairs are stored correspondingly, and the computer device directly obtains the initial sample data pairs corresponding to the multiple groups of first sample data pairs respectively to obtain the multiple groups of initial sample data pairs.
303. And for each group of initial sample data pairs, the computer equipment plays the initial voice sample data in the target scene through the target playing equipment, performs sound acquisition in the target scene through the target sound acquisition equipment to obtain voice data containing environmental noise and target environmental reverberation, and performs noise reduction processing on the voice data to obtain second voice sample data containing the target environmental reverberation.
Optionally, the computer device plays the initial voice sample data in the target scene through the target playing device in a specified environment corresponding to the target scene, and performs sound collection in the target scene through the target sound collecting device, where the specified environment only has environmental noise but does not have other interference noise, the environmental noise is inevitable noise in the target scene, and the other interference noise is noise that can be avoided in the target scene; in a conference scene, the designated environment is a conference room, and at this time, background noise in the conference room is unavoidable environmental noise, and sounds such as keyboard knocking, door opening and closing, and hand clapping belong to other interference noise. Therefore, sound collection is carried out under the condition that only environmental noise exists, the obtained voice data only contains the environmental noise, and the stable noise reduction of the voice data can be realized by using a general noise reduction method, so that second voice sample data under a target scene is obtained.
It should be noted that different sound collection devices have different frequency responses, where the frequency responses indicate differences in processing capabilities for different frequency signals, and in this embodiment of the present application, in order to further obtain second voice sample data that matches a target scene, the target sound collection device is a sound collection device that is actually used in the target scene and is used to collect noisy voice data to be denoised, so that the second voice sample data matches the target scene more. The target playing device can also be a playing device actually used in a target scene, so that other playing devices are not required to be configured, and convenience in playing initial voice sample data is improved.
In the embodiment of the present application, the process of performing reverberation processing on the initial voice sample data to obtain the second voice sample data in the target scene is realized through step 302; however, for some scenes with complex situations, it cannot be guaranteed that only ambient noise exists in the specified environment, and in order to ensure efficiency and accuracy of obtaining the second voice sample data, in some embodiments, the process of obtaining the second voice sample data in multiple sets of second sample data pairs containing target ambient reverberation by performing reverberation processing on initial voice sample data in multiple sets of initial sample data pairs by the computer device includes the following implementation manners: and the computer equipment acquires impulse response data under the target scene for each group of initial sample data pairs, and performs convolution processing on the impulse response data and the initial voice sample data to obtain second voice sample data containing target environment reverberation.
Wherein the impulse response data reflects a target ambient reverberation of the target scene; optionally, the computer device obtains impulse response data in the target scene by performing impulse response measurement on a certain room in the target scene. In the implementation mode, the second voice sample data containing the target environment reverberation is obtained based on the impulse response data, so that the voice data containing the environment noise is prevented from being subjected to noise reduction processing, and the acquisition efficiency of the second voice sample data is improved.
304. And the computer equipment determines the noise type corresponding to the initial noise sample data for each group of initial sample data pairs, and acquires the noise corresponding to the noise type in the target scene to obtain target noise data containing the target environment reverberation.
The target noise data not only contains target environment reverberation, but also contains environment noise in a target scene and noise corresponding to the noise type of the initial noise sample data. The noise types include walking sound, crying laughter, keyboard knocking sound, table and chair moving sound, wind sound, rain sound, tide sound, etc., and are not limited in particular.
The computer device collects the noise corresponding to the noise type in the specified environment corresponding to the target scene, the specified environment only has the noise corresponding to the noise type, and does not have other sounds, the other sounds comprise sounds generated by speaking of a person and sounds generated by playing of voice data by the target playing device, and the like, and then pure noise data can be obtained. Optionally, the noise corresponding to the noise type is artificially manufactured in the target scene, so that the noise can be flexibly manufactured according to needs, and the flexibility and the efficiency of acquiring the target noise data by collecting the noise are improved. Optionally, the computer device collects noise through a target sound collection device, thereby making the target noise data more matched with the target scene.
In the embodiment of the application, noise is collected in a target scene, target noise data containing target environment reverberation is obtained, the target noise data is matched with the target scene, and then noisy speech sample data matched with the target scene can be obtained based on the target noise data.
In the embodiment of the present application, reverberation processing is performed on the initial noise sample data in multiple initial sample data pairs through step 304 to obtain target noise data in multiple groups of second sample data pairs; in yet other embodiments, the process of performing reverberation processing on the initial noise sample data in multiple initial sample data pairs by the computer device to obtain target noise data in multiple sets of second sample data pairs further includes the following implementation manners: and for each group of initial sample data pairs, the computer equipment plays the initial noise sample data in the target scene through the target playing equipment, and performs sound acquisition in the target scene through the target sound acquisition equipment to obtain target noise data containing the target environment reverberation.
In the implementation mode, the target noise data is obtained by carrying out sound acquisition in the target scene of playing the initial noise sample data, so that artificial noise is avoided, and the convenience of acquiring the target noise data is improved.
305. And the computer equipment mixes the second voice sample data and the target noise data in each group of second sample data pairs to obtain a plurality of second noisy voice sample data.
In one implementation, the computer device mixes the second voice sample data and the target noise data based on the target signal-to-noise ratio for each group of second sample data pairs to obtain second noisy voice sample data.
Wherein the target signal-to-noise ratio refers to a ratio between a signal intensity of the second voice sample data and a signal intensity of the target noise data; optionally, the target signal-to-noise ratio in the embodiment of the present application is determined based on the signal-to-noise ratio of the noisy speech data in the target scene, so that the second noisy speech sample data is more matched with the target scene. And the voice data with noise under the target scene can have a plurality of signal-to-noise ratios, and then the target signal-to-noise ratio can be a plurality. Correspondingly, the computer equipment mixes the second voice sample data and the target noise data respectively based on each target signal-to-noise ratio to obtain a plurality of second noisy voice sample data, so that the data volume of the second noisy voice sample data is further enriched, and the noise reduction effect of the trained voice noise reduction model can be improved.
The noise types of the initial noise sample data comprise multiple types, and the target noise data comprise target noise data of multiple different noise types, so that multiple second noisy speech sample data are obtained. Accordingly, in some embodiments, the above-mentioned process of mixing the second voice sample data and the target noise data based on the target signal-to-noise ratio by the computer device to obtain the second noisy voice sample data includes at least one of the following implementation manners:
in one implementation, the computer device mixes the target noise data of multiple different noise types with the second voice sample data respectively based on the target signal-to-noise ratio to obtain multiple second noisy voice sample data. In the implementation mode, target noise data of various different noise types are mixed with the second voice sample data respectively, so that the second noisy voice sample data cover various types of noise, the noise types of the noisy voice sample data are enriched, model training is performed on the noisy voice sample data based on various noise types, and the universal applicability of the trained voice noise reduction model can be improved.
In another implementation manner, for the target noise data of each noise type, the computer device mixes the target noise data with at least one target noise data different from the noise type of the target noise data to obtain a plurality of mixed noise data, and mixes the plurality of mixed noise data with the second voice sample data respectively based on the target signal-to-noise ratio to obtain a plurality of second noisy voice sample data.
Alternatively, the computer device mixes the plurality of target noise data based on a signal energy ratio, which refers to a ratio between signal energies corresponding to the plurality of target noise data, and the signal energy ratio may be set and changed as needed, for example, in order to make the signal energy distribution corresponding to the plurality of target noise data in the mixed noise data more uniform, the signal energy ratio may be set to 1.
It should be noted that, multiple different types of noise may exist at the same time in the target scene, and in this implementation, the target noise data with multiple different noise types are mixed, so that the mixed noise data is more matched with the target scene, and further, second noisy speech sample data more matched with the target scene can be obtained.
It should be noted that the second voice sample data includes second voice sample data of a plurality of different people, and the obtained second noisy voice sample data is a plurality of samples. Accordingly, in some embodiments, the above-mentioned process of mixing the second voice sample data and the target noise data based on the target signal-to-noise ratio by the computer device to obtain the second noisy voice sample data includes at least one of the following implementation manners:
in one implementation, the computer device mixes the second voice sample data of a plurality of different crowds with the target noise data respectively based on the target signal-to-noise ratio to obtain a plurality of second noisy voice sample data.
The crowd can be divided based on different standards, and if the crowd is divided based on the age, the crowd comprises infants, juveniles, young people, middle-aged people, old people and the like; if classified based on gender, the population includes males, females, etc. In the embodiments of the present application, this is not particularly limited.
In the implementation mode, the second voice sample data of a plurality of different crowds are mixed with the target noise data respectively, so that the second noisy voice sample data cover voices of various crowds, a third sample data pair is formed based on the second voice sample data and the second noisy voice sample data, the data size of the third sample data pair is enriched, model training is carried out based on the third sample data pair, and the general applicability of the trained voice noise reduction model can be improved.
In another implementation manner, the computer device mixes the second voice sample data with at least one second voice sample data different from the second voice sample data of each crowd to obtain a plurality of mixed voice sample data, and mixes the plurality of mixed voice sample data with the target noise data respectively based on the target signal-to-noise ratio to obtain a plurality of second noisy voice sample data.
Alternatively, the computer device mixes the plurality of second voice sample data based on a signal energy ratio, which may be set and changed as needed, and is not particularly limited herein.
It should be noted that, in a target scene, there may exist voice data of a plurality of different people at the same time, such as in a chorus scene; in the implementation mode, the second voice sample data different from a plurality of crowds are mixed, so that the mixed voice sample data is more matched with the target scene, and the second noisy voice sample data more matched with the target scene can be obtained.
It should be noted that, the steps in this embodiment are numbered for convenience of description, and the execution order of the steps is not limited, and the steps 302 to 305 may be executed after the step 301, or before the step 301, or may be executed synchronously with the step 301, that is, in this embodiment, the first speech noise reduction model may be obtained based on the first sample data pair, the third sample data pair may be obtained based on the initial sample data pair, or the first speech noise reduction model and the third sample data pair may be obtained simultaneously, which is not limited in this embodiment.
306. The computer device trains the first voice noise reduction model based on multiple groups of third sample data pairs to obtain a target voice noise reduction model, the target voice noise reduction model is used for reducing noise of noisy voice data in a target scene, and each group of third sample data pairs comprises second voice sample data and second noisy voice sample data.
Optionally, the training, by the computer device, of the first speech noise reduction model based on the plurality of sets of third sample data pairs includes: the computer equipment inputs second noisy speech sample data in each group of third sample data pairs into the first speech noise reduction model to obtain predicted second speech data, and then adjusts model parameters of the first speech noise reduction model based on a loss value between the second speech data and the second speech sample data; and the computer equipment performs the steps on the iteration based on the multiple groups of third sample data until the iteration reaches a stop condition, so as to obtain a target voice noise reduction model.
Referring to fig. 4, fig. 4 is a flowchart illustrating training of a speech noise reduction model according to an embodiment of the present application; the method comprises the steps that computer equipment plays and collects initial voice sample data to obtain voice data under an actual non-interference environment of a target scene, and clean second voice sample data are obtained after the voice data are subjected to noise reduction, wherein the actual non-interference environment refers to an environment only including environmental noise; then, the computer equipment collects noise in the actual environment of a target scene to obtain target noise data, wherein the noise is the same as the noise type of the initial noise sample data; then the computer equipment mixes the target noise data and the second voice sample data to obtain second noisy voice sample data; and finally, the computer device optimally trains the trained basic voice noise reduction model based on a specified data set to obtain a target voice noise reduction model suitable for a target scene, wherein the basic voice noise reduction model is a first voice noise reduction model obtained by training based on first sample data pairs under a simulation scene, and the specified data set comprises a plurality of groups of third sample data pairs consisting of second voice sample data and second noisy voice sample data.
In the embodiment of the application, a universal voice noise reduction model is obtained by training based on a sample data pair under a simulation scene of various scenes, then the universal voice noise reduction model is used as a basic voice noise reduction model, and the training data set formed by the sample data pair under a target scene is used for fine tuning, so that the data actually acquired under the target scene is applied to model training, and the target voice noise reduction model is suitable for the target scene; the voice noise reduction model in the simulation scene is used as the basic voice noise reduction model, then the optimization training is carried out on the basic voice noise reduction model based on the sample data in the target scene, the efficiency of the optimization training on the basic voice noise reduction model based on the sample data in the target scene can be improved, the optimization of the basic voice noise reduction model in the specified target scene is rapidly completed, and the voice noise reduction models which are respectively suitable for a plurality of scenes can be rapidly obtained.
The embodiment of the application provides a training method of a voice noise reduction model, which obtains a basic voice noise reduction model which is universal for various scenes on the basis of sample data in a simulation scene, performs reverberation processing on initial voice sample data and initial noise sample data in a target scene to obtain voice sample data and noise data containing target environment reverberation, and obtains voice sample data with noise on the basis of the voice sample data and the noise data containing the environment reverberation, namely obtains the voice sample data and the voice sample data with noise which are adaptive to the target scene; and then training the basic voice noise reduction model based on the voice sample data adapted to the target scene and the voice sample data with noise, so that the trained target voice noise reduction model can be suitable for the target scene, and therefore, noise is reduced for the voice data with noise in the target scene based on the target voice noise reduction model, and the noise reduction effect can be improved.
The embodiment of the present application further provides a training apparatus for a speech noise reduction model, referring to fig. 5, the apparatus includes:
the first training module 501 is configured to train to obtain a first voice noise reduction model based on multiple groups of first sample data pairs in a simulation scene of multiple scenes, where each group of first sample data pairs includes first voice sample data and first noise-carrying voice sample data, and the first voice sample data is the noise-reduced first noise-carrying voice sample data;
an obtaining module 502, configured to obtain multiple sets of initial sample data pairs, where each set of initial sample data pairs includes initial voice sample data and initial noise sample data, the initial voice sample data is first voice sample data that does not include simulated environment reverberation, and the initial noise sample data is data obtained by removing voice data and simulated environment reverberation from the first noisy voice sample data;
a processing module 503, configured to perform reverberation processing on initial voice sample data in multiple sets of initial sample data pairs in a target scene of multiple scenes to obtain second voice sample data in multiple sets of second sample data pairs including target environment reverberation, and perform reverberation processing on initial noise sample data in the multiple sets of initial sample data pairs to obtain target noise data in the multiple sets of second sample data pairs;
a mixing module 504, configured to mix second voice sample data and target noise data in each group of second sample data pairs to obtain multiple second noisy voice sample data;
the second training module 505 is configured to train the first voice noise reduction model based on multiple sets of third sample data pairs to obtain a target voice noise reduction model, where the target voice noise reduction model is configured to reduce noise of noisy voice data in a target scene, and each set of third sample data pairs includes second voice sample data and second noisy voice sample data.
In some embodiments, the processing module 503 is configured to:
for each group of initial sample data pairs, playing initial voice sample data in a target scene through target playing equipment, and acquiring sound in the target scene through target sound acquisition equipment to obtain second voice sample data containing target environment reverberation; alternatively, the first and second liquid crystal display panels may be,
and for each group of initial sample data pairs, impulse response data under the target scene are obtained, and convolution processing is carried out on the impulse response data and the initial voice sample data to obtain second voice sample data containing target environment reverberation.
In some embodiments, the processing module 503 is configured to:
for each group of initial sample data pairs, determining a noise type corresponding to initial noise sample data, and acquiring noise corresponding to the noise type in a target scene to obtain target noise data containing target environment reverberation; alternatively, the first and second liquid crystal display panels may be,
and for each group of initial sample data pairs, playing initial noise sample data in a target scene through target playing equipment, and acquiring sound in the target scene through target sound acquisition equipment to obtain target noise data containing target environment reverberation.
In some embodiments, the mixing module 504 is configured to, for each group of second sample data pairs, mix the second voice sample data and the target noise data based on the target signal-to-noise ratio to obtain second noisy voice sample data.
In some embodiments, the target noise data comprises target noise data of a plurality of different noise types, the second noisy speech sample data is a plurality of, the mixing module 504 is configured to:
mixing target noise data of various different noise types with second voice sample data respectively based on the target signal-to-noise ratio to obtain a plurality of second voice sample data with noise;
and for the target noise data of each noise type, mixing the target noise data with at least one target noise data with a different noise type to obtain a plurality of mixed noise data, and respectively mixing the plurality of mixed noise data with second voice sample data based on a target signal-to-noise ratio to obtain a plurality of second noisy voice sample data.
In some embodiments, the second voice sample data comprises a plurality of second voice sample data of different people, the second noisy voice sample data being a plurality, the mixing module 504 is configured to:
on the basis of the target signal-to-noise ratio, mixing second voice sample data of multiple different crowds with the target noise data respectively to obtain multiple second noisy voice sample data;
and mixing the second voice sample data with at least one second voice sample data different from the second voice sample data of each crowd to obtain a plurality of mixed voice sample data, and mixing the plurality of mixed voice sample data with the target noise data respectively based on the target signal to noise ratio to obtain a plurality of second noise-carrying voice sample data.
The embodiment of the application provides a training device of a voice noise reduction model, which obtains a basic voice noise reduction model which is universal for various scenes on the basis of sample data in a simulation scene, performs reverberation processing on initial voice sample data and initial noise sample data in a target scene to obtain voice sample data and noise data containing target environment reverberation, and obtains noisy voice sample data on the basis of the voice sample data and the noise data containing the environment reverberation to obtain the voice sample data and the noisy voice sample data which are adaptive to the target scene; and then training the basic voice noise reduction model based on the voice sample data adapted to the target scene and the voice sample data with noise, so that the trained target voice noise reduction model can be suitable for the target scene, and therefore, noise is reduced for the voice data with noise in the target scene based on the target voice noise reduction model, and the noise reduction effect can be improved.
It should be noted that: the training device of the speech noise reduction model provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical applications, the above function distribution may be completed by different functional modules as needed, that is, the internal structure of the terminal is divided into different functional modules to complete all or part of the above described functions. In addition, the training apparatus for the speech noise reduction model and the training method embodiment for the speech noise reduction model provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.
In some embodiments, the computer device is provided as a terminal. Fig. 6 shows a block diagram of a terminal 600 according to an exemplary embodiment of the present application. The terminal 600 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.
In general, the terminal 600 includes: a processor 601 and a memory 602.
Processor 601 may include one or more processing cores, such as 4-core processors, 8-core processors, and so forth. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.
The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 602 is used to store at least one program code for execution by the processor 601 to implement the training method of the speech noise reduction model provided by the method embodiments in the present application.
In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a display 605, a camera assembly 606, an audio circuit 607, a positioning component 608, and a power supply 609.
The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripherals interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on separate chips or circuit boards, which is not limited by the present embodiment.
The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, disposed on the front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in other embodiments, the display 605 may be a flexible display disposed on a curved surface or a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.
The positioning component 608 is used for positioning the current geographic Location of the terminal 600 to implement navigation or LBS (Location Based Service). The Positioning component 608 can be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.
Power supply 609 is used to provide power to the various components in terminal 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, optical sensor 614, and proximity sensor 615.
The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the terminal 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization while shooting, game control, and inertial navigation.
Pressure sensors 613 may be disposed on the side bezel of terminal 600 and/or underneath display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The optical sensor 614 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the brightness of display screen 605 based on the ambient light intensity collected by optical sensor 614. Specifically, when the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 605 is adjusted down. In another embodiment, processor 601 may also dynamically adjust the shooting parameters of camera assembly 606 based on the ambient light intensity collected by optical sensor 614.
A proximity sensor 615, also called a distance sensor, is typically provided on the front panel of the terminal 600. The proximity sensor 615 is used to collect a distance between a user and a front surface of the terminal 600. In one embodiment, when the proximity sensor 615 detects that the distance between the user and the front surface of the terminal 600 gradually decreases, the processor 601 controls the display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 615 detects that the distance between the user and the front surface of the terminal 600 is gradually increased, the processor 601 controls the display 605 to switch from the breath-screen state to the bright-screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting of terminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
In some embodiments, the computer device is provided as a server, and fig. 7 is a block diagram of a server provided in the embodiments of the present application, where the server 700 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 701 and one or more memories 702, where the memories 702 are used for storing executable program codes, and the processors 701 are configured to execute the executable program codes to implement the training method of the speech noise reduction model provided in the above-described method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
In an exemplary embodiment, there is also provided a storage medium comprising program code, such as a memory 702 comprising program code, executable by the processor 701 of the server 700 to perform the above-described method of training a speech noise reduction model. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, the non-transitory computer readable storage medium may be a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM (CompacT Disc-Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.
An embodiment of the present application further provides a computer-readable storage medium, where at least one program code is stored in the computer-readable storage medium, and the at least one program code is loaded and executed by a processor, so as to implement the method for training a speech noise reduction model according to any implementation manner.
Embodiments of the present application further provide a computer program product, where the computer program product includes computer program code, the computer program code is stored in a computer-readable storage medium, and a processor of the computer device reads the computer program code from the computer-readable storage medium, and executes the computer program code, so that the computer device executes the training method for a speech noise reduction model according to any one of the above implementation manners.
In some embodiments, the computer program product according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.
The present application is intended to cover various modifications, alternatives, and equivalents, which may be included within the spirit and scope of the present application.

Claims (10)

1. A method for training a speech noise reduction model, the method comprising:
training to obtain a first voice noise reduction model based on multiple groups of first sample data pairs in a simulation scene of multiple scenes, wherein each group of first sample data pairs comprises first voice sample data and first noise-carrying voice sample data, and the first voice sample data is the first noise-carrying voice sample data subjected to noise reduction;
acquiring multiple groups of initial sample data pairs, wherein each group of initial sample data pairs comprises initial voice sample data and initial noise sample data, the initial voice sample data is the first voice sample data without simulation environment reverberation, and the initial noise sample data is data obtained by removing voice data and simulation environment reverberation from first noisy voice sample data;
performing reverberation processing on initial voice sample data in the multiple groups of initial sample data pairs in a target scene in the multiple scenes to obtain second voice sample data in multiple groups of second sample data pairs containing target environment reverberation, and performing reverberation processing on initial noise sample data in the multiple groups of initial sample data pairs to obtain target noise data in the multiple groups of second sample data pairs;
mixing second voice sample data and target noise data in each group of second sample data pairs to obtain a plurality of second noisy voice sample data;
training the first voice noise reduction model based on multiple groups of third sample data pairs to obtain a target voice noise reduction model, wherein the target voice noise reduction model is used for reducing noise of the voice data with noise in the target scene, and each group of third sample data pairs comprises second voice sample data and second voice sample data with noise.
2. The method according to claim 1, wherein performing reverberation processing on initial voice sample data in the multiple sets of initial sample data pairs to obtain second voice sample data in multiple sets of second sample data pairs containing target environment reverberation, comprises:
for each group of initial sample data pairs, playing the initial voice sample data in the target scene through target playing equipment, performing sound acquisition in the target scene through target sound acquisition equipment to obtain voice data containing environmental noise and target environmental reverberation, and performing noise reduction processing on the voice data to obtain second voice sample data containing the target environmental reverberation; alternatively, the first and second electrodes may be,
and for each group of initial sample data pairs, obtaining impulse response data in the target scene, and performing convolution processing on the impulse response data and the initial voice sample data to obtain second voice sample data containing the target environment reverberation.
3. The method of claim 1, wherein performing reverberation on initial noise sample data in the multiple sets of initial sample data pairs to obtain target noise data in the multiple sets of second sample data pairs comprises:
for each group of initial sample data pairs, determining a noise type corresponding to the initial noise sample data, and acquiring noise corresponding to the noise type in the target scene to obtain the target noise data containing the target environment reverberation; alternatively, the first and second electrodes may be,
and for each group of initial sample data pairs, playing the initial noise sample data in the target scene through target playing equipment, and carrying out sound acquisition in the target scene through target sound acquisition equipment to obtain the target noise data containing the target environment reverberation.
4. The method of claim 1, wherein the mixing the second speech sample data and the target noise data in each group of second sample data pairs to obtain a plurality of second noisy speech sample data comprises:
and for each group of second sample data pairs, mixing the second voice sample data and the target noise data based on a target signal-to-noise ratio to obtain second noisy voice sample data.
5. The method of claim 4, wherein the target noise data comprises a plurality of target noise data of different noise types, and the second noisy speech sample data is a plurality of, and wherein the mixing the second speech sample data and the target noise data based on the target signal-to-noise ratio to obtain the second noisy speech sample data comprises at least one of:
mixing the target noise data with different noise types with the second voice sample data respectively based on the target signal-to-noise ratio to obtain a plurality of second voice sample data with noise;
and for target noise data of each noise type, mixing the target noise data with at least one target noise data with a different noise type to obtain a plurality of mixed noise data, and mixing the plurality of mixed noise data with the second voice sample data respectively based on the target signal-to-noise ratio to obtain a plurality of second noisy voice sample data.
6. The method of claim 4, wherein the second voice sample data comprises a plurality of second voice sample data of different crowds, and the second noisy voice sample data is a plurality of samples, and the mixing the second voice sample data and the target noise data based on the target signal-to-noise ratio to obtain the second noisy voice sample data comprises at least one of:
mixing second voice sample data of the different crowds with the target noise data respectively based on the target signal-to-noise ratio to obtain a plurality of second noisy voice sample data;
and mixing the second voice sample data with at least one second voice sample data different from the second voice sample data of each crowd to obtain a plurality of mixed voice sample data, and mixing the plurality of mixed voice sample data with the target noise data respectively based on the target signal-to-noise ratio to obtain a plurality of second noise-carrying voice sample data.
7. An apparatus for training a speech noise reduction model, the apparatus comprising:
the first training module is used for training to obtain a first voice noise reduction model based on a plurality of groups of first sample data pairs in a simulation scene of various scenes, wherein each group of first sample data pairs comprises first voice sample data and first noise-carrying voice sample data, and the first voice sample data is the first noise-carrying voice sample data after noise reduction;
the acquisition module is used for acquiring a plurality of groups of initial sample data pairs, each group of initial sample data pairs comprises initial voice sample data and initial noise sample data, the initial voice sample data is the first voice sample data without the reverberation of the simulation environment, and the initial noise sample data is data obtained by removing the voice data and the reverberation of the simulation environment from the first voice sample data with noise;
a processing module, configured to perform reverberation processing on initial voice sample data in the multiple sets of initial sample data pairs in a target scene of the multiple scenes to obtain second voice sample data in multiple sets of second sample data pairs including target environment reverberation, and perform reverberation processing on initial noise sample data in the multiple sets of initial sample data pairs to obtain target noise data in the multiple sets of second sample data pairs;
the mixing module is used for mixing the second voice sample data and the target noise data in each group of second sample data pairs to obtain a plurality of second noisy voice sample data;
and the second training module is used for training the first voice noise reduction model based on the multiple groups of third sample data pairs to obtain a target voice noise reduction model, the target voice noise reduction model is used for reducing noise of the noisy voice data in the target scene, and each group of third sample data pairs comprises second voice sample data and second noisy voice sample data.
8. A computer device comprising one or more processors and one or more memories having at least one program code stored therein, the at least one program code being loaded and executed by the one or more processors to implement the method of training a speech noise reduction model according to any one of claims 1 to 6.
9. A computer-readable storage medium, wherein at least one program code is stored in the storage medium, and the at least one program code is loaded and executed by a processor to implement the method for training a speech noise reduction model according to any one of claims 1 to 6.
10. A computer program product, characterized in that the computer program product comprises computer program code, which is stored in a computer readable storage medium, from which a processor of a computer device reads the computer program code, the processor executing the computer program code, causing the computer device to perform the method of training a speech noise reduction model according to any of claims 1 to 6.
CN202210964055.9A 2022-08-11 2022-08-11 Training method, device, equipment, storage medium and product of voice noise reduction model Pending CN115331689A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210964055.9A CN115331689A (en) 2022-08-11 2022-08-11 Training method, device, equipment, storage medium and product of voice noise reduction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210964055.9A CN115331689A (en) 2022-08-11 2022-08-11 Training method, device, equipment, storage medium and product of voice noise reduction model

Publications (1)

Publication Number Publication Date
CN115331689A true CN115331689A (en) 2022-11-11

Family

ID=83924055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210964055.9A Pending CN115331689A (en) 2022-08-11 2022-08-11 Training method, device, equipment, storage medium and product of voice noise reduction model

Country Status (1)

Country Link
CN (1) CN115331689A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117558266A (en) * 2024-01-12 2024-02-13 腾讯科技(深圳)有限公司 Model training method, device, equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117558266A (en) * 2024-01-12 2024-02-13 腾讯科技(深圳)有限公司 Model training method, device, equipment and computer readable storage medium
CN117558266B (en) * 2024-01-12 2024-03-22 腾讯科技(深圳)有限公司 Model training method, device, equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN111564152B (en) Voice conversion method and device, electronic equipment and storage medium
CN109887494B (en) Method and apparatus for reconstructing a speech signal
CN111696532B (en) Speech recognition method, device, electronic equipment and storage medium
CN111445901B (en) Audio data acquisition method and device, electronic equipment and storage medium
CN110688082B (en) Method, device, equipment and storage medium for determining adjustment proportion information of volume
CN112487940B (en) Video classification method and device
CN109003621B (en) Audio processing method and device and storage medium
CN111613213B (en) Audio classification method, device, equipment and storage medium
CN109192223B (en) Audio alignment method and device
CN112233689B (en) Audio noise reduction method, device, equipment and medium
CN111863020A (en) Voice signal processing method, device, equipment and storage medium
CN112614500A (en) Echo cancellation method, device, equipment and computer storage medium
CN113420177A (en) Audio data processing method and device, computer equipment and storage medium
CN110798327B (en) Message processing method, device and storage medium
CN114333774A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN115331689A (en) Training method, device, equipment, storage medium and product of voice noise reduction model
CN113409805A (en) Man-machine interaction method and device, storage medium and terminal equipment
CN113343709B (en) Method for training intention recognition model, method, device and equipment for intention recognition
CN112750449B (en) Echo cancellation method, device, terminal, server and storage medium
CN113963707A (en) Audio processing method, device, equipment and storage medium
CN114328815A (en) Text mapping model processing method and device, computer equipment and storage medium
CN113192531A (en) Method, terminal and storage medium for detecting whether audio is pure music audio
CN113301444A (en) Video processing method and device, electronic equipment and storage medium
CN113362836A (en) Vocoder training method, terminal and storage medium
CN112560903A (en) Method, device and equipment for determining image aesthetic information and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination